{"title": "The VC-Dimension versus the Statistical Capacity of Multilayer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 928, "page_last": 935, "abstract": null, "full_text": "The VC-Dimension versus the  Statistical \n\nCapacity of Multilayer  Networks \n\nChuanyi Ji \"and  Demetri Psaltis \nDepartment of Electrical  Engineering \nCalifornia Institute of Technology \nPasadena,  CA  91125 \n\nAbstract \n\nA  general  relationship  is  developed  between  the  VC-dimension  and  the \nstatistical lower  epsilon-capacity which shows  that the VC-dimension can \nbe  lower  bounded  (in  order)  by the statistical  lower epsilon-capacity of a \nnetwork  trained  with  random  samples.  This  relationship  explains  quan(cid:173)\ntitatively  how  generalization  takes  place  after  memorization,  and  relates \nthe concept of generalization (consistency) with the capacity of the optimal \nclassifier over a class of classifiers with the same structure and the capacity \nof the Bayesian classifier.  Furthermore, it provides a general methodology \nto evaluate a  lower  bound for  the VC-dimension of feedforward multilayer \nneural networks. \nThis  general  methodology  is  applied  to  two  types  of networks  which  are \nimportant  for  hardware  implementations:  two  layer  (N  - 2L  - 1)  net(cid:173)\nworks  with  binary  weights,  integer  thresholds  for  the  hidden  units  and \nzero  threshold  for  the  output  unit,  and  a  single  neuron  ((N - 1)  net(cid:173)\nworks)  with  binary  weigths  and  a  zero  threshold.  Specifically,  we  obtain \nOC~L) ::;  d2 \n::;  O(W),  and  d1  \"\"'  O(N).  Here  W  is  the  total  number \nof weights  of  the  (N - 2L - 1)  networks.  d1  and  d2  represent  the  VC(cid:173)\ndimensions  for  the (N - 1)  and  (N - 2L - 1)  networks respectively. \n\n1 \n\nIntroduction \n\nThe information capacity and  the VC-dimension  are two important quantities that \ncharacterize multilayer feedforward neural networks.  The former  characterizes their \n\n\"Present Address:  Department of Electrical  Computer and System  Engineering,  Rens(cid:173)\n\nselaer  Poly tech Institute,  Troy,  NY  12180. \n\n928 \n\n\fThe VC-Dimension versus  the Statistical Capacity of Multilayer Networks \n\n929 \n\nmemorization capability,  while  the latter  represents the sample  complexity needed \nfor  generalization.  Discovering  their  relationships  is  of  importance  for  obtaining \na  better  understanding  of  the  fundamental  properties  of  multilayer  networks  in \nlearning and generalization. \n\nIn this work  we  show that the VC-dimension of feedforward  multilayer neural  net(cid:173)\nworks,  which  is  a  distribution-and  network-parameter-indenpent  quantity,  can  be \nlower  bounded  (in  order)  by  the  statistical  lower  epsilon-capacity  C;  (McEliece \net.al,  (1987\u00bb,  which  is  a  distribution-and  network-dependent  quantity,  when  the \nsamples  are  drawn  from  two  classes:  0 1 (+1)  and  02{-1).  The  only  requirement \non the distribution from which samples are  drawn is that  the optimal classification \nerror achievable,  the  Bayes error  Pbe,  is  greater than  zero.  Then we  will  show  that \nthe  VC-dimension d and  the statistical lower epsilon-capacity C;  are  related  by \n\nC;  ~ Ad, \n\n(1) \n\nI \n(: \n\nI \n\nI \n(: \n\nI \n\nfor  0  <  (:  ~ P eo ;  or  (:  =  Pbe  -\n\nfor  0  <  (:  ~ Pbe.  Here  (: \nwhere  (:  =  P eo  -\nis  the  error  tolerance,  and  P eo  represents the  optimal  error  rate  achievable on  the \nclass  of classifiers  considered.  It  is  obvious  that  P eo  ~ Pbe'  The  relation  given \nin  Equation  (1)  is  non-trivial  if Pbe  > 0,  P eo  ~ /  or  Pbe  ~ /  so  that  (:  is  a  non(cid:173)\nnegative quantity.  Ad is  called  the universal sample bound for  generalization, where \nA  < \nis a  positive constant.  When the sample complexity exceeds  Ad,  all  the \nnetworks of the same architechture for all distributions of the samples can generalize \nwith  almost  probability 1 for  d  large.  A  special case  of interest, in  which  Pbe  =  ~, \ncorresponds  to  random  assignments of samples.  Then  C;  represents  the  random \nstorage  capacity which  characterizes the memorizing  capability of networks. \n\n1281n-+ \n\n12  ' \n\n( \n\nAlthough  the  VC-dimension  is  a  key  parameter  in  generalization ,  there  exists  no \nsystematic  way  of finding  it.  The relationship  we  have  obtained,  however,  brings \nconcomitantly a constructive method of finding a lower bound for  the VC-dimension \nof  multilayer  networks.  That  is,  if  the  weights  of  a  network  are  properly  con(cid:173)\nstructed  using  random  samples  drawn  from  a  chosen  distribution,  the  statistical \nlower  epsilon-capacity  can  be  evaluated  and  then  utilized  as  bounds  for  the  VC(cid:173)\ndimension.  In  this  paper  we  will  show how  this  constructive approach  cQntributes \nto  finding  lower  bounds  of the  VC-dimension  of multilayer  networks  with  binary \nweights. \n\n2  A  Relationship  Between the VC-Dimension  and  the \n\nStatistical  Capacity \n\n2.1  Definition of the Statistical Capacity \n\nConsider  a  network  s  whose  weights  are  constructed  from  M  random samples  be(cid:173)\nlonging  to  two  classes.  Let  r{ s)  =  ~, where  Z  is  the  total  number  of samples \nclassified incorrectly  by the network s.  Then the random variable r( s)  is  the train(cid:173)\ning error rate.  Let \n\n(2) \n\n\f930 \n\nJi and Psaltis \n\nwhere 0 <  \u20ac  ~ 1.  Then the statistical lower epsilon-capacity (statistical capacity in \nshort) C; is the maximum M  such that Pf(M)  ;:::  1 - Tj,  where  Tj  can be arbitrarily \nsmall for  sufficiently large  N. \n\nRoughly speaking, the statistical lower epsilon-capacity defined here can be regarded \nas  a  sharp  transition point on  the curve Pf(M)  shown in  Fig.1.  When  the number \nof samples  used  is  below  this  sharp  transition,  the  network  can  memorize  them \nperfectly. \n\n2.2  The Universal Sample Bound for  Generalization \n\nLet  Pe(xls)  be  the  true  probability  of  error  for  the  network  s.  Then  the  gener(cid:173)\nalization  error  LlE(s)  satisfies  LlE(s)  =1  r(s) - Pe(xls)  I.  We  can  show  that  the \nprobability for  the  generalization error to exceed  a  given  small  quantity  (  satisfies \nthe following  relation. \n\nPr(maxLlE(s) > /) ~ h(2M;d,l), \n\nsES \n\n(3) \n\nTheorem 1 \n\nwhere \n\nh(2M; d, <')  = { \n\n1; \n6 \n\n. \n\nezther 2M :s  d,  or 6 \n\n(2M)\" \n\nd!  e--s -;:::  1&2M > d, \n\n.,2 M \n\n(2M)\"  _  .,2 M \u2022 \n\nd! \n\ne  ---r-,  otherwise. \n\nHere  S  is  a  class  of networks  with  the  same  architecture.  The function  h(2M; d, (') \nhas  one  sharp  transition  occurring  at  Ad  shown  in  Fig.l,  where  A  is  a  constant \nsatisfying  the  equation A  = In(2A) + 1 - TA = O. \nThis theorem says that when the number M  of samples used exceeds  Ad, generaliza(cid:173)\ntion happens with probability 1.  Since Ad is a  distribution-and network-parameter(cid:173)\nindependent quantity, we  call it the universal sample bound for  generalization. \n\n,2 \n\n2.3  A  Relationship between The VC-Dimension and C; \n\nRoughly  speaking,  since  both  the  statistical  capacity  and  the  VC-dimension  rep(cid:173)\nresent  sharp  transition  points,  it  is  natural  to  ask  whether  they  are  related.  The \nrelationship  can actually  be given through the theorem  below. \n\nTheorem  2  Let  samples  belonging  to  two  classes  0 1(+1)  and  O2(-1)  be  drawn \nindependently  from  some  distribution.  The  only  requirement  on  the  distributions \nconsidered  is  that  the  Bayes  error  Pbe  satisfies  0  <  Pbe  ~ !.  Let  5  be  a  class \nof feedforward  multilayer  networks  with  a  fixed  structure  consisting  of threshold \nelements  and SI  be  one  network in  5,  where  the  weights  of S1  are  constructed from \nM  (training)  samples  drawn  from  one  distribution  as  specified  above.  For a  given \ndistribution,  let  Peo  be  the  optimal error rate  achievable  on  Sand Pbe  be  the  Bayes \nerror rate.  Then \n\nPr(r(sI) < Peo  -\n\n, \n, \n(  ) :s  h(2M; d, (  ), \n\nand \n\n(4) \n\n(5) \n\n\fThe VC-Dimension  versus  the Statistical Capacity of Multilayer Networks \n\n931 \n\n1 \n\nI  h(2M;d,E' \n\nM \n\nFigure  1:  Two  sharp  transition  points  for  the  capacity  and  the  universal  sample \nbound for  generalization. \n\nwhere  f(S1)  is  equal  to  the  training  error  rate  of S1. \nstitution  error estimator in  the  pattern  recognition  literature.)  These  relations  are \nnontrivial if Peo  > /, Pbe  > /  and (' > 0  small. \n\n(It  is  also  called  the  resub(cid:173)\n\nThe key idea of this result is  illustrated in Fig.1.  That is, the sharp transition which \nstands for  the lower epsilon-capacity is  below the sharp transition for  the  universal \nsample  bound for  generalization. \n\nTo interpret this relation, let us  compare Equation (2)  and Equation (5)  and exam(cid:173)\nine  the  range  of (:  and  ('  respectively.  Since  (',  which  is  initially  given  in  Inequal(cid:173)\nity  (3),  represents  a  bound  on  the  generalization  error,  it  is  usually  quite  small. \nFor most  of practical  problems,  Pbe  is  small  also.  If the  structure  of the  class  of \nnetworks is  properly  chosen  so  that  P eo  ~ Pbe,  then  (  =  Peo  -\n('  will  be  a  sma.ll \nquantity.  Although the epsilon-capacity is  a  valid quantity depending on M  for  any \nnetwork in the class, for  M  sufficiently large, the meaningful networks to be consid(cid:173)\nered through this relation is  only  a small subset in  the class  whose  true probability \nof error  is  close  to  Peo .  That  is,  this  small  subset  contains  only  those  networks \nwhich  can approximate the  best  classifier  contained in  this class . \n\nFor a special  case in which samples are assigned randomly to two classes with equal \nprobability, we  have a  result stated in Corollary  1. \n\nCorollary 1  Let samples  be  drawn  independently from  some  distribution  and then \nassigned  randomly  to  two  classes  fh(+I)  and O2(-1)  with  equal  probability.  This \nis  equivalent  to  the  case  that  the  two  class  conditional  distributions  have  complete \noverlap  with  one  another.  That  is,  Pr(x 101) =  Pr(x I O2 ).  Then  the  Bayes  error \nis !.  Using  the  same  notation  as  in  the  above  theorem,  we  have \n\nC\"l \n\nI  < Ad. \n\n2-( \n\n(6) \n\n\f932 \n\nJi  and Psaltis \n\nAlthough the distributions specified here give an uninteresting case for  classification \npurposes,  we  will  see later  that the random statistical epsilon-capacity in  Inequal(cid:173)\nity  (6)  can  be  used  to  characterize  the  memorizing  capability  of networks,  and  to \nformulate a  constructive approach to find  a  lower  bound for  the  VC-dimension. \n\n3  Bounds for the VC-Dimension of Two Networks with \n\nBinary Weights \n\n3.1  A  Constructive Methodology \n\nOne of the applications of this relation is  that it provides a  general constructive ap(cid:173)\nproach to find  a  lower bound for  the VC-dimension for  a  class  of networks.  Specifi(cid:173)\ncally, using the relationship given in  Inequality (6),  the procedures can be described \nas follows. \n\n1)  Select  a  distribution. \n\n2)  Draw samples independently from  the chosen distribution,  and then assign  them \nrandomly  to two classes. \n\n3)  Evaluate  the  lower  epsilon-capacity  and  then  use  it  as  a  lower  bound  for  the \nVC-dimension. \n\nTwo  example  are  given  below  to  demonstrate  how  this  general  approach  can  be \napplied  to find  lower  bounds for  the VC-dimension. \n\n3.2  Bounds for  Two-Layer Networks with Binary Weigths \n\nTwo-layer  (N - 2L - 1)  networks  with  binary  weights  and  integer  thresholds  are \nconsidered  in  this section. \n\n3.2.1  A  lower Bound \n\nThe construction of the network we  consider is  motivated by the one used by Baum \n(Baum,  1988)  in  finding  the  capacity  for  two  layer  networks  with  real  weights. \nAlthough  this  particular  network  will  fail  if the  accuracy  of the  weights  and  the \nthresholds  is  reduced,  the  idea of using  the  grandmother-cell  type of network  will \nbe adopted to construct our network. \n\nWe  consider  a  two  layer  binary  network  with  2L  hidden  threshold  units  and  one \noutput threshold  unit shown in  Fig.2  a). \n\nThe weights at  the second layer are fixed  and equal to  +1 and -1 alternately.  The \nhidden  units  are  allowed  to have integer  thresholds  in  [-N, N],  and  the  threshold \nfor  the output  unit is  zero. \nLet  Xr(m)  =  (x~;n), .. \" x~;)  be  a  N  dimensional  random  vector,  where  x~;n)'s  are \nindependent random variables taking (+ 1)  and  (-1) with equal  probability  ~, 0 ~ \nI ::;  L, and  0 ::;  m  ::;  M.  Consider  the Ith pair of hidden  units.  The  weights  at  the \nfirst  layer  for  this  pair  of hidden  units  are  equal.  Let  Wri  denote  the  weight  from \nthe  ith input to these  two hidden  units, then we have \n\n\fThe VC-Dimension versus  the Statistical Capacity of Multilayer Networks \n\n933 \n\n1 \n\n2L \n\nN \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n(b) \n\nith \n\n(a) \n\nFigure 2:  a)  The  two-layer  network  with  binary  weights.  b)  Illustration  on  how  a \npair of hidden  units separates samples. \n\nW/i  = sgn(a/ L x~r\u00bb), \n\nM \n\nm=l \n\n(7) \n\nwhere  sgn(x) =  1 if x> 0,  and -1 otherwise.  a/'s, 1 ~ I  ~ L, which  are indepen(cid:173)\ndent random variables  which  take on  two  values  +1  or  -1 with  equal  probability, \nrepresent the random assignments of the LM samples into two classes  Ol( +1)  and \n02( -1). \n\nThe thresholds for  these two units are different  and  are  given  as \n\n(8) \n\nwhere 0 < k < 1,  and t/:J:  correspond to the thresholds for  the units with weight  + 1 \nand  -1 at the second layer  respectively. \n\nFig.2  b)  illustrates  how  this  network  works.  Each pair  of hidden  units  forms  two \nparallel hyperplanes separated by  the two thresholds,  which will  generates  a presy(cid:173)\nnaptic  input  either  +2 or  (-2)  to  the  output  unit  only  for  the  samples  stored  in \nthis  pair  which  fall  in  between  the  planes  when  a/  equals  either  + 1 or  -1, and  a \npresynaptic  input  0  for  the samples  falling  outside.  When  the  samples  as  well  as \nthe parallel  hyperplanes are random,  with a certain probability they will  fall either \nbetween  a  pair of parallel hyperplanes or outside.  Therefore,  statistical  analysis  is \nneeded  to obtain the lower epsilon-capacity. \n\n\f934 \n\nJi  and Psaltis \n\nTheorem  3  A  lower  bound  c~  ,for the  lower  epsilon-capacity  c~  ,for this \nnetwork is: \n\n2-( \n\n2-( \n\n, \n\n, c;  , \n\n~-( \n\n(1-k)2NL \n\n(9) \n\n3.2.2  An  Upper  Bound \n\nSince the total number of possible mappings of two layer (N -2L-1) networks with \nbinary weights and integer thresholds ranging in [-N, N]  is bounded by 2w +L log 2N, \nthe  VC-dimension  d2  is  upper  bounded  by W  + L log 2N,  which  is  in  the  order  of \nW.  Then d2  ~ O(W).  By  combining both the  upper  and  lower  bounds,  we  have \n\n(10) \n\n3.3  Bounds for  One-Layer Networks with  Binary Weigths \n\nThe one-layer network we consider here is equivalent to one hidden unit in the above \n(N - 2L -1) network.  Specifically, the weight from the i-th input unit to the neuron \nIS \n\nM \n\nWi  = sgn( L O'mx~m\u00bb, \n\n(11) \n\nwhere  (1  <  i \n:::;  N),  x~m) 's  and  O'm's  are  independent  and  equally  probable \nbinary(\u00b11)  random  variables,  which  represent  elements  of  N-dimensional  sample \nvectors and  their random assignments  to two classes  respectively. \n\nm=l \n\nTheorem 4  The  lower  epsilon-capacity c~  ,of this  network satisfies \n\n2-( \n\nC-\n\n1 \n2-( \n\nN \n\"\" - -\n7r  (: \n\n, \" \"   22' \n\n(12) \n\nThen  by  Corollary  1  we  have  O(N)  ~ O(dd,  where  d1  is  the  VC-dimension  of \none-layer (N - 1)  networks. \n\nUsing the similar counting arguement, an  upper  bound  can be obtained as d 1  ~ N . \nThen  combining the lower  and  upper  bounds,  we  have d1  \"\" O(N) \n\n4  Discussions \n\nThe  general  relationship  we  have  drawn  between  the  VC-dimension  and  the  sta(cid:173)\ntistical  lower  epsilon-capacity  provides  a  new  view  on  the  sample  complexity  for \ngeneralization.  Specifically, it has two implications to learning and  generalziation. \n1)  For random assignments of the samples (Pbe  = t), the relationship confirms that \ngeneralization occurs after memorization, since the statistical lower epsilon-capacity \n\n\fThe VC-Dimension versus  the Statistical Capacity of Multilayer Networks \n\n935 \n\nfor  this  case  is  the  random  storage  capacity  which  charaterizes  the  memorizing \ncapability of networks  and  it  is  upper  bounded  by  the universal sample bound for \ngeneralization. \n\n2)  For  cases  where  the  Bayes  error  is  smaller  than  ~,  the  relationship  indicates \nthat  an  appropriate choice of a  network structure  is  very important.  If a  network \nstructure  is  properly  chosen  so  that  the  optimal  achievable  error  rate  Peo  is  close \nto the  Bayes error  Peb ,  than  the optimal  network  in  this  class is the  one which  has \nthe  largest  lower epsilon-capacity.  Since  a  suitable structure can hardly  be  chosen \na  priori due  to  the lack of knowledge  about  the  underlying distribution,  searching \nfor  network  structures  as  well  as  weight  values  becomes  necessary.  Similar  idea \nhas been  addressed  by  Devroye (Devroye,  1988)  and  by Vapnik  (Vapnik,  1982)  for \nstructural minimization. \n\nWe  have  applied  this  relation  as  a  general  constructive  approach  to  obtain  lower \nbounds for  the VC-dimension of two-layer and one-layer networks with binary inter(cid:173)\nconnections.  For the one-layer networks,  the lower bound  is  tight and matches the \nupper bound.  For the two-layer networks, the lower bound is smaller than the upper \nbound (in order) by a In factor.  In an independent work by Littlestone (Littlestone, \n1988),  the  VC-dimension  of so-called  DNF  expressions  were  obtained.  Since  a.ny \nDNF expression can be implemented by a  two layer network of threshold units with \nbinary  weights and integer thresholds,  this result  is  equivalent to showing that the \nVC-dimension of such networks is  O(W).  We believe that the In factor  in our lower \nbound  is  due  to  the  limitations  of the  grandmother-cell  type  of networks  used  in \nour construction. \n\nAcknowledgement \n\nThe authors would like to thank Yaser Abu-Mostafa and David Haussler for  helpful \ndiscussions.  The support of AFOSR and  DARPA is  gratefully  acknowledged. \n\nReferences \n\nE.  Baum.  (1988)  On  the  Capacity  of Multilayer  Perceptron.  J.  of  Complexity, \n4:193-215. \nL.  Devroye. \n(1988)  Automatic  Pattern  Recognition:  A  Study  of  Probability  of \nError.  IEEE  Trans.  on  Pattern  Recognition  and  Machine  Intelligence,  Vol.  10, \nNo.4:  530-543. \n\nN.  Littlestone.  (1988)  Learning  Quickly  When  Irrelevant  Attributes  Abound:  A \nNew  Linear-Threshold  Algorithm.  Machine  Learning 2:  285-318. \n\nR.J  . McEliece, E.C . Posner, E.R. Rodemich, S.S . Venkatesh.  (1987) The Capacity \nof the Hopfield Associative Memory.  IEEE  Trans.  Inform.  Theory,  Vol.  IT-33,  No. \n4,461-482. \n\nV.N  .  Vapnik  (1982)  Estimation  of Dependences  Based  on  Empirical  Data,  New \nYork:  Springer-Verlag. \n\n\f", "award": [], "sourceid": 481, "authors": [{"given_name": "Chuanyi", "family_name": "Ji", "institution": null}, {"given_name": "Demetri", "family_name": "Psaltis", "institution": null}]}