{"title": "Scaling and Generalization in Neural Networks: A Case Study", "book": "Advances in Neural Information Processing Systems", "page_first": 160, "page_last": 168, "abstract": null, "full_text": "160 \n\nSCALING  AND  GENERALIZATION IN \nNEURAL  NETWORKS:  A  CASE STUDY \n\nSubutai Ahmad \n\nCenter for  Complex Systems  Research \n\nUniversity of Illinois at Urbana-Champaign \n\n508  S.  6th St., Champaign, IL  61820 \n\nGerald Tesauro \nIBM  Watson  Research Center \nPO Box 704 \nYorktown  Heights,  NY  10598 \n\nABSTRACT \n\nThe  issues  of scaling  and  generalization  have  emerged  as  key  issues  in \ncurrent studies of supervised learning from examples in neural networks. \nQuestions such  as  how  many  training  patterns  and  training  cycles  are \nneeded for  a problem of a given size  and difficulty,  how  to represent the \ninllUh  and how  to choose useful training exemplars,  are of considerable \ntheoretical  and  practical  importance.  Several  intuitive  rules  of thumb \nhave been obtained from empirical studies, but as yet there are few  rig(cid:173)\norous  results.  In  this  paper we  summarize  a  study Qf generalization in \nthe simplest possible case-perceptron networks learning linearly separa(cid:173)\nble  functions.  The  task  chosen  was  the majority function  (i.e.  return \na  1  if a  majority  of the  input  units  are  on),  a  predicate  with  a  num(cid:173)\nber  of useful  properties.  We  find  that  many  aspects  of.generalization \nin  multilayer  networks  learning  large,  difficult  tasks  are  reproduced  in \nthis simple domain, in which  concrete numerical results and even some \nanalytic understanding can be achieved. \n\n1 \n\nINTRODUCTION \n\nIn recent years there has been a tremendous growth in the study of machines which \nlearn.  One  class  of learning  systems  which  has  been  fairly  popular  is  neural  net(cid:173)\nworks.  Originally motivated by the study of the nervous system in biological organ(cid:173)\nisms  and  as  an  abstract  model  of computation,  they have since  been  applied  to a \nwide variety of real-world problems (for examples see [Sejnowski and Rosenberg,  87] \nand  [Tesauro and Sejnowski,  88]).  Although  the  results  have  been  encouraging, \nthere  is  actually little understanding of the extensibility  of the formalism.  In  par(cid:173)\nticular,  little is  known  of the resources  required when  dealing with large problems \n(i.e.  scaling),  and the abilities of networks to respond to novel situations  (i.e.  gen(cid:173)\neraliz ation). \n\nThe objective  of this  paper  is  to gain  some  insight  into  the relationships  between \nthree fundament~l quantities under  a  variety of situations.  In particular we  are in(cid:173)\nterested in the relationships between the size of the network, the number of training \n\n\fScaling and Generalization in Neural Networks \n\n161 \n\ninstances,  and the generalization that the network  performs,  with  an  emphasis on \nthe  effects  of the  input  representation  and  the  particular  patterns  present  in  the \ntraining set. \n\nAs  a  first  step  to a  detailed  understanding,  we  summarize  a  study  of scaling  and \ngeneralization in the simplest possible case.  Using feed forward networks, the type of \nnetworks most common in the literature, we  examine the majority function  (return \na  1 if a majority of the inputs are on),  a boolean predicate with a number of useful \nfeatures.  By using a combination of computer simulations and analysis in the limited \ndomain of the majority function,  we  obtain some concrete  numerical results which \nprovide insight into the process of generalization and which will hopefully lead to a \nbetter understanding of learning in neural networks in general.\u00b7 \n\n2  THE MAJORITY FUNCTION \n\nThe function  we  have  chosen  to study is  the majority function,  a  simple predicate \nwhose  output  is  a  1 if and  only  if more  than  half of the input  units  are  on.  This \nfunction  has  a  number  of useful  properties  which  facilitate  a  study  of this  type. \nThe function has a  natural appeal and can occur in several different  contexts in the \nreal-world.  The problem is  linearly separable (i.e.  of predicate order  1 [Minsky  and \nPapert,  69]).  A  version  of the  perceptron  convergence  theorem applies,  so  we  are \nguaranteed that a network with one layer of weights can learn the function.  Finally, \nwhen  there  are  an  odd  number  of input  units,  exactly  half of the  possible  inputs \nresults in an output of 1.  This property tends to minimize any negative effects that \nmay result from having too many positive or negative training examples. \n\n3  METHODOLOGY \n\nThe class of networks used are feed forward networks [Rumelhart  and McClelland,  86], \na general category of networks that include perceptrons and the multi-layered net(cid:173)\nworks  most  often  used  in  current  research.  Since  majority  is  a  boolean  function \nof predicate order  1,  we  use  a  network with no hidden  units.  The output function \nused was a sigmoid with a bias.  The basic procedure consisted of three steps.  First \nthe  network  was  initialized  to some  random starting  weights.  Next  it was  trained \nusing back  propagation on  a  set  of training  patterns.  Finally,  the  performance  of \nthe network  was  tested on  a  set of random test  patterns.  This performance figure \nwas  used  as  the  estimate  of the  network's  generalization.  Since  there  is  a  large \namount of randomness in  the  procedure, most of our data are averages over several \nsimulations. \n\nO.  The material contained in this paper is a condensation of portions of the first  author's \n\nM.S.  thesis  [Ahmad,  88]. \n\n\f162 \n\nAhmad and Tesauro \n\nf \n\n0.50 \n\n0.42 \n\n0.33 \n\n0.25 \n\n0.17 \n\n0.08 \n\n0.00 \n\n0 \n\n70 \n\n140 \n\n210 \n\n280 \n\n350  5 \n\n420 \n\nFigure  1:  The average failure  rate as a function  of S.  d = 25 \n\nNotation.  In  the following  discussion,  we  denote  5  to be the number  of training \npatterns, d the number of input units, and c the number of cycles through the train(cid:173)\ning set.  Let  f  be the failure  rate  (the fraction  of misclassified  training instances), \nand rr  be the set of training patterns. \n\n4  RANDOM TRAINING  PATTERNS \n\nWe  first  examine  the  failure  rate  as  a  function  of 5  and  d.  Figure  1  shows  the \ngraph of the average failure  rate as  a function  of S,  for  a fixed  input  size  d = 25. \nNot surprisingly we find  that the failure rate decreases fairly  monotonically with 5. \nOur simulations show  that in fact,  for  majority there is  a  well  defined  relationship \nbetween  the failure  rate  and  5.  Figure  2 shows  this for  a  network  with  25  input \nunits.  The figure  indicates that In f  is  proportional to 5  implying  that the failure \nrate decreases  exponentially with  5,  i.e.,  ,  = ae-fJs .  1/ {3  can be thought  of as  a \ncharacteristic training set size,  corresponding to a  failure  rate of a/e. \n\nObtaining the exact scaling relationship of l/P was somewhat tricky.  Plotting {3  on \na  log-log plot against d showed  it to be close to a straight line,  indicating that 1/ {3 \nincreases'\" d(J  for some constant a.  Extracting the exponent by measuring the slope \nof the log-log  graph turned out to be very  error prone,  since  the  data only  ranged \nover  one  order  of magnitude.  An  alternate  method  for  obtaining  the  exponent  is \nto  look  for  a  particular  exponent  a  by  setting  5  =  ad(J.  Since  a  linear  scaling \nrelationship  is  theoretically plausible,  we  measured  the failure  rate of the network \n\n\fScaling and Generalization in Neural Networks \n\n163 \n\nIn! \n\nG.\"'\" \n\n-1.000 \n\n-Z.OOO \n\n-3.000 \n\n-4.000 \n\n-5.000 \n\n-6.000 \n\n0.0 \n\n70.0 \n\n140.0 \n\nZ10.0 \n\nZ80.0 \n\n350.0 \n\n4Z0.0 \n\ns \n\nFigure 2:  In f  as  a function of S.  d = 25.  The slope was ==  -0.01 \n\nat S = ad for  various values of a.  As  Figure 3 shows, the failure  rate remains more \nor less constant for fixed  values of a, indicating a linear scaling relationship with d. \nThus O( d)  training patterns should be required to learn majority to a fixed  level of \nperformance.  Note  that if we  require  perfect  learning,  then the failure  rate has  to \nbe < 1/(2d - S) ,..,.  1/2d \u2022  By substituting this for  f  in the above formula and solving \nfor  S,  we  get  that (1 )(dln 2 +  In a)  patterns  are  required.  The extra factor  of d \nsuggests that O( d2)  would be required to learn majority perfectly.  We  will show in \nSection 6.1  that this is  actually an overestimate. \n\n5  THE INPUT REPRESENTATION \n\nSo far in our simulations we have used the representation commonly used for boolean \npredicates.  Whenever an input feature has been true, we clamped the corresponding \ninput unit to a  1,  and when  it  has  been off we  have  clamped it to a O.  There is  no \nreason,  however,  why  some  other  representation  couldn't  have  been  used.  Notice \nthat in  back  propagation the weight  change  is  proportional to the  incoming input \nsignal,  hence the weight from  a particular  input unit to the output unit is  changed \nonly  when  the pattern  is  misclassified  and the  input  unit  is  non-zero.  The weight \nremains unchanged when the input unit is O.  If the 0,1  representation were changed \nto a-l,+1 representation each weight will be changed more often, hence the network \nshould learn  the training set quicker  (simulations in  [Stornetta and  Huberman, 81] \nreported such  a decrease in training time using  a -i, +i representation.) \n\n\f164 \n\nAhmad and Tesauro \n\n0.50 \n\n0.42 \n\nf \n\n0.33 \n\n0.25 \n\n0.17 \n\n0.08 \n\n~~--------------------\n\nS=3d \n\n-\n\nS=5d \n\nS=7d \n\n0.00  +----+----+-----+---+----+---.... 60 \n\n20 \n\n27 \n\n33 \n\n40 \n\n47 \n\n53 \n\nd \n\nFigure 3:  Failure ra.te  VB  d  with S  =  3d, 5d, 7 d. \n\nWe found that not only did the training time decrease with the new representation, \nthe generalization of the network improved  significantly.  The scaling of the failure \nrate with respect to S is unchanged, but for  any fixed value of S, the generalization \nis  about 5 - 10%  better.  Also,  the scaling  with respect to dis still linear,  but the \nconstant for  a  fixed  performance  level  is  smaller.  Although  the  exact  reason  for \nthe improved  generalization  is  unclear,  the following  might  be  a  plausible  reason. \nA  weight  is  changed only if the corresponding input is non-zero.  By the  definition \nof the majority function,  the  average  number of units that are  on  for  the  positive \ninstances is  higher  than for  the  negative  instances.  Hence,  using the 0,1  represen(cid:173)\ntation, the weight  changes  are more pronounced for  the positive instances than for \nthe  negative  instances.  Since  the weights  are  changed  whenever  a  pattern  is  mis(cid:173)\nclassified,  the net  result  is  that the weight  change is  greater when  a  positive event \nis misclassified than when a negative event is misclassified.  Thus, there seems to be \na  bias in the 0,1  representation for  correcting the hyperplane more when a  positive \nevent is misclassified.  In the new representation, both positive and negative events \nare treated equally hence it is  unbiased. \n\nThe basic lesson  here seems  to be  that  one  should  carefully  examine  every choice \nthat has  been  made  during  the  design  process.  The  representation  of the  input, \neven down to such low level details as deciding whether \"off\"  should be represented \nas  0 or -1, could make a significant difference in  the generalization. \n\n\fScaling and Generalization in Neural Networks \n\n165 \n\n6  BORDER PATTERNS \n\nWe now consider a method for improving the generalization by intelligently selecting \nthe patterns in the training set.  Normally, for  a given training set, when the inputs \nare spread evenly around the input space, there can be several generalizations which \nare  consistent  with  the  patterns.  The  performance  of  the  network  on  the  test \nset  becomes  a  random  event,  depending  on  the  initial  state  of the  network.  If \npractical,  it  makes  sense  to  choose  training  patterns  whic~ can  limit  the  possible \ngeneralizations.  In  particular,  if we  can  find  those  examples  which  are  closest  to \nthe separating surface,  we  can  maximally  constrain the number of generalizations. \nThe  solution that  the  network  converges  to  using  these  \"border\"  patterns  should \nhave  a  higher  probability  of being  a  good  separator.  In  general finding  a  perfect \nset of border patterns can be computationally expensive, however there might exist \nsimple heuristics which  can help select good training examples. \n\nWe  explored  one  heuristic  for  choosing  such  points:  selecting  only  those  patterns \nin  which  the  number  of 1 's  is  either  one  less  or  one  more  than  half the  number \nof input  units.  Intuitively,  these  inputs should  be  close  to  the  desired  separating \nsurface,  thereby constraining' the network  more than random patterns  would.  Our \nresults  show  that  using  only  border  patterns  in  the  training  set,  there  is  a  large \nincrease in the expected performance of the network for  a  given S.  In addition, the \nscaling behavior as a  function of S  seems to be very different  and is faster  than an \nexponential  decrease.  (Figure  4  shows  typical failure  rate  vs  S  curves  comparing \nborder patterns, the -1,+1  representation,  and  the 0,1  representation.) \n\n6.1  BORDER PATTERNS AND PERFECT LEARNING \nWe say the network has perfectly learned a function when the test patterns are never \nmisclassified.  For  the  majority function,  one  can  argue  that  at least  some  border \npatterns  must  be  present  in  order  to  guarantee  perfect  performance.  If no  border \npatterns  were  in  the  training  set,  then  the  network  could  have  learned  the  f - 1 \nof d  or  the  f + 1  of d  function .  Furthermon~, if we  know  that  a  certain  number \nof border  patterns is  guaranteed  to give  perfect  performance,  say  bed),  then given \nthe  probability  that  a  random  pattern  is  a  border  pattern,  we  can  calculate  the \nexpected number of random patterns sufficient to learn  majority. \n\nFor  odd  d,  there  are  2 * (  ;  )  border  patterns,  so the  probability  of choosing  a \nborder pattern randomly is: \n\n( ; ) \n\n2d- 1 \n\nAs d gets larger this probability decreases as 1/.fd.*  The expected number of ran(cid:173)\ndomly chosen patterns required before b( d)  border  patterns are chosen is  therefore: \n\n0*  This can be shown  using  Stirling's approximation  to d!. \n\n\f166 \n\nAhmad and Tesauro \n\nf \n\n0.50 \n\n0.42 \n\n0.33 \n\n0.25 \n\n0.17 \n\n0.08 \n\n0.001 \n\n0 \n\n58 \n\n117 \n\n175 \n\n233 \n\n292 \n\n350 \n\nS \n\nFigure  4:  Graph  showing  the  average  failure  rate  vs.  S  using  the  0,1  representation \n(right),  the -1,+1  representation  (middle),  and using border patterns (left).  The network \nhad 23  inputs units and was tested on  a  test set consisting of 1024  patterns. \n\nb( cl)Vd.  From our data we find that 3d border patterns are always sufficient to learn \nthe test set perfectly.  From this,  and from the theoretical results in  [Cover,  65],  we \ncan be confident that b( cI)  is linear in d.  Thus, O( fi3/2)  random patterns should be \nsufficient to learn majority perfectly. \n\nIt should  be mentioned that border patterns are  not  the  only  patterns which  con(cid:173)\ntribute to the generalization of the network.  Figure 5 shows that the failure  rate of \nthe network when trained  with  random training patterns which  happen to contain \nb  border  patterns  is  substantially  better  than  a  training  set  consisting  of only  b \nborder  patterns.  Note  that  perfect  performance  is  achieved  at  the same  point  in \nboth cases. \n\n7  CONCLUSION \n\nIn this  paper we  have  described  a  systematic  study of some  of the  various  factors \naffecting scaling and generalization  in  neural  networks.  Using  empirical studies in \na  simple test domain,  we  were  able  to obtain precise scaling relationships  between \nthe  performance  of the  network,  the  number  of training  patterns,  and  the size  of \nthe network.  It was shown  that for  a fixed  network  size,  the failure rate decreases \nexponentially with the size of the training set.  The number of patterns required to \n\n/ \n\n--......,  '\\ \n\n\fScaling and Generalization in Neural Networks \n\n167 \n\nf  \u2022. u \n\n\u2022. u \n\n\u2022\u2022 U \n\n\u2022\u2022 17 \n\n.... \n.... \n\n\u2022 \n\nII \n\n.. \n\nII \n\nII \nN wnber of border patterna. \n\n,. \n\nFigure 5:  This figure  compares the failure  rate on a random training set which  happens \nto contain b border patterns (bottom plot)  with a  training set composed of only b border \npatterns (top plot). \n\nachieve  a  fixed  performance level  was  shown  to increase linearly  with  the  network \nSIZe. \n\nA general finding  was  that  the performance of the network  was very  sensitive to a \nnumber of factors.  A slight change in the input representation caused a jump in  the \nperformance of the  network.  The specific  patterns in  the  training set had  a  large \ninfluence  on the final  weights  and on  the generalization.  By  selecting the  training \npatterns intelligently,  the performance  of the network  was  increased significantly. \n\nThe notion of border patterns were  introduced  as the most  interesting patterns in \nthe  training  set.  As  far  as  the  number  of patterns  required  to  teach  a  function \nto  the  network,  these  patterns  are  near  optimal.  It was  shown  that  a  network \ntrained  only  on  border  patterns  generalizes  substantially  better  than  one  trained \non the same number of random patterns.  Border patterns were also  used  to derive \nan expected bound on the  number  of random patterns sufficient  to learn majority \nperfectly.  It was  shown  tha,t on average, O(d3 / 2 )  random patterns are sufficient  to \nlearn  majority perfectly. \n\nIn conclusion,  this paper advocates  a  careful study of the process of generalization \nin  neural networks.  There  are  a  large  number of different  factors  which  can affect \nthe performance.  Any assumptions made when  applying neural networks  to a  real(cid:173)\nworld  problem should  be made with  care.  Although  much  more work  needs  to  be \n\n\f168 \n\nAhmad and Tesauro \n\ndone,  it  was  shown  that many  of the  issues  can  be effectively  studied  in  a  simple \ntest domain. \n\nAcknowledgements \n\nWe  thank  T. Sejnowski,  R.  Rivest  and A.  Barron for  helpful  discussions.  We  also \nthank T. Sejnowski  and  B.  Bogstad for  assistance in  development  of the simulator \ncode.  This work was partially supported by the National Center for Supercomputing \nApplications and by  National Science Foundation grant Phy 86-58062. \nReferences \n\n[Ahmad,88]  S. Ahmad.  A  Study of Scaling  and Generalization  in  Neural Networks. \nTechnical Report UIUCDCS-R-88-1454,  Department of Computer Science,  Uni(cid:173)\nversity  of Illinois,  Urbana-Champaign, IL,  1988. \n\n[Cover,  65]  T. Cover.  Geometric and satistical properties of systems oflinear equa(cid:173)\n\ntions.  IEEE  Trans.  Elect.  Comp.,  14:326-334,  1965. \n\n[Minsky  and Papert,  69]  Marvin  Minsky  and  Seymour  Papert.  Perceptrons.  MIT \n\nPress,  Cambridge,  Mass.,  1969. \n\n[Muroga,  71]  S  Muroga.  Threshold  Logic  and  its  Applications.  Wiley,  New  York, \n\n1971. \n\n[Rumelhart  and McClelland, 86]  D.  E.  Rumelhart  and  J.  L.  McClelland,  editors. \nParallel Distributed Processing:  Explorations in the Microstructure of Cognition: \nFoundations.  Volume  I, MIT Press,  Cambridge,  Mass.,  1986. \n\n[Stornetta and  Huberman, 87]  W.S.  Stornetta and  B.A.  Huberman.  An  improved \n\nthree-layer,  back  propagation  algorithm.  In  Proceedings  of the  IEEE  First  In(cid:173)\nternational  Conference  on  Neural  Networks,  San  Diego,  CA,  1987. \n\n[Sejnowski  and  Rosenberg,  87]  T.J.  Sejnowski  and  C.R.  Rosenberg.  Parallel  net(cid:173)\nworks that learn to pronounce English text.  Complex Systems,  1:145-168,  1987. \n\n[Tesauro and Sejnowski,  88]  G.  Tesauro  and  T.J.  Sejnowski.  A  Parallel  Network \nthat  Learns  to  Play  Backgammon.  Technical  Report  CCSR-88-2,  Center  for \nComplex Systems Research,  University of Illinois,  Urbana-Champaign, IL,  1988. \n\n\f", "award": [], "sourceid": 129, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}, {"given_name": "Gerald", "family_name": "Tesauro", "institution": null}]}