{"title": "A Simple Weight Decay Can Improve Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 950, "page_last": 957, "abstract": null, "full_text": "A  Simple Weight  Decay Can Improve \n\nGeneralization \n\nAnders Krogh\u00b7 \n\nCONNECT,  The  Niels  Bohr  Institute \n\nBlegdamsvej  17 \n\nJohn A.  Hertz \n\nNordita \n\nBlegdamsvej  17 \n\nhertz@nordita.dk \n\nDK-2100  Copenhagen,  Denmark \n\nDK-2100  Copenhagen,  Denmark \n\nkrogh@cse.ucsc.edu \n\nAbstract \n\nIt has  been observed  in  numerical simulations that a weight decay  can  im(cid:173)\nprove generalization in a feed-forward  neural network.  This paper explains \nwhy.  It is  proven  that  a  weight  decay  has  two effects  in  a  linear  network. \nFirst,  it  suppresses  any  irrelevant  components  of  the  weight  vector  by \nchoosing  the smallest  vector  that solves  the learning  problem.  Second,  if \nthe size  is chosen  right,  a weight  decay  can suppress some of the effects  of \nstatic  noise  on  the  targets,  which  improves  generalization  quite  a  lot.  It \nis  then  shown  how  to extend  these  results  to networks  with hidden  layers \nand  non-linear  units.  Finally  the  theory  is  confirmed  by  some numerical \nsimulations using  the  data from  NetTalk. \n\n1 \n\nINTRODUCTION \n\nMany  recent  studies have shown  that the generalization ability of a  neural network \n(or any other  'learning machine') depends  on a  balance between  the information in \nthe  training examples  and  the  complexity  of the  network,  see  for  instance  [1,2,3]. \nBad  generalization  occurs  if the  information  does  not  match  the  complexity,  e.g. \nif the  network  is  very  complex and  there  is  little  information  in  the  training  set. \nIn  this  last  instance  the  network  will  be  over-fitting  the  data,  and  the  opposite \nsituation corresponds  to under-fitting. \n\n\u00b7Present address:  Computer and  Information Sciences,  Univ.  of California  Santa Cruz, \n\nSanta Cruz,  CA  95064. \n\n950 \n\n\fA Simple Weight Decay Can Improve Generalization \n\n951 \n\nOften  the  number of free  parameters,  i. e.  the number of weights  and thresholds,  is \nused  as  a  measure of the  network complexity, and algorithms have  been  developed, \nwhich  minimizes the  number of weights while still keeping  the error on the training \nexamples small [4,5,6].  This  minimization of the  number of free  parameters is  not \nalways what is  needed. \n\nA different  way  to constrain a  network,  and thus  decrease  its complexity, is to limit \nthe growth of the weights through some kind of weight  decay.  It should prevent  the \nweights  from  growing  too  large  unless  it  is  really  necessary.  It can  be  realized  by \nadding a  term to the cost  function  that  penalizes  large weights, \n\nE(w) =  Eo(w) + 2A L..J Wi' \n1  \"\"  2 \n\ni \n\n(1) \n\nwhere  Eo  is  one's favorite  error  measure  (usually  the sum of squared  errors),  and \nA is  a  parameter governing how  strongly large weights  are  penalized.  w  is  a  vector \ncontaining all free  parameters of the network,  it  will  be  called  the  weight  vector.  If \ngradient  descend  is  used  for  learning,  the  last  term in  the  cost  function  leads  to  a \nnew  term  -AWi  in  the weight  update: \n\n. \nWi  ex:  --fJ - \",Wi\u00b7 \n\n\\ \n\nfJEo \nWi \n\n(2) \n\nHere  it  is  formulated  in  continuous  time.  If the  gradient  of Eo  (the  'force  term') \nwere  not  present  this equation would  lead  to an  exponential decay  of the weights. \n\nObviously  there  are  infinitely  many  possibilities  for  choosing  other  forms  of the \nadditional term in  (1),  but here  we  will  concentrate on  this simple form. \n\nIt has  been  known  for  a  long  time  that  a  weight  decay  of this form  can  improve \ngeneralization  [7],  but  until now not  very  widely recognized.  The aim of this paper \nis  to  analyze  this effect  both  theoretically  and  experimentally.  Weight  decay  as  a \nspecial  kind  of regularization  is  also  discussed  in  [8,9] . \n\n2  FEED-FORWARD  NETWORKS \n\nA  feed-forward  neural  network  implements a  function  of the  inputs  that  depends \non  the  weight  vector  w,  it  is  called  fw.  For  simplicity  it  is  assumed  that  there  is \nonly one output unit.  When the input is e the output is  fw (e) . Note that the input \nvector  is  a  vector  in the  N-dimensional input space,  whereas  the weight  vector  is  a \nvector  in the  weight  space  which  has  a  different  dimension W. \n\nThe aim of the learning is not only to learn the examples, but to learn the underlying \nfunction  that  produces  the  targets for  the  learning process.  First,  we  assume  that \nthis target function  can actually be implemented by  the network .  This means there \nexists  a  weight  vector  u  such  that  the  target function  is  equal  to  fu .  The  network \nwith  parameters  u  is  often  called  the  teacher,  because  from  input  vectors  it  can \nproduce  the right  targets .  The sum of squared  errors  is \n\np \n\nEo(w) =  ~ 2:[fu(eJl ) - fw(eJl)]2, \n\nJI=l \n\n(3) \n\n\f952 \n\nKrogh and Hertz \n\nwhere p  is  the  number of training patterns.  The learning equation  (2)  can  then  be \nwritten \n\nWi  <X  2)fu(eJl ) - fw(eJl)]&~:~) - AWj. \n\n(4) \n\nNow  the idea is  to expand  this  around  the solution u,  but first  the linear case  will \nbe  analyzed  in  some detail. \n\nJl \n\n\u2022 \n\n3  THE LINEAR PERCEPTRON \n\nThe simplest kind of 'network' is  the linear  perceptron  characterized  by \n\n(5) \n\nwhere  the  N-l/2  is  just  a  convenient  normalization factor.  Here  the  dimension of \nthe weight space  (W)  is  the same as  the  dimension of the input space  (N) . \nThe learning equation then  takes  the simple form \n\nWi  <X  L  N- 1 L[Uj - Wj]ejer  - AWi. \n\nJl \n\nj \n\nDefining \n\nand \n\nit  becomes \n\nAij  =  N- 1 L  ere; \n\nJl \n\nVj  <X  - L  AijVj + A(Uj  - Vi)' \n\nj \n\nTransforming this equation  to the  basis  where  A is  diagonal yields \n\nvr  <X  -(Ar + A)Vr + AUr, \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\nwhere Ar  are the eigenvalues of A,  and a subscript r  indicates transformation to this \nbasis.  The generalization error is  defined  as the error averaged over the distribution \nof input  vectors \n\nN- 1 L VjVj(eiej)\u20ac \n\nij \n\n(11) \n\nHere it is assumed that (eiej)\u20ac  =  6ij .  The generalization error F is thus proportional \nto Iv1 2 ,  which  is  also  quite  natural. \nThe eigenvalues of the covariance matrix A are non-negative, and its rank can easily \nbe  shown  to  be  less  than  or  equal  to p.  It is  also  easily  seen  that  all  eigenvectors \nbelonging  to eigenvalues larger  than 0 lies  in  the subspace of weight space spanned \n\n\fA Simple Weight Decay Can Improve Generalization \n\n953 \n\nby  the  input  patterns e, ... , e.  This subspace,  called  the  pattern  subspace,  will \n\nbe  denoted  Vp ,  and  the  orthogonal  subspace  is  denoted  by  V.l.  When  there  are \nsufficiently  many  examples  they  span  the  whole  space,  and  there  will  be  no  zero \neigenvalues.  This can only happen for  p 2::  N. \nWhen  A = 0  the  solution  to  (10)  inside  Vp  is  just  a  simple  exponential  decay  to \nVr  = o.  Outside the pattern subspace  Ar  = 0,  and the corresponding part of Vr  will \nbe  constant.  Any  weight  vector  which  has  the  same  projection  onto  the  pattern \nsubspace  as  u  gives  a  learning error  O.  One  can  think  of this  as  a  'valley' in  the \nerror surface given  by  u  + V/. \nThe training set contains  no  information that can  help  us  choose  between  all  these \nsolutions to the  learning  problem.  When learning  with  a  weight  decay  A >  0,  the \nconstant part in V/  will decay to zero  asymptotically (as e->'t, where t  is the time). \nAn  infinitesimal  weight  decay  will  therefore  choose  the  solution  with  the  smallest \nnorm out  of all  the  solutions  in  the  valley  described  above.  This  solution  can  be \nshown  to  be  the optimal one on average. \n\n4  LEARNING WITH AN  UNRELIABLE  TEACHER \n\nRandom errors  made by  the teacher  can be modeled by adding a  random term 11  to \nthe targets: \n\n(12) \nThe variance of TJ  is called u 2 ,  and it is  assumed to have zero mean.  Note that these \ntargets  are  not  exactly  realizable  by  the  network  (for  Q'  >  0),  and  therefore  this  is \na simple model for  studying learning of an  unrealizable function. \n\nWith this noise  the learning equation (2)  becomes \n\nWi  ex  L:(N- 1 L: Vjf.j + N- 1/ 211/J)f.f  - AWi\u00b7 \n\n/J \n\nj \n\nTransforming it  to the basis where  A is  diagonal as  before, \n\nvr  ex  -(Ar + A)Vr + AUr - N- 1/ 2 L 11/Jf.~\u00b7 \n\n/J \n\nThe asymptotic solution to this equation is \n\nAUr - N-l/ 2  L/J TJ/Jf.~ \n\nVr  = \n\nA + Ar \n\n. \n\n(13) \n\n(14) \n\n(15) \n\nThe  contribution  to  the generalization  error  is  the square of this  summed over  all \nr.  If averaged  over  the  noise  (shown  by  the  bar)  it becomes  for  each  r \n\n(16) \n\nThe last expression has a minimum in A,  which  can be found  by  putting the deriva(cid:173)\ntive  with  respect  to A equal to zero,  A~Ptimal = u 2 /u;.  Remarkably it  depends  only \n\n\f954 \n\nKrogh and Hertz \n\nFigure  1:  Generalization  error  as  a \nfunction  of Q'  = pIN.  The  full  line  is \nfor  A =  u 2  =  0.2,  and  the  dashed  line \nfor  A =  O.  The dotted line is  the gener(cid:173)\nalization error  with no noise and A =  O. \n\nLI.. \n\no~ __________ ~ __________ ~ \no \n2 \n\n., \n\n1 \npIN \n\non u  and the variance of the noise,  and  not  on A.  If it is  assumed that u  is  random \n(16)  can  be  averaged over  u.  This yields  an optimal A independent  of r, \n\nu 2 \nAoptlmai  =  ---;;-, \n\nu~ \n\n(17) \n\nwhere  u 2  is  the average of N- 1IuI 2 . \nIn  this case  the  weight  decay  to some extent  prevents  the  network  from  fitting  the \nnOIse. \n\nFrom equation (14) one can see that the noise is projected onto the pattern subspace. \nTherefore the contribution to the generalization error from V/  is the same as before, \nand  this  contribution is  on  average minimized by  a  weight  decay  of any  size. \n\nEquation (17)  was derived  in [10]  in the context of a particular eigenvalue spectrum. \nFigure  fig.  1  shows  the  dramatic  improvement  in  generalization  error  when  the \noptimal weight  decay  is  used  in  this  case,  The  present  treatment  shows  that  (17) \nis  independent of the  spectrum of A. \n\nWe  conclude  that  a  weight  decay  has  two  positive  effects  on  generalization  in  a \nlinear network:  1) It suppresses  any irrelevant components of the  weight  vector  by \nchoosing the smallest vector that solves the learning problem.  2)  If the size is chosen \nright,  it can suppress  some of the effect  of static noise on  the targets . \n\n5  NON-LINEAR NETWORKS \n\nIt is  not  possible  to  analyze  a  general  non-linear  network  exactly,  as  done  above \nfor  the  linear  case.  By  a  local  linearization,  it  is  however,  possible  to  draw  some \ninteresting conclusions  from  the  results  in  the  previous section. \nAssume the function  is realizable,  f  =  fu.  Then learning corresponds  to solving the \np  equations \n\n(18) \n\n\fA Simple Weight Decay Can Improve Generalization \n\n955 \n\nin  W  variables,  where  W  is  the  number  of weights.  For  p  <  W  these  equations \ndefine  a  manifold in  weight space of dimension at least  W  - p.  Any  point W on this \nmanifold gives  a  learning error  of zero,  and  therefore  (4)  can  be  expanded  around \nw.  Putting v  = W - w, expanding fw  in  v,  and using  it  in  (4)  yields \n\nVi \n\nex \n\n- L (8f;~:Jj\u00bb) v/9f;~:Jj) + A(Wi  - vd \n\nJj ,1 \n\n- LAij(W)Vj - AVi  + AWj \n\nj \n\n(The  derivatives  in  this  equation should  be taken  at iV.) \nThe analogue of A is  defined  as \n\nA\u00b7\u00b7( -) = L 8fw(eJj) 8fw(eJj) \n\n'1  w \n\n-\n\n~ \nuW' \n, \n\nJj \n\n:::l \nuW' \n1 \n\n(19) \n\n(20) \n\n\u2022 \n\nSince  it  is  of outer  product  form  (like  A)  its  rank  R( in)  ~ min{p, W}.  Thus when \np  < W,  A  is  never of full  rank.  The rank  of A  is  of course  equal  to  W  minus the \ndimension of the manifold mentioned above. \n\nFrom these  simple observations one  can  argue  that good  generalization should  not \nbe expected  for  p  < W.  This is  in  accordance  with other  results  (cf.  [3]),  and  with \ncurrent  'folk-lore'.  The difference  from the linear case  is that the  'rain gutter'  need \nnot be (and most probably is not) linear, but curved in  this case.  There may in  fact \nbe other  valleys  or  rain  gutters  disconnected  from  the one containing  u.  One  can \nalso see  that if A  has full  rank,  all  points in  the immediate neighborhood of W =  u \ngive  a  learning error  larger  than 0,  i.e.  there  is  a simple minimum at u. \n\nAssume  that  the  learning  finds  one  of these  valleys.  A  small  weight  decay  will \npick  out  the point in  the valley with the smallest norm among all  the  points in  the \nvalley.  In general it can not be proven that picking that solution is  the best strategy. \nBut, at least from a philosophical point of view,  it seems sensible,  because  it is  (in a \nloose sense)  the solution with the smallest complexity-the one that Ockham would \nprobably have  chosen. \n\nThe value of a  weight  decay  is  more evident if there  are small errors  in  the targets. \nIn that case one  can go through exactly the same line of arguments as for  the linear \ncase  to  show  that  a  weight  decay  can  improve  generalization,  and  even  with  the \nsame optimal choice  (17)  of A.  This is  strictly true only for  small errors  (where  the \nlinear approximation is  valid). \n\n6  NUMERICAL  EXPERIMENTS \n\nA  weight  decay  has  been  tested  on  the  NetTalk  problem  [11].  In  the  simulations \nback-propagation derived from the 'entropic error  measure'  [12]  with a  momentum \nterm fixed at 0.8  was  used.  The network had 7 x 26  input units, 40 hidden units and \n26  output  units.  In  all about  8400  weights.  It was  trained  on 400  to 5000  random \nwords  from  the data base  of around  20.000  words,  and  tested  on  a  different  set  of \n1000  random  words.  The  training set  and  test  set  were  independent  from  run  to \nrun . \n\n\f0.26 \n\n0.24 \n\n0.22 \n\nf/) \n\n.... \n.... \n0  0.20 \n.... w \n\n0.18 \n\n0.16 \n\n0.14 \n0 \n\n. . \n\n956 \n\nKrogh and Hertz \n\n1.2 \n\n1.0 \n\nlL. \n\n0.8 \n\n0.6 \n\no \n\n2 0 104 \n\nP \n\n4 0 104 \n\n. --\n\nFigure 2:  The top full  line corresponds  to the generalization error after  300  epochs \n(300  cycles  through  the  training set)  without a  weight  decay.  The lower  full  line is \nwith  a  weight  decay.  The  top  dotted  line  is  the  lowest  error  seen  during  learning \nwithout  a  weight  decay,  and the lower  dotted  with  a  weight  decay.  The size  of the \nweight  decay  was  .A  =  0.00008. \nInsert:  Same figure  except  that the error rate is shown instead of the squared error. \nThe  error  rate  is  the  fraction  of wrong  phonemes  when  the  phoneme  vector  with \nthe smallest angle  to the  actual output is  chosen,  see  [11]. \n\nResults  are  shown  in  fig.  2.  There  is  a  clear  improvement  in  generalization  error \nwhen  weight  decay  is  used.  There  is  also  an  improvement in  error  rate  (insert  of \nfig.  2),  but  it  is  less  pronounced  in  terms of relative  improvement.  Results  shown \nhere are for  a  weight decay of .A  = 0.00008.  The values 0.00005 and  0.0001  was  also \ntried and gave basically the same curves. \n\n7  CONCLUSION \n\nIt  was  shown  how  a  weight  decay  can  improve  generalization  in  two  ways:  1)  It \nsuppresses  any irrelevant components of the weight vector  by  choosing the smallest \nvector that solves the learning problem.  2) If the size is  chosen right, a weight decay \ncan  suppress  some of the  effect  of static  noise  on  the  targets.  Static  noise  on  the \ntargets can be viewed  as  a model of learning an unrealizable function.  The analysis \nassumed that the network  could  be expanded around an optimal weight vector, and \n\n\fA Simple Weight Decay Can Improve Generalization \n\n957 \n\ntherefore  it is  strictly only valid  in  a little neighborhood  around  that vector. \n\nThe  improvement  from  a  weight  decay  was  also  tested  by  simulations.  For  the \nNetTalk  data  it  was  shown  that  a  weight  decay  can  decrease  the  generalization \nerror  (squared  error)  and  also,  although less  significantly,  the  actual  mistake  rate \nof the  network  when  the  phoneme closest  to the output is  chosen. \n\nAcknowledgements \n\nAK  acknowledges support from the Danish Natural Science Council and the Danish \nTechnical  Research  Council  through  the  Computational  Neural  Network  Center \n(CONNECT). \n\nReferences \n\n[1]  D.B.  Schwartz,  V.K.  Samalam, S.A.  Solla, and J.S.  Denker.  Exhaustive learn(cid:173)\n\ning.  Neural  Computation,  2:371-382,  1990. \n\n[2]  N.  Tishby,  E.  Levin,  and  S.A.  Solla.  Consistent  inference  of probabilities  in \nlayered  networks:  predictions  and generalization.  In  International  Joint  Con(cid:173)\nference  on  Neural  Networks,  pages  403-410,  (Washington  1989),  IEEE,  New \nYork,  1989. \n\n[3]  E.B.  Baum and  D.  Haussler.  What size  net gives  valid generalization?  Neural \n\nComputation,  1:151-160, 1989. \n\n[4]  Y. Le Cun, J .S.  Denker, and S.A. Solla.  Optimal brain damage.  In D.S. Touret(cid:173)\nzky,  editor,  Advances  in  Neural  Information  Processing  Systems,  pages  598-\n605,  (Denver  1989),  Morgan  Kaufmann, San  Mateo,  1990. \n\n[5]  H.H.  Thodberg.  Improving generalization of neural networks  through pruning. \n\nInternational  Journal  of Neural Systems,  1:317-326, 1990. \n\n[6]  D.H.  Weigend,  D.E.  Rumelhart,  and  B.A.  Huberman.  Generalization  by \nweight-elimination  with  application  to  forecasting.  In  R.P.  Lippmann  et  ai, \neditors,  Advances  in  Neural  Information  Processing  Systems,  page  875-882, \n(Denver  1989),  Morgan  Kaufmann, San  Mateo,  1991. \n\n[7]  G.E. Hinton.  Learning translation invariant recognition  in a  massively parallel \nnetwork.  In  G.  Goos  and  J.  Hartmanis,  editors,  PARLE:  Parallel  Architec(cid:173)\ntures  and  Languages  Europe.  Lecture  Notes  in  Computer Science,  pages  1-13, \nSpringer-Verlag,  Berlin,  1987. \n\n[8]  J  .Moody.  Generalization,  weight  decay,  and architecture  selection  for  nonlin(cid:173)\n\near learning systems.  These  proceedings. \n\n[9]  D.  MacKay.  A  practical  bayesian  framework  for  backprop  networks.  These \n\nproceedings. \n\n[10]  A.  Krogh and J .A.  Hertz.  Generalization in a Linear Perceptron in the Presence \n\nof Noise.  To appear in  Journal  of Physics  A  1992. \n\n[11]  T.J. Sejnowski and C.R. Rosenberg.  Parallel networks that learn to pronounce \n\nenglish  text .  Complex  Systems,  1:145-168,1987. \n\n[12]  J .A. Hertz,  A.  Krogh,  and  R.G.  Palmer.  Introduction  to  the  Theory  of Neural \n\nComputation.  Addison-Wesley,  Redwood  City,  1991. \n\n\f", "award": [], "sourceid": 563, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "John", "family_name": "Hertz", "institution": null}]}