{"title": "Statistical Dynamics of Batch Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 286, "page_last": 292, "abstract": null, "full_text": "Statistical Dynamics of Batch Learning \n\nDepartment of Physics, Hong Kong University of Science and Technology \n\ns.  Li and K.  Y. Michael Wong \n\nClear Water Bay, Kowloon,  Hong Kong \n\n{phlisong,  phkywong}@ust.hk \n\nAbstract \n\nAn important issue in neural computing concerns the description of \nlearning dynamics  with  macroscopic  dynamical  variables.  Recen(cid:173)\nt  progress on  on-line learning only  addresses the often unrealistic \ncase of an infinite  training set.  We  introduce a  new  framework  to \nmodel batch learning of restricted sets of examples, widely applica(cid:173)\nble to  any learning cost function,  and fully  taking into account the \ntemporal correlations introduced by the recycling of the examples. \nFor  illustration  we  analyze  the  effects  of weight  decay  and  early \nstopping during the learning of teacher-generated examples. \n\n1 \n\nIntroduction \n\nThe dynamics of learning in  neural  computing is  a  complex  multi-variate process. \nThe interest  on  the macroscopic  level  is  thus  to  describe  the process  with  macro(cid:173)\nscopic  dynamical  variables.  Recently,  much  progress has  been  made  on  modeling \nthe dynamics of on-line learning, in which an independent example is generated for \neach  learning step  [1,  2].  Since  statistical correlations among the examples can  be \nignored,  the  dynamics  can  be  simply  described  by  instantaneous  dynamical  vari(cid:173)\nables. \nHowever, most studies on on-line learning focus  on  the ideal case in which the net(cid:173)\nwork  has  access  to  an  almost  infinite  training  set,  whereas  in  many  applications, \nthe  collection  of training  examples  may  be  costly.  A  restricted  set  of  examples \nintroduces extra temporal correlations during learning,  and  the  dynamics is  much \nmore complicated.  Early studies  briefly  considered the dynamics  of Adaline learn(cid:173)\ning [3, 4, 5], and has recently been extended to linear perceptrons learning nonlinear \nrules  [6,  7}.  Recent  attempts,  using  the  dynamical  replica  theory,  have  been  made \nto study the learning of restricted sets of examples, but so far exact results are pub(cid:173)\nlished for simple learning rules such as  Hebbian learning, beyond which appropriate \napproximations are needed  [8]. \n\nIn this paper, we  introduce a  new  framework  to model batch learning of restricted \nsets of examples,  widely  applicable  to any  learning  rule  which  minimizes  an  arbi(cid:173)\ntrary  cost  function  by  gradient  descent.  It fully  takes  into  account  the  temporal \ncorrelations during learning, and is  therefore exact for  large networks. \n\n\fStatistical Dynamics of Batch Learning \n\n287 \n\n2  Formulation \nConsider the single layer perceptron with N  \u00bb 1 input nodes  {~j} connecting to a \nsingle output node by the weights {Jj }.  For convenience we  assume that the inputs \n~j  are Gaussian  variables with mean 0 and  variance 1,  and the output state 5  is  a \nfunction  f(x)  of the  activation x  at the output node, i.e. \n\n(1) \nThe network is assigned to \"learn\" p = aN examples which map inputs {{j} to the \noutputs {5~} (p =  1, ... ,p).  5~ are the outputs generated by a  teacher percept ron \n{Bj }, namely \n\n5=f(x);  x=J\u00b7{. \n\n(2) \nBatch learning by gradient descent is achieved by adjusting the weights {Jj }  itera(cid:173)\ntively so that a certain cost function in terms of the student and teacher activations \n{x~} and  {y~} is  minimized.  Hence we  consider a general cost function \n\n(3) \n\nThe precise functional form  of g(x, y)  depends  on  the adopted learning algorithm. \nFor the case of binary outputs, f(x)  =  sgnx.  Early studies on the learning dynamics \nconsidered  Adaline learning  [3,  4,  5],  where g(x, y)  =  -(5 - x)2/2  with  5  =  sgny. \nFor recent studies on  Hebbian learning [8],  g(x,y) =  x5. \nTo ensure that the perceptron is regularized after learning, it is customary to intro(cid:173)\nduce a weight decay term.  Furthermore, to avoid the system being trapped in local \nminima, noise is often added in the dynamics.  Hence the gradient descent dynamics \nis given by \n\n-;u- - N  ~g (x~ t  ,yl'  ~j  -).Jj(t  +\"1j  t, \ndJj (t)  _  1  \" \" ,   (  ) \n\n)  I' \n\n) \n\n(  ) \n\n(4) \n\n~ \n\nwhere,  here  and  below,  g' (x, y)  and  gil (x, y)  respectively  represent  the  first  and \nsecond partial derivatives of g(x, y)  with respect to x.  ). is the weight decay strength, \nand \"1j(t)  is the noise term at temperature T  with \n\n(5) \n\n3  The Cavity Method \n\nOur  theory  is  the  dynamical  version  of the  cavity  method  [9,  10,  11].  It uses  a \nself-consistency argument  to consider what happens when  a  new  example is added \nto a training set.  The central quantity in this method is the cavity activation, which \nis  the activation of a  new example for  a  perceptron trained without that example. \nSince  the  original  network has  no information  about  the  new  example,  the cavity \nactivation is  stochastic.  Specifically,  denoting the new  example by  the  label  0,  its \ncavity activation at time t  is \n\n(6) \nFor  large  N  and  independently  generated  examples,  ho(t)  is  a  Gaussian  variable. \nIts covariance is  given  by  the correlation function  G(t, s)  of the weights at times  t \nand s,  that is, \n\nho(t)  =  J(t) . f1. \n\n(ho(t)ho(s\u00bb)  =  J(t). J(s)  ==  G(t,s), \n\n(7) \n\n\f288 \n\nS.  Li and K.  Y.  M.  Wong \n\nwhere  ~J  and  ~2  are  assumed  to  be  independent  for  j  i- k.  The  distribution  is \nfurther specified  by  the teacher-student correlation R(t), given by \n\n(ho(t)yo)  =  j(t) . jj = R(t). \n\n(8) \n\nNow suppose the perceptron incorporates the new example at the batch-mode learn(cid:173)\ning step at time  s.  Then the activation of this new  example at a  subsequent time \nt  >  s  will  no  longer  be  a  random  variable.  Furthermore,  the  activations  of the \noriginal p examples at time t  will  also be adjusted from  {xJl(t)}  to {x~(t)} because \nof the newcomer, which will in turn affect the evolution of the activation of example \n0,  giving rise  to the so-called  Onsager reaction effects.  This makes  the dynamics \ncomplex, but fortunately for large p '\" N, we can assume that the adjustment from \nxJl(t)  to x2(t)  is  small,  and perturbative analysis can be  applied. \nSuppose the  weights  of the original and new  perceptron at time t  are  {Jj  (t)}  and \n{JJ(t)}  respectively.  Then a  perturbation of (4)  yields \n~g'(xo(t),yo)~J \n\n(! + ,x)  (.tj(t) - Jj(t\u00bb  = \n\n+  ~ 2: ~fgll(XJl(t), YJl)~r(J'2(t) - Jk(t\u00bb. \n\nJlk \n\n(9) \n\nThe first term on the right hand side describes the primary effects of adding example \no to the  training  set,  and  is  the  driving  term  for  the  difference  between  the  two \nperceptrons.  The second  term  describes  the secondary  effects  due  to the changes \nto  the  original  examples  caused  by  the  added  example,  and  is  referred  to  as  the \nOn sager  reaction  term.  One  should  note  the  difference  between  the  cavity  and \ngeneric  activations  of  the  added  example.  The  former  is  denoted  by  ho(t)  and \ncorresponds to the activation in the perceptron {Jj  (t) }, whereas the latter, denoted \nby  Xo (t)  and corresponding to the activation in the percept ron  {.tj (t)},  is  the one \nused  in  calculating the gradient  in  the  driving  term of (9).  Since  their  notations \nare sufficiently distinct,  we  have omitted the superscript 0 in  xo(t),  which  appears \nin  the background examples x~(t). \nThe equation can be solved by the Green's function  technique, yielding \n\n.tj(t) - Jj(t) = 2:! dsGjk(t, s)  (~g~(s)~2) , \n\nk \n\n(10) \n\nwhere g~(s) =  g'(xo(s),yo)  and Gjk(t, s)  is the weight  Green's function satisfying \n\nGjk(t,S) =  G(O)(t - S)6jk + ~ ~! dt'G(O)(t - t')~fg~(t')~rGik(t' - s), \n\n(11) \n\nJl\\ \nG(O)(t - s)  = e(t - s) exp( -,x(t - s\u00bb \nis  the  bare  Green's  function,  and e is  the \nstep function.  The weight Green's function  describes how  the effects of example 0 \npropagates from  weight  Jk  at learning time s to weight  Jj  at a subsequent time t, \nincluding both primary and secondary effects.  Hence  all  the temporal correlations \nhave been taken into account. \nFor large N, the equation can be solved by a diagrammatic approach similar to [5]. \nThe weight  Green's function is self-averaging over the distribution of examples and \nis diagonal, i.e.  limN-+ooGjk(t,s)  =  G(t,s)6jk , where \n\nG(t, s)  =  G(O)(t - s) + a ! dt1 ! dt2G(O)(t - td(g~(tdDJl(tl' t2))G(t2' s). \n\n(12) \n\n\fStatistical Dynamics of Batch Learning \n\nD ~ (t, s)  is  the  example  Green's function given by \n\nD~(t,s) = c5(t  - s) + J dt'G(t,t')g~(t')D~(t',s). \n\n289 \n\n(13) \n\nThis allows  us  to express the generic  activations of the examples in  terms of their \ncavity counterparts.  Multiplying both sides of (10)  and summing over j, we  get \n\nxo(t) - ho(t)  = J dsG(t, s)g~(s). \n\n(14) \n\n(15) \n\n(16) \n\nThis  equation  is  interpreted  as  follows.  At  time  t,  the  generic  activation  xo{t) \ndeviates  from  its  cavity  counterpart  because  its gradient  term  g&(s)  was  present \nin  the  batch  learning step  at previous  times  s.  This  gradient  term propagates its \ninfluence from time s to t via the Green's function G(t, s).  Statistically, this equation \nenables  us  to express  the  activation  distribution  in  terms  of the  cavity  activation \ndistribution,  thereby getting a  macroscopic description of the dynamics. \nTo solve for the Green's functions and the activation distributions, we further need \nthe fluctuation-response relation derived by linear response theory, \n\nC(t, s) = a J dt'G(O) (t - t')(g~(t')x~(s\u00bb + 2T J dt'G(O)(t - t')G(s, t'). \n\nFinally, the teacher-student correlation is  given by \n\nR(t) =  a J dt'G(O)(t - t')(g~(t')y~}. \n\n4  A  Solvable Case \n\nThe cavity method can be applied to the dynamics of learning with an arbitrary cost \nfunction.  When it is applied to the Hebb rule, it yields results identical to the exact \nresults in  [8].  Here we present the results for  the Adaline rule to illustrate features \nof learning dynamics derivable from  the study.  This is a common learning rule and \nbears resemblance with the more common back-propagation rule.  Theoretically, its \ndynamics  is  particularly  convenient  for  analysis  since  g\" (x)  =  -1, rendering the \nweight  Green's function  time translation invariant,  Le.  G(t, s)  =  G(t - s).  In this \ncase, the dynamics can be solved  by Laplace transform. \nTo  monitor  the progress of learning,  we  are  interested  in  three performance  mea(cid:173)\nsures:  (a)  Training  error  ft,  which  is  the  probability  of error for  the  training ex(cid:173)\namples.  It is  given  by  ft  =  (9 (-xsgnY\u00bb:q, ,  where  the  average  is  taken  over  the \njoint distribution p(x, y)  of the training set.  (b)  Test  error ftest,  which is  the prob(cid:173)\nability  of error when  the  inputs e;  of the  training  examples  are  corrupted  by  an \nadditive  Gaussian  noise  of variance  ~ 2 .  This  is  a  relevant  performance  measure \nwhen the percept ron is applied to process data which  are the corrupted versions of \nthe training data.  It is given by  ftest  = (H(xsgny/~JC(t,t\u00bb):r;y.  When  ~2 = 0, \nthe test error reduces to the training error.  (c)  Generalization  error fg, which is the \nprobability of error for an arbitrary input ~j when the teacher and student outputs \nare compared.  It is given  by  fg  =  arccos[R(t)/ JC(t, t\u00bb)/7r. \nFigure  l(a)  shows  the  evolution  of the  generalization  error  at  T  =  O.  When  the \nweight  decay strength varies,  the steady-state generalization error is  minimized  at \nthe optimum \n\nAopt  =  '2 - 1, \n\n7r \n\n(17) \n\n\f290 \n\nS.  Li and K.  Y.  M  Wong \n\nwhich  is  independent  of Q.  It  is  interesting to note that in  the  cases  of the  linear \npercept ron ,  the  optimal  weight  decay  strength  is  also  independent  of Q  and  only \ndetermined  by  the  output  noise  and  unlearn ability  of  the  examples  [5,  7].  Simi(cid:173)\nlarly,  here the student is  only  provided the coarse-grained  version  of the teacher's \nactivation in the form  of binary bits. \nFor  A  < Aopt,  the generalization error is  a  non-monotonic function  in learning time. \nHence the dynamics is  plagued by  overtraining, and it is  desirable to introduce early \nstopping to improve  the  perceptron  performance.  Similar behavior  is  observed  in \nlinear perceptrons  [5,  6,  7]. \nTo verify the theoretical predictions, simulations were done with N  =  500 and using \n50 samples for  averaging.  As  shown  in  Fig.  l(a), the agreement is  excellent. \nFigure  1 (b)  compares  the  generalization  errors  at  the  steady-state  and  the  early \nstopping point.  It shows that early stopping improves the performance for  A  < Aopt, \nwhich becomes near-optimal when compared with the best result at A  =  Aopt.  Hence \nearly stopping can speed up the learning process without significant sacrifice in the \ngeneralization ability.  However,  it cannot outperform the optimal result  at steady(cid:173)\nstate.  This agrees with a  recent empirical observation that  a  careful control of the \nweight  decay may be better than early stopping in optimizing generalization [12]. \n\n0.40 \n\n0.36 \n\nw~ \n\n.... \ne \n~ 0.32 L \nc \n. Q \niii \nN \n~ 0.28 \n~ \ni \n\n0.24 \n\n0.20 \n\no \n\na=O.5 \n\n1..=10 \n~:::::::::=:::;::;;;;;;;mEE:=::: 1..=0.1 \n1..=1.. \n\u2022 \n\nl.-..a~= ..... I~.2.....-............................. ~.....-.................. _  1..=10 \nI..=Ql \n1..=\\\", \n\n0.38  ,-\n\n0.3s l \nI \n\n~~0.34~ \n\n~  0.32 \nc \no \n~ 0.30 \n.t:! \n~ 0.28 \nQ) \nc \n~ 0.26 \n\nt \n\" \n\n00 \n\n0.24 [--r-hh--\n\ne; \n\na=O.5 \n\na=1.2 \n\n1.5 \n\n-----c... -'--'---~---\"--'--~-'---\n\n2 \n\n4 \n\n8 \n\n10 \n\n6 \n\ntimet \n\n12 \n\n0.22 \n\n0.0 \n\n---'- - _.-\n\n0.5 \n\nweight decay ).. \n\n1.0 \n\nFigure  1:  (a)  The evolution  of the  generalization  error  at T  =  0  for  Q  = 0.5,1.2 \nand  different  weight  decay  strengths  A.  Theory:  solid  line,  simulation:  symbols. \n(b)  Comparing the generalization error at  the  steady state  (00)  and  at  the  early \nstopping point  (tes )  for  Q  =  0.5,1.2 and T  =  O. \n\nIn  the  search  for  optimal  learning  algorithms,  an  important  consideration  is  the \nenvironment  in  which  the  performance  is  tested.  Besides  the  generalization  per(cid:173)\nformance,  there are applications in which  the test examples have inputs correlated \nwith  the  training examples.  Hence  we  are  interested  in  the  evolution  of the  test \nerror for  a given additive Gaussian noise 6, in the inputs.  Figure 2(a) shows, again, \nthat there is an optimal weight decay parameter Aopt  which minimizes the test error. \nFurthermore, when the weight decay is weak,  early stopping is desirable. \n\nFigure 2(b)  shows the value of the optimal weight  decay as a function  of the input \nnoise  variance  6,2.  To  the  lowest  order  approximation,  Aopt  ex:  6,2  for  sufficiently \nlarge 6, 2.  The dependence of Aopt  on input noise is rather general since it also holds \nin  the  case  of random  examples  [13].  In  the  limit  of small  6,2,  Aopt  vanishes  as \n6,2  for  Q  <  1,  whereas  Aopt  approaches  a  nonzero  constant  for  Q  >  1.  Hence  for \n\n\fStatistical Dynamics of Batch Learning \n\n291 \n\na  <  1, weight decay is not necessary when the training error is optimized, but when \nthe percept ron is applied to process increasingly noisy  data, weight  decay becomes \nmore and more important in performance enhancement. \n\nFigure  2(b)  also  shows  the  phase  line  Aot(~2)  below  which  overtraining  occurs. \nAgain,  to the lowest order approximation, Aot  ex  ~2 for sufficiently large ~2 .  How(cid:173)\never,  unlike  the case  of generalization error,  the line  for  the onset  of overtraining \ndoes  not  coincide exactly with the line  of optimal weight  decay.  In  particular, for \nan  intermediate  range  of  input  noise,  the  optimal  line  lies  in  the  region  of over(cid:173)\ntraining, so  that the optimal performance can only be attained by tuning  both the \nweight  decay  strength  and  learning  time.  However,  at  least  in  the  present  case, \ncomputational results show that the improvement is marginal. \n\n028  c \n\ni  ~A=Ol \n.  l a=1.2 \n0.26  :--s  0  I \n\nA=lO \nA=3.6 \n\n0.30 , \nI \n\n, \nr \n... \ng \nt \nQ)  0.24  t \n~ \nl \n0.22  ~ \n\n20 \n\n15 \n\n.-<  10 \n\n5 \n\no.oo~ \n\n-0.05  ' - - - - - - - (cid:173)\n\no \n\n0.18 0---'--~2~~~3----\"4~--'-5----'6 \n\nTime \n\nFigure 2:  (a)  The evolution of the test error for  ~ 2  =  3,  T  =  0 and different weight \ndecay  strengths  A  (Aopt  :::::  1.5,3.6 for  a  =  0.5, 1.2  respectively).  (b)  The lines  of \nthe optimal weight decay and the onset of overtraining for  a  =  5.  Inset:  The same \ndata with  Aot  - Aopt  (magnified)  versus  ~2. \n\n5  Conclusion \n\nBased on the cavity method, we have introduced a new framework for modeling the \ndynamics of learning,  which  is  applicable to  any learning cost function,  making it \na  versatile theory.  It takes into full  account the temporal correlations generated by \nthe  use  of a  restricted  set  of examples,  which  is  more  realistic  in  many  situations \nthan theories of on-line learning with an infinite training set. \n\nWhile  the  Adaline  rule  is  solvable  by  the  cavity  method,  it  is  still  a  relatively \nsimple model  approachable by more direct methods.  Hence the justification of the \nmethod as a general framework for  learning dynamics hinges on its applicability to \nless  trivial  cases.  In  general,  g~(t') in  (13)  is  not  a  constant  and  DJl(t, s)  has  to \nbe expanded  as  a  series.  The dynamical  equations  can  then  be considered  as  the \nstarting point of a perturbation theory, and results in various limits can be derived, \ne.g.  the limits of small a, large a, large A,  or the asymptotic limit.  Another area for \nthe useful  application of the cavity method is  the case  of batch learning with  very \nlarge learning steps.  Since it has been shown recently that such learning converges \nin  a few  steps  [6],  the dynamical equations remain simple enough for  a  meaningful \nstudy.  Preliminary results  along this  direction  are promising and will  be reported \nelsewhere. \n\n\f292 \n\nS.  Li and K.  Y.  M  Wong \n\nAn  alternative  general  theory  for  learning dynamics,  the  dynamical  replica  theo(cid:173)\nry,  has  recently  been  developed  [8].  It yields  exact  results  for  Hebbian  learning, \nand approximate results for  more non-trivial cases.  Based on certain self-averaging \nassumptions,  the  theory  is  able  to approximate  the dynamics  by  the evolution  of \nsingle-time functions , at the expense of having to solve  a set of saddle point equa(cid:173)\ntions  in  the  replica formalism  at  every  learning  instant.  On  the  other  hand,  our \ntheory retains the functions  G(t,s)  and C(t, s)  with  double  arguments,  but  devel(cid:173)\nops  naturally from  the  stochastic nature  of the  cavity  activations.  Contrary  to a \nsuggestion  [14],  the cavity  method can also be applied  to the on-line learning with \nrestricted  sets  of examples.  It  is  hoped  that  by  adhering  to  an  exact  formalism, \nthe  cavity  method  can  provide  more  fundamental  insights  when  the  studies  are \nextended to more sophisticated multilayer networks of practical importance. \nThe method enables us to study the effects of weight  decay  and early stopping.  It \nshows  that the optimal strength of weight  decay  is  determined  by  the imprecision \nin the examples, or the level of input noise in  anticipated applications.  For weaker \nweight  decay,  the generalization  performance  can  be  made  near-optimal  by  early \nstopping.  Furthermore,  depending  on  the  performance  measure,  optimality  may \nonly  be  attained  by  a  combination  of weight  decay  and  early  stopping.  Though \nthe performance improvement is marginal in the present case, the question remains \nopen in the more general context. \nWe  consider  the  present  work  as  the  beginning  of  an  in-depth  study  of learning \ndynamics.  Many interesting and challenging issues remain to be explored. \n\nAcknowledgments \n\nWe  thank A.  C.  C.  Coolen  and D.  Saad for fruitful  discussions  during NIPS.  This \nwork was supported by the grant HKUST6130/97P from the Research Grant Coun(cid:173)\ncil  of Hong Kong. \n\nReferences \n\n[1]  D.  Saad and S.  Solla,  Phys.  Rev.  Lett.  74,4337 (1995). \n[2]  D.  Saad and M. Rattray,  Phys.  Rev.  Lett.  79, 2578  (1997). \n[3]  J.  Hertz,  A.  Krogh and G. I. Thorbergssen,  J.  Phys.  A  22, 2133  (1989). \n[4]  M. Opper,  Europhys.  Lett.  8, 389  (1989). \n[5]  A.  Krogh and J.  A.  Hertz,  J.  Phys.  A  25, 1135  (1992). \n[6]  S.  Bos and M. Opper, J.  Phys.  A  31, 4835  (1998). \n[7]  S. Bos,  Phys.  Rev.  E  58, 833  (1998). \n[8]  A.  C.  C.  Coolen  and D. Saad, in  On-line  Learning in Neural  Networks,  ed. D.  Saad \n\n(Cambridge University Press,  Cambridge, 1998). \n\n[9]  M.  Mezard, G.  Parisi and M. Virasoro,  Spin  Glass  Theory  and  Beyond (World Sci-\n\nentific, Singapore)  (1987). \n\n[10]  K. Y.  M.  Wong,  Europhys.  Lett. 30, 245  (1995). \n[11]  K. Y. M. Wong,  Advances in Neural  Information Processing  Systems 9, 302  (1997). \n[12]  L. K. Hansen, J. Larsen and T. Fog,  IEEE Int.  Conf.  on Acoustics,  Speech,  and Signal \n\nProcessing 4,3205 (1997). \n\n[13]  Y.  W. Tong, K.  Y.  M.  Wong and S.  Li,  to appear in  Proc.  of IJCNN'99 (1999) . \n[14]  A. C.  C.  Cool en and D.  Saad,  Preprint KCL-MTH-99-33  (1999). \n\n\f", "award": [], "sourceid": 1697, "authors": [{"given_name": "Song", "family_name": "Li", "institution": null}, {"given_name": "K. Y. Michael", "family_name": "Wong", "institution": null}]}