{"title": "Geometry of Early Stopping in Linear Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 365, "page_last": 371, "abstract": null, "full_text": "Geometry of Early  Stopping in  Linear \n\nNetworks \n\nRobert Dodier  * \n\nDept.  of Computer Science \n\nUniversity of Colorado \n\nBoulder,  CO 80309 \n\nAbstract \n\nA theory of early stopping as applied to linear models is presented. \nThe  backpropagation  learning  algorithm  is  modeled  as  gradient \ndescent  in  continuous time.  Given  a  training set and  a  validation \nset,  all  weight  vectors  found  by  early  stopping must  lie  on  a  cer(cid:173)\ntain quadric surface, usually an ellipsoid.  Given a training set and \na  candidate early stopping weight  vector,  all  validation  sets  have \nleast-squares weights lying on a certain plane.  This latter fact  can \nbe exploited  to estimate  the  probability  of stopping  at any  given \npoint along the trajectory from the initial weight vector to the least(cid:173)\nsquares weights  derived from  the training set, and to estimate the \nprobability  that  training  goes  on  indefinitely.  The  prospects  for \nextending this theory to nonlinear models are discussed. \n\n1 \n\nINTRODUCTION \n\n'Early stopping'  is  the following training procedure: \n\nSplit  the  available data into a  training set and a  \"validation\"  set. \nStart  with  initial  weights  close  to  zero.  Apply  gradient  descent \n(backpropagation) on the training data.  If the error on the valida(cid:173)\ntion set increases over time,  stop training. \n\nThis training method,  as  applied  to neural networks,  is  of relatively recent origin. \nThe  earliest  references  include  Morgan  and  Bourlard  [4]  and  Weigend  et  al.  [7]. \n\n* Address correspondence to:  dodier~cs. colorado . edu \n\n\f366 \n\nR.  DODIER \n\nFinnoff et  al.  [2]  studied early stopping empirically.  While  the goal of a  theory of \nearly stopping is  to analyze its application to nonlinear approximators such as sig(cid:173)\nmoidal networks, this paper will deal mainly with linear systems and only marginally \nwith  nonlinear systems.  Baldi  and  Chauvin  [1]  and Wang  et  al.  [6]  have also  ana(cid:173)\nlyzed linear systems. \n\nIt  can  be  shown \nThe  main  result  of this  paper  can  be  summarized  as  follows. \n(see  Sec.  5)  that  the  most  probable  stopping  point  on  a  given  trajectory  (fixing \nthe  training  set  and  initial  weights)  is  the  same  no  matter  what  the  size  of the \nvalidation set.  That is,  the most  probable stopping point  (considering all  possible \nvalidation  sets)  for  a  finite  validation set  is  the  same  as  for  an  infinite  validation \nset.  (If the validation data is unlimited, then the validation error is the same as the \ntrue generalization error.)  However,  for  finite  validation sets  there is  a  dispersion \nof stopping points  around the  best  (most  probable and  least generalization  error) \nstopping point,  and this increases  the expected generalization error.  See  Figure  1 \nfor  an illustration of these ideas. \n\n2  MATHEMATICAL PRELIMINARIES \n\nIn what follows,  backpropagation will  be modeled as  a  process in continuous time. \nThis corresponds to letting the learning rate approach zero.  This continuum model \nsimplifies  the necessary  algebra while  preserving the important properties of early \nstopping.  Let the inputs be denoted X  =  (Xij), so that Xij  is the j'th component of \nthe i'th observation; there are p components of each of the n observations.  Likewise, \nlet y  =  (Yi)  be the (scalar) outputs observed when the inputs are X.  Our regression \nmodel  will  be  a  linear  model,  Yi  = W'Xi  + fi,  i  = 1, ... , n.  Here  fi  represents \nindependent,  identically  distributed  (LLd.)  Gaussian  noise,  fi  rv  N(O, q2).  Let \nE(w) =  !IIXw - Yll2  be one-half the usual sum of squared errors. \n\nThe error gradient  with  respect  to  the  weights  is  \\7 E(w)  =  w'x'x - y'X.  The \nbackprop  algorithm  is  modeled  as  Vi  =  -\\7 E( w).  The least-squares  solution,  at \nwhich  \\7E(w)  =  0,  is  WLS  =  (X'X)-lX'y.  Note  the  appearence  here  of the \ninput  correlation  matrix,  X'X  =  (2:~=1 XkiXkj).  The  properties  of this  matrix \ndetermine, to a large extent, the properties of the least-squares solutions we find.  It \nturns out that as the number of observations n increases without bound, the matrix \nq2(X'X)-1  converges  with probability one to the population covariance matrix of \nthe weights.  We  will find that the correlation matrix plays an important role in the \nanalysis of early stopping. \n\nWe  can rewrite the error E  using a diagonalization of the correlation matrix X'X = \nSAS'.  Omitting a few  steps of algebra, \n\np \n\nE(w) =  ! L AkV~ + !y'(y - XWLS) \n\nk=l \n\n(1) \n\nwhere v = S'(W-WLS) and A = diag(Al, .. . , Ap).  In this sum we see that the mag(cid:173)\nnitude  of the  k'th  term is  proportional  to  the corresponding characteristic  value, \nso  moving  w  toward  w LS  in the  direction  corresponding to the largest  character(cid:173)\nistic value  yields  the greatest reduction of error.  Likewise,  moving in  the direction \ncorresponding to the smallest characteristic value gives the least reduction of error. \n\n\fGeometry of Early Stopping in Linear Networks \n\n367 \n\nSo far, we have implicitly considered only one set of data; we have assumed all data \nis used for training.  Now let us distinguish training data, X t  and Yt, from validation \ndata,  Xv  and  Yv ;  there  are  nt  training  and  nv  validation  data.  Now  each  set  of \ndata has its own least-squares weight vector, Wt  and Wv ,  and its own error gradient, \n\\lEt(w) and \\lEv(w).  Also  define  M t  =  X~Xt and  Mv  =  X~Xv for  convenience. \nThe early stopping method can be analyzed in terms of the these pairs of matrices, \ngradients, and least-squares weight  vectors. \n\n3  THE MAGIC ELLIPSOID \n\nConsider the early stopping criterion,  d~v (w)  = O.  Applying the chain rule, \n\ndEv  = dEv  . dw  = \\lE  . -\\lE \ndt \nt, \n\ndw \n\ndt \n\nv \n\n(2) \n\nwhere the last equality follows  from the definition of gradient descent.  So the early \nstopping criterion is the same as saying \n\n\\lEt' \\lEv = 0, \n\n(3) \n\nthat is,  at an early stopping point,  the training and  validation error gradients  are \nperpendicular, if they are not zero. \n\nConsider  now  the  set  of all  points  in  the  weight  space such  that  the training and \nvalidation  error gradients  are perpendicular.  These  are  the  points  at  which  early \nstopping may stop.  It turns out that this set of points has an easily described shape. \nThe condition given by Eq.  3 is equivalent to \n\nNote that all correlation matrices are symmetric, so  MtM~ =  MtMv.  We  see that \nEq.  4 gives  a  quadratic form.  Let us  put Eq.  4 into a  standard form.  Toward  this \nend, let us  define some useful terms.  Let \n\n(4) \n\nM  =  MtMv , \nM  =  HM + M') =  HMtMv + MvMt), \nVi \n\nHWt + wv ), \nWt  - Wv , \n\n~w \n\nand \n\n~ \nw=w-i \n\n-\n\nIM- 1 (M \n\n-M  w. \n\n')~ \n\n(5) \n(6) \n(7) \n(8) \n\n(9) \n\nNow  an important result can be stated.  The proof is omitted. \n\nProposition 1.  \\lEt . \\lEv =  0 is equivalent to \n\n(W  - w)'M(w - w)  =  t~w[M +  t(M' - M)M- 1 (M - M')l~w.  0 \n\n(10) \n\nThe matrix M  of the quadratic form  given by Eq.  10 is  \"usually\"  positive definite. \nAs  the number of observations nt  and  nv  of training and  validation  data increase \nwithout bound,  M  converges to a  positive  definite  matrix.  In what follows  it will \n\n\f368 \n\nR.  DODIER \n\nalways be assumed that M is indeed positive definite.  Given this, the locus defined \nby V' Et .1  V' Ev  is an ellipsoid.  The centroid is  W,  the orientation is determined by \nthe characteristic vectors of M, and the length of the k'th semiaxis is v' c/ Ak,  where \nc is  the  constant on  the righthand side of Eq.  10  and 'xk  is  the  k'th characteristic \nvalue of M. \n\n4  THE MAGIC PLANE \n\nGiven  the  least-squares  weight  vector  Wt  derived  from  the  training  data  and  a \ncandidate  early  stopping  weight  vector  Wes,  any  least-squares  weight  vector  Wv \nfrom  a  validation set must lie  on  a  certain plane,  the 'magic plane.'  The proof of \nthis statement is omitted. \n\nProposition 2.  The condition that Wt,  W v, and Wes  all lie on the magic ellipsoid, \n(Wt -w)/M(wt -w) =  (wv  -w)/M(wv -w) =  (wes  -wYM(wes  -w) =  c,  (11) \nimplies \n\n(Wt  - wes)/Mwv =  (Wt  - wes)/Mwes.  0 \n\n(12) \nThis shows  that  Wv  lies  on a  plane,  the  magic  plane,  with  normal M/(wt - wes). \nThe reader will note a  certain difficulty here,  namely that M  =  MtM v  depends on \nthe particular validation set used,  as does  W v.  However,  we  can make  progress by \nconsidering only a fixed correlation matrix Mv  and letting W v vary.  Let us suppose \nthe inputs (Xl, X2, \u2022\u2022 .  ,Xp) are LLd.  Gaussian random variables with mean zero and \nsome covariance E.  (Here the inputs are random but they are observed exactly,  so \nthe error model  y = w/x + \u20ac  still applies.)  Then \n\n(Mv)  =  (X~Xv) =  nvE, \n\nso  in  Eq.  12  let  us  replace  Mv  with  its  expected  value  nv:E.  That  is,  we  can \napproximate Eq.  12  with \n\n(13) \n\nNow  consider  the  probability  that  a  particular point  w(t)  on  the trajectory from \nw(O)  to Wt  is  an early stopping point, that is,  V' Et(w(t)) . V' Ev(w(t)) = O.  This is \nexactly the probability that Eq.  12  is  satisfied,  and approximately the probability \nthat  Eq.  13  is  satisfied.  This  latter  approximation  is  easy  to  calculate:  it  is  the \nmass of an infinitesimally-thin slab cutting through the distribution of least-squares \nvalidation weight vectors.  Given the usual additive noise model  y =  w/x + \u20ac  with  \u20ac \nbeing Li.d.  Gaussian  distributed  noise  with  mean zero and  variance  (f2,  the least(cid:173)\nsquares weights are approximately distributed as \n\n(14) \n\nwhen  the number of data is  large. \nConsider now the plane n = {w : Wi ft = k}.  The probability mass on this plane as \nit cuts through a  Gaussian distribution  N(/-t, C)  is  then \n\npn(k, ft)  =  (27rft/Cft)-1/2 exp( _~ (k ~~:)2) ds \n\n(15) \n\nwhere  ds  denotes  an  infinitesimal  arc  length.  (See,  for  example,  Sec.  VIII-9.3  of \nvon  Mises  [3].) \n\n\fGeometry of Early Stopping in Linear Networks \n\n369 \n\n0.2S,------r-~-~-~-~__,_-_r_-_, \n\n0.15 \n\n0.' \n\nO.Os \n\n~~~~~L-lli3~~.ll-~S~~~~--~ \n\nArc leng1h Along Trajectory \n\nFigure 1:  Histogram of early stopping points along a  trajectory, with bins of equal \narc length.  An  approximation to the  probability  of stopping  (Eq.  16)  is  superim(cid:173)\nposed.  Altogether 1000 validation sets  were generated for  a  certain training set;  of \nthese,  288 gave  \"don't start\" solutions, 701 gave early stopping solutions (which are \nbinned here)  somewhere on the trajectory, and  11  gave  \"don't stop\"  solutions. \n\n5  PROBABILITY OF STOPPING AT  A  GIVEN POINT \nLet us apply Eq.  15 to the problem at hand.  Our normal is ft = nv :EMt (w t  - Wes ) \nand  the offset  is  k  =  ft' W es.  A  formal  statement  of the  approximation of PO  can \nnow be made. \n\nProposition 3.  Assuming the validation correlation matrix X~Xv equals the mean \ncorrelation matrix nv~, the  probability of stopping at  a  point  Wes  =  w(t)  on  the \ntrajectory from  w(O)  to Wt  is  approximately \n\nwith \n\n(17) \n\nHow useful is this approximation?  Simulations were carried out in which the initial \nweight vector w(O)  and the training data (nt  =  20)  were fixed,  and many validation \nsets  of size  nv  =  20  were  generated  (without  fixing  X~Xv).  The  trajectory  was \ndivided into segments of equal length and histograms of the number of early stopping \nweights on each segment were constructed.  A typical example is shown in Figure 1. \nIt can be seen that the empirical histogram is  well-approximated by  Eq.  16. \n\nIf for  some w(t)  on the trajectory the  magic  plane  cuts  through  the true  weights \nw\u00b7, then Po  will  have a  peak at t.  As  the number of validation data nv  increases, \nthe  variance  of Wv  decreases  and  the  peak  narrows,  but  the  position w(t)  of the \npeak does  not move.  As  nv  -t 00  the peak becomes  a  spike  at w(t).  That is,  the \npeak  of Po  for  a  finite  validation  set  is  the same  as  if we  had  access  to  the  true \ngeneralization error.  In this sense,  early stopping does the right thing. \n\nIt  has  been  observed  that  when  early  stopping  is  employed,  the  validation  error \nmay decrease forever and never rise - thus the 'early stopping' procedure yields the \nleast-squares weights.  How  common  is  this  phenomenon?  Let  us  consider  a  fixed \n\n\f370 \n\nR.  DODIER \n\ntraining set and a fixed  initial weight vector, so that the trajectory is fixed.  Letting \nthe  validation  set  range  over  all  possible  realizations,  let  us  denote  by  Pn(t)  = \nPn(k(t), n(t))  the probability that training stops at time t or later.  1- Pn(O)  is  the \nprobability that validation error rises immediately upon beginning training, and let \nus  agree that  Pn(oo)  denotes the probability that validation error never increases. \nThis  Pn(t)  is  approximately  the  mass  that  is  \"behind\"  the  plane n'wv  =  n'wes , \n\"behind\"  meaning the points  Wv  such  that  (wv  - wes)'ft < O.  (The identification \nof Pn  with  the mass  to one side  of the  plane  is  not  exact because intersections of \nmagic  planes  are ignored.)  As  Eq.  15  has the form  of a  Gaussian  p.dJ., it is  easy \nto show  that \n\nPq(k, ft)  =  G \n\n-nw \n(  k \n(n'Cft)1/2 \n\nA'  \"') \n\n(18) \n\nwhere G denotes the standard Gaussian c.dJ.,  G(z)  =  (211')-1/2  J~oo exp( -t2 /2)dt. \nRecall that we take the normal ft of the magic plane through Wes  as ft  =  EMt(wt(cid:173)\nwes).  For  t  =  0  there  is  no  problem  with  Eq.  18  and  an  approximation  for  the \n\"never-starting\"  probability is  stated in the next proposition. \n\nProposition 4.  The probability that validation error increases immediately  upon \nbeginning  training  (\"never  starting\"),  assuming  the  validation  correlation  matrix \nX~Xv equals the mean correlation matrix nv:E, is  approximately \n\n1 - Pn(O)  =  1 - G (Fv \n\n(w'\"  - w(O))'MtE(wt - w(O)) \n\n).  0 \n\n(19) \n\nU \n\n[(Wt  - w(O))'MtEMt(wt - w(0))P/2 \n\nWith similar arguments we  can develop  an approximation to the  \"never-stopping\" \nprobability. \n\nProposition 5.  The probability that training continues indefinitely  (\"never stop(cid:173)\nping\"),  assuming the validation  correlation matrix X~Xv equals the mean correla(cid:173)\ntion  matrix nvE, is  approximately \n\nPn(oo)  =  G (Fv (w'\"  - Wt)'Mt:E(\u00b1S\"')) . \n\nU \n\nA\"'[(s\"')'Es\"'j1/2 \n\n(20) \n\nIn  Eq.  20  pick  +s'\"  if (Wt  - w(O))'s'\"  > 0,  otherwise pick  -s\"'. \nSimulations are in good agreement with the estimates given by Propositions 4 and \n5. \n\n0 \n\n6  EXTENDING THE THEORY TO  NONLINEAR \n\nSYSTEMS \n\nIt may be possible to extend the theory presented in this paper to nonlinear approx(cid:173)\nimators.  The elementary concepts carryover unchanged,  although it will  be more \ndifficult to describe them algebraically.  In a nonlinear early stopping problem, there \nwill  be  a  surface  corresponding  to the  magic ellipsoid  on  which  'VEt  ...L \n'V E v ,  but \nthis  surface  may  be  nonconvex or  not  simply  connected.  Likewise,  corresponding \nto the magic plane there will  be a surface on which least-squares validation weights \nmust fall,  but this surface need not be fiat  or unbounded. \n\nIt is  customary in  the world of statistics to apply results derived for  linear systems \nto  nonlinear  systems  by  assuming  the  number  of data  is  very  large  and  various \n\n\fGeometry of Early Stopping in  Linear Networks \n\n371 \n\nregularity  conditions  hold.  If the  errors  \u00a3.  are  additive,  the  least-squares  weights \nagain have  a  Gaussian distribution.  As  in  the linear case, the Hessian of the total \nerror appears as  the inverse of the covariance of the least-squares weights.  In  this \nasymptotic (large data) regime, the standard results for linear regression carryover \nto nonlinear  regression mostly  unchanged.  This suggests  that the linear theory of \nearly  stopping  will  also  apply  to  nonlinear  regression  models,  such  as  sigmoidal \nnetworks, when there is  much data. \n\nHowever,  it should  be  noted  that the  asymptotic regression  theory  is  purely local \n- it describes only what happens in  the neighborhood of the least-squares weights. \nAs  the outcome of early stopping depends upon the initial  weights  and the trajec(cid:173)\ntory  taken  through  the  weight  space,  any  local  theory  will  not suffice  to  analyze \nearly  stopping.  Nonlinear  effects  such  as  local  minima  and  non-quadratic  basins \ncannot be accounted for  by a linear or asymptotically linear theory,  and these may \nplay  important roles  in  nonlinear  regression  problems.  This may  invalidate direct \nextrapolations of linear results to nonlinear networks,  such  as that given  by Wang \nand Venkatesh  [5]. \n\n7  ACKNOWLEDGMENTS \n\nThis  research  was  supported  by  NSF  Presidential Young  Investigator  award  IRl-\n9058450 and grant 90-21  from  the James  S.  McDonnell  Foundation to Michael  C. \nMozer. \n\nReferences \n\n[1]  Baldi, P., and Y. Chauvin.  \"Temporal Evolution of Generalization during Learn(cid:173)\n\ning in Linear Networks,\"  Neural  Computation 3, 589-603  (Winter 1991). \n\n[2]  Finnoff,  W.,  F.  Hergert,  and  H.  G.  Zimmermann.  \"Extended  Regularization \n\nMethods for  Nonconvergent Model  Selection,\"  in  Advances in  NIPS 5,  S.  Han(cid:173)\nson,  J.  Cowan,  and  C.  L.  Giles,  eds.,  pp  228-235.  San  Mateo,  CA:  Morgan \nKaufmann Publishers.  1993. \n\n[3]  von  Mises,  R.  Mathematical  Theory  of Probability  and  Statistics.  New  York: \n\nAcademic Press.  1964. \n\n[4]  Morgan,  N.,  and  H.  Bourlard.  \"Generalization  and  Parameter  Estimation  in \nFeedforward Nets:  Some  Experiments,\"  in  Advances  in  NIPS 2,  D.  Touretzky, \ned., pp 630-637. San Mateo, CA:  Morgan Kaufmann.  1990. \n\n[5]  Wang,  C.,  and S.  Venkatesh.  \"Temporal Dynamics of Generalization in  Neural \nNetworks,\"  in Advances in NIPS 7,  G.  Tesauro, D.  Touretzky, and T. Leen, eds. \npp 263-270.  Cambridge,  MA:  MIT Press.  1995. \n\n[6]  Wang,  C.,  S.  Venkatesh,  J.  S.  Judd.  \"Optimal Stopping and Effective  Machine \nComplexity in Learning,\"  in  Advances in NIPS 6,  J. Cowan, G.  Tesauro, and J. \nAlspector, eds.,  pp 303-310. San Francisco:  Morgan Kaufmann.  1994. \n\n[7]  Weigend,  A.,  B.  Huberman, and D.  Rumelhart. \"Predicting the Future:  A Con(cid:173)\n\nnectionist  Approach,\"  Int'l J.  Neural  Systems 1,  193-209 (1990). \n\n\f", "award": [], "sourceid": 1147, "authors": [{"given_name": "Robert", "family_name": "Dodier", "institution": null}]}