{"title": "A Realizable Learning Task which Exhibits Overfitting", "book": "Advances in Neural Information Processing Systems", "page_first": 218, "page_last": 224, "abstract": null, "full_text": "A  Realizable  Learning Task which \n\nExhibits Overfitting \n\nSiegfried  Bos \n\nLaboratory for  Information  Representation,  RIKEN, \n\nHirosawa 2-1,  Wako-shi,  Saitama, 351-01,  Japan \n\nemail:  boes@zoo.riken.go.jp \n\nAbstract \n\nIn  this  paper we  examine  a  perceptron  learning  task.  The  task  is \nrealizable  since  it  is  provided  by  another  perceptron  with  identi(cid:173)\ncal  architecture.  Both  perceptrons  have  nonlinear  sigmoid  output \nfunctions.  The gain  of the output function  determines the level  of \nnonlinearity  of the  learning  task.  It is  observed  that  a  high  level \nof nonlinearity leads to overfitting. We  give an explanation for  this \nrather  surprising  observation  and  develop  a  method  to  avoid  the \noverfitting.  This  method  has  two  possible  interpretations,  one  is \nlearning with noise,  the other cross-validated early stopping. \n\n1  Learning  Rules  from  Examples \n\nThe  property  which  makes feedforward  neural  nets  interesting for  many  practical \napplications  is  their  ability  to approximate functions,  which  are  given  only  by ex(cid:173)\namples.  Feed-forward  networks  with  at  least  one  hidden  layer  of nonlinear  units \nare  able  to  approximate  each  continuous function  on  a  N-dimensional  hypercube \narbitrarily  well.  While  the  existence  of neural  function  approximators  is  already \nestablished, there is  still a lack of knowledge about their practical realizations. Also \nmajor problems, which complicate a good realization, like overfitting, need  a better \nunderstanding. \n\nIn  this  work  we  study  overfitting in  a  one-layer  percept ron  model.  The  model \nallows a good theoretical description while it exhibits already a qualitatively similar \nbehavior as  the multilayer  perceptron. \n\nA  one-layer  perceptron  has  N  input  units  and  one  output  unit.  Between  input \nand output it  has one layer of adjustable weights Wi, (i  =  1, ... ,N). The output  z \nis  a possibly nonlinear function  of the weighted sum of inputs Xi,  i.e. \n\nz  =  g(h) ,  with \n\nh = \n\n1 \n\nN \n\nI1tT  L Wi Xi . \n\nvN  i=l \n\n(1) \n\n\fA Realizable Learning Task Which  Exhibits  Overfitting \n\n219 \n\nThe quality of the function approximation is measured by the difference between \nthe  correct  output  z*  and  the  net's  output  z  averaged  over  all  possible  inputs.  In \nthe  supervised  learning scheme  one  trains  the  network  using  a  set  of examples  ;fll \n(JL  =  1, ... , P),  for  which  the  correct  output  is  known.  It  is  the  learning  task  to \nminimize a certain cost function, which measures the difference between the correct \noutput z~ and the net's output  Zll  averaged over all examples. \n\nUsing  the  mean  squared  error  as  a  suitable  measure  for  the difference  between \nthe  outputs,  we  can  define  the  training  error ET  and  the  generalization  error Ea \nas \n\n(2) \n\nThe development of both errors as  a function of the number  P  of trained examples \nis given by the learning curves. Training is conventionally done by gradient descend. \nFor theoretical  purposes it is  very useful  to study learning tasks,  which  are  pro(cid:173)\nvided  by  a  second  network,  the  so-called  teacher  network.  This  concept  allows  a \nmore transparent definition of the difficulty of the learning task.  Also the monitor(cid:173)\ning of the  training process  becomes  clearer,  since  it  is  always  possible to  compare \nthe  student network and the teacher network directly. \n\nSuitable quantities for such a comparison are, in the perceptron case, the following \n\norder parameters, \n\nq:= IIWII  =  2:(Wi )2. \n\nN \n\ni=l \n\n(3) \n\nBoth have  a  very  transparent  interpretation,  r  is  the normalized  overlap  between \nthe weight  vectors of teacher and student, and q is the norm of the student's weight \nvector.  These  order  parameters  can  also  be  used  in  multilayer learning,  but  their \nnumber increases with the number of all possible permutations between the hidden \nunits of teacher and student. \n\n2  The Learning Task \nHere  we  concentrate  on  the  case  in  which  a  student  perceptron  has  to  learn  a \nmapping provided by another perceptron.  We  choose  identical networks for  teacher \nand  student.  Both  have  the  same  sigmoid  output  function,  i.e.  g*(h)  =  g(h)  = \ntanh( \"Ih). Identical network architectures of teacher and student are realizable tasks. \nIn  principle  the student  is  able  to learn  the  task provided  by the  teacher  exactly. \nUnrealizable tasks can not  be learnt exactly, there remains always a  finite  error. \n\nIf we  use  uniformally distributed random inputs ;f and weights W, the weighted \nsum  h  in  (1)  can  be  assumed  as  Gaussian  distributed.  Then  we  can  express  the \ngeneralization error (2)  by the order parameters (3), \n\nEa= JDZ1  JDz2~{tanh[\"IZll-tanh[q(rzl+~Z2)]r, \n\n(4) \n\nwith the Gaussian  measure J  1+00  dz \n\nDz:= \n\n- 00 \n\n(Z2) \n\n- - exp  - -\n../2i \n2 \n\n(5) \n\nFrom  equation  (4)  we  can  see  how  the  student  learns  the  gain  \"I  of the  teachers \noutput function. It adjusts the norm q of its weights. The gain \"I  plays an important \nrole since it allows to tune the function tanhbh) between a linear function  b  \u00ab  1) \nand a  highly  nonlinear function  b  \u00bb  1).  Now  we  want  to determine  the  learning \ncurves of this task. \n\n\f220 \n\ns.B6s \n\n3  Emergence of Overfitting \n\n3.1  Explicit  Expression for  the Weights \nBelow the storage capacity of the perceptron, i.e. a  =  1, the minimum of the training \nerror  ET  is  zero.  A zero  training error  implies  that every  example  has  been  learnt \nexactly, thus \n\n(6) \n\nThe  weights  with  minimal  norm  that  fulfill  this  condition  are  given  by  the  Pseu(cid:173)\ndoinverse  (see  Hertz  et  al.  1991), \n\nP \n\nWi  =  2:  h~ (C-l)~v xf, \n\n~,v=l \n\n(7) \n\nNote,  that the  weights  are  completely independent  of the output  function  g(h)  = \ng*(h).  They are the same as in the simplest realizable case, linear perceptron learns \nlinear perceptron. \n\n3.2  Statistical Mechanics \nThe calculation of the order parameters can  be done  by a  method from  statistical \nmechanics which  applies the  commonly  used  replica  method.  For details  about the \nreplica approach see  Hertz  et  al.  (1991).  The solution of the continuous perceptron \nproblem  can  be  found  in  Bas  et  al.  (1993).  Since  the  results  of the statistical me(cid:173)\nchanics  calculations  are  exact  only  in  the  thermodynamic  limit,  i.e.  N  ~ 00,  the \nvariable a  is  the more  natural measure.  It is  defined  as  the fraction  of the number \nof patterns  P  over the system size  N,  i.e.  a  := PIN. In  the thermodynamic limit \nN  and  P  are  infinite,  but  a  is  still finite.  Normally,  reasonable  system sizes,  such \nas  N  ~ 100,  are already well described by this theory. \n\nUsually one concentrates on the zero temperature limit, because this implies that \nthe training error ET accepts its absolute minimum for  every number of presented \nexamples  P.  The  corresponding  order  parameters  for  the  case,  linear  perceptron \nlearns linear student, are \n\nq='Yva, \n\nr  =va. \n\n(8) \n\nThe zero temperature limit can also be called  exhaustive  training, since the student \nnet is trained until the absolute minimum of ET  is  reached. \n\nFor small a  and high  gains 'Y,  i.e  levels of nonlinearity, exhaustive training leads \nto  overfitting.  That  means  the  generalization  error  Ea(a)  is  not,  as  it  should, \nmonotonously decreasing with a. It is  one  reason  for  overfitting,  that the training \nfollows  too strongly the  examples.  The  critical gain 'Yc,  which  determines  whether \nthe generalization error Ea ( a) is  increasing or decreasing function for  small values \nof a, can be determined by a linear approximation. For small a, both order param(cid:173)\neters (3)  are small, and the student's tanh-function in (4)  can be approximated by \na  linear function.  This simplifies the equation  (4)  to the following  expression, \n\nEa(f) =  Ea(O) - i [2H(r) - 'Y 1,  with  H( 'Y):= J Dz tanh(rz) z. \n\n(9) \n\nSince the function H(r) has an upper bound, i.e.  J2/7r, the critical gain is  reached \nif 'Yc  =  2H{rc).  The  numerical solution  gives  'Yc  =  1.3371.  If 'Y  is  higher,  the slope \nof Ea(a)  is  positive for  small a. In the following  considerations we  will  use  always \nthe gain 'Y  =  5 as  an example, since  this is  an intermediate level of nonlinearity. \n\n\fA  Realizable Learning Task Which Exhibits Overfitting \n\n221 \n\n100.0 \n10.0 \n5.0 \n2.0 \n1.0 \n0.5 \n\n0.8 \n\n1.0 \n\n-- '- .- . - '-.- --.--- .- .- . -'-. - ' - ' - . -.- . -'- . \n\nw  _\n\n_ \n\n_ \n\n\u2022 \n\n...... \n\n..  __   -- -- -..  -- -..  -- -- -- --- ..  _- ---\n0.2 \n\n0.4 \n\n0.6 \n\n-'-.-\n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\n0.0 \n\nPIN \n\nFigure  1:  Learning  curves  E ( 0:)  for  the  problem,  tanh- perceptron  learns  tanh(cid:173)\nperceptron, for different values of the gain,. Even in this realizable case, exhaustive \ntraining can lead to overfitting, if the gain  ,  is  high enough. \n\n3.3  How to Understand the Emergence of Overfitting \nHere the evaluation of the generalization error in dependence of the order parameters \nrand q  is  helpful.  Fig.  2 shows  the function  EG(r, q)  for  r  between 0  and 1 and q \nbetween 0 and 1.2,. \n\nThe  exhaustive  training  in  realizable  cases  follows  always  the  line  q( r)  =  ,r \nindependent of the actual output function.  That means, training is  guided  only  by \nthe training error and not  by  the generalization error.  If the gain  ,  is  higher than \n,e, the  line  EG  =  EG(O, 0)  starts with a  lower  slope  than q(r)  =  ,r, which  results \nin  overfitting. \n\n4  How to Avoid  Overfitting \nFrom Fig. 2 we can guess already that q increases too fast compared to r. Maybe the \nratio  between  q and r  is  better during the training process.  So we  have to develop \na  description for  the training process first. \n\n4.1  Training Process \nWe  found  already  that  the  order  parameters  for  finite  temperatures  (T  >  0)  of \nthe  statistical  mechanics  approach  are  a  good  description  of the  training  process \nin  an unrealizable learning task (Bos 1995).  So  we  use the finite  temperature order \nparameters also in this task. These are, again taken from the task 'linear perceptron \nlearns linear percept ron' , \n\nq  0:, a, \n\n( \n\n) =  J(~) (1  +  0:) a - 20: \n\na \n\n2 \na  - 0: \n\nr(o:, a) = \n\n' \n\n(0:) \n\na \n\na2  - 0: \n\n(1+0:)a-20:' \n\nwith the temperature dependent  variable \n\na:= 1 + [,8(Q  - q)]-l  . \n\n(10) \n\n(11) \n\n\f222 \n\nS.BOS \n\n. \n. \n. \n. \n/local minZ \ni  abs. mici. \n./  local m~. \n\n........ \n\n... \n\n... \n\n\".-\n\n........ \n\n........ \n\n6.0 \n\n5.0 \n\n4.0 \n\n3.0 \n\n2.0 \n\nq \n\n1.0 \n\n0.0 ~~==~-~-:=::: .... ! : .  ==\u00b1~===--L..::===~\u00b7=\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7~\u00b7\u00b7\u00b7\u00b7\u00b73\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n--- ...... ~: ...... -. \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n1.0 \n\nr \n\nFigure  2:  Contour  plot  of  EG(r,q)  defined  by  (4),  the  generalization  error  as  a \nfunction of the two order parameters. Starting from the minimum EG  =  0 at (r, q)  = \n(1,5) the contour lines for  EG  =  0.1,0.2, ... , 0.8 are given  (dotted lines). The dashed \nline  corresponds  to  EG(O,O)  =  0.42.  The  solid  lines  are  parametric  curves  of the \norder  parameters  (r, q)  for  certain  training strategies.  The  straight line  illustrates \nexhaustive training,  the lower ones the optimal training, which will  be explained in \nFig.  3.  Here  the gain I  =  5. \n\nThe  zero  temperature  limit  corresponds  to  a  =  1.  We  will  show  now  that  the \ndecrease  of the  temperature  dependent  parameter  a  from  00  to  1,  describes  the \nevolution of the order parameters during the training process. In the training process \nthe  natural  parameter  is  the  number  of parallel  training  steps  t.  In  each  parallel \ntraining  step  all  patterns  are  presented  once  and  all  weights  are  updated.  Fig.  3 \nshows the evolution of the order parameters (10)  as  parametric curves (r,q). \n\nThe exhaustive learning  curve  is  defined  by  a  =  1  with  the  parameter  0:  (solid \nline).  For  each  0:  the  training  ends  on  this  curve.  The  dotted  lines  illustrate  the \ntraining process,  a  runs from infinity to 1.  Simulations of the training process have \nshown that this theoretical curve is  a  good description,  at least after some training \nsteps.  We  will  now use  this  description  of the training process for  the definition  of \nan optimized training strategy. \n\n4.2  Optimal temperature \n\nThe optimized training strategy chooses not a  =  1 or the corresponding temperature \nT  =  0,  but  the  value  of a  (Le.  temperature),  which  minimizes  the  generalization \nerror EG.  In the lower solid curve indicating the parametric curve (r, q)  the value of \na  is  chosen  for  every  0:,  which minimizes  EG.  The function  EG(a)  has two minima \nbetween  0:  =  0.5  and  0.7.  The  solid  line  indicates  always  the  absolute  minimum. \nThe parametric curves corresponding to the local minima are  given  by  the double \ndashed  and  dash-dotted  lines.  Note,  that  the  optimized  value  a  is  always  related \nto  an  optimized  temperature  through  equation  (11).  But  the  parameter  a  is  also \nrelated to the number of training steps t. \n\n\fA  Realizable  Learning Task Which  Exhibits  Overfilling \n\n223 \n\nq \n\n6.0 \n\n5.0 \n\n4.0 \n\n3.0 \n\n2.0 \n\n1.0 \n\n0.0 \n\n0.0 \n\nlocal min. \nabs. min. \nlocal min. \nsimulation  I--t--l \n\n0.2 \n\n0.4 \n\nr \n\n0.6 \n\n0.8 \n\n1.0 \n\nFigure  3:  Training  process.  The  order  parameters  (10)  as  parametric  curves  (r,q) \nwith  the  parameters  a  and  a.  The  straight  solid  line  corresponds  to  exhaustive \nlearning,  i.e.  a  =  1  (marks  at  a  =  0.1,0.2, ... 1.0).  The  dotted  lines  describe  the \ntraining  process  for  fixed  a.  Iterative  training  reduces  the  parameter  a  from  00 \nto  1.  Examples  for  a  =  0.1,0.2,0.3,0.4,0.9,0.99  are  given.  The lower  solid  line  is \nan  optimized learning curve.  To  achieve  this curve the value  of a  is  chosen,  which \nminimizes  EG  absolutely.  Between a  ~ 0.5  and  0.7  the error  EG  has  two  minima; \nthe double- dashed and dash-dotted lines indicate the second, local minimum of EG. \nCompare with Fig.  2,  to see  which is  the absolute and which the local minimum of \nEG.  A naive early stopping procedure ends always in the minimum with the smaller \nq, since it is  the first  minimum during the training process (see simulation indicated \nwith errorbars). \n\n4.3  Early Stopping \nFig.  3 and Fig.  2 together indicate  that  an earlier stopping of the training process \ncan  avoid  the  overfitting.  But  in  order  to  determine  the  stopping  point  one  has \nto know  the actual generalization  error  during the training.  Cross-validation tries \nto provide an approximation for  the real generalization error.  The cross-validation \nerror  Ecv  is  defined  like  E T ,  see  (2),  on  a  set  of examples,  which  are  not  used \nduring the  training.  Here  we  calculate  the  optimum  using  the real  generalization \nerror,  given  by  rand q,  to determine  the  optimal  point for  early  stopping.  It  is  a \nlower  bound for  training  with  finite  cross-validation  sets.  Some  preliminary  tests \nhave shown that already small cross- validation sets approximate the real  EG  quite \nwell.  Training is  stopped,  when  EG  increases.  The resulting  curve  is  given  by  the \nerror bars in  Fig.  3.  The errorbars  indicate  the  standard  deviation  of a  simulation \nwith N  = 100  averaged over  50  trials. \n\nIn Fig. 4 the same results are shown as learning curves EG(a). There one can see \n\nclearly  that the early stopping strategy avoids the overfitting. \n\n5  Summary and outlook \n\nIn this paper we  have  shown  that overfitting can also  emerge in realizable learning \ntasks.  The calculation of a  critical gain  and the contour lines  in  Fig.  2 imply,  that \n\n\f224 \n\nS.BOS \n\n0.5 \n\n0.4 \n\n0.3 \n\nEO \n\n0.2 \n\n0.1 \n\n0.0 \n\n0.0 \n\nexh. \nlocal min. \nabs. min. \nlocal min. \nsimulation  ~ \n\n0.2 \n\n0.4 \n\nPIN \n\n0.6 \n\n0.8 \n\n1.0 \n\nFigure  4:  Learning  curves  corresponding  to  the  parametric  curves  in  Fig.  3.  The \nupper solid  line shows  again exhaustive training.  The optimized finite  temperature \ncurve is  the lower  solid line.  From  0:  =  0.6  exhaustive and optimal training lead to \nidentical results (see marks). The simulation for early stopping (errorbars) finds the \nfirst  minimum of EG. \n\nthe reason for the overfitting is the nonlinearity of the problem. The network adjusts \nslowly  to  the  nonlinearity  of the  task.  We  have  developed  a  method  to  avoid  the \noverfitting, it  can be interpreted in two ways. \n\nTraining  at  a  finite  temperature  reduces  overfitting.  It can  be  realized,  if  one \ntrains  with  noisy  examples.  In  the  other  interpretation  one  learns  without  noise, \nbut  stops the training earlier.  The early stopping is  guided  by cross-validation.  It \nwas  observed  that  early  stopping  is  not  completely  simple,  since  it  can  lead  to a \nlocal minimum of the generalization  error.  One should  be aware of this possibility, \nbefore one applies early stopping. \n\nSince  multilayer perceptrons are  built of nonlinear  perceptrons, the same  effects \nare  important for  multilayer learning.  A study with large scale simulations (Miiller \net  al.  1995)  has shown  that overfitting occurs also  in  realizable  multilayer learning \ntasks. \n\nAcknowledgments \nI  would  like  to thank S.  Amari and  M.  Opper for  stimulating discussions,  and M. \nHerrmann for  hints concerning the presentation. \n\nReferences \nS.  Bos.  (1995)  Avoiding  overfitting  by  finite  temperature  learning  and  cross(cid:173)\nvalidation.  International  Conference  on Artificial Neural Networks '95 Vo1.2,  p.111. \nS.  Bos,  W.  Kinzel  &  M.  Opper.  (1993)  Generalization  ability  of perceptrons  with \ncontinuous outputs.  Phys.  Rev.  E  47:1384-1391. \nJ.  Hertz,  A.  Krogh  &  R.  G.  Palmer.  (1991)  Introduction  to  the  Theory  of Neural \nComputation.  Reading:  Addison-Wesley. \nK. R.  Miiller,  M.  Finke, N.  Murata, K.  Schulten &  S.  Amari.  (1995)  On large scale \nsimulations for  learning curves,  Neural  Computation in press. \n\n\f", "award": [], "sourceid": 1127, "authors": [{"given_name": "Siegfried", "family_name": "B\u00f6s", "institution": null}]}