{"title": "Statistical Theory of Overtraining - Is Cross-Validation Asymptotically Effective?", "book": "Advances in Neural Information Processing Systems", "page_first": 176, "page_last": 182, "abstract": null, "full_text": "Statistical Theory of Overtraining - Is \n\nCross-Validation Asymptotically \n\nEffective? \n\ns.  Amari,  N.  Murata, K.-R.  Miiller* \n\nDept.  of Math.  Engineering and Inf.  Physics,  University of Tokyo \n\nHongo 7-3-1,  Bunkyo-ku, Tokyo 113, Japan \n\nM. Finke \n\nInst.  f.  Logik , University of Karlsruhe \n\n76128  Karlsruhe,  Germany \n\nH.  Yang \n\nLab . f.  Inf.  Representation,  RIKEN, \n\nWakoshi, Saitama, 351-01, Japan \n\nAbstract \n\nA  statistical  theory  for  overtraining  is  proposed.  The  analysis \ntreats realizable stochastic neural networks, trained with Kullback(cid:173)\nLeibler loss in the  asymptotic case.  It is shown that the asymptotic \ngain  in  the  generalization error  is  small if we  perform early  stop(cid:173)\nping, even if we have access to the optimal stopping time.  Consider(cid:173)\ning cross-validation stopping we answer the question:  In what ratio \nthe examples should be divided into training and testing sets in or(cid:173)\nder  to  obtain  the  optimum  performance.  In  the  non-asymptotic \nregion  cross-validated early stopping always  decreases  the general(cid:173)\nization  error.  Our  large  scale  simulations  done  on  a  CM5  are  in \nnice  agreement with our analytical findings. \n\n1 \n\nIntroduction \n\nTraining multilayer neural  feed-forward  networks,  there  is  a  folklore  that  the gen(cid:173)\neralization error  decreases  in an early period of training, reaches  the minimum and \nthen increases as training goes on,  while the training error monotonically decreases. \nTherefore,  it is  considered  advantageous to stop training at an adequate time or to \nuse  regularizers  (Hecht-Nielsen  [1989),  Hassoun  [1995),  Wang et  al.  [1994)'  Poggio \nand  Girosi  [1990),  Moody  [1992)'  LeCun  et  al.  [1990]  and  others).  To  avoid  over(cid:173)\ntraining,  the following stopping  rule  has  been  proposed  based  on  cross-validation: \n\n*Permanent  address:  GMD  FIRST,  Rudower  Chaussee  5,  12489  Berlin,  Germany. \n\nE-mail:  Klaus@first .gmd.de \n\n\fStatistical Theory  of Overtraining-Is  Cross-Validation  Asymptotically  Effective? \n\n177 \n\nDivide  all  the  available examples into two  disjoint  sets.  One  set  is  used  for  train(cid:173)\ning.  The other set  is used for  testing such that the behavior of the trained network \nis  evaluated  by  using  the  test  examples  and  training is  stopped  at  the  point  that \nminimizes the testing error. \nThe present  paper gives  a  mathematical analysis of the so-called overtraining phe(cid:173)\nnomena to elucidate the folklore.  We analyze the asymptotic case where the number \nt  of examples are  very  large.  Our analysis treats  1)  a  realizable stochastic machine, \n2)  Kullback-Leibler loss (negative ofthe log likelihood loss),  3) asymptotic behavior \nwhere  the number t  of examples is  sufficiently large (compared with the number  m \nof parameters).  We firstly  show  that  asymptotically the gain  of the generalization \nerror is  small even  if we  could find  the optimal stopping time.  We  then answer  the \nquestion:  In  what  ratio,  the  examples  should  be  divided  into training  and  testing \nsets in order  to obtain the optimum performance.  We give a  definite  answer to this \nproblem.  When  the number m  of network  parameters is  large,  the  best  strategy is \nto  use  almost  all  t  examples  in  the  training set  and  to  use  only  l/v2m examples \nin  the  testing  set,  e.g.  when  m  =  100,  this  means  that  only  7%  of the  training \npatterns are to be used  in the set  determining the point for  early  stopping. \nOur  analytic  results  were  confirmed  by  large-scale  computer  simulations of three(cid:173)\nlayer  continuous feedforward  networks  where  the  number  m  of modifiable param(cid:173)\neters  are  m  =  100.  When t  > 30m,  the theory  fits  well  with  simulations, showing \ncross-validation  is  not  necessary,  because  the  generalization  error  becomes  worse \nby  using test examples to obtain an adaequate stopping  time.  For  an intermediate \nrange,  where t  < 30m overtraining occurs  surely  and  the  cross-validation stopping \nimproves the generalization ability strongly. \n\n2  Stochastic feedforward  networks \n\nLet  us  consider  a  stochastic  network  which  receives  input  vector  x  and  emits \noutput  vector  y.  The  network  includes  a  modifiable  vector  parameter  w  = \n(WI,\"', wm )  and  is  denoted  by  N(w).  The  input-output  relation  of  the  net(cid:173)\nwork  N(w)  is  specified  by  the  conditional  probability p(Ylx; w).  We  assume  (a) \nthat  there  exists  a  teacher  network  N(wo)  which  generates  training  examples \nfor  the  student  N(w).  And  (b)  that  the  Fisher  information  matrix  Gij(w)  = \nE  [a~. logp(x, y; w) a~j logp(x, y; w)]  exists,  is  non-degenerate  and  is  smooth in \nw,  where  E  denotes  the  expectation  with  respect  to  p(x, Y; w)  =  q(x)p(Ylx; w). \nThe  training  set  Dt  =  {(Xl, YI), ... , (Xt, Yt)}  consists  of t  independent  examples \ngenerated  by  the  distribution  p(x, Y; wo)  of N(wo).  The  maximum likelihood  es(cid:173)\ntimator  (m.l.e.)  Vi  is  the  one  that  maximizes  the  likelihood  of producing  D t ,  or \nequivalently minimizes the training error or  empirical risk function \n\nRtrain(w) =  -i I:logp(xi,Yi;w). \n\n1 \n\nt \n\n(2.1) \n\ni=l \n\nThe generalization error or risk function  R(w) of network  N(w)  is  the expectation \nwith respect  to the  true distribution, \nR(w) = -Eo[logp(x, Y; w)] = Ho+D(wo  II  w) = Ho+Eo  [log p~x, Y; wojJ,  (2.2) \n\np  x,y;w \n\nwhere  Eo  denotes  the  expectation  with  respect  to  p(x, Y; wo),  Ho  is  the  entropy \nof the  teacher  network  and  D(wo  II  w)  is  the  Kullback-Leibler  divergence  from \nprobability  distribution  p(x,y;wo)  to p(x,y;w) or  the  divergence  of N(w)  from \nN(wo).  Hence,  minimizing R(w) is  equivalent  to minimizing D(wo  II  w),  and the \n\n\f178 \n\nS. AMARI, N. MURATA, K. R.  MULLER, M.  FINKE, H.  YANG \n\nminimum is attained at w  = Wo.  The asymptotic theory of statistics proves that the \nm.l.e.  Wt  is  asymptotically subject  to the  normal distribution  with  mean Wo  and \nvariance G-1 It,  where  G- 1 is  the  inverse  of the  Fisher  information matrix G.  We \ncan expand for  example the risk  R(w) =  Ho+ t(w -wo)TG(wo)(w -wo) + 0  (/2) \nto obtain \n\n(Rgen(w))  =  Ho + ~ + 0  C~ ),  (Rtrain(w))  =  Ho - ~ + 0  C~),  (2.3) \n\nas asymptotic result for  training and test error (see  Murata et al.  [1993]  and Amari \nand  Murata  [1990)) .  An  extension  of (2.3)  including higher  order  corrections  was \nrecently  obtained by M liller et al.  [1995]. \nLet  us  consider  the gradient  descent  learning rule  (Amari [1967],  Rumelhart et  al. \n[1986], and many others), where the parameter w(n) at the nth step is modified by \n\nw(n + 1)  =  w(n) _  \u20ac  f)Rtr~~(wn) , \n\n(2.4) \n\nand  where  \u20ac \nis  a  small  positive  constant.  This  is  batch  learning  where  all  the \ntraining examples are used for each iteration of modifying w( n).l  The batch process \nis  deterministic  and w( n)  converges  to W,  provided  the  initial w(O)  is  included  in \nits  basin  of attraction.  For  large  n  we  can  argue,  that  w(n)  is  approaching  w \nisotropically and the learning trajectory follows  a linear  ray towards w  (for  details \nsee  Amari et  al.  [1995]). \n\n3  Virtual  optimal  stopping rule \n\nDuring learning as  the parameter w(n)  approaches  W,  the generalization behavior \nof network  N {w(n)}  is  evalulated by the sequence  R(n) =  R{w(n)},  n =  1,2, . .. \nThe folklore says that R(n)  decreases  in an early period oflearning but it increases \nlater.  Therefore,  there  exists  an  optimal stopping  time  n  at  which  R(n)  is  mini(cid:173)\nmized.  The stopping time nopt is  a random variable depending on wand the initial \nw(O) .  We  now  evaluate  the ensemble average of (R(nopd). \nThe true Wo  and  the m.l.e.  ware in general  different,  and  they are  apart of order \n1/Vt.  Let us  compose a sphere S  of which the center is  at (1/2)(wo+w) and which \npasses  through  both Wo  and W,  as  shown  in Fig.1b.  Its  diameter is  denoted  by  d, \nwhere  d2  =  Iw  - Wo 12  and \n\nEo [d2] \n\nEo[(w - wo? G- 1(w - wo)]  =  ~tr(G-1G) =  m. \nt \n\nt \n\n(3 .1) \n\nLet  A  be  the  ray,  that is  the  trajectory w(n)  starting at w(O)  which  is  not  in  the \nneighborhood of Wo .  The optimal stopping point w\"  that minimizes \n\nR(n) = Ho + ~Iw(n) - wol 2 \n\n(3.2) \n\nis  given  by  the first  intersection of the  ray  A  and the sphere  S. \nSince  w\"  is  the  point  on  A  such  that  Wo  - w\"  is  orthogonal  to  A,  it  lies  on the \nsphere  S  (Fig.1b).  When ray A' is  approaching w  from the opposite side ofwo (the \nright-hand  side  in the  figure),  the  first  intersection  point  is  w  itself.  In this  case, \nthe optimal stopping never  occurs  until it  converges  to W. \nLet  ()  be the  angle  between  the  ray  A  and  the  diameter Wo  - w  of the sphere  S. \nWe now  calculate the distribution of ()  when the rays  are isotropically distributed. \n\nlWe can alternatively use on-line learning, studied by Amari [1967],  Heskes and Kappen \n\n[1991] ,  and  recently  by  Barkai et al.  [1994]  and  SolI a  and Saard  [1995]. \n\n\fStatistical  Theory  of Overtraining-Is  Cross-Validation  Asymptotically  Effective? \n\n179 \n\nLemma 1.  When ray A  is approaching V.  from the side in which Wo  is  included, the \nprobability density of 0,  0 :::;  0 :::;  7r /2, is given by \n\nreO)  = -- sinm- 2 0,  where  1m  = \n\n1 \n\n1m-2 \n\n17r/2 \n\n0 \n\nsinm OdO. \n\n(3.3) \n\nThe  det,ailed  proof of this  lemma can  be found  in  Amari et  aI.  [1995].  Using  the \ndensity of 0 given by  Eq.(3.3)  and we  arrive  at the following theorem. \n\nTheorem  1.  The  average  generalization  error  at  the  optimal  stopping  point  is \ngiven by \n\n(3.4) \n\nProof  When ray A  is  at angle 0,  0 :::;  0 < 7r /2, the optimal stopping point w*  is on \nthe sphere  S.  It is  easily shown that  Iw*  - wol  =  dsinO.  This is  the  case  where  A \nis from  the same side  as  Wo  (from the left-hand side in  Fig.l b), which  occurs  with \nprobability 0.5,  and the average of (d sin 0)2  is \n\nEo[(dsinO?] \n\nEo[d2\n1m - 2  Jo \n\n]  r/\\in2 Osinm- 2 OdO  = m ~ = m (1- ~). \n\nt  1m-2 \n\nt \n\nm \n\nWhen  0  is  7r/2  :::;  0  :::;  7r,  that  is  A  approaches  V.  from  the  opposite  side,  it  does \nnot stop  until it reaches  V.,  so  that  Iw*  - Wo 12  = IV.  - Wo I = d2 \u2022  This occurs  with \nprobability 0.5.  Hence,  we  proved  the theorem. \n\nThe  theorem  shows  that,  if we  could  know  the  optimal  stopping  time  nopt  for \neach  trajectory,  the  generalization  error  decreases  by  1/2t,  which  has  an  effect  of \ndecreasing the effective dimensions by 1/2.  This effect  is neglegible when m  is large. \nThe optimal stopping time is  of the order  logt.  However,  it is  impossible to know \nthe optimal stopping time.  If we  stop learning at an estimated optimal time nopt, \nwe  have  a  small  gain  when  the  ray  A  is  from  the  same  side  as  Wo  but  we  have \nsome loss  when  ray  A  is  from  the  opposite  direction.  This shows  that  the  gain  is \neven smaller if we  use  a common stopping time iiopt independent of V.  and w(O)  as \nproposed  by  Wang et aI.  [1994].  However,  the  point  is  that there  is  neither  direct \nmeans to estimate nopt  nor  iiopt  rather than for  example cross-validation.  Hence, \nwe  analyze  cross-validation stopping in the following . \n\n4  Optimal stopping by cross-validation \n\nThe  present  section  studies  asymptotically two fundamental  problems:  1)  Given  t \nexamples,  how  many  examples  should  be  used  in the  training  set  and  how  many \nin the testing set?  2)  How  much gain can one expect  by  the above  cross-validated \nstopping? \nLet us divide t  examples into rt examples of the training set and r't examples of the \ntesting set,  where  r + r' = 1.  Let V.  be the m.I.e.  from rt training examples, and let \nw  be the  m .I.e.  from  the other  r't  testing examples.  Since  the  training examples \nand  testing  examples  are  independent,  V.  and  ware subject  to  independent  nor(cid:173)\nmal distributions  with  mean Wo  and covariance  matrices G-1/(rt) and G-l/(r't), \nrespecti vely. \nLet  us  compose the triangle with vertices  Wo,  V.  and w.  The trajectory  A  starting \nat  w(O)  enters  V.  linearly in the  neighborhood.  The  point w\"  on the trajectory  A \nwhich  minimizes the testing error  is  the  point on  A  that  is  closest  to W, since  the \ntesting error  defined  by \n\nRtest(w) =  r't ~{-logp(xi'Yi; w)}, \n\n1 \n\nt \n\n(4.1) \n\n\f180 \n\nS. AMARI, N.  MURATA, K. R. MULLER, M.  FINKE, H.  YANG \n\nwhere  summation is  taken  over  r't testing examples,  can be expanded  as \n\nRtest(w) ==  Ho  - ~Iw - wol 2  + ~Iw - w1 2 . \n\n(4.2) \nLet  S  be  the  sphere  centered  at  (w + w)/2  and  passing  through  both  wand w. \nIt 's  diameter  is  given  by  d  ==  Iw  - wi.  Then,  the  optimal stopping point  w*  is \ngiven  by  the  intersection  of the  trajectory  A  and  sphere  S .  When  the  trajectory \ncomes from  the opposite side of W,  it  does  not  intersect  S  until it converges  to w, \nso that  the optimal point is  w* ==  w  in this case.  Omitting the detailed proof, the \ngeneralization error of w*  is  given  by Eq.(??) , so that we  calculate the expectation \n\nE[lw* -woI 2 ]  ==  m _ ~ (~_~). \n\ntr \n\n2t \n\nl' \n\n1\" \n\nLemma 2.  The average generalization error by  the optimal cross-validated stopping \nIS \n\n* \n\n1 \n(R(w  ,1'))  =  Ho +  4rt  + 4r't \n\n2m - 1 \n\n( 4.3) \n\nWe  can  then  calculate  the optimal division rate \n\nropt  =  1 -\n\nJ2m -1-1 \n\n2(m _  1) \n\nand \n\nropt =  1 - J2m \n\n1 \n\n(large  m  limit). \n\n( 4.4) \n\nof  examples,  which  minimizes  the  generalization  error.  So  for  large  m  only \n(1/J2m) x  100% of examples should be used for  testing and all others for  training. \nFor  example,  when  m  = 100, this  shows  that  93%  of examples  are  to be  used  for \ntraining and only 7%  are to be kept for  testing.  From Eq.( 4.4) we obtain as optimal \ngeneralization error for  large  m \n\n(R(w', ropt\u00bb  =  Ho +; (1 + If) . \n\n( 4.5) \n\nThis shows that the generalization error  asymptotically increases slightly by cross(cid:173)\nvalidation compared with non-stopped  learning which  is  using  all the examples for \ntraining. \n\n5  Simulations \n\nWe use standard feed-forward  classifier  networks  with N  inputs,  H  sigmoid hidden \nunits  and  M  softmax outputs  (classes).  The output  activity  0/  of the  lth  output \nunit  is  calculated via the softmax squashing function \n\n_ \n\n. \n\n_ \n\n_ \n\nexp(h/) \nk exp \n\n+ \n\n(h  )' \n\np(y\u00b7-GI!x,w)-O/-l  2: \nwhere h?  =  Lj wg Sj  -\nposteriori probability of being in class G/,  0 0  denotes a  zero  class for  normalization \npurposes.  The  m  network  parameters  consist  of biases  '19  and  weights  w .  When  x \nis  input, the  activity of the j-th hidden  unit is \n\n'19?  is the local field  potential.  Each output 0/ codes  the a(cid:173)\n\n/=l ,\u00b7\u00b7 \u00b7,M, \n\nk \n\nSj  =  [1 + exp( - L Wf{:Xk  -\n\nN \n\nk=1 \n\n'I9.f)]-I , \n\nj  =  1, .. \"  H . \n\nThe input  layer  is  connected  to the  hidden  layer  via w H ,  the  hidden layer  is  con(cid:173)\nnected  to  the  output  layer  via wo,  but  no  short-cut  connections  are  present .  Al(cid:173)\nthough  the  network  is  completely  deterministic,  it  is  constructed  to  approximate \n\n\fStatistical  Theory  of Overtraining-Is Cross-Validation  Asymptotically  Effective? \n\n181 \n\nclass  conditional probabilities (Finke  and Miiller  [1994]) . \nThe  examples  {(x}, yd, .. \" (Xt , Yt)}  are  produced  randomly, by  drawing  Xi,  i  = \n1, .. . , t,  from  a  uniform  distribution  independently  and  producing  the  labels  Yi \nstochastically  from  the  teacher  classifier.  Conjugate  gradient  learning  with  line(cid:173)\nsearch  on  the  empirical  risk  function  Eq.(2.1)  is  applied,  starting  from  some  ran(cid:173)\ndom initial vector.  The generalization ability is  measured using Eq.  (2.2)  on a large \ntest  set  (50000  patterns).  Note  that  we  use  Eq.  (2.1)  on  the  cross-validation  set , \nbecause only the empirical risk  is available on the cross-validation set  in a practical \nsituation.  We  compare the generalisation error for the settings:  exhaustive training \n(no  stopping),  early  stopping  (controlled  by  the  cross-validation set)  and  optimal \nstopping  (controlled  by  the  large  testset) .  The  simulations  were  performed  on  a \nparallel  computer  (CM5).  Every  curve  in  the figures  takes  about  8h of computing \ntime on a  128  respectively  256 partition of the CM5, i.e.  we  perform 128-256 paral(cid:173)\nlel trials.  This setting enabled us  to do extensive statistics  (cf.  Amari et al.  [1995]) . \nFig.  la shows  the  results  of simulations,  where  N  =  8,  H  =  8,  M  =  4,  so  that \nthe number m  of modifiable parameters is  m = (N + I)H + (H + I)M = 108.  We \nobserve  clearly,  that  saturated  learning  without  early  stopping  is  the  best  in  the \nasymptotic range  of t  >  30m, a  range  which  is  due  to the  limited size  of the  data \nsets often unaccessible in practical applications.  Cross-validated early stopping does \nnot  improve  the  generalization  error  here,  so  that  no  overtraining  is  observed  on \nthe  average  in  this  range.  In  the  asymptotic  area  (figure  1)  we  observe  that  the \nsmaller the percentage of the  training set ,  which  is  used  to determine  the point  of \nearly stopping,  the better  the  performance of the generalization  ability.  When  we \nuse cross-validation, the optimal size of the test set  is  about 7% of all the examples, \nas  the theory  predicts. \nClearly,  early stopping does  improve the generalization ability to a  large extent  in \nan intermediate range for  t  < 30m  (see  Miiller  et  al.  [1995]) .  Note, that our  the(cid:173)\nory  also gives  a  good  estimate of the  optimal size  of the early stopping set  in  this \nintermediate range. \n\n0.05 \n\n0.045 \n\n0.04 \n\n0.035 \n\n0.03 \n\n0.025 \n\n0.02 \n\n0.015 \n\nom \n\n'i \nCit \n\n\" \nopt, 4-\n20.%  -+--(cid:173)\n,3:l%  .. E} ... \n/ 42%  ...... .. -\nn9,gtopping  -.. >-\n/ \"\". >:i<:;_=-~;:~~::::~::~~~:------\n\n;' \n\n~ . \n\nA \n\n.~~;;?;;;;\u00bb/ \n\n~A.~~.::.../ \n\" , \n\n0.005  L....I..._--'-_-'-_-'--_-'------'_--'-_--'-_-'-----l \n\n5e-5 \n\nle-4  1.5e-4  2e-4  2.5e-4  3e-4  3.5e-4  4e-4  4.5e-4  5e-4 \n\nlIt \n(a) \n\n(b) \n\nFigure  1:  (a)  R(w)  plotted  as  a  function  of lit for  different  sizes  r'  of the  early \nstopping  set  for  an  8-8-4  classifier  network.  opt.  denotes  the  use  of a  very  large \ncross-validation  set  (50000)  and  no  stopping  adresses  the  case  where  100%  of the \ntraining  set  is  used  for  exhaustive  learning.  (b)  Geometrical  picture  to  determine \nthe  optimal stopping  point  w* . \n\n\f182 \n\ns. AMARI. N. MURATA. K.  R. MOLLER. M. FINKE. H. YANG \n\n6  Conclusion \n\nWe  proposed  an asymptotic theory for  overtraining.  The analysis treats  realizable \nstochastic  neural networks,  trained  with Kullback-Leibler loss. \nIt is demonstrated both theoretically and in simulations that asymptotically the gain \nin  the  generalization  error  is  small  if we  perform  early  stopping,  even  if we  have \naccess  to  the  optimal stopping  time.  For  cross-validation stopping  we  showed  for \nlarge  m  that  optimally only  r~pt = 1/ J2m examples should  be  used  to  determine \nthe  point of early stopping in order  to obtain the best  performance.  For  example, \nif m  =  100 this corresponds to using 93% of the t  training patterns for  training and \nonly  7%  for  testing where to stop.  Yet,  even  if we  use  rapt  for  cross-validated stop(cid:173)\nping the generalization error is  always increased  comparing to exhaustive training. \nNevertheless  note,  that this  range  is  due  to the  limited size  of the  data sets  often \nunaccessible  in  practical  applications. \nIn the non-asymptotic region  simulations show  that  cross-validated  early  stopping \nalways helps to enhance the performance since it decreases  the generalization error. \nIn  this intermediate range our theory also gives  a good estimate of the optimal size \nof the  early  stopping set.  In future  we  will  consider  higher  order  correction  terms \nto extend  our theory  to give  also a  quantitative description  of the  non-asymptotic \nregIOn. \n\nAcknowledgements:  We  would like to thank Y.  LeCun,  S.  Bos and K  Schulten \nfor  valuable discussions.  K  -R. M.  thanks  K  Schulten for  warm hospitality during \nhis stay at the Beckman Inst.  in  Urbana, Illinois.  We  acknowledge  computing time \non the CM5 in  Urbana (NCSA)  and in  Bonn, supported  by  the  National Institutes \nof Health  (P41RRO 5969)  and the EC  S & T  fellowship  (FTJ3-004, K. -R.  M.). \n\nReferences \n\nAmari, S.  [1967],  IEEE  Trans.,  EC-16,  299- 307. \nAmari, S.,  Murata, N.  [1993],  Neural  Computation  5,  140 \nAmari, S.,  Murata, N., Muller,  K-R., Finke, M., Yang, H. [1995], Statistical Theory \nof Overtraining and Overfitting,  Univ.  of Tokyo Tech.  Report  95-06,  submitted \nBarkai,  N.  and  Seung,  H.  S.  and  Sompolinski,  H.  [1994],  On-line  learning  of di(cid:173)\nchotomies,  NIPS'94 \nFinke,  M. and Muller,  K-R. [1994]  in Proc.  of the 1993 Connectionist Models sum(cid:173)\nmer  school,  Mozer,  M., Smolensky, P.,  Touretzky,  D.S ., Elman, J.L.  and Weigend, \nA.S.  (Eds.),  Hillsdale,  NJ:  Erlenbaum Associates,  324 \nHassoun,  M.  H.  [1995],  Fundamentals  of Artificial  Neural  Networks,  MIT  Press. \nHecht-Nielsen,  R.  [1989],  Neurocomputing,  Addison-Wesley. \nHeskes,  T.  and  Kappen,  B.  [1991]'  Physical Review,  A44, 2718- 2762. \nLeCun,  Y.,  Denker,  J .S.,  Solla, S. [1990],  Optimal brain damage, NIPS'89 \nMoody,  J .  E.  [1992]'  The  effective  number  of parameters:  An  analysis  of general(cid:173)\nization and regularization in nonlinear learning systems,  NIPS  4 \nMurata, N.,  Yoshizawa,  S., Amari , S. [1994],  IEEE  Trans., NN5,  865-872. \nMuller,  K-R., Finke,  M.,  Murata,  N., Schulten,  K  and Amari, S.  [1995]  A numer(cid:173)\nical study on  learning  curves  in  stochastic  multilayer feed-forward  networks,  Univ. \nof Tokyo Tech.  Report  METR 95-03  and  Neural  Computation in  Press \nPoggio, T.  and  Girosi,  F.  [1990],  Science,  247, 978- 982. \nRissanen,  J.  [1986],  Ann,  Statist., 14,  1080- 1100. \nRumelhart, D., Hinton,  G.  E., Williams, R.  J.  [1986],  in PDP, Vol.1,  MIT Press. \nSaad,  D.,  Solla, S.  A.  [1995],  PRL,  74,4337 and  Phys.  Rev.  E,  52,4225 \nWang, Ch., Venkatesh,  S.  S., Judd, J. S.  [1994], Optimal stopping and effective ma(cid:173)\nchine complexity in learning, to appear, (revised  and extended  version of NIPS'93). \n\n\f", "award": [], "sourceid": 1060, "authors": [{"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}, {"given_name": "Noboru", "family_name": "Murata", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Michael", "family_name": "Finke", "institution": null}, {"given_name": "Howard", "family_name": "Yang", "institution": null}]}