{"title": "Optimal Brain Damage", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 605, "abstract": null, "full_text": "598 \n\nLe Cun, Denker and Solla \n\nOptimal Brain  Damage \n\nYann Le  Cun,  John S.  Denker  and Sara A.  Sol1a \nAT&T Bell  Laboratories,  Holmdel,  N.  J.  07733 \n\nABSTRACT \n\nWe  have used  information-theoretic ideas  to derive  a class of prac(cid:173)\ntical  and  nearly  optimal schemes  for  adapting the size  of a  neural \nnetwork.  By  removing  unimportant  weights  from  a  network,  sev(cid:173)\neral  improvements  can  be  expected:  better  generalization,  fewer \ntraining examples required,  and improved speed  of learning and/or \nclassification.  The  basic  idea  is  to  use  second-derivative  informa(cid:173)\ntion to make a  tradeoff between  network  complexity  and  training \nset error.  Experiments confirm  the usefulness  of the methods on a \nreal-world  application. \n\nINTRODUCTION \n\n1 \nMost successful applications of neural network learning to real-world problems have \nbeen  achieved  using  highly  structured  networks  of rather  large  size  [for  example \n(Waibel,  1989;  Le  Cun et al.,  1990a)].  As  applications become more complex,  the \nnetworks  will  presumably  become  even  larger  and  more structured.  Design  tools \nand  techniques  for  comparing different  architectures  and  minimizing the  network \nsize will be needed.  More importantly, as the number of parameters in the systems \nincreases,  overfitting  problems  may  arise,  with  devastating effects  on  the general(cid:173)\nization  performance.  We introduce a  new  technique called Optimal Brain Damage \n(OBD)  for  reducing  the size  of a  learning network  by  selectively  deleting weights. \nWe show  that OBD can  be used  both as an automatic network  minimization pro(cid:173)\ncedure  and as an interactive tool to suggest  better architectures. \nThe basic idea of OBD is that it is possible to take a  perfectly reasonable network, \ndelete half (or more) of the weights and wind up with a  network that works just as \nwell,  or  better.  It can be applied in situations  where a  complicated  problem must \n\n\fOptimal Brain Damage \n\n599 \n\nbe solved,  and  the system  must  make optimal use  of a  limited  amount of training \ndata.  It is known from  theory (Denker et al.,  1987; Baum and Haussler,  1989; Solla \net  al.,  1990)  and  experience  (Le  Cun,  1989)  that,  for  a  fixed  amount  of training \ndata,  networks  with  too many weights do  not generalize  well.  On the  other hand. \nnetworks  with  too  few  weights  will  not  have  enough  power  to  represent  the  data \naccurately.  The best generalization is obtained by trading off the training error and \nthe network  complexity. \nOne technique to reach  this tradeoff is to minimize a cost function composed of two \nterms:  the ordinary  training error,  plus some  measure  of the  network  complexity. \nSeveral such  schemes  have  been  proposed  in  the statistical inference literature [see \n(Akaike,  1986;  Rissanen,  1989;  Vapnik,  1989)  and  references  therein]  as  well  as  in \nthe NN  literature (Rumelhart, 1988; Chauvin, 1989;  Hanson and Pratt, 1989; Mozer \nand  Smolensky,  1989). \n\nVarious complexity  measures  have  been  proposed,  including  Vapnik-Chervonenkis \ndimensionality (Vapnik and Chervonenkis,  1971)  and description  length (Rissanen, \n1989) .  A time-honored (albeit inexact) measure of complexity is simply the number \nof non-zero free  parameters,  which  is  the  measure  we  choose  to  use  in  this paper \n[but  see  (Denker,  Le  Cun and Solla,  1990)].  Free  parameters  are used  rather  than \nconnections,  since in constrained networks,  several connections can be controlled by \na  single parameter. \n\nIn most cases in the statistical inference literature, there is some a priori or heuristic \ninformation  that  dictates  the  order  in  which  parameters  should  be  deleted;  for \nexample, in a family of polynomials, a  smoothness heuristic may require high-order \nterms  to  be deleted  first.  In  a  neural  network,  however,  it  is  not  at all obvious in \nwhich order the parameters should  be deleted. \nA  simple strategy consists  in deleting parameters with small  \"saliency\",  i.e.  those \nwhose  deletion  will  have  the  least  effect  on  the  training error.  Other  things  be(cid:173)\ning equal, small-magnitude parameters will have  the least saliency,  so a  reasonable \ninitial strategy  is  to  train  the  network  and  delete  small-magnitude parameters in \norder.  After  deletion,  the  network  should  be  retrained.  Of course  this  procedure \ncan be iterated;  in the limit it  reduces  to continuous weight-decay during  training \n(using disproportionately rapid decay of small-magnitude parameters).  In fact, sev(cid:173)\neral network minimization schemes have been implemented using  non-proportional \nweight decay  (Rumelhart,  1988;  Chauvin, 1989;  Hanson and  Pratt,  1989), or \"gat(cid:173)\ning  coefficients\"  (Mozer  and  Smolensky,  1989).  Generalization  performance  has \nbeen  reported  to increase significantly on the somewhat small problems examined. \nTwo drawbacks of these  techniques are that they  require fine-tuning of the  \"prun(cid:173)\ning\"  coefficients  to  avoid  catastrophic  effects,  and  also  that  the  learning  process \nis  significantly  slowed  down.  Such  methods  include  the  implicit  hypothesis  that \nthe  appropriate  measure  of network  complexity  is  the  number  of parameters  (or \nsometimes the number of units)  in  the network. \nOne  of the  main  points of this  paper  is  to  move  beyond  the approximation  that \n\"magnitude equals saliency\" , and propose a  theoretically justified saliency measure. \n\n\f600 \n\nLe Cun, Denker and Solla \n\nOur technique  uses  the second  derivative of the objective function  with  respect  to \nthe  parameters  to  compute  the  saliencies.  The  method  was  ,,-alidated  using  our \nhandwritten digit recognition network trained with backpropagation (Le Cun et aI., \n1990b). \n\n2  OPTIMAL  BRAIN DAMAGE \nObjective  functions  playa central  role  in  this  field;  therefore  it  is  more  than  rea(cid:173)\nsonable  to  define  the  saliency  of a  parameter  to  be  the  change  in  the  objective \nfunction caused  by deleting that parameter.  It would  be prohibiti,-ely laborious to \nevaluate the saliency directly from  this definition,  i.e.  by temporarily deleting each \nparameter and  reevaluating the objective function. \nFortunately,  it  is  possible  to  construct  a  local  model  of  the  error  function  and \nanalytically  predict the effect  of perturbing the  parameter  vector.  \"'e approximate \nthe objective  function  E  by  a  Taylor series.  A  perturbation  lL~  of the  parameter \nvector  will change the objective function  by \n\n(1) \n\nHere,  the 6ui'S are the components of flJ,  the gi's are the components of the gradient \nG  of E  with  respect  to U,  and  the  hi;'S are the elements of the Hessian  matrix H \nof E  with  respect  to U: \n\n8E \ngi=  -8 \nUi \n\nand \n\n(2) \n\nThe goal is  to find  a  set of parameters whose deletion  will cause  the least  increase \nof E .  This  problem  is  practically  insoluble  in  the  general  case.  One  reason  is \nthat the matrix H  is enormous (6.5 x  106  terms for our 2600  parameter network), \nand  is  very  difficult  to  compute.  Therefore  we  must  introduce  some  simplifying \napproximations.  The  \"diagonal\"  approximation  assumes  that  the  6E  caused  by \ndeleting several parameters is the sum of the 6E's caused by delet~ each parameter \nindividually;  cross  terms  are  neglected,  so  third  term  of the  npt hand  side  of \nequation  1  is  discarded.  The  \"extremal\"  approximation  assumes  that  parameter \ndeletion  will  be performed  after  training  has  converged.  The parameter  vector  is \nthen at a (local) minimum of E and the first term of the right hand side of equation 1 \ncan be neglected.  Furthermore, at a  local minimum,  all the hii's are non-negative, \nso  any  perturbation  of the  parameters  will cause  E  to increase  or stay  the same. \nThirdly,  the  \"quadratic\"  approximation assumes  that  the  cost  fundion  is  nearly \nquadratic 80  that the last term in  the equation can be neglected.  Equation  1 then \nreduces  to \n\n6E=~~h\"6u~ \n\u2022 \n\n2L.i  \" \n\ni \n\n(3) \n\n\fOptimal Brain Damage \n\n601 \n\n2.1  COMPUTING  THE  SECOND  DERIVATIVES \n\nNow  we  need  an  efficient  way  of computing  the  diagonal  second  derivatives  hii . \nSuch  a  procedure  was derived  in (Le Cun,  1987), and  was  the basis of a  fast  back(cid:173)\npropagation method  used  extensively  in  \\1lrious applications (Becker  and  Le  Cun, \n1989;  Le  Cun,  1989;  Le  Cun  et  al.,  1990a).  The procedure  is  very  similar  to the \nback-propagation algorithm used  for  computing  the first  derivatives.  We  will  only \noutline the  proced ure;  details can be found  in the references. \nWe  assume  the objective function is the  usual  mean-squared  error (MSE);  general(cid:173)\nization to other  additive error  measures  is  straightforward.  The following  expres(cid:173)\nsions  apply  to  a  single  input  pattern;  afterward  E  and  H  must  be  averaged  over \nthe training set.  The network  state is  computed  using the standard formulae \n\nand \n\nai = L WijZj \n\nj \n\n( 4) \n\nwhere  Zi  is  the state of unit  i,  ai  its  total  input  (weighted  sum), !  the squashing \nfunction and  Wij  is  the connection going  from unit j  to unit i.  In a  shared-weight \nnetwork like ours, a single parameter Uk  can control one or more connections:  Wij  = \nUk  for all (i, j) E Vk,  where Vk  is a set of index pairs.  By the chain rule, the diagonal \nterms of H  are given by \n\nhu =  L  {)w~, \n\n{)2E \n\n(i,j)EV. \n\n., \n\nThe summand  can be expanded  (using the basic  network equations 4)  as: \n\nlP E \n{J2E \n2 \n- -= -z \u00b7  \n{Jw~. \n{Ja~' \n\n., \n\n. \n\nThe second  derivatives are back-propagated from layer to layer: \n\n(5) \n\n(6) \n\n(7) \n\nWe  also  need  the  boundary  condition at  the  output  layer,  specifying  the  second \nderivative of E  with respect  to the last-layer weighted  BUms: \n{J{J2 ~ = 2!'(ai)2 - 2(di - Zi)!\"(ai) \nai \n\n(8) \n\nfor all units i  in  the output layer. \nAs can be seen,  computing the diagonal Hessian is of the same order of complexity \nas computing the gradient.  In some cases,  the second  term of the right hand side of \nthe last two equations (involving the second derivative of I) can be neglected.  This \ncorresponds  to the  well-known  Levenberg-Marquardt  approximation,  and  has  the \ninteresting property of giving guaranteed positive estimates of the second derivative. \n\n\f602 \n\nLe Cun, Denker and Solla \n\n2.2  THE  RECIPE \n\nThe OBO  procedure can be carried  out as follows: \n\n1.  Choose a  reasonable network architecture \n2.  Train the network  until a  reasonable solution is obtained \n3.  Compute the second  derivatives hu for each  parameter \n4.  Compute the saliencies for  each parameter:  Sk  =  huu~/2 \n5.  Sort  the parameters by saliency  and delete some low-saliency  parameters \n6.  Iterate to step 2 \n\nDeleting  a  parameter  is  defined  as  setting  it  to 0  and  freezing  it  there.  Several \nvariants of the procedure can be devised,  such as decreasing  the ... 41ues  of the low(cid:173)\nsaliency  parameters  instead  of simply  setting  them  to  0,  or  allowing  the  deleted \nparameters to adapt again after they  have been set  to o. \n2.3  EXPERIMENTS \n\nThe simulation results  given  in  this section  were  obtained  using  back-propagation \napplied to handwritten digit recognition.  The initial network was highly constrained \nand sparsely connected,  having 105  connections controlled by 2578  free parameters. \nIt  was  trained on a  database of segmented  handwritten  zip code digits and  printed \ndigits  containing  approximately  9300  training examples  and  3350  t.est  examples. \nMore details can be obtained from  the companion paper (Le Cun et al.,  1990b). \n\n16 \n14 \n1 \n10 \npJ  8 \n~6 \nb04 \n\no -\n\n<a> \n\nMagnitude \n\n16 \n14 \n1 \n10 \npJ  8 \n~6 \nb04 \n.9 \n\nOBD \n\n(b) \n\no \n~~--~--~---+--~----~ \n500  1000  1500  2000  2SOO \n\no \n\nParameters \n\no \n-2~ __ ~ __ ~ __ -+ ________ ~ \nlaX)  2SOO \n\nSOO  1000  1500 \n\no \n\nParameters \n\nFigure 1:  (a)  Objective  function  (in  dB)  versus  number  of paramet.ers  for  OBn \n(lower curve) and magnitude-based parameter deletion (upper curve).  (b) Predicted \nand  actual  objective  function  versus  number  of parameters.  The  predicted  value \n(lower curve)  is the sum of the saliencies of the deleted  parameters. \n\nFigure  la shows  how  the  objective  function  increases  (from  right  to left)  as  the \nnumber of remaining parameters decreases.  It is clear that  deletin~ parameters by \n\n\fOptimal Brain Damage \n\n603 \n\norder of saliency causes a significantly smaller increase of the objective function than \ndeleting them according to their magnitude.  Random deletions were  also tested  for \nthe sake of comparison,  but the performance was so bad  that the curves  cannot be \nshown on the same scale. \n\nFigure 1b shows how the objective function increases (from right to left) as the num(cid:173)\nber of remaining parameters decreases,  compared  to the increase  predicted  by  the \nQuadratic-Extremum-Diagonal approximation.  Good  agrement  is  obtained  for  up \nto approximately 800  deleted  parameters  (approximately  30%  of the  parameters). \nBeyond  that  point,  the curves  begin  to split,  for  several  reasons:  the off-diagonal \nterms  in  equation  1  become  disproportionately  more  important  as  the  number of \ndeleted  parameters increases,  and higher-than-quadratic terms become more impor-\ntant when  larger-valued parameters are deleted. \n\n' \n\n16 \n14 \n1 \n10 \nUJ  8 \n~ 6 \n\n~4 o -\n\no \n-2~--4----+----~--~--~ \n1000  1500  2000  2500 \n\nSOO \n\no \n\nParameters \n\n<a) \n\n(b) \n\n16 \n14 \n1 \n10 \nUJ  8 \n~ 6 \n~ 4. \n~~ \n-2~I--~,~ __ +, ____ ~, __ ~I ____ ~I \n\n500  1000  1500  2000  2500 \n\no \n\nParameters \n\nFigure 2:  Objective  function  (in  dB)  versus  number  of parameters,  without  re(cid:173)\ntraining (upper curve),  and after  retraining (lower curve).  Curves are given for  the \ntraining set (a)  and the  test set  (b). \n\nFigure 2 shows  the log-MSE on the training set  and  the on the test set  before  and \nafter  retraining.  The  performance  on  the  training set  and  on  the  test  set  (after \nretraining)  stays almost  the same  when  up to 1500 parameters  (60%  of the  total) \nare deleted. \n\nWe  have  also  used  OBn  as  an  interactive  tool  for  network  design  and  analysis. \nThis contrasts  with  the  usual  view  of weight  deletion  as a  more-or-Iess  automatic \nprocedure.  Specifically,  we  prepared  charts  depicting  the  saliency  of the  10,000 \nparameters in the digit recognition network reported last year (Le Cun et aI., 1990b). \nTo  our  surprise,  several  large  groups  of parameters  were  expendable.  We  were \nable to excise  the second-to-Iast layer,  thereby  reducing  the  number of parameters \nby  a  factor  of  two.  The  training  set  MSE  increased  by  a  factor  of 10,  and  the \ngeneralization  MSE  increased  by  only  50%.  The  10-category  classification  error \non  the  test  set  actually  decreased  (which  indicates  that  MSE  is  not  the  optimal \n\n\f604 \n\nLe Cun, Denker and Solla \n\nobjective function  for  this  task).  OBD  motivated  other  architectural  changes,  as \ncan be seen  by comparing the 2600-parameter network in (Le Cun et aI.,  1990a) to \nthe 1O,OOO-parameter network in (Le Cun et aI.,  1990b). \n\n3  CONCLUSIONS  AND  OUTLOOK \nWe  have used  Optimal Brain Damage interactively to reduce  the number of param(cid:173)\neters  in a  practical neural  network  by  a  factor of four.  We obtained an additional \nfactor of more than two by using OBD to delete parameters automatically.  The net(cid:173)\nwork's speed  improved significantly, and  its recognition accuracy  increased  slightly. \nWe  emphasize  that  the  starting  point  was  a  state-of-the-art  network.  It would  be \ntoo easy to start with a  foolish  network  and make large improvements:  a  technique \nthat can help  improve an already-good  network is  particularly valuable. \nWe believe that the techniques  presented  here only scratch  the surface of the appli(cid:173)\ncations where  second-derivative information can and should be used.  In particular, \nwe  have also been able to move beyond the approximation that \"complexity equals \nnumber of free  parameters\"  by using second-derivative information.  In (Denker,  Le \nCun and Solla,  1990),  we  use  it to to derive  an improved  measure of the network's \ninformation  content,  or  complexity.  This  allows  us  to compare  network  architec(cid:173)\ntures on a  given  task,  and  makes contact  with the notion of Minimum Description \nLength (MDL)  (Rissanen,  1989).  The main idea is that a  \"simple\"  network  whose \ndescription needs  a small number of bits is  more likely  to generalize correctly  than \na  more  complex  network,  because  it  presumably  has  extracted  the  essence  of the \ndata and removed  the redundancy from it. \n\nAcknowledgments \n\nWe  thank  the  US  Postal  Service  and  its  contractors  for  providing  us  with  the \ndatabase.  We also thank Rich Howard  and Larry Jackel for  their helpful comments \nand encouragements.  We especially thank David Rumelhart for sharing unpublished \nideas. \n\nReferences \nAkaike, H. (1986).  Use of Statistical Models for Time Series Analysis.  In Proceedings \n\nICASSP 86,  pages 3147-3155, Tokyo.  IEEE. \n\nBaum,  E.  B.  and  Haussler,  D.  (1989).  What Size  Net  Gives  Valid  Generaliztion? \n\nNeural  Computation,  1:151-160. \n\nBecker,  S.  and Le Cun, Y.  (1989).  Improving the Convergence of Back-Propagation \n\nLearning with Second-Order  Methods.  In Touretzky,  D.,  Hinton,  G., and  Se(cid:173)\njnowski, T., editors,  Proc.  of the  1988  Connectionist  Model&  S.mmer School, \npages 29-37, San Mateo.  Morgan Kaufman. \n\nChauvin,  Y.  (1989).  A  Back-Propagation  Algorithm  with  Optimal  Use  of Hid(cid:173)\nden  Units.  In  Touretzky,  D.,  editor,  Neural Information  Proce$$ing  S,&tems, \nvolume 1,  Denver,  1988.  Morgan Kaufmann. \n\n\fOptimal Brain Damage \n\n605 \n\nDenker,  J.,  Schwartz,  D.,  Wittner,  B.,  Solla,  S.  A.,  Howard,  R.,  Jackel,  L.,  and \n\nHopfield,  J.  (1987).  Large Automatic Learning,  Rule Extraction and General(cid:173)\nization.  Complex Systems,  1:877-922. \n\nDenker,  J.  S.,  Le  Cun,  Y.,  and  Solla,  S.  A.  (1990).  Optimal Brain  Damage.  To \n\nappear in Computer and System Sciences. \n\nHanson,  S.  J.  and  Pratt,  L.  Y.  (1989).  Some Comparisons of Constraints for  Min(cid:173)\n\nimal  Network  Construction  with  Back-Propagation.  In Touretzky,  D.,  editor, \nNeural Information Processing Systems,  volume 1, Denver, 1988. Morgan Kauf(cid:173)\nmann. \n\nLe  Cun,  Y.  (1987).  Modeles  Connexionnistes  de  l'Apprentissage.  PhD  thesis,  Uni(cid:173)\n\nversite  Pierre et Marie Curie,  Paris,  France. \n\nLe  Cun,  Y.  (1989).  Generalization  and  Network  Design  Strategies.  In  Pfeifer,  R., \nSchreter,  Z.,  Fogelman,  F., and Steels,  L.,  editors,  Connectionism  in  Perspec(cid:173)\ntive,  Zurich,  Switzerland.  Elsevier. \n\nLe  Cun,  Y.,  Boser,  B.,  Denker,  J.  S.,  Henderson,  D.,  Howard,  R.  E.,  Hubbard, \nW.,  and  Jackel,  L.  D.  (1990a) .  Handwritten  Digit  Recognition  with  a  Back(cid:173)\nPropagation Network.  In Touretzky,  D., editor,  Neural Information  Processing \nSystems,  volume 2,  Denver,  1989.  Morgan Kaufman. \n\nLe  Cun, Y.,  Boser,  B.,  Denker,  J. S.,  Henderson,  D.,  Howard,  R.  E.,  Hubbard, W., \nand Jackel,  L.  D.  (1990b).  Back-Propagation Applied  to Handwritten Zipcode \nRecognition.  Neural  Computation,  1{ 4). \n\nMozer,  M.  C.  and  Smolensky,  P.  (1989).  Skeletonization:  A  Technique  for  Trim(cid:173)\n\nming  the  Fat  from  a  Network  via  Relevance  Assessment.  In  Touretzky,  D., \neditor,  Neural  Information  Processing  Systefn$,  volume  1,  Denver,  1988.  Mor(cid:173)\ngan Kaufmann. \n\nRissanen,  J.  (1989).  Stochastic  Complexity  in  Statistical Inquiry.  World Scientific, \n\nSingapore. \n\nRumeihart,  D.  E. (1988).  personal communication. \nSolla, S.  A., Schwartz,  D.  B.,  Tishby,  N.,  and Levin,  E.  (1990).  Supervised  Learn(cid:173)\n\ning:  a  Theoretical  Framework.  In  Touretzky,  D.,  editor,  Neural  Information \nProcessing  Systems,  volume 2,  Denver,  1989.  Morgan Kaufman. \n\nVapnik, V.  N. (1989).  Inductive Principles of the Search for Empirical Dependences. \n\nIn Proceedings  of the second annual Workshop on  Computational Learning The(cid:173)\nory,  pages 3-21. Morgan Kaufmann. \n\nVapnik,  V.  N.  and  Chervonenkis,  A.  Y.  (1971).  On  the  Uniform Convergence  of \n\nRelative Frequencies of Events to Their Probabilities.  Th.  Pro6.  and its Appli(cid:173)\ncations,  17(2):264-280. \n\nWaibel,  A.  (1989).  Consonant  Recognition  by  Modular  Construction  of Large \n\nPhonemic  Time-Delay Neural  Networks.  In  Touretzky,  D.,  editor,  Neural In(cid:173)\nformation  Processing Systems, volume 1,  pages 215-223, Denver,  1988. Morgan \nKaufmann. \n\n\f", "award": [], "sourceid": 250, "authors": [{"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "John", "family_name": "Denker", "institution": null}, {"given_name": "Sara", "family_name": "Solla", "institution": null}]}