{"title": "Bayesian Model Comparison and Backprop Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 839, "page_last": 846, "abstract": null, "full_text": "Bayesian Model Comparison and Backprop Nets \n\nDavid J.C.  MacKay\u00b7 \n\nComputation and  Neural Systems \n\nCalifornia Institute of Technology 139-14 \n\nPasadena CA  91125 \n\nmackayGras.phy.cam.ac.uk \n\nAbstract \n\nThe Bayesian model comparison framework  is  reviewed, and the Bayesian \nOccam's razor is explained.  This framework can be applied to feedforward \nnetworks,  making  possible  (1)  objective  comparisons  between  solutions \nusing alternative network architectures; (2)  objective choice of magnitude \nand  type of weight decay terms;  (3)  quantified estimates of the error  bars \non network  parameters and on network output.  The framework  also  gen(cid:173)\nerates a  measure of the effective number of parameters determined by the \ndata. \nThe  relationship  of  Bayesian  model  comparison  to  recent  work  on  pre(cid:173)\ndiction  of generalisation  ability  (Guyon  et  al.,  1992,  Moody,  1992)  is  dis(cid:173)\ncussed. \n\n1  BAYESIAN INFERENCE AND  OCCAM'S  RAZOR \n\nIn science, a  central task is  to develop and compare models to account for  the data \nthat  are  gathered.  Typically,  two  levels  of inference  are  involved  in  the  task  of \ndata modelling.  At  the  first  level  of inference,  we  assume  that  one  of the  models \nthat  we  invented  is  true,  and  we  fit  that  model  to  the  data.  Typically  a  model \nincludes some free  parameters; fitting  the model to the data involves inferring what \nvalues those parameters should  probably take, given the data.  This  is  repeated  for \neach  model.  The second  level  of inference  is  the  task of model  comparison.  Here, \n\n\u00b7Current address:  Darwin  College,  Cambridge CB3  9EU,  U.K. \n\n839 \n\n\f840 \n\nMacKay \n\nwe  wish  to  compare  the  models  in  the  light  of the  data,  and  assign  some  sort  of \npreference  or  ranking to the alternatives.1 \n\nFor  example,  consider  the  task  of interpolating  a  noisy  data  set.  The  data  set \ncould  be  interpolated  using  a  splines  model,  polynomials,  or  feedforward  neural \nnetworks.  At the first  level of inference,  we  find  for  each individual model  the best \nfit  interpolant  (a  process  sometimes  known  as  'learning').  At  the  second  level  of \ninference we  want to rank  the alternative models  and state for  our  particular  data \nset that, for  example,  'splines are  probably the best interpolation model', or  'if the \ninterpolant is modelled as a  polynomial, it should probably be a  cubic', or 'the best \nneural network for  this data set has eight hidden  units'. \n\nModel  comparison is  a  difficult  task  because it  is  not possible simply to choose  the \nmodel that fits  the data best:  more complex models  can  always fit  the data better, \nso  the  maximum  likelihood  model  choice  leads  us  inevitably  to implausible  over(cid:173)\nparameterised models which generalise poorly.  'Occam's razor' is  the principle that \nstates that unnecessarily  complex models should  not  be  preferred  to simpler ones. \nBayesian methods  automatically  and  quantitatively embody Occam's  razor  (Gull, \n1988,  Jeffreys,  1939),  without  the  introduction  of ad  hoc  penalty  terms.  Complex \nmodels  are  automatically self-penalising under Bayes' rule. \n\nLet  us  write  down Bayes' rule  for  the two levels of inference  described  above .  As(cid:173)\nsume  each  model  1ii  has  a  vector  of parameters  w.  A  model  is  defined  by  its \nfunctional  form  and  two  probability  distributions:  a  'prior'  distribution  P(WI1ii) \nwhich states what values the model's parameters might plausibly take; and the pre(cid:173)\ndictions P(Dlw, 1ii) that the model  makes about  the data D  when  its parameters \nhave a  particular  value  w.  Note  that  models  with  the same  parameterisation  but \ndifferent priors over the parameters are therefore defined  to be different models. \n1.  Model fitting.  At  the  first  level of inference,  we  assume  that  one  model 1ii \nis  true,  and  we  infer  what  the  model's  parameters  w  might  be  given  the  data  D. \nUsing  Bayes' rule,  the posterior probability of the parameters w  is: \n\n(1) \n\nIn words: \n\n. \n\nPostenor = \n\nLikelihood  X  Prior \n\n. d \n\nEVI  ence \n\nIt is  common to use gradient-based methods to find  the maximum of the  posterior, \nwhich defines  the most  probable value for  the parameters,  W MP ;  it is then  common \nto  summarise  the  posterior  distribution  by  the  value  of W MP ,  and  error  bars  on \nthese  best  fit  parameters.  The  error  bars  are  obtained  from  the  curvature of the \nposterior; writing the Hessian A =  -\\7\\7 log P(wID, 1ii) and Taylor-expanding the \nlog  posterior  with  ~w = w  - W MP , \n\n(2) \n\n1 Note that  both levels of inference are distinct from  decision  theory.  The goal  of infer(cid:173)\n\nence is,  given  a  defined  hypothesis space and  a  particular data set,  to  assign  probabilities \nto  hypotheses.  Decision  theory  chooses  between  alternative actions  on  the basis  of these \nprobabilities so  as to minimise  the expectation of a  'loss function'. \n\n\fBayesian  Model Comparison and Backprop Nets \n\n841 \n\nw \n\nFigure  1:  The Occam factor \n\nThis figure shows the quantities that determine the Occam factor for  a hypothesis 1ii hav(cid:173)\nin\u00a7 a single parameter w.  The prior distribution (dotted line) for the parameter has width \nIl.  w.  The posterior distribution  (solid  line)  has a single peak at WMP  with  characteristic \nwidth Il.w.  The Occam  factor  is  :b~. \n\nwe see that the posterior can be locally approximated as  a gaussian with covariance \nmatrix (error  bars)  A -1. \n\n2.  Model  comparison.  At the  second  level  of inference,  we  wish  to infer  which \nmodel is  most plausible given  the data.  The posterior probability of each model is: \n(3) \nNotice  that  the  objective  data-dependent  term  P(DI1id  is  the  evidence  for  1ii, \nwhich  appeared  as  the  normalising  constant in  (1).  The second  term, P(1ii ),  is  a \n'subjective' prior over our  hypothesis space.  Assuming  that we  have no  reason  to \nassign  strongly  differing  priors  P(1ii)  to  the  alternative  models,  models  1ii  are \nranked  by evaluating the evidence. \n\nP(1ii ID) ex  P(DI1ii )P(1ii ) \n\nThis  concept  is  very  general:  the  evidence  can  be  evaluated  for  parametric  and \n'non-parametric'  models  alike;  whether  our  data  modelling  task  is  a  regression \nproblem,  a  classification  problem,  or  a  density  estimation  problem,  the  evidence \nis  the  Bayesian's  transportable  quantity  for  comparing  alternative  models.  In  all \nthese cases the evidence naturally embodies Occam's razor,  as we will  now see.  The \nevidence is the normalising constant for  equation  (1): \n\nP(D l1ii) = J P(Dlw, 1ii)P(wl1id dw \n\n(4) \n\nFor  many  problems,  including  interpolation,  it  is  common  for  the  posterior \nP(wID,1ii)  ex  P(Dlw,1ij)P(wl1ii)  to  have  a  strong  peak  at  the  most  probable \nparameters W MP  (figure  1).  Then the evidence  can  be approximated by the height \nof the peak of the integrand P(Dlw, 1ii)P(wl1ii) times its width,  ~w: \n\nEvidence \n\n-\n-\n\nP(D IwMP ' 1ii) \n, \nBest fit  likelihood \n\ny \n\n~ \n\nP(wMPI1ii) ~w \n, \nOccam factor \n\ny \n\n~ \n\n(5) \n\nThus  the  evidence  is  found  by  taking  the  best  fit  likelihood  that  the  model  can \nachieve and  multiplying it by an  'Occam factor'  (Gull,  1988), which is  a  term with \nmagnitude less  than one that  penalises 1ii for  having the parameter w. \n\n\f842 \n\nMacKay \n\nInterpretation of the Occam factor \n\nThe  quantity  ~ w  is  the  posterior  uncertainty  in  w.  Imagine  for  simplicity  that \nthe  prior  P(wlllj)  is  uniform  on  some  large  interval  ~ow (figure  1),  so  that \nP(wMPllli) = AJW; then \n\n~w \nOccam factor = ~ ow' \n\ni.e. the ratio of the posterior accessible volume oflli's parameter space to \nthe prior accessible  volume (Gull,  1988, Jeffreys,  1939).  The log  of the Occam \nfactor can be interpreted as  the amount of information we  gain abou t  the model lli \nwhen  the data arrive. \n\nTypically, a  complex or flexible  model with many parameters, each of which  is  free \nto vary over  a  large  range  ~ ow,  will  be penalised  with  a  larger Occam factor  than \na  simpler  model.  The Occam factor  also  penalises  models  which  have  to  be finely \ntuned  to  fit  the  data.  Which  model  achieves  the  greatest  evidence  is  determined \nby a  trade-off between minimising this natural complexity measure and minimising \nthe  data misfit. \n\nOccam factor for several parameters \n\nIf w  is  k-dimensional,  and if the posterior is  well  approximated by a  gaussian,  the \nOccam factor  is  given  by the determinant of the gaussian's covariance matrix: \n\nwhere  A = - VV log P(wID, lli), the Hessian which we  already evaluated when we \ncalculated  the  error  bars on  W MP .  As  the  amount  of data  collected,  N, increases, \nthis gaussian approximation is expected to become increasingly accurate on account \nof the central limit  theorem. \n\nThus Bayesian model selection is  a simple extension of maximum likelihood  model \nselection:  the evidence  is  obtained  by multiplying  the  best fit  likelihood \nby the Occam factor.  To evaluate the Occam factor  all we  need is  the  Hessian \nA,  if  the  gaussian  approximation  is  good.  Thus  the  Bayesian  method  of model \ncomparison by evaluating the evidence is computationally no more demanding than \nthe  task of finding for  each model  the best fit  parameters and their error  bars. \n\n2  THE EVIDENCE FOR NEURAL NETWORKS \n\nNeural  network  learning  procedures  include  a  host  of control  parameters  such  as \nthe number of hidden  units and weight  decay  rates.  These parameters are  difficult \nto  set  because  there  is  an  Occam's  razor  problem:  if we  just  set  the  parameters \nso  as  to  minimise  the  error  on  the  training set,  we  would  be  led  to  over-complex \nand  under-regularised  models  which  over-fit  the  data.  Figure  2a  illustrates  this \nproblem  by showing the  test  error versus the training error of a  hundred networks \nof varying complexity all  trained on the same interpolation  problem. \n\n\fBayesian  Model Comparison and  Backprop Nets \n\n843 \n\nOf course if we  had unlimited  resources,  we  could  compare these networks by mea(cid:173)\nsuring  the  error  on  an  unseen  test  set  or  by  similar  cross-validation  techniques. \nHowever  these  techniques may require  us to devote  a  large  amount of data to the \ntest  set,  and  may  be  computationally  demanding.  If there  are  several  parameters \nlike  weight  decay rates, it is preferable if they can be optimised on line. \n\nUsing the Bayesian framework, it is possible for  all our data to have a say in both the \nmodel fitting and the model comparison levels of inference.  We can rank alternative \nneural network solutions by evaluating the  'evidence'.  Weight  decay rates can  also \nbe optimised by finding  the  'most  probable'  weight  decay rate.  Alternative weight \ndecay  schemes  can  be  compared  using  the  evidence.  The  evidence  also  makes  it \npossible  to compare  neural  network  solutions  with  other interpolation  models,  for \nexample,  splines  or  radial  basis  functions,  and  to  choose  control  parameters such \nas  spline  order  or  RBF  kernel.  The  framework  can  be  applied  to  classification \nnetworks  as  well  as  the  interpolation  networks  discussed  here.  For  details  of the \ntheoretical  framework  (which  is  due  to  Gull  and  Skilling  (1989\u00bb  and  for  more \ncomplete discussion  and bibliography,  MacKay (1991)  should  be consulted. \n\n2.1  THE PROBABILISTIC INTERPRETATION \nFitting a  backprop network  to  a  data set  D  = {x, t}  often  involves  minimising  an \nobjective function  M(w) = {3ED(W; D) + aEw(w).  The first  term  is  the  data er(cid:173)\nror,  for  example  ED  = L !(Y -\nt)2,  and  the second  term  is  a  regulariser  (weight \ndecay  term),  for  example  Ew  = L !w~ .  (There  may  be several  regularisers  with \nindependent  constants  {a c}.  The  Bayesian  framework  also  covers  that  case.)  A \nmodel 1\u00a3  has  three components  {A,N, 'R}:  The  architecture  A  specifies  the func(cid:173)\ntional  dependence  of the  input-output  mapping  on  the  network's  parameters  w. \nThe  noise  model  N  specifies  the  functional  form  of the  data  error.  Within  the \nprobabilistic  interpretation (Tishby  et  ai., 1989),  the data error  is  viewed  as  relat(cid:173)\ning to a  likelihood,  P(Dlw,{3,A,N) = exp(-{3ED )/ZD.  For example,  a  quadratic \nED  corresponds  to the assumption that the distribution of errors between the data \nand  the  true  interpolant  is  Gaussian,  with  variance  u~ = 1/ {3.  Lastly,  the  regu(cid:173)\nlariser  'R,  with  associated  regularisation  constant  a, is  interpreted  as specifying  a \nprior on  the parameters w,  P(wla,A, 'R) = exp( -aEw).  For example, the  use  of \na  plain  quadratic  regulariser  corresponds  to  a  Gaussian  prior  distribution  for  the \nparameters. \n\nGiven this probabilistic interpretation, interpolation with neural networks can then \nbe decomposed  into three levels of inference: \n{3  1\u00a3.) = P(Dlw, {3, 1\u00a3j)P(wla, 1\u00a3i) \nP(  ID \n\nFitting a  regularised  model \n\n1 \n\n2a  Setting  regularisation  con(cid:173)\nstants  and  estimating  nOIse \nlevel \n\n2  Model  comparison \n\n, \n\n, a  \n\na \n\n,a,  , \n\nw \nP(a  {3ID  1\u00a3-) = P(Dla,{3,1\u00a3i)P(a, {311\u00a3i) \n\nP(Dla,{3,1id \n\nP(DI1ij) \n\nBoth  levels  2a  and  2  require  Occam's  razor.  For  both  levels  the  key  step  is  to \nevaluate  the  evidence  P(Dla,{3,1\u00a3),  which,  within  the  quadratic  approximation \n\n\f844 \n\nMacKay \n\neo \n\n.. \nI  ... \nI \n... \n\nuo \n\n... \n..  . 'i'~ \u2022 \n... \nJ  ... \ni \nJ \n... \n.. \n\ntao \n\no \n\n. .  \n\n. . .. ,~. ' .. \n. . '\\ .. :, \n\n, \n\u2022  I \n\nb)  JOe \n\n... \n\n... \n\n\u2022 \n\n\u2022 \n\n_.-\n... \n\n.. \n\nIII \n\nFigure 2:  The evidence solves  the neural networks'  Occam problem \n\na)  Test  error  vs.  data error.  Each  point  represents  the  performance of a  single  trained \nneural network on  the training set and on the test set.  This graph illustrates the fact  that \nthe best generalisation  is  not achieved  by the models which fit  the training data best. \nb)  Log  Evidence vs.  test error. \n\nX!. = 2aEw = 'Y \n\nX~ = 2,BED  = N  - 'Y, \n\naround w MP ,  is given by: \nlog P(Dla,,B, 1-\u00a3) =  -aEW -,BE~P -210g det A-log Zw(a)-log ZD (,B) + '2 log 27r. \n(1) \nAt level  2a we  can find  the most  probable value  for  the regularisation  constant  a \nand noise  level  1/,B  by differentiating (1)  with respect  to a  and ,B.  The result is \n\nk \n\n1 \n\n(8) \nwhere'Y is  'the effective number of parameters determined by the data' (Gull,  1989), \n\nand \n\n'Y = k - aTraceA:  = ~ A \n\n-1 \n\nAa \n\n\" \na=1  a + a \n\nk \n\n' \n\n(9) \n\nwhere  Aa  are  the  eigenvalues of 'V''V' ,BED  in  the  natural  basis  of Ew.  Each  term \nin  the  sum  is  a  number  between  0  and  1  which  measures  how  well  one  parame(cid:173)\nter  is  determined  by  the  data  rather  than  by  the  prior.  The  expressions  (8),  or \napproximations to them, can be used  to re-estimate weight decay rates on line. \nThe  central  quantity  in  the  evidence  and  in  'Y  is  the  inverse  hessian  kl, which \ndescribes  the  error  bars on  the  parameters w.  From this  we  can  also  obtain error \nbars on the outputs of a network (Denker and Le  Cun, 1991, MacKay, 1991).  These \nerror  bars  are  closely  related  to  the  predicted  generalisation  error  calculated  by \nLevin  et  al.(1989).  In  (MacKay,  1991)  the  practical  utility  of these  error  bars is \ndemonstrated for  both regression and classification networks. \n\nFigure 2b shows the Bayesian 'evidence' for each of the solutions in figure 2a against \nthe  test  error.  It can  be  seen  that  the  correlation  between  the evidence  and  the \ntest  error  is  extremely  good.  This  good  correlation  depends  on  the  model  being \nwell-matched to the problem;  when an  inconsistent weight decay scheme  was  used \n(forcing all weights to decay at the same rate), it was found  that the correlation be(cid:173)\ntween the evidence and the test error was much poorer.  Such comparisons between \nBayesian and traditional methods are powerful tools  for  human learning. \n\n\fBayesian  Model Comparison and  Backprop Nets \n\n845 \n\n3  RELATION TO  THEORIES  OF  GENERALISATION \n\nThe Bayesian 'evidence' framework assesses within a  well-defined hypothesis space \nhow  probable  a  set  of alternative  models  are.  However,  what  we  really  want  to \nknow is  how  well  each model  is expected to generalise.  Empirically, the correlation \nbetween the evidence and generalisation error is surprisingly good.  But a theoretical \nconnection  linking  the  two  is  not  yet established.  Here,  a  brief discussion  is  given \nof similarities and differences  between the evidence and quantities arising in  recent \nwork on prediction of generalisation error. \n\n3.1  RELATION TO  MOODY'S 'G.P.E.' \n\nMoody's  (1992)  'Generalised  Prediction  Error'  is  a  generalisation  of  Akaike's \n'F .P.E.' to non-linear regularised models.  The F .P.E.  is  an estimator of generalisa(cid:173)\ntion error which  can be derived without making assumptions about the distribution \nof errors  between  the  data and  true  interpolant,  and  without  assuming  a  known \nclass  to  which  the  true  interpolant  belongs.  The  difference  between  F .P.E.  and \nG.P.E. is  that the total number of parameters k in F.P.E. is  replaced by an effective \nnumber  of parameters,  which  is  in  fact  identical  to  the  quantity  -y  arising  in  the \nBayesian analysis (9).  If ED  is  as  defined  earlier, \n\nG.P.E. =  (ED + u~-y) IN. \n\n(10) \n\nLike  the log  evidence,  the  G .P.E.  has  the form  of the data error  plus  a  term  that \npenalises complexity.  However, although the same quantity -y  arises in  the Bayesian \nanalysis, the Bayesian Occam factor does not have the same scaling behaviour as the \nG.P.E. term (see discussion below).  And empirically, the G.P.E. is not always a good \npredictor of generalisation.  The reason for this is that in the derivation of the G.P.E., \nit is  assumed  that the distribution  over x  values is  well  approximated  by a  sum of \ndelta functions at the samples in the training set.  This is equivalent to assuming test \nsamples will be drawn only at the x locations at which we have already received data. \nThis assumption breaks down for  over-parameterised and over-flexible models.  An \nadditional distinction  that between  the G.P.E. and the evidence framework  is  that \nthe  G.P.E.  is  defined  for  regression  problems only;  the  evidence  can  be evaluated \nfor  regression,  classification  and density estimation models. \n\n3.2  RELATION TO THE EFFECTIVE V-C  DIMENSION \n\nRecent work on 'structural risk minimisation' (Guyon  et  al.,  1992) utilises empirical \nexpressions of the form: \n\nE \n\nlog(NI-y) + C2 \n\nN l-y \n\n(11) \n\n+ Cl \n\n,....,  E  IN \n\ngen  -\n\nD \n\nwhere -y  is  the 'effective V-C dimension' of the model,  and is  identical to the quan(cid:173)\ntity  arising  in  (9).  The  constants  Cl  and  C2  are  determined  by experiment.  The \nstructural risk  theory is  currently intended to be applied only to nested families  of \nclassification models (hence the abscence of (3:  ED  is dimensionless) with monotonic \neffective  V-C  dimension,  whereas  the  evidence  can  be  evaluated  for  any  models. \nHowever, it  is  very interesting that the scaling  behaviour of this expression  (11)  is \n\n\f846 \n\nMacKay \n\nidentical to  the scaling  behaviour of the  log evidence  (1),  subject  to  the following \nassumptions.  Assume  that  the  value  of the  regularisation  constant  satisfies  (8). \nAssume  furthermore  that  the significant  eigenvalues  (Aa  > a) scale  as  Aa  -- Na/'Y \n(It  can  be confirmed  that  this scaling is  obtained for  example  in  the  interpolation \nmodels  consisting  of a  sequence  of steps  of independent  heights,  as  we  vary  the \nnumber of steps.)  Then it can be shown  that the scaling of the log  evidence  is: \n\n1 \n\n-log P(Dla,,B, 1i) '\" ,BE~P + 2 ('Y log(N /'Y) + 'Y) \n\n(12) \n(Readers  familiar  with  MDL  will  recognise  the  dominant 1 log N  term;  MDL  and \nBayes are identical.)  Thus the scaling  behaviour of the  lo~ evidence  is  identical to \nthe  structural  risk  minimisation  expression  (11),  if Cl  = - and  C2  = 1.  I.  Guyon \n(personal communication) has confirmed that the empirically determined values for \nCl  and C2  are indeed close to these Bayesian values.  It will  be interesting to try and \nunderstand  and  develop this  relationship. \n\nAcknowledgements \n\nThis work  was supported  by studentships from Caltech and  SERC, UK. \n\nReferences \n\nJ.S. Denker and Y.  Le  Cun (1991).  'Transforming neural-net output levels  to prob(cid:173)\nability  distributions',  in  Advances  in  neural information  processing  systems  3,  ed. \nR.P.  Lippmann  et  al.,  853-859,  Morgan  Kaufmann. \n'Bayesian  inductive  inference  and  maximum entropy',  in  Maxi(cid:173)\nS.F.  Gull  (1988). \nmum  Entropy  and  Bayesian  Methods  in  science  and  engineering,  vol.  1:  Founda(cid:173)\ntions,  G.J.  Erickson and  C.R.  Smith,  eds.,  Kluwer. \nS.F. Gull  (1989).  'Developments in  Maximum entropy data analysis',  in J. Skilling, \ned., 53-11. \nI. Guyon, V.N.  Vapnik, B.E.  Boser, L.Y.  Bottou and S.A.  Solla (1992).  'Structural \nrisk  minimization for  character recognition',  this  volume. \nH.  Jeffreys  (1939).  Theory  of Probability, Oxford  Univ.  Press. \nE.  Levin,  N.  Tishby  and  S.  Solla  (1989). \ngeneralization in  layered neural  networks', in  COLT  '89:  2nd workshop  on  compu(cid:173)\ntational learning  theory,  245-260. \nD.J.C. MacKay (1991)  'Bayesian Methods for  Adaptive Models',  Ph.D. Thesis, Cal(cid:173)\ntech.  Also  'Bayesian interpolation',  'A  practical  Bayesian framework  for  backprop \nnetworks',  'Information-based objective functions for  active data selection',  to ap(cid:173)\npear in  Neural  computation.  And  'The evidence framework  applied to classification \nnetworks', submitted to  Neural computation. \nJ .E.  Moody  (1992). \nnonlinear  learning systems', this volume. \nN.  Tishby,  E.  Levin  and S.A.  Solla  (1989).  'Consistent inference of probabilities  in \nlayered networks:  predictions  and generalization',  in  Proc.  [lCNN,  Washington. \n\n'Generalization,  regularization  and  architecture  selection  in \n\n'A  statistical  approach  to  learning  and \n\n\f", "award": [], "sourceid": 488, "authors": [{"given_name": "David", "family_name": "MacKay", "institution": null}]}