{"title": "A Practical Monte Carlo Implementation of Bayesian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 604, "abstract": null, "full_text": "A  Practical Monte  Carlo Implementation \n\nof Bayesian Learning \n\nCarl  Edward Rasmussen \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto, Ontario,  M5S  1A4, Canada \n\ncarl@cs.toronto.edu \n\nAbstract \n\nA  practical  method  for  Bayesian  training  of feed-forward  neural \nnetworks  using  sophisticated  Monte  Carlo  methods  is  presented \nand evaluated.  In reasonably small amounts of computer time this \napproach  outperforms  other  state-of-the-art  methods  on  5  data(cid:173)\nlimited tasks  from real world  domains. \n\n1 \n\nINTRODUCTION \n\nBayesian learning uses  a prior on model parameters, combines this with information \nfrom  a  training set ,  and  then  integrates  over  the  resulting  posterior  to  make pre(cid:173)\ndictions.  With this approach, we  can use  large networks without fear  of overfitting, \nallowing us  to capture more structure  in the data, thus  improving prediction accu(cid:173)\nracy and eliminating the tedious search  (often performed using cross  validation) for \nthe  model complexity  that  optimises  the  bias/variance  tradeoff.  In  this  approach \nthe size  of the  model is  limited only  by computational considerations. \n\nThe  application  of  Bayesian  learning  to  neural  networks  has  been  pioneered  by \nMacKay  (1992),  who  uses  a  Gaussian approximation to the posterior weight distri(cid:173)\nbution.  However,  the Gaussian approximation is poor because of multiple modes in \nthe  posterior.  Even  locally  around  a  mode  the  accuracy  of the  Gaussian  approxi(cid:173)\nmation is  questionable, especially when  the model is  large compared to the amount \nof training data. \n\nHere  I  present  and  test  a  Monte  Carlo  method  (Neal,  1995)  which  avoids  the \nGaussian approximation.  The implementation is complicated, but the user is not re(cid:173)\nquired to have extensive knowledge about the algorithm.  Thus, the implementation \nrepresents  a  practical  tool for  learning in neural nets. \n\n\fA Practical Monte  Carlo  Implementation of Bayesian Learning \n\n599 \n\n1.1  THE PREDICTION TASK \nThe  training  data  consists  of  n  examples  in  the  form  of  inputs  x  = {x(i)}  and \ncorresponding  outputs  y  = {y(i)}  where  i  = 1 ... n.  For  simplicity  we  consider \nonly  real-valued  scalar  outputs.  The  network  is  parametrised  by  weights  w,  and \nhyperparameters h  that control the distributions for  weights,  playing a role similar \nto that of conventional weight decay.  Weights and hyperparameters are collectively \ntermed 0,  and the network function is written as F/I (x),  although the function  value \nis  only indirectly dependent  on  the hyperparameters  (through the  weights). \n\nBayes' rule  gives  the posterior distribution for  the  parameters in  terms of the like(cid:173)\nlihood, p(ylx, 0),  and prior, p(O): \n\n(Olx \np \n\n,y \n\n) =  p(O)p(ylx, O) \n\np(ylx) \n\nTo  minimize the expected  squared  error  on  an unseen  test  case  with  input  x(n+l), \nwe  use  the mean prediction \n\n(1) \n\n2  MONTE CARLO  SAMPLING \n\nThe  following  implementation  is  due  to  Neal  (1995).  The  network  weights  are \nupdated  using  the  hybrid Monte  Carlo  method  (Duane  et  al.  1987).  This method \ncombines the Metropolis algorithm with dynamical simulation.  This helps to avoid \nthe  random walk behavior  of simple  forms  of Metropolis,  which  is  essential  if we \nwish  to  explore  weight  space  efficiently.  The  hyperparameters  are  updated  using \nGibbs sampling. \n\n2.1  NETWORK  SPECIFICATION \n\nThe networks  used  here  are  always of the same form:  a single linear output unit, a \nsingle  hidden  layer of tanh  units and  a task dependent  number of input  units.  All \nlayers  are  fully  connected  in  a  feed  forward  manner  (including  direct  connections \nfrom input to output).  The output and  hidden  units have  biases. \nThe  network priors are specified  in  a  hierarchical manner in terms of hyperparam(cid:173)\neters;  weights of different kinds are divided into groups, each group having it's own \nprior.  The  output-bias  is  given  a  zero-mean  Gaussian  prior  with  a  std.  dev.  of \nu = 1000, so  it is  effectively  unconstrained. \nThe  hidden-biases  are  given  a  two  layer  prior:  the  bias  b is  given  a  zero-mean \nGaussian prior b '\" N(O, ( 2 ); the value of u is specified in terms of precision r  = u- 2 , \nwhich  is  given  a  Gamma prior with mean p  = 400  (corresponding  to u = 0.05)  and \nshape  parameter a  =  0.5;  the  Gamma density  is  given  by  p(r)  '\" Gamma(p, a) ex: \nr Ol / 2 - 1 exp( -ra/2p).  Note that this type of prior introduces a dependency  between \nthe  biases  for  different  hidden  units  through  the  common  r.  The  prior  for  the \nhidden-to-output weights is identical to the prior for  the hidden-biases, except  that \nthe  variance  of these  weights  under  the  prior  is  scaled  down  by  the  square  root \nof the  number of hidden  units,  such  that  the  network  output  magnitude becomes \nindependent  of the  number  of hidden  units.  The  noise  variance  is  also  given  a \nGamma prior with these  parameters. \n\n\f600 \n\nC. E. RASMUSSEN \n\nThe  input-to-hidden  weights  are  given  a  three  layer  prior:  again  each  weight  is \ngiven  a  zero-mean  Gaussian  prior  w  rv  N(O, (12);  the  corresponding  precision  for \nthe  weights  out of input unit i  is  given  a  Gamma prior with  a mean J.l  and a shape \nparameter  a1  =  0.5:  Ti  rv  Gamma(J.l, a1).  The  mean  J.l  is  determined  on  the  top \nlevel  by  a  Gamma distribution  with  mean  and  shape  parameter  ao  = 1:  J.li  rv \nGamma(400,ao).  The direct  input-to-output connections  are  also  given  this prior. \n\nThe  above-mentioned  3  layer  prior  incorporates  the  idea of Automatic  Relevance \nDetermination (ARD), due to MacKay and Neal, and discussed in Neal  (1995) .  The \nhyperparameters,  Ti,  associated  with individual inputs can  adapt  according  to  the \nrelevance  of the input;  for  an unimportant input,  Ti  can grow very  large  (governed \nby  the top level  prior),  thus forcing  (1i  and  the associated  weights  to vanish. \n\n2.2  MONTE  CARLO  SPECIFICATION \n\nSampling from the posterior weight distribution is performed by iteratively updating \nthe values of the network weights and hyperparameters.  Each iteration involves two \ncomponents:  weight  updates  and  hyperparameter  updates.  A  cursory  description \nof these  steps follows. \n\n2.2.1  Weight  Updates \n\nWeight  updates  are  done  using  the  hybrid  Monte  Carlo  method .  A  fictitious  dy(cid:173)\nnamical system is  generated  by  interpreting  weights  as  positions,  and  augmenting \nthe weights w  with  momentum variables p.  The purpose  of the dynamical system \nis  to give  the weights  \"inertia\"  so  that slow random walk behaviour can be  avoided \nduring exploration  of weight  space.  The total energy,  H, of the system is  the sum \nof the kinetic energy,  I<,  (a function  of the momenta) and  the potential energy,  E. \nThe  potential energy  is  defined  such  that p(w)  ex  exp( -E).  We  sample from  the \njoint distribution for  wand p  given  by p(w,p) ex  exp(-E - I<),  under  which  the \nmarginal distribution for  w  is given  by  the posterior.  A sample of weights from the \nposterior can therefore  be  obtained by  simply ignoring the  momenta. \n\nSampling from the joint distribution is achieved  by two steps:  1)  finding  new points \nin  phase space  with  near-identical  energies  H  by simulating the dynamical system \nusing  a  discretised  approximation  to  Hamiltonian dynamics,  and  2)  changing  the \nenergy  H  by  doing Gibbs sampling for  the momentum variables. \nHamiltonian  Dynamics.  Hamilton's first  order  differential  equations  for  Hare \napproximated  by  a  series  of discrete  first  order  steps  (specifically  by  the  leapfrog \nmethod).  The  first  derivatives  of the  network  error  function  enter  through  the \nderivative  of the  potential  energy,  and  are  computed  using  backpropagation.  In \nthe  original  version  of the  hybrid  Monte  Carlo  method  the  final  position  is  then \naccepted  or  rejected  depending  on  the  final  energy  H'\"  (which  is  not  necessarily \nequal to the initial energy  H  because of the discretisation).  Here  we  use  a modified \nversion  that  uses  an  average  over  a  window  of states  instead.  The step  size  of the \ndiscrete  dynamics  should  be  as  large  as  possible  while  keeping  the  rejection  rate \nlow.  The step sizes  are set  individually using several heuristic  approximations, and \nscaled  by  an  overall parameter c.  We  use  L  = 200  iterations,  a  window  size  of 20 \nand  a step size  of c = 0.2 for  all simulations. \nGibbs  Sampling  for  Momentum  Variables.  The  momentum  variables  are \nupdated  using  a  modified  version  of  Gibbs  sampling,  allowing  the  energy  H  to \nchange.  A  \"persistence\"  of  0.95  is  used;  the  new  value  of the  momentum  is  a \nweighted  sum of the previous  value  (weight  0.95)  and  the value  obtained by  Gibbs \nsampling  (weight  (1  - 0.952)1/2).  With  this  form  of  persistence,  the  momenta \n\n\fA Practical Monte  Carlo Implementation  of Bayesian  Learning \n\n601 \n\nchanges  approx. 20  times more slowly,  thus increasing the  \"inertia\"  of the  weights, \nso  as to further help in avoiding random walks.  Larger values of the persistence  will \nfurther  increase  the  weight  inertia,  but  reduce  the  rate  of exploration  of H.  The \nadvantage of increasing the weight inertia in this way rather than by increasing L  is \nthat the hyperparameters are  updated at shorter intervals,  allowing them to adapt \nto the rapidly changing weights. \n\n2.2.2  Hyperparameter Updates \n\nThe hyperparameters are updated using Gibbs sampling.  The conditional distribu(cid:173)\ntions for  the hyperparameters given the weights are of the Gamma form, for  which \nefficient  generators exist, except  for  the top-level hyperparameter in  the case  of the \n3  layer  priors  used  for  the  weights  from  the  inputs;  in  this  case  the  conditional \ndistribution is more complicated and a form  of rejection  sampling is  employed. \n\n2.3  NETWORK TRAINING  AND  PREDICTION \n\nThe  network  training  consists  of two  levels  of initialisation  before  sampling  for \nnetworks used for  prediction.  At the first  level of initialisation the hyperparameters \n(variance  of the  Gaussians)  are  kept  constant  at  1,  allowing  the  weights  to  grow \nduring  1000 leapfrog iterations.  Neglecting this phase can cause the network to get \ncaught for  a long time in a state where  weights  and hyperparameters are  both very \nsmall. \n\nThe scheme  described  above  is  then  invoked  and  run  for  as  long  as  desired,  even(cid:173)\ntually producing networks from the posterior distribution.  The initial 1/3 of these \nnets  are discarded,  since the algorithm may need  time to reach  regions of high pos(cid:173)\nterior probability.  Networks sampled during the remainder of the run are saved  for \nmaking predictions. \n\nThe  predictions are  made using an average  of the networks sampled from  the  pos(cid:173)\nterior as an approximation to the integral in eq.  (1).  Since the output unit is linear \nthe final  prediction  can  be  seen  as coming from  a  huge  (fully  connected)  ensemble \nnet  with  appropriately  scaled  output  weights.  All  the  results  reported  here  were \nfor  ensemble nets  with  4000  hidden  units.  The size  of the  individual nets is given \nby  the rule  that we  want  at least  as  many network  parameters as  we  have  training \nexamples  (with  a lower limit of 4 hidden units).  We  hope thereby  to be well  out of \nthe  underfitting  region.  Using  even  larger  nets  would  probably  not  gain  us  much \n(in  the face  of the limited training data)  and is  avoided for  computational reasons. \n\nAll  runs  used  the  parameter values  given  above.  The  only  check  that  is  necessary \nis  that  the  rejection  rate  stays  low,  say  below  5%;  if  not,  the  step  size  should \nbe  lowered.  In  all  runs  reported  here,  c  = 0.2  was  adequate.  The  parameters \nconcerning the Monte  Carlo method and the network  priors were  all selected  based \non intuition and on experience  with toy  problems.  Thus no  parameters need  to be \nset  by  the  user. \n\n3  TESTS \n\nThe performance of the algorithm was evaluated by  comparing it to other state-of(cid:173)\nthe-art  methods  on  5  real-world  regression  tasks.  All  5  data sets  have  previously \nbeen  studied  using  a  10-way  cross-validation  scheme  (Quinlan  1993).  The  tasks \nin  these  domains is  to  predict  price  or  performance of an  object  from  various  dis(cid:173)\ncrete  and  real-valued  attributes.  For  each  domain  the  data is  split  into  two  sets \nof roughly  equal  size,  one  for  training  and  one  for  testing.  The  training  data  is \n\n\f602 \n\nC. E. RASMUSSEN \n\nfurther  subdivided into full-,  half-,  quarter- and eighth-sized subsets,  15 subsets in \ntotal.  Networks  are trained on each  of these  partitions, and evaluated on the large \ncommon  test  set .  On  the  small  training  sets,  the  average  performance  and  one \nstd.  dev.  error  bars on  this estimate are  computed. \n\n3.1  ALGORITHMS \n\nThe  Monte  Carlo  method  was  compared  to  four  other  algorithms.  For  the  three \nneural  network  methods  nets  with  a  single  hidden  layer  and  direct  input-output \nconnections were  used.  The Monte Carlo method was  run for  1 hour on each of the \nsmall training sets,  and 2,4 and 8 hours respectively on the larger training sets.  All \nsimulations were done on a 200 MHz  MIPS R4400 processor.  The Gaussian Process \nmethod is  described  in  a companion paper  (Williams & Rasmussen  1996). \nThe Evidence method (MacKay  1992) was  used  for  a network with separate hyper(cid:173)\nparameters for  the  direct  connections,  the  weights  from  individual inputs  (ARD), \nhidden  biases,  and  output  biases.  Nets  were  trained  using  a  conjugate  gradient \nmethod,  allowing  10000  gradient  evaluations  (batch)  before  each  of 6  updates  of \nthe  hyperparameters.  The network  Hessian  was  computed  analytically.  The  value \nof the evidence  was computed without compensating for  network symmetries, since \nthis can lead  to  a  vastly over-estimated evidence for  big networks  where  the poste(cid:173)\nrior Gaussians from different modes overlap.  A large number of nets were trained for \neach  task,  with the number of hidden  units computed from  the results  of previous \nnets by the following heuristics:  The min and max number of hidden units in the 20% \nnets with  the  highest  evidences  were found.  The new  architecture  is picked  from  a \nGaussian  (truncated  at  0)  with mean  (max  - min)/2  and  std.  dev.  2 + max - min, \nwhich  is  thought  to  give  a  reasonable  trade-off  between  exploration  and  exploita(cid:173)\ntion.  This procedure is run for 1 hour of cpu time or until more than 1000 nets have \nbeen  trained.  The final  predictions  are  made from  an ensemble of the  20%  (but  a \nmaximum of 100)  nets with the highest evidence. \nAn  ensemble  method using  cross-validation to search  over  a  2-dimensional grid for \nthe  number  of hidden  units  and  the  value of a  single  weight  decay  parameter  has \nbeen  included,  as  an  attempt  to  have  a  thorough  version  of  \"common  practise\". \nThe  weight  decay  parameter takes  on  the values  0,  0.01,  0.04,  0.16, 0.64  and  2.56. \nUp to 6 sizes  of nets are used, from 0 hidden  units  (a linear model) up to a number \nthat  gives  as  many  weights  as  training  examples.  Networks  are  trained  with  a \nconjugent  gradient  method  for  10000  epochs  on  each  of these  up  to  36  networks, \nand performance was monitored on a validation set containing 1/3 of the examples, \nselected  at  random.  This  was  repeated  5  times  with  different  random  validation \nsets,  and the  architecture  and  weight  decay  that did  best  on  average  was  selected. \nThe predictions are made from an ensemble of 10 nets with this architecture,  trained \non the full training set.  This algorithm took several hours of cpu time for the largest \ntraining sets. \nThe  Multivariate  Adaptive  Regression  Splines  (MARS)  method  (Friedman  1991) \nwas included as  a non-neural network approach.  It is possible to vary the maximum \nnumber  of variables  allowed  to  interact  in  the  additive  components  of the  model. \nIt is  common to  allow  either  pairwise  or full  interactions.  I  do  not  have  sufficient \nexperience  with  MARS  to  make this  choice.  Therefore,  I  tried  both  options  and \nreported  for  each  partition  on  each  domain  the  best  performance  based  on  the \ntest  error,  so  results  as good  as  the ones  reported  here  might not be  obtainable in \npractise.  All  other  parameters of MARS  were  left  at  their  default  values.  MARS \nalways  required  less  than 1 minute of cpu  time. \n\n\fA Practical  Monte  Carlo Implementation  of Bayesian Learning \n\n603 \n\nAuto price \n\nCpu \n\n2 \n\n1.5 \n\n1 \n\n0.5 \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0* \n+ \n\nx \n\n+ \n\nIS! \n\no \nX  * \n\no~------~----~----~---\n\n10 \n\n20 \n\n40 \n\n80 \n\nOL-~----~------~----~--\n\n13 \n\n26 \n\n52 \n\n104 \n\nHouse \n\nt \n\n>\u00ab1>* \n+ IS! \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0.25 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\nMpg \n\n* \nXo+  IS! \n\no~~----~------~----~--\n\n32 \n\n64 \n\n128 \n\n256 \n\nOL-~----~----~----~--\n\n24 \n\n48 \n\n96 \n\n192 \n\n1 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\nServo \n\nOtIS! \n\nX  * \n\nGeometric mean \n\nx  Monte Carlo \no  Gaussian Evidence \n+  Backprop \n*  MARS \nIS!  Gaussian Process \n\n0.283 \n0.364 \n0.339 \n\n0.371 \n0.304 \n\no~~------~----~----~---\n\n11 \n\n22 \n\n44 \n\n88 \n\nFigure 1:  Squared error on test cases for  the five  algorithms applied to the five  problems. \nErrors are normalized  with respect  to  the variance  on the test cases.  The x-axis  gives  the \nnumber of training examples;  four  different  set sizes  were used on each domain.  The error \nbars give  one  std.  dev.  for  the distribution  of the  mean over training  sets.  No error bar is \ngiven  for  the largest  size,  for  which  only  a  single  training  set  was  available.  Some  of the \nlarge error bars are cut of at the top.  MARS  was unable  to run on the  smallest  partitions \nfrom  the  Auto  price and  the  servo domains;  in  these  cases  the  means  of the  four  other \nmethods  were  used in  the reported geometric mean for  MARS. \n\n\f604 \n\nC. E. RASMUSSEN \n\nTable 1:  Data Sets \n\n#  training  cases  #  test  cases  #  binary  inputs  #  real inputs \n\ndomain \nAuto Price \nCpu \nHouse \nMpg \nServo \n\n80 \n104 \n256 \n192 \n88 \n\n79 \n105 \n250 \n200 \n79 \n\n0 \n0 \n1 \n6 \n10 \n\n16 \n6 \n12 \n3 \n2 \n\n3.2  PERFORMANCE \n\nThe  test  results  are  presented  in  fig .  1.  On  the  servo  domain  the  Monte  Carlo \nmethod is  uniformly better  than all other methods,  although the difference  should \nprobably not always be considered statistically significant.  The Monte Carlo method \ngenerally does well  for  the smallest training sets.  Note that no single method does \nwell  on  all  these  tasks.  The Monte  Carlo method is never  vastly out-performed by \nthe other methods. \n\nThe  geometric mean of the  performances over  all  5 domains for  the the 4 different \ntraining  set  sizes  is  computed.  Assuming  a  Gaussian  distribution  of  prediction \nerrors,  the  log  of the  error  variance  can  (apart  from  normalising  constants)  be \ninterpreted  as  the  amount  of information  unexplained  by  the  models.  Thus,  the \nlog  of the  geometric  means  in  fig.  1  give  the  average  information unexplained  by \nthe  models.  According  to this measure the  Monte Carlo method does  best,  closely \nfollowed  by  the  Gaussian  Process  method.  Note  that  MARS  is  the  worst,  even \nthough  the decision  between  pairwise and full  interactions were  made on  the  basis \nof the test  errors. \n\n4  CONCLUSIONS \n\nI  have  outlined  a  black-box  Monte  Carlo  implementation  of Bayesian  learning  in \nneural networks,  and shown that it has an excellent performance.  These results sug(cid:173)\ngest that Monte Carlo based Bayesian methods are serious competitors for practical \nprediction  tasks on data limited domains. \n\nAcknowledgements \n\nI am  grateful  to Radford  Neal for  his  generosity  with insight  and software.  This  research \nwas funded by a grant to G.  Hinton from  the Institute for Robotics and Intelligent Systems. \n\nReferences \n\nS.  Duane,  A.  D.  Kennedy,  B.  J.  Pendleton  & D. Roweth  (1987)  \"Hybrid  Monte  Carlo\", \nPhysics  Letters  B,  vol.  195,  pp.  216-222. \nJ . H.  Friedman (1991)  \"Multivariate adaptive regression splines\"  (with discussion) , Annals \nof Statistics, 19,1-141  (March) .  Source:  http://lib.stat.cmu.edu/general/mars3.5. \nD.  J. C.  MacKay  (1992)  \"A  practical Bayesian framework for backpropagation  networks\", \nNeural  Computation,  vol.  4, pp.  448- 472. \nR.  M.  Neal  (1995)  Bayesian Learning for Neural Networks, PhD thesis,  Dept. of Computer \nScience,  University  of Toronto, ftp: pub/radford/thesis. ps. Z from ftp. cs . toronto. edu. \nJ. R.  Quinlan  (1993)  \"Combining instance-based and model-based learning\",  Proc .  ML '93 \n(ed  P.E.  Utgoff),  San  Mateo:  Morgan  Kaufmann. \nC.  K. I. Williams  &  C. E.  Rasmussen  (1996).  \"Regression with Gaussian processes\",  NIPS \n8,  editors  D.  Touretzky,  M.  Mozer and M. Hesselmo.  (this  volume) . \n\n\f", "award": [], "sourceid": 1029, "authors": [{"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}