{"title": "Bayesian Averaging is Well-Temperated", "book": "Advances in Neural Information Processing Systems", "page_first": 265, "page_last": 271, "abstract": null, "full_text": "Bayesian averaging  is  well-temperated \n\nLars Kai  Hansen \n\nDepartment of Mathematical Modelling \nTechnical  University of Denmark B321 \n\nDK-2800  Lyngby,  Denmark \n\nlkhansen@imm.dtu.dk \n\nAbstract \n\nBayesian predictions are stochastic just like predictions of any other \ninference scheme that generalize from a finite sample.  While a sim(cid:173)\nple variational argument shows  that Bayes averaging is  generaliza(cid:173)\ntion  optimal given  that  the  prior  matches  the  teacher  parameter \ndistribution the situation is  less  clear  if the  teacher  distribution is \nunknown.  I define  a class of averaging procedures,  the temperated \nlikelihoods,  including  both  Bayes  averaging  with  a  uniform  prior \nand  maximum likelihood estimation as  special  cases.  I  show  that \nBayes  is  generalization optimal in this family  for  any  teacher  dis(cid:173)\ntribution for  two  learning problems that are  analytically tractable: \nlearning the mean of a Gaussian and asymptotics of smooth learn(cid:173)\ners. \n\n1 \n\nIntroduction \n\nLearning  is  the  stochastic  process  of generalizing  from  a  random  finite  sample  of \ndata.  Often a learning problem has natural quantitative measure of generalization. \nIf a loss function is  defined  the natural measure is the  generalization  error,  i.e.,  the \nexpected loss on a  random sample independent  of the  training set.  Generalizability \nis  a  key  topic  of learning  theory  and  much  progress  has  been  reported.  Analytic \nresults  for  a  broad class  of machines  can  be  found  in  the litterature  [8,  12,  9,  10] \ndescribing  the  asymptotic generalization  ability of supervised  algorithms that  are \ncontinuously parameterized.  Asymptotic bounds on generalization for  general  ma(cid:173)\nchines  have  been  advocated  by  Vapnik  [11].  Generalization  results  valid for  finite \ntraining  sets  can  only  be  obtained  for  specific  learning  machines,  see  e.g.  [5].  A \nvery rich  framework for  analysis of generalization for  Bayesian averaging and other \nschemes is  defined  in  [6]. \n\nA veraging  has  become  popular  as  a  tool  for  improving generalizability of learning \nmachines .  In the context of (time series)  forecasting averaging has been investigated \nintensely for decades  [3].  Neural network ensembles were  shown to improve general(cid:173)\nization by simple voting in [4]  and later work has generalized these  results  to other \ntypes  of averaging.  Boosting,  Bagging,  Stacking,  and  Arcing  are  recent  examples \nof averaging  procedures  based  on data  resampling  that  have  shown  useful  see  [2] \nfor  a  recent  review  with  references.  However,  Bayesian  averaging  in  particular  is \nattaining a kind of cult  status.  Bayesian averaging is  indeed  provably optimal in a \n\n\f266 \n\nL.  K.  Hansen \n\nnumber  various  ways  (admissibility,  the  likelihood principle etc)  [1].  While  it fol(cid:173)\nlows  by construction that Bayes is generalization optimal if given  the correct  prior \ninformation,  i.e.,  the  teacher  parameter  distribution,  the  situation  is  less  clear  if \nthe teacher distribution is unknown.  Hence,  the pragmatic Bayesians downplay the \nrole of the prior.  Instead the averaging aspect  is emphasized and  \"vague\"  priors are \ninvoked.  It  is  important to  note  that  whatever  prior  is  used  Bayesian  predictions \nare  stochastic just  like  predictions  of any  other  inference  scheme  that  generalize \nfrom a finite  sample. \n\nIn  this  contribution  I  analyse  two  scenarios  where  averaging  can  improve  gener(cid:173)\nalizability and I  show  that  the  vague  Bayes  average  is  in fact  optimal among  the \naveraging schemes investigated.  Averaging  is  shown  to reduce  variance  at the cost \nof introducing  bias,  and  Bayes  happens  to  implement  the  optimal  bias-variance \ntrade-off. \n\n2  Bayes  and generalization \n\nConsider  a  model  that  is  smoothly  parametrized  and  whose  predictions  can  be \ndescribed  in terms of a density  function 1 .  Predictions in  the  model  are based on  a \ngiven  training set:  a  finite  sample  D  =  {Xa}~=l of the  stochastic  vector  x  whose \ndensity - the teacher - is denoted p(xIOo).  In other words the true density is assumed \nto  be  defined  by  a  fixed,  but  unknown,  teacher  parameter  vector  00 .  The  model, \n\ndenoted  H,  involves the parameter vector \u00b0 and the  predictive density is given by \n\np(xID, H) = ! p(xIO, H)p(OID, H)dO \n\np(OID, H)  is  the  parameter distribution  produced  in  training  process.  In  a  maxi(cid:173)\nmum likelihood scenario  this distribution is  a  delta function  centered  on the  most \nlikely  parameters  under  the  model  for  the  given  data set.  In  ensemble  averaging \napproaches,  like boosting bagging or stacking, the distribution is obtained by  train(cid:173)\ning on resampled  traning sets.  In  a  Bayesian scenario,  the  parameter distribution \nis  the posterior distribution, \n\np(OID, H) =  f p(DIO', H)p(O'IH)dO' \n\np(DIO, H)p(OIH) \n\n(2) \n\nwhere  p(OIH)  is  the  prior  distribution  (probability  density  of parameters  if D  is \nempty).  In the sequel we  will only consider one  model hence  we suppress  the model \nconditioning label H. \n\nThe generalization error  is  the  average  negative  log density  (also  known as simply \nthe  \"log loss\"  - in  some applied statistics  works  known  as  the  \"deviance\") \n\n(1) \n\n(3) \n\n(4) \n\nr(DIOo)  = ! -logp(xID)p(xIOo)dx, \n\nThe  expected  value  of the  generalization  error  for  training  sets  produced  by  the \ngiven teacher  is given  by \n\nf(Oo)  = ! ! -logp(xID)p(xIOo)dxp(DIOo)dD. \n\nlThis  does  not  limit  us  to  conventional  density  estimation;  pattern  recognition  and \n\nmany functional  approximations  problems  can be formulated  as  density  estimation prob(cid:173)\nlems  as  well. \n\n\fBayesian Averaging is Well-Temperated \n\n267 \n\nPlaying  the  game  of  \"guessing  a  probability  distribution\"  [6]  we  not  only  face  a \nrandom  training  set,  we  also  face  a  teacher  drawn  from  the  teacher  distribution \np( Bo) . The teacher  averaged generalization must then be  defined  as \n\nr = J f(Bo)p(Bo)dBo . \n\n(5) \n\n(7) \n\nThis is the typical generalization error for  a random training set from the randomly \nchosen  teacher  - produced  by  the  model H.  The generalization error  is  minimized \nby Bayes averaging if the teacher distribution is used as prior.  To see  this, form the \nLagrangian functional \n\n\u00a3[q(xID)] = J J J -logq(xID)p(xIBo)dxp(DIBo)dDp(Bo)dBo+A J q(xID)dx  (6) \n\ndefined on positive functions q(xID).  The second term is used to ensure that q(xID) \nis  a normalized density  in  x .  Now  compute the variational derivative to obtain \n\n6\u00a3 \n\n6q(xID)  =  - q(xID) \n\np(xIBo)p(DIBo)p(Bo)dBo + A. \n\n1  J \n\nJ \np(xIB) J p(DIB')p(B')dB' dB, \n\np(DIB)p(B) \n\nEquating this derivative  to zero  we  recover  the  predictive distribution of Bayesian \naveragmg, \n\nq(xID) = \n\n(8) \nwhere  we  used  that A = J p(DIB)p(B)dB  is  the  appropriate normalization constant. \nIt is  easily  verified  that this  is  indeed  the  global  minimum of the  averaged  gener(cid:173)\nalization error.  We  also  note  that if the  Bayes  average  is  performed  with  another \nprior  than the teacher  distribution p( Bo),  we  can expect  a higher generalization er(cid:173)\nror.  The important question from a Bayesian point of view  is then:  Are  there cases \nwhere  averaging with generic priors  (e.g.  vague or uniform priors)  can be shown to \nbe optimal? \n\n3  Temperated likelihoods \n\nTo come closer  to a quantative statement about when  and  why  vague Bayes is  the \nbetter procedure  we  will analyse two problems for which some analytical progress is \npossible.  We  will  consider  a  one-parameter family of learning procedures  including \nboth a  Bayes  and  the maximum likelihood procedure, \nv(DIB) \n\np(BI!3,D,H) =  Jpf3(DIB')dB\" \n\n(9) \n\nwhere  !3  is  a  positive  parameter  (plying  the  role  of an  inverse  temperature).  The \nfamily  of procedures  are  all  averaging  procedures,  and !3  controls  the width  of the \naverage.  Vague  Bayes  (here  used  synonymously  with  Bayes  with  a  uniform prior) \nis recoved for !3 =  1, while the maximum posterior procedure is obtained by cooling \nto zero  width !3  --+  00 . \nIn this context the generalization design  question can be frased  as follows :  is  there \nan  optimal temperature  in  the  family  of the  temperated likelihoods? \n\n3.1  Example:  ID  normal variates \n\nLet  the  teacher distribution be  given by \n\np(xIBo)  =  ~exp (-~(X - Bo)2) \n\n211\"<72 \n\n2<7 \n\n(10) \n\n\f268 \n\nL.K.  Hansen \n\nThe  model  density  is  of the  same  form  with  (J  unknown  and  u2  assumed  to  be \nknown.  For  N  examples the posterior  (with  a  uniform prior)  is, \n\np(OID)  = J 2:U2  exp (-::2 (x - (J)2)  , \n\n(11) \nwith x = 1/ N Eo: Xo:.  The temperated likelihood is  obtained by raising to the ,8'th \npower  and  normalizing, \n\np((JID,,8)  = V ~ exp \n\nf7iN \n\n(,8N \n- 2u2 (x - (J)2 \n\n) \n\n. \n\nThe predictive distribution is  found by  integrating w.r.t. (J, \n\np(xID,,8) = !P(ZIB)P(BID,~)dB; ~exp (--212 (x- X)2)  , \n\n21!'u$ \n\nu f3 \n\n(12) \n\n(13) \n\nwith u~ = u2(1+1/,8N).  We note that this distribution is wider for  all the averaging \nprocedures  than it is  for  maximum likelihood  (,8  -T (0),  i.e.,  less  variant.  For  very \nsmall  ,8  the  predictive  distribution  is  almost  independent  of the  data  set,  hence \nhighly  biased. \n\nIt is straightforward to compute the generalization error of the predictive distribu(cid:173)\ntion for general,8.  First we compute the generalization error for the specific training \nset  D, \n\nr(D,,8, (Jo)  = ! -logp(xID, ,8)p(xl(Jo)dx =  log )21!'u$ + 2~~ ((x - (JO)2  + ( 2) , \n\n(14) \nThe  average  generalization  error  is  then  found  by  averaging  w.r.t  the  sampling \ndistribution using x\"\" N((Jo, u2/N)., \n\nr(,8)  = ! r(D, ,8)dDp(DI(Jo)  =  log )21!'u$ + 2:$  (~ + 1) , \n\n(15) \n\nWe  first  note  that the generalization error is  independent  of the teacher  (Jo  param(cid:173)\neter,  this  happened  because  (J  is  a  \"location\"  parameter.  The ,8-dependency  of the \naveraged generalization error is depicted in Figure 1.  Solving 8r(,8) /8,8 =  0 we find \nthat the optimal ,8  solves \n\nu$=U2(,8~+I)=U2(~+I)  :=} \n\n,8=1 \n\n(16) \n\nNote that this  result  holds for  any  N  and is  independent  of the teacher  parameter. \nThe Bayes averaging at unit temperature is optimal for  any given value of (Jo,  hence, \nfor  any  teacher  distribution.  We  may  say  that  the  vague  Bayes  scheme  is  robust \nto  the teacher  distribution in this case.  Clearly  this is  a  much stronger  optimality \nthan the more general  result  proven  above. \n\n3.2  Bias-variance tradeoff \n\nIt is  interesting to decompose the generalization error in Eq.  15 in bias and variance \ncomponents.  We  follow  Heskes  [7]  and  define  the  bias  error  as  the  generalization \nerror of the geometric  average distribution, \n\nB(,8) = ! -logp(x)p(xl(Jo)dx, \n\n(17) \n\n\fBayesian Averaging is  Wel/-Temperated \n\n269 \n\nGENERALIZATION \n\n0.7 \n\n0.& \n\n0.5 \n\nA  04 \n1\u00a3 \n\u00a7 , \n\nv  03 \n\n02 \n\n0.1 \n\n0 \n0 \n\n0.5 \n\n15 \n\n2 \n\n25 \n\nTEMPERATURE \n\n3 \n\n35 \n\n45 \n\nFigure  1:  Bias-variance trade-off as function  of the  width of the temperated likeli(cid:173)\nhood ensemble  (temperature =  1/ (3)  for  N  =  1.  The bias is  computed as  the gen(cid:173)\neralization error of the predictive distribution obtained from  the geometric average \ndistribution w.r.t.  training set  fluctuations  as  proposed  by  Heskes.  The predictive \ndistribution produced by Bayesian averaging corresponds  to unit temperature  (ver(cid:173)\ntical  line)  and  it  achieves  the  minimal generalization  error.  Maximum-likelihood \nestimation for  reference  is  recovered  as the zero  width/temperature limit. \n\nwith \n\np(x) =  Z-l exp  ( /  10g(P(X 1D)]P(D I80 )dD) . \n\nInserting from  Eq.  (13),  we  find \n\np(z) = ~exp (-~(X -80)2) . \n\n27r0'~ \n\n0' f3 \n\nIntegrating over  the  teacher distribution we  find, \n\nB(f3)  =  -2  log 27r0'~ + -2  \n20'f3 \nThe variance error  is  given  by  V(f3)  = r(f3)  - B(f3) , \n\n0'2 \n\n1 \n\n0'2 \n\nV (f3)  =  2N O'~ \n\n(18) \n\n(19) \n\n(20) \n\n(21) \n\nWe  can now quantify the statements above.  By averaging a  bias is  introduced -the \npredictive  distribution  becomes  wider- which  decrease  the  variance  contribution \ninitially so that the generalization error being the sum of the two decreases.  At still \nhigher temperatures the bias becomes too strong and the generalization error start \nto increase.  The Bayes  average  at  unit temperature  is  the optimal trade-off within \nthe given family of procedures. \n\n\f270 \n\nL. K.  Hansen \n\n3.3  Asymptotics for  smoothly parameterized models \n\nWe  now  go  on  to  show  that  a  similar result  also  holds  for  general  learning  prob(cid:173)\nlems  in  limit  of large  data  sets.  We  consider  a  system  parameterized  by  a  finite \ndimensional  parameter vector  O.  For  a  given  large  training set  and  for  a  smooth \nlikelihood function,  the  temperated  likelihood is  approximately  Gaussian centered \nat the maximum posterior parameters[13]' hence  the normalized temperated poste(cid:173)\nrior  reads \n\nP(OI(3D,H) =  I(3NA(~OML) lexp  (_(3; 60'A(D,OML)60) \n\n(22) \nwhere 60 = O-OML, with OML  = OML(D)  denoting the maximum likelihood solution \nfor  the given training sample.  The second  derivative  or  Hessian  matrix is  given  by \n\nA(D,O) \n\n1  N \nN  LA(xa,O) \n\na=l \n\nA(x,O)  = \n\n{)2 \n\n()O{)O'  - log p( x 10) \n\nThe predictive distribution is  given by \n\np(xl(3, D) = ! p(xIO)p(OI(3, D)dO \n\n(23) \n\n(24) \n\n(25) \n\nwe  write p(xIO)  = exp(-f(xIO))  and expand f(xIO)  around OML  to second order,  we \nfind \n\np(xIO)  ~ p(XIOML) exp (-a(xIOML)'60 - ~60' A(xIOML)60)  . \n\n(26) \nWe  are  then  in  position  to  perform  the  integration  over  the  posterior  to  find  the \nnormalized predictive distribution, \n\np(xl(3, D) = p(XIOML) \n\nI(3N A(D)I \n\n1 \n\n, \n\nI(3NA(D)  +  A(x)1  exp (  2'a(xIOML)  A(xIOML)a(xIOML)). \n(27) \n\nProceeding  as  above,  we  compute the generalization error \n\nf((3, ( 0 )  = ! ! -logp(xl(3, D)p(xIOo)dxp(DIOo)dD \n\n(28) \n\nFor sufficiently smooth likelihoods, fluctuations in the  maximum likelihood param(cid:173)\neters  will  be  asymptotic normal, see  e.g.  [8],  and furthermore fluctuations  in A(D) \ncan be  neglected,  this  means that we  can approximate, \n\nA(x) + A(D) ~ (~ + l)Ao,  Ao = ! A(xIOo)p(xIOo)dx \n\n(29) \n\nwhere  Ao  is  the  averaged  Fisher  information matrix.  With  these  approximations \n(valid as  N  --+  (0)  the  generalization error can be found, \n\nd \n\n( \n\n1)  d  1+ ~ \n- 21 +  (3N' \n\nf((3, ( 0 )  ~ f(oo) +  2 log  1 +  (3N \n\n(30) \nwith  d = dim(O)  denoting  the dimension of the  parameter vector.  Like  in  the  ID \nexample  (Eq.  (15))  we  find  the  generalization error  is  asymptotically independent \nof the  teacher  parameters.  It is  minimized for  (3  = 1 and we  conclude  that Bayes \nis  well-temperated  in  the  asymptotics  and  that  this  holds  for  any  teacher  distri(cid:173)\nbution.  In  the  Bayes  literature  this  is  refered  to  as  the  prior  is  overwhelmed  by \ndata [1].  Decomposing the errors in bias and variance contributions we  find  similar \nresults as for  in  ID example, Bayes introduces the optimal bias by averaging at unit \ntemperature. \n\n\fBayesian Averaging is  Well-Temperated \n\n271 \n\n4  Discussion \n\nWe have seen  two examples of Bayes averaging being optimal, in particular improv(cid:173)\ning on maximum likelihood estimation.  We  found  that  averaging introduces  a  bias \nand  reduces  variance  so  that  the  generalization  error  (being  the  sum  of bias  and \nvariance)  initially decrease.  Bayesian averaging at  unit temperature  is  the optimal \nwidth  of the  averaging  distribution.  For  larger  temperatures  (widths)  the  bias  is \ntoo strong and the generalization error increases.  Both examples were special in the \nsense  that  they  lead  to generalization  errors  that  are  independent  of the  random \nteacher  parameter.  This is  not  generic,  of course,  rather  the generic  case  is  that  a \nmis-specified  prior can lead to arbitrary large learning catastrophes. \n\nAcknowledgments \n\nI  thank  the  organizers  of the  1999  Max  Planck  Institute  Workshop  on  Statistical \nPhysics of Neural Networks Michael Biehl,  Wolfgang Kinzel  and Ido Kanter,  where \nthis work was initiated.  I thank Carl Edward Rasmussen, Jan Larsen,  and Manfred \nOpper for  stimulating discussions on_Bayesian averaging.  This work was  funded  by \nthe  Danish  Research  Councils  through  the  Computational Neural  Network  Center \nCONNECT  and the THOR Center for  Neuroinformatics. \n\nReferences \n\n[1]  C.P. Robert:  The Bayesian Choice - A  Decision- Theoretic Motivation. Springer Texts \nin  Statistics,  Springer  Verlag,  New  York  (1994).  A.  Ohagan:  Bayesian  Inference. \nKendall's  Advanced  Theory  of Statistics.  Vol  2B.  The  University  Press,  Cambridge \n(1994). \n\n[2]  L.  Breiman:  Using  adaptive  bagging  to  debias  regressions.  Technical  Report  547, \n\nStatistics  Dept.  U.C.  Berkeley,  (1999) . \n\n[3]  R.T.  Clemen  Combining  forecast:  A  review  and  annotated  bibliography.  Journal  of \n\nForecasting  5, 559  (1989). \n\n[4]  L.K.  Hansen  and  P.  Salamon:  Neural  Network  Ensembles.  IEEE  Transactions  on \n\nPattern Analysis  and Machine  Intelligence,  12, 993-1001  (1990). \n\n[5]  L.K . Hansen:  Stochastic  Linear  Learning:  Exact  Test  and  Training  Error  Averages. \n\nNeural  Networks  6,  393-396,  (1993) \n\n[6]  D.  Haussler  and  M.  Opper:  Mutual  Information,  Metric  Entropy,  and  Cumulative \n\nRelative  Entropy Risk Annals  of Statistics  25 2451-2492  (1997) \n\n[7]  T .  Heskes:  Bias/Variance  Decomposition  for  Likelihood-Based  Estimators.  Neural \n\nComputation  10, pp 1425-1433,  (1998) . \n\n[8]  L.  Ljung:  System  Identification:  Theory for  the  User. Englewood  Cliffs,  New  Jersey: \n\nPrentice-Hall,  (1987). \n\n[9]  J .  Moody:  \"Note  on  Generalization,  Regularization,  and  Architecture  Selection  in \n\nNonlinear  Learning  Systems,\"  in B.H.  Juang,  S.Y.  Kung  &  C.A.  Kamm  (eds.)  Pro(cid:173)\nceedings  of the first  IEEE  Workshop  on  Neural  Networks  for  Signal Processing,  Pis(cid:173)\ncataway,  New  Jersey:  IEEE,  1-10, (1991). \n\n[10]  N .  Murata,  S.  Yoshizawa  &  S.  Amari:  Network  Information  Criterion  - Deter(cid:173)\n\nmining  the  Number  of Hidden  Units  for  an  Artificial Neural  Network  Model.  IEEE \nTransactions  on Neural  Networks,  vol.  5,  no.  6,  pp.  865-872,  1994. \n\n[11]  V.  Vapnik:  Estimation  of Dependences  Based  on  Empirical  Data.  Springer-Verlag \n\nNew  York  (1982). \n\n[12]  H . White,  \"Consequences  and  Detection of Misspecified  Nonlinear  Regression  Mod(cid:173)\n\nels,\"  Journal of the  American Statistical Association, 76(374),  419-433,  (1981). \n\n[13]  D .J .C  MacKay:  Bayesian Interpolation, Neural  Computation 4,  415-447,  (1992) . \n\n\f", "award": [], "sourceid": 1709, "authors": [{"given_name": "Lars", "family_name": "Hansen", "institution": null}]}