{"title": "Boosting Algorithms as Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 512, "page_last": 518, "abstract": null, "full_text": "Boosting Algorithms as  Gradient  Descent \n\nLlew Mason \n\nResearch School of Information \n\nSciences and Engineering \n\nAustralian National  University \nCanberra, ACT,  0200,  Australia \n\nlmason@syseng.anu.edu.au \n\nJonathan Baxter \n\nResearch School of Information \n\nSciences and Engineering \n\nAustralian National University \nCanberra,  ACT,  0200,  Australia \n\nJonathan. Baxter@anu.edu.au \n\nPeter Bartlett \n\nResearch School of Information \n\nSciences and Engineering \n\nAustralian National University \nCanberra, ACT, 0200,  Australia \n\nPeter.Bartlett@anu.edu.au \n\nMarcus Frean \n\nDepartment of Computer Science \n\nand Electrical Engineering \n\nThe University of Queensland \nBrisbane,  QLD, 4072,  Australia \n\nmarcusf@elec.uq.edu.au \n\nAbstract \n\nWe  provide an abstract characterization of boosting algorithms as \ngradient  decsent  on  cost-functionals  in  an  inner-product  function \nspace.  We  prove convergence  of these functional-gradient-descent \nalgorithms under quite  weak conditions.  Following previous theo(cid:173)\nretical  results  bounding the generalization  performance of convex \ncombinations of classifiers in terms of general cost functions of the \nmargin,  we  present  a  new  algorithm  (DOOM  II)  for  performing a \ngradient descent  optimization of such cost functions.  Experiments \non  several  data sets  from  the  UC  Irvine  repository  demonstrate \nthat DOOM II generally outperforms AdaBoost, especially in high \nnoise situations, and that the overfitting behaviour of AdaBoost is \npredicted by our cost functions. \n\n1 \n\nIntroduction \n\nThere has been considerable interest recently in  voting  methods  for  pattern classi(cid:173)\nfication,  which predict the label of a  particular example using a  weighted vote over \na set of base classifiers  [10,  2,  6,  9,  16,  5,  3,  19,  12,  17,  7,  11,  8].  Recent theoretical \nresults  suggest  that  the effectiveness  of these  algorithms  is  due  to  their  tendency \n[1,  18].  Loosely  speaking,  if a  combination of \nto produce  large  margin  classifiers \nclassifiers correctly classifies most of the training data with a large margin, then its \nerror probability is  small. \n\nIn  [14]  we  gave  improved  upper  bounds  on  the  misclassification  probability  of  a \ncombined classifier in  terms of the average over the training data of a  certain  cost \nfunction  of the  margins.  That  paper  also  described  DOOM,  an  algorithm for  di(cid:173)\nrectly minimizing the margin cost function by adjusting the weights associated with \n\n\fBoosting Algorithms as Gradient Descent \n\n513 \n\neach  base  classifier  (the  base  classifiers  are suppiled  to DOOM).  DOOM  exhibits \nperformance improvements over AdaBoost, even when using the same base hypothe(cid:173)\nses,  which  provides additional empirical evidence that these  margin  cost functions \nare appropriate quantities to optimize. \nIn  this  paper,  we  present  a  general  class  of algorithms  (called  AnyBoost)  which \nare gradient descent  algorithms for  choosing linear combinations of elements of an \ninner  product  function  space so  as  to  minimize some  cost functional.  The normal \noperation of a weak learner is shown to be equivalent to maximizing a certain inner \nproduct.  We  prove convergence of AnyBoost under weak  conditions.  In Section 3, \nwe  show  that  this  general  class  of  algorithms  includes  as  special  cases  nearly  all \nexisting voting methods.  In Section 5,  we  present experimental results for  a special \ncase of AnyBoost that minimizes  a  theoretically-motivated margin cost functional. \nThe experiments show that the new algorithm typically outperforms AdaBoost, and \nthat this is especially true with label noise.  In addition, the theoretically-motivated \ncost  functions  provide  good  estimates  of the error of AdaBoost,  in  the sense  that \nthey can be used to predict its overfitting behaviour. \n\n2  AnyBoost \n\nLet  (x, y)  denote  examples  from  X  x  Y,  where  X  is  the  space  of  measurements \n(typically X  ~ JRN)  and Y  is  the space of labels (Y is  usually a discrete set or some \nsubset of JR).  Let  F  denote some class of functions  (the base hypotheses)  mapping \nX  -7 Y,  and lin (F) denote the set of all linear combinations of functions  in F.  Let \n(,)  be an inner product on lin (F),  and \n\nC:  lin (F)  -7 ~ \n\na  cost  functional on lin (F). \nOur  aim  is  to  find  a  function  F  E  lin (F)  minimizing  C(F).  We  will  proceed \niteratively via a  gradient descent procedure. \nSuppose we  have some F  E lin (F)  and  we  wish  to find  a  new  f  E F  to add to F \nso that the cost C(F + Ef)  decreases, for some small value of E.  Viewed in function \nspace terms, we  are asking for  the  \"direction\"  f  such that C(F + Ef)  most  rapidly \ndecreases.  The desired direction is  simply the negative of the functional  derivative \nofC at F,  -\\lC(F), where: \n\n\\lC(F)(x)  :=  aC(F + o:Ix) I \n\n' \n\n(1) \n\nao: \n\n0:=0 \n\nwhere  Ix  is the indicator function of x.  Since we  are restricted to choosing our new \nfunction  f  from  F,  in  general  it  will  not  be  possible  to  choose  f  =  -\\lC(F), so \ninstead we  search for  an f  with  greatest inner product with  -\\lC(F).  That is,  we \nshould  choose  f  to  maximize  - (\\lC(F), I).  This  can  be  motivated  by  observing \nthat, to first  order in  E,  C(F + Ef)  =  C(F) + E (\\lC(F), f) and hence  the greatest \nreduction in cost will  occur for  the f  maximizing  - (\\lC(F), f). \nFor reasons that will become obvious later, an algorithm that chooses f  attempting \nto maximize  - (\\lC(F), f) will  be described as  a  weak  learner. \nThe preceding discussion motivates Algorithm 1 (AnyBoost), an iterative algorithm \nfor  finding  linear  combinations  F  of base hypotheses  in F  that  minimize  the  cost \nfunctional  C (F).  Note that we  have allowed the base hypotheses to take values  in \nan arbitrary set Y, we have not restricted the form of the cost or the inner product, \nand  we  have  not  specified  what  the  step-sizes  should  be.  Appropriate  choices  for \n\n\f514 \n\nL.  Mason,  J  Baxter. P.  Bartlett and M  Frean \n\nthese things will be made when we  apply the algorithm to more concrete situations. \nNote also  that the algorithm terminates when  - (\\lC(Ft ), It+!)  ~ 0,  i.e  when the \nweak learner C returns a base hypothesis It+l  which no longer points in the downhill \ndirection of the cost function C(F).  Thus, the algorithm terminates when,  to first \norder,  a  step in function  space in the direction of the base hypothesis returned by \nC  would increase the cost. \n\nAlgorithm 1  :  Any Boost \n\nRequire: \n\nsome set Y. \n\n\u2022  An inner product space (X, (, )) containing functions mapping from  X  to \n\n\u2022  A class of base classifiers  F  ~ X. \n\u2022  A differentiable cost functional  C:  lin (F) --+  III \n\u2022  A  weak learner C(F)  that accepts F  E lin (F)  and returns I  E  F  with a \n\nlarge value of - (\\lC(F), f). \n\nLet  Fo(x)  := O. \nfor  t  := 0 to T  do \n\nLet  It+!  := C(Ft ). \nif - (\\lC(Ft ), It+!)  ~ 0 then \n\nreturn Ft. \n\nend if \nChoose Wt+!. \nLet Ft+l  := Ft + Wt+I!t+1 \n\nend for \nreturn FT+I. \n\n3  A  gradient descent  view of voting methods \nWe  now  restrict  our  attention  to  base  hypotheses  I  E  F  mapping to Y  = {\u00b1 I}, \nand the inner  product \n\n(2) \n\nfor  all F, G  E lin (F),  where S  = {Xl, yt), . . . , (Xn, Yn)}  is  a set of training examples \ngenerated according to some unknown distribution 1) on X  x Y.  Our aim now is  to \nfind  F  E lin (F)  such that Pr(x,y)\"\"\"Vsgn (F(x))  -=f.  Y is  minimal,  where sgn (F(x))  = \n-1 if F (x)  < 0 and sgn (F (x))  =  1 otherwise.  In other words, sgn F should minimize \nthe misclassification probability. \nThe  margin of F : X  --+  R on example  (x,y)  is  defined  as yF(x).  Consider  margin \ncost-Iunctionals defined  by \n\nC(F)  := - L  C(YiF(Xi)) \n\n1  m \n\nm  i=l \n\nwhere c:  R --+  R is any differentiable real-valued function of the margin.  With these \ndefinitions,  a  quick calculation shows: \n\n- (\\lC(F), I) = -2  LYd(Xi)C'(YiF(Xi)). \n\n1  m \n\nm \n\ni=l \n\nSince positive margins correspond to examples correctly labelled by sgn F  and neg(cid:173)\native  margins  to  incorrectly  labelled  examples,  any  sensible  cost  function  of  the \n\n\fBoosting Algorithms as Gradient Descent \n\n515 \n\nTable  1:  Existing voting methods viewed  as  AnyBoost on margin cost functions. \n\nAlgorithm \nAdaBoost  [9] \nARC-X4  [2] \nConfidenceBoost [19] \nLogitBoost  [12] \n\nCost function \n\ne-yF(x) \n\n(1  - yF(x))\" \n\ne  yF(x) \n\nStep size \nLine search \n\n1ft \n\nLine search \n\nIn(l + e-yl\u00abX\u00bb)  Newton-Raphson \n\nmargin will  be monotonically decreasing.  Hence  -C'(YiF(Xi))  will  always  be posi(cid:173)\ntive.  Dividing through by  - 2:::1 C'(YiF(Xi)),  we  see that finding an I  maximizing \n- ('\\1 C (F), f)  is  equivalent to finding  an I  minimizing the weighted error \n\nL  D(i)  where \n\ni:  f(Xi):f;Yi \n\nfor  i  =  1, ... ,m. \n\nMany of the most successful voting methods are, for the appropriate choice of margin \ncost function c and step-size, specific cases of the AnyBoost algorithm (see Table 3). \nA more detailed analysis can be found  in  the full  version of this paper  [15]. \n\n4  Convergence of Any Boost \n\nIn this  section  we  provide convergence results  for  the  AnyBoost  algorithm,  under \nquite  weak  conditions  on  the  cost  functional  C.  The  prescriptions  given  for  the \nstep-sizes Wt  in these results  are for  convergence guarantees only:  in  practice they \nwill  almost always  be smaller than necessary,  hence fixed  small steps or some form \nof line search should be used. \nThe  following  theorem  (proof  omitted,  see  [15])  supplies  a  specific  step-size  for \nAnyBoost  and characterizes the limiting behaviour with this step-size. \nTheorem 1.  Let C:  lin (F)  -7  ~ be  any  lower  bounded,  Lipschitz  differentiable \ncost functional (that is,  there exists L  > 0 such that II'\\1C(F)-'\\1C(F')1I  :::;  LIIF-F'II \nlor  all  F, F'  E  lin (F)).  Let  Fo, F l ,  ...  be  the  sequence  01  combined  hypotheses \ngenerated  by  the  AnyBoost algorithm,  using  step-sizes \n\nWt+1  := -\n\n('\\1C(Ft ), It+!) \n\nLll/t+!112 \n\n. \n\n(3) \n\nThen  AnyBoost  either  halts  on  round  T  with  - ('\\1C(FT ), IT+1)  :::;  0,  or  C(Ft) \nconverges  to  some  finite  value  C*,  in  which  case limt-+oo  ('\\1C(Ft ), It+l)  =  O. \n\nThe  next  theorem  (proof  omitted,  see  [15])  shows  that  if  the  weak  learner  can \nalways  find  the  best  weak  hypothesis  It  E  F  on  each  round  of AnyBoost,  and  if \nthe  cost  functional  C  is  convex,  then  any  accumulation  point  F  of the  sequence \n(Ft)  generated  by  AnyBoost  with  the  step  sizes  (3)  is  a  global  minimum  of  the \ncost.  For  ease  of exposition,  we  have  assumed  that  rather than terminating when \n- ('\\1C(FT), h+l) :::;  0,  AnyBoost simply continues to return FT  for  all subsequent \ntime steps t. \nTheorem 2.  Let C:  lin (F)  -7  ~ be  a convex  cost  functional  with  the  properties \nin  Theorem  1,  and  let  (Ft)  be  the  sequence  01  combined  hypotheses  generated  by \nthe  AnyBoost algorithm  with  step  sizes given  by  (3).  Assume that the  weak  hypoth(cid:173)\n- I  E  F)  and  that  on  each  round \nesis  class  F  is  negation  closed  (f  E  F \n\n===} \n\n\f516 \n\nL.  Mason, 1.  Baxter, P.  Bartlett and M.  Frean \n\nthe  AnyBoost algorithm  finds  a function  fHl  maximizing - (V'C(Ft ), ft+l)\u00b7  Then \nany  accumulation  point  F  of the  sequence  (Ft)  satisfies  sUP/EF - (V'C(F), f)  = \n0, \n\nand  C(F)  =  infGElin(F) C(G). \n\n5  Experiments \n\nAdaBoost  had  been  perceived  to  be  resistant  to  overfitting  despite  the  fact  that \nit  can  produce combinations involving  very  large numbers  of classifiers.  However, \nrecent  studies  have  shown  that  this  is  not  the  case,  even  for  base  classifiers  as \nsimple as  decision stumps  [13,  5,  17].  This overfitting can be attributed to the use \nof exponential margin cost functions  (recall Table  3). \nThe results in in  [14]  showed that overfitting may be avoided by  using margin cost \nfunctionals  of a form  qualitatively similar to \n\nC(F) =  - 2: 1 - tanh(>'YiF(xi)), \n\n1  m \n\nm  i=l \n\n(4) \n\nwhere  >.  is  an  adjustable  parameter  controlling  the  steepness  of  the  margin  cost \nfunction  c(z)  =  1 - tanh(>.z).  For  the  theoretical  analysis  of  [14]  to  apply,  F \nmust  be  a  convex  combination  of  base  hypotheses,  rather  than  a  general  linear \ncombination.  Henceforth  (4)  will  be  referred  to  as  the  normalized  sigmoid  cost \nfunctional.  AnyBoost with  (4)  as  the cost functional  and  (2)  as  the inner product \nwill  be  referred  to  as  DOOM  II.  In  our  implementation  of DOOM  II  we  use  a \nfixed  small  step-size  \u20ac  (for  all  of the  experiments  \u20ac  =  0.05).  For  all  details  of the \nalgorithm the reader is  referred to the full  version of this paper [15]. \nWe  compared  the performance of DOOM  II and  AdaBoost  on  a  selection  of nine \ndata sets taken from the VCI machine learning repository [4]  to which various levels \nof  label  noise  had  been  applied.  To  simplify  matters,  only  binary  classification \nproblems  were  considered.  For  all  of the experiments  axis  orthogonal hyperplanes \n(also  known  as  decision  stumps)  were  used  as  the  weak  learner.  Full  details  of \nthe  experimental  setup  may  be  found  in  [15].  A  summary  of  the  experimental \nresults  is  shown  in  Figure  1.  The improvement  in  test  error  exhibited  by  DOOM \nII over  AdaBoost  is  shown  for  each  data set  and  noise  level.  DOOM II generally \noutperforms AdaBoost and the improvement is  more pronounced in the presence of \nlabel noise. \nThe effect of using the normalized sigmoid cost function rather than the exponential \ncost function  is  best  illustrated by  comparing the cumulative margin distributions \ngenerated  by  AdaBoost  and  DOOM  II.  Figure  2 shows  comparisons for  two  data \nsets  with  0%  and  15%  label  noise  applied.  For  a  given  margin,  the  value  on  the \ncurve corresponds to the proportion of training examples with margin less  than or \nequal  to  this  value.  These  curves  show  that  in  trying to  increase  the  margins  of \nnegative examples AdaBoost is  willing to sacrifice the margin of positive examples \nsignificantly.  In  contrast,  DOOM  II  'gives  up'  on  examples  with  large  negative \nmargin in  order to reduce the value of the cost function. \nGiven that AdaBoost does suffer from  overfitting and is  guaranteed to minimize an \nexponential cost function of the margins, this cost function certainly does not relate \nto  test  error.  How  does  the value  of our  proposed cost  function  correlate against \nAdaBoost's  test  error?  Figure  3  shows  the  variation  in  the  normalized  sigmoid \ncost  function,  the  exponential  cost  function  and  the  test  error  for  AdaBoost  for \ntwo  VCI  data sets  over  10000  rounds.  There  is  a  strong correlation between  the \nnormalized sigmoid cost and AdaBoost's test error.  In both data sets the minimum \n\n\fBoosting Algorithms as Gradient Descent \n\n517 \n\n3.5 \n\n3 \n\n2.5 \n\n2 \n\n1.5 \n\n0.5 \n\n~ \nII) \nbO \nfl \n= \nos \n> \n-0 \nos \ng \n0 \n~  -0.5 \n-1 \n-1.5 \n-2 \n\nI \n\n! \n: \n\n1 \n; \ni \n! \n0; \n\n1 \n! \n! \n0 : \n\n.. i  ~ t \n! \n\" ! , \nJ \n\n.11 \n\n0 \n\nf \n\n, \n\n0 11 \nI -r \n\n, \n\nQ \n\nSOIlar \n\ncleve \n\nIonosphere \n\nvote I \n\ncredll \n\nData set \n\nbrea.l;t-cancer  Jmna .. uldlans \n\n;! \n\n~ \n~ \n\n'\" \n\n0 \n\nI \n15(fi.:  noi~e -\n\ni \n\nhypo I \n\n00/0 noise \n~,. ;~  Il'.'ioc\u00b7  ... ~ .. ,. # ,.~ \n\nsphCt! \n\nFigure 1:  Summary oft est error advantage  (with standard error bars) of DOOM II \nover AdaBoost with  varying levels  of noise  on nine VCI data sets. \n\nbreast--cancer~wisconsin \n\n- - O~ noise ..  AdaBI')(\\'it \n- - U,,\u00b7 Mise - DOOM II \n15% noise  - AdaBoost \n.............  15% noise - DOOM n \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\nsplice \n\n- - 0% n(!ise ..  AdaBo(lst \n- - 0%  ,,,)i.,~ -DOOM II \n15% noise - AdaBoost . \n............  15% noise - DOOM n. \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\no~------~~~~~~----~ \n\n-1 \n\n-0.5 \n\no \n\n0.5 \n\no~------~~~~------~-----J \n\n-1 \n\n-0.5 \n\no \n\n0.5 \n\nMar~in \n\nMar~in \n\nFigure 2:  Margin distributions for  AdaBoost and DOOM II with 0%  and 15%  label \nnoise for  the breast-cancer and  splice data sets. \n\nof  AdaBoost's  test  error  and  the  mlllimum  of the  normalized  sigmoid  cost  very \nnearly  coincide,  showing  that  the  sigmoid  cost  function  predicts  when  AdaBoost \nwill  start to overfit. \n\nReferences \n\n[1]  P. L. Bartlett.  The sample complexity of pattern classification  with neural networks: \n\nthe size of the weights  is  more  important than the size of the network.  IEEE  Trans(cid:173)\nactions  on  Information  Theory, 44(2) :525-536,  March 1998. \n\n[2]  L. Breiman.  Bagging predictors.  Machine  Learning,  24(2):123- 140,  1996. \n\n[3]  L.  Breiman.  Prediction games and arcing algorithms.  Technical Report 504, Depart(cid:173)\n\nment of Statistics,  University of California,  Berkeley,  1998. \n\n[4]  E.  Keogh  C.  Blake  and  C.  J.  Merz.  UCI  repository  of machine  learning  databases, \n\n1998.  http:j jwww.ics.uci.eduj\"'mlearnjMLRepository_html. \n\n[5]  T.G. Dietterich.  An  experimental comparison of three methods for  constructing en(cid:173)\n\nsembles of decision  trees:  Bagging,  boosting,  and randomization.  Technical  report, \nComputer Science Department, Oregon  State University,  1998. \n\n\f518 \n\nL.  Mason, J.  Baxter,  P  Bartlett and M.  Frean \n\nlabor \n\nyotel \n\n30.-~--~------~~--~------~ \n\nAdaB(>o~1  1e_~1  (';1.)(  - -\n\nExponential CQ!;! \n\n--\n\nNormalized sigmoid cost  ............ . \n\nAJaB ()(~3l re;.;(  CIYor  - (cid:173)\nExponential cost  ---.. \n\n. \n\nNormalized sigmoid enst  ........... .. \n\n6 \n\n7  /1 I \n\\ \n\\ \n5  ........ '\\ \\ \n\\ 1/' \n\\  ~_~  ~ \n\\..i\\.. \n\n'V'\\..,..... \n\n4 \n\n2 \n\n, ............................... ,\"\" ....................... . \n\n!O \n\n100 \n\nRounds \n\n1000 \n\n10000 \n\nO~----~------~-----=~----~ \n10000 \n\n1000 \n\n100 \n\n10 \n\n1 \n\nRounds \n\nFigure 3:  AdaBoost  test error, exponential cost  and  normalized  sigmoid  cost  over \n10000 rounds of AdaBoost for the labor and vote1 data sets.  Both costs have been \nscaled in each case for  easier comparison with test error. \n\n[6]  H.  Drucker and C . Cortes. Boosting decision trees.  In Advances in Neural Information \n\nProcessing  Systems  8,  pages  479- 485,  1996. \n\n[7]  N.  Duffy  and D.  Helmbold.  A  geometric  approach  to  leveraging  weak  learners.  In \n\nComputational  Learning  Theory:  4th  European  Conference,  1999.  (to appear). \n\n[8]  Y.  Freund. An adaptive version of the boost by majority algorithm.  In Proceedings  of \nthe  Twelfth Annual Conference  on  Computational Learning  Theory,  1999. (to appear) . \n[9]  Y .  Freund  and  R.  E.  Schapire.  Experiments  with  a  new  boosting  algorithm.  In \nMachine  Learning:  Proceedings  of the  Thirteenth  International  Conference,  pages \n148-156,  1996. \n\n[10]  Y.  Freund and R.  E.  Schapire.  A  decision-theoretic generalization  of on-line learning \nand an application to boosting.  Journal  of Computer and System Sciences,  55(1):119-\n139,  August  1997. \n\n[11]  J .  Friedman.  Greedy function  approximation  :  A  gradient  boosting  machine.  Tech(cid:173)\n\nnical  report,  Stanford University,  1999. \n\n[12]  J . Friedman, T.  Hastie, and R.  Tibshirani.  Additive logistic regression :  A statistical \n\nview of boosting.  Technical report, Stanford University,  1998. \n\n[13]  A.  Grove  and  D .  Schuurmans.  Boosting  in  the  limit:  Maximizing  the  margin  of \nlearned ensembles.  In  Proceedings  of the  Fifteenth  National  Conference  on  Artificial \nIntelligence,  pages 692-699,  1998. \n\n[14]  L.  Mason,  P.  1.  Bartlett,  and  J .  Baxter.  Improved generalization  through  explicit \n\noptimization of margins.  Machine  Learning,  1999.  (to appear) . \n\n[15]  Llew  Mason,  Jonathan Baxter, Peter Bartlett, and Marcus Frean.  Functional Gradi(cid:173)\n\nent  Techniques for  Combining  Hypotheses.  In  Alex  Smola,  Peter  Bartlett,  Bernard \nSch6lkopf,  and Dale Schurmanns, editors,  Large  Margin  Classifiers.  MIT Press,  1999. \nTo appear. \n\n[16]  J.  R.  Quinlan.  Bagging, boosting, and C4.5.  In Proceedings  of the  Thirteenth National \n\nConference  on  Artificial Intelligence,  pages 725-730,  1996. \n\n[17]  G.  Ratsch, T. Onoda, and K.-R. Muller.  Soft margins for  AdaBoost.  Technical Report \nNC-TR-1998-021 ,  Department  of Computer  Science,  Royal  Holloway,  University  of \nLondon, Egham,  UK,  1998. \n\n[18]  R.  E.  Schapire,  Y.  Freund,  P.  L.  Bartlett,  and  W .  S.  Lee.  Boosting  the  margin \n:  A  new  explanation  for  the  effectiveness  of  voting  methods.  Annals  of Statistics, \n26(5):1651- 1686,  October  1998. \n\n[19]  R.  E.  Schapire and Y.  Singer.  Improved boosting  algorithms  using  confidence-rated \nIn  Proceedings  of the  Eleventh  Annual  Conference  on  Computational \n\npredictions. \nLearning  Theory,  pages 80- 91,  1998. \n\n\f", "award": [], "sourceid": 1766, "authors": [{"given_name": "Llew", "family_name": "Mason", "institution": null}, {"given_name": "Jonathan", "family_name": "Baxter", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Marcus", "family_name": "Frean", "institution": null}]}