{"title": "Leveraged Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 610, "page_last": 616, "abstract": null, "full_text": "Leveraged Vector Machines \n\nYoram Singer \n\nHebrew University \n\nsinger@cs.huji.ac.il \n\nAbstract \n\nWe describe an iterative algorithm for building vector machines used in \nclassification tasks.  The algorithm builds  on  ideas from  support vector \nmachines, boosting, and generalized additive models. The algorithm can \nbe  used  with  various continuously differential functions  that bound the \ndiscrete (0-1) classification loss and is very simple to implement. We test \nthe proposed algorithm with two different loss functions on synthetic and \nnatural data. We also describe a norm-penalized version of the algorithm \nfor the exponential loss function used in AdaBoost.  The performance of \nthe algorithm on  natural data is  comparable to  support vector machines \nwhile typically its running time is shorter than of SVM. \n\n1  Introduction \nSupport vector machines (SVM) [1,  13]  and boosting [10, 3, 4,  11] are highly popular and \neffective methods for constructing linear classifiers. The theoretical basis for SVMs stems \nfrom  Vapnik's  seminal on  learning and  generalization  [12]  and has proved to  be  of great \npractical usage.  The first  boosting algorithms  [10,  3],  on  the other hand,  were  developed \nto  answer certain  fundamental  questions  about  PAC-learnability  [6].  While  mathemati(cid:173)\ncally  beautiful,  these algorithms were rather impractical.  Later,  Freund and  Schapire [4] \ndeveloped the AdaBoost algorithm, which proved to be a practically useful meta-learning \nalgorithm.  AdaBoost works by  making repeated calls to a weak learner.  On each call the \nweak learner generates a single weak hypothesis, and these weak hypotheses are combined \ninto an  ensemble called strong  hypothesis.  Recently,  Schapire and Singer [11]  studied a \nsimple generalization of AdaBoost  in  which a weak-hypothesis can  assign  a real-valued \nconfidence to  each prediction.  Even  more recently,  Friedman,  Hastie,  and Tibshirani  [5] \npresented an alternative view of boosting from a statistical point of view and also described \na new family  of algorithms for constructing generalized additive models  of base learners \nin a similar fashion to AdaBoost.  The work of Friedman, Hastie, and Tibshirani generated \nlots  of attention  and  motivated  research  in  classification  algorithms  that  employ  various \nloss functions [8, 7]. \nIn this work we combine ideas from the research mentioned above and devise an alternative \napproach to construct vector machines for classification.  As  in  SVM, the  base predictors \nthat we  use  are  Mercer kernels.  The value of a kernel  evaluated at an  input pattern,  i.e., \nthe dot-product between two instances embedded in  a high-dimensional space,  is  viewed \nas a real-valued prediction. We describe a simple extension to additive models in which the \nprediction of a base-learner is  a linear transformation of a given kernel.  We  then describe \nan iterative algorithm that greedily adds kernels.  We derive our algorithm using the expo(cid:173)\nnentialloss function used in AdaBoost and the loss function used by Friedman, Hastie, and \nTibshirani [5]  in  \"LogitBoost\".  For brevity we  call the resulting classifiers boosted vector \nmachines (BVM) and logistic  vector machines  (LVM).  We  would  like to  note in  passing \n\n\fLeveraged Vector Machines \n\n611 \n\nthat the resulting algorithms are not boosting algorithms in  the  PAC  sense.  For instance, \nthe weak-Iearnability assumption that the weak-learner can always find a weak-hypothesis \nis  violated. We therefore adopt the terminology used in  [2]  and call the resulting classifiers \nleveraged vector machines. \nThe leveraging procedure we give adopts the chunking technique from SVM. After present(cid:173)\ning the basic leveraging algorithms we compare their performance with SVM on synthetic \ndata.  The  experimental  results  show  that the  leveraged vector machines  achieve  similar \nperformance  to  SVM  and  often  the  resulting  vector machines  are  smaller than  the  ones \nobtained by  SVM. The experiments also demonstrate that BVM is  especially  sensitive to \n(malicious) label noise while LVM seems to be more insensitve. We also describe a simple \nnorm-penalized extension of BVM that provides a  partial solution to  overfitting in  the  p(cid:173)\nresence of noise.  Finally, we give results of experiments performed with natural data from \nthe DCI repository and conclude. \n2  Preliminaries \nLet S  =  ((Xl, yd, ... ,(xm, Ym))  be a sequence of training examples where each instance \nXi  belongs  to  a  domain  or  instance  space  X,  and  each  label  Yi  is  in  {-I, +1}.  (The \nmethods described  in  this  paper to  build  vector machines and  SVMs can  be extended to \nsolve multiclass problems using, for instance, error correcting output coding. Such methods \nare beyond the scope of this paper and will  be discussed elsewhere).  For convenience, we \nwill use iii to denote (Yi + 1) /2 E {O,  I}. \nAs  is  boosting,  we assume access  to  a  weak or base learning algorithm which accepts as \ninput  a  weighted sequence of training  examples  S.  Given  such  input,  the  weak  learner \ncomputes a  weak  (or base) hypothesis h.  In  general,  h  has  the form  h  :  X  -+  ~.  We \ninterpret the  sign  of h(x) as  the  predicted label (-1 or + 1) to  be assigned to  instance  X, \nand the magnitude Ih(x)1  as the \"confidence\" in this prediction. \nTo  build vector machines we use the notion of confidence-rated predictions. take for base \nhypotheses sample-based Mercer kernels  [13], and define the confidence (i.e., the magni(cid:173)\ntude of prediction) of a base learner to be the value of its dot-product with another instance. \nThe sign  of the  prediction is  set to  be  the  label  of the corresponding instance.  Formally, \nfor  each  base  hypothesis  h  there  exist  (Xj,Yj)  E  S such  that  h(x)  =  YjK(Xj, x)  and \nK(u, v)  defines  an  inner product in  a  feature  space:  K(u, v)  =  2:::~1 ak'lfJk (U)'Ij;k (v). \nWe  denote  the  function  induced  by  an  instance  label  pair  (Xj, Yj)  with  a  kernel  K  by \n</>j (x)  =  yjK (Xj, x). Our goal is to find  a classifier f(x), called a strong hypothesis in the \ncontext of boosting algorithms, ofthe form  f(x)  =  2::::=1 atht(x) + /3,  such that the signs \nof the predictions of the classifier should agree, as much as possible, with the labels of the \ntraining instances. \nThe leverage algorithm we describe maintains a distribution Dover {I, ... , m}, i.e., over \nthe indices of S.  This distribution is simply a vector of non-negative weights, one weight \nper example and is an exponential function of the classifier f  which is built incrementally, \n\n1 \n\nD(i) =  Z  exp (-Yd(Xi))  where  Z  =  L exp (-Yd(Xi)) . \n\nm \n\n(1) \n\ni=l \n\nFor a random function  9  of the  input instances and the labels,  we  denote the sample ex(cid:173)\npectation of 9 according to D by ED(g)  =  2::::1 D(i)g(Xi, Yi).  We also use this notation \nto denote the expectation of matrices of random functions.  We  will convert a confidence(cid:173)\nrated classifier f  into a randomized predictor by  using the soft-max function and denote it \nby  P(Xi)  where \np \n\nexp (f(Xi)) \n\n1 \n\n(Xi)  =  exp (f(Xi)) + exp (- f(Xi)) \n\n1 + exp (-2f(Xi))  . \n\n(2) \n\n\f612 \n\nY.  Singer \n\n3  The leveraging algorithm \nThe basic procedure to construct leveraged vector machines builds on ideas from [11, 5] by \nextending the prediction to be a linear function of the base classifiers.  The algorithm works \nin  rounds,  constructing a  new classifier  It  from  the  previous one It-I by  adding a  new \nbase hypothesis ht  to  the current classifier, It- Denoting by  Dt  and Pt+1  the  distribution \nand probability given by Eqn. (1) and Eqn. (2) using It and It+l' the algorithm attempts to \nminimize either the exponential function that arise in AdaBoost: \n\nZ  =  2: exp (-ydt(Xi))  = 2: exp (-Yi(ft-l (Xi)  + atht(Xi) + f3t)) \n\nm \n\ni=1 \n\nm \n\ni=1 \nm \n\ni=1 \n\n'\"  2: Dt(i) exp (-Yi(atht(Xi) + f3t))  , \n\nor the logistic loss function: \n\nm \n\ni=1 \nm \n\ni=1 \n\nm - 2: (fh log(Pt+1 (Xi)) + (1  - ih) log(1  - Pt+1 (Xi))) \n\ni=1 \n\n(3) \n\n(4) \n\n(5) \n\nWe  initialize  lo(x)  to  be  zero  everywhere  and  run  the  procedure for  a  predefined  num(cid:173)\n'\u00a3'['=1 (atht(x)  + f3t)  = \nber  of rounds  T.  The  final  classifier  is  therefore  IT(X)  = \nf3  + '\u00a3'['=1 atht(x)  where \nf3  =  '\u00a3t f3t  . We  would  like  to  note  parenthetically  that  it \nis  possible to  use  other loss  functions  that bound the 0-1  (classification) loss  (see for in(cid:173)\nstance [8]). Here we focus on the above loss functions, Land Z. Fixing It-I and ht, these \nfunctions are convex in at and f3t  which guarantees, under mild conditions (details omitted \ndue to lack of space), the uniqueness of at and f3t . \nOn each round we look for the current base hypothesis ht  that will reduce the loss function \n(Z or L) the most.  As  discussed before, each  input instance  X j  defines a  function  <Pj (x) \nand is a candidate for ht(x). In general, there is no close form solution for Eqn. (3) and (5) \nand finding a  and f3  for each possible input instance is time consuming. We therefore use a \nquadratic approximation for the loss functions.  Using the quadratic approximation, for each \n<Pj  we  can find  a  and  f3  analytically and calculate the reduction in  the  loss function.  Let \n\\7 Z  =  (~~, ~~) T  and  \\7 L  =  (~~, ~~) T  be the column vectors of the partial derivatives \nof Z  and L w.r.t a  and  f3  (fixing It-I and ht ).  Similarly,  let \\7 2 Z  and  \\72 L  be the 2  x  2 \nmatrices of second order derivatives of Z  and L with respect to  a  and f3.  Then, quadratic \napproximation yields  that  (a,f3)T  =  (\\72Z)-1  \\7Z and  (a,f3)T  =  (\\72L)-1  \\7L.  On \neach round t we maintain a distribution D t  which is  defined from  It  as  given by Eqn. (l) \nand conditional class probability estimates Pt(Xi) as given by Eqn. (2).  Solving the linear \nequation above for a  and f3  for each possible instance is  done by  setting h t (x)  =  <Pj (x), \nwe get for Z \n\nand for L \n\n(6) \n\n(7) \n\n\fLeveraged Vector Machines \n\n613 \n\n~----....... . \n\n:~_------,7\"--~~~'i====~ \n\nFigure  1:  Comparison  of the  test  error  as \na  function  of number  of leveraging  rounds \nwhen  using  full  numerical  search  for  a  and \nf3,  a \"one-step\" numerical  search based on a \nquadratic approximation of the loss function, \nand  a  one-step search  with  chunking of the \ninstances. \n\nNote  that  the  equations above  share  much  in  common and  require,  after pre-computing \nP(Xi), the same amount of computation time. \nAfter calculating the value of a  and f3  for each instance  (x j  , Y j  ), we  simply evaluate the \ncorresponding value  of the  loss  function,  choose  the  instance  (Xj> , Yj\u00bb \nthat  attains  the \nminimal  loss,  and set ht  =  <pj>.  We  then  numerically search for the  optimal value of a \nand f3  by iterating Eqn. (6) or Eqn. (7) and summing the values into at and f3t.  We would \nlike to note that typically two or three iterations suffice and we can save time by using the \nvalue of a  and f3  found using the quadratic approximation without a full  numerical search \nfor the optimal value of a  and f3.  (See also Fig.  1.)  We repeat this process for T  rounds \nor until no instance can serve as  a base hypothesis.  We  note that the same instance can be \nchosen more than once, although not in consecutive iterations, and typically only a small \nfraction of the instances is actually used in building f.  Roughly speaking, these instances \nare the \"support patterns\" of the leveraged machines although they are not necessarily the \ngeometric support patterns. \nAs in SVMs, in order to make the search for a base hypothesis efficient we pre-compute and \nstore K(x, x') for all pairs x  i- x' from 8.  Storing these values require 181 2  space, which \nmight be  prohibited in  large  problems.  To  save  space,  we  employ the  idea of chunking \nused  in  SVM.  We  partition  8  into  r  blocks  8 1,82 ,  ..\u2022 ,Sr of about the  same  size.  We \ndivide the iterations into sub-groups such that all iterations belonging to the ith sub-group \nuse and evaluate kernels based on instances from the  ith block only_ When switching to a \nnew block k we need to compute the values K(x, x') for x  E  8  and x,  E  Sk.  This division \ninto blocks might be more expensive since we typically use each block of instances more \nthan once. However, the storage of the kernel values can be done in place and we thus save \na factor of r  in  memory requirements.  In  practice we  found  that  chunking does not hurt \nthe performance.  In Fig.  1 we show the test error as a function of number of rounds when \nusing (a) full  numerical search to determine a  and f3  on each round, (b) using the quadratic \napproximation (\"one-step\") to  find  a  and f3,  and (c) using quadratic approximation with \nchunking.  The number of instances in the experiment is 1000, each block for chunking is \nof size 100, and we switch to a different block every 100 iterations.  (Further description of \nthe data is given in the next section.)  In this example, after 10 iterations, there is virtually \nno difference in the performance of the different schemes. \n\n4  Experiments with synthetic data \nIn  this  section  we  describe experiments with  synthetic  data comparing different aspects \nof leveraged vector machines to  SVMs.  The original  instance  space  is  two dimensional \nwhere  the  positive  class  includes  all  points  inside  a circle  of radius  R,  i.e.,  an  instance \n(UI, U2)  E  1R?  is  labeled  +1  iff ui  +  u~  ~ R.  The  instances  were  picked  at  random \naccording to a zero mean unit variance normal distribution and R was set such exactly half \nof the instances belong to the positive class.  In all the experiments described in this section \nwe generated 10 groups of training and test sets each of which includes 1000 train and test \nexamples.  Overall,  there  are  10,000 training  examples and  10,000 test  examples.  The \n\n\f614 \n\nY.  Singer \n\n- I \n\nSVM \n\ni \n\n.\u2022 \n' J\u00bb - . ._\\ \n\u2022  O M \u2022 ! , .. - -\n\n.~ -\n\n-\n\nI \n\n-D .. \n\n-_.'--\n\n~ \nI \n\n! V \n. --\n\n\u2022\u2022  i\u00b7 \n\n~ \n\n.. \n\n5 \n\n... -\n\n, \n\n\u2022 \n\nFigure 2:  Performance comparison of SVM and BVM as  a  function  of the  training data \nsize (left), the dimension of the kernels (middle), and the number of redundant features. \n\n~\u2022 \nIMI \n~ L'\" \n- . . \nsvu \n\n' 0 \n\n\u2022  \u2022 \u2022  \n\n/;I \n\n.. \n\n~ \n\nt ea \n\nt ot \n\n....... \n\n.. \n\n~II \n\n0 \"\" \n\n\u2022  I' \n\nFigure 3:  Train and test er(cid:173)\nrors  for  SVM,  LVM,  and \nBVM  as  a  function  of the \nlabel noise. \n\naverage variance of the estimates of the empirical errors across experiments is about 0.2%. \nFor SVM we  set the regularization parameter, C , to  100 and used 500 iterations to  build \nleveraged machines.  In  all  the experiments without noise the results for BVM and LVM \nwere practically the same.  We therefore only compare BVM to SVM in Fig. 2.  Unless said \notherwise we used polynomials of degree two as kernels: K(X,' x) =  (x\u00b7 x' + 1)2 . Hence, \nthe data is separable in the absence of noise. \nIn the first experiment we tested the sensitivity to the number of training examples by omit(cid:173)\nting examples from the training data (without any modification to the test sets). On the left \npart of Fig.  2 we plot the test error as  a function of the number of training examples. The \ntest error of BVM is  almost indistinguishable from the error of SVM and performance of \nboth methods improves very fast  as  a function  of training examples.  Next,  we compared \nthe performance as a function of the dimension of polynomial constituting the kernel.  We \nran  the  algorithms  with  kernels  of the  form  K(X,' x)  =  (x \u00b7 x' + l)d  for  d  =  2, ... ,8. \nThe results are depicted in the middle plots of Fig. 2.  Again, the performance of BVM and \nSVM is  very close (note the small scale  of the y axis for the test error in  this  experimen(cid:173)\nt).  To conclude the experiments with clean, realizable, data we checked the sensitivity to \nirrelevant features  of the input.  Each input instance (Ul' U2)  was augmented with random \n.  ,Ul to form an input vector of dimension l.  The right hand side graphs of \nelements U3,\" \nFig. 2 shows the test error as a function of 1 for 1 =  2, ... , 12.  Once more we see that the \nperformance of both algorithms is very similar. \nWe next compared the performance of the algorithms in the presence of noise. We used ker(cid:173)\nnels of dimension two and instances without redundant features. The label of each instance \nwas flipped  with probability E.  We ran  15 sets  of experiments, for \u20ac  = 0.01, ... , 0.15. As \nbefore, each set included 10 runs each of which used 1000 training examples and 1000 test \nexamples.  In  Fig.  3  we  show the  average training  error (left),  and  the  average test error \n(right), for each  of the algorithms.  It is  apparent from  the graphs that BVMs  built based \non the exponential loss are much more sensitive to  noise than SVMs and LVMs, and their \ngeneralization error degrades significantly, even for low noise rates.  The generalization er(cid:173)\nror ofLVMs is, on the other hand, only slightly worse than the that of SVMs, although the \n\n\fLeveraged Vector Machines \n\n615 \n\n.~ -(cid:173) LW \n\n..... \n\n~ -\n\nLW \n\n---\n----_. \n\n~  - ~  - ..  - ..  -\n\n---\n\nFigure  4:  The  training  error,  test  error,  and  the  cumulative  Ll  norm  (L~'=l la~ I)  as  a \nfunction of the number of leveraging iterations for LVM,BVM, and PBVM. \n\nonly algorithmic difference in constructing BVMs and LVMs is  in the loss function.  The \nfact that LVMs exhibit performance similar to SVM can be  partially attributed to the fact \nthat the asymptotic behavior of their loss functions is the same. \n5  A norm-penalized version \nOne  of the  problems with boosting and the corresponding leveraging algorithm with  the \nexponential loss described here, is  that it might increase the confidence on a few instances \nwhile misclassifying many other instances, albeit with a small confidence. This often hap(cid:173)\npens on late rounds, during which the distribution D t (i) is concentrated on a few examples, \nand the leveraging algorithm typically assigns a large weight to a weak hypothesis that does \nnot effect most of the instances. It is therefore desired to control the complexity of the lever(cid:173)\naged classifiers by limiting the magnitude of base  hypotheses'  weights.  Several methods \nhave  been proposed to  limit the confidence of AdaBoost,  using,  for instance,  regulariza(cid:173)\ntion  (e.g.,  [9])  or  \"smoothing\" the  predictions  [11].  Here  we  propose a  norm-penalized \nmethod for BVM that is very simple to implement and maintains the convexity properties \nof the  objective function.  Following  the  idea Cortes  and Vapnik's  of SVMs  in  the  non-\n\nseparable case [1]  we add the following penalization term: ,0 exp (L;=1  latlP)  . Simple \n\nalgebric manipulation implies that the objective function  at the tth round for  BVMs with \nthe penalization term above is, \n\nm \n\nZt  = I: Dt(i) exp (-Yi(atht(xi) + f3t\u00bb  +,t exp{latIP )  . \n\ni=l \n\n(8) \n\nIt is  also  easy  to  show  that the penalty parameter should  be updated after each round is: \n,t  =  ,t-l exp(lat-lIP)/Zt-l.  Since  Zt  <  1,  unless  there  is  no  kernel  function  better \n\nthan  random, ,t typically  increases  as  a  function  of t,  forcing  more  and  more  the  new \n\nweights  to  be  small.  Note  that Eqn.  (8)  implies  that  the  search  for  a  base  predictor  ht \nand weights at, f3t  on each round can still  be done independently of previous rounds  by \nmaintaining the distribution D t  and a single regularization value 't. The penalty term for \np  = 1 and p  = 2 simply adds  a diagonal  term to  the matrix of second order derivatives \n(Eqn.  (6\u00bb  and the  algorithm follows  the  same line (details omitted).  For brevity  we  call \nthe  norm-penalized leveraging procedure PBVM.  In Fig.  4  we  plot the  test error (right), \ntraining error (middle),  and  Lt latl  as  functions  of number of rounds for LVM,  BVM, \nand  PBVM with p  =  1 ,0  =  0.01.  The training set in  this  example was  made  small  on \n\npurpose  (200 examples) and  was  contaminated with  5%  label  noise.  In  this  very  small \nexample both LVM and BVM overfit while PBVM stops increasing the weights and finds \na reasonably  good classifier.  The plots demonstrate that the norm-penalized version  can \nsafeguard against overfitting by preventing the weights from growing arbitrarily large, and \nthat the effect of the  penalized version  is  very  similar to  early  stopping.  We  would  like \n\n\f616 \n\nY.  Singer \n\nDataSet \n(Source) \n\nlabor (UC!) \nechocard. (uci) \nbridges (uci) \nhepati tis (uci) \nhorse\u00b7colic (uci) \nliver (uci) \nionosphere (uci) \nvote (uci) \nticketl  (att) \nticket2 (att) \nticket3  (att) \nbands (uci) \nbreast-wisc (uci) \npima (uci) \ngerman (uci) \nweather (uci) \nnetwork (att) \nsplice (uci) \nboa (att) \n\n#Example \n\n& \n\n#Feature \n57 :  16 \n74 :  12 \n102 :  7 \n155:  19 \n300:  23 \n345 :  6 \n351:  34 \n435 :  16 \n556 :  78 \n556:  53 \n556 :  61 \n690:  39 \n699:  9 \n768 :  8 \n1000:  10 \n1000:  35 \n2600:  35 \n3190:  60 \n5000:  68 \n\nSVM \n\nLVM \n\nBVM \n\nRBVM \n\nSVM \n\nLVM \n\nBVM \n\nPBVM \n\nSize \n\nSize \n\nSize \n\nSize \n\nError \n\nError \n\nError \n\nError \n\n12.5 \n7.8 \n27.2 \n41.2 \n122.0 \n228.6 \n63.4 \n37.0 \n48.1 \n52.6 \n46.1 \n265.5 \n49.3 \n360.7 \n485.2 \n562.0 \n1031.0 \n318.0 \n637.0 \n\n13.7 \n13.0 \n20.2 \n13.5 \n13.0 \n11.3 \n58.9 \n37.0 \n84.6 \n77.1 \n76.2 \n78.2 \n26.5 \n47.7 \n89.8 \n52.0 \n42.0 \n153.0 \n183.0 \n\n16.1 \n12.6 \n18.5 \n17.4 \n13.0 \n12.8 \n67.9 \n41.0 \n89.3 \n75.4 \n77.8 \n76.4 \n24.4 \n30.3 \n96.5 \n52.0 \n43.0 \n156.0 \n178.0 \n\n13.6 \n12.4 \n17.9 \n14.0 \n13.0 \n10.7 \n59.1 \n37.0 \n82.3 \n74.0 \n73.3 \n75.6 \n24.0 \n22.8 \n87.0 \n52.0 \n42.0 \n153.0 \n160.0 \n\n6.0 \n8.6 \n15.0 \n21.3 \n14.7 \n33.8 \n13.7 \n4.4 \n8.4 \n6.6 \n6.9 \n32.8 \n3.5 \n23.0 \n23.5 \n25.9 \n24.8 \n8.0 \n41.5 \n\n14.0 \n5.7 \n15.0 \n22.0 \n14.7 \n35.6 \n13.1 \n5.2 \n3.3 \n6.4 \n4.9 \n33.2 \n3.6 \n22.6 \n24.0 \n25.4 \n21.2 \n8.4 \n40.8 \n\n14.0 \n10.0 \n23.0 \n22.7 \n14.7 \n33.5 \n16.9 \n5.9 \n11.5 \n8.0 \n7.6 \n34.3 \n4.1 \n23.2 \n23.8 \n25.4 \n23.5 \n8.4 \n40.8 \n\n12.0 \n10.0 \n14.0 \n22.0 \n13.2 \n35.6 \n13.7 \n5.2 \n5.1 \n6.4 \n6.7 \n33.3 \n4.1 \n22.1 \n24.1 \n25.4 \n21.2 \n8.4 \n41.0 \n\nTable 1:  Summary of results for a collection of binary classification problems. \n\nto note that we found experimentally that the norm-penalized version does compensate for \nincorrect estimates of a  and fJ  due to malicious label noise. The experimental results given \nin  the next section show, however, that it does indeed help in  preventing overfitting when \nthe training set is small. \n6  Experiments with natural data \nWe  compared  the  practical  performance of leveraged  vector machines  with  SVMs  on  a \ncollection of nineteen dataset from  the  UCI  machine learning repository  and  AT&T  net(cid:173)\nworking and  marketing data.  For SVM we  set C  =  100.  We  built each of the  leveraged \nvector machines using 500 rounds.  For PBVM we used again p  =  1 and 'Yo  =  0.0l.  We \nused chunking in building the leveraged vector machines, dividing each training set into 10 \nblocks.  For all the datasets,  with the exception of \"boa\", we used  lO-fold cross validation \nto  calculate the  test error.  (The dataset \"boa\" has  5000  training  examples and  6000 test \nexamples.)  The performance of SVM, LVM,  and  PBVM seem comparable.  In  fact,  with \nthe exception of a very few datasets the differences in error rates are not statistically signif(cid:173)\nicant.  Of the three methods (SVM, PBVM, and LVM), LVM is  the simplest to implement \nthe  time  required to  build  an  LVM  is  typically  much  shorter  than  that  of an  SVM.  It is \nalso worth noting that the size of leveraged machines is  often smaller than the size of the \ncorresponding SVM.  Finally,  it apparent that PBVMs frequently  yield better results  than \nBVMs, especially for small and medium size datasets. \nReferences \n[I]  Corinna Cones and Vladimir Vapnik.  Suppon-vector networks.  Machine Learning, 20(3):273-297, September 1995. \n[2]  N.  Duffy and D.  Helmbold.  A geometric approach to leveraging weak learners.  EuroCOLT '99. \n[3]  Yoav Freund.  Boosting a weak learning algorithm by majority.  Information and Computation , 121(2):256-285, 1995. \n[4]  Yoav Freund and Roben E.  Schapire. A decision\u00b7 theoretic generalization of on-line learning and an application to boosting. \n\nJournal of Computer and System Sciences, 55(1):119-139, August 1997. \n\n[5]  J.  Friedman, T.  Hastie, and R. Tibshirani.  Additive logistic regression: a statistical view of boosting. Tech.  Repon, 1998. \n[6]  Michael Kearns and Leslie G.  Valiant.  Cryptographic limitations on learning Boolean formulae and finite automata. Journal \n\nof the Associationfor Computing Machiner)\"  41(1):67-95, January 1994. \n\n[7]  John D. Laffeny. Additive models, boosting and inference for generalized divergences. In Proceedings of the Twelfth Annual \n\nConference on  Computational Learning Theor)\"  1999. \n\n[8]  L.  Mason, J.  Baxter. P. Banlett, and M. Frean.  Doom II.  Technical repon. Depa.  of Sys. Eng. ANU  1999. \n[9]  G.  Rlitsch, T.Onoda. and K.-R.  Miiller.  Regularizing adaboost.  In Advances in  Neural Info. Processing Systems 12,1998. \n[10]  Roben E. Schapire. The strength of weak learnability.  Machine Learning, 5(2):197-227,1990. \n[II]  Roben E.  Schapire and Yoram Singer.  Improved boosting algorithms using confidence-rated predictions. COLT'98. \n[12]  V.  N.  Vapnik.  Estimation of Dependences Based on Empirical Data.  Springer-Verlag,  1982. \n[13]  Vladimir N.  Vapnik. The  Nature of Statistical Learning Theor),. Springer,  1995. \n\n\f", "award": [], "sourceid": 1771, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}]}