{"title": "Large Margin DAGs for Multiclass Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 547, "page_last": 553, "abstract": null, "full_text": "Large Margin DAGs for \nMulticlass Classification \n\nJohn C. Platt \n\nMicrosoft Research \n\n1 Microsoft Way \n\nRedmond, WA  98052 \njpiatt@microsojt.com \n\nNello Cristianini \n\nDept.  of Engineering Mathematics \n\nUniversity of Bristol \nBristol, BS8 1 TR - UK \n\nnello.cristianini@bristol.ac.uk \n\nJohn Shawe-Taylor \n\nDepartment of Computer Science \n\nRoyal Holloway College - University of London \n\nEGHAM, Surrey, TW20 OEX - UK \nj.shawe-taylor@dcs.rhbnc.ac.uk \n\nAbstract \n\nWe  present a  new  learning  architecture:  the  Decision  Directed  Acyclic \nGraph  (DDAG),  which  is  used  to  combine  many  two-class  classifiers \ninto  a  multiclass  classifier.  For  an  N -class  problem,  the  DDAG  con(cid:173)\ntains N(N - 1)/2 classifiers, one for each pair of classes.  We present a \nVC analysis of the case when the node classifiers are hyperplanes; the re(cid:173)\nsulting bound on the test error depends on N  and on the margin achieved \nat  the  nodes,  but not  on  the  dimension  of the  space.  This motivates an \nalgorithm, DAGSVM, which operates in  a kernel-induced feature  space \nand  uses  two-class maximal  margin  hyperplanes at each  decision-node \nof the  DDAG.  The DAGSVM  is  substantially faster  to  train  and  evalu(cid:173)\nate  than  either the  standard  algorithm  or Max Wins,  while maintaining \ncomparable accuracy to both of these algorithms. \n\n1  Introduction \n\nThe problem of multiclass classificatIon, especially for systems like SVMs, doesn't present \nan easy solution. It is generally simpler to construct classifier theory and algorithms for two \nmutually-exclusive classes than for N  mutually-exclusive classes. We believe constructing \nN -class SVMs is  still an  unsolved research p~oblem. \nThe standard method for N -class SVMs [10] is to construct N  SVMs.  The ith SVM will be \ntrained with all  of the examples in  the ith class with  positive labels, and all  other examples \nwith  negative labels.  We  refer to SVMs trained in  this way as  J -v-r SVMs (short for one(cid:173)\nversus-rest). The final  output of the N  l-v-r SVMs is the class that corresponds to the SVM \nwith the highest output value.  Unfortunately, there is no bound on  the generalization error \nfor the  l-v-r SVM, and the training time of the standard method scales linearly with N. \nAnother method for constructing N -class classifiers from  SVMs is  derived from  previous \nresearch into combining two-class classifiers.  Knerr [5] suggested constructing all possible \ntwo-class classifiers from a training set of N  classes, each classifier being trained on  only \n\n\f548 \n\nJ  C.  Platt,  N.  Cristianini and J  Shawe-Taylor \n\ntwo out of N  classes.  There would thus be K  =  N (N - 1) /2 classifiers.  When applied to \nSVMs, we refer to this as J -v-J SVMs (short for one-versus-one). \nKnerr  suggested  combining  these  two-class  classifiers  with  an  \"AND\"  gate  [5].  Fried(cid:173)\nman  [4]  suggested a Max Wins algorithm:  each  I-v-l classifier casts one vote for its pre(cid:173)\nferred  class,  and  the  final  result  is  the  class  with  the  most  votes.  Friedman  shows  cir(cid:173)\ncumstances in  which  this  algorithm  is  Bayes  optimal.  KreBel  [6]  applies  the  Max  Wins \nalgorithm to Support Vector Machines with excellent results. \nA  significant disadvantage of the  I-v-l  approach,  however,  is  that,  unless  the  individual \nclassifiers are carefully regularized (as in SVMs), the overall N -class classifier system will \ntend  to  overfit.  The \"AND\" combination method and  the Max  Wins combination method \ndo not have bounds on the generalization error.  Finally, the size of the  I-v-l classifier may \ngrow superlinearly with N, and hence, may be slow to evaluate on large problems. \nIn  Section 2,  we  introduce a new multiclass learning architecture, called the Decision Di(cid:173)\nrected  Acyclic  Graph  (DDAG).  The  DDAG  contains  N(N - 1)/2 nodes,  each  with  an \nassociated  I-v-l classifier.  In  Section 3,  we present a VC analysis of DDAGs whose clas(cid:173)\nsifiers  are  hyperplanes,  showing  that  the  margins  achieved  at  the decision  nodes and  the \nsize of the graph both affect their performance, while the dimensionality of the input space \ndoes not.  The VC analysis indicates that building large margin DAGs in  high-dimensional \nfeature spaces  can  yield  good generalization performance.  Using  such  bound  as  a guide, \nin  Section 4, we introduce a novel algorithm for multiclass classification based on  placing \nl-v-l SVMs into nodes of a DDAG. This algorithm, called DAGSVM, is efficient to  train \nand evaluate. Empirical evidence of this efficiency is  shown in Section 5. \n\n2  Decision DAGs \n\nA Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. \nA Rooted DAG has a unique node such that  it is  the only node which has no arcs pointing \ninto  it.  A  Rooted  Binary  DAG  has  nodes  which  have  either 0  or  2  arcs  leaving  them. \nWe  will  use  Rooted  Binary  DAGs  in  order  to  define  a  class  of functions  to  be  used  in \nclassification tasks.  The class of functions computed by  Rooted Binary DAGs is formally \ndefined as follows. \n\nDefinition 1  Decision  DAGs  (DDAGs).  Given a space X  and a set of boolean functions \nF  =  {f : X  -t {a, I}}, the class DDAG(F) of Decision DAGs on N  classes over Fare \nfunctions which can be implemented using a rooted binary DAG  with  N  leaves labeled by \nthe classes where each of the K  = N(N - 1)/2 internal nodes is labeled with an element \nof F.  The nodes are arranged in a triangle with the single root node at the top,  two nodes \nin the second layer and so on until the jinallayer of N  leaves.  The i-th node in layer j  < N \nis connected to  the i-th and (i + 1)-st node in the (j + 1)-st layer. \n\nTo  evaluate a  particular DDAG  G  on  input x  EX, starting  at  the  root node,  the  binary \nfunction  at  a  node  is  evaluated.  The  node  is  then  exited  via  the  left edge,  if the  binary \nfunction  is  zero;  or the  right edge,  if the  binary  function  is  one.  The  next  node's binary \nfunction  is  then evaluated. The value of the decision function  D (x)  is the value associated \nwith  the  final  leaf node  (see Figure  l(a\u00bb.  The path  taken  through  the  DDAG  is  known \nas  the  evaluation path.  The  input  x  reaches  a  node  of the  graph,  if that  node  is  on  the \nevaluation  path  for x.  We  refer to  the decision  node distinguishing classes i  and  j  as the \nij-node.  Assuming that the number of a leaf is \"its  class,  this  node is  the i-th  node in  the \n(N - j  + i)-th layer provided i  < j. Similarly the j-nodes are those nodes involving class \nj, that is, the internal nodes on the two diagonals containing the leaf labeled by j. \nThe DDAG is equivalent to operating on a list, where each node eliminates one class from \nthe list.  The list is initialized with a list of all  classes.  A test point is evaluated against the \ndecision  node that corresponds to the first  and last elements of the list.  If the node prefers \n\n\fLarge Margin DAGs for Multiclass Classification \n\n549 \n\ntest points on this \nSIde of hyperplane \ncannot be in class  1 \n\n3 \n4 \n\n4 \n\n3 \n\n2 \n\n(a) \n\n1 vs4 SVM \n\n1 \n\n1  1  1  1 \n1  1  1 \n\ntest pOInts on this \nSide of hyperplane \ncannot be In class 4 \n\n(b) \n\nFigure 1:  (a) The decision DAG for finding the best class out of four classes. The equivalent \nlist  state for each node is  shown  next to  that node.  (b) A diagram of the input space of a \nfour-class problem. A I-v-l  SVM can only exclude one class from consideration. \n\none of the two classes, the other class is eliminated from  the list, and the DDAG  proceeds \nto  test  the  first  and  last elements of the  new  list.  The DDAG  terminates  when  only  one \nclass remains in  the list.  Thus, for a problem with N  classes, N  - 1 decision nodes will  be \nevaluated in order to derive an answer. \nThe current  state  of the  list  is  the  total  state  of the  system.  Therefore,  since  a  list  state \nis  reachable  in  more  than  one  possible  path  through  the  system,  the  decision  graph  the \nalgorithm traverses is a DAG, not simply a tree. \nDecision  DAGs  naturally  generalize the class  of Decision  Trees,  allowing for  a  more ef(cid:173)\nficient representation of redundancies and  repetitions that can  occur in  different branches \nof the  tree,  by  allowing  the  merging of different  decision  paths.  The  class  of functions \nimplemented is the same as  that of Generalized Decision Trees [1], but this particular rep(cid:173)\nresentation presents both computational and learning-theoretical advantages. \n\n3  Analysis of Generalization \n\nIn  this  paper we  study  DDAGs  where  the  node-classifiers  are  hyperplanes.  We  define  a \nPerceptron  DDAG  to  be  a  DDAG  with  a  perceptron  at  every  node.  Let w  be  the  (unit) \nweight vector correctly  splitting the  i  and  j  classes  at  the  ij-node with  threshold O.  We \ndefine  the margin of the  ij-node to  be I  =  minc(x)==i,j {I(w, x)  - Ol},  where c(x)  is  the \nclass  associated  to  training  example  x.  Note  that,  in  this  definition,  we  only  take  into \naccount examples with class labels equal to i or j . \n\nTheorem 1  Suppose we are able to classify a random m  sampLe of LabeLed examples using \na Perceptron DDAG on N  classes containing K  decision nodes with margins Ii at node i, \nthen  we can bound the generalization error with probability greater than 1 - 6 to  be less \nthan \n\n130R2  ( \n--:;;;:- D' log ( 4em) log( 4m) + log \n\n2(2m)K) \n\n6 \n\n' \n\nwhere D'  =  L~l ~, and R  is the radius of a ball containing the distribution's support. \n\nProof: see Appendix 0 \n\n\f550 \n\nJ.  C.  Platt, N.  Cristianini and J.  Shawe-Taylor \n\nTheorem  1 implies that we can control the capacity of DDAGs by enlarging their margin. \nNote  that,  in  some  situations,  this  bound  may  be pessimistic:  the  DDAG  partitions  the \ninput space into poly topic  regions, each  of which  is  mapped to  a  leaf node and  assigned \nto a specific class.  Intuitively, the only margins that should matter are the ones relative to \nthe  boundaries of the cell  where a given training  point is  assigned,  whereas the bound in \nTheorem  1 depends on all  the margins in  the graph. \nBy the above observations, we would expect that a DDAG whose j-node margins are large \nwould be accurate at identifying class j , even when other nodes do not have large margins. \nTheorem  2 substantiates  this by  showing that the appropriate bound depends only on  the \nj-node margins, but first  we  introduce the notation, Ej(G)  =  P{x : (x in  class j  and x  is \nmisclassified by G) or x  is misclassified as class j  by G}. \n\nTheorem 2  Suppose we are able to  correctly distinguish class j  from  the other classes in \na random m-sample with a DDAG  Gover N  classes containing K  decision nodes with \nmargins 'Yi at node i,  then with probability 1 - J, \n\nEj(G)  ~ ----;:;;- D'log(4em) log(4m) + log \n\n130R2  ( \n\n2(2m)N-l) \n\nJ \n\n' \n\nwhere  D'  =  ~ ..  d  ~,and R  is  the radius of a ball containing the  support of the \ndistribution. \n\nL-tErno  es \"Y; \n\nProof: follows exactly Lemma 4 and Theorem  I, but is omitted.O \n\n4  The DAGSVM algorithm \n\nBased on  the previous analysis,  we  propose a new algorithm, called the Directed Acyclic \nGraph  SVM (DAGSVM) algorithm,  which combines the results of I-v-I SVMs.  We  will \nshow that this combination method is efficient to train and evaluate. \nThe  analysis  of Section  3  indicates that  maximizing  the  margin  of all  of the  nodes  in  a \nDDAG will minimize a bound on the generalization error.  This bound is also independent \nof input dimensionality.  Therefore,  we  will  create a  DDAG  whose  nodes  are  maximum \nmargin classifiers over a kernel-induced feature space.  Such a DDAG is obtained by  train(cid:173)\ning  each  ij-node only  on  the  subset  of training  points  labeled  by  i  or j.  The final  class \ndecision is derived by using the DDAG architecture, described in  Section 2. \nThe DAGSVM  separates the individual classes with  large margin.  It is  safe to discard the \nlosing class at each  I-v-l  decision  because, for the hard margin case, all  of the examples \nof the losing class are far away from the decision surface (see Figure 1 (b)). \nFor the DAGSVM, the choice of the class order in  the list (or DDAG) is arbitrary.  The ex(cid:173)\nperiments in Section  5 simply use a list of classes in the natural numerical (or alphabetical) \norder.  Limited experimentation with  re-ordering the list did  not yield  significant changes \nin accuracy performance. \nThe DAGSVM algorithm is  superior to other multiclass SVM algorithms in  both training \nand evaluation time.  Empirically, SVM training is observed to scale super-linearly with the \ntraining set size m  [7],  according to a power law:  T  =  crn\"Y, where 'Y  ~ 2 for algorithms \nbased on the decomposition method, with some proportionality constant c.  For the standard \nI-v-r multiclass  SVM  training  algorithm,  the  entire  training  set  is  used  to  create  all  N \nclassifiers.  Hence the training time for  I-v-r is \n\n(1) \nAssuming  that the classes  have  the  same number of examples,  training each  l-v-I  SVM \nonly requires 2m/ N  training examples. Thus, training K  l-v-I SVMs would require \n\nT1 - v - r  =  cNm\"Y . \n\nT \n\nI-v-l - c \n\n2 \n\n- N(N -1) (2m) \"Y  '\"  \"Y-1  N2- \"Y \n\nN \n\n'\" 2 \n\nc \n\n\"Y \nm  . \n\n(2) \n\n\fLarge Margin DAGs for Multiclass Classification \n\n551 \n\nFor a typical case, where 'Y  =  2, the amount of time required to train all of the 1-v-1  SVMs \nis  independent of N , and  is  only  twice  that  of training a single  1-v-r SVM.  Vsing  1-v-1 \nSVMs with a combination algorithm is thus preferred for training time. \n\n5  Empirical Comparisons and Conclusions \n\nThe DAGSVM algorithm was evaluated on three different test sets:  the VSPS handwritten \ndigit data  set  [10],  the VCI  Letter data set  [2],  and  the  VCI  Covertype data set [2].  The \nUSPS  digit  data  consists  of 10 classes  (0-9),  whose  inputs  are  pixels  of a  scaled  input \nimage.  There are 7291  training examples and  2007  test examples.  The  UCI  Letter data \nconsists of 26 classes (A-Z),  whose  inputs  are  measured statistics of printed font glyphs. \nWe  used  the first  16000 examples for training, and the last 4000 for testing.  All  inputs of \nthe VCI  Letter data set  were  scaled to lie  in  [-1, 1].  The VCI Covertype data consists of \n7 classes of trees, where the inputs are terrain features.  There are  11340 training examples \nand 565893 test examples.  All  of the continuous inputs for Covertype were scaled to have \nzero mean and unit variance. Discrete inputs were represented as a 1-of-n code. \nOn  each  data set,  we  trained  N  1-v-r SVMs  and  K  1-v-1  SVMs,  using  SMO  [7],  with \nsoft margins.  We combined the  1-v-1  SVMs both  with  the Max Wins algorithm and  with \nDAGSVM. The choice of kernel  and  of the regularizing parameter C was determined via \nperfonnance on  a  validation  set.  The  validation  performance  was  measured  by  training \non  70% of the training  set  and  testing  the combination algorithm  on  30% of the  training \nset  (except for Covertype,  where the  UCI  validation set  was  used).  The  best kernel  was \nselected from a set of polynomial kernels (from degree 1 through 6), both homogeneous and \ninhomogeneous;  and  Gaussian  kernels,  with  various a.  The Gaussian  kernel  was  always \nfound to be best. \n\n(1 \n\nC \n\nError \n\nKernel \n\nRate (%)  Evaluations \n\nTraining CPU  Classifier Size \n(Kparameters) \n\nTime (sec) \n\nUSPS \nl-v-r \nMax Wins \nDAGSVM \nNeural Net [10] \nUCI Letter \n\n1-v-r \nMax Wins \nDAGSVM \nNeural Net \nUCI  Covertype \n\nl-v-r \nMax Wins \nDAGSVM \nNeural  Net [2] \n\n3.58 \n5.06 \n5.06 \n\n100 \n100 \n100 \n\n0.447 \n0.632 \n0.447 \n\n100 \n100 \n10 \n\n1 \n1 \n1 \n\n10 \n10 \n10 \n\n4.7 \n4.5 \n4.4 \n5.9 \n\n2.2 \n2.4 \n2.2 \n4.3 \n\n30.2 \n29.0 \n29.2 \n30 \n\n2936 \n1877 \n819 \n\n8183 \n7357 \n3834 \n\n7366 \n7238 \n4390 \n\n3532 \n307 \n307 \n\n1764 \n441 \n792 \n\n4210 \n1305 \n1305 \n\n760 \n487 \n487 \n\n148 \n160 \n223 \n\n105 \n107 \n107 \n\nTable 1:  Experimental Results \n\nTable  1 shows the results of the experiments.  The optimal parameters for all  three multi(cid:173)\nclass SVM  algorithms are very similar for both data sets.  Also,  the error rates are similar \nfor all  three algorithms for both data sets.  Neither 1-v-r nor Max Wins is  statistically sig(cid:173)\nnificantly  better than  DAGSVM  using McNemar's test  [3]  at a 0.05  significance level  for \nUSPS  or UCI  Letter.  For VCI  Covertype,  Max  Wins  is  slightly  better than  either of the \nother SVM-based algorithms.  The results  for  a neural  network trained on  the  same  data \nsets are shown for a baseline accuracy comparison. \nThe three algorithms distinguish themselves in training time, evaluation time, and classifier \nsize.  The  number of kernel  evaluations is  a good  indication of evaluation  time.  For  J-v-\n\n\f552 \n\nJ  C.  Platt,  N.  Cristianini and J  Shawe-Taylor \n\nr and Max Wins,  the  number of kernel  evaluations is  the  total  number of unique support \nvectors for all  SVMs.  For the DAGSVM, the number of kernel evaluations is  the number \nof unique support vectors averaged over the evaluation paths through the DDAG taken by \nthe test set.  As can be seen in Table 1,  Max Wins is faster than  I-v-r SVMs, due to shared \nsupport  vectors  between  the  I-v-1  classifiers.  The  DAGSVM  has the  fastest  evaluation. \nThe DAGSVM  is between a factor of 1.6 and 2.3 times faster to evaluate than Max Wins. \nThe DAGSVM  algorithm  is  also substantially faster to  train  than  the standard  I-v-r SVM \nalgorithm:  a  factor  of 2.2  and  11.5  times  faster  for  these  two  data sets.  The  Max  Wins \nalgorithm shares a similar training speed advantage. \nBecause the SVM basis functions are drawn from  a limited set,  they can  be shared across \nclassifiers for  a great  savings in  classifier size.  The number of parameters for DAGSVM \n(and Max Wins)  is  comparable to the number of parameters for  I-v-r SVM,  even  though \nthere are N (N - 1) /2 classifiers, rather than  N. \nIn  summary,  we  have created a Decision  DAG  architecture,  which  is  amenable to  a VC(cid:173)\nstyle bound of generalization error.  Using this bound, we created the DAGSVM algorithm, \nwhich  places  a  two-class  SVM  at  every  node  of the  DDAG.  The  DAGSVM  algorithm \nwas tested versus the standard  1-v-r multiclass SVM algorithm, and Friedman's Max Wins \ncombination algorithm. The DAGSVM algorithm yields comparable accuracy and memory \nusage to the other two algorithms, but yields substantial improvements in  both training and \nevaluation time. \n\n6  Appendix:  Proof of Main Theorem \n\nDefinition 2  Let F  be a set of reaL  vaLued functions.  We  say that a set of points X  is ,(cid:173)\nshattered by F  relative to r  =  (rx)xEx,  if there are  reaL  numbers rx, indexed by x  E  X, \nsuch  that for  all  binary  vectors  b indexed by  X,  there  is  a function  fb  E  F  satisfying \n(2bx - l)fdx)  ~ (2bx - l)rx +,. The  fat  shattering dimension,  fatF,  of the set F  is a \nfunction from  the positive reaL numbers to  the integers which maps a vaLue, to the size of \nthe largest ,-shattered set,  if the set is finite,  or maps to infinity otherwise. \nAs a relevant example, consider the class Flin  =  {x -+  (w, x) - (J  : Ilwll  =  I}. We quote \nthe following result from  [1]. \n\nTheorem 3  Let Flin be restricted to points in a ball ofn dimensions of radius R  about the \norigin.  Then \n\nWe  wiIl  bound generalization with  a technique that closely resembles the technique used \nin  [1]  to study Perceptron Decision Trees.  We  will  now give a lemma and a theorem:  the \nlemma bounds the probability over a double sample that the first half has zero error and the \nsecond error greater than  an  appropriate  E.  We  assume that  the  DDAG  on  N  classes has \nK  =  N(N - 1)/2 nodes and we denote fat}'\"l\u00b7  b) by fatb). \n\nIII \n\nLemma 4  Let G be a DDAG  on  N  classes with  K  =  N(N - 1)/2 decision  nodes with \nmargins ,1,,2, ... \"K at the decision  nodes satisfying k i  =  fat ( ,i/8), where fat is con(cid:173)\ntinuous from  the  right.  Then  the following  bound hoLds,  p2m{xy:::I  a  graph  G  :  G \nwhich  separates  classes i  and j  at  the  ij-node for all  x  in  x,  a fraction  of points  mis-\nclassified in  y  >  E(m, K, 6).}  <  6  where  E(m, K, 6)  =  ! (D log (8m)  + log 2;) and \nD  =  L:~1 ki log(4em/ki ). \n\nProof The proof of Lemma 4 is omitted for space reasons, but is formally analogous to the \nproof of Lemma 4.4 in  [8], and can easily be reconstructed from it.  0 \n\n\fLarge Margin DAGs for Muldclass Classification \n\n553 \n\nLemma 4 applies to a particular DDAG with a specified margin Ii at each node.  In practice, \nwe observe these quantities after generating the DDAG. Hence, to obtain a bound that can \nbe applied in  practice, we must bound the probabilities uniformly over all  of the possible \nmargins that can arise. We can now give the proof for Theorem 1. \nProof of Main Theorem: We must bound the probabilities over different margins. We first \nuse a standard result due to Vapnik [9, page 168] to bound the probability of error in  terms \nof the probability of the discrepancy  between  the performance on  two halves of a double \nsample. Then we combine this result with Lemma 4. We must consider all  possible patterns \nof ki's over the decision  nodes.  The largest allowed value of ki is  m, and so, for fixed  K, \nwe can  bound the  number of possibilities by m K .  Hence,  there are m K  of applications of \nLemma 4 for a fixed  N.  Since K  = N(N - 1)/2, we can let 15k  = 8/mK, so thatthe sum \nL:~l 15k  =  8.  Choosing \n\n\u20ac  (m , K, 8;)  =  6~2 (D'IOg(4em) log(4m) + log 2(2;)K) \n\n(3) \n\nin  the applications of Lemma 4 ensures that the probability of any of the statements failing \nto hold is less  than 8/2.  Note that we  have replaced the constant 82  =  64 by 65  in  order \nto ensure the continuity from  the right required for the  application of Lemma 4 and  have \nupper bounded  log(4em/ki )  by  log(4em).  Applying Vapnik's  Lemma  [9,  page  168]  in \neach case, the probability that the statement of the theorem fails  to hold is less than 8.  0 \nMore details on  this style of proof, omitted in  this paper for space constraints, can be \nfound in  [1]. \n\nReferences \n\n[1]  K. Bennett, N.  Cristianini, J.  Shawe-Taylor, and D. Wu.  Enlarging the margin in perceptron \n\ndecision trees.  Machine Learning (submitted).  http://lara.enm.bris.ac.ukJcig/pubsIML-PDT.ps. \n\n[2]  C. Blake, E. Keogh,  and C. Merz.  UCI repository of machine leaming databases.  Dept. of \n\ninformation and computer sciences, University of Califomia, Irvine,  1998. \nhttp://www.ics.uci.edul,,,mleamIMLRepository.html. \n\n[3]  T.  G. Dietterich.  Approximate statistical tests for comparing supervised classification leaming \n\nalgorithms.  Neural Computation,  10: 1895-1924,  1998. \n\n[4]  J. H.  Friedman.  Another approach to polychotomous classification.  Technical report, Stanford \n\nDepartment of Statistics, 1996.  htlp:llwww-stat.stanford.edulreports/friedmanlpoly.ps.Z. \n\n[5]  S. Knerr, L.  Personnaz, and G.  Dreyfus.  Single-layer leaming revisited:  A stepwise procedure \n\nfor building and training a neural  network.  In Fogelman-Soulie and Herault, editors, \nNeurocomputing:  Algorithms, Architectures and Applications, NATO ASI. Springer, 1990. \n\n[6]  U.  KreGel.  Pairwise classification and support vector machines.  In B. SchOlkopf, C.  J.  C. \nBurges, and A.  J.  Smola, editors, Advances in Kernel Methods:  Support Vector Learning, \npages 255-268. MIT Press, Cambridge, MA, 1999. \n\n[7]  J. Platt.  Fast training of support vector machines using sequential  minimal  optimization.  In \n\nB.  Scholkopf, C. J.  C. Burges, and A. J.  Smola, editors, Advances in  Kernel Methods -\nSupport Vector Learning, pages 185-208. MIT Press, Cambridge, MA,  1999. \n\n[8]  J.  Shawe-Taylor and N. Cristianini.  Data dependent structural risk minimization for perceptron \ndecision trees.  In M. Jordan, M.  Keams, and S.  SoJla, editors, Advances in Neural Information \nProcessing Systems, volume 10,  pages 336-342. MIT Press, 1999. \n\n[9]  V.  Vapnik.  Estimation of Dependences Based on  Empirical Data [in  Russian).  Nauka, \n\nMoscow,  1979.  (English translation:  Springer Verlag, New York,  1982). \n\n[10]  V.  Vapnik. Statistical Learning Theory.  Wiley, New York,  1998. \n\n\f", "award": [], "sourceid": 1773, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}