{"title": "A Geometric Interpretation of v-SVM Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 250, "abstract": null, "full_text": "A  Geometric Interpretation of v-SVM \n\nClassifiers \n\nDavid J.  Crisp \n\nCentre for  Sensor Signal and \n\nInformation Processing, \n\nDeptartment of Electrical Engineering, \nUniversity of Adelaide, South Australia \n\ndcrisp@eleceng.adelaide.edu.au \n\nChristopher J.C.  Burges \n\nAdvanced Technologies, \n\nBell Laboratories, \nLucent Technologies \nHolmdel,  New  Jersey \nburges@lucent.com \n\nAbstract \n\nWe show that the recently proposed variant of the Support Vector \nmachine  (SVM)  algorithm,  known  as  v-SVM,  can  be  interpreted \nas a maximal separation between subsets of the convex hulls of the \ndata,  which  we  call  soft  convex  hulls.  The  soft  convex  hulls  are \ncontrolled by  choice of the parameter v.  If the intersection of the \nconvex hulls is empty, the hyperplane is positioned halfway between \nthem such that the distance between convex hulls, measured along \nthe normal, is  maximized; and if it is not, the hyperplane's normal \nis  similarly determined  by  the  soft  convex  hulls,  but  its  position \n(perpendicular  distance  from  the  origin)  is  adjusted  to  minimize \nthe  error  sum.  The  proposed  geometric  interpretation of v-SVM \nalso leads to necessary and sufficient conditions for the existence of \na choice of v  for  which  the v-SVM solution is nontrivial. \n\n1 \n\nIntroduction \n\nRecently,  SchOlkopf  et  al.  [I)  introduced  a  new  class  of SVM  algorithms,  called \nv-SVM,  for both regression estimation and pattern recognition.  The basic idea is to \nremove the user-chosen error penalty factor C that appears in SVM  algorithms by \nintroducing a  new  variable p  which,  in the pattern recognition case,  adds another \ndegree of freedom  to the margin.  For a given normal to the separating hyperplane, \nthe size  of the margin increases linearly with  p.  It turns out that  by  adding p  to \nthe  primal objective function  with  coefficient  -v,  v  2:  0,  the  variable  C  can  be \nabsorbed,  and  the behaviour of the resulting SVM  - the  number of margin errors \nand  number  of support  vectors  - can  to  some  extent  be  controlled  by  setting  v. \nMoreover,  the decision  function  produced by  v-SVM  can  also be produced  by  the \noriginal SVM  algorithm with a suitable choice of C. \nIn  this paper  we  show  that  v-SVM,  for  the  pattern  recognition  case,  has  a  clear \ngeometric interpretation, which also leads to necessary and sufficient conditions for \nthe existence of a nontrivial solution to the v-SVM problem.  All our considerations \napply to feature space,  after the mapping of the data induced by some kernel.  We \nadopt the usual notation:  w is the normal to the separating hyperplane, the mapped \n\n\fA Geometric Interpretation ofv-SVM Classifiers \n\n245 \n\ndata is denoted by  Xi  E !RN ,  i  = 1, ... ,1,  with corresponding labels Yi  E {\u00b11}, b,  p \nare scalars, and ~i'  i  =  1\", ,,1 are positive scalar slack variables. \n\n2  v-SVM  Classifiers \n\nThe v-SVM  formulation, as given in [1],  is as follows:  minimize \n\n1 \n\npI =  211w/112  - Vp'  + y l:~~ \n\n1 \n\nwith respect to w', b' , p', ~i, subject to: \n\ni \n\nYi(W'  .  Xi + b/) ~ p'  - ~~,  ~i  ~ 0,  p'  ~ o. \n\n(1) \n\n(2) \n\nHere  v  is  a user-chosen parameter between 0 and  1.  The decision function  (whose \nsign determines the label given to a test point x)  is then: \n\n(3) \nThe Wolfe dual of this problem is:  maximize Ph  =  -~ 2:ij OiOjYiYjXi .  Xj  subject \nto \n\nl' (x)  =  w' . x + b' . \n\n(4) \n\nwith  w'  given  by  w'  =  2:i 0iYiXi .  SchOlkopf et  al. \n[1]  show  that  v  is  an  upper \nbound on the fraction  of margin errors1 ,  a  lower bound on the fraction  of support \nvectors, and that both of these quantities approach v  asymptotically. \nNote that the point w' = b'  =  p =  ~i =  0 is feasible,  and that at this point, pI = O. \nThus  any  solution  of  interest  must  have  pI  ::;  O.  Furthermore,  if  Vp'  = 0,  the \noptimal solution is at w'  =  b'  =  p =  ~i =  02 \u2022  Thus we can assume that v p'  > 0 (and \ntherefore  v  > 0)  always.  Given this,  the constraint p'  ~ 0  is  in fact  redundant:  a \nnegative value of p'  cannot appear in a solution (to the problem with this constraint \nremoved)  since the  above  (feasible)  solution  (with  p'  =  0)  gives  a  lower  value for \nP'.  Thus below we  replace the constraints (2)  by \n\n2.1  A  Reparameterization of v-SVM \n\nWe  reparameterize  the  primal  problem  by  dividing  the  objective  function  pI  by \nv 2/2,  the constraints  (5)  by v,  and by making the following substitutions: \n\n(5) \n\n~i \nI-'  =  -,  w  =  -,  b = -,  p =  -,  ~i = -. \nv \n\n2 \nvl \n\nb' \nv \n\nw' \nv \n\np' \nv \n\n(6) \n\n1 A margin error Xi  is defined to be any  point for  which  \u20aci  > 0  (see  [1]). \n2In  fact  we  can  prove  that,  even  if  the  optimal  solution  is  not  unique,  the  global \nsolutions still all  have w  =  0:  see Burges and Crisp,  \"Uniqueness of the SYM  Solution\"  in \nthis volume. \n\n\f246 \n\nD. J.  Crisp and C.  J.  C.  Burges \n\nThis gives the equivalent formulation:  minimize \n\n(7) \n\nwith respect to w, b, p, ~i' subject to: \n\nIT we use as decision function  f(x) = f'(x)/v, the formulation is exactly equivalent, \n\nalthough both primal and dual appear different.  The dual problem is now:  minimize \n\n(8) \n\nwith respect to the ai, subject to: \n\n(9) \n\n(10) \n\nwith w  given  by  w  =  1 2:i  aiYiXi.  In  the following,  we  will  refer  to the  reparam(cid:173)\neterized  version of v-StrM  given  above  as  J.'-SVM,  although we  emphasize that it \ndescribes the same problem. \n\n3  A  Geometric Interpretation of l/-SVM \n\nIn the separable case, it is clear that the optimal separating hyperplane is just that \nhyperplane which bisects the shortest vector joining the convex hulls of the positive \nand negative polarity points3 \u2022  We now show that this geometric interpretation can \nbe extended to the case of v-SVM for  both separable and nonseparable cases. \n\n3.1  The Separable Case \n\nWe  start by giving the analysis for  the separable case.  The convex hulls of the two \nclasses are \n\nand \n\n(11) \n\n(12) \n\nFinding the two closest points can be written as the following optimization problem: \n\nmin \n\nCIt \n\n(13) \n\n3See, for  example, K. Bennett,  1997,  in http://www.rpi.edu/bennek/svmtalk.ps (also, \n\nto appear). \n\n\fA  Geometric Interpretation of v-SVM Classifiers \n\nsubject to: \n\nL  ai =  1, \n\ni:y;=+l \n\nL  ai =  1, \n\ni:y;=-l \n\na '  > 0 \n\nt  _ \n\n247 \n\n(14) \n\nTaking the decision boundary  j(x) =  w\u00b7 x + b =  0 to be the perpendicular bisector \nof the line segment joining the two closest points means that at the solution, \n\nand b =  -w\u00b7 p,  where \n\n(15) \n\n(16) \n\nThus w  lies along the line segment (and is half its size)  and p is the midpoint of the \nline segment.  By rescaling the objective function and using the class labels Yi  =  \u00b11 \nwe  can rewrite this as4 : \n\nsubject to \n\nThe associated decision function is  j( x) = w . x + b where w = ~ L:i aiYiXi, \np = ~ L:i aiXi  and b = -w.p = -t L:ij aiYiajXi . Xj. \n\n3.2  The Connection with v-SVM \n\nConsider now  the two sets of points defined  by: \n\nH+ JJ  =  {  '. ~ aiXil  .. ~ ai =  1,  0 ~ ai ~ fL} \n\nI.y;-+l \n\nI.y.-+l \n\nand \n\n(17) \n\n(18) \n\n(19) \n\n(20) \n\nWe  have the following simple proposition: \nProposition 1:  H+ JJ  C  H+  and  H-JJ  C  H_,  and H+ JJ  and  H-JJ  are  both convex \nsets.  Furthermore, the positions of the points H+ JJ  and H-JJ  with respect to the Xi \ndo not depend on the choice of origin. \nProof:  Clearly,  since  the  ai  defined  in  H+ JJ  is  a  subset  of the  ai  defined  in  H+, \nH+ JJ  C  H+,  similarly for  H_.  Now  consider two points in H+JJ  defined by aI,  a2. \nThen all points on the line joining these two points can be written as  L:i:y;=+l ((1-\nA)ali + Aa2i)Xi,  0  ~ A ~ 1.  Since  ali  and a2i  both satisfy 0  ~ ai  ~ fL,  so  does \n(1- A)ali +Aa2i, and since also L:i:y;=+l (1- A)ali+Aa2i =  1, the set H+ JJ  is convex. \n\n4That  one can  rescale  the objective function  without changing the constraints follows \nfrom  uniqueness  of  the  solution.  See  also  Burges  and  Crisp,  \"Uniqueness  of the  SVM \nSolution\"  in this volume. \n\n\f248 \n\nD. J.  Crisp and C.  J.  C.  Burges \n\nThe argument for  H_~ is  similar.  Finally,  suppose  that  every  Xi  is  translated by \n'Vi.  Then since  L:i:Yi=+l  ai = 1,  every  point in H+~ is  also \nXo,  i.e.  Xi  -+  Xi  + Xo \ntranslated by  the same amount, similarly for  H-w  0 \nThe problem of finding the optimal separating hyperplane between the convex sets \nH+~ and H_~ then becomes: \n\nsubject to \n\n(21) \n\n(22) \n\nSince  Eqs.  (21)  and  (22)  are  identical  to  (9)  and  (10),  we  see  that  the  v-SVM \nalgorithm is  in fact  finding the optimal separating hyperplane between the convex \nsets  H+~ and  H-w  We  note that the  convex  sets  H+~ and  H_~ are not  simply \nuniformly scaled versions of H +  and H _.  An  example is shown in Figure 1. \n\nxl \n\nxl \n\nxl \n\n1'=113 \n\n1'=5112 \n\n1/3 \n\n...... '! \n\nxl \n\n113 \n\n5:::  :\"::.~ \n\n. '  \n\nxl \n\nxl \n\n116 \n\n5112 \n\nx2 \n\n112  -lo:rrrTTT17TTT17~ \n\n--t----\"I---+-----. \n\nxl \n\nxl \n\n112 \n\nFigure  1:  The  soft  convex  hull  for  the  vertices  of  a  right  isosceles  triangle,  for \nvarious 1'.  Note how  the shape changes  as the set grows and is constrained by the \nboundaries of the encapsulating convex hull.  For I' < ~, the set is  empty. \n\nBelow, we  will refer to the formulation given in this section as the soft  convex hull \nformulation,  and  the  sets  of points  defined  in  Eqs.  (19)  and  (20)  as  soft  convex \nhulls. \n\n3.3  Comparing the Offsets and Margin Widths \nThe natural value of the offset b in the soft  convex hull approach, b = -w . p,  arose \nby asking that the separating hyperplane lie halfway between the closest extremities \nof the two soft  convex hulls.  Different choices of b just amount to hyperplanes with \nthe  same  normal  but  at  different  perpendicular  distances  from  the  origin.  This \nvalue of b will not in general be the same as that for  which the cost term in Eq.  (7) \nis minimized.  We  can compare the two  values as follows.  The KKT conditions for \nthe J.'-SVM formulation are \n\n(I' - ai)~i  -\n\nai(Yi(w\u00b7Xi+b)-p+~i) \n\n0 \n0 \n\n(23) \n(24) \n\nMultiplying (24)  by Yi,  summing over i  and using  (23)  gives \n\n\fA Geometric Interpretation ofv-SVM Classifiers \n\n249 \n\n(25) \n\nThus the separating hyperplane found in the J.'-SVM  algorithm sits a perpendicular \ndistance  12ifiorr  l:i Yi~i I away  from  that found  in  the  soft  convex  hull  formulation. \nFor the given w, this choice of b results in the lowest value of the cost, J.' l:i ~i. \nThe  soft  convex  hull  approach  suggests  taking  p =  w  . w,  since  this  is  the  value \nIii takes at the points l:Yi=+l (XiXi  and l:Yi=-l (XiXi.  Again, we  can use the KKT \nconditions to compare this with p.  Summing (24)  over i  and using (23)  gives \n\np= p+ ~ L~i. \n\ni \n\n(26) \n\nSince p =  W\u00b7 w,  this again shows that if p =  0 then w =  ~i =  0,  and, by (25),  b =  O. \n\n3.4  The Primal for  the Soft  Convex Hull Formulation \n\nBy substituting  (25)  and  (26)  into the J.'-SVM  primal formulation  (7)  and  (8)  we \nobtain the primal formulation for  the soft  convex hull problem:  minimize \n\nwith respect to w, b, p, ~i, subject to: \n\n( \n\nYi  W  \u2022 Xi + b  2::  p - ~i + J.' ~  2 \n\n-) \n\n_ \n\n\" \n\n1 + YiYj \n\n(27) \n\n(28) \n\n~j, \n\nj \n\nIt is  straightforward to check that the dual is  exactly  (9)  and  (10).  Moreover,  by \nsumming the relevant KKT conditions, as above, we see that b = -w\u00b7p and p =  w\u00b7w. \nNote that in this formulation the variables ~i retain their meaning according to (8). \n\n4  Choosing  v \n\nIn  this  section  we  establish  some  results  on  the  choices  for  v,  using  the  J.'-SVM \nformulation.  First,  note  that  l:i (XiYi  = 0  and  l:i (Xi  = 2 implies  l:i:Yi=+l (Xi  = \nl:i:Yi=-l (Xi  =  1.  Then  (Xi  2::  0  gives  (Xi  ~ 1,  Vi.  Thus  choosing  J.'  >  1,  which \ncorresponds to choosing v  < 2/1, results in the same solution of the  dual (and hence \nthe same normal w)  as  choosing J.'  =  1.  (Note  that  different  values  of J.'  > 1 can \nstill result in different values of the other primal variables, e.g.  b). \nThe equalities  l:i:Yi=+l (Xi  = l:i:y;=-l (Xi  = 1 also show  that if J.'  < 2/1  then  the \nfeasible  region  for  the  dual  is  empty  and  hence  the  problem  is  insoluble.  This \ncorresponds to the requirement v  < 1.  However, we  can improve upon this.  Let 1+ \n(L) be the number of positive  (negative)  polarity points, so that 1+  + L  =  I.  Let \nlmin  ==  min{I+,L}.  Then the minimal value of J.'  which still results in a nonempty \nfeasible region is  J.'min  = 1/lmin.  This gives the condition v  ~ 2Imin /l. \nWe  define  a  \"nontrivial\"  solution of the  problem to  be  any solution  with  w  =I  o. \nThe following proposition gives conditions for  the existence of nontrivial solutions. \n\n\f250 \n\nD. J.  Crisp and C. J.  C.  Burges \n\nProposition  2:  A  value  of v  exists  which  will  result  in  a  nontrivial solution  to \nthe  v-SVM classification problem if and only  if {H+I-'  : I-'  = I-'min} n {H_I-'  : I-'  = \nI-'min}  =  0. \nProof:  Suppose  that  {H+I-'  :  I-'  =  I-'min}  n {H_I-'  :  I-'  =  I-'min}  =1=  0.  Then for  all \nallowable values of I-'  (and hence v), the two convex hulls will intersect, since {H+I-'  : \nI-'  =  I-'min}  C  {H+I-'  : I-'  ~ I-'min}  and  {H_I-'  : I-'  =  I-'min}  C  {H_I-'  : I-'  ~ I-'min}.  IT \nthe two convex hulls intersect, then the solution is  trivial, since by definition there \nthen  exist  feasible  points  z  such  that  z  =  Li:Yi=+lOiXi  and  z  =  Li:Yi=_lOiXi, \nand  hence  2w  =  Li 0iYiXi  =  Li:Yi=+lOiXi  - Li:Yi=-l 0iXi  =  0  (cf.  (21),  (22). \nNow  suppose  that  {H+I-'  :  I-'  = I-'min}  n {H_I-'  :  I-'  = I-'min}  = 0.  Then clearly a \nnontrivial solution exists, since the shortest  distance between  the two  convex sets \n{H +1-'  : I-'  = I-'min}  and  {H -I-'  :  I-'  = I-'min}  is  not  zero,  hence  the  corresponding \nw  =1=  o.  0 \nNote that when 1+  =  L, the condition amounts to the requirement that the centroid \nof the positive examples does not coincide with that of the negative examples.  Note \nalso  that  this  shows  that,  given  a  data set,  one  can find  a  lower  bound  on  v,  by \nfinding the largest I-'  that satisfies H_I-'  n H+I-'  = 0. \n\n5  Discussion \n\nThe  soft  convex  hull  interpretation suggests  that  an  appropriate way  to penalize \npositive polarity errors differently from negative is to replace the sum I-' Li ~i in (7) \nwith 1-'+ Li:Yi=+l ~i + 1-'- Li:Yi=-l ~i\u00b7  In fact  one can go further and introduce a I-' \nfor every train point.  The I-'-SVM formulation makes this possibility explicit, which \nit is not in original v-SVM formulation. \nNote also that the fact that v-SVM leads to values of b which differ from that which \nwould place the optimal hyperplane halfway between the soft convex hulls suggests \nthat there may be principled methods for  choosing the best b for  a given problem, \nother than that dictated by  minimizing the sum of the ~i 'so  Indeed,  originally, the \nsum of ~i 's term arose in an attempt to approximate the  number of errors on  the \ntrain set [21.  The above reasoning in a sense separates the justification for  w  from \nthat for  b.  For  example,  given  w,  a  simple line  search  could  be  used  to find  that \nvalue of b which actually does minimize the number of errors on the train set.  Other \nmethods  (for  example,  minimizing the  estimated Bayes  error  [3])  may  also  prove \nuseful. \n\nAcknowledgments \n\nC.  Burges wishes to thank W.  Keasler,  V.  Lawrence  and  C.  Nohl  of Lucent  Tech(cid:173)\nnologies for  their support. \n\nReferences \n\n[1]  B.  Scholkopf and A.  Smola and R.  Williamson  and P.  Bartlett.  New  support vector \nalgorithms,  neurocolt2  nc2-tr-1998-031.  Technical  report,  GMD  First  and  Australian \nNational University, 1998. \n\n[2]  C.  Cortes  and V.  Vapnik.  Support vector  networks.  Machine  Learning,  20:273-297, \n1995. \n\n[3]  C.  J. C.  Burges and B.  SchOlkopf.  Improving the accuracy and speed of support vector \nlearning machines.  In M.  Mozer,  M.  Jordan,  and T. Petsche, editors,  Advances  in Neural \nInformation  Processing  Systems  9,  pages 375-381,  Cambridge,  MA,  1997.  MIT Press. \n\n\f", "award": [], "sourceid": 1687, "authors": [{"given_name": "David", "family_name": "Crisp", "institution": null}, {"given_name": "Christopher", "family_name": "Burges", "institution": null}]}