{"title": "A Multi-class Linear Learning Algorithm Related to Winnow", "book": "Advances in Neural Information Processing Systems", "page_first": 519, "page_last": 525, "abstract": null, "full_text": "A Multi-class Linear Learning Algorithm \n\nRelated to Winnow \n\nChris Mesterhann* \n\nRutgers Computer Science Department \n\n110 Frelinghuysen Road \nPiscataway,  NJ 08854 \n\nmesterha@paul.rutgers.edu \n\nAbstract \n\nIn  this  paper,  we  present  Committee,  a  new  multi-class  learning  algo(cid:173)\nrithm related  to the Winnow family  of algorithms.  Committee is  an  al(cid:173)\ngorithm for combining the predictions of a set of sub-experts in  the  on(cid:173)\nline mistake-bounded model oflearning. A sub-expert is a special type of \nattribute that predicts with a distribution over a finite  number of classes. \nCommittee learns a linear function  of sub-experts and uses this function \nto make class predictions.  We  provide bounds for Committee that show \nit performs  well  when  the  target can  be represented  by  a  few  relevant \nsub-experts.  We  also  show how Committee can  be  used  to solve more \ntraditional problems composed  of attributes.  This leads to a natural ex(cid:173)\ntension  that learns on  multi-class problems that contain both traditional \nattributes and sub-experts. \n\n1  Introduction \n\nIn this paper, we present a new multi-class learning algorithm called Committee.  Committee \nlearns a  k  class target function by  combining information from a large  set of sub-experts. \nA  sub-expert is  a special  type of attribute that predicts  with a  distribution over the target \nclasses.  The target space of functions are linear-max functions.  We define these as functions \nthat take a linear combination of sub-expert predictions and return the class with maximum \nvalue.  It may  be  useful to think of the sub-experts as  individual classifying functions that \nare  attempting to predict the target function.  Even  though the individual sub-experts may \nnot be perfect, Committee attempts to learn a linear-max function that represents the target \nfunction.  In truth, this picture is not quite accurate.  The reason  we call  them sub-experts \nand not experts is because even though a individual sub-expert might be poor at prediction, \nit may  be useful when used in a linear-max function.  For example, some sub-experts might \nbe used to add constant weights to the linear-max function. \n\nThe algorithm is analyzed for the on-line mistake-bounded model oflearning [Lit89].  This \nis a useful model  for a type of incremental learning where an  algorithm can  use feedback \nabout its current hypothesis to improve its performance.  In this model, the algorithm goes \nthrough a series of learning trials.  A trial is  composed of three steps.  First, the algorithm \n\n\u00b7Part of this work was supported by NEe Research Institute, Princeton, NJ. \n\n\f520 \n\nC.  Mesterharm \n\nreceives an  instance, in this case,  the predictions of the sub-experts. Second, the algorithm \npredicts a label  for the  instance;  this is  the global prediction of Committee.  And last,  the \nalgorithm receives  the  true label  of the  instance; Committee uses  this information to up(cid:173)\ndate its estimate of the target.  The goal of the algorithm is to minimize the total number of \nprediction mistakes the algorithm makes while learning the target. \n\nThe analysis and performance of Committee is similar to another learning algorithm, Win(cid:173)\nnow  [Lit89] .  Winnow  is  an  algorithm for  learning  a linear-threshold function  that  maps \nattributes in  [0, 1] to  a binary target.  It is  an  algorithm that is effective when  the  concept \ncan  be represented with a few  relevant attributes, irrespective of the behavior of the  other \nattributes.  Committee is  similar but deals  with learning a  target  that contains only a  few \nrelevant  sub-experts.  While  learning  with  sub-experts  is  interesting  in  it's own  right,  it \nturns out the distinction between the two tasks is not significant.  We will show in section 5 \nhow to transform attributes from  [0, 1]  into sub-experts.  Using particular transformations, \nCommittee is  identical to the  Winnow algorithms, Balanced and  WMA [Lit89].  Further(cid:173)\nmore, we can generalize these transformations to handle attribute problems with multi-class \ntargets.  These transformations naturally lead to a hybrid algorithm that allows a combina(cid:173)\ntion of sub-experts and attributes for multi-class learning problems.  This opens up a range of \nnew practical problems that did not easily fit into the previous framework of [0, 1 J attributes \nand binary classification. \n\n2  Previous work \n\nMany  people  have  successfully  tried  the Winnow algorithms on  real-world tasks.  In  the \ncourse of their work, they have made modifications to the algorithms to fit  certain aspects \nof their problem.  These modifications include multi-class extensions. \n\nFor example, [DKR97] use Winnow algorithms on text classification problems. This multi(cid:173)\nclass  problem has a special form ; a document can  belong to more than one class.  Because \nof this property, it makes sense to learn a different binary classifier for each class. The linear \nfunctions are allowed, even desired, to overlap. However, this paper is concerned with cases \nwhere this is not possible.  For example,  in [GR96]  the correct spelling of a word must be \nselected  from  a  set of many  possibilities.  In  this setting,  it is more  desirable to have  the \nalgorithm select a single word. \n\nThe work in  [GR96]  presents many  interesting ideas and modifications of the Winnow al(cid:173)\ngorithms.  At a minimum, these modification are  useful  for improving the performance of \nWinnow on those particular problems.  Part of that work also extends the Winnow algorithm \nto general multi-class problems.  While the results are favorable, the contribution ofthis pa(cid:173)\nper is to give a different algorithm that has a stronger theoretical foundation for customizing \na particular multi-class problem. \n\nBlum also works with multi-class Winnow algorithms on the calendar scheduling problem \nof [MCF+94] . In [Blu95], a modified Winnow is given with theoretical arguments for good \nperformance on certain types of multi-class disjunctions. In  this paper, these results are ex(cid:173)\ntended, with the new algorithm Committee, to cover a wider range of multi-class linear func(cid:173)\ntions. \n\nOther related theoretical  work on  multi-class problems includes  the regression  algorithm \nEG \u00b1.  In  [KW97],  Kivinen and Warmuth introduce EG \u00b1, an  algorithm related to Winnow \nbut  used  on  regression  problems.  In  general,  while regression  is  a  useful  framework  for \nmany  multi-class problems, it is not straightforward how to extend regression  to the con(cid:173)\ncepts learned by Committee. A particular problem is the inability of current regression tech(cid:173)\nniques to handle 0-1  loss. \n\n\fA Multi-class Linear Learning Algorithm Related to Winnow \n\n521 \n\n3  Algorithm \n\nThis section of the paper describes the details of Committee.  Near the end of the section, \nwe will give a formal  statement of the algorithm. \n\n3.1  Prediction scheme \n\nAssume there are n sub-experts.  Each sub-expert has a positive weight that is used to vote \nfor k  different classes; let Wi  be the weight of sub-expert i.  A sub-expert can  vote for sev(cid:173)\neral classes by spreading its weight with a prediction distribution. For example, if k  =  3, a \nsub-expert may  give 3/5 of its weight to class  1,  1/5 of its weight to class 2, and  1/5 of its \nweight to class 3.  Let Xi represent this prediction distribution, where x~ is the fraction of the \nweight sub-expert i gives to class j . The vote for class j  is  L~=I WiX~. Committee predicts \nthe class that has the highest vote.  (On ties, the algorithm picks one of the classes involved \nin the tie.)  We call the function computed by this prediction scheme a linear-max function, \nsince it is  the maximum class value taken  from a linear combination of the SUb-expert pre(cid:173)\ndictions. \n\n3.2  Target function \n\nThe  goal  of  Committee  is  to  mInimIZe  the  number  of  mistakes  by  quickly  learning \nsub-expert weights that correctly classify the target function.  Assume there exists fL,  a vec(cid:173)\ntor of nonnegative weights that correctly classifies the target.  Notice that fL can be multiplied \nby any constant without changing the target.  To remove this confusion, we  will normalize \nthe weights to sum to  1,  i.e.,  L~=Il-1i = 1.  Let ((j) be the target's vote for class j. \n\nn \n\n((j)  =  L l-1iX~ \n\nt=I \n\nPart of the difficulty of the learning problem is  hidden in  the target weights.  Intuitively, a \ntarget function will be more difficult to learn ifthere is a small difference between the (votes \nof the correct and incorrect classes.  We measure this difficulty by looking at the minimum \ndifference,  over all  trials, of the  vote of the correct label and  the vote of the other labels. \nAssume for trial t  that Pt  is the correct label. \n\n8=  min \n\n(min(((pt)-((j))) \n\ntETnals  rlpt \n\nBecause these are the weights of the target, and the target always makes the correct predic(cid:173)\ntion, 8 > o. \nOne problem with the above assumptions is that they do not allow noise (cases where 8 <::;  0). \nHowever, there are variations of the analysis that allow for limited amounts of noise [Lit89, \nLit91].  Also experimental work [Lit95, LM]  shows the family of Winnow algorithms to be \nmuch more robust to noise  than  the  theory  would predict.  Based on  the similarity of the \nalgorithm and  analysis,  and  some  preliminary experiments,  Committee should be  able  to \ntolerate some noise. \n\n3.3  Updates \n\nCommittee only updates on mistakes using multiplicative updates. The algorithm starts by \ninitializing all weights to 1 In.  During the trials, let P be the correct label and .x  be the pre(cid:173)\ndicted label  of Committee.  When .x  =1=  P the weight of each  sub-expert i  is multiplied by \naX; -x;.  This corresponds to  increasing the weights of the sub-experts who  predicted the \n\n\f522 \n\nC.  Mesterharm \n\ncorrect label instead of the label Committee predicted.  The value of 0'  is initialized at the \nstart of the algorithm.  The optimal value of 0' for the bounds depends on  6.  Often 6 is not \nknown in advance,  but experiments on Winnow algorithms suggest that these algorithms \nare  more flexible, often  performing well  with a  wider range  of 0'  values  [LM).  Last, the \nweights are  renormalize to sum to  1.  While this is not strictly necessary,  normalizing has \nseveral advantages including reducing the likelyhood of underflow/overflow errors. \n\n3.4  Committee code \n\nInitialization \n\n'Vi  E  {l , . .. , n}  W i:= lin. \nSet 0' > 1. \n\nTrials \n\nInstance  sub-experts (Xl , . .. , xn) . \nPrediction  >.  is  the  first  class  c  such  that  for  all  other  classes  J, \n\n\"n \n\nL...-i =1 W i X i \n\nc  >  \"n \n\n-\n\nL...-i=1  W i X t \u2022 \n\nj \n\nUpdate  Let p be the correct label.  If mistake (>'  # p) \n\nfori:=l  ton \n\nW i  := O'X i  -x, Wi . \n\np \n\n>. \n\nNormalize weights, L:~= l W t  =  1 \n\n3.5  Mistake bound \n\nWe  do not have  the space  to give the  proof for  the  mistake  bound of Committee, but the \ntechnique is similar to the proof of the Winnow algorithm, Balanced, given in [Lit89).  For \nthe complete proof, the reader can refer to [Mes99). \n\nTheorem 1  Committee makes at most 2ln (n) 162mistakes when  the target conditions in \nsection 3.2 are satisfied and 0' is set to (1  - 6) - 1/ 2. \n\nSurprisingly, this bound does not refer to the number of classes. The effects of larger values \nof k  show up indirectly in  the 6 value. \n\nWhile it  is  not obvious, this  bound shows that Committee performs  well  when  the  target \ncan be represented by  a small fraction of the sub-experts.  Call the sub-experts in the target \nthe relevant sub-experts.  Since 6 is a function of the target, 6 only depends on the relevant \nsub-experts. On the other hand, the remaining sub-experts have a small effect on the bound \nsince they  are  only represented  in  the In( n)  factor.  This means that the mistake  bound of \nCommittee is fairly  stable even  when  adding a large number of additional sub-experts.  In \ntruth, this doesn ' t mean that the algorithm will have a good bound when there are few rele(cid:173)\nvant sub-experts.  In some cases, a small number of sub-experts can give an arbitrarily small \n6 value.  (This is a general  problem with all the Winnow algorithms.) What it does mean is \nthat, given any  problem, increasing the number of irrelevant sub-experts will  only have a \nlogarithmic effect on the mistake bound. \n\n4  Attributes to sub-experts \n\nOften there are no obvious sub-experts to use in solving a learning problem. Many times the \nonly information available is a set of attributes. For attributes in [0,1]' we will show how to \nuse Committee to learn a natural kind of k  class target function , a linear machine. To learn \nthis target, we  will transform each attribute into k  separate sub-experts.  We  will use some \nof the same notion as Committee to help understand the transformation. \n\n\fA Multi-class Linear Learning Algorithm Related to Winnow \n\n523 \n\n4.1  Attribute target (linear machine) \n\nA linear machine [DH73] is a prediction function that divides the feature space into disjoint \nconvex regions where each class corresponds to one region.  The predictions are  made by \na comparing the value of k  different linear functions where each function corresponds to a \nclass. \nMore formally, assume there are m  - 1 attributes and k classes.  Let Zi  E  [0,1]  be attribute \ni.  Assume the target function is represented using k  linear functions of the attributes.  Let \n((j) =  L::II-L{ Zi  be the linear function for class j  where I-Li  is the weight of attribute i  in \nclass j. Notice that we have added one extra attribute. This attribute is set to 1 and is needed \nfor the constant portion of the linear functions.  The target function labels an  instance with \nthe class of the largest (  function.  (Ties  are  not defined.)  Therefore, ((j) is  similar to the \nvoting function for class j  used in Committee. \n\n4.2  Transforming the target \n\nOne difficulty  with these linear functions  is that they  may  have  negative weights.  Since \nCommittee only allows targets  with nonnegative weights,  we  need transform to an  equiv(cid:173)\nalent problem that has  nonnegative weights.  This is  not difficult.  Since we are  only con(cid:173)\ncerned with the relative difference between the (  functions, we are allowed to add any func(cid:173)\ntion to the (functions as long as we add it to all (functions. This gives us a simple procedure \nto remove negative weights.  For example, if ((1) =  3Z1 - 2Z2 + 1z3 -4, we can add 2Z2 +4 \nto every (  function to remove the negative weights from ((1). It is straightforward to extend \nthis and remove all negative weights. \n\nWe  also  need  to normalize the weights.  Again,  since only the relative difference  between \nthe  (  functions matter,  we  can  divide all  the (  functions  by  any  constant.  We  normalize \nthe weights to sum to  1,  i.e., L:~=1L:~11-L{ =  1.  At this point, without loss of generality, \nassume that the original (  functions are nonnegative and normalized. \n\nThe last step is to identify a 8 value.  We use the same definition of 8 as Committee  substi(cid:173)\ntuting the corresponding (  functions of the linear machine.  Assume for trial t  that Pt  is the \ncorrect label. \n\n8 =  min \n\ntETrwls \n\n(min( ((Pt)  - ((j))) \nji-P, \n\n4.3  Transforming the attributes \n\nThe  transformation  works  as  follows:  convert  attribute  Zi  into  k  sub-experts.  Each \nsub-expert will  always  vote for one  of the  k  classes  with value  Zi.  The target  weight for \neach  of these  sub-experts is  the corresponding target  weight of the attribute, label  pair in \nthe (  functions . Do this for every attribute. \n\nNotice that  we  are  not  using  distributions for  the  sub-expert predictions.  A  sub-expert's \nprediction can be converted to a distribution by adding a constant amount to each class pre(cid:173)\n\ndiction.  For example,  a sub-expert that predicts Zl  =  .7,  Z2  = 0,  Z3  = \u00b0 can be changed \n\nto  Zl  =  .8,  Z2  =  .1,  Z3  =  .1  by  adding .1  to each class.  This conversion does  not affect \nthe predicting or updating of Committee. \n\n\f524 \n\nC.  Mesterharm \n\nTheorem 2  Committee makes at most 2In(mk)/8 2mistakes on a  linear machine,  as de(cid:173)\nfined in this section,  when 0  is set to (1  - 8)-1/2. \n\nProof:  The above target transformation creates mk normalized target sub-experts that vote \nwith the same (  functions as the linear machine.  Therefore, this set of sub-experts has the \nsame 8 value.  Plugging these values into the bound for Committee gives the result. \n\nThis  transformation  provides  a  simple  procedure  for  solving  linear  machine  problems. \nWhile  the  details  of the  transformation  may  look  cumbersome,  the  actual  implementa(cid:173)\ntion  of the algorithm is relatively simple.  There  is  no need  to explicitly keep track of the \nsub-experts. Instead, the algorithm can use a linear machine type representation.  Each class \nkeeps a vector of weights, one weight for each attribute. During an update, only the correct \nclass  weights and  the  predicted class  weights are  changed.  The  correct class weights are \nmultiplied by  O Zi; the predicted class weights are multiplied by  o -z' . \n\nThe above procedure is very similar to the Balanced algorithm from [Lit89], in fact, for k  = \n2, it is identical.  A  similar transformation duplicates the  behavior of the  linear-threshold \nlearning version ofWMA as given in [Lit89]. \n\nWhile this transformation shows some  advantages  for  k  =  2,  more research is  needed  to \ndetermine the proper way  to generalize to the multi-class case.  For both of these transfor(cid:173)\nmations, the bounds given in this paper are equivalent (except for a superficial adjustment \nin  the 8 notation of WMA) to the original bounds given in [Lit89] . \n\n4.4  Combining attributes and sub-experts \n\nThese  transformations  suggest  the  proper  way  to  do  a  hybrid  algorithm  that  combines \nsub-experts and  attributes:  use  the transformations to create  new  sub-experts from the at(cid:173)\ntributes and combine them with the original sub-experts when running Committee.  It may \neven be desirable to break original sub-experts into attributes and use both in the algorithm \nbecause some sub-experts may  perform better on certain classes.  For example,  if it is felt \nthat a sub-expert is particularly good at class  1,  we  can  perform the following transforma-\ntion. \n\nNow, instead of using one weight for the whole sub-expert, Committee can also learn based \non  the  sub-expert's performance  for the first  class.  Even  if a good target  is  representable \nonly with the original sub-experts, these additional sub-experts will not have a large effect \nbecause  of the logarithmic bound.  In  the same  vein,  it may  be  useful  to add constant at(cid:173)\ntributes to a set of sub-experts.  These add only k extra SUb-experts, but allow the algorithm \nto represent a larger set of target functions . \n\n5  Conclusion \n\nIn this paper, we have introduced Committee, a multi-class learning algorithm. We feel that \nthis algori thm will be important in practice, extending the range of problems that can be han(cid:173)\ndled by  the Winnow family of algorithms.  With a solid theoretical foundation, researchers \ncan customize Winnow algorithms to handle various multi-class problems. \n\n\fA Multi-class Linear Learning Algorithm Related to  Winnow \n\n525 \n\nPart of this customization includes feature transformations.  We  show how Committee can \nhandle general  linear machine  problems by transforming attributes into sub-experts.  This \nsuggests a way to do a hybrid learning algorithm that allows a combination of sub-experts \nand attributes. This same techniques can  also be used to add to the representational power \non a standard sub-expert problem. \n\nIn the future, we plan to empirically test Committee and the feature transformations on real \nworld problems. Part of this testing will include modifying the algorithm to use extra infor(cid:173)\nmation, that is related to the proof technique [Mes99), in an attempt to lower the number of \nmistakes.  We speculate that adjusting the multiplier to increase the change in progress per \ntrial will be useful for certain types of multi-class problems. \n\nAcknowledgments \n\nWe thank Nick Littlestone for stimulating this work by suggesting techniques for converting \nthe Balanced algorithm to multi-class targets.  Also we thank Haym Hirsh, Nick Littlestone \nand Warren Smith for providing valuable comments and corrections. \n\nReferences \n\n[Blu95] \n\n[DH73) \n\n[DKR97] \n\n[GR96) \n\n[KW97) \n\n[Lit89] \n\n[Lit91) \n\n[Lit95] \n\n[LM) \n\nAvrim Blum. Empirical support for winnow and weighted-majority algorithms: \nresults on a calendar scheduling domain.  In ML-95, pages 64-72, 1995. \nR. O. Duda and P.  Hart.  Pattern Classification and Scene Analysis. Wiley, New \nYork,1973 . \nI. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. \nIn EMNLP-97, pages 55-63,1997. \nA.  R. Golding and  D. Roth.  Applying winnow  to  context-sensitive spelling \ncorrection. In ML-96,  1996. \nJyrki Kivinen and Manfred K. Warmuth.  Additive versus exponentiated gradi(cid:173)\nent updates for linear prediction.  Information and Computation, 132(1): 1-64, \n1997. \nNick Littlestone.  Mistake  bounds and linear-threshold learning algorithms. \nPhD  thesis,  University  of California,  Santa  Cruz,  1989.  Technical  Report \nUCSC-CRL-89-11. \nNick  Littlestone.  Redundant  noisy  attributes,  attribute  errors,  and  linear(cid:173)\nthreshold learning using winnow. In  COLT-91 , pages  147-156,1991. \nNick  Littlestone.  Comparing several  linear-threshold learning algorithms on \ntasks involving superfluous attributes.  In ML-95, pages 353-361 , 1995. \nNick  Littlestone and  Chris Mesterharm.  A  simulation study  of winnow and \nrelated algorithms . Work in progress. \n\n[MCF+94)  T.  Mitchell, R.  Caruana, D.  Freitag, 1.  McDermott, and D.  Zabowski.  Experi(cid:173)\n\n[Mes99) \n\nence with a personal learning assistant.  CACM, 37(7):81-91, 1994. \nChris Mesterharm.  A multi-class linear learning algorithm related  to winnow \nwith proof.  Technical report, Rutgers University,  1999. \n\n\f", "award": [], "sourceid": 1718, "authors": [{"given_name": "Chris", "family_name": "Mesterharm", "institution": null}]}