{"title": "The Entropy Regularization Information Criterion", "book": "Advances in Neural Information Processing Systems", "page_first": 342, "page_last": 348, "abstract": null, "full_text": "The Entropy Regularization \n\nInformation Criterion \n\nAlex J. Smola \n\nDept. of Engineering and RSISE \nAustralian National University \nCanberra ACT 0200, Australia \n\nAlex.Smola@anu.edu.au \n\nJohn Shawe-Taylor \n\nRoyal Holloway College \n\nUniversity of London \n\nEgham, Surrey 1W20 OEX, UK \n\njohn@dcs.rhbnc.ac.uk \n\nBernhard Scholkopf \n\nMicrosoft Research Limited \n\nSt. George House,  1 Guildhall Street \n\nCambridge CB2 3NH \nbsc@microsoft.com \n\nRobert C. Williamson \nDept. of Engineering \n\nAustralian National University \nCanberra ACT 0200, Australia \nBob. Williamson @anu.edu.au \n\nAbstract \n\nEffective methods of capacity control via uniform convergence bounds \nfor function expansions have been largely limited to Support Vector ma(cid:173)\nchines,  where  good  bounds  are  obtainable  by  the  entropy  number ap(cid:173)\nproach. We extend these methods to systems with expansions in terms of \narbitrary (parametrized) basis functions and a wide range of regulariza(cid:173)\ntion methods covering the whole range of general linear additive models. \nThis is  achieved by  a data dependent analysis of the eigenvalues of the \ncorresponding design matrix. \n\n1 \n\nINTRODUCTION \n\nModel selection criteria based on the Vapnik-Chervonenkis (VC) dimension are known to \nbe difficult to  obtain,  worst case,  and often  not very  tight.  Yet  they  have  the  theoretical \nappeal of providing bounds, with few  or no assumptions made. \n\nRecently  new  methods  [8,  7,  6]  have  been  developed which  are  able to  provide a  better \ncharacterization of the complexity of function classes than the VC dimension,  and  more(cid:173)\nover,  are easily  obtainable  and  take  advantage of the  data at hand  (i.e.  they  employ  the \nconcept of luckiness).  These techniques,  however,  have  been limited to  linear functions \nor expansions of functions in terms of kernels as happens to be the case in Support Vector \n(SV) machines. \n\nIn this paper we show that the previously mentioned techniques can be extended to expan(cid:173)\nsions in  terms  of arbitrary  basis  functions,  covering a large range of practical algorithms \nsuch as general linear models, weight decay, sparsity regularization [3], and regularization \nnetworks [4]. \n\n\fThe Entropy Regularization Information Criterion \n\n343 \n\n2  SUPPORT VECTOR MACHINES \n\nSupport Vector machines carry out an effective means of capacity control by minimizing a \nweighted sum of the training error \n\n(1) \n\nand a regularization term Q[J]  =  ~llwI12; i.e. they minimize the regularized risk functional \n\nRreg[J]  := Remp[f] + AQ[f]  =  m ~ C(Xi, Yi,  f(Xi)) + \"2llwI12. \n\n(2) \n\n1  m \n\nA \n\nt=l \n\nHere  X  :=  {Xl, ... Xm}  C  X denotes the training  set,  Y  :=  {YI, ... Ym}  C  }j  the  cor(cid:173)\nresponding labels (target values),  X, }j  the corresponding domains,  A > a  a regularization \nconstant, C  :  X  X  }j  X  }j  -+ JRt  a cost function, and f  : X -+  }j is  given by \nf(x)  := (x, w),  or in the nonlinear case  f(x)  := (4l(x), w). \n\n(3) \n\nHere 4l  : X -+  l' is a map into a feature space 1'. Finally, dot products in feature space can \nbe written as  (4l(x), 4l(X'))  =  k(x, x') where k is a so-called Mercer kernel. \nFor n  E  N,  ~n denotes  the  n-dimensional  space  of vectors  x  =  (Xl, ... , Xn).  We  de(cid:173)\nfine  spaces f; as  follows:  as  vector spaces, they  are identical to  ~n, in addition, they are \nendowed with p-norms: \n\nfora < p  < 00 \nforp = 00 \n\nWe write  fp  =  fr;:o  Furthermore let Ue~ := {x: Ilxlle~  ::;  I} be the unitf;-baU. \nFor model selection purposes one wants to obtain bounds on the richness of the map S x \n\nSx : w f-t  (f(xd, ... , f(xm))  =  ((4l(xd, w), ... , (4l(xm), w)). \n\n(4) \n\nwhere w  is restricted to an  f2  unit ball  of some radius A (this is  equivalent to choosing an \nappropriate value of A -\nan increase in  A decreases A and vice versa).  By the \"richness\" \nof S x  specificaUy  we  mean the f: \u20ac-covering  numbers  N( \u20ac,  S X (AUe;;, ), f1:J  of the  set \nSx(AUlm). In the standard COLT notation, we mean \n\np \n\nN(\u20ac,  SX(AUl;;')' f:) := min  n \n\n{\n\nSee [8]  for further details. \n\nThere exists a set {Zl, ... zn}  C  F  such that for all  } \nZ  E Sx(AUem)  we have  min  liz - zililm  < \u20ac \n\np \n\nl::;i::;n \n\n00 \n\n-\n\nWhen carrying out model selection in this case, advanced methods [6] exploit the distribu(cid:173)\ntion  of X  mapped into feature space 1', and thus of the spectral properties of the operator \nSx by analyzing the spectrum of the Gram matrix G  =  [gij]ij,  where gij  := k(Xi, Xj). \nAll  this  is  possible  since  k(Xi,Xj)  can  be  seen  as  a  dot  product of Xi,Xj  mapped  into \nsome feature  space 1', i.e.  k(Xi, Xj)  =  (4l(Xi), 4l(Xj )) .  This property,  whilst true for  SV \nmachines with Mercer kernels, does not hold in general case where f  is expanded in  terms \nof more or less arbitrary basis functions. \n\n\f344 \n\nA. J.  Smola. J.  Shawe-Taylor,  B.  Sch61kopf and R.  C.  Williamson \n\n3  THE BASIC PROBLEMS \n\nOne basic problem is that when expanding 1 into \n\nn \n\n(5) \n\ni=l \n\nwith Ii (x)  being arbitrary  functions,  it is  not immediately  obvious how  to  regard 1 as  a \ndot product in  some  feature  space.  One  can  show  that the VC  dimension  of a set of n \nlinearly independent functions is n.  Hence one would intuitively try  to restrict the class of \nadmissible models by controlling the number of basis functions n  in  terms of which 1 can \nbe expanded. \nNow consider an extreme case.  In addition to the n  basis functions Ii defined previously, \nwe are given n further basis functions II, linearly independent of the previous ones, which \ndiffer from Ii only  on  a  small  domain X',  i.e.  Iilx\\x1  = IIlx\\xl.  Since this  new  set of \nfunctions is  linearly independent, the VC dimension of the joint set is given by 2n.  On the \nother hand, if hardly any data occurs on the domain X',  one would not notice the difference \nbetween Ii and II.  In  other words,  the joint system  of functions  would  behave as  if we \nonly had the initial system of n  basis functions. \nAn  analogous  situation  occurs  if II  =  Ii + \u20acgi  where  \u20ac \nis  a  small  constant and  gi  was \nbounded, say, within [0, 1J.  Again, in this case, the additional effect of the set offunctions \nII  would be hardly noticable, but still, the joint set of functions would count as one with VC \ndimension 2n.  This already  indicates, that simply counting the number of basis functions \nmay not be a good idea after all. \n\n.' ''~ \n\nFigure  1:  From  left  to  right:  (a)  initial  set  of functions  h, ... , 15  (dots  on  the  x-axis \nindicate sampling points);  (b) additional set of functions  IL ... , I~ which  differ globally, \nbut only by  a small amount; (c) additional set offunctions IL ... , I~ which differ locally, \nhowever by  a  large amount;  (d)  spectrum of the corresponding design matrices - the  bars \ndenote the cases (a)-(c) in the corresponding order.  Note that the difference is quite small. \n\nOn the other hand, the spectra of the corresponding design matrices (see Figure 1) are very \nsimilar.  This suggests the use of the latter for a model selection criterion. \n\nFinally  we  have  the practical problem that capacity  control,  which  in  SV  machines  was \ncarried out by minimizing the length of the \"weight vector\" w  in  feature space, cannot be \ndone in  an analogous  way  either.  There are several  ways  to  do  this.  Below  we  consider \nthree that have appeared in the literature and for which there exist effective algorithms. \nExample 1 (Weight Decay)  Define Q[IJ  :=  ~ L:i ar ..  i.e.  the coefficients ai of the junc(cid:173)\ntion  expansion are  constrained to  an \u00a32  ball.  In  this case we can consider the following \noperator S(1)\u00b7 \u00a3n  -t \u00a3m  where \n\nX \nSr): aM (f(xd, ... , I(xm))  =  ((f(Xl), a), . .. , (f(Xm), a))  =  Fa \n\n(6) \nHere  I(x):=  Ul(x) , .. \u00b7In(x)),  Fij  :=  Ii(Xj),  a'- (al, ... ,an) and a  E  AUl'2for \nsome A> O. \n\n.  2 \n\n00' \n\n\fThe Entropy Regularization Information Criterion \n\n345 \n\nExample 2 (Sparsity Regularization)  In  this case  Q[J]  :=  Li lail,  i.e.  the coefficients \nai of the function expansion are constrained to an \u00a31  ball to  enforce sparseness [3].  Thus \nsC;) : \u00a31  -t \u00a3~ with sC;)  mapping a  as in (6) except a  E  AUlI.  This is similar to  expan(cid:173)\nsions encountered in boosting or in linear programming machines. \n\nExample 3 (Regularization Networks)  Finally one could set Q[J]  :=  ~a T Qa for some \npositive definite matrix Q.  For instance, Qij could be obtainedfrom (Ph, P fj) where P  is \na  regularization operator penalizing non-smooth functions [4J.  In this case a  lives inside \nsome n-dimensional ellipsoid.  By substituting a' := Q% a  one can reduce this setting to the \ncase of example 1 with a different set of basis functions (f'(x)  =  Q-% f(x)) and consider \nan evaluation operator s~) : \u00a32  -t \u00a3: given by \ns~): a'  f-+  (f(xd, . .. , f(xm))  = ((Q-% f(X1), a'), . .. , (Q-t f(xm), a')) = Q-t Fa' \n(7) \n\nwhere a'  E  AUl2  for some A> 0 and Fij  =  fi(xj) as in example 1. \n\nExample 4 (Support Vector Machines)  An important special case of example 3 are Sup(cid:173)\nport Vector Machines where we have Qij = k(Xi,Xj) andfi(x) = k(Xi,X), henceQ = F. \nHence the possible values generated by a Support Vector Machine can be written as \n\ns~): a'  f-+  (f(X1), ... , f(xm))  = ((Q-% f(xd, a'), . .. , (Q-% f(xm), a')) = Ft a' \n\n(8) \n\nwhere a'  E  AUl2  for some A > o. \n\n4  ENTROPY NUMBERS \n\nCovering numbers characterize the difficulty of learning elements of a function class.  En(cid:173)\ntropy  numbers  of operators  can  be  used  to  compute  covering  numbers  more easily  and \nmore tightly  than the traditional techniques  based  on  VC-like dimensions  such as  the fat \nshattering  dimension  [1].  Knowing  el (S x)  =  \u20ac  (see  below  for  the  definition)  tells  one \nthat  10g:N(\u20ac , F,\u00a3~)  ::;  I,  where  F  is  the effective  class  of functions  used  by  the  regu(cid:173)\nlarised learning machines under consideration.  In this section we summarize a  few  basic \ndefinitions and results as presented in [8]  and [2]. \nThe lth entropy number \u20acl  (F)  of a set F  with a corresponding metric d is  the precision up \nt~ whicI:!  F  can _be  approximated by 1 elements of F;  i.e.  for  all f  E  F  there exists some \nfi  E  {h, \u00b7\u00b7\u00b7, fd  such  that  d(f, fi)  ::;  \u20acl.  Hence \u20ac1(F) \nis  the  functional  inverse  of the \ncovering number of F. \n\nThe  entropy  number  of an  bounded  linear  operator T: A  -t  B  between  normed  linear \nspaces  A  and  B  is  defined  as  \u20ac1(T) \n:=  \u20ac1(T(UA))  with  the  metric  d  being  induced  by \nII  . liB.  The dyadic  entropy numbers el  are  defined  by  el  :=  \u20ac2'+1 \n(the  latter quantity  is \noften more convenient to deal with since it corresponds to the log of the covering number). \n\nWe  make use  of the following  three results  on entropy  numbers  of the  identity  mapping \nfrom \u00a3;1  into \u00a3;2' diagonal operators, and products of operators. Let \n\nThe following result is due to Schlitt; the constants 9.94 and 1.86 were obtained in  [9]. \n\nid;l ,P2 : \u00a3;1  -t \u00a3;2 \n\n; \n\nid;1 ,P2 : x  f-+  x \n\nProposition 1 (Entropy numbers for identity operators)  Be mEN.  Then \n\nel(id~,2) ::;  9.94 (t log (1 + T) ) 2 \n\n1 \n\n&  el (id~,(xJ ::;  1.86 (t log (1 + T) ) 2 \n\n1 \n\n(9) \n\n\f346 \n\nA. J  Smola, J  Shawe-Taylor,  B.  SchOlkopfand R.  C.  Williamson \n\nProposition 2 (Carl and Stephani [2, p.11])  Let E, F, G  be Banach spaces,  R  :  F  -+ \nG,  and S: E  -+ F.  Then,forn, tEN, \n\nen+t-l (RS)  ~ en(R)et(S),  en(RS)  ~ en (R)IISII  and en(RS)  ~ en(S)IIRII. \n\n(to) \n\nNote  that the latter two inequalities follow directly from  the fact that \u20acl  (R)  =  IIRIlfor all \nR: F  -+  G by definition of the operator norm IIRII. \n\nProposition 3  Let 0\"1  ~ 0\"2  ~ . .. ~ O\"j  ~ . ..  ~ 0,  1 ~ p  ~ 00 and \n\n(11) \n\nfor x  =  (Xl, X2,  ... , Xj, . .. )  E  f!p  be the diagonal operator from  f!p  into  itself,  generated \nby the sequence (0\" j ) j.  Then for all n  E N, \n\n5  THE MAIN RESULT \n\nWe can now state the main theorem which gives bounds on the entropy numbers of S~) for \nthe first three examples of model selection described above (since Support Vector Machines \nare a special case of example 3 we will not deal with it separately). \n\nProposition 4  Let! be  expanded in  a  linear  combination  of basis functions  as  !  .(cid:173)\nL~=l adi and the  coefficients a  restricted  to  one of the convex sets as described in  the \nexamples  1  to  3.  Moreover  denote  by  Fij  :=  !j(Xi)  the  design  matrix on  a  particular \nsample X, and by Q the regularization matrix in the case of example 3.  Then the following \nbound on Sx holds. \n\n1.  In  the case of weight decay (ex.  1 )(with h + l2  ~ l + 1) \n\nel(S~)) ~ 1.96 (llllog(1 +m/h))t eI2(~)' \n\n(13) \n\n2.  1n  the case of weight sparsity regularization (ex.  2) (with h + l2  + l3  ~ l + 2) \n\nel(S~)) ~ 18.48 (lillog (1 + m/h)) t el2 (~) (l3'llog (1 + m/l3)) t.  (14) \n\n3.  Finally,  in the case of regularization networks (ex.  3) (with II + l2  ~ l + 1) \n\nel (Sr))  ~ 1.96 (lillog (1 + m/h)) t el 2 (~). \n\n(15) \n\nHere  ~ is a diagonal scaling operator (matrix) with (i, i)  entries .j(ii and (.j(ii)i are the \neigenvalues (sorted in decreasing order) of the matrix FFT in the case of examples 1 and \n2,  and FQ-l FT in the case of example 3. \n\nThe entropy number of ~ is readily bounded in terms of (O\"i)i  by  using (3).  One can see \nthat the first setting (weight decay) is a special case of the third one,  namely when Q  =  1, \ni.e.  when Q is just the identity matrix. \n\nProof  The proofrelies on a  factorization of S~) (i  =  1,2,3) in the following  way.  First \nwe  consider the equivalent operator S x  mapping  from  f!~  to f!r  and  perform a  singular \nvalue decomposition [5]  of the latter into S x  =  V~W where V, W  are operators of norm \n1,  and  ~ contains  the  singular  values  of S~), i.e.  the  singular  values  of F  and  FQ- t \n\n\fThe Entropy Regularization Information Criterion \n\n347 \n\nrespectively.  The  latter,  however,  are  identical  to  the  square root  of the  eigenvalues  of \nF FT or FQ-l FT. Consequently we can factorize S~) as in the diagram \n\n(16) \n\nFinally,  in  order to  compute the entropy  number  of the  overall  operator one  only  has  to \nuse  the  factorization  of Sx  into  S~)  =  id~oo VL:W  for  i  E  {1,3}  and  into  S~)  = \nid~oo VL:Wid~,2  for example 2,  and  apply  Proposition 2  several  times.  We  also  exploit \nthe fact that for singular value decompositions IIVI\\' IIWII s l. \n\u2022 \nThe present theorem allows us  to compute the entropy numbers (and thus the complexity) \nof a class of functions on the current sample X.  Going back to  the examples of section 3, \nwhich led to large bounds on the VC dimension one can see that the new result is much less \nsusceptible to  such modifications:  the addition of f{. ... f~ to h, ... f n  does  not change \nthe eigenspectrum L:  of the design matrix significantly (possibly only doubling the nominal \nvalue of the singular values), if the functions fi differ from fi only slightly.  Consequently \nalso  the  bounds  will  not  change significantly even though  the  number of basis  functions \njust doubled. \n\nAlso  note that the  current error bounds reduce to  the results of [6]  in  the  SV  case:  here \nQ ij  =  Fij  =  k( Xi, X j)  (both  the  design  matrix  F  and  the regularization  matrix  Q  are \ndetermined  by  kernels)  and  therefore  FQ-l F  =  Q.  Thus  the  analysis  of the  singular \nvalues  of FQ-l F  leads  to  an  analysis of the eigenvalues of the  kernel  matrix,  which  is \nexactly what is done when dealing with SV machines. \n\n6  ERROR BOUNDS \n\nTo  use the above result we need a bound on the expected error of a hypothesis f  in  terms \nof the empirical error (training error) and the observed entropy numbers \u20acn(J').  We use [6, \nTheorem 4.1]  with a small modification. \n\nTheorem 1  Let:1' be a set of linear junctions as described in the previous examples with \nen(Sx) as the corresponding bound on the observed entropy numbers of:1' on the dataset \nX.  Moreover suppose thatforafixed threshold b E  [?for some f  E :1',  sgn(f - b)  correctly \nclassifies the set X  with a margin 'Y  := minlSiSm If(Xi) - bl. \nFinally let U := min{ n  E N with en(Sx) s 'Y /8.001} and a(U, <5)  := 3.08(1 + bIn t). \nThen with confidence 1- <5  over X  (drawn randomly from pm where P  is some probability \ndistribution) the expected error ofsgn(f - b)  is boundedfrom above by \n\n\u20ac(m,U,<5)  =! (U(1+a(U,~)log(5t-m)log(17m)) + log (l6r)) . \n\n(17) \n\nThe proof is essentially identical to that of [6, Theorem 4.1] and is omitted.  [6]  also shows \nhow to compute en (S x) efficiently including an explicit formula for evaluating el (L:). \n\n7  DISCUSSION \n\nWe  showed  how  improved  bounds could be obtained  on  the entropy  numbers  of a  wide \nclass of popular statistical estimators ranging from weight decay to sparsity regularization \n\n\f348 \n\nA. J  Smola. J  Shawe-Taylor,  B.  SchOllropf and R.  C.  Williamson \n\n(with  SV  machines being  a  special  case thereof).  The results  are  given  in  a  way  that is \ndirectly useable for practicioners without any tedious calculations of the VC dimension or \nsimilar combinatorial quantities.  In  particular,  our method ignores (nearly) linear depen(cid:173)\ndent basis  functions  automatically.  Finally,  it takes  advantage of favourable distributions \nof data  by  using  the observed entropy  numbers as  a  base  for  stating  bounds  on the  true \nentropy numbers with respect to the function class under consideration. \n\nWhilst this  leads  to  significantly  improved  bounds  (we  achieved  an  improvement of ap(cid:173)\nproximately  two  orders  of magnitude over previous VC-type  bounds  involving  only  the \nradius of the data R  and the  weight vector IIwll  in  the experiments) on the expected risk, \nthe  bounds  are  still  not good enough  to  become predictive.  This  indicates that possibly \nrather than  using  the standard uniform convergence bounds (as  used  in  the previous sec(cid:173)\ntion) one might want to use other techniques such as a PAC-Bayesian treatment (as recently \nsuggested by Herbrich and Graepel) in combination with the bounds on eigenvalues of the \ndesign matrix. \n\nAcknowledgements:  This work was  supported by the Australian Research Council and a \ngrant of the Deutsche Forschungsgemeinschaft SM 62/1-1. \n\nReferences \n\n[1]  N.  Alon,  S.  Ben-David,  N.  Cesa-Bianchi,  and  D.  Haussler.  Scale-sensitive Dimen(cid:173)\nsions, Uniform Convergence, and Learnability.  1. of the ACM, 44(4):615-631,1997. \n\n[2]  B.  Carl  and I.  Stephani.  Entropy,  compactness,  and the approximation of operators. \n\nCambridge University Press, Cambridge, UK,  1990. \n\n[3]  S. Chen, D.  Donoho, and M.  Saunders.  Atomic decomposition by basis pursuit.  Tech(cid:173)\n\nnical Report 479, Department of Statistics, Stanford University,  1995. \n\n[4]  F.  Girosi, M. Jones, and T.  Poggio.  Regularization theory and neural networks archi(cid:173)\n\ntectures.  Neural Computation, 7:219-269,1995. \n\n[5]  R.  A.  Horn and C.  R.  Johnson.  Matrix Analysis.  Cambridge University Press,  Cam(cid:173)\n\nbridge,  1992. \n\n[6]  B.  Scholkopf,  J.  Shawe-Taylor,  A.  J.  Smola,  and  R.  C.  Williamson.  Generalization \n\nbounds via eigenvalues of the gram matrix. Technical Report NC-TR-99-035, Neuro(cid:173)\nColt2, University of London, UK,  1999. \n\n[7]  J.  Shawe-Taylor and  R.  C.  Williamson.  Generalization performance of classifiers  in \n\nterms of observed covering numbers.  In Proc.  EUROCOLT'99, 1999. \n\n[8]  R.  C.  Williamson,  A.  J.  Smola,  and  B.  Scholkopf.  Generalization  performance  of \nregularization networks and support vector machines via entropy numbers of compact \noperators.  NeuroCOLT NC-TR-98-019, Royal Holloway College,  1998. \n\n[9]  R.  C.  Williamson,  A.  J.  Smola,  and B.  SchOlkopf.  A Maximum Margin Miscellany. \n\nTypescript,  1999. \n\n\f", "award": [], "sourceid": 1677, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}