{"title": "Some Theoretical Results Concerning the Convergence of Compositions of Regularized Linear Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 370, "page_last": 378, "abstract": null, "full_text": "Some Theoretical Results Concerning the \n\nConvergence of Compositions of Regularized \n\nLinear Functions \n\nTong Zhang \n\nMathematical Sciences Department \nIBM T.1.  Watson Research Center \n\nYorktown Heights, NY 10598 \n\ntzhang@watson.ibm.com \n\nAbstract \n\nRecently,  sample complexity bounds have been derived for problems in(cid:173)\nvolving linear functions such as neural networks and support vector ma(cid:173)\nchines.  In this paper,  we extend some theoretical results in this area by \nderiving dimensional independent covering number bounds for regular(cid:173)\nized  linear functions under certain regularization conditions.  We  show \nthat such bounds lead to a class of new methods for training linear clas(cid:173)\nsifiers with similar theoretical advantages of the support vector machine. \nFurthermore,  we also present a theoretical analysis for these new meth(cid:173)\nods from the asymptotic statistical point of view.  This technique provides \nbetter description for large sample behaviors of these algorithms. \n\n1  Introduction \n\nIn this paper,  we are interested in the generalization performance of linear classifiers  ob(cid:173)\ntained from  certain algorithms.  From computational learning theory point of view,  such \nperformance measurements,  or sample complexity bounds,  can  be described by a quanti(cid:173)\nty called covering number [11,  15,  17], which measures  the size of a parametric function \nfamily.  For two-class  classification problem,  the covering  number can be bounded by a \ncombinatorial quantity called VC-dimension [12,  17].  Following this work,  researchers \nhave  found other combinatorial  quantities (dimensions) useful for bounding the covering \nnumbers.  Consequently, the concept of VC-dimension has been generalized to  deal  with \nmore general problems, for example in [15,  11]. \n\nRecently,  Vapnik introduced the concept of support vector machine  [16]  which has been \nsuccessful applied to many  real  problems.  This method achieves  good generalization by \nrestricting the 2-norm of the weights of a separating hyperplane.  A similar technique has \nbeen investigated by Bartlett [3],  where the author studied the performance of neural net(cid:173)\nworks when the  I-norm of the weights is bounded.  The same idea has  also been applied \nin [13]  to explain the effectiveness of the boosting algorithm.  In this paper, we will extend \ntheir results and  emphasize the importance of dimension independence.  Specifically,  we \nconsider the following form of regularization method (with an  emphasis on classification \nproblems) which has been widely studied for regression problems both in statistics and in \n\n\fConvergence of Regularized Linear Functions \n\nnumerical mathematics: \n\ninf Ex  yL(w, 2:, y)  = inf Ex yl(wT 2:Y) + Ag(W), \nw \n\nW \n\nI \n\nI \n\n371 \n\n(1) \n\nwhere Ex ,y  is  the expectation over a distribution of (2:, y), and  y  E {-1, 1} is the binary \nlabel of data vector 2:.  To apply this fonnulation for the purpose oftraining linear classifiers. \nwe can  choose I  as  a decreasing function,  such that I ( .)  ~ 0,  and choose 9 ( w)  ~ 0 as \na  function  that  penalizes  large  w  (liIl1w~oo g( w)  -4  00).  A is  an  appropriately  chosen \npositive parameter to balance the two tenns. \n\nThe paper is organized as follows.  In Section 2, we briefly review the concept of covering \nnumbers as well as  the main results related to analyzing the perfonnance of learning algo(cid:173)\nrithms.  In Section 3,  we introduce the regularization idea.  Our main goal is  to construct \nregularization conditions so that dimension independent bounds on covering numbers can \nbe obtained.  Section 4 extends results from the previous section to nonlinear composition(cid:173)\ns of linear functions.  In Section 5.  we  give an  asymptotic  fonnula for the generalization \nperfonnance  of a  learning algorithm,  which will then be used  to  analyze  an  instance of \nSVM. Due to the space limitation, we will only present the main results and discuss their \nimplications. The detailed derivations can be found in [18]. \n\n2  Covering numbers \n\nWe  fonnulate  the learning problem as  to find  a  parameter from  random observations  to \nminimize  risk:  given  a  loss  function  L( a, x)  and  n  observations  Xl  =  {x 1,  ... , xn } \nindependently drawn  from  a  fixed  but unknown  distribution D,  we  want  to  find  a  that \nminimizes the expected loss over 2:  (risk): \n\nR(a) =  ExL(a,x)= /  L(a,x)dP(x). \n\n(2) \n\nThe most natural method for solving (2) using a limited number of observations is by the \nempirical risk minimization (ERM) method (cf  [15,  16]).  We  simply choose a parameter \na  that minimizes the observed risk: \n\nR(a,Xl ) =  - LL(a,xi). \n\n1  n \n\nn  i=l \n\n(3) \n\nWe  denote the parameter obtained in this way as  a erm (Xl)'  The  convergence  behavior \nof this method  can  be  analyzed by using the  VC  theoretical  point of view.  which relies \non  the  unifonn  convergence  of the  empirical  risk  (the  unifonn  law  of large  numbers): \nSUPa  IR(a, Xl) - R(a)l.  Such a  bound can  be  obtained from  quantities that  measure \nthe size of a Glivenko-Cantelli class.  For finite number of indices,  the family size can  be \nmeasured simply by its cardinality. For general function families, a well known quantity to \nmeasure the degree ofunifonn convergence is the covering number which can be be dated \nback to Kolmogrov [8, 9].  The idea is to discretize (which can depend on the data Xl) the \nparameter space  into N  values a1, . ..  ,aN SO  that each L(a, .)  can be approximated by \nL( ai, .) for some i.  We shall only describe a simplified version relevant for our purposes. \n\nDefinition 2.1  Let B  be a metric space with metric p.  Given a norm p,  observations Xl = \n[Xl,  ... ,xn ].  and vectors  I(a, Xl)  =  [/(a, Xl)\"\" \n,/(a, xn )]  E  Bn parameterized by \na, the covering number in p-norm,  denoted as Np (I, \u20ac,  Xl)' is the minimum number of a \ncollection o/vectors V1,  ... ,Vm  E  B n  such  that Va.  3Vi:  IIp(l(a,Xl),vi)lIp  ::;  n1/P\u20ac. \nWe  also denote Np(l, \u20ac,  n) =  maxx~ Np(l, \u20ac,  Xl). \n\nNote that from the definition and the Jensen's inequality, we have Np  ::;  Nq  for p  ::;  q.  We \nwill always assume the metric on R  to be IX1  - x21  if not explicitly specified otherwise. \nThe following theorem is due to Pollard [11]: \n\n\f372 \n\nTheorem 2.1 ([11])  \\;/n,  f  > \u00b0 and distribution D. \n\nT.  Zhang \n\nP(s~p IR(a, X~) - R(a)1  > \u20acj  ~ 8E(Af1(L , f/8, X~)] exp( 128M2)' \n\n-nf 2 \n\nwhere M  =  sUPa,:z: L(a, x) -\ndrawn from  D. \n\ninfa,:z: L(a, x). and X~ =  {Xl, . ..  ,X' l }  are independently \n\nThe constants in the above theorem can be improved for certain problems; see [4. 6,  15,  16] \nfor related results.  However,  they yield very similar bounds.  The result most relevant for \nthis paper is a lemma in [3] where the 1-nonn covering number is replaced by the oo-nonn \ncovering number.  The latter can be bounded by a scale-sensitive combinatorial dimension \n[1], which can be bounded from the I-norm covering number if this covering number does \nnot depend  on n.  These results can  replace  Theorem  2.1  to yield better estimates under \ncertain circumstances. \n\nSince Bartlett's lemma in [3] is only for binary loss functions, we shall give a generalization \nso that it is comparable to Theorem 2.1 : \nTheorem 2.2  Let It and 12  be two functions:  R n  -+ [0, 1] such that /Y1  - Y21  ~ I  implies \nIt (Y1)  ~ h(Y2) ~ h(Y1) where h  : R n  -+ [0,1] is a reference separatingfunction, then \n\nP[s~p[E:z:It(L(a, x\u00bb)  - Ex-;-h(L(a, x))]  > f]  ~ 4E[Afoo(L, I, X~)] exp( 32)' \n\n-nf 2 \n\nNote  that  in  the  extreme  case  that  some  choice  of a  achieves  perfect  generalization: \nE:z:h(L(a, x))  = 0, and assume that our choices of a(X1) always  satisfy the condition \nEXf h(L( a, x\u00bb  =  0, then better bounds can be obtained by using a refined version of the \nChernoffbound. \n\n3  Covering number bounds for linear systems \n\nIn this section, we present a few new bounds on covering numbers for the following form \nof real valued loss functions: \n\nL(w, x) = xT w = L XiWi \u00b7 \n\nd \n\ni=l \n\n(4) \n\nAs we shall see later, these bounds are relevant to the convergence properties of (1).  Note \nthat in order to apply Theorem 2.1,  since Afl  < Af2 , therefore it is sufficient to  estimate \nAf2(L, \u20ac,  n) for \u20ac  > O.  It is clear that Af2(L, f, ~ is not finite ifno restrictions on x and w \nare imposed.  Therefore in the following, we will assume that each  I/xil/p  is bounded. and \nstudy conditions ofllw// q  so that logAf(j, f, n) is independent or weakly dependent of d. \nOur first  result generalizes a  theorem of Bartlett [3].  The original results is  with p  =  00 \nand q = 1, and the related technique has also appeared in [10,  13].  The proof uses a lemma \nthat is attributed to Maurey (cf.  [2, 7]). \nTheorem 3.1  V/lxi/lp  ~ band Ilw/lq  ~ a,  where lip + 1/q ==  1 and 2 ~ p ~ 00, then \n\nlog2 Af2(L, f, n)  ~ r 7 1 Iog2 (2d + 1). \n\na2 b2 \n\nThe above bound on the covering number depends logarithmically on d,  which is already \nquite weak (as compared to linear dependency on d in the standard situation). However, the \nbound in Theorem 3.1  is nottightforp < 00. For example, the following theorem improves \nthe above bound for p  =  2.  Our technique of proof relies on the SVD decomposition [5] \nfor matrices, which improves a similar result in [14 J by a logarithmic factor. \n\n\fConvergence of Regularized Linear Functions \n\n373 \n\nThe next theorem shows that if lip + llq  >  1, then the 2-nonn covering number is also \nmdependent of dimension. \nTheorem 3.3  Let L(w, x)  = xTw.  {f'llxillp  :::;  band Ilwllq  :::;  a,  where  1  :::;  q  :::;  2  and \nJ  =  lip + 1jq - 1 > 0,  then \n\nOne consequence of this theorem is  a  potentially refined  explanation for the boosting al(cid:173)\ngorithm.  In [13], the boosting algorithm has been analyzed by using a technique related to \nresults in [3] which essentially rely on Theorem 3.1  withp =  00. Unfortunately, the bound \ncontains a logarithmic dependency on d (in the most general case) which does not seem to \nfully explain the fact  that in many cases  the perfonnance of the boosting algorithm keeps \nimproving as  d  increases.  However,  this seemingly mysterious behavior might be better \nunderstood from Theorem 3.3  under the assumption that the data  is  more restricted than \nsimply being oo-nonn bounded.  For example, when the contribution of the wrong predic(cid:173)\ntions is bounded by a  constant (or grow very  slowly as  d  increases),  then  we  can regard \nits p-th nonn bounded for some p  < 00 .  In this case,  Theorem  3.3  implies dimensional \nindependent generalization. \n\nIf we want to apply Theorem 2.2,  then it is necessary  to  obtain bounds for  infinity-nonn \ncovering numbers.  The following theorem gives such bounds by using a result from online \nlearning. \nTheorem 3.4  lfllxillp  :::;  band Ilwllq  :::;  a,  where  2  :::;  p  < 00  and lip + 11q  =  1,  then \ntiE>  O. \n\nIn the case of p  =  00, an entropy condition can be used to obtain dimensional independent \ncovering number bounds. \nDefinition 3.1  Let f1.  = [f1.i]  be a vector with positive entries such that 11f1.lll  = 1 (in  this \ncase,  we call f1.  a distribution vector).  Let x  =  [Xi]  \"#  0 be a vector of the same length,  then \nwe define the weighted relative entropy of x  with re5pect to f1.  as: \n\nentro~(x) = ~ IXil ln J-Lillxlh' \n\n~  IXil \n\u2022 \n\nTheorem 3.5  Given  a  distribution  vector  f1.,  If llxi lloo \n:::;  band  Ilwlll \nentro ~ ( w)  :::;  c,  where we assume that w has non-negative entries,  then tiE>  0, \n\n:::;  a  and \n\nlog2 Noo(L, E, n) :::; \n\n36b 2 ( a 2 + ac) \n\nE2 \n\nlog2[2 r 4ab/ E + 21n + 1]. \n\nTheorems in this section can be combined with Theorem 4.1  to fonn more complex cover(cid:173)\ning number bounds for nonlinear compositions oflinear functions. \n\n\f374 \n\n4  Nonlinear extensions \n\nConsider the following system: \n\nT.  Zhang \n\nL([a, w], x) =  I(g(a, x) + wTh(a, x)) , \n\n(5) \nwhere x is  the observation,  and [a, w]  is the parameter.  We assume that 1 is a  nonlinear \nfunction with bounded total variation. \nDefinition 4.1  A/unction 1 : R  -+  R  is said to satisfy the Lipschitz condition with param(cid:173)\neter\"Y ifVx, y:  I/( x) - I(y) I ~ )'Ix - yl\u00b7 \nDefinition 4.2  The total variation of a/unction 1 : R  -+  R  is defined as \nsup  L I/(xi) - I(xi-dl \u00b7 \n\nTV(f, x) = \n\nL \n\n:2:0<X1 \n\n' <Xl~X t=l \n\nWe  also denote TV(f, (0) as TV(f). \nTheorem 4.1  .if L([a, w], x)  =  I(g(a, x) + wT h(a, x)),  where  TV(f)  <  00  and 1 is \nLipschitz with parameter),.  Assume also that w  is a d-dimensional vector and Ilwllq  :s;  c, \nthen VEl, E2  > 0, and n  > 2(d + 1): \nIog2 Nr (L, E1  + E2, n) < (d + 1) log2[den  max(l TV(f) J, 1)] + log2 Nr([g , h], E2h, n) , \nwhere the metric o/[g, h)  is defined as  Ig1  - g21  + cllh1  - h211p  (l/p + l/q =  1). \nExample 4.1  Consider classification by hyperplane:  L( w, x)  =  J( wT x  < 0)  where J  is \nthe set indicator function. Let L' ( w, x) =  10 ( wT x) be another loss function where \n\n+ 1 \n\n2E1 \n\n-\n\nlo(z)  = \n\n{\n\nz < 0 \n\n1 \n1 - z  z  E  [0 , 1]  . \no \n\nz  > 1 \n\nInstead of using ERM for estimating parameter that minimizes the risk of L , consider the \nscheme of minimize empirical risk associated with L', under the assumption that II x 112  :s;  b \nand constraint that JJwl12  :s;  a.  Denote the estimated parameter by wn .  It follows from the \ncovering number bounds and Theorem 2.1  that with probability of at least 1 - 1]: \n\nIf we  apply  a  slight generalization  of Theorem  2.2  and  the  covering  number bound of \nTheorem 3.4, then with probability of at least 1 - T/: \n\nn 1 / 2ab In( nab + 2) + In 1.. \n________  --'-'7 ). \n\nn \n\nExJ(w~ x  ~ 0)  :s;  EXfJ(w~ x  :s;  2)') + O(  -(-2 In(abh + 2) + In n + In -)) \n\n1  a 2b2 \nn \n\n)' \n\n1 \nT/ \n\nfor all)' E (0,1].  0 \n\nBounds given in this paper can be applied to  show that under appropriate regularization \nconditions and assumptions on the data,  methods based on (1)  lead to generalization per(cid:173)\nformances  of the form 0(1/ .jn), where 0 symbol (which is independent of d)  is used to \nindicate that the hidden constant may include a  polynomial dependency on Iog( n).  It is \nalso important to note that in certain cases, ,\\ will not appear (or it has a small influence on \nthe convergence)  in the constant of 0, as being demonstrated by the example in the next \nsection. \n\n\fConvergence of Regularized Linear Functions \n\n375 \n\n5  Asymptotic analysis \n\nThe  convergence results in the previous sections are  in the form of VC  style convergence \nin  probability,  which has  a  combinatorial  flavor.  However,  for  problems  with differen(cid:173)\ntiable function families involving vector parameters, it is often convenient to derive precise \nasymptotic results using the differential structure. \n\nAssume  that  the parameter  a  E  Rm  in  (2)  is  a  vector and  L  is a  smooth  function.  Let \na*  denote the optimal parameter;  \"\\1 ex  denote the derivative with respect to a; and 'It( a, x) \ndenote \"\\1 exL(a, x) . Assume that \n\nV  = J \"\\1 ex'lt(a* , x)  dP(x) \nU = J 'It ( a * , x) 'It ( a * , x f  dP ( x) . \n\nThen under certain regularity conditions, the asymptotic expected  generalization error is \ngiven by \n\nE  R(aerm )  = R(a*) +  2n tr(V-1U). \n\n1 \n\nMore generally, for any evaluation function h( a) such that \"\\1 h( a*)  =  0: \n\nE  h(aerm )  I=::j  h(a*) + -tr(V- 1\"\\12h\u00b7  V-1U), \n\n1 \n2n \n\n(6) \n\n(7) \n\nwhere \"\\1 2 h is the Hessian matrix of hat a*.  Note that this approach assumes  that the op(cid:173)\ntimal solution is unique. These results are exact asymptotically and provide better bounds \nthan those from the standard PAC  analysis. \n\nExample 5.1  We  would  like  to  study  a  form  of the  support  vector machine:  Consider \nL(a, x) =  f(a T x) + ~Aa2 , \n\nz  < 1 \nz > 1 . \n\nBecause of the discontinuity in the derivative of f , the asymptotic formula may not hold. \nHowever,  if we  make  an  assumption  on  the  smoothness  of the  distribution x,  then  the \nexpectation of the derivative over x  can  still be smooth.  In this case,  the  smoothness of \nf  itself is  not crucial.  Furthermore,  in a  separate  report.  we  shall  illustrate that  similar \nsmall sample bounds without any assumption on the smoothness of the distribution can be \nobtained by using techniques related to asymptotic analysis. \nConsider the optimal parameter a*  and letS =  {x : a*Tx::;  1}. Note that Aa*  =  ExEsx, \nand U  =  EXES(X  - ExEsx)(x - EXEsxf. Assume that 3')'  > 0 S.t.  P(a*T x  ::;  ')')  =  0, \nthen V  =  AI + B  where B  is a positive semi-definite matrix.  It follows that \n\ntr(V-1U)  ::;  tr(U)jA ::;  EXES *T  Ila*I I ~::; sup Ilxll~ l la*ll~j')'. \n\nx 2 \n\nE \nxESa  X \n\nNow, consider an obtained from observations Xl = [Xl, '\" \nrisk associated with loss function L( a, x), then \n\n,xn ] by minimizing empirical \n\nExL(a emp , x) ::; inf ExL(a, x) + -21  sup Ilxl l~lla*ll~ \n\nex \n\n')'n \n\nasymptotically.  Let A --+  0, this scheme becomes the optimal separating hyperplane [16]. \nThis asymptotic bound is better than typical PAC bounds with fixed  A.  0 \n\nNote that although the bound obtained in the above example is very similar to the mistake \nbound for the perceptron online update algorithm, we may in practice obtain much better \nestimates from (6) by plugging in the empirical data. \n\n\f376 \n\nReferences \n\nT.  Zhang \n\n[I]  N.  Alon,  S.  Ben-David, N.  Cesa-Bianchi,  and  D.  Haussler.  Scale-sensitive dimen(cid:173)\n\nsions,  uniform convergence,  and  learnability.  Journal  of the ACM,  44(4):615-631, \n1997. \n\n[2]  A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal func(cid:173)\n\ntion.  IEEE Transactions on Injormation Theory, 39(3):930-945, 1993. \n\n[3]  P.L.  Bartlett.  The sample complexity of pattern classification with neural  networks: \n\nthe size of the weights is more important than the size of the network.  IEEE Transac(cid:173)\ntions on Information Theory,  44(2):525-536, 1998. \n\n[4]  R.M.  Dudley.  A  course on  empirical processes,  volume  1097  of Lecture  Notes  in \n\nMathematics.  1984. \n\n[5]  G.H.  Golub  and  C.P.  Van  Loan.  Matrix  computations.  Johns  Hopkins University \n\nPress, Baltimore, MD, third edition, 1996. \n\n[6]  D.  Haussler.  Generalizing  the  PAC  model:  sample  size  bounds  from  metric \ndimension-based uniform convergence  results.  In Proc.  30th  IEEE Symposium  on \nFoundations of Computer Science, pages 40-45,  1989. \n\n[7]  Lee K.  Jones.  A simple lemma on greedy approximation in  Hilbert space  and con(cid:173)\nvergence  rates  for projection pursuit regression  and  neural  network training.  Ann. \nStatist., 20(1): 60~13, 1992. \n\n[8]  A.N.  Kolmogorov.  Asymptotic characteristics of some completely bounded metric \n\nspaces.  Dokl. Akad.  Nauk.  SSSR, 108:585-589, 1956. \n\n[9]  A.N. Kolmogorov and Y.M.  Tihomirov. f-entropyand f-capacity of sets in functional \n\nspaces.  Amer.  Math. Soc.  Trans!.,  17(2):277-364,1961. \n\n[10]  Wee  Sun  Lee,  P.L.  Bartlett,  and  R.C.  Williamson.  Efficient  agnostic  learning  of \nneural  networks  with  bounded  fan-in.  IEEE  Transactions  on  Information  Theory, \n42(6):2118-2132,1996. \n\n[II]  D.  Pollard. Convergence of stochastic processes.  Springer-Verlag, New York,  1984. \n[12]  N.  Sauer.  On the density of families of sets.  Journal of Combinatorial Theory (Series \n\nA),  13: 145-147,1972. \n\n[13]  Robert  E.  Schapire,  Yoav  Freund,  Peter Bartlett,  and  Wee  Sun  Lee.  Boosting the \nmargin:  a  new  explanation for  the  effectiveness  of voting methods.  Ann. Statist., \n26(5): 1651-1686,1998. \n\n[14]  1.  Shawe-Taylor,  P.L.  Bartlett,  R.C.  Williamson,  and  M.  Anthony.  Structural  risk \nminimization over data-dependent hierarchies.  IEEE Trans.  In! Theory,  44(5): 1926-\n1940,  1998. \n\n[15]  Y.N.  Vapnik.  Estimation of dependences  based on empirical data.  Springer-Verlag, \n\nNew York,  1982.  Translated from the Russian by Samuel Kotz. \n\n[16]  Y.N.  Vapnik.  The  nature of statistical learning theory.  Springer-Verlag, New York, \n\n1995. \n\n[17]  Y.N.  Vapnik  and AJ.  Chervonenkis.  On the  uniform convergence  of relative  fre(cid:173)\n\nquencies  of events  to  their  probabilities.  Theory  of Probability and Applications, \n16:264-280, 1971. \n\n[18]  Tong  Zhang.  Analysis  of regularized  linear  functions  for  classification  problems. \n\nTechnical Report RC-21572, IBM,  1999. \n\n\fPART IV \n\nALGORITHMS AND ARCHITECTURE \n\n\f\f", "award": [], "sourceid": 1689, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}