{"title": "Improved risk tail bounds for on-line algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 195, "page_last": 202, "abstract": null, "full_text": "Improved Risk Tail Bounds \nfor On-Line Algorithms * \n\nNicolo Cesa-Bianchi \n\nDSI, Universita di Milano \n\nvia Comelico 39 \n\n20135 Milano, Italy \n\nClaudio Gentile \n\nDICOM, Universita dell'Insubria \n\nvia Mazzini 5 \n\n21100 Varese, Italy \n\ncesa-bianchi@dsi.unimi.it \n\ngentile@dsi.unimi.it \n\nAbstract \n\nWe prove the strongest known bound for the risk of hypotheses selected \nfrom the ensemble generated by running a learning algorithm incremen(cid:173)\ntally on the training data. Our result is based on proof techniques that are \nremarkably different from the standard risk analysis based on uniform \nconvergence arguments. \n\n1 Introduction \n\nIn this paper, we analyze the risk of hypotheses selected from the ensemble obtained by \nrunning an arbitrary on-line learning algorithm on an i.i.d. sequence of training data. We \ndescribe a procedure that selects from the ensemble a hypothesis whose risk is , with high \nprobability, at most \n\nMn + 0 ((innn)2 + J~n Inn) , \n\nwhere Mn is the average cumulative loss incurred by the on-line algorithm on a training \nsequence of length n. Note that this bound exhibits the \"fast\" rate (in n)2 I n whenever the \ncumulative loss nMn is 0(1). \n\nThis result is proven through a refinement of techniques that we used in [2] to prove the \nsubstantially weaker bound Mn + 0 ( J (in n) I n). As in the proof of the older result, we \nanalyze the empirical process associated with a run of the on-line learner using exponential \ninequalities for martingales. However, this time we control the large deviations of the \non-line process using Bernstein's maximal inequality rather than the Azuma-Hoeffding \ninequality. This provides a much tighter bound on the average risk of the ensemble. Finally, \nwe relate the risk of a specific hypothesis within the ensemble to the average risk. As in [2], \nwe select this hypothesis using a deterministic sequential testing procedure, but the use of \nBernstein's inequality makes the analysis of this procedure far more complicated. \n\nThe study of the statistical risk of hypotheses generated by on-line algorithms , initiated \nby Littlestone [5], uses tools that are sharply different from those used for uniform con(cid:173)\nvergence analysis , a popular approach based on the manipulation of suprema of empirical \n\n' Part of the results contained in this paper have been presented in a talk given at the NIPS 2004 \n\nworkshop on \"(Ab)Use of Bounds\". \n\n\fprocesses (see, e.g., [3]). Unlike uniform convergence, which is tailored to empirical risk \nminimization , our bounds hold for any learning algorithm. Indeed, disregarding efficiency \nissues, any learner can be run incrementally on a data sequence to generate an ensemble of \nhypotheses. \n\nThe consequences of this line of research to kernel and margin-based algorithms have been \npresented in our previous work [2]. \n\nNotation. An example is a pair (x , y), where x E X (which we call instance) is a data \nelement and y E Y is the label associated with it. Instances x are tuples of numerical and/or \nsymbolic attributes. Labels y belong to a finite set of symbols (the class elements) or to \nan interval of the real line, depending on whether the task is classification or regression. \nWe allow a learning algorithm to output hypotheses of the form h : X ----> D , where D \nis a decision space not necessarily equal to y. The goodness of hypothesis h on example \n(x, y) is measured by the quantity C(h(x), y), where C : D x Y ----> lR. is a nonnegative and \nbounded loss function. \n\n2 A bound on the average risk \n\nAn on-line algorithm A works in a sequence of trials. In each trial t = 1,2, ... the algo(cid:173)\nrithm takes in input a hypothesis Ht- l and an example Zt = (Xt, yt), and returns a new \nhypothesis H t to be used in the next trial. We follow the standard assumptions in statis(cid:173)\ntical learning: the sequence of examples zn = ((Xl , Yd , ... , (Xn , Yn)) is drawn i.i.d. \naccording to an unknown distribution over X x y. We also assume that the loss function C \nsatisfies 0 ::; C ::; 1. The success of a hypothesis h is measured by the risk of h, denoted by \nrisk(h). This is the expected loss of h on an example (X, Y) drawn from the underlying \ndistribution , risk(h) = lEC(h(X), Y). Define also riskernp(h) to be the empirical risk \nof h on a sample zn, \n\nriskernp(h) = - 2: C(h(Xt ), yt) . \n\n1 n \n\nn \n\nt =l \n\nGiven a sample zn and an on-line algorithm A, we use Ho, HI, ... ,Hn- l to denote the \nensemble of hypotheses generated by A. Note that the ensemble is a function of the random \ntraining sample zn. Our bounds hinge on the sample statistic \n\nwhich can be easily computed as the on-line algorithm is run on zn. \n\nThe following bound, a consequence of Bernstein's maximal inequality for martingales due \nto Freedman [4], is of primary importance for proving our results. \n\nLemma 1 Let L I , L 2 , ... be a sequence of random variables, 0 ::; Lt ::; 1. Define the \nbounded martingale difference sequence Vi = lE[Lt ILl' ... ' Lt- l ] - Lt and the asso(cid:173)\nciated martingale Sn = VI + ... + Vn with conditional variance Kn = L:~=l Var[Lt I \nLI, ... ,Lt - I ]. Then,forall s,k ~ 0, \n\nIP' (Sn ~ s, Kn ::; k) ::; exp ( - 2k :22S/ 3 ) \n\n. \n\nThe next proposition, derived from Lemma 1, establishes a bound on the average risk of \nthe ensemble of hypotheses. \n\n\fProposition 2 Let Ho, . .. ,Hn - 1 be the ensemble of hypotheses generated by an arbitrary \non-line algorithm A. Then,for any 0 < 5 ::; 1, \n( 1 ~ . \n;:;: ~ rlsk(Ht- d ::::: Mn + ~ In \n\n(nMn +3 ) \n\n+ 2 \n\n36 \n\nIP' \n\n5 \n\nThe bound shown in Proposition 2 has the same rate as a bound recently proven by \nZhang [6, Theorem 5]. However, rather than deriving the bound from Bernstein inequality \nas we do, Zhang uses an ad hoc argument. \n\nProof. Let \n1 n \n\nf-ln = - 2: risk(Ht _d \n\nn \n\nt = l \n\nand vt - l = risk(Ht _d - C(Ht-1(Xt ), yt) \n\nfor t ::::: l. \n\nLet \"'t be the conditional variance Var(C(Ht _ 1 (Xt ) , yt) \nfor brevity K n = 2:~= 1 \"'t, K~ = l2:~= 1 \"'d, and introduce the function A (x) \n2 In (X+l)}X +3) for x ::::: O. We find upper and lower bounds on the probability \n\nI Zl , ... , Zt - l). Also, set \n\nIP' (t vt - l ::::: A(Kn) + J A(Kn) Kn) . \n\n(1) \n\nThe upper bound is determined through a simple stratification argument over Lemma 1. \nWe can write \n\nn \n\n1P'(2: vt - l ::::: A(Kn) + J A(Kn) Kn) \n\nt = l \n\nn \n\n::; 1P'(2: vt - l ::::: A(K~) + J A(K~) K~) \n\nt =l \n\nn \n\nn \n\nt = l \nn \n\n::; 2:1P'(2: vt - 1 ::::: A(s) + JA(s)s, K~ = s) \ns=o \nn \n::; 2:1P'(2: vt - l ::::: A(s) + J A(s) s, Kn ::; s + 1) \ns=o \nt = l \n~ ( \ns=o \n\n< ~exp -~--~~-=====~~----\n~(A(s) + J A(s) s) + 2(s + 1) \n-\n> A(s)/2 for all s > 0 we obtain \n\n(A(s) + J A(s) s)2 \n\n) \n\n-\n\n, \n\nSince \n\n(A(s)+~)2 \n\nHA(s) +VA(S)S) +2(S+1) -\n\n(using Lemma 1). \n\n(1) < t e- A (s)/2 = t \n\n-\n\ns=o \n\n5 \n\n< 5. \n\n(2) \n\ns=o (s + l)(s + 3) \n\nAs far as the lower bound on (1) is concerned, we note that our assumption 0 ::; C ::; 1 \nimplies \"'t ::; risk(Ht_d for all t which, in tum, gives Kn ::; nf-ln. Thus \n\n(1) = IP' ( nf-ln - nMn ::::: A(Kn) + J A(Kn) Kn) \n\n::::: IP' ( nf-ln - nMn ::::: A(nf-ln) + J A(nf-ln) nf-ln) \n= IP' ( 2nf-ln ::::: 2nMn + 3A(nf-ln) + J4nMn A(nf-ln) + 5A(nf-lnF ) \n= lP' ( x::::: B + ~A(x) + JB A(x) + ~A2(x)) , \n\n\fwhere we set for brevity x = nf-ln and B = n Mn. We would like to solve the inequality \n\nx ~ B + ~A(x) + VB A(x) + ~A2(X) \n\n(3) \n\nW.r.t. x . More precisely, we would like to find a suitable upper bound on the (unique) x* \nsuch that the above is satisfied as an equality. \nA (tedious) derivative argument along with the upper bound A(x) ::; 4 In (X!3) show that \n\nx' = B + 2 VB In ( Bt3) + 36ln ( Bt3) \n\nmakes the left-hand side of (3) larger than its right-hand side. Thus x' is an upper bound \non x* , and we conclude that \n\nwhich, recalling the definitions of x and B, and combining with (2), proves the bound. D \n\n3 Selecting a good hypothesis from the ensemble \n\nIf the decision space D of A is a convex set and the loss function \u00a3 is convex in its first \nargument, then via Jensen's inequality we can directly apply the bound of Proposition 2 to \nthe risk of the average hypothesis H = ~ L~=I H t - I \n\n. This yields \n\nlP' (riSk(H) ~ Mn + ~ In (nM~ + 3) + 2 ~n In (nM~ + 3) ) ::; 6. \n\n(4) \n\nObserve that this is a O(l/n) bound whenever the cumulative loss n Mn is 0(1). \n\nIf the convexity hypotheses do not hold (as in the case of classification problems), then \nthe bound in (4) applies to a hypothesis randomly drawn from the ensemble (this was \ninvestigated in [1] though with different goals). \n\nIn this section we show how to deterministically pick from the ensemble a hypothesis \nwhose risk is close to the average ensemble risk. \n\nTo see how this could be done, let us first introduce the functions \n\nEr5(r, t) = 3(!~ t) + J ~~rt \n\nand \n\ncr5(r, t) = Er5 (r + J ~~rt' t) , \n\n'th B-1 n(n+2) \n\nn \n\nr5 \n\nWI \n\n-\n\n. \n\nLet riskemp(Ht , t + 1) + Er5 (riskemp(Ht, t + 1), t) be the penalized empirical risk of \nhypothesis H t , where \n\nn \nriskemp(Ht , t + 1) = - - \" \n\n1 \n\n\u00a3(Ht(Xi), Xi) \n\nn - t ~ \n\ni=t+1 \n\nis the empirical risk of H t on the remaining sample Zt+ l, ... , Z]1' We now analyze the per(cid:173)\nformance of the learning algorithm that returns the hypothesis H minimizing the penalized \nrisk estimate over all hypotheses in the ensemble, i.e., I \n\nii = argmin( riskemp(Ht , t + 1) + Er5 (riskemp(Ht , t + 1), t)) . \n\nO::; t min (risk(Ht ) + 2 c8(risk(Ht ) , t))) S b . \n\nO::; t risk(H*) + 2c8(risk(H*), T *)) \n\n< \n\nlP' ( risk(H) > risk(H*) + 2C8 (R* - Q(R*, T *), T *)) \n\n+ lP' (riSk(H*) < R * - Q(R*, T *)) \n\nlP' ( risk(H) > risk(H*) + 2C8 (R* - Q(R*, T *), T *) ) \n\n< \n+ ~ lP' ( risk(Ht ) < R t - Q(Rt , t)) . \n\nApplying the standard Bernstein's inequality (see, e.g., [3 , Ch. 8]) to the random variables \nR t with IRt l S 1 and expected value risk(Ht ), and upper bounding the variance of R t \nwith risk(Ht ), yields \n\n. ( ) \n\nlP' r~sk H t < R t -\n\n(\n\nB + y'B(B + 18(n - t)risk(Ht ))) \n\n( ) \n3 n - t \n\n- B \n\nS e \n\n. \n\nWith a little algebra, it is easy to show that \n\n. ( ) \n\nr~sk H t < Rt -\n\nB + y'B(B + 18(n - t)risk(Ht )) \n\n( ) \n3 n - t \n\nis equivalent to risk(Ht ) < R t - Q(Rt , t). Hence, we get \n\nlP' ( risk(H) > risk(H*) + 2c8(risk(H *), T *)) \n\n< \n\nlP' (risk(H) > risk(H*) + 2C8 (R* - Q(R*, T *),T *) ) + n e- B \n\n< \n\nlP' (risk(H) > risk(H*) + 2\u00a38(R*, T *)) + n e- B \n\n\fwhere in the last step we used \n~B'r \n-\nn - t \n\nQ('r, t) ::; \n\nCo ('I' - J ~~'rt' t) = Eo ('I', t) . \n\nand \n\nSet for brevity E = Eo (R* , T *) . We have \n\nIP' ( risk(H) > risk(H*) + 2E) \n\nIP' ( risk(H) > risk(H*) + 2E , R f + Eo (R f' T) ::; R * + E) \n(since Rf + Eo(Rf, T) ::; R * + E holds with certainty) \n\n< ~ IP' ( R t + Eo(Rt , t) ::; R * + E, risk(Ht ) > risk(H*) + 2E). \n\n(6) \n\nNow, if R t + Eo (Rt, t) ::; R * + E holds, then at least one of the following three conditions \nR t ::; risk (Ht ) - Eo(Rt , t) , R * > risk(H*) + E, risk (Ht ) - risk (H*) < 2E \nmust hold. Hence, for any fixed t we can write \n\nIP' ( R t + Eo(Rt, t) ::; R * + E, risk(Ht ) > risk(H*) + 2E) \n\n< \n\nIP' ( R t ::; risk(Ht ) - Eo(Rt , t) , risk(Ht ) > risk(H*) + 2E) \n\n+IP' ( R * > risk(H*) + E, risk(Ht ) > risk(H*) + 2E) \n\n+IP' ( risk (Ht ) - risk (H*) < 2E , risk(Ht ) > risk (H*) + 2E) \n\n< \n\nIP' ( R t ::; risk(Ht ) - Eo(Rt , t)) +IP' ( R * > risk(H*) + E) . \n\n(7) \n\nPlugging (7) into (6) we have \n\nIP' (risk(H) > risk (H*) + 2E) \n< ~ IP' ( R t ::; risk(Ht ) - Eo(Rt, t)) + n IP' ( R * > risk(H*) + E) \n\n< n e- B + n ~ IP'( R t 2: risk(Ht ) + Eo(Rt,t)) ::; n e- B + n 2 e- B , \n\nwhere in the last two inequalities we applied again Bernstein's inequality to the random \nvariables R t with mean risk(Ht ). Putting together we obtain \n\nlP' (risk(H) > risk(H*) + 2co(risk(H*), T *)) ::; (2n + n 2 )e- B \n\nwhich, recalling that B = In n(no+2) , implies the thesis. \nFix n 2: 1 and 15 E (0,1). For each t = 0, ... , n - 1, introduce the function \n\nD \n\nllCln(n -t) + 1 m,cx \n\n+ 2 - - , x2:0, \n\nn-t \n\nf() \ntX =x+-\n3 \n\nn-t \n\nwhere C = In 2n(~+2) . Note that each ft is monotonically increasing. We are now ready \nto state and prove the main result of this paper. \n\n\fTheorem 4 Fix any loss function C satisfying 0 ::; C ::; 1. Let H 0 , ... , H n-;:..l be the ensem(cid:173)\nble of hypotheses generated by an arbitrary on-line algorithm A and let H be the hypoth(cid:173)\nesis minimizing the penalized empirical risk expression obtained by replacing C8 with C8/2 \nin (5). Then,for any 0 < 15 ::; 1, ii satisfies \n\nIP' risk(H) ;::: min ft Mt,n + --In \n\n( \n\n( \n\n2n(n+3) \n\n15 \n\n~ \n\n36 \n\nn - t \n\nO min (risk(Ht ) + c8/2(risk(Ht ), t)))::; i . \n\n(9) \n\nO