{"title": "Exploiting easy data in online optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 810, "page_last": 818, "abstract": "We consider the problem of online optimization, where a learner chooses a decision from a given decision set and suffers some loss associated with the decision and the state of the environment. The learner's objective is to minimize its cumulative regret against the best fixed decision in hindsight. Over the past few decades numerous variants have been considered, with many algorithms designed to achieve sub-linear regret in the worst case. However, this level of robustness comes at a cost. Proposed algorithms are often over-conservative, failing to adapt to the actual complexity of the loss sequence which is often far from the worst case. In this paper we introduce a general algorithm that, provided with a safe learning algorithm and an opportunistic benchmark, can effectively combine good worst-case guarantees with much improved performance on easy data. We derive general theoretical bounds on the regret of the proposed algorithm and discuss its implementation in a wide range of applications, notably in the problem of learning with shifting experts (a recent COLT open problem). Finally, we provide numerical simulations in the setting of prediction with expert advice with comparisons to the state of the art.", "full_text": "Exploiting easy data in online optimization\n\nAmir Sani\n\nGergely Neu\n\nSequeL team, INRIA Lille \u2013 Nord Europe, France\n\nAlessandro Lazaric\n\n{amir.sani,gergely.neu,alessandro.lazaric}@inria.fr\n\nAbstract\n\nWe consider the problem of online optimization, where a learner chooses a deci-\nsion from a given decision set and suffers some loss associated with the decision\nand the state of the environment. The learner\u2019s objective is to minimize its cu-\nmulative regret against the best \ufb01xed decision in hindsight. Over the past few\ndecades numerous variants have been considered, with many algorithms designed\nto achieve sub-linear regret in the worst case. However, this level of robustness\ncomes at a cost. Proposed algorithms are often over-conservative, failing to adapt\nto the actual complexity of the loss sequence which is often far from the worst\ncase. In this paper we introduce a general algorithm that, provided with a \u201csafe\u201d\nlearning algorithm and an opportunistic \u201cbenchmark\u201d, can effectively combine\ngood worst-case guarantees with much improved performance on \u201ceasy\u201d data.\nWe derive general theoretical bounds on the regret of the proposed algorithm and\ndiscuss its implementation in a wide range of applications, notably in the prob-\nlem of learning with shifting experts (a recent COLT open problem). Finally, we\nprovide numerical simulations in the setting of prediction with expert advice with\ncomparisons to the state of the art.\n\nIntroduction\n\n1\nWe consider a general class of online decision-making problems, where a learner sequentially de-\ncides which actions to take from a given decision set and suffers some loss associated with the\ndecision and the state of the environment. The learner\u2019s goal is to minimize its cumulative loss as\nthe interaction between the learner and the environment is repeated. Performance is usually mea-\nsured with regard to regret; that is, the difference between the cumulative loss of the algorithm and\nthe best single decision over the horizon in the decision set. The objective of the learning algorithm\nis to guarantee that the per-round regret converges to zero as time progresses. This general setting\nincludes a wide range of applications such as online linear pattern recognition, sequential investment\nand time series prediction.\nNumerous variants of this problem were considered over the last few decades, mainly differing in the\nshape of the decision set (see [6] for an overview). One of the most popular variants is the problem\nof prediction with expert advice, where the decision set is the N-dimensional simplex and the per-\nround losses are linear functions of the learner\u2019s decision. In this setting, a number of algorithms are\nknown to guarantee regret of order \u221aT after T repetitions of the game. Another well-studied setting\nis online convex optimization (OCO), where the decision set is a convex subset of Rd and the loss\nfunctions are convex and smooth. Again, a number of simple algorithms are known to guarantee\na worst-case regret of order \u221aT in this setting. These results hold for any (possibly adversarial)\nassignment of the loss sequences. Thus, these algorithms are guaranteed to achieve a decreasing\nper-round regret that approaches the performance of the best \ufb01xed decision in hindsight even in the\nworst case. Furthermore, these guarantees are unimprovable in the sense that there exist sequences\nof loss functions where the learner suffers \u03a9(\u221aT ) regret no matter what algorithm the learner uses.\nHowever this robustness comes at a cost. These algorithms are often overconservative and fail to\nadapt to the actual complexity of the loss sequence, which in practice is often far from the worst\n\n1\n\n\fpossible. In fact, it is well known that making some assumptions on the loss generating mechanism\nimproves the regret guarantees. For instance, the simple strategy of following the leader (FTL,\notherwise known as \ufb01ctitious play in game theory, see, e.g., [6, Chapter 7]), which at each round\npicks the single decision that minimizes the total losses so far, guarantees O(log T ) regret in the\nexpert setting when assuming i.i.d. loss vectors. The same strategy also guarantees O(log T ) regret\nin the OCO setting, when assuming all loss functions are strongly convex. On the other hand, the\nrisk of using this strategy is that it\u2019s known to suffer \u03a9(T ) regret in the worst case.\nThis paper focuses on how to distinguish between \u201ceasy\u201d and \u201chard\u201d problem instances, while\nachieving the best possible guarantees on both types of loss sequences. This problem recently re-\nceived much attention in a variety of settings (see, e.g., [8] and [13]), but most of the proposed\nsolutions required the development of ad-hoc algorithms for each speci\ufb01c scenario and de\ufb01nition of\n\u201ceasy\u201d problem. Another obvious downside of such ad-hoc solutions is that their theoretical analysis\nis often quite complicated and dif\ufb01cult to generalize to more complex problems. In the current pa-\nper, we set out to de\ufb01ne an algorithm providing a general structure that can be instantiated in a wide\nrange of settings by simply plugging in the most appropriate choice of two algorithms for learning\non \u201ceasy\u201d and \u201chard\u201d problems.\nAside from exploiting easy data, our method has other potential applications. For example, in some\nsensitive applications we may want to protect ourselves from complete catastrophe, rather than take\nrisks for higher payoffs.\nIn fact, our work builds directly on the results of Even-Dar et al. [9],\nwho point out that learning algorithms in the experts setting may fail to satisfy the rather natural\nrequirement of performing strictly better than a trivial algorithm that merely decides on which expert\nto follow by uniform coin \ufb02ips. While Even-Dar et al. propose methods that achieve this goal, they\nleave open an obvious open question. Is it possible to strictly improve the performance of an existing\n(and possibly na\u00a8\u0131ve) solution by means of principled online learning methods? This problem can be\nseen as the polar opposite of failing to exploit easy data. In this paper, we push the idea of Even-Dar\net al. one step further. We construct learning algorithms with order-optimal regret bounds, while\nalso guaranteeing that their cumulative loss is within a constant factor of some pre-de\ufb01ned strategy\nreferred to as the benchmark. We stress that this property is much stronger than simply guaranteeing\nO(1) regret with respect to some \ufb01xed distribution D as done by Even-Dar et al. [9] since we\nallow comparisons to any \ufb01xed strategy that is even allowed to learn. Our method guarantees that\nreplacing an existing solution can be done at a negligible price in terms of output performance with\nadditional strong guarantees on the worst-case performance. However, in what follows, we will\nonly regard this aspect of our results as an interesting consequence while emphasizing the ability\nof our algorithm to exploit easy data. Our general structure, referred to as (A,B)-PROD, receives a\nlearning algorithm A and a benchmark B as input. Depending on the online optimization setting, it\nis enough to set A to any learning algorithm with performance guarantees on \u201chard\u201d problems and\nB to an opportunistic strategy exploiting the structure of \u201ceasy\u201d problems. (A,B)-PROD smoothly\nmixes the decisions of A and B, achieving the best possible guarantees of both.\n2 Online optimization with a benchmark\n\nParameters: set of decisions S, number of rounds T ;\nFor all t = 1, 2, . . . , T , repeat\n1. The environment chooses loss function ft : S \u2192 [0, 1].\n2. The learner chooses a decision xt \u2208 S.\n3. The environment reveals ft (possibly chosen depending on the past history\n\nof losses and decisions).\n\n4. The forecaster suffers loss ft(xt).\n\nFigure 1: The protocol of online optimization.\n\nWe now present the formal setting and an algorithm for online optimization with a benchmark. The\ninteraction protocol between the learner and the environment is formally described on Figure 1. The\nonline optimization problem is characterized by the decision set S and the class F \u2286 [0, 1]S of loss\nfunctions utilized by the environment. The performance of the learner is usually measured in terms\nt=1\ufffdft(xt) \u2212 ft(x)\ufffd. We say that an algorithm learns if it\nof the regret, de\ufb01ned as RT = supx\u2208S\ufffdT\n\nmakes decisions so that RT = o(T ).\n\n2\n\n\fLet A and B be two online optimization algorithms that map observation histories to decisions in a\npossibly randomized fashion. For a formal de\ufb01nition, we \ufb01x a time index t \u2208 [T ] = {1, 2, . . . , T}\nand de\ufb01ne the observation history (or in short, the history) at the end of round t \u2212 1 as Ht\u22121 =\n(f1, . . . , ft\u22121). H0 is de\ufb01ned as the empty set. Furthermore, de\ufb01ne the random variables Ut and Vt,\ndrawn from the standard uniform distribution, independently of Ht\u22121 and each other. The learning\nalgorithms A and B are formally de\ufb01ned as mappings from F\u2217 \u00d7 [0, 1] to S with their respective\ndecisions given as\n\nat\n\ndef= A(Ht\u22121, Ut)\n\nand\n\nbt\n\ndef= B(Ht\u22121, Vt).\n\nFinally, we de\ufb01ne a hedging strategy C that produces a decision xt based on the history of deci-\nsions proposed by A and B, with the possible help of some external randomness represented by the\nuniform random variable Wt as xt = C\ufffdat, bt,H\u2217t\u22121, Wt\ufffd. Here, H\u2217t\u22121 is the simpli\ufb01ed history con-\nsisting of\ufffdf1(a1), f1(b1), . . . , ft\u22121(at\u22121), ft\u22121(bt\u22121)\ufffd and C bases its decisions only on the past\nlosses incurred by A and B without using any further information on the loss functions. The total\nexpected loss of C is de\ufb01ned as\ufffdLT (C) = E[\ufffdT\nt=1 ft(xt)], where the expectation integrates over the\npossible realizations of the internal randomization of A,B and C. The total expected losses of A, B\nand any \ufb01xed decision x \u2208 S are similarly de\ufb01ned.\nOur goal is to de\ufb01ne a hedging strategy with low regret against a benchmark strategy B, while also\nenjoying near-optimal guarantees on the worst-case regret against the best decision in hindsight. The\n(expected) regret of C against any \ufb01xed decision x \u2208 S and against the benchmark, are de\ufb01ned as\n\nRT (C, x) = E\ufffd T\ufffdt=1\ufffdft(xt) \u2212 ft(x)\ufffd\ufffd , RT (C,B) = E\ufffd T\ufffdt=1\ufffdft(xt) \u2212 ft(bt)\ufffd\ufffd .\n\nOur hedging strategy, (A,B)-PROD, is based on the\nclassic PROD algorithm popularized by Cesa-Bianchi\net al. [7] and builds on a variant of PROD called D-\nPROD, proposed in Even-Dar et al. [9], which (when\nproperly tuned) achieves constant regret against the per-\nformance of a \ufb01xed distribution D over experts, while\n\nguaranteeing O(\u221aT log T ) regret against the best ex-\npert in hindsight. Our variant (A,B)-PROD (shown in\nFigure 2) is based on the observation that it is not neces-\nsary to use a \ufb01xed distribution D in the de\ufb01nition of the\nbenchmark, but actually any learning algorithm or sig-\nnal can be used as a baseline. (A,B)-PROD maintains\ntwo weights, balancing the advice of learning algorithm\nA and a benchmark B. The benchmark weight is de-\n\ufb01ned as w1,B \u2208 (0, 1) and is kept unchanged during the\nentire learning process. The initial weight assigned to\nA is w1,A = 1 \u2212 w1,B, and in the remaining rounds\nt = 2, 3, . . . , T is updated as\n\nInput: learning rate \u03b7 \u2208 (0, 1/2], initial\nweights {w1,A, w1,B}, num. of rounds T ;\nFor all t = 1, 2, . . . , T , repeat\n1. Let\n\nst =\n\nwt,A\n\n.\n\nwt,A + w1,B\n\n2. Observe at and bt and predict\n\nxt =\ufffdat with probability st,\n\notherwise.\n\nbt\n\n3. Observe ft and suffer loss ft(xt).\n4. Feed ft to A and B.\n5. Compute \u03b4t = ft(bt)\u2212 ft(at) and set\n\nwt+1,A = wt,A \u00b7 (1 + \u03b7\u03b4t) .\nFigure 2: (A,B)-PROD\n\nwt,A = w1,A\n\nt\u22121\ufffds=1\ufffd1 \u2212 \u03b7\ufffdfs(as) \u2212 fs(bs)\ufffd\ufffd,\n\nwhere the difference between the losses of A and B is used. Output xt is set to at with probability\nst = wt,A/(wt,A +w1,B), otherwise it is set to bt.1 The following theorem states the performance\nguarantees for (A,B)-PROD.\nTheorem 1 (cf. Lemma 1 in [9]). For any assignment of the loss sequence, the total expected loss of\n(A,B)-PROD initialized with weights w1,B \u2208 (0, 1) and w1,B = 1 \u2212 w1,A simultaneously satis\ufb01es\n\n\ufffdLT\ufffd(A,B)-PROD\ufffd \u2264\ufffdLT (A) + \u03b7\n\nT\ufffdt=1\ufffdft(bt) \u2212 ft(at)\ufffd2\n\nlog w1,A\n\n\u03b7\n\n\u2212\n\nand\n\n\ufffdLT\ufffd(A,B)-PROD\ufffd \u2264\ufffdLT (B) \u2212\n\nlog w1,B\n\n\u03b7\n\n.\n\n1For convex decision sets S and loss families F, one can directly set xt = stat + (1 \u2212 st)bt at no expense.\n\n3\n\n\fThe proof directly follows from the PROD analysis of Cesa-Bianchi et al. [7]. Next, we suggest\na parameter setting for (A,B)-PROD that guarantees constant regret against the benchmark B and\nO(\u221aT log T ) regret against the learning algorithm A in the worst case.\nCorollary 1. Let C \u2265 1 be an upper bound on the total benchmark loss \ufffdLT (B). Then setting\n\u03b7 = 1/2 \u00b7\ufffd(log C)/C < 1/2 and w1,B = 1 \u2212 w1,A = 1 \u2212 \u03b7 simultaneously guarantees\nfor any x \u2208 S and\nRT\ufffd(A,B)-PROD,B\ufffd \u2264 2 log 2\nagainst any assignment of the loss sequence.\nNotice that for any x \u2208 S, the previous bounds can be written as\n\nRT\ufffd(A,B)-PROD, x\ufffd \u2264 RT (A, x) + 2\ufffdC log C\n\nRT ((A,B)-PROD, x) \u2264 min\ufffdRT (A, x) + 2\ufffdC log C, RT (B, x) + 2 log 2\ufffd ,\n\nwhich states that (A,B)-PROD achieves the minimum between the regret of the benchmark B and\nlearning algorithm A plus an additional regret of O(\u221aC log C). If we consider that in most online\noptimization settings, the worst-case regret for a learning algorithm is O(\u221aT ), the previous bound\nshows that at the cost of an additional factor of O(\u221aT log T ) in the worst case, (A,B)-PROD per-\nforms as well as the benchmark, which is very useful whenever RT (B, x) is small. This suggests\nthat if we set A to a learning algorithm with worst-case guarantees on \u201cdif\ufb01cult\u201d problems and B to\nan algorithm with very good performance only on \u201ceasy\u201d problems, then (A,B)-PROD successfully\nadapts to the dif\ufb01culty of the problem by \ufb01nding a suitable mixture of A and B. Furthermore, as\ndiscussed by Even-Dar et al. [9], we note that in this case the PROD update rule is crucial to achieve\nthis result: any algorithm that bases its decisions solely on the cumulative difference between ft(at)\nand ft(bt) is bound to suffer an additional regret of O(\u221aT ) on both A and B. While HEDGE and\n\nfollow-the-perturbed-leader (FPL) both fall into this category, it can be easily seen that this is not\nthe case for PROD. A similar observation has been made by de Rooij et al. [8], who discuss the\npossibility of combining a robust learning algorithm and FTL by HEDGE and conclude that this\napproach is insuf\ufb01cient for their goals \u2013 see also Sect. 3.1.\nFinally, we note that the parameter proposed in Corollary 1 can hardly be computed in practice,\n\nsince an upper-bound on the loss of the benchmark \ufffdLT (B) is rarely available. Fortunately, we can\nadapt an improved version of PROD with adaptive learning rates recently proposed by Gaillard et al.\n[11] and obtain an anytime version of (A,B)-PROD. The resulting algorithm and its corresponding\nbounds are reported in App. B.\n3 Applications\nThe following sections apply our results to special cases of online optimization. Unless otherwise\nnoted, all theorems are direct consequences of Corollary 1 and thus their proofs are omitted.\n3.1 Prediction with expert advice\nWe \ufb01rst consider the most basic online optimization problem of prediction with expert advice. Here,\n\nS is the N-dimensional simplex \u0394N =\ufffdx \u2208 RN\ni=1 xi = 1\ufffd and the loss functions are linear,\nthat is, the loss of any decision x \u2208 \u0394N in round t is given as the inner product ft(x) = x\ufffd\ufffdt\nand \ufffdt \u2208 [0, 1]N is the loss vector in round t. Accordingly, the family F of loss functions can\nbe equivalently represented by the set [0, 1]N . Many algorithms are known to achieve the opti-\nmal regret guarantee of O(\u221aT log N ) in this setting, including HEDGE (so dubbed by Freund and\n\nSchapire [10], see also the seminal works of Littlestone and Warmuth [20] and Vovk [23]) and the\nfollow-the-perturbed-leader (FPL) prediction method of Hannan [16], later rediscovered by Kalai\nand Vempala [19]. However, as de Rooij et al. [8] note, these algorithms are usually too conser-\nvative to exploit \u201ceasily learnable\u201d loss sequences and might be signi\ufb01cantly outperformed by a\nsimple strategy known as follow-the-leader (FTL), which predicts bt = arg minx\u2208S x\ufffd\ufffdt\u22121\ns=1 \ufffds.\nFor instance, FTL is known to be optimal in the case of i.i.d. losses, where it achieves a regret of\nO(log T ). As a direct consequence of Corollary 1, we can use the general structure of (A,B)-PROD\nto match the performance of FTL on easy data, and at the same time, obtain the same worst-case\nguarantees of standard algorithms for prediction with expert advice. In particular, if we set FTL as\nthe benchmark B and ADAHEDGE (see [8]) as the learning algorithm A, we obtain the following.\n\n+ :\ufffdN\n\n4\n\n\fTheorem 2. Let S = \u0394N and F = [0, 1]N . Running (A,B)-PROD with A = ADAHEDGE and\nB = FTL, with the parameter setting suggested in Corollary 1 simultaneously guarantees\nlog N + 2\ufffdC log C\n\nRT\ufffd(A,B)-PROD, x\ufffd \u2264 RT (ADAHEDGE, x) + 2\ufffdC log C \u2264\ufffd L\u2217T (T \u2212 L\u2217T )\n\nT\n\nfor any x \u2208 S, where L\u2217T = minx\u2208\u0394N LT (x), and\n\nagainst any assignment of the loss sequence.\n\nRT\ufffd(A,B)-PROD, FTL\ufffd \u2264 2 log 2.\n\nWhile we recover the worst-case guarantee of O(\u221aT log N ) plus an additional regret O(\u221aT log T )\non \u201chard\u201d loss sequences, on \u201ceasy\u201d problems we inherit the good performance of FTL.\nComparison with FLIPFLOP. The FLIPFLOP algorithm proposed by de Rooij et al. [8] addresses\nthe problem of constructing algorithms that perform nearly as well as FTL on easy problems while\nretaining optimal guarantees on all possible loss sequences. More precisely, FLIPFLOP is a HEDGE\nalgorithm where the learning rate \u03b7 alternates between in\ufb01nity (corresponding to FTL) and the value\nsuggested by ADAHEDGE depending on the cumulative mixability gaps over the two regimes. The\nresulting algorithm is guaranteed to achieve the regret guarantees of\n\nRT (FLIPFLOP, x) \u2264 5.64RT (FTL, x) + 3.73\n\nand\n\nRT (FLIPFLOP, x) \u2264 5.64\ufffd L\u2217T (T \u2212 L\u2217T )\n\nT\n\nlog N + O(log N )\n\nagainst any \ufb01xed x \u2208 \u0394N at the same time. Notice that while the guarantees in Thm. 2 are very\nsimilar in nature to those of de Rooij et al. [8] concerning FLIPFLOP, the two results are slightly\ndifferent. The \ufb01rst difference is that our worst-case bounds are inferior to theirs by a factor of\norder \u221aT log T .2 On the positive side, our guarantees are much stronger when FTL outperforms\nADAHEDGE. To see this, observe that their regret bound can be rewritten as\n\nLT (FLIPFLOP) \u2264 LT (FTL) + 4.64\ufffdLT (FTL) \u2212 inf xLT (x)\ufffd + 3.73,\n\nwhereas our result replaces the last two terms by 2 log 2.3 The other advantage of our result is that\nwe can directly bound the total loss of our algorithm in terms of the total loss of ADAHEDGE (see\nThm. 1). This is to be contrasted with the result of de Rooij et al. [8], who upper bound their regret\nin terms of the regret bound of ADAHEDGE, which may not be tight and be much worse in practice\nthan the actual performance of ADAHEDGE. All these advantages of our approach stem from the fact\nthat we smoothly mix the predictions of ADAHEDGE and FTL, while FLIPFLOP explicitly follows\none policy or the other for extended periods of time, potentially accumulating unnecessary losses\nwhen switching too late or too early. Finally, we note that as FLIPFLOP is a sophisticated algorithm\nspeci\ufb01cally designed for balancing the performance of ADAHEDGE and FTL in the expert setting,\nwe cannot reasonably hope to beat its performance in every respect by using our general-purpose\nalgorithm. Notice however that the analysis of FLIPFLOP is dif\ufb01cult to generalize to other learning\nsettings such as the ones we discuss in the sections below.\nComparison with D-PROD. In the expert setting, we can also use a straightforward modi\ufb01cation\nof the D-PROD algorithm originally proposed by Even-Dar et al. [9]: This variant of PROD includes\nthe benchmark B in \u0394N as an additional expert and performs PROD updates for each base expert\nusing the difference between the expert and benchmark losses. While the worst-case regret of this\nalgorithm is of O(\u221aC log C log N ), which is asymptotically inferior to the guarantees given by\nThm. 2, D-PROD also has its merits in some special cases. For instance, in a situation where the\ntotal loss of FTL and the regret of ADAHEDGE are both \u0398(\u221aT ), D-PROD guarantees a regret of\nO(T 1/4) while the (A,B)-PROD guarantee remains O(\u221aT ).\n\n2In fact, the worst case for our bound is realized when C = \u03a9(T ), which is precisely the case when\n\nADAHEDGE has excellent performance as it will be seen in Sect. 4.\n\n3While one can parametrize FLIPFLOP so as to decrease the gap between these bounds, the bound on\n\nLT (FLIPFLOP) is always going to be linear in RT (FLIPFLOP, x).\n\n5\n\n\f3.2 Tracking the best expert\nWe now turn to the problem of tracking the best expert, where the goal of the learner is to control the\nregret against the best \ufb01xed strategy that is allowed to change its prediction at most K times during\nthe entire decision process (see, e.g., [18, 14]). The regret of an algorithm A producing predictions\na1, . . . , aT against an arbitrary sequence of decisions y1:T \u2208 S T is de\ufb01ned as\n\nRT (A, y1:T ) =\n\nT\ufffdt=1\ufffdft(at) \u2212 ft(yt)\ufffd.\n\nRegret bounds in this setting typically depend on the complexity of the sequence y1:T as measured\nby the number decision switches C(y1:T ) = {t \u2208 {2, . . . , T} : yt \ufffd= yt\u22121}. For example, a properly\ntuned version of the FIXED-SHARE (FS) algorithm of Herbster and Warmuth [18] guarantees that\nRT (FS, y1:T ) = O\ufffdC(y1:T )\u221aT log N\ufffd. This upper bound can be tightened to O(\u221aKT log N )\n\nwhen the learner knows an upper bound K on the complexity of y1:T . While this bound is unim-\nprovable in general, one might wonder if it is possible to achieve better performance when the loss\nsequence is easy. This precise question was posed very recently as a COLT open problem by War-\nmuth and Koolen [24]. The generality of our approach allows us to solve their open problem by using\n(A,B)-PROD as a master algorithm to combine an opportunistic strategy with a principled learning\nalgorithm. The following theorem states the performance of the (A,B)-PROD-based algorithm.\nTheorem 3. Let S = \u0394N , F = [0, 1]N and y1:T be any sequence in S with known complexity\nK = C(y1:T ). Running (A,B)-PROD with an appropriately tuned instance of A = FS (see [18]),\nwith the parameter setting suggested in Corollary 1 simultaneously guarantees\nRT\ufffd(A,B)-PROD, y1:T\ufffd \u2264 RT (FS, y1:T ) + 2\ufffdC log C = O(\ufffdKT log N ) + 2\ufffdC log C\n\nfor any x \u2208 S and\nagainst any assignment of the loss sequence.\n\nRT\ufffd(A,B)-PROD,B\ufffd \u2264 2 log 2.\n\nThe remaining problem is then to \ufb01nd a benchmark that works well on \u201ceasy\u201d problems, notably\nwhen the losses are i.i.d. in K (unknown) segments of the rounds 1, . . . , T . Out of the strategies\nsuggested by Warmuth and Koolen [24], we analyze a windowed variant of FTL (referred to as\nFTL(w)) that bases its decision at time t on losses observed in the time window [t\u2212 w\u2212 1, t\u2212 1] and\npicks expert bt = arg minx\u2208\u0394N x\ufffd\ufffdt\u22121\ns=t\u2212w\u22121 \ufffds. The next proposition (proved in the appendix)\ngives a performance guarantee for FTL(w) with an optimal parameter setting.\nProposition 1. Assume that there exists a partition of [1, T ] into K intervals such that the losses\nare generated i.i.d. within each interval. Furthermore, assume that the expectation of the loss of the\nbest expert within each interval is at least \u03b4 away from the expected loss of all other experts. Then,\n\nsetting w =\ufffd4 log(N T /K)/\u03b42\ufffd, the regret of FTL(w) is upper bounded for any y1:T as\n\n4K\n\u03b42 log(N T /K) + 2K,\nwhere the expectation is taken with respect to the distribution of the losses.\n\nE\ufffdRT (FTL(w), y1:T )\ufffd \u2264\n\n3.3 Online convex optimization\nHere we consider the problem of online convex optimization (OCO), where S is a convex and closed\nsubset of Rd and F is the family of convex functions on S. In this setting, if we assume that the\nloss functions are smooth (see [25]), an appropriately tuned version of the online gradient descent\n(OGD) is known to achieve a regret of O(\u221aT ). As shown by Hazan et al. [17], if we additionally\nassume that the environment plays strongly convex loss functions and tune the parameters of the\nalgorithm accordingly, the same algorithm can be used to guarantee an improved regret of O(log T ).\nFurthermore, they also show that FTL enjoys essentially the same guarantees. The question whether\nthe two guarantees can be combined was studied by Bartlett et al. [4], who present the adaptive\nonline gradient descent (AOGD) algorithm that guarantees O(log T ) regret when the aggregated\ns=1 fs are strongly convex for all t, while retaining the O(\u221aT ) bounds if\nloss functions Ft = \ufffdt\nthis is not the case. The next theorem shows that we can replace their complicated analysis by our\ngeneral argument and show essentially the same guarantees.\n\n6\n\n\fTheorem 4. Let S be a convex closed subset of Rd and F be the family of smooth convex functions\non S. Running (A,B)-PROD with an appropriately tuned instance of A = OGD (see [25]) and\nB = FTL, with the parameter setting suggested in Corollary 1 simultaneously guarantees\n\nRT\ufffd(A,B)-PROD, x\ufffd \u2264 RT (OGD, x) + 2\ufffdC log C = O(\u221aT ) + 2\ufffdC log C\n\nfor any x \u2208 S and\nagainst any assignment of the loss sequence. In particular, this implies that\n\nRT\ufffd(A,B)-PROD, FTL\ufffd \u2264 2 log 2.\nRT\ufffd(A,B)-PROD, x\ufffd = O(log T )\n\nif the loss functions are strongly convex.\n\nSimilar to the previous settings, at the cost of an additional regret of O(\u221aT log T ) in the worst case,\n(A,B)-PROD successfully adapts to the \u201ceasy\u201d loss sequences, which in this case corresponds to\nstrongly convex functions, on which it achieves a O(log T ) regret.\n3.4 Learning with two-points-bandit feedback\nWe consider the multi-armed bandit problem with two-point feedback, where we assume that in each\nround t, the learner picks one arm It in the decision set S = {1, 2, . . . , K} and also has the possi-\nbility to choose and observe the loss of another arm Jt. The learner suffers the loss ft(It). Unlike\nthe settings considered in the previous sections, the learner only gets to observe the loss function\nfor arms It and Jt. This is a special case of the partial-information game recently studied by Seldin\net al. [21]. A similar model has also been studied as a simpli\ufb01ed version of online convex opti-\nmization with partial feedback [1]. While this setting does not entirely conform to our assumptions\nconcerning A and B, observe that a hedging strategy C de\ufb01ned over A and B only requires access to\nthe losses suffered by the two algorithms and not the entire loss functions. Formally, we give A and\nB access to the decision set S, and C to S 2. The hedging strategy C selects the pair (It, Jt) based on\nthe arms suggested by A and B as:\n\n(It, Jt) =\ufffd(at, bt) with probability st,\n(bt, at) with probability 1 \u2212 st.\n\nThe probability st is a well-de\ufb01ned deterministic function of H\u2217t\u22121, thus the regret bound of (A,B)-\nPROD can be directly applied. In this case, \u201ceasy\u201d problems correspond to i.i.d. loss sequences\n(with a \ufb01xed gap between the expected losses), for which the UCB algorithm of Auer et al. [2] is\nguaranteed to have a O(log T ) regret, while on \u201chard\u201d problems, we can rely on the EXP3 algorithm\nof Auer et al. [3] which suffers a regret of O(\u221aT K) in the worst case. The next theorem gives the\nperformance guarantee of (A,B)-PROD when combining UCB and EXP3.\nTheorem 5. Consider the multi-armed bandit problem with K arms and two-point feedback. Run-\nning (A,B)-PROD with an appropriately tuned instance of A = EXP3 (see [3]) and B = UCB (see\n[2]), with the parameter setting suggested in Corollary 1 simultaneously guarantees\n\nfor any arm x \u2208 {1, 2, . . . , K} and\n\nRT\ufffd(A,B)-PROD, x\ufffd \u2264 RT (EXP3, x) + 2\ufffdC log C = O(\ufffdT K log K) + 2\ufffdC log C\n\nagainst any assignment of the loss sequence. In particular, if the losses are generated in an i.i.d. fash-\nion and there exists a unique best arm x\u2217 \u2208 S, then\nwhere the expectation is taken with respect to the distribution of the losses.\n\nRT\ufffd(A,B)-PROD, UCB\ufffd \u2264 2 log 2.\nE\ufffdRT\ufffd(A,B)-PROD, x\ufffd\ufffd = O(log T ),\n\nThis result shows that even in the multi-armed bandit setting, we can achieve nearly the best per-\nformance in both \u201chard\u201d and \u201ceasy\u201d problems given that we are allowed to pull two arms at the\ntime. This result is to be contrasted with those of Bubeck and Slivkins [5], later improved by Seldin\nand Slivkins [22], who consider the standard one-point feedback setting. The algorithm of Seldin\nand Slivkins, called EXP3++ is a variant of the EXP3 algorithm that simultaneously guarantees\n\nO(log2 T ) regret in stochastic environments while retaining the regret bound of O(\u221aT K log K)\nin the adversarial setting. While our result holds under stronger assumptions, Thm. 5 shows that\n(A,B)-PROD is not restricted to work only in full-information settings. Once again, we note that\nsuch a result cannot be obtained by simply combining the predictions of UCB and EXP3 by a generic\nlearning algorithm as HEDGE.\n\n7\n\n\f4 Empirical Results\n\n60\n\n50\n\n40\n\nt\n\ne\nr\ng\ne\nR\n\n30\n\n20\n\n10\n\n0\n\nSetting 1\n\nSetting 2\n\nSetting 3\n\nSetting 4\n\nFTL\nAdahe dge\nFl ipFlop\nD - Pr od\n( A , B ) - Pr od\n( A , B ) - He dge\n\n200\n\n400\n\n600\n\n800\n\n1000 1200 1400 1600 1800 2000\nTime\n\nt\n\ne\nr\ng\ne\nR\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nFTL\nAdahe dge\nFl ipFlop\nD - Pr od\n( A , B ) - Pr od\n( A , B ) - He dge\n\n200\n\n400\n\n600\n\n800\n\n1000 1200 1400 1600 1800 2000\nTime\n\nt\n\ne\nr\ng\ne\nR\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nFTL\nAdahe dge\nFl ipFlop\nD - Pr od\n( A , B ) - Pr od\n( A , B ) - He dge\n\n200\n\n400\n\n600\n\n800\n\n1000 1200 1400 1600 1800 2000\nTime\n\nt\n\ne\nr\ng\ne\nR\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\nFTL\nAdahe dge\nFlipFlop\nD - Pr od\n( A , B ) - Pr od\n( A , B ) - He dge\n\n200\n\n400\n\n600\n\n800\n\n1000 1200 1400 1600 1800 2000\nTime\n\nFigure 3: Hand tuned loss sequences from de Rooij et al. [8]\n\nWe study the performance of (A,B)-PROD in the experts setting to verify the theoretical results of\nThm. 2, show the importance of the (A,B)-PROD weight update rule and compare to FLIPFLOP. We\nreport the performance of FTL, ADAHEDGE, FLIPFLOP, and B = FTL and A = ADAHEDGE for\nthe anytime versions of D-PROD, (A,B)-PROD, and (A,B)-HEDGE, a variant of (A,B)-PROD\nwhere an exponential weighting scheme is used. We consider the two-expert settings de\ufb01ned\nby de Rooij et al. [8] where deterministic loss sequences of T = 2000 steps are designed to ob-\ntain different con\ufb01gurations. (We refer to [8] for a detailed speci\ufb01cation of the settings.) The results\nare reported in Figure 3. The \ufb01rst remark is that the performance of (A,B)-PROD is always com-\nparable with the best algorithm between A and B. In setting 1, although FTL suffers linear regret,\n(A,B)-PROD rapidly adjusts the weights towards ADAHEDGE and \ufb01nally achieves the same order\nof performance. In settings 2 and 3, the situation is reversed since FTL has a constant regret, while\nADAHEDGE has a regret of order of \u221aT . In this case, after a short initial phase where (A,B)-PROD\nhas an increasing regret, it stabilizes on the same performance as FTL. In setting 4 both ADAHEDGE\nand FTL have a constant regret and (A,B)-PROD attains the same performance. These results match\nthe behavior predicted in the bound of Thm. 2, which guarantees that the regret of (A,B)-PROD is\nroughly the minimum of FTL and ADAHEDGE. As discussed in Sect. 2, the PROD update rule\nused in (A,B)-PROD plays a crucial role to obtain a constant regret against the benchmark, while\nother rules, such as the exponential update used in (A,B)-HEDGE, may fail in \ufb01nding a suitable\nmix between A and B. As illustrated in settings 2 and 3, (A,B)-HEDGE suffers a regret similar to\nADAHEDGE and it fails to take advantage of the good performance of FTL, which has a constant\nregret. In setting 1, (A,B)-HEDGE performs as well as (A,B)-PROD because FTL is constantly\nworse than ADAHEDGE and its corresponding weight is decreased very quickly, while in setting\n4 both FTL and ADAHEDGE achieves a constant regret and so does (A,B)-HEDGE. Finally, we\ncompare (A,B)-PROD and FLIPFLOP. As discussed in Sect. 2, the two algorithms share similar the-\noretical guarantees with potential advantages of one on the other depending on the speci\ufb01c setting.\nIn particular, FLIPFLOP performs slightly better in settings 2, 3, and 4, whereas (A,B)-PROD ob-\ntains smaller regret in setting 1, where the constants in the FLIPFLOP bound show their teeth. While\nit is not possible to clearly rank the two algorithms, (A,B)-PROD clearly avoids the pathological\nbehavior exhibited by FLIPFLOP in setting 1. Finally, we note that the anytime version of D-PROD\nis slightly better than (A,B)-PROD, but no consistent difference is observed.\n5 Conclusions\nWe introduced (A,B)-PROD, a general-purpose algorithm which receives a learning algorithm A\nand a benchmark strategy B as inputs and guarantees the best regret between the two. We showed\nthat whenever A is a learning algorithm with worst-case performance guarantees and B is an op-\nportunistic strategy exploiting a speci\ufb01c structure within the loss sequence, we obtain an algorithm\nwhich smoothly adapts to \u201ceasy\u201d and \u201chard\u201d problems. We applied this principle to a number of dif-\nferent settings of online optimization, matching the performance of existing ad-hoc solutions (e.g.,\nAOGD in convex optimization) and solving the open problem of learning on \u201ceasy\u201d loss sequences\nin the tracking the best expert setting proposed by Warmuth and Koolen [24]. We point out that\nthe general structure of (A,B)-PROD could be instantiated in many other settings and scenarios\nin online optimization, such as learning with switching costs [12, 15], and, more generally, in any\nproblem where the objective is to improve over a given benchmark strategy. The main open problem\nis the extension of our techniques to work with one-point bandit feedback.\n\nAcknowledgements This work was supported by the French Ministry of Higher Education and\nResearch and by the European Community\u2019s Seventh Framework Programme (FP7/2007-2013) un-\nder grant agreement 270327 (project CompLACS), and by FUI project Herm`es.\n\n8\n\n\fReferences\n[1] Agarwal, A., Dekel, O., and Xiao, L. (2010). Optimal algorithms for online convex optimization with multi-\npoint bandit feedback. In Kalai, A. and Mohri, M., editors, Proceedings of the 23rd Annual Conference on\nLearning Theory (COLT 2010), pages 28\u201340.\n\n[2] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem.\n\nMach. Learn., 47(2-3):235\u2013256.\n\n[3] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed bandit\n\nproblem. SIAM J. Comput., 32(1):48\u201377.\n\n[4] Bartlett, P. L., Hazan, E., and Rakhlin, A. (2008). Adaptive online gradient descent. In Platt, J. C., Koller,\nD., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages\n65\u201372. Curran Associates. (December 3\u20136, 2007).\n\n[5] Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits. In COLT,\n\npages 42.1\u201342.23.\n\n[6] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA.\n\n[7] Cesa-Bianchi, N., Mansour, Y., and Stoltz, G. (2007). Improved second-order bounds for prediction with\n\nexpert advice. Machine Learning, 66(2-3):321\u2013352.\n\n[8] de Rooij, S., van Erven, T., Gr\u00a8unwald, P. D., and Koolen, W. M. (2014). Follow the leader if you can,\n\nhedge if you must. Accepted to the Journal of Machine Learning Research.\n\n[9] Even-Dar, E., Kearns, M., Mansour, Y., and Wortman, J. (2008). Regret to the best vs. regret to the average.\n\nMachine Learning, 72(1-2):21\u201337.\n\n[10] Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55:119\u2013139.\n\n[11] Gaillard, P., Stoltz, G., and van Erven, T. (2014). A second-order bound with excess losses. In Balcan,\nM.-F. and Szepesv\u00b4ari, Cs., editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of\nJMLR Proceedings, pages 176\u2013196. JMLR.org.\n\n[12] Geulen, S., V\u00a8ocking, B., and Winkler, M. (2010). Regret minimization for online buffering problems\n\nusing the weighted majority algorithm. In COLT, pages 132\u2013143.\n\n[13] Grunwald, P., Koolen, W. M., and Rakhlin, A., editors (2013). NIPS Workshop on \u201cLearning faster from\n\neasy data\u201d.\n\n[14] Gy\u00a8orgy, A., Linder, T., and Lugosi, G. (2012). Ef\ufb01cient tracking of large classes of experts.\n\nTransactions on Information Theory, 58(11):6709\u20136725.\n\nIEEE\n\n[15] Gy\u00a8orgy, A. and Neu, G. (2013). Near-optimal rates for limited-delay universal lossy source coding.\n\nSubmitted to the IEEE Transactions on Information Theory.\n\n[16] Hannan, J. (1957). Approximation to Bayes risk in repeated play. Contributions to the theory of games,\n\n3:97\u2013139.\n\n[17] Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69:169\u2013192.\n\n[18] Herbster, M. and Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32:151\u2013178.\n[19] Kalai, A. and Vempala, S. (2005). Ef\ufb01cient algorithms for online decision problems. Journal of Computer\n\nand System Sciences, 71:291\u2013307.\n\n[20] Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Information and Computation,\n\n108:212\u2013261.\n\n[21] Seldin, Y., Bartlett, P., Crammer, K., and Abbasi-Yadkori, Y. (2014). Prediction with limited advice and\nmultiarmed bandits with paid observations. In Proceedings of the 30th International Conference on Machine\nLearning (ICML 2013), page 280287.\n\n[22] Seldin, Y. and Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. In\n\nProceedings of the 30th International Conference on Machine Learning (ICML 2014), pages 1287\u20131295.\n\n[23] Vovk, V. (1990). Aggregating strategies. In Proceedings of the third annual workshop on Computational\n\nlearning theory (COLT), pages 371\u2013386.\n\n[24] Warmuth, M. and Koolen, W. (2014). Shifting experts on easy data. COLT 2014 open problem.\n[25] Zinkevich, M. (2003). Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nProceedings of the Twentieth International Conference on Machine Learning (ICML).\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 550, "authors": [{"given_name": "Amir", "family_name": "Sani", "institution": "INRIA"}, {"given_name": "Gergely", "family_name": "Neu", "institution": "INRIA"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA Lille-Nord Europe"}]}