{"title": "Online Learning with Switching Costs and Other Adaptive Adversaries", "book": "Advances in Neural Information Processing Systems", "page_first": 1160, "page_last": 1168, "abstract": "We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player's performance using a new notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's behavior. In a setting where losses are allowed to drift, we characterize ---in a nearly complete manner--- the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switching costs, the attainable rate with bandit feedback is $T^{2/3}$. Interestingly, this rate is significantly worse than the $\\sqrt{T}$ rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also show that a bounded memory adversary can force $T^{2/3}$ regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies.", "full_text": "Online Learning with Switching Costs and Other\n\nAdaptive Adversaries\n\nNicol`o Cesa-Bianchi\n\nUniversit`a degli Studi di Milano\n\nItaly\n\nOfer Dekel\n\nMicrosoft Research\n\nUSA\n\nOhad Shamir\n\nMicrosoft Research\n\nand the Weizmann Institute\n\nAbstract\n\nWe study the power of different types of adaptive (nonoblivious) adversaries in\nthe setting of prediction with expert advice, under both full-information and ban-\ndit feedback. We measure the player\u2019s performance using a new notion of regret,\nalso known as policy regret, which better captures the adversary\u2019s adaptiveness\nto the player\u2019s behavior. In a setting where losses are allowed to drift, we char-\nacterize \u2014in a nearly complete manner\u2014 the power of adaptive adversaries with\nbounded memories and switching costs. In particular, we show that with switch-\ning costs, the attainable rate with bandit feedback is \ufffd\u0398(T 2/3). Interestingly, this\nrate is signi\ufb01cantly worse than the \u0398(\u221aT ) rate attainable with switching costs in\nthe full-information case. Via a novel reduction from experts to bandits, we also\nshow that a bounded memory adversary can force\ufffd\u0398(T 2/3) regret even in the full\n\ninformation case, proving that switching costs are easier to control than bounded\nmemory adversaries. Our lower bounds rely on a new stochastic adversary strat-\negy that generates loss processes with strong dependencies.\n\n1\n\nIntroduction\n\nAn important instance of the framework of prediction with expert advice \u2014see, e.g., [8]\u2014 is de\ufb01ned\nas the following repeated game, between a randomized player with a \ufb01nite and \ufb01xed set of available\nactions and an adversary. At the beginning of each round of the game, the adversary assigns a loss to\neach action. Next, the player de\ufb01nes a probability distribution over the actions, draws an action from\nthis distribution, and suffers the loss associated with that action. The player\u2019s goal is to accumulate\nloss at the smallest possible rate, as the game progresses. Two versions of this game are typically\nconsidered: in the full-information feedback version, at the end of each round, the player observes\nthe adversary\u2019s assignment of loss values to each action. In the bandit feedback version, the player\nonly observes the loss associated with his chosen action, but not the loss values of other actions.\nWe assume that the adversary is adaptive (also called nonoblivious by [8] or reactive by [16]), which\nmeans that the adversary chooses the loss values on round t based on the player\u2019s actions on rounds\n1 . . . t \u2212 1. We also assume that the adversary is deterministic and has unlimited computational\npower. These assumptions imply that the adversary can specify his entire strategy before the game\nbegins.\nIn other words, the adversary can perform all of the calculations needed to specify, in\nadvance, how he plans to react on each round to any sequence of actions chosen by the player.\nMore formally, let A denote the \ufb01nite set of actions and let Xt denote the player\u2019s random action on\nround t. We adopt the notation X1:t as shorthand for the sequence X1 . . . Xt. We assume that the\nadversary de\ufb01nes, in advance, a sequence of history-dependent loss functions f1, f2, . . .. The input\nto each loss function ft is the entire history of the player\u2019s actions so far, therefore the player\u2019s loss\non round t is ft(X1:t). Note that the player doesn\u2019t observe the functions ft, only the losses that\nresult from his past actions. Speci\ufb01cally, in the bandit feedback model, the player observes ft(X1:t)\non round t, whereas in the full-information model, the player observes ft(X1:t\u22121, x) for all x \u2208 A.\n\n1\n\n\fOn any round T , we evaluate the player\u2019s performance so far using the notion of regret, which\ncompares his cumulative loss on the \ufb01rst T rounds to the cumulative loss of the best \ufb01xed action in\nhindsight. Formally, the player\u2019s regret on round T is de\ufb01ned as\n\nRT =\n\nft(X1:t) \u2212 min\nx\u2208A\n\nT\ufffdt=1\n\nT\ufffdt=1\n\nft(x . . . x) .\n\n(1)\n\nRT is a random variable, as it depends on the randomized action sequence X1:t. Therefore, we also\nconsider the expected regret E[RT ]. This de\ufb01nition is the same as the one used in [18] and [3] (in\nthe latter, it is called policy regret), but differs from the more common de\ufb01nition of expected regret\n\nE\ufffd T\ufffdt=1\n\nft(X1:t) \u2212 min\nx\u2208A\n\nft(X1:t\u22121, x)\ufffd .\n\nT\ufffdt=1\n\n(2)\n\nThe de\ufb01nition in Eq. (2) is more common in the literature (e.g., [4, 17, 10, 16]), but is clearly inade-\nquate for measuring a player\u2019s performance against an adaptive adversary. Indeed, if the adversary is\nadaptive, the quantity ft(X1:t\u22121, x)is hardly interpretable \u2014see [3] for a more detailed discussion.\nIn general, we seek algorithms for which E[RT ] can be bounded by a sublinear function of T ,\nimplying that the per-round expected regret, E[RT ]/T , tends to zero. Unfortunately, [3] shows that\narbitrary adaptive adversaries can easily force the regret to grow linearly. Thus, we need to focus on\n(reasonably) weaker adversaries, which have constraints on the loss functions they can generate.\nThe weakest adversary we discuss is the oblivious adversary, which determines the loss on round t\nbased only on the current action Xt. In other words, this adversary is oblivious to the player\u2019s past\nactions. Formally, the oblivious adversary is constrained to choose a sequence of loss functions that\nsatis\ufb01es \u2200t, \u2200x1:t \u2208 At, and \u2200x\ufffd1:t\u22121 \u2208 At\u22121,\n\nft(x1:t) = ft(x\ufffd1:t\u22121, xt) .\n\n(3)\nThe majority of previous work in online learning focuses on oblivious adversaries. When dealing\nwith oblivious adversaries, we denote the loss function by \ufffdt and omit the \ufb01rst t\u2212 1 arguments. With\nthis notation, the loss at time t is simply written as \ufffdt(Xt).\nFor example, imagine an investor that invests in a single stock at a time. On each trading day he\ninvests in one stock and suffers losses accordingly. In this example, the investor is the player and the\nstock market is the adversary. If the investment amount is small, the investor\u2019s actions will have no\nmeasurable effect on the market, so the market is oblivious to the investor\u2019s actions. Also note that\nthis example relates to the full-information feedback version of the game, as the investor can see the\nperformance of each stock at the end of each trading day.\nA stronger adversary is the oblivious adversary with switching costs. This adversary is similar to the\noblivious adversary de\ufb01ned above, but charges the player an additional switching cost of 1 whenever\nXt \ufffd= Xt\u22121. More formally, this adversary de\ufb01nes his sequence of loss functions in two steps: \ufb01rst\nhe chooses an oblivious sequence of loss functions, \ufffd1, \ufffd2 . . ., which satis\ufb01es the constraint in Eq. (3).\nThen, he sets f1(x) = \ufffd1(x), and\n\n\u2200 t \u2265 2, ft(x1:t) = \ufffdt(xt) + I{xt\ufffd=xt\u22121} .\n\n(4)\nThis is a very natural setting. For example, let us consider again the single-stock investor, but now\nassume that each trade has a \ufb01xed commission cost. If the investor keeps his position in a stock for\nmultiple trading days, he is exempt from any additional fees, but when he sells one stock and buys\nanother, he incurs a \ufb01xed commission. More generally, this setting (or simple generalizations of it)\nallows us to capture any situation where choosing a different action involves a costly change of state.\nIn the paper, we will also discuss a special case of this adversary, where the loss function \ufffdt(x) for\neach action is sampled i.i.d. from a \ufb01xed distribution.\nThe switching costs adversary de\ufb01nes ft to be a function of Xt and Xt\u22121, and is therefore a special\ncase of a more general adversary called an adaptive adversary with a memory of 1. This adversary\nis constrained to choose loss functions that satisfy \u2200t, \u2200x1:t \u2208 At, and \u2200x\ufffd1:t\u22122 \u2208 At\u22122,\n\n(5)\nThis adversary is more general than the switching costs adversary because his loss functions can\ndepend on the previous action in an arbitrary way. We can further strengthen this adversary and\n\nft(x1:t) = ft(x\ufffd1:t\u22122, xt\u22121, xt) .\n\n2\n\n\fde\ufb01ne the bounded memory adaptive adversary, which has a bounded memory of an arbitrary size.\nIn other words, this adversary is allowed to set his loss function based on the player\u2019s m most recent\npast actions, where m is a prede\ufb01ned parameter. Formally, the bounded memory adversary must\nchoose loss functions that satisfy, \u2200t, \u2200x1:t \u2208 At, and \u2200x\ufffd1:t\u2212m\u22121 \u2208 At\u2212m\u22121,\n\nft(x1:t) = ft(x\ufffd1:t\u2212m\u22121, xt\u2212m:t) .\n\nIn the information theory literature, this setting is called individual sequence prediction against loss\nfunctions with memory [18].\nIn addition to the adversary types described above, the bounded memory adaptive adversary has\nadditional interesting special cases. One of them is the delayed feedback oblivious adversary of\n[19], which de\ufb01nes an oblivious loss sequence, but reveals each loss value with a delay of m rounds.\nSince the loss at time t depends on the player\u2019s action at time t \u2212 m, this adversary is a special case\nof a bounded memory adversary with a memory of size m. The delayed feedback adversary is not a\nfocus of our work, and we present it merely as an interesting special case.\nSo far, we have de\ufb01ned a succession of adversaries of different strengths. This paper\u2019s goal is\nto understand the upper and lower bounds on the player\u2019s regret when he faces these adversaries.\nSpeci\ufb01cally, we focus on how the expected regret depends on the number of rounds, T , with either\nfull-information or bandit feedback.\n\n1.1 The Current State of the Art\n\nDifferent aspects of this problem have been previously studied and the known results are surveyed\nbelow and summarized in Table 1. Most of these previous results rely on the additional assumption\nthat the range of the loss functions is bounded in a \ufb01xed interval, say [0, C]. We explicitly make note\nof this because our new results require us to slightly generalize this assumption.\nAs mentioned above, the oblivious adversary has been studied extensively and is the best under-\nstood of all the adversaries discussed in this paper. With full-information feedback, both the Hedge\nalgorithm [15, 11] and the follow the perturbed leader (FPL) algorithm [14] guarantee a regret of\n\nO(\u221aT ), with a matching lower bound of \u03a9(\u221aT ) \u2014see, e.g., [8]. Analyses of Hedge in settings\n\nwhere the loss range may vary over time have also been considered \u2014see, e.g., [9]. The oblivious\nsetting with bandit feedback, where the player only observes the incurred loss ft(X1:t), is called the\nnonstochastic (or adversarial) multi-armed bandit problem. In this setting, the Exp3 algorithm of [4]\nguarantees the same regret O(\u221aT ) as the full-information setting, and clearly the full-information\nlower bound \u03a9(\u221aT ) still applies.\nThe follow the lazy leader (FLL) algorithm of [14] is designed for the switching costs setting with\nfull-information feedback. The analysis of FLL guarantees that the oblivious component of the\nplayer\u2019s expected regret (without counting the switching costs), as well as the expected number of\n\nswitches, is upper bounded by O(\u221aT ), implying an expected regret of O(\u221aT ).\nThe work in [3] focuses on the bounded memory adversary with bandit feedback and guarantees\nan expected regret of O(T 2/3). This bound naturally extends to the full-information setting. We\nnote that [18, 12] study this problem in a different feedback model, which we call counterfactual\nfeedback, where the player receives a full description of the history-dependent function ft at the end\nof round t. In this setting, the algorithm presented in [12] guarantees an expected regret of O(\u221aT ).\n\nLearning with bandit feedback and switching costs has mostly been considered in the economics\nliterature, using a different setting than ours and with prior knowledge assumptions (see [13] for\nan overview). The setting of stochastic oblivious adversaries (i.e., oblivious loss functions sampled\ni.i.d. from a \ufb01xed distribution) was \ufb01rst studied by [2], where they show that O(log T ) switches are\nsuf\ufb01cient to asymptotically guarantee logarithmic regret. The paper [20] achieves logarithmic regret\nnonasymptotically with O(log T ) switches.\nSeveral other papers discuss online learning against \u201cadaptive\u201d adversaries [4, 10, 16, 17], but these\nresults are not relevant to our work and can be easily misunderstood. For example, several bandit\n\nalgorithms have extensions to the \u201cadaptive\u201d adversary case, with a regret upper bound of O(\u221aT )\n\n[1]. This bound doesn\u2019t contradict the \u03a9(T ) lower bound for general adaptive adversaries mentioned\n\n3\n\n\foblivious\n\nswitching cost memory of size 1\n\nbounded memory\n\nadaptive\n\n\u221aT\n\u221aT\n\n\u221aT\n\u221aT\n\n\u221aT\n\u221aT\n\nFull-Information Feedback\n\nT 2/3\n\u221aT\n\nBandit Feedback\n\nT 2/3\n\n\u221aT \u2192 T 2/3\n\nT 2/3\n\n\u221aT \u2192 T 2/3\n\nT 2/3\n\n\u221aT \u2192 T 2/3\n\nT 2/3\n\n\u221aT \u2192 T 2/3\n\nT\nT\n\nT\nT\n\n\u03a9\n\n\u03a9\n\n\ufffdO\n\ufffdO\n\nTable 1: State-of-the-art upper and lower bounds on regret (as a function of T ) against different\nadversary types. Our contribution to this table is presented in bold face.\n\nearlier, since these papers use the regret de\ufb01ned in Eq. (2) rather than the regret used in our work,\nde\ufb01ned in Eq. (1).\nAnother related body of work lies in the \ufb01eld of competitive analysis \u2014see [5], which also deals\nwith loss functions that depend on the player\u2019s past actions, and the adversary\u2019s memory may even\nbe unbounded. However, obtaining sublinear regret is generally impossible in this case. Therefore,\ncompetitive analysis studies much weaker performance metrics such as the competitive ratio, making\nit orthogonal to our work.\n\n1.2 Our Contribution\n\nIn this paper, we make the following contributions (see Table 1):\n\n\u2022 Our main technical contribution is a new lower bound on regret that matches the existing\nupper bounds in several of the settings discussed above. Speci\ufb01cally, our lower bound\napplies to the switching costs adversary with bandit feedback and to all strictly stronger\nadversaries.\n\nbounds up to logarithmic factors.\n\nsetting with full-information feedback, again matching the known upper bound.\n\n\u2022 Building on this lower bound, we prove another regret lower bound in the bounded memory\n\u2022 We con\ufb01rm that existing upper bounds on regret hold in our setting and match the lower\n\u2022 Despite the lower bound, we show that for switching costs and bandit feedback, if we\nalso assume stochastic i.i.d. losses, then one can get a distribution-free regret bound of\nO(\u221aT log log log T ) for \ufb01nite action sets, with only O(log log T ) switches. This result\n\nuses ideas from [7], and is deferred to the supplementary material.\n\nOur new lower bound is a signi\ufb01cant step towards a complete understanding of adaptive adversaries;\nobserve that the upper and lower bounds in Table 1 essentially match in all but one of the settings.\nOur results have two important consequences. First, observe that the optimal regret against the\n\ncosts and bandit feedback. This demonstrates that dependencies in the loss process must play a\n\nfeedback. To the best of our knowledge, this is the \ufb01rst theoretical con\ufb01rmation that learning with\nbandit feedback is strictly harder than learning with full-information, even on a small \ufb01nite action set\nand even in terms of the dependence on T (previous gaps we are aware of were either in terms of the\nnumber of actions [4], or required large or continuous action spaces \u2014see, e.g., [6, 21]). Moreover,\n\nswitching costs adversary is \u0398\ufffd\u221aT\ufffd with full-information feedback, versus \u0398\ufffdT 2/3\ufffd with bandit\nrecall the regret bound of O\ufffd\u221aT log log log T\ufffd against the stochastic i.i.d. adversary with switching\ncrucial role in controlling the power of the switching costs adversary. Indeed, the \u03a9\ufffdT 2/3\ufffd lower\nbound proven in the next section heavily relies on such dependencies.\nSecond, observe that in the full-information feedback case, the optimal regret against a switching\ncosts adversary is \u0398(\u221aT ), whereas the optimal regret against the more general bounded memory\nadversary is \u03a9(T 2/3). This is somewhat surprising given the ideas presented in [18] and later ex-\ntended in [3]: The main technique used in these papers is to take an algorithm originally designed\nfor oblivious adversaries, forcefully prevent it from switching actions very often, and obtain a new\nalgorithm that guarantees a regret of O(T 2/3) against bounded memory adversaries. This would\n\n4\n\n\fseem to imply that a small number of switches is the key to dealing with general bounded memory\nadversaries. Our result contradicts this intuition by showing that controlling the number of switches\nis easier then dealing with a general bounded memory adversary.\nAs noted above, our lower bounds require us to slightly weaken the standard technical assumption\nthat loss values lie in a \ufb01xed interval [0, C]. We replace it with the following two assumptions:\n\n1. Bounded range. We assume that the loss values on each individual round are bounded\nin an interval of constant size C, but we allow this interval to drift from round to round.\nFormally, \u2200t, \u2200x1:t \u2208 At and \u2200x\ufffd1:t \u2208 At,\n\n2. Bounded drift. We also assume that the drift of each individual action from round to round\n\nis contained in a bounded interval of size Dt, where Dt may grow slowly, as O\ufffd\ufffdlog(t)\ufffd.\n\nFormally, \u2200t and \u2200x1:t \u2208 At,\n\n\ufffd\ufffdft(x1:t) \u2212 ft(x\ufffd1:t)\ufffd\ufffd \u2264 C .\n\ufffd\ufffdft(x1:t) \u2212 ft+1(x1:t, xt)\ufffd\ufffd \u2264 Dt .\n\n(6)\n\n(7)\n\nSince these assumptions are a relaxation of the standard assumption, all of the known lower bounds\non regret automatically extend to our relaxed setting. For our results to be consistent with the current\nstate of the art, we must also prove that all of the known upper bounds continue to hold after the\nrelaxation, up to logarithmic factors.\n\n2 Lower Bounds\n\nIn this section, we prove lower bounds on the player\u2019s expected regret in various settings.\n\n2.1 \u03a9(T 2/3) with Switching Costs and Bandit Feedback\n\nWe begin with a \u03a9(T 2/3) regret lower bound against an oblivious adversary with switching costs,\nwhen the player receives bandit feedback. It is enough to consider a very simple setting, with only\ntwo actions, labeled 1 and 2. Using the notation introduced earlier, we use \ufffd1, \ufffd2, . . . to denote the\noblivious sequence of loss functions chosen by the adversary before adding the switching cost.\nTheorem 1. For any player strategy that relies on bandit feedback and for any number of rounds T ,\nthere exist loss functions f1, . . . , fT that are oblivious with switching costs, with a range bounded\n\nby C = 2, and a drift bounded by Dt =\ufffd3 log(t) + 16, such that E[RT ] \u2265 1\n\nThe full proof is given in the supplementary material, and here we give an informal proof sketch.\nWe begin by constructing a randomized adversarial strategy, where the loss functions \ufffd1, . . . , \ufffdT are\nan instantiation of random variables Lt, . . . , LT de\ufb01ned as follows. Let \u03be1, . . . , \u03beT be i.i.d. standard\nGaussian random variables (with zero mean and unit variance) and let Z be a random variable that\nequals \u22121 or 1 with equal probability. Using these random variables, de\ufb01ne for all t = 1 . . . T\n\n40 T 2/3.\n\nLt(1) =\n\n\u03bes ,\n\nLt(2) = Lt(1) + ZT \u22121/3 .\n\n(8)\n\nt\ufffds=1\n\nt=1 is simply a Gaussian random walk and {Lt(2)}T\n\nIn words, {Lt(1)}T\nt=1 is the same random walk,\nslightly shifted up or down \u2014see \ufb01gure 1 for an illustration. It is straightforward to con\ufb01rm that this\nloss sequence has a bounded range, as required by the theorem: by construction we have |\ufffdt(1) \u2212\n\ufffdt(2)| = T \u22121/3 \u2264 1 for all t, and since the switching cost can add at most 1 to the loss on each\nround, we conclude that |ft(1) \u2212 ft(2)| \u2264 2 for all t. Next, we show that the expected regret\nof any player against this random loss sequence is \u03a9(T 2/3), where expectation is taken over the\nrandomization of both the adversary and the player. The intuition is that the player can only gain\ninformation about which action is better by switching between them. Otherwise, if he stays on\nthe same action, he only observes a random walk, and gets no further information. Since the gap\nbetween the two losses on each round is T \u22121/3, the player must perform \u03a9(T 2/3) switches before\nhe can identify the better action. If the player performs that many switches, the total regret incurred\ndue to the switching costs is \u03a9(T 2/3). Alternatively, if the player performs o(T 2/3) switches, he\n\n5\n\n\f2\n\n0\n\n\u22122\n\n5\n\n10\n\n15\nt\n\n20\n\n25\n\n30\n\n\ufffdt(1)\n\ufffdt(2)\n\nFigure 1: A particular realization of the random loss sequence de\ufb01ned in Eq. (8). The sequence of\nlosses for action 1 follows a Gaussian random walk, whereas the sequence of losses for action 2\nfollows the same random walk, but slightly shifted either up or down.\n\ncan\u2019t identify the better action; as a result he suffers an expected regret of \u03a9(T \u22121/3) on each round\nand a total regret of \u03a9(T 2/3).\nSince the randomized loss sequence de\ufb01ned in Eq. (8), plus a switching cost, achieves an expected\nregret of \u03a9(T 2/3), there must exist at least one deterministic loss sequence \ufffd1 . . . \ufffdT with a regret of\n\u03a9(T 2/3). In our proof, we show that there exists such \ufffd1 . . . \ufffdT with bounded drift.\n\n2.2 \u03a9(T 2/3) with Bounded Memory and Full-Information Feedback\n\nWe build on Thm. 1 and prove a \u03a9(T 2/3) regret lower bound in the full-information setting, where\nwe get to see the entire loss vector on every round. To get this strong result, we need to give the\nadversary a little bit of extra power: memory of size 2 instead of size 1 as in the case of switching\ncosts. To show this result, we again consider a simple setting with two actions.\nTheorem 2. For any player strategy that relies on full-information feedback and for any number of\nrounds T \u2265 2, there exist loss functions f1, . . . , fT , each with a memory of size m = 2, a range\nbounded by C = 2, and a drift bounded by Dt =\ufffd3 log(t) + 18, such that E[RT ] \u2265 1\n40 (T \u2212 1)2/3.\nThe formal proof is deferred to the supplementary material and a proof sketch is given here. The\nproof is based on a reduction from full-information to bandit feedback that might be of independent\ninterest. We construct the adversarial loss sequence as follows: on each round, the adversary assigns\nthe same loss to both actions. Namely, the value of the loss depends only on the player\u2019s previous two\nactions, and not on his action on the current round. Recall that even in the full-information version of\nthe game, the player doesn\u2019t know what the losses would have been had he chosen different actions\nin the past. Therefore, we have made the full-information game as dif\ufb01cult as the bandit game.\nSpeci\ufb01cally, we construct an oblivious loss sequence \ufffd1 . . . \ufffdT as in Thm. 1 and de\ufb01ne\n\nft(x1:t) = \ufffdt\u22121(xt\u22121) + I{xt\u22121\ufffd=xt\u22122} .\n\n(9)\nIn words, we de\ufb01ne the loss on round t of the full-information game to be equal to the loss on round\nt \u2212 1 of a bandits-with-switching-costs game in which the player chooses the same sequence of\nactions. This can be done with a memory of size 2, since the loss in Eq. (9) is fully speci\ufb01ed by the\nplayer\u2019s choices on rounds t, t \u2212 1, t \u2212 2. Therefore, the \u03a9(T 2/3) lower bound for switching costs\nand bandit feedback extends to the full-information setting with a memory of size at least 2.\n\n3 Upper Bounds\n\nIn this section, we show that the known upper bounds on regret, originally proved for bounded\nlosses, can be extended to the case of losses with bounded range and bounded drift. Speci\ufb01cally, of\nthe upper bounds that appear in Table 1, we prove the following:\n\n\u2022 O(\u221aT ) for an oblivious adversary with switching costs, with full-information feedback.\n\u2022 \ufffdO(\u221aT ) for an oblivious adversary with bandit feedback (where \ufffdO hides logarithmic factors).\n\u2022 \ufffdO(T 2/3) for a bounded memory adversary with bandit feedback.\n\n6\n\n\fThe remaining upper bounds in Table 1 are either trivial or follow from the principle that an upper\nbound still holds if we weaken the adversary or provide a more informative feedback.\n\n3.1 O(\u221aT ) with Switching Costs and Full-Information Feedback\nIn this setting, ft(x1:t) = \ufffdt(xt) + I{xt\ufffd=xt\u22121}. If the oblivious losses \ufffd1 . . . \ufffdT (without the addi-\ntional switching costs) were all bounded in [0, 1], the Follow the Lazy Leader (FLL) algorithm of\n[14] would guarantee a regret of O(\u221aT ) with respect to these losses (again, without the additional\nswitching costs). Additionally, FLL guarantees that its expected number of switches is O(\u221aT ).\n\nWe use a simple reduction to extend these guarantees to loss functions with a range bounded in an\ninterval of size C and with an arbitrary drift.\nOn round t, after choosing an action and receiving the loss function \ufffdt, the player de\ufb01nes the modi-\n\ufb01ed loss \ufffd\ufffdt(x) = 1\nthen chooses the next action.\nTheorem 3. If each of the loss functions f1, f2, . . . is oblivious with switching costs and has a range\nbounded by C then the player strategy described above attains O(C\u221aT ) expected regret.\n\nC\u22121\ufffd\ufffdt(x) \u2212 miny \ufffdt(y)\ufffd and feeds it to the FLL algorithm. The FLL algorithm\n\nThe full proof is given in the supplementary material but the proof technique is straightforward. We\n\ufb01rst show that each \ufffd\ufffdt is bounded in [0, 1] and therefore the standard regret bound for FLL holds\nwith respect to the sequence of modi\ufb01ed loss functions \ufffd\ufffd1, \ufffd\ufffd2, . . .. Then we show that the guarantees\nprovided for FLL imply a regret of O(\u221aT ) with respect to the original loss sequence f1, f2, . . ..\n3.2\n\n\ufffdO(\u221aT ) with an Oblivious Adversary and Bandit Feedback\n\nIn this setting, ft(x1:t) simply equals \ufffdt(xt). The reduction described in the previous subsection\ncannot be used in the bandit setting, since minx \ufffdt(x) is unknown to the player, and a different\nreduction is needed. The player sets a \ufb01xed horizon T and focuses on controlling his regret at time\nT ; he can then use a standard doubling trick [8] to handle an in\ufb01nite horizon. The player uses the\nfact that each ft has a range bounded by C. Additionally, he de\ufb01nes D = maxt\u2264T Dt and on each\nround he de\ufb01nes the modi\ufb01ed loss\n\nf\ufffdt(x1:t) =\n\n1\n\n2(C + D)\ufffd\ufffdt(xt) \u2212 \ufffdt\u22121(xt\u22121)\ufffd +\n\n1\n2\n\n.\n\n(10)\n\nexpected regret.\n\nTheorem 4. If each of the loss functions f1 . . . fT is oblivious with a range bounded by C and\n\na \ufb01xed action. The Exp3 algorithm, due to [4], is such an algorithm. The player chooses his actions\naccording to the choices made by Exp3. The following theorem states that this reduction results in\n\nNote that f\ufffdt(X1:t) can be computed by the player using only bandit feedback. The player then feeds\nf\ufffdt(X1:t) to an algorithm that guarantees a O(\u221aT ) standard regret (see de\ufb01nition in Eq. (2)) against\na bandit algorithm that guarantees a regret of \ufffdO(\u221aT ) against oblivious adversaries.\na drift bounded by Dt = O\ufffd\ufffdlog(t)\ufffd then the player strategy described above attains \ufffdO(C\u221aT )\nThe full proof is given in the supplementary material. In a nutshell, we show that each f\ufffdt is a loss\nfunction bounded in [0, 1] and that the analysis of Exp3 guarantees a regret of O(\u221aT ) with respect to\nthe loss sequence f\ufffd1 . . . f\ufffdT . Then, we show that this guarantee implies a regret of (C +D)O(\u221aT ) =\n\ufffdO(C\u221aT ) with respect to the original loss sequence f1 . . . fT .\n\ufffdO(T 2/3) with Bounded Memory and Bandit Feedback\n\nProving an upper bound against an adversary with a memory of size m, with bandit feedback,\nrequires a more delicate reduction. As in the previous section, we assume a \ufb01nite horizon T and we\nlet D = maxt Dt. Let K = |A| be the number of actions available to the player.\nSince fT (x1:t) depends only on the last m + 1 actions in x1:t, we slightly overload our notation\nand de\ufb01ne ft(xt\u2212m:t) to mean the same as ft(x1:t). To de\ufb01ne the reduction, the player \ufb01xes a base\n\n3.3\n\n7\n\n\f\ufffdft(xt\u2212m:t) =\n\n1\n\n2\ufffdC + (m + 1)D\ufffd\ufffdft(xt\u2212m:t) \u2212 ft\u2212m\u22121(x0 . . . x0)\ufffd +\n\n1\n2\n\n.\n\naction x0 \u2208 A and for each t > m he de\ufb01nes the loss function\n\nNext, he divides the T rounds into J consecutive epochs of equal length, where J = \u0398(T 2/3). We\nassume that the epoch length T /J is at least 2K(m + 1), which is true when T is suf\ufb01ciently large.\nAt the beginning of each epoch, the player plans his action sequence for the entire epoch. He uses\nsome of the rounds in the epoch for exploration and the rest for exploitation. For each action in A,\nthe player chooses an exploration interval of 2(m + 1) consecutive rounds within the epoch. These\nK intervals are chosen randomly, but they are not allowed to overlap, giving a total of 2K(m + 1)\nexploration rounds in the epoch. The details of how these intervals are drawn appears in our analysis,\nin the supplementary material. The remaining T /J \u2212 2K(m + 1) rounds are used for exploitation.\nThe player runs the Hedge algorithm [11] in the background, invoking it only at the beginning of\neach epoch and using it to choose one exploitation action that will be played consistently on all of the\nexploitation rounds in the epoch. In the exploration interval for action x, the player \ufb01rst plays m + 1\nrounds of the base action x0 followed by m + 1 rounds of the action x. Letting tx denote the \ufb01rst\nround in this interval, the player uses the observed losses ftx+m(x0 . . . x0) and ftx+2m+1(x . . . x)\n\nas feedback to the Hedge algorithm.\nWe prove the following regret bound, with the proof deferred to the supplementary material.\nTheorem 5. If each of the loss functions f1 . . . fT is has a memory of size m, a range bounded\n\nto compute \ufffdftx+2m+1(x . . . x). In our analysis, we show that the latter is an unbiased estimate of\nthe average value of \ufffdft(x . . . x) over t in the epoch. At the end of the epoch, the K estimates are fed\nby C, and a drift bounded by Dt = O\ufffd\ufffdlog(t)\ufffd then the player strategy described above attains\n\ufffdO(T 2/3) expected regret.\n\n4 Discussion\n\nIn this paper, we studied the problem of prediction with expert advice against different types of\nadversaries, ranging from the oblivious adversary to the general adaptive adversary. We proved\nupper and lower bounds on the player\u2019s regret against each of these adversary types, in both the\nfull-information and the bandit feedback models. Our lower bounds essentially matched our up-\nper bounds in all but one case: the adaptive adversary with a unit memory in the full-information\n\nsetting, where we only know that regret is \u03a9(\u221aT ) and O(T 2/3). Our bounds have two important\n\nconsequences. First, we characterize the regret attainable with switching costs, and show a setting\nwhere predicting with bandit feedback is strictly more dif\ufb01cult than predicting with full-information\nfeedback \u2014even in terms of the dependence on T , and even on small \ufb01nite action sets. Second, in\nthe full-information setting, we show that predicting against a switching costs adversary is strictly\neasier than predicting against an arbitrary adversary with a bounded memory. To obtain our re-\nsults, we had to relax the standard assumption that loss values are bounded in [0, 1]. Re-introducing\nthis assumption and proving similar lower bounds remains an elusive open problem. Many other\nquestions remain unanswered. Can we characterize the dependence of the regret on the number of\nactions? Can we prove regret bounds that hold with high probability? Can our results be generalized\nto more sophisticated notions of regret, as in [3]?\nIn addition to the adversaries discussed in this paper, there are other interesting classes of adversaries\nthat lie between the oblivious and the adaptive. A notable example is the family of deterministically\nadaptive adversaries, which includes adversaries that adapt to the player\u2019s actions in a known de-\nterministic way, rather than in a secret malicious way. For example, imagine playing a multi-armed\nbandit game where the loss values are initially oblivious, but whenever the player chooses an arm\nwith zero loss, the loss of the same arm on the next round is deterministically changed to zero. Many\nreal-world online prediction scenarios are deterministically adaptive, but we lack a characterization\nof the expected regret in this setting.\n\nAcknowledgments\n\nPart of this work was done while NCB was visiting OD at Microsoft Research, whose support is\ngratefully acknowledged.\n\n8\n\n\fReferences\n[1] J. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability. In COLT,\n\n2009.\n\n[2] R. Agrawal, M.V. Hedge, and D. Teneketzis. Asymptotically ef\ufb01cient adaptive allocation rules\nIEEE Transactions on Automatic\n\nfor the multiarmed bandit problem with switching cost.\nControl, 33(10):899\u2013906, 1988.\n\n[3] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary:\nfrom regret to policy regret. In Proceedings of the Twenty-Ninth International Conference on\nMachine Learning, 2012.\n\n[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[5] A. Borodin and R. El-Yaniv. Online computation and competitive analysis. Cambridge Uni-\n\nversity Press, 1998.\n\n[6] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv\u00b4ari. X-armed bandits. Journal of Machine\n\nLearning Research, 12:1655\u20131695, 2011.\n\n[7] N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in\nsecond-price auctions. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms\n(SODA13), 2013.\n\n[8] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[9] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz.\n\nImproved second-order bounds for prediction\n\nwith expert advice. Machine Learning, 66(2/3):321\u2013352, 2007.\n\n[10] V. Dani and T. P. Hayes. Robbing the bandit: Less regret in online geometric optimization\nagainst an adaptive adversary. In Proceedings of the Seventeenth Annual ACM-SIAM Sympo-\nsium on Discrete Algorithms, 2006.\n\n[11] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[12] A. Gyorgy and G. Neu. Near-optimal rates for limited-delay universal lossy source coding. In\n\nIEEE International Symposium on Information Theory, pages 2218\u20132222, 2011.\n\n[13] T. Jun. A survey on the bandit problem with switching costs. De Economist, 152:513\u2013541,\n\n2004.\n\n[14] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Com-\n\nputer and System Sciences, 71:291\u2013307, 2005.\n\n[15] N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Com-\n\nputation, 108:212\u2013261, 1994.\n\n[16] O. Maillard and R. Munos. Adaptive bandits: Towards the best history-dependent strategy. In\nProceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\n2010.\n\n[17] H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an\nadaptive adversary. In Proceedings of the Seventeenth Annual Conference on Learning Theory,\n2004.\n\n[18] N. Merhav, E. Ordentlich, G. Seroussi, and M.J. Weinberger. Sequential strategies for loss\nfunctions with memory. IEEE Transactions on Information Theory, 48(7):1947\u20131958, 2002.\n[19] C. Mesterharm. Online learning with delayed label feedback. In Proceedings of the Sixteenth\n\nInternational Conference on Algorithmic Learning Theory, 2005.\n\n[20] R. Ortner. Online regret bounds for Markov decision processes with deterministic transitions.\n\nTheoretical Computer Science, 411(29\u201330):2684\u20132695, 2010.\n\n[21] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization.\n\nCoRR, abs/1209.2388, 2012.\n\n9\n\n\f", "award": [], "sourceid": 604, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-Bianchi", "institution": "University of Milan"}, {"given_name": "Ofer", "family_name": "Dekel", "institution": "Microsoft Research"}, {"given_name": "Ohad", "family_name": "Shamir", "institution": "The Weizmann Institute"}]}