{"title": "Eluder Dimension and the Sample Complexity of Optimistic Exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 2256, "page_last": 2264, "abstract": "This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, also shares a close theoretical connection with optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.", "full_text": "Eluder Dimension and the Sample Complexity of\n\nOptimistic Exploration\n\nDaniel Russo\n\nStanford University\nStanford, CA 94305\n\ndjrusso@stanford.edu\n\nBenjamin Van Roy\nStanford University\nStanford, CA 94305\n\nbvr@stanford.edu\n\nAbstract\n\nThis paper considers the sample complexity of the multi-armed bandit with depen-\ndencies among the arms. Some of the most successful algorithms for this problem\nuse the principle of optimism in the face of uncertainty to guide exploration. The\nclearest example of this is the class of upper con\ufb01dence bound (UCB) algorithms,\nbut recent work has shown that a simple posterior sampling algorithm, sometimes\ncalled Thompson sampling, can be analyzed in the same manner as optimistic ap-\nproaches. In this paper, we develop a regret bound that holds for both classes of\nalgorithms. This bound applies broadly and can be specialized to many model\nclasses. It depends on a new notion we refer to as the eluder dimension, which\nmeasures the degree of dependence among action rewards. Compared to UCB\nalgorithm regret bounds for speci\ufb01c model classes, our general bound matches the\nbest available for linear models and is stronger than the best available for general-\nized linear models.\n\n1\n\nIntroduction\n\nConsider a politician trying to elude a group of reporters. She hopes to keep her true position hidden\nfrom the reporters, but each piece of information she provides must be new, in the sense that it\u2019s not\na clear consequence of what she has already told them. How long can she continue before her true\nposition is pinned down? This is the essence of what we call the eluder dimension. We show this\nnotion controls the rate at which algorithms using optimistic exploration converge to optimality.\nWe consider an optimization problem faced by an agent who is uncertain about how her actions\nin\ufb02uence performance. The agent selects actions sequentially, and upon each action observes a\nreward. A reward function governs the mean reward of each action. As rewards are observed the\nagent learns about the reward function, and this allows her to improve behavior. Good performance\nrequires adaptively sampling actions in a way that strikes an effective balance between exploring\npoorly understood actions and exploiting previously acquired knowledge to attain high rewards.\nUnless the agent has prior knowledge of the structure of the mean payoff function, she can only learn\nto attain near optimal performance by exhaustively sampling each possible action. In this paper, we\nfocus on problems where there is a known relationship among the rewards generated by different\nactions, potentially allowing the agent to learn without exploring every action. Problems of this form\nare often referred to as multi-armed bandit (MAB) problems with dependent arms.\nA notable example is the \u201clinear bandit\u201d problem, where actions are described by a \ufb01nite number\nof features and the reward function is linear in these features. Several researchers have studied\nalgorithms for such problems and established theoretical guarantees that have no dependence on the\nnumber of actions [1, 2, 3]. Instead, their bounds depend on the linear dimension of the class of\nreward functions. In this paper, we assume that the reward function lies in a known but otherwise\narbitrary class of uniformly bounded real-valued functions, and provide theoretical guarantees that\n\n1\n\n\fdepend on more general measures of the complexity of the class of functions. Our analysis of\nthis abstract framework yields a result that applies broadly, beyond the scope of speci\ufb01c problems\nthat have been studied in the literature, and also identi\ufb01es fundamental insights that unify more\nspecialized prior results.\nThe guarantees we provide apply to two popular classes of algorithms for the stochastic MAB:\nupper con\ufb01dence bound (UCB) algorithms and Thompson sampling. Each algorithm is described\nin Section 3. The aforementioned papers on the linear bandit problem study UCB algorithms [1,\n2, 3]. Other authors have studied UCB algorithms in cases where the reward function is Lipschitz\ncontinuous [4, 5], sampled from a Gaussian process [6], or takes the form of a generalized [7] or\nsparse [8] linear model. More generally, there is an immense literature on this approach to balancing\nbetween exploration and exploitation, including work on bandits with independent arms [9, 10, 11,\n12], reinforcement learning [13, 14], and Monte Carlo Tree Search [15].\nRecently, a simple posterior sampling algorithm called Thompson sampling was shown to share a\nclose connection with UCB algorithms [16]. This connection enables us to study both types of\nalgorithms in a uni\ufb01ed manner. Though it was \ufb01rst proposed in 1933 [17], Thompson sampling\nhas until recently received relatively little attention. Interest in the algorithm grew after empirical\nstudies [18, 19] demonstrated performance exceeding state-of the-art methods. Strong theoretical\nguarantees are now available for an important class of problems with independent arms [20, 21, 22].\nA recent paper considers the application of this algorithm to a linear contextual bandit problem [23].\nTo our knowledge, few other papers have studied MAB problems in a general framework like the\none we consider. There is work that provides general bounds for contextual bandit problems where\nthe context space is allowed to be in\ufb01nite, but the action space is small (see e.g., [24]). Our model\ncaptures contextual bandits as a special case, but we emphasize problem instances with large or\nin\ufb01nite action sets, and where the goal is to learn without sampling every possible action. The closest\nrelated work to ours is that of Amin et al. [25], who consider the problem of learning the optimum\nof a function that lies in a known, but otherwise arbitrary set of functions. They provide bounds\nbased on a new notion of dimension, but unfortunately this notion does not provide a guarantee for\nthe algorithms we consider.\nWe provide bounds on expected regret over a time horizon T that are, up to a logarithmic factor, of\norder\n\nEluder dimension\n\nlog\u2013covering number\n\nThis quantity depends on the class of reward functions F through two measures of complexity. Each\ncaptures the approximate structure of the class of functions at a scale T \u22122 that depends on the time\nhorizon. The \ufb01rst measures the growth rate of the covering numbers of F, and is closely related to\nmeasures of complexity that are common in the supervised learning literature. This quantity roughly\ncaptures the sensitivity of F to statistical over-\ufb01tting. The second measure, the eluder dimension,\nis a new notion we introduce. This captures how effectively the value of unobserved actions can be\ninferred from observed samples. We highlight in Section 4.1 why notions of dimension common to\nthe supervised learning literature are insuf\ufb01cient for our purposes. Finally, we show that our more\ngeneral result when specialized to linear models recovers the strongest known regret bound and in\nthe case of generalized linear models yields a bound stronger than that established in prior literature.\n\n2 Problem Formulation\nWe consider a model involving a set of actions A and a set of real-valued functions F =\n{f\u03c1 : A (cid:55)\u2192 R| \u03c1 \u2208 \u0398}, indexed by a parameter that takes values from an index set \u0398. We will\nde\ufb01ne random variables with respect to a probability space (\u2126, F, P). A random variable \u03b8 indexes\nthe true reward function f\u03b8. At each time t, the agent is presented with a possibly random subset\nAt \u2286 A and selects an action At \u2208 At, after which she observes a reward Rt.\nWe denote by Ht the history (A1, A1, R1, . . . ,At\u22121, At\u22121, Rt\u22121,At) of observations available to\nthe agent when choosing an action At. The agent employs a policy \u03c0 = {\u03c0t|t \u2208 N}, which is a\ndeterministic sequence of functions, each mapping the history Ht to a probability distribution over\nactions A. For each realization of Ht, \u03c0t(Ht) is a distribution over A with support At. The action At\n\n2\n\n(cid:118)(cid:117)(cid:117)(cid:116)dimE\n(cid:124)\n\n(cid:0)F, T \u22122(cid:1)\n(cid:123)(cid:122)\n(cid:125)\n\nlog(cid:0)N(cid:0)F, T \u22122,(cid:107)\u00b7(cid:107)\u221e\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:1)(cid:1)\n(cid:125)\n\nT .\n\n\fis selected by sampling from the distribution \u03c0t(\u00b7), so that P(At \u2208 \u00b7|Ht) = \u03c0t(Ht). We assume that\nE[Rt|Ht, \u03b8, At] = f\u03b8(At). In other words, the realized reward is the mean-reward value corrupted\nby zero-mean noise. We will also assume that for each f \u2208 F and t \u2208 N, arg maxa\u2208At f (a) is\nnonempty with probability one, though algorithms and results can be generalized to handle cases\nwhere this assumption does not hold. We \ufb01x constants C > 0 and \u03b7 > 0 and impose two further\nsimplifying assumptions. The \ufb01rst concerns boundedness of reward functions.\nAssumption 1. For all f \u2208 F and a \u2208 A, f (a) \u2208 [0, C].\n\nOur second assumption ensures that observation noise is light-tailed. We say a random variable X\nis \u03b7-sub-Gaussian if E[exp(\u03bbX)] \u2264 exp(\u03bb2\u03b72/2) almost surely for all \u03bb.\nAssumption 2. For all t \u2208 N, Rt \u2212 f\u03b8(At) conditioned on (Ht, \u03b8, At) is \u03b7-sub-Gaussian.\nWe let A\u2217\nrandom variable\n\nt \u2208 arg maxa\u2208Atf\u03b8(a) denote the optimal action at time t. The T period regret is the\n\nT(cid:88)\n\nR(T, \u03c0) =\n\n[f\u03b8(A\u2217\n\nt ) \u2212 f\u03b8 (At)] ,\n\nt=1\n\nwhere the actions {At : t \u2208 N} are selected according to \u03c0. We sometimes study expected regret\nE[R(T, \u03c0)], where the expectation is taken over the prior distribution of \u03b8, the reward noise, and\nthe algorithm\u2019s internal randomization. This quantity is sometimes called Bayes risk or Bayesian\nregret. Similarly, we study conditional expected regret E [R(T, \u03c0) | \u03b8], which integrates over all\nrandomness in the system except for \u03b8.\nExample 1. Contextual Models. The contextual multi-armed bandit model is a special case of\nthe formulation presented above. In such a model, an exogenous Markov process Xt taking values\nin a set X in\ufb02uences rewards. In particular, the expected reward at time t is given by f\u03b8(a, Xt).\nHowever, this is mathematically equivalent to a problem with stochastic time-varying decision\nsets At. In particular, one can de\ufb01ne the set of actions to be the set of state-action pairs A :=\n{(x, a) : x \u2208 A, a \u2208 A(x)}, and the set of available actions to be At = {(Xt, a) : a \u2208 A(Xt)}.\n\n3 Algorithms\n\nWe will establish performance bounds for two classes of algorithms: Thompson sampling and UCB\nalgorithms. As background, we discuss the algorithms in this section. We also provide an example\nof each type of algorithm that is designed to address the \u201clinear bandit\u201d problem.\nUCB Algorithms: UCB algorithms have received a great deal of attention in the MAB literature.\nHere we describe a very broad class of UCB algorithms. We say that a con\ufb01dence set is a random\nsubset Ft \u2282 F that is measurable with respect to \u03c3(Ht). Typically, Ft is constructed so that\nit contains f\u03b8 with high probability. We denote by \u03c0F1:\u221e a UCB algorithm that makes use of a\nsequence of con\ufb01dence sets {Ft : t \u2208 N}. At each time t, such an algorithm selects the action\n\nAt \u2208 arg max\na\u2208At\n\nsup\nf\u2208Ft\n\nf (a),\n\nwhere sup\nf\u2208Ft\n\nf (a) is an optimistic estimate of f\u03b8(a) representing the greatest value that is statistically\nplausible at time t. Optimism encourages selection of poorly-understood actions, which leads to\ninformative observations. As data accumulates, optimistic estimates are adapted, and this process of\nexploration and learning converges toward optimal behavior.\nIn this paper, we will assume for simplicity that the maximum de\ufb01ning At is attained. Results can be\ngeneralized to handle cases when this technical condition does not hold. Unfortunately, for natural\nchoices of Ft, it may be exceptionally dif\ufb01cult to solve for such an action. Thankfully, all results in\nthis paper also apply to a posterior sampling algorithm that avoids this hard optimization problem.\nThompson sampling: The Thompson sampling algorithm simply samples each action according\nto the probability it is optimal. In particular, the algorithm applies action sampling distributions\nt \u2208 arg maxa\u2208At f\u03b8(a).\n\u03c0TS\nt\nPractical implementations typically operate by at each time t sampling an index \u02c6\u03b8t \u2208 \u0398 from the\ndistribution P (\u03b8 \u2208 \u00b7 | Ht) and then generating an action At \u2208 arg maxa\u2208At f\u02c6\u03b8t\n\nt is a random variable that satis\ufb01es A\u2217\n\n(Ht) = P (A\u2217\n\nt \u2208 \u00b7 | Ht), where A\u2217\n\n(a).\n\n3\n\n\fAlgorithm 1 Linear UCB\n1: Initialize: Select d linearly independent ac-\n\n2: Update Statistics:\n\ntions\n\u02c6\u03b8t \u2190 OLS estimate of \u03b8\n\n\u03a6t \u2190(cid:80)t\u22121\n(cid:26)\n\nk=1 \u03c6( \u00afAk)\u03c6( \u00afAk)T\n\u2264 \u03b2\n\u03c1 :\n\n(cid:13)(cid:13)(cid:13)\u03c1 \u2212 \u02c6\u03b8t\n\n\u0398t \u2190\n\n(cid:13)(cid:13)(cid:13)\u03a6t\n\n(cid:27)\n\n\u221a\n\nd log t\n\n3: Select Action:\n\nAt \u2208 arg maxa\u2208A {max\u03c1\u2208\u0398t (cid:104)\u03c6(a), \u03c1(cid:105)}\n\n4: Increment t and Goto Step 2\n\nAlgorithm 2\nLinear Thompson sampling\n1: Sample Model:\n\u02c6\u03b8t \u223c N (\u00b5t, \u03a3t)\n2: Select Action:\nAt \u2208 arg maxa\u2208A(cid:104)\u03c6(a), \u02c6\u03b8t(cid:105)\n3: Update Statistics:\n\u00b5t+1 \u2190 E[\u03b8|Ht+1]\n\u03a3t+1 \u2190 E[(\u03b8 \u2212 \u00b5t+1)(\u03b8 \u2212 \u00b5t+1)(cid:62)|Ht+1]\n\n4: Increment t and Goto Step 1\n\nAlgorithms for Linear Bandits: Here we provide an example of a Thompson sampling and a\nUCB algorithm, each of which addresses a problem in which the reward function is linear in a d-\ndimensional vector \u03b8. In particular, there is a known feature mapping \u03c6 : A \u2192 Rd such that an\naction a yields expected reward f\u03b8(a) = (cid:104)\u03c6(a), \u03b8(cid:105). Algorithm 1 is a variation of one proposed by\nRusmevichientong and Tsitsiklis [3] to address such problems. Given past observations, the algo-\nrithm constructs a con\ufb01dence ellipsoid \u0398t centered around a least squares estimate \u02c6\u03b8t and employs\nthe upper con\ufb01dence bound Ut(a) := max\u03b8\u2208\u0398t\n.\n\u22121\nThe term (cid:107)\u03c6(a)(cid:107)\u03a6\nt\ncaptures the amount of previous exploration in the direction \u03c6(a), and causes\nto diminish as the number of observations increases.\nthe \u201cuncertainty bonus\u201d \u03b2\n\n(cid:10)\u03c6(a), \u03b8(cid:11) =\n\nd log(t)(cid:107)\u03c6(a)(cid:107)\u03a6\n\n\u03c6(a), \u02c6\u03b8t\n\nd log(t)(cid:107)\u03c6(a)(cid:107)\u03a6\n\n(cid:113)\n\n(cid:113)\n\n(cid:68)\n\n(cid:69)\n\n+ \u03b2\n\n\u22121\nt\n\n\u22121\nt\n\nNow, consider Algorithm 2. Here we assume \u03b8 is drawn from a normal distribution N (\u00b51, \u03a31). We\nconsider a linear reward function f\u03b8(a) = (cid:104)\u03c6(a), \u03b8(cid:105) and assume the reward noise Rt \u2212 f\u03b8(At) is\nnormally distributed and independent from (Ht, At, \u03b8). It is easy to show that, conditioned on the\nhistory Ht, \u03b8 remains normally distributed. Algorithm 2 presents an implementation of Thompson\nsampling for this problem. The expectations can be computed ef\ufb01ciently via Kalman \ufb01ltering.\n\n4 Notions of Dimension\n\nRecently, there has been a great deal of interest in the development of regret bounds for linear UCB\nalgorithms [1, 2, 3, 26]. These papers show that for a broad class of problems, a variant \u03c0\u2217 of\n\u221a\nAlgorithm 1 satis\ufb01es the upper bounds E [R(T, \u03c0\u2217)] = \u02dcO(d\nT ).\nAn interesting feature of these bounds is that they have no dependence on the number actions in A,\nand instead depend only on the linear dimension of the set of functions F. Our goal is to provide\nbounds that depend on more general measures of the complexity of the class of functions. This\nsection introduces a new notion, the eluder dimension, on which our bounds will depend. First,\nwe highlight why common notions from statistical learning theory do not suf\ufb01ce when it comes to\nmulti\u2013armed bandit problems.\n\n\u221a\nT ) and E [R(T, \u03c0\u2217) | \u03b8] = \u02dcO(d\n\n4.1 Vapnik-Chervonenkis Dimension\n\na\n\nclass\n\n\ufb01nite\n\n2. Consider\n\nWe begin with an example that illustrates how a class of functions that is learnable in constant time\nin a supervised learning context may require an arbitrarily long duration when learning to optimize.\nExample\n=\n{f\u03c1 : A (cid:55)\u2192 {0, 1} | \u03c1 \u2208 {1, . . . , n}} over a \ufb01nite action set A = {1, . . . , n}. Let f\u03c1(a) = 1(\u03c1 = a),\nso that each function is an indicator for an action. To keep things simple, assume that Rt = f\u03b8(At),\nso that there is no noise. If \u03b8 is uniformly distributed over {1, . . . , n}, it is easy to see that the regret\nof any algorithm grows linearly with n. For large n, until \u03b8 is discovered, each sampled action is\nunlikely to reveal much about \u03b8 and learning therefore takes very long.\nConsider the closely related supervised learning problem in which at each time an action \u02dcAt is\nsampled uniformly from A and the mean\u2013reward value f\u03b8( \u02dcAt) is observed. For large n, the time it\n\nbinary-valued\n\nfunctions\n\nF\n\nof\n\n4\n\n\ftakes to effectively learn to predict f\u03b8( \u02dcAt) given \u02dcAt does not depend on t. In particular, prediction\nerror converges to 1/n in constant time. Note that predicting 0 at every time already achieves this\nlow level of error.\n\nIn the preceding example, the Vapnik-Chervonenkis (VC) dimension, which characterizes the sam-\nple complexity of supervised learning, is 1. On the other hand, the eluder-dimension, which will\nwe de\ufb01ne below, is n. To highlight conceptual differences between the eluder dimension and the\nVC dimension, we will now de\ufb01ne VC dimension in a way analogous to how will de\ufb01ne eluder\ndimension. We begin with a notion of independence.\nDe\ufb01nition 1. An action a is VC-independent of \u02dcA \u2286 A if for any f, \u02dcf \u2208 F there exists some \u00aff \u2208 F\nwhich agrees with f on a and with \u02dcf on \u02dcA; that is, \u00aff (a) = f (a) and \u00aff (\u02dca) = \u02dcf (\u02dca) for all \u02dca \u2208 \u02dcA.\nOtherwise, a is VC-dependent on \u02dcA.\nBy this de\ufb01nition, an action a is said to be VC-dependent on \u02dcA if knowing the values f \u2208 F takes\non \u02dcA could restrict the set of possible values at a. This notion of independence is intimately related\nto the VC dimension of a class of functions. In fact, it can be used to de\ufb01ne VC dimension.\nDe\ufb01nition 2. The VC dimension of a class of binary-valued functions with domain A is the largest\ncardinality of a set \u02dcA \u2286 A such that every a \u2208 \u02dcA is VC-independent of \u02dcA\\{a}.\n\nIn the above example, any two actions are VC-dependent because knowing the label f\u03b8(a) of one\naction could completely determine the value of the other action. However, this only happens if the\nsampled action has label 1. If it has label 0, one cannot infer anything about the value of the other\naction. Instead of capturing the fact that one could gain useful information through exploration, we\nneed a stronger requirement that guarantees one will gain useful information.\n\n4.2 De\ufb01ning Eluder Dimension\n\n(cid:113)(cid:80)n\n\nHere we de\ufb01ne the eluder dimension of a class of functions, which plays a key role in our results.\nDe\ufb01nition 3. An action a \u2208 A is \u0001-dependent on actions {a1, ..., an} \u2286 A with respect to F if any\npair of functions f, \u02dcf \u2208 F satisfying\ni=1(f (ai) \u2212 \u02dcf (ai))2 \u2264 \u0001 also satis\ufb01es f (a) \u2212 \u02dcf (a) \u2264 \u0001.\nFurther, a is \u0001-independent of {a1, .., an} with respect to F if a is not \u0001-dependent on {a1, .., an}.\nIntuitively, an action a is independent of {a1, ..., an} if two functions that make similar predictions\nat {a1, ..., an} can nevertheless differ signi\ufb01cantly in their predictions at a. The above de\ufb01nition\nmeasures the \u201csimilarity\u201d of predictions at \u0001-scale, and measures whether two functions make similar\npredictions at {a1, ..., an} based on the cumulative discrepancy\ni=1(f (ai) \u2212 \u02dcf (ai))2. This\nmeasure of dependence suggests using the following notion of dimension.\nDe\ufb01nition 4. The \u0001-eluder dimension dimE(F, \u0001) is the length d of the longest sequence of elements\nin A such that, for some \u0001(cid:48) \u2265 \u0001, every element is \u0001(cid:48)-independent of its predecessors.\n\n(cid:113)(cid:80)n\n\nRecall that a vector space has dimension d if and only if d is the length of the longest sequence of\nelements such that each element is linearly independent or equivalently, 0-independent of its pre-\ndecessors. De\ufb01nition 4 replaces the requirement of linear independence with \u0001-independence. This\nextension is advantageous as it captures both nonlinear dependence and approximate dependence.\n\n5 Con\ufb01dence Bounds and Regret Decompositions\n\nA key to our analysis is recent observation [16] that the regret of both Thompson sampling and a\nUCB algorithm can be decomposed in terms of con\ufb01dence sets. De\ufb01ne the width of a subset \u02dcF \u2282 F\nat an action a \u2208 A by\n\n(cid:0)f (a) \u2212 f (a)(cid:1) .\n\nw \u02dcF (a) = sup\nf ,f\u2208 \u02dcF\n\n(1)\n\nThis is a worst\u2013case measure of the uncertainty about the payoff f\u03b8(a) at a given that f\u03b8 \u2208 \u02dcF.\n\n5\n\n\ft=1\n\nt=1\n\n(2)\n\n(3)\n\nProposition 1. Fix any sequence {Ft : t \u2208 N}, where Ft \u2282 F is measurable with respect to \u03c3(Ht).\nThen for any T \u2208 N, with probability 1,\n\nR(T, \u03c0F1:\u221e ) \u2264 T(cid:88)\n[wFt(At) + C1(f\u03b8 /\u2208 Ft)]\nT(cid:88)\nE(cid:2)R(T, \u03c0TS)(cid:3) \u2264 E\ntially bounds regret in terms of the sum of widths(cid:80)T\n\n[wFt(At) + C1(f\u03b8 /\u2208 Ft)] .\n\nIf the con\ufb01dence sets Ft are constructed to contain f\u03b8 with high probability, this proposition essen-\nt=1 wFt(At). In this sense, the decomposition\nbounds regret only in terms of uncertainty about the actions A1,..,At that the algorithm has actually\nsampled. As actions are sampled, the value of f\u03b8(\u00b7) at those actions is learned accurately, and hence\nwe expect that the width wFt(\u00b7) of the con\ufb01dence sets should diminish over time.\nIt is worth noting that the regret bound of the UCB algorithm \u03c0F1:\u221e depends on the speci\ufb01c con\ufb01-\ndence sets {Ft : t \u2208 N} used by the algorithm whereas the bound of \u03c0TS applies for any sequence\nof con\ufb01dence sets. However, the decomposition (3) holds only in expectation under the prior distri-\nbution. The implications of these decompositions are discussed further in earlier work [16].\nIn the next section, we design abstract con\ufb01dence sets Ft that are shown to contain the true function\nt=1 wFt (At)\nin terms of the eluder dimension of the class of functions F. When combined with Proposition 1,\nthis analysis provides regret bounds for both Thompson sampling and for a UCB algorithm.\n\nf\u03b8 with high probability. Then, in Section 7 we give a worst case bound on the sum(cid:80)T\n\n6 Construction of con\ufb01dence sets\n\nt\n\n1\n\n1\n\n2,Et\n\n= (cid:80)t\u22121\n\n:= {f \u2208 F : (cid:107)f \u2212 \u02c6f LS\n\ng2(Ak). Hence (cid:107)f \u2212 f\u03b8(cid:107)2\n\narg minf\u2208F L2,t(f ) where L2,t(f ) = (cid:80)t\u22121\n\n\u2208\n(f (At) \u2212 Rt)2 is the cumulative squared predic-\n\u03b2t} where \u03b2t is\nis de\ufb01ned by\nmeasures the cumulative discrepancy between the\n\nThe abstract con\ufb01dence sets we construct are centered around least squares estimates \u02c6f LS\nt (cid:107)2,Et \u2264 \u221a\ntion error.1 The sets take the form Ft\nan appropriately chosen con\ufb01dence parameter, and the empirical 2-norm (cid:107)\u00b7(cid:107)2,Et\n(cid:107)g(cid:107)2\nprevious predictions of f and f\u03b8.\nThe following lemma is the key to constructing strong con\ufb01dence sets (Ft : t \u2208 N). For an arbitrary\nfunction f, it bounds the squared error of f from below in terms of the empirical loss of the true\nfunction f\u03b8 and the aggregate empirical discrepancy (cid:107)f \u2212 f\u03b8(cid:107)2\nbetween f and f\u03b8. It establishes\nthat for any function f, with high probability, the random process (L2,t(f ) : t \u2208 N) never falls\n2(cid:107)f \u2212 f\u03b8(cid:107)2\n: t \u2208 N) by more than a \ufb01xed constant. A proof of\nbelow the process (L2,t(f\u03b8) + 1\nthe lemma is provided in the appendix. Recall that \u03b7 is a constant given in Assumption 2.\nLemma 1. For any \u03b4 > 0 and f : A (cid:55)\u2192 R,\n\n2,Et\n\n2,Et\n\n2,Et\n\n(cid:18)\n\n(cid:19)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b8\n\nP\n\nL2,t(f ) \u2265 L2,t(f\u03b8) +\n\n(cid:107)f \u2212 f\u03b8(cid:107)2\n\n2,Et\n\n1\n2\n\n\u2212 4\u03b72 log (1/\u03b4) \u2200t \u2208 N\n\n\u2265 1 \u2212 \u03b4.\n\n2,Et\n\nBy Lemma 1, with high probability, f can enjoy lower squared error than f\u03b8 only if its empirical\ndeviation (cid:107)f \u2212 f\u03b8(cid:107)2\nfrom f\u03b8 is less than 8\u03b72 log(1/\u03b4). Through a union bound, this property\nholds uniformly for all functions in a \ufb01nite subset of F. To extend this result to in\ufb01nite classes of\nfunctions, we measure the function class at some discretization scale \u03b1. Let N (F, \u03b1, (cid:107)\u00b7(cid:107)\u221e) denote\nthe \u03b1-covering number of F in the sup-norm (cid:107) \u00b7 (cid:107)\u221e, and let\nt (F, \u03b4, \u03b1) := 8\u03b72 log (N (F, \u03b1, (cid:107)\u00b7(cid:107)\u221e)/\u03b4) + 2\u03b1t\n\u03b2\u2217\n\n(cid:17)\n8C +(cid:112)8\u03b72 ln(4t2/\u03b4)\n\n(cid:16)\n\n(4)\n\n.\n\n1The results can be extended to the case where the in\ufb01mum of L2,t(f ) is unattainable by selecting a function\n\nwith squared prediction error suf\ufb01ciently close to the in\ufb01mum.\n\n6\n\n\fProposition 2. For all \u03b4 > 0 and \u03b1 > 0, if\n\nFt =\n\nfor all t \u2208 N, then\n\n(cid:26)\n\nt\n\n(cid:13)(cid:13)(cid:13)2,Et\n(cid:13)(cid:13)(cid:13)f \u2212 \u02c6f LS\n(cid:33)\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b8\n\u221e(cid:92)\n\nFt\n\nt=1\n\nf \u2208 F :\n\n(cid:32)\n\nP\n\nf\u03b8 \u2208\n\n\u2264(cid:112)\u03b2\u2217\n\nt (F, \u03b4, \u03b1)\n\n(cid:27)\n\n\u2265 1 \u2212 2\u03b4.\n\nExample 3. Suppose \u0398 \u2282 [0, 1]d and for each a \u2208 A, f\u03b8(a) is an L\u2013Lipschitz function of \u03b8. Then\nN (F, \u03b1,(cid:107) \u00b7 (cid:107)\u221e) \u2264 (1 + L/\u0001)d and hence log N (F, \u03b1,(cid:107) \u00b7 (cid:107)\u221e) \u2264 d log(1 + L/\u0001).\n\n7 Measuring the rate at which con\ufb01dence sets shrink\n\nOur remaining task is to provide a worst case bound on the sum(cid:80)T\n\nalgebra leads to a worst case bound on(cid:80)T\n\n1 wFt(At). First consider the\ncase of a linearly parameterized model where f\u03c1(a) := (cid:104)\u03c6(a), \u03c1(cid:105) for each \u03c1 \u2208 \u0398 \u2282 Rd. Then,\nit can be shown that our con\ufb01dence set takes the form Ft := {f\u03c1 : \u03c1 \u2208 \u0398t} where \u0398t \u2282 Rd is\nan ellipsoid. When an action At is sampled, the ellipsoid shrinks in the direction \u03c6(At). Here\nthe explicit geometric structure of the con\ufb01dence set implies that the width wFt shrinks not only\nat At but also at any other action whose feature vector is not orthogonal to \u03c6(At). Some linear\n1 wFt (At). For a general class of functions, the situation\nis much subtler, and we need to measure the way in which the width at each action can be reduced\nby sampling other actions.\nThe following result uses our new notion of dimension to bound the number of times the width of\nthe con\ufb01dence interval for a selected action At can exceed a threshold.\nProposition 3. If (\u03b2t \u2265 0|t \u2208 N) is a nondecreasing sequence and Ft := {f \u2208 F : (cid:107)f \u2212\nt (cid:107)2,Et \u2264 \u221a\n\u02c6f LS\n\n\u03b2t} then with probability 1\n\n(cid:18) 4\u03b2T\n\n(cid:19)\n\n1(wFt(At) > \u0001) \u2264\n\n\u00012 + 1\n\ndimE(F, \u0001)\n\nT(cid:88)\n\nt=1\n\nfor all T \u2208 N and \u0001 > 0.\n\nUsing Proposition 3, one can bound the sum(cid:80)T\n\n\u03b1F\nt = max\n\nt=1 wFt(At), as established by the following lemma.\n(cid:27)\n(cid:26) 1\nT \u2013eluder dimension of F,\nTo extend our analysis to in\ufb01nite classes of functions, we consider the \u03b1F\nwhere\nt2 , inf {(cid:107)f1 \u2212 f2(cid:107)\u221e : f1, f2 \u2208 F, f1 (cid:54)= f2}\n(cid:0)F, \u03b1F\n+ min(cid:8)dimE\n\nLemma 2. If (\u03b2t \u2265 0|t \u2208 N) is a nondecreasing sequence and Ft := {f \u2208 F : (cid:107)f \u2212 \u02c6f LS\n\u221a\n\u03b2t} then with probability 1, for all T \u2208 N,\n\n(cid:1) , T(cid:9) C + 4\n\n(5)\nt (cid:107)2,Et \u2264\n\n(cid:0)F, \u03b1F\n\n(cid:1) \u03b2T T .\n\nT(cid:88)\n\n(cid:113)\n\ndimE\n\n(6)\n\n.\n\nT\n\nT\n\nwFt(At) \u2264 1\nT\n\nt=1\n\n8 Main Result\n\n1:\u221e executed with appropriate con\ufb01dence sets {F\u2217\n\nOur analysis provides a new guarantee both for Thompson sampling, and for a UCB algorithm\nt : t \u2208 N}. Recall, for a sequence of con-\n\u03c0F\u2217\n\ufb01dence sets {Ft : t \u2208 N} we denote by \u03c0F1:\u221e the UCB algorithm that chooses an action \u00afAt \u2208\n\narg maxa\u2208A(cid:8)supf\u2208Ft f\u03b8(a)(cid:9) at each time t. We establish bounds that are, up to a logarithmic\n\nfactor, of order\n\n(cid:118)(cid:117)(cid:117)(cid:116)dimE\n(cid:124)\n\n(cid:0)F, T \u22122(cid:1)\n(cid:123)(cid:122)\n(cid:125)\n\nlog(cid:0)N(cid:0)F, T \u22122,(cid:107)\u00b7(cid:107)\u221e\n(cid:124)\n\n(cid:123)(cid:122)\n\nlog\u2013covering number\n\n(cid:1)(cid:1)\n(cid:125)\n\nT .\n\nEluder dimension\n\n7\n\n\fThis term depends on two measures of the complexity of the function class F. The \ufb01rst, which\ncontrols for statistical over\u2013\ufb01tting, grows logarithmically in the cover numbers of the function class.\nThis is a common feature of notions of dimension from statistical learning theory. The second\nmeasure of complexity, the eluder dimension, measures the extent to which the reward value at one\naction can be inferred by sampling other actions.\nThe next two propositions, which provide \ufb01nite time bounds for a particular UCB algorithm and for\nThompson sampling, follow by combining Proposition 1, Propsition 2, and Lemma 2. De\ufb01ne,\n\n(cid:0)F, \u03b1F\n\nT\n\n(cid:1) \u03b2\u2217\n\nT\n\n(cid:0)F, \u03b1F\n\nT , \u03b4(cid:1) T .\n\nB(F, T, \u03b4) =\n\n(cid:26)\n\nNotice that B(F, T, \u03b4) is the right hand side of the bound (6) with \u03b2T taken to be \u03b2\u2217\nT , \u03b4).\nProposition 4. Fix any \u03b4 > 0 and T \u2208 N, and de\ufb01ne for each t \u2208 N, F\u2217\nf \u2208 F :\n\nT (F, \u03b1F\n\n(cid:13)(cid:13)(cid:13)f \u2212 \u02c6f LS\n\nt\n\nt =\n\n(cid:113)\n\nT\n\n1\nT\n\ndimE\n\n+(cid:2)min(cid:8)dimE\n(cid:0)F, \u03b1F\n(cid:1) , T(cid:9)(cid:3) C + 4\n(cid:27)\n(cid:13)(cid:13)(cid:13)2,Et\n\u2264(cid:112)\u03b2\u2217\n(cid:111) \u2265 1 \u2212 2\u03b4\nP(cid:110)R(T, \u03c0F\u2217\nt (F, \u03b1T , \u03b4)\nE(cid:104)R(T, \u03c0F\u2217\n(cid:105) \u2264 B(F, T, \u03b4) + 2\u03b4T C\nE(cid:2)R(T, \u03c0TS)(cid:3) \u2264 B(F, T, T \u22121) + 2C\n\n1:\u221e) \u2264 B(F, T, \u03b4) | \u03b8\n1:\u221e ) | \u03b8\n\n. Then,\n\nProposition 5. For any T \u2208 N,\n\nThe next two examples show how the regret bounds of Proposition 4 and 5 specialize to d-\ndimensional linear and generalized linear models. For each of these examples \u0398 \u2282 Rd and each\naction is associated with a known feature vector \u03c6(a). Throughout these two examples, we \ufb01x posi-\ntive constants \u03b3 and s and assume that \u03b3 \u2265 supa\u2208A (cid:107)\u03c6(a)(cid:107) and s \u2265 sup\u03c1\u2208\u0398 (cid:107)\u03c1(cid:107). For each of these\nexamples, a bound on dimE (F, \u0001) is provided in the supplementary material.\nExample 4. Linear Models: Consider the case of a d-dimensional linear model f\u03c1(a) :=\n(cid:104)\u03c6(a), \u03c1(cid:105). Then, dimE(F, \u0001) = O(d log(1/\u0001)) and log N (F, \u0001,(cid:107)\u00b7(cid:107)\u221e) = O(d log(1/\u0001)). Proposi-\nT \u2265 T \u22122, This is tight to\ntions 4 and 5 therefore yield O(d log(1/\u03b1F\nT )\nwithin a factor of log T [3], and matches the best available bound for a linear UCB algorithm [2].\nExample 5. Generalized Linear Models: Consider the case of a d-dimensional general-\nized linear model f\u03b8(a) := g ((cid:104)\u03c6(a), \u03b8(cid:105)) where g is an increasing Lipschitz continuous func-\nThen,\ntion.\nlog N (F, \u0001,(cid:107)\u00b7(cid:107)\u221e) = O(d log(h/\u0001)) and dimE(F, \u0001) = O(dr2 log(h/\u0001)), and Propositions 4 and\n\u221a\n5 yield O(rd log(h/\u03b1F\nT ) regret bounds. To our knowledge, this bound is a slight improvement\nT )\n\u221a\nover the strongest regret bound available for any algorithm in this setting. The regret bound of\nFilippi et al. [7] is of order rd log3/2(T )\n\nSet h = sup\u02dc\u03b8,a g(cid:48)((cid:104)\u03c6(a), \u02dc\u03b8(cid:105)), h = inf \u02dc\u03b8,a g(cid:48)((cid:104)\u03c6(a), \u02dc\u03b8(cid:105)) and r = h/h.\n\nT ) regret bounds. Since \u03b1F\n\n\u221a\n\nT .\n\n9 Conclusion\n\nIn this paper, we have analyzed two algorithms, Thompson sampling and a UCB algorithm, in a\nvery general framework, and developed regret bounds that depend on a new notion of dimension.\nIn constructing these bounds, we have identi\ufb01ed two factors that control the hardness of a particular\nmulti-armed bandit problem. First, an agent\u2019s ability to quickly attain near-optimal performance\ndepends on the extent to which the reward value at one action can be inferred by sampling other\nactions. However, in order to select an action the agent must make inferences about many possible\nactions, and an error in its evaluation of any one could result in large regret. Our second measure\nof complexity controls for the dif\ufb01culty of maintaining appropriate con\ufb01dence sets simultaneously\nat every action. While our bounds are nearly tight in some cases, further analysis is likely to yield\nstronger results in other cases. We hope, however, that our work provides a conceptual foundation\nfor the study of such problems, and inspires further investigation.\n\nAcknowledgments\n\nThe \ufb01rst author is supported by a Burt and Deedee McMurty Stanford Graduate Fellowship. This\nwork was supported in part by Award CMMI-0968707 from the National Science Foundation.\n\n8\n\n\fReferences\n[1] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. In Proceed-\n\nings of the 21st Annual Conference on Learning Theory (COLT), pages 355\u2013366, 2008.\n\n[2] Y. Abbasi-Yadkori, D. P\u00b4al, and C. Szepesv\u00b4ari. Improved algorithms for linear stochastic bandits. Advances\n\nin Neural Information Processing Systems, 24, 2011.\n\n[3] P. Rusmevichientong and J.N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations\n\nResearch, 35(2):395\u2013411, 2010.\n\n[4] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th\n\nACM Symposium on Theory of Computing, 2008.\n\n[5] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv\u00b4ari. X-armed bandits. Journal of Machine Learning\n\nResearch, 12:15871627, 2011.\n\n[6] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Information-theoretic regret bounds for Gaussian\nprocess optimization in the bandit setting. Information Theory, IEEE Transactions on, 58(5):3250 \u20133265,\nmay 2012. ISSN 0018-9448. doi: 10.1109/TIT.2011.2182033.\n\n[7] S. Filippi, O. Capp\u00b4e, A. Garivier, and C. Szepesv\u00b4ari. Parametric bandits: The generalized linear case.\n\nAdvances in Neural Information Processing Systems, 23:1\u20139, 2010.\n\n[8] Y. Abbasi-Yadkori, D. Pal, and C. Szepesv\u00b4ari. Online-to-con\ufb01dence-set conversions and application to\n\nsparse stochastic bandits. In Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[9] T.L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied mathe-\n\nmatics, 6(1):4\u201322, 1985.\n\n[10] T.L. Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics,\n\npages 1091\u20131114, 1987.\n\n[11] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nlearning, 47(2):235\u2013256, 2002.\n\n[12] O. Capp\u00b4e, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper con\ufb01dence\n\nbounds for optimal sequential allocation. Submitted to the Annals of Statistics.\n\n[13] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. The Journal of\n\nMachine Learning Research, 99:1563\u20131600, 2010.\n\n[14] P.L. Bartlett and A. Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly\ncommunicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelli-\ngence, pages 35\u201342. AUAI Press, 2009.\n\n[15] L. Kocsis and C. Szepesv\u00b4ari. Bandit based monte-carlo planning. In Machine Learning: ECML 2006,\n\npages 282\u2013293. Springer, 2006.\n\n[16] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. arXiv preprint arXiv:1301.2609,\n\n2013.\n\n[17] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence\n\nof two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[18] S.L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business\n\nand Industry, 26(6):639\u2013658, 2010.\n\n[19] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Neural Information Processing\n\nSystems (NIPS), 2011.\n\n[20] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. 2012.\n[21] S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. arXiv preprint\n\narXiv:1209.3353, 2012.\n\n[22] E. Kauffmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal \ufb01nite time\n\nanalysis. In International Conference on Algorithmic Learning Theory, 2012.\n\n[23] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. arXiv preprint\n\narXiv:1209.3352, 2012.\n\n[24] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R.E. Schapire. Contextual bandit algorithms with\nsupervised learning guarantees. In Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), vol-\nume 15. JMLR Workshop and Conference Proceedings, 2011.\n\n[25] K. Amin, M. Kearns, and U. Syed. Bandits, query learning, and the haystack dimension. In Proceedings\n\nof the 24th Annual Conference on Learning Theory (COLT), 2011.\n\n[26] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. The Journal of Machine Learn-\n\ning Research, 3:397\u2013422, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1094, "authors": [{"given_name": "Daniel", "family_name": "Russo", "institution": "Stanford University"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}