{"title": "Linear Contextual Bandits with Knapsacks", "book": "Advances in Neural Information Processing Systems", "page_first": 3450, "page_last": 3458, "abstract": "We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the sum of these vectors doesn't exceed the budget in each dimension. The objective is once again to maximize the total reward. This problem turns out to be a common generalization of classic linear contextual bandits  (linContextual),  bandits with knapsacks (BwK), and the online stochastic packing problem (OSPP). We present algorithms with near-optimal regret bounds for this problem. Our bounds compare favorably to results on the unstructured version of the problem, where the relation between the contexts and the outcomes could be arbitrary, but the algorithm only competes against a fixed set of policies accessible through  an optimization oracle. We combine techniques from the work on linContextual, BwK and OSPP in a nontrivial manner while also tackling new difficulties that are not present in any of these special cases.", "full_text": "Linear Contextual Bandits with Knapsacks\n\nShipra Agrawal\u2217\n\nNikhil R. Devanur\u2020\n\nAbstract\n\nWe consider the linear contextual bandit problem with resource consumption, in\naddition to reward generation. In each round, the outcome of pulling an arm is\na reward as well as a vector of resource consumptions. The expected values of\nthese outcomes depend linearly on the context of that arm. The budget/capacity\nconstraints require that the total consumption doesn\u2019t exceed the budget for each\nresource. The objective is once again to maximize the total reward. This problem\nturns out to be a common generalization of classic linear contextual bandits (linCon-\ntextual) [8, 11, 1], bandits with knapsacks (BwK) [3, 9], and the online stochastic\npacking problem (OSPP) [4, 14]. We present algorithms with near-optimal regret\nbounds for this problem. Our bounds compare favorably to results on the unstruc-\ntured version of the problem [5, 10] where the relation between the contexts and\nthe outcomes could be arbitrary, but the algorithm only competes against a \ufb01xed\nset of policies accessible through an optimization oracle. We combine techniques\nfrom the work on linContextual, BwK and OSPP in a nontrivial manner while also\ntackling new dif\ufb01culties that are not present in any of these special cases.\n\n1\n\nIntroduction\n\nIn the contextual bandit problem [8, 2], the decision maker observes a sequence of contexts (or\nfeatures). In every round she needs to pull one out of K arms, after observing the context for that\nround. The outcome of pulling an arm may be used along with the contexts to decide future arms.\nContextual bandit problems have found many useful applications such as online recommendation\nsystems, online advertising, and clinical trials, where the decision in every round needs to be\ncustomized to the features of the user being served. The linear contextual bandit problem [1, 8, 11]\nis a special case of the contextual bandit problem, where the outcome is linear in the feature vector\nencoding the context. As pointed by [2], contextual bandit problems represent a natural half-way\npoint between supervised learning and reinforcement learning: the use of features to encode contexts\nand the models for the relation between these feature vectors and the outcome are often inherited from\nsupervised learning, while managing the exploration-exploitation tradeoff is necessary to ensure good\nperformance in reinforcement learning. The linear contextual bandit problem can thus be thought of\nas a midway between the linear regression model of supervised learning, and reinforcement learning.\n\nRecently, there has been a signi\ufb01cant interest in introducing multiple \u201cglobal constraints\u201d in the\nstandard bandit setting [9, 3, 10, 5]. Such constraints are crucial for many important real-world\napplications. For example, in clinical trials, the treatment plans may be constrained by the total\navailability of medical facilities, drugs and other resources. In online advertising, there are budget\nconstraints that restrict the number of times an ad is shown. Other applications include dynamic\npricing, dynamic procurement, crowdsourcing, etc.; see [9, 3] for many such examples.\n\nIn this paper, we consider the linear contextual bandit with knapsacks (henceforth, linCBwK)\nproblem. In this problem, the context vectors are generated i.i.d. in every round from some unknown\ndistribution, and on picking an arm, a reward and a consumption vector is observed, which depend\n\n\u2217Columbia University. sa3305@columbia.edu.\n\u2020Microsoft Research. nikdev@microsoft.com.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\flinearly on the context vector. The aim of the decision maker is to maximize the total reward while\nensuring that the total consumption of every resource remains within a given budget. Below, we give\na more precise de\ufb01nition of this problem. We use the following notational convention throughout:\nvectors are denoted by bold face lower case letters, while matrices are denoted by regular face upper\ncase letters. Other quantities such as sets, scalars, etc. may be of either case, but never bold faced. All\nvectors are column vectors, i.e., a vector in n dimensions is treated as an n \u00d7 1 matrix. The transpose\nof matrix A is A\u22a4.\nDe\ufb01nition 1 (linCBwK). There are K \u201carms\u201d, which we identify with the set [K]. The algorithm\nis initially given as input a budget B \u2208 R+. In every round t, the algorithm \ufb01rst observes context\nxt(a) \u2208 [0, 1]m for every arm a, and then chooses an arm at \u2208 [K], and \ufb01nally observes a reward\nrt(at) \u2208 [0, 1] and a d-dimensional consumption vector vt(at) \u2208 [0, 1]d. The algorithm has a \u201cno-op\u201d\noption, which is to pick none of the arms and get 0 reward and 0 consumption. The goal of the\nalgorithm is to pick arms such that the total reward PT\nt=1 rt(at) is maximized, while ensuring that\nthe total consumption does not exceed the budget, i.e., Pt vt(at) \u2264 B1.\nWe make the following stochastic assumption for context, reward, and consumption vectors. In\nevery round t, the tuple {xt(a), rt(a), vt(a)}K\na=1 is generated from an unknown distribution D,\nindependent of everything in previous rounds. Also, there exists an unknown vector \u00b5\u2217 \u2208 [0, 1]m and\na matrix W\u2217 \u2208 [0, 1]m\u00d7d such that for every arm a, given contexts xt(a), and history Ht\u22121 before\ntime t,\n\nE[rt(a)|xt(a), Ht\u22121] = \u00b5\u22a4\n\u2217\n\nxt(a), E[vt(a)|xt(a), Ht\u22121] = W \u22a4\n\u2217\n\nxt(a).\n\n(1)\n\nFor succinctness, we will denote the tuple of contexts for K arms at time t as matrix Xt \u2208 [0, 1]m\u00d7K ,\nwith xt(a) being the ath column of this matrix. Similarly, rewards and consumption vectors at time t\nare represented as the vector rt \u2208 [0, 1]K and the matrix Vt \u2208 [0, 1]d\u00d7K respectively.\n\nAs we discuss later in the text, the assumption in equation (1) forms the primary distinction between\nour linear contextual bandit setting and the general contextual bandit setting considered in [5].\nExploiting this linearity assumption will allow us to generate regret bounds which do not depend\non the number of arms K, rendering it to be especially useful when the number of arms is large.\nSome examples of this include recommendation systems with large number of products (e.g., retail\nproducts, travel packages, ad creatives, sponsored facebook posts). Another advantage over using the\ngeneral contextual bandit setting of [5] is that we don\u2019t need an oracle access to a certain optimization\nproblem, which in this case is required to solve an NP-Hard problem. (See Section 1.1 for a more\ndetailed discussion.)\n\nWe compare the performance of an algorithm to that of an optimal adaptive policy that knows the\ndistribution D and the parameters (\u00b5\u2217, W\u2217), and can take into account the history up to that point,\nas well as the current context, to decide (possibly with randomization) which arm to pull at time t.\nHowever, it is easier to work with an upper bound on this, which is the optimal expected reward of a\nstatic policy that is required to satisfy the constraints only in expectation. This technique has been\nused in several related problems and is standard by now [14, 9].\n\nDe\ufb01nition 2 (Optimal Static Policy). A context-dependent non-adaptive policy \u03c0 is a mapping from\ncontext space [0, 1]m\u00d7K to \u2126 = {p \u2208 [0, 1]K : kpk1 \u2264 1}, where \u03c0(X)i denotes the probability of\nplaying arm i when the context is X, and 1 \u2212PK\ni=1 \u03c0(X)i is the probability of no-op. De\ufb01ne r(\u03c0)\nand v(\u03c0) to be the expected reward and consumption vector of policy \u03c0, respectively, i.e.\n\nr(\u03c0)\n\nv(\u03c0)\n\n:= E(X,r,V )\u223cD[r\u03c0(X)] = EX\u223cD[\u00b5\u22a4\n\u2217\n:= E(X,r,V )\u223cD[V \u03c0(X)] = EX\u223cD[W \u22a4\n\u2217\n\nX\u03c0(X)].\n\nX\u03c0(X)].\n\nLet \u03c0\u2217\n\n:= arg max\u03c0 T r(\u03c0) such that T v(\u03c0) \u2264 B1\n\n(2)\n\n(3)\n\n(4)\n\nbe the optimal static policy. Note that since no-op is allowed, a feasible policy always exists. We\ndenote the value of this optimal static policy by OPT := T r(\u03c0\u2217).\n\nThe following lemma proves that OPT upper bounds the value of an optimal adaptive policy. Proof is\nin Appendix B in the supplement.\nLemma 1. Let OPT denote the value of an optimal adaptive policy that knows the distribution D\nand parameters \u00b5\u2217\n\n, W\u2217. Then OPT \u2265 OPT.\n\n2\n\n\fDe\ufb01nition 3 (Regret). Let at be the arm played at time t by the algorithm. Then, regret is de\ufb01ned as\n\nregret(T ) := OPT \u2212\n\nT\n\nXt=1\n\nrt(at).\n\n1.1 Main results\n\nOur main result is an algorithm with near-optimal regret bound for linCBwK.\nTheorem 1. There is an algorithm for linCBwK such that if B > m1/2T 3/4, then with probability\nat least 1 \u2212 \u03b4,\n\nregret(T ) = O(cid:16)( OPT\n\nB + 1)mpT log(dT /\u03b4) log(T )(cid:17) .\n\nRelation to general contextual bandits. There have been recent papers [5, 10] that solve prob-\nlems similar to linCBwK but for general contextual bandits. In these papers the relation between\ncontexts and outcome vectors is arbitrary and the algorithms compete with an arbitrary \ufb01xed set\nof context dependent policies \u03a0 accessible via an optimization oracle, with regret bounds being\n\nsetting using a set \u03a0 of linear context dependent policies. Comparing their bounds with ours, in our\n\nO(cid:16)( OPT\nB + 1)pKT log(dT|\u03a0|/\u03b4)(cid:17) . These approaches could potentially be applied to the linear\nresults, essentially a pK log(|\u03a0|) factor is replaced by a factor of m. Most importantly, we have no\n\ndependence on K,3 which enables us to consider problems with large action spaces.\n\nFurther, suppose that we want to use their result with the set of linear policies, i.e., policies of the\n\nform, for some \ufb01xed \u03b8 \u2208 \u211cm,\n\narg max\n\na\u2208[K]{xt(a)\u22a4\u03b8}.\n\nThen, their algorithms would require access to an \u201cArg-Max Oracle\u201d that can \ufb01nd the best such policy\n(maximizing total reward) for a given set of contexts and rewards (no resource consumption). In\nfact, by a reduction from the problem of learning halfspaces with noise [16], we can show that the\noptimization problem underlying such an \u201cArg-Max Oracle\u201d problem is NP-Hard, making such an\napproach computationally expensive. The proof of this is in Appendix C in the supplement.\n\nThe only downside to our results is that we need the budget B to be \u2126(m1/2T 3/4). Getting similar\nbounds for budgets as small as B = \u0398(m\u221aT ) is an interesting open problem. (This also indicates\n\nthat this is indeed a harder problem than all the special cases.)\n\nNear-optimality of regret bounds.\n\nproblem, no online algorithm can achieve a regret bound better than \u2126(m\u221aT ). In fact, they prove\n\nIn [12], it was shown that for the linear contextual bandits\n\nthis lower bound for linear contextual bandits with static contexts. Since that problem is a special\ncase of the linCBwK problem with d = 1, this shows that the dependence on m and T in the above\nregret bound is optimal upto log factors. For general contextual bandits with resource constraints, the\nbounds of [5, 10] are near optimal.\n\nRelation to BwK [3] and OSPP [4].\nIt is easy to see that the linCBwK problem is a generalization\nof the linear contextual bandits problem [1, 8, 11]. There, the outcome is scalar and the goal is\nto simply maximize the sum of these. Remarkably, the linCBwK problem also turns out to be a\ncommon generalization of the bandits with knapsacks (BwK) problem considered in [9, 3], and the\nonline stochastic packing problem (OSPP) studied by [13, 6, 15, 14, 4]. In both BwK and OSPP, the\noutcome of every round t is a reward rt and a vector vt and the goal of the algorithm is to maximize\nt=1 vt \u2264 B1. The problems differ in how these rewards and vectors\nare picked. In the OSPP problem, in every round t, the algorithm may pick any reward,vector pair\nfrom a given set At of d + 1-dimensional vectors. The set At is drawn i.i.d. from an unknown\ndistribution over sets of vectors. This corresponds to the special case of linCBwK, where m = d + 1\nand the context xt(a) itself is equal to (rt(a), vt(a)). In the BwK problem, there is a \ufb01xed set of\narms, and for each arm there is an unknown distribution over reward,vector pairs. The algorithm\npicks an arm and a reward,vector pair is drawn from the corresponding distribution for that arm. This\n\nPT\nt=1 rt while ensuring that PT\n\n3Similar to the regret bounds for linear contextual bandits [8, 1, 11].\n\n3\n\n\fcorresponds to the special case of linCBwK, where m = K and the context Xt = I, the identity\nmatrix, for all t.\n\nWe use techniques from all three special cases: our algorithms follow the primal-dual paradigm\nand use an online learning algorithm to search the dual space, as was done in [3]. In order to deal\nwith linear contexts, we use techniques from [1, 8, 11] to estimate the weight matrix W\u2217, and de\ufb01ne\n\u201coptimistic estimates\u201d of W\u2217. We also use the technique of combining the objective and the constraints\nusing a certain tradeoff parameter and that was introduced in [4]. Further new dif\ufb01culties arise, such\nas in estimating the optimum value from the \ufb01rst few rounds, a task that follows from standard\ntechniques in each of the special cases but is very challenging here. We develop a new way of\nexploration that uses the linear structure, so that one can evaluate all possible choices that could\nhave led to an optimum solution on the historic sample. This technique might be of independent\ninterest in estimating optimum values. One can see that the problem is indeed more than the sum of\n\nits parts, from the fact that we get the optimal bound for linCBwK only when B \u2265 \u02dc\u2126(m1/2T 3/4),\nunlike either special case for which the optimal bound holds for all B (but is meaningful only for\nB = \u02dc\u2126(m\u221aT )).\nThe approach in [3] (for BwK) extends to the case of \u201cstatic\u201d contexts,4 where each arm has a context\nthat doesn\u2019t change over time. The OSPP of [4] is not a special case of linCBwK with static contexts;\nthis is one indication of the additional dif\ufb01culty of dynamic over static contexts.\n\nOther related work. Recently, [17] showed an O(\u221aT ) regret in the linear contextual setting with\n\na single budget constraint, when costs depend only on contexts and not arms.\n\nDue to space constraints, we have moved many proofs from the main part of the paper to the\nsupplement.\n\n2 Preliminaries\n\n2.1 Con\ufb01dence Ellipsoid\n\nConsider a stochastic process which in each round t, generates a pair of observations (rt, yt), such\nthat rt is an unknown linear function of yt plus some 0-mean bounded noise, i.e., rt = \u00b5\u22a4\n\u2217 yt + \u03b7t,\nwhere yt, \u00b5\u2217 \u2208 Rm, |\u03b7t| \u2264 2R, and E[\u03b7t|y1, r1, . . . , yt\u22121, rt\u22121, yt] = 0.\nAt any time t, a high con\ufb01dence estimate of the unknown vector \u00b5\u2217 can be obtained by building\na \u201ccon\ufb01dence ellipsoid\u201d around the \u21132-regularized least-squares estimate \u02c6\u00b5t constructed from the\nobservations made so far. This technique is common in prior work on linear contextual bandits (e.g.,\nin [8, 11, 1]). For any regularization parameter \u03bb > 0, let\n\nMt := \u03bbI +Pt\u22121\n\ni=1 yiy\u22a4i , and \u02c6\u00b5t := M\u22121\n\ni=1 yiri.\n\nt Pt\u22121\n\nThe following result from [1] shows that \u00b5\u2217 lies with high probability in an ellipsoid with center \u02c6\u00b5t.\nFor any positive semi-de\ufb01nite (PSD) matrix M, de\ufb01ne the M -norm as k\u00b5kM := p\u00b5\u22a4M \u00b5. The\ncon\ufb01dence ellipsoid at time t is de\ufb01ned as\n\nCt := n\u00b5 \u2208 Rm : k\u00b5 \u2212 \u02c6\u00b5tkMt \u2264 Rpm log ((1+tm/\u03bb)/\u03b4) + \u221a\u03bbmo .\n\nLemma 2 (Theorem 2 of [1]). If \u2200 t, k\u00b5\u2217k2 \u2264 \u221am and kytk2 \u2264 \u221am, then with prob. 1 \u2212 \u03b4,\n\u00b5\u2217 \u2208 Ct.\nAnother useful observation about this construction is stated below. It \ufb01rst appeared as Lemma 11 of\n[8], and was also proved as Lemma 3 in [11].\nt=1 kytkM \u22121\n\nLemma 3 (Lemma 11 of [8]). PT\n\nt \u2264 pmT log(T ).\n\nAs a corollary of the above two lemmas, we obtain a bound on the total error in the estimate provided\nby \u201cany point\u201d from the con\ufb01dence ellipsoid. (Proof is in Appendix D in the supplement.)\n\n4It was incorrectly claimed in [3] that the approach can be extended to dynamic contexts without much\n\nmodi\ufb01cations.\n\n4\n\n\fCorollary 1. For t = 1, . . . , T , let \u02dc\u00b5t \u2208 Ct be a point in the con\ufb01dence ellipsoid, with \u03bb = 1 and\n2R = 1. Then, with probability 1 \u2212 \u03b4,\n\nPT\nt=1 | \u02dc\u00b5\u22a4t yt \u2212 \u00b5\u22a4\n\n\u2217 yt| \u2264 2mpT log ((1+T m)/\u03b4) log(T ).\n\n2.2 Online Learning\n\nConsider a T round game played between an online learner and an adversary, where in round\nt, the learner chooses a \u03b8t \u2208 \u2126 := {\u03b8 : k\u03b8k1 \u2264 1, \u03b8 \u2265 0}, and then observes a linear function\ngt : \u2126 \u2192 [\u22121, 1] picked by the adversary. The learner\u2019s choice \u03b8t may only depend on learner\u2019s and\nadversary\u2019s choices in previous rounds. The goal of the learner is to minimize regret de\ufb01ned as the\ndifference between the learner\u2019s objective value and the value of the best single choice in hindsight:\n\nR(T ) := max\u03b8\u2208\u2126PT\n\nt=1 gt(\u03b8) \u2212PT\n\nt=1 gt(\u03b8t).\n\nThe multiplicative weight update (MWU) algorithm (generalization by [7]) is a fast and ef\ufb01cient\nonline learning algorithm for this problem. Let gt,j := gt(1j). Then, given a parameter \u01eb > 0, in\nround t + 1, the choice of this algorithm takes the following form,\n\n\u03b8t+1,j =\n\nwt,j\n\n1 +Pj wt,j\nwith initialization w0,j = 1, for all j = 1, . . . , K.\nLemma 4. [7] For any 0 < \u01eb \u2264 1\n\nonline learning problem described above:\n\n, where wt,j = (cid:26) wt\u22121,j(1 + \u01eb)gt,j\nwt\u22121,j(1 \u2212 \u01eb)\u2212gt,j\n\nif gt,j > 0,\nif gt,j \u2264 0.\n\n(5)\n\n2 , the MWU algorithm provides the following regret bound for the\n\nR(T ) \u2264 \u01ebT + log(d+1)\n\n\u01eb\n\n.\n\nIn particular, for \u01eb = q log(d+1)\nFor the rest of the paper, we refer to the MWU algorithm with \u01eb = q log(d+1)\n\n, we have R(T ) \u2264 plog(d + 1)T\n\n(OL) algorithm, and the update in (5) as the OL update at time t + 1.\n\nT\n\nT\n\nas the online learning\n\n3 Algorithm\n\n3.1 Optimistic estimates of unknown parameters\n\nLet at denote the arm played by the algorithm at time t. In the beginning of every round, we use the\noutcomes and contexts from previous rounds to construct a con\ufb01dence ellipsoid for \u00b5\u2217 and every\ncolumn of W\u2217. The construction of con\ufb01dence ellipsoid for \u00b5\u2217 follows directly from the techniques\nin Section 2.1 with yt = xt(at) and rt being reward at time t. To construct a con\ufb01dence ellipsoid\nfor a column j of W\u2217, we use the techniques in Section 2.1 while substituting yt = xt(at) and\nrt = vt(at)j for every j.\nAs in Section 2.1, let Mt := I +Pt\u22121\nestimate for \u00b5\u2217\n\ni=1 xi(ai)xi(ai)\u22a4, and construct the regularized least squares\n\n, W\u2217, respectively, as\n\n:= M\u22121\n:= M\u22121\nDe\ufb01ne con\ufb01dence ellipsoid for parameter \u00b5\u2217 as\n\nt Pt\u22121\nt Pt\u22121\n\n\u02c6\u00b5t\n\u02c6Wt\n\ni=1 xi(ai)ri(ai)\u22a4\ni=1 xi(ai)vi(ai)\u22a4.\n\nCt,0 := n\u00b5 \u2208 Rm : k\u00b5 \u2212 \u02c6\u00b5kMt \u2264 pm log ((d+tmd)/\u03b4) + \u221amo ,\n\nand for every arm a, the optimistic estimate of \u00b5\u2217 as:\n\n\u02dc\u00b5t(a) := arg max\u00b5\u2208Ct,0 xt(a)\u22a4\u00b5.\n\n(6)\n\n(7)\n\n(8)\n\nLet wj denote the jth column of a matrix W . We de\ufb01ne a con\ufb01dence ellipsoid for each column j, as\n\nCt,j := nw \u2208 Rm : kw \u2212 \u02c6wtjkMt \u2264 pm log ((d+tmd)/\u03b4) + \u221amo ,\n\n5\n\n\fand denote by Gt, the Cartesian product of all these ellipsoids: Gt := {W \u2208 Rm\u00d7d : wj \u2208 Ct,j}.\nNote that Lemma 2 implies that W\u2217 \u2208 Gt with probability 1 \u2212 \u03b4. Now, given a vector \u03b8t \u2208 Rd, we\nde\ufb01ne the optimistic estimate of the weight matrix at time t w.r.t. \u03b8t, for every arm a \u2208 [K], as :\n\n\u02dcWt(a) := arg minW\u2208Gt xt(a)\u22a4W \u03b8t.\n\n(9)\n\nIntuitively, for the reward, we want an upper con\ufb01dence bound and for the consumption we want a\nlower con\ufb01dence bound as an optimistic estimate. This intuition aligns with the above de\ufb01nitions,\nwhere the maximizer was used in case of reward and a minimizer was used for consumption. The\nutility and precise meaning of \u03b8t will become clearer when we describe the algorithm and present the\nregret analysis.\nUsing the de\ufb01nition of \u02dc\u00b5t, \u02dcWt, along with the results in Lemma 2 and Corollary 1 about con\ufb01dence\nellipsoids, the following can be derived.\nCorollary 2. With probability 1 \u2212 \u03b4, for any sequence of \u03b81, \u03b82, . . . , \u03b8T ,\n1. xt(a)\u22a4 \u02dc\u00b5t(a) \u2265 xt(a)\u22a4\u00b5\u2217, for all arms a \u2208 [K], for all time t.\n2. xt(a)\u22a4 \u02dcWt(a)\u03b8t \u2264 xt(a)\u22a4W\u2217\u03b8t, for all arms a \u2208 [K], for all time t.\n)\u22a4xt(at)| \u2264 (cid:16)2mpT log ((1+tm)/\u03b4) log(T )(cid:17) .\n3. |PT\n4. kPT\n\nt=1( \u02dc\u00b5t(at) \u2212 \u00b5\u2217\nt=1( \u02dcWt(at) \u2212 W\u2217)\u22a4xt(at)k \u2264 k1dk(cid:16)2mpT log ((d+tmd)/\u03b4) log(T )(cid:17) .\n\nEssentially, the \ufb01rst two claims ensure that we have optimistic estimates, and the last two claims\nensure that the estimates quickly converge to the true parameters.\n\n3.2 The core algorithm\n\nIn this section, we present an algorithm and its analysis, under the assumption that a parameter Z\nsatisfying certain properties is given. Later, we show how to use the \ufb01rst T0 rounds to compute such\na Z, and also bound the additional regret due to these T0 rounds. We de\ufb01ne Z now.\nAssumption 1. Let Z be such that for some universal constants c, c\u2032, OPT\n\nB \u2264 Z \u2264 c OPT\n\nB + c\u2032.\n\nThe algorithm constructs estimates \u02c6\u00b5t and \u02c6Wt as in Section 3.1. It also runs the OL algorithm for an\ninstance of the online learning problem. The vector played by the OL algorithm in time step t is \u03b8t.\nAfter observing the context, the optimistic estimates for each arm are then constructed using \u03b8t, as\nde\ufb01ned in (8) and (9). Intuitively, \u03b8t is used here as a multiplier to combine different columns of\nthe weight matrix, to get an optimistic weight vector for every arm. An adjusted estimated reward\nfor arm a is then de\ufb01ned by using Z to linearly combine the optimistic estimate of the reward with\nthe optimistic estimate of the consumption, as (xt(a)\u22a4 \u02dc\u00b5t(a)) \u2212 Z(xt(a)\u22a4 \u02dcWt(a)\u03b8t). The algorithm\n\nchooses the arm which appears to be the best according to the adjusted estimated reward. After\nobserving the resulting reward and consumption vectors, the estimates are updated. The online\nlearning algorithm is advanced by one step, by de\ufb01ning the pro\ufb01t vector to be vt(at) \u2212 B\nT 1. The\nalgorithm ends either after T time steps or as soon as the total consumption exceeds the budget along\nsome dimension.\nTheorem 2. Given a Z as per Assumption 1, Algorithm 1 achieves the following, with prob. 1 \u2212 \u03b4:\n\nregret(T ) \u2264 O(cid:16)( OPT\n\nB + 1)mpT log(dT /\u03b4) log(T )(cid:17) .\n\n(Proof Sketch) We provide a sketch of the proof here, with a full proof given in Appendix E in the\nsupplement. Let \u03c4 be the stopping time of the algorithm. The proof is in 3 steps:\nStep 1: Since E[vt(at)|Xt, at, Ht\u22121] = W \u22a4\n\u2217 xt(at), we apply Azuma-Hoeffding inequality to get\nthat with high probability (cid:13)(cid:13)P\u03c4\n\u2217 xt(at)(cid:13)(cid:13)\u221e\nt=1 vt(at) \u2212 W \u22a4\nis small. Therefore, we can work with\nP\u03c4\n\u2217 xt(at) instead of P\u03c4\nt=1 W \u22a4\nt=1 vt(at). A similar application of Azuma-Hoeffding inequality\n\u2217 xt(at)|, so that a lower bound on P\u03c4\nis used to bound the gap |P\u03c4\nt=1 rt(at) \u2212 \u00b5\u22a4\n\u2217 xt(at) is\nsuf\ufb01cient to lower bound the total reward P\u03c4\n\nt=1 rt(at).\n\nt=1 \u00b5\u22a4\n\n6\n\n\fAlgorithm 1 Algorithm for linCBwK, with given Z\n\nInitialize \u03b81 as per the online learning (OL) algorithm.\nInitialize Z that satis\ufb01es Assumption 1.\nfor all t = 1, ..., T do\n\nObserve Xt.\nFor every a \u2208 [K], compute \u02dc\u00b5t(a) and \u02dcWt(a) as per (8) and (9) respectively.\nPlay the arm at := arg maxa\u2208[K] xt(a)\u22a4( \u02dc\u00b5t(a) \u2212 Z \u02dcWt(a)\u03b8t).\nObserve rt(at) and vt(at).\nIf for some j = 1..d,Pt\u2032\u2264t vt\u2032 (at\u2032 ) \u00b7 ej \u2265 B then EXIT.\nUse xt(at), rt(at) and vt(at) to obtain \u02c6\u00b5t+1, \u02c6Wt+1 and Gt+1.\nChoose \u03b8t+1 using the OL update (refer to (5)) with gt(\u03b8t) := \u03b8t \u00b7 vt(at) \u2212 B\n\nend for\n\nT 1(cid:1) .\n\nt=1(W\u2217 \u2212 \u02dcWt(at))\u22a4xt(at)(cid:13)(cid:13)(cid:13)\u221e\n\nIt is therefore suf\ufb01cient to work with the sum of vectors \u02dcWt(at)\u22a4xt(at) instead of W \u22a4\nsimilarly with \u02dc\u00b5t(at)\u22a4xt(at) instead of \u00b5\u22a4\n\nStep 2: Using Corollary 2, with high probability, we can bound (cid:13)(cid:13)(cid:13)PT\nStep 3: The proof is completed by showing the desired bound on OPT \u2212P\u03c4\nt=1 \u02dc\u00b5t(at)\u22a4xt(at). This\npart is similar to the online stochastic packing problem; if the actual reward and consumption vectors\nwere \u02dc\u00b5t(at)\u22a4xt(at) and \u02dcWt(at)\u22a4xt(at), then it would be exactly that problem. We adapt techniques\nfrom [4]: use the OL algorithm and the Z parameter to combine constraints into the objective. If a\ndimension is being consumed too fast, then the multiplier for that dimension should increase, making\nthe algorithm to pick arms that are not likely to consume too much along this dimension. Regret is\nthen bounded by a combination of the online learning regret and the error in the optimistic estimates.\n\n.\n\u2217 xt(at), and\n\n\u2217 xt(at).\n\n3.3 Algorithm with Z computation\n\nIn this section, we present a modi\ufb01cation of Algorithm 1 which computes the required parameter\nZ that satis\ufb01es Assumption 1, and therefore does not need to be provided with a Z as input. The\nalgorithm computes Z using observations from the \ufb01rst T0 rounds. Once Z is computed, Algorithm\n1 can be run for the remaining time steps. However, it needs to be modi\ufb01ed slightly to take into\naccount the budget consumed during the \ufb01rst T0 rounds. We handle this by using a smaller budget\nB\u2032 = B \u2212 T0 in the computations for the remaining rounds. The modi\ufb01ed algorithm is given below.\nAlgorithm 2 Algorithm for linCBwK, with Z computation\n\nInputs: B, T0, B\u2032 = B \u2212 T0\nUsing observations from \ufb01rst T0 rounds, compute a Z that satis\ufb01es Assumption 1.\nRun Algorithm 1 for T \u2212 T0 rounds and budget B\u2032.\n\nNext, we provide the details of how to compute Z from observations in the \ufb01rst T0 rounds, and how\nto choose T0. We provide a method that takes advantage of the linear structure of the problem, and\nexplores in the m-dimensional space of contexts and weight vectors to obtain bounds independent of\nK. In every round t = 1, . . . , T0, after observing Xt, let pt \u2208 \u2206[K] be\n,\n\n:= arg max\n\n(10)\n\npt\n\np\u2208\u2206[K] kXtpkM \u22121\n\nt\n\ni=1(Xipi)(Xipi)\u22a4.\n\nwhere Mt\n\n:= I +Pt\u22121\n\n(11)\n\nSelect arm at = a with probability pt(a). In fact, since Mt is a PSD matrix, due to convexity of the\nfunction kXtpk2\n. Construct estimates\n\u02c6\u00b5, \u02c6Wt of \u00b5\u2217\n\n, it is the same as playing at = arg maxa\u2208[K] kxt(a)kM \u22121\n\nM \u22121\n\nt\n\nt\n\n, W\u2217 at time t as\n\u02c6\u00b5t := M\u22121\n\nt Pt\u22121\n\ni=1(Xipi)ri(ai), \u02c6Wt := M\u22121\n\ni=1(Xipi)vi(ai)\u22a4.\n\nt Pt\u22121\n\n7\n\n\fAnd, for some value of \u03b3 de\ufb01ned later, obtain an estimate \u02c6OPT\n\n\u03b3\n\nof OPT as:\n\n\u03b3\n\n\u02c6OPT\n\n:=\n\nmax\u03c0\nsuch that\n\nT\n\nT0 PT0\n\ni=1 \u02c6\u00b5\u22a4i Xi\u03c0(Xi)\n\u02c6W \u22a4i Xi\u03c0(Xi) \u2264 B + \u03b3.\n\ni=1\n\nT\n\nT0 PT0\n\n(12)\n\nFor an intuition about the choice of arm in (10), observe from the discussion in Section 2.1 that every\n\ncolumn w\u2217j of W\u2217 is guaranteed to lie inside the con\ufb01dence ellipsoid centered at column \u02c6wtj of \u02c6Wt,\nnamely the ellipsoid, kw \u2212 \u02c6wtjk2\nMt \u2264 4m log(T m/\u03b4). Note that this ellipsoid has principle axes as\neigenvectors of Mt, and the length of the semi-principle axes is given by the inverse eigenvalues of\nMt. Therefore, by maximizing kXtpkM \u22121\nwe are choosing the context closest to the direction of the\nlongest principal axis of the con\ufb01dence ellipsoid, i.e. in the direction of the maximum uncertainty.\nIntuitively, this corresponds to pure exploration: by making an observation in the direction where\nuncertainty is large we can reduce the uncertainty in our estimate most effectively.\n\nt\n\n\u03b3\nA more algebraic explanation is as follows. In order to get a good estimate of OPT by \u02c6OPT\n\n, we want\n\nthe estimates \u02c6Wt and W\u2217 (and, \u02c6\u00b5 and \u00b5\u2217) to be close enough so that kPT0\n(and, |PT0\n\nt=1( \u02c6Wt\u2212 \u02c6W\u2217)\u22a4Xt\u03c0(Xt)k\u221e\n)\u22a4Xt\u03c0(Xt)|) is small for all policies \u03c0, and in particular for sample optimal\n\npolicies. Now, using Cauchy-Schwartz these are bounded by\n\nt=1( \u02c6\u00b5t \u2212 \u00b5\u2217\n\nPT0\nt=1 k \u02c6\u00b5t \u2212 \u00b5\u2217kMtkXt\u03c0(Xt))kM \u22121\n, and\nPT0\nt=1 k \u02c6Wt \u2212 W\u2217kMtkXt\u03c0(Xt))kM \u22121\n\n,\n\nt\n\nt\n\nwhere we de\ufb01ne kWkM , the M -norm of matrix W to be the max of column-wise M -norms. Using\nLemma 2, the term k \u02c6\u00b5t\u2212\u00b5\u2217kMt is bounded by 2pm log(T0m/\u03b4) , and k \u02c6Wt\u2212W\u2217kMt is bounded by\n2pm log(T0md/\u03b4), with probability 1\u2212\u03b4. Lemma 3 bounds the second termPT0\nt=1 kXt\u03c0(Xt)kM \u22121\nbut only when \u03c0 is the played policy. This is where we use that the played policy pt was cho-\nt=1 kXtptkM \u22121\nsen to maximize kXtptkM \u22121\nand the bound\nPT0\nt \u2264 pmT0 log(T0) given by Lemma 3 actually bounds PT0\nt=1 kXt\u03c0(Xt)kM \u22121\nt=1 kXtptkM \u22121\nfor all \u03c0. Combining, we get a bound of 2mpT0log(T0) log(T0d/\u03b4) on deviations kPT0\nt=1( \u02c6Wt \u2212\n\u02c6W\u2217)\u22a4Xt\u03c0(Xt)k\u221e and |PT0\nLemma 5. For \u03b3 = (cid:16) T\n\nT0(cid:17) 2mpT0log(T0) log(T0d/\u03b4), with probability 1 \u2212 O(\u03b4),\n\n)\u22a4Xt\u03c0(Xt)| for all \u03c0.\n\n, so that PT0\n\nt=1 kXt\u03c0(Xt)kM \u22121\n\nt \u2264 PT0\n\nWe prove the following lemma.\n\nt=1( \u02c6\u00b5t \u2212 \u00b5\u2217\n\nt\n\nt\n\nt\n\nt\n\nOPT \u2212 2\u03b3 \u2264 \u02c6OPT\n\n2\u03b3\n\n\u2264 OPT + 9\u03b3( OPT\n\nB + 1).\n\nCorollary 3. Set Z = ( \u02c6OPT\n\n2\u03b3 +2\u03b3)\nB\n\n+ 1, with the above value of \u03b3. Then, with probability 1 \u2212 O(\u03b4),\n\nOPT\n\nB + 1 \u2264 Z \u2264 (1 + 11\u03b3\nCorollary 3 implies that as long as B \u2265 \u03b3, i.e., B \u2265 \u02dc\u2126( mT\u221aT0\nB + 1 \u2265 Z\u2217, therefore Theorem 2 should provide an \u02dcO(cid:16)( OPT\n\nB )( OPT\n\nOPT\n\nB + 1).\n\n), Z is a constant factor approximation of\n\nB + 1)m\u221aT(cid:17) regret bound. However,\n\nthis bound does not account for the budget consumed in the \ufb01rst T0 rounds. Considering that (at most)\nT0 amount can be consumed from the budget in the \ufb01rst T0 rounds, we have an additional regret of\nOPT\nB T0. Further, since we have B\u2032 = B \u2212 T0 budget for remaining T \u2212 T0 rounds, we need a Z that\nsatis\ufb01es the required assumption for B\u2032 instead of B (i.e., we need OPT\nB \u2032 + 1(cid:1)).\nIf B \u2265 2T0, then, B\u2032 \u2265 B/2, and using 2 times the Z computed in Corollary 3 would satisfy the\n\nB \u2032 \u2264 Z \u2264 O(1) OPT\n\nrequired assumption.\n\nTogether, these observations give Theorem 3.\n\nTheorem 3. Using Algorithm 2 with T0 such that B > max{2T0, mT /\u221aT0}, and twice the Z given\n\nby Corollary 3, we get a high probability regret bound of\n\nIn particular, for B > m1/2T 3/4 and m \u2264 \u221aT , we can use T0 = m\u221aT to get a regret bound of\n\n\u02dcO(cid:16) OPT\n\nB + 1(cid:1)(cid:16)T0 + m\u221aT(cid:17)(cid:17) .\n\u02dcO(cid:16) OPT\n\nB + 1(cid:1) m\u221aT(cid:17) .\n\n8\n\n\fReferences\n\n[1] Y. Abbasi-Yadkori, D. P\u00b4al, and C. Szepesv\u00b4ari. Improved algorithms for linear stochastic bandits.\n\nIn NIPS, 2012.\n\n[2] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire. Taming the monster: A\n\nfast and simple algorithm for contextual bandits. In ICML 2014, June 2014.\n\n[3] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In\nProceedings of the Fifteenth ACM Conference on Economics and Computation, EC \u201914, 2014.\n\n[4] S. Agrawal and N. R. Devanur. Fast algorithms for online stochastic convex programming. In\n\nSODA, pages 1405\u20131424, 2015.\n\n[5] S. Agrawal, N. R. Devanur, and L. Li. An ef\ufb01cient algorithm for contextual bandits with\n\nknapsacks, and an extension to concave objectives. In COLT, 2016.\n\n[6] S. Agrawal, Z. Wang, and Y. Ye. A dynamic near-optimal algorithm for online linear program-\n\nming. Operations Research, 62:876 \u2013 890, 2014.\n\n[7] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm\n\nand applications. Theory of Computing, 8(6):121\u2013164, 2012.\n\n[8] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res.,\n\n3, Mar. 2003.\n\n[9] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In FOCS, pages\n\n207\u2013216, 2013.\n\n[10] A. Badanidiyuru, J. Langford, and A. Slivkins. Resourceful contextual bandits. In Proceedings\n\nof The Twenty-Seventh Conference on Learning Theory (COLT-14), pages 1109\u20131134, 2014.\n\n[11] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual Bandits with Linear Payoff Functions.\n\nIn AISTATS, 2011.\n\n[12] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic Linear Optimization under Bandit Feedback.\n\nIn COLT, 2008.\n\n[13] N. R. Devanur and T. P. Hayes. The adwords problem: online keyword matching with budgeted\n\nbidders under random permutations. In EC, 2009.\n\n[14] N. R. Devanur, K. Jain, B. Sivan, and C. A. Wilkens. Near optimal online algorithms and fast\n\napproximation algorithms for resource allocation problems. In EC, 2011.\n\n[15] J. Feldman, M. Henzinger, N. Korula, V. S. Mirrokni, and C. Stein. Online stochastic packing\napplied to display ad allocation. In Proceedings of the 18th Annual European Conference on\nAlgorithms: Part I, ESA\u201910, 2010.\n\n[16] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal\n\non Computing, 39(2):742\u2013765, 2009.\n\n[17] H. Wu, R. Srikant, X. Liu, and C. Jiang. Algorithms with logarithmic or sublinear regret for\nconstrained contextual bandits. In Proceedings of the 28th International Conference on Neural\nInformation Processing Systems (NIPS), 2015.\n\n9\n\n\f", "award": [], "sourceid": 1719, "authors": [{"given_name": "Shipra", "family_name": "Agrawal", "institution": "Columbia University"}, {"given_name": "Nikhil", "family_name": "Devanur", "institution": "Microsoft Research"}]}