{"title": "Combinatorial semi-bandit with known covariance", "book": "Advances in Neural Information Processing Systems", "page_first": 2972, "page_last": 2980, "abstract": "The combinatorial stochastic semi-bandit problem is an extension of the classical multi-armed bandit problem in which an algorithm pulls more than one arm at each stage and the rewards of all pulled arms are revealed. One difference with the single arm variant is that the dependency structure of the arms is crucial. Previous works on this setting either used a worst-case approach or imposed independence of the arms. We introduce a way to quantify the dependency structure of the problem and design an algorithm that adapts to it. The algorithm is based on linear regression and the analysis uses techniques from the linear bandit literature. By comparing its performance to a new lower bound, we prove that it is optimal, up to a poly-logarithmic factor in the number of arms pulled.", "full_text": "Combinatorial semi-bandit with known covariance\n\nR\u00e9my Degenne\n\nLMPA, Universit\u00e9 Paris Diderot\n\nCMLA, ENS Paris-Saclay\n\ndegenne@cmla.ens-cachan.fr\n\nVianney Perchet\n\nCMLA, ENS Paris-Saclay\nCRITEO Research, Paris\n\nperchet@normalesup.org\n\nAbstract\n\nThe combinatorial stochastic semi-bandit problem is an extension of the classical\nmulti-armed bandit problem in which an algorithm pulls more than one arm at\neach stage and the rewards of all pulled arms are revealed. One difference with the\nsingle arm variant is that the dependency structure of the arms is crucial. Previous\nworks on this setting either used a worst-case approach or imposed independence\nof the arms. We introduce a way to quantify the dependency structure of the\nproblem and design an algorithm that adapts to it. The algorithm is based on linear\nregression and the analysis develops techniques from the linear bandit literature.\nBy comparing its performance to a new lower bound, we prove that it is optimal,\nup to a poly-logarithmic factor in the number of pulled arms.\n\n1\n\nIntroduction and setting\n\nThe multi-armed bandit problem (MAB) is a sequential learning task in which an algorithm takes at\neach stage a decision (or, \u201cpulls an arm\u201d). It then gets a reward from this choice, with the goal of\nmaximizing the cumulative reward [Robbins, 1985]. We consider here its stochastic combinatorial\nextension, in which the algorithm chooses at each stage a subset of arms [Audibert et al., 2013,\nCesa-Bianchi and Lugosi, 2012, Chen et al., 2013, Gai et al., 2012]. These arms could form, for\nexample, the path from an origin to a destination in a network. In the combinatorial setting, contrary\nto the the classical MAB, the inter-dependencies between the arms can play a role (we consider that\nthe distribution of rewards is invariant with time). We investigate here how the covariance structure\nof the arms affects the dif\ufb01culty of the learning task and whether it is possible to design a unique\nalgorithm capable of performing optimally in all cases from the simple scenario with independent\nrewards to the more challenging scenario of general correlated rewards.\nFormally, at each stage t \u2208 N, t \u2265 1, an algorithm pulls m \u2265 1 arms among d \u2265 m. Such a set of m\narms is called an \u201caction\u201d and will be denoted by At \u2208 {0, 1}d, a vector with exactly m non-zero\nentries. The possible actions are restricted to an arbitrary \ufb01xed subset A \u2282 {0, 1}d. After choosing\nt Xt, where Xt \u2208 Rd is the vector encapsulating the\naction At, the algorithm receives the reward A(cid:62)\nreward of the d arms at stage t. The successive reward vectors (Xt)t\u22651 are i.i.d with unknown mean\n\u00b5 \u2208 Rd. We consider a semi-bandit feedback system: after choosing the action At, the algorithm\nobserves the reward of each of the arms in that action, but not the other rewards. Other possible\nfeedbacks previously studied include bandit (only A(cid:62)\nt Xt is revealed) and full information (Xt is\nrevealed). The goal of the algorithm is to maximize the cumulated reward up to stage T \u2265 1 or\nequivalently to minimize the expected regret, which is the difference of the reward that would have\nbeen gained by choosing the best action in hindsight A\u2217 and what was actually gained:\n\nT(cid:88)\n\nERT = E\n\n(A\u2217(cid:62)\u00b5 \u2212 A(cid:62)\n\nt \u00b5) .\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\nt=1\n\n\fgap of At, so that regret rewrites as ERT = E(cid:80)T\n\nFor an action A \u2208 A, the difference \u2206A = (A\u2217(cid:62)\u00b5 \u2212 A(cid:62)\u00b5) is called gap of A. We denote by \u2206t the\nt=1 \u2206t. We also de\ufb01ne the minimal gap of an arm,\n\n\u2206i,min = min{A\u2208A:i\u2208A} \u2206A.\nThis setting was already studied Cesa-Bianchi and Lugosi [2012], most recently in Combes et al.\n[2015], Kveton et al. [2015], where two different algorithms are used to tackle on one hand the case\nwhere the arms have independent rewards and on the other hand the general bounded case. The regret\nguaranties of the two algorithms are different and re\ufb02ect that the independent case is easier. Another\nalgorithm for the independent arms case based on Thompson Sampling was introduced in Komiyama\net al. [2015]. One of the main objectives of this paper is to design a unique algorithm that can adapt\nto the covariance structure of the problem when prior information is available.\nThe following notations will be used throughout the paper: given a matrix M (resp. vector v), its\n(i, j)th (resp. ith) coef\ufb01cient is denoted by M (ij) (resp. v(i)). For a matrix M, the diagonal matrix\nwith same diagonal as M is denoted by \u03a3M .\nWe denote by \u03b7t the noise in the reward, i.e. \u03b7t := Xt \u2212 \u00b5. We consider a subgaussian setting, in\nwhich we suppose that there is a positive semi-de\ufb01nite matrix C such that for all t \u2265 1,\n\n\u2200u \u2208 Rd, E[eu(cid:62)\u03b7t] \u2264 e\n\n1\n\n2 u(cid:62)Cu .\n\n\u221a\n\nis\n\nThis is equivalent to the usual setting for bandits where we suppose that the individual arms are\nsubgaussian. Indeed if we have such a matrix C then each \u03b7(i)\nC (ii)-subgaussian. And under\nt\na subgaussian arms assumption, such a matrix always exists. This setting encompasses the case of\nbounded rewards.\nWe call C a subgaussian covariance matrix of the noise (see appendix A of the supplementary\nmaterial). A good knowledge of C can simplify the problem greatly, as we will show. In the case\nof 1-subgaussian independent rewards, in which C can be chosen diagonal, a known lower bound\non the regret appearing in Combes et al. [2015] is d\n\u2206 log T , while Kveton et al. [2015] proves a\n\u2206 log T lower bound in general. Our goal here is to investigate the spectrum of intermediate cases\ndm\nbetween these two settings, from the uninformed general case to the independent case in which one\nhas much information on the relations between the arm rewards. We characterize the dif\ufb01culty of the\nproblem as a function of the subgaussian covariance matrix C. We suppose that we know a positive\nsemi-de\ufb01nite matrix \u0393 such that for all vectors v with positive coordinates, v(cid:62)Cv \u2264 v(cid:62)\u0393v, property\nthat we denote by C (cid:22)+ \u0393. \u0393 re\ufb02ects the prior information available about the possible degree of\nindependence of the arms. We will study algorithms that enjoy regret bounds as functions of \u0393.\nThe matrix \u0393 can be chosen such that all its coef\ufb01cients are non-negative and verify for all i, j,\n\u0393(ij) \u2264\n\u0393(ii)\u0393(jj). From now on, we suppose that it is the case. In the following, we will use \u0001t\nsuch that \u03b7t = C 1/2\u0001t and write for the reward: Xt = \u00b5 + C 1/2\u0001t.\n\n\u221a\n\n2 Lower bound\n\nWe \ufb01rst prove a lower bound on the regret of any algorithm, demonstrating the link between the sub-\ngaussian covariance matrix and the dif\ufb01culty of the problem. It depends on the maximal off-diagonal\nC(ij)\ncorrelation coef\ufb01cient of the covariance matrix. This coef\ufb01cient is \u03b3 = max{(i,j)\u2208[d],i(cid:54)=j}\nC(ii)C(jj) .\nThe bound is valid for consistent algorithms [Lai and Robbins, 1985], for which the regret on any\nproblem veri\ufb01es ERt = o(ta) as t \u2192 +\u221e for all a > 0.\nTheorem 1. Suppose to simplify that d is a multiple of m. Then, for any \u2206 > 0, for any consistent\nalgorithm, there is a problem with gaps \u2206, \u03c3-subgaussian arms and correlation coef\ufb01cients smaller\nthan \u03b3 \u2208 [0, 1] on which the regret is such that\n\n\u221a\n\nERt\nlog t\n\nlim inf\nt\u2192+\u221e\n\n\u2265 (1 + \u03b3(m \u2212 1))\n\n2\u03c32(d \u2212 m)\n\n\u2206\n\nThis bound is a consequence of the classical result of Lai and Robbins [1985] for multi-armed bandits,\napplied to the problem of choosing one among d/m paths, each of which has m different successive\nedges (Figure 1). The rewards in the same path are correlated but the paths are independent. A\ncomplete proof can be found in appendix B.1 of the supplementary material.\n\n2\n\n\fFigure 1: Left: parallel paths problem. Right: regret of OLS-UCB as a function of m and \u03b3 in the\nparallel paths problem with 5 paths (average over 1000 runs).\n\n3 OLS-UCB Algorithm and analysis\nFaced with the combinatorial semi-bandit at stage t \u2265 1, the observations from t \u2212 1 stages form as\nmany linear equations and the goal of an algorithm is to choose the best action. To \ufb01nd the action with\nthe highest mean, we estimate the mean of all arms. This can be viewed as a regression problem. The\ndesign of our algorithm stems from this observation and is inspired by linear regression in the \ufb01xed\ndesign setting, similarly to what was done in the stochastic linear bandit literature [Rusmevichientong\nand Tsitsiklis, 2010, Filippi et al., 2010]. There are many estimators for linear regression and we\nfocus on the one that is simple enough and adaptive: Ordinary Least Squares (OLS).\n\n3.1 Fixed design linear regression and OLS-UCB algorithm\nFor an action A \u2208 A, let IA be the diagonal matrix with a 1 at line i if A(i) = 1 and 0 otherwise. For\na matrix M, we also denote by MA the matrix IAM IA. At stage t, if all actions A1, . . . , At were\nindependent of the rewards, we would have observed a set of linear equations\n\nIA1X1 = IA1\u00b5 + IA1 \u03b71\n\n...\n\nIAt\u22121Xt\u22121 = IAt\u22121\u00b5 + IAt\u22121\u03b7t\u22121\n\nand we could use the OLS estimator to estimate \u00b5, which is unbiased and has a known subgaussian\nconstant controlling its variance. This is however not true in our online setting since the successive\nactions are not independent. At stage t, we de\ufb01ne\n\nt\u22121(cid:88)\n\nt\u22121(cid:88)\n\nn(i)\nt =\n\nI{i \u2208 As}, n(ij)\n\nt =\n\nI{i \u2208 As}I{j \u2208 As} and Dt =\n\nIAs ,\n\ns=1\n\ns=1\n\ns=1\n\nt\u22121(cid:88)\n\nt\u22121(cid:88)\n\ns=1\n\n(cid:88)\n\n\u02c6\u00b5(i)\nt =\n\n1\nn(i)\nt\n\ns<t:i\u2208As\n\nwhere n(i)\nthese numbers. The OLS estimator is, for an arm i \u2208 [d],\nt\n\nis the number of times arm i has been pulled before stage t and Dt is a diagonal matrix of\n\ns = \u00b5(i) + (D\u22121\nX (i)\n\nt\n\nIAs C 1/2\u0001s)(i) .\n\n((cid:80)t\u22121\n\nt\n\nt\n\ns=1 CAs )D\u22121\n\nThen for all A \u2208 A, A(cid:62)(\u02c6\u00b5t \u2212 \u00b5) in the \ufb01xed design setting has a subgaussian matrix equal to\nD\u22121\n. We get con\ufb01dence intervals for the estimates and can use an upper con\ufb01dence\nbound strategy [Lai and Robbins, 1985, Auer et al., 2002]. In the online learning setting the actions\nare not independent but we will show that using this estimator still leads to estimates that are\nwell concentrated around \u00b5, with con\ufb01dence intervals given by the same subgaussian matrix. The\nalgorithm OLS-UCB (Algorithm 1) results from an application of an upper con\ufb01dence bound strategy\nwith this estimator.\nWe now turn to an analysis of the regret of OLS-UCB. At any stage t \u2265 1 of the algorithm, let\n\u0393(ij)\n\u0393(ii)\u0393(jj) be the maximal off-diagonal correlation coef\ufb01cient of \u0393At and\n\u03b3t = max{(i,j)\u2208At,i(cid:54)=j}\nlet \u03b3 = max{t\u2208[T ]} \u03b3t be the maximum up to stage T .\n\n\u221a\n\n3\n\n\fAlgorithm 1 OLS-UCB.\nRequire: Positive semi-de\ufb01nite matrix \u0393, real parameter \u03bb > 0.\n1: Choose actions such that each arm is pulled at least one time.\n2: loop: at stage t,\n3:\n\nAt = arg maxA A(cid:62) \u02c6\u00b5t + Et(A)\n\n(cid:113)\nwith Et(A) =(cid:112)2f (t)\n\nA(cid:62)D\u22121\nt\nChoose action At, observe IAt Xt.\nUpdate \u02c6\u00b5t, Dt.\n\n(\u03bb\u03a3\u0393Dt +(cid:80)t\u22121\n\n4:\n5:\n6: end loop\n\ns=1 \u0393As)D\u22121\n\nt A.\n\nTheorem 2. The OLS-UCB algorithm with parameter \u03bb > 0 and f (t) = log t + (m + 2) log log t +\nm\n2 log(1 + e\n\n\u03bb ) enjoys for all times T \u2265 1 the regret bound\n\n(cid:24) log m\n\n(cid:25)2\n\n1.6\n\n(cid:33)\n\nE[RT ] \u226416f (T )\n\n5(\u03bb + 1 \u2212 \u03b3)\n\n+ 45\u03b3m\n\n8dm2 maxi{C (ii)}\u2206max\n\n+\n\n+ 4\u2206max ,\n\n(cid:32)\n\n(cid:88)\n\ni\u2208[d]\n\n\u0393(ii)\n\u2206i,min\n\n\u22062\n\nmin\n\nwhere (cid:100)x(cid:101) stands for the smallest positive integer bigger than or equal to x. In particular, (cid:100)0(cid:101) = 1.\n\n\u2206\n\nupper bound (we recall that the lower bound is of the order of d log T\n\nThis bound shows the transition between a general case with a dm log T\nregime and an independent\ncase with a d log2 m log T\n\u2206 ). The\nweight of each case is given by the maximum correlation parameter \u03b3. The parameter \u03bb seems to be\nan artefact of the analysis and can in practice be taken very small or even equal to 0.\nFigure 1 illustrates the regret of OLS-UCB on the parallel paths problem used to derive the lower\nbound. It shows a linear dependency in \u03b3 and supports the hypothesis that the true upper bound\nmatches the lower bound with a dependency in m and \u03b3 of the form (1 + \u03b3(m \u2212 1)).\nCorollary 1. The OLS-UCB algorithm with matrix \u0393 and parameter \u03bb > 0 has a regret bounded as\n\n\u2206\n\n(cid:118)(cid:117)(cid:117)(cid:116)dT log T max\n\ni\u2208[d]\n\n(cid:32)\n\n(cid:24) log m\n\n(cid:25)2\n\n1.6\n\n(cid:33)\n\nE[RT ] \u2264 O(\n\n{\u0393(ii)}\n\n5(\u03bb + 1 \u2212 \u03b3)\n\n+ 45\u03b3m\n\n) .\n\nProof. We write that the regret up to stage T is bounded by \u2206T for actions with gap smaller than\nsome \u2206 and bounded using theorem 2 for other actions (with \u2206min \u2265 \u2206). Maximizing over \u2206 then\ngives the result.\n\n3.2 Comparison with other algorithms\n\n\u2206\n\n).\n\nm log T\n\nPrevious works supposed that the rewards of the individual arms are in [0, 1], which gives them a\n1/2-subgaussian property. Hence we suppose (\u2200i \u2208 [d], C (ii) = 1/2) for our comparison.\nIn the independent case, our algorithm is the same as ESCB-2 from Combes et al. [2015], up to the\n\u221a\nparameter \u03bb. That paper shows that ESCB-2 enjoys an O( d\n) upper bound but our analysis\ntighten it to O( d log2 m log T\nIn the general (worst) case, Kveton et al. [2015] prove an O( dm log T\n\u2206 ) upper bound (which is tight)\nusing CombUCB1, a UCB based algorithm introduced in Chen et al. [2013] which at stage t uses\n\u221a\n. Our exploration term always veri\ufb01es Et(A) \u2264\nthe exploration term\nt with f (t) \u2248 log t (see section 3.6). Their exploration term is a worst-case\nn(i)\ncon\ufb01dence interval for the means. Their broader con\ufb01dence intervals however have the desirable\nproperty that one can \ufb01nd the action that realizes the maximum index by solving a linear optimization\nproblem, making their algorithm computationally ef\ufb01cient, quality that both ESCB and OLS-UCB\nare lacking.\n\n1.5 log t(cid:80)\n\n(cid:112)f (t)(cid:80)\n\n(cid:113)\n\n(cid:113)\n\ni\u2208A 1/\n\ni\u2208A 1/\n\nn(i)\nt\n\n\u2206\n\n4\n\n\fNone of the two former algorithms bene\ufb01ts from guaranties in the other regime. The regret of ESCB\nin the general possibly correlated case is unknown and the regret bound for CombUCB1 is not\nimproved in the independent case. In contrast, OLS-UCB is adaptive in the sense that its performance\ngets better when more information is available on the independence of the arms.\n\n3.3 Regret Decomposition\nLet Hi,t = {|\u02c6\u00b5(i)\nt \u2212 \u00b5(i)| \u2265 \u2206t\nHi,t. Ht is the event that at least one coordinate of\n\u02c6\u00b5t is far from the true mean. Let Gt = {A\u2217(cid:62)\u00b5 \u2265 A\u2217(cid:62) \u02c6\u00b5t + Et(A\u2217)} be the event that the estimate\nof the optimal action is below its true mean by a big margin. We decompose the regret according to\nthese events:\n\n2m} and Ht = \u222ad\n\ni=1\n\nRT \u2264 T(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\n\u2206tI{Gt, Ht} +\n\n\u2206tI{Gt} +\n\n\u2206tI{Ht}\n\nt=1\n\nt=1\n\nt=1\n\nEvents Gt and Ht are rare and lead to a \ufb01nite regret (see below). We \ufb01rst simplify the regret due to\nGt \u2229 Ht and show that it is bounded by the \"variance\" term of the algorithm.\nLemma 1. With the algorithm choosing at stage t the action At = arg maxA(A(cid:62) \u02c6\u00b5t + Et(A)), we\nhave \u2206tI{Gt, Ht} \u2264 2Et(At)I{\u2206t \u2264 Et(At)}.\nProof in appendix B.2 of the supplementary material. Then the regret is cut into three terms,\n\nT(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\nRT \u2264 2\n\nEt(At)I{\u2206t \u2264 2Et(At)} +\n\n\u2206tI{Gt} +\n\n\u2206tI{Ht} .\n\nt=1\n\nt=1\n\nt=1\n\nThe three terms will be bounded as follows:\n\n\u2022 The Ht term leads to a \ufb01nite regret from a simple application of Hoeffding\u2019s inequality.\n\u2022 The Gt term leads to a \ufb01nite regret for a good choice of f (t). This is where we need to show\nthat the exploration term of the algorithm gives a high probability upper con\ufb01dence bound\nof the reward.\n\n\u2022 The Et(At) term, or variance term, is the main source of the regret and is bounded using\n\nideas similar to the ones used in existing works on semi-bandits.\n\n3.4 Expected regret from Ht\n\nLemma 2. The expected regret due to the event Ht is E[(cid:80)T\n\nt=1 \u2206tI{Ht}] \u2264 8dm2 maxi{C(ii)}\u2206max\nThe proof uses Hoeffding\u2019s inequality on the arm mean estimates and can be found in appendix B.2\nof the supplementary material.\n\n\u22062\n\nmin\n\n.\n\n3.5 Expected regret from Gt\nWe want to bound the probability that the estimated reward for the optimal action is far from its mean.\nWe show that it is suf\ufb01cient to control a self-normalized sum and do it using arguments from Pe\u00f1a\net al. [2008], or Abbasi-Yadkori et al. [2011] who applied them to linear bandits. The analysis also\ninvolves a peeling argument, as was done in one dimension by Garivier [2013] to bound a similar\nquantity.\nLemma 3. Let \u03b4t > 0. With \u02dcf (\u03b4t) = log(1/\u03b4t) + m log log t + m\ngiven by the exploration term Et(A) =\nevent Gt = {A\u2217(cid:62)\u00b5 \u2265 A\u2217(cid:62) \u02c6\u00b5t + Et(A\u2217)} veri\ufb01es P{Gt} \u2264 \u03b4t .\nWith \u03b41 = 1 and \u03b4t = 1\nexpectation, bounded by 4\u2206max.\n\nt log2 t for t \u2265 2, such that \u02dcf (\u03b4t) = f (t), the regret due to Gt is \ufb01nite in\n\n(\u03bb\u03a3\u0393Dt +(cid:80)t\u22121\n\n\u03bb ) and an algorithm\nt A , then the\n\ns=1 \u0393As )D\u22121\n\n2 log(1 + e\n\nA(cid:62)D\u22121\n\n(cid:113)\n\n(cid:113)\n\n2 \u02dcf (\u03b4t)\n\nt\n\n5\n\n\fProof. We use a peeling argument: let \u03b7 > 0 and for a = (a1, . . . , am) \u2208 Nm, let Da \u2282 [T ] be a\nt < (1 + \u03b7)ai+1). For any Bt \u2208 R,\nsubset of indices de\ufb01ned by (t \u2208 Da \u21d4 \u2200i \u2208 A\u2217, (1 + \u03b7)ai \u2264 n(i)\n\nP(cid:8)A\u2217(cid:62)(\u00b5 \u2212 \u02c6\u00b5t) \u2265 Bt\n\n(cid:9) \u2264(cid:88)\n\nP(cid:8)A\u2217(cid:62)(\u00b5 \u2212 \u02c6\u00b5t) \u2265 Bt|t \u2208 Da\n\n(cid:9) .\n\n(cid:111)\n\n.\n\n(cid:9).\n\nt\n\na\n\n(cid:27)\n\n(cid:113)\n\n2 (cid:107)St(cid:107)2\n\nA\u2217(cid:62)(\u00b5\u2212\u02c6\u00b5t)\u2265\n\n2 \u02dcf (\u03b4t)A\u2217(cid:62)D\u22121\n\n(\u03bb\u03a3CDt+Vt)D\u22121\n\ns=1 CAs\u2229A\u2217 and IVt+D(\u0001) = 1\n\nt A\u2217(cid:12)(cid:12)(cid:12)t\u2208Da\n\ns=1 IAs\u2229A\u2217 C 1/2\u0001s, Vt =(cid:80)t\u22121\n\nThe number of possible sets Da for t is bounded by (log t/ log(1 + \u03b7))m, since each number of pulls\nn(i)\nSuppose t \u2208 Da and let D be a positive de\ufb01nite diagonal matrix (that depends on a).\nt\n(Vt+D)\u22121.\n\nfor i \u2208 A\u2217 is bounded by t. We now search a bound of the form P(cid:8)A\u2217(cid:62)(\u00b5 \u2212 \u02c6\u00b5t) \u2265 Bt|t \u2208 Da\nLet St =(cid:80)t\u22121\n(cid:26)\nmaxu\u2208Rd(cid:81)t\u22121\n\nLemma 4. Let \u03b4t > 0 and let \u02dcf (\u03b4t) be a function of \u03b4t. With a choice of D such that IA\u2217 D (cid:22)\n\u03bbIA\u2217 \u03a3CDt for all t in Da,\nP\n\nProof in appendix B.2 of the supplementary material.\nThe self-normalized sum IVt(\u0001) is an interesting quantity for the following reason: exp( 1\n\n2 IVt(\u0001)) =\ns=1 exp(u(cid:62)IAs\u2229A\u2217 C 1/2\u0001s \u2212 u(cid:62)CAs\u2229A\u2217 u). For a given u, the exponential is smaller\nthat 1 in expectation, from the subgaussian hypothesis. The maximum of the expectation is then\nsmaller than 1. To control IVt(\u0001), we are however interested in the expectation of this maximum and\ncannot interchange max and E. The method of mixtures circumvents this dif\ufb01culty: it provides an\napproximation of the maximum by integrating the exponential against a multivariate normal centered\nat the point V \u22121\nt St, where the maximum is attained. The integrals over u and \u0001 can then be swapped\nby Fubini\u2019s theorem to get an approximation of the expectation of the maximum using an integral of\nthe expectations. Doing so leads to the following lemma, extracted from the proof of Theorem 1 of\nAbbasi-Yadkori et al. [2011].\nLemma 5. Let D be a positive de\ufb01nite matrix that does not depend on t and\nMt(D) =\n\nIVt+D(\u0001)\u2265 \u02dcf (\u03b4t)|t\u2208Da\n\n\u2264 P(cid:110)\n\n(cid:113) det D\n(cid:111)\ndet(Vt+D) exp(IVt+D(\u0001)). Then E[Mt(D)] \u2264 1.\nIVt+D(\u0001) \u2265 \u02dcf (\u03b4t)\n(cid:111)\n\nto introduce Mt(D),\n\n(cid:40)\n\nWe rewrite P(cid:110)\nP(cid:110)\n\n(cid:12)(cid:12)(cid:12)t\u2208Da\n\nexp( \u02dcf (\u03b4t))\n\n(cid:41)\n\n.\n\n(cid:112)det(Id + D\u22121/2VtD\u22121/2)\n\n1\n\nIVt+D(\u0001) \u2265 \u02dcf (\u03b4t)|t\u2208Da\n\n= P\n\nMt(D) \u2265\n\nThe peeling lets us bound Vt. Let Da be the diagonal matrix with entry (i, i) equal to (1 + \u03b7)ai for\ni \u2208 A\u2217 and 0 elsewhere.\nLemma 6. With D = \u03bb\u03a3CDa + I[d]\\A\u2217, det(Id + D\u22121/2VtD\u22121/2) \u2264 (1 + 1+\u03b7\nThe union bound on the sets Da and Markov\u2019s inequality give\n\n\u03bb )m .\n\n(cid:113)\n\n(cid:113)\n\nP\n\nA\u2217(cid:62)(\u00b5 \u2212 \u02c6\u00b5t) \u2265\n\n2 \u02dcf (\u03b4t)\n\n(cid:26)\n(cid:26)\n\u2264(cid:88)\n(cid:18) log t\n\nDa\n\nP\n\n\u2264\n\nMt(D) \u2265 (1 +\n\n1 + \u03b7\n\n\u03bb\n\n(cid:19)m\n\nlog(1 + \u03b7)\n\n(1 +\n\n1 + \u03b7\n\n\u03bb\n\n)m/2 exp(\u2212 \u02dcf (\u03b4t))\n\n\u03bbA\u2217(cid:62)\u03a3CD\u22121\n\n(cid:27)\nt A\u2217 + A\u2217(cid:62)D\u22121\nt VtD\u22121\n)\u2212m/2 exp( \u02dcf (\u03b4t))|t \u2208 Da\n\nt A\u2217\n\n(cid:27)\n\nFor \u03b7 = e \u2212 1 and \u02dcf (\u03b4t) as in lemma 3, this is bounded by \u03b4t. The result with \u0393 instead of C is a\nconsequence of C (cid:22)+ \u0393. With \u03b41 = 1 and \u03b4t = 1/(t log2 t) for t \u2265 2, the regret due to Gt is\n\nE[\n\n\u2206tI{Gt}] \u2264 \u2206max(1 +\n\n1\n\n) \u2264 4\u2206max .\n\nT(cid:88)\n\nt log2 t\n\nt=2\n\nT(cid:88)\n\nt=1\n\n6\n\n\f3.6 Bounding the variance term\nThe goal of this section is to bound Et(At) under the event {\u2206t \u2264 Et(At)}. Let \u03b3t \u2208 [0, 1] such\nthat for all i, j \u2208 At with i (cid:54)= j, \u0393(ij) \u2264 \u03b3t\n\u0393(ii)\u0393(jj). From the Cauchy-Schwartz inequality,\nt \u2264\nn(ij)\n\n. Using these two inequalities,\n\nn(i)\nt n(j)\n\n(cid:113)\n\n\u221a\n\nt\n\nt\u22121(cid:88)\n\ns=1\n\n(cid:88)\n\ni,j\u2208At\n\nt D\u22121\nA(cid:62)\n\nt\n\n(\n\n\u0393As )D\u22121\n\nt At =\n\nn(ij)\nt \u0393(ij)\nn(i)\nt n(j)\n\nt\n\n\u2264 (1 \u2212 \u03b3t)\n\n\u0393(ii)\nn(i)\nt\n\n+ \u03b3t(\n\n\u0393(ii)\nn(i)\nt\n\n)2 .\n\n(cid:115)\n\n(cid:88)\n\ni\u2208At\n\nWe recognize here the forms of the indexes used in Combes et al. [2015] for independent arms (left\nterm) and Kveton et al. [2015] for general arms (right term). Using \u2206t \u2264 Et(At) we get\n\n\u22062\nt\n8f (t)\n\n\u2264 (\u03bb + 1 \u2212 \u03b3t)\n\n\u0393(ii)\nn(i)\nt\n\n+ \u03b3t(\n\n\u0393(ii)\nn(i)\nt\n\n)2 .\n\n(1)\n\n(cid:88)\n\ni\u2208At\n\n(cid:88)\n\ni\u2208At\n\n(cid:115)\n\n(cid:88)\n\ni\u2208At\n\n\u22062\nt\n\nt,e = {i \u2208 At, n(i)\n\nThe strategy from here is to \ufb01nd events that must happen when (1) holds and to show that these events\ncannot happen very often. For positive integers j and t and for e \u2208 {1, 2}, we de\ufb01ne the set of arms\n},\nt \u2264 \u03b1j,e\n8f (t)\u0393(ii)ge(m,\u03b3t)\nin At that were pulled less than a given threshold: Sj\nwith ge(m, \u03b3t) to be stated later and (\u03b1i,e)i\u22651 a decreasing sequence. Let also S0\nt,e)j\u22650\nis decreasing for the inclusion of sets and we impose limj\u2192+\u221e \u03b1j,e = 0, such that there is an index\nt,e = \u2205. We introduce another positive sequence (\u03b2j,e)j\u22650 and consider the events that\nj\u2205 with Sj\u2205\nt,e and that the same is false for k < j, i.e. for t \u2265 1,\nat least m\u03b2j,e arms in At are in the set Sj\nt,e| \u2265 m\u03b2j,e;\u2200k < j,|Sk\nt,e| < m\u03b2k,e}. To avoid having some of these events being\nt,e = {|Sj\nAj\nimpossible we choose (\u03b2j,e)j\u22650 decreasing. We also impose \u03b20,e = 1, such that |S0\nLet then At,e = \u222a+\u221e\ntrue. First, remark that under a condition on (\u03b2j,e)j\u22650, At is a \ufb01nite union of events,\nLemma 7. For e \u2208 {1, 2}, if there exists j0,e such that \u03b2j0,e,e \u2264 1/m, then At,e = \u222aj0\nWe now show that At is impossible by proving a contradiction in (1).\nLemma 8. Under the event At,1, if there exists j0 such that \u03b2j0,1 \u2264 1/m, then\n\nt,e and At = At,1 \u222a At,2. We will show that At must happen for (1) to be\nAj\n\nt,e| = m\u03b20,e.\n\nt,e = At. (Sj\n\nt,e.\n\nAj\n\nj=1\n\nj=1\n\n8f (t)g1(m, \u03b3t)\n\n\u03b1j,1\n\n\u03b2j\u22121,1 \u2212 \u03b2j,1\n\n+\n\n\u03b2j0,1\n\u03b1j0,1\n\n(cid:88)\n\ni\u2208At\n\nm\u22062\nt\n\n\uf8eb\uf8ed j0(cid:88)\n\u03b1j,2 = 0 and(cid:80)+\u221e\n+\u221e(cid:88)\n(cid:112)8f (t)g2(m, \u03b3t)\n\nm\u2206t\n\n\u221a\n\nj=1\n\nj=1\n\nj=1\n\n\u0393(ii)\nn(i)\nt\n\n<\n\n(cid:115)\n\n(cid:88)\n\ni\u2208At\n\n\u2264\n\n\u0393(ii)\nn(i)\nt\n\n\u03b2j\u22121,2\u2212\u03b2j,2\n\n\u221a\n\n\u03b1j,2\n\nexists, then\n\n\u03b2j\u22121,2 \u2212 \u03b2j,2\n\n\u221a\n\n.\n\n\u03b1j,2\n\n\uf8f6\uf8f8 .\n\nUnder the event At,2, if limj\u2192+\u221e \u03b2j,2/\n\nj=1\n\n\u03b1j,1\n\n\u03b2j0,1,1\n\u03b1j0,1,1\n\n+(cid:80)j0,1\n\nand l2 =(cid:80)+\u221e\n(cid:18)\n\nA proof can be found in appendix B.2 of the supplementary material. To ensure that the con-\n\u221a\nditions of these lemmas are ful\ufb01lled, we impose that (\u03b2i,1)i\u22650 and (\u03b2i,2)i\u22650 have limit 0 and\n\u03b1j,2 = 0. Let j0,1 be the smallest integer such that \u03b2j0,1,1 \u2264 1/m. Let\nthat limj\u2192+\u221e \u03b2j,2/\n\u03b2j\u22121,1\u2212\u03b2j,1\n. Using the two last lemmas with (1),\nl1 =\nwe get that if At is true,\n\u22062\nt\n8f (t)\n\n+ \u03b3t\ng2(m, \u03b3t)\nTaking g1(m, \u03b3t) = 2(\u03bb + 1 \u2212 \u03b3t)ml1 and g2(m, \u03b3t) = 2\u03b3tm2l2\n2, we get a contradiction. Hence\nwith these choices At must happen. The regret bound will be obtained by a union bound on the events\nthat form At. First suppose that all gaps are equal to the same \u2206.\n\n(\u03bb + 1 \u2212 \u03b3t)\n\n\u22062\nt\n8f (t)\n\n\u03b2j\u22121,2\u2212\u03b2j,2\n\ng1(m, \u03b3t)\n\n(cid:19)\n\nm2l2\n2\n\nml1\n\n\u03b1j,2\n\nj=1\n\n\u221a\n\n<\n\n.\n\n7\n\n\fLemma 9. Let \u03b3 = maxt\u22651 \u03b3t.\nd\u03b1j,e8f (T ) maxi{\u0393(ii)}ge(m,\u03b3)\n\ntimes.\n\nm\u03b2j,e\u22062\n\nFor j \u2208 N\u2217,\n\nthe event Aj\n\nt,e happens at most\n\nProof. Each time that Aj\nAfter \u03b1j,e8f (T ) maxi{\u0393(ii)}ge(m,\u03b3)\nThere are d arms, so the event can happen at most d 1\n\nt,e happens, the counter of plays n(i)\n\n\u22062\n\nincrements, an arm cannot verify the condition on n(i)\nt\n\nt of at least m\u03b2je arms is incremented.\nany more.\n\n\u03b1j,e8f (T ) maxi{\u0393(ii)}ge(m,\u03b3)\n\ntimes.\n\n\u22062\n\nIf all gaps are equal to \u2206, an union bound for At gives\n\n\u2206I{Ht \u2229 Gt}] \u2264 16 max\ni\u2208[d]\n\n{\u0393(ii)} f (T )\n\u2206\n\nd\n\nT(cid:88)\n\nt=1\n\nE[\n\nm\u03b2je\n\n\uf8ee\uf8f0(\u03bb + 1 \u2212 \u03b3)l1\n\nj0,1(cid:88)\n\nj=1\n\n+\u221e(cid:88)\n\nj=1\n\n\u03b1j,2\n\u03b2j,2\n\n\u03b1j,1\n\u03b2j,1\n\n+ \u03b3ml2\n2\n\n\uf8f9\uf8fb .\n\n\uf8f9\uf8fb .\n(cid:33)\n\nThe general case requires more involved manipulations but the result is similar and no new important\nidea is used. The following lemma is proved in appendix B.2 of the supplementary material:\nLemma 10. Let \u03b3(i) = max{t,i\u2208At} \u03b3t. The regret from the event Ht \u2229 Gt is such that\n+\u221e(cid:88)\n\nT(cid:88)\n\n(cid:88)\n\nE[\n\n\u2206tI{Ht \u2229 Gt}] \u2264 16f (T )\n\n+ \u03b3ml2\n2\n\n\u0393(ii)\n\u2206i,min\n\n\u03b1j,2\n\u03b2j,2\n\nj=1\n\nFinally we can \ufb01nd sequences (\u03b1j,1)j\u22651, (\u03b1j,2)j\u22651, (\u03b2j,1)j\u22650 and (\u03b2j,2)j\u22650 such that\n\nE[\n\n\u2206I{Ht \u2229 Gt}] \u2264 16f (T )\n\n\u0393(ii)\n\u2206i,min\n\n5(\u03bb + 1 \u2212 \u03b3(i))\n\n+ 45\u03b3(i)m\n\nSee appendix C of the supplementary material. In Combes et al. [2015], \u03b1i,1 and \u03b2i,1 were such\nthat the log2 m term was replaced by\nm. Our choice is also applicable to their ESCB algorithm.\nOur use of geometric sequences is only optimal among sequences such that \u03b2i,1 = \u03b1i,1 for all i \u2265 1.\nIt is unknown to us if one can do better. With this control of the variance term, we \ufb01nally proved\nTheorem 2.\n\n\uf8ee\uf8f0(\u03bb + 1 \u2212 \u03b3)l1\n(cid:32)\n\n\u03b1j,1\n\u03b2j,1\n\nj0(cid:88)\n(cid:24) log m\n\nj=1\n\n1.6\n\n(cid:25)2\n\ni\u2208[d]\n\n(cid:88)\n\ni\u2208[d]\n\u221a\n\nt=1\n\nT(cid:88)\n\nt=1\n\n4 Conclusion\n\nWe de\ufb01ned a continuum of settings from the general to the independent arms cases which is suitable\nfor the analysis of semi-bandit algorithms. We exhibited a lower bound scaling with a parameter that\nquanti\ufb01es the particular setting in this continuum and proposed an algorithm inspired from linear\nregression with an upper bound that matches the lower bound up to a log2 m term. Finally we showed\nhow to use tools from the linear bandits literature to analyse algorithms for the combinatorial bandit\ncase that are based on linear regression.\nIt would be interesting to estimate the subgaussian covariance matrix online to attain good regret\nbounds without prior knowledge. Also, our algorithm is not computationally ef\ufb01cient since it requires\nthe computation of an argmax over the actions at each stage. It may be possible to compute this\nargmax less often and still keep the regret guaranty, as was done in Abbasi-Yadkori et al. [2011] and\nCombes et al. [2015].\nOn a broader scope, the inspiration from linear regression could lead to algorithms using different\nestimators, adapted to the structure of the problem. For example, the weighted least-square estimator\nis also unbiased and has smaller variance than OLS. Or one could take advantage of a sparse\ncovariance matrix by using sparse estimators, as was done in the linear bandit case in Carpentier and\nMunos [2012].\n\nAcknowledgements\n\nThe authors would like to acknowledge funding from the ANR under grant number ANR-13-JS01-\n0004 as well as the Fondation Math\u00e9matiques Jacques Hadamard and EDF through the Program\nGaspard Monge for Optimization and the Irsdi project Tecolere.\n\n8\n\n\fReferences\nYasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved Algorithms for Linear Stochastic Bandits.\n\nNeural Information Processing Systems, pages 1\u201319, 2011.\n\nJean-Yves Audibert, S\u00e9bastien Bubeck, and G\u00e1bor Lugosi. Regret in online combinatorial optimization.\n\nMathematics of Operations Research, 39(1):31\u201345, 2013.\n\nPeter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\nAlexandra Carpentier and R\u00e9mi Munos. Bandit Theory meets Compressed Sensing for high dimensional\nStochastic Linear Bandit. Advances in Neural Information Processing Systems (NIPS), pages 251\u2013259, 2012.\n\nNicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78\n\n(5):1404\u20131422, 2012.\n\nWei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications.\n\nProceedings of the 30th International Conference on Machine Learning (ICML), pages 151\u2013159, 2013.\n\nRichard Combes, M. Sadegh Talebi, Alexandre Proutiere, and Marc Lelarge. Combinatorial Bandits Revisited.\n\nNeural Information Processing Systems, pages 1\u20139, 2015.\n\nSarah Filippi, Olivier Capp\u00e9, Aur\u00e9lien Garivier, and Csaba Szepesv\u00e1ri. Parametric Bandits: The Generalized\n\nLinear Case. Neural Information Processing Systems, pages 1\u20139, 2010.\n\nYi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown variables:\nMulti-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking,\n20(5):1466\u20131478, 2012.\n\nAur\u00e9lien Garivier. Informational con\ufb01dence bounds for self-normalized averages and applications. 2013 IEEE\n\nInformation Theory Workshop, ITW 2013, 2013.\n\nJunpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Optimal Regret Analysis of Thompson Sampling\nin Stochastic Multi-armed Bandit Problem with Multiple Plays. Proceedings of the 32nd International\nConference on Machine Learning, 2015.\n\nBranislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic\ncombinatorial semi-bandits. Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2015.\n\nTze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied\n\nmathematics, 6(1):4\u201322, 1985.\n\nVictor H Pe\u00f1a, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical\n\nApplications. Springer Science & Business Media, 2008.\n\nHerbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers,\n\npages 169\u2013177. Springer, 1985.\n\nPaat Rusmevichientong and John N. Tsitsiklis. Linearly Parameterized Bandits. Mathematics of Operations\n\nResearch, (1985):1\u201340, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1480, "authors": [{"given_name": "R\u00e9my", "family_name": "Degenne", "institution": "Universit\u00e9 Paris Diderot"}, {"given_name": "Vianney", "family_name": "Perchet", "institution": "Ensae & Criteo Labs"}]}