{"title": "Distributed Exploration in Multi-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 854, "page_last": 862, "abstract": "We study exploration in Multi-Armed Bandits (MAB) in a setting where~$k$ players collaborate in order to identify an $\\epsilon$-optimal arm. Our motivation comes from recent employment of MAB algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the $k$ players to communicate \\emph{only once}, they are able to learn $\\sqrt{k}$ times faster than a single player. That is, distributing learning to $k$ players gives rise to a factor~$\\sqrt{k}$ parallel speed-up. We complement this result with a lower bound showing this is in general the best possible.  On the other extreme, we present an algorithm that achieves the ideal factor $k$ speed-up in learning performance, with communication only logarithmic in~$1/\\epsilon$.", "full_text": "Distributed Exploration in Multi-Armed Bandits\n\nEshcar Hillel\n\nYahoo Labs, Haifa\n\neshcar@yahoo-inc.com\n\nTomer Koren\u2217\n\nTechnion \u2014 Israel Inst. of Technology\n\ntomerk@technion.ac.il\n\nZohar Karnin\n\nYahoo Labs, Haifa\n\nzkarnin@yahoo-inc.com\n\nRonny Lempel\nYahoo Labs, Haifa\n\nrlempel@yahoo-inc.com\n\nOren Somekh\n\nYahoo Labs, Haifa\n\norens@yahoo-inc.com\n\nAbstract\n\nWe study exploration in Multi-Armed Bandits in a setting where k players col-\nlaborate in order to identify an \u03b5-optimal arm. Our motivation comes from recent\nemployment of bandit algorithms in computationally intensive, large-scale appli-\ncations. Our results demonstrate a non-trivial tradeoff between the number of arm\npulls required by each of the players, and the amount of communication between\nthem. In particular, our main result shows that by allowing the k players to com-\nmunicate only once, they are able to learn\nk times faster than a single player.\nThat is, distributing learning to k players gives rise to a factor\nk parallel speed-\nup. We complement this result with a lower bound showing this is in general the\nbest possible. On the other extreme, we present an algorithm that achieves the\nideal factor k speed-up in learning performance, with communication only loga-\nrithmic in 1/\u03b5.\n\n\u221a\n\n\u221a\n\nIntroduction\n\n1\nOver the past years, multi-armed bandit (MAB) algorithms have been employed in an increasing\namount of large-scale applications. MAB algorithms rank results of search engines [23, 24], choose\nbetween stories or ads to showcase on web sites [2, 8], accelerate model selection and stochastic\noptimization tasks [21, 22], and more. In many of these applications, the workload is simply too\nhigh to be handled by a single processor. In the web context, for example, the sheer volume of user\nrequests and the high rate at which they arrive, require websites to use many front-end machines\nthat run in multiple data centers.\nIn the case of model selection tasks, a single evaluation of a\ncertain model or con\ufb01guration might require considerable computation time, so that distributing\nthe exploration process across several nodes may result with a signi\ufb01cant gain in performance. In\nthis paper, we study such large-scale MAB problems in a distributed environment where learning is\nperformed by several independent nodes that may take actions and observe rewards in parallel.\nFollowing recent MAB literature [14, 3, 15, 18], we focus on the problem of identifying a \u201cgood\u201d\nbandit arm with high con\ufb01dence. In this problem, we may repeatedly choose one arm (corresponding\nto an action) and observe a reward drawn from a probability distribution associated with that arm.\nOur goal is to \ufb01nd an arm with an (almost) optimal expected reward, with as few arm pulls as\npossible (that is, minimize the simple regret [7]). Our objective is thus explorative in nature, and in\n\n\u2217Most of this work was done while the author was at Yahoo Labs, Haifa.\n\n1\n\n\fparticular we do not mind the incurred costs or the involved regret. This is indeed the natural goal in\nmany applications, such as in the case of model selection problems mentioned above. In our setup,\na distributed strategy is evaluated by the number of arm pulls per node required for the task, which\ncorrelates with the parallel speed-up obtained by distributing the learning process.\nWe abstract a distributed MAB system as follows. In our model, there are k players that correspond\nto k independent machines in a cluster. The players are presented with a set of arms, with a common\ngoal of identifying a good arm. Each player receives a stream of queries upon each it chooses an arm\nto pull. This stream is usually regulated by some load balancer ensuring the load is roughly divided\nevenly across players. To collaborate, the players may communicate with each other. We assume\nthat the bandwidth of the underlying network is limited, so that players cannot simply share every\npiece of information. Also, communicating over the network might incur substantial latencies, so\nplayers should refrain from doing so as much as possible. When measuring communication of a\ncertain multi-player protocol we consider the number of communication rounds it requires, where\nin a round of communication each player broadcasts a single message (of arbitrary size) to all other\nplayers. Round-based models are natural in distributed learning scenarios, where frameworks such\nas MapReduce [11] are ubiquitous.\nWhat is the tradeoff between the learning performance of the players, and the communication be-\ntween them? At one extreme, if all players broadcast to each other each and every arm reward as\nit is observed, they can simply simulate the decisions of a serial, optimal algorithm. However, the\ncommunication load of this strategy is of course prohibitive. At the other extreme, if the players\nnever communicate, each will suffer the learning curve of a single player, thereby avoiding any pos-\nsible speed-up the distributed system may provide. Our goal in this work is to better understand this\ntradeoff between inter-player communication and learning performance.\nConsidering the high cost of communication, perhaps the simplest and most important question that\narises is how well can the players learn while keeping communication to the very minimum. More\nspeci\ufb01cally, is there a non-trivial strategy by which the players can identify a \u201cgood\u201d arm while\ncommunicating only once, at the end of the process? As we discuss later on, this is a non-trivial\nquestion. On the positive side, we present a k-player algorithm that attains an asymptotic parallel\nspeed-up of\nk factor, as compared to the conventional, serial setting. In fact, our approach demon-\nstrates how to convert virtually any serial exploration strategy to a distributed algorithm enjoying\nsuch speed-up. Ideally, one could hope for a factor k speed-up in learning performance; however,\nwe show a lower bound on the required number of pulls in this case, implying that our\nk speed-up\nis essentially optimal.\nAt the other end of the trade-off, we investigate how much communication is necessary for obtaining\nthe ideal factor k parallel speed-up. We present a k-player strategy achieving such speed-up, with\ncommunication only logarithmic in 1/\u03b5. As a corollary, we derive an algorithm that demonstrates an\nexplicit trade-off between the number of arm pulls and the amount of inter-player communication.\n\n\u221a\n\n\u221a\n\n1.1 Related Work\n\nRecently there has been an increasing interest in distributed and collaborative learning problems.\nIn the MAB literature, several recent works consider multi-player MAB scenarios in which players\nactually compete with each other, either on arm-pulls resources [15] or on the rewards received [19].\nIn contrast, we study a collaborative multi-player problem and investigate how sharing observations\nhelps players achieve their common goal. The related work of Kanade et al. [17] in the context\nof non-stochastic (i.e. adversarial) experts also deals with a collaborative problem in a similar dis-\ntributed setup, and examine the trade-off between communication and the cumulative regret.\nAnother line of recent work was focused on distributed stochastic optimization [13, 1, 12] and dis-\ntributed PAC models [6, 10, 9], investigating the involved communication trade-offs. The techniques\ndeveloped there, however, are inherently \u201cbatch\u201d learning methods and thus are not directly applica-\nble to our MAB problem which is online in nature. Questions involving network topology [13, 12]\nand delays [1] are relevant to our setup as well; however, our present work focuses on establishing\nthe \ufb01rst non-trivial guarantees in a distributed collaborative MAB setting.\n\n2\n\n\f2 Problem Setup and Statement of Results\nIn our model of the Distributed Multi-Armed Bandit problem, there are k \u2265 1 individual players.\nThe players are given n arms, enumerated by [n] := {1, 2, . . . , n}. Each arm i \u2208 [n] is associated\nwith a reward, which is a [0, 1]-valued random variable with expectation pi. For convenience, we\nassume that the arms are ordered by their expected rewards, that is p1 \u2265 p2 \u2265 \u00b7\u00b7\u00b7 \u2265 pn. At every\ntime step t = 1, 2, . . . , T , each player pulls one arm of his choice and observes an independent\nsample of its reward. Each player may choose any of the arms, regardless of the other players and\ntheir actions. At the end of the game, each player must commit to a single arm. In a communication\nround, that may take place at any prede\ufb01ned time step, each player may broadcast a message to all\nother players. While we do not restrict the size of each message, in a reasonable implementation a\nmessage should not be larger than \u02dcO(n) bits.\nIn the best-arm identi\ufb01cation version of the problem, the goal of a multi-player algorithm given some\ntarget con\ufb01dence level \u03b4 > 0, is that with probability at least 1 \u2212 \u03b4 all players correctly identify the\nbest arm (i.e. the arm having the maximal expected reward). For simplicity, we assume in this setting\nthat the best arm is unique. Similarly, in the (\u03b5, \u03b4)-PAC variant the goal is that each player \ufb01nds an\n\u03b5-optimal (or \u201c\u03b5-best\u201d) arm, that is an arm i with pi \u2265 p1\u2212 \u03b5, with high probability. In this paper we\nfocus on the more general (\u03b5, \u03b4)-PAC setup, which also includes best-arm identi\ufb01cation for \u03b5 = 0.\nWe use the notation \u2206i := p1 \u2212 pi to denote the suboptimality gap of arm i, and occasionally\nuse \u2206(cid:63) := \u22062 for denoting the minimal gap. In the best-arm version of the problem, where we\nassume that the best arm is unique, we have \u2206i > 0 for all i > 1. When dealing with the (\u03b5, \u03b4)-PAC\ni := max{\u2206i, \u03b5}. In the context of MAB problems, we\nsetup, we also consider the truncated gaps \u2206\u03b5\nare interested in deriving distribution-dependent bounds, namely, bounds that are stated as a function\nof \u03b5, \u03b4 and also the distribution-speci\ufb01c values \u2206 := (\u22062, . . . , \u2206n). The \u02dcO notation in our bounds\nhides polylogarithmic factors in n, k, \u03b5, \u03b4, and also in \u22062, . . . , \u2206n. In the case of serial exploration\nalgorithms (i.e., when there is only one player), the lower bounds of Mannor and Tsitsiklis [20] and\nAudibert et al. [3] show that in general \u02dc\u2126(H\u03b5) pulls are necessary for identifying an \u03b5-arm, where\n\nn(cid:88)\n\ni=2\n\nH\u03b5 :=\n\n1\ni )2 .\n(\u2206\u03b5\n\n(1)\n\nIntuitively, the hardness of the task is therefore captured by the quantity H\u03b5, which is roughly the\nnumber of arm pulls needed to \ufb01nd an \u03b5-best arm with a reasonable probability; see also [3] for a\ndiscussion. Our goal in this work is therefore to establish bounds in the distributed model that are\nexpressed as a function of H\u03b5, in the same vein of the bounds known in the classic MAB setup.\n\n2.1 Baseline approaches\n\nWe now discuss several baseline approaches for the problem, starting with our main focus\u2014the sin-\ngle round setting. The \ufb01rst obvious approach, already mentioned earlier, is the no-communication\nstrategy: just let each player explore the arms in isolation of the other players, following an inde-\npendent instance of some serial strategy; at the end of the executions, all players hold an \u03b5-best arm.\nClearly, this approach performs poorly in terms of learning performance, needing \u02dc\u2126(H\u03b5) pulls per\nplayer in the worst case and not leading to any parallel speed-up.\nAnother straightforward approach is to employ a majority vote among the players: let each player\nindependently identify an arm, and choose the arm having most of the votes (alternatively, at least\nhalf of the votes). However, this approach does not lead to any improvement in performance: for\nthis vote to work, each player has to solve the problem correctly with reasonable probability, which\nalready require \u02dc\u2126(H\u03b5) pulls of each. Even if we somehow split the arms between players and let\neach player explore a share of them, a majority vote would still fail since those players getting the\n\u201cgood\u201d arms might have to pull arms \u02dc\u2126(H\u03b5) times\u2014a small MAB instance might be as hard as the\nfull-sized problem (in terms of the complexity measure H\u03b5).\nWhen considering algorithms employing multiple communication rounds, we use an ideal simulated\nserial algorithm (i.e., a full-communication approach) as our baseline. This approach is of course\nprohibited in our context, but is able to achieve the optimal parallel speed-up, linear in the number\nof players k.\n\n3\n\n\f2.2 Our results\n\n\u221a\n\n\u221a\n\n\u221a\n\nWe now discuss our approach and overview our algorithmic results. These are summarized in Table 1\nbelow, that compares the different algorithms in terms of parallel speed-up and communication.\nOur approach for the one-round case is based on the idea of majority vote. For the best-arm identi\ufb01-\ncation task, our observation is that by letting each player explore a smaller set of n/\nk arms chosen\nk of the players would come up with the global\nat random and choose one of them as \u201cbest\u201d, about\nbest arm. This (partial) consensus on a single arm is a key aspect in our approach, since it allows the\nplayers to identify the correct best arm among the votes of all k players, after sharing information\nonly once. Our approach leads to a factor\nk parallel speed-up which, as we demonstrate in our\nlower bound, is the optimal factor in this setting. Although our goal here is pure exploration, in our\nalgorithms each player follows an explore-exploit strategy. The idea is that a player should sample\nhis recommended arm as much as his budget permits, even if it was easy to identify in his small-\nsized problem. This way we can guarantee that the top arms are sampled to a suf\ufb01cient precision by\nthe time each of the players has to choose a single best arm.\nThe algorithm for the (\u03b5, \u03b4)-PAC setup is similar, but its analysis is more challenging. As mentioned\nabove, an agreement on a single arm is essential for a vote to work. Here, however, there might\nbe several \u03b5-best arms, so arriving at a consensus on a single one is more dif\ufb01cult. Nonetheless,\nby examining two different regimes, namely when there are \u201cmany\u201d \u03b5-best arms and when there\nare \u201cfew\u201d of them, our analysis shows that a vote can still work and achieve the\nk multiplicative\nspeed-up.\nIn the case of multiple communication rounds, we present a distributed elimination-based algorithm\nthat discards arms right after each communication round. Between rounds, we share the work load\nbetween players uniformly. We show that the number of such rounds can be reduced to as low\nas O(log(1/\u03b5)), by eliminating all 2\u2212r-suboptimal arms in the r\u2019th round. A similar idea was\nemployed in [4] for improving the regret bound of UCB with respect to the parameters \u2206i. We also\nuse this technique to develop an algorithm that performs only R communication rounds, for any\ngiven parameter R \u2265 1, that achieves a slightly worse multiplicative \u03b52/Rk speed-up.\n\n\u221a\n\nSETTING\n\nONE-ROUND\n\nMULTI-ROUND\n\nALGORITHM\nNo-Communication\nMajority Vote\nAlgorithm 1,2\nSerial (simulated)\nAlgorithm 3\nAlgorithm 3\u2019\n\nSPEED-UP COMMUNICATION\n\n1\n\u221a\n1\nk\nk\nk\n\n\u03b52/R \u00b7 k\n\nnone\n1 round\n1 round\nevery time step\n\u03b5 ) rounds\nO(log 1\nR rounds\n\nTable 1: Summary of baseline approaches and our results. The speed-up results are asymptotic\n(logarithmic factors are omitted).\n\n3 One Communication Round\n\nThis section considers the most basic variant of the multi-player MAB problem, where each player\nis only allowed a single transmission, when \ufb01nishing her queries. For the clarity of exposition, we\n\ufb01rst consider the best-arm identi\ufb01cation setting in Section 3.1. Section 3.2 deals with the (\u03b5, \u03b4)-PAC\nsetup. We demonstrate the tightness of our result in Section 3.3 with a lower bound for the required\nbudget of arm pulls in this setting.\nOur algorithms in this section assume the availability of a serial algorithm A(A, \u03b5), that given a set\nof arms A and target accuracy \u03b5, identi\ufb01es an \u03b5-best arm in A with probability at least 2/3 using no\nmore than\n\n(2)\n\n(cid:88)\n\ni\u2208A\n\ncA\n\n|A|\n\u2206\u03b5\ni\n\n1\ni )2 log\n(\u2206\u03b5\n\n4\n\n\farm pulls, for some constant cA > 1. For example, the Successive Elimination algorithm [14] and\nthe Exp-Gap Elimination algorithm [18] provide a guarantee of this form. Essentially, any explo-\nration strategy whose guarantee is expressed as a function of H\u03b5 can be used as the procedure A,\nwith technical modi\ufb01cations in our analysis.\n\n3.1 Best-arm Identi\ufb01cation Algorithm\n\nWe now describe our one-round best-arm identi\ufb01cation algorithm. For simplicity, we present a\nversion matching \u03b4 = 1/3, meaning that the algorithm produces the correct arm with probability at\nleast 2/3; we later explain how to extend it to deal with arbitrary values of \u03b4.\nOur algorithm is akin to a majority vote among the multiple players, in which each player pulls\narms in two stages. In the \ufb01rst EXPLORE stage, each player independently solves a \u201csmaller\u201d MAB\ninstance on a random subset of the arms using the exploration strategy A. In the second EXPLOIT\nstage, each player exploits the arm identi\ufb01ed as \u201cbest\u201d in the \ufb01rst stage, and communicates that arm\nand its observed average reward. See Algorithm 1 below for a precise description. An appealing\nfeature of our algorithm is that it requires each player to transmit a single message of constant\nsize (up to logarithmic factors).\nIn Theorem 3.1 we prove that Algorithm 1 in-\ndeed achieves the promised upper bound.\nTheorem 3.1. Algorithm 1 identi\ufb01es the best\narm correctly with probability at least 2/3 us-\ning no more than\n\nAlgorithm 1 ONE-ROUND BEST-ARM\ninput time horizon T\noutput an arm\n1: for player j = 1 to k do\n2:\n\n\u221a\n\n(cid:32)\n\n(cid:33)\n\n\u00b7 n(cid:88)\n\nO\n\n1\u221a\nk\n\n1\n\u22062\ni\n\nlog\n\nn\n\u2206i\n\ni=2\n\narm pulls per player, provided that 6 \u2264 \u221a\n\nk \u2264\nn. The algorithm uses a single communica-\ntion round, in which each player communicates\n\u02dcO(1) bits.\n\n3:\n\n4:\n\nk arms uni-\nchoose a subset Aj of 6n/\nformly at random\nEXPLORE: execute ij \u2190 A(Aj, 0) using\nat most 1\n2 T pulls (and halting the algorithm\nearly if necessary);\nif the algorithm fails to identify any arm or\ndoes not terminate gracefully, let ij be an\narbitrary arm\nEXPLOIT: pull arm ij for 1\nlet \u02c6qj be its average reward\ncommunicate the numbers ij, \u02c6qj\n\n2 T times and\n\n5:\n6: end for\n7: let ki be the number of players j with ij = i,\n\nBy repeating the algorithm O(log(1/\u03b4)) times\nand taking the majority vote of the independent\nruns, we can amplify the success probability to\n1 \u2212 \u03b4 for any given \u03b4 > 0. Note that we can\nstill do that with one communication round (at\nthe end of all executions), but each player now\nhas to communicate O(log(1/\u03b4)) values1.\nTheorem 3.2. There exists a k-player al-\ngorithm that given \u02dcO\narm\npulls, identi\ufb01es the best arm correctly with probability at least 1 \u2212 \u03b4. The algorithm uses a sin-\ngle communication round, in which each player communicates O(log(1/\u03b4)) numerical values.\n\n8: let \u02c6pi = (1/ki)(cid:80){j : ij =i} \u02c6qj for all i\n\n9: return arg maxi\u2208A \u02c6pi; if the set A is empty,\n\nand de\ufb01ne A = {i : ki >\n\noutput an arbitrary arm.\n\n(cid:16) 1\u221a\n\n(cid:80)n\n\ni=2 1/\u22062\ni\n\n(cid:17)\n\nk\n\n\u221a\n\nk}\n\nWe now prove Theorem 3.1. We show that a budget T of samples (arm pulls) per player, where\n\n\u00b7 n(cid:88)\n\ni=2\n\nT \u2265 24cA\u221a\nk\n\n1\n\u22062\ni\n\nln\n\nn\n\u2206i\n\n,\n\n(3)\n\nsuf\ufb01ces for the players to jointly identify the best arm i(cid:63) with the desired probability. Clearly, this\nwould imply the bound stated in Theorem 3.1. We note that we did not try to optimize the constants\nin the above expression.\nWe begin by analyzing the EXPLORE phase of the algorithm. Our \ufb01rst lemma shows that each player\nchooses the global best arm and identi\ufb01es it as the local best arm with suf\ufb01ciently large probability.\n\n1In fact, by letting each player pick a slightly larger subset of O((cid:112)log(1/\u03b4)\u00b7 n/\n\n\u221a\nk) arms, we can amplify\nthe success probability to 1 \u2212 \u03b4 without needing to communicate more than 2 values per player. However, this\napproach only works when k = \u2126(log(1/\u03b4)).\n\n5\n\n\f\u221a\nLemma 3.3. When (3) holds, each player identi\ufb01es the (global) best arm correctly after the EX-\nPLORE phase with probability at least 2/\n\nk.\n\nWe next address the EXPLOIT phase. The next simple lemma shows that the popular arms (i.e. those\nselected by many players) are estimated to a suf\ufb01cient precision.\n2 \u2206(cid:63) for all arms i \u2208 A with probability\nLemma 3.4. Provided that (3) holds, we have |\u02c6pi \u2212 pi| \u2264 1\nat least 5/6.\n\nDue to lack of space, the proofs of the above lemmas are omitted and can be found in [16]. We can\nnow prove Theorem 3.1.\n\n\u221a\n\nk] \u2264 Pr[ki(cid:63) \u2212 E[ki(cid:63) ] \u2264 \u2212\u221a\n\nProof (of Theorem 3.1). Let us \ufb01rst show that with probability at least 5/6, the best arm i is con-\ntained in the set A. To this end, notice that ki(cid:63) is the sum of k i.i.d. Bernoulli random vari-\nables {Ij}j where Ij is the indicator of whether player j chooses arm i(cid:63) after the EXPLORE\nphase. By Lemma 3.3 we have that E[Ij] \u2265 2/\nPr[ki(cid:63) \u2264 \u221a\nk for all j, hence by Hoeffding\u2019s inequality,\nk] \u2264 exp(\u22122k/k) \u2264 1/6 which implies that i(cid:63) \u2208 A with\nprobability at least 5/6.\nNext, note that with probability at least 5/6 the arm i \u2208 A having the highest empirical reward \u02c6pi\nis the one with the highest expected reward pi. Indeed, this follows directly from Lemma 3.4 that\nshows that with probability at least 5/6, for all arms i \u2208 A the estimate \u02c6pi is within 1\n2 \u2206 of the true\nbias pi. Hence, via a union bound we conclude that with probability at least 2/3, the best arm is in\nA and has the highest empirical reward. In other words, with probability at least 2/3 the algorithm\noutputs the best arm i(cid:63).\n\n3.2\n\n(\u03b5, \u03b4)-PAC Algorithm\n\nWe now present an algorithm whose purpose is to recover an \u03b5-optimal arm. Here, there might be\nmore than one \u03b5-best arm, so each \u201csuccessful\u201d player might come up with a different \u03b5-best arm.\nNevertheless, our analysis below shows that with high probability, a subset of the players can still\nagree on a single \u03b5-best arm, which makes it possible to identify it among the votes of all players.\nOur algorithm is described in Algorithm 2, and the following theorem states its guarantees.\nTheorem 3.5. Algorithm 2 identi\ufb01es a 2\u03b5-best arm with probability at least 2/3 using no more than\n\n\u00b7 n(cid:88)\narm pulls per player, provided that 24 \u2264 \u221a\n\n1\u221a\nk\n\ni=2\n\nO\n\n(cid:32)\n\nround, in which each player communicates \u02dcO(1) bits.\n\n(cid:33)\n\n1\ni )2 log\n(\u2206\u03b5\n\nn\n\u2206\u03b5\ni\n\nk \u2264 n. The algorithm uses a single communication\n\nn(cid:88)\n\ni=2\n\nBefore proving the theorem, we \ufb01rst state several key lemmas. In the following, let n\u03b5 and n2\u03b5\ndenote the number of \u03b5-best and 2\u03b5-best arms respectively. Our analysis considers two different\nregimes: n2\u03b5 \u2264 1\n\nk, and shows that in any case,\n\n\u221a\n\n\u221a\n\nk and n2\u03b5 > 1\n50\n\n50\n\nT \u2265 400cA\u221a\nk\n\n1\ni )2 ln\n(\u2206\u03b5\n\n24n\n\u2206\u03b5\ni\n\n(4)\n\nsuf\ufb01ces for identifying a 2\u03b5-best arm with the desired probability. Clearly, this implies the bound\nstated in Theorem 3.5.\nThe \ufb01rst lemma shows that at least one of the players is able to \ufb01nd an \u03b5-best arm. As we later show,\nthis is suf\ufb01cient for the success of the algorithm in case there are many 2\u03b5-best arms.\nLemma 3.6. When (4) holds, at least one player successfully identi\ufb01es an \u03b5-best arm in the EX-\nPLORE phase, with probability at least 5/6.\n\nThe next lemma is more re\ufb01ned and states that in case there are few 2\u03b5-best arms, the probability of\neach player to successfully identify an \u03b5-best arm grows linearly with n\u03b5.\nLemma 3.7. Assume that n2\u03b5 \u2264 1\n\u221a\nEXPLORE phase, with probability at least 2n\u03b5/\n\nk. When (4) holds, each player identi\ufb01es an \u03b5-best arm in the\n\n\u221a\n\nk.\n\n50\n\n6\n\n\fThe last lemma we need analyzes the accuracy\nof the estimated rewards of arms in the set A.\nLemma 3.8. With probability at least 5/6, we\nhave |\u02c6pi \u2212 pi| \u2264 \u03b5/2 for all arms i \u2208 A.\nFor the proofs of the above lemmas, refer to\n[16]. We now turn to prove Theorem 3.5.\n\nProof. We shall prove that with probability 5/6\nthe set A contains at least one \u03b5-best arm. This\nwould complete the proof, since Lemma 3.8 as-\nsures that with probability 5/6, the estimates \u02c6pi\nof all arms i \u2208 A are at most \u03b5/2-away from the\ntrue reward pi, and in turn implies (via a union\nbound) that with probability 2/3 the arm i \u2208 A\nhaving the maximal empirical reward \u02c6pi must\nbe a 2\u03b5-best arm.\nFirst, consider the case n2\u03b5 > 1\nk. Lemma\n50\n3.6 shows that with probability 5/6 there exists\na player j that identi\ufb01es an \u03b5-best arm ij. Since\nfor at least n2\u03b5 arms \u2206i \u2264 2\u03b5, we have\n24n\n2\u03b5\n\n\u00b7 n2\u03b5 \u2212 1\n\n\u221a\n2 T \u2265 400\n\n\u221a\n\nln\n\n2\n\nAlgorithm 2 ONE-ROUND \u03b5-ARM\ninput time horizon T , accuracy \u03b5\noutput an arm\n1: for player j = 1 to k do\n2:\n\n\u221a\n\n3:\n\n4:\n\nchoose a subset Aj of 12n/\nk arms uni-\nformly at random\nEXPLORE: execute ij \u2190 A(Aj, \u03b5) using\nat most 1\n2 T pulls (and halting the algorithm\nearly if necessary);\nif the algorithm fails to identify any arm or\ndoes not terminate gracefully, let ij be an\narbitrary arm\nEXPLOIT: pull arm ij for 1\nlet \u02c6qj be the average reward\ncommunicate the numbers ij, \u02c6qj\n\n2 T times, and\n\n2 kiT and \u02c6pi = (1/ki)(cid:80){j : ij =i} \u02c6qj\n\n5:\n6: end for\n7: let ki be the number of players j with ij = i\n8: let ti = 1\n9: de\ufb01ne A = {i \u2208 [n] : ti \u2265 (1/\u03b52) ln(12n)}\n10: return arg maxi\u2208A \u02c6pi; if the set A is empty,\n\nfor all i\n\n(2\u03b5)2\n\noutput an arbitrary arm.\n\nk\n\u03b52 ln(12n) ,\n\ntij \u2265 1\n\u2265 1\nthat is, ij \u2208 A.\nNext, consider the case n2\u03b5 \u2264 1\nk. Let N denote the number of players that identi\ufb01ed some\n\u03b5-best arm. The random variable N is a sum of Bernoulli random variables {Ij}j where Ij in-\n\u221a\ndicates whether player j identi\ufb01ed some \u03b5-best arm. By Lemma 3.7, E[Ij] \u2265 2n\u03b5/\n\u221a\nk and thus\n\u03b5) \u2264 1/6 .\nk] = Pr[N \u2212 E[N ] \u2264 \u2212n\u03b5\nk] \u2264 exp(\u22122n2\n\u221a\nby Hoeffding\u2019s inequality, Pr[N < n\u03b5\nThat is, with probability 5/6, at least n\u03b5\nk players found an \u03b5-best arm. A pigeon-hole argument\nnow shows that in this case there exists an \u03b5-best arm i(cid:63) selected by at least\nk players. Hence,\nwith probability 5/6 the number of samples of this arm collected in the EXPLOIT phase is at least\n\n\u221a\n\n\u221a\n\n\u221a\n\n50\n\nkT /2 > (1/\u03b52) ln(12n), which means that i(cid:63) \u2208 A.\n\nti(cid:63) \u2265 \u221a\n\n3.3 Lower Bound\n\n\u221a\nThe following theorem suggests that in general, for identifying the best arm k players achieve a\nmultiplicative speed-up of at most \u02dcO(\nk) when allowing one transmission per player (at the end of\nthe game). Clearly, this also implies that a similar lower bound holds in the PAC setup, and proves\nthat our algorithmic results for the one-round case are essentially tight.\nTheorem 3.9. For any k-player strategy that uses a single round of communication, there exist\nrewards p1, . . . , pn \u2208 [0, 1] and integer T such that\n\u221a\n\u2022 each individual player must use at least T /\nk arm pulls for them to collectively identify the\n\u2022 there exist a single-player algorithm that needs at most \u02dcO(T ) pulls for identifying the best arm\n\nbest arm with probability at least 2/3;\n\nwith probability at least 2/3.\n\nThe proof of the theorem is omitted due to space constraints and can be found in [16].\n\n4 Multiple Communication Rounds\n\nIn this section we establish an explicit tradeoff between the performance of a multi-player algorithm\nand the number of communication rounds it uses, in terms of the accuracy \u03b5. Our observation is that\n\n7\n\n\fby allowing O(log(1/\u03b5)) rounds of communication, it is possible to achieve the optimal speedup of\nfactor k. That is, we do not gain any improvement in learning performance by allowing more than\nO(log(1/\u03b5)) rounds.\nOur algorithm is given in Algorithm 3. The\nidea is to eliminate in each round r (i.e., right\nafter the rth communication round) all 2\u2212r-\nsuboptimal arms. We accomplish this by let-\nting each player sample uniformly all remain-\ning arms and communicate the results to other\nplayers. Then, players are able to eliminate\nsuboptimal arms with high con\ufb01dence.\nIf\neach such round is successful, after log2(1/\u03b5)\nrounds only \u03b5-best arms survive. Theorem 4.1\nbelow bounds the number of arm pulls used by\nthis algorithm (a proof can be found in [16]).\nTheorem 4.1. With probability at least 1 \u2212 \u03b4,\nAlgorithm 3\n\u2022 identi\ufb01es the optimal arm using\n\nAlgorithm 3 MULTI-ROUND \u03b5-ARM\ninput (\u03b5, \u03b4)\noutput an arm\n1: initialize S0 \u2190 [n], r \u2190 0, t0 \u2190 0\n2: repeat\n3:\n4:\n5:\n6:\n\nset r \u2190 r + 1\nlet \u03b5r \u2190 2\u2212r, tr \u2190 (2/k\u03b52\nfor player j = 1 to k do\n\nsample each arm i \u2208 Sr\u22121 for tr \u2212 tr\u22121\ntimes\nlet \u02c6pr\nall rounds so far of player j)\ncommunicate the numbers \u02c6pr\n\nj,i be the average reward of arm i (in\n\ni = (1/k)(cid:80)k\n\nr) ln(4nr2/\u03b4)\n\nend for\nlet \u02c6pr\nand let \u02c6pr\nset Sr \u2190 Sr\u22121\\{i \u2208 Sr\u22121 : \u02c6pr\n\n(cid:63) = maxi\u2208Sr\u22121 \u02c6pr\ni\n\nj,1, . . . , \u02c6pr\nj,i for all i \u2208 Sr\u22121,\n(cid:63)\u2212\u03b5r}\n\n(cid:19)(cid:33)\n\n8:\n9:\n10:\n\nj=1 \u02c6pr\n\ni < \u02c6pr\n\n7:\n\nj,n\n\n1\ni )2 log\n(\u2206\u03b5\n\nlog\n\n1\n\u2206\u03b5\ni\n\n(cid:18) n\n\n\u03b4\n\n11:\n12: until \u03b5r \u2264 \u03b5/2 or |Sr| = 1\n13: return an arm from Sr\n\n(cid:32)\n\n\u00b7 n(cid:88)\n\ni=2\n\n1\nk\n\nO\n\narm pulls per player;\n\nrounds for \u03b5 = 0).\n\n\u2022 terminates after at most 1 + (cid:100)log2(1/\u03b5)(cid:101) rounds of communication (or after 1 + (cid:100)log2(1/\u2206(cid:63))(cid:101)\n\nBy properly tuning the elimination thresholds \u03b5r of Algorithm 3 in accordance with the target accu-\nracy \u03b5, we can establish an explicit trade-off between the number of communication rounds and the\nnumber of arm pulls each player needs. In particular, we can design a multi-player algorithm that\nterminates after at most R communication rounds, for any given parameter R > 0. This, however,\ncomes at the cost of a compromise in learning performance as quanti\ufb01ed in the following corollary.\nCorollary 4.2. Given a parameter R > 0, set \u03b5r \u2190 \u03b5r/R for all r \u2265 1 in Algorithm 3. With proba-\nbility at least 1 \u2212 \u03b4, the modi\ufb01ed algorithm\n\n\u2022 identi\ufb01es an \u03b5-best arm using \u02dcO((\u03b5\u22122/R/k) \u00b7(cid:80)n\n\ni )2) arm pulls per player;\n\ni=2(1/\u2206\u03b5\n\n\u2022 terminates after at most R rounds of communication.\n\n5 Conclusions and Further Research\n\nWe have considered a collaborative MAB exploration problem, in which several independent play-\ners explore a set of arms with a common goal, and obtained the \ufb01rst non-trivial results in such\nsetting. Our main results apply for the speci\ufb01cally interesting regime where each of the players is\nallowed a single transmission; this setting \ufb01ts naturally to common distributed frameworks such as\nMapReduce. An interesting open question in this context is whether one can obtain a strictly better\nspeed-up result (which, in particular, is independent of \u03b5) by allowing more than a single round.\nk speed-up can\nEven when allowing merely two communication rounds, it is unclear whether the\nbe improved. Intuitively, the dif\ufb01culty here is that in the second phase of a reasonable strategy each\nplayer should focus on the arms that excelled in the \ufb01rst phase; this makes the sub-problems being\nfaced in the second phase as hard as the entire MAB instance, in terms of the quantity H\u03b5. Nev-\nertheless, we expect our one-round approach to serve as a building-block in the design of future\ndistributed exploration algorithms, that are applicable in more complex communication models.\nAn additional interesting problem for future research is how to translate our results to the regret\nminimization setting. In particular, it would be nice to see a conversion of algorithms like UCB [5]\nto a distributed setting. In this respect, perhaps a more natural distributed model is a one resembling\nthat of Kanade et al. [17], that have established a regret vs. communication trade-off in the non-\nstochastic setting.\n\n\u221a\n\n8\n\n\fReferences\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization.\n\n873\u2013881, 2011.\n\nIn NIPS, pages\n\n[2] D. Agarwal, B.-C. Chen, P. Elango, N. Motgi, S.-T. Park, R. Ramakrishnan, S. Roy, and\nJ. Zachariah. Online models for content optimization. In NIPS, pages 17\u201324, December 2008.\n[3] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identi\ufb01cation in multi-armed bandits. In\n\nCOLT, pages 41\u201353, 2010.\n\n[4] P. Auer and R. Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed\n\nbandit problem. Periodica Mathematica Hungarica, 61(1-2):55\u201365, 2010.\n\n[5] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit prob-\n\nlem. Machine learning, 47(2):235\u2013256, 2002.\n\n[6] M. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity\n\nand privacy. Arxiv preprint arXiv:1204.3514, 2012.\n\n[7] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In\n\nAlgorithmic Learning Theory, pages 23\u201337. Springer, 2009.\n\n[8] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. In NIPS,\n\npages 273\u2013280, 2008.\n\n[9] H. Daum\u00b4e III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Ef\ufb01cient protocols for\n\ndistributed classi\ufb01cation and optimization. In ALT, 2012.\n\n[10] H. Daum\u00b4e III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Protocols for learning\n\nclassi\ufb01ers on distributed data. AISTAT, 2012.\n\n[11] J. Dean and S. Ghemawat. MapReduce: simpli\ufb01ed data processing on large clusters. Commun.\n\nACM, 51(1):107\u2013113, Jan. 2008.\n\n[12] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction\n\nusing mini-batches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[13] J. Duchi, A. Agarwal, and M. J. Wainwright. Distributed dual averaging in networks. NIPS,\n\n23, 2010.\n\n[14] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the\nmulti-armed bandit and reinforcement learning problems. The Journal of Machine Learning\nResearch, 7:1079\u20131105, 2006.\n\n[15] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identi\ufb01cation.\n\nNIPS, 2011.\n\n[16] E. Hillel, Z. Karnin, T. Koren, R. Lempel, and O. Somekh. Distributed exploration in multi-\n\narmed bandits. arXiv preprint arXiv:1311.0800, 2013.\n\n[17] V. Kanade, Z. Liu, and B. Radunovic. Distributed non-stochastic experts.\n\nNeural Information Processing Systems 25, pages 260\u2013268, 2012.\n\nIn Advances in\n\n[18] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In\n\nProceedings of the 30th International Conference on Machine Learning, 2013.\n\n[19] K. Liu and Q. Zhao. Distributed learning in multi-armed bandit with multiple players. IEEE\n\nTransactions on Signal Processing, 58(11):5667\u20135681, Nov. 2010.\n\n[20] S. Mannor and J. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit\n\nproblem. The Journal of Machine Learning Research, 5:623\u2013648, 2004.\n\n[21] O. Maron and A. W. Moore. Hoeffding races: Accelerating model selection search for classi-\n\n\ufb01cation and function approximation. In NIPS, 1994.\n\n[22] V. Mnih, C. Szepesv\u00b4ari, and J.-Y. Audibert. Empirical bernstein stopping. In ICML, pages\n\n672\u2013679. ACM, 2008.\n\n[23] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data re\ufb02ect retrieval quality?\n\nIn CIKM, pages 43\u201352, October 2008.\n\n[24] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling\n\nbandits problem. In ICML, page 151, June 2009.\n\n9\n\n\f", "award": [], "sourceid": 476, "authors": [{"given_name": "Eshcar", "family_name": "Hillel", "institution": "Yahoo! Labs"}, {"given_name": "Zohar", "family_name": "Karnin", "institution": "Yahoo! Labs"}, {"given_name": "Tomer", "family_name": "Koren", "institution": "Technion"}, {"given_name": "Ronny", "family_name": "Lempel", "institution": "Yahoo! Labs"}, {"given_name": "Oren", "family_name": "Somekh", "institution": "Yahoo! Labs"}]}