{"title": "Threshold Bandits, With and Without Censored Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 4889, "page_last": 4897, "abstract": "We consider the \\emph{Threshold Bandit} setting, a variant of the classical multi-armed bandit problem in which the reward on each round depends on a piece of side information known as a \\emph{threshold value}. The learner selects one of $K$ actions (arms), this action generates a random sample from a fixed distribution, and the action then receives a unit payoff in the event that this sample exceeds the threshold value. We consider two versions of this problem, the \\emph{uncensored} and \\emph{censored} case, that determine whether the sample is always observed or only when the threshold is not met. Using new tools to understand the popular UCB algorithm, we show that the uncensored case is essentially no more difficult than the classical multi-armed bandit setting. Finally we show that the censored case exhibits more challenges, but we give guarantees in the event that the sequence of threshold values is generated optimistically.", "full_text": "Threshold Bandit, With and Without Censored\n\nFeedback\n\nJacob Abernethy\n\nKareem Amin\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Michigan\nAnn Arbor, MI 48109\njabernet@umich.edu\n\nUniversity of Michigan\nAnn Arbor, MI 48109\namkareem@umich.edu\n\nRuihao Zhu\n\nAeroAstro&CSAIL\n\nMIT\n\nCambridge, MA 02139\n\nrzhu@mit.edu\n\nAbstract\n\nWe consider the Threshold Bandit setting, a variant of the classical multi-armed\nbandit problem in which the reward on each round depends on a piece of side\ninformation known as a threshold value. The learner selects one of K actions\n(arms), this action generates a random sample from a \ufb01xed distribution, and the\naction then receives a unit payoff in the event that this sample exceeds the threshold\nvalue. We consider two versions of this problem, the uncensored and censored case,\nthat determine whether the sample is always observed or only when the threshold is\nnot met. Using new tools to understand the popular UCB algorithm, we show that\nthe uncensored case is essentially no more dif\ufb01cult than the classical multi-armed\nbandit setting. Finally we show that the censored case exhibits more challenges, but\nwe give guarantees in the event that the sequence of threshold values is generated\noptimistically.\n\n1\n\nIntroduction\n\nThe classical Multi-armed Bandit (MAB) problem provides a framework to reason about sequential\ndecision settings, but speci\ufb01cally where the learner\u2019s chosen decision is intimately tied to information\ncontent received as feedback. MAB problems have generated much interest in the Machine Learning\nresearch literature in recent years, particularly as a result of the changing nature in which learning\nand estimation algorithms are employed in practice. More and more we encounter scenarios in which\nthe procedure used to make and exploit algorithmic predictions is exactly the same procedure used to\ncapture new data to improve prediction performance. In other words it is increasingly harder to view\ntraining and testing as distinct entities.\nMAB problems generally involve repeatedly making a choice between one of a \ufb01nite (or even in\ufb01nite)\nset of actions, and these actions have historically been referred to as arms of the bandit. If we \u201cpull\u201d\narm i at round t, then we receive a reward Rt\ni 2 [0,1] which is frequently assumed to be a stochastic\nquantity that is drawn according to distribution Di. Typically we assume that Di are heterogeneous\nacross the arms i, whereas we assume the samples {Rt\ni}t=1,...,T are independently and identically\ndistributed according to the \ufb01xed Di across all times t.1 Of course, were the learner to have full\nknowledge of the distributions Di from the outset, she would presumably choose to pull the arm\nwhose expected reward \u00b5i is highest. With that in mind, we tend to consider the (expected) regret of\nthe learner, de\ufb01ned to be the (expected) reward of the best arm minus the (expected) reward of the\nactual arms selected by the learner.\nEarly work on MAB problems (Robbins, 1952; Lai and Robbins, 1985; Gittins et al., 2011) tended to\nbe more focused on asymptotic guarantees, whereas more recent work (Auer et al., 2002; Auer, 2003)\n\n1Note that in much of our notation we use superscript t to denote the time period rather than as an exponent.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fhas been directed towards a more \u201c\ufb01nite time\u201d approach in which we can bound regret for \ufb01xed time\nhorizons T . One of the best-known and well-studied techniques is known as the Upper Con\ufb01dence\nBound (UCB) algorithm (Auer et al., 2002; Auer and Ortner, 2010). The magic of UCB relies on a\nvery intuitive policy framework, that a learner should select decisions by maximizing over rewards\nestimated from previous data but only after biasing each estimate according to its uncertainty. Simply\nput, one should choose the arm that maximizes the \u201cmean plus con\ufb01dence interval,\u201d hence the name\nUpper Con\ufb01dence Bound.\nIn the present paper we focus on the Threshold Bandit setting, described as follows. On each round t,\na piece of side information is given to the learner in the form of a real number ct, the learner must\nthen choose arm i out of K arms, and this arm produces a value Xt\ni drawn from a survival distribution\nwith survival function Fi(x) = Pr(Xt\ni itself but is instead\ni = I[Xt\nthe binary value Rt\ni exceeds the\nthreshold value ct, and otherwise we receive no reward. For a \ufb01xed value of ct, each arm i has\ni] = Fi(ct). Notice, crucially, that the arm with the greatest expected payoff can\nexpected payoff E[Rt\nvary signi\ufb01cantly across different threshold values.\nThis abstract model has a number of very natural applications:\n\ni  ct]; that is, we receive a unit reward when the sample Xt\n\ni  x). The reward to the learner is not Xt\n\n1. Packet Delivery with Deadlines: FedEx receives a stream of packages that need to be\nshipped from source to destination, and each package is supplied with a delivery deadline.\nThe goal of the FedEx routing system is to select a transportation route (via air or road or\nship, etc.) that has the highest probability of on-time arrival. Of course some transportation\nschemes are often faster (e.g. air travel) but have higher volatility (e.g. due to poor weather).\n2. Supplier Selection: Customers approach a manufacturing \ufb01rm to produce a product with\nspeci\ufb01c quality demands. The \ufb01rm must approach one of several suppliers to contract out\nthe work, but the \ufb01rm is uncertain as to the capabilities and variabilities of the products each\nsupplier produces.\n\n3. Dark Pool Brokerage: A \ufb01nancial brokerage \ufb01rm is asked to buy or sell various sized\nbundles of shares, and the brokerage aims to of\ufb02oad the transactions onto one of many\ndark pools, i.e. \ufb01nancial exchanges that match buyers and sellers in a con\ufb01dential manner\n(Ganchev et al., 2010; Amin et al., 2012; Agarwal et al., 2010). A standard dark pool\nmechanism will simply execute the transaction if there is suitable liquidity, or will reject the\ntransaction when no match is made. Of course the brokerage gets paid on commission, and\nsimply wants to choose the pool that has the highest probability of completion.\n\nWhat distinguishes the Threshold Bandit problem from the standard stochastic multi-armed bandit\nsetting are two main features:\n\n1. The regret of the learner will be measured in comparison to the best policy rather than to\nsimply the best arm. Note that the optimal of\ufb02ine policy may incorporate the threshold\nvalue ct before selecting an arm I t.\n\n2. Whereas the standard stochastic bandit setting assumes that we observe the reward Rt\nI t of\nthe chosen arm I t, in the Threshold Bandit setting we consider two types of feedback.\n(a) Uncensored Feedback: After playing arm I t, the learner observes the sample Xt\nI t\nregardless of the threshold value ct. This is a natural model for the FedEx routing\nproblem above, wherein one learns the travel time of a package regardless of the\ndeadline having been met.\nI t  ct,\nand otherwise observes Xt\nI t . This is a natural model for the Supplier Selection problem\nabove, as we would only learn the product\u2019s quality value when the customer rejects\nwhat is received from the supplier.\n\n(b) Censored Feedback: After playing I t, the learner observes a null value when Xt\n\nIn the present paper we present roughly three primary results. First, we provide a new perspective\non the classical UCB algorithm, giving an alternative proof that relies on an interesting potential\nfunction argument; we believe this technique may be of independent interest. Second, we analyze\nthe Threshold Bandit setting when given uncensored feedback, and we give a novel algorithm called\nDKWUCB based on the Dvoretzky-Kiefer-Wolfowitz inequality (Dvoretzky et al., 1956). We show,\nsomewhat surprisingly, that with uncensored feedback the regret bound is no worse than the standard\n\n2\n\n\fstochastic MAB setting, suggesting that despite the much richer policy class one has nearly the same\nlearning challenge. Finally, we consider learning in the censored feedback setting, and propose an\nalgorithm KMUCB, akin to the Kaplan-Meier estimator (Kaplan and Meier, 1958). Learning with\ncensored feedback is indeed more dif\ufb01cult, as the performance can depend signi\ufb01cantly on the order\nof the threshold values. In the worst case, when threshold values are chosen in an adversarial order,\nthe cost of learning scales with the number of unique threshold values, whereas one can perform\nsigni\ufb01cantly better under certain constraints on optimistic assumptions on the order or even a random\norder.\n\n2 A New Perspective on UCB\n\nBefore focusing on the Threshold Bandit problem, let us turn our attention to the classical stochastic\nMAB setting and give another look at the UCB algorithm. We will now provide a modi\ufb01ed proof\nof the regret bound of UCB that relies on a potential function. Potential arguments have proved\nquite popular in studying adversarial bandit problems (Auer et al., 2003; Audibert and Bubeck, 2009;\nAbernethy et al., 2012; Neu and Bart\u00f3k, 2013; Abernethy et al., 2015), but have received less use\nin the stochastic setting. This potential trick is the basis for forthcoming results on the Threshold\nBandit.\nLet Di be a distribution on the reward Rt\ni, with support on [0,1]. We imagine the rewards\nR1\ni , . . . ,RT\ni] = \u00b5i. A bandit algorithm is simply a procedure that chooses\ni\na random arm/action I t on round t as a function of the set of past observed (action, reward) pairs,\nt=1 I[I t = i] and de\ufb01ne the empirical mean estimator\n(I 1,R1\nat time t to be \u02c6\u00b5t\n\nI 1), . . . , (I t1,Rt1\nI t1). Finally, let Nt\nt=1 I[I t=i]Rt\n\u00c2t1\nI t\n\ni.i.d.\u21e0 Di, whose mean E[Rt\n\ni := \u00c2t1\n\n.\n\ni :=\n\nNt\ni\n\nWe assume we are given a particular deviation bound which provides the following guarantee,\n\nPr|\u00b5i  \u02c6\u00b5t\n\ni| > e  Nt\n\ni  N \uf8ff f (N,e),\n\nwhere f (\u00b7) is some function, continuous in e and monotonically decreasing in both parameters, that\ncontrols the probability of a large deviation. While UCB relies speci\ufb01cally on the Hoeffding-Azuma\ninequality (Cesa-Bianchi and Lugosi, 2006), for now we leave the deviation bound in generic form.\nThis will be useful in following sections.\nGiven f (\u00b7,\u00b7), what is of interest to our present work is a pair of functions that allow us to convert\nbetween values of e and N in order to guarantee that f (N,e) \uf8ff d for a given d > 0. To this end de\ufb01ne\n\n](e,d)\n\nV(N,d)\n\n:= min{N : f (N,e/2) \uf8ff d},\n:= \u21e2\n\ninf{e : f (N,e) \uf8ff d} if N > 0;\notherwise,\n1\n\nWe will often omit the d in the argument to ](\u00b7),V(\u00b7). Note the key property that V(N,d) \uf8ff e/2 for\nany N  ](e,d).\nWe can now de\ufb01ne our variant of the UCB algorithm for a \ufb01xed choice of d > 0.\ni + V(Nt\n\non round t play I t = argmax\n\nUCB Algorithm:\n\n(1)\n\n\u02c6\u00b5t\n\ni\n\ni ,d) \n\nWe will make the simplifying assumption that the largest \u00b5i is unique and, without loss of generality, let\nus assume that the coordinates are permuted in order that \u00b51 is the largest mean reward. Furthermore,\nde\ufb01ne Di := \u00b51  \u00b5i for i = 2, . . . ,K.\nA central piece of the analysis relies on the following potential function, which depends on the current\nnumber of plays of each arm i = 2, . . . ,K.\n\nK\n\nNt\ni 1\n\u00c2\n\nN=0\n\nV(N,d)\n\n(2)\n\nF(Nt\n\n\u00c2\ni=2\nLemma 1. The expected regret of UCB is bounded as\nE[RegretT (UCB)] \uf8ff E[F(NT +1\n\n2, . . . ,Nt\n\nK) := 2\n\n2\n\n, . . . ,NT +1\n\nK\n\n)] + O(Td)\n\n3\n\n\f1,d)\n\n1 + V(Nt\n\n\u00b51 \uf8ff \u02c6\u00b5t\n\nProof. The (random) additional regret suffered on round t of UCB is exactly \u00b51  \u00b5I t . By virtue of\nour given deviation bound, we know that\n\u02c6\u00b5t\nI t \uf8ff \u00b5I t + V(Nt\n(3)\nAlso, let xt be the indicator variable that one of the above two inequalities fails to hold. Of course we\nchose V(\u00b7) in order that E[xt] \uf8ff 2d via a simple union bound.\nNote that, by virtue of using the UCB selection rule for I t, it is clear that we have\nI t + V(Nt\n\neach w.p. > 1 d.\n\n\u02c6\u00b5t\n1 + V(Nt\n\nI t ,d),\n\nI t ,d)\n\nand\n\n(4)\n\nIf we combine Equations 3 and 4, and consider the event that xt = 0, then we obtain\n\n1,d) \uf8ff \u02c6\u00b5t\nI t + V(Nt\n\n\u00b51 \uf8ff \u02c6\u00b5t\n\n1 + V(Nt\n\n1,d) \uf8ff \u02c6\u00b5t\n\nI t ,d) \uf8ff \u00b5I t + 2V(Nt\n\nI t ,d).\n\nI t ,d) + xt.\n\nEven in the event that xt = 1 we have that \u00b51\u00b5I t \uf8ff 1. Hence, it follows immediately that \u00b51\u00b5I t \uf8ff\n2V(Nt\nFinally, we observe that\nF(Nt\n\nK ) \nI t ,d). Recalling that F(0, . . . ,0) = 0, a simple telescoping argument gives that\n\nthe potential function was chosen so that F(Nt+1\n\nK) = 2V(Nt\n\n, . . . ,Nt+1\n\n2, . . . ,Nt\n\n2\n\nE[RegretT (UCB)] \uf8ff E\"F(NT +1\n\n2\n\n, . . . ,NT +1\n\nK\n\n) +\n\nT\n\n\u00c2\n\nt=1\n\nxt# = E[F(NT +1\n\n2\n\n, . . . ,NT +1\n\nK\n\n)] + 2Td.\n\nThe \ufb01nal piece we need to establish is that the number of pulls Nt\ni of arm i, for i = 2, . . . ,K, is unlikely\nto exceed ](Di,d). This result uses some more standard techniques from the original UCB analysis\n(Auer et al., 2002), and we defer it to the appendix.\nLemma 2. For any T > 0 we have E[F(NT +1\nWe are now able to combine the above results for the \ufb01nal bound.\nTheorem 1. If we set d = T2/2, the expected regret of UCB is bounded as\n\n)] \uf8ff F(](D2,d), . . . ,] (DK,d)) + O(T 2d).\n\n, . . . ,NT +1\n\nK\n\n2\n\nE[RegretT (UCB)] \uf8ff 8\n\nK\n\n\u00c2\n\ni=2\n\nlog(T )\n\nDi\n\n+ O(1).\n\nProof. Note that a very standard deviation bound that holds for all distributions supported on\n[0,1] is the Hoeffding-Azuma inequality (Cesa-Bianchi and Lugosi, 2006), where the bound is\n\ngiven by f (N,e) = 2exp(2Ne2). Utilizing Hoeffding-Azuma we have ](e,d) =l 2log2/d\nV(N,d) =q log(2/d)\n1py \uf8ff 2pY , then we see that\n\nfor N > 0. If we utilize the fact that \u00c2Y\n\nm and\n\ny=1\n\n2N\n\ne2\n\nF(](D2,d), . . . ,] (DK,d)) = 2\n\n\u00c2\ni=2\nCombining the Lemma 1 and Lemma 2, setting d = T2/2, we conclude the theorem.\n\nV(N,d) = 2\n\n\u00c2\n\n\u00c2\n\n= 4\n\nN=0\n\ni=2\n\ni=2\n\nK\n\n2rlog(2/d)](Di,d)\n\n2\n\nK\n\nlog(2/d)\n\n.\n\nDi\n\nK\n\n](Di,d)\n\u00c2\n\n3 The Threshold Bandits Model\n\nIn the preceding, we described a potential-based proof for the UCB algorithm in the classic stochastic\nbandit problem. We now return to the Threshold Bandit setting, our problem of interest.\nA K-armed Threshold Bandit problem is de\ufb01ned by random variables Xt\ni and a sequence of threshold\nvalues ct for 1 \uf8ff i \uf8ff K and 1 \uf8ff t \uf8ff T, where i is the index for arms. Successive pulling of arm\ni generates the values X 1\ni , which are drawn i.i.d. from an unknown distribution. The\nthreshold values c1,c2, . . . ,cT are drawn from M = {1,2, . . . ,m} (according to rules speci\ufb01ed later).\nThe threshold value ct is observed at the beginning of round t, and the learner follows a policy P to\nchoose the arm to play based on its past selections and previously observed feedbacks. Suppose the\narm pulled at round t is I t, the observed reward is then Rt\nI t  ct]; that is, we receive a unit\nreward when the sample Xt\nI t exceeds the threshold value ct, and otherwise we receive no reward. We\ndistinguish two different types of feedback.\n\nI t = I[Xt\n\ni , . . . ,X T\n\ni ,X 2\n\n4\n\n\fof the threshold value ct.\n\n1. Uncensored Feedback: After playing arm I t, the learner observes the sample Xt\nif Xt\nI t  ct,\notherwise\n\n2. Censored Feedback: After playing I t, the learner observes2\u21e2/0\n\nIn this case, we refer to the threshold value as a censor value.\n\nXt\nI t\n\n.\n\nI t regardless\n\nLet Fi(x) denote the survival function of the distribution on arm i. That is, Fi(x) = Pr(Xt\nmeasure regret against the optimal policy with full knowledge of F1, . . . ,Fn i.e.,\n\ni  x). We\n\nNotice that for a \ufb01xed value of ct, each arm i has expected payoff E[Rt\nbe written as\n\nRegretT (P) = E\" T\nt=1\u2713max\n\u00c2\nRegretT (P) = E\u21e5\u00c2T\n\nI t\u25c6# = E\" T\ni  Rt\nRt\nt=1maxi2[n] Fi (ct) FI t (ct)\u21e4 .\n\nOur goal is to design a policy that minimizes the regret.\n\nt=1\u2713max\n\u00c2\n\nIXt\ni  ctIXt\n\ni2[n]\n\ni2[n]\n\nI t  ct\u25c6# .\n\ni] = Fi(ct), the regret can also\n\n4 DKWUCB: Dvoretzky-Kiefer-Wolfowitz Inequality based Upper\n\nCon\ufb01dence Bound algorithm\n\nIn this section, we study the uncensored feedback setting in which the value Xt\nI t is always observed\nregardless of ct. We assume that the largest Fi( j) is unique for all j 2 M, and de\ufb01ne i\u21e4( j) =\nargmaxi Fi( j),Di( j) = Fi\u21e4( j)( j) Fi( j) for all i = 1,2, . . . ,K and j 2 M.\nUnder this setting, the algorithm will use the empirical distribution as an estimate for the true\ndistribution. Speci\ufb01cally, we want to estimate the true survival function Fi via:\n\n\u02c6Ft\ni ( j) =\n\nt=1 I[Xt\n\u00c2t1\n\nI t  j,I t = i]\nNt\ni\n\n8 j 2 M\n\n(5)\n\ni ( j) :\n\nthan 2exp2e2Nt\n\nThe key tool in our analysis is a deviation bound on the empirical CDF of a distribution, and we\nnote that this bound holds uniformly over the support of the distribution. The Dvoretzky-Kiefer-\nWolfowitz (DKW) inequality (Dvoretzky et al., 1956) allows us to bound the error on \u02c6Ft\nLemma 3. At a time t, let \u02c6Ft\nprobability that the maximum of the difference between \u02c6Ft\n\ni be the empirical distribution function of Fi as given in equation 5. The\ni and Fi over all j 2 M is at least e is less\ni  N \uf8ff 2exp2e2N .\n\ni ( j) Fi( j)| e  Nt\n\nThe proof of the lemma can be found in Dvoretzky et al. (1956). The key insight is that the estimate \u02c6Fi\nconverges to Fi point-wise at the same rate as the Hoeffding-Asumza inequality. That is, one does not\npay an additional M factor from applying a union bound. The fact that we have uniform convergence\nof the CDF with the same rate as the Hoeffding-Azuma inequality allows us to immediately apply\n\nthe potential function argument from Section 2. In particular, we de\ufb01ne f (N,e) = 2exp2e2N , as\n\nwell as the pair of functions ](e,d) and V(N,d) exactly the same as the previous section, i.e.,\n\ni , i.e.,\nPrsup j2M | \u02c6Ft\n\n](e,d)\n\nV(N,d)\n\ne2\n\n\u21e1 ,\n:= \u21e02log2/d\n:= ( q log(2/d)\n\n2N\n\n1\n\nif N > 0;\notherwise.\n\nWe are now ready to de\ufb01ne our DKWUCB algorithm for a \ufb01xed choice of parameter d > 0 to solve\nthe problem.\n\nDKWUCB Algorithm:\n\non round t play I t argmax\n\ni\n\ni (ct) + V(Nt\n\n \u02c6Ft\n\ni ,d) .\n\n(6)\n\n2Existing literature often refers to this as right-censoring. With right-censored feedback, samples from\n\nplaying arms at high threshold values can inform decisions at low threshold values but not vice versa.\n\n5\n\n\fTo analyze DKWUCB, we use a slight variant of the potential function de\ufb01ned in Section 2. Let\ni\u21e4( j) = argmaxi Fi( j) denote the optimal arm for threshold value j, and \u02dcNt\ni denote the number of\nrounds arm i is pulled when it is not optimal, \u02dcNt\ni \uf8ff Nt\ni .\nDe\ufb01ne the potential function as:\n\nt=1 I[I t = i,I t 6= i\u21e4(ct)]. Notice that \u02dcNt\n\ni = \u00c2t1\n\nF( \u02dcNt\n\n1, . . . , \u02dcNt\n\nK) := 2\n\nV(N,d)\n\n(7)\n\nK\n\n\u00c2\n\ni=1\n\n\u02dcNt\ni 1\n\u00c2\n\nN=0\n\nTheorem 2. Setting d = T2/2, the expected regret of DKWUCB is bounded as\n\nlogT\n\nK\n\n\u00c2\n\ni=1\n\nD2\n\n+ O(1),\n\nmin j2M Di( j)\n\nE[RegretT (DKWUCB)] \uf8ff 8\nWe defer the proof of this theorem to the appendix.\nWe pause now to comment on some of the strengths of this type of analysis. At a high-level, the\ntypical analysis to the UCB algorithm for the standard multi-armed bandit problem (Auer et al.,\n\nwith high probability, and (2) the regret suffered by any such pull is O(Di). The contribution of arm\n\n2002) is the following: (1) at some \ufb01nite time T , the number of pulls of a bad arm i is O\u21e3 log(T )\ni \u2318\nDi \u2318. In contrast, we analyzed the UCB algorithm in Section 2\ni to total regret is therefore O\u21e3 log(T )\nby observing that the expected regret suffered on round t is bounded by the difference between the\nempirical mean estimator and the true mean for the payoff of arm It. Of course by design this quantity\nis almost certainly (w.p. at least 1 d) less than V(Nt\nIt ). The potential function F(\u00b7, . . . ,\u00b7) tracks the\naccumulation of these values V(Nt\ni ) for each arm i, and the \ufb01nal regret bound is a consequence of the\nsummation properties of V,] for the particular estimator being used.\nWhile these two approaches lead to the same bound in the standard multi-armed bandit problem,\nthe potential function approach bears fruit in the Threshold Bandit setting. Because the uniform\nconvergence rate promised by the DKW inequality matches that of the Hoeffding-Azume inequality,\nTheorem 2 should not be surprising; the ith arm\u2019s contribution to DKWUCB\u2019s regret should be\nidenitical to UCB, but with the suboptimality gap now equal to min j Di( j).\nHowever, following the program for the standard analysis of UCB, one would naively argue that\n\nlog(T )\n\n(min j2M Di( j))2\u2318 times. These pulls might come in the face of any\narm i is incorrectly pulled O\u21e3\nnumber of threshold values ct, suffering as much as max j2M Di( j) regret, yielding a bound of\nmin j Di( j)\u2318 worse than\nO\u21e3 max j2M Di( j)log(T )\n\n(min j2M Di( j))2 \u2318 on the ith arm\u2019s regret contribution, which is a factor O\u21e3 max j Di( j)\n\nthe derived result. By tracking the convergence of the underlying estimator, we circumvent this\nproblem entirely.\n\n5 KMUCB: Kaplan-Meier based Upper Con\ufb01dence Bound Algorithm\nWe now turn to the censored feedback setting, in which the feedback of pulling arm I t is observed\nonly when Xt\nI t is less than ct. For ease of presentation, we assume that the largest Fi( j) is unique for\nall j 2 M, and de\ufb01ne i\u21e4( j) = argmaxi Fi( j),Di( j) = Fi\u21e4( j)( j)Fi( j) for all i = 1,2, . . . ,K and j 2 M.\nOne prevalent non-parametric estimator for censored data is the Kaplan-Meier maximum likelihood\nestimator Kaplan and Meier (1958); Peterson (1983). Most of existing works have studied the uniform\nerror bound of Kaplan-Meier estimator in the case that the threshold values are drawn i.i.d. from a\nknown distribution Foldes and Rejto (1981) or asymptotic error bound for the non-i.i.d. case Huh\net al. (2009). The only known uniform error bound of Kaplan-Meier estimator is proposed in Ganchev\net al. (2010).\nNoting that for a given threshold value, all the feedbacks from larger threshold values are useful, we\npropose a new estimator with tighter uniform error bound based on the Kaplan-Meier estimator as\nfollowing:\n\n\u02c6Ft\ni =\n\nDt\ni( j)\nNt\ni ( j)\n\n6\n\n(8)\n\n\fwhere Dt\n\ni( j) and Nt\nAt := min{Xt\n\ni ( j) is de\ufb01ned as follows\nt1\u00c2\nI t ,ct},\n\ni( j) :=\n\nDt\n\nt=1\n\nI[At  j,I t = i],\n\nNt\ni ( j) :=\n\nt1\u00c2\n\nt=1\n\nI[ct  j,I t = i].\n\nWe \ufb01rst present an error bound for the modi\ufb01ed Kaplan-Meier estimate of Fi( j) :\nLemma 4. At time t, let \u02c6Ft\nFor any j 2 M, the probability that the difference between \u02c6Ft\n2exp\u21e3\n\ni be the modi\ufb01ed Kaplan-Meier estimate of Fi as given in equation 8.\ni ( j) and Fi( j) is at least e is less than\n\n2 \u2318 , i.e.,\n\ne2Nt\n\ni ( j)\n\nPr| \u02c6Ft\n\ni ( j) Fi( j)| e \uf8ff 2exp\u2713\n\ne2Nt\n\ni ( j)\n\n2 \u25c6 .\n\nWe defer the proof of this lemma to the appendix.\nDifferent to the stochastic uncensored MAB setting, we show that the cost of learning with censored\nfeedback depends signi\ufb01cantly on the order of the threshold values. To illustrate this point, we \ufb01rst\nshow a comparison between the regret of adversarial setting and optimistic setting. In the adversarial\nsetting, the threshold values are chosen to arrive in a non-decreasing order 1,1, . . . ,1,2, . . . ,2,3, . . . ,m,\nthe problem becomes playing m independent copies of bandits, and the regret scales with m;\nwhile in the optimistic setting, the threshold values are chosen to arrive in a non-increasing or-\nder m,m, . . . ,m,m  1, . . . ,m  1, . . . ,1, . . . ,1, which means the learner can make full use of the\nsamples, and can thus perform signi\ufb01cantly better. Afterwards, we show that if the order of the\nthreshold values is close to uniformly random, the regret only scales with logm.\n\n5.1 Adversarial vs. Optimistic Setting\nFor the simplicity of presentation, we assume that in both settings, the time horizon could be divided\nin to m stages, each with length bT /mc.. In the adversarial setting, threshold value j comes during\nstage j; while in the optimistic setting threshold value m j + 1 comes during stage j.\nFor the adversarial setting, due to the censored feedback structure, only the samples observed within\nthe same stage can help to inform decision making. From the perspective of the learner, this is\nequivalent to facing m independent copies of stochastic MAB problems, and thus, the regret scales\nwith m. Making use of the lower bound of stochastic MAB problems Lai and Robbins (1985), we can\nconclude the following theorem.\nTheorem 3. If the threshold values arrive according to the adversarial order speci\ufb01ed above, no\nlearning algorithm can achive a regret bound better than \u00c2m\nKL(B(Fi( j)||B(Fi\u21e4( j)( j))) , where\nKL(\u00b7||\u00b7) is the Kullback-Leibler divergence Lai and Robbins (1985) and B(\u00b7) is the probability\ndistribution function of Bernoulli distribution.\n\nj=1 \u00c2K\ni=1\n\nlog(T /m)\n\nFor the optimistic setting, although the feedbacks are right censored, we note that every sample\nobserved in the previous rounds are useful in later rounds. This is because the threshold values arrive\nin non-increasing order. Therefore, we can reduce the optimistic setting to the Threshold Bandit\nproblem with uncensored feedback, and use the DKWUCB proposed in Section 4 to solve it. More\nspeci\ufb01cally, we can set\n\nf (N,e)\n\n](e,d)\n\nV(N,d)\n\n:= 2exp(e2N/2),\n:= \u21e08log2/d\n\u21e1 ,\n:= ( q 2log(2/d)\n\ne2\n\n1\n\nN\n\nif N  1;\notherwise.\n\n,\n\nand on every round, the learner plays the same strategy as DKWUCB. We call this strategy OPTIM.\nFollowing the same procedure in Section 4, we can provide a regret for OPTIM.\nTheorem 4. Let d = T2/2 and assume T  mK. The regret of the optimistic setting satis\ufb01es\n\nE[RegretT (OPTIM )] \uf8ff 32\n\n7\n\nK\n\n\u00c2\n\ni=1\n\nlogT\n\nmin j2M Di( j)\n\n+ O(1).\n\n\f5.2 Cyclic Permutation Setting\nIn this subsection, we show that if the order of threshold values is close to uniformly random, we can\nperform signi\ufb01cantly better than the adversarial setting. To be precise, we assume that the threshold\nvalues are a cyclic permutation order of 1,2, . . . ,m. We de\ufb01ne the set M = {ckm,ckm+1, . . . ,ck(m+1)1}\nfor any non-negative integer k \uf8ff T /m.\nWe are now ready to present KMUCB, which is a modi\ufb01ed Kaplan-Meier-based UCB algorithm.\nKMUCB divides the time horizon into epochs of length Km and, for each epoch, pulls each arm once\nfor each threshold value. KMUCB then performs an \u201carm elimination\u201d process, and once all but one\narm has been eliminated, it proceeds to pull the single remaining arm for the given threshold value.\nKMUCB\u2019s estimation procedure leverages information across threshold values, where observations\nfrom higher thresholds are utilized to estimate mean payoffs for lower thresholds; information does\nnot \ufb02ow in the other direction, however, as a result of the censoring assumption. Speci\ufb01cally, for a\ngiven threshold index j, KMUCB tracks the arm elimination process as follows: for any threshold\nvalues below j, KMUCB believes that we have determined the best arm, and plays that arm constantly.\nFor threshold values greater than or equal to j, KMUCB explores all arms uniformly. Note that by\nuniform exploration over all arms for threshold value j, all sub-optimal arms can be detected with\n\nprobability at least 1 O 1\n\nT after O\u2713\n\nlogT\n\n(m j+1)mini2[K] D2\n\ni ( j)\u25c6 epochs. KMUCB then removes all the\n\nsub-optimal arms for threshold value j, and increments j by 1. Denoting the last time unit of epoch k\nas tk = kKm, the detailed description of KMUCB is shown in Algorithm 1.\n\nAlgorithm 1 KMUCB\n1: Input: A set of arms 1,2, . . . ,K.\n2: Initialization: L j [K] 8 j 2 M,k 1, j 1\n3: for epoch k = 1,2, . . . ,T /Km do\ncount[ j0] 0 8 j0 2 M\n4:\nfor t from (tk1 + 1) to tk do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nObserve ct = j0 and set count[ j0] count[ j0] + 1\nif j0 < j then\nI t index of the single arm remaining in L j0\nI t count[ j0].\n\nend for\nif j \uf8ff m and maxi02[K] \u02c6Ftk\n\nend if\n\nelse\n\ni0 ( j) \u02c6Ftk\nL j (argmax\n\ni ( j) q 16log(T k)\n(m j+1)k 8i 2 L j \\{argmaxi02[K] \u02c6Ftk\ni0 ( j)) ,\n\nj j + 1\n\n\u02c6Ftk\n\ni02[K]\n\ni0 ( j)} then\n\nend if\n\n15:\n16: end for\n\nTheorem 5. The expected regret of KMUCB is bounded as\n128max j2M Di( j)logT\nmini2[K], j2M D2\ni ( j)\n\nlogm\n\n\u00c2\n\ni=1\n\nK\n\n+ O(1).\n\nWe defer the proof of this theorem to the appendix.\nWe note two directions of future research. First, we believe the above bound can likely be made\nstronger by either improving upon the minimization in the denominator or the maximization in the\nnumerator. Second, we believe the \u201ccyclic permutation\u201d assumption can be weakened to \u201cuniformly\nrandomly sequence of thresholds,\u201d but we were unable to make progress in this direction. We\nwelcome further investigation along these lines.\n\n8\n\n\fReferences\nJacob Abernethy, Elad Hazan, and Alexander Rakhlin. 2012.\n\nInterior-point methods for full-\ninformation and bandit online learning. IEEE Transactions on Information Theory 58, 7 (2012),\n4164\u20134175.\n\nJacob D Abernethy, Chansoo Lee, and Ambuj Tewari. 2015. Fighting Bandits with a New Kind of\n\nSmoothness. In Advances in Neural Information Processing Systems. 2188\u20132196.\n\nAlekh Agarwal, Peter L Bartlett, and Max Dama. 2010. Optimal Allocation Strategies for the Dark\n\nPool Problem. In International Conference on Arti\ufb01cial Intelligence and Statistics. 9\u201316.\n\nKareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. 2012. Budget optimization for\n\nsponsored search: Censored learning in mdps. arXiv preprint arXiv:1210.4847 (2012).\n\nJean-Yves Audibert and S\u00e9bastien Bubeck. 2009. Minimax policies for adversarial and stochastic\n\nbandits. In COLT. 217\u2013226.\n\nPeter Auer. 2003. Using con\ufb01dence bounds for exploitation-exploration trade-offs. The Journal of\n\nMachine Learning Research 3 (2003), 397\u2013422.\n\nPeter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning 47, 2-3 (2002), 235\u2013256.\n\nPeter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2003. The Nonstochastic\n\nMultiarmed Bandit Problem. SIAM Journal of Computuataion 32, 1 (2003), 48\u201377.\n\nPeter Auer and Ronald Ortner. 2010. UCB revisited: Improved regret bounds for the stochastic\n\nmulti-armed bandit problem. Periodica Mathematica Hungarica 61, 1-2 (2010), 55\u201365.\n\nNicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. 2006. Prediction, Learning, and Games. Cambridge\n\nUniversity Press.\n\nA. Dvoretzky, J. Kiefer, and J. Wolfowitz. 1956. Asymptotic Minimax Character of the Sample\nDistribution Function and of the Classical Multinomial Estimator. In Annals of Mathematical\nStatistics.\n\nA. Foldes and L. Rejto. 1981. Strong uniform consistency for nonparametric survival curve estimators\n\nfrom randomly censored data. In The Annals of Statistics. 9(1):122?129.\n\nKuzman Ganchev, Michael Kearns, Yuriy Nevmyvaka, and Jennifer Wortman Vaughan. 2010. Cen-\n\nsored Exploration and the Dark Pool Problem. In UAI.\n\nJohn Gittins, Kevin Glazebrook, and Richard Weber. 2011. Multi-armed bandit allocation indices.\n\nJohn Wiley & Sons.\n\nW. T. Huh, R. Levi, P. Rusmevichientong, and J. Orlin. 2009. Adaptive data-driven inventory control\npolicies based on Kaplan-Meier estimator. In http://legacy.orie.cornell.edu/ paatrus/ ps\ufb01les/km-\nmyopic.pdf.\n\nE. L. Kaplan and P. Meier. 1958. Nonparametric Estimation from Incomplete Observations. In JASA.\nT. L. Lai and Herbert Robbins. 1985. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics 6 (1985), 4\u201322.\n\nGergely Neu and G\u00e1bor Bart\u00f3k. 2013. An ef\ufb01cient algorithm for learning with semi-bandit feedback.\n\nIn Algorithmic Learning Theory. Springer, 234\u2013248.\n\nA. V. Peterson. 1983. Kaplan-Meier estimator. In Encyclopedia of Statistical Sciences.\nHerbert Robbins. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc.\n\n58, 5 (1952), 527\u2013535.\n\n9\n\n\f", "award": [], "sourceid": 2474, "authors": [{"given_name": "Jacob", "family_name": "Abernethy", "institution": "University of Michigan"}, {"given_name": "Kareem", "family_name": "Amin", "institution": "University of Michigan"}, {"given_name": "Ruihao", "family_name": "Zhu", "institution": "MIT"}]}