{"title": "Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising", "book": "Advances in Neural Information Processing Systems", "page_first": 2400, "page_last": 2408, "abstract": "In search advertising, the search engine needs to select the most profitable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection bias that harms the search engine by decreasing expected revenue and \u201cestimation of the largest mean\u201d (ELM) bias that harms the advertisers by increasing game-theoretic player-regret. We then propose simple bias-correction methods with benefits to both the search engine and the advertisers.", "full_text": "Estimation Bias in Multi-Armed Bandit Algorithms\n\nfor Search Advertising\n\nMin Xu\n\nMachine Learning Department\nCarnegie Mellon University\n\nminx@cs.cmu.edu\n\nTao Qin\n\nMicrosoft Research Asia\n\ntaoqin@microsoft.com\n\nTie-Yan Liu\n\nMicrosoft Research Asia\n\ntie-yan.liu@microsoft.com\n\nAbstract\n\nIn search advertising, the search engine needs to select the most pro\ufb01table adver-\ntisements to display, which can be formulated as an instance of online learning\nwith partial feedback, also known as the stochastic multi-armed bandit (MAB)\nproblem. In this paper, we show that the naive application of MAB algorithms\nto search advertising for advertisement selection will produce sample selection\nbias that harms the search engine by decreasing expected revenue and \u201cestima-\ntion of the largest mean\u201d (ELM) bias that harms the advertisers by increasing\ngame-theoretic player-regret. We then propose simple bias-correction methods\nwith bene\ufb01ts to both the search engine and the advertisers.\n\n1\n\nIntroduction\n\nSearch advertising, also known as sponsored search, has been formulated as a multi-armed bandit\n(MAB) problem [11], in which the search engine needs to choose one ad from a pool of candidate to\nmaximize some objective (e.g., its revenue). To select the best ad from the pool, one needs to know\nthe quality of each ad, which is usually measured by the probability that a random user will click on\nthe ad. Stochastic MAB algorithms provide an attractive way to select the high quality ads, and the\nregret guarantee on MAB algorithms ensures that we do not display the low quality ads too many\ntimes.\nWhen applied to search advertising, a MAB algorithm needs to not only identify the best ad (suppose\nthere is only one ad slot for simplicity) but also accurately learn the click probabilities of the top two\nads, which will be used by the search engine to charge a fair fee to the winner advertiser according\nto the generalized second price auction mechanism [6]. If the probabilities are estimated poorly, the\nsearch engine may charge too low a payment to the advertisers and lose revenue, or it may charge\ntoo high a payment which would encourage the advertisers to engage in strategic behavior. However,\nmost existing MAB algorithms only focus on the identi\ufb01cation of the best arm; if naively applied to\nsearch advertising, there is no guarantee to get an accurate estimation for the click probabilities of\nthe top two ads.\nThus, search advertising, with its special model and goals, merits specialized algorithmic design and\nanalysis while using MAB algorithms. Our work is a step in this direction. We show in particular that\nnaive ways of combining click probability estimation and MAB algorithms lead to sample selection\nbias that harms the search engine\u2019s revenue. We present a simple modi\ufb01cation to MAB algorithms\nthat eliminates such a bias and provably achieves almost the revenue as if an oracle gives us the\nactual click probabilities. We also analyze the game theoretic notion of incentive compatibility (IC)\n\n1\n\n\fand show that low regret MAB algorithms may have worse IC property than high regret uniform\nexploration algorithms and that a trade-off may be required.\n\n2 Setting\n\nEach time an user visits a webpage, which we call an impression, the search engine runs a gener-\nalized second price (SP) auction [6] to determine which ads to show to the user and how much to\ncharge advertisers if their ads are clicked. We will in this paper suppose that we have only one ad\nslot in which we can display one ad. The multiple slot setting is more realistic but also much more\ncomplicated to analyze; we leave the extension as future work. In the single slot case, generalized\nSP auction becomes simply the well known second price auction, which we describe below.\nAssume there are n ads. Let bk denote the bid of advertiser k (or the ad k), which is the maximum\namount of money advertiser k is willing to pay for a click, and \u03c1k denote the click-through-rate\n(CTR) of ad k, which is the probability a random user will click on it. SP auction ranks ads according\nto the products of the ad CTRs and bids. Assume that advertisers are numbered by the decreasing\norder of bi\u03c1i: b1\u03c11 > b2\u03c12 > \u00b7\u00b7\u00b7 > bn\u03c1n. Then advertiser 1 wins the ad slot, and he/she need to pay\nb2\u03c12/\u03c11 for each click on his/her ad. This payment formula is chosen to satisfy the game theoretic\nnotion of incentive compatibility (see Chapter 9 of [10] for a good introduction). Therefore, the\nper-impress expected revenue of SP auction is b2\u03c12.\n\n2.1 A Two-Stage Framework\n\nSince the CTRs are unknown to both advertisers and the search engine, the search engine needs to\nestimate them through some learning process. We adopt the same two-stage framework as in [12, 2],\nwhich is composed by a CTR learning stage lasting for the \ufb01rst T impressions and followed by a SP\nauction stage lasting for the second Tend \u2212 T impressions.\n\n1. Advertisers 1, ..., n submit bids b1, ..., bn.\n2. CTR learning stage:\n\n3. SP auction stage:\n\nFor each impression t = 1, ..., T , display ad kt \u2208 {1, ..., n} using MAB algorithm M.\n\nEstimate(cid:98)\u03c1i based on the click records from previous stage.\nFor t = T + 1, ..., Tend, we run SP auction using estimators(cid:98)\u03c1i: display ad that maximizes\nbk(cid:98)\u03c1k and charge b(2)(cid:98)\u03c1(2)\n. Here we use (s) to indicate the ad with the s-th largest score bi(cid:98)\u03c1i.\n(cid:98)\u03c1(1)\nOne can see that in this framework, the estimators (cid:98)\u03c1i\u2019s are computed at the end of the \ufb01rst stage\n\nand keep unchanged in the second stage. Recent works [2] suggested one could also run the MAB\nalgorithm and keep updating the estimators until Tend. However, it is hard to compute a fair payment\nwhen we display ads based using a MAB algorithm rather than the SP auction, and a randomized\npayment is proposed in [2]. Their scheme, though theoretically interesting, is impractical because\nit is dif\ufb01cult for advertisers to accept a randomized payment rule. We thus adhere to the above\n\nframework and do not update(cid:98)\u03c1i\u2019s in the second stage.\nharm the expected revenue: (1) the ranking may be incorrect, i.e. arg maxk bk(cid:98)\u03c1k (cid:54)= arg max bk\u03c1k,\n\nIt is important to note that in search advertising, we measure the quality of CTR estimators not by\nmean-squared error but by criteria important to advertising. One criterion is to the per-impression\nexpected revenue (de\ufb01ned below) in rounds T + 1, ..., Tend. Two types of estimation errors can\n\nand (2) the estimators may be biased. Another criterion is incentive compatibility, which is a more\ncomplicated concept and we defer its de\ufb01nition and discussion to Section 4. We do not analyze\nthe revenue and incentive compatibility properties of the \ufb01rst CTR learning stage because of its\ncomplexity and brief duration; we assume that Tend >> T .\n\nDe\ufb01nition 2.1. Let (1) := arg maxk bk(cid:98)\u03c1k, (2) := arg maxk(cid:54)=(1) bk(cid:98)\u03c1k. We de\ufb01ne the per-\nimpression empirical revenue as (cid:99)rev := \u03c1(1)\nE[(cid:99)rev] where the expectation is taken over the CTR estimators. We de\ufb01ne then the per-impression\nexpected revenue loss as b2\u03c12 \u2212 E[(cid:99)rev], where b2\u03c12 is the oracle revenue we obtain if we know the\n\nand the per-impression expected revenue as\n\nb(2)(cid:98)\u03c1(2)\n(cid:98)\u03c1(1)\n\ntrue click probabilities.\n\n2\n\n\fChoice of Estimator We will analyze the most straightforward estimator (cid:98)\u03c1k = Ck\n\nwhere Tk is\nthe number of impression allocated to ad k in the CTR learning stage and Ck is the number of clicks\nreceived by ad k in the CTR learning stage. This estimator is in fact biased and we will later propose\nsimple improvements.\n\nTk\n\n2.2 Characterizing MAB Algorithms\n\nWe analyze two general classes of MAB algorithms: uniform and adaptive. Because there are many\nspeci\ufb01c algorithms for each class, we give our formal de\ufb01nitions by characterizing Tk, the number\nof impressions assigned to each advertiser k at the end of the CTR learning stage.\nDe\ufb01nition 2.2. We say that the learning algorithm M is uniform if, for some constant 0 < c < 1,\n\nfor all k, all bid vector b, with probability at least 1 \u2212 O(cid:0) n\n\n(cid:1):\n\nT\n\nTk \u2265 c\nn\n\nT.\n\nWe next describe adaptive algorithm which has low regret because it stops allocating impressions to\nad k if it is certain that bk\u03c1k < maxk(cid:48) bk(cid:48)\u03c1k(cid:48).\nDe\ufb01nition 2.3. Let b be a bid vector. We say that a MAB algorithm is adaptive with respect to b,\n\nif, with probability at least 1 \u2212 O(cid:0) n\n\n(cid:1), we have that:\n(cid:19)\n\nln T\n\n\u2265 Tk \u2265 min\n\n(cid:18)\n\n(cid:18)\n\nT\n\nc(cid:48) b2\nk\n\u22062\nk\n\nT1 \u2265 cTmax\n\nand\n\n(cid:19)\n\ncTmax,\n\n4b2\nk\n\u22062\nk\n\nln T\n\nfor all k (cid:54)= 1\n\nwhere \u2206k = b1\u03c11 \u2212 bk\u03c1k and c < 1, c(cid:48) are positive constants and Tmax = maxk Tk. For simplicity,\nwe assume that c here is the same as c in De\ufb01nition 2.2, we can take the minimum of the two if they\nare different.\n\nBoth the uniform algorithms and the adaptive algorithms have been used in the search advertising\nauctions [5, 7, 12, 2, 8]. UCB (Uniform Con\ufb01dence Bound) is a simple example of an adaptive\nalgorithm.\nExample 2.1. UCB Algorithm. The UCB algorithm, at round t, allocate the impression to the ad\n\nwith the largest score, which is de\ufb01ned as sk,t \u2261 bk(cid:98)\u03c1k,t + \u03b3bk\nwhere Tk(t) is the number of impressions ad k has received before round t and(cid:98)\u03c1k,t is the number\n\nof clicks divided by Tk(t) in the history log before round t. \u03b3 is a tuning parameter that trades off\nexploration and exploitation; the larger \u03b3 is, the more UCB resembles uniform algorithms. Some\nversion of UCB algorithm uses log t instead of log T in the score; this difference is unimportant and\nwe use the latter form to simplify the proof.\n\n(cid:113) 1\n\nTk(t) log T .\n\nUnder the UCB algorithm, it is well known that the Tk\u2019s satisfy the upper bounds in De\ufb01nition 2.3.\nThat the Tk\u2019s also satisfy the lower bounds is not obvious and has not been previously proved.\nPrevious analyses of UCB, whose goal is to show low regret, do not need any lower bounds on Tk\u2019s;\nour analysis does require a lower bound because we need to control the accuracy of the estimator\nTheorem 2.1. Suppose we run the UCB algorithm with \u03b3 \u2265 4, then the Tk\u2019s satisfy the bounds\ndescribed in De\ufb01nition 2.3.\n\n(cid:98)\u03c1k. The following theorem is, to the best of our knowledge, a novel result.\n\nThe UCB algorithm in practice satisfy the lower bounds even with a smaller \u03b3. We refer the readers\nto Theorem 5.1 and Theorem 5.2 of Section 5.1 of the appendix for the proof.\n\nAs described in Section 2.1, we form estimators(cid:98)\u03c1k by dividing the number of clicks by the number\nof impressions Tk. The estimator(cid:98)\u03c1k is not an average of Tk i.i.d Bernoulli random variables because\nthe size Tk is correlated with(cid:98)\u03c1k. This is known as the sample selection bias.\nDe\ufb01nition 2.4. We de\ufb01ne the sample selection bias as E[(cid:98)\u03c1k] \u2212 \u03c1k.\nWe can still make the following concentration of measure statements about(cid:98)\u03c1k, for which we give a\n\nstandard proof in Section 5.1 of the appendix.\n\n3\n\n\fLemma 2.1. For any MAB learning algorithm, with probability at least 1 \u2212 O( n\n1, ..., T , for all k = 1, ..., n, the con\ufb01dence bound holds.\n\n\u03c1k \u2212(cid:112)(1/Tk(t)) log T \u2264(cid:98)\u03c1k,t \u2264 \u03c1k +(cid:112)(1/Tk(t)) log T\n\nT ), for all t =\n\n2.3 Related Work\n\nAs mentioned before, how to design incentive compatible payment rules when using MAB algo-\nrithms to select the best ads has been studied in [2] and [5]. However, their randomized payment\nscheme is very different from the current industry standard and is somewhat impractical. The idea\nof using MAB algorithms to simultaneously select ads and estimate click probabilities has proposed\nin [11], [8] and [13] . However, they either do not analyze estimation quality or do not analyze it\nbeyond a concentration of measure deviation bound. Our work in contrast shows that it is in fact the\nestimation bias that is important in the game theoretic setting. [9] studies the effect of CTR learning\non incentive compatibility from the perspective of an advertiser with imperfect information.\nThis work is only the \ufb01rst step towards understanding the effect of estimation bias in MAB algo-\nrithms for search advertising auctions, and we only focus on a relative simpli\ufb01ed setting with only\na single ad slot and without budget constraints, which is already dif\ufb01cult to analyze. We leave the\nextensions to multiple ad slots and with budget constraints as future work.\n\n3 Revenue and Sample Selection Bias\n\nIn this section, we analyze the impact of a MAB algorithm on the search engine\u2019s revenue. We show\nthat the direct plug-in of the estimators from a MAB algorithm (either unform or adaptive) will cause\nthe sample selection bias and damage the search engine\u2019s revenue; we then propose a simple de-bias\nmethod which can ensure the revenue guarantee. Throughout the section, we \ufb01x a bid vector b. We\n\nde\ufb01ne the notations (1), (2) as (1) := arg maxk(cid:98)\u03c1kbk and (2) := arg maxk(cid:54)=(1)(cid:98)\u03c1kbk.\n(cid:98)\u03c1k > \u03c1k, then the UCB algorithm will select k more often and thus acquire more click data to\ngradually correct the overestimation. If (cid:98)\u03c1k < \u03c1k however, the UCB algorithm will select k less\n\nBefore we present our main result, we pause to give some intuition about sample selection bias.\nAssume b1\u03c11 \u2265 b2\u03c12... \u2265 bn\u03c1n and suppose we use the UCB algorithm in the learning stage. If\n\noften and the underestimation persists. Therefore, E[\u03c1k] < \u03c1k.\n\n3.1 Revenue Analysis\n\nThe following theorem is the main result of this section, which shows that the bias of the CTR\nestimators can critically affect the search engine\u2019s revenue.\nTheorem 3.1. Let T0 := 4n\n\u03c12\n1\n4 nb2\nc\u22062\n2\nIf T \u2265 T0, then, for either adaptive or uniform algorithms,\n\nlog T . Let c be the constant introduced in De\ufb01nition 2.3 and 2.2.\n\n:= 5c(cid:48)(cid:16)(cid:80)\n\nlog T , and T unif\nmin\n\nlog T , T adpt\nmin\n\nmax(b2\n\u22062\nk\n\n(cid:17)\n\nk(cid:54)=1\n\n1,b2\nk)\n\n:=\n\nmax\n\n\u03c11\n\nE[(cid:98)\u03c11]\n\n(cid:18)(cid:114) n\n\n(cid:1) \u2212 O\n\n(cid:18)(cid:0)b2\u03c12 \u2212 b2E[(cid:98)\u03c12]\n(cid:18)\nmin or if we use uniform algorithms and T \u2265 T unif\n(b2\u03c12 \u2212 b2E[(cid:98)\u03c12]\n\n(cid:19)\n(cid:17)(cid:19)\n\n(cid:17)(cid:19)\n\n(cid:16) n\n\n(cid:16) n\n\n) \u2212 O\n\n\u2212 O\n\nlog T\n\nT\n\nb2\u03c12 \u2212 E[(cid:99)rev] \u2264\n\n.\n\nT\n\nmin , then\n\n\u03c11\n\nE[(cid:98)\u03c11]\n\nT\n\nb2\u03c12 \u2212 E[(cid:99)rev] \u2264\n\nIf we use adaptive algorithms and T \u2265 T adpt\n\nWe leave the full proof to Section 5.2 of the appendix and provide a quick sketch here. In the \ufb01rst\ncase where T is smaller than thresholds T adpt\nmin , the probability of incorrect ranking, that is,\nincorrectly identifying the best ad, is high and we can only use concentration of measure bounds to\ncontrol the revenue loss. In the second case, we show that we can almost always identify the best ad\n\nmin or T unif\n\nand therefore, the(cid:112) n\n\nT log T error term disappears.\n\n4\n\n\fThe (b2\u03c12\u2212b2E[(cid:98)\u03c12] \u03c11E[(cid:98)\u03c11] ) term in the theorem is in general positive because of sample selection bias.\n(cid:17)\nWith bias, the best bound we can get on the expectation E[(cid:98)\u03c12] is that |E[(cid:98)\u03c12]\u2212\u03c12| \u2264 O\nfore, \u03c11E[(cid:98)\u03c11] is at most on the order of 1 +(cid:112) n\nCombining these derivations, we get that b2\u03c12 \u2212 E[(cid:99)rev] \u2264 O(\u22062) + O(cid:0) n\n\nT log T and b2\u03c12 \u2212 b2E[(cid:98)\u03c12] is on the order of O(\u2206).\n(cid:1). This bound suggests\n\nwhich is through the concentration inequality (Lemma 2.1).\nRemark 3.1. With adaptive learning, T1 is at least the order of O( n\n\nthat the revenue loss does not converge to 0 as T increases. Simulations in Section 5 show that\nour bound is in fact tight: the expected revenue loss for adaptive learning, in presence of sample\nselection bias, can be large and persistent.\n\nlog T \u2265 \u22062\nc(cid:48)b2\n\n(cid:16)(cid:113) 1\n\nT ) and 1\nT2\n\n. There-\n\n2\n\n2\n\nT\n\nlog T\n\n,\n\nT2\n\nFor many common uniform learning algorithms (uniformly random selection for instance) sample\nselection bias does not exist and so the expected revenue loss is smaller. This seems to suggest that,\nbecause of sample selection bias, adaptive algorithms are, from a revenue optimization perspective,\ninferior. The picture is switched however if we use a debiasing technique such as the one we propose\nin section 3.2. When sample selection bias is 0, adaptive algorithms yield better revenue because\nit is able to correctly identify the best advertisement with fewer rounds. We make this discussion\nconcrete with the following results in which we assume a post-learning unbiasedness condition.\n\nDe\ufb01nition 3.1. We say that the post-learning unbiasedness condition holds if for all k, E[(cid:98)\u03c1k] = \u03c1k.\n\nThis condition does not hold in general, but we provide a simple debiasing procedure in Section 3.2\nto ensure that it always does. The following Corollary follows immediately from Theorem 3.1 with\nan application of Jensen\u2019s inequality.\nCorollary 3.1. Suppose the post-learning unbiasedness condition holds. Let T0 \u2264 T adpt\nmin \u2264 T unif\nbe de\ufb01ned as in Theorem 3.1.\n\nIf we use either adaptive or uniform algorithms and T \u2265 T0, then b2\u03c12 \u2212 E[(cid:99)rev] \u2264 O(cid:0)(cid:112) n\n\nT log T(cid:1).\n\nmin\n\nIf we use adaptive algorithm and T \u2265 T adpt\n\nmin or if we use uniform algorithm and T \u2265 T unif\nb2\u03c12 \u2212 E[(cid:99)rev] \u2264 O\n\nmin , then\n\n(cid:16) n\n\n(cid:17)\n\nT\n\nThe revenue loss guarantee is much stronger with the unbiasedness, which we con\ufb01rm in our simu-\nlations in Section 5.\n\nCorollary 3.1 also shows that the revenue loss drops sharply from(cid:112) n\n\nT once T is larger\nthan some threshold. Intuitively, this behavior exists because the probability of incorrect ranking be-\ncomes negligibly small when T is larger than the threshold. Because the adaptive learning threshold\nT adpt\nmin is always smaller and often much smaller than the uniform learning threshold T unif\nmin , Corol-\nlary 3.1 shows that adaptive learning can guarantee much lower revenue loss when T is between\nT adpt\nmin and T unif\nmin . It is in fact the same adaptiveness that leads to low regret that also leads to the\nstrong revenue loss guarantees for adaptive learning algorithms.\n\nT log T to n\n\n3.2 Sample Selection Debiasing\n\nGiven a MAB algorithm, one simple meta-algorithm to produce an unbiased estimator where the\nTk\u2019s still satisfy De\ufb01nition 2.3 and 2.2 is to maintain \u201cheld-out\u201d click history logs.\nInstead of\nkeeping one history log for each advertisement, we will keep two; if the original algorithm allocates\none impression to advertiser k, we will actually allocate two impressions at a time and record the\nclick result of one of the impressions in the \ufb01rst history log and the click result of the other in the\nheldout history log.\n\nWhen the MAB algorithm requires estimators(cid:98)\u03c1k\u2019s or click data to make an allocation, we will allow\n\nit access only to the \ufb01rst history log. The estimator learned from the \ufb01rst history log is biased by\nthe selection procedure but the heldout history log, since it does not in\ufb02uence the ad selection, can\nbe used to output an unbiased estimator of each advertisement\u2019s click probability at the end of the\nexploration stage. Although this scheme doubles the learning length, sample selection debiasing can\nsigni\ufb01cantly improve the guarantee on expected revenue as shown in both theory and simulations.\n\n5\n\n\f4 Advertisers\u2019 Utilities and ELM Bias\n\nIn this section, we analyze the impact of a MAB algorithm on advertisers\u2019 utilities. The key re-\nsult of this section is the adaptive algorithms can exacerbate the \u201cestimation of the largest mean\u201d\n(ELM) bias, which arises because expectation of the maximum is larger than the maximum of the\nexpectation. This ELM bias will damage advertisers\u2019 utilities because of overcharging.\nWe will assume that the reader is familiar with the concept of incentive compatbility and give only a\nbrief review. We suppose that there exists a true value vi, which exactly measures how much a click\nis worth to advertiser i. The utility per impression of advertiser i in the auction is then \u03c1i(vi \u2212 pi)\nif the ad i is displayed where pi is the per-click payment charged by the search engine charges. An\nauction mechanism is called incentive compatible if the advertisers maximize their own utility by\ntruthfully bidding: bi = vi. For auctions that are close but not fully incentive compatible, we also\nde\ufb01ne player-regret as the utility lost by advertiser i in truthfully bidding vi rather than a bid that\noptimizes utility.\n\n4.1 Player-Regret Analysis\n\n(cid:16)\n\n(cid:17)\n\n\u03c1k\n\n(cid:98)\u03c1k(bk)\n\nvk\u03c1k \u2212 maxk(cid:48)(cid:54)=k bk(cid:48)(cid:98)\u03c1k(cid:48) (bk)\n\nWe de\ufb01ne v = (v1, ..., vn) to be the true per-click values of the advertisers. We will for\nsimplicity assume that the post-learning unbiasedness condition (De\ufb01nition 3.1) holds for all\nour results in this section. We introduce some formal de\ufb01nitions before we begin our anal-\nysis. For a \ufb01xed vector of competing bids b\u2212k, we de\ufb01ne the player utility as uk(bk) \u2261\nIbk(cid:98)\u03c1k(bk)\u2265bk(cid:48)(cid:98)\u03c1k(cid:48) (bk)\u2200k(cid:48)\n, where Ibk(cid:98)\u03c1k(bk)\u2265bk(cid:48)(cid:98)\u03c1k(cid:48) (bk)\u2200k(cid:48) is a 0/1 func-\ntion indicating whether the impression is allocated to ad k. We de\ufb01ne the player-regret, with respect\nE[uk(vk)]. It is important to note that we are hiding uk(bk)\u2019s and(cid:98)\u03c1k(bk)\u2019s dependency on the com-\nE[uk(bk)] \u2212\nto a bid vector b, as the player\u2019s optimal gain in utility through false bidding supb\n(cid:1). Suppose\nSuppose bk\u2217 \u03c1k\u2217 \u2212 v1\u03c11 \u2265 \u03c9((cid:112) n\n|v1\u03c11 \u2212 bk\u2217 \u03c1k\u2217| \u2264 O((cid:112) n\nTheorem 4.2. Suppose v1\u03c11 \u2212 bk\u2217 \u03c1k\u2217 \u2265 \u03c9(cid:0)(cid:112) n\n\npeting bids b\u2212k in our notation. Without loss of generality, we consider the utility of player 1. We\n\ufb01x b\u22121 and we de\ufb01ne k\u2217 \u2261 arg maxk(cid:54)=1 bk\u03c1k. We divide our analysis into cases, which cover the\ndifferent possible settings of v1 and competing bid b\u22121.\nTheorem 4.1. The following holds for both uniform and adaptive algorithms.\n\nTheorem 4.1 shows that when v1\u03c11 is not much larger than bk\u2217 \u03c1k\u2217, the player-regret is not too large.\nThe next Theorem shows that when v1\u03c11 is much larger than bk\u2217 \u03c1k\u2217 however, the player-regret can\nbe large.\n\nE[u1(b1)] \u2212 E[u1(v1)] \u2264 O(cid:0)(cid:112) n\nT log T(cid:1), then, for both uniform and adaptive\n(cid:16) n\n(cid:17)(cid:17)\n0, E[b(2)(v1)(cid:98)\u03c1(2)(v1)] \u2212 E[b(2)(b1)(cid:98)\u03c1(2)(b1)] + O\n\nE[u1(b1)] \u2212 E[u1(v1)] \u2264 O(cid:0) n\nT log T(cid:1).\n\nalgorithms:\n\u2200b1, E[u1(b1, b\u22121)]\u2212E[u1(v1, b\u22121)] \u2264 max\n\nT log T ), then, supb1\n\nT log T ) , then supb1\n\n(cid:16)\n\nT\n\nT\n\nsupb1\n\nWe give the proofs of both Theorem 4.1 and 4.2 in Section 5.3 of the appendix.\n\nRemark 4.1. In the special case of only two advertisers, it must be that (2) = 2 and therefore\n\nBoth expectations E[b(2)(v1)(cid:98)\u03c1(2)(v1)] and E[b(2)(b1)(cid:98)\u03c1(2)(b1)] can be larger than b2\u03c12 because the\nE[maxk(cid:54)=1 bk(cid:98)\u03c1k(v1)] \u2265 maxk(cid:54)=1 bkE[(cid:98)\u03c1k(v1)].\nE[b(2)(v1)(cid:98)\u03c1(2)(v1)] = b2\u03c12 and E[b(2)(v1)(cid:98)\u03c1(2)(v1)] = b2\u03c12. The player-regret is then very small:\n(cid:1).\nE[u1(b1, b2)] \u2212 E[u1(v1, b2)] \u2264 O(cid:0) n\ncause the bias E[b(2)(b1)(cid:98)\u03c1(2)(b1)] \u2212 b2\u03c12 increases when T2(b1), ..., Tn(b1) are low\u2013that is, it in-\ncreases when the variance of (cid:98)\u03c1k(b1)(cid:48)s are high. An omniscient advertiser 1, with the belief that\nrithm to allocate more rounds to advertisers 2, .., n and reduce the variance of (cid:98)\u03c1k(b1)(cid:48)s. Such a\n\nv1\u03c11 >> b2\u03c12, can thus increase his/her utility by underbidding to manipulate the learning algo-\n\nThe incentive can be much larger when there are more than 2 advertisers. Intuitively, this is be-\n\nstrategy will give advertiser 1 negative utility in the learning CTR learning stage, but it will yield\npositive utility in the longer SP auction stage and thus give an overall increase to the player utility.\n\nT\n\n6\n\n\f(cid:19)\n\nlog T\n\nT\n\nsup\nb1\n\nIn the case of uniform learning, the advertiser\u2019s manipulation is limited because the learning algo-\nrithm is not signi\ufb01cantly affected by the bid.\nSuppose that v1\u03c11 \u2212 bk\u2217 \u03c1k\u2217 \u2265\nCorollary 4.1. Let the competing bid vector b\u22121 be \ufb01xed.\n\nT log T(cid:1). If uniform learning is used in the \ufb01rst stage, we have that\n(cid:18)(cid:114) n\n\n\u03c9(cid:0)(cid:112) n\nNevertheless, by contrasting this with(cid:112) n\nutility by bidding some b1 smaller than v1 but still large enough to ensure that b1(cid:98)\u03c11(b1) still be\n\nT bound we would get in the two\nadvertiser case, we see the negative impact of ELM bias on incentive compatibility. The negative\neffect is even more pronounced in the case of adaptive learning. Advertiser 1 can increase its own\n\nE[u1(b1, b\u22121)] \u2212 E[u1(v1, b\u22121)] \u2264 O\n\nranked the highest at the end of the learning stage. We explain this intuition with more details in the\nfollowing example, which we also simulate in Section 5.\nExample 4.1. Suppose we have n advertisers and b2\u03c12 = b3\u03c13 = ...bn\u03c1n. Suppose that v1\u03c11 >>\nb2\u03c12 and we will show that advertiser 1 has the incentive to underbid.\nLet \u2206k(b1) \u2261 b1\u03c11 \u2212 bk\u03c1k, then \u2206k(b1)\u2019s are the same for all k and \u2206k(v1) >> 0 by previous\nsupposition. Suppose advertiser 1 bids b1 < v1 but where \u2206k(b1) >> 0 still. We assume that\nfor all k = 2, ..., n, which must hold for large T by de\ufb01nition of adaptive\nTk(b1) = \u0398\nlearning.\nFrom Lemma 5.4 in the appendix, we know that\n\nT log T bound with the n\n\n(cid:16) log T\n\n\u2206k(b1)2\n\n(cid:17)\n\n(cid:115)\n\n(cid:115)\n\nE[b(2)(b1)(cid:98)\u03c1(2)(b1)] \u2212 b2\u03c12 \u2264\n\nlog(n \u2212 1)\n\n\u2264\n\nlog(n \u2212 1)\n\n(b1\u03c11 \u2212 bk\u03c1k)\n\n(4.1)\n\nThe Eqn. (4.1) is an upper bound but numerical experiments easily show that E[b(2)(b1)(cid:98)\u03c1(2)(b1)] is\nFrom Eqn. (4.1), we derive that, for any b1 such that b1\u03c11 \u2212 b2\u03c12 \u2265 \u03c9(cid:0)(cid:112) n\n\nin fact on the same order as the RHS of Eqn. (4.1).\n\nlog T\n\nTk\n\n(cid:32)(cid:115)\n\nT log T(cid:1):\n(cid:33)\n\nE[u1(b1, b\u22121)] \u2212 E[u1(v1, b\u22121)] \u2264 O\n\nlog(n \u2212 1)\n\nlog T\n\n(v1\u03c11 \u2212 b\u03c11)\n\nThus, we cannot guarantee that the mechanism is approximately truthful. The bound decreases with\nT at a very slow logarithmic rate because with adaptive algorithm, a longer learning period T might\n\nnot reduce the variances of many of the estimators(cid:98)\u03c1k\u2019s.\n\nWe would like to at this point brie\ufb02y compare our results with that of [9], which shows, under an\nimperfect information de\ufb01nition of utility, that advertisers have an incentive to overbid so that the\ntheir CTRs can be better learned by the search engine. Our results are not contradictory since we\nshow that only the leading advertiser have an incentive to underbid.\n\n4.2 Bias Reduction in Estimation of the Largest Mean\n\nThe previous analysis shows that the incentive-incompatibility issue in the case of adaptive learning\n\nis caused by the fact that the estimator b(2)(cid:98)\u03c1(2) = maxk(cid:54)=1 b2(cid:98)\u03c12 is upward biased. E[b(2)(cid:98)\u03c1(2)] is\nmuch larger than b2\u03c12 in general even if the individual estimators(cid:98)\u03c1k\u2019s are unbiased. We can abstract\nof the Largest Mean\u201d (ELM): given N probabilities {\u03c1k}k=1,...,N , \ufb01nd an estimator(cid:98)\u03c1max such that\nE[(cid:98)\u03c1max] = maxk \u03c1k. Unfortunately, as proved by [4] and [3], unbiased estimator for the largest\n\nout the game theoretic setting and distill a problem known in the statistics literature as \u201cEstimation\n\nmean does not exist for many common distributions including the Gaussian, Binomial, and Beta; we\nthus survey some methods for reducing the bias.\n[3] studies techniques that explicitly estimate and then substract the bias. Their method, though\ninteresting, is speci\ufb01c to the case of selecting the larger mean among only two distributions. [1]\n\n7\n\n\fthrough history into two sets S, E and get two estimators (cid:98)\u03c1S\nk , (cid:98)\u03c1E\nk . We then use (cid:98)\u03c1S\nand output a weighted average \u03bb(cid:98)\u03c1S\n\nproposes a different approach based on data-splitting. We randomly partition the data in the click-\nk for selection\nk for estimating the value\nbecause, without conditioning on a speci\ufb01c selection, it is downwardly biased. We unfortunately\nknow of no principled way to choose \u03bb. We implement this scheme with \u03bb = 0.5 and show in\nsimulation studies in Section 5 that it is effective.\n\nk + (1 \u2212 \u03bb)(cid:98)\u03c1E\n\nk . We cannot use only(cid:98)\u03c1E\n\n5 Simulations\n\nWe simulate our two stage framework for various values of T . Figures 1a and 1b show the effect\nof sample selection debiasing (see Section 3, 3.2) on the expected revenue where one uses adaptive\nlearning.\n(the UCB algorithm 2.1 in our experiment) One can see that selection bias harms the\nrevenue but the debiasing method described in Section 3.2, even though it holds out half of the click\ndata, signi\ufb01cantly lowers the expected revenue loss, as theoretically shown in Corollary 3.1. We\nchoose the tuning parameter \u03b3 = 1. Figure 1c shows that when there are a large number of poor\nquality ads, low regret adaptive algorithms indeed achieve better revenue in much fewer rounds of\nlearning. Figure 1d show the effect of estimation-of-the-largest-mean (ELM) bias on the utility gain\nof the advertiser. We simulate the setting of Example 4.1 and we see that without ELM debiasing,\nthe advertiser can noticeably increase utility by underbidding. We implement the ELM debiasing\ntechnique described in Section 4.2; it does not completely address the problem since it does not\ncompletely reduce the bias (such a task has been proven impossible), but it does ameliorate the\nproblem\u2013the increase in utility from underbidding has decreased.\n\n(a) n = 2, \u03c11 = 0.09, \u03c12 = 0.1, b1 = 2, b2 = 1 (b) n = 2, \u03c11 = .3, \u03c12 = 0.1, b1 = 0.7, b2 = 1\n\n(c) n = 42, \u03c11 = .2, \u03c12 = 0.15, b1 = 0.8,\nb2 = 1. All other bk = 1, \u03c1k = 0.01.\n\n(d) n = 5, (cid:126)\u03c1 = {0.15, 0.11, 0.1, 0.05, 01},\n(cid:126)b\u22121 = {0.9, 1, 2, 1}\n\nFigure 1: Simulation studies demonstrating effect of sample selection debiasing and ELM debiasing.\nThe revenue loss in \ufb01gures a to c is relative and is measured by 1 \u2212 revenue\n; negative loss indicate\nrevenue improvement over oracle SP. Figure d shows advertiser 1\u2019s utility gain as a function of\npossible bids. The vertical dotted black line denote the advertiser\u2019s true value at v = 1. Utility gain\nutility(v) \u2212 1; higher utility gain implies that advertiser 1 can bene\ufb01t more\nis relative and de\ufb01ned as utility(b)\nfrom strategic bidding. The expected value is computed over 500 simulated trials.\n\nb2\u03c12\n\n8\n\n050001000015000\u22120.0500.050.1Rounds of ExplorationExpected Revenue Loss no selection debiasingwith selection debiasing050001000015000\u22120.0500.050.1Rounds of ExplorationExpected Revenue Loss no selection debiasingwith selection debiasing00.511.522.5x 10400.10.20.30.40.5Rounds of ExplorationExpected Revenue Loss uniformadaptive with debiasing0.80.911.11.2\u22120.1\u22120.0500.050.1Bid pricePlayer Utility Gain no ELM debiasingwith ELM debiasing\fReferences\n[1] K. Alam. A two-sample estimate of the largest mean. Annals of the Institute of Statistical\n\nMathematics, 19(1):271\u2013283, 1967.\n\n[2] M. Babaioff, R.D. Kleinberg, and A. Slivkins. Truthful mechanisms with implicit payment\n\ncomputation. arXiv preprint arXiv:1004.3630, 2010.\n\n[3] S. Blumenthal and A. Cohen. Estimation of the larger of two normal means. Journal of the\n\nAmerican Statistical Association, pages 861\u2013876, 1968.\n\n[4] Bhaeiyal Ishwaei D, D. Shabma, and K. Krishnamoorthy. Non-existence of unbiased esti-\nmators of ordered parameters. Statistics: A Journal of Theoretical and Applied Statistics,\n16(1):89\u201395, 1985.\n\n[5] N.R. Devanur and S.M. Kakade. The price of truthfulness for pay-per-click auctions.\nProceedings of the tenth ACM conference on Electronic commerce, pages 99\u2013106, 2009.\n\nIn\n\n[6] Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. Internet advertising and the\ngeneralized second price auction: Selling billions of dollars worth of keywords. Technical\nreport, National Bureau of Economic Research, 2005.\n\n[7] N. Gatti, A. Lazaric, and F. Trov`o. A truthful learning mechanism for contextual multi-slot\nsponsored search auctions with externalities. In Proceedings of the 13th ACM Conference on\nElectronic Commerce, pages 605\u2013622. ACM, 2012.\n\n[8] R. Gonen and E. Pavlov. An incentive-compatible multi-armed bandit mechanism. In Pro-\nceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing,\npages 362\u2013363. ACM, 2007.\n\n[9] S.M. Li, M. Mahdian, and R. McAfee. Value of learning in sponsored search auctions. Internet\n\nand Network Economics, pages 294\u2013305, 2010.\n\n[10] Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani. Algorithmic game theory.\n\nCambridge University Press, 2007.\n\n[11] Sandeep Pandey and Christopher Olston. Handling advertisements of unknown quality in\n\nsearch advertising. Advances in Neural Information Processing Systems, 19:1065, 2007.\n\n[12] A.D. Sarma, S. Gujar, and Y. Narahari. Multi-armed bandit mechanisms for multi-slot spon-\n\nsored search auctions. arXiv preprint arXiv:1001.1414, 2010.\n\n[13] J. Wortman, Y. Vorobeychik, L. Li, and J. Langford. Maintaining equilibria during exploration\n\nin sponsored search auctions. Internet and Network Economics, pages 119\u2013130, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1137, "authors": [{"given_name": "Min", "family_name": "Xu", "institution": "CMU"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research"}]}