{"title": "Online Pricing with Strategic and Patient Buyers", "book": "Advances in Neural Information Processing Systems", "page_first": 3864, "page_last": 3872, "abstract": "We consider a seller with an unlimited supply of a single good, who is faced with a stream of $T$ buyers. Each buyer has a window of time in which she would like to purchase, and would buy at the lowest price in that window, provided that this price is lower than her private value (and otherwise, would not buy at all). In this setting, we give an algorithm that attains $O(T^{2/3})$ regret over any sequence of $T$ buyers with respect to the best fixed price in hindsight, and prove that no algorithm can perform better in the worst case.", "full_text": "Online Pricing with Strategic and Patient Buyers\n\nMichal Feldman\n\nTel-Aviv University and MSR Herzliya\n\nmichal.feldman@cs.tau.ac.il\n\nTomer Koren\u21e4\nGoogle Brain\n\ntkoren@google.com\n\nRoi Livni\u21e4\n\nPrinceton University\n\nrlivni@cs.princeton.edu\n\nAviv Zohar\u21e4\n\nHebrew University of Jerusalem\n\navivz@cs.huji.ac.il\n\nYishay Mansour\u21e4\nTel-Aviv University\nmansour@tau.ac.il\n\nAbstract\n\nWe consider a seller with an unlimited supply of a single good, who is faced with\na stream of T buyers. Each buyer has a window of time in which she would like\nto purchase, and would buy at the lowest price in that window, provided that this\nprice is lower than her private value (and otherwise, would not buy at all). In this\nsetting, we give an algorithm that attains O(T2/3) regret over any sequence of T\nbuyers with respect to the best \ufb01xed price in hindsight, and prove that no algorithm\ncan perform better in the worst case.\n\n1\n\nIntroduction\n\nPerhaps the most common way to sell items is using a \u201cposted price\u201d mechanism in which the seller\npublishes the price of an item in advance, and buyers that wish to obtain the item decide whether to\nacquire it at the given price or to forgo the purchase. Such mechanisms are extremely appealing. The\ndecision made by the buyer in a single-shot interaction is simple: if it values the item by more than\nthe offering price, it should buy, and if its valuation is lower, it should decline. The seller on the other\nhand needs to determine the price at which she wishes to sell goods. In order to set prices, additive\nregret can be minimized using, for example, a multi-armed bandit (MAB) algorithm in which arms\ncorrespond to a different prices, and rewards correspond to the revenue obtained by the seller.\nThings become much more complicated when the buyers who are facing the mechanism are patient\nand can choose to wait for the price to drop. The simplicity of posted price mechanisms is then\ntainted by strategic considerations, as buyers attempt to guess whether or not the seller will lower\nthe price in the future. The direct application of MABs is no longer adequate, as prices set by such\nalgorithms may \ufb02uctuate at every time period. Strategic buyers can make use of this fact to gain\nthe item at a lower price, which lowers the revenue of the seller and, more crucially, changes the\nseller\u2019s feedback for a given price. With patient buyers, the revenue from sales is no longer a result\nof the price at the current period alone, but rather the combined outcome of prices that were set in\nsurrounding time periods, and of the expectation of buyers regarding future prices.\nIn this paper, we focus on strategic buyers that may delay their purchase in hopes of obtaining a\nbetter deal. We assume that each buyer has a valuation for the item, and a \u201cpatience level\u201d which\nrepresents the length of the time-window during which it is willing to wait in order to purchase the\nitem. Buyers wish to minimize the price during this period. Note that such buyers may interfere with\nna\u00efve attempts to minimize regret, as consecutive days at which different prices are set are no longer\nindependent.\n\n\u21e4Parts of this work were done while the author was at Microsoft Research, Herzliya.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTo regain the simplicity of posted prices for the buyers, we consider a setting in which the seller\ncommits to the price in subsequent time periods in advance, publishing prices for the entire window\nof the buyers. Strategic buyers that arrive at the market are then able to immediately choose the\nlowest price within their window. Thus, given the valuation and patience of the buyers (the number\nof days they are willing to wait) their actions are clearly determined: buy it at a day that is within the\nbuyer\u2019s patience window and price is cheapest, provided that it is lower than the valuation.\nAn important aspect of our proposed model is to consider for each buyer a window of time (rather\nthan, for example, discounting). For example, when considering discounting, the buyers, in order\nto best respond, would have argue how would other buyers would behave and how would the seller\nadjust the prices in response to them. By \ufb01xing a window of time, and forcing the seller to publish\nprices for the entire window, the buyers become \u201cprice takers\u201d and their behavior becomes tractable\nto analyze.\nAs in previous works, we focus on minimizing the additive regret of the seller, assuming that the\nappearance of buyers is adversarial; that is, we do not make any statistical assumptions on the buyers\u2019\nvaluation and window size (except for a simple upper bound). Speci\ufb01cally we assume that the values\nare in the range [0, 1] and that the window size is in the range {1, . . . , \u02c6\u2327 + 1}. The regret is measured\nwith respect to the best single price in hindsight. Note that the benchmark of a \ufb01xed price p\u21e4 implies\nthat any buyer with value above p\u21e4 buys and any buyer with value below p\u21e4 does not buy. The window\nsize has no effect when we have a \ufb01xed price. On the other hand, for the online algorithm, having to\ndeal with various window sizes create a new challenge.\nThe special case of this model where \u02c6\u2327 = 0 (and hence all buyers have window of size exactly one)\nwas previously studied by Kleinberg and Leighton [11], who discussed a few different models for\nthe buyer valuations and derived tight regret bounds for them. When the set of feasible prices is of\nconstant size their result implies a \u21e5(pT) regret bound with respect to the best \ufb01xed price, which\nis also proven to be the best possible in that case. In contrast, in the current paper we focus on\nthe case \u02c6\u2327  1, where the buyers\u2019 window sizes may be larger than one, and exhibit the following\ncontributions:\n(i) We present an algorithm that achieves O( \u02c6\u23271/3T2/3) additive regret in an adversarial setting,\ncompared to the best \ufb01xed posted price in hindsight. The upper bound relies on creating epochs,\nwhen the price within each epoch is \ufb01xed and the number of epochs limit the number of times\nthe seller switches prices. The actual algorithm that is used to select prices within an epoch is\nEXP3 (or can be any other multi-arm bandit algorithm with similar performance).\n\n(ii) We exhibit a matching lower bound of \u2326( \u02c6\u23271/3T2/3) regret. The proof of the lower bound reveals\nthat the dif\ufb01culty in achieving lower regret stems from the lost revenue that the seller suffers\nevery time she tries to lower costs. Buyers from preceding time slots wait and do not purchase\nthe items at the higher prices that prevailed when they arrive. We are thus able to prove a lower\nbound by reducing to a multi-armed bandit problem with switching costs. Our lower bound\nuses only two prices.\n\nIn other words, we see that as soon as the buyers\u2019 patience increases from zero to one, the optimal\n\nregret rate immediately jumps from \u21e5(pT) to \u21e5(T2/3).\n\nThe rest of the paper is organized as follows. In the remainder of this section we brie\ufb02y overview\nrelated work. We then proceed in Section 2 to provide a formal de\ufb01nition of the model and the\nstatement of our main results. We continue in Section 3 with a presentation of our algorithm and its\nanalysis, present our lower bound in Section 4, and conclude with a brief discussion.\n\n1.1 Related work\n\nAs mentioned above, the work most closely related to ours is the paper of Kleinberg and Leighton\n[11] that studies the case \u02c6\u2327 = 0, i.e., in which the buyers\u2019 windows are limited to be all of size one.\nFor a \ufb01xed set of feasible prices of constant size, their result implies a \u21e5(pT) regret bound, whereas\nfor a continuum of prices they achieve a \u21e5(T2/3) regret bound. The \u2326(T2/3) lower bound found in\n[11] is similar to our own in asymptotic magnitude, but stems from the continuous nature of the prices.\nIn our case the lower bound is achieved for buyers with only 2 prices, a case in which Kleinberg\nand Leighton [11] have a bound of \u21e5(pT). Hence, we show that such a bound can occur due to the\n\nstrategic nature of the interaction itself.\n\n2\n\n\fA line of work appearing in [1, 12, 13] considers a model of a single buyer and a single seller, where\nthe buyer is strategic and has a constant discount factor. The main issue is that the buyer continuously\ninteracts with the seller and thus has an incentive to lower future prices at the cost of current valuations.\nThey de\ufb01ne strategic regret and derive near optimal strategic regret bounds for various valuation\nmodels. We differ from this line of work in a few important ways. First, they consider other either\n\ufb01xed unknown valuation or stochastic i.i.d. valuations, while we consider adversarial valuations.\nSecond, they consider a single buyer while we consider a stream of buyers. More importantly, in\nour model the buyers do not in\ufb02uence the prices they are offered, so the strategic incentives are very\ndifferent. Third, their model uses discounting to model the decay of buyer valuation over time, while\nwe use a window of time.\nThere is a vast literature in Algorithmic Game Theory on revenue maximization with posted prices,\nin settings where agents\u2019 valuations are drawn from unknown distributions. For the case of a single\ngood of unlimited supply, the goal is to approximate the best price, as a function of the number\nof samples observed and with a multiplicative approximation ratio. The work of Balcan et al. [4]\ngives a generic reduction which can be used to show that one can achieve an \u270f-optimal pricing with\na sample of size O((H/\u270f2) log(H/\u270f)), where H is a bound on the maximum valuation. The works\nof Cole and Roughgarden [8] and Huang et al. [10] show that for regular and Monotone Hazard\nRate distributions sample bounds of \u21e5(\u270f3) and \u21e5(\u270f3/2), respectively, guarantee a multiplicative\napproximation of 1  \u270f.\nFinally, our setting is somewhat similar to a unit-demand auction in which agents desire a single\nitem out of several offerings. In our case, we can consider items sold at different times as different\nitems and agents desire a single one that is within their window. When agents have unit-demand\npreferences, posted-price mechanisms can extract a constant fraction of the optimal revenue [5, 6, 7].\nNote that a constant ratio approximation algorithm implies a linear regret in our model. On the other\nhand, these works consider a more involved problem from a buyer\u2019s valuation perspective.\n\n2 Setup and Main Results\n\nWe consider a setting with a single seller and a sequence of T buyers b1, . . . , bT . Every buyer bt is\nassociated with value vt 2[ 0, 1] and patience \u2327t. A buyer\u2019s patience indicates the time duration in\nwhich the buyer stays in the system and may purchase an item.\nThe seller posts prices in advance over some time window. Let \u02c6\u2327 be the maximum patience, and\nassume that \u2327t \uf8ff \u02c6\u2327 for every t. Let pt denote the price at time t, and assume that all prices are chosen\nn, 2\nfrom a discrete (and normalized) prede\ufb01ned set of n prices P = {0, 1\nn, . . . 1}. At time t = 1, the\nseller posts prices p1, . . . , p\u02c6\u2327+1, and learns the revenue obtained at time t = 1 (the revenue depends\non the buyers\u2019 behavior, which is explained below). Then, at each time step t, the seller publishes\na new price pt+\u02c6\u2327 2 P, and learns the revenue obtained at time t, which she can use to set the next\nprices. Note that at every time step, prices are known for the next \u02c6\u2327 time steps.\nThe revenue in every time step is determined by the strategic behavior of buyers, which is explained\nnext. Every buyer bt observes prices pt, . . . , pt+\u2327t , and purchases the item at the lowest price among\nthese prices (breaking ties toward earlier times), if she does not exceed her value. The revenue\nobtained from buyer bt is given by:\n\n(pt, . . . , pt+\u02c6\u2327; bt) =\u21e2min{pt, . . . , pt+\u2327t }\n\n0\n\nif min{pt, . . . , pt+\u2327t }\uf8ff vt,\notherwise.\n\nAs bt has patience \u2327t, we will sometime omit the irrelevant prices and write (pt, . . . , pt+\u2327t ; bt) =\n(pt, . . . , pt+\u02c6\u2327; bt).\nAs we described, a buyer need not buy the item on her day of appearance and may choose to wait.\nIf the buyer chooses to wait, we will observe the feedback from her decision only on the day of\npurchase. We therefore need to distinguish between the revenue from buyer t and the revenue at time\nt. Given a sequence of prices p1, . . . , pt+\u02c6\u2327 and a sequence of buyers b1, . . . , bt we de\ufb01ne the revenue\nat time t to be the sum of all revenues from buyers that preferred to buy at time t. Formally, let It\ndenote the set of all buyers that buy at time t, i.e.,\n\nIt = {bi : t = arg min{i \uf8ff t \uf8ff i + \u2327i : pt = (pi . . . , pi+\u02c6\u2327; bi)}}.\n\n3\n\n\fThen the revenue obtained at time t is given by:\n\nwhere we use the notation b1:T as a shorthand for the sequence b1, . . . , bT . The regret of the (possibly\nrandomized) seller A is the difference between the revenue obtained by the best \ufb01xed price in hindsight\nand the expected revenue obtained by the seller A, given a sequence of buyers:\n\nRt(pt\u02c6\u2327, . . . , pt+\u02c6\u2327) = R(p1, . . . , pt+\u02c6\u2327; b1:t) :=\u2019i2It\nR(p\u21e4, . . . , p\u21e4; b1:t) E\" T\u2019t=1\n\nRegretT(A; b1:T) = max\np\u21e42P\n\nT\u2019t=1\n\n(pi, . . . pi+\u02c6\u2327; bi)),\n\nR(p1, . . . pt+\u02c6\u2327; b1:t)# .\n\nWe further denote by RegretT(A) the expected regret a seller A incurs for the worst case sequence,\ni.e., RegretT(A) = maxb1:T RegretT(A; b1:T).\n2.1 Main Results\n\nOur main result are optimal regret rates in the strategic buyers setting.\nTheorem 1. The T-round expected regret of Algorithm 1 for any sequence of buyers b1, . . . , bT with\npatience at most \u02c6\u2327  1 is upper bounded as RegretT \uf8ff 10( \u02c6\u2327n log n)1/3T2/3.\nTheorem 2. For any \u02c6\u2327  1, n  2 and for any pricing algorithm, there exists a sequence of buyers\nb1, . . . , bT with patience at most \u02c6\u2327 such that RegretT = \u2326( \u02c6\u23271/3T2/3).\n3 Algorithm\n\nIn this section we describe and analyze our online pricing algorithm. It is worth to start by highlighting\nwhy simply running an \u201coff the shelf\u201d multi-arm bandit algorithm such as EXP3 would fail. Consider\na \ufb01xed distribution over the actions and assume the buyer has a window size of two. Unlike the\nstandard multi-arm bandit, where we get the expected revenue from the price we select, now the\nbuyer would select the lower of the two prices, which would clearly hurt our revenue (there is a slight\ngain, by the increased probability of sell, but it does suf\ufb01ce to offset the loss). For this reason, the\nseller would intuitively like to minimize the number of time it changes prices (more precisely, lower\nthe prices).\nOur online pricing algorithm, which is given in Algorithm 1, is based on the EXP3 algorithm of Auer\net al. [3] which we use as a black-box. The algorithm divides the time horizon to roughly T2/3 epochs,\nand within each epoch the seller repeatedly announces the same price, that was chosen by the EXP3\nblack-box in the beginning of the epoch. In the end of the epoch, EXP3 is updated with the overall\naverage performance of the chosen price during the epoch (ignoring the time steps which might be\nin\ufb02uenced by different prices). Hence, our algorithm changes the posted price only O(T2/3) times,\nthereby keeping under control the costs associated with price \ufb02uctuations due to the patience of the\nbuyers.\n\nAlgorithm 1: Online posted pricing algorithm\nParameters: horizon T, number of prices n, and maximal patience \u02c6\u2327;\nLet B = b \u02c6\u23272/3(n log n)1/3T1/3c and T0 = bT/Bc;\nInitialize A EXP3(T0, n);\nfor j = 0, . . . , T0  1 do\n\nSample i \u21e0 A and let p0j = i/n;\nfor t = B j + 1, . . . , B(j + 1) do\n\nAnnounce price pt+\u02c6\u2327 = p0j; %On j = 0, t = 1 announce p1, . . . pt+\u2327 = p00.;\nReceive and observe total revenue Rt(pt\u02c6\u2327, . . . , pt+\u02c6\u2327);\nt=B j+2\u02c6\u2327+1 Rt(pt\u02c6\u2327, . . . , pt+\u02c6\u2327);\n\nUpdate A with feedback 1\n\nB\u00d5B(j+1)\n\nfor t = BT0 + 1, . . . , T do\n\nAnnounce price pt+\u02c6\u2327 = p0T01;\n\n4\n\n\fWe now analyze Algorithm 1 and prove Theorem 1. The proof follows standard arguments in\nadversarial online learning (e.g., Arora et al. [2]); we note, however, that for obtaining the optimal\ndependence on the maximal patience \u02c6\u2327 one cannot apply existing results directly and has to analyse\nthe effect of accumulating revenues over epochs more carefully, as we do in the proof below. This\nis mainly because in our model the revenue at time t is not bounded by 1 but by \u2327, hence readily\namenable results would add a factor \u2327 to the regret.\nProof of Theorem 1. For all 0 \uf8ff j \uf8ff T0 and for all prices p 2 P, de\ufb01ne\n\nR0j(p) =\n\n1\nB\n\nB(j+1)\u2019t=B j+2\u02c6\u2327+1\n\nRt(p, . . . , p).\n\n(Here, the argument p is repeated 2 \u02c6\u2327 + 1 times.) Observe that 0 \uf8ff R0j(p)\uf8ff 1 for all j and p, as the\nmaximal total revenue between rounds B j + 2 \u02c6\u2327 + 1 and B(j + 1) is at most B; indeed, there are at\nmost B buyers who might make a purchase during that time, and each purchase yields revenue of at\nmost 1. By a similar reasoning, we also have\n\nB j+2\u02c6\u2327\u2019t=B j+1\n\nRt(p, . . . , p)\uf8ff 4 \u02c6\u2327\n\n(1)\n\nfor all j and p.\nNow, notice that pt = p0j for all B j + \u02c6\u2327 + 1 \uf8ff t \uf8ff B(j + 1) + \u02c6\u2327, hence the feedback fed back to A\nafter epoch j is\n\nThat is, Algorithm 1 is essentially running EXP3 on the reward functions R0j. By the regret bound of\nEXP3, we know that\n\nRt(p0j, . . . , p0j) = R0j(p0j).\n\nFinally, for B = b \u02c6\u23272/3(n log n)1/3T1/3c, the theorem follows (assuming that \u02c6\u2327< T).\n4 Lower Bound\n\nWe next brie\ufb02y overview the lower bound and the proof\u2019s main technique. A full proof is given in the\nsupplementary material; for simplicity of exposition, here we assume \u02c6\u2327 = 1 and n = 2.\n\n5\n\nfor any \ufb01xed p\u21e4 2 P, which implies\n\n1\nB\n\n1\nB\n\nRt(pt\u02c6\u2327, . . . , pt+\u02c6\u2327) =\n\nB(j+1)\u2019t=B j+2\u02c6\u2327+1\n\nB(j+1)\u2019t=B j+2\u02c6\u2327+1\nR0j(p\u21e4) E266664\nR0j(p0j)377775\nT01\u2019j=0\nT01\u2019j=0\n\uf8ff 3pT0n log n\nRt(pt\u02c6\u2327, . . . , pt+\u02c6\u2327)377775\nRt(p\u21e4, . . . , p\u21e4) E266664\nB(j+1)\u2019t=B j+2\u02c6\u2327+1\nT01\u2019j=0\nB(j+1)\u2019t=B j+2\u02c6\u2327+1\nT01\u2019j=0\nRt(pt\u02c6\u2327, . . . , pt+\u02c6\u2327)377775\nRt(p\u21e4, . . . , p\u21e4) E266664\nB j+2\u02c6\u2327\u2019t=B j+1\nT01\u2019j=0\nT01\u2019j=0\nB j+2\u02c6\u2327\u2019t=B j+1\nRt(pt\u02c6\u2327, . . . , pt+\u02c6\u2327)# \uf8ff 3pBTn log n +\nRt(p\u21e4, . . . , p\u21e4) E\" T\u2019t=1\n\nT\u2019t=1\n\nIn addition, due to Eq. (1) and the non-negativity of the revenues, we also have\n\n\uf8ff 3pBTn log n.\n\n\uf8ff 4 \u02c6\u2327T0 \uf8ff\n\n4 \u02c6\u2327T\nB\n\n.\n\n4 \u02c6\u2327T\nB + B + 2 \u02c6\u2327.\n\n(2)\n\n(3)\n\n\u21e4\n\nSumming Eqs. (2) and (3), and taking into account rounds BT0 + 1, . . . , T during which the total\nrevenue is at most B + 2 \u02c6\u2327, we obtain the regret bound\n\n\fOur proof relies on two steps. The \ufb01rst step is a reduction from pricing with patience \u02c6\u2327 = 0 but with\nswitching cost. The second step is to lower bound the regret of pricing with switching cost. This we\ndo again by reduction from the Multi Armed Bandit (MAB) problem with switching cost. We begin\nby brie\ufb02y over-viewing these terms and de\ufb01nitions.\nWe recall the standard setting of MAB with two actions and switching cost c. A sequence of losses\nis produced `1, . . . ,` T where each loss is de\ufb01ned as a function `t : {1, 2}!{ 0, 1}. At each round\na player chooses an action it 2{ 1, 2} and receives as feedback `t(it). The switching cost regret of\nplayer A is given by\n\nSc-RegretT(A; `1:T) = E\" T\u2019t=1\n\n`t(it) min\n\ni\u21e4\n\n`t(i\u21e4)# + cE[|{it : it , it1}|] .\n\nT\u2019t=1\n\nWe will de\ufb01ne analogously the switching cost regret for non-strategic buyers. Namely, given a\nsequence of buyers b1, . . . , bT , all with patience \u02c6\u2327 = 0, the switching cost regret for a seller is given\nby:\n\nSc-RegretT(A; b1:T) = E\"max\n\np\u21e4 \u2019 R(p\u21e4; bt)\n\nT\u2019t=1\n\nR(pt; bt)# + cE[|{pt : pt , pt1}|] .\n\n4.1 Reduction from Switching Cost Regret\n\nAs we stated above, our \ufb01rst step is to show a reduction from switching cost regret for non-strategic\nbuyers. This we do in Theorem 3:\nTheorem 3. For every (possibly randomized) seller A for strategic buyers with patience at most\n\u02c6\u2327 = 1, there exists a randomized seller A0 for non-strategic buyers with patience \u02c6\u2327 = 0 such that:\n\n1\n2S 1\n12\n\n-RegretT(A0)\uf8ff RegretT(A)\n\nThe proof idea is to construct from every sequence of non-strategic buyers b1, . . . , bT a sequence of\nstrategic buyers \u00afb1, . . . , \u00afbT such that the regret incurred to A by \u00afb1:T is at least the switching cost\nregret incurred to A0 by b1:T . The idea behind the construction is as follows: At each iteration t we\nchoose with probability half to present to the seller bt and with probability half we present to the\nseller a buyer zt that has the following statistics:\n\nzt =\u21e2(v = 1\n\n2,\u2327 = 0) w.p. 1\n(v = 1,\u2327 = 1) w.p. 1\n\n2\n2\n\n(4)\n\n2 a buyer with value v = 1\n\n2 and patience \u2327 = 0, and with probability 1\n2,\n\nThat is, zt is with probability 1\nzt is a buyer with value v = 1 and patience \u2327 = 1.\nObserve that if zt would always have patience \u2327 = 0 (i.e., even if her value is v = 1), for any sequence\nof prices the expected rewards from the zt buyer is always half, independent of the prices. In other\nwords, the sequence of noise does not change the performance of the sequence of prices and cannot\nbe exploited to improve. On the other hand, note since the value 1 corresponds to patience 1, the\nseller might lose half whenever she reduces the price from 1 to 1\n2. A crucial point is that the seller\nmust post her price in advance, therefore she cannot in any way predict if the buyer is willing to\nwait or not and manipulate prices accordingly. A proof for the following Lemma is provided in the\nsupplementary material.\nLemma 4. Consider the pricing problem with \u02c6\u2327 = 1 and n = 2. Let b1, . . . , bT be a sequence of\nbuyers with patience 0. Let z1, . . . , zT be a sequence of stochastic buyers as in Eq. (4). De\ufb01ne \u00afbt to\nbe a stochastic buyer that is with probability half bt and with probability half zt. Then, for any seller\nA, the expected regret A incurs from the sequence \u00afb1:T is at least\n\nE\u21e5RegretT(A; \u00afb1:T)\u21e4 \n\n1\n\n2 E\"max\n\np\u21e42P\n\nT\u2019t=1\n\n(p\u21e4; bt) (pt; bt)# +\n\n1\n\n8 E\" T\u2019t=1 |{pt : pt > pt+1}|#\n\n(5)\n\nwhere the expectations are taken with respect to the internal randomization of the seller A and the\nrandom bits used to generate the sequence \u00afb1:T .\n\n6\n\n\f4.1.1 Proof for Theorem 3\nTo construct algortihm A0 from A, we develop a meta algorithm A, depicted in Algorithm 2 that\nreceives an algorithm, or seller, as input. A0 is then the seller obtained by \ufb01xing A as the input for A.\nIn our reduction we assume that at each iteration algorithm A can ask from A one posted price,pt,\nand in turn she can return a feedback rt to algorithm A, then a new iteration begins.\nThe idea of construction is as follows: As an initialization step Algorithm A0 produces a stochastic\nsequence of buyers of type z1, . . . , zt, the algorithm then chooses apriori if at step t a buyer \u00afbt is\ngoing to be the buyer bt that she observes or zt (with probability half each). The sequence \u00afbt is\ndistributed as depicted in Lemma 4. Note that we do not assume that the learner knows the value of\nbt.\nAt each iteration t, algorithm A0 receives price pt from algorithm A and posts price pt. She then\nreceives as feedback (pt; bt): Given the revenues (p1; b1), . . . , (pt; bt) and her own internal\nrandom variables, the algorithm can calculate the revenue for algorithm A w.r.t to the sequence of\nbuyers \u00afb1, . . . , \u00afbt, namely rt = R(pt1, . . . , pt+1, \u00afb1:t).\nIn turn, at time t algorithm A0 returns to algorithm A her revenue, or feedback, w.r.t \u00afb1, . . . , \u00afbT at\ntime t which is rt.\nSince Algorithm A receives as feedback at time t R(pt1, pt, pt+1; \u00afb1:t), we obtain that for the sequence\nof posted prices p1, . . . , pT :\n\nT\u2019t=1\n\nT\u2019t=1\n\nTaking expectation, using Lemma 4, and noting that the number of time pt+1 > pt is at least 1/3 of\nthe times pt , pt+1 (since there are only 2 prices), we have that\n\n(p\u21e4, p\u21e4; \u00afbt)\n\nRegretT(A; \u00afb1:T) =\n(pt, pt+1; \u00afbt).\n-RegretT(A0; b1:T)\uf8ff E\u00afb1:T\u21e5RegretT(A; \u00afb1:T)\u21e4 \uf8ff RegretT(A)\n\n1\n2S 1\n\n12\n\nSince this is true for any sequence b1:T we obtain the desired result.\nAlgorithm 2: Reduction from from pricing with switching cost to strategic buyers\nInput:T, A % A is an algorithm with bounded regret for strategic buyers;\nOutput:p1, . . . , pT ;\nSet r1 = . . . = rT = 0;\nDraw IID z1, . . . , zT % see Eq. 4;\nDraw IID e1, . . . , eT 2{ 0, 1} Distributed according to Bernoulli distribution;\nfor t=1,. . . ,T do\n\nReceive from A a posted price pt+1; %At \ufb01rst round receive two prices p1, p2.;\npost price pt and receive as feedback (pt; bt);\nif et = 0 then\n\nelse\n\nSet rt = rt + (pt; bt); % \u00afbt = bt\nif (pt \uf8ff pt+1) OR(zt has patience 0) then\nelse\n\nSet rt = rt + (pt; zt)\nSet rt+1 = rt+1 + (pt, pt+1; zt)\n\nReturn rt as feedback to A.\n\n4.2 From MAB with switching cost to Pricing with switching cost\nThe above section concluded that switching cost for pricing may be reduced to pricing with strategic\nbuyers. Therefore, our next step would be to show that we can produce a sequence of non-strategic\nbuyers with high switching cost regret. Our proof relies on a further reduction for MAB with\nSwitching cost.\nTheorem 5 (Dekel et al. [9]). Consider the MAB setting with 2 actions. For any randomized\n: {1, 2}!{ 0, 1} such that\nplayer, there exists a sequence of loss functions `1, . . . ,` T where `t\nSc-RegretT(A; `1:T)2 \u2326(T2/3), for every c > 0.\n\n7\n\n\fHere we prove an analogous statement for pricing setting:\nTheorem 6. Consider the pricing problem for buyers with patience \u02c6\u2327 = 0 and n = 2. For any\nrandomized seller, there exists a sequence of buyers b1, . . . , bT such that Sc-RegretT(A; b1:T)2\n\u2326(T2/3), for every c > 0.\nThe transition from MAB with switching cost to pricing with switching cost is a non-trivial task. To\ndo so, we have to relate actions to prices and values to loss vectors in a manner that would relate the\nrevenue regret to the loss regret. The main challenge, perhaps, is that the structure of the feedback is\ninherently different in the two problems. In two-armed bandit problems all loss con\ufb01guration are\nfeasible. In contrast, in the pricing case certain feedbacks collapse to full information: for example, if\nwe sell at price 1 we know the feedback from price 1\n2 we obtain full\nfeedback for price 1.\nOur reduction proceeds roughly along the following lines. We begin by constructing stochastic\nmappings that turn loss vectors into values \u232bt : {0, 1}2 !{ 0, 1\n2, 1}. This in turn de\ufb01nes a mapping\nfrom a sequences of losses `t to stochastic sequences of buyers bt. In our reduction we assume we\nare given an algorithm A that solves the pricing problem; that is, at each iteration we may ask for\na price and then in turn we return a feedback (pt; bt). Note that we cannot assume that we have\naccess or know bt that is de\ufb01ned by \u232bt(`t). The buyer bt depends on the full loss vector `t: assuming\nthat we can see the full `t would not lead to a meaningful reduction for MAB.\nHowever, our construction of \u232bt is such that each posted price is associated with a single action. This\nmeans that for each posted price there is a single action we need to observe in order to calculate the\ncorrect feedback or revenue. This also means that we switch actions only when algorithm A switches\nprices. Finally, our sequence of transformation has the following property: if i is the action needed in\norder to discover the revenue for price p, then E(`t(i)) = 1\n4 E((p; bt)). Thus, the regret for our\nactions compares to the regret of the seller.\n\n2, and if we fail to sell at price 1\n\n2  1\n\n5 Discussion\n\nIn this work we introduced a new model of strategic buyers, where buyers have a window of time\nin which they would like to purchase the item. Our modeling circumvents complicated dynamics\nbetween the buyers, since it forces the seller to post prices for the entire window of time in advance.\nWe consider an adversarial setting, where both buyer valuation and window size are selected adver-\nsarially. We compare our online algorithm to a static \ufb01xed price, which is by de\ufb01nition oblivious\nto the window sizes. We show that the regret is sub-linear, and more precisely \u21e5(T2/3). The upper\nbound shows that in this model the average regret per buyer is still vanishing. The lower bound shows\nthat having a window size greater than 1 impacts the regret bounds dramatically. Even for window\nsizes 1 or 2 and prices 1\n2 or 1 we get a regret of \u2326(T2/3), compared to a regret of O(T1/2) when all\nthe windows are of size 1.\nGiven the sharp \u21e5(T2/3) bound, it might be worth revisiting our feedback model. Our model assumes\nthat the feedback for the seller is the revenue obtained at the end of each day. It is worthwhile to\nconsider stronger feedback models, where the seller can gain more information about the buyers.\nNamely, their day of arrival and their window size. In terms of the upper bound, our result applies to\nany feedback model that is stronger, i.e., as long as the seller gets to observe the revenue per day,\nthe O(T2/3) bound holds. As far as the lower bound is concerned, one can observe that our proofs\nand construction are valid even for very strong feedback models. Namely, even if the seller gets as\nfeedback the revenue from buyer t at time t (instead of the time of purchase), and in fact even if she\ngets to observe the patience of the buyers (i.e. full information w.r.t. patience), the \u2326(T2/3) bound\nholds, as long as the seller posts prices in advance.\nWe did not consider continuous pricing explicitly, but one can verify that applying our algorithm to a\nsetting of continuous pricing gives a regret bound of O(T3/4), by discretizing the continuous prices\nto T1/4 prices. On the positive side, it shows that we still obtain a vanishing average regret in the\ncontinuous case. On the other hand, we were not able to improve our lower bound to match this upper\nbound. This gap is one of the interesting open problems in our work.\n\n8\n\n\fReferences\n[1] K. Amin, A. Rostamizadeh, and U. Syed. Learning prices for repeated auctions with strategic\nbuyers. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 26, pages 1169\u20131177. 2013.\n\n[2] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from\n\nregret to policy regret. arXiv preprint arXiv:1206.6400, 2012.\n\n[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[4] M.-F. Balcan, A. Blum, J. D. Hartline, and Y. Mansour. Reducing mechanism design to\n\nalgorithm design via machine learning. J. Comput. Syst. Sci., 74(8):1245\u20131270, 2008.\n\n[5] S. Chawla, J. D. Hartline, and R. D. Kleinberg. Algorithmic pricing via virtual valuations. In\n\nACM Conference on Electronic Commerce, pages 243\u2013251, 2007.\n\n[6] S. Chawla, J. D. Hartline, D. L. Malec, and B. Sivan. Multi-parameter mechanism design and\n\nsequential posted pricing. In STOC, pages 311\u2013320, 2010.\n\n[7] S. Chawla, D. L. Malec, and B. Sivan. The Power of Randomness in Bayesian Optimal\n\nMechanism Design. In the 11th ACM Conference on Electronic Commerce (EC), 2010.\n\n[8] R. Cole and T. Roughgarden. The sample complexity of revenue maximization. In Proceedings\n\nof the 46th Annual ACM Symposium on Theory of Computing, pages 243\u2013252. ACM, 2014.\n\n[9] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: T 2/3 regret. In\nProceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 459\u2013467.\nACM, 2014.\n\n[10] Z. Huang, Y. Mansour, and T. Roughgarden. Making the most of your samples. In Proceedings\n\nof the Sixteenth ACM Conference on Economics and Computation, EC, pages 45\u201360, 2015.\n\n[11] R. D. Kleinberg and F. T. Leighton. The value of knowing a demand curve: Bounds on regret for\nonline posted-price auctions. In 44th Symposium on Foundations of Computer Science FOCS,\npages 594\u2013605, 2003.\n\n[12] M. Mohri and A. Munoz. Optimal regret minimization in posted-price auctions with strategic\nbuyers. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 27, pages 1871\u20131879. 2014.\n\n[13] M. Mohri and A. Munoz. Revenue optimization against strategic buyers. In C. Cortes, N. D.\nLawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 28, pages 2530\u20132538. 2015.\n\n9\n\n\f", "award": [], "sourceid": 1920, "authors": [{"given_name": "Michal", "family_name": "Feldman", "institution": "TAU"}, {"given_name": "Tomer", "family_name": "Koren", "institution": "Technion---Israel Inst. of Technology"}, {"given_name": "Roi", "family_name": "Livni", "institution": "Huji"}, {"given_name": "Yishay", "family_name": "Mansour", "institution": "Microsoft"}, {"given_name": "Aviv", "family_name": "Zohar", "institution": "huji"}]}