Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*David C. Parkes, Dimah Yanovsky, Satinder Singh*

Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision pro- cess (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the pos- sibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement - efficient policies, and retain truth-revelation as an approximate Bayesian- Nash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse.

1 Introduction

Mechanism design (MD) is concerned with the problem of providing incentives to im- plement desired system-wide outcomes in systems with multiple self-interested agents. Agents are assumed to have private information, for example about their utility for differ- ent outcomes and about their ability to implement different outcomes, and act to maximize their own utility. The MD approach to achieving multiagent coordination supposes the ex- istence of a center that can receive messages from agents and implement an outcome and collect payments from agents. The goal of MD is to implement an outcome with desired system-wide properties in a game-theoretic equilibrium.

Classic mechanism design considers static systems in which all agents are present and a one-time decision is made about an outcome. Auctions, used in the context of resource- allocation problems, are a standard example. Online mechanism design [1] departs from this and allows agents to arrive and depart dynamically requiring decisions to be made with uncertainty about the future. Thus, an online mechanism makes a sequence of decisions without the benefit of hindsight about the valuations of the agents yet to arrive. Without the issue of incentives, the online MD problem is a classic sequential decision problem.

In prior work [6], we showed that Markov decision processes (MDPs) can be used to define an online Vickrey-Clarke-Groves (VCG) mechanism [2] that makes truth-revelation by the agents (called incentive-compatibility) a Bayesian-Nash equilibrium [5] and implements a policy that maximizes the expected summed value of all agents. This online VCG model

differs from the line of work in online auctions, introduced by Lavi and Nisan [4] in that it assumes that the center has a model and it handles a general decision space and any decision horizon. Computing the payments and allocations in the online VCG mechanism involves solving the MDP that defines the underlying centralized (ignoring self-interest) decision making problem. For large systems, the MDPs that need to be solved exactly become large and thus computationally infeasible.

In this paper we consider the case where the underlying centralized MDPs are indeed too large and thus must be solved approximately, as will be the case in most real applications. Of course, there are several choices of methods for solving MDPs approximately. We show that the sparse-sampling algorithm due to Kearns et al. [3] is particularly well suited to online MD because it produces the needed local approximations to the optimal value and action efficiently. More challengingly, regardless of our choice the agents in the system can exploit their knowledge of the mechanism's approximation algorithm to try and "cheat" the mechanism to further their own selfish interests. Our main contribution is to demonstrate that our new approximate online VCG mechanism has truth-revelation by the agents as an -Bayesian-Nash equilibrium (BNE). This approximate equilibrium supposes that each agent is indifferent to within an expected utility of , and will play a truthful strategy in best- response to truthful strategies of other agents if no other strategy can improve its utility by more than . For any , our online mechanism has a run-time that is independent of the number of states in the underlying MDP, provides an -BNE, and implements a policy with expected value within of the optimal policy's value.

Our approach is empirically illustrated in the context of the dynamic allocation of WiFi con- nectivity to users in a coffeehouse. We demonstrate the speed-up introduced with sparse- sampling (compared with policy calculation via value-iteration), as well as the economic value of adopting an MDP-based approach over a simple fixed-price approach.

2 Preliminaries

Here we formalize the multiagent sequential decision problem that defines the online mech- anism design (OMD) problem. The approach is centralized. Each agent is asked to report its private information (for instance about its value and its capabilities) to a central planner or mechanism upon arrival. The mechanism implements a policy based on its view of the state of the world (as reported by agents). The policy defines actions in each state, and the assumption is that all agents acquiesce to the decisions of the planner. The mechanism also collects payments from agents, which can themselves depend on the reports of agents.

Online Mechanism Design We consider a finite-horizon problem with a set T of time points and a sequence of decisions k = {k1, . . . , kT }, where kt Kt and Kt is the set of feasible decisions in period t. Agent i I arrives at time ai T , departs at time li T , and has value vi(k) 0 for a sequence of decisions k. By assumption, an agent has no value for decisions outside of interval [ai, li]. Agents also face payments, which can be col- lected after an agent's departure. Collectively, information i = (ai, li, vi) defines the type of agent i with i . Agent types are sampled i.i.d. from a probability distribution f (), assumed known to the agents and to the central mechanism. Multiple agents can arrive and depart at the same time. Agent i, with type i, receives utility ui(k, p; i) = vi(k; i) - p, for decisions k and payment p. Agents are modeled as expected-utility maximizers.

Definition 1 (Online Mechanism Design) The OMD problem is to implement the sequence of decisions that maximizes the expected summed value across all agents in equilibrium, given self-interested agents with private information about valuations.

In economic terms, an optimal (value-maximizing) policy is the allocatively-efficient, or simply the efficient policy. Note that nothing about the OMD models requires centralized

execution of the joint plan. Rather, the agents themselves can have capabilities to perform actions and be asked to perform particular actions by the mechanism. The agents can also have private information about the actions that they are able to perform.

Using MDPs to Solve Online Mechanism Design. In the MDP-based approach to solv- ing the OMD problem the sequential decision problem is formalized as an MDP with the state at any time encapsulating both the current agent population and constraints on current decisions as reflected by decisions made previously. The reward function in the MDP is simply defined to correspond with the total reported value of all agents for all sequences of decisions.

Given types i f () we define an MDP, Mf , as follows. Define the state of the MDP at time t as the history-vector ht = (1, . . . , t; k1, . . . , kt-1), to include the reported types up to and including period t and the decisions made up to and including period t - 1.1 The set of all possible states at time t is denoted Ht. The set of all possible states across all time is H = T +1 H t=1 t. The set of decisions available in state ht is Kt(ht). Given a decision kt Kt(ht) in state ht, there is some probability distribution Prob(ht+1|ht, kt) over possible next states ht+1. In the setting of OMD, this probability distribution is determined by the uncertainty on new agent arrivals (as represented within f ()), together with departures and the impact of decision kt on state.

The payoff function for the induced MDP is defined to reflect the goal of maximizing the total expected reward across all agents. In particular, payoff Ri(ht, kt) = vi(kt; i) - vi(kt-1; i) becomes available from agent i upon taking action kt in state ht. With this, we have Ri(h t=1 t, kt) = vi(k ; i), for all periods to provide the required cor- respondence with agent valuations. Let R(ht, kt) = Ri(h i t, kt), denote the payoff obtained from all agents at time t. Given a (nonstationary) policy = {1, 2, . . . , T } where t : Ht Kt, an MDP defines an MDP-value function V as follows: V (ht) is the expected value of the summed payoff obtained from state ht onwards under policy , i.e., V (ht) = E{R(ht, (ht)) + R(ht+1, (ht+1)) + + R(hT , (hT ))}. An optimal policy is one that maximizes the MDP-value of every state in H.

The optimal MDP-value function V can be computed by value-iteration, and is defined so that V (h) = maxkK P rob(h |h, k)V (h )] for t = T - t (h) [R(h, k) + h Ht+1 1, T - 2, . . . , 1 and all h Ht, with V (h HT ) = maxkKT (h) R(h, k). Given the optimal MDP-value function, the optimal policy is derived as follows: for t < T , policy (h Ht) chooses action arg maxkK P rob(h |h, k)V (h )] t (h) [R(h, k) + h Ht+1

and (h HT ) = arg maxkKT (h) R(h, k). Let ^ t denote reported types up to and including period t . Let Ri (^ t t ; ) denote the total reported reward to agent i up to and including period t . The commitment period for agent i is defined as the first period, mi, for which t mi, Ri (^ ; ) = Ri (^ ; ) , for any types still to m m m i i t i >mi >mi arrive. This is the earliest period in which agent i's total value is known with certainty.

Let ht (^ t ; ) denote the state in period t given reports ^ t . Let ^ t \i = ^ t \ ^ i.

Definition 2 (Online VCG mechanism) Given history h H, mechanism Mvcg = (; , pvcg) implements policy and collects payment,

pvcg(^ ; ) = Ri (^ ; ) - V (h (^ ; )) - V (h (^ i mi m m ^ a ^ a ^ a ^ a i i i i i i \i ; )) (1)

from agent i in some period t mi.

1Using histories as state will make the state space very large. Often, there will be some function g for which g(h) is a sufficient statistic for all possible states h. We ignore this possibility here.

Agent i's payment is equal to its reported value discounted by the expected marginal value that it will contribute to the system as determined by the MDP-value function for the policy in its arrival period. The incentive-compatibility of the Online VCG mechanism requires that the MDP satisfies stalling which requires that the expected value from the optimal policy in every state in which an agent arrives is at least the expected value from following the optimal action in that state as though the agent had never arrived and then returning to the optimal policy. Clearly, property Kt(ht) Kt(ht \ i) in any period t in which i has just arrived is sufficient for stalling. Stalling is satisfied whenever an agent's arrival does not force a change in action on a policy.

Theorem 1 (Parkes & Singh [6]) An online VCG mechanism, Mvcg = (; , pvcg), based on an optimal policy for a correct MDP model that satisfies stalling is Bayesian- Nash incentive compatible and implements the optimal MDP policy.

3 Solving Very Large MDPs Approximately

From Equation 1, it can be seen that making outcome and payment decisions at any point in time in an online VCG mechanism does not require a global value function or a global policy. Unlike most methods for approximately solving MDPs that compute global approx- imations, the sparse-sampling methods of Kearns et al. [3] compute approximate values and actions for a single state at a time. Furthermore, sparse-sampling methods provide approx- imation guarantees that will be important to establish equilibrium properties -- they can compute an -approximation to the optimal value and action in a given state in time inde- pendent of the size of the state space (though polynomial in 1 and exponential in the time horizon). Thus, sparse-sampling methods are particularly suited to approximating online VCG and we adopt them here.

The sparse-sampling algorithm uses the MDP model Mf as a generative model, i.e., as a black box from which a sample of the next-state and reward distributions for any given state-action pair can be obtained. Given a state s and an approximation parameter , it computes an -accurate estimate of the optimal value for s as follows. We make the param- eterization on explicit by writing sparse-sampling( ). The algorithm builds out a depth-T sampled tree in which each node is a state and each node's children are obtained by sam- pling each action in that state m times (where m is chosen to guarantee an approximation to the optimal value of s), and each edge is labeled with the sample reward for that transi- tion. The algorithm computes estimates of the optimal value for nodes in the tree working backwards from the leaves as follows. The leaf-nodes have zero value. The value of a node is the maximum over the values for all actions in that node. The value of an action in a node is the summed value of the m rewards on the m outgoing edges for that action plus the summed value of the m children of that node. The estimated optimal value of state s is the value of the root node of the tree. The estimated optimal action in state s is the action that leads to the largest value for the root node in the tree.

Lemma 1 (Adapted from Kearns, Mansour & Ng [3]) The sparse-sampling( ) algorithm, given access to a generative model for any n-action MDP M , takes as input any state s S and any > 0, outputs an action, and satisfies the following two conditions:

```
(Running Time) The running time of the algorithm is O((nC)T ), where C = f (n, 1 , Rmax, T ) and f is a polynomial function of the approximation parameter 1 , the number of actions n, the largest expected reward in a state Rmax and the horizon T . In particular, the running time has no dependence on the number of states.
(Near-Optimality) The value function of the stochastic policy implemented by the sparse-sampling( ) algorithm, denoted V ss, satisfies |V (s) - V ss(s)| si-
multaneously for all states s S.
```

It is straightforward to derive the following corollary from the proof of Lemma 1 in [3].

Corollary 1 The value function computed by the sparse-sampling( ) algorithm, denoted ^ V ss, is near-optimal in expectation, i.e., |V (s) - E{ ^ V ss(s)}| simultaneously for all states s S and where the expectation is over the randomness introduced by the sparse- sampling( ) algorithm.

4 Approximately Efficient Online Mechanism Design

In this section, we define an approximate online VCG mechanism and consider the effect on incentives of substituting the sparse-sampling( ) algorithm into the online VCG mech- anism. We model agents as indifferent between decisions that differ by at most a utility of > 0, and consider an approximate Bayesian-Nash equilibrium. Let vi(; ) denote the final value to agent i after reports given policy .

Definition 3 (approximate BNE) Mechanism Mvcg = (, , pvcg) is -Bayesian-Nash in- centive compatible if

```
E| {vi(; ) - pvcg(; )} + E {vi(-i, ^ i; ) - pvcg(-i, ^ i; )}(2) | t i t i
```

where agent i with type i arrives in period t , and with the expectation taken over future types given current reports t .

In particular, when truth-telling is an -BNE we say that the mechanism is -BNE incentive compatible and no agent can improve its expected utility by more than > 0, for any type, as long as other agents are bidding truthfully. Equivalently, one can interpret an -BNE as an exact equilibrium for agents that face a computational cost of at least to compute the exact BNE.

Definition 4 (approximate mechanism) A sparse-sampling( ) based approximate online VCG mechanism, Mvcg( ) = (; ~ , ~ pvcg), uses the sparse-sampling( ) algorithm to imple- ment stochastic policy ~ and collects payment

```
~ pvcg(^ ; ~ ) = Ri (^ ; ~ ) - ^ V ss(h (^ ; ~ )) - ^ V ss(h (^ i mi m m ^ a ^ a ^ a ^ a i i i i i i \i ; ~ ))
```

from agent i in some period t mi, for commitment period mi.

Our proof of incentive-compatibility first demonstrates that an approximate delayed VCG mechanism [1, 6] is -BNE. With this, we demonstrate that the expected value of the pay- ments in the approximate online VCG mechanism is within 3 of the payments in the approximate delayed VCG mechanism. The delayed VCG mechanism makes the same decisions as the online VCG mechanism, except that payments are delayed until the final period and computed as:

```
pDvcg(^ ; ) = Ri (^ ; ) - R i T T ( ^ ; ) - RT (^ -i; ) (3)
```

where the discount is computed ex post, once the effect of an agent on the system value is known. In an approximate delayed-VCG mechanism, the role of the sparse-sampling algorithm is to implement an approximate policy, as well as counterfactual policies for the worlds -i without each agent i in turn. The total reported reward to agents = i over this counterfactual series of states is used to define the payment to agent i.

Lemma 2 Truthful bidding is an -Bayesian-Nash equilibrium in the sparse-sampling( ) based approximate delayed-VCG mechanism.

Proof: Let ~ denote the approximate policy computed by the sparse-sampling algorithm. Assume agents = i are truthful. Now, if agent i bids truthfully its expected utility is

```
E| {v Rj (; ~ ) - Rj ( a i(; ~ ) + T T -i; ~ )} (4) i
j=i j=i
```

where the expectation is over both the types yet to be reported and the random- ness introduced by the sparse-sampling( ) algorithm. Substituting R

```
V (ha ( ; ~ )) - V ss(h ( i ai ai ai\i; ~ )) - (5)
```

because V ss(ha ( ; ~ )) V (h ( ; ~ )) - by Lemma 1. Now, ignore term i ai ai ai RT (-i; ~ ) in Equation (4), which is independent of agent i's bid ^ i, and consider the maximal expected utility to agent i from some non-truthful bid. The effect of ^ i on the first two terms is indirect, through a change in the policy for periods ai. An agent can change the policy only indirectly, by changing the center's view of the state by misreporting its type. By definition, the agent can do no better than selecting optimal policy , which is defined to maximize the expected value of the first two terms. Putting this together, the expected utility from ^ i is at most V (ha ( ; ~ )) - V ss(h ( i ai ai ai\i; ~ )) and at most better than that from bidding truthfully.

Theorem 2 Truthful bidding is an 4 -Bayesian-Nash equilibrium in the sparse- sampling( ) based approximate online VCG mechanism.

Proof: Assume agents = i bid truthfully, and consider report ^ i. Clearly, the policy implemented in the approximate online-VCG mechanism is the same as in the delayed- VCG mechanism for all ^ i. Left to show is that the expected value of the payments are within 3 for all ^ i. From this we conclude that the expected utility to agent i in the approximate-VCG mechanism is always within 3 of that in the approximate delayed-VCG mechanism, and therefore 4 -BNE by Lemma 2. The expected payment in the approximate online VCG mechanism is

```
E| {Ri (^ ; ~ )} - E{ ^ V ss(h (^ ; ~ )} - E{ ^ V ss(h (^ a T ^ ai ^ ai ^ ai ^ ai\i; ~ )} i
```

The value function computed by the sparse-sampling( ) algorithm is a random variable to agent i at the time of bidding, and the second and third expectations are over the random- ness introduced by the sparse-sampling( ) algorithm. The first term is the same as in the payment in the approximate delayed-VCG mechanism. By Corollary 1, the value function estimated in the sparse-sampling( ) is near-optimal in expectation and the total of the sec- ond two terms is at least V (h^a (^ (^ ; )) - 2 . Ignoring the first i ^ ai\i; )) - V (h^ ai ^ ai

term in pDvcg, the expected payment in the approximate delayed-VCG mechanism is no i more than V (h^a (^ (^ ; )) - ) because of the near-optimality i ^ ai\i; )) - (V (h^ ai ^ ai of the value function of the stochastic policy (Lemma 1). Putting this together, we have a maximum difference in expected payments of 3 . Similar analysis yields a maximum dif- ference of 3 when an upper-bound is taken on the payment in the online VCG mechanism and compared with a lower-bound on the payment in the delayed mechanism.

Theorem 3 For any parameter > 0, the sparse-sampling( ) based approximate online VCG mechanism has -efficiency in an 4 -BNE.

5 Empirical Evaluation of Approximate Online VCG

The WiFi Problem. The WiFi problem considers a fixed number of channels C with a random number of agents (max A) that can arrive per period. The time horizon

T = 50. Agents demand a single channel and arrive with per-unit value, distributed i.i.d. V = {v1, . . . , vk} and duration in the system, distributed i.i.d. D = {d1, . . . , dl}. The value model requires that any allocation to agent i must be for contiguous periods, and be made while the agent is present (i.e., during periods [t, ai + di], for arrival ai and duration di). An agent's value for an allocation of duration x is vix where vi is its per-unit value. Let dmax denote the maximal possible allocated duration. We define the following MDP components: State space: We use the following compact, sufficient, statistic of history: a resource schedule is a (weakly non-decreasing) vector of length dmax that counts the number of channels available in the current period and next dmax - 1 periods given previous actions (C channels are available after this); an agent vector of size (k l) that provides a count of the number of agents present but not allocated for each possible per-unit value and each possible duration (the duration is automatically decremented when an agent remains in the system for a period without receiving an allocation); the time remaining until horizon T . Action space: The policy can postpone an agent allocation, or allocate an agent to a chan- nel for the remaining duration of the agent's time in the system if a channel is available, and the remaining duration is not greater than dmax. Payoff function: The reward at a time step is the summed value obtained from all agents for which an allocation is made in this time step. This is the total value such an agent will receive before it departs. Transition probabilities: The change in resource schedule, and in the agent vector that relates to agents currently present, is deterministic. The random new additions to the agent vector at each step are unaffected by the actions and also independent of time.

Mechanisms. In the exact online VCG mechanism we compute an optimal policy, and optimal MDP values, offline using finite-horizon value iteration [7]. In the sparse- sampling( ) approach, we define a sampling tree depth L (perhaps < T ) and sample each state m times. This limited sampling depth places a lower-bound on the best possible ap- proximation accuracy from the sparse-sampling algorithm. We also employ agent pruning, with the agent vector in the state representation pruned to remove dominated agents: con- sider agent type with duration d and value v and remove all but C - N agents where N is the number of agents that either have remaining duration d and value > v or duration < d and value v. In computing payments we use factoring, and only determine VCG payments once for each type of agent to arrive. We compare performance with a simple fixed-price allocation scheme that given a particular problem, computes off-line a fixed number of periods and a fixed price (agents are queued and offered the price at random as resources become available) that yields the maximum expected total value.

Results In the default model, we set C = 5, A = 5, define the set of values V = {1, 2, 3}, define the set of durations D = {1, 2, 6}, with lookahead L = 4 and sampling width m = 6. All results are averaged over at least 10 instances, and experiments were performed on a 3GHz P4, with 512 MB RAM. Value and revenue is normalized by the total value demanded by all agents, i.e. the value with infinite capacity.2 Looking first at economic properties, Figure 1(A) shows the effect of varying the number of agents from 2 to 12, comparing the value and revenue between the approximate online VCG mechanism and the fixed price mechanism. Notice that the MDP method dominates the price-based scheme for value, with a notable performance improvement over fixed price when demand is neither very low (no contention) nor very high (lots of competition). Revenue is also generally better from the MDP-based mechanism than in the fixed price scheme. Fig. 1(B) shows the similar effect of varying the number of channels from 3 to 10.

Turning now to computational properties, Figure 1 (C) illustrates the effectiveness of sparse-sampling, and also agent pruning, sampled over 100 instances. The model is very

2This explains why the value appears to drop as we scale up the number of agents-- the total available value is increasing but supply remains fixed.

```
100 value:mdp value:mdp 80 rev:mdp rev:mdp value:fixed 80 value:fixed rev:fixed rev:fixed
60 60
% %
40 40
20 20
2 4 6 8 10 12 03 4 5 6 7 8 9 10 Number of agents Number of channels
98 1.0 600 value:pruning vs. #agents time:pruning vs. #agents (no pruning) 96 value:no pruning 0.8 500 time:no pruning vs. #channels
400 94 0.6
300
92 0.4 Run time (s) % of Exact Value % of Exact Time 200
90 0.2 time:pruning 100 time:no pruning
88 0 0 2 4 6 8 10 2 4 6 8 10 12
Sampling Width Number of Agents
```

Figure 1: (A) Value and Revenue vs. Number of Agents. (B) Value and Revenue vs. Number of Channels. (C) Effect of Sampling Width. (D) Pruning speed-up.

small: A = 2, C = 2, D = {1, 2, 3}, V = {1, 2, 3} and L = 4, to allow a compari- son with the compute time for an optimal policy. The sparse-sampling method is already running in less than 1% of the time for optimal value-iteration (right-hand axis), with an accuracy as high as 96% of the optimal. Pruning provides an incremental speed-up, and actually improves accuracy, presumably by making better use of each sample. Figure 1 (D) shows that pruning is extremely useful computationally (in comparison with plain sparse- sampling), for the default model parameters and as the number of agents is increased from 2 to 12. Pruning is effective, removing around 50% of agents (summed across all states in the lookahead tree) at 5 agents.

Acknowledgments. David Parkes was funded by NSF grant IIS-0238147. Satinder Singh was funded by NSF grant CCF 0432027 and by a grant from DARPA's IPTO program.

Do not remove: This comment is monitored to verify that the site is working properly