Part of Advances in Neural Information Processing Systems 19 (NIPS 2006)
Eli Ben-sasson, Ehud Kalai, Adam Kalai
A central question in game theory and artificial intelligence is how a rational agent should behave in a complex environment, given that it cannot perform unbounded computations. We study strategic aspects of this question by formulating a simple model of a game with additional costs (computational or otherwise) for each strategy. First we connect this to zero-sum games, proving a counter-intuitive generalization of the classic min-max theorem to zero-sum games with the addition of strategy costs. We then show that potential games with strategy costs remain potential games. Both zero-sum and potential games with strategy costs maintain a very appealing property: simple learning dynamics converge to equilibrium. 1 The Approach and Basic Model How should an intelligent agent play a complicated game like chess, given that it does not have unlimited time to think? This question reflects one fundamental aspect of "bounded rationality," a term coined by Herbert Simon [1]. However, bounded rationality has proven to be a slippery concept to formalize (prior work has focused largely on finite automata playing simple repeated games such as prisoner's dilemma, e.g. [2, 3, 4, 5]). This paper focuses on the strategic aspects of decisionmaking in complex multi-agent environments, i.e., on how a player should choose among strategies of varying complexity, given that its opponents are making similar decisions. Our model applies to general strategic games and allows for a variety of complexities that arise in real-world applications. For this reason, it is applicable to one-shot games, to extensive games, and to repeated games, and it generalizes existing models such as repeated games played by finite automata. To easily see that bounded rationality can drastically affect the outcome of a game, consider the following factoring game. Player 1 chooses an n-bit number and sends it to Player 2, who attempts to find its prime factorization. If Player 2 is correct, he is paid 1 by Player 1, otherwise he pays 1 to Player 1. Ignoring complexity costs, the game is a trivial win for Player 2. However, for large n, the game should is essentially a win for Player 1, who can easily output a large random number that Player 2 cannot factor (under appropriate complexity assumptions). In general, the outcome of a game (even a zero-sum game like chess) with bounded rationality is not so clear. To concretely model such games, we consider a set of available strategies along with strategy costs. Consider an example of two players preparing to play a computerized chess game for 100Kprize.Supposetheplayerssimultaneouslychooseamongtwoavailableoptions:tousea10K program A or an advanced program B, which costs 50K.Werefertotherowchooseraswhiteandtothecolumnchooserasblack,withthecorrespondingadvantagesreflectedbythewinprobabilitiesofwhitedescribedinTable1a.Forexample,whenbothplayersuseprogramA,whitewins55Figure1:a)Tableoffirst−playerwinningprobabilitiesbasedonprogramchoices.b)Tableofexpectednetearningsinthousandsofdollars.Theuniqueequilibriumis(A,B)whichstronglyfavorsthesecondplayer.Asurprisingpropertyisevidentintheabovegame.Everythingaboutthegameseemstofavorwhite.Yetduetothe(symmetric)costs,attheuniqueNashequilibrium(A,B)ofTable1b,blackwins8734K more than white. In fact, it is a dominant strategy for white to play A and for black to play B. To see this, note that playing B increases white's probability of winning by 38%, independent of what black chooses. Since the pot is 100K,thisisworth38K in expectation, but B costs 40KmorethanA.Ontheotherhand,blackenjoysa4240K. Before formulating the general model, we comment on some important aspects of the chess example. First, traditional game theory states that chess can be solved in "only" two rounds of elimination of dominated strategies [10], and the outcome with optimal play should always be the same: either a win for white or a win for black. This theoretical prediction fails in practice: in top play, the outcome is very nondeterministic with white winning roughly twice as often as black. The game is too large and complex to be solved by brute force. Second, we have been able to analyze the above chess program selection example exactly because we formulated as a game with a small number of available strategies per player. Another formulation that would fit into our model would be to include all strategies of chess, with some reasonable computational costs. However, it is beyond our means to analyze such a large game. Third, in the example above we used monetary software cost to illustrate a type of strategy cost. But the same analysis could accommodate many other types of costs that can be measured numerically and subtracted from the payoffs, such as time or effort involved in the development or execution of a strategy, and other resource costs. Additional examples in this paper include the number of states in a finite automaton, the number of gates in a circuit, and the number of turns on a commuter's route. Our analysis is limited, however, to cost functions that depend only on the strategy of the player and not the strategy chosen by its opponent. For example, if our players above were renting computers A or B and paying for the time of actual usage, then the cost of using A would depend on the choice of computer made by the opponent. Generalizing the example above, we consider a normal form game with the addition of strategy costs, a player-dependent cost for playing each available strategy. Our main results regard two important classes of games: constant-sum and potential games. Potential games with strategy costs remain potential games. While two-person constant-sum games are no longer constant, we give a basic structural description of optimal play in these games. Lastly, we show that known learning dynamics converge in both classes of games. 2 Definition of strategy costs We first define an N -person normal-form game G = (N , S, p) consisting of finite sets of (available) pure strategies S = (S1 , . . . , SN ) for the N players, and a payoff function p : S1 . . . SN RN . Players simultaneously choose strategies si Si after which player i is rewarded with pi (s1 , . . . , sN ). A randomized or mixed strategy i for player i is a probability distribution over its pure strategies Si , x . x i i = R|Si | : = 1, xj 0 j We extend p to 1 . . . N in the natural way, i.e., pi (1 , . . . , N ) = E[pi (s1 , . . . , sN )] where each si is drawn from i , independently. Denote by s-i = (s1 , s2 , . . . , si-1 , si+1 , . . . , sN ) and similarly for -i . A best response by player i to -i is i i such that pi (i , -i ) = maxi i pi (i , -i ). A (mixed strategy) Nash equilibrium of G is a vector of strategies (1 , . . . , N ) 1 . . . N such that each i is a best response to -i . We now define G-c , the game G with strategy costs c = (c1 , . . . , cN ), where ci : Si R. It is simply an N -person normal-form game G-c = (N , S, p-c ) with the same sets of pure strategies as G, but with a new payoff function p-c : S1 . . . SN RN where, p-c (s1 , . . . , sN ) = pi (s1 , . . . , sN ) - ci (si ), for i = 1, . . . , N . i We similarly extend ci to i in the natural way. 3 Two-person constant-sum games with strategy costs Recall that a game is constant-sum (k -sum for short) if at every combination of individual strategies, the players' payoffs sum to some constant k. Two-person k -sum games have some important properties, not shared by general sum games, which result in more effective game-theoretic analysis. In particular, every k -sum game has a unique value v R. A mixed strategy for player 1 is called optimal if it guarantees payoff v against any strategy of player 2. A mixed strategy for player 2 is optimal if it guarantees k - v against any strategy of player 1. The term optimal is used because optimal strategies guarantee as much as possible (v + k - v = k ) and playing anything that is not optimal can result in a lesser payoff, if the opponent responds appropriately. (This fact is easily illustrated in the game rock-paper-scissors randomizing uniformly among the strategies guarantees each player 50% of the pot, while playing anything other than uniformly random enables the opponent to win strictly more often.) The existence of optimal strategies for both players follows from the min-max theorem. An easy corollary is that the Nash equilibria of a k -sum game are exchangeable: they are simply the cross-product of the sets of optimal mixed strategies for both players. Lastly, it is well-known that equilibria in two-person k -sum games can be learned in repeated play by simple dynamics that are guaranteed to converge [17]. With the addition of strategy costs, a k -sum game is no longer k -sum and hence it is not clear, at first, what optimal strategies there are, if any. (Many examples of general-sum games do not have optimal strategies.) We show the following generalization of the above properties for zero-sum games with strategies costs. Theorem 1. Let G be a finite two-person k -sum game and G-c be the game with strategy costs c = (c1 , c2 ). 1. There is a value v R for G-c and nonempty sets OPT1 and OPT2 of optimal mixed strategies for the two players. OPT1 is the set of strategies that guarantee player 1 payoff v - c2 (2 ), against any strategy 2 chosen by player 2. Similarly, OPT2 is the set of strategies that guarantee player 2 payoff k - v - c1 (1 ) against any 1 . 2. The Nash equilibria of G-c are exchangeable: the set of Nash equilibria is OPT1 OPT2 . 3. The set of net payoffs possible at equilibrium is an axis-parallel rectangle in R2 . For zero-sum games, the term optimal strategy was natural: the players could guarantee v and k - v , respectively, and this is all that there was to share. Moreover, it is easy to see that only pairs of optimal strategies can have the Nash equilibria property, being best responses to each other. In the case of zero-sum games with strategy costs, the optimal structure is somewhat counterintuitive. First, it is strange that the amount guaranteed by either player depends on the cost of the other player's action, when in reality each player pays the cost of its own action. Second, it is not even clear why we call these optimal strategies. To get a feel for this latter issue, notice that the sum of the net payoffs to the two players is always k - c1 (1 ) - c2 (2 ), which is exactly the total of what optimal strategies guarantee, v - c2 (2 ) + k - v - c1 (1 ). Hence, if both players play what we call optimal strategies, then neither player can improve and they are at Nash equilibrium. On the other hand, suppose player 1 selects a strategy 1 that does not guarantee him payoff at least v - c2 (2 ). This means that there is some response 2 by player 2 for which player 1's payoff is < v - c2 (2 ) and hence player 2's payoff is > k - v - c1 (1 ). Thus player 2's best response to 1 must give player 2 payoff > k - v - c1 (1 ) and leave player 1 with < v - c2 (2 ). The proof of the theorem (the above reasoning only implies part 2 from part 1) is based on the following simple observation. Consider the k -sum game H = (N , S, q ) with the following payoffs: q1 (s1 , s2 ) = p1 (s1 , s2 ) - c1 (s1 ) + c2 (s2 ) = p-c (s1 , s2 ) + c2 (s2 ) 1 q2 (s1 , s2 ) = p2 (s1 , s2 ) - c2 (s1 ) + c1 (s1 ) = p-c (s1 , s2 ) + c1 (s1 ) 2 That is to say, Player 1 pays its strategy cost to Player 2 and vice versa. It is easy to verify that, 1 , 1 1 , 2 2 q1 (1 , 2 ) - q1 (1 , 2 ) = p-c (1 , 2 ) - p-c (1 , 2 ) 1 1 (1) This means that the relative advantage in switching strategies in games G-c and H are the same. In particular, 1 is a best response to 2 in G-c if and only if it is in H . A similar equality holds for player 2's payoffs. Note that these conditions imply that the games G-c and H are strategically equivalent in the sense defined by Moulin and Vial [16]. Proof of Theorem 1. Let v be the value of the game H . For any strategy 1 that guarantees player 1 payoff v in H , 1 guarantees player 1 v - c2 (2 ) in G-c . This follows from the definition of H . Similarly, any strategy 2 that guarantees player 2 payoff k - v in H will guarantee k - v - c1 (1 ) in G-c . Thus the sets OPT1 and OPT2 are non-empty. Since v - c2 (2 ) + k - v - c1 (1 ) = k - c1 (1 ) - c2 (2 ) is the sum of the payoffs in G-c , nothing greater can be guaranteed by either player. Since the best responses of G-c and H are the same, the Nash equilibria of the two games are the same. Since H is a k -sum game, its Nash equilibria are exchangeable, and thus we have part 2. (This holds for any game that is strategically equivalent to k -sum.) Finally, the optimal mixed strategies OPT1 , OPT2 of any k -sum game are convex sets. If we look at the achievable costs of the mixed strategies in OPTi , by the definition of the cost of a mixed strategy, this will be a convex subset of R, i.e., an interval. By parts 1 and 2, the set of achievable net payoffs at equilibria of G-c are therefore the cross-product of intervals. To illustrate Theorem 1 graphically, Figure 2 gives a 4 4 example with costs of 1, 2, 3, and 4, respectively. It illustrates a situation with multiple optimal strategies. Notice that player 1 is completely indifferent between its optimal choices A and B, and player 2 is completely indifferent between C and D. Thus the only question is how kind they would like to be to their opponent. The (A,C) equilibrium is perhaps most natural as it is yields the highest payoffs for both parties. Note that the proof of the above theorem actually shows that zero-sum games with costs share additional appealing properties of zero-sum games. For example, computing optimal strategies is a polynomial time-computation in an n n game, as it amounts to computing the equilibria of H . We next show that they also have appealing learning properties, though they do not share all properties of zero-sum games.1 3.1 Learning in repeated two-person k -sum games with strategy costs Another desirable property of k -sum games is that, in repeated play, natural learning dynamics converge to the set of Nash equilibria. Before we state the analogous conditions for k -sum games with costs, we briefly give a few definitions. A repeated game is one in which players chooses a sequence of strategies vectors s1 , s2 , . . ., where each st = (st , . . . , st ) is a strategy vector of some 1 N fixed stage game G = (N , S, p). Under perfect monitoring, when selecting an action in any period the players know all the previous selected actions.As we shall discuss, it is possible to learn to play without perfect monitoring as well. 1 One property that is violated by the chess example is the "advantage of an advantage" property. Say Player 1 has the advantage over Player 2 in a square game if p1 (s1 , s2 ) p2 (s2 , s1 ) for all strategies s1 , s2 . At equilibrium of a k-sum game, a player with the advantage must have a payoff at least as large as its opponent. This is no longer the case after incorporating strategy costs, as seen in the chess example, where Player 1 has the advantage (even including strategy costs), yet his equilibrium payoff is smaller than 2's. a) A B C D b) A (-1) B (-2) C (-3) D (-4) A 6, 4 7, 3 7.5, 2.5 8.5, 1.5 A (-1) 5, 3 5, 2 4.5, 1.5 4.5, 0.5 B 5, 5 6, 4 6.5, 3.5 7, 3 B (-2) 4, 3 4, 2 3.5, 1.5 3, 1 C 3, 7 4, 6 4.5, 5.5 5.5, 4.5 C (-3) 2, 4 2, 3 1.5, 2.5 1.5, 1.5 D 2, 8 3, 7 3.5, 6.5 4.5, 5.5 D (-4) 1, 4 1, 3 0.5, 2.5 0.5, 1.5 PLAYER 2 NET PAYOFF