{"title": "Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions", "book": "Advances in Neural Information Processing Systems", "page_first": 1880, "page_last": 1888, "abstract": "Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player's actions according to the player's average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff.", "full_text": "Ef\ufb01cient Monte Carlo Counterfactual Regret\n\nMinimization in Games with Many Player Actions\n\nRichard Gibson, Neil Burch, Marc Lanctot, and Duane Szafron\n\nDepartment of Computing Science, University of Alberta\n{rggibson | nburch | lanctot | dszafron}@ualberta.ca\n\nEdmonton, Alberta, T6G 2E8, Canada\n\nAbstract\n\nCounterfactual Regret Minimization (CFR) is a popular, iterative algorithm for\ncomputing strategies in extensive-form games. The Monte Carlo CFR (MCCFR)\nvariants reduce the per iteration time cost of CFR by traversing a smaller, sampled\nportion of the tree. The previous most effective instances of MCCFR can still be\nvery slow in games with many player actions since they sample every action for a\ngiven player. In this paper, we present a new MCCFR algorithm, Average Strat-\negy Sampling (AS), that samples a subset of the player\u2019s actions according to the\nplayer\u2019s average strategy. Our new algorithm is inspired by a new, tighter bound on\nthe number of iterations required by CFR to converge to a given solution quality.\nIn addition, we prove a similar, tighter bound for AS and other popular MCCFR\nvariants. Finally, we validate our work by demonstrating that AS converges faster\nthan previous MCCFR algorithms in both no-limit poker and Bluff.\n\n1\n\nIntroduction\n\nAn extensive-form game is a common formalism used to model sequential decision making prob-\nlems containing multiple agents, imperfect information, and chance events. A typical solution con-\ncept in games is a Nash equilibrium pro\ufb01le. Counterfactual Regret Minimization (CFR) [12] is an\niterative algorithm that, in 2-player zero-sum extensive-form games, converges to a Nash equilib-\nrium. Other techniques for computing Nash equilibria of 2-player zero-sum games include linear\nprogramming [8] and the Excessive Gap Technique [6]. Theoretical results indicate that for a \ufb01xed\nsolution quality, CFR takes a number of iterations at most quadratic in the size of the game [12, The-\norem 4]. Thus, as we consider larger games, more iterations are required to obtain a \ufb01xed solution\nquality. Nonetheless, CFR\u2019s versatility and memory ef\ufb01ciency make it a popular choice.\nMonte Carlo CFR (MCCFR) [9] can be used to reduce the traversal time per iteration by considering\nonly a sampled portion of the game tree. For example, Chance Sampling (CS) [12] is an instance of\nMCCFR that only traverses the portion of the game tree corresponding to a single, sampled sequence\nof chance\u2019s actions. However, in games where a player has many possible actions, such as no-limit\npoker, iterations of CS are still very time consuming. This is because CS considers all possible\nplayer actions, even if many actions are poor or only factor little into the algorithm\u2019s computation.\nOur main contribution in this paper is a new MCCFR algorithm that samples player actions and is\nsuitable for games involving many player choices. Firstly, we provide tighter theoretical bounds on\nthe number of iterations required by CFR and previous MCCFR algorithms to reach a \ufb01xed solution\nquality. Secondly, we use these new bounds to propel our new MCCFR sampling algorithm. By\nusing a player\u2019s average strategy to sample actions, convergence time is signi\ufb01cantly reduced in\nlarge games with many player actions. We prove convergence and show that our new algorithm\napproaches equilibrium faster than previous sampling schemes in both no-limit poker and Bluff.\n\n1\n\n\f(cid:81)\n\n2 Background\nA \ufb01nite extensive game contains a game tree with nodes corresponding to histories of actions h \u2208 H\nand edges corresponding to actions a \u2208 A(h) available to player P (h) \u2208 N \u222a {c} (where N is the\nset of players and c denotes chance). When P (h) = c, \u03c3c(h, a) is the (\ufb01xed) probability of chance\ngenerating action a at h. Each terminal history z \u2208 Z has associated utilities ui(z) for each player\ni. We de\ufb01ne \u2206i = maxz,z(cid:48)\u2208Z ui(z) \u2212 ui(z(cid:48)) to be the range of utilities for player i. Non-terminal\nhistories are partitioned into information sets I \u2208 Ii representing the different game states that\nplayer i cannot distinguish between. For example, in poker, player i does not see the private cards\ndealt to the opponents, and thus all histories differing only in the private cards of the opponents are\nin the same information set for player i. The action sets A(h) must be identical for all h \u2208 I, and we\ndenote this set by A(I). We de\ufb01ne |Ai| = maxI\u2208Ii |A(I)| to be the maximum number of actions\navailable to player i at any information set. We assume perfect recall that guarantees players always\nremember information that was revealed to them and the order in which it was revealed.\nA (behavioral) strategy for player i, \u03c3i \u2208 \u03a3i, is a function that maps each information set I \u2208 Ii to\na probability distribution over A(I). A strategy pro\ufb01le is a vector of strategies \u03c3 = (\u03c31, ..., \u03c3|N|) \u2208\n\u03a3, one for each player. Let ui(\u03c3) denote the expected utility for player i, given that all players play\naccording to \u03c3. We let \u03c3\u2212i refer to the strategies in \u03c3 excluding \u03c3i. Let \u03c0\u03c3(h) be the probability\nof history h occurring if all players choose actions according to \u03c3. We can decompose \u03c0\u03c3(h) =\ni (h) is the contribution to this probability from player i when playing\naccording to \u03c3i (or from chance when i = c). Let \u03c0\u03c3\u2212i(h) be the product of all players\u2019 contributions\n(including chance) except that of player i. Let \u03c0\u03c3(h, h(cid:48)) be the probability of history h(cid:48) occurring\nafter h, given h has occurred. Furthermore, for I \u2208 Ii, the probability of player i playing to reach I\nis \u03c0\u03c3\nA best response to \u03c3\u2212i is a strategy that maximizes player i\u2019s expected payoff against \u03c3\u2212i. The\nbest response value for player i is the value of that strategy, bi(\u03c3\u2212i) = max\u03c3(cid:48)\ni, \u03c3\u2212i). A\nstrategy pro\ufb01le \u03c3 is an \u0001-Nash equilibrium if no player can unilaterally deviate from \u03c3 and gain\nmore than \u0001; i.e., ui(\u03c3) + \u0001 \u2265 bi(\u03c3\u2212i) for all i \u2208 N. A game is two-player zero-sum if N = {1, 2}\nand u1(z) = \u2212u2(z) for all z \u2208 Z. In this case, the exploitability of \u03c3, e(\u03c3) = (b1(\u03c32)+b2(\u03c31))/2,\nmeasures how much \u03c3 loses to a worst case opponent when players alternate positions. A 0-Nash\nequilibrium (or simply a Nash equilibrium) has zero exploitability.\nCounterfactual Regret Minimization (CFR) [12] is an iterative algorithm that, for two-player zero\nsum games, computes an \u0001-Nash equilibrium pro\ufb01le with \u0001 \u2192 0. CFR has also been shown to work\nwell in games with more than two players [1, 3]. On each iteration t, the base algorithm, \u201cvanilla\u201d\nCFR, traverses the entire game tree once per player, computing the expected utility for player i at\neach information set I \u2208 Ii under the current pro\ufb01le \u03c3t, assuming player i plays to reach I. This\nui(z)\u03c0\u03c3\u2212i(z[I])\u03c0\u03c3(z[I], z),\nwhere ZI is the set of terminal histories passing through I and z[I] is that history along z contained\nin I. For each action a \u2208 A(I), these values determine the counterfactual regret at iteration t,\n\nexpectation is the counterfactual value for player i, vi(I, \u03c3) =(cid:80)\n\ni (h) for any h \u2208 I, which is well-de\ufb01ned due to perfect recall.\n\ni (h), where \u03c0\u03c3\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\ni\u2208N\u222a{c} \u03c0\u03c3\n\ni (I) = \u03c0\u03c3\n\nz\u2208ZI\n\nrt\ni(I, a) = vi(I, \u03c3t\n\n(I\u2192a)) \u2212 vi(I, \u03c3t),\n\nwhere \u03c3(I\u2192a) is the pro\ufb01le \u03c3 except that at I, action a is always taken. The regret rt\ni(I, a) measures\nhow much player i would rather play action a at I than play \u03c3t. These regrets are accumulated to\nobtain the cumulative counterfactual regret, RT\ni(I, a), and are used to update\nthe current strategy pro\ufb01le via regret matching [5, 12],\n\ni (I, a) = (cid:80)T\n\nt=1 rt\n\nwhere x+ = max{x, 0} and actions are chosen uniformly at random when the denominator is zero.\nIt is well-known that in a two-player zero-sum game, if both players\u2019 average (external) regret,\n\ni\n\nRT,+\n\n(I, a)\nb\u2208A(I) RT,+\n\n(I, b)\n\ni\n\n,\n\n(1)\n\n\u03c3T +1(I, a) =\n\n(cid:80)\nT(cid:88)\n(cid:0)ui(\u03c3(cid:48)\ni (I, a) = (cid:80)T\n\n1\nT\n\nt=1\n\n= max\ni\u2208\u03a3i\n\u03c3(cid:48)\n\nRT\ni\nT\n\n2\n\ni , \u03c3t\u2212i)(cid:1) ,\n\ni, \u03c3t\u2212i) \u2212 ui(\u03c3t\n\nis at most \u0001/2, then the average pro\ufb01le \u00af\u03c3T is an \u0001-Nash equilibrium. During computation, CFR\nstores a cumulative pro\ufb01le sT\ni (I, a) and outputs the average pro\ufb01le\n\ni (I)\u03c3t\n\nt=1 \u03c0\u03c3t\n\n\fi (I, a)/(cid:80)\n\ni (I, a) = sT\n\u00af\u03c3T\nis bounded by the sum of the positive parts of the cumulative counterfactual regrets RT,+\n\ni (I, b). The original CFR analysis shows that player i\u2019s regret\n\nb\u2208A(I) sT\n\n(I, a):\n\ni\n\nTheorem 1 (Zinkevich et al. [12])\n\ni \u2264(cid:88)\n\nI\u2208I\n\nRT\n\nmax\na\u2208A(I)\n\nRT,+\n\ni\n\n(I, a).\n\nRegret matching minimizes the average of the cumulative counterfactual regrets, and so player i\u2019s\naverage regret is also minimized by Theorem 1. For each player i, let Bi be the partition of Ii such\nthat two information sets I, I(cid:48) are in the same part B \u2208 Bi if and only if player i\u2019s sequence of\nactions leading to I is the same as the sequence of actions leading to I(cid:48). Bi is well-de\ufb01ned due to\n\nperfect recall. Next, de\ufb01ne the M-value of the game to player i to be Mi =(cid:80)\n\n(cid:112)|B|. The\n\nB\u2208Bi\n\nbest known bound on player i\u2019s average regret is:\n\nTheorem 2 (Lanctot et al. [9]) When using vanilla CFR, average regret is bounded by\n\n(cid:112)|Ai|\n\nT\n\n.\n\n\u221a\n\u2264 \u2206iMi\n\nRT\ni\nT\n\nWe prove a tighter bound in Section 3. For large games, CFR\u2019s full game tree traversals can be very\nexpensive. Alternatively, one can traverse a smaller, sampled portion of the tree on each iteration\nusing Monte Carlo CFR (MCCFR) [9]. Let Q = {Q1, ..., QK} be a set of subsets, or blocks, of\nthe terminal histories Z such that the union of Q spans Z. For example, Chance Sampling (CS)\n[12] is an instance of MCCFR that partitions Z into blocks such that two histories are in the same\nblock if and only if no two chance actions differ. On each iteration, a block Qj is sampled with\nk=1 qk = 1. In CS, we generate a block by sampling a single action a at\neach history h \u2208 H with P (h) = c according to its likelihood of occurring, \u03c3c(h, a). In general, the\nsampled counterfactual value for player i is\n\nprobability qj, where(cid:80)K\n\n\u02dcvi(I, \u03c3) =\n\nui(z)\u03c0\u03c3\u2212i(z[I])\u03c0\u03c3(z[I], z)/q(z),\n\n(cid:88)\n\nz\u2208ZI\u2229Qj\n\nwhere q(z) = (cid:80)\n\nk:z\u2208Qk\n\ni(I, a) = \u02dcvi(I, \u03c3t\nt=1 \u02dcrt\n\ni (I, a) =(cid:80)T\n\nqk is the probability that z was sampled. For example, in CS, q(z) =\n(I\u2192a))\u2212\nc (z). De\ufb01ne the sampled counterfactual regret for action a at I to be \u02dcrt\n\u03c0\u03c3\n\u02dcvi(I, \u03c3t). Strategies are then generated by applying regret matching to \u02dcRT\ni(I, a).\nCS has been shown to signi\ufb01cantly reduce computing time in poker games [11, Appendix A.5.2].\nOther instances of MCCFR include External Sampling (ES) and Outcome Sampling (OS) [9].\nES takes CS one step further by considering only a single action for not only chance, but also for\nthe opponents, where opponent actions are sampled according to the current pro\ufb01le \u03c3t\u2212i. OS is the\nmost extreme version of MCCFR that samples a single action at every history, walking just a single\ntrajectory through the tree on each traversal (Qj = {z}). ES and OS converge to equilibrium faster\nthan vanilla CFR in a number of different domains [9, Figure 1].\nES and OS yield a probabilistic bound on the average regret, and thus provide a probabilistic guar-\nantee that \u00af\u03c3T converges to a Nash equilibrium. Since both algorithms generate blocks by sampling\ni\u2208N\u222a{c} qi(z) so that qi(z) is the probability\n\nactions independently, we can decompose q(z) = (cid:81)\n\ncontributed to q(z) by sampling player i\u2019s actions.\n\nTheorem 3 (Lanctot et al. [9]) 1 Let X be one of ES or OS (assuming OS also samples opponent\nactions according to \u03c3\u2212i), let p \u2208 (0, 1], and let \u03b4 = minz\u2208Z qi(z) > 0 over all 1 \u2264 t \u2264 T . When\nusing X, with probability 1 \u2212 p, average regret is bounded by\n\n(cid:32)\n\n(cid:112)2|Ii||Bi|\n\n(cid:33)(cid:18) 1\n\n(cid:19) \u2206i\n\n\u2264\n\nRT\ni\nT\n\nMi +\n\n\u221a\n\np\n\n\u03b4\n\n(cid:112)|Ai|\u221a\n\nT\n\n.\n\nused Mi \u2265(cid:112)|Ii||Bi|, which is actually incorrect. The bound we present here is correct.\n\n1The bound presented by Lanctot et al. appears slightly different, but the last step of their proof mistakenly\n\n3\n\n\f3 New CFR Bounds\n\n(cid:80)\n\nWhile Zinkevich et al. [12] bound a player\u2019s regret by a sum of cumulative counterfactual re-\ngrets (Theorem 1), we can actually equate a player\u2019s regret to a weighted sum of counterfac-\ntual regrets. For a strategy \u03c3i \u2208 \u03a3i and an information set I \u2208 Ii, de\ufb01ne RT\ni (I, \u03c3i) =\ni \u2208 \u03a3i be a player i strategy such that\ni , \u03c3t\u2212i) =\n\n(cid:80)T\nt=1 ui(\u03c3(cid:48)\ni is a best response to the opponent\u2019s average strategy after T iterations.\n\ni, \u03c3t\u2212i). Note that in a two-player game, (cid:80)T\n\ni , \u00af\u03c3T\u2212i), and thus \u03c3\u2217\n\nt=1 ui(\u03c3\u2217\n\nIn addition,\n\nlet \u03c3\u2217\n\ni (I, a).\n\na\u2208A(I) \u03c3i(I, a)RT\n\u03c3\u2217\ni = arg max\u03c3(cid:48)\ni\u2208\u03a3i\nT ui(\u03c3\u2217\nTheorem 4\n\n\u03c0\u03c3\u2217\ni (I)RT\n\ni (I, \u03c3\u2217\ni ).\n\nbe Mi(\u03c3i) =(cid:80)\n\nAll proofs in this paper are provided in full as supplementary material. Theorem 4 leads to a tighter\nbound on the average regret when using CFR. For a strategy \u03c3i \u2208 \u03a3i, de\ufb01ne the M-value of \u03c3i to\ni (I). Clearly, Mi(\u03c3i) \u2264 Mi for all\nB\u2208Bi\n\u03c3i \u2208 \u03a3i since \u03c0\u03c3\ni (B) \u2264 1. For vanilla CFR, we can simply replace Mi in Theorem 2 with Mi(\u03c3\u2217\ni ):\nTheorem 5 When using vanilla CFR, average regret is bounded by\n\ni (B) = maxI\u2208B \u03c0\u03c3\n\n\u03c0\u03c3\n\nRT\n\ni =\n\n(cid:88)\ni (B)(cid:112)|B|, where \u03c0\u03c3\n\nI\u2208Ii\n\ni )(cid:112)|Ai|\n\n.\n\nT\n\n\u2264 \u2206iMi(\u03c3\u2217\n\u221a\n\nRT\ni\nT\n\nFor MCCFR, we can show a similar improvement to Theorem 3. Our proof includes a bound for CS\nthat appears to have been omitted in previous work. Details are in the supplementary material.\nTheorem 6 Let X be one of CS, ES, or OS (assuming OS samples opponent actions according to\n\u03c3\u2212i), let p \u2208 (0, 1], and let \u03b4 = minz\u2208Z qi(z) > 0 over all 1 \u2264 t \u2264 T . When using X, with\nprobability 1 \u2212 p, average regret is bounded by\n\n(cid:32)\n\n\u2264\n\nRT\ni\nT\n\nMi(\u03c3\u2217\n\ni ) +\n\n(cid:112)2|Ii||Bi|\n\n\u221a\n\n(cid:33)(cid:18) 1\n\n(cid:19) \u2206i\n\np\n\n\u03b4\n\n(cid:112)|Ai|\u221a\n\nT\n\n.\n\nTheorem 4 states that player i\u2019s regret is equal to the weighted sum of player i\u2019s counterfactual\nregrets at each I \u2208 Ii, where the weights are equal to player i\u2019s probability of reaching I under \u03c3\u2217\ni .\nSince our goal is to minimize average regret, this means that we only need to minimize the average\ncumulative counterfactual regret at each I \u2208 Ii that \u03c3\u2217\ni plays to reach. Therefore, when using\nMCCFR, we may want to sample more often those information sets that \u03c3\u2217\ni plays to reach, and less\noften those information sets that \u03c3\u2217\n\ni avoids. This inspires our new MCCFR sampling algorithm.\n\n4 Average Strategy Sampling\n\ni on hand. Recall that in a two-player game, \u03c3\u2217\n\nLeveraging the theory developed in the previous section, we now introduce a new MCCFR sam-\npling algorithm that can minimize average regret at a faster rate than CS, ES, and OS. As we just\ndescribed, we want our algorithm to sample more often the information sets that \u03c3\u2217\ni plays to reach.\nUnfortunately, we do not have the exact strategy \u03c3\u2217\ni is\na best response to the opponent\u2019s average strategy, \u00af\u03c3T\u2212i. However, for two-player zero-sum games,\nwe do know that the average pro\ufb01le \u00af\u03c3T converges to a Nash equilibrium. This means that player i\u2019s\naverage strategy, \u00af\u03c3T\ni , converges to a best response of \u00af\u03c3T\u2212i. While the average strategy is not an exact\nbest response, it can be used as a heuristic to guide sampling within MCCFR. Our new sampling al-\ngorithm, Average Strategy Sampling (AS), selects actions for player i according to the cumulative\npro\ufb01le and three prede\ufb01ned parameters. AS can be seen as a sampling scheme between OS and ES\nwhere a subset of player i\u2019s actions are sampled at each information set I, as opposed to sampling\ni (I,\u00b7) on iteration\none action (OS) or sampling every action (ES). Given the cumulative pro\ufb01le sT\nT , an exploration parameter \u0001 \u2208 (0, 1], a threshold parameter \u03c4 \u2208 [1,\u221e), and a bonus parameter\n\u03b2 \u2208 [0,\u221e), each of player i\u2019s actions a \u2208 A(I) are sampled independently with probability\n\n\u03c1(I, a) = max\n\n\u0001,\n\n,\n\n(2)\n\n(cid:40)\n\n(cid:41)\n\n\u03b2 +(cid:80)\n\n\u03b2 + \u03c4 sT\n\ni (I, a)\n\nb\u2208A(I) sT\n\ni (I, b)\n\n4\n\n\fif h \u2208 Z then return ui(h)/q end if\nif h \u2208 P (c) then Sample action a \u223c \u03c3c(h,\u00b7), return WalkTree(ha, i, q) end if\nI \u2190 Information set containing h , \u03c3(I,\u00b7) \u2190 RegretMatching(r(I,\u00b7))\nif h /\u2208 P (i) then\n\nAlgorithm 1 Average Strategy Sampling (Two-player version)\n1: Require: Parameters \u0001, \u03c4, \u03b2\n2: Initialize regret and cumulative pro\ufb01le: \u2200I, a : r(I, a) \u2190 0, s(I, a) \u2190 0\n3:\n4: WalkTree(history h, player i, sample prob q):\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n\nfor a \u2208 A(I) do r(I, a) \u2190 r(I, a) + \u02dcv(a) \u2212(cid:80)\nreturn(cid:80)\n\n\u03c1 \u2190 max\nif Random(0, 1) < \u03c1 then \u02dcv(a) \u2190 WalkTree(ha, i, q \u00b7 min{1, \u03c1}) end if\n\nfor a \u2208 A(I) do s(I, a) \u2190 s(I, a) + (\u03c3(I, a)/q) end for\nSample action a \u223c \u03c3(I,\u00b7), return WalkTree(ha, i, q)\n\nend if\nfor a \u2208 A(I) do\n\u0001,\n\na\u2208A(I) \u03c3(I, a)\u02dcv(a) end for\n\n, \u02dcv(a) \u2190 0\n\nend for\n\n(cid:110)\n\n(cid:111)\n\n\u03b2+(cid:80)\n\n\u03b2+\u03c4 s(I,a)\n\nb\u2208A(I) s(I,b)\n\na\u2208A(I) \u03c3(I, a)\u02dcv(a)\n\nb\u2208A(I) sT\n\nb\u2208A(I) sT\n\nor with probability 1 if either \u03c1(I, a) > 1 or \u03b2 +(cid:80)\ni (I, a)/(cid:80)\n\ni (I, b) = 0. As in ES, at opponent and\nchance nodes, a single action is sampled on-policy according to the current opponent pro\ufb01le \u03c3T\u2212i\nand the \ufb01xed chance probabilities \u03c3c respectively.\nIf \u03c4 = 1 and \u03b2 = 0, then \u03c1(I, a) is equal to the probability that the average strategy \u00af\u03c3T\ni =\ni (I, b) plays a at I, except that each action is sampled with probability at least\nsT\n\u0001. For choices greater than 1, \u03c4 acts as a threshold so that any action taken with probability at least\n1/\u03c4 by the average strategy is always sampled by AS. Furthermore, \u03b2\u2019s purpose is to increase the\nrate of exploration during early AS iterations. When \u03b2 > 0, we effectively add \u03b2 as a bonus to the\ni (I, a) before normalizing. Since player i\u2019s average strategy \u00af\u03c3T\ncumulative value sT\nis not a good\napproximation of \u03c3\u2217\ni\ni for small T , we include \u03b2 to avoid making ill-informed choices early-on. As\ni (I,\u00b7) grows over time, \u03b2 eventually becomes negligible. In Section 5, we\nthe cumulative pro\ufb01le sT\npresent a set of values for \u0001, \u03c4, and \u03b2 that work well across all of our test games.\nPseudocode for a two-player version of AS is presented in Algorithm 1. In Algorithm 1, the recursive\nfunction WalkTree considers four different cases. Firstly, if we have reached a terminal node, we\nreturn the utility scaled by 1/q (line 5), where q = qi(z) is the probability of sampling z contributed\nfrom player i\u2019s actions. Secondly, when at a chance node, we sample a single action according to \u03c3c\nand recurse down that action (line 6). Thirdly, at an opponent\u2019s choice node (lines 8 to 11), we again\nsample a single action and recurse, this time according to the opponent\u2019s current strategy obtained\nvia regret matching (equation (1)). At opponent nodes, we also update the cumulative pro\ufb01le (line\n9) for reasons that we describe in a previous paper [2, Algorithm 1]. For games with more than two\nplayers, a second tree walk is required and we omit these details.\nThe \ufb01nal case in Algorithm 1 handles choice nodes for player i (lines 7 to 17). For each action a, we\ncompute the probability \u03c1 of sampling a and stochastically decide whether to sample a or not, where\nRandom(0,1) returns a random real number in [0, 1). If we do sample a, then we recurse to obtain\n(I\u2192a)) (line 14). Finally, we update the regrets at I\nthe sampled counterfactual value \u02dcv(a) = \u02dcvi(I, \u03c3t\n\n(line 16) and return the sampled counterfactual value at I,(cid:80)\n\na\u2208A(I) \u03c3(I, a)\u02dcv(a) = \u02dcvi(I, \u03c3t).\n\nRepeatedly running WalkTree(\u2205, i, 1) \u2200i \u2208 N provides a probabilistic guarantee that all players\u2019\naverage regret will be minimized. In the supplementary material, we prove that AS exhibits the same\nregret bound as CS, ES, and OS provided in Theorem 6. Note that \u03b4 in Theorem 6 is guaranteed\nto be positive for AS by the inclusion of \u0001 in equation (2). However, for CS and ES, \u03b4 = 1 since\nall of player i\u2019s actions are sampled, whereas \u03b4 \u2264 1 for OS and AS. While this suggests that fewer\niterations of CS or ES are required to achieve the same regret bound compared to OS and AS,\niterations for OS and AS are faster as they traverse less of the game tree. Just as CS, ES, and OS\n\n5\n\n\fhave been shown to bene\ufb01t from this trade-off over vanilla CFR, we will show that in practice, AS\ncan likewise bene\ufb01t over CS and ES and that AS is a better choice than OS.\n\n5 Experiments\n\nIn this section, we compare the convergence rates of AS to those of CS, ES, and OS. While AS can\nbe applied to any extensive game, the aim of AS is to provide faster convergence rates in games\ninvolving many player actions. Thus, we consider two domains, no-limit poker and Bluff, where we\ncan easily scale the number of actions available to the players.\nNo-limit poker. The two-player poker game we consider here, which we call 2-NL Hold\u2019em(k),\nis inspired by no-limit Texas Hold\u2019em. 2-NL Hold\u2019em(k) is played over two betting rounds. Each\nplayer starts with a stack of k chips. To begin play, the player denoted as the dealer posts a small\nblind of one chip and the other player posts a big blind of two chips. Each player is then dealt two\nprivate cards from a standard 52-card deck and the \ufb01rst betting round begins. During each betting\nround, players can either fold (forfeit the game), call (match the previous bet), or raise by any\nnumber of chips in their remaining stack (increase the previous bet), as long as the raise is at least as\nbig as the previous bet. After the \ufb01rst betting round, three public community cards are revealed (the\n\ufb02op) and a second and \ufb01nal betting round begins. If a player has no more chips left after a call or a\nraise, that player is said to be all-in. At the end of the second betting round, if neither player folded,\nthen the player with the highest ranked \ufb01ve-card poker hand wins all of the chips played. Note that\nthe number of player actions in 2-NL Hold\u2019em(k) at one information set is at most the starting stack\nsize, k. Increasing k adds more betting options and allows for more actions before being all-in.\nBluff. Bluff(D1, D2) [7], also known as Liar\u2019s Dice, Perduo, and Dudo, is a two-player dice-bidding\ngame played with six-sided dice over a number of rounds. Each player i starts with Di dice. In each\nround, players roll their dice and look at the result without showing their opponent. Then, players\nalternate by bidding a quantity q of a face value f of all dice in play until one player claims that\nthe other is bluf\ufb01ng (i.e., claims that the bid does not hold). To place a new bid, a player must\nincrease q or f of the current bid. A face value of six is considered \u201cwild\u201d and counts as any other\nface value. The player calling bluff wins the round if the opponent\u2019s last bid is incorrect, and loses\notherwise. The losing player removes one of their dice from the game and a new round begins.\nOnce a player has no more dice left, that player loses the game and receives a utility of \u22121, while\nthe winning player earns +1 utility. The maximum number of player actions at an information set\nis 6(D1 + D2) + 1 as increasing Di allows both players to bid higher quantities q.\nPreliminary tests. Before comparing AS to CS, ES, and OS, we \ufb01rst run some preliminary exper-\niments to \ufb01nd a good set of parameter values for \u0001, \u03c4, and \u03b2 to use with AS. All of our preliminary\nexperiments are in two-player 2-NL Hold\u2019em(k). In poker, a common approach is to create an ab-\nstract game by merging similar card dealings together into a single chance action or \u201cbucket\u201d [4]. To\nkeep the size of our games manageable, we employ a \ufb01ve-bucket abstraction that reduces the branch-\ning factor at each chance node down to \ufb01ve, where dealings are grouped according to expected hand\nstrength squared as described by Zinkevich et al. [12].\nFirstly, we \ufb01x \u03c4 = 1000 and test different values for \u0001 and \u03b2 in 2-NL Hold\u2019em(30). Recall that\n\u03c4 = 1000 implies actions taken by the average strategy with probability at least 0.001 are always\nsampled by AS. Figure 1a shows the exploitability in the \ufb01ve-bucket abstract game, measured in\nmilli-big-blinds per game (mbb/g), of the pro\ufb01le produced by AS after 1012 nodes visited. Recall\nthat lower exploitability implies a closer approximation to equilibrium. Each data point is averaged\nover \ufb01ve runs of AS. The \u0001 = 0.05 and \u03b2 = 105 or 106 pro\ufb01les are the least exploitable pro\ufb01les\nwithin statistical noise (not shown).\nNext, we \ufb01x \u0001 = 0.05 and \u03b2 = 106 and test different values for \u03c4. Figure 1b shows the abstract\ngame exploitability over the number of nodes visited by AS in 2-NL Hold\u2019em(30), where again each\ndata point is averaged over \ufb01ve runs. Here, the least exploitable strategies after 1012 nodes visited\nare obtained with \u03c4 = 100 and \u03c4 = 1000 (again within statistical noise). Similar results to Figure\n1b hold in 2-NL Hold\u2019em(40) and are not shown. Throughout the remainder of our experiments, we\nuse the \ufb01xed set of parameters \u0001 = 0.05, \u03b2 = 106, and \u03c4 = 1000 for AS.\n\n6\n\n\f(a) \u03c4 = 1000\n\n(b) \u0001 = 0.05, \u03b2 = 106\n\nFigure 1: (a) Abstract game exploitability of AS pro\ufb01les for \u03c4 = 1000 after 1012 nodes visited\nin 2-NL Hold\u2019em(30). (b) Log-log plot of abstract game exploitability over the number of nodes\nvisited by AS with \u0001 = 0.05 and \u03b2 = 106 in 2-NL Hold\u2019em(30). For both \ufb01gures, units are in\nmilli-big-blinds per hand (mbb/g) and data points are averaged over \ufb01ve runs with different random\nseeds. Error bars in (b) indicate 95% con\ufb01dence intervals.\n\nMain results. We now compare AS to CS, ES, and OS in both 2-NL Hold\u2019em(k) and Bluff(D1, D2).\nSimilar to Lanctot et al. [9], our OS implementation is \u0001-greedy so that the current player i samples\na single action at random with probability \u0001 = 0.5, and otherwise samples a single action according\nto the current strategy \u03c3i.\nFirstly, we consider two-player 2-NL Hold\u2019em(k) with starting stacks of k = 20, 22, 24, ..., 38,\nand 40 chips, for a total of eleven different 2-NL Hold\u2019em(k) games. Again, we apply the same\n\ufb01ve-bucket card abstraction as before to keep the games reasonably sized. For each game, we ran\neach of CS, ES, OS, and AS \ufb01ve times, measured the abstract game exploitability at a number of\ncheckpoints, and averaged the results. Figure 2a displays the results for 2-NL Hold\u2019em(36), a game\nwith approximately 68 million information sets and 5 billion histories (nodes). Here, AS achieved\nan improvement of 54% over ES at the \ufb01nal data points. In addition, Figure 2b shows the average\nexploitability in each of the eleven games after approximately 3.16 \u00d7 1012 nodes visited by CS, ES,\nand AS. OS performed much worse and is not shown. Since one can lose more as the starting stacks\nare increased (i.e., \u2206i becomes larger), we \u201cnormalized\u201d exploitability across each game by dividing\nthe units on the y-axis by k. While there is little difference between the algorithms for the smaller\n20 and 22 chip games, we see a signi\ufb01cant bene\ufb01t to using AS over CS and ES for the larger games\nthat contain many player actions. For the most part, the margins between AS, CS, and ES increase\nwith the game size.\nFigure 3 displays similar results for Bluff(1, 1) and Bluff(2, 1), which contain over 24 thousand and\n3.5 million information sets, and 294 thousand and 66 million histories (nodes) respectively. Again,\nAS converged faster than CS, ES, and OS in both Bluff games tested. Note that the same choices\nof parameters (\u0001 = 0.05, \u03b2 = 106, \u03c4 = 1000) that worked well in 2-NL Hold\u2019em(30) also worked\nwell in other 2-NL Hold\u2019em(k) games and in Bluff(D1, D2).\n\n6 Conclusion\n\nThis work has established a number of improvements for computing strategies in extensive-form\ngames with CFR, both theoretically and empirically. We have provided new, tighter bounds on the\naverage regret when using vanilla CFR or one of several different MCCFR sampling algorithms.\nThese bounds were derived by showing that a player\u2019s regret is equal to a weighted sum of the\nplayer\u2019s cumulative counterfactual regrets (Theorem 4), where the weights are given by a best re-\nsponse to the opponents\u2019 previous sequence of strategies. We then used this bound as inspiration for\nour new MCCFR algorithm, AS. By sampling a subset of a player\u2019s actions, AS can provide faster\n\n7\n\nExploitability (mbb/g)100101102103104105106107108109\u03b20.010.050.10.20.30.40.5\u03b5 0 0.2 0.4 0.6 0.8 110-1100101102101010111012Abstract game exploitability (mbb/g)Nodes Visited\u03c4=100\u03c4=101\u03c4=102\u03c4=103\u03c4=104\u03c4=105\u03c4=106\f(a) 2-NL Hold\u2019em(36)\n\n(b) 2-NL Hold\u2019em(k), k \u2208 {20, 22, ..., 40}\n\nFigure 2: (a) Log-log plot of abstract game exploitability over the number of nodes visited by CS,\nES, OS, and AS in 2-NL Hold\u2019em(36). The initial uniform random pro\ufb01le is exploitable for 6793\nmbb/g, as indicated by the black dashed line. (b) Abstract game exploitability after approximately\n3.16 \u00d7 1012 nodes visited over the game size for 2-NL Hold\u2019em(k) with even-sized starting stacks\nk between 20 and 40 chips. For both graphs, units are in milli-big-blinds per hand (mbb/g) and data\npoints are averaged over \ufb01ve runs with different random seeds. Error bars indicate 95% con\ufb01dence\nintervals. For (b), units on the y-axis are normalized by dividing by the starting chip stacks.\n\n(a) Bluff(1, 1)\n\n(b) Bluff(2, 1)\n\nFigure 3: Log-log plots of exploitability over number of nodes visited by CS, ES, OS, and AS in\nBluff(1, 1) and Bluff(2, 1). The initial uniform random pro\ufb01le is exploitable for 0.780 and 0.784\nin Bluff(1, 1) and Bluff(2, 1) respectively, as indicated by the black dashed lines. Data points are\naveraged over \ufb01ve runs with different random seeds and error bars indicate 95% con\ufb01dence intervals.\n\nconvergence rates in games containing many player actions. AS converged faster than previous MC-\nCFR algorithms in all of our test games. For future work, we would like to apply AS to games with\nmany player actions and with more than two players. All of our theory still applies, except that\nplayer i\u2019s average strategy is no longer guaranteed to converge to \u03c3\u2217\ni . Nonetheless, AS may still \ufb01nd\nstrong strategies faster than CS and ES when it is too expensive to sample all of a player\u2019s actions.\n\nAcknowledgments\n\nWe thank the members of the Computer Poker Research Group at the University of Alberta for help-\nful conversations pertaining to this work. This research was supported by NSERC, Alberta Innovates\n\u2013 Technology Futures, and computing resources provided by WestGrid and Compute Canada.\n\n8\n\n10-1100101102103104101010111012Abstract game exploitability (mbb/g)Nodes VisitedCSESOSAS 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16106107108Abstract game exploitability (mbb/g) / kGame size (# information sets)k=20k=30k=40CSESAS10-510-410-310-210-11001071081091010101110121013ExploitabilityNodes VisitedCSESOSAS10-510-410-310-210-11001071081091010101110121013ExploitabilityNodes VisitedCSESOSAS\fReferences\n[1] Nick Abou Risk and Duane Szafron. Using counterfactual regret minimization to create com-\npetitive multiplayer poker agents. In Ninth International Conference on Autonomous Agents\nand Multiagent Systems (AAMAS), pages 159\u2013166, 2010.\n\n[2] Richard Gibson, Marc Lanctot, Neil Burch, Duane Szafron, and Michael Bowling. Generalized\nsampling and variance in counterfactual regret minimization. In Twenty-Sixth Conference on\nArti\ufb01cial Intelligence (AAAI), pages 1355\u20131361, 2012.\n\n[3] Richard Gibson and Duane Szafron. On strategy stitching in large extensive form multiplayer\nIn Advances in Neural Information Processing Systems 24 (NIPS), pages 100\u2013108,\n\ngames.\n2011.\n\n[4] Andrew Gilpin and Tuomas Sandholm. A competitive Texas Hold\u2019em poker player via au-\nIn Twenty-First Conference on\n\ntomated abstraction and real-time equilibrium computation.\nArti\ufb01cial Intelligence (AAAI), pages 1007\u20131013, 2006.\n\n[5] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equi-\n\nlibrium. Econometrica, 68:1127\u20131150, 2000.\n\n[6] Samid Hoda, Andrew Gilpin, Javier Pe\u02dcna, and Tuomas Sandholm. Smoothing techniques\nfor computing Nash equilibria of sequential games. Mathematics of Operations Research,\n35(2):494\u2013512, 2010.\n\n[7] Reiner Knizia. Dice Games Properly Explained. Blue Terrier Press, 2010.\n[8] Daphne Koller, Nimrod Megiddo, and Bernhard von Stengel. Fast algorithms for \ufb01nding\nIn Annual ACM Symposium on Theory of Computing\n\nrandomized strategies in game trees.\n(STOC\u201994), pages 750\u2013759, 1994.\n\n[9] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling\nfor regret minimization in extensive games. In Advances in Neural Information Processing\nSystems 22 (NIPS), pages 1078\u20131086, 2009.\n\n[10] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling\nfor regret minimization in extensive games. Technical Report TR09-15, University of Alberta,\n2009.\n\n[11] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret min-\nimization in games with incomplete information. Technical Report TR07-14, University of\nAlberta, 2007.\n\n[12] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret mini-\nmization in games with incomplete information. In Advances in Neural Information Processing\nSystems 20 (NIPS), pages 905\u2013912, 2008.\n\n9\n\n\f", "award": [], "sourceid": 939, "authors": [{"given_name": "Neil", "family_name": "Burch", "institution": null}, {"given_name": "Marc", "family_name": "Lanctot", "institution": null}, {"given_name": "Duane", "family_name": "Szafron", "institution": null}, {"given_name": "Richard", "family_name": "Gibson", "institution": null}]}