Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Xiang Yan, Persi Diaconis, Paat Rusmevichientong, Benjamin Roy*

In this paper, we use the rollout method for policy improvement to an- alyze a version of Klondike solitaire. This version, sometimes called thoughtful solitaire, has all cards revealed to the player, but then follows the usual Klondike rules. A strategy that we establish, using iterated roll- outs, wins about twice as many games on average as an expert human player does.

1 Introduction

Though proposed more than fifty years ago [1, 7], the effectiveness of the policy improve- ment algorithm remains a mystery. For discounted or average reward Markov decision problems with n states and two possible actions per state, the tightest known worst-case upper bound in terms of n on the number of iterations taken to find an optimal policy is O(2n/n) [9]. This is also the tightest known upper bound for deterministic Markov de- cision problems. It is surprising, however, that there are no known examples of Markov decision problems with two possible actions per state for which more than n + 2 iterations are required. A more intriguing fact is that even for problems with a large number of states say, in the millions an optimal policy is often delivered after only half a dozen or so iterations.

In problems where n is enormous say, a googol this may appear to be a moot point because each iteration requires (n) compute time. In particular, a policy is represented by a table with one action per state and each iteration improves the policy by updating each entry of this table. In such large problems, one might resort to a suboptimal heuris- tic policy, taking the form of an algorithm that accepts a state as input and generates an action as output. An interesting recent development in dynamic programming is the roll- out method. Pioneered by Tesauro and Galperin [13, 2], the rollout method leverages the policy improvement concept to amplify the performance of any given heuristic. Unlike the conventional policy improvement algorithm, which computes an optimal policy off-line so that it may later be used in decision-making, the rollout method performs its computations on-line at the time when a decision is to be made. When making a decision, rather than applying the heuristic policy directly, the rollout method computes an action that would result from an iteration of policy improvement applied to the heuristic policy. This does

not require (n) compute time since only one entry of the table is computed.

The way in which actions are generated by the rollout method may be considered an al- ternative heuristic that improves on the original. One might consider applying the rollout method to this new heuristic. Another heuristic would result, again with improved perfor- mance. Iterated a sufficient number of times, this process would lead to an optimal policy. However, iterating is usually not an option. Computational requirements grow exponen- tially in the number of iterations, and the first iteration, which improves on the original heuristic, is already computationally intensive. For this reason, prior applications of the rollout method have involved only one iteration [3, 4, 5, 6, 8, 11, 12, 13]. For example, in the interesting study of Backgammon by Tesauro and Galperin [13], moves were generated in five to ten seconds by the rollout method running on configurations of sixteen to thirty- two nodes in a network of IBM SP1 and SP2 parallel-RISC supercomputers with parallel speedup efficiencies of 90%. A second iteration of the rollout method would have been infeasible requiring about six orders of magnitude more time per move.

In this paper, we apply the rollout method to a version of solitaire, modeled as a deter- ministic Markov decision problem with over 52! states. Determinism drastically reduces computational requirements, making it possible to consider iterated rollouts1. With five iterations, a game, implemented in Java, takes about one hour and forty-five minutes on average on a SUN Blade 2000 machine with two 900MHz CPUs, and the probability of winning exceeds that of a human expert by about a factor of two. Our study represents an important contribution both to the study of the rollout method and to the study of solitaire.

Do not remove: This comment is monitored to verify that the site is working properly