{"title": "Improved and Generalized Upper Bounds on the Complexity of Policy Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 386, "page_last": 394, "abstract": "Given a Markov Decision Process (MDP) with $n$ states and $m$ actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\\gamma$-discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most  $ O \\left( \\frac{ n m}{1-\\gamma} \\log \\left( \\frac{1}{1-\\gamma} \\right)\\right) $ iterations, improving by a factor $O(\\log n)$ a result by Hansen et al. (2013), while Simplex-PI terminates after at most $ O \\left(  \\frac{n^2 m}{1-\\gamma} \\log \\left( \\frac{1}{1-\\gamma} \\right)\\right) $ iterations, improving by a factor $O(\\log n)$ a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor~$\\gamma$: given a measure of the maximal transient time $\\tau_t$ and the maximal time $\\tau_r$ to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most $ \\tilde O \\left( n^3 m^2 \\tau_t \\tau_r \\right) $ iterations. This generalizes a recent result for deterministic MDPs by Post & Ye (2012), in which $\\tau_t \\le n$ and $\\tau_r \\le n$. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that Simplex-PI and Howard's PI terminate after at most  $ \\tilde O(nm (\\tau_t+\\tau_r))$ iterations.", "full_text": "Improved and Generalized Upper Bounds on\n\nthe Complexity of Policy Iteration\n\nUniversit\u00b4e de Lorraine, LORIA, UMR 7503, Vandoeuvre-l`es-Nancy, F-54506, France\n\nBruno Scherrer\n\nInria, Villers-l`es-Nancy, F-54600, France\n\nbruno.scherrer@inria.fr\n\nAbstract\n\nGiven a Markov Decision Process (MDP) with n states and m actions per\nstate, we study the number of iterations needed by Policy Iteration (PI)\nalgorithms to converge to the optimal \u201c-discounted optimal policy. We con-\nsider two variations of PI: Howard\u2019s PI that changes the actions in all states\nwith a positive advantage, and Simplex-PI that only changes the action in\nthe state with maximal advantage. We show that Howard\u2019s PI terminates\n\nimproving by a factor O(log n) a result by [3], while Simplex-PI terminates\n\nafter at most n(m \u2260 1)\u00cf 1\n1\u2260\u201c log1 1\nafter at most n2(m \u2260 1)11 + 2\n\n1\u2260\u201c2\u00cc = O1 nm\n1\u2260\u201c log1 1\n\n1\u2260\u201c log1 1\n1\u2260\u201c22 = O1 n2m\n\n1\u2260\u201c22 iterations,\n1\u2260\u201c log1 1\n1\u2260\u201c22\n\niterations, improving by a factor O(log n) a result by [11]. Under some\nstructural assumptions of the MDP, we then consider bounds that are\nindependent of the discount factor \u201c: given a measure of the maximal tran-\nsient time \u00b7t and the maximal time \u00b7r to revisit states in recurrent classes\nunder all policies, we show that Simplex-PI terminates after at most n2(m\u2260\n1) (\u00c1\u00b7r log(n\u00b7r)\u00cb + \u00c1\u00b7r log(n\u00b7t)\u00cb)#(m \u2260 1)\u00c1n\u00b7t log(n\u00b7t)\u00cb + \u00c1n\u00b7t log(n2\u00b7t)\u00cb$ =\n\u02dcO!n3m2\u00b7t\u00b7r\" iterations. This generalizes a recent result for determin-\nistic MDPs by [8],\nin which \u00b7t \u00c6 n and \u00b7r \u00c6 n. We explain why\nsimilar results seem hard to derive for Howard\u2019s PI. Finally, under\nthe additional (restrictive) assumption that the state space is parti-\ntioned in two sets, respectively states that are transient and recurrent\nfor all policies, we show that Howard\u2019s PI terminates after at most\nn(m \u2260 1) (\u00c1\u00b7t log n\u00b7t\u00cb + \u00c1\u00b7r log n\u00b7r\u00cb) = \u02dcO(nm(\u00b7t + \u00b7r)) iterations while\nSimplex-PI\nterminates after n(m \u2260 1) (\u00c1n\u00b7t log n\u00b7t\u00cb + \u00c1\u00b7r log n\u00b7r\u00cb) =\n\u02dcO(n2m(\u00b7t + \u00b7r)) iterations.\n\n1 Introduction\n\nWe consider a discrete-time dynamic system whose state transition depends on a control.\nWe assume that there is a state space X of \ufb01nite size n. At state i \u0153{ 1, .., n}, the control is\nchosen from a control space A of \ufb01nite size1 m. The control a \u0153 A speci\ufb01es the transition\nprobability pij(a) = P(it+1 = j|it = i, at = a) to the next state j. At each transition,\nthe system is given a reward r(i, a, j) where r is the instantaneous reward function. In\nthis context, we look for a stationary deterministic policy (a function \ufb01 : X \u00e6 A that maps\n1In the works of [11, 8, 3] that we reference, the integer \u201cm\u201d denotes the total number of actions,\nthat is nm with our notation. When we restate their result, we do it with our own notation, that\nis we replace their \u00d5\u00d5m\u00d5\u00d5 by \u00d5\u00d5nm\u00d5\u00d5.\n\n1\n\n\fstates into controls2) that maximizes the expected discounted sum of rewards from any state\ni, called the value of policy \ufb01 at state i:\n\nv\ufb01(i) := EC \u0152\u00ffk=0\n\n\u201ckr(ik, ak, ik+1)-----\n\ni0 = i, \u2019k \u00d8 0, ak = \ufb01(ik), ik+1 \u2265 P(\u00b7|ik, ak)D\n\nwhere \u201c \u0153 (0, 1) is a discount factor. The tuple \u00c8X, A, p, r, \u201c\u00cd is called a Markov Decision\nProcess (MDP) [9, 1], and the associated problem is known as optimal control.\nThe optimal value starting from state i is de\ufb01ned as\nv\ufb01(i).\n\nv\u00fa(i) := max\n\n\ufb01\n\nFor any policy \ufb01, we write P\ufb01 for the n \u25ca n stochastic matrix whose elements are pij(\ufb01(i))\nand r\ufb01 the vector whose components are qj pij(\ufb01(i))r(i, \ufb01(i), j). The value functions v\ufb01\nand v\u00fa can be seen as vectors on X. It is well known that v\ufb01 is the solution of the following\nBellman equation:\nthat is v\ufb01 is a \ufb01xed point of the ane operator T\ufb01 : v \u2018\u00e6 r\ufb01 + \u201cP\ufb01v. It is also well known\nthat v\u00fa satis\ufb01es the following Bellman equation:\n\nv\ufb01 = r\ufb01 + \u201cP\ufb01v\ufb01,\n\nv\u00fa = max\n\n\ufb01\n\n(r\ufb01 + \u201cP\ufb01v\u00fa) = max\n\n\ufb01\n\nT\ufb01v\u00fa\n\nwhere the max operator is componentwise. In other words, v\u00fa is a \ufb01xed point of the nonlinear\noperator T : v \u2018\u00e6 max\ufb01 T\ufb01v. For any value vector v, we say that a policy \ufb01 is greedy with\nrespect to the value v if it satis\ufb01es:\n\n\ufb01 \u0153 arg max\n\n\ufb01\u00d5\n\nT\ufb01\u00d5v\n\nor equivalently T\ufb01v = T v. With some slight abuse of notation, we write G(v) for any policy\nthat is greedy with respect to v. The notions of optimal value function and greedy policies\nare fundamental to optimal control because of the following property: any policy \ufb01\u00fa that is\ngreedy with respect to the optimal value v\u00fa is an optimal policy and its value v\ufb01\u00fa is equal\nto v\u00fa.\nLet \ufb01 be some policy. We call advantage with respect to \ufb01 the following quantity:\n\nT\ufb01\u00d5v\ufb01 \u2260 v\ufb01 = T v\ufb01 \u2260 v\ufb01.\nWe call the set of switchable states of \ufb01 the following set\n\na\ufb01 = max\n\ufb01\u00d5\n\nAssume now that \ufb01 is non-optimal (this implies that S\ufb01 is a non-empty set). For any\nnon-empty subset Y of S\ufb01, we denote switch(\ufb01, Y ) a policy satisfying:\n\nS\ufb01 = {i, a\ufb01(i) > 0}.\n\n\u2019i, switch(\ufb01, Y )(i) =; G(v\ufb01)(i)\n\n\ufb01(i)\n\nif i \u0153 Y\nif i \u201d\u0153 Y.\n\nThe following result is well known (see for instance [9]).\nLemma 1. Let \ufb01 be some non-optimal policy. If \ufb01\u00d5 = switch(\ufb01, Y ) for some non-empty\nsubset Y of S\ufb01, then v\ufb01\u00d5 \u00d8 v\ufb01 and there exists at least one state i such that v\ufb01\u00d5(i) > v\ufb01(i).\nThis lemma is the foundation of the well-known iterative procedure, called Policy Iteration\n(PI), that generates a sequence of policies (\ufb01k) as follows.\n\n\ufb01k+1 \u03a9 switch(\ufb01k, Yk) for some set Yk such that \u00ff ( Yk \u2122 S\ufb01k .\n\nThe choice for the subsets Yk leads to dierent variations of PI. In this paper we will focus\non two speci\ufb01c variations:\n\n2Restricting our attention to stationary deterministic policies is not a limitation. Indeed, for the\noptimality criterion to be de\ufb01ned soon, it can be shown that there exists at least one stationary\ndeterministic policy that is optimal [9].\n\n2\n\n\f\u2022 When for all iterations k, Yk = S\ufb01k, that is one switches the actions in all states with\npositive advantage with respect to \ufb01k, the above algorithm is known as Howard\u2019s\nPI; it can be seen then that \ufb01k+1 \u0153G (v\ufb01k).\n\u2022 When for all k, Yk is a singleton containing a state ik \u0153 arg maxi a\ufb01k(i), that is if\nwe only switch one action in the state with maximal advantage with respect to \ufb01k,\nwe will call it Simplex-PI3.\n\nSince it generates a sequence of policies with increasing values, any variation of PI converges\nto the optimal policy in a number of iterations that is smaller than the total number of\npolicies mn. In practice, PI converges in very few iterations. On random MDP instances,\nconvergence often occurs in time sub-linear in n. The aim of this paper is to discuss existing\nand provide new upper bounds on the number of iterations required by Howard\u2019s PI and\nSimplex-PI that are much sharper than mn.\nIn the next sections, we describe some known results\u2014see [11] for a recent and comprehensive\nreview\u2014about the number of iterations required by Howard\u2019s PI and Simplex-PI, along with\nsome of our original improvements and extensions.4\n\n2 Bounds with respect to a Fixed Discount Factor \u201c< 1\nA key observation for both algorithms, that will be central to the results we are about to\ndiscuss, is that the sequence they generate satis\ufb01es some contraction property5. For any\nvector u \u0153 Rn, let \u00ceu\u00ce\u0152 = max1\u00c6i\u00c6n|u(i)| be the max-norm of u. Let 1 be the vector of\nwhich all components are equal to 1.\nLemma 2 (Proof in Section A). The sequence (\u00cev\u00fa \u2260 v\ufb01k\u00ce\u0152)k\u00d80 built by Howard\u2019s PI is\ncontracting with coecient \u201c.\nLemma 3 (Proof in Section B). The sequence (1T (v\u00fa \u2260 v\ufb01k))k\u00d80 built by Simplex-PI is\ncontracting with coecient 1 \u2260 1\u2260\u201c\nn .\nThough this observation is widely known for Howard\u2019s PI, it was to our knowledge never\nmentionned explicitly in the literature for Simplex-PI. These contraction properties have\nthe following immediate consequence6.\nCorollary 1. Let Vmax = max\ufb01 \u00cer\ufb01\u00ce\u0152\nbe an upper bound on \u00cev\ufb01\u00ce\u0152 for all policies \ufb01. In\norder to get an \u2018-optimal policy, that is a policy \ufb01k satisfying \u00cev\u00fa \u2260 v\ufb01k\u00ce\u0152 \u00c6 \u2018, Howard\u2019s\n\u00cc\n\u20181\u2260\u201c \u00cc iterations, while Simplex-PI requires at most \u00cf n log nVmax\nPI requires at most \u00cf log Vmax\niterations.\nThese bounds depend on the precision term \u2018, which means that Howard\u2019s PI and Simplex-\nPI are weakly polynomial for a \ufb01xed discount factor \u201c. An important breakthrough was\nrecently achieved by [11] who proved that one can remove the dependency with respect to \u2018,\nand thus show that Howard\u2019s PI and Simplex-PI are strongly polynomial for a \ufb01xed discount\nfactor \u201c.\nTheorem 1 ([11]). Simplex-PI and Howard\u2019s PI both terminate after at most n(m \u2260\n1)\u00cf n\n1\u2260\u201c log1 n2\n\n1\u2260\u201c2\u00cc iterations.\n\n3In this case, PI is equivalent to running the simplex algorithm with the highest-pivot rule on a\n\n1\u2260\u201c\n\n1\u2260\u201c\n\n\u2018\n\nlinear program version of the MDP problem [11].\n\n4For clarity, all proofs are deferred to the Appendix. The \ufb01rst proofs about bounds for the\ncase \u201c< 1 are given in the Appendix of the paper. The other proofs, that are more involved, are\nprovided in the Supplementary Material.\n5A sequence of non-negative numbers (xk)k\u00d80 is contracting with coecient \u2013 if and only if for\nall k \u00d8 0, xk+1 \u00c6 \u2013xk.\n6For Howard\u2019s PI, we have: \u00cev\u00fa\u2260v\ufb01k\u00ce\u0152 \u00c6 \u201ck\u00cev\u00fa\u2260v\ufb010\u00ce\u0152 \u00c6 \u201ckVmax. Thus, a sucient condition\nfor \u00cev\u00fa\u2260v\ufb01k\u00ce\u0152 <\u2018 is \u201ckVmax <\u2018 , which is implied by k \u00d8\n. For Simplex-PI, we\n\u2018\nlog 1\nhave \u00cev\u00fa \u2260 v\ufb01k\u00ce\u0152 \u00c6 \u00cev\u00fa \u2260 v\ufb01k\u00ce1 \u00c6!1 \u2260 1\u2260\u201c\n\u201c\nnVmax, and the conclusion\nis similar to that for Howard\u2019s PI.\n\nn \"k \u00cev\u00fa \u2260 v\ufb010\u00ce1 \u00c6!1 \u2260 1\u2260\u201c\nn \"k\n\n1\u2260\u201c >\n\nlog Vmax\n\nlog Vmax\n\n\u2018\n\n3\n\n\fThe proof is based on the fact that PI corresponds to the simplex algorithm in a linear\nprogramming formulation of the MDP problem. Using a more direct proof, [3] recently\nimproved the result by a factor O(n) for Howard\u2019s PI.\n\nTheorem 2 ([3]). Howard\u2019s PI terminates after at most (nm + 1)\u00cf 1\n\ntions.\nOur \ufb01rst two results, that are consequences of the contraction properties (Lemmas 2 and\n3), are stated in the following theorems.\nin Section C). Howard\u2019s PI terminates after at most n(m \u2260\nTheorem 3 (Proof\nin Section D). Simplex-PI\nterminates after at most n(m \u2260\n\n1\u2260\u201c2\u00cc itera-\n\n1\u2260\u201c log1 n\n\nTheorem 4 (Proof\n\n1)\u00cf 1\n1\u2260\u201c log1 1\n1)\u00cf n\n1\u2260\u201c log1 n\n\n1\u2260\u201c2\u00cc iterations.\n1\u2260\u201c2\u00cc iterations.\n\nOur result for Howard\u2019s PI is a factor O(log n) better than the previous best result of [3].\nOur result for Simplex-PI is only very slightly better (by a factor 2) than that of [11], and\nuses a proof that is more direct. Using more re\ufb01ned argument, we managed to also improve\nthe bound for Simplex-PI by a factor O(log n).\nin Section E). Simplex-PI terminates after at most n2(m \u2260\nTheorem 5 (Proof\n\n1)11 + 2\n\n1\u2260\u201c log 1\n\n1\u2260\u201c2 iterations.\n\nCompared to Howard\u2019s PI, our bound for Simplex-PI is a factor O(n) larger. However, since\none changes only one action per iteration, each iteration may have a complexity lower by a\nfactor n: the update of the value can be done in time O(n2) through the Sherman-Morrisson\nformula, though in general each iteration of Howard\u2019s PI, which amounts to compute the\nvalue of some policy that may be arbitrarily dierent from the previous policy, may require\nO(n3) time. Overall, both algorithms seem to have a similar complexity.\nIt is easy to see that the linear dependency of the bound for Howard\u2019s PI with respect to\nn is optimal. We conjecture that the linear dependency of both bounds with respect to\nm is also optimal. The dependency with respect to the term 1\n1\u2260\u201c may be improved, but\nremoving it is impossible for Howard\u2019s PI and very unlikely for Simplex-PI. [2] describes an\nMDP for which Howard\u2019s PI requires an exponential (in n) number of iterations for \u201c = 1\nand [5] argued that this holds also when \u201c is in the vicinity of 1. Though a similar result\ndoes not seem to exist for Simplex-PI in the literature, [7] consider four variations of PI\nthat all switch one action per iteration, and show through speci\ufb01cally designed MDPs that\nthey may require an exponential (in n) number of iterations when \u201c = 1.\n\n3 Bounds for Simplex-PI that are independent of \u201c\nIn this section, we will describe some bounds that do not depend on \u201c but that will be\nbased on some structural assumptions of the MDPs. On this topic, [8] recently showed the\nfollowing result for deterministic MDPs.\nTheorem 6 ([8]). If the MDP is deterministic, then Simplex-PI terminates after at most\nO(n5m2 log2 n) iterations.\nGiven a policy \ufb01 of a deterministic MDP, states are either on cycles or on paths induced by\n\ufb01. The core of the proof relies on the following lemmas that altogether show that cycles are\ncreated regularly and that signi\ufb01cant progress is made every time a new cycle appears; in\nother words, signi\ufb01cant progress is made regularly.\nLemma 4. If the MDP is deterministic, after at most nm\u00c12(n\u2260 1) log n\u00cb iterations, either\nSimplex-PI \ufb01nishes or a new cycle appears.\nLemma 5. If the MDP is deterministic, when Simplex-PI moves from \ufb01 to \ufb01\u00d5 where \ufb01\u00d5\ninvolves a new cycle, we have\n\n1T (v\ufb01\u00fa \u2260 v\ufb01\u00d5) \u00c631 \u2260\n\n1\n\nn4 1T (v\ufb01\u00fa \u2260 v\ufb01).\n\n4\n\n\fsuce\n\nobservations\n\nto prove7\nthese\nIndeed,\nafter\n1\u2260\u201c ) = \u02dcO(n4m2). Removing completely the dependency with respect to the\nO(n4m2 log n\ndiscount factor \u201c\u2014the term in O(log 1\n1\u2260\u201c )\u2014requires a careful extra work described in [8],\nwhich incurs an extra term of order O(n log(n)).\nAt a more technical level, the proof of [8] critically relies on some properties of the vec-\n\ufb01 )\u226011 that provides a discounted measure of state visitations along the\ntor x\ufb01 = (I \u2260 \u201cP T\ntrajectories induced by a policy \ufb01 starting from a uniform distribution:\n\nthat Simplex-PI\n\nterminates\n\n\u0152\u00fft=0\n\n\u201ctP(it = i | i0 \u2265 U, at = \ufb01(it)),\n\n\u2019i \u0153 X, x\ufb01(i) = n\ni, we trivially have x\ufb01(i) \u015311, n\n\nwhere U denotes the uniform distribution on the state space X. For any policy \ufb01 and state\n\n1\u2260\u201c2. The proof exploits the fact that x\ufb01(i) belongs to the\nset (1, n) when i is on a path of \ufb01, while x\ufb01(i) belongs to the set ( 1\n1\u2260\u201c ) when i is on\na cycle of \ufb01. As we are going to show, it is possible to extend the proof of [8] to stochastic\nMDPs. Given a policy \ufb01 of a stochastic MDP, states are either in recurrent classes or\ntransient classes (these two categories respectively generalize those of cycles and paths).\nWe will consider the following structural assumption.\nAssumption 1. Let \u00b7t \u00d8 1 and \u00b7r \u00d8 1 be the smallest constants such that for all policies\n\ufb01 and all states i,\n\n1\u2260\u201c , n\n\n(1 \u00c6 )x\ufb01(i) \u00c6 \u00b7t\n(1 \u2260 \u201c)\u00b7r \u00c6 x\ufb01(i)3 \u00c6\n\nn\n\nn\n\n1 \u2260 \u201c4\n\nif i is transient for \ufb01, and\nif i is recurrent for \ufb01.\n\nThe constant \u00b7t (resp. \u00b7r) can be seen as a measure of the time needed to leave transient\nstates (resp. the time needed to revisit states in recurrent classes). In particular, when \u201c\ntends to 1, it can be seen that \u00b7t is an upper bound of the expected time L needed to \u201cLeave\nthe set of transient states\u201d, since for any policy \ufb01,\n\n1\nn\n\nlim\n\n\u00b7t \u00d8\n\nlim\n\u201c\u00e61\n\nx\ufb01(i) =\n\n\u201c\u00e61 \u00ffi transient for \ufb01\n\n\u0152\u00fft=0\n= E [ L | i0 \u2265 U, at = \ufb01(it)] .\nSimilarly, when \u201c is in the vicinity of 1, 1\nis the minimal asymptotic frequency8 in recurrent\n\u00b7r\nstates given that one starts from a random uniform state, since for any policy \ufb01 and recurrent\nstate i:\n\nP(it transient for \ufb01 | i0 \u2265 U, at = \ufb01(it))\n\nn\n\n1 \u2260 \u201c\n\nlim\n\u201c\u00e61\n\n\u201ctP(it = i | i0 \u2265 U, at = \ufb01(it))\n\nx\ufb01(i) = lim\n\u201c\u00e61\n\n\u0152\u00fft=0\n(1 \u2260 \u201c)\nT\u22601\u00fft=0\n1\nP(it = i | i0 \u2265 U, at = \ufb01(it)).\nT\nWith Assumption 1 in hand, we can generalize Lemmas 4-5 as follows.\nLemma\nn#(m \u2260 1)\u00c1n\u00b7t log(n\u00b7t)\u00cb + \u00c1n\u00b7t log(n2\u00b7t)\u00cb$\nnew recurrent class appears.\n\n= lim\nT\u00e6\u0152\n\nthe MDP\n\nsatis\ufb01es\n\n6.\n\nIf\n\nAssumption\n\nat most\niterations either Simplex-PI \ufb01nishes or a\n\nafter\n\n1,\n\n7This can be done by using arguments similar to the proof of Theorem 4 in Section D.\n8If the MDP is aperiodic and irreducible, and thus admits a stationary distribution \u2039\ufb01 for any\n\npolicy \ufb01, one can see that\n\n1\n\u00b7r\n\n=\n\nmin\n\n\ufb01, i recurrent for \ufb01\n\n\u2039\ufb01(i).\n\n5\n\n\fLemma 7. If the MDP satis\ufb01es Assumption 1, when Simplex-PI moves from \ufb01 to \ufb01\u00d5 where\n\ufb01\u00d5 involves a new recurrent class, we have\n\n1T (v\ufb01\u00fa \u2260 v\ufb01\u00d5) \u00c631 \u2260\n\n1\n\n\u00b7r4 1T (v\ufb01\u00fa \u2260 v\ufb01).\n\nFrom these generalized observations, we can deduce the following original result.\nTheorem 7 (Proof in Appendix F of the Supp. Material). If the MDP satis\ufb01es Assump-\ntion 1, then Simplex-PI terminates after at most\n\nn2(m \u2260 1) (\u00c1\u00b7r log(n\u00b7r)\u00cb + \u00c1\u00b7r log(n\u00b7t)\u00cb)#(m \u2260 1)\u00c1n\u00b7t log(n\u00b7t)\u00cb + \u00c1n\u00b7t log(n2\u00b7t)\u00cb$\n\niterations.\nRemark 1. This new result is a strict generalization of the result for deterministic MDPs.\nIndeed, in the deterministic case, we have \u00b7t \u00c6 n and \u00b7r \u00c6 n, and it is is easy to see that\nLemmas 6, 7 and Theorem 7 respectively imply Lemmas 4, 5 and Theorem 6.\nAn immediate consequence of the above result is that Simplex-PI is strongly polynomial for\nsets of MDPs that are much larger than the deterministic MDPs mentionned in Theorem 6.\nCorollary 2. For any family of MDPs indexed by n and m such that \u00b7t and \u00b7r are polyno-\nmial functions of n and m, Simplex-PI terminates after a number of steps that is polynomial\nin n and m.\n\n4 Similar results for Howard\u2019s PI?\nOne may then wonder whether similar results can be derived for Howard\u2019s PI. Unfortunately,\nand as quickly mentionned by [8], the line of analysis developped for Simplex-PI does not\nseem to adapt easily to Howard\u2019s PI, because simultaneously switching several actions can\ninterfere in a way that the policy improvement turns out to be small. We can be more\nprecise on what actually breaks in the approach we have described so far. On the one hand,\nit is possible to write counterparts of Lemmas 4 and 6 for Howard\u2019s PI (see Appendix G of\nthe Supp. Material).\nLemma 8. If the MDP is deterministic, after at most n iterations, either Howard\u2019s PI\n\ufb01nishes or a new cycle appears.\nLemma 9. If the MDP satis\ufb01es Assumption 1, after at most nm\u00c1\u00b7t log n\u00b7t\u00cb iterations,\neither Howard\u2019s PI \ufb01nishes or a new recurrent class appears.\nHowever, on the other hand, we did not manage to adapt Lemma 5 nor Lemma 7. In fact,\nit is unlikely that a result similar to that of Lemma 5 will be shown to hold for Howard\u2019s PI.\nIn a recent deterministic example due to [4] to show that Howard\u2019s PI may require at most\nO(n2) iterations, new cycles are created every single iteration but the sequence of values\nsatis\ufb01es9 for all iterations k < n2\n\n4 + n\n\n4 and states i,\n\nv\u00fa(i) \u2260 v\ufb01k+1(i) \u00d8C1 \u22603 2\n\nn4kD (v\u00fa(i) \u2260 v\ufb01k(i)).\n\nContrary to Lemma 5, as k grows, the amount of contraction gets (exponentially) smaller and\nsmaller. With respect to Simplex-PI, this suggests that Howard\u2019s PI may suer from subtle\nspeci\ufb01c pathologies. In fact, the problem of determining the number of iterations required\nby Howard\u2019s PI has been challenging for almost 30 years. It was originally identi\ufb01ed as\nan open problem by [10]. In the simplest\u2014deterministic\u2014case, the question is still open:\nthe currently best known lower bound is the O(n2) bound by [4] we have just mentionned,\nwhile the best known upper bound is O( mn\n\nn ) (valid for all MDPs) due to [6].\n\n9This MDP has an even number of states n = 2p. The goal is to minimize the long term expected\ncost. The optimal value function satis\ufb01es v\u00fa(i) = \u2260pN for all i, with N = p2 + p. The policies\ngenerated by Howard\u2019s PI have values v\ufb01k(i) \u0153 (pN\u2260k\u22601, pN\u2260k). We deduce that for all iterations\nk and states i, v\u00fa(i)\u2260v\ufb01k+1 (i)\n\n1+p\u2260k \u00d8 1 \u2260 p\u2260k(1 \u2260 p\u22602) \u00d8 1 \u2260 p\u2260k.\n\nv\u00fa(i)\u2260v\ufb01k (i) \u00d8 1+p\u2260k\u22602\n\n1+p\u2260k = 1 \u2260 p\u2260k\u2260p\u2260k\u22602\n\n6\n\n\fOn the positive side, an adaptation of the line of proof we have considered so far can be\ncarried out under the following assumption.\nAssumption 2. The state space X can be partitioned in two sets T and R such that for\nall policies \ufb01, the states of T are transient and those of R are recurrent.\nIndeed, under this assumption, we can prove for Howard\u2019s PI a variation of Lemma 7\nintroduced for Simplex-PI.\nLemma 10. For an MDP satisfying Assumptions 1-2, suppose Howard\u2019s PI moves from \ufb01\nto \ufb01\u00d5 and that \ufb01\u00d5 involves a new recurrent class. Then\n\n1T (v\ufb01\u00fa \u2260 v\ufb01\u00d5) \u00c631 \u2260\n\n1\n\n\u00b7r4 1T (v\ufb01\u00fa \u2260 v\ufb01).\n\nAnd we can deduce the following original bound (that also applies to Simplex-PI).\nTheorem 8 (Proof in Appendix H of the Supp. Material). If the MDP satis\ufb01es Assump-\ntions 1-2, then Howard\u2019s PI terminates after at most n(m \u2260 1) (\u00c1\u00b7t log n\u00b7t\u00cb + \u00c1\u00b7r log n\u00b7r\u00cb)\niterations, while Simplex-PI terminates after at most n(m \u2260 1) (\u00c1n\u00b7t log n\u00b7t\u00cb + \u00c1\u00b7r log n\u00b7r\u00cb)\niterations.\nIt should however be noted that Assumption 2 is rather restrictive. It implies that the algo-\nrithms converge on the recurrent states independently of the transient states, and thus the\nanalysis can be decomposed in two phases: 1) the convergence on recurrent states and then\n2) the convergence on transient states (given that recurrent states do not change anymore).\nThe analysis of the \ufb01rst phase (convergence on recurrent states) is greatly facilitated by the\nfact that in this case, a new recurrent class appears every single iteration (this is in contrast\nwith Lemmas 4, 6, 8 and 9 that were designed to show under which conditions cycles and\nrecurrent classes are created). Furthermore, the analysis of the second phase (convergence\non transient states) is similar to that of the discounted case of Theorems 3 and 4. In other\nwords, if this last result sheds some light on the practical eciency of Howard\u2019s PI and\nSimplex-PI, a general analysis of Howard\u2019s PI is still largely open, and constitutes our main\nfuture work.\n\nA Contraction property for Howard\u2019s PI (Proof of Lemma 2)\nFor any k, using the facts that {\u2019\ufb01, T\ufb01v\ufb01 = v\ufb01}, {T\ufb01\u00fa\n{Lemma 1 and P\ufb01k is positive de\ufb01nite}, we have\n\nv\ufb01k\u22601 \u00c6 T\ufb01k v\ufb01k\u22601} and\n\nv\ufb01\u00fa \u2260 v\ufb01k = T\ufb01\u00fa\n\nv\ufb01\u00fa \u2260 T\ufb01\u00fa\n\nv\ufb01k\u22601 \u2260 T\ufb01k v\ufb01k\u22601 + T\ufb01k v\ufb01k\u22601 \u2260 T\ufb01k v\ufb01k\n\u00c6 \u201cP\ufb01\u00fa(v\ufb01\u00fa \u2260 v\ufb01k\u22601) + \u201cP\ufb01k(v\ufb01k\u22601 \u2260 v\ufb01k) \u00c6 \u201cP\ufb01\u00fa(v\ufb01\u00fa \u2260 v\ufb01k\u22601).\n\nv\ufb01k\u22601 + T\ufb01\u00fa\n\nSince v\ufb01\u00fa \u2260 v\ufb01k is non negative, we can take the max norm and get: \u00cev\ufb01\u00fa \u2260 v\ufb01k\u00ce\u0152 \u00c6\n\u201c\u00cev\ufb01\u00fa \u2260 v\ufb01k\u22601\u00ce\u0152.\nB Contraction property for Simplex-PI (Proof of Lemma 3)\nBy using the fact that {v\ufb01 = T\ufb01v\ufb01 \u2206 v\ufb01 = (I \u2260 \u201cP\ufb01)\u22601r\ufb01}, we have that for all pairs of\npolicies \ufb01 and \ufb01\u00d5.\n\nv\ufb01\u00d5 \u2260 v\ufb01 = (I \u2260 \u201cP\ufb01\u00d5)\u22601r\ufb01\u00d5 \u2260 v\ufb01 = (I \u2260 \u201cP\ufb01\u00d5)\u22601(r\ufb01\u00d5 + \u201cP\ufb01\u00d5v\ufb01 \u2260 v\ufb01)\n\n= (I \u2260 \u201cP\ufb01\u00d5)\u22601(T\ufb01\u00d5v\ufb01 \u2260 v\ufb01).\n\n(1)\nOn the one hand, by using this lemma and the fact that {T\ufb01k+1v\ufb01k \u2260 v\ufb01k \u00d8 0}, we have for\nany k: v\ufb01k+1 \u2260 v\ufb01k = (I \u2260 \u201cPk+1)\u22601(T\ufb01k+1v\ufb01k \u2260 v\ufb01k) \u00d8 T\ufb01k+1v\ufb01k \u2260 v\ufb01k , which implies that\n(2)\nOn the other hand, using Equation (1) and the facts that {\u00ce(I \u2260 \u201cP\ufb01\u00fa)\u22601\u00ce\u0152 =\n1\u2260\u201c and (I \u2260 \u201cP\ufb01\u00fa)\u22601 is positive de\ufb01nite}, {maxs T\ufb01k+1v\ufb01k(s) = maxs,\u02dc\ufb01 T\u02dc\ufb01v\ufb01k(s)} and\n\n1T (v\ufb01k+1 \u2260 v\ufb01k) \u00d8 1T (T\ufb01k+1v\ufb01k \u2260 v\ufb01k).\n\n1\n\n7\n\n\f{\u2019x \u00d8 0, maxs x(s) \u00c6 1T x}, we have:\nv\ufb01\u00fa \u2260 v\ufb01k = (I \u2260 \u201cP\ufb01\u00fa)\u22601(T\ufb01\u00fa\n\nv\ufb01k \u2260 v\ufb01k) \u00c6\n\n1\n1 \u2260 \u201c\nT\ufb01k+1v\ufb01k(s) \u2260 v\ufb01k(s) \u00c6\n\nv\ufb01k(s) \u2260 v\ufb01k(s)\nT\ufb01\u00fa\n1T (T\ufb01k+1v\ufb01k \u2260 v\ufb01k),\n\n\u00c6\n\n1\n1 \u2260 \u201c\n\nmax\ns\n\nwhich implies (using {\u2019x, 1T x \u00c6 n\u00cex\u00ce\u0152}) that\n\n1T (T\ufb01k+1v\ufb01k \u2260 v\ufb01k) \u00d8 (1 \u2260 \u201c)\u00cev\ufb01\u00fa \u2260 v\ufb01k\u00ce\u0152 \u00d8\n\n1T (v\ufb01\u00fa \u2260 v\ufb01k).\n\n(3)\n\nmax\ns\n1\n1 \u2260 \u201c\n1 \u2260 \u201c\n\nn\n\nCombining Equations (2) and (3), we get:\n1T (v\ufb01\u00fa \u2260 v\ufb01k+1) = 1T (v\ufb01\u00fa \u2260 v\ufb01k) \u2260 1T (v\ufb01k+1 \u2260 v\ufb01k)\n\n\u00c6 1T (v\ufb01\u00fa \u2260 v\ufb01k) \u2260\n\n1 \u2260 \u201c\n\nn\n\n1T (v\ufb01\u00fa \u2260 v\ufb01k) =31 \u2260\n\nn 4 1T (v\ufb01\u00fa \u2260 v\ufb01k).\n1 \u2260 \u201c\n\nC A bound for Howard\u2019s PI when \u201c< 1 (Proof of Theorem 3)\nFor any k, by using Equation (1) and the fact {v\u00fa \u2260 v\ufb01k \u00d8 0 and P\ufb01k positive de\ufb01nite}, we\nhave:\n\nv\u00fa \u2260 T\ufb01k v\u00fa = (I \u2260 \u201cP\ufb01k)(v\u00fa \u2260 v\ufb01k) \u00c6 v\u00fa \u2260 v\ufb01k .\n\nSince v\u00fa\u2260T\ufb01k v\u00fa is non negative, we can take the max norm and, using Lemma 2, Equation (1)\nand the fact that {\u00ce(I \u2260 \u201cP\ufb010)\u22601\u00ce\u0152 = 1\n\n1\u2260\u201c}, we get:\n\n\u00cev\u00fa \u2260 T\ufb01k v\u00fa\u00ce\u0152 \u00c6 \u00cev\u00fa \u2260 v\ufb01k\u00ce\u0152 \u00c6 \u201ck\u00cev\ufb01\u00fa \u2260 v\ufb010\u00ce\u0152\n\n= \u201ck\u00ce(I \u2260 \u201cP\ufb010)\u22601(v\u00fa \u2260 T\ufb010v\u00fa)\u00ce\u0152 \u00c6\n(4)\nBy de\ufb01nition of the max-norm, there exists a state s0 such that v\u00fa(s0) \u2260 [T\ufb010v\u00fa](s0) =\n\u00cev\u00fa \u2260 T\ufb010v\u00fa\u00ce\u0152. From Equation (4), we deduce that for all k,\nv\u00fa(s0) \u2260 [T\ufb01k v\u00fa](s0) \u00c6 \u00cev\u00fa \u2260 T\ufb01k v\u00fa\u00ce\u0152 \u00c6\n\n1 \u2260 \u201c\u00cev\u00fa \u2260 T\ufb010v\u00fa\u00ce\u0152.\n\n(v\u00fa(s0) \u2260 [T\ufb010v\u00fa](s0)).\n\n\u201ck\n\n\u201ck\n\n1 \u2260 \u201c\u00cev\u00fa \u2260 T\ufb010v\u00fa\u00ce\u0152 = \u201ck\n1 \u2260 \u201c\n1\u2260\u201c \u00cc >\u00cf log 1\n\n1\u2260\u201c\nlog 1\n\n1\u2260\u201c\n\nAs a consequence, the action \ufb01k(s0) must be dierent from \ufb010(s0) when \u201ck\n\n1\u2260\u201c < 1, that is for\n\u201c \u00cc . In other words, if some policy \ufb01\n\nall values of k satisfying k \u00d8 k\u00fa =\u00cf log 1\nis not optimal, then one of its non-optimal actions will be eliminated for good after at most\nk\u00fa iterations. By repeating this argument, one can eliminate all non-optimal actions (they\nare at most n(m \u2260 1)), and the result follows.\nD A bound for Simplex-PI when \u201c< 1 (Proof of Theorem 4)\nUsing {\u2019x \u00d8 0, \u00cex\u00ce\u0152 \u00c6 1T x}, Lemma 3, {\u2019x, 1T x \u00c6 n\u00cex\u00ce\u0152}, Equation (1) and {\u00ce(I \u2260\n\u201cP\ufb010)\u22601\u00ce\u0152 = 1\n1\u2260\u201c}, we have for all k,\n\u00cev\ufb01\u00fa \u2260 T\ufb01k v\ufb01\u00fa\u00ce\u0152 \u00c6 \u00cev\ufb01\u00fa \u2260 v\ufb01k\u00ce\u0152 \u00c6 1T (v\ufb01\u00fa \u2260 v\ufb01k)\n\u00c631 \u2260\nn 4k\n1T (v\ufb01\u00fa \u2260 v\ufb010) \u00c6 n31 \u2260\nn 4k\n1 \u2260 \u201c\n1 \u2260 \u201c\nn 4k\n= n31 \u2260\n1 \u2260 \u201c\n\u00ce(I \u2260 \u201cP\ufb010)\u22601(v\u00fa \u2260 T\ufb010v\u00fa)\u00ce\u0152 \u00c6\nafter at most k\u00fa =\u00cf n\nobtained by noting that there are at most n(m \u2260 1) non optimal actions to eliminate.\n\nn ): , and the overall number of iterations is\n\nSimilarly to the proof for Howard\u2019s PI, we deduce that a non-optimal action is eliminated\n\n1\u2260\u201c\u00cc \u00d89 log n\n\n1 \u2260 \u201c31 \u2260\n\nn 4k\n1 \u2260 \u201c\n\n\u00cev\ufb01\u00fa \u2260 T\ufb010v\ufb01\u00fa\u00ce\u0152\n\n\u00cev\ufb01\u00fa \u2260 v\ufb010\u00ce\u0152\n\n1\u2260\u201c log n\n\n1\u2260\u201c\nlog(1\u2260 1\u2260\u201c\n\nn\n\n8\n\n\fReferences\n[1] D.P. Bertsekas and J.N. Tsitsiklis. Neurodynamic Programming. Athena Scienti\ufb01c,\n\n1996.\n\n[2] J. Fearnley. Exponential lower bounds for policy iteration. In Proceedings of the 37th\ninternational colloquium conference on Automata, languages and programming: Part\nII, ICALP\u201910, pages 551\u2013562, Berlin, Heidelberg, 2010. Springer-Verlag.\n\n[3] T.D. Hansen, P.B. Miltersen, and U. Zwick. Strategy iteration is strongly polynomial\nfor 2-player turn-based stochastic games with a constant discount factor. J. ACM,\n60(1):1:1\u20131:16, February 2013.\n\n[4] T.D. Hansen and U. Zwick. Lower bounds for howard\u2019s algorithm for \ufb01nding minimum\n\nmean-cost cycles. In ISAAC (1), pages 415\u2013426, 2010.\n\n[5] R. Hollanders, J.C. Delvenne, and R. Jungers. The complexity of policy iteration is\nIn 51st IEEE conference on\n\nexponential for discounted markov decision processes.\nDecision and control (CDC\u201912), 2012.\n\n[6] Y. Mansour and S.P. Singh. On the complexity of policy iteration.\n\n401\u2013408, 1999.\n\nIn UAI, pages\n\n[7] M. Melekopoglou and A. Condon. On the complexity of the policy improvement algo-\nrithm for markov decision processes. INFORMS Journal on Computing, 6(2):188\u2013192,\n1994.\n\n[8] I. Post and Y. Ye. The simplex method is strongly polynomial for deterministic markov\n\ndecision processes. Technical report, arXiv:1208.5083v2, 2012.\n\n[9] M. Puterman. Markov Decision Processes. Wiley, New York, 1994.\n[10] N. Schmitz. How good is howard\u2019s policy improvement algorithm? Zeitschrift f\u00a8ur\n\nOperations Research, 29(7):315\u2013316, 1985.\n\n[11] Y. Ye. The simplex and policy-iteration methods are strongly polynomial for the markov\n\ndecision problem with a \ufb01xed discount rate. Math. Oper. Res., 36(4):593\u2013603, 2011.\n\n9\n\n\f", "award": [], "sourceid": 258, "authors": [{"given_name": "Bruno", "family_name": "Scherrer", "institution": "INRIA"}]}