{"title": "(More) Efficient Reinforcement Learning via Posterior Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 3003, "page_last": 3011, "abstract": "Most provably efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an $\\tilde{O}(\\tau S \\sqrt{AT} )$ bound on the expected regret, where $T$ is time, $\\tau$ is the episode length and $S$ and $A$ are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.", "full_text": "(More) E\ufb03cient Reinforcement Learning via\n\nPosterior Sampling\n\nOsband, Ian\n\nStanford University\nStanford, CA 94305\n\niosband@stanford.edu\n\nRusso, Daniel\n\nStanford University\nStanford, CA 94305\n\ndjrusso@stanford.edu\n\nVan Roy, Benjamin\nStanford University\nStanford, CA 94305\nbvr@stanford.edu\n\nAbstract\n\nMost provably-e\ufb03cient reinforcement learning algorithms introduce opti-\nmism about poorly-understood states and actions to encourage exploration.\nWe study an alternative approach for e\ufb03cient exploration: posterior sam-\npling for reinforcement learning (PSRL). This algorithm proceeds in re-\npeated episodes of known duration. At the start of each episode, PSRL\nupdates a prior distribution over Markov decision processes and takes one\nsample from this posterior. PSRL then follows the policy that is optimal\nfor this sample during the episode. The algorithm is conceptually simple,\ncomputationally e\ufb03cient and allows an agent to encode prior knowledge\nin a natural way. We establish an \u02dcO(\u03c4 S\nAT) bound on expected regret,\nwhere T is time, \u03c4 is the episode length and S and A are the cardinali-\nties of the state and action spaces. This bound is one of the \ufb01rst for an\nalgorithm not based on optimism, and close to the state of the art for any\nreinforcement learning algorithm. We show through simulation that PSRL\nsigni\ufb01cantly outperforms existing algorithms with similar regret bounds.\n\n\u221a\n\n1 Introduction\n\nWe consider the classical reinforcement learning problem of an agent interacting with its\nenvironment while trying to maximize total reward accumulated over time [1, 2]. The agent\u2019s\nenvironment is modeled as a Markov decision process (MDP), but the agent is uncertain\nabout the true dynamics of the MDP. As the agent interacts with its environment, it observes\nthe outcomes that result from previous states and actions, and learns about the system\ndynamics. This leads to a fundamental tradeo\ufb00: by exploring poorly-understood states\nand actions the agent can learn to improve future performance, but it may attain better\nshort-run performance by exploiting its existing knowledge.\nNa\u00a8\u0131ve optimization using point estimates for unknown variables overstates an agent\u2019s knowl-\nedge, and can lead to premature and suboptimal exploitation. To o\ufb00set this, the majority of\nprovably e\ufb03cient learning algorithms use a principle known as optimism in the face of uncer-\ntainty [3] to encourage exploration. In such an algorithm, each state and action is a\ufb00orded\nsome optimism bonus such that their value to the agent is modeled to be as high as is statis-\ntically plausible. The agent will then choose a policy that is optimal under this \u201coptimistic\u201d\nmodel of the environment. This incentivizes exploration since poorly-understood states and\nactions will receive a higher optimism bonus. As the agent resolves its uncertainty, the ef-\nfect of optimism is reduced and the agent\u2019s behavior approaches optimality. Many authors\nhave provided strong theoretical guarantees for optimistic algorithms [4, 5, 6, 7, 8]. In fact,\nalmost all reinforcement learning algorithms with polynomial bounds on sample complexity\nemploy optimism to guide exploration.\n\n1\n\n\fWe study an alternative approach to e\ufb03cient exploration, posterior sampling, and provide\n\ufb01nite time bounds on regret. We model the agent\u2019s initial uncertainty over the environment\nthrough a prior distribution.1 At the start of each episode, the agent chooses a new pol-\nicy, which it follows for the duration of the episode. Posterior sampling for reinforcement\nlearning (PSRL) selects this policy through two simple steps. First, a single instance of the\nenvironment is sampled from the posterior distribution at the start of an episode. Then,\nPSRL solves for and executes the policy that is optimal under the sampled environment over\nthe episode. PSRL randomly selects policies according to the probability they are optimal;\nexploration is guided by the variance of sampled policies as opposed to optimism.\nThe idea of posterior sampling goes back to 1933 [9] and has been applied successfully to\nmulti-armed bandits.\nIn that literature, the algorithm is often referred to as Thompson\nsampling or as probability matching. Despite its long history, posterior sampling was largely\nneglected by the multi-armed bandit literature until empirical studies [10, 11] demonstrated\nthat the algorithm could produce state of the art performance. This prompted a surge of\ninterest, and a variety of strong theoretical guarantees are now available [12, 13, 14, 15].\nOur results suggest this method has great potential in reinforcement learning as well.\nPSRL was originally introduced in the context of reinforcement learning by Strens [16]\nunder the name \u201cBayesian Dynamic Programming\u201d,2 where it appeared primarily as a\nheuristic method. In reference to PSRL and other \u201cBayesian RL\u201d algorithms, Kolter and\nNg [17] write \u201clittle is known about these algorithms from a theoretical perspective, and\nit is unclear, what (if any) formal guarantees can be made for such approaches.\u201d Those\nBayesian algorithms for which performance guarantees exist are guided by optimism. BOSS\n[18] introduces a more complicated version of PSRL that samples many MDPs, instead\nof just one, and then combines them into an optimistic environment to guide exploration.\nBEB [17] adds an exploration bonus to states and actions according to how infrequently\nthey have been visited. We show it is not always necessary to introduce optimism via a\ncomplicated construction, and that the simple algorithm originally proposed by Strens [16]\nsatis\ufb01es strong bounds itself.\nOur work is motivated by several advantages of posterior sampling relative to optimistic\nalgorithms. First, since PSRL only requires solving for an optimal policy for a single sam-\npled MDP, it is computationally e\ufb03cient both relative to many optimistic methods, which\nrequire simultaneous optimization across a family of plausible environments [4, 5, 18], and\nto computationally intensive approaches that attempt to approximate the Bayes-optimal\nsolutions directly [18, 19, 20]. Second, the presence of an explicit prior allows an agent to\nincorporate known environment structure in a natural way. This is crucial for most prac-\ntical applications, as learning without prior knowledge requires exhaustive experimentation\nin each possible state. Finally, posterior sampling allows us to separate the algorithm from\nthe analysis. In any optimistic algorithm, performance is greatly in\ufb02uenced by the manner\nin which optimism is implemented. Past works have designed algorithms, at least in part, to\nfacilitate theoretical analysis for toy problems. Although our analysis of posterior sampling\nis closely related to the analysis in [4], this worst-case bound has no impact on the algo-\nrithm\u2019s actual performance. In addition, PSRL is naturally suited to more complex settings\nwhere design of an e\ufb03ciently optimistic algorithm might not be possible. We demonstrate\nthrough a computational study in Section 6 that PSRL outperforms the optimistic algorithm\nUCRL2 [4]: a competitor with similar regret bounds over some example MDPs.\n\n2 Problem formulation\n\nWe consider the problem of learning to optimize a random \ufb01nite horizon MDP M =\n(S,A, RM , P M , \u03c4, \u03c1) in repeated \ufb01nite episodes of interaction. S is the state space, A is\nthe action space, RM\na (s) is a probability distribution over reward realized when selecting\na (s0|s) is the probability of transitioning\naction a while in state s whose support is [0, 1], P M\nto state s0 if action a is selected while at state s, \u03c4 is the time horizon, and \u03c1 the initial\nstate distribution. We de\ufb01ne the MDP and all other random variables we will consider with\n\n1For an MDP, this might be a prior over transition dynamics and reward distributions.\n2We alter terminology since PSRL is neither Bayes-optimal, nor a direct approximation of this.\n\n2\n\n\frespect to a probability space (\u2126,F, P). We assume S, A, and \u03c4 are deterministic so the\nagent need not learn the state and action spaces or the time horizon.\nA deterministic policy \u00b5 is a function mapping each state s \u2208 S and i = 1, . . . , \u03c4 to an action\na \u2208 A. For each MDP M = (S,A, RM , P M , \u03c4, \u03c1) and policy \u00b5, we de\ufb01ne a value function\n\n\uf8ee\uf8f0 \u03c4X\n\nj=i\n\n\uf8f9\uf8fb ,\n(cid:12)(cid:12)(cid:12)si = s\n\n\u00b5,i(s) := EM,\u00b5\nV M\n\nR\n\nM\naj\n\n(sj)\n\nM\n\n(\u00b7|sj) for j = i, . . . , \u03c4. A policy \u00b5 is said to be optimal for MDP M if V M\n\nwhere R\na (s) denotes the expected reward realized when action a is selected while in state\ns, and the subscripts of the expectation operator indicate that aj = \u00b5(sj, j), and sj+1 \u223c\n\u00b5,i(s) =\nP M\naj\n\u00b50,i(s) for all s \u2208 S and i = 1, . . . , \u03c4. We will associate with each MDP M a policy\nmax\u00b50 V M\n\u00b5M that is optimal for M.\nThe reinforcement learning agent interacts with the MDP over episodes that begin at times\ntk = (k \u2212 1)\u03c4 + 1, k = 1, 2, . . .. At each time t, the agent selects an action at, observes\na scalar reward rt, and then transitions to st+1. If an agent follows a policy \u00b5 then when\nin state s at time t during episode k, it selects an action at = \u00b5(s, t \u2212 tk). Let Ht =\n(s1, a1, r1, . . . , st\u22121, at\u22121, rt\u22121) denote the history of observations made prior to time t. A\nreinforcement learning algorithm is a deterministic sequence {\u03c0k|k = 1, 2, . . .} of functions,\neach mapping Htk to a probability distribution \u03c0k(Htk) over policies. At the start of the kth\nepisode, the algorithm samples a policy \u00b5k from the distribution \u03c0k(Htk). The algorithm\nthen selects actions at = \u00b5k(st, t \u2212 tk) at times t during the kth episode.\nWe de\ufb01ne the regret incurred by a reinforcement learning algorithm \u03c0 up to time T to be\n\nwhere \u2206k denotes regret over the kth episode, de\ufb01ned with respect to the MDP M\u2217 by\n\nRegret(T, \u03c0) :=\n\n\u2206k,\n\ndT /\u03c4eX\n\nk=1\n\n\u03c1(s)(V M\u2217\n\n\u00b5\u2217,1(s) \u2212 V M\u2217\n\n\u00b5k,1(s)),\n\n\u2206k =X\n\ns\u2208S\n\nwith \u00b5\u2217 = \u00b5M\u2217 and \u00b5k \u223c \u03c0k(Htk). Note that regret is not deterministic since it can\ndepend on the random MDP M\u2217, the algorithm\u2019s internal random sampling and, through\nthe history Htk, on previous random transitions and random rewards. We will assess and\ncompare algorithm performance in terms of regret and its expectation.\n\n3 Posterior sampling for reinforcement learning\nThe use of posterior sampling for reinforcement learning (PSRL) was \ufb01rst proposed by\nStrens [16]. PSRL begins with a prior distribution over MDPs with states S, actions A and\nhorizon \u03c4. At the start of each kth episode, PSRL samples an MDP Mk from the posterior\ndistribution conditioned on the history Htk available at that time. PSRL then computes\nand follows the policy \u00b5k = \u00b5Mk over episode k.\n\nAlgorithm: Posterior Sampling for Reinforcement Learning (PSRL)\n\nData: Prior distribution f, t=1\nfor episodes k = 1, 2, . . . do\n\nsample Mk \u223c f(\u00b7|Htk)\ncompute \u00b5k = \u00b5Mk\nfor timesteps j = 1, . . . , \u03c4 do\n\nsample and apply at = \u00b5k(st, j)\nobserve rt and st+1\nt = t + 1\n\nend\n\nend\n\n3\n\n\fWe show PSRL obeys performance guarantees intimately related to those for learning algo-\nrithms based upon OFU, as has been demonstrated for multi-armed bandit problems [15].\nWe believe that a posterior sampling approach o\ufb00ers some inherent advantages. Optimistic\nalgorithms require explicit construction of the con\ufb01dence bounds on V M\u2217\n\u00b5,1 (s) based on ob-\nserved data, which is a complicated statistical problem even for simple models. In addition,\neven if strong con\ufb01dence bounds for V M\u2217\n\u00b5,1 (s) were known, solving for the best optimistic\npolicy may be computationally intractable. Algorithms such as UCRL2 [4] are computa-\ntionally tractable, but must resort to separately bounding R\na (s) with high\nprobability for each s, a. These bounds allow a \u201cworst-case\u201d mis-estimation simultaneously\nin every state-action pair and consequently give rise to a con\ufb01dence set which may be far\ntoo conservative.\nBy contrast, PSRL always selects policies according to the probability they are optimal.\nUncertainty about each policy is quanti\ufb01ed in a statistically e\ufb03cient way through the pos-\nterior distribution. The algorithm only requires a single sample from the posterior, which\nmay be approximated through algorithms such as Metropolis-Hastings if no closed form\nexists. As such, we believe PSRL will be simpler to implement, computationally cheaper\nand statistically more e\ufb03cient than existing optimistic methods.\n\na (s) and P M\n\nM\n\n3.1 Main results\nThe following result establishes regret bounds for PSRL. The bounds have \u02dcO(\u03c4 S\nAT)\nexpected regret, and, to our knowledge, provide the \ufb01rst guarantees for an algorithm not\nbased upon optimism:\nTheorem 1. If f is the distribution of M\u2217 then,\n\n\u221a\n\nE(cid:2)Regret(T, \u03c0PS\n\n\u03c4 )(cid:3) = O\n\n(cid:16)\n\n\u03c4 SpAT log(SAT)(cid:17)\n\n(1)\n\nThis result holds for any prior distribution on MDPs, and so applies to an immense class\nof models. To accommodate this generality, the result bounds expected regret under the\nprior distribution (sometimes called Bayes risk or Bayesian regret). We feel this is a natural\nmeasure of performance, but should emphasize that it is more common in the literature to\nbound regret under a worst-case MDP instance. The next result provides a link between\nthese notions of regret. Applying Markov\u2019s inequality to (1) gives convergence in probability.\nCorollary 1. If f is the distribution of M\u2217 then for any \u03b1 > 1\n2,\n\n\u03c4 )\nRegret(T, \u03c0PS\n\nT \u03b1\n\n\u2192\n\np\n\n0.\n\n\u221a\n\n\u221a\n\nAs shown in the appendix, this also bounds the frequentist regret for any MDP with non-zero\nprobability. State-of-the-art guarantees similar to Theorem 1 are satis\ufb01ed by the algorithms\nUCRL2 [4] and REGAL [5] for the case of non-episodic RL. Here UCRL2 gives regret\nAT) where D = maxs06=s min\u03c0 E[T(s0|M, \u03c0, s)] and T(s0|M, \u03c0, s) is the \ufb01rst\nbounds \u02dcO(DS\ntime step where s0 is reached from s under the policy \u03c0. REGAL improves this result to\nAT) where \u03a8 \u2264 D is the span of the of the optimal value function. However, there\n\u02dcO(\u03a8S\nis so far no computationally tractable implementation of this algorithm.\nIn many practical applications we may be interested in episodic learning tasks where the\nconstants D and \u03a8 could be improved to take advantage of the episode length \u03c4. Simple\nmodi\ufb01cations to both UCRL2 and REGAL will produce regret bounds of \u02dcO(\u03c4 S\nAT), just\nas PSRL. This is close to the theoretical lower bounds of\n\nSAT-dependence.\n\n\u221a\n\n\u221a\n\n4 True versus sampled MDP\n\nA simple observation, which is central to our analysis, is that, at the start of each kth\nepisode, M\u2217 and Mk are identically distributed. This fact allows us to relate quantities that\ndepend on the true, but unknown, MDP M\u2217, to those of the sampled MDP Mk, which is\n\n4\n\n\ffully observed by the agent. We introduce \u03c3(Htk) as the \u03c3-algebra generated by the history\nup to tk. Readers unfamiliar with measure theory can think of this as \u201call information\nknown just before the start of period tk.\u201d When we say that a random variable X is \u03c3(Htk)-\nmeasurable, this intuitively means that although X is random, it is deterministically known\ngiven the information contained in Htk. The following lemma is an immediate consequence\nof this observation [15].\nLemma 1 (Posterior Sampling). If f is the distribution of M\u2217 then, for any \u03c3(Htk)-\nmeasurable function g,\n(2)\nNote that taking the expectation of (2) shows E[g(M\u2217)] = E[g(Mk)] through the tower\nproperty.\n\nE[g(M\u2217)|Htk] = E[g(Mk)|Htk].\n\n\u00b5k,1(s)) to be the regret over period k.\nA signi\ufb01cant hurdle in analyzing this equation is its dependence on the optimal policy \u00b5\u2217,\nwhich we do not observe. For many reinforcement learning algorithms, there is no clean way\nto relate the unknown optimal policy to the states and actions the agent actually observes.\nThe following result shows how we can avoid this issue using Lemma 1. First, de\ufb01ne\n\ns\u2208S \u03c1(s)(V M\u2217\n\n\u00b5\u2217,1(s)\u2212 V M\u2217\n\nRecall, we have de\ufb01ned \u2206k =P\n\n\u02dc\u2206k =X\n\ns\u2208S\n\n\u03c1(s)(V Mk\n\n\u00b5k,1(s) \u2212 V M\u2217\n\n\u00b5k,1(s))\n\nas the di\ufb00erence in expected value of the policy \u00b5k under the sampled MDP Mk, which is\nknown, and its performance under the true MDP M\u2217, which is observed by the agent.\nTheorem 2 (Regret equivalence).\n\n\" mX\n\n#\n\n#\n\n\u02dc\u2206k\n\n\" mX\n\nk=1\n\n= E\nand for any \u03b4 > 0 with probability at least 1 \u2212 \u03b4,\n\n\u2206k\n\nk=1\n\nE\n\nProof. Note, \u2206k \u2212 \u02dc\u2206k =P\n\n(3)\n\n(4)\n\n(5)\n\ns\u2208S \u03c1(s)(V M\u2217\n\n\u00b5\u2217,1(s) \u2212 V Mk\n\n\u00b5k,1(s)) \u2208 [\u2212\u03c4, \u03c4]. By Lemma 1, E[\u2206k \u2212\n\n\u02dc\u2206k|Htk] = 0. Taking expectations of these sums therefore establishes the claim.\nThis result bounds the agent\u2019s regret in epsiode k by the di\ufb00erence between the agent\u2019s\nestimate V Mk\n\u00b5k,1(stk) of the expected reward in Mk from the policy it chooses, and the expected\n\u00b5k,1(stk) in M\u2217. If the agent has a poor estimate of the MDP M\u2217, we expect it to\nreward V M\u2217\nlearn as the performance of following \u00b5k under M\u2217 di\ufb00ers from its expectation under Mk.\nAs more information is gathered, its performance should improve. In the next section, we\nformalize these ideas and give a precise bound on the regret of posterior sampling.\n\n5 Analysis\nAn essential tool in our analysis will be the dynamic programming, or Bellman operator\nT M\n\u00b5 , which for any MDP M = (S,A, RM , P M , \u03c4, \u03c1), stationary policy \u00b5 : S \u2192 A and value\nfunction V : S \u2192 R, is de\ufb01ned by\n\nT M\n\u00b5 V (s) := R\n\n\u00b5(s)(s0|s)V (s0).\nP M\n\n\u00b5 (s, \u00b5) +X\n\nM\n\ns0\u2208S\n\nThis operation returns the expected value of state s where we follow the policy \u00b5 under the\nlaws of M, for one time step. The following lemma gives a concise form for the dynamic\nprogramming paradigm in terms of the Bellman operator.\nLemma 2 (Dynamic programming equation). For any MDP M = (S,A, RM , P M , \u03c4, \u03c1)\nand policy \u00b5 : S \u00d7 {1, . . . , \u03c4} \u2192 A, the value functions V M\n\nsatisfy\n\n\u00b5\n\n\u00b5,i = T M\nV M\n\n\u00b5(\u00b7,i)V M\n\n\u00b5,i+1\n\nfor i = 1 . . . \u03c4, with V M\n\n\u00b5,\u03c4+1 := 0.\n\n5\n\n\fIn order to streamline our notation we will let V \u2217\nT \u2217\n\u00b5 := T M\u2217\n\n\u00b5(\u00b7|s) := P M\u2217\n\n\u00b5(s)(\u00b7|s).\n\nand P \u2217\n\n\u00b5\n\n\u00b5,i := V M\u2217\n\n\u00b5,i\n\n, V k\n\n\u00b5,i(s) := V Mk\n\n5.1 Rewriting regret in terms of Bellman error\n\nE(cid:2) \u02dc\u2206k\n\n(cid:12)(cid:12)M\u2217, Mk\n\n(cid:3) = E\n\n\" \u03c4X\n\ni=1\n\nh(T k\n\n\u00b5k(\u00b7,i) \u2212 T \u2217\n\n\u00b5k(\u00b7,i))V k\n\n,\n\n\u00b5\n\n\u00b5,i (s), T k\n\u00b5 := T Mk\n#\n\n(6)\n\n\u00b5k,i+1(stk+i)i(cid:12)(cid:12)(cid:12)(cid:12)M\u2217, Mk\n\nTo see why (6) holds, simply apply the Dynamic programming equation inductively:\n\u00b5k,1 \u2212 V \u2217\n(V k\n\n\u00b5k,2)(stk+1)\n\n\u00b5k,2 \u2212 T \u2217\n\n\u00b5k(\u00b7,1)V \u2217\n\n\u00b5k,1)(stk+1) = (T k\n= (T k\n\n\u00b5k(\u00b7,1)V k\n\u00b5k(\u00b7,1) \u2212 T \u2217\n\n{P \u2217\n\n\u00b5k(\u00b7,1))V k\n\n\u00b5k(\u00b7,1))V k\n\n\u00b5k,2(stk+1)\n\n= (T k\n= . . .\n=\n\n\u00b5k(\u00b7,1)(s0|stk+1)(V \u2217\n\n\u00b5k,2 \u2212 V k\n\u00b5k,2(stk+1) + (V \u2217\n\n+X\ns0\u2208S\n\u00b5k(\u00b7,1) \u2212 T \u2217\n\u03c4X\n(T k\n\u00b5k(\u00b7,i) \u2212 T \u2217\n\u00b5k(\u00b7,i)(s0|stk+i)(V \u2217\n\u00b5k,i+1(stk+i)i under the sampled MDP Mk. Crucially, (6) de-\n\n\u00b5k,2)(s0)}\n\u00b5k,2 \u2212 V k\n\u03c4X\n\u00b5k,i+1(stk+i) +\ndtk+i,\n\u00b5k,i+1 \u2212 V k\n\u00b5k,i+1)(s0)} \u2212 (V \u2217\n\n\u00b5k(\u00b7,i))V k\n\u00b5k,i+1 \u2212 V k\n\n\u00b5k,2)(stk+1) + dtk+1\n\ni=1\n\ni=1\n\n\u00b5k,i+1)(stk+i).\nThis expresses the regret in terms two factors. The \ufb01rst factor is the one step Bellman\n\ns0\u2208S{P \u2217\n\nwhere dtk+i :=P\nerror h(T k\n\n\u00b5k(\u00b7,i) \u2212 T \u2217\n\n\u00b5k(\u00b7,i))V k\n\npends only the Bellman error under the observed policy \u00b5k and the states s1, .., sT that are\nactually visited over the \ufb01rst T periods. We go on to show the posterior distribution of Mk\nconcentrates around M\u2217 as these actions are sampled, and so this term tends to zero.\nThe second term captures the randomness in the transitions of the true MDP M\u2217.\nIn state st under policy \u00b5k, the expected value of (V \u2217\n\u00b5k,i+1)(stk+i) is exactly\n\u00b5k,i+1)(s0)}. Hence, conditioned on the true MDP M\u2217\n\nP\nand the sampled MDP Mk, the termP\u03c4\ns0\u2208S{P \u2217\n\n\u00b5k(\u00b7,i)(s0|stk+i)(V \u2217\n\ni=1 dtk+i has expectation zero.\n\n\u00b5k,i+1 \u2212 V k\n\n\u00b5k,i+1 \u2212 V k\n\nIntroducing con\ufb01dence sets\n\n5.2\nThe last section reduced the algorithm\u2019s regret to its expected Bellman error. We will\nproceed by arguing that the sampled Bellman operator T k\n\u00b5k(\u00b7,i) concentrates around the\ntrue Bellman operatior T \u2217\n\u00b5k(\u00b7,i). To do this, we introduce high probability con\ufb01dence sets\na(\u00b7|s) denote the emprical distribution up period\nsimilar to those used in [4] and [5]. Let \u02c6P t\nt of transitions observed after sampling (s, a), and let \u02c6Rt\na(s) denote the empirical average\nt=1 1{(st,at)=(s,a)} to be the number of times (s, a)\n\nwas sampled prior to time tk. De\ufb01ne the con\ufb01dence set for episode k:\n\nreward. Finally, de\ufb01ne Ntk(s, a) = Ptk\u22121\n(cid:13)(cid:13)(cid:13)1\nMk :=n\nWhere \u03b2k(s, a) :=q 14S log(2SAmtk)\n\na(\u00b7|s) \u2212 P M\n\n(cid:13)(cid:13)(cid:13) \u02c6P t\n\na (\u00b7|s)\n\nM :\n\n\u2264 \u03b2k(s, a) & | \u02c6Rt\n\na(s) \u2212 RM\n\na (s)| \u2264 \u03b2k(s, a) \u2200(s, a)o\n\nmax{1,Ntk (s,a)} is chosen conservatively so that Mk contains both M\u2217\nand Mk with high probability. It\u2019s worth pointing out that we have not tried to optimize\nthis con\ufb01dence bound, and it can be improved, at least by a numerical factor, with more\ncareful analysis. Now, using that \u02dc\u2206k \u2264 \u03c4 we can decompose regret as follows:\n\nmX\n\n\u02dc\u2206k \u2264 mX\n\nk=1\n\nk=1\n\nmX\n\nk=1\n\n\u02dc\u2206k1{Mk,M\u2217\u2208Mk} + \u03c4\n\n6\n\n[1{Mk /\u2208Mk} + 1{M\u2217 /\u2208Mk}]\n\n(7)\n\n\fsince Mk\n\n1, E[1{Mk /\u2208Mk}|Htk] =\nNow,\nE[1{M\u2217 /\u2208Mk}|Htk]. Lemma 17 of [4] shows3 P(M\u2217 /\u2208 Mk) \u2264 1/m for this choice of \u03b2k(s, a),\nwhich implies\n\nis \u03c3(Htk)-measureable,\n\nby Lemma\n\n#\nmX\n(cid:3) 1{Mk,M\u2217\u2208Mk}\n\n+ 2\u03c4\n\nk=1\n\n#\n\n+ 2\u03c4\n\nP{M\u2217 /\u2208 Mk}.\n\n#\n\n\" mX\n\nk=1\n\nE\n\n\u02dc\u2206k\n\n\u2264 E\n\n\u2264 E\n\n\u2264 E\n\n\u02dc\u2206k1{Mk,M\u2217\u2208Mk}\n\nk=1\n\n\" mX\n\" mX\nE(cid:2) \u02dc\u2206k|M\u2217, Mk\nmX\n\u03c4X\nmX\n\u03c4X\n\n|(T k\n\nk=1\n\nk=1\n\ni=1\n\n\u00b5k(\u00b7,i) \u2212 T \u2217\n\n\u00b5k(\u00b7,i))V k\n\n\u00b5k,i+1(stk+i)|1{Mk,M\u2217\u2208Mk} + 2\u03c4\n\n\u2264 \u03c4 E\n\nmin{\u03b2k(stk+i, atk+i), 1} + 2\u03c4.\n\nWe also have the worst\u2013case bound Pm\nP\u03c4\nto provide a worst case bound on min{\u03c4Pm\n\u03c4 SpAT log(SAT), which completes our analysis.\nIn the technical appendix we go on\ni=1 min{\u03b2k(stk+i, atk+i), 1}, T} of order\n\n\u02dc\u2206k \u2264 T.\nk=1\n\nk=1\n\nk=1\n\ni=1\n\n(8)\n\n6 Simulation results\n\nWe compare performance of PSRL to UCRL2 [4]: an optimistic algorithm with similar\nregret bounds. We use the standard example of RiverSwim [21], as well as several randomly\ngenerated MDPs. We provide results in both the episodic case, where the state is reset\nevery \u03c4 = 20 steps, as well as the setting without episodic reset.\n\nFigure 1: RiverSwim - continuous and dotted arrows represent the MDP under the actions\n\u201cright\u201d and \u201cleft\u201d.\n\nRiverSwim consists of six states arranged in a chain as shown in Figure 1. The agent begins\nat the far left state and at every time step has the choice to swim left or right. Swimming left\n(with the current) is always successful, but swimming right (against the current) often fails.\nThe agent receives a small reward for reaching the leftmost state, but the optimal policy is\nto attempt to swim right and receive a much larger reward. This MDP is constructed so\nthat e\ufb03cient exploration is required in order to obtain the optimal policy. To generate the\nrandom MDPs, we sampled 10-state, 5-action environments according to the prior.\nWe express our prior in terms of Dirichlet and normal-gamma distributions over the tran-\nsitions and rewards respectively.4 In both environments we perform 20 Monte Carlo sim-\nulations and compute the total regret over 10,000 time steps. We implement UCRL2 with\n\u03b4 = 0.05 and optimize the algorithm to take account of \ufb01nite episodes where appropriate.\nPSRL outperformed UCRL2 across every environment, as shown in Table 1. In Figure 2,\nwe show regret through time across 50 Monte Carlo simulations to 100,000 time\u2013steps in\nthe RiverSwim environment: PSRL\u2019s outperformance is quite extreme.\n\n3Our con\ufb01dence sets are equivalent to those of [4] when the parameter \u03b4 = 1/m.\n4These priors are conjugate to the multinomial and normal distribution. We used the values\n\n\u03b1 = 1/S, \u00b5 = \u03c32 = 1 and pseudocount n = 1 for a di\ufb00use uniform prior.\n\n7\n\n\fTable 1: Total regret in simulation. PSRL outperforms UCRL2 over di\ufb00erent environments.\n\nAlgorithm\n\nPSRL\nUCRL2\n\nRandom MDP Random MDP RiverSwim RiverSwim\n\u03c4-episodes \u221e-horizon\n1.06 \u00d7 102\n6.88 \u00d7 101\n3.64 \u00d7 103\n1.26 \u00d7 103\n\n\u221e-horizon\n7.30 \u00d7 103\n1.13 \u00d7 105\n\n\u03c4-episodes\n1.04 \u00d7 104\n5.92 \u00d7 104\n\n6.1 Learning in MDPs without episodic resets\nThe majority of practical problems in reinforcement learning can be mapped to repeated\nepisodic interactions for some length \u03c4. Even in cases where there is no actual reset of\nepisodes, one can show that PSRL\u2019s regret is bounded against all policies which work over\nhorizon \u03c4 or less [6]. Any setting with discount factor \u03b1 can be learned for \u03c4 \u221d (1 \u2212 \u03b1)\u22121.\nOne appealing feature of UCRL2 [4] and REGAL [5] is that they learn this optimal timeframe\n\u03c4. Instead of computing a new policy after a \ufb01xed number of periods, they begin a new\nepisode when the total visits to any state-action pair is doubled. We can apply this same\nrule for episodes to PSRL in the \u221e-horizon case, as shown in Figure 2. Using optimism\nwith KL-divergence instead of L1 balls has also shown improved performance over UCRL2\n[22], but its regret remains orders of magnitude more than PSRL on RiverSwim.\n\n(a) PSRL outperforms UCRL2 by large margins (b) PSRL learns quickly despite misspeci\ufb01ed prior\n\nFigure 2: Simulated regret on the \u221e-horizon RiverSwim environment.\n\n7 Conclusion\n\n\u221a\n\nWe establish posterior sampling for reinforcement learning not just as a heuristic, but as a\nprovably e\ufb03cient learning algorithm. We present \u02dcO(\u03c4 S\nAT) Bayesian regret bounds, which\nare some of the \ufb01rst for an algorithm not motivated by optimism and are close to state of the\nart for any reinforcement learning algorithm. These bounds hold in expectation irrespective\nof prior or model structure. PSRL is conceptually simple, computationally e\ufb03cient and can\neasily incorporate prior knowledge. Compared to feasible optimistic algorithms we believe\nthat PSRL is often more e\ufb03cient statistically, simpler to implement and computationally\ncheaper. We demonstrate that PSRL performs well in simulation over several domains. We\nbelieve there is a strong case for the wider adoption of algorithms based upon posterior\nsampling in both theory and practice.\n\nAcknowledgments\nOsband and Russo are supported by Stanford Graduate Fellowships courtesy of PACCAR\ninc., and Burt and Deedee McMurty, respectively. This work was supported in part by\nAward CMMI-0968707 from the National Science Foundation.\n\n8\n\n\fReferences\n[1] A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies for markov decision processes.\n\nMathematics of Operations Research, 22(1):222\u2013255, 1997.\n\n[2] P. R. Kumar and P. Varaiya. Stochastic systems: estimation, identi\ufb01cation and adaptive\n\ncontrol. Prentice-Hall, Inc., 1986.\n\n[3] T.L. Lai and H. Robbins. Asymptotically e\ufb03cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\n[4] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.\n\nThe Journal of Machine Learning Research, 99:1563\u20131600, 2010.\n\n[5] P. L. Bartlett and A. Tewari. Regal: A regularization based algorithm for reinforcement\nlearning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on\nUncertainty in Arti\ufb01cial Intelligence, pages 35\u201342. AUAI Press, 2009.\n\n[6] R. I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for near-\noptimal reinforcement learning. The Journal of Machine Learning Research, 3:213\u2013231, 2003.\n[7] S. M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of\n\nLondon, 2003.\n\n[8] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine\n\nLearning, 49(2-3):209\u2013232, 2002.\n\n[9] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[10] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Neural Information\n\nProcessing Systems (NIPS), 2011.\n\n[11] S.L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in\n\nBusiness and Industry, 26(6):639\u2013658, 2010.\n\n[12] S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. arXiv\n\npreprint arXiv:1209.3353, 2012.\n\n[13] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payo\ufb00s. arXiv\n\npreprint arXiv:1209.3352, 2012.\n\n[14] E. Kau\ufb00mann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal\n\n\ufb01nite time analysis. In International Conference on Algorithmic Learning Theory, 2012.\n\n[15] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. CoRR, abs/1301.2609,\n\n2013.\n\n[16] M. Strens. A Bayesian framework for reinforcement learning.\n\nInternational Conference on Machine Learning, pages 943\u2013950, 2000.\n\nIn Proceedings of the 17th\n\n[17] J. Z. Kolter and A. Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of\nthe 26th Annual International Conference on Machine Learning, pages 513\u2013520. ACM, 2009.\n[18] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line\nreward optimization. In Proceedings of the 22nd international conference on Machine learning,\npages 956\u2013963. ACM, 2005.\n\n[19] A. Guez, D. Silver, and P. Dayan. E\ufb03cient bayes-adaptive reinforcement learning using sample-\n\nbased search. arXiv preprint arXiv:1205.3109, 2012.\n\n[20] J. Asmuth and M. L. Littman. Approaching bayes-optimalilty using monte-carlo tree search.\n\nIn Proc. 21st Int. Conf. Automat. Plan. Sched., Freiburg, Germany, 2011.\n\n[21] A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov\n\ndecision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331, 2008.\n\n[22] S. Filippi, O. Capp\u00b4e, and A. Garivier. Optimism in reinforcement learning based on kullback-\n\nleibler divergence. CoRR, abs/1004.5229, 2010.\n\n9\n\n\fA Relating Bayesian to frequentist regret\nLet M be any family of MDPs with non-zero probability under the prior. Then, for any \u0001 > 0,\n\u03b1 > 1\n2:\n\n(cid:18)Regret(T, \u03c0P S\n\n\u03c4\n\n)\n\nP\n\nT \u03b1\n\n> \u0001(cid:12)(cid:12)M\n\n(cid:19)\n\n\u2217 \u2208 M\n\n\u2192 0\n\nThis provides regret bounds even if M\u2217 is not distributed according to f. As long as the true\nMDP is not impossible under the prior, we will have an asymptotic frequentist regret close to the\n\u221a\ntheoretical lower bounds of in T -dependence of O(\n\nT ).\n\nProof. We have for any \u0001 > 0:\nE[Regret(T, \u03c0P S\n\n\u03c4\n\n)]\n\nT \u03b1\n\n\u2265 E\n\n\u2265 \u0001P\n\n(cid:18)Regret(T, \u03c0P S\n\nTherefore via theorem (1), for any \u03b1 > 1\n2:\n\u2217 \u2208 M\n\n(cid:19)\n\n(cid:12)(cid:12)M\n\nP\n\n)\n\n\u03c4\n\n\u2264\n\nT \u03b1\n\n\u03c4\n\n)\n\nT \u03b1\n\n(cid:21)\n\n\u2217 \u2208 M\n\n(cid:20)Regret(T, \u03c0P S\n(cid:12)(cid:12)M\n(cid:18)Regret(T, \u03c0P S\n(cid:12)(cid:12)M\n(cid:19) E[Regret(T, \u03c0P S\u03c4)]\n(cid:18)\n\n\u2217 \u2208 M)\n\n\u2217 \u2208 M\n\n(cid:19)\n\nP (M\n\nP (M\n\nT \u03b1\n\n1\n\n)\n\n\u03c4\n\n\u2217 \u2208 M)\n\n\u0001P (M\u2217 \u2208 M)\n\nT \u03b1\n\n\u2192 0\n\ni=1 min{\u03b2kstk+i, atk+i), 1}, T} which we claim is\n(s,a)} .\n\nmX\n\n\u03c4X\n\nk=1\n\ni=1\n\n1{Ntk\n\n>\u03c4}\n\nr 14S log(2SAmtk)\n\nmax{1, Ntk(s, a)}\n\nB Bounding the sum of con\ufb01dence set widths\n\ni=1\n\nk=1\n\nk=1\n\nmax{1,Ntk\n\n\u03c4X\n\nmX\n\nProof. In a manner similar to [4] we can say:\n\nr 14S log(2SAmtk)\nmax{1, Ntk(s, a)} \u2264\n\nWe are interested in bounding min{\u03c4Pm\nP\u03c4\nO(\u03c4 SpAT log(SAT ) for \u03b2k(s, a) :=q 14S log(2SAmtk)\n\u03c4X\nmX\n2\u03c4 times per state action pair. Therefore, Pm\nP\u03c4\nr\nmX\ntk+1\u22121X\nmX\nNT +1(s,a)X\n2X\ns\nX\n\nr1(Ntk(st, at) > \u03c4)\n\ntk+1\u22121X\n\nNtk(st, at)\n\nk=1\n\u221a\n\n1{Ntk\n\nt=tk\n\nt=tk\n\nk=1\n\nk=1\n\n\u2264\n\n\u2264\n\nj=1\n\ni=1\n\ns,a\n\n\u2264\u03c4} +\n\nNow, the consider the event (st, at) = (s, a) and (Ntk(s, a) \u2264 \u03c4). This can happen fewer than\ni=1 1(Ntk(s, a) \u2264 \u03c4) \u2264 2\u03c4 SA.Now, suppose\nNtk(s, a) > \u03c4. Then for any t \u2208 {tk, .., tk+1 \u2212 1}, Nt(s, a) + 1 \u2264 Ntk(s, a) + \u03c4 \u2264 2Ntk(s, a).\nTherefore:\n\nk=1\n\n2\n\nNt(st, at) + 1 =\n\n(Nt(st, at) + 1)\u22121/2\n\n\u22121/2 \u2264\n\nj\n\n\u22121/2\n\nx\n\ndx\n\nTX\nZ NT +1(s,a)\n\nt=1\n\n2\n\n\u221a\n\n2X\n\n\u221a\n\n\u221a\n\nx=0\n\ns,a\n\n2SAT\n\n\u2264\n\n2SA\n\nNT +1(s, a) =\n\ns,a\n\nNote that since all rewards and transitions are absolutely constrained \u2208 [0, 1] our regret\nmin{\u03c4\n\nSA + \u03c4p28S2AT log(SAT ), T}\n\nmin{\u03b2k(stk+i, atk+i), 1}, T} \u2264 min{2\u03c4\n\n\u03c4X\n\nmX\n\n2\n\n2\u03c4 2SAT + \u03c4p28S2AT log(SAT ) \u2264 \u03c4 Sp30AT log(SAT )\n\n\u221a\n\nk=1\n\ni=1\n\nWhich is our required result.\n\n\u2264\n\n10\n\n\f", "award": [], "sourceid": 1366, "authors": [{"given_name": "Ian", "family_name": "Osband", "institution": "Stanford University"}, {"given_name": "Daniel", "family_name": "Russo", "institution": "Stanford University"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}