{"title": "Policy Gradient Methods for Reinforcement Learning with Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1057, "page_last": 1063, "abstract": null, "full_text": "Policy  Gradient  Methods for \n\nReinforcement  Learning with Function \n\nApproximation \n\nRichard S.  Sutton, David McAllester, Satinder Singh, Yishay Mansour \n\nAT&T Labs - Research,  180 Park Avenue,  Florham Park,  NJ 07932 \n\nAbstract \n\nFunction  approximation  is  essential  to  reinforcement  learning,  but \nthe standard approach of approximating a  value function and deter(cid:173)\nmining  a  policy  from  it  has so  far  proven theoretically  intractable. \nIn this paper we explore an alternative approach in which the policy \nis explicitly represented by its own function approximator,  indepen(cid:173)\ndent of the value function,  and is  updated according to the gradient \nof expected reward with respect to the policy parameters.  Williams's \nREINFORCE method and actor-critic methods are examples of this \napproach.  Our  main  new  result  is  to  show  that  the  gradient  can \nbe  written  in  a  form  suitable  for  estimation  from  experience  aided \nby  an  approximate  action-value  or  advantage  function.  Using  this \nresult,  we  prove for  the first  time that a  version  of policy  iteration \nwith arbitrary differentiable function approximation is convergent to \na  locally optimal policy. \n\nLarge applications of reinforcement learning (RL) require the use of generalizing func(cid:173)\ntion approximators such neural networks,  decision-trees,  or instance-based methods. \nThe dominant approach for the last decade has been the value-function approach,  in \nwhich  all  function  approximation  effort  goes  into  estimating  a  value  function,  with \nthe action-selection policy represented implicitly as the  \"greedy\"  policy with respect \nto the estimated values  (e.g.,  as the policy that selects in each state the action with \nhighest estimated value).  The value-function approach has worked well in many appli(cid:173)\ncations,  but has several limitations.  First, it is oriented toward finding  deterministic \npolicies, whereas the optimal policy is often stochastic, selecting different actions with \nspecific probabilities (e.g.,  see Singh,  Jaakkola,  and Jordan,  1994).  Second,  an arbi(cid:173)\ntrarily small change in the estimated value of an action can cause it to be, or not be, \nselected.  Such discontinuous changes have been identified as a  key obstacle to estab(cid:173)\nlishing  convergence  assurances for  algorithms  following  the value-function approach \n(Bertsekas  and Tsitsiklis,  1996).  For example,  Q-Iearning,  Sarsa,  and dynamic pro(cid:173)\ngramming methods have all  been shown unable to converge to any policy for  simple \nMDPs  and simple  function  approximators  (Gordon,  1995,  1996;  Baird,  1995;  Tsit(cid:173)\nsiklis and van Roy,  1996;  Bertsekas  and Tsitsiklis,  1996).  This can occur even if the \nbest approximation is found at each step before changing the policy,  and whether the \nnotion of \"best\"  is in the mean-squared-error sense or the slightly different  senses of \nresidual-gradient,  temporal-difference,  and dynamic-programming methods. \n\nIn this  paper we  explore an  alternative  approach to  function  approximation  in  RL. \n\n\f1058 \n\nR.  S.  Sutton,  D.  McAl/ester.  S.  Singh  and Y.  Mansour \n\nRather than approximating a value function and using that to compute a determinis(cid:173)\ntic policy,  we  approximate a stochastic policy directly using an independent function \napproximator with its own parameters.  For example, the policy might be represented \nby  a  neural  network  whose  input  is  a  representation  of the  state,  whose  output  is \naction  selection  probabilities,  and  whose  weights  are  the  policy  parameters.  Let  0 \ndenote the vector of policy  parameters and p the performance of the corresponding \npolicy (e.g.,  the average reward per step).  Then, in the policy gradient approach, the \npolicy parameters are updated approximately proportional to the gradient: \n\nap \n~O~CtaO' \n\n(1) \nwhere  Ct  is  a  positive-definite  step  size.  If the  above  can  be  achieved,  then  0  can \nusually be assured to converge to a locally optimal policy in the performance measure \np.  Unlike the value-function  approach,  here small changes in 0  can cause only small \nchanges in the policy and in the state-visitation distribution. \nIn this paper we prove that an unbiased estimate of the gradient (1) can be obtained \nfrom  experience  using  an  approximate  value  function  satisfying  certain  properties. \nWilliams's  (1988,  1992)  REINFORCE  algorithm  also  finds  an  unbiased estimate of \nthe gradient,  but without the assistance of a  learned value function.  REINFORCE \nlearns  much  more  slowly  than  RL  methods  using  value  functions  and  has  received \nrelatively little attention.  Learning a value function and using it to reduce the variance \nof the gradient  estimate appears to be ess~ntial for  rapid learning.  Jaakkola,  Singh \nand Jordan (1995)  proved a result very similar to ours for the special case of function \napproximation corresponding to tabular POMDPs.  Our result strengthens theirs and \ngeneralizes it to arbitrary differentiable function approximators.  Konda and Tsitsiklis \n(in prep.)  independently developed a very simialr result to ours.  See also Baxter and \nBartlett (in prep.)  and Marbach and Tsitsiklis  (1998). \nOur result  also suggests a  way of proving the convergence of a  wide variety of algo(cid:173)\nrithms based on  \"actor-critic\"  or policy-iteration architectures  (e.g.,  Barto,  Sutton, \nand  Anderson,  1983;  Sutton,  1984;  Kimura and Kobayashi,  1998).  In this paper we \ntake  the  first  step  in  this  direction  by  proving  for  the  first  time that  a  version  of \npolicy iteration with  general  differentiable  function  approximation  is  convergent  to \na  locally  optimal  policy.  Baird  and  Moore  (1999)  obtained  a  weaker  but  superfi(cid:173)\ncially similar result for their VAPS family of methods.  Like policy-gradient methods, \nVAPS includes separately parameterized policy and value functions updated by gra(cid:173)\ndient  methods.  However,  VAPS  methods do  not climb  the gradient  of performance \n(expected  long-term  reward),  but  of a  measure  combining  performance  and value(cid:173)\nfunction  accuracy.  As  a  result,  VAPS  does  not converge to a  locally  optimal policy, \nexcept in the case that no weight is put upon value-function accuracy,  in which case \nVAPS  degenerates to REINFORCE. Similarly,  Gordon's (1995)  fitted value iteration \nis also convergent and value-based,  but does not find  a  locally optimal policy. \n\n1  Policy Gradient Theorem \n\nWe  consider  the  standard  reinforcement  learning  framework  (see,  e.g.,  Sutton  and \nBarto,  1998),  in  which  a  learning  agent  interacts  with  a  Markov  decision  process \n(MDP). The state,  action,  and reward at each time t  E  {O, 1, 2, . . . } are denoted  St  E \nS, at  E  A, and rt  E  R respectively.  The environment's dynamics are characterized by \nstate transition probabilities, P:SI  = Pr { St+ 1  = Sf  I St  = s, at = a}, and expected re(cid:173)\nwards 'R~ =  E {rt+l  1st =  s, at  =  a}, 'r/s,  Sf  E  S, a E  A.  The agent's decision making \nprocedure at each time is characterized by a policy, 1l'(s, a, 0) = Pr {at  = alst = s, O}, \n'r/s  E S,a E  A, where 0  E  ~, for  l \u00ab  lSI,  is a  parameter vector.  We assume that 1l' \nis diffentiable with respect to its parameter,  i.e.,  that  a1f~~a)  exists.  We also  usually \nwrite just 1l'(s, a)  for  1l'(s, a, 0). \n\n\fPolicy Gradient Methods for RL with Function Approximation \n\n1059 \n\nWith function approximation,  two ways of formulating the agent's objective are use(cid:173)\nful.  One is the average reward formulation,  in which policies are ranked according to \ntheir long-term expected reward  per step,  p(rr): \n\np(1I\")  =  lim  .!.E{rl +r2 + ... +rn 11I\"}  = '\" \u00a3ff(s) \"'1I\"(s,a)'R.:, \n\nn-+oon \n\n~  ~ \nII \n\nQ \n\nwhere cP (s)  =  limt-+oo Pr {St  =  slso, 11\"}  is the stationary distribution of states under \n11\",  which  we  assume exists  and  is  independent  of So  for  all  policies.  In the average \nreward formulation,  the value of a  state-action pair given  a  policy is defined  as \n\nQ1r(s,a)  = LE {rt  - p(1I\")  I So  = s,ao = a,1I\"}, \n\n00 \n\nt=l \n\nVs  E  S,a E  A. \n\nThe second  formulation  we  cover  is  that  in  which  there is  a  designated  start  state \nSo,  and we  care only  about the long-term reward obtained from  it.  We  will  give our \nresults only once,  but they will apply to this formulation as well under the definitions \n\nand  Q1r(s,a)  = E{t. \"(k-lrt+k  1St  =  s,at = a, 11\" }. \n\np(1I\")  =  E{t. \"(t-lrt I 8 0 ,1I\"} \nwhere,,(  E  [0,1]  is  a  discount  rate  (\"(  = 1 is  allowed  only in  episodic  tasks).  In this \nformulation, we define d1r (8)  as a discounted weighting of states encountered starting \nat So  and then following  11\":  cP(s) = E:o\"(tpr{st = slso,1I\"}. \nOur first  result  concerns the gradient of the performance metric with respect to the \npolicy parameter: \nTheorem  1  (Policy  Gradient).  For  any  MDP,  in  either  the average-reward  or \nstart-state formulations, \n\nap  =  \"'.ftr( )'\" a1l\"(s,a)Q1r( \nao  ~ u \n\ns  ~ ao \n\n) \ns, a  . \n\nII \n\nQ \n\n(2) \n\nProof:  See the appendix. \nThis way of expressing the gradient was first  rtiscussed for  the average-reward formu(cid:173)\nlation by Marbach and Tsitsiklis (1998),  based on a related expression in terms of the \nstate-value  function  due to  Jaakkola,  Singh,  and Jordan  (1995)  and  Coo  and Chen \n(1997).  We  extend  their  results  to the start-state formulation  and  provide simpler \nand  more direct  proofs.  Williams's  (1988,  1992)  theory of REINFORCE algorithms \ncan also be viewed as  implying  (2).  In any event, the key aspect of both expressions \nfor  the  gradient  is  that  their  are  no  terms  of the  form  adiJII):  the  effect  of policy \nchanges on the distribution of states does not appear.  This is convenient for  approxi(cid:173)\nmating the gradient by sampling.  For example, if 8  was sampled from the distribution \nobtained by following  11\",  then  Ea a1r~~,a) Q1r (s, a)  would  be an unbiased  estimate of \n~. Of course,  Q1r(s, a)  is  also not normally known  and must be estimated.  One ap(cid:173)\nproach is  to use the actual returns,  Rt = E~l rt+k  - p(1I\")  (or Rt  =  E~l \"(k-lrt+k \nin the start-state formulation)  as an approximation for  each Q1r (St, at).  This leads to \nWilliams's  episodic  REINFORCE  algorithm,  t::..Ot  oc  a1r~~,at2 Rt  (1 \n)  (the  ~a \ncorrects for  the oversampling of actions preferred by 11\"),  which is known to follow  ~ \nin expected value (Williams,  1988,  1992). \n\n7r\\St,Ut) \n\n7r  St,at \n\n2  Policy Gradient with Approximation \n\nNow consider the case in which Q1r  is approximated by a learned function approxima(cid:173)\ntor.  If the approximation is  sufficiently good,  we  might hope to use it in place of Q1r \n\n\f1060 \n\nR. S.  Sutton,  D.  MeAl/ester,  S.  Singh and Y.  Mansour \n\nin  (2)  and still point roughly in the direction of the gradient.  For example, Jaakkola, \nSingh, and Jordan (1995)  proved that for  the special case of function  approximation \narising  in  a  tabular POMDP one  could  assure  positive  inner  product with the gra(cid:173)\ndient,  which is  sufficient  to ensure  improvement  for  moving  in  that direction.  Here \nwe extend their result to general function approximation and prove equality with the \ngradient. \nLet  fw  : S  x  A  - ~ be our  approximation to Q7f,  with  parameter w.  It is  natural \nto learn  f w by following  1r  and updating w  by a  rule such as  AWt  oc  I,u [Q7f (St, at) -\nfw(st,at)]2  oc  [Q7f(st,at)  - fw(st,at)]alw~~,ad, where  Q7f(st,at)  is  some  unbiased \nestimator of Q7f(st, at),  perhaps Rt.  When such  a  process  has  converged  to a  local \noptimum, then \n\nLcF(s):E 1r(s,a)[Q7f (s,a) - fw(s,a)] 8f~~,a) =  o. \n\n(3) \n\n/I \n\na \n\nTheorem 2  (Policy  Gradient with Function  Approximation).  If fw  satisfies \n(3)  and is  compatible with the policy parameterization in the sense thatl \n\nthen \n\n8fw(s, a) \n\n8w \n\n= \n\n81r(s, a) \n\n1 \n\n80 \n\n1r(s, a) , \n\n8p  ~  ~ 81r(s, a) \nao  =  ~cF(s) ~  ao \n\nfw(s,a). \n\nII \n\na \n\nProof:  Combining (3)  and (4)  gives \n\nLd7f (s) L  87r1~a) [Q7f (s,a) - fw(s,a)]  =  0 \n\nII \n\na \n\n(4) \n\n(5) \n\n(6) \n\nwhich tells  us that the error  in  fw(s, a)  is  orthogonal  to the gradient  of the policy \nparameterization.  Because the expression above is zero,  we  can subtract it from  the \npolicy gradient theorem  (2)  to yield \n\nap \nao \n\n=  L  cF(s) L a1r1~ a) Q7f(s, a)  - :E cF(s) :E a1r1~ a) [Q7f (s, a) - fw(s, a)] \n\nII \n\n11 \n\n~  ~ a1r(s,a) \n~cF(s)~  ao \n\n/I \n\na \n\nII \n\na \n\n[Q7f(s,a)-Q7f(s,a)+fw(s,a)] \n\n~  ~ a1r(s,a) \n\n=  ~ cF(s) ~  ao \n\nfw(s, a). \n\nII \n\na \n\nQ.E.D. \n\n3  Application to Deriving Algorithms and Advantages \n\nGiven  a  policy  parameterization,  Theorem  2  can  be  used  to derive  an  appropriate \nform  for  the value-function parameterization.  For example,  consider a  policy that is \na  Gibbs distribution in a  linear combination of features: \n\n'is E S,s E  A, \n\nITsitsiklis (personal communication) points out that /w  being linear in the features given \n\non  the righthand side may be the only way  to satisfy  this condition. \n\n\fPolicy Gradient Methods for RL with Function Approximation \n\n1061 \n\nwhere each <Psa  is an i-dimensional feature vector characterizing state-action pair s, a. \nMeeting the compatibility condition  (4)  requires that \n\nofw(s,a)  _  o1r(s,a) \n\n-\n\n00 \n\nO \nW \n\n1 \n( \n\n_  A \n\n)  - 'l'sa \n\n7rS,a \n\n_  L  (  b)A \n'l'sb, \n\n1r  S, \n\nb \n\nso that the natural parameterization of fw  is \n\nfw(s,a) ~ wT  [\"',. - ~\"(S,b)\"\"bl \u00b7 \n\nIn other words,  fw  must be linear in the same features as the policy, except normalized \nto be mean zero  for  each state.  Other algorithms can easily be derived for  a  variety \nof nonlinear policy parameterizations, such as multi-layer backpropagation networks. \nThe  careful  reader  will  have  noticed  that  the  form  given  above  for  f w  requires \nthat  it  have  zero  mean  for  each  state:  l:a 1r(s, a)fw(s, a)  =  0,  Vs  E  S .  In  this \nsense  it  is  better  to  think  of f w  as  an  approximation  of the  advantage  function, \nA7r(s,a)  =  Q7r(s,a)  - V7r(s)  (much  as  in  Baird,  1993),  rather  than  of  Q7r .  Our \nconvergence  requirement  (3)  is  really  that  fw  get  the  relative  value  of  the  ac(cid:173)\ntions  correct  in  each  state,  not  the  absolute value,  nor  the variation  from  state to \nstate.  Our  results  can  be  viewed  as  a  justification  for  the  special  status of advan(cid:173)\ntages  as  the  target  for  value  function  approximation  in  RL.  In  fact,  our  (2),  (3), \nand  (5),  can  all  be  generalized  to  include  an  arbitrary  function  of state  added  to \nthe  value  function  or  its  approximation.  For  example,  (5)  can  be  generalized  to \n~ =  l:s d7r (s) l:a 87r~~,a)  [fw(s, a) + v(s)] ,where v  : S  ---+  R  is  an arbitrary function. \n(This follows  immediately because l:a 87r~~a)  =  0, Vs  E  S.)  The choice of v  does not \naffect  any of our theorems,  but can substantially affect  the variance of the gradient \nestimators.  The  issues  here  are entirely  analogous  to those  in  the  use  of reinforce(cid:173)\nment baselines in earlier work  (e.g.,  Williams,  1992;  Dayan,  1991;  Sutton,  1984).  In \npractice, v should presumably be set to the best available approximation of V7r.  Our \nresults establish that that approximation process can  proceed without  affecting  the \nexpected evolution of fw  and 1r. \n\n4  Convergence of Policy Iteration with Function Approximation \n\nGiven Theorem 2,  we can prove for  the first  time that a form of policy iteration with \nfunction  approximation is convergent to a  locally optimal policy. \nTheorem  3  (Policy  Iteration  with  Function  Approximation).  Let  1r \nand  fw  be  any  differentiable  function  approximators  for  the  policy  and  value \nfunction  respectively  that  satisfy  the  compatibility  condition  (4)  and  for  which \nmaxe,s,a,i,j 18;~~9;) I <  B  <  00.  Let  {Ok}~o be  any  step-size  sequence  such  that \nlimk-+oo Ok  =  0  and  l:k Ok  =  00.  Then,  for  any  MDP  with  bounded  rewards,  the \nsequence {p(1rk)}r=o,  defined by  any 00,  1rk  =  1r(.,., Ok),  and \n\nWk \n\nw such that  ~crk(S) ~ 1rk(s,a) Q7rk(s,a)  - fw(s,a) \n\n[ \n\n'\"' \ns \n\n'\"' \na \n\n]ofw(s,a) \n\now  = \u00b0 \n\nOk+l  =  Ok+Ok~crk(S)~  00 \n\nfWk(s,a), \n\n'\"' \n\n'\"' 01rk(S, a) \n\ns \n\na \n\nconverges such that limk-+oo  8P~;k)  =  o. \nProof:  Our Theorem 2 assures that the Ok  update is in the direction of the gradient. \n....\u00a3..\u00a3....-\nThe  bounds on  89;89j  and  on the  MDP's  rewards  together  assure  us  that  89i89j \n\n8 2 7r(s  a) \n\n\f1062 \n\nR. S.  Sutton,  D.  MeAl/ester.  S.  Singh and Y.  Mansour \n\nis  also  bounded.  These,  together  with the step-size requirements,  are the necessary \nconditions to apply Proposition 3.5 from  page 96 of Bertsekas  and Tsitsiklis  (1996), \nQ.E.D. \nwhich assures convergence to a  local optimum. \n\nAcknowledgements \nThe authors wish to thank Martha Steenstrup and Doina Precup for comments, and Michael \nKearns for  insights into the notion of optimal policy under function  approximation. \n\nReferences \nBaird,  L.  C.  (1993) .  Advantage Updating.  Wright Lab.  Technical Report WL-TR-93-1l46. \nBaird, L. C.  (1995).  Residual algorithms:  Reinforcement learning with function approxima(cid:173)\ntion.  Proc.  of the  Twelfth  Int.  Co,:4.  on  Machine  Learning, pp.  30-37.  Morgan  Kaufmann. \nBaird,  L.  C.,  Moore,  A.  W .  (1999) .  Gradient  descent  for  general  reinforcement  learning. \nNIPS 11.  MIT Press. \nBarto,  A.  G.,  Sutton,  R.  S.,  Anderson,  C.  W.  (1983).  Neuronlike  elements  that can solve \ndifficult learning control problems.  IEEE 1rans.  on Systems,  Man,  and Cybernetics  19:835. \nBaxter,  J ., Bartlett, P.  (in prep.)  Direct gradient-based reinforcement learning:  I. Gradient \nestimation algorithms. \nBertsekas,  D.  P.,  Tsitsiklis, J.  N.  (1996).  Neuro-Dynamic  Programming.  Athena Scientific. \nCao, X.-R., Chen, H.-F. (1997) .  Perturbation realization, potentials, and sensitivity analysis \nof Markov Processes,  IEEE 1hlns.  on  Automatic Control 42{1O):1382-1393. \nDayan,  P.  (1991).  Reinforcement  comparison.  In  D.  S.  Touretzky,  J.  L.  Elman,  T.  J.  Se(cid:173)\njnowski,  and G.  E.  Hinton  (eds.),  Connectionist  Models:  Proceedings  of the  1990 Summer \nSchool, pp.  45-51.  Morgan  Kaufmann. \nGordon,  G.  J. (1995).  Stable function approximation in dynamic programming.  Proceedings \nof the  Twelfth  Int.  Conf.  on  Machine  Learning,  pp. 261-268.  Morgan  Kaufmann. \nGordon,  G.  J.  (1996).  Chattering in SARSA(A).  CMU  Learning  Lab Technical  Report. \nJaakkola,  T.,  Singh, S. P., Jordan, M.  I.  (1995)  Reinforcement learning algorithms for  par(cid:173)\ntially observable Markov decision  problems,  NIPS 7,  pp.  345-352.  Morgan  Kaufman. \nKimura,  H.,  Kobayashi,  S.  (1998).  An  analysis  of actor/critic  algorithms  using  eligibility \ntraces:  Reinforcement learning with imperfect value functions.  Proc. ICML-98, pp. 278-286. \nKonda,  V. R.,  Tsitsiklis,  J.  N.  (in prep.)  Actor-critic algorithms. \nMarbach,  P.,  Tsitsiklis,  J.  N.  (1998)  Simulation-based optimization of Markov reward pro(cid:173)\ncesses,  technical report LIDS-P-2411,  Massachusetts Institute of Technology. \nSingh,  S.  P.,  Jaakkola,  T.,  Jordan,  M.  I.  (1994) .  Learning  without  state-estimation  in \npartially observable Markovian decision problems.  Proc.  ICML-94,  pp.  284-292. \nSutton, R.  S. (1984).  Temporal  Credit Assignment in Reinforcement Learning.  Ph.D. thesis, \nUniversity of Massachusetts,  Amherst. \nSutton, R.  S.,  Barto, A.  G.  (1998) .  Reinforcement Learning:  An Introduction.  MIT Press. \nTsitsiklis,  J.  N.  Van  Roy,  B.  (1996) .  Feature-based  methods  for  large scale dynamic  pro(cid:173)\ngramming.  Machine  Learning  22:59-94. \nWilliams,  R.  J .  (1988) .  Toward  a  theory of reinforcement-learning  connectionist  systems. \nTechnical  Report NU-CCS-88-3,  Northeastern University,  College of Computer Science. \nWilliams,  R.  J.  (1992) .  Simple  statistical  gradient-following  algorithms  for  connectionist \nreinforcement  learning.  Machine  Learning  8:229-256. \n\nAppendix:  Proof of Theorem 1 \n\nWe prove the theorem first for the average-reward formulation and then for the start(cid:173)\nstate formulation. \n\n8V1I'(s) \n\ndef \n\n8fJ \n\na \n\n= \n\n~ [87r(S, a)  11'() \n] \n~  80  Q  s,a  +7r(s,a)80Q  s,a) \na \n\n11'( \n\n8 \n\n\fPolicy Gradient Methods for RL with Function Approximation \n\n1063 \n\nTherefore, \n\nap  =  '\"\"\"  [a1r(S,a)Q1T( \nae \n\nao \n\n) \n\n( \n\n),\"\"\"pa  aV1T(S')]_  aV1T(s) \n\ns, a  + 1r  s, a  L- ss' \n\nao \n\nae \n\nL-\na \n\n~ \n\nSumming both sides over the stationary distribution d1T , \n\n=  '\"\"\"  1T ( \n\nL- d  s  L-\na \ns \n\n)  '\"\"\" a1r( s, a) Q1T ( \n\n) \n\nae \n\ns, a  + L- U  s  L- 7r  s, a  L- Pss' \n\n)  '\"\"\" \n\n( \n\n'\"\"\" ..nr (  )  '\"\"\" \na \ns \n\ns' \n\na  aV1T (s') \n\nae \n\nbut since ~ is stationary, \n\n_  L~(s)av;O(s), \n\ns \n\n::  =  Ld1T (s) L  a1r~~a) Q1T(s,a). \n\nQ.E.D. \n\n_  Ld1T(s) av;o(s) \n\ns \n\nFor the start-state formulation: \n\ns \n\na \n\naV1T ( s ) \n\nae \n\ndef  a  '\"\"\" \n= \n\nae L- 1r(s, a)Q  (s, a) \n\n1T \n\n'risE  S \n\na \n\n[a1r~~ a) Q1T(S, a) + 1r(s, a) :eQ1T (s, a)] \n\n=  L \na \n\n~  ~ [inr~~ a) Q'(s,a) +--(s,a) :0 ['R~ + ~ ~P:., V'(S')]] \n~  ~ [inr~~ a) Q'(s,a) +--(s,a) ~ ~P:.,! V'(S')] \n\n(7) \n\n'\"\"\"~  It \n\n=  L- L- 'Y  Pr  s -+ x, k, 1r  L-\na \n\nIt=o \n\nx \n\n( \n\n),\"\"\"a1r(x,a)Q1T() \nx, a  , \n\nae \n\nafter several steps of unrolling  (7),  where  Pr(s -+ x, k, 1r)  is  the probability of going \nfrom state s to state x  in  k  steps under policy 1r.  It is then immediate that \n\nQ.E.D. \n\na  {~t-l  I  }  a \n\nrt  So,1r  = ae v  (so) \n\n71' \n\nae  =  aoE  ti 'Y \nap \n'\"\"\"~  It \ns  k=O \n\n( \n\n=  L- L- 'Y  Pr So  -+ s, k, 1r) L-\na \n=  L ~(s) L a1r~~ a) Q7I'(s, a). \n\ns \n\na \n\n,\"\",,87r(s,a)Q7I'( \n\nae \n\n) \ns, a \n\n\f", "award": [], "sourceid": 1713, "authors": [{"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Yishay", "family_name": "Mansour", "institution": null}]}