{"title": "Actor-Critic Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1008, "page_last": 1014, "abstract": null, "full_text": "Actor-Critic Algorithms \n\nVijay  R.  Konda \n\nJohn  N.  Tsitsiklis \n\nLaboratory for  Information and Decision  Systems , \n\nMassachusetts Institute of Technology, \n\nCambridge,  MA,  02139. \n\nkonda@mit.edu,  jnt@mit.edu \n\nAbstract \n\nWe  propose  and  analyze  a  class  of  actor-critic  algorithms  for \nsimulation-based  optimization  of  a  Markov  decision  process  over \na  parameterized  family  of randomized  stationary  policies.  These \nare two-time-scale  algorithms in  which  the critic uses TD learning \nwith  a  linear approximation architecture and the actor is  updated \nin  an  approximate  gradient  direction  based  on  information  pro(cid:173)\nvided by the critic.  We  show that the features for  the critic should \nspan a subspace prescribed by the choice of parameterization of the \nactor.  We  conclude by discussing convergence properties and some \nopen problems. \n\n1 \n\nIntroduction \n\nThe  vast  majority  of  Reinforcement  Learning  (RL)  [9J  and  Neuro-Dynamic  Pro(cid:173)\ngramming (NDP)  [lJ  methods fall  into one of the following two categories: \n\n(a)  Actor-only methods work with a parameterized family of policies.  The gra(cid:173)\n\ndient  of the performance,  with respect to the actor parameters,  is  directly \nestimated by  simulation,  and the parameters are updated in  a  direction of \nimprovement [4,  5,  8,  13J.  A possible drawback of such methods is that the \ngradient  estimators  may  have  a  large  variance.  Furthermore,  as  the  pol(cid:173)\nicy  changes,  a  new  gradient  is  estimated independently  of past estimates. \nHence,  there is  no  \"learning,\"  in  the sense of accumulation and consolida(cid:173)\ntion of older  information. \n\n(b)  Critic-only  methods  rely  exclusively  on  value  function  approximation  and \naim at learning an approximate solution to the Bellman equation, which will \nthen  hopefully  prescribe a  near-optimal policy.  Such  methods  are indirect \nin the sense that they do not try to optimize directly over a policy space.  A \nmethod of this type may succeed in constructing a  \"good\" approximation of \nthe value function,  yet lack reliable guarantees in  terms of near-optimality \nof the resulting policy. \n\nActor-critic  methods  aim  at combining the strong points  of actor-only  and  critic(cid:173)\nonly  methods.  The  critic  uses  an  approximation  architecture  and  simulation  to \nlearn  a  value function,  which  is  then used to update the actor's policy  parameters \n\n\fActor-Critic Algorithms \n\n1009 \n\nin  a  direction  of  performance  improvement.  Such  methods,  as  long  as  they  are \ngradient-based,  may  have  desirable  convergence  properties,  in  contrast  to  critic(cid:173)\nonly  methods  for  which  convergence is  guaranteed  in  very  limited  settings.  They \nhold the promise of delivering faster convergence (due to variance reduction), when \ncompared to actor-only methods.  On the other hand,  theoretical understanding of \nactor-critic methods has been limited to the case of lookup table representations of \npolicies  [6]. \nIn  this  paper,  we  propose some  actor-critic algorithms  and provide an  overview of \na convergence proof.  The algorithms are based on an important observation.  Since \nthe number of parameters that the actor has to update is relatively small (compared \nto  the  number  of states),  the critic  need  not  attempt to compute  or  approximate \nthe exact value function,  which is  a  high-dimensional object.  In  fact,  we  show that \nthe critic should ideally compute a  certain  \"projection\"  of the value function onto a \nlow-dimensional subspace spanned by a set of \"basis functions,\"  that are completely \ndetermined by  the  parameterization  of  the  actor.  Finally,  as  the  analysis  in  [11] \nsuggests for TD algorithms, our algorithms can be extended to the case of arbitrary \nstate and action spaces as long as  certain ergodicity assumptions are satisfied. \n\nWe  close  this  section  by  noting that  ideas  similar to ours  have been  presented  in \nthe simultaneous and independent work of Sutton et al.  [10]. \n\n2  Markov decision processes  and  parameterized family  of \n\nRSP's \n\nConsider a Markov decision process with finite state space S,  and finite action space \nA.  Let 9  : S x A  -t ffi.  be a given cost function.  A randomized stationary policy (RSP) \nis  a mapping I-\"  that assigns to each state x  a probability distribution over the action \nspace  A.  We  consider  a  set  of randomized  stationary  policies  JPl  =  {1-\"9; e E  ffi.n }, \nparameterized in terms of a vector e.  For each pair (x, u)  E S  x A, 1-\"9 (x, u)  denotes \nthe probability of taking action u when the state x  is  encountered, under the policy \ncorresponding to e.  Let PXy(u)  denote the probability that the next state is y,  given \nthat the current state is x and the current action is u.  Note that under any RSP, the \nsequence of states  {Xn} and of state-action pairs  {Xn' Un}  of the Markov decision \nprocess form  Markov chains with state spaces Sand S  x  A, respectively.  We  make \nthe following  assumptions about the family  of policies  JPl. \n\n(AI)  For  all  xES and  u  E  A  the  map  e t-t  1-\"9(X, u)  is  twice  differentiable \nwith  bounded  first,  second  derivatives.  Furthermore,  there  exists  a  ffi.n_ \nvalued function 'l/J9(X, u)  such that \\l1-\"9(X, u)  =  1-\"9 (x, U)'l/J9(X, u)  where the \nmapping e t-t 'l/J9(X, u) is  bounded and has first  bounded derivatives for any \nfixed  x  and u. \n(A2)  For each e E ffi.n , the Markov chains {Xn} and {Xn, Un} are irreducible and \naperiodic, with  stationary probabilities 7r9(X)  and 'T}9(X, u)  = 7r9 (x) 1-\"9 (x, u), \nrespectively,  under the RSP  1-\"9. \n\nIn  reference to Assumption  (AI) , note that whenever 1-\"9 (x, u)  is  nonzero we  have \n\n'l/J9 (x, u) = \n\n\\l1-\"9(X, u) \n1-\"9  x,u \n\n( \n\n)  = \\lIn 1-\"9 (x, u). \n\nConsider the average cost function>.  : ffi.n  t-t  ffi.,  given  by \n>.(e)  =  L  g(x, U)'T}9(X, u) . \n\nxES,uEA \n\n\f1010 \n\nV.  R.  Konda and J.  N.  Tsitsiklis \n\nWe  are interested in  minimizing  >'(19)  over all  19.  For  each  19  E  Rn ,  let  Ve  : S  t--7  R \nbe the  \"differential\"  cost function,  defined  as  solution of Poisson equation: \n\n>'(19)  + Ve(x)  =  L \n\nI-'e(x,u)  [g(X,U)  + LPxY(U)Ve(Y)]. \n\nuEA \n\nY \n\nIntuitively, Ve(x)  can be viewed as the  \"disadvantage\"  of state x:  it is the expected \nexcess cost - on top of the average cost - incurred if we  start at state x.  It plays a \nrole  similar to that played by  the more familiar  value function  that arises  in total \nor discounted  cost  Markov decision  problems.  Finally,  for  every  19  E  Rn,  we  define \nthe q-function  qe  : S x  A  -+  R,  by \n\nqe(x, u)  =  g(x, u) - >'(19)  + LPxy(u)Ve(y). \n\nY \n\nWe recall the following result, as stated in  [8].  (Different versions of this result have \nbeen established in  [3,  4,  5].) \nTheorem 1. \n\n8 \n819. >'(19)  =  L..J 1]e(x, u)qe(x , u)1/;o(x, u) \n\n. \n\n'\"' \nX,U \n\n1. \n\n(1) \n\nwhere 1/;b (x, u)  stands for  the i th  component  of 1/;e . \n\nIn  [8],  the  quantity  qe(x,u)  in  the  above  formula  is  interpreted  as  the  expected \nexcess  cost  incurred over  a  certain  renewal  period  of the  Markov  chain  {Xn, Un}, \nunder the RSP  I-'e,  and is  then estimated by means of simulation, leading to actor(cid:173)\nonly  algorithms.  Here,  we  provide  an  alternative  interpretation  of the  formula  in \nTheorem 1,  as an inner product, and thus derive a different set of algorithms, which \nreadily generalize to the case of an infinite space as well. \nFor any 19  E Rn , we define the inner product (', .) e of two real valued functions q1 , q2 \non  S  x  A, viewed  as vectors in RlsiIAI,  by \n\n(q1, q2)e  =  L  1]e(x, U)q1 (x, U)q2(X, u). \n\nx,u \n\nWith this notation we  can  rewrite the formula (1)  as \n\n8 \n819i  >'(19)  = (qe,1/;o)e, \n\n. \n\ni  = 1, ... ,n. \n\nLet 11 \u00b7l le  denote the norm induced by this inner product on RlsiIAI.  For each 19  E Rn \nlet  we  denote the span of the vectors  {1/;b;  1 ::;  i  ::;  n} in RISIIAI.  (This  is  same as \nthe set of all functions f  on S  x  A  of the form  f(x ,u)  = 2::7=1  ai1/;~(x , U), for  some \nscalars a1,\u00b7 . . ,an,) \nNote that although the gradient of >.  depends  on the q-function,  which  is  a  vector \nin  a  possibly  very  high  dimensional  space  RlsiIAI,  the  dependence  is  only  through \nits inner products with  vectors in  we .  Thus, instead of  \"learning\"  the function  qe, \nit would  suffice  to learn the projection of qe  on the subspace We. \nIndeed,  let rIe  : RlsllAI  t--7  We  be the projection operator defined  by \n\nrIeq  =  arg !llin  Ilq  - qlle. \n\nqEwe \n\nSince \n\n(qe ,1/;e)e  =  (rIeqe, 1/;e)e, \nit is  enough to compute the projection of qe  onto we. \n\n(2) \n\n\fActor-Critic Algorithms \n\n1011 \n\n3  Actor-critic algorithms \n\nWe  view actor critic-algorithms as  stochastic gradient algorithms on the parameter \nspace  of the actor.  When  the  actor parameter vector  is  0,  the job  of the  critic  is \nto compute an approximation of the projection IIeqe  of qe  onto 'lie.  The actor uses \nthis approximation to update its policy  in  an  approximate gradient direction.  The \nanalysis in  [11,  12]  shows  that this is  precisely  what TD  algorithms try to do,  i.e., \nto  compute the projection  of an  exact  value  function  onto  a  subspace spanned by \nfeature  vectors.  This  allows  us  to implement  the  critic  by  using  a  TD  algorithm. \n(Note, however, that other types of critics are possible, e.g., based on batch solution \nof least squares problems,  as  long as they aim  at computing the same projection.) \n\nWe  note  some  minor  differences  with  the  common  usage  of TD.  In  our  context, \nwe  need  the  projection  of  q-functions,  rather  than  value  functions.  But  this  is \neasily achieved  by replacing the Markov chain  {xt} in  [11,  12]  by the Markov chain \n{Xn, Un}.  A further  difference  is  that  [11,  12]  assume that the control  policy  and \nthe  feature  vectors  are fixed.  In  our algorithms,  the  control  policy  as  well  as  the \nfeatures  need  to  change  as  the  actor  updates  its  parameters.  As  shown  in  [6,  2], \nthis need not pose any problems, as long as the actor parameters are updated on a \nslower time scale. \nWe  are  now  ready  to  describe  two  actor-critic  algorithms,  which  differ  only  as  far \nas  the critic updates  are concerned.  In  both variants,  the critic is  a  TD  algorithm \nwith a linearly parameterized approximation architecture for  the q-function,  of the \nform \n\nQ~(x, u) = I: r j 4>~(x, u), \n\nm \n\nj=l \n\nwhere  r  =  (rl, ... , rm)  E  ]Rm  denotes  the  parameter  vector  of  the  critic.  The \nfeatures  4>~, j  =  1, ... ,m, used  by the critic are dependent on the actor parameter \nvector  0 and  are  chosen  such  that  their  span  in  ]RlsIIAI,  denoted  by  <Pe,  contains \n'lI e.  Note  that the formula  (2)  still  holds  if IIe  is  redefined  as  projection  onto  <Pe \nas  long as  <Pe  contains 'lie.  The most straightforward choice would  be to let m  =  n \nand  4>~  =  't/J~  for  each i.  Nevertheless,  we  allow  the possibility that m  > nand <Pe \nproperly  contains  'lie,  so  that  the  critic  uses  more features  than  that are  actually \nnecessary.  This added flexibility  may turn out to be useful  in  a  number of ways: \n\n1.  It is  possible  for  certain  values  of 0,  the  features  't/Je  are  either  close  to \nzero  or  are almost  linearly  dependent.  For  these  values  of 0,  the operator \nIIe  becomes ill-conditioned and the algorithms can  become unstable.  This \nmight be avoided  by  using richer set of features  't/J~. \n\n2.  For  the  second  algorithm  that we  propose  (TD(a)  a  < 1)  critic  can  only \ncompute approximate - rather than exact - projection.  The use of additional \nfeatures  can result in  a reduction  of the approximation error. \n\nAlong with the parameter vector r, the critic stores some auxiliary parameters:  these \nare a  (scalar)  estimate  A,  of the average cost,  and an  m-vector  z  which  represents \nSutton's eligibility trace [1,  9].  The actor and critic updates take place in the course \nof a simulation of a single sample path of the controlled Markov chain.  Let rk, Zk, Ak \nbe the parameters of the critic,  and let  Ok  be the parameter vectpr of the actor, at \ntime  k.  Let  (Xk, Uk)  be the state-action pair  at that time.  Let  Xk+l  be the new \nstate, obtained after action Uk  is applied.  A new action Uk+l  is generated according \nto the RSP  corresponding to the actor parameter vector  Ok.  The critic carries out \nan update similar to the average cost temporal-difference method of [12]: \n\nAk+l  =  Ak  + 'Ydg(Xk, Uk)  - Ak), \n\n\f1012 \n\nV.  R.  Kanda and J.  N.  Tsitsiklis \n\nrk+l  =  rk  + 1'k(9(Xk' Uk)  - Ak  + Q~~(Xk+l' Uk+l)  - Q~~(Xk,Uk))Zk. \n\n(Here,  1'k  is  a  positive  stepsize  parameter.)  The  two  variants  of  the  critic  use \ndifferent  ways of updating Zk: \n\nTD(J)  Critic:  Let x*  be a  state in  S. \n\nZk+l \n\nZk  + \u00a2>91c (Xk+l' Uk+l) , \n\u00a2>9,. (Xk+l, Uk+d, \n\nif Xk+l  ::/=  x*, \n\notherwise. \n\nTD(a)  Critic,  0  ~ a  < 1: \n\nActor:  Finally, the actor updates its  parameter vector by letting \n\n(Jk+l  =  (Jk  - rhf(rk)Q~~ (Xk+l' Uk+l)1/!9,. (Xk+l' Uk+l). \n\nHere,  13k  is  a positive stepsize and r(rk) > 0 is  a  normalization factor  satisfying: \n\n(A3)  f(\u00b7)  is  Lipschitz continuous. \n(A4)  There exists C  > 0 such that \n\nC \n\nr(r) ~ 1 + Ilrll' \n\nThe above presented algorithms are only two out of many variations.  For instance, \none could also consider  \"episodic\"  problems in which one starts from  a given  initial \nstate  and  runs  the  process  until  a  random  termination  time  (at  which  time  the \nprocess  is  reinitialized  at  x*),  with  the objective  of minimizing  the expected  cost \nIn  this  setting,  the  average  cost  estimate  Ak  is  unnecessary \nuntil  termination. \nand  is  removed from  the critic  update formula.  If the critic  parameter rk  were  to \nbe  reinitialized  each  time  that  x*  is  entered,  one  would  obtain  a  method  closely \nrelated to Williams'  REINFORCE algorithm [13].  Such  a method  does  not involve \nany  value  function  learning,  because  the  observations  during  one  episode  do  not \naffect the critic  parameter r  during another episode.  In  contrast, in  our approach, \nthe  observations  from  all  past  episodes  affect  current  critic  parameter  r,  and  in \nthis  sense  critic  is  \"learning\".  This  can  be advantageous  because,  as  long  as  (J  is \nslowly  changing, the observations from  recent episodes carry useful  information on \nthe q-function  under the current policy. \n\n4  Convergence of actor-critic algorithms \n\nSince  our  actor-critic  algorithms  are  gradient-based,  one  cannot  expect  to  prove \nconvergence  to  a  globally  optimal  policy  (within  the  given  class  of RSP's).  The \nbest that one could hope for is the convergence of '\\l A((J)  to zero; in practical terms, \nthis  will  usually  translate  to  convergence  to  a  local  minimum  of  A((J).  Actually, \nbecause the T D(a) critic will  generally converge to an approximation of the desired \nprojection of the value function, the corresponding convergence result is  necessarily \nweaker, only guaranteeing that '\\l A((h)  becomes small (infinitely often).  Let us now \nintroduce some further  assumptions. \n\n\fActor-Critic Algorithms \n\n1013 \n\n(A5)  For each 0 E ~n, we  define  an m  x m  matrix G(O)  by \n\nG(O)  =  L1Jo(x,u)\u00a2o(x,u)\u00a2O(x,U)T. \n\nx,u \n\nWe  assume  that  G(O)  is  uniformly  positive  definite,  that  is,  there  exists \nsome  fl  > 0 such that for  all  r  E  ~m and 0  E  ~n \n\nrTG(O)r  ~ fdlrW\u00b7 \n\n(A6)  We  assume that the stepsize sequences  bk}, {th} are positive,  nonincreas(cid:173)\n\ning,  and satisfy \n\n15k  > 0,  Vk,  L 15k  =  00,  L 15k  < 00, \n\nwhere 15k stands for  either /h or 'Yk.  We  also  assume that \n\nk \n\nk \n\n13k  --+  o. \n'Yk \n\nNote that the last  assumption requires that the actor parameters be updated at a \ntime scale slower than that of critic. \nTheorem 2.  In  an  actor-critic  algorithm  with  a  TD(l)  critic, \n\nliminf IIV'A(Ok)11  = 0 \n\nk \n\nw.p.  1. \n\nFurthermore,  if {Od  is  bounded  w.p.  1  then \n\nlim IIV' A(Ok)11  =  0 \nk \n\nw.p.  1. \n\nTheorem 3.  For  every  f  >  0,  there  exists  a:  sufficiently  close  to  1,  such  that \nliminfk IIV'A(Ok)11  ::;  f  w.p.  1. \n\nNote that the theoretical guarantees appear to be stronger in  the case of the TD(l) \ncritic.  However,  we  expect  that TD(a:)  will  perform  better  in  practice because  of \nmuch smaller variance for  the parameter rk.  (Similar issues  arise when considering \nactor-only algorithms.  The experiments reported in  [7]  indicate that introducing a \nforgetting factor  a:  < 1 can result in much faster convergence, with very little loss of \nperformance.)  We  now  provide an  overview of the proofs  of these  theorems.  Since \n13k/'Yk  --+  0,  the  size  of the  actor  updates  becomes  negligible  compared to  the size \nof the  critic  updates.  Therefore  the  actor  looks  stationary,  as  far  as  the  critic  is \nconcerned.  Thus,  the  analysis  in  [1]  for  the  TD(l)  critic  and  the  analysis  in  [12] \nfor  the TD(a:)  critic  (with  a:  < 1)  can  be  used,  with  appropriate modifications,  to \nconclude that the critic's approximation of IIokqok  will  be  \"asymptotically correct\". \nIf r(O)  denotes  the value  to which  the  critic  converges  when  the  actor  parameters \nare fixed  at 0,  then the update for  the actor can  be rewritten as \n\nOk+l  =  Ok  - 13kr(r(Ok))Q~(Ok) (Xk+l, Uk+l)'l/JOk (Xk+1 , Uk+d + 13kek, \n\nwhere ek is an error that becomes asymptotically negligible.  At this point, standard \nproof techniques  for  stochastic approximation algorithms  can  be used  to complete \nthe proof. \n\n5  Conclusions \n\nThe  key  observation  in  this  paper  is  that  in  actor-critic  methods,  the  actor  pa(cid:173)\nrameterization and the critic parameterization need  not,  and should  not be chosen \n\n\f1014 \n\nV. R.  Konda and J. N.  Tsitsiklis \n\nindependently.  Rather,  an  appropriate approximation architecture for  the critic is \ndirectly prescribed by the parameterization used in  actor. \n\nCapitalizing on the above observation, we have presented a class of actor-critic algo(cid:173)\nrithms, aimed at combining the advantages of actor-only and critic-only methods.  In \ncontrast to existing actor-critic methods, our algorithms apply to high-dimensional \nproblems (they do not rely on lookup table representations), and are mathematically \nsound in  the sense that they possess certain convergence properties. \n\nAcknowledgments:  This  research  was  partially  supported  by  the  NSF  under \ngrant ECS-9873451, and by  the AFOSR under grant F49620-99-1-0320. \n\nReferences \n\n[1]  D.  P.  Bertsekas  and  J.  N.  Tsitsiklis.  Neurodynamic  Programming.  Athena \n\nScientific,  Belmont, MA,  1996. \n\n[2]  V.  S.  Borkar.  Stochastic  approximation  with  two  time  scales.  Systems  and \n\nControl  Letters,  29:291-294, 1996. \n\n[3]  X. R.  Cao and  H.  F.  Chen.  Perturbation realization,  potentials,  and  sensitiv(cid:173)\nity  analysis  of Markov  processes.  IEEE  Transactions  on  Automatic  Control, \n42:1382-1393,1997. \n\n[4]  P.  W.  Glynn.  Stochastic approximation for  monte carlo optimization.  In  Pro(cid:173)\n\nceedings  of the  1986  Winter Simulation  Conference,  pages  285-289, 1986. \n\n[5]  T . Jaakola, S. P.  Singh,  and M.  1.  Jordan.  Reinforcement learning algorithms \n\nfor  partially observable Markov decision  problems.  In  Advances  in  Neural  In(cid:173)\nformation  Processing  Systems,  volume  7,  pages  345- 352,  San  Francisco,  CA, \n1995.  Morgan Kaufman. \n\n[6]  V.  R.  Konda and V. S.  Borkar.  Actor-critic like learning algorithms for Markov \ndecision processes.  SIAM Journal  on  Control  and  Optimization,  38(1) :94-123, \n1999. \n\n[7]  P.  Marbach.  Simulation  based  optimization  of Markov  reward  processes.  PhD \n\nthesis,  Massachusetts Institute of Technology,  1998. \n\n[8]  P.  Marbach  and  J.  N.  Tsitsiklis.  Simulation-based  optimization  of  Markov \n\nreward  processes.  Submitted to IEEE Transactions on  Automatic  Control. \n\n[9]  R. Sutton and A.  Barto. Reinforcement Learning:  An Introduction.  MIT Press, \n\nCambridge,  MA,  1995. \n\n[10]  R.  S.  Sutton, D.  McAllester,  S.  Singh,  and Y.  Mansour.  Policy gradient meth(cid:173)\nods for reinforcement learning with function approximation. In  this proceedings. \n[11]  J.  N.  Tsitsiklis  and  B.  Van  Roy.  An  analysis  of  temporal-difference  learn(cid:173)\ning  with  function  approximation.  IEEE  Transactions  on  Automatic  Control, \n42(5):674-690,  1997. \n\n[12]  J.  N.  Tsitsiklis  and  B.  Van  Roy.  Average  cost  temporal-difference  learning. \n\nAutomatica, 35(11):1799-1808, 1999. \n\n[13]  R.  Williams.  Simple statistical gradient following  algorithms for  connectionist \n\nreinforcement learning.  Machine  Learning,  8:229-256,  1992. \n\n\f", "award": [], "sourceid": 1786, "authors": [{"given_name": "Vijay", "family_name": "Konda", "institution": null}, {"given_name": "John", "family_name": "Tsitsiklis", "institution": null}]}