{"title": "Temporal Difference Learning in Continuous Time and Space", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1079, "abstract": "", "full_text": "Temporal Difference Learning in \n\nContinuous Time and  Space \n\nKenji Doya \n\ndoya~hip.atr.co.jp \n\nATR Human Information Processing Research Laboratories \n2-2  Hikaridai,  Seika.-cho,  Soraku-gun,  Kyoto 619-02,  Japan \n\nAbstract \n\nA continuous-time, continuous-state version of the temporal differ(cid:173)\nence (TD) algorithm is derived in order to facilitate the application \nof reinforcement  learning to real-world  control tasks and  neurobi(cid:173)\nological modeling.  An optimal nonlinear feedback control law  was \nalso  derived  using the derivatives  of the value  function.  The  per(cid:173)\nformance  of the algorithms  was  tested in a  task of swinging  up  a \npendulum with limited torque.  Both the  \"critic\"  that specifies the \npaths to the upright position and the \"actor\"  that works as a non(cid:173)\nlinear feedback  controller  were  successfully implemented by radial \nbasis function  (RBF)  networks. \n\n1 \n\nINTRODUCTION \n\nThe temporal-difference  (TD)  algorithm  (Sutton,  1988)  for  delayed  reinforcement \nlearning  has  been  applied  to  a  variety  of tasks,  such  as  robot  navigation,  board \ngames,  and biological modeling  (Houk et al.,  1994).  Elucidation of the relationship \nbetween TD learning and dynamic programming (DP) has provided good theoretical \ninsights  (Barto et al.,  1995).  However,  conventional  TD  algorithms  were  based  on \ndiscrete-time,  discrete-state  formulations.  In  applying  these  algorithms  to control \nproblems, time, space and action had to be appropriately discretized using a priori \nknowledge  or  by  trial and  error.  Furthermore,  when  a  TD  algorithm  is  used  for \nneurobiological modeling,  discrete-time operation is  often very unnatural. \nThere have been several attempts to extend TD-like algorithms to continuous cases. \nBradtke  et  al.  (1994)  showed  convergence  results  for  DP-based  algorithms  for  a \ndiscrete-time,  continuous-state  linear  system  with  a  quadratic  cost.  Bradtke  and \nDuff (1995)  derived TD-like algorithms for  continuous-time, discrete-state systems \n(semi-Markov decision problems).  Baird (1993) proposed the \"advantage updating\" \nalgorithm by modifying Q-Iearning so that it works with arbitrary small time steps. \n\n\f1074 \n\nK.DOYA \n\nIn this  paper, we  derive  a  TD  learning algorithm for  continuous-time,  continuous(cid:173)\nstate, nonlinear control  problems.  The correspondence of the continuous-time ver(cid:173)\nsion  to the conventional  discrete-time  version  is  also  shown.  The  performance  of \nthe  algorithm  was  tested  in  a  nonlinear  control  task  of swinging  up  a  pendulum \nwith limited torque. \n\n2  CONTINUOUS-TIME TD LEARNING \n\nWe  consider a continuous-time dynamical system  (plant) \n\n(1) \nwhere x  E  X  eRn is  the state and u  E  U C Rm  is  the control input  (action).  We \ndenote the immediate reinforcement  (evaluation)  for  the state and the action  as \n\nx(t) =  f(x(t), u(t)) \n\nr(t) =  r(x(t), u(t)). \nOur goal is  to find  a  feedback control law  (policy) \nu(t) = JL(x(t)) \n\n(3) \nthat maximizes  the expected  reinforcement  for  a  certain period  in  the future.  To \nbe specific,  for  a given control law JL,  we  define  the  \"value\"  of the state x(t)  as \n\n(2) \n\n(4) \n\nV!L(x(t))  = \n\n100  1 \n\nt \n\n,-t \n\n-e--T  r(x(s), u(s))ds, \nr \n\nwhere x(s)  and  u(s)  (t  < s  < 00)  follow  the system dynamics  (1)  and the control \nlaw  (3).  Our  problem  now  is  to find  an  optimal  control  law  JL*  that  maximizes \nV!L(x)  for any state x  E X.  Note that r  is the time scale of \"imminence-weighting\" \nand the scaling factor  ~ is  used for  normalization,  i.e.,  ftOO  ~e- ':;:t ds  = 1. \n\n2.1  TD ERROR \n\nThe basic idea in TD learning is to predict future  reinforcement in an on-line man(cid:173)\nner.  We  first  derive  a local consistency condition for  the value function  V!L(x).  By \ndifferentiating  (4)  by t,  we  have \n\nd \n\nr dt V!L(x(t))  =  V!L(x(t)) - r(t). \n\n(5) \n\nLet  P(t)  be the prediction of the value  function  V!L(x(t))  from  x(t)  (output of the \n\"critic\").  If the prediction is  perfect, it should satisfy  rP(t) = P(t) - r(t).  If this \nis  not satisfied,  the prediction should be adjusted to decrease the inconsistency \n\nf(t) =  r(t) - P(t) + rP(t). \n\n(6) \n\nThis is  a  continuous version of the temporal difference error. \n\n2.2  EULER DIFFERENTIATION:  TD(O) \n\nThe relationship between the above continuous-time TD error and the discrete-time \nTD error  (Sutton,  1988) \n\nf(t) = r(t) + ,,(P(t) - P(t - ~t) \n\n(7) \n\ncan  be  easily  seen  by  a  backward  Euler  approximation  of p(t).  By  substituting \np(t) =  (P(t) - P(t - ~t))/~t into  (6),  we  have \n\nf=r(t)+ ~t [(1- ~t)P(t)-P(t-~t)] . \n\n\fTemporal Difference in  Learning in Continuous Time and Space \n\n1075 \n\nThis coincides with (7)  if we make the \"discount factor\"  '\"Y  =  1- ~t ~ e-'\u00a5, except \nfor  the scaling factor  It ' \nNow  let  us consider a  case  when the prediction of the value function  is  given  by \n\n(8) \n\nwhere bi O are basis functions  (e.g., sigmoid, Gaussian, etc)  and Vi  are the weights. \nThe gradient descent of the squared TD error is  given by \n\n~Vi ex:  _ o~r2(t)  ex:  - r et) [(1 _ ~t) oP(t)  _  oP(t - ~t)] . \n\nOVi \n\nT \n\nOVi \n\nOVi \n\nIn order to \"back-up\"  the information about the future reinforcement to correct the \nprediction in the past,  we  should  modify  pet - ~t) rather than pet)  in  the  above \nformula.  This results in the learning rule \n\n~Vi ex:  ret) OP(~~ ~t) =  r(t)bi(x(t - ~t)). \n\n(9) \n\nThis is  equivalent to the TD(O)  algorithm that uses the  \"eligibility trace\"  from  the \nprevious time step. \n\n2.3  SMOOTH DIFFERENTIATION:  TD(-\\) \n\nThe  Euler  approximation  of  a  time  derivative  is  susceptible  to  noise  (e.g.,  when \nwe  use  stochastic  control  for  exploration) .  Alternatively,  we  can  use  a  \"smooth\" \ndifferentiation algorithm that uses a  weighted average of the past input, such  as \n\npet) ~ pet) - Pet) \n\n~ \n\nwhere \n\nTc dd  pet) =  pet) - pet) \n\nt \n\nand  Tc  is  the time constant  of the differentiation.  The corresponding gradient de(cid:173)\nscent algorithm is \n\n~Vi ex:  _ O~;2(t) ex:  ret) o~(t) =  r(t)bi(t) , \n\nVi \n\nUVi \n\nwhere bi  is  the eligibility  trace for  the weight \n\nd -\n\nTc dtbi(t) =  bi(x(t)) - bi(t) . \n\n-\n\n(10) \n\n(11) \n\nNote that this is equivalent to the TD(-\\)  algorithm (Sutton, 1988)  with -\\  =  1- At \nif we  discretize the above equation with time step  ~t. \n\nT c \n\n3  OPTIMAL CONTROL BY VALUE GRADIENT \n\n3.1  HJB EQUATION \n\nThe value  function V * for  an optimal control J..L*  is  defined as \n\nV*(x(t))  =  max \nU[t,oo) \n\n. -t \n\n-e--T  r(x(s), u(s))ds  . \nT \n\n] \n\n(12) \n\n[1 00  1 \n\nt \n\nAccording  to  the  principle  of  dynamic  programming  (Bryson  and  Ho,  1975),  we \nconsider  optimization  in  two  phases,  [t, t + ~t] and  [t  + ~t , 00),  resulting  in  the \nexpression \n\nV*(x(t)) =  max \n. \n\nU[t,HAt) \n\n[I t+At  1 \n\nT \n\nt \n\n1 \n_e- \u00b7:;:-t r(x(s), u(s))ds + e--'\u00a5V*(x(t + ~t))  . \n\n\f1076 \n\nK.DOYA \n\nBy Taylor expanding the value at t + f:l.t  as \n\nV*(x(t + f:l.t))  = V*(x(t)) + ax(t) f(x(t), u(t))f:l.t + O(f:l.t) \n\nav* \n\nand then taking f:l.t  to zero,  we  have  a  differential constraint for  the optimal value \nfunction \n\nV*(t) =  max \nU(t)EU \n\nav* \nr(x(t), u(t)) + T -a \n[ \nx \n\n] \nf(x(t), u(t)) \n\n. \n\n(13) \n\nThis is  a  variant of the Hamilton-Jacobi-Bellman equation  (Bryson  and Ho,  1975) \nfor  a discounted case. \n\n3.2  OPTIMAL NONLINEAR FEEDBACK CONTROL \n\nWhen  the  reinforcement  r(x, u)  is  convex  with  respect  to the  control  u,  and  the \nvector field  f(x, u) is linear with respect to u, the optimization problem in  (13)  has \na  unique solution.  The condition for the optimal control is \n\nar(x, u) \n\nau  +T  ax \n\nav* af(x, u)  _  0 \n\nau  -. \n\n(14) \n\nNow  we  consider the case  when  the cost for  control is  given  by  a  convex  potential \nfunction GjO  for  each control input \n\nf(x, u)  =  rx(x) - 2:= Gj(Uj), \n\nj \n\nwhere  reinforcement  for  the state r x (x)  is  still unknown.  We  also  assume that the \ninput gain of the system \n\nb -(x)  =  af(x, u) \nJ \n\nau-J \n\nis  available.  In this case, the optimal condition  (14)  for  Uj  is  given by \n\n-Gj(Uj) + T  ax  bj(x) =  O. \n\nav* \n\nNoting that the derivative G'O  is a monotonic function since GO  is convex, we have \nthe optimal feedback  control law \n\n) \nUj =  (G')-1  T  ax  b(x) \n\n(  av* \n\n. \n\n(15) \n\nParticularly,  when  the  amplitude  of  control  is  bounded  as  IUj I <  uj&X,  we  can \nenforce  this constraint using a  control cost \n\nGj(Uj) =  Cj IoUi \n\n~ \n\ng-l(s)ds, \n\n(16) \n\nwhere g-10 is  an inverse sigmoid function that diverges at \u00b11  (Hopfield,  1984).  In \nthis case, the optimal feedback  control law  is given by \n\nUj = ujaxg \n\n( u max  av* \n\n) \n~j  T  ax  bj(x) \n\n. \n\nIn the limit of Cj  -70, this results in the  \"bang-bang\"  control law \n\nUj =  Uj \n\nmax' \n\nSIgn  ax \n\n[av* b  (  )] \n\nj  x \n\n. \n\n(17) \n\n(18) \n\n\fTemporal Difference in Learning in Continuous Time and Space \n\n1077 \n\nFigure  1:  A  pendulum  with  limited  torque.  The  dynamics  is  given  by  m18 \n-f-tiJ + mglsinO + T.  Parameters were  m = I = 1,  9 = 9.8,  and f-t  = 0.0l. \n\n20 \n\n17 .5 \n\n15 \n\n12.5 \n\n0. \n\"  10 \n.-\n\n7 . 5 \n\ntrials \n(a) \n\n~\\~ \niii \ni~!  I \nI, \n':1 \n\ntrials \n\n(c) \n\nth \n\n(b) \n\nth \n\n(d) \n\nFigure  2:  Left:  The  learning  curves  for  (a)  optimal  control  and  (c)  actor-critic. \nLup:  time during which 101  < 90\u00b0.  Right:  (b)  The predicted value function  P  after \n100  trials of optimal control.  (d)  The output of the controller after  100 trials with \nactor-critic learning.  The thick gray line shows the trajectory of the pendulum.  th: \no (degrees), om:  iJ  (degrees/sec). \n\n\f1078 \n\n4  ACTOR-CRITIC \n\nK.DOYA \n\nWhen the information about the control cost,  the input gain of the system, or the \ngradient  of the  value  function  is  not  available,  we  cannot  use  the  above  optimal \ncontrol law.  However, the TD error  (6)  can be used as  \"internal reinforcement\"  for \ntraining a stochastic controller, or an \"actor\"  (Barto et al.,  1983). \nIn the simulation below,  we  combined our TD  algorithm for  the critic  with a  rein(cid:173)\nforcement  learning algorithm for  real-valued output (Gullapalli, 1990).  The output \nof the controller was  given by \n\nu;(t)  =  ujUg (~W;,b'(X(t)) + <1n;(t))  , \n\n(19) \n\nwhere nj(t) is  normalized  Gaussian noise  and Wji  is  a weight.  The size  of this per(cid:173)\nturbation was  changed based on the predicted performance by  (Y  =  (Yo  exp( -P(t)). \nThe connection weights were  changed by \n\n!:l.Wji  ex  f(t)nj(t)bi(x(t)). \n\n(20) \n\n5  SIMULATION \n\nThe performance of the above  continuous-time TD  algorithm was  tested on a  task \nof swinging  up  a  pendulum  with  limited  torque  (Figure  1).  Control  of  this  one(cid:173)\ndegree-of-freedom system is trivial near the upright equilibrium.  However,  bringing \nthe  pendulum  near the upright  position  is  not  if we  set  the maximal torque  Tmax \nsmaller  than  mgl.  The  controller  has  to  swing  the  pendulum  several  times  to \nbuild up enough momentum to bring it upright.  Furthermore, the controller has to \ndecelerate the pendulum early enough to avoid falling over. \nWe  used  a  radial basis function  (RBF)  network to approximate the value function \nfor the state of the pendulum x =  (8,8).  We prepared a fixed set of 12 x 12 Gaussian \nbasis functions.  This is a natural extension of the \"boxes\"  approach previously used \nto control inverted pendulums  (Barto et al.,  1983).  The immediate  reinforcement \nwas  given by the height of the tip of the pendulum, i.e., rx  =  cos 8. \n\n5.1  OPTIMAL CONTROL \n\nFirst,  we  used  the  optimal  control  law  (17)  with  the  predicted  value  function  P \ninstead  of V\u00b7.  We  added  noise  to  the  control  command  to  enhance  exploration. \nThe torque was  given  by \n\n) \nT  =  Tmaxg  - - r - -b  + (Yn(t) \n\n( Tmax  aP(x) \nax \n\nc \n\n, \n\nwhere  g(x)  =  ~ tan- 1 ( ~x)  (Hopfield,  1984).  Note  that  the  input  gain  b  = \n(0, 1/mI2)T  was  constant.  Parameters were rm ax  = 5,  c = 0.1,  (Yo  = 0.01,  r  = 1.0, \nand rc  =  0.1. \nEach run was  started from  a  random 8 and was  continued for  20  seconds.  Within \nten trials, the value function P  became accurate enough to be able to swing up and \nhold the pendulum (Figure 2a).  An example of the predicted value function P  after \n100 trials is shown in Figure 2b.  The paths toward the upright position, which were \nimplicitly determined by the dynamical properties of the system, can be seen as the \nridges of the value function.  We also had successful results when the reinforcement \nwas given  only  near the goal:  rx  =  1 if 181  < 30\u00b0,  -1 otherwise. \n\n\fTemporal Difference in Learning in Continuous Time and Space \n\n1079 \n\n5.2  ACTOR-CRITIC \n\nNext,  we  tested the actor-critic learning scheme as described above.  The controller \nwas  also implemented by a  RBF network with the same 12  x  12  basis  functions  as \nthe critic network.  It took about one hundred trials to achieve reliable performance \n(Figure  2c).  Figure  2d  shows  an example of the output of the controller after 100 \ntrials.  We can see nearly linear feedback in the neighborhood of the upright position \nand a  non-linear torque field  away from  the equilibrium. \n\n6  CONCLUSION \n\nWe  derived  a  continuous-time,  continuous-state  version  of the  TD  algorithm  and \nshowed  its applicability to a  nonlinear  control  task.  One  advantage  of continuous \nformulation is that we  can derive an explicit form  of optimal control law  as in  (17) \nusing derivative information, whereas a one-ply search for the best action is  usually \nrequired in discrete formulations. \n\nReferences \nBaird III,  L.  C.  (1993).  Advantage  updating.  Technical  Report WL-TR-93-1146, \n\nWright Laboratory, Wright-Patterson Air Force  Base,  OH  45433-7301,  USA. \n\nBarto, A. G. , Bradtke, S.  J., and Singh, S.  P.  (1995).  Learning to act using real-time \n\ndynamic programming.  Artificial Intelligence,  72:81-138. \n\nBarto,  A.  G.,  Sutton,  R.  S.,  and  Anderson,  C.  W.  (1983).  Neuronlike  adaptive \nelements that can solve difficult learning control problems.  IEEE  Transactions \non  System,  Man,  and  Cybernetics,  SMC-13:834-846. \n\nBradtke,  S.  J.  and  Duff,  M.  O.  (1995).  Reinforcement  learning  methods  for \ncontinuous-time  Markov decision  problems.  In Tesauro,  G.,  Touretzky,  D.  S., \nand Leen,  T.  K.,  editors,  Advances  in  Neural  Information  Processing  Systems \n7,  pages 393-400. MIT Press,  Cambridge, MA. \n\nBradtke,  S.  J ., Ydstie,  B. E.,  and  Barto,  A.  G.  (1994).  Adaptive  linear quadratic \ncontrol using policy iteration.  CMPSCI Technical Report 94-49,  University of \nMassachusetts,  Amherst,  MA. \n\nBryson,  Jr.,  A.  E .. and Ho,  Y.-C.  (1975).  Applied  Optimal  Control.  Hemisphere \n\nPublishing,  New  York,  2nd edition. \n\nGuUapalli,  V.  (1990) .  A  stochastic  reinforcement  learning  algorithm  for  learning \n\nreal-valued functions.  Neural Networks,  3:671-192. \n\nHopfield,  J. J. (1984).  Neurons with graded response have collective computational \nproperties like those of two-state neurons.  Proceedings  of National  Academy  of \nScience,  81 :3088-3092. \n\nHouk,  J .  C.,  Adams,  J.  L.,  and  Barto,  A.  G.  (1994).  A  model  of how  the  basal \nganglia  generate  and  use  neural  signlas  that  predict  renforcement.  In  Houk, \nJ. C., Davis, J. L., and Beiser, D.  G., editors, Models  of Information Processing \nin the  Basal  Ganglia, pages  249--270.  MIT Press, Cambrigde,  MA. \n\nSutton,  R.  S.  (1988).  Learning  to  predict  by  the methods  of temporal difference. \n\nMachine  Learning,  3:9--44. \n\n\f", "award": [], "sourceid": 1169, "authors": [{"given_name": "Kenji", "family_name": "Doya", "institution": null}]}