{"title": "A Reinforcement Learning Variant for Control Scheduling", "book": "Advances in Neural Information Processing Systems", "page_first": 479, "page_last": 485, "abstract": null, "full_text": "A Reinforcement Learning Variant for Control \n\nScheduling \n\nHoneywell  Sensor and  System  Development Center \n\nAloke Guha \n\n3660 Technology Drive \nMinneapolis,  MN  55417 \n\nAbstract \n\nWe  present  an  algorithm  based  on  reinforcement  and  state  recurrence \nlearning  techniques  to  solve  control  scheduling  problems.  In  particular,  we \nhave  devised  a  simple  learning  scheme  called  \"handicapped  learning\",  in \nwhich  the  weights  of the  associative  search  element  are  reinforced,  either \npositively  or negatively,  such  that the  system  is forced  to  move  towards the \ndesired  setpoint  in  the  shortest possible  trajectory.  To  improve  the  learning \nrate,  a  variable  reinforcement  scheme  is  employed:  negative  reinforcement \nvalues are  varied depending  on  whether the failure  occurs in  handicapped or \nnormal  mode  of operation.  Furthermore,  to  realize  a  simulated  annealing \nscheme  for  accelerated  learning,  if the  system  visits  the  same  failed  state \nsuccessively,  the  negative  reinforcement  value  is  increased. \nIn  examples \nstudied,  these  learning  schemes  have  demonstrated  high  learning  rates,  and \ntherefore may prove useful  for  in-situ learning. \n\n1  INTRODUCTION \n\nReinforcement  learning  techniques  have been  applied  successfully  for  simple control \nproblems,  such  as  the  pole-cart problem  [Barto  83,  Michie 68,  Rosen  88]  where  the \ngoal  was  to  maintain  the  pole  in  a  quasistable  region,  but  not  at  specific  setpoints. \nHowever,  a  large  class  of  continuous  control  problems  require  maintaining  the \nsystem  at  a  desired  operating  point,  or  setpoint,  at  a  given  time.  We  refer  to  this \nproblem  as  the  basic  setpoint  control  problem  [Guha  90],  and  have  shown  that \nreinforcement learning can be  used,  not surprisingly, quite  well  for such  control  tasks. \nA more  general  version  of the  same problem  requires  steering  the  system  from  some \n\n479 \n\n\f480 \n\nGuha \n\ninitial  or  starting  state  to  a  desired  state  or  setpoint  at  specific  times  without \nknowledge  of  the  dynamics  of  the  system.  We  therefore  wish  to  examine  how \ncontrol  scheduling  tasks,  where  the  system  must  be  steered  through  a  sequence  of \nsetpoints  at  specific  times.  can  be  learned.  Solving  such  a  control  problem  without \nexplicit modeling  of the  system or plant can  prove to  be beneficial in  many  adaptive \ncontrol  tasks. \n\nTo  address  the  control  scheduling  problem.  we  have  derived  a  learning  algorithm \ncalled  handicapped learning.  Handicapped  learning  uses  a nonlinear encoding of the \nstate of the  system.  a  new  associative  reinforcement learning  algorithm.  and  a  novel \nreinforcement  scheme  to  explore  the  control  space  to  meet  the  scheduling \nconstraints.  The goal of handicapped learning  is to  learn  the control law  necessary  to \nsteer the  system  from  one  setpoint to another.  We provide a  description  of the  state \nencoding  and  associative  learning  in  Section  2.  the  reinforcement  scheme  in  Section \n3,  the  experimental results  in  Section 4,  and  the  conclusions  in  Section  5. \n\n2  REINFORCEMENT  LEARNING  STRATEGY: \n\nHANDICAPPED  LEARNING \n\nOur earlier work on  regulatory control  using reinforcement  learning  [Guha  90]  used  a \nsimple  linear  coded  state  representation  of the  system.  However.  when  considering \nmultiple  setpoints  in  a  schedule,  a  linear  coding  of  high-resolution  results  in  a \ncombinatorial  explosion  of states.  To  avoid  this  curse  of dimensionality,  we  have \nadopted  a simple nonlinear encoding of the state space.  We  describe  this  first. \n\n2.1  STATE ENCODING \n\nTo  define  the  states  in  which  reinforcement  must  be  provided  to  the  controller.  we \nset tolerance  limits around  the desired  setpoint.  say  Xd.  If the tolerance of operation \ndefined  by  the  level  of control  sophistication  required  in  the  problem  is  T.  then  the \ncontroller is defined to fail  if IX(t)  - Xdl  > T as described  in our earlier work  in  [Guha \n90]. \n\nThe controller must learn  to maintain  the system  within  this  tolerance  window.  If the \nrange,  R.  of possible  values  of the  setpoint  or  control  variable  X(t)  is  significantly \ngreater  than  the  tolerance  window.  then  the  number  of states  required  to  define  the \nsetpoint  will  be large.  We  therefore  use  a  nonlinear coding  of the  control  variable. \nThus,  if the  level  of discrimination  within  the  tolerance  window  is  2T/n.  then  the \nnumber of states  required  to  represent  the  control  variable  is  (n  + 2)  where  the  two \nadded  states  represent  the  states,  (X(t)  - Xd)  > T  and  (X(t)  - Xd)  <  -T.  With  this \nrepresentation  scheme.  any  continuous  range  of setpoints  can  be  represented  with \nvery high resolution but without the explosion  in  state space. \n\nThe  above  state  encoding  will  be  used  in  our  associative  reinforcement  learning \nalgorithm.  handicapped  learning,  which  we describe  next. \n\n\fA Reinforcement Learning Variant for Control Scheduling \n\n481 \n\n2.2 HANDICAPPED LEARNING ALGORITHM \n\nOur  reinforcement  learning  strategy  is  derived  from  the  Associative  Search \nElement/Adaptive  Heuristic  Critic  (ASE/AHC)  algorithm  [Barto  83.  Anderson  86]. \nWe have considered a binary  control  output.  y(t): \n\ny(t)  =  f(L  wi(t)xi(t) + noise(t\u00bb \n\ni \n\n(1) \n\nwhere  f  is  the  thresholding  step  function.  and xi(t). 0 SiS N. is the current decoded \nstate.  that  is.  xi(t) =  1  when  the  system  is  in  the  ith  state  and  0  otherwise.  As  in \nASE.  the  added  term  noise(t)  facilitates  stochastic  learning.  Note  that  the  learning \nalgorithm  can  be  easily  extended  to  continuous  valued  outputs.  the  nature  of  the \ncontinuity is determined by  the thresholding function. \n\nWe  incorporate  two  learning  heuristics:  state  recurrence  [Rosen  88]  and  a  newly \nintroduced  heuristic  called  \"handicapped  learning\".  The  controller  is  in  the \nhandicapped learning mode if a flag.  H.  is set high.  H is defined as follows: \n\nH = O.  if IX(t)  - Xdl < T \n\n= 1.  otherwise \n\n(2) \n\nThe  handicap  mode  provides  a  mechanism  to  modify  the  reinforcement  scheme.  In \nthis  mode  the controller is  allowed  to  explore  the  search  space  of action  sequences. \nto  steer to  a  new  setpoint. without \"punishment\"  (negative  reinforcement).  The mode \nis  invoked  when  the  system  is  at  a  valid  setpoint  XI(tI) at time  tl. but  must  be \nsteered  to  the  new  setpoint X2  outside  the  tolerance  window.  that  is.  IXI  - X21  > T. \nat time t2.  Since both  setpoints are  valid  operating  points.  these  setpoints as  well  as \nall points  within  the  possible optimal trajectories from  Xl to X2 cannot be deemed  to \nbe  failure  states.  Further.  by  following  a  special  reinforcement  scheme  during  the \nhandicapped  mode.  one can  enable  learning  and  facilitate  the  controller  to  find  the \noptimal  trajectory to  steer the  system  from  one setpoint to  another. \n\nThe weight updating  rule  used during  setpoint schedule learning is given  by  equation \n(3): \n\nwi(t+I) = wi(t)  + (1  rt(t) ei(t) + (12  r2(t)  e2i(t) + (13  r3(t)  e3i(t) \n\n(3) \n\nwhere  the  term  (1  rt (t)  ei(t)  is  the  basic  associative  learning  component.  rt (t)  the \nheuristic  reinforcement.  and  ei(t)  the  eligibility trace of the state xi(t)  [Barto 83]. \n\nThe third  term  in  equation  (3)  is the state recurrence component for  reinforcing  short \ncycles  [Rosen  88].  Here  (12  is  a  constant gain.  f2(t)  is  a  positive  constant  reward. \nand  ~i the  state  recurrence eligibility  is  defined as  follows: \n\ne2i(t) = ~2 xi(t)y(ti.last)/(~2 + t - ti.last). \n\n= O.  otherwise \n\nif (t - ti.last) > 1 and H = 0 \n(4) \n\n\f482 \n\nGuha \n\nwhere  ~2 is  a positive  constant,  and  ti.last  is the  last  time  the  system  visited  the  ith \nstate.  The  eligibility  function  in  equation  (4)  reinforces  shorter  cycles  more  than \nlonger cycles, and improve control  when  the system  is within a tolerance  window. \n\nThe  fourth  term  in equation  (3)  is the handicapped learning  component.  Here  (13  is a \nconstant  gain.  r3(t)  is  a  positive  constant  reward  and  e3i the  handicapped  learning \neligibility is  defined as  follows: \n\ne3i(t) = - ~3 xi(t)y(ti.last)/(~3 + t - ti.lasV. \n\n= O.  otherwise \n\nif H = 1 \n\n(5) \n\nwhere ~3 is  a positive constant.  While state recurrence promotes  short cycles around \na  desired  operating  point.  handicapped  learning  forces  the  controller  to  move  away \nfrom  the  current  operating  point  X(t).  The  system  enters  the  handicapped  mode \nwhenever it is outside the tolerance  window  around the desired  setpoint.  If the initial \noperating  point  Xi  (=  X(O\u00bb  is  outside  the  tolerance  window  of the  desired  setpoint \nXd. 1Xi  - Xdl  > T.  the basic  AHC  network  will  always register a failure.  This failure \nsituation  is  avoided  by  invoking  the  handicapped  learning  described  above.  By \nsetting  absolute upper and  lower limits  to operating point values. the controller based \non  handicapped  learning  can  learn  the  correct  sequence  of actions  necessary  to  steer \nthe  system  to  the desired operating point Xd. \n\nThe  weight  update  equations  for  the  critic  in  the  AHC  are  unchanged  from  the \noriginal AHC and we do not list them here. \n\n3  REINFORCEMENT  SCHEMES \n\nUnlike  in  previous  experiments  by  other  researchers.  we  have  constructed  the \nreinforcement  values  used  during  learning  to  be  multivalued.  and  not  binary.  The \nreinforcement  to  the  critic  is  negative-both  positive  and  negative  reinforcements  are \nused.  There are  two  forms  of failure  that can occur during setpoint control.  First.  the \ncontroller  can  reach  the  absolute  upper  or  lower  limits.  Second.  there  may  be  a \ntimeout  failure  in  the  handicapped  mode.  By  design.  when  the  controller  is  in \nhandicapped  mode,  it  is  allowed  to  remain  there  for  only  TL.  determined  by  the \naverage  control  step  Ay  and  the  error  between  the  current  operating  point  and  the \ndesired  setpoint: \n\nTL = k Ay (XO  - Xd) \n\n(6) \n\nwhere  Xo  is  the  initial  setpoint.  and  k  some  constant.  The  negative  reinforcement \nprovided  to  the  controller  is  higher  if the  absolute  limits  of the  operating  point  is \nreached. \n\nWe  have  implemented  a  more  interesting  reinforcement  scheme  that  is  somewhat \nIf  the  system  fails  in  the  same  state  on  two \nsimilar  to  simulated  annealing. \nsuccessive  trials.  the  negative  reinforcement  is  increased. \nThe  primary \nreinforcement  function  can be defined as  follows: \n\n\fA Reinforcement Learning Variant for Control Scheduling \n\n483 \n\nrjCk  + I) = riCk)  - rO, \n\n= rl, \n\nif i = j \nif i  \":i;  j \n\n(7) \n\nwhere  ri(k)  is  the  negative  reinforcement  provided  if the  system  failed  in  state  i \nduring trial k, and rO  and rl are constants. \n\n4  EXPERIMENTS  AND  RESULTS \n\nTwo  different  setpoint control  experiments  have  been  conducted.  The  first  was  the \nbasic  setpoint  control  of a  continuous  stirred  tank  reactor  in  which  the  temperature \nmust  be  held  at  a  desired  setpoint.  That experiment  successfully  demonstrated  the \nuse  of reinforcement learning  for  setpoint control  of a  highly  nonlinear and  unstable \nprocess  [Guha  90].  The  second  recent  experiment  has  been  on  evaluating  the \nhandicapped  learning  strategy  for  an  environmental  controller  where  the  controller \nmust  learn  to  control  the  heating  system  to  maintain  the  ambient  temperature \nspecified  by  a  time-temperature  schedule.  Thus,  as  the  external  temperature  varies, \nthe  network  must  adapt  the  heating  (ON)  and  (OFF)  control  sequence  so  as  to \nmaintain  the  environment  at  the  desired  temperature  as  quickly  as  possible.  The \nstate information describing system  is composed of the  time  interval of the schedule, \nthe  current  heating  state  (ON/OFF),  and  the error or the  difference  between  desired \nand  current  ambient  or  interior  temperature.  The  heating  and  cooling  rates  are \nvariable:  the  heating  rate  decreases  while  the cooling  rate  increases  exponentially  as \nthe exterior temperature falls  below  the  ambient or controlled  temperature. \n\ne j \n..!! \n:I \n1 .c \nJ! \n'5 \n\" \n\n100 \n\n80 \n\n60 \n\n40 \n\n20 \n\n0 \n\n0 \n\n. .  Handicapped Learning \n+  No Handicapped Learning \n\n10 \n\n20 \n\n30 \n\n40 \n\nTrial Number \n\nFigure  I:  Rate of Learning  with  and  without Handicapped Learning \n\n\f484 \n\nGuha \n\nTenp \n\n70 \n\n68 \n\n66 \n\n6f \n\n62 \n\n60 \n\n58 \n\n56 \n\n54 \n\nTdalH3 \n\nAmbient \nTemperature \n(CXIIltrolled) \n\n' (\n\n... \n\nDl \nTime (minute) \n\n1200 \n\nFigure 2:  Time-Temperature Plot of Controlled Environment at Forty-third Trial \n\nThe  experiments  on  the  environmental  controller  consisted  of  embedding  a  daily \nsetpoint  schedule  that  contains  six  setpoints  at  six  specific  times.  Trails  were \nconducted  to  train  the  controller.  Each  trial  starts  at  the  beginning  of the  schedule \n(time  =  0).  The  setpoints  typically  varied  in  the  range  of 55  to  75  degrees.  The \ndesired tolerance  window  was  1 degree.  The upper and lower limits  of the controlled \ntemperature  were  set arbitrarily  at  50  and  80  degrees.  respectively.  Control  actions \nwere taken every 5 minutes.  Learning  was monitored by  examining how  much of the \nschedule was  learnt  correctly as the  number of trials  increased. \n\nTal Run \n\nt-\n\n65 \n\nt-\n\nTemp \n\n,4 \n,. \nt-r \n55  , \n:-\n\n. \n\n, \n\n, \n.. ,~I\\'~';;\u00b7\u00b7\u00b7~l' ~ \nt' \nl \n\n\\. \n\n, \n~ \n~ \n1 \n., \n., \n\\. \n, \n, \n\\ \n\n\u2022 \n\n.'.'t 1-1.'), \n'! \n\n. \\ . (  . \n. \n, \n.. \ni \n, \n, \n! \n\nSetpoint \nSd\\ecIule/ ___ \nTemperature \n\n,,-'  . \n, \n.' \\\" J.\\,L \n! \n.' \n, \nI \n; \nI \nit.  \"  \"' .. \n\nI \n\\ \n\nAmbient \n\n(controlled) \n\n.,  ~emperature \n., \n;, \n. \n'. \n... \u00b7fl\u00b7,! \n\n\"'  \"r\\~\u00b7  J\\\"'. \n\n1 \n\n' ..  -..!'.~J  I .\u2022  ' . ... .! \n\n.'. \n\nBxteriCll\"  ~ \nTemperature \n\n___ \n\n-------\n\n50 r------------\n\n200 \n\n400 \n\n. \n\n600 \n\n. --.---.--~-\n\n800 \n\n1000 \n\n1200 \n\n1400 \n\nFigure 3:  Time-Temperature Plot of Controlled Environment for  a Test Run \n\nTIme (minutea) \n\nFigure  1 shows how  the  learning progresses with  the number of trials.  Current results \nshow  that  the  learning  of the  complete  schedule  (of the  six  time-temperature  pairs) \nrequiring  288  control  steps.  can  be  accomplished  in  only  43  trials.  (Given  binary \n\n\fA Reinforcement Learning Variant for Control Scheduling \n\n485 \n\noutput, the controller could have in the worst case executed 1086 (- 2288) trials to learn \nthe complete schedule.) \n\nMore details on the learning ability using the reinforcement learning strategy are available \nfrom  the  time-temperature  plots of the  trial  and  test runs  in  Figures  2  and  3.  As the \nlearning progresses to the forty-third trial, the controller learns to continuously heat up or \ncool  down  to  the  desired  temperature  (Figure  2).  To  further  test  the  learning \ngeneralizations on the schedule, the trained network was tested on a different environment \nwhere the exterior temperature profile (and the therefore the heating and cooling rates) was \ndifferent from  the one used for training.  Figure 3 shows the schedule that is maintained. \nBecause the controller encounters different cooling rates in  the  test run,  some learning \nstill  occurs  as  evident form  Figure  3.  However,  all  six  setpoints  were reached in  the \nproper sequence.  In essence,  this  test shows that the controller has generalized on  the \nheating and cooling control law , independent of the setpoints and the heating and cooling \nrates. \n\n5  CONCLUSIONS \n\nWe have developed a new learning strategy based on reinforcement learning that can be \nused to learn setpoint schedules for continuous processes.  The experimental results have \ndemonstrated good learning performance.  However, a number of interesting extensions to \nthis work are possible.  For instance. the handicapped mode exploration of control can be \nbetter  controlled  for  faster  learning,  if more  information  on  the  desired  or  possible \ntrajectory is known.  Another area of investigation would be the area of state encoding. \nIn  our approach,  the  nonlinear encoding of the  system  state  was  assumed  uniform  at \ndifferent  regions  of  the  control  space.  In  applications  where  the  system  with  high \nnonlinearity,  different nonlinear coding  could  be  used  adaptively  to  improve  the state \nrepresentation.  Finally, other formulations of reinforcement learning algorithms, besides \nASE/AHC,  should  also  be  explored.  One  such  possibility  is  Watkins'  Q-Iearning \n[Watkins 89]. \n\nReferences \n\n[Guha 90]  A. Guha and A. Mathur, Set point Control Based on Reinforcement  Learning, \nProceedings of UCNN 90, Washington D.C., January 1990. \n[Barto 83]  A.G.  Barto, R.S.  Sutton, and C.W.  Anderson, Neuronlike Adaptive Elements \nThat  Can  Solve  Difficult Learning  Control  Problems,  IEEE  Transactions  on  Systems, \nMan, and Cybernetics, Vol.  SMC-13. No.5. September/October 1983. \n[Michie 68]  D.  Michie and R. Chambers, Machine  Intelligence, E.  Dale and D.  Michie \n(eds.), Oliver and Boyd, Edinburgh, 1968, p.  137. \n[Rosen 88]  B.  E.  Rosen, J. M.  Goodwin. and J. J. Vidal, Learning  by  State Recurrence \nDetection, IEEE  Conference on  Neural  Information  Processing  Systems  - Natural  and \nSynthetic. AlP Press,  1988. \n[Watkins 89]  C.J.C.H.  Watkins, Learning from  Delayed Rewards, Ph.  D.  Dissertation, \nKing's College, May  1989. \n\n\f", "award": [], "sourceid": 337, "authors": [{"given_name": "Aloke", "family_name": "Guha", "institution": null}]}