{"title": "Generalization and Scaling in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 557, "abstract": null, "full_text": "550 \n\nAckley and Littman \n\nGeneralization  and  scaling  in  reinforcement \n\nlearning \n\nDavid H.  Ackley \n\nMichael L.  Littman \n\nCognitive Science  Research Group \n\nBellcore \n\nMorristown,  NJ  07960 \n\nABSTRACT \n\nIn  associative reinforcement learning,  an environment generates input \nvectors, a  learning system generates possible output vectors, and a  re(cid:173)\ninforcement function computes feedback signals from the input-output \npairs.  The  task is  to discover  and  remember  input-output  pairs  that \ngenerate  rewards.  Especially  difficult  cases  occur  when  rewards  are \nrare, since the expected time for any algorithm can grow exponentially \nwith the size  of the problem.  Nonetheless, if a  reinforcement function \npossesses regularities, and a learning algorithm exploits them, learning \ntime  can be reduced  below that  of non-generalizing  algorithms.  This \npaper  describes  a  neural  network algorithm called  complementary  re(cid:173)\ninforcement  back-propagation (CRBP),  and  reports simulation results \non problems designed to offer differing opportunities for generalization. \n\n1  REINFORCEMENT LEARNING REQUIRES  SEARCH \nReinforcement learning (Sutton, 1984; Barto &  Anandan, 1985; Ackley,  1988; Allen, \n1989)  requires more from  a learner than does the more familiar  supervised learning \nparadigm.  Supervised learning supplies the correct answers to the learner,  whereas \nreinforcement  learning  requires  the  learner  to  discover  the  correct  outputs  before \nthey  can  be  stored.  The  reinforcement  paradigm  divides  neatly  into  search  and \nlearning  aspects:  When  rewarded  the  system makes internal adjustments to learn \nthe discovered input-output pair;  when punished the system makes internal adjust(cid:173)\nments to search elsewhere. \n\n\fGeneralization and Scaling in Reinforcement Learning \n\n551 \n\n1.1  MAKING REINFORCEMENT INTO  ERROR \n\nFollowing work by Anderson (1986)  and Williams  (1988),  we  extend the backprop(cid:173)\nagation  algorithm  to associative  reinforcement  learning.  Start with  a  \"garden  va(cid:173)\nriety\"  backpropagation  network:  A  vector  i  of n  binary  input  units  propagates \nthrough  zero  or  more  layers  of hidden  units,  ultimately  reaching  a  vector  8  of m \nsigmoid  units,  each  taking  continuous  values in  the range  (0,1).  Interpret each  8j \nas  the  probability that  an  associated  random  bit  OJ  takes on  value  1.  Let  us  call \nthe  continuous,  deterministic  vector  8  the  search  vector to distinguish  it from  the \nstochastic binary output  vector o. \n\nGiven  an  input  vector,  we  forward  propagate  to  produce  a  search  vector  8,  and \nthen  perform  m  independent  Bernoulli trials  to  produce  an  output  vector  o.  The \ni  -\n0  pair  is  evaluated  by  the  reinforcement  function  and  reward  or  punishment \nensues.  Suppose  reward occurs.  We  therefore  want  to make  0  more  likely given  i. \nBackpropagation will  do just that if we  take  0  as  the desired  target to produce  an \nerror vector (0 - 8)  and adjust weights normally. \nNow suppose punishment occurs, indicating 0  does not correspond with i.  By choice \nof error vector, backpropagation allows us to push the search vector in any direction; \nwhich way should we go?  In absence of problem-specific information, we cannot pick \nan appropriate  direction  with certainty.  Any decision  will  involve assumptions.  A \nvery minimal \"don't be like 0\"  assumption-employed in Anderson (1986), Williams \n(1988),  and Ackley  (1989)-pushes  s directly  away from 0 by taking (8 - 0)  as the \nerror vector.  A slightly stronger  \"be like not-o\"  assumption-employed in Barto & \nAnandan  (1985)  and Ackley  (1987)-pushes s  directly  toward the  complement of 0 \nby taking  ((1  - 0)  - 8)  as  the  error  vector.  Although  the  two  approaches always \nagree  on  the  signs  of the  error  terms,  they  differ  in  magnitudes.  In  this  work, \nwe  explore  the second  possibility,  embodied in  an algorithm called  complementary \nreinforcement  back-propagation ( CRBP). \n\nFigure 1 summarizes the CRBP algorithm.  The algorithm in the figure reflects three \nmodifications to the basic approach just sketched.  First, in step 2,  instead of using \nthe  8j'S  directly as probabilities,  we  found  it advantageous to  \"stretch\"  the values \nusing  a  parameter  v.  When  v  < 1,  it is  not  necessary for  the  8i'S  to reach zero or \none  to  produce  a  deterministic  output.  Second,  in  step  6,  we  found  it  important \nto use  a smaller learning rate for  punishment compared to reward.  Third, consider \nstep 7:  Another  forward  propagation is  performed,  another  stochastic binary out(cid:173)\nput  vector  0*  is  generated  (using  the procedure  from  step  2),  and  0*  is  compared \nto  o.  If they  are  identical  and  punishment  occurred,  or  if they  are  different  and \nreward occurred, then another error vector is generated and another weight update \nis performed.  This loop continues until a  different  output is generated  (in the case \nof failure)  or  until the original  output is  regenerated  (in  the case  of success).  This \nmodification improved performance significantly, and added only a small percentage \nto the total number of weight updates performed. \n\n\f552 \n\nAckley and Littman \n\nO.  Build  a  back  propagation  network  with  input  dimensionality  n  and  output \n\ndimensionality m.  Let t  =  0 and te  =  O. \n\n1.  Pick random i  E 2n  and forward  propagate to produce  a/s. \n2.  Generate a binary output vector o.  Given a uniform random variable ~ E  [0,1] \n\nand parameter 0 < v  < 1, \n\nOJ  = {1,  if(sj - !)/v+! ~ ~j \n\n0,  otherwise. \n\nej =  (tj  - sj)sj(l- Sj). \n\n3.  Compute reinforcement r = f(i,o).  Increment t.  If r  < 0,  let te  = t. \n4.  Generate output errors ej.  If r  > 0,  let tj = OJ,  otherwise let tj = 1- OJ.  Let \n5.  Backpropagate errors. \n6.  Update weights.  1:::..Wjk  =  1]ekSj,  using 1]  = 1]+  if r  ~ 0,  and 1]  = 1]- otherwise, \n7.  Forward  propagate  again  to  produce  new  Sj's.  Generate  temporary  output \n\nwith parameters 1]+,1]- > o. \nvector 0*.  If (r  > 0 and 0*  #- 0)  or (r  < 0 and 0*  =  0),  go to 4. \n\n8.  If te  ~ t,  exit returning te,  else go  to 1. \n\nFigure 1:  Complementary Reinforcement  Back Propagation-CRBP \n\n2  ON-LINE GENERALIZATION \nWhen  there  are  many possible  outputs  and correct  pairings  are  rare,  the  compu(cid:173)\ntational  cost  associated  with  the  search  for  the  correct  answers  can  be  profound. \nThe search for  correct  pairings will  be accelerated if the  search strategy can effec(cid:173)\ntively  generalize  the  reinforcement  received  on one  input  to  others.  The speed  of \nan algorithm on a  given problem relative to non-generalizing algorithms provides a \nmeasure of generalization that we  call  on-line  generalization. \n\nO.  Let  z  be  an  array  of length  2n.  Set  the  z[i]  to  random  numbers from  0  to \n\n2m  - 1.  Let  t  = te  = O. \n\n1.  Pick a  random input i  E 2n. \n2.  Compute reinforcement r  =  f(i, z[i]).  Increment t. \n3.  If r < 0 let  z[i] = (z[i] + 1) mod 2m , and let te  = t. \n4.  If te  <t:: t  exit  returning te, else  go  to 1. \n\nFigure  2:  The Table Lookup  Reference  Algorithm Tref(f, n, m) \n\nConsider  the  table-lookup  algorithm Tref(f, n, m)  summarized in  Figure  2.  In this \nalgorithm, a separate storage location is used for  each possible input.  This prevents \nthe  memorization  of one  i  -\n0  pair  from  interfering  with  any  other.  Similarly, \nthe  selection  of a  candidate  output  vector  depends  only  on  the  slot  of the  table \ncorresponding  to  the given input.  The learning  speed of Tref  depends only  on  the \ninput  and  output  dimensionalities  and  the  number  of correct  outputs  associated \n\n\fGeneralization and Scaling in Reinforcement Learning \n\n553 \n\nwith  each  input.  When  a  problem  possesses  n  input  bits  and  n  output  bits,  and \nthere is  only  one correct output vector for  each input vector, Tre{  runs in about 4n \ntime  (counting each input-output judgment as  one.)  In  such  cases  one  expects  to \ntake at least 2n - 1 just to find  one correct  i - 0  pair, so exponential time cannot be \navoided  without  a  priori information.  How  does  a  generalizing  algorithm  such as \nCRBP compare to Trer? \n\n3  SIMULATIONS ON  SCALABLE PROBLEMS \nWe have tested  CRBP on several simple problems designed to offer  varying degrees \nand  types  of generalization.  In all  of the simulations in this section,  the  following \ndetails apply:  Input and output bit counts are equal (n).  Parameters are dependent \non n  but independent of the  reinforcement function  f.  '7+  is  hand-picked for  each \nn,l 11- =  11+/10 and II = 0.5.  All data points are medians of five runs.  The stopping \ncriterion te  ~ t is interpreted as te +max(2000, 2n+l)  < t.  The fit lines in the figures \nare least  squares solutions to a  x  bn ,  to two significant digits. \nAs  a  notational convenience, let c =  ~ E ij  -\n\nthe fraction  of ones in the input. \n\nn \n\n;=1 \n\n3.1  n-MAJORlTY \nConsider this  \"majority rules\"  problem:  [if c > ~ then 0  = In else 0  = on].  The i-o \nmapping  is  many-to-l.  This  problem provides an  opportunity for  what  Anderson \n(1986) called \"output generalization\":  since there are only two correct output states, \nevery pair of output bits are completely correlated in the cases when reward occurs. \n\n10 7 \n106 \n\n- 10 5 \n\nG) \n\n'iii u \nrn \nC) \n0 ::::. \nE \n; \n\nG) \n\n104 \n10 3 \n10 2 \n10 1 \n10 0 \n\n0  1  2  3  456  78  91011121314 \n\nn \n\nFigure 3:  The n-majority problem \n\nx  Table \nD  CRBP n-n-n \n+  CRBP n-n \n\nFigure  3  displays  the  simulation  results.  Note  that  although  Trer  is  faster  than \nCRBP at small  values of n,  CRBP's  slower  growth rate  (1.6n  vs 4.2n )  allows it to \ncross over and begin outperforming Trer  at about 6 bits.  Note also--in violation of \n1 For n  =  1  to 12.  we used '1+  =  {2.000.  1.550.  1.130.0.979.0.783.0.709.0.623.0.525.0.280. \n\n0.219.  0.170.  0.121}. \n\n\f554 \n\nAckley and Littman \n\nsome conventional wisdom-that although n-majority is  a linearly separable  prob(cid:173)\nlem,  the  performance  of  CRBP with  hidden  units is  better  than  without.  Hidden \nunits can  be  helpful--even on linearly  separable problems-when there are  oppor(cid:173)\ntunities for  output generalization. \n\n3.2  n-COPY  AND  THE 2k -ATTRACTORS  FAMILY \nAs a second example, consider the n-copy problem:  [0  = i].  The i-o mapping is now \n1-1,  and the  values of output bits in rewarding states are  completely uncorrelated, \nbut  the  value  of each  output  bit  is  completely  correlated  with  the  value  of the \ncorresponding input  bit.  Figure  4  displays  the  simulation  results.  Once  again,  at \n\n-G) \n\n'ii \ntA \nQ \n0 ::::. \n\nG) \n\n.5 -\n\n150*2.0I\\n \n\nx  Table \nD  CRBP n-n-n \n+  CRBP  n-n \n\n12*2.2I\\n \n\n107 \n106 \n105 \n104 \n103 \n102 \n10 1 \n10 0 \n\n0 \n\n1 \n\n2  3  4  5 \n\n6 \nn \n\n7 \n\n8  9  10  1112 \n\nFigure 4:  The n-copy problem \n\nlow values  of n,  Trer  is  faster,  but  CRBP rapidly  overtakes Trer  as n  increases.  In \nn-copy,  unlike n-majority,  CRBP performs better without  hidden units. \n\nThe n-majority and n-copy problems are extreme cases  of a  spectrum.  n-majority \ncan  be  viewed  as  a  \"2-attractors\"  problem  in  that  there  are  only  two  correct \noutputs-all zeros  and all  ones-and the correct  output is  the  one  that i  is  closer \nto  in  hamming distance.  By  dividing  the  input  and  output  bits  into  two  groups \nand performing  the majority function  independently on  each  group,  one  generates \na  \"4-aUractors\"  problem.  In  general,  by  dividing  the  input  and  output  bits into \n1  ~ Ie  ~ n  groups,  one  generates  a  \"2i:-attractors\"  problem.  When  Ie  =  1,  n(cid:173)\nmajority results, and when  Ie  = n, n-copy results. \nFigure 5 displays simulation results on the n  = 8-bit  problems generated when  Ie  is \nvaried from  1 to n.  The  advantage  of hidden  units  for  low  values  of Ie  is  evident, \nas  is  the  advantage  of \"shortcut connections\"  (direct  input-to-output  weights)  for \nlarger  values  of Ie.  Note  also  that  combination of both  hidden  units and shortcut \nconnections performs better than either alone. \n\n\fGeneralization and Scaling in Reinforcement Learning \n\n555 \n\n105~--------------------------------~ \n\n-0- CASP 8-10-8 \n-+- CASP 8-8 \n....  CASP 8-10-Sls \n...  Table \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\nk \n\nFigure 5:  The 21:-attractors family  at n  = 8 \n\n3.3  n-EXCLUDED  MIDDLE \n\nAll  of the functions  considered  so  far  have  been  linearly  separable.  Consider  this \n\"folded  majority\"  function:  [if i  <  c  < i  then  0  = on  else  0  = In].  Now,  like \nn-majority,  there  are  only  two  rewarding  output  states,  but  the  determination  of \nwhich  output  state  is  correct  is  not  linearly  separable  in  the  input  space.  When \nn  =  2,  the  n-excluded  middle  problem  yields  the  EQV  (i.e.,  the  complement  of \nXOR) function,  but whereas  functions  such  as  n-parity  [if nc is  even  then  0  = on \nelse  0  =  In]  get more  non-linear  with increasing n,  n-excluded  middle does not. \n\n'ii u \n\n-I) \n.2 -I) \n\nf) \nD) \n\nE \n::: \n\n17oo*1.6\"n \n\n107~------------------------------~~ \n10 6 \n10 5 \n10 4 \n10 3 \n10 2 \n10 1 \n10 0 \n\n0 \n\n1  2  3  4  5 \n\n6 \n\n7  8  9  10  1112 \n\nx  Table \nc  CRSP n-n-n/s \n\nn \n\nFigure 6:  The  n-excluded  middle problem \n\nFigure  6  displays  the  simulation  results.  CRBP is  slowed  somewhat  compared  to \nthe linearly separable problems, yielding a higher  \"cross over point\" of about 8 bits. \n\n\f556 \n\nAckley and Littman \n\n4  STRUCTURING DEGENERATE OUTPUT SPACES \nAll  of the  scaling  problems  in  the  previous  section  are  designed  so  that  there  is \na  single  correct  output  for  each  possible  input.  This  allows  for  difficult  problems \neven at small sizes,  but it rules  out an important aspect of generalizing algorithms \nfor  associative  reinforcement  learning:  If there  are  multiple  satisfactory  outputs \nfor  given inputs,  a  generalizing algorithm may impose  structure on the mapping it \nproduces. \n\nWe have  two demonstrations of this effect,  \"Bit  Count\"  and  \"Inverse Arithmetic.\" \nThe Bit Count problem simply states that the number of I-bits in the output should \nequal  the  number  of I-bits in  the input.  When n  = 9,  Tref  rapidly finds  solutions \ninvolving  hundreds of different  output  patterns.  CRBP is  slower--especially  with \nrelatively few  hidden units-but it regularly finds solutions involving just 10 output \npatterns that form  a sequence from  09  to  19  with one  bit changing per step. \n\n0+Ox4=0  0+2x4=8  0+4 x  4 =  16  0+6 x  4 =  24 \n1+0x4=1  1+2x4=9  1+4x4=17  1 +  6 x  4 =  25 \n2+0x4=2  2 +  2 x  4 =  10  2 +  4 x 4 =  18  2 +  6 x  4 =  26 \n3+0x4=3  3+2x4=11  3 +4 x 4 =  19  3 +  6 x 4 =  27 \n\n4+0x4=4  4+ 2 x  4 =  12  4+4 x 4 =  20  4 +  6 x  4 =  28 \n5+0x4=5  5 +  2 x 4 =  13  5 +  4  x 4 =  21  5 +  6 x  4 =  29 \n6+0x4=6  6 +  2 x  4 =  14  6 +  4 x  4 =  22  6 +  6 x 4 =  30 \n7+0x4=7  7 +  2 x  4 =  15  7 +  4 x  4 =  23  7 +  6 x 4 =  31 \n\n2+2-4=0  2+2+4=8  6+ 6 +  4 =  16  0+6 x 4 =  24 \n3+2-4=1  3+2+4=9  7+6+4= 17  1 +  6 x 4 =  25 \n2+2+4=2  2 +  2 x  4 =  10  2 +  4  x 4 =  18  2 +  6 x  4 =  26 \n3+2+4=3  3+2x4=1l  3 +  4  x  4 =  19  3 +  6 x  4 =  27 \n\n6+2-4=4  6 +  2+ 4 =  12  4 x  4 +  4 =  20  4 +  6 x  4 =  28 \n7+2-4=5  7 +  2 +  4 =  13  5 +  4 x  4 =  21  5 +  6 x  4 =  29 \n6+2+4=6  6 +  2 x 4 =  14  6 +  4 x 4 = 22  6 +  6 x 4 = 30 \n7+2-.;-4=7  7 +  2 x  4 =  15  7 +4 x 4 =  23  7 +  6 x 4 =  31 \n\nFigure  7:  Sample  CRBP solutions to Inverse Arithmetic \n\nThe Inverse  Arithmetic  problem can be summarized as  follows:  Given i  E 25 ,  find \n:1:,  y, z  E 23  and 0, <>  E {+(OO)' -(01)'  X (10)' +(11)} such that :I: oy<>z =  i. In all there are \n13 bits of output, interpreted as three 3-bit binary numbers and two 2-bit operators, \nand  the  task  is  to  pick  an  output  that  evaluates  to  the  given  5-bit  binary  input \nunder  the  usual  rules:  operator  precedence,  left-right  evaluation,  integer  division, \nand division  by zero fails. \n\nAs shown in Figure 7,  CRBP sometimes solves this problem essentially by discover(cid:173)\ning positional notation, and sometimes produces less-globally structured solutions, \nparticularly as  outputs for  lower-valued i's, which have a  wider  range  of solutions. \n\n\fGeneralization and Scaling in Reinforcement Learning \n\n557 \n\n5  CONCLUSIONS \nSome  basic  concepts  of supervised  learning  appear  in  different  guises  when  the \nparadigm of reinforcement learning is  applied  to large output spaces.  Rather  than \na  \"learning  phase\"  followed  by  a  \"generalization  test,\"  in  reinforcement  learning \nthe  search problem is  a generalization test,  performed simultaneously with learning. \nInformation is  put to work as soon as it is acquired. \n\nThe problem of of \"overfitting\"  or  \"learning the noise\"  seems to be less of an issue, \nsince  learning  stops  automatically  when  consistent  success  is  reached.  In  exper(cid:173)\niments  not  reported  here  we  gradually  increased  the  number  of  hidden  units  on \nthe  8-bit  copy  problem  from  8  to  25  without  observing  the  performance  decline \nassociated  with  \"too  many free  parameters.\" \n\nThe  2k-attractors  (and  2k-folds-generalizing  Excluded  Middle)  families  provide \na  starter  set  of sample  problems  with  easily  understood  and  distinctly  different \nextreme cases. \n\nIn  degenerate  output  spaces,  generalization  decisions  can  be  seen  directly  in  the \ndiscovered mapping.  Network analysis is not required to  \"see how the net does it.\" \n\nThe  possibility  of  ultimately  generating  useful  new  knowledge  via  reinforcement \nlearning algorithms  cannot  be ruled out. \n\nReferences \n\nAckley, D.H.  (1987)  A  connectionist machine for genetic hillclimbing.  Boston, MA: Kluwer \nAcademic  Press. \n\nAckley,  D.H.  (1989)  Associative  learning  via  inhibitory  search.  In  D.S.  Touretzky  (ed.), \nAdvances  in  Neural  Information  Processing  Systems  1,  20-28.  San  Mateo,  CA:  Morgan \nKaufmann. \n\nAllen,  R.B.  (1989)  Developing agent models  with a  neural reinforcement technique.  IEEE \nSystems,  Man,  and  Cybernetics  Conference.  Cambridge,  MA. \n\nAnderson,  C.W.  (1986)  Learning  and  problem  solving  with  multilayer  connectionist sys(cid:173)\ntems.  University of Mass.  Ph.D. dissertation.  COINS  TR 86-50.  Amherst,  MA. \n\nBarto,  A.G.  (1985)  Learning  by statistical cooperation of self-interested neuron-like  com(cid:173)\nputing elements.  Human  Neurobiology,  4:229-256. \n\nBarto,  A.G.,  &  Anandan,  P.  (1985)  Pattern  recognizing  stochastic  learning  automata. \nIEEE  Transactions on  Systems,  Man,  and  Cybernetics,  15,  360-374. \nRumelhart, D.E., Hinton, G.E., &  Williams, R.J.  (1986) Learning representations by back(cid:173)\npropagating errors.  Nature,  323,  533-536. \n\nSutton,  R.S.  (1984)  Temporal credit  assignment in  reinforcement learning.  University  of \nMass.  Ph.D.  dissertation.  COINS  TR 84-2.  Amherst,  MA. \n\nWilliams,  R.J.  (1988)  Toward  a  theory  of reinforcement-learning  connectionist  systems. \nCollege of Computer Science of Northeastern University Technical Report NU-CCS-88-3. \nBoston,  MA. \n\n\f", "award": [], "sourceid": 208, "authors": [{"given_name": "David", "family_name": "Ackley", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}]}