{"title": "Practical Issues in Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 259, "page_last": 266, "abstract": null, "full_text": "Practical Issues  in Temporal Difference Learning \n\nGerald  Tesauro \n\nIBM  Thomas J. Watson Research  Center \n\nP.  O.  Box  704 \n\nYorktown Heights,  NY  10598 \n\ntesauro@watson.ibm.com \n\nAbstract \n\nThis  paper  examines  whether  temporal  difference  methods  for  training \nconnectionist  networks,  such  as  Suttons's  TO('\\)  algorithm,  can  be  suc(cid:173)\ncessfully  applied to complex real-world problems.  A number of important \npractical issues are identified and discussed from a general theoretical per(cid:173)\nspective.  These practical issues are then examined in the context of a case \nstudy  in  which  TO('\\)  is  applied  to  learning  the  game  of backgammon \nfrom  the  outcome of self-play.  This is  apparently  the  first  application of \nthis  algorithm  to  a  complex  nontrivial  task.  It is  found  that,  with  zero \nknowledge  built  in,  the network  is able  to learn  from  scratch  to  play  the \nentire game at a  fairly strong intermediate level of performance,  which  is \nclearly  better  than conventional commercial programs, and which  in fact \nsurpasses  comparable networks  trained  on  a  massive  human expert  data \nset.  The hidden  units in  these network  have apparently discovered  useful \nfeatures,  a  longstanding  goal of computer  games  research.  Furthermore, \nwhen  a  set  of hand-crafted  features  is  added  to the  input representation, \nthe resulting networks reach  a  near-expert  level of performance, and have \nachieved good results against world-class human play. \n\n1 \n\nINTRODUCTION \n\nWe  consider  the prospects for  applications of the TO('\\)  algorithm for  delayed  re(cid:173)\ninforcement learning,  proposed  in (Sutton,  1988), to  complex real-world problems. \nTO('\\) is  an algorithm for  adjusting the weights  in a  connectionist  network  which \n259 \n\n\f260 \n\nTesauro \n\nhas the following form: \n\n~Wt = Q(PHI - Pt ) L: At-I:VwPI: \n\nt \n\n1:=1 \n\n(1) \n\nwhere  Pt  is the network's output upon observation of input pattern Zt at time t, W \nis the vector  of weights  that parameterizes the network,  and VwPI:  is  the gradient \nof network output with respect  to weights.  Equation 1 basically couples a temporal \ndifference  method for  temporal credit  assignment  with a  gradient-descent  method \nfor structural credit assigment; thus it provides a way  to adapt supervised  learning \nprocedures such as back-propagation to solve temporal credit assignment problems. \nThe  A parameter interpolates between  two limiting cases:  A = 1 corresponds  to an \nexplicit  supervised  pairing  of each  input  pattern  Zt  with  the  final  reward  signal, \nwhile  A = 0 corresponds  to an explicit pairing of Zt with  the next prediction PHI. \nLittle theoretical guidance is available for practical uses of this algorithm.  For exam(cid:173)\nple, one of the most important i88ues  in applications of network learning procedures \nis  the  choice  of a  good  representation  scheme.  However,  the  existing  theoretical \nanalysis of TD( A)  applies  primarily to  look-up  table representations  in  which  the \nnetwork  has  enough  adjustable  parameters  to  explicitly  store  the  value  of every \np088ible  state  in  the  state  space.  This  will  clearly  be  intractable  for  real-world \nproblems, and the theoretical  results may be completely inappropriate, as they in(cid:173)\ndicate,  for  example,  that every  possible  state  in  the  state space  has  to  be  visited \ninfinitely many times in order  to guarantee convergence. \nAnother  important class  of practical  i88ues  has  to  do  with  the  nature  of the  task \nbeing  learned,  e.g.,  whether  it  is  noisy  or  deterministic.  In  volatile environments \nwith a  high  step-to-step  variance  in  expected  reward,  TD learning is  likely  to  be \ndifficult.  This  is  because  the  value  of Pt+1 ,  which  is  used  as  a  heuristic  teacher \nsignal for  Pt ,  may have  nothing  to do  with the  true  value  of the state  Zt.  In  such \ncases it may be necessary  to modify TD(A) by including a lookahead process which \naverages over the step-to-step noise. \nAdditional difficulties  must  also  be  expected  if the  task  is  a  combined  prediction(cid:173)\ncontrol  task,  in  which  the predictor  network  is  used  to make control decisions,  as \nopposed  to a  prediction only  task.  As  the network's predictions change, its control \nstrategies  also  change,  and  this changes the target  predictions  that  the network  is \ntrying  to  learn.  In  this case,  theory  does  not  say  whether  the  combined  learning \nsystem  would  converge  at all,  and if so,  whether  it would  converge  to  the  optimal \npredictor-controller.  It  might  be  possible  for  the  system  to  get  stuck  in  a  self(cid:173)\nconsistent  but non-optimal predictor-controller. \n\nA final set of practical i88ues are algorithmic in nature, such as convergence,  scaling, \nand the p088ibility of overtraining or overfitting.  TD( A) has been proven to converge \nonly for  a linear network and a  linearly independent set of input patterns (Sutton, \n1988;  Dayan, 1992).  In the more general case,  the algorithm may not converge even \nto a locally optimal solution, let alone to a globally optimal solution. \nRegarding  scaling,  no  results  are  available to  indicate  how  the speed  and  quality \nof TD learning will scale  with the temporal length of sequences  to  be learned,  the \ndimensionality  of the  input  space,  the  complexity  of the  task,  or  the  size  of the \nnetwork.  Intuitively it seems  likely  that  the  required  training  time might increase \n\n\fPractical Issues  in Temporal Difference Learning \n\n261 \n\ndramatically with  the  sequence  length.  The training time might also  scale  poorly \nwith  the  network  or  input  space  dimension,  e.g.,  due  to  increased  sensitivity  to \nnoise in the teacher signal.  Another potential problem is that the quality of solution \nfound  by  gradient-descent  learning relative  to  the  globally optimal solution might \nget progressively worse  with increasing network size. \nOvertraining occurs  when continued  training of the network  results  in  poorer per(cid:173)\nformance.  Overfitting  occurs  when  a  larger network  does  not do as well  on  a  task \nas a  smaller network.  In supervised  learning,  both of these  problems are  believed \nto be due  to  a  limited data set.  In the  TD approach,  training takes  place  on-line \nusing patterns generated  de  novo,  thus one might hope that these  problems would \nnot occur.  But  both  overtraining and overfitting  may occur  if the  error  function \nminimized during  training  does  not  correspond  to  the  performance function  that \nthe user  cares  about.  For example, in a  combined prediction-control task,  the user \nmay care only about the quality of control signals, not the absolute accuracy of the \npredictions. \n\n2  A  CASE  STUDY:  TD LEARNING OF \n\nBACKGAMMON  STRATEGY \n\nWe have seen that existing theory provides little indication of how TD(A) will behave \nin practical applications.  In the absence of theory,  we  now examine empirically the \nabove-mentioned issues in the context of a specific application:  learning to play the \ngame of backgammon from  the outcome of self-play.  This application was  selected \nbecause of its complexity and stochastic nature, and because  a detailed comparison \ncan  be  made  with  the  alternative  approach  of  supervised  learning  from  human \nexpert examples (Tesauro,  1989j Tesauro,  1990). \nIt seems reasonable that, by watching two fixed  opponents  play out a large number \nof games, a network could learn by TD methods to predict the expected outcome of \nany given board position.  However,  the experiments presented  here study the more \ninteresting question of whether a network can learn from its own play.  The learning \nsystem  is  set  up  as  follows:  the  network  observes  a  sequence  of board  positions \nZl, Zl, \u2022\u2022\u2022 ,  Z J leading to a final  reward signal z.  In the simplest case,  z = 1 if White \nwins  and  z  =  0 if Black  wins.  In  this case  the network's output  Pe is  an estimate \nof White's probability of winning from  board  position  Ze.  The sequence  of board \npositions is  generated  by setting up an initial configuration,  and making plays for \nboth  sides  using  the  network's  output  as  an evaluation function.  In  other  words, \nthe move which is selected  at each  time step is the move which maximizes Pe when \nWhite is to play and minimizes Pe when  Black is  to play. \nThe representation scheme used here contained only a simple encoding of the \"raw\" \nboard description (explained in detail in figure 2), and did not utilize any additional \npre-computed  \"features\"  relevant  to  good  play.  Since  the  input  encoding  scheme \ncontains  no  built-in  knowledge  about  useful  features,  and  since  the  network  only \nobserves  its  own  play,  we  may  say  that  this  is  a  \"knowledge-free\"  approach  to \nlearning backgammon.  While it's not clear that this approach can make any progress \nbeyond  a  random  starting  state,  it  at least  provides  a  baseline  for  judging other \napproaches using various forms of built-in knowledge. \n\n\f262 \n\nTesauro \n\nThe approach  described  above is similar in  spirit  to  Samuel's scheme for  learning \ncheckers from self-play  (Samuel, 1959),  but in several ways it is a  more challenging \nlearning task.  Unlike the raw board description  used  here,  Samuel's board descrip(cid:173)\ntion  used  a  number  of hand-crafted  features  which  were  designed  in  consultation \nwith human checkers  experts.  The evaluation function  learned  in  Samuel's study \nwas a linear function of the input variables, whereas multilayer networks learn more \ncomplex nonlinear functions.  Finally,  Samuel found  that  it  was  necessary  to give \nthe learning system at least one fixed intermediate goal, material advantage, as well \nas the ultimate goal of the game.  The proposed  backgammon learning system  has \nno such  intermediate goals. \n\nThe networks had a feedforward fully-connected  architecture with either no hidden \nunits,  or a  single hidden  layer with  between  10  and  40  hidden  units.  The learning \nalgorithm  parameters  were  set,  after  a  certain  amount  of  parameter  tuning,  at \nQ  = 0.1  and  A = 0.7. \nThe average  sequence  length  appeared  to depend  strongly  on  the  quality  of play. \nWith decent  play on both sides,  the average game length is about 50-60 time steps, \nwhereas  for  the random initial networks,  games often  last  several  hundred  or even \nseveral  thousand  time  steps.  This  is  one  of the  reasons  why  the  proposed  self(cid:173)\nlearning scheme appeared  unlikely to work. \n\nLearning was assessed primarily by testing the networks in actual game play against \nSun  Microsystems'  Gammontool  program.  Gammontool  is  representative  of the \nplaying ability of typical commercial programs, and  provides  a  decent  benchmark \nfor  measuring game-playing strength:  human beginners can  win about  40%  of the \ntime against it, decent  intermediate-level humans would win about 60%, and human \nexperts  would  win  about  75%.  (The  random initial networks  before  training win \nonly about  1%.) \nNetworks  were  trained on  the entire game, starting from  the opening position and \ngoing all the way  to the end.  This is  an admittedly naive approach which  was not \nexpected  to yield  any useful  results  other  than a  reference  point for  judging more \nsensible  approaches.  However,  the  rather  surprising  result  was  that  a  significant \namount of learning actually took place.  Results are shown in figure  1.  For compar(cid:173)\nison  purposes,  networks  with  the  same  input  coding  scheme  were  also  trained  on \na  massive human expert  data base of over  15,000 engaged  positions, following  the \ntraining procedure  described  in  (Tesauro,  1989).  These networks  were  also  tested \nin game play against Gammontool. \nGiven the complexity of the task, size of input space and length of typical sequences, \nit  seems  remarkable  that  the  TO  nets  can  learn  on  their  own  to  play  at  a  level \nsubstantially better than Gammontool.  Perhaps even  more remarkable is  that the \nTO  nets surpass  the  EP  nets  trained  on  a  massive  human expert  data base:  the \nbest TO net  won  66.2% against Gammontool, whereas  the best  EP net  could  only \nmanage 59.4%.  This was  confirmed  in  a  head-to-head  test  in  which  the  best  TO \nnet  played  10,000  games  against  the  best  EP  net.  The  result  was  55%  to  45% \nin  favor  of the  TO  net.  This  confirms  that  the  Gammontool benchmark  gives  a \nreasonably  accurate  measure  of relative  game-playing strength,  and  that  the  TO \nnet  really is better than  the EP net.  In fact,  the TO net with no features  appears \nto be as good as  Neurogammon 1.0, backgammon champion of the  1989  Computer \n\n\fPractical  Issues  in Temporal Difference Learning \n\n263 \n\nc \n0 \n~ \n\n(/) \nQ.) \nE \nctJ \nen \n...... \n0 \nc \n0 \n.1'\"\"'1 \n+J \nU \nctJ \n'-\nl.J.... \n\n.70 \n\n.65 \n\n.60 \n\n.55 \n\n.50 \n\n.45 \n\n10 \n\n0 \n40 \nNumber  of  hidden  units \n\n20 \n\nFigure  1:  Plot  of game  performance against  Gammontool vs.  number  of hidden \nunits for  networks  trained  using  TD learning from  self-play  (TD),  and supervised \ntraining on  human expert  preferences  (EP).  Each data point  represents  the result \nof a  10,000 game test,  and should be accurate to within one percentage point. \n\nOlympiad, which does have features,  and which  wins 65%  against Gammontool.  A \n10,000 game test  of the  best  TD net  against  Neurogammon 1.0 yielded statistical \nequality:  50% for  the TD net and 50% for  N eurogammon. \nIi is  also of interest  to examine the weights learned by  the TD nets,  shown in fig(cid:173)\nure  2.  One  can see  a  great  deal  of spatially organized  structure  in  the  pattern  of \nweights, and some of this structure can be interpreted as useful features by a knowl(cid:173)\nedgable backgammon player.  For example, the first  hidden unit in figure  2 appears \nto  be a  race-oriented  feature detector,  while the second  hidden  unit appears  to be \nan  attack-oriented  feature  detector.  The  TD net  has  apparently  solved  the  long(cid:173)\nstanding  \"feature discovery\"  problem, which  was  recently stated in (Frey,  1986) as \nfollows:  \"Samuel was disappointed in his inability to develop a mechanical strategy \nfor  defining  features.  He  thought  that  true  machine  learning  should  include  the \ndiscovery  and  definition  of features.  Unfortunately,  no  one  has  found  a  practical \nway  to do this even  though more than two and a  half decades  have  passed.\" \nThe training times needed to reach  the levels of performance shown in figure 1 were \non the order of 50,000 training games for  the networks with 0 and 10  hidden units, \n100,000  games  for  the  20-hidden  unit  net,  and  200,000  games  for  the  40-hidden \nunit net.  Since the number of training games appears to scale roughly linearly with \nthe  number of weights  in  the network,  and the  CPU simulation time per  game on \na  serial  computer  also  scales  linearly  with  the  number of weights,  the  total  CPU \ntime  thus  scales  quadratically  with  the  number  of weights:  on  an  IBM  RS/6000 \nworkstation,  the  smallest  network  was  trained  in  several  hours,  while  the  largest \nnet  required  two weeks  of simulation time. \nIn  qualitative terms,  the TD nets  have developed  a  style  of play emphasizing run-\n\n\f264 \n\nTesauro \n\n24 \n23 \n22 \n21 \n20 \n19 \n18 \n17 \n16 \n15 \n1-1 \n13 \n12 \n11 \n10 \n9 \na \n7 \ne \nS \n4 \n3 \n2 \n1 \n\n2 \n\n3 \n\n4 \n\n23 4  \n\n2 \n\n3 \n\n4 \n\n4 \n\nBlack  pipces \n\nWhite  pieces \n\nBlack  pi('ces \n\nWhite  pit\"t:~~ \n\nFigure  2:  Weights from  the input  units  to  two  hidden  units  in  the  best  TO  net. \nBlack squares represent  negative weightsi white squares represent  positive weightsi \nsize indicates magnitude of weights.  Rows represent  spatial locations 1-24, top row \nrepresents  no.  of barmen,  men off,  and  side  to  move.  Columns represent  number \nof Black  and  White  men as  indicated.  The  first  hidden  unit  has  two  noteworthy \nfeatures:  a linearly increasing pattern of negative weights for Black blots and Black \npoints, and a negative weighting of White men off and a positive weighting of Black \nmen  off.  These  contribute  to an  estimate of Black's  probability  of winning  based \non  his racing lead.  The second  hidden  unit  has  the following noteworthy features: \nstrong  positive  weights  for  Black  home  board  points,  strong  positive  weights  for \nWhite men on bar, positive weights for White blots, and negative weights for White \npoints in  Black's home board.  These factors  all contribute  to  the probability of a \nsuccessful  Black attacking strategy. \n\nning  and  tactical  play,  whereas  the  EP  nets  favor  more  quiescent  positional  play \nemphasizing blocking  rather  than racing.  This is  more in  line  with human expert \nplay,  but it often leads  to complex prime vs.  prime and back-game situations that \nare  hard  for  the  network  to  evaluate  properly.  This suggests  one  possible  advan(cid:173)\ntage  of the TO  approach  over  the  EP  approach:  by  imitating an  expert  teacher, \nthe learner  may get  itself into situations that it can't handle.  With the alternative \napproach  of learning  from  experience,  the  learner  may  not  reproduce  the  expert \nstrategies, but at least it will learn to handle whatever situations are brought about \nby its own  strategy. \n\nIt's also interesting that TO net plays well in early phases of play, whereas  its play \nbecomes worse in the late phases of the game.  This is contrary to the intuitive notion \nthat states far from  the  end of the sequence  should be harder to learn than states \nnear  the end.  Apparently  the inductive bias due  to the representation  scheme and \nnetwork architecture is more important than temporal distance to the final outcome. \n\n\fPractical  Issues  in Temporal  Difference Learning \n\n265 \n\n3  TD LEARNING WITH BUILT-IN FEATURES \n\nWe have seen  that TD networks with no built-in knowledge  are able to  reach com(cid:173)\nputer championship levels of performance for  this particular application.  It is  then \nnatural  to  wonder  whether  even  greater  levels  of performance  might  be  obtained \nby adding hand-crafted features  to the input representation.  In a separate series  of \nexperiments, TD nets containing all of Neurogammon's features  were  trained from \nself-playas described in the previous section.  Once again it was found that the per(cid:173)\nformance improved monotonically by adding more hidden units to the network, and \ntraining for  longer training times.  The best performance was  obtained with a  net(cid:173)\nwork containing 80 hidden units and over 25,000 weights.  This network was trained \nfor  over  300,000 training games, taking over  a  month of CPU time on an RS/6000 \nworkstation.  The resulting level of performance was 73% against Gammontool and \nnearly  60%  against  N eurogammon.  This is  very  close  to a  human expert  level  of \nperformance, and is  the strongest  program ever seen  by this author. \nThe level  of play  of this network  was  also  tested  in an all-day  match against  two(cid:173)\ntime World Champion Bill Robertie,  one  of the world's best  backgammon players. \nAt the end of the match, a total of 31 games had been played, of which Robertie won \n18  and  the TD net  13.  This showed  that the TD net  was  capable of a  respectable \nshowing against world-class human play.  In fact,  Robertie thinks that the network's \nlevel  of play is  equal to the average good human tournament player. \nIt's interesting  to  speculate  about  how  far  this  approach  can  be  carried.  Further \nsubstantial improvements might be  obtained by  training much  larger  networks  on \na  supercomputer  or  special-purpose  hardware.  On  such  a  machine  one  could  also \nsearch beyond one ply, and there is some evidence that small-to-moderate improve(cid:173)\nments could  be obtained by running the network  in two-ply search  mode.  Finally, \nthe features in Berliner's BKG program (Berliner,  1980) or in some of the top com(cid:173)\nmercial programs are  probably  more sophisticated  than Neurogammon's relatively \nsimple features,  and  hence  might give better  performance.  The combination of all \nthree improvements (bigger nets,  two-ply search,  better features)  could conceivably \nresult in a  network capable of playing at world-class  level. \n\n4  CONCLUSIONS \n\nThe experiments in this paper were  designed  to test  whether  TD(.\\) could  be suc(cid:173)\ncessfully applied to complex, stochastic, nonlinear, real-world prediction and control \nproblems.  This cannot be addressed  within current theory because it cannot answer \nsuch  basic questions as whether  the algorithm converges or how it would scale. \nGiven  the lack  of any  theoretical  guarantees,  the  results  of these  experiments  are \nve~y encouraging.  Empirically the  algorithm always converges  to  at  least  a  local \nminimum, and  the  quality of solution generally  improves  with increasing  network \nsize.  Furthermore,  the scaling of training  time with the length  of input sequences, \nand with the size and complexity of the task, does not appear to be a serious prob(cid:173)\nlem.  This was  ascertained  through studies of simplified endgame situations, which \ntook  about  as  many  training  games  to  learn  as  the full-game  situation  (Tesauro, \n1992).  Finally,  the  network's  move  selection  ability  is  better  than  one  would  ex(cid:173)\npect  based on its prediction accuracy.  The absolute prediction accuracy  is  only at \n\n\f266 \n\nTesauro \n\nthe  10%  level,  whereas  the  difference  in  expected  outcome  between  optimal  and \nnon-optimal moves is usually at the level of 1 % or less. \nThe most encouraging finding, however, is a clear demonstration that TO nets with \nzero  built-in  knowledge  can  outperform  identical  networks  trained  on  a  massive \ndata base of expert  examples.  It would  be nice  to  understand  exactly how  this is \npossible.  The ability of TO nets  to  discover  features  on  their  own  may also  be  of \nsome general  importance in  computer games  research,  and  thus worthy  of further \nanalysis. \nBeyond  this  particular application area,  however,  the  larger  and  more  important \nissue  is  whether  learning  from  experience  can  be  useful  and  practical  for  more \ngeneral  complex problems.  The quality of results  obtained  in  this study  indicates \nthat  the  approach  may  work  well  in  practice.  There  may  also  be  some  intrinsic \nadvantages over supervised  training on a fixed data set.  At the very least, for  tasks \nin  which  the  exemplars  are  hand-labeled  by  humans,  it  eliminates  the  laborious \nand time-consuming process of labeling the data.  Furthermore, the learning system \nwould  not  be  fundamentally  limited by  the  quantity of labeled  data,  or  by  errors \nin  the  labeling  process.  Finally,  preserving  the  intrinsic  temporal  nature  of  the \ntask,  and  informing  the  system  of  the  consequences  of  its  actions,  may  convey \nimportant  information  about  the  task  which  is  not  necessarily  contained  in  the \nlabeled exemplars.  More  theoretical and empirical work will be needed  to establish \nthe relative  advantages and disadvantages of the  two approachesi  this could  result \nin the development of hybrid algorithms combining the best  of both approaches. \n\nReferences \n\nH.  Berliner,  \"Computer backgammon.\"  Sci.  Am.  243:1, 64-72  (1980). \nP.  Dayan,  \"Temporal differences:  TO{>') for general >..\"  Machine  Learning, in press \n(1992). \nP. W.  Frey,  \"Algorithmic strategies for improving the performance of game playing \nprograms.\"  In:  O.  Farmer et a1.  (Eds.),  Evolution,  Game6  and  Learning.  Amster(cid:173)\ndam:  North  Holland (1986). \nA.  Samuel,  \"Some studies in  machine learning using  the  game of checkers.\"  IBM \nJ.  of Re6earch  and  Development 3,  210-229  (1959). \nR. S. Sutton, \"Learning to predict by the methods of temporal differences. \"  Machine \nLearning 3,  9-44 (1988). \nG.  Tesauro and T.  J. Sejnowski,  \"A parallel network  that learns to  play backgam(cid:173)\nmon.\"  Artificial Intelligence 39, 357-390 (1989). \nG. Tesauro,  \"Connectionist learning of expert  preferences  by comparison training.\" \nIn D. Touretzky (Ed.), Advance6 in Neural Information Proce66ing 1, 99-106 (1989). \nG.  Tesauro,  \"Neurogammon:  a  neural  network  backgammon  program.\"  IJCNN \nProceeding6 III, 33-39  (1990). \nG.  Tesauro,  \"Practical issues  in  temporal difference  learning.\"  Machine  Learning, \nin press  (1992). \n\n\f", "award": [], "sourceid": 465, "authors": [{"given_name": "Gerald", "family_name": "Tesauro", "institution": null}]}