{"title": "Reinforcement Learning by Probability Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 1080, "page_last": 1086, "abstract": null, "full_text": "Reinforcement  Learning by Probability \n\nMatching \n\nPhilip N.  Sabes \n\nMichael I.  Jordan \n\nsabes~psyche.mit.edu \n\njordan~psyche.mit.edu \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts  Institute of Technology \n\nCambridge, MA  02139 \n\nAbstract \n\nWe  present  a  new  algorithm  for  associative  reinforcement  learn(cid:173)\ning.  The algorithm is based upon the idea of matching a network's \noutput probability with a probability distribution derived from the \nenvironment's reward signal.  This Probability Matching algorithm \nis shown  to perform faster  and  be  less  susceptible to local  minima \nthan  previously  existing  algorithms.  We  use  Probability  Match(cid:173)\ning to train mixture of experts networks,  an architecture for  which \nother reinforcement  learning rules fail  to converge reliably on even \nsimple  problems.  This  architecture  is  particularly  well  suited  for \nour  algorithm as it can compute arbitrarily complex functions  yet \ncalculation of the output probability is  simple. \n\n1 \n\nINTRODUCTION \n\nThe  problem  of learning  associative  networks  from  scalar  reinforcement  signals  is \nnotoriously  difficult .  Although  general  purpose  algorithms such  as  REINFORCE \n(Williams, 1992) and Generalized Learning Automata (Phansalkar, 1991) exist, they \nare  generally slow  and  have  trouble  with  local  minima.  As  an example,  when  we \nattempted to apply these  algorithms to mixture of experts networks (Jacobs et al. , \n1991), the  algorithms  typically  converged  to  the  local  minimum which  places  the \nentire  burden of the task on one expert. \n\nHere  we  present  a new  reinforcement learning algorithm which has faster and more \nreliable convergence properties than previous algorithms.  The next section describes \nthe  algorithm  and  draws  comparisons  between  it  and  existing  algorithms.  The \nfollowing section details its application to Gaussian units and mixtures of Gaussian \nexperts.  Finally, we  present  empirical results. \n\n\fReinforcement  Learning  by  Probability  Matching \n\n1081 \n\n2  REINFORCEMENT  PROBABILITY MATCHING \n\nWe  begin  by  formalizing  the  learning  problem.  Given  an  input  x  E  X  from  the \nenvironment, the network must select  an output y  E y.  The network  then receives \na  scalar  reward  signal  r,  with  a  mean  r  and  distribution  that  depend  on  x  and \ny.  The goal  of the  learner  is  to  choose  an  output  which  maximizes the expected \nreward.  Due  to  the  lack  of an  explicit  error  signal,  the  learner  must  choose  its \noutput stochastically, exploring for better rewards.  Typically the learner starts with \na  parameterized form for  the  conditional output density P8(ylx),  and  the  learning \nproblem  becomes  one  of finding  the  parameters  0  which  maximize  the  expected \nreward: \n\nJr(O)  = 1 p(x)p8(ylx)r(x, y)dydx. \n\nX,Y \n\nWe  present  an  alternative  route  to  the  maximum expected  reward  cost  function, \nand in doing so  derive  a  novel learning rule for  updating the network's parameters. \nThe learner's task is to choose  from  a set of conditional output distributions based \non the reward it receives from the environment. These rewards can be thought of as \ninverse  energies;  input/output  pairs that  receive  high  rewards  are  low  energy  and \nare preferred  by  the environment.  Energies can always be converted  into probabil(cid:173)\nities  through  the  Boltzmann distribution,  and  so  we  can  define  the environment's \nconditional distribution on Y given  x, \n\n*(  I)  exp( _T- 1 E(x, y\u00bb \npyx  = \n\nZT(X) \n\n= \n\nexp(T-1r(x, y\u00bb \n, \n\nZT(X) \n\nwhere  T  is  a  temperature parameter  and ZT(X)  is  a  normalization constant  which \ndepends on T . This distribution can be thought of as representing the environment's \nideal  input-output mapping, high reward  input-output pairs  being more typical or \nlikely than low reward pairs.  The temperature parameter determines the strength of \nthis preference:  when T  is infinity all outputs are equally likely; when T  is zero only \nthe  highest  reward  output  is  chosen.  This new  distribution  is  a  purely  theoretical \nconstruct,  but  it  can  be  used  as  a  target  distribution for  the  learner.  If the  0  are \nadjusted so  that P8(ylx) is  nearly equal to p*(ylx),  then the network's output will \ntypically result  in high rewards. \n\nThe agreement  between  the network  and environment conditional output densities \ncan be measured with the Kullback-Liebler  (KL)  divergence: \n\nK L(p II  P*)  = -1  p(x)p8(ylx) [logp*(Ylx) -logp8(ylx)] dydx \n\nX,Y \n\n=  -~ 1 p(x)p8(ylx)[r(x,y) - Tr8(X,y)]dydx+  f  p(x)logZT(x)dx, \n\n(1) \n\nX,Y \n\nJx \n\nwhere  r8(x, y) is defined  as  the logarithm of the conditional output probability and \ncan be thought of as the network's estimate of the mean reward.  This cost function \nis always greater than or equal to zero,  with equality only when the two probability \ndistributions are identical. \nKeeping only the part of Equation 1 which  depends on 0,  we  define  the Probability \nMatching (PM)  cost function: \n\nJpM(O)  = - f \n\np(x)p8(ylx)[r(x, y) - Tr8(X, y)] dydx = -Jr(O) - TS(P8) \n\nJx,y \n\nThe  PM  cost  function  is  analogous  to  a  free  energy,  balancing  the  energy,  in  the \nform  of the  negative  of the  average  reward,  and  the  entropy  S(P8)  of the  output \n\n\f1082 \n\nP. N.  SABES, M. I. JORDAN \n\nT=l \n\nT=.5 \n\n-1 \n\nT= .2 \n\n-1 \n\n-0.5 \n\n0 \n\n0.5 \n\nT= .05 \n\nFigure  1:  p*'s  (dashed)  and  PM  optimal  Gaussians  (solid)  for  the same  bimodal  reward \nfunction  and  various  temperatures.  Note  the differences  in scale. \n\ndistribution.  A higher T  corresponds to a smoother target distribution and tilts the \nbalance of the cost function in favor of the entropy term, making diffuse output dis(cid:173)\ntributions more favorable.  Likewise,  a small T  results in a sharp target distribution \nplacing most of the weight on the reward  dependent  term of cost function,  which  is \nalways optimized by  the singular solution of a spike  at the highest  reward  output. \n\nAlthough  minimizing  the  PM  cost  function  will  result  in  sampling most  often  at \nhigh  reward  outputs,  it  will  not  optimize  the  overall  expected  reward  if T  >  O. \nThere  are  two  reasons  for  this.  First,  the output y  which  maximizes ro(x, y) may \nnot  maximize  rex, y).  Such  an  example  is  seen  in  the  first  panel  of Figure  1: \nthe  network's  conditional output  density  is  a  Gaussian  with  adjustable mean  and \nvariance,  and  the environment has a  bimodal reward  function  and  T  = 1.  Even  in \nthe realizable case,  however,  the network will  choose  outputs which are suboptimal \nwith respect  to its own predicted reward,  with the probability of choosing output y \nfalling off exponentially with ro(x, y).  The key  point here  is  that early  in learning \nthis  non-optimality is  exactly  what  is  desired.  The  PM  cost  function  forces  the \nlearner to maintain output density everywhere the reward,  as  measure by p*l/T, is \nnot much smaller than  its maximum.  When  T  is  high,  the  rewards  are  effectively \nflattened and even fairly small rewards look big.  This means that a high temperature \nensures  that the learner will explore the output space. \n\nOnce the network is nearly PM optimal, it would be advantageous to  \"sharpen up\" \nthe  conditional  output  distribution,  sampling more  often  at  outputs  with  higher \npredicted  rewards.  This  translates  to  decreasing  the  entropy  of the  output  distri(cid:173)\nbution or lowering T.  Figure 1 shows how the PM optimal Gaussian changes as the \ntemperature  is  lowered  in  the example discussed  above;  at  very  low  temperatures \nthe output  is  almost always near  the mode of the target  distribution.  In  the  limit \nof T  =  0,  J PM  becomes original reward  maximization criterion  Jr.  The idea of the \nProbability Matching algorithm is  to begin training with a  large T,  say  unity,  and \ngradually  decrease  it  as  the  performance  improves,  effectively  shifting  the  bias  of \nthe learner from exploration to exploitation. \n\nWe  now  must find  an  update  rule for  0  which  minimizes  JpM(O).  We  proceed  by \nlooking for a stochastic gradient descent step.  Differentiating the cost function gives \n\n\\T OJpM(O)  = -1  p(x)po(Ylx) [rex, y) - Tro(x, y)] \\T oro(x, y)dydx. \n\nX,Y \n\nThus,  if after every  action the  parameters are  updated by  the step \n\nt:.o =  a [r - Tro(x, y)] \\T oro (x, y), \n\n(2) \n\nwhere  alpha is  a  constant  which  can  vary  over  time,  then  the  parameters  will  on \naverage  move  down  the gradient of the  PM  cost  function.  Note  that  any  quantity \n\n\fReinforcement  Learning  by  Probability  Matching \n\n1083 \n\nwhich  does  not depend  on Y or r  can be  added to the difference  in the update rule, \nand  the expected  step  will still point  along the direction  of the gradient. \n\nThe form of Equation 2 is similar to the REINFORCE algorithm (Williams, 1992), \nwhose  update rule is \n\nt:.() = a(r - b)V' elogpe(Ylx), \n\nwhere  b,  the reinforcement baseline,  is  a quantity which  does not depend  on Y or r. \nNote that these two update rules are identical when T  is zero.!  The advantage of the \nPM  rule  is  that it allows for  an  early  training  phase  which  encourages  exploration \nwithout  forcing  the  output  distribution  to  converge  on  suboptimal outputs.  This \nwill lead  to striking qualitative differences  in the performance of the  algorithm for \ntraining mixtures of Gaussian experts. \n\n3  UPDATE  RULES  FOR GAUSSIAN  UNITS  AND \n\nMIXTURES  OF  GAUSSIAN  EXPERTS \n\nWe  employ Gaussian  units  with  mean I' =  w T x  and  covariance  0\"21.  The learner \nmust  select  the  matrix wand scalar  0\"  which  minimize  JpM(W, 0\").  Applying  the \nupdate rule in Equation 2,  we  get \n\nt:.w \n\n\" \n\n1 \n0\" \n\na[r - Tr(x,y)] 2\"(Y -I'?x \n\n\"  1 (\"Y -I'W \n\n0\"2 \n\na [r - Tr(x, y)]  0\"2 \n\n) \n- 1 \n\n. \n\nIn  practice,  for  both  single  Gaussian  units  and  the  mixtures  presented  below  we \navoid  the issue  of constraining  0\"  > 0 by updating log 0\"  directly. \nWe  can generalize the linear model by considering a conditional output distribution \nin the form of a  mixture of Gaussian experts  (Jacobs et  al.,  1991), \n\np(Ylx) = Lgi(x)(27r0\"1)-~ exp(--2I1y -l'iW)\u00b7 \n\n1 \n\nN \n\ni=! \n\n1 \n\n20\"i \n\nExpert  i  has mean I'i = w r x  and covariance  0\"[1.  The prior  probability given x  of \n\nchoosing expert  i, gi(X),  is  determined  by  a single layer gating network with weight \nmatrix v  and  softmax output  units.  The gating network  learns  a  soft  partitioning \nof the input space  into regions for  which  each expert  is  responsible. \nAgain,  we  can  apply  Equation 2  to get  the PM update rules: \n\nt:.Vi \n\nt:.Wi \n\n6.O\"i \n\na [r - Tf(x,y)] (hi  - gi)X \na [r - Tf(x, y)] hi~(Y -\na[r-Tf(x,Y)]hi : 1 (\"Y~riW -1), \n\nl'i?X \n\nO\"i \n\nwhere  hi = giPi(ylx)jp(ylx) is  the posterior  probability of choosing expert  i  given \ny.  We  note  that  the  PM  update  rules  are  equivalent  to  the  supervised  learning \ngradient  descent  update rules  in  (Jacobs et  al.,  1991)  modulated by  the  difference \nbetween  the  actual and expected  rewards. \n\nlThis fact implies that the REINFORCE step is in the direction of the gradient of JR(B), \nas shown  by  (Williams,  1992).  See  Williams  and  Peng,  1991,  for  a  similar  REINFORCE \nplus entropy  update  rule. \n\n\f1084 \n\nP. N. SABES, M. I. JORDAN \n\nTable  1:  Convergence  times  and  gate  entropies  for  the linear  example  (standard  errors \nin  parentheses).  Convergence  times:  An  experiment  consisting  of 50  runs  was conducted \nfor  each  algorithm,  with  a  wide  range  of learning  rates and  both  reward  functions.  Best \nresults for  each  algorithm  are  reported.  Entropy:  Values  are averages  over  the last  5,000 \ntime steps of each  run.  20  runs of 50,000 time steps were conducted. \nI Convergence  Time  I  Entropy \n.0011 \n.02 \n.04 \n.03 \n.03 \n\nAlgorithm \nPM,  T= 1 \nPM, T=.5 \nPM, T  =.1 \nREINFORCE \nREINF-COMP \n\n.993 \n.97 \n.48 \n.21 \n.21 \n\n1088  (43) \n\n-\n-\n\n2998  (183) \n1622  (46) \n\nBoth  the  hi  and r depend  on  the  overall  conditional  probability p(ylx), which  in \nturn  depends  on each Pi(ylx).  This adds  an extra step  to the  training procedure. \nAfter receiving the input x, the network chooses an expert based on the priors gi(X) \nand an output y  from the selected  expert's output distribution.  The output is then \n.  sent  back  to each  of the experts in order to compute the likelihood of their having \ngenerated it.  Given the set of Pi'S,  the network can update its parameters as  above. \n\n4  SIMULATIONS \n\nWe  present  three  examples  designed  to  explore  the  behavior  of the  Probability \nMatching algorithm.  In each  case,  networks  were  trained using  Probability Match(cid:173)\ning,  REINFORCE,  and  REINFORCE  with  reinforcement  comparison  (REINF(cid:173)\nCOMP),  where  a  running  average  of the  reward  is  used  as  a  reinforcement  base(cid:173)\nline  (Sutton,  1984).  In  the  first  two  examples  an  optimal output  function  y*(x) \nwas  chosen  and used  to calculate  a  noisy  error,  c =  Ily - y*(x) - zll,  where  z was \ni.i.d.  zero-mean  Gaussian  with  u  =  .1.  The error  signal  determined  the  reward \nby one  of two functions,  r  = -c2/2 or  exp( _c2 /2).  When the  RMSE  between  the \nnetwork  mean  and  the  optimal output  was  less  that  .05  the  network  was  said  to \nhave converged. \n\n4.1  A  Linear Example \n\nIn this example x  was chosen uniformly from  [-1,1]2 x {I}, and the optimal output \nwas y* = Ax, for a 2 x 3 matrix A.  A mixture of three Gaussian experts was trained. \nThe details  of the  simulation and results  for  each  algorithm are shown  in Table  1. \nProbability  Matching  with  constant  T  = 1  shows  almost  a  threefold  reduction \nin  training  time  compared  to  REINFORCE  and  about  a  50%  improvement  over \nREINF-COMP. \n\nThe  important point  of this  example  is  the  manner  in  which  the  extra  Gaussian \nunits were employed.  We  calculated the entropy of the gating network,  normalized \nso that a  value of one means that each expert has equal probability of being chosen \nand  a  value  of zero  means  that  only  one  expert  is  ever  chosen.  The  values  after \n50,000  time steps  are  shown  in  the second  column of Table  1.  When  T  ~ 1,  the \nProbability Matching algorithm gives  the three experts  roughly equal  priors.  This \nis due to the fact  that small differences  in the experts' parameters lead to increased \noutput  entropy  if all  experts  are  used.  REINFORCE  on  the  other  hand  always \nconverges  to  a  solution  which  employs  only  one  expert.  This  difference  in  the \nbehavior of the algorithms will have  a  large qualitative effect  in the next  example. \n\n\fReinforcement  Learning  by  Probability  Matching \n\nJ085 \n\nTable  2:  Results  for  absolute  value.  The  percentage  of  trials  that  converged  and  the \naverage  time to convergence  for  those trials.  Standard errors are in parentheses.  50  trials \nwere  conducted  for  a  range  of learning  rates  and  with  both  reward  functions;  the  best \nresults for  each  algorithm  are shown. \n\nAlgorithm \n\nPM \n\nREINFORCE \nREINF-COMP \n\nI Successful  Trials  I Convergence Time  I \n\n100% \n48% \n38% \n\n6,052  313) \n\n76,775  3,329) \n42,105  3,869) \n\n110 \n\n100 \n\n10 \n\n60 \n\n.0 \n\n10 \n\n8.0 \n\n2.0 \n\n0 .0 \n\n-2.0 \n\n(a) \n\n(b) \n\n(c) \n\n(d) \n\n0.0 \n\n1.0  1 .0 \n\n'.0 \n\n' .0 \n\n' .0 \n\n0.0 \n\n1.0  2.0 \n\n'.0 \n\n'.0 \n\n'.0 \n\nFigure 2:  Example 4.3.  The environment's  probability  distribution  for T  =  1:  (a) density \nplot  of p.  vs.  y / x,  (b)  cross-sectional  view  with  Y2  =  o.  Locally  weighted  mean  and \nvariance of Y2/X  over  representative  runs:  (c)  T  = 1,  (d) T  = 0  (i.e.  REINFORCE). \n\n4.2  Absolute Value \n\nWe  used  a  mixture  of two  Gaussian  units  to  learn  the  absolute  value  function. \nThe details of the simulation and  the best  results for  each  algorithm are  shown in \nTable 2.  Probability Matching with constant T  =  1 converged  to criterion on every \ntrial,  in  marked  contrast  to  the  REINFORCE  algorithm.  With  no  reinforcement \nbaseline,  REINFORCE converged  to criterion  in only  about  half of the  cases,  less \nwith reinforcement comparison.  In almost all of the trials that didn't converge,  only \none expert was active on the domain of the input.  Neither version of REINFORCE \never  converged  to criterion in less  than 14,000 time steps. \n\nThis example highlights the advantage of the Probability Matching algorithm.  Dur(cid:173)\ning  training,  all  three  algorithms initially use  both experts  to capture  the  overall \nmean  of the  data.  REINFORCE  converges  on  this  local  minimum,  cutting  one \nexpert  off  before  it has  a  chance  to explore  the  rest  of the  parameter space.  The \nProbability Matching algorithm keeps  both experts  in use.  Here,  the more conser(cid:173)\nvative approach  leads to a stark improvement in performance. \n\n4.3  An Example with Many Local Maxima \n\nIn this example, the learner's conditional output distribution was a bivariate Gaus(cid:173)\nsian with  It =  [Wl, W2]T x,  and  the environment's rewards  were  a function  of y/x. \nThe optimal output distribution p*(y/x) is  shown  in  Figures 2(a,b).  These figures \ncan also be interpreted as the expected value of p*  for a given w.  The weight vector \nis  initially chosen  from  a  uniform distribution over  [-.2, .2]2,  depicted  as  the  very \nsmall while dot in Figure 2(a).  There are  a series of larger and larger local maxima \noff to the right,  with  a peak of height 2n  at Wl  =  2n. \n\nThe results  are shown in Table 3.  REINFORCE,  both with  and  without reinforce(cid:173)\nment comparison, never got past third peak; the variance of the Gaussian unit would \n\n\f1086 \n\nP. N. SABES, M. I. JORDAN \n\nTable 3:  Results  for  Example  4.3.  These  values  represent  20  runs for  50,000  time  steps \neach.  The first  and second columns correspond to number of the peak the learner reached. \n\nMean  Final  Range of Final  Mean  Final \n\nAlgorithm \nPM, T= 2 \nPM, T  = 1 \nPM, T=.5 \nREINFORCE \nREINF-COMP \n\nlog2  Wl \n\n28,8 \n6.34 \n3.06 \n2.17 \n2.05 \n\nlog2  Wl'S \n[19.1,51.0] \n5.09,8.08 \n3.04,3.07 \n2.00,2.90 \n2.05,2.06 \n\n(T \n\n> 101> \n13.1 \n.40 \n.019 \n.18 \n\nvery  quickly  close  down  to  a small value making further  exploration of the  output \nspace impossible.  Probability Matching, on the other hand, was able to find greater \nand greater maxima, with the variance growing adaptively to match the local scale \nof the  reward  function.  These  differences  can  be  clearly  seen  in  Figures  2( c,d), \nwhich  show  typical  behavior  for  the  Probability  Matching  algorithm with T  =  1 \nand T = O. \n\n5  CONCLUSION \n\nWe  have presented a new  reinforcement learning algorithm for  associative networks \nwhich converges faster  and more reliably than existing algorithms.  The strength of \nthe  Probability  Matching  algorithm is  that  it  allows  for  a  better  balance  between \nexploration of the output space  and and exploitation of good outputs.  The param(cid:173)\neter T  can be  adjusted  during learning to allow  broader  output distributions early \nin training and then to force  the network to sharpen up its distribution once nearly \noptimal parameters have  been  found. \nAlthough  the  applications  in  this  paper  were  restricted  to  networks  with  Gaus(cid:173)\nsian units, the Probability Matching algorithm can be applied to any reinforcement \nlearning task and any conditional output distribution.  It could easily be employed, \nfor example, on classification problems using logistic or multinomial (softmax) out(cid:173)\nput units or mixtures of such units.  Finally, the simulations presented in this paper \nare  of simple  examples.  Preliminary  results  indicate  that  the  advantages  of the \nProbability Matching  algorithm scale  up to larger,  more interesting  problems. \n\nReferences \n\nJacobs,  R.  A.,  Jordan,  M.  I.,  Nowlan,  S.  J.,  and  Hinton,  G.  E.  (1991).  Adaptive \n\nmixtures of local experts.  Neural  Computation,  3:79-87. \n\nPhansalkar, V.  V.  (1991).  Learning  automata  algorithms for  connectionist  systems \n- local  and  global  convergence.  PhD Thesis,  Dept.  of Electrical  Engineering, \nIndia Institute of Science,  Bangalore. \n\nSutton,  R.  S.  (1984). \n\nTemporal  credit  assignment  in  reinforcement  learning. \n\nPhD Thesis,  Dept.  of Computer and  Information Science,  University  of Mas(cid:173)\nsachusetts,  Amherst,  MA. \n\nWilliams, R. J.  (1992) .  Simple statistical gradient-following algorithms for  connec(cid:173)\n\ntionist reinforcement  learning.  Machine  Learning,  8:229-256. \n\nWilliams,  R.  J.  and  Peng,  J.  (1991).  Function  optimization  using  connectionist \n\nreinforcement  learning algorithms.  Connection  Science,  3:241-268. \n\n\f", "award": [], "sourceid": 1042, "authors": [{"given_name": "Philip", "family_name": "Sabes", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}