{"title": "Policy Search via Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1022, "page_last": 1028, "abstract": null, "full_text": "Policy Search via Density Estimation \n\nComputer Science Division \n\nAndrewY. Ng \nu.c. Berkeley \n\nBerkeley, CA 94720 \nang@cs.berkeley.edu \n\nRonald Parr \n\nDaphne Koller \n\nComputer Science Dept. \n\nComputer Science Dept. \n\nStanford University \nStanford, CA 94305 \nparr@cs.stanjord.edu \n\nStanford University \nStanford, CA 94305 \n\nkolle r@cs.stanjord.edu \n\nAbstract \n\nWe  propose  a  new  approach  to  the  problem  of searching  a  space  of \nstochastic controllers for a Markov decision process (MDP) or a partially \nobservable Markov decision process (POMDP). Following several other \nauthors,  our approach  is  based  on  searching  in  parameterized  families \nof policies (for example, via gradient descent) to optimize solution qual(cid:173)\nity.  However,  rather than  trying  to  estimate  the  values  and  derivatives \nof a policy  directly,  we do  so  indirectly  using  estimates  for  the  proba(cid:173)\nbility  densities  that  the  policy  induces  on  states  at  the  different points \nin  time.  This enables our algorithms to  exploit the many techniques for \nefficient  and  robust approximate density  propagation  in  stochastic  sys(cid:173)\ntems.  We show how our techniques can  be applied both to deterministic \npropagation schemes (where the MDP's dynamics are given explicitly in \ncompact form,)  and  to  stochastic  propagation schemes (where we  have \naccess only to a generative model, or simulator, of the MDP). We present \nempirical results for both of these variants on complex problems. \n\n1  Introduction \n\nIn  recent  years,  there  has  been  growing  interest  in  algorithms for  approximate planning \nin  (exponentially  or even  infinitely)  large  Markov  decision  processes  (MDPs)  and  par(cid:173)\ntially  observable MDPs  (POMDPs).  For such  large  domains,  the  value and  Q-functions \nare sometimes complicated and difficult to approximate, even though there may be simple, \ncompactly representable policies which perform very well.  This observation has led to par(cid:173)\nticular interest in  direct policy search methods (e.g.,  [9,  8,  1]), which attempt to choose a \ngood policy from some restricted class IT  of policies.  In our setting, IT  =  {1ro  : (J  E  ~m} is \na class of policies smoothly parameterized by  (J  E  ~m. If the value of 1ro  is differentiable \nin  (J,  then  gradient ascent  methods  may  be used  to  find  a  locally  optimal  1ro.  However, \nestimating values of 1ro  (and the associated gradient) is  often far from trivial.  One simple \nmethod for estimating 1ro's  value involves executing one or more Monte Carlo trajectories \nusing 1ro,  and then taking the average empirical return; cleverer algorithms executing sin(cid:173)\ngle trajectories also allow gradient estimates [9,  1].  These methods have become a standard \napproach to policy search, and sometimes work fairly  well. \n\nIn  this paper,  we propose a somewhat different approach to  this  value/gradient estimation \nproblem.  Rather than estimating these quantities directly, we estimate the probability den(cid:173)\nsity over the states of the system induced by 1ro  at different points in time.  These time slice \n\n\fPolicy Search via Density Estimation \n\n1023 \n\ndensities completely determine the value of the policy 1re.  While density estimation is not \nan easy problem, we can utilize existing approaches to density propagation [3, 5], which al(cid:173)\nlow users to specify prior knowledge about the densities, and which have also been shown, \nboth theoretically and empirically, to  provide robust estimates for time slice densities.  We \nshow how direct policy search can be implemented using this approach in  two very differ(cid:173)\nent settings of the planning problem:  In the first,  we have access to an explicit model of the \nsystem dynamics, allowing us to provide an explicit algebraic operator that implements the \napproximate density propagation process.  In the second, we have access only to a genera(cid:173)\ntive model of the dynamics (which allows us only to sample from,  but does not provide an \nexplicit representation of,  next-state distributions).  We  show how  both of our techniques \ncan be combined with gradient ascent in order to perform policy search, a somewhat subtle \nargument in the case of the sampling-based approach. We also present empirical results for \nboth variants in complex domains. \n\n2  Problem description \n\nA  Markov  Decision  Process  (MDP)  is  a  tuple  (S, So, A, R, P)  where:!  S  is  a  (possibly \ninfinite)  set  of states;  So  E  S  is  a  start state;  A  is  a  finite  set  of actions;  R  is  a  reward \nfunction  R  :  S  f-t  [0, Rmax];  P  is  a  transition  model  P  :  S  x  A  f-t  ils,  such  that \nP(s' I s, a)  gives the probability oflanding in state s'  upon taking action a in state s. \nA stochastic policy is a map 1r  :  S  f-t  ilA, where 1r( a Is) is the probability of taking action \na in  state s.  There are many ways of defining a policy 1r'S \"quality\" or value.  For a horizon \nT  and discount factor 1', the finite horizon discounted value function VT,\"Y[1r]  is defined by \nVO,\"Y[1r](s)  = R(s) ; vt+1,\"Y[1r](s)  = R(s) + l' L:a 1r(a I s) L:sl P(s' Is, a)vt'''Y[1r](s'). \nFor an infinite state space (here and below), the summation is replaced by an  integral.  We \ncan  now  define  several  optimality  criteria.  The finite  horizon  total reward with  horizon \nT  is  VT[1r]  = VT,d1r](so).  The  infinite  horizon  discounted  reward  with  discount l'  < \n1  is  V\"Y[1r]  =  limT-HXl VT,\"Y[1r](So).  The  infinite  horizon  average  reward  is  Vavg [1r]  = \nlimT-HXl  ~ VT,1 [1r](so),  where we assume that the limit exists. \nFix an  optimality criterion V.  Our goal is  to  find  a policy  that has a high  value.  As  dis(cid:173)\nWe  assume that II  =  {1re  I \u00b0 E  ffim}  is a set of policies parameterized by  0  E  ffi.m, and \ncussed, we assume we have a restricted set II of policies, and wish to  select a good 1r  E II. \nthat 1re(a I s)  is continuously differentiable in  0 for each s, a.  As a very simple example, \nwe may have a one-dimensional state, two-action MDP with \"sigmoidal\" 1re,  such that the \nprobability of choosing action ao  at state x is 1re(ao  I x)  =  1/(1 + exp( -81  - 82x)) . \nNote that this framework also encompasses cases where our family  II consists of policies \nthat depend only on certain aspects of the state.  In  particular, in  POMDPs, we can restrict \nattention to policies that depend only on the observables.  This restriction results in  a sub(cid:173)\nclass of stochastic memory-free policies.  By introducing artificial \"memory bits\"  into the \nprocess state, we can also define stochastic limited-memory policies. [6] \nEach 0 has a value V[O]  =  V[1re], as specified above.  To find the best policy in II, we can \nsearch for the 0 that maximizes V[O].  If we  can compute or approximate V[O],  there are \nmany algorithms that can  be used  to find  a  local  maximum.  Some,  such as Nelder-Mead \nsimplex search (not to be confused with the simplex algorithm for linear programs), require \nonly the ability  to  evaluate the function  being optimized at any  point.  If we  can compute \nor estimate V[O]'s gradient with respect to 0, we can also use a variety of (deterministic or \nstochastic) gradient ascent methods. \n\nIWe  write  rewards  as  R(s)  rather  than  R(s, a),  and  assume  a  single  start  state  rather  than  an \ninitial-state distribution,  only  to  simplify  exposition;  these  and  several  other  minor  extensions  are \ntrivial. \n\n\f1024 \n\nA. Y  Ng,  R.  Parr and D.  Koller \n\n3  Densities and value functions \n\nMost  optimization  algorithms require  some method  for  computing  V[O]  for  any  0  (and \nsometimes also  its gradient).  In  many real-life MOPs,  however, doing so exactly is  com(cid:173)\npletely infeasible, due to the large or even infinite number of states. Here, we will consider \nan approach to  estimating these quantities,  based on a density-based reformulation of the \nvalue function  expression.  A policy 71\"  induces a probability distribution over the states at \neach time t.  Letting \u00a2(O)  be the initial distribution (giving probability  1 to  so),  we  define \nthe time slice distributions via the recurrence: \n\ns \n\na \n\n(1) \n\nIt  is  easy  to  verify  that the  standard  notions  of value  defined  earlier can reformulated in \nterms of \u00a2(t); e.g., VT,1'[7I\"](So)  =  Ei'=o ,,/(\u00a2(t) . R), where\u00b7 is  the dot-product operation \n(equivalently,  the expectation of R  with respect to  \u00a2(t).  Somewhat more subtly,  for the \ncase of infinite horizon average reward,  we  have that Vavg [71\"]  =  \u00a2(oo)  . R,  where  \u00a2(oo)  is \nthe limiting distribution of (1), if one exists. \n\nThis reformulation gives us an alternative approach to evaluating the value of a policy 71\"0: \nwe first compute the time slice densities \u00a2(t)  (or \u00a2(oo), and then use them to compute the \nvalue.  Unfortunately,  that modification,  by  itself,  does not resolve  the  difficulty.  Repre(cid:173)\nsenting and computing probability densities over large or infinite spaces is often no easier \nthan representing and computing value functions.  However, several results  [3,  5]  indicate \nthat  representing  and  computing  high-quality  approximate  densities  may  often  be  quite \nfeasible.  The  general  approach  is  an  approximate  density  propagation  algorithm,  using \ntime-slice distributions in  some restricted family 3. For example, in  continuous spaces, 3 \nmight be the set of multivariate Gaussians. \n\nThe approximate propagation algorithm  modifies equation (1)  to  maintain the  time-slice \ndensities  in  3.  More precisely,  for  a policy  71\"0,  we  can  view  (1) as  defining  an  operator \ncf>[0]  that  takes  one distribution  in  !:1s  and  returns  another.  For our current policy  71\"0 0 , \nwe  can  rewrite (1) as:  \u00a2(t+1)  =  cf>[Oo](\u00a2(t)) .  In  most cases,=: will  not be  closed  under \ncf>;  approximate density propagation algorithms use some alternative operator 4>,  with  the \nproperties that, for \u00a2 E 3: (a) 4>( \u00a2)  is also in 3, and (b) 4>( \u00a2)  is (hopefully) close to cf>(\u00a2). \nWe  use  4>[0]  to  denote the approximation to  cf>[0],  and  \u00a2(t)  to  denote (4) [0]) (t) (\u00a2(O)).  If \n4>  is  selected  carefully,  it  is  often  the  case  that \u00a2(t)  is  close to  \u00a2(t).  Indeed,  a  standard \ncontraction analysis for stochastic processes can be used to show: \nProposition 1  Assume thatJor all t,  11cf>(\u00a2(t))  - 4>(\u00a2(t))lll  ~ c.  Then  there  exists some \nconstant>. such thatJor all t,  1I\u00a2(t)  - \u00a2(t) lit  ~ c/ >.. \nIn  some cases,  >.  might be arbitrarily small,  in  which case the proposition is meaningless. \nHowever,  there are many systems where>.  is reasonable (and independent of c)  [3].  Fur(cid:173)\nthermore, empirical results also show that approximate density propagation can often track \nthe exact time slice distributions quite accurately. \n\nApproximate tracking can now be applied to our planning task.  Given an optimality crite(cid:173)\nrion V expressed with \u00a2(t) s, we define an approximation V  to it by replacing each \u00a2(t)  with \n\u00a2(t), e.g., VT,1'[7I\"](so)  =  Ei'=o ,t\u00a2(t) . R.  Accuracy guarantees on approximate tracking \ninduce comparable guarantees on the  value approximation;  from  this,  guarantees  on  the \nperformance of a policy 7I\"iJ  found by optimizing V are also possible: \nProposition 2  Assume that,for all t,  we have that 11\u00a2(t)  - \u00a2(t) lit  ~ 6.  ThenJor each fixed \nT, ,: IVT,1'[7I\"](So)  - VT,1'[7I\"](so)I  =  0(6). \n\n\fPolicy Search via Density Estimation \n\n1025 \n\nProposition 3  Let  0*  = argmaxo V[O]  and 0 \nV[O]I  ::;  \u20ac, \n\nthen V[O*]  - V[O]  ::;  2\u20ac\n\n. \n\nargmaxo V[O].  If maxo!V[O]  -\n\n4  Differentiating approximate densities \n\nIn  this  section  we  discuss  two  very  different techniques for  maintaining  an  approximate \ndensity  \u00a2 (t)  using  an  approximate propagation operator <1>, and show when and  how  they \ncan be combined with gradient ascent to perform policy search.  In general, we will assume \nthat :=:  is  a family  of distributions parameterized by e E  ffi.l.  For example, if :=:  is  the set \nof d-dimensional multivariate Gaussians with diagonal covariance matrices, e would be a \n2d-dimensional vector, specifying the mean vector and the covariance matrix's diagonal. \n\nNow,  consider  the  task  of doing  gradient ascent  over the  space  of policies,  using  some \noptimality criterion V,  say VT,.,,[O].  Differentiating it relative to  0,  we  get '\\7oVT,.,, [O]  = \n\n'\u00a3'['=0 ,t ds~t)  . R.  To  avoid  introducing new  notation,  we  also  use  \u00a2 (t)  to denote the as(cid:173)\nsociated  vector of parameters e E  ffi.l .  These parameters are a function  of O.  Hence,  the \ninternal gradient term is represented by an \u00a3 x m Jacobian matrix, with entries representing \nthe derivative of a parameter ~i  relative to  a parameter OJ.  This gradient can be computed \nusing a simple recurrence, based on the chain rule for derivatives: \n\nThe first summand (an \u00a3 x m  Jacobian) is the derivative of the transition operator <1>  relative \nto  the  policy  parameters O.  The second  is  a  product of two  terms:  the  derivative  of <1> \nrelative to the distribution parameters, and the result of the previous step in the recurrence. \n\n4.1  Deterministic density propagation \n\nConsider a transition operator q,  (for simplicity, we omit the dependence on 0).  The idea in \nthis approach is to try to get <1>( \u00a2) to be as close as possible to q,(\u00a2), subject to the constraint \nthat <1>( \u00a2)  E  3.  Specifically, we define a projection operator r  that takes a distribution 'ljJ \nnot in  3, and  returns a  distribution  in  3  which  is  closest (in  some sense)  to  'ljJ .  We  then \ndefine  <1>(\u00a2)  =  r(q,(\u00a2)).  In  order to ensure that gradient descent applies  in  this  setting, \nwe  need  only  ensure that  rand q,  are differentiable functions.  Clearly,  there are many \ninstantiations of this idea for which this assumption holds. We provide two examples. \nConsider  a  continuous-state process  with  nonlinear dynamics,  where  q,  is  a  mixture  of \nconditional  linear  Gaussians.  We  can  define  3  to  be the  set  of multivariate  Gaussians. \nThe  operator  r  takes  a  distribution  (a  mixture  of gaussians)  'ljJ  and  computes  its  mean \nand  covariance matrix.  This  can  be easily  computed from  'ljJ's  parameters  using  simple \ndifferentiable algebraic operations. \n\nA  very  different example is  the algorithm  of [3]  for  approximate density  propagation  in \ndynamic Bayesian networks (DBNs).  A DBN is a structured representation of a stochastic \nprocess, that exploits conditional independence properties of the distribution to allow com(cid:173)\npact representation.  In  a DBN,  the  state space is defined as a set of possible assignments \nx  to a set of random variables Xl , ' ..  ,Xn .  The transition  model P(x' I x) is described \nusing a  Bayesian network fragment over the nodes {Xl, ' \"  ,Xn , X{, .. .  ,X~}. A  node \nX i  represents xft)  and  X:  represents xft+1).  The  nodes  X i  in  the  network are forced \nto  be roots (i.e.,  have no parents), and  are not associated with conditional probability dis(cid:173)\ntributions.  Each  node X: is  associated with  a conditional probability distribution (CPO), \nwhich specifies P(X: I Parents(XD) . The transition probability P(X' I X) is defined as \n\n\f1026 \n\nA.  Y.  Ng,  R.  Parr and D.  Koller \n\n11 P(X:  I Parents(Xf)).  OBNs support a compact representation of complex transition \nmodels in MOPs [2].  We  can extend the OBN to encode the behavior of an  MOP with  a \nstochastic policy 7l'  by  introducing a new random variable A representing the action taken \nat the current time.  The parents of A will be those variables in the state on which the action \nis  allowed to depend.  The CPO of A  (which may be compactly represented with function \napproximation) is  the distribution over actions defined by 7l'  for the different contexts. \n\nIn discrete OBNs, the number of states grows exponentially with the number of state vari(cid:173)\nables, making an  explicit representation of a joint distribution impractical.  The algorithm \nof [3]  defines::::  to  be a set of distributions defined  compactly as  a set of marginals over \nsmaller clusters of variables.  In  the simplest example, ::::  is  the set of distributions where \nXI, ... ,X n  are independent. The parameters ~ defining a distribution in ::::  are the param(cid:173)\neters of n multinomials.  The projection operator r simply marginalizes distributions onto \nthe individual variables, and is differentiable.  One useful corollary of [3]'s analysis is that \nthe decay rate of a structured  ~ over::::  can  often  be much  higher than  the decay  rate  of \n~, so that multiple applications of ~ can converge very rapidly to a stationary distribution; \nthis property is very useful when approximating \u00a2(oo)  to optimize relative to Vavg . \n\n4.2  Stochastic density propagation \n\nIn  many settings, the assumption that we  have direct access to  ~ is  too  strong.  A weaker \nassumption  is  that  we  have  access  to  a  generative model -\na black box from  which  we \ncan  generate samples with  the appropriate distribution;  i.e.,  for any  s, a,  we can generate \nsamples  s'  from  P(s'  I s, a).  In  this  case,  we  use  a  different  approximation  scheme, \nbased  on  [5].  The  operator ~ is  a  stochastic  operator.  It  takes  the  distribution  \u00a2,  and \ngenerates  some number of random state  samples  Si  from  it.  Then,  for  each  Si  and each \naction a,  we generate a sample s~ from the transition distribution P(\u00b7 I Si, a).  This sample \n(Si' ai, sD  is then assigned a weight Wi  = 7l'8(ai  I Si), to compensate for the fact that not \nall actions would have been selected by 7l'e  with equal probability.  The resulting set of N \nsamples  s~  weighted  by  the WiS  is  given as  input to  a statistical  density estimator,  which \nuses it to estimate a new density \u00a2'.  We assume that the density estimation procedure is a \ndifferentiable function of the weights, often a reasonable assumption. \n\nClearly, this <1>  can be used to compute \u00a2(t)  for any t, and thereby approximate 7l'e'S value. \nHowever,  the gradient computation for  ~ is far from  trivial.  In  particular, to compute the \nderivative 8<1> /8\u00a2, we must consider <1>'s  behavior for some perturbed \u00a2It)  other than  the \none (say,  \u00a2~t)  to  which it was applied originally.  In  this case, an  entirely different set of \nsamples would probably have been generated, possibly leading to a very different density. \nIt is hard to  see how one could differentiate the result of this perturbation.  We propose an \nalternative solution based  on  importance sampling.  Rather than  change the  samples,  we \nmodify their weights to reflect the change in the probability that they would be generated. \n\nSpecifically, when fitting \u00a2it+1) , we now define a sample (Si' ai, sD's weight to  be \n\n(3) \n\n. (J.(t)  0)  _  \u00a21  (Si)7l'e (ai  lSi) \n'1'1' \n\n-\n\nW t \n\n~ (t) \n\n~(t) \n\u00a2o  (Si) \n\nWe can now compute <1>'s  derivatives at (0o, \u00a2~t)) with respect to  any  of its parameters, as \nrequired in  (2).  Let (  be the vector of parameters (0, e).  Using the chain rule, we have \n\n8<1> [O](\u00a2) \n\n8(  =  8w \n\n8<1> [O](\u00a2)  8w \n. 8[' \n\nThe first  term is  the derivative of the estimated density relative to  the sample weights (an \n\u00a3 x  N  matrix).  The second is the derivative of the weights relative to the parameter vector \n(an N  x  (m + \u00a3)  Jacobian), which can easily be computed from (3). \n\n\fPolicy Search via Density Estimation \n\n1027 \n\n~ \n818 \n~ \n\n0.042 \n\no.~ \n\n038 \n\n0.36 \n\n..... 0.34 \n(J) \n0 \n() 0.32 \n\n0.3 \nI \n\nO~~ \nO~r \n\n, \n, \n\n, , , \n\n, , , , , \n\n(a) \n\no 2.0'----:.';O:------:-:'~:---,:7:~:----:200=\u00b7 --:2~~--::300::---~3~::--::400:----:-!..SO \n\n#Function evaluations \n\n(b) \n\nFigure 1: Driving task:  (a) DBN model; (b) policy-search/optimization results (with 1 s.e.) \n\n5  Experimental results \n\nWe  tested  our  approach  in  two  very  different  domains.  The  first  is  an  average-reward \nDBN-MDP problem (shown in Figure l(a)), where the task is to find a policy for changing \nlanes  when  driving  on  a  moderately  busy  two-lane  highway  with  a  slow  lane and  a  fast \nlane.  The model  is  based on  the BAT DBN of [4],  the result of a separate effort to  build a \ngood model of driver behavior.  For simplicity, we assume that the car's speed is controlled \nautomatically, so we are concerned only with choosing the LateraL Action - change Lane or \ndrive straight. The observables are shown in  the figure:  LCLr and RClr are the clearance to \nthe next car in each lane (close,  medium or far).  The agent pays a cost of 1 for each step \nit  is  \"blocked\" by  (meaning driving  close to)  the  car to  its front;  it pays a penalty of 0.2 \nper step for staying in the fast lane. Policies are specified by action probabilities for the  18 \npossible observation combinations. Since this is a reasonably small number of parameters, \nwe used the simplex search algorithm described earlier to optimize V[O]. \n\nThe process mixed quite quickly, so \u00a2(20) was a fairly good approximation to \u00a2( = ).  Bused \na fully  factored representation of the joint distribution except for a single cluster over the \nthree observables.  Evaluations are averages of 300 Monte Carlo trials of 400 steps each. \nFigure 1 (b) shows the estimated and  actual average rewards, as  the  policy parameters are \nevolved over time.  The algorithm improved quickly,  converging to a  very  natural  policy \nwith  the car generally staying in  the slow lane,  and  switching to  the fast  lane only  when \nnecessary to overtake. \n\nIn  our second experiment, we  used  the bicycle simulator of [7].  There are 9 actions cor(cid:173)\nresponding  to  leaning  left/center/right and  applying  negative/zero/positive torque  to  the \nhandlebar; the six-dimensional state used in  [7]  includes variables for the bicycle'S tilt an(cid:173)\ngle and orientation, and the handlebar's angle. If the bicycle tilt exceeds 7r /15, it falls  over \nand enters an absorbing state. We used policy search over the following space:  we selected \ntwelve (simple,  manually  chosen  but  not fine-tuned)  features  of each state; actions were \nthe probability of taking action ai is exp(x,wi)/ E j exp(x ,wj). \nchosen with a softmax -\nAs the problem only comes with  a generative model of the complicated,  nonlinear,  noisy \nbicycle  dynamics,  we  used  the  stochastic  density  propagation  version  of our algorithm, \nwith (stochastic) gradient ascent.  Each distribution in B was a mixture of a singleton point \nconsisting of the absorbing-state, and of a 6-D multivariate Gaussian. \n\n\f1028 \n\nA. Y.  Ng,  R.  Pa\" and D.  Koller \n\nThe first  task  in  this  domain  was  to  balance reliably  on  the  bicycle.  Using  a  horizon  of \nT  = 200, discount 'Y  = 0.995, and 600 Si  samples per density propagation step,  this was \nquickly achieved.  Next,  trying to learn  to ride to a goal2  10m in radius and  1000m away, \nit also succeeded in  finding  policies that do so reliably.  Formal evaluation is difficult,  but \nthis is  a sufficiently hard problem that even finding a solution can be considered a success. \nThere was also some slight parameter sensitivity  (and the best results were obtained only \nwith ~(O) picked/fit with some care, using in part data from earlier and less successful trials, \nto  be \"representative\" of a fairly  good rider's state distribution,)  but using this  algorithm, \nwe were able to obtain solutions with median riding distances under 1.1 km to the goal.  This \nis significantly better than the results of [7]  (obtained in  the learning rather than  planning \nsetting,  and  using a  value-function approximation solution),  which  reported  much  larger \nriding distances to the goal of about 7km, and a single \"best-ever\" trial of about 1.7km. \n\n6  Conclusions \n\nWe  have presented two  new  variants of algorithms for performing direct policy  search in \nthe deterministic  and  stochastic  density  propagation settings.  Our empirical results  have \nalso shown these methods working well on two large problems. \n\nAcknowledgements.  We  warmly thank Kevin Murphy for use of and help with his Bayes \nNet Toolbox, and Jette Randl~v and Preben Alstr~m for use of their bicycle simulator.  A. \nNg  is  supported  by  a  Berkeley  Fellowship.  The  work  of D.  Koller  and  R.  Parr  is  sup(cid:173)\nported by the ARO-MURI program \"Integrated Approach to Intelligent Systems\", DARPA \ncontract DACA 76-93-C-0025 under subcontract to lET, Inc., ONR contract N6600 1-97 -C-\n8554 under DARPA's HPKB  program, the Sloan Foundation, and the Powell Foundation. \n\nReferences \n\n[1]  L. Baird and A.W.  Moore.  Gradient descent for  general Reinforcement Learning.  In NIPS  II, \n\n1999. \n\n[2]  C.  Boutilier, T.  Dean,  and  S.  Hanks.  Decision  theoretic  planning:  Structural  assumptions  and \n\ncomputational leverage.  1.  Artijiciallntelligence Research,  1999. \n\n[3]  X. Boyen  and  D.  Koller.  Tractable inference  for  complex stochastic processes.  In  Proc.  VAl, \n\npages 33-42,  1998. \n\n[4]  J.  Forbes,  T.  Huang,  K.  Kanazawa,  and  S.J.  Russell.  The  BATmobile:  Towards  a  Bayesian \n\nautomated taxi.  In Proc. IlCAI,  1995. \n\n[5]  D. Koller and  R.  Fratkina.  Using learning  for  approximation  in stochastic  processes.  In  Proc. \n\nICML, pages 287-295,  1998. \n\n[6]  N . Meuleau,  L.  Peshkin,  K-E.  Kim,  and  L.P.  Kaelbling.  Learning  finite-state  controllers  for \n\npartially observable environments.  In Proc.  VAIlS,  1999. \n\n[7]  1. Randl0v and P.  Alstr0m. Learning to drive a bicycle using reinforcement learning and shaping. \n\nIn Proc.  ICML,  1998. \n\n[8]  J.K.  Williams and S. Singh.  Experiments with an algorithm which learns stochastic memoryless \n\npolicies for POMDPs.  In NIPS 11, 1999. \n\n[9]  R.J.  Williams.  Simple statistical gradient-following algorithms for connectionist reinforcement \n\nlearning.  Machine Learning, 8:229-256,  1992. \n\n2For  these  experiments,  we  found  learning  could  be  accomplished  faster  with  the  simulator'S \nintegration  delta-time  constant  tripled  for  training.  This  and  \"shaping\"  reinforcements  (chosen  to \nreward  progress  made towards  the goal)  were both used,  and training  was  with  the bike \"infinitely \ndistant\" from the goal.  For this and the balancing experiments, sampling from the fallen/absorbing(cid:173)\nstate  portion  of the  distributions  J>(t)  is  obviously  inefficient  use  of samples,  so  all  samples  were \ndrawn from the non-absorbing state portion (i.e.  the Gaussian, also with its tails corresponding to tilt \nangles greater than 7r /15 truncated), and weighted accordingly relative to the absorbing-state portion. \n\n\f", "award": [], "sourceid": 1748, "authors": [{"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Ronald", "family_name": "Parr", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}