{"title": "Monte Carlo POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1070, "abstract": null, "full_text": "Monte Carlo POMDPs \n\nSebastian Thrun \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA  15213 \n\nAbstract \n\nWe  present  a Monte Carlo algorithm for  learning to  act  in  partially observable \nMarkov decision processes (POMDPs) with real-valued state and action spaces. \nOur approach uses importance sampling for representing beliefs, and Monte Carlo \napproximation for belief propagation.  A reinforcement learning algorithm, value \niteration, is employed to learn value functions over belief states. Finally, a sample(cid:173)\nbased  version  of nearest  neighbor  is  used  to  generalize  across  states. \nInitial \nempirical results suggest that our approach works well in practical applications. \n\n1  Introduction \nPOMDPs address the problem of acting optimally in partially observable dynamic environ(cid:173)\nment [6]. In POMDPs, a learner interacts with a stochastic environment whose state is only \npartially observable.  Actions change the state of the environment and  lead to  numerical \npenalties/rewards, which may be observed with an  unknown temporal delay.  The learner's \ngoal  is  to devise  a policy for action  selection that maximizes  the  reward.  Obviously, the \nPOMDP framework embraces a large range of practical problems. \nPast work has predominately studied POMDPs in discrete worlds [1].  Discrete worlds have \nthe  advantage  that distributions over  states (so-called  \"belief states\")  can  be  represented \nexactly,  using  one  parameter  per  state.  The  optimal  value  function  (for  finite  planning \nhorizon)  has  been  shown  to  be  convex  and  piecewise  linear  [lO,  14],  which  makes  it \npossible to derive exact solutions for discrete POMDPs. \nHere we are interested in POMDPs with continuous state and action spaces, paying tribute \nto the fact that a large number of real-world problems are continuous in nature.  In general, \nsuch POMDPs are not solvable exactly,  and little is known about special cases that can  be \nsolved.  This paper proposes an  approximate approach, the MC-POMDP algorithm, which \ncan  accommodate real-valued spaces and  models.  The central  idea is  to use Monte Carlo \nsampling for belief representation and propagation.  Reinforcement learning in belief space \nis  employed  to  learn  value  functions,  using  a  sample-based  version  of nearest  neighbor \nfor generalization.  Empirical results illustrate that our approach  finds  to close-to-optimal \nsolutions efficiently. \n\n2  Monte Carlo POMDPs \n2.1  Preliminaries \nPOMDPs address the problem of selection actions in  stationary, partially observable, con(cid:173)\ntrollable Markov chains.  To establish the basic vocabulary, let us define: \n\n\u2022  State.  At any point in  time, the world is in a specific state, denoted by x. \n\n\fMonte  Carlo POMDPs \n\n1065 \n\n\u2022  Action.  The agent can execute actions, denoted a. \n\u2022  Observation.  Through its  sensors,  the agent can  observe a  (noisy) projection of the \n\nworld's state.  We  use 0  to denote observations. \n\n\u2022  Reward.  Additionally,  the  agent  receives  rewards/penalties,  denoted  R  E  ~.  To \nsimplify the  notation,  we  assume  that  the  reward  is  part  of the  observation.  More \nspecifically,  we  will use  R( 0)  to denote the function  that \"extracts\" the reward from \nthe observation. \n\nThroughout this paper,  we use the subscript t  to refer to a  specific  point in  time (e.g.,  St \nrefers to the state at time t). \nPOMDPs are characterized by three probability distributions: \n\ntime t  = O. \n\n1.  The initial distribution,  7r( x)  :=  Pr( xo),  specifies  the initial distribution of states at \n2.  The  next  state  distribution,  p(x'  I a,x)  :=  Pr(xt  =  x'  I at-I  =  a,Xt-l  =  x), \n3.  The perceptual distribution, v( 0 Ix)  :=  Pr( 0t  =  0 I Xt  =  x), describes  the likeli(cid:173)\n\ndescribes the likelihood that action a, when executed at  state x, leads to state x'. \n\nhood of observing 0  when the world is in  state x. \n\nA history is a sequence of states and observations.  For simplicity, we assume that actions \nand observations are alternated.  We use dt  to denote the history leading up to time t: \n\ndt \n\n{Ot,at-l,Ot-l,at-2, ... ,ao,00} \n\n(1) \n\nThe fundamental problem in POMDPs is to devise a policy for action selection that maxi(cid:173)\nmizes reward.  A policy, denoted \n\n(T \n\n:  d--+a \n\n(2) \nis  a  mapping from histories to  actions.  Assuming that actions are  chosen  by a  policy  (T, \neach policy induces an  expected cumulative (and possibly discounted by a discount factor \n,  :::;  1) reward, defined as \n\n00 \n\nJ<7  =  L E  [,T R(OT)] \n\nT=O \n\n(3) \n\nHere E[ ] denotes the mathematical  expectation.  The POMDP problem is,  thus,  to find  a \npolicy (T*  that maximizes r, i.e., \n\n(T*  =  argmax J<7 \n\n<7 \n\n(4) \n\n(6) \n\n(7) \n(8) \n\n(9) \n\n(10) \n\n2.2  Belief States \nTo  avoid  the  difficulty of learning  a  function  with  unbounded  input  (the  history  can  be \narbitrarily  long),  it  is  common  practice  to  map  histories  into  belief states,  and  learn  a \nmapping from belief states to actions instead [10]. \nFormally, a belief state (denoted e) is a probability distribution over states conditioned on \npast actions and observations: \n\nPr(xt I dt}  =  Pr(xt  lOt, at-I,\"\"  00) \n\n(5) \nBelief are computed incrementally, using knowledge of the POMDP's defining distributions \n7r,  p, and v.  Initially \n\net \n\neo  = \n\n7r \n\nFor t  ~ 0, we obtain \n\nBt+1 \n\nPr(xt+1  I Ot+l, at,\u00b7\u00b7\u00b7, 00) \n0'  Pr(Ot+1  I Xt+I,\u00b7\u00b7\u00b7, 00)  Pr(Xt+l  I at,\u00b7\u00b7\u00b7, 00) \n\n0'  Pr(ot+1  I Xt+l)  J Pr(Xt+l  I at,\u00b7\u00b7\u00b7, 00,  xt}  Pr(xt I at,\u00b7\u00b7\u00b7, 00)  dXt \n0'  Pr(Ot+l  I Xt+d  J Pr(xt+1  I at, Xt)  et  dXt \n\n\f1066 \n\nS.  Thrun \n\n0.2 \n\n0.1 \n\n/ \n\n'''-', \n\n9 \n\nI \nI \nI \n/ \nI \nI \nI \nI \nI \nI \n\n, \n\\ \n\\ \n\\ \n\\ \n\\ \n, \n\\ \n, \n\\ \n\n\\ \n\n'-, \n\nII \n\nI \n\nI \n\n.. .\u2022  I II  III. HI  \"I \n\nI  \u2022 \u2022 \u2022 . . .  _\n\n___  ....... \n\n11. ___  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 1111.  1.11111. \n\n\u2022 \n\n2 \n\n12 \nFigure 1:  Sampling:  (a) Likelihood-weighted sampling and (b) importance sampling.  At the bottom \nof each graph,  samples are shown that approximate the function f  shown at the top.  The height of \nthe samples illustrates their importance/actors. \n\n10 \n\n10 \n\n12 \n\n4 \n\nHere a denotes a constant normalizer.  The derivations of (8) and (10) follow directly from \nthe  fact  that  the  environment  is  a  stationary  Markov  chain,  for  which  future  states  and \nobservations are conditionally independent from  past ones given  knowledge of the state. \nEquation (9) is obtained using the theorem of total probability. \nArmed  with  the  notion  of belief states,  the  policy  is  now  a  mapping  from  belief states \n(instead of histories) to actions: \n\n(j  :  0  -+ a \n\n(11) \nThe legitimacy of conditioning a  on 0,  instead of d,  follows directly from the fact that the \nenvironment  is  Markov,  which  implies that  0 is  all  one  needs  to  know  about the  past  to \nmake optimal decisions. \n\n2.3  Sample Representations \nThus far,  we intentionally left open how belief states 0 are represented.  In prior work, state \nspaces  have  been  discrete.  In  discrete  worlds,  beliefs can  be represented  by  a  collection \nof probabilities (one for each  state), hence,  beliefs can  be represented exactly.  Here  were \nare  interested  in  real-valued  state  spaces.  In  general,  probability distributions over real(cid:173)\nvalued spaces possess infinitely many dimensions, hence cannot be represented on a digital \ncomputer. \nThe  key  idea  is  to  represent  belief states  by  sets  of (weighted)  samples  drawn  from  the \nbelief distribution.  Figure 1 illustrates two popular schemes for sample-based approxima(cid:173)\ntion:  likelihood-weighted sampling, in  which samples (shown at  the bottom of Figure  la) \nare  drawn  directly from  the  target  distribution (labeled  f  in  Figure  la),  and  importance \nsampling, where samples are drawn from some other distribution, such as the curve labeled \n9 in Figure 1 b.  In the latter case,  samples x are annotated by a numerical importance factor \n\nf(x) \ng(x) \n\np(x) \n\n(12) \nto account for the difference in  the sampling distribution, g,  and  the target distribution f \n(the height of the bars in Figure 1 b illustrates the importance factors).  Importance sampling \nrequires that f  > 0 -+ 9  > 0, which will be the case throughout this paper.  Obviously, both \nsampling methods generate approximations only.  Under mild assumptions, they converge \nto the target distribution at a rate of -j;;, with N  denoting the sample set size [16]. \nIn  the  context  of POMDPs,  the  use  of sample-based  representations  gives  rise  to  the \nfollowing algorithm for approximate belief propagation (c.f., Equation (10\u00bb: \n\nAlgorithm particleJilter(Ot , at, 0t+l): \n\nOt+l  = 0 \ndoN times: \n\ndraw random state Xt  from Ot \n\n\fMonte Carlo POMDPs \n\n1067 \n\nsample Xt+1  according to p(Xt+1  I at, xt} \nset importance factorp(xt+J)  = V(Ot+1  I xt+d \nadd (Xt+l,p(Xt+I))  toBt+1 \n\nnormalize all p(Xt+d  E  Bt+1 so that LP(Xt+d =  1 \nreturn Bt+1 \n\nThis  algorithm  converges  to  (10)  for  arbitrary  models  p,  v,  and  11\"  and  arbitrary  belief \ndistributions B,  defined  over discrete,  continuous, or mixed  continuous-discrete state and \naction  spaces.  It has,  with minor modifications,  been  proposed under names  like particle \nfilters  [131.  condensation algorithm [5],  survival  of the fittest  [8],  and,  in  the context of \nrobotics, Monte Carlo localization [4]. \n\n2.4  Projection \nIn conventional planning, the result  of applying an  action at  at a  state  Xt  is  a distribution \nPr(Xt+l, Rt+1  I at, xt}  over  states  Xt+1  and  rewards  R t+1 at  the  next  time  step.  This \noperation is called projection.  In  POMDPs, the state  Xt  is  unknown.  Instead,  one has to \ncompute  the  result of applying action  at  to  a  belief state  Bt .  The result  is  a  distribution \nPr(Bt+I' Rt+ 1 I at, Bt ) over belief states Bt+1 and rewards Rt+ I.  Since belief states them(cid:173)\nselves are distributions, the result of a projection in POMDPs is, technically, a distribution \nover distributions. \nThe projection algorithm is derived as follows.  Using total probability, we obtain: \nPr(Bt+l , R t+1 I at,Bd \n\nPr(Bt+I,Rt+11  at,dt} \n\n(13) \n\n= J !r(Bt+l , Rt+: I Ot+l, at, dt),  !r(ot+I,,1  at, dt},  dOt+1  (14) \n\n(*) \n\n(**) \n\nThe term (*)  has already been derived in  the previous section  (c.f., Equation (10\u00bb, under \nthe observation that the reward Rt +1 is trivially computed from the observation 0t+l. \nThe second term, (**),  is obtained by integrating out the unknown variables,  Xt+1  and Xt. \nand by once again exploiting the Markov property: \n\nPr(Ot+l  I at, dt}  J Pr(Ot+1  I Xt+d  Pr(xt+1  I at. dt}  dXt+1 \n\n(15) \n\nJ Pr(Ot+1  I Xt+l)  J Pr(xt+1  I Xt,  at}  Pr(xt I dt}  dXt  dXt-t616) \nJ V(Ot+1  I Xt+d J p(Xt+1  I Xt, at}  Bt(xt)  dXt  dXt+1 \n\n(17) \n\nThis leads to the following approximate algorithm for projecting belief state.  In the spirit \nof this  paper,  our approach  uses  Monte Carlo integration  instead of exact  integration.  It \nrepresents distributions (and distributions over distributions) by samples drawn from such \ndistributions. \n\nAlgorithm particle_projection(Bt, at): \n\n8 t  = 0 \ndoN times: \n\ndraw random state Xt  from Bt \nsample a next state Xt+1  accordingtop(xt+1  I at,xt) \nsample an observation Ot+1  according to V(Ot+1  I Xt+d \ncompute Bt+1 = partic1e_filter(Bt. at. Ot+l) \nadd (Bt+I,R(ot+J))  t08t \n\nreturn8t \n\nThe result of this  algorithm, 8 t , is  a  sample  set of belief states  Bt+1 and  rewards  Rt+I, \ndrawn from the desired distribution Pr( Bt+ I, Rt+ 1  I Bt , at}.  As  N  ~ 00,  at converges \nwith probability 1 to the true posterior [16]. \n\n\f1068 \n\nS.  Thrun \n\n2.5  Learning Value Functions \nFollowing  the  rich  literature on  reinforcement  learning  [7,  15],  our approach  solves  the \nPOMDP  problem  by  value  iteration  in  belief  space.  More  specifically,  our  approach \nrecursively  learns  a value function  Q  over belief states and  action,  by  backing  up  values \nfrom subsequent belief states: \n\nQ(Ot,at}  ~ E[R(ot+t}+,m:xQ(Ot+l,a)] \n\n(18) \nLeaving open (for a moment) how Q is represented, it is easy to be seen how the algorithm \nparticle_projection can  be applied to compute a Monte Carlo approximation of the  right \nhand-side expression: Given a belief state Ot and an action at, particle_projection computes \na sample of R( 0t+ I) and Ot+ I, from which the expected value on the right hand side of (18) \ncan  be approximated. \nIt has been shown [2] that if both sides of (18) are equal, the greedy policy \n\na \n\n(1'Q(O)  =  argmaxQ(O,a) \n\n(19) \nis  optimal,  i.e.,  (1'*  = (1'Q.  Furthermore,  it has  been  shown  (for the discrete  case!)  that \nrepetitive application of (18) leads to an  optimal value function  and,  thus,  to the optimal \npolicy [17,  3]. \nOur  approach  essentially  performs  model-based  reinforcement  learning  in  belief space \nusing approximate  sample-based  representations.  This  makes  it  possible to  apply  a rich \nbag  of tricks  found  in  the  literature  on  MDPs.  In  our  experiments  below,  we  use  on(cid:173)\nline reinforcement  learning  with counter-based  exploration  and  experience  replay  [9]  to \ndetermine the order in which belief states are updated. \n\n2.6  Nearest Neighbor \nWe  now  return  to  the  issue  how  to  represent  Q.  Since  we  are  operating  in  real-valued \nspaces,  some  sort of function  approximation  method  is  called for.  However,  recall  that \nQ accepts  a probability distribution (a sample  set)  as  an  input.  This makes  most existing \nfunction approximators (e.g., neural networks) inapplicable. \nIn  our current  implementation,  nearest  neighbor  [11]  is  applied  to  represent  Q.  More \nspecifically, our algorithm maintains a set of sample sets 0 (belief states) annotated by an \naction a and a Q-value Q(O, a).  When  a new  belief state Of  is encountered,  its Q-value is \nobtained by  finding  the  k  nearest  neighbors in  the database,  and  linearly  averaging  their \nQ-values.  If there  aren't  sufficiently  many  neighbors  (within  a  pre-specified  maximum \ndistance), Of  is added to the database; hence, the database grows over time. \nOur approach  uses  KL divergence (relative entropy) as  a distance function I.  Technically, \nthe  KL-divergence  between  two  continuous distributions is  well-defined.  When  applied \nto sample sets, however,  it cannot be computed.  Hence,  when evaluating the distance be(cid:173)\ntween two different sample sets, our approach maps them into continuous-valued densities \nusing Gaussian kernels, and uses Monte Carlo sampling to approximate the KL divergence \nbetween  them.  This algorithm is fairly generic an  extension of nearest neighbors to func(cid:173)\ntion  approximation in  density space,  where  densities are  represented  by  samples.  Space \nlimitations preclude us from providing further detail (see [11,  12]). \n\n3  Experimental Results \nPreliminary results have been obtained in a world shown in two domains, one synthetic and \none using a simulator of a RWI B21  robot. \nIn  the  synthetic  environment  (Figure  2a),  the  agents  starts  at  the  lower  left  comer.  Its \nobjective  is  to reach  \"heaven\"  which  is  either  at  the  upper left comer or the lower right \n\n1 Strictly speaking, KL divergence is not a distance metric, but this is ignored here. \n\n\fMonte Carlo POMDPs \n\n1069 \n\n... (a .... )  ,.-__  ...,.._ (~\"----~~-\nI=) \nP \n\n\\on... \n'--\n\n,1M \n\n50 \n\n,-~--~-v----v.n-\"'\" \n\n25 \n\n\u00b725 \n\n-50 \n\n\u00b775 \n\n\u00b7100 \n\nt.S:: \n\nt .. .....,.\u00b7 \n\n0 \n\n20 \n\n30 \nFigure 2:  (a) The environment, schematically.  (b)  Average perfonnance (reward) as  a function  of \ntraining episodes.  The black graph corresponds to  the smaller environment (25  steps min),  the grey \ngraph to  the larger environment (50 steps min).  (c) Same results, plotted as  a function  of number of \nbackups (in thousands). \n\n10 \n\n60 \n\n40 \n\n80 \n\n15 \n\n20 \n\n25 \n\ncomer.  The opposite location is  \"hell.\"  The agent does  not know the location of heaven, \nbut it can ask a \"priest\" who is located in the upper right comer.  Thus, an optimal solution \nrequires the agent to go first to the priest, and then head to heaven.  The state space contains \na real-valued (coordinates of the agent) and discrete (location of heaven) component.  Both \nare unobservable:  In addition to not knowing the location of heaven, the agent also cannot \nsense  its  (real-valued)  coordinates.  5%  random  motion  noise  is  injected  at  each  move. \nWhen  an  agent  hits  a  boundary,  it is penalized,  but  it  is  also  told which  boundary  it  hit \n(which makes  it possible to infer its coordinates along one axis).  However,  notice that the \ninitial coordinates of the agent are known. \nThe optimal solution takes approximately 25 steps; thus, a successful POMDP planner must \nbe capable  of looking 25  steps  ahead.  We  will  use  the term \"successful  policy\"  to  refer \nto a policy that always leads to  heaven,  even  if the path is suboptimal.  For a policy to  be \nsuccessful, the agent must have learned to first  move to the priest (information gathering), \nand then proceed to the right target location. \nFigures 2b&c show performance results, averaged  over  13  experiments.  The solid (black) \ncurve in both diagrams plots the average cumulative reward J  as  a function of the number \nof training episodes (Figure 2b), and as  a function  of the  number of backups (Figure 2c). \nA  successful  policy  was  consistently found  after  17  episodes  (or 6,150  backups),  in  all \n13  experiments.  In  our current  implementation, 6,150 backups require approximately 29 \nminutes  on  a  Pentium Pc. In  some  experiments,  a  successful  policy  was  identified in  6 \nepisodes (less than  1,500 backups or 7 minutes).  After a successful policy is found, further \nlearning gradually optimizes the path.  To  investigate scaling,  we  doubled the size  of the \nenvironment (quadrupling the size of the state space), making the optimal sol uti on 50 steps \nlong.  The  results  are  depicted  by  the  gray  curves  in  Figures  2b&c.  Here  a  successful \npolicy is consistently found after 33 episodes (10,250 backups, 58 minutes).  In some runs, \na successful policy is identified after only 14 episodes. \nWe also applied MC-POMDPs to a robotic locate-and-retrieve task.  Here a robot (Figure 3a) \nis to find and grasp an object somewhere in its vicinity (at floor or table height).  The robot's \ntask  is  to  grasp  the object  using its  gripper.  It is  rewarded  for  successfully  grasping the \nobject, and penalized for unsuccessful grasps or for moving too far away from the object. \nThe state space is continuous in x  and y coordinates, and discrete in the object's height. \nThe robot uses a mono-camera system for object detection; hence, viewing the object from \na single location is insufficient for its 3D localization.  Moreover,  initially the object might \nnot  be  in  sight  of the  robot's  camera,  so  that  the  robot  must look around  first. \nIn  our \nsimulation, we assume 30% general detection error (false-positive and false-negative), with \nadditional Gaussian  noise  if the object is  detected  correctly.  The robot's actions  include \nturns (by a variable angle),  translations (by a variable distance), and grasps (at one of two \nlegal heights).  Robot control is erroneous with a variance of20% (in x-y-space) and 5% (in \nrotational space).  Typical belief states range from uniformly distributed sample sets (initial \nbelief) to samples narrowly focused on a specific x-y-z location. \n\n\f1070 \n\nS.  Thrun \n\n(b) \n\n, \n\n\\ \n\n\\ \n\n\\ , \n\n(c) \n% success \n\n1 \n\nOB \n\n0.6 \n\n0.4 \n\n2000 \n\nL \nC \n\n4000 \n\n6000 \n\niteration \n\nBOOO \n\nFigure 3:  Find and fetch  task:  (a)  The mobile  robot with  gripper and camera,  holding  the  target \nobject (experiments are carried out in simulation!), (b) three successful runs (trajectory projected into \n2D), and (c) success rate as a function of number of planning steps. \n\nFigure 3c shows the rate of successful  grasps as  a function of iterations (actions).  While \ninitially, the robot fails to grasp the object, after approximately 4,000 iterations its perfor(cid:173)\nmance surpasses 80%. Here the planning time is in the order of 2 hours. However, the robot \nfails to reach  100%. This is in part because certain initial configurations make it impossible \nto succeed  (e.g.,  when  the object is too close to  the  maximum allowed distance),  in  part \nbecause  the robot occasionally misses  the object by a few  centimeters.  Figure 3b depicts \nthree successful  example  trajectories.  In  all  three,  the robot initially searches  the object, \nthen moves towards it and grasps it successfully. \n\n4  Discussion \nWe have presented a Monte Carlo approach for learning how to act  in partially observable \nMarkov decision  processes  (POMDPs).  Our approach  represents  all  belief distributions \nusing samples drawn  from these distributions.  Reinforcement learning in  belief space is \napplied  to  learn  optimal  policies,  using  a  sample-based  version  of nearest  neighbor for \ngeneralization.  Backups are performed using Monte Carlo sampling.  Initial experimental \nresults demonstrate that our approach is applicable to real-valued domains, and that it yields \ngood performance results in environments that are-by POMDP standards-relatively large. \n\nReferences \n[1]  AAAI  Fall  symposium  on  POMDPs. \n\npomdp-symposiurn.html \n\n1998. \n\nSee  http://www.cs.duke.edu/ ... mlittman/talks/ \n\n[2]  R  E. Bellman.  Dynamic Programming. Princeton University Press,  1957. \n[3]  P.  Dayan and T. 1. Sejnowski. ID('>') converges with probability 1.  1993. \n[4]  D. Fox, W. Burgard, F.  Dellaert, and S.  Thrun.  Monte carlo localization:  Efficient position estimation  for mobile robots. \n\nAAAI-99. \n\n[5]  M. lsard and A. Blake. Condensation: conditional density propagationforvisual tracking.lnternationalJoumalofComputer \n\nVision,  1998. \n\n[6]  L.P. Kaelbling, M.L. Littman, and A.R Cassandra. Planning and acting in partially observable stochastic domains. Submitted \n\nfor publication, 1997. \n\n[7]  L.P.  Kaelbling, M.L. Littman, and A. W.  Moore. Reinforcement learning:  A survey. lAIR,4, 1996. \n[8]  K  Kanazawa, D. Koller, and S.l. Russell.  Stochastic simulation algorithms for dynamic probabilistic networks. UAI-95. \n[9]  L.-l. Lin.  Self-improving reactive agents based on reinforcement learning, planning and teaching.  Machine Learning, 8, \n\n1992. \n\n[10]  M.L.  Littman, A.R  Cassandra, and L.P.  KaeJbling.  Learning poliCies for partially observable environments:  Scaling up. \n\nICML-95. \n\n[11]  A.w. Moore, C.G.  Atkeson, and S.A. Schaal. Locally weighted learning for control. AI Review, II, 1997. \n[12]  D. Ormoneit and S. Sen.  Kernel-based reinforcernentlearning. TR 1999-8, Statistics, Stanford University, 1999. \n[13]  M. Pitt and N. Shephard. Filtering via simulation: auxiliary particle filter.  lournal of the American Statistical Association, \n\n1999. \n\n[14]  E. Sondik.  The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford, 1971. \n[I 5]  R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,  1998. \n[16]  M.A. Tanner.  ToolsforStatistical Inference.  Springer Verlag,  1993. \n[17]  C. 1. C. H. Watkins.  Learningfrom Delayed Rewards. PhD thesis, King's College, Cambridge, 1989. \n\n\f", "award": [], "sourceid": 1772, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}