{"title": "Approximate Planning in Large POMDPs via Reusable Trajectories", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1007, "abstract": null, "full_text": "Approximate Planning in Large POMDPs \n\nvia Reusable Trajectories \n\nMichael Kearns \n\nAT&T Labs \n\nmkearns@research.att.com \n\nYishay Mansour \nTel Aviv University \n\nmansour@math.tau.ac.il \n\nAndrewY. Ng \nUC Berkeley \n\nang@cs.berkeley.edu \n\nAbstract \n\nWe consider the problem of reliably choosing a near-best strategy from \na restricted class  of strategies TI  in  a partially observable Markov deci(cid:173)\nsion process (POMDP). We assume we are given the ability to simulate \nthe POMDP,  and study  what might be called the sample complexity -\nthat  is,  the  amount of data one must  generate in  the  POMDP  in  order \nto choose a good strategy.  We  prove upper bounds on  the sample com(cid:173)\nplexity  showing  that,  even  for  infinitely  large  and arbitrarily  complex \nPOMDPs,  the  amount  of data needed  can  be finite,  and  depends  only \nlinearly  on  the complexity of the  restricted strategy class  TI,  and  expo(cid:173)\nnentially  on  the horizon time.  This latter dependence can be eased in  a \nvariety  of ways,  including the  application  of gradient and  local  search \nalgorithms.  Our measure of complexity generalizes the classical super(cid:173)\nvised  learning notion  of VC  dimension to the settings  of reinforcement \nlearning and planning. \n\n1  Introduction \n\nMuch recent attention has been focused on partially observable Markov decision processes \n(POMDPs)  which  have  exponentially  or even  infinitely  large  state  spaces.  For such  do(cid:173)\nmains,  a  number of interesting basic  issues  arise.  As  the  state  space becomes  large,  the \nclassical way of specifying a POMDP by tables of transition probabilities clearly becomes \ninfeasible.  To  intelligently discuss the problem of planning -\nthat is,  computing a good \nstrategy 1 in a given POMDP -\ncompact or implicit representations of both POMDPs, and \nof strategies in  POMDPs,  must be developed.  Examples include factored  next-state dis(cid:173)\ntributions  [2,  3,  7],  and strategies derived from function approximation schemes [8].  The \ntrend towards such compact representations, as  well as algorithms for planning and learn(cid:173)\ning using them, is reminiscent of supervised learning, where researchers have long empha(cid:173)\nsized parametric models (such as decision trees and neural networks) that can capture only \nlimited  structure,  but which enjoy  a  number of computational and  information-theoretic \nbenefits. \n\nMotivated by these issues, we consider a setting were we are given a generative model, or \n\nlThroughout, we use the word strategy to mean any mapping from observable histories to actions, \n\nwhich generalizes the notion of policy in a fully observable MDP. \n\n\f1002 \n\nM  Kearns.  Y.  Mansour and A.  Y.  Ng \n\nsimulator, for a POMDP,  and wish  to  find  a good strategy  7r  from some restricted class of \nstrategies II. A generative model is a \"black box\" that allows us to generate experience (tra(cid:173)\njectories) from different states of our choosing.  Generative models are an abstract notion of \ncompact POMDP representations,  in  the  sense that the compact representations typically \nconsidered (such  as  factored  next-state distributions) already  provide efficient generative \nmodels.  Here we are imagining that the strategy class II is  given by some compact repre(cid:173)\nsentation or by some natural limitation on strategies (such as bounded memory). Thus, the \nview we are adopting is that even though the world (POMDP) may be extremely complex, \nwe assume that we can at least simulate or sample experience in the world (via the gener(cid:173)\native model), and  we try  to  use this experience to  choose a strategy from  some \"simple\" \nclass II. \n\nWe study the following question:  How many calls to a generative model are needed to have \nenough data  to  choose a  near-best strategy  in  the  given  class?  This  is  analogous  to  the \nquestion of sample complexity in supervised learning -\nbut harder.  The added difficulty \nlies in the reuse of data.  In supervised learning, every sample (x,  f(x))  provides feedback \nabout every hypothesis function h(x) (namely, how close h(x) is to  f(x)) . If h is restricted \nto lie in some hypothesis class 1i, this reuse permits sample complexity bounds that are far \nsmaller than the size of 1i.  For instance, only O(log(I1il))  samples are needed to choose \na  near-best model  from  a finite  class 1i.  If 1i is  infinite, then  sample sizes  are obtained \nthat depend only on some measure of the complexity of1i (such as VC dimension [9]), but \nwhich have no dependence on the complexity of the target function or the size of the input \ndomain. \n\nIn  the  POMDP setting,  we  would  like analogous sample complexity bounds in  terms  of \nthe \"complexity\" of the strategy class II - bounds that have no dependence on the size or \ncomplexity of the POMDP. But unlike the supervised learning setting, experience \"reuse\" \nis not immediate in POMDPs.  To see this, consider the \"straw man\" algorithm that, starting \nwith some 7r  E  II, uses  the  generative model  to  generate many  trajectories under 7r,  and \nthus forms a Monte Carlo estimate of V 7r (so).  It is not clear that these trajectories under \n7r  are of much use in evaluating a different 7r'  E  II,  since  7r  and  7r'  may quickly disagree \non which actions to take.  The naive Monte Carlo method thus gives 0(1111)  bounds on the \n\"sample complexity,\" rather than O(log(IIII)), for the finite case. \n\nIn  this  paper,  we  shall  describe the  trajectory  tree  method  of generating \"reusable\"  tra(cid:173)\njectories,  which requires  generating only a (relatively) small  number of trajectories -\na \nnumber that is  independent of the  state-space size  of the  POMDP,  depends only  linearly \non a general measure of the complexity of the strategy class II, and depends exponentially \non  the horizon time.  This latter dependence can be eased via gradient algorithms such as \nWilliams'  REINFORCE  [10]  and  Baird  and  Moore's more recent  YAPS  [1],  and by  local \nsearch techniques.  Our measure of strategy class complexity generalizes the notion of VC \ndimension in  supervised learning to  the  settings  of reinforcement learning and planning, \nand we give bounds that recover for these settings the most powerful analogous results in \nsupervised learning -\nbounds for arbitrary,  infinite strategy classes  that depend only  on \nthe dimension of the class rather than the size of the state space. \n\n2  Preliminaries \n\nWe  begin with some standard definitions.  A Markov decision process (MDP) is a  tuple \n(S, So, A, {P (,1 s, a)}, R),  where:  S is  a  (possibly  infinite)  state set;  So  E  S  is  a  start \nstate; A = {al' . .. ,ad are actions; PC Is, a)  gives the next-state distribution upon taking \naction  a from  state  s;  and  the reward function  R(s, a)  gives  the  corresponding rewards. \nWe  assume for simplicity that rewards are deterministic, and further that they are bounded \n\n\fApproximate Planning in Large POMDPs via Reusable Trajectories \n\n1003 \n\nin absolute value by  Rmax.  A partially observable Markov decision process (POMDP) \nconsists  of an  underlying MOP and  observation distributions  Q(ols)  for  each state  s, \nwhere 0  is the random observation made at s. \n\nWe  have  adopted the  common assumption  of a fixed  start  state,2  because once  we  limit \nthe class of strategies we entertain, there may not be a single \"best\" strategy in the class(cid:173)\ndifferent start states may  have different best strategies in  II.  We  also assume that we are \ngiven a POMOP M  in the form of a generative model for M  that, when given as input any \nstate-action pair (s, a),  will  output a state S' drawn according to  P(\u00b7ls, a), an  observation \no drawn according to  Q(\u00b7ls), and  the reward  R(s, a).  This gives us  the ability  to sample \nthe POMOP M  in a random-access way.  This definition may  initially seem unreasonably \ngenerous:  the  generative model  is  giving  us  a  fully  observable simulation  of a partially \nobservable process. However, the key point is that we must still find a strategy that performs \nwell in  the partially observable setting.  As  a concrete example,  in  designing an  elevator \ncontrol system, we may have access to a simulator that generates random rider arrival times, \nand keeps track of the waiting time of each rider, the number of riders waiting at every floor \nat every time of day,  and so on.  However helpful this  information might be in  designing \nthe controller, this controller must only use information about which floors currently have \nhad  their call  button  pushed  (the  observables).  In  any  case,  readers  uncomfortable with \nthe power provided by our generative models are referred to  Section 5,  where we briefly \ndescribe results requiring only an extremely weak form of partially observable simulation. \n\nAt  any  time  t,  the  agent  will  have  seen  some  sequence  of  observations,  00,\u00b7\u00b7., Ot, \nand  will  have  chosen  actions  and  received  rewards  for  each  of  the  t \ntime \nsteps  prior  to \nthe  current  one.  We  write  its  observable  history  as  h \n(( 00, ao, TO),  ... , (Ot-l , at-I, Tt-l ), (Ot, _,  _)).  Such observable histories, also called tra(cid:173)\njectories, are the inputs to strategies.  More formally, a strategy 7r  is any (stochastic) map(cid:173)\nping from  observable histories  to  actions.  (For example,  this  includes approaches which \nuse  the  observable history  to  track the  belief state  [5].)  A strategy class II  is  any  set of \nstrategies. \n\nWe  will restrict our attention to the case of discounted return,3 and we let, E  [0,1) be the \ndiscount factor.  We define the t::-horizon time to be HE  = 10gl'(t::(1  - ,)/2Rmax ).  Note \nthat returns  beyond  the  first  HE-steps  can  contribute at most  t::/2  to  the  total  discounted \nreturn.  Also,  let Vmax  = Rmax/(l - ,) bound the value function.  Finally, for a POMDP \nM  and a strategy class II, we define opt(M, II)  =  SUP7rEII V7r (so)  to be the best expected \nreturn achievable from So  using II. \n\nOur  problem  is  thus  the  following:  Given  a  generative  model  for  a  POMOP  M  and  a \nstrategy class II,  how many calls to the generative model must we make,  in  order to  have \nenough data to choose a 7r  E  II whose performance V7r(so)  approaches opt(M, II)? Also, \nwhich calls should we make to the generative model to achieve this? \n\n3  The Trajectory Tree Method \n\nWe  now  describe  how  we  can  use  a  generative  model  to  create  \"reusable\"  trajectories. \nFor ease  of exposition,  we assume  there  are  only  two  actions  al  and a2,  but our results \ngeneralize easily to any finite number of actions.  (See the full  paper [6].) \n\n2 An  equivalent definition is  to  assume a fixed  distribution  D  over start states,  since  So  can  be a \n\n\"dummy\" state whose next-state distribution under any action is D. \n\n3The results  in  this  paper can be  extended  without difficulty  to  the undiscounted  finite-horizon \n\nsetting [6]. \n\n\f1004 \n\nM.  Keams, Y.  Mansour and A. Y.  Ng \n\nA trajectory tree is a binary tree in  which each node is labeled by a state and observation \npair,  and  has  a  child for  each  of the  two  actions.  Additionally,  each  link  to  a  child  is \nlabeled  by  a  reward,  and  the  tree's  depth  will  be  H~, so  it  will  have  about  2H e  nodes. \n(In  Section 4,  we will  discuss settings  where this  exponential dependence on  H~  can  be \neased.) Each trajectory tree is built as follows:  The root is labeled by So  and the observation \nthere,  00 '  Its two children are then created by calling the generative model on (so, ad and \n(so, a2),  which gives us the two next-states reached (say  s~  and  s~ respectively), the two \nobservations  made  (say  o~  and  o~), and  the  two  rewards  received  (r~  = R(so, ad  and \nr~ =  R(so, a2). Then (s~ , aD  and (s~, o~) label the root's aI-child and a2-child, and the \nlinks  to  these  children are labeled r~ and r~.  Recursively,  we generate two children and \nrewards this way for each node down to depth H~ . \n\nNow for any  deterministic strategy tr and any  trajectory  tree T,  tr defines a path through \nT:  tr  starts at the root,  and  inductively, if tr  is  at some internal  node  in T,  then  we  feed \nto  tr  the  observable history  along the  path  from  the  root to  that  node,  and  tr  selects  and \nmoves to  a child of the current node.  This continues until  a leaf node is  reached, and we \ndefine R( tr , T) to be the discounted sum of returns along the path taken.  In  the case that \ntr is stochastic,  tr defines a distribution on paths in T, and  R(tr, T) is  the expected return \naccording  to  this  distribution.  (We  will  later  also  describe  another  method  for  treating \nstochastic strategies.)  Hence,  given m  trajectory trees T1 , ... , T m,  a natural estimate for \nV7r(so)  is  V7r(so)  =  ,; 2:::1 R(tr, Ti).  Note that each tree  can be used to evaluate any \nstrategy,  much  the  way  a  single  labeled  example  (x , f(x))  can  be used  to  evaluate  any \nhypothesis h(x) in supervised learning.  Thus in this sense, trajectory trees are reusable. \n\nOur goal  now  is  to  establish  uniform convergence results  that bound the error of the es(cid:173)\ntimates  V7r (so)  as  a function  of the \"sample size\"  (number of trees) m.  Section 3.1  first \ntreats the easier case of deterministic classes II; Section 3.2 extends the result to stochastic \nclasses. \n\n3.1  The Case of Deterministic II \n\nLet us begin by stating a result for the special case of finite classes of deterministic strate(cid:173)\ngies, which will serve to demonstrate the kind of bound we seek. \n\nTheorem 3.1  Let  II  be  any finite  class  of deterministic  strategies for an  arbitrary  two(cid:173)\naction POMDP M. Let m trajectory trees be created using a generative modelfor M, and \nV7r(so)  be the resulting estimates.  lfm =  0  ((Vrnax /t)2(log(IIII) + log(1/8))), then with \nprobability 1 - 8, I V7r (so)  - V7r (so) I :s  t  holds simultaneously for alltr E II. \n\nDue  to  space limitations,  detailed proofs  of the results  of this section are  left to  the  full \npaper [6], but we  will  try to  convey  the  intuition behind the  ideas.  Observe that for  any \nfixed deterministic tr, the estimates R( tr, Ti) that are generated by the m different trajectory \ntrees Ti  are independent.  Moreover, each R(tr, Ti )  is an unbiased estimate of the expected \ndiscounted H~ -step return of tr,  which is in turn t/2-close to  V7r(so).  These observations, \ncombined  with  a  simple Chernoff and  union  bound  argument,  are  sufficient to  establish \nTheorem 3.1.  Rather than developing this  argument here,  we instead move straight on to \nthe harder case of infinite II. \n\nWhen addressing sample complexity in  supervised learning, perhaps  the  most important \ninsight is that even though a class 1i may be infinite, the number of possible behaviors of \n1i on a finite  set of points is  often not exhaustive.  More precisely, for boolean functions, \nwe  say  that the  set  Xl, ... , Xd  is  shattered by 1i if every  of the  2d  possible labelings  of \n\n\fApproximate Planning in Large POMDPs via Reusable Trajectories \n\n1005 \n\nthese  points is  realized by some h  E  1i.  The VC dimension  of 1i is  then  defined as  the \nsize of the largest shattered set [9].  It is known that if the VC dimension of 1i is d,  then the \nnumber <P d (m) of possible labelings induced by 1i on a set of m points is at most (em J d)d, \nwhich is  much less than 2m  for d \u00ab m.  This fact provides the key  leverage exploited by \nthe classical VC dimension results, and we will concentrate on replicating this leverage in \nour setting. \n\nIf II is a (possibly infinite) set of deterministic strategies, then each strategy tr  E  II is simply \na deterministic function mapping from the set of observable histories to  the set  {al' a2}, \nand is thus a boolean function on observable histories.  We  can  therefore write VC(II)  to \ndenote the familiar VC  dimension  of the set of binary functions II.  For example, if II is \nthe set of all thresholded linear functions of the current vector of observations (a particular \ntype of memoryless strategy),  then VC(II)  simply equals the  number of parameters.  We \nnow show intuitively why a class II of bounded VC dimension d cannot induce exhaustive \nbehavior on a set T l ,  ... ,T m  of trajectory trees for m  \u00bb  d.  Note that if trl, tr2  E  II are such \nthat their \"reward labelings\" (R(trl' Tl ), ... ,R(trl' T m))  and (R( tr2, Tt), ... , R( tr2, T m)) \ndiffer, then  R( trl, Ti)  =f.  R(tr2' Ti ) for some 1  ::;  i  ::;  m.  But if trl  and tr2  give different \nreturns on Ti ,  then they must choose different actions at some node in Ti .  In  other words, \nevery different reward labeling of the set of m  trees yields a different (binary) labeling of \nthe set of m  . 2H \u2022  observable histories in the trees.  So, the number of different tree reward \nlabelings  can  be at most  <Pd(m\u00b7  2H<)  ::;  (em\u00b7 2H<Jd)d.  By developing this  argument \ncarefully and applying classical uniform convergence techniques, we obtain the following \ntheorem. (Full proof in  [6].) \n\nTheorem 3.2  Let II  be any class  of deterministic  strategies for an  arbitrary  two-action \nPOMDP M,  and let VC(II)  denote  its  VC dimension.  Let m  trajectory  trees be created \nusing a generative model for M, and \"\\I7r (so)  be the resulting estimates.  If \n\nthen with probability 1 - 6,  I V 7r (so)  - \"\\I7r (so) I ::;  \u20ac  holds simultaneously for alltr E II. \n\n(1) \n\n3.2  The Case of Stochastic II \n\nWe  now  address the case of stochastic strategy  classes.  We  describe an  approach where \nwe transform stochastic strategies into \"equivalent\" deterministic ones and operate on the \ndeterministic versions, reducing the problem to the one handled in the previous section. The \ntransformation is as follows:  Given a class of stochastic strategies II, each with domain X \n(where X  is the set of all observable histories), we first extend the domain to be X  x  [0,1]. \nNow for each stochastic strategy tr  E  II, define a corresponding deterministic transformed \nstrategy  tr'  with  domain  X  x  [0,1],  given  by:  tr'(h, r)  =  al  if r  ::;  Pr[tr(h)  =  ad, \nand  7r'(h,r)  =  a2  otherwise  (for any  hEX, r  E  [0,1]).  Let  II'  be the  collection  of \nthese transformed deterministic strategies tr'.  Since II' is just a set of deterministic boolean \nfunctions,  its VC dimension is well-defined.  We  then define the pseudo-dimension of the \noriginal set of stochastic strategies II to be p VC(II)  =  VC(II').4 \n\nHaving transformed the strategy class, we also need to transform the POMDP, by augment(cid:173)\ning the state space S  to  be S  x  [0,1].  Informally, the transitions and rewards remain  the \nsame, except that after each state transition, we draw a new random variable r  uniformly in \n[0,1], and independently of all previous events. States are now of the form (s, r), and we let \nr  be an observed variable. Whenever in the original POMDP a stochastic strategy tr would \n\n4This  is  equivalent  to  the  conventional  definition  of the  pseudo-dimension  of IT  [4],  when  it is \n\nviewed as a set of maps into real-valued action-probabilities. \n\n\f1006 \n\nM  Kearns,  Y.  Mansour and A.  Y.  Ng \n\nhave been  given a history  h,  in  the transformed POMDP the corresponding deterministic \ntransformed strategy 7r'  is  given (h, r), where r is  the  [0, l]-random variable at the current \nstate.  By the definition of 7r',  it is easy to see that 7r'  and 7r  have exactly the same chance \nof choosing each action at any node (randomization over r). \n\nWe are now back in the deterministic case, so Theorem 3.2 applies, with VC(II)  replaced \nby pVC (II)  =  VC(II'), and we again have the desired uniform convergence result. \n\n4  Algorithms for Approximate Planning \n\nGiven  a generative model for a POMDP,  the preceding section's results immediately sug(cid:173)\ngest a class  of approximate planning algorithms:  generate m  trajectory trees T1 , ... , T m, \nand  search  for  a  7r  E  II  that  maximizes V7r (so)  =  (1/ m) L R( 7r, Ti).  The  following \ncorollary to the uniform convergence results establishes the soundness of this approach. \n\nCorollary 4.1  Let II  be a  class of strategies in  a  POMDP  M,  and let the  number m  of \ntrajectory trees be as given in  Theorem 3.2.  Let it =  argmax7rErr{V7r(so)}  be  the policy \nin  II  with  the  highest empirical return  on  the  m  trees.  Then  with probability 1 - 0,  it is \nnear-optimal within II: \n\nV7T(SO)  ~ opt(M, II)  - 2\u20ac. \n\n(2) \n\nIf the  suggested  maximization  is  computationally  infeasible,  one  can  search  for  a  local \nmaximum 7r instead, and uniform convergence again assures us that V7r (so)  is a trusted es(cid:173)\ntimate of our true performance. Of course, even finding a local maximum can be expensive, \nsince each trajectory tree is of size exponential in H{. \n\nHowever, in  practice it may be possible to significantly reduce the cost of the search.  Sup(cid:173)\npose we are using a class of (possibly transformed) deterministic strategies, and we perform \na greedy local search over II to optimize V7r (so).  Then at any time in the search, to evalu(cid:173)\nate the policy we are currently considering, we really need to  look at only  a single path of \nlength Hf  in  each tree,  corresponding to  the path taken  by the strategy being considered. \nThus,  we should build the trajectory trees  lazily -\nthat is,  incrementally build each node \nof each tree only as it is needed to evaluate R( 7r, Ti) for the current strategy 7r.  If there are \nparts  of a tree  that are reached only  by poor policies,  then  a good search algorithm may \nnever even build these parts of the tree.  In any case, for a fixed number of trees, each step \nof the local search now takes time only linear in H f \u20225 \n\nThere is a different approach that works directly on stochastic strategies (that is, without re(cid:173)\nquiring the transformation to deterministic strategies).  In this case each stochastic strategy \n7r  defines a distribution over all the paths in a trajectory tree, and thus calculating R( 7r, T) \nmay  in  general require examining complete trees.  However,  we can  view each trajectory \ntree as  a small, deterministic POMDP by  itself,  with  the children of each node in  the tree \nbeing its  successor nodes.  So if II  =  {7re  : e E IRd}  is  a smoothly parameterized family \nof stochastic strategies,  then  algorithms such as  William's  REINFORCE  [10]  can be used \nto  find  an  unbiased estimate of the gradient (d/ de) V7r 9  (so),  which in  turn can be used to \n5 See also (Ng and Jordan, in preparation) which, by assuming a much stronger model of a POMDP \n\n(a  deterministic  function  1  such  that  I(s, a, r) is distributed according  to  P('ls, a)  when  r  is  dis(cid:173)\ntributed Uniform[O,l]), gives an  algorithm that enjoys uniform convergence bounds  similar to  those \npresented here, but with only a polynomial rather than exponential dependence on H,. The algorithm \nsamples a number of vectors  r(i)  E [0,  IjH., each of which, with I, defines an H,-step Monte Carlo \nevaluation trial for any  policy 7r .  The bound is on  the number of such random vectors needed (rather \nthan on the total number of calls to  f). \n\n\fApproximate Planning in Large POMDPs via Reusable Trajectories \n\n1007 \n\nperform stochastic gradient ascent to maximize V7r 8 (so).  Moreover, for a fixed  number of \ntrees, these algorithms need only O(H\u20ac) \ntime per gradient estimate; so combined with lazy \ntree construction,  we again have a practical algorithm whose per-step complexity is  only \nlinear in the horizon time.  This line of thought is further developed in  the long version of \nthe paper.6 \n\n5  The Random Trajectory Method \n\nUsing a fully observable generative model of a POMDP, we have shown that the trajectory \ntree method gives uniformly good value estimates, with an amount of experience linear in \nVC(II),  and exponential  in  H\u20ac. \nIt turns  out we can  significantly  weaken  the  generative \nmodel,  yet still obtain essentially the same theoretical results.  In this harder case,  we  as(cid:173)\nsume a  generative model that provides only partially observable histories generated by a \ntruly random strategy (which takes each action with equal probability at every step, regard(cid:173)\nless  of the history so far).  Furthermore,  these trajectories always  begin at the designated \nstart state, so there is no ability provided to \"reset\" the POMDP to any state other than so. \n(Indeed, underlying states may never be observed.) \n\nOur method for this harder case is called the Random Trajectory method.  It seems to  lead \nless readily to practical algorithms than the trajectory tree method, and its formal descrip(cid:173)\ntion  and  analysis,  which is  more difficult  than  for  trajectory  trees,  are  given  in  the  long \nversion of this  paper [6].  As  in Theorem 3.2,  we prove that the amount of data needed is \nlinear in VC(II), and exponential in the horizon time -\nthat is, by averaging appropriately \nover the resulting ensemble of trajectories  generated,  this  amount of data is  sufficient to \nyield uniformly good estimates of the values for all strategies in II. \n\nReferences \n\n[1]  L. Baird and A.  W.  Moore.  Gradient descent for general Reinforcement Learning.  In Advances \n\nin Neural Information Processing Systems 11, 1999. \n\n[2]  C.  Boutilier, T.  Dean, and S.  Hanks.  Decision theoretic planning:  Structural assumptions and \n\ncomputational leverage.  Journal of Artificial Intelligence Research,  1999. \n\n[3]  X.  Boyen and D.  Koller.  Tractable inference for complex stochastic processes.  In  Proc.  UAI, \n\npages 33-42, 1998. \n\n[4]  David  Haussler.  Decision  theoretic generalizations  of the PAC  model  for  neural  net and  oter \n\nlearning applications.  Information and Computation,  100:78-150,  1992. \n\n[5]  L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable \n\nstochastic domains.  ArtifiCial Intelligence,  101,  1998. \n\n[6]  M.  Kearns,  Y.  Mansour,  and A.  Y.  Ng.  Approximate planning in large POMDPs via  reusable \n\ntrajectories.  (long version),  1999. \n\n[7]  D. Koller and R.  Parr. Computing factored  value functions  for poliCies in structured MDPs.  In \n\nProceedings of the Sixteenth International Joint Conference on Artificial Intelligence,  1999. \n\n[8]  R.  S.  Sutton and A. G. Barto.  Reinforcement Learning.  MIT Press, 1998. \n[9]  Y.N. Vapnik.  Estimation of Dependences Based on Empirical Data.  Springer-Verlag,  1982. \n[10]  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement \n\nlearning.  Machine Learning, 8:229-256,  1992. \n\n6In the full paper, we also show how these algorithms can be extended to find in expected O( He) \ntime  an  unbiased  estimate  of the  gradient  of the  true  value  V 7T 8  (so)  for  discounted  infinite  hori(cid:173)\nzon problems (whereas  most current algorithms either only converge asymptotically  to  an unbiased \nestimate of this gradient, or need an absorbing state and \"proper\" strategies). \n\n\f", "award": [], "sourceid": 1664, "authors": [{"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Yishay", "family_name": "Mansour", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}