{"title": "Automated Hierarchy Discovery for Planning in Partially Observable Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 232, "abstract": null, "full_text": "Automated Hierarchy Discovery for Planning in\n\nPartially Observable Environments\n\nLaurent Charlin & Pascal Poupart\n\nRomy Shioda\n\nDavid R. Cheriton School of Computer Science\n\nDept of Combinatorics and Optimization\n\nFaculty of Mathematics\nUniversity of Waterloo\n\nWaterloo, Ontario\n\nFaculty of Mathematics\nUniversity of Waterloo\n\nWaterloo, Ontario\n\n{lcharlin,ppoupart}@cs.uwaterloo.ca\n\nrshioda@math.uwaterloo.ca\n\nAbstract\n\nPlanning in partially observable domains is a notoriously dif\ufb01cult problem. How-\never, in many real-world scenarios, planning can be simpli\ufb01ed by decomposing the\ntask into a hierarchy of smaller planning problems. Several approaches have been\nproposed to optimize a policy that decomposes according to a hierarchy speci\ufb01ed\na priori. In this paper, we investigate the problem of automatically discovering\nthe hierarchy. More precisely, we frame the optimization of a hierarchical policy\nas a non-convex optimization problem that can be solved with general non-linear\nsolvers, a mixed-integer non-linear approximation or a form of bounded hierar-\nchical policy iteration. By encoding the hierarchical structure as variables of the\noptimization problem, we can automatically discover a hierarchy. Our method is\n\ufb02exible enough to allow any parts of the hierarchy to be speci\ufb01ed based on prior\nknowledge while letting the optimization discover the unknown parts. It can also\ndiscover hierarchical policies, including recursive policies, that are more compact\n(potentially in\ufb01nitely fewer parameters) and often easier to understand given the\ndecomposition induced by the hierarchy.\n\n1 Introduction\n\nPlanning in partially observable domains is a notoriously dif\ufb01cult problem. However, in many real-\nworld scenarios, planning can be simpli\ufb01ed by decomposing the task into a hierarchy of smaller\nplanning problems. Such decompositions can be exploited in planning to temporally abstract sub-\npolicies into macro actions (a.k.a. options). Pineau et al. [17], Theocharous et al. [22], and Hansen\nand Zhou [10] proposed various algorithms that speed up planning in partially observable domains\nby exploiting the decompositions induced by a hierarchy. However these approaches assume that a\npolicy hierarchy is speci\ufb01ed by the user, so an important question arises: how can we automate the\ndiscovery of a policy hierarchy? In fully observable domains, there exists a large body of work on\nhierarchical Markov decision processes and reinforcement learning [6, 21, 7, 15] and several hier-\narchy discovery techniques have been proposed [23, 13, 11, 20]. However those techniques rely on\nthe assumption that states are fully observable to detect abstractions and subgoals, which prevents\ntheir use in partially observable domains.\n\nWe propose to frame hierarchy and policy discovery as an optimization problem with variables\ncorresponding to the hierarchy and policy parameters. We present an approach that searches in the\nspace of hierarchical controllers [10] for a good hierarchical policy. The search leads to a dif\ufb01cult\nnon-convex optimization problem that we tackle using three approaches: generic non-linear solvers,\na mixed-integer non-linear programming approximation or an alternating optimization technique\nthat can be thought as a form of hierarchical bounded policy iteration. We also generalize Hansen\nand Zhou\u2019s hierarchical controllers [10] to allow recursive controllers. These are controllers that\n\n\fmay recursively call themselves, with the ability of representing policies with a \ufb01nite number of\nparameters that would otherwise require in\ufb01nitely many parameters. Recursive policies are likely\nto arise in language processing tasks such as dialogue management and text generation due to the\nrecursive nature of language models.\n\n2 Finite State Controllers\n\nWe \ufb01rst review partially observable Markov decision processes (POMDPs) (Sect. 2.1), which is the\nframework used throughout the paper for planning in partially observable domains. Then we review\nhow to represent POMDP policies as \ufb01nite state controllers (Sect. 2.2) as well as some algorithms\nto optimize controllers of a \ufb01xed size (Sect. 2.3).\n\n2.1 POMDPs\n\nPOMDPs have emerged as a popular framework for planning in partially observable domains [12].\nA POMDP is formally de\ufb01ned by a tuple (S, O, A, T, Z, R, \u03b3) where S is the set of states, O is\nthe set of observations, A is the set of actions, T (s0, s, a) = Pr(s0|s, a) is the transition function,\nZ(o, s0, a) = Pr(o|s0, a) is the observation function, R(s, a) = r is the reward function and \u03b3 \u2208\n[0, 1) is the discount factor.\nIt will be useful to view \u03b3 as a termination probability. This will\nallow us to absorb \u03b3 into the transition probabilities by de\ufb01ning discounted transition probabilities:\nPr\u03b3(s0|s, a) = Pr(s0|s, a)\u03b3. Given a POMDP, the goal is to \ufb01nd a course of action that maximizes\nexpected total rewards. To select actions, the system can only use the information available in\nthe past actions and observations. Thus we de\ufb01ne a policy \u03c0 as a mapping from histories of past\nactions and observations to actions. Since histories may become arbitrarily long, we can alternatively\nde\ufb01ne policies as mappings from beliefs to actions (i.e., \u03c0(b) = a). A belief b(s) = Pr(s) is a\nprobability distribution over states, taking into account the information provided by past actions\nand observations. Given a belief b, after executing a and receiving o, we can compute an updated\nbelief ba,o using Bayes\u2019 theorem: ba,o(s) = kb(s) Pr(s0|s, a) Pr(o|s0a). Here k is a normalization\nconstant. The value V \u03c0 of policy \u03c0 when starting with belief b is measured by the expected sum of\nthe future rewards: V \u03c0(b) = Pt R(bt, \u03c0(bt)), where R(b, a) = Ps b(s)R(s, a). An optimal policy\n\u03c0\u2217 is a policy with the highest value V \u2217 for all beliefs (i.e., V \u2217(b) \u2265 V \u03c0(b)\u2200b, \u03c0). The optimal\nvalue function also satis\ufb01es Bellman\u2019s equation: V \u2217(b) = maxa (R(b, a) + \u03b3 Pr(o|b, a)V \u2217(ba,o)),\nwhere Pr(o|b, a) = Ps,s0 b(s) Pr(s0|s, a) Pr(o|s0, a).\n2.2 Policy Representation\n\nA convenient representation for an important class of policies is that of \ufb01nite state controllers [9].\nA \ufb01nite state controller consists of a \ufb01nite state automaton (N, E) with a set N of nodes and a set\nE of directed edges Each node n has one outgoing edge per observation. A controller encodes a\npolicy \u03c0 = (\u03b1, \u03b2) by mapping each node to an action (i.e., \u03b1(n) = a) and each edge (referred by its\nobservation label o and its parent node n) to a successor node (i.e., \u03b2(n, o) = n0). At runtime, the\npolicy encoded by a controller is executed by doing the action at = \u03b1(nt) associated with the node\nnt traversed at time step t and following the edge labelled with observation ot to reach the next node\nnt+1 = \u03b2(nt, ot).\nStochastic controllers [18] can also be used to represent stochastic policies by rede\ufb01ning \u03b1 and \u03b2 as\ndistributions over actions and successor nodes. More precisely, let Pr\u03b1(a|n) be the distribution from\nwhich an action a is sampled in node n and let Pr\u03b2(n0|n, a, o) be the distribution from which the\nsuccessor node n0 is sampled after executing a and receiving o in node n. The value of a controller\nis computed by solving the following system of linear equations:\n\nn (s) = X\nV \u03c0\n\na\n\nPr\u03b1(a|n)[R(s, a) + X\n\ns0,o,n0\n\nPr\u03b3(s0|s, a) Pr(o|s0, a) Pr\u03b2(n0|n, a, o)V \u03c0\n\nn0 (s0)] \u2200n, s (1)\n\nWhile there always exists an optimal policy representable by a deterministic controller, this con-\ntroller may have a very large (possibly in\ufb01nite) number of nodes. Given time and memory con-\nstraints, it is common practice to search for the best controller with a bounded number of nodes [18].\nHowever, when the number of nodes is \ufb01xed, the best controller is not necessarily deterministic. This\nexplains why searching in the space of stochastic controllers may be advantageous.\n\n\fTable 1: Quadratically constrained optimization program for bounded stochastic controllers [1].\nx,y X\n\nmax\n\nPr\u03b3(s0|s, a) Pr(o|s0, a) Pr(n0, a|n, o)\n}\n\n{z\n\n|\n\nx\n\ni \u2200n, s\n\nVn0 (s0)\n| {z }y\n\ns\n\n= X\n\nbo(s) Vno (s)\n| {z }y\nh Pr(a, n0|n, ok)\n|\n}\n\u2265 0 \u2200n0, a, n, o X\n\n{z\n\na,n0\n\nx\n\nR(s, a) + X\n\ns0,o\n\ns.t. Vn(s)\n| {z }y\nPr(n0, a|n, o)\n{z\n}\n|\nX\nPr(n0, a|n, o)\n|\n}\n\n{z\n\nn0\n\nx\n\nx\n\n= 1 \u2200n, o\n\nn0,a\n\nPr(n0, a|n, o)\n|\n}\n\n{z\n\nx\n\n\u2200a, n, o\n\n= X\n\nn0\n\nPr(n0, a|n, ok)\n|\n}\n\n{z\n\nx\n\n2.3 Optimization of Stochastic Controllers\n\nThe optimization of a stochastic controller with a \ufb01xed number of nodes can be formulated as a\nquadratically constrained optimization problem (QCOP) [1]. The idea is to maximize V \u03c0 by varying\nthe controller parameters Pr\u03b1 and Pr\u03b2. Table 1 describes the optimization problem with Vn(s)\nand the joint distribution Pr(n0, a|n, o) = Pr\u03b1(n|a) Pr\u03b2(n0|n, a, o) as variables. The \ufb01rst set of\nconstraints corresponds to those of Eq. 1 while the remaining constraints ensure that Pr(n0, a|n, o)\nn Pr(n0, a|n, o) = Pr\u03b1(a|n)\u2200o. This optimization program is\nis a proper distribution and that P0\nnon-convex due to the \ufb01rst set of constraints. Hence, existing techniques can at best guarantee\nconvergence to a local optimum. Several techniques have been tried including gradient ascent [14],\nstochastic local search [3], bounded policy iteration (BPI) [18] and a general non-linear solver called\nSNOPT (based on sequential quadratic programming) [1, 8]. Empirically, biased-BPI (version of\nBPI that biases its search to the belief region reachable from a given initial belief state) and SNOPT\nhave been shown to outperform the other approaches on some benchmark problems [19, 1]. We\nquickly review BPI since it will be extended in Section 3.2 to optimize hierarchical controllers. BPI\nalternates between policy evaluation and policy improvement. Given a policy with \ufb01xed parameters\nPr(a, n0|n, o), policy evaluation solves the linear system in Eq 1 to \ufb01nd Vn(s) for all n, s. Policy\nimprovement can be viewed as a linear simpli\ufb01cation of the program in Table 1 achieved by \ufb01xing\nVn0 (s0) in the right hand side of the \ufb01rst set of constraints. Policy improvement is achieved by\noptimizing the controller parameters Pr(n0, a|n, o) and the value Vn(s) on the left hand side.1\n\n3 Hierarchical controllers\n\nHansen and Zhou [10] recently proposed hierarchical \ufb01nite-state controllers as a simple and intu-\nitive way of encoding hierarchical policies. A hierarchical controller consists of a set of nodes and\nedges as in a \ufb02at controller, however some nodes may be abstract, corresponding to sub-controllers\nthemselves. As with \ufb02at controllers, concrete nodes are parameterized with an action mapping \u03b1\nand edges outgoing concrete nodes are parameterized by a successor node mapping \u03b2. In contrast,\nabstract nodes are parameterized by a child node mapping indicating in which child node the sub-\ncontroller should start. Hansen and Zhou consider two schemes for the edges outgoing abstract\nnodes: either there is a single outgoing edge labelled with a null observation or there is one edge\nper terminal node of the subcontroller labelled with an abstract observation identifying the node in\nwhich the subcontroller terminated.\n\nSubcontrollers encode full POMDP policies with the addition of a termination condition. In fully\nobservable domains, it is customary to stop the subcontroller once a goal state (from a prede\ufb01ned\nset of terminal states) is reached. This strategy cannot work in partially observable domains, so\nHansen and Zhou propose to terminate a subcontroller when an end node (from a prede\ufb01ned set of\nterminal nodes) is reached. Since the decision to reach a terminal node is made according to the\nsuccessor node mapping \u03b2, the timing for returning control is implicitly optimized. Hansen and\n\n1Note however that this optimization may decrease the value of some nodes so [18] add an additional\nconstraint to ensure monotonic improvement by forcing Vn(s) on the left hand side to be at least as high as\nVn(s) on the right hand side.\n\n\fZhou propose to use |A| terminal nodes, each mapped to a different action. Terminal nodes do not\nhave any outgoing edges nor any action mapping since they already have an action assigned.\n\nThe hierarchy of the controller is assumed to be \ufb01nite and speci\ufb01ed by the programmer. Subcon-\ntrollers are optimized in isolation in a bottom up fashion. Subcontrollers at the bottom level are\nmade up only of concrete nodes and therefore can be optimized as usual using any controller op-\ntimization technique. Controllers at other levels may contain abstract nodes for which we have to\nde\ufb01ne the reward function and the transition probabilities. Recall that abstract nodes are not mapped\nto concrete actions, but rather to children nodes. Hence, the immediate reward of an abstract node\n\u00afn corresponds to the value V\u03b1(\u00afn)(s) of its child node \u03b1(\u00afn). Similarly, the probability of reach-\ning state s0 after executing the subcontroller of an abstract node \u00afn corresponds to the probability\nPr(send|s, \u03b1(\u00afn)) of terminating the subcontroller in send when starting in s at child node \u03b1(\u00afn).\nThis transition probability can be computed by solving the following linear system:\n\nPr(send|s, n) = \uf8f1\uf8f2\n\uf8f3\n\n1 when n is a terminal node and s = send\n0 when n is a terminal node and s 6= send\nPo,s0 Pr(s0|s, \u03b1(n)) Pr(o|s0, \u03b1(n)) Pr(send|s0, \u03b2(n, o))\n\n(2)\n\notherwise\n\nSubcontrollers with abstract actions correspond to partially observable semi-Markov decision pro-\ncesses (POSMDPs) since the duration of each abstract action may vary. The duration of an action is\nimportant to determine the amount by which future rewards should be discounted. Hansen and Zhou\npropose to use the mean duration to determine the amount of discounting, however this approach\ndoes not work. In particular, abstract actions with non-zero probability of never terminating have an\nin\ufb01nite mean duration. Instead, we propose to absorb the discount factor into the transition distribu-\ntion (i.e., Pr\u03b3(s0|s, a) = \u03b3 Pr(s0|s, a)). This avoids all issues related to discounting and allows us to\nsolve POSMDPs with the same algorithms as POMDPs. Hence, given the abstract reward function\nR(s, \u03b1(\u00afn)) = V\u03b1(\u00afn)(s) and the abstract transition function Pr\u03b3(s0|s, \u03b1(\u00afn)) obtained by solving the\nlinear system in Eq. 2, we have a POSMDP which can be optimized using any POMDP optimization\ntechnique (as long as the discount factor is absorbed into the transition function).\n\nHansen\u2019s hierarchical controllers have two limitations: the hierarchy must have a \ufb01nite number of\nlevels and it must be speci\ufb01ed by hand. In the next section we describe recursive controllers which\nmay have in\ufb01nitely many levels. We also describe an algorithm to discover a suitable hierarchy by\nsimultaneously optimizing the controller parameters and hierarchy.\n\n3.1 Recursive Controllers\n\nIn some domains, policies are naturally recursive in the sense that they decompose into subpolicies\nthat may call themselves. This is often the case in language processing tasks since language models\nsuch as probabilistic context-free grammars are composed of recursive rules. Recent work in dia-\nlogue management uses POMDPs to make high level discourse decisions [24]. Assuming POMDP\ndialogue management eventually handles decisions at the sentence level, recursive policies will nat-\nurally arise. Similarly, language generation with POMDPs would naturally lead to recursive policies\nthat re\ufb02ect the recursive nature of language models.\n\nWe now propose several modi\ufb01cations to Hansen and Zhou\u2019s hierarchical controllers that simplify\nthings while allowing recursive controllers. First, the subcontrollers of abstract nodes may be com-\nposed of any node (including the parent node itself) and transitions can be made to any node any-\nwhere (whether concrete or abstract). This allows recursive controllers and smaller controllers since\nnodes may be shared across levels. Second, we use a single terminal node that has no action nor any\nouter edge. It is a virtual node simply used to signal the termination of a subcontroller. Third, while\nabstract nodes lead to the execution of a subcontroller, they are also associated with an action. This\naction is executed upon termination of the subcontroller. Hence, the actions that were associated\nwith the terminal nodes in Hansen and Zhou\u2019s proposal are associated with the abstract nodes in\nour proposal. This allows a uniform parameterization of actions for all nodes while reducing the\nnumber of terminal nodes to 1. Fourth, the outer edges of abstract nodes are labelled with regular\nobservations since an observation will be made following the execution of the action of an abstract\nnode. Finally, to circumvent all issues related to discounting, we absorb the discount factor into the\ntransition probabilities (i.e., Pr\u03b3(s0|s, a)).\n\n\fPr(n0, a|\u00afn, o)\n\n\u00afn\ns\n\nPr(nbeg|\u00afn)\n\noc(send, nend|s, nbeg)\n\nnbeg\n\ns\n\nn0\ns0\n\nnend\nsend\n\noc(send, nend|s, nbeg)\n\nn0\ns0\n\nPr(n0, a|\u00afn, o)\n\n\u00afn\ns\n\nPr(nbeg|\u00afn)\n\nnbeg\n\ns\n\noc(send, nend|s, nbeg)\n\nn0\ns0\n\nnend\nsend\n\n(a)\n\n(b)\n\nFigure 1: The \ufb01gures represent controllers and transitions as written in Equations 5 and 6b. Along-\nside the directed edges we\u2019ve indicated the equivalent part of the equations which they correspond\nto.\n3.2 Hierarchy and Policy Optimization\n\nWe formulate the search for a good stochastic recursive controller, including the automated hierarchy\ndiscovery, as an optimization problem (see Table 2). The global maximum of this optimization\nproblem corresponds to the optimal policy (and hierarchy) for a \ufb01xed set N of concrete nodes n\nand a \ufb01xed set \u00afN of abstract nodes \u00afn. The variables consist of the value function Vn(s), the policy\nparameters Pr(n0, a|n, o), the (stochastic) child node mapping Pr(n0|\u00afn) for each abstract node \u00afn and\nthe occupancy frequency oc(n, s|n0, s0) of each (n, s)-pair when starting in (n0, s0). The objective\n(Eq. 3) is the expected value Ps b0(s)Vn0 (s) of starting the controller in node n0 with initial belief\nb0. The constraints in Equations 4 and 5 respectively indicate the expected value of concrete and\nabstract nodes. The expected value of an abstract node corresponds to the sum of three terms: the\nexpected value Vnbeg (s) of its subcontroller given by its child node nbeg, the reward R(send, a\u00afn)\nimmediately after the termination of the subcontroller and the future rewards Vn(s0). Figure 1a\nillustrates graphically the relationship between the variables in Equation 5. Circles are state-node\npairs labelled by their expected value. Edges indicate single transitions (solid line), sequences of\ntransitions (dashed line) or the beginning/termination of a subcontroller (bold/dotted line). Edges\nare labelled with the corresponding transition probability variables.\nNote that the reward R(send, a\u00afn) depends on the state send in which the subcontroller terminates.\nHence we need to compute the probability that the last state visited in the subcontroller is send.\nThis probability is given by the occupancy frequency oc(send, nend|s, nbeg), which is recursively\nde\ufb01ned in Eq. 6 in terms of a preceding state-node pair which may be concrete (6a) or abstract (6b).\nFigure 1b illustrates graphically the relationship between the variables in Eq. 6b. Eq. 7 prevents\nin\ufb01nite loops (without any action execution) in the child node mappings. The label function refers\nto the labelling of all abstract nodes, which induces an ordering on the abstract nodes. Only the\nnodes labelled with numbers larger than the label of an abstract node can be children of that abstract\nnode. This constraint ensures that chains of child node mappings have a \ufb01nite length, eventually\nreaching a concrete node where an action is executed. Constraints, like the ones in Table 1, are also\nneeded to guarantee that the policy parameters and the child node mappings are proper distributions.\n\n3.3 Algorithms\n\nSince the problem in Table 2 has non-convex (quartic) constraints in Eq. 5 and 6, it is dif\ufb01cult to\nsolve. We consider three approaches inspired from the techniques for non-hierarchical controllers:\nNon-convex optimization: Use a general non-linear solver, such as SNOPT, to directly tackle the\noptimization problem in Table 2. This is the most convenient approach, however a globally optimal\nsolution may not be found due to the non-convex nature of the problem.\nMixed-Integer Non-Linear Programming (MINLP): We restrict Pr(n0, a|n, o) and Pr(nbeg|\u00afn)\nto be binary (i.e., in {0, 1}). Since the optimal controller is often near deterministic in practice, this\nrestriction tends to have a negligible effect on the value of the optimal controller. The problem is\nstill non-convex but can be tackled with a mixed-integer non-linear solver such as MINLP BB 2.\nBounded Hierarchical Policy Iteration (BHPI): We alternate between (i) solving a simpli\ufb01ed ver-\nsion of the optimization where some variables are \ufb01xed and (ii) updating the values of the \ufb01xed vari-\nables. More precisely, we \ufb01x Vn0 (s0) in Eq. 5 and oc(s, \u00afn|s0, n0) in Eq. 6. As a result, Eq. 5 and 6\nare now cubic, involving products of variables that include a single continuous variable. This per-\n\n2http://www-unix.mcs.anl.gov/\u223cleyffer/solvers.html\n\n\fTable 2: Non-convex quarticly constrained optimization problem for hierarchy and policy discovery\nin bounded stochastic recursive controllers.\nw,x,y,z X\n\nb0(s) Vn0 (s)\n\nmax\n\n(3)\n\ns\u2208S\n\ns.t. Vn(s)\n| {z }y\n\n= X\n\na,n0\n\n| {z }y\nh Pr(n0, a|n, ok)\n|\n}\n\n{z\n\nx\n\nR(s, a) + X\n\ns0,o\n\nPr\u03b3(s0|s, a) Pr(o|s0, a) Pr(n0, a|n, o)\n}\n\n{z\n\n|\n\nx\n\nVn0 (s0)\n| {z }y\n\ni \u2200s, n\n(4)\n\n= X\n\nnbeg\n\nV \u00afn(s)\n\n| {z }y\n\n+ X\n\ns0,o\n\nPr(nbeg|\u00afn)\n\n+ X\n\nh Vnbeg (s)\n|\n}\n\n{z\n\nz\n\ny\n\n}\n\n|\n\n{z\n\nsend,a,n0\n\n|\nPr\u03b3(s0|send, a) Pr(o|s0, a) Pr(n0, a|\u00afn, o)\n}\n\n|\n= \u03b4(s0, n0, s0, n0) + X\n\n{z\n\nh\n\nx\n\nw\n\n{z\nVn0 (s0)\n| {z }y\n\noc(send, nend|s, nbeg)\n\nR(send, a)\n\nh Pr(n0, a|\u00afn, ok)\n|\n}\n\n{z\n\nx\n\n}\nii \u2200s, \u00afn\n\noc(s0, n0|s0, n0)\n|\n}\n\nw\n\nn\n\nw\n\n|\n\ns,o,a\n\n{z\n\noc(s, n|s0, n0)\n\n{z\nX\n+Psend,nbeg , \u00afn oc(s, \u00afn|s0, n0)\n}\n{z\n\n{z\nw\nPr(n0, a|\u00afn, o)\n|\n}\n|\nPr(\u00afn0|\u00afn) = 0 if label(\u00afn0) \u2264 label(\u00afn), \u2200\u00afn, \u00afn0\n\noc(send, nend|s, nbeg)\n\n}\n|\n\n{z\n\n|\n\n}\n\nw\n\nx\n\nPr\u03b3(s0|s, a) Pr(o|s0, a) Pr(n0, a|n, o)\n}\n\n{z\n\n|\n\nx\n\nPr\u03b3(s0|send, a) Pr(o|s0, a)\n\ni\n\n9=\n;\n9>>=\n>>;\n\ni \u2200s0, s0, n0, n0\n\n\u00afn abstract (6b)\n\nPr(nbeg|\u00afn)\n\nz\n\n{z\n\n}\n\nn concrete (6a)\n\n(5)\n\n(6)\n\n(7)\n\nmits the use of disjunctive programming [2] to linearize the constraints without any approximation.\nThe idea is to replace any product BX (where B is binary and X is continuous) by a new continuous\nvariable Y constrained by lbX B \u2264 Y \u2264 ubX B and X + (B \u2212 1)ubX \u2264 Y \u2264 X + (B \u2212 1)lbX\nwhere lbX and ubX are lower and upper bounds on X. One can verify that those additional linear\nconstraints force Y to be equal to BX. After applying disjunctive programming, we solve the result-\ning mixed-integer linear program (MILP) and update Vn0 (s0) and oc(s, \u00afn|s0, n0) based on the new\nvalues for Vn(s) and oc(s0, n0|s0, n0). We repeat the process until convergence or until a pre-de\ufb01ned\ntime limit is reached. Although, convergence cannot be guaranteed, in practice we have found BHPI\nto be monotonically increasing. Note that \ufb01xing Vn0(s0) and oc(s, \u00afn|s0, n0) while varying the policy\nparameters is reminiscent of policy iteration, hence the name bounded hierarchical policy iteration.\n\n3.4 Discussion\n\nDiscovering a hierarchy offers many advantages over previous methods that assume the hierarchy is\nalready known. In situations where the user is unable to specify the hierarchy, our approach provides\na principled way of discovering it. In situations where the user has a hierarchy in mind, it may be\npossible to \ufb01nd a better one. Note however that discovering the hierarchy while optimizing the\npolicy is a much more dif\ufb01cult problem than simply optimizing the policy parameters. Additional\nvariables (e.g., Pr(n0, a|n, o) and oc(s, n|s0, n0)) must be optimized and the degree of non-linearity\nincreases. Our approach can also be used when the hierarchy and the policy are partly known. It is\nfairly easy to set the variables that are known or to reduce their range by specifying upper and lower\nbounds. This also has the bene\ufb01t of simplifying the optimization problem.\n\nIt is also interesting to note that hierarchical policies may be encoded with exponentially fewer\nnodes in a hierarchical controller than a \ufb02at controller. Intuitively, when a subcontroller is called by\nk abstract nodes, this subcontroller is shared by all its abstract parents. An equivalent \ufb02at controller\nwould have to use k separate copies of the subcontroller. If a hierarchical controller has l levels\nwith subcontrollers shared by k parents in each level, then the equivalent \ufb02at controller will need\nO(kl) copies. By allowing recursive controllers, policies may be represented even more compactly.\nRecursive controllers allow abstract nodes to call subcontrollers that may contain themselves. An\n\n\fProblem\n\nPaint\nShuttle\n\n4x4 Maze\n\nS\n\n4\n8\n8\n8\n8\n16\n\nA O V*\n\n4\n3\n3\n3\n3\n4\n\n2\n5\n5\n5\n5\n2\n\n3.3\n32.7\n32.7\n32.7\n32.7\n3.7\n\nTable 3: Experiment results\nNum.\nof Nodes\n4(3/1)\n4(3/1)\n6(4/2)\n7(4/3)\n9(5/4)\n3(2/1)\n\nSNOPT\nV\n0.48\n31.87\n31.87\n31.87\n30.27\n3.15\n\nTime\n2s\n2s\n6s\n26s\n1449s\n3s\n\nBHPI\n\nTime\n13s\n85s\n7459s\n10076s\n10518s\n397s\n\nMINLP BB\nV\nTime V\n3.29\n<1s\n18.92\n4s\n27.93\n221s\n31.87 N/A\nN/A\n3.73\n3.21\n30s\n\n3.29\n18.92\n27.68\n\u2013\n\u2013\n3.73\n\nequivalent non-hierarchical controller would have to unroll the recursion by creating a separate copy\nof the subcontroller each time it is called. Since recursive controllers essentially call themselves in-\n\ufb01nitely many times, they can represent in\ufb01nitely large non-recursive controllers with \ufb01nitely many\nnodes. As a comparison, recursive controllers are to non-recursive hierarchical controllers what\ncontext-free grammars are to regular expressions. Since the leading approaches for controller opti-\nmization \ufb01x the number of nodes [18, 1], one may be able to \ufb01nd a much better policy by considering\nhierarchical recursive controllers. In addition, hierarchical controllers may be easier to understand\nand interpret than \ufb02at controllers given their natural decomposition into subcontrollers and their\npossibly smaller size.\n\n4 Experiments\n\nWe report on some preliminary experiments with three toy problems (paint, shuttle and maze) from\nthe POMDP repository3. We used the SNOPT package to directly solve the non-convex optimization\nproblem in Table 2 and bounded hierarchical policy iteration (BHPI) to solve it iteratively. Table 3\nreports the running time and the value of the hierarchical policies found.4 For comparison purposes,\nthe optimal value of each problem (copied from [4]) is reported in the column labelled by V \u2217.\nWe optimized hierarchical controllers of two levels with a \ufb01xed number of nodes reported in the\ncolumn labelled \u201cNum. of Nodes\u201d. The numbers in parentheses indicate the number of nodes at\nthe top level (left) and at the bottom level (right).5 In general, SNOPT \ufb01nds the optimal solution\nwith minimal computational time. In contrast, BHPI is less robust and takes up to several orders of\nmagnitude longer. MINLP BB returns good solutions for the smaller problems but is unable to \ufb01nd\nfeasible solutions to the larger ones. We also looked at the hierarchy discovered for each problem\nand veri\ufb01ed that it made sense. In particular, the hierarchy discovered for the paint problem matches\nthe one hand coded by Pineau in her PhD thesis [16]. Given the relatively small size of the test\nproblems, these experiments should be viewed as a proof of concept that demonstrate the feasibility\nof our approach. More extensive experiments with larger problems will be necessary to demonstrate\nthe scalability of our approach.\n\n5 Conclusion & Future Work\n\nThis paper proposes the \ufb01rst approach for hierarchy discovery in partially observable planning prob-\nlems. We model the search for a good hierarchical policy as a non-convex optimization problem\nwith variables corresponding to the hierarchy and policy parameters. We propose to tackle the op-\ntimization problem using non-linear solvers such as SNOPT or by reformulating the problem as\nan approximate MINLP or as a sequence of MILPs that can be thought of as a form of hierarchical\nbounded policy iteration. Preliminary experiments demonstrate the feasibility of our approach, how-\never further research is necessary to improve scalability. The approach can also be used in situations\nwhere a user would like to improve or learn part of the hierarchy. Many variables can then be set (or\nrestricted to a smaller range) which simpli\ufb01es the optimization problem and improves scalability.\n\nWe also generalize Hansen and Zhou\u2019s hierarchical controllers to recursive controllers. Recursive\ncontrollers can encode policies with \ufb01nitely many nodes that would otherwise require in\ufb01nitely large\n\n3http://pomdp.org/pomdp/examples/index.shtml\n4N/A refers to a trial when the solver was unable to return a feasible solution to the problem.\n5Since the problems are simple, the number of levels was restricted to two, though our approach permits any\n\nnumber of levels and does not require the number of levels nor the number of nodes per level to be speci\ufb01ed.\n\n\fnon-recursive controllers. Further details about recursive controllers and our other contributions can\nbe found in [5]. We plan to further investigate the use of recursive controllers in dialogue manage-\nment and text generation where recursive policies are expected to naturally capture the recursive\nnature of language models.\nAcknowledgements: this research was supported by the Natural Sciences and Engineering Re-\nsearch Council (NSERC) of Canada, the Canada Foundation for Innovation (CFI) and the Ontario\nInnovation Trust (OIT).\n\nReferences\n[1] C. Amato, D. Bernstein, and S. Zilberstein. Solving POMDPs using quadratically constrained linear\n\nprograms. In To appear In International Joint Conferences on Arti\ufb01cial Intelligence (IJCAI), 2007.\n\n[2] E. Balas. Disjunctive programming. Annals of Discrete Mathematics, 5:3\u201351, 1979.\n[3] D. Braziunas and C. Boutilier. Stochastic local search for POMDP controllers. In AAAI, pages 690\u2013696,\n\n2004.\n\n[4] A. Cassandra. Exact and approximate algorithms for partially observable Markov decision processes.\n\nPhD thesis, Brown University, Dept. of Computer Science, 1998.\n\n[5] L. Charlin. Automated hierarchy discovery for planning in partially observable domains. Master\u2019s thesis,\n\nUniversity of Waterloo, 2006.\n\n[6] T. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. JAIR,\n\n13:227\u2013303, 2000.\n\n[7] M. Ghavamzadeh and S. Mahadevan. Hierarchical policy gradient algorithms.\n\nN. Mishra, editors, ICML, pages 226\u2013233. AAAI Press, 2003.\n\nIn T. Fawcett and\n\n[8] P. Gill, W. Murray, and M. Saunders. SNOPT: An SQP algorithm for large-scale constrained optimization.\n\nSIAM Review, 47(1):99\u2013131, 2005.\n\n[9] E. Hansen. An improved policy iteration algorithm for partially observable MDPs. In NIPS, 1998.\n[10] E. Hansen and R. Zhou. Synthesis of hierarchical \ufb01nite-state controllers for POMDPs. In E. Giunchiglia,\n\nN. Muscettola, and D. Nau, editors, ICAPS, pages 113\u2013122. AAAI, 2003.\n\n[11] B. Hengst. Discovering hierarchy in reinforcement learning with HEXQ. In ICML, pages 243\u2013250, 2002.\n[12] L. Kaelbling, M. Littman, and A. Cassandra. Planning and acting in partially observable stochastic do-\n\nmains. Arti\ufb01cial Intelligence, 101(1-2):99\u2013134, 1998.\n\n[13] A. McGovern and A. Barto. Automatic discovery of subgoals in reinforcement learning using diverse\n\ndensity. In ICML, pages 361\u2013368, 2001.\n\n[14] N. Meuleau, L. Peshkin, K.-E. Kim, and L. Kaelbling. Learning \ufb01nite-state controllers for partially\n\nobservable environments. In UAI, pages 427\u2013436, 1999.\n\n[15] R. Parr. Hierarchical Control and learning for Markov decision processes. PhD thesis, University of\n\nCalifornia at Berkeley, 1998.\n\n[16] J. Pineau. Tractable Planning Under Uncertainty: Exploiting Structure. PhD thesis, Robotics Institute,\n\nCarnegie Mellon University, 2004.\n\n[17] J. Pineau, G. Gordon, and S. Thrun. Policy-contingent abstraction for robust robot control. In UAI, pages\n\n477\u2013484, 2003.\n\n[18] P. Poupart and C. Boutilier. Bounded \ufb01nite state controllers. In NIPS, 2003.\n[19] Pascal Poupart. Exploiting Structure to ef\ufb01ciently solve large scale partially observable Markov decision\n\nprocesses. PhD thesis, University of Toronto, 2005.\n\n[20] M. Ryan. Using abstract models of behaviours to automatically generate reinforcement learning hierar-\n\nchies. In ICML, pages 522\u2013529, 2002.\n\n[21] R. Sutton, D. Precup, and S. Singh. Between MDPs and Semi-MDPs: A framework for temporal abstrac-\n\ntion in reinforcement learning. Arti\ufb01cial Intelligence, 112(1-2):181\u2013211, 1999.\n\n[22] G. Theocharous, S. Mahadevan, and L. Kaelbling. Spatial and temporal abstractions in POMDPs ap-\nplied to robot navigation. Technical Report MIT-CSAIL-TR-2005-058, Computer Science and Arti\ufb01cial\nIntelligence Laboratory, MIT, 2005.\n\n[23] S. Thrun and A. Schwartz. Finding structure in reinforcement learning. In NIPS, pages 385\u2013392, 1994.\n[24] J. Williams and S. Youngs. Scaling POMDPs for dialogue management with composite summary point-\nIn AAAI workshop on Statistical and Empirical Methods in Spoken\n\nbased value iteration (CSPBVI).\nDialogue Systems, 2006.\n\n\f", "award": [], "sourceid": 2968, "authors": [{"given_name": "Laurent", "family_name": "Charlin", "institution": null}, {"given_name": "Pascal", "family_name": "Poupart", "institution": null}, {"given_name": "Romy", "family_name": "Shioda", "institution": null}]}