{"title": "Generating Long-term Trajectories Using Deep Hierarchical Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1543, "page_last": 1551, "abstract": "We study the problem of modeling spatiotemporal trajectories over long time horizons using expert demonstrations. For instance, in sports, agents often choose action sequences with long-term goals in mind, such as achieving a certain strategic position. Conventional policy learning approaches, such as those based on Markov decision processes, generally fail at learning cohesive long-term behavior in such high-dimensional state spaces, and are only effective when fairly myopic decision-making yields the desired behavior. The key difficulty is that conventional models are ``single-scale'' and only learn a single state-action policy. We instead propose a hierarchical policy class that automatically reasons about both long-term and short-term goals, which we instantiate as a hierarchical neural network. We showcase our approach in a case study on learning to imitate demonstrated basketball trajectories, and show that it generates significantly more realistic trajectories compared to non-hierarchical baselines as judged by professional sports analysts.", "full_text": "Generating Long-term Trajectories Using Deep\n\nHierarchical Networks\n\nStephan Zheng\n\nCaltech\n\nstzheng@caltech.edu\n\nYisong Yue\n\nCaltech\n\nyyue@caltech.edu\n\nPatrick Lucey\n\nSTATS\n\nplucey@stats.com\n\nAbstract\n\nWe study the problem of modeling spatiotemporal trajectories over long time\nhorizons using expert demonstrations. For instance, in sports, agents often choose\naction sequences with long-term goals in mind, such as achieving a certain strategic\nposition. Conventional policy learning approaches, such as those based on Markov\ndecision processes, generally fail at learning cohesive long-term behavior in such\nhigh-dimensional state spaces, and are only effective when fairly myopic decision-\nmaking yields the desired behavior. The key dif\ufb01culty is that conventional models\nare \u201csingle-scale\u201d and only learn a single state-action policy. We instead propose a\nhierarchical policy class that automatically reasons about both long-term and short-\nterm goals, which we instantiate as a hierarchical neural network. We showcase our\napproach in a case study on learning to imitate demonstrated basketball trajectories,\nand show that it generates signi\ufb01cantly more realistic trajectories compared to\nnon-hierarchical baselines as judged by professional sports analysts.\n\n1\n\nIntroduction\n\nModeling long-term behavior is a key challenge in many learning prob-\nlems that require complex decision-making. Consider a sports player\ndetermining a movement trajectory to achieve a certain strategic position.\nThe space of such trajectories is prohibitively large, and precludes conven-\ntional approaches, such as those based on simple Markovian dynamics.\nMany decision problems can be naturally modeled as requiring high-level,\nlong-term macro-goals, which span time horizons much longer than the\ntimescale of low-level micro-actions (cf. He et al. [8], Hausknecht and\nStone [7]). A natural example for such macro-micro behavior occurs in\nspatiotemporal games, such as basketball where players execute complex\ntrajectories. The micro-actions of each agent are to move around the\ncourt and, if they have the ball, dribble, pass or shoot the ball. These\nmicro-actions operate at the centisecond scale, whereas their macro-goals,\nsuch as \"maneuver behind these 2 defenders towards the basket\", span\nmultiple seconds. Figure 1 depicts an example from a professional basketball game, where the player\nmust make a sequence of movements (micro-actions) in order to reach a speci\ufb01c location on the\nbasketball court (macro-goal).\nIntuitively, agents need to trade-off between short-term and long-term behavior: often sequences of\nindividually reasonable micro-actions do not form a cohesive trajectory towards a macro-goal. For\ninstance, in Figure 1 the player (green) takes a highly non-linear trajectory towards his macro-goal of\npositioning near the basket. As such, conventional approaches are not well suited for these settings,\nas they generally use a single (low-level) state-action policy, which is only successful when myopic\nor short-term decision-making leads to the desired behavior.\n\nFigure 1: The player (green)\nhas two macro-goals: 1)\npass the ball (orange) and\n2) move to the basket.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn this paper, we propose a novel class of hierarchical policy models, which we instantiate using\nrecurrent neural networks, that can simultaneously reason about both macro-goals and micro-actions.\nOur model utilizes an attention mechanism through which the macro-policy guides the micro-policy.\nOur model is further distinguished from previous work on hierarchical policies by dynamically\npredicting macro-goals instead of following \ufb01xed goals, which gives additional \ufb02exibility to our\nmodel class that can be \ufb01tted to data (rather than having the macro-goals be speci\ufb01cally hand-crafted).\nWe showcase our approach in a case study on learning to imitate demonstrated behavior in professional\nbasketball. Our primary result is that our approach generates signi\ufb01cantly more realistic player\ntrajectories compared to non-hierarchical baselines, as judged by professional sports analysts. We\nalso provide a comprehensive qualitative and quantitive analysis, e.g., showing that incorporating\nmacro-goals can actually improve 1-step micro-action prediction accuracy.\n\n2 Related Work\n\nThe reinforcement learning community has largely focused on non-hierarchical policies such as those\nbased on Markovian or linear dynamics (cf. Ziebart et al. [17], Mnih et al. [11], Hausknecht and\nStone [7]). By and large, such policy classes are shown to be effective only when the optimal action\ncan be found via short-term planning. Previous research has instead focused on issues such as how\nto perform effective exploration, plan over parameterized action spaces, or deal with non-convexity\nissues from using deep neural networks. In contrast, we focus on developing hierarchical policies that\ncan effectively generate realistic long-term plans in complex settings such as basketball gameplay.\nThe use of hierarchical models to decompose macro-goals from micro-actions is relatively common\nin the planning community (cf. Sutton et al. [14], He et al. [8], Bai et al. [1]). For instance, the\nwinning team in 2015 RoboCup Simulation Challenge (Bai et al. [1]) used a manually constructed\nhierarchical policy to solve MDPs with a set of \ufb01xed sub-tasks, while Konidaris et al. [10] segmented\ndemonstrations to construct a hierarchy of static macro-goals. In contrast, we study how one can\nlearn a hierarchical policy from a large amount of expert demonstrations that can adapt its policy in\nnon-Markovian environments with dynamic macro-goals.\nOur approach shares af\ufb01nity with behavioral cloning. One difference with previous work is that we\ndo not learn a reward function that induces such behavior (cf. Muelling et al. [12]). Another related\nline of research aims to develop ef\ufb01cient policies for factored MDPs (Guestrin et al. [6]), e.g. by\nlearning value functions over factorized state spaces for multi-agent systems. It may be possible that\nsuch approaches are also applicable for learning our hierarchical policy.\nAttention models for deep networks have mainly been applied to natural language processing, image\nrecognition and combinations thereof (Xu et al. [15]). In contrast to previous work which focuses on\nattention models of the input, our attention model is applied to the output by integrating control from\nboth the macro-policy and the micro-policy.\nRecent work on generative models for sequential data (Chung et al. [4]), such as handwriting\ngeneration, have combined latent variables with an RNN\u2019s hidden state to capture temporal variability\nin the input. In our work, we instead aim to learn semantically meaningful latent variables that are\nexternal to the RNN and reason about long-term behavior and goals.\nOur model shares conceptual similarities to the Dual Process framework (Evans and Stanovich\n[5]), which decomposes cognitive processes into fast, unconscious behavior (System 1) and slow,\nconscious behavior (System 2). This separation re\ufb02ects our policy decomposition into a macro and\nmicro part. Other related work in neuroscience and cognitive science include hierarchical models of\nlearning by imitation (Byrne and Russon [2]).\n\n3 Long-Term Trajectory Planning\n\nWe are interested in learning policies that can produce high quality trajectories, where quality is some\nglobal measure of the trajectory (e.g., realistic trajectories as in Figure 1). We \ufb01rst set notation:\n\nt \u2208 A. The full state and action are\n\u2022 Macro policies also use a goal space G, e.g. regions in the court that a player should reach.\n\n(cid:9)\nplayers i. The history of events is ht = {(su, au)}0\u2264u<t.\n\n(cid:9)\nplayers i, at =(cid:8)ai\n\nt\n\nst =(cid:8)si\n\nt\n\n\u2022 At time t, an agent i is in state si\n\nt \u2208 S and takes action ai\n\n2\n\n\fFigure 3: The general structure of a 2-level hierarchical policy that consists of 1) a raw micro-policy \u03c0raw 2) a\nmacro-policy \u03c0macro and 3) a transfer function \u03c6. For clarity, we suppressed the indices i, t in the image. The\nraw micro-policy learns optimal short-term policies, while the macro-policy is optimized to achieve long-term\nt), which guides the raw micro-policy\nrewards. The macro-policy outputs a macro-goal gi\nt. The hierarchical\nui\nt = \u03c0raw(si\npolicy \u03c0micro = \u03c8(ui\n\nt) in order for the hierarchical policy \u03c0micro to achieve a long-term goal gi\n\nt)) uses a transfer function \u03c6 and synthesis functon \u03c8, see (3) and Section 4.\n\nt = \u03c0macro(si\n\nt, \u03c6(gi\n\nt, hi\n\nt, hi\n\n\u2022 Let \u03c0(st, ht) denote a policy that maps state and history to a distribution over actions\nP (at|st, ht). If \u03c0 is deterministic, the distribution is peaked around a speci\ufb01c action. We\nalso abuse notation to sometimes refer to \u03c0 as deterministically taking the most probable\naction \u03c0(st, ht) = argmaxa\u2208AP (a|st, ht) \u2013 this usage should be clear from context.\n\nOur main research question is how to design a policy class that can capture the salient properties of\nhow expert agents execute trajectories. In particular, we present a general policy class that utilizes\na goal space G to guide its actions to create such trajectory histories. We show in Section 4 how to\ninstantiate this policy class as a hierarchical network that uses an attention mechanism to combine\nmacro-goals and micro-actions. In our case study on modeling basketball behavior (Section 5.1), we\ntrain such a policy to imitate expert demonstrations using a large dataset of tracked basketball games.\n\n3.1\n\nIncorporating Macro-Goals\n\n(cid:1)\n\nan agent i executes a sequence of micro-actions(cid:0)ai\n\nOur main modeling assumption is that a policy should simultaneously optimize\nbehavior hierarchically on multiple well-separated timescales. We consider\ntwo distinct timescales (macro and micro-level), although our approach could\nin principle be generalized to even more timescales. During an episode [t0, t1],\nt\u22650 that leads to a macro-\nt \u2208 G. We do not assume that the start and end times of an episode\ngoal gi\nare \ufb01xed. For instance, macro-goals can change before they are reached. We\n\ufb01nally assume that macro-goals are relatively static on the timescale of the\nmicro-actions, that is: dgi\nFigure 2 depicts an example of an agent with two unique macro-goals over a\n50-frame trajectory. At every timestep t, the agent executes a micro-action ai\nt,\nwhile the macro-goals gi\nWe model the interaction between a micro-action ai\nt \u2208 A that is independent of the macro-goal. The micro-policy is then de\ufb01ned as:\nui\n\nt change more slowly.\n\nt and a macro-goal gi\n\nt/dt (cid:28) 1.\n\nt\n\n(cid:90)\nt = \u03c0micro(st, ht) = argmaxaP micro(a|st, ht)\nai\nt|u, g, st, ht)P (u, g|st, ht).\n\ndudgP (ai\n\nP micro(ai\n\nt|st, ht) =\n\n(1)\n\n(2)\n\nFigure 2: Depicting\ntwo macro-goals (blue\nboxes) as an agent\nmoves to the top left.\n\nt through a raw micro-action\n\nHere, we model the conditional distribution P (ai\n\nt|u, g, st, ht) as a non-linear function of u, g:\n\nP (ai\n\n(3)\nwhere \u03c6, \u03c8 are transfer and synthesis functions respectively that we make explicit in Section 4. Note\nthat (3) does not explicitly depend on st, ht: although it is straightforward to generalize, this did not\nmake a signi\ufb01cant difference in our experiments. This decomposition is shown in Figure 3 and can\nbe generalized to multiple scales l using multiple macro-goals gl and transfer functions \u03c6l.\n\nt, st, ht) = \u03c8(ui\n\nt, \u03c6(gi\n\nt, gi\n\nt)),\n\nt|ui\n\n4 Hierarchical Policy Network\n\nFigure 3 depicts a high-level overview of our hierarchical policy class for generating long-term\nspatiotemporal trajectories. Both the raw micro-policy and macro-policy are instantiated as recurrent\n\n3\n\nraw action umicro-action amacro-goal gstate stransfer \u03d5raw micro-policy \u03c0rawmacro-policy \u03c0macromicro-policy \u03c0micro\fconvolutional neural networks, and the raw action and macro-goals are combined via an attention\nmechanism, which we elaborate on below.\nDiscretization and deep neural architecture. In general, when using continuous latent variables\ng, learning the model (1) is intractable, and one must resort to approximation methods. We choose\nt \u2208 S is naturally\nto discretize the state-action and latent spaces. In the basketball setting, a state si\nrepresented as a 1-hot occupancy vector of the basketball court. We then pose goal states gi\nt as\nsub-regions of the court that i wants to reach, de\ufb01ned at a coarser resolution than S. As such, we\ninstantiate the macro and micro-policies as convolutional recurrent neural networks, which can\ncapture both predictive spatial patterns and non-Markovian temporal dynamics.\nAttention mechanism for integrating macro-goals and micro-actions. We model (3) as an atten-\nt), over the output action space A and \u03c8 is an element-wise\ntion, i.e. \u03c6 computes a softmax density \u03c6(gi\n(Hadamard) product. Suppressing indices i, t and s, h for clarity, the distribution (3) becomes\n\n, P (ak|u, g) \u221d P raw(uk|s, h) \u00b7 \u03c6k(g), k = 1 . . .|A|,\n\n(4)\n\n(cid:80)\n\n\u03c6k(g) =\n\nexp h\u03c6(g)k\nj exp h\u03c6(g)j\n\nwhere h\u03c6(g) is computed by a neural network that takes P macro(g) as an input. Intuitively, this\nstructure captures the trade-off between the macro- and raw micro-policy. On the one hand, the\nraw micro-policy \u03c0raw aims for short-term optimality. On the other hand, the macro-policy \u03c0macro\ncan attend via \u03c6 to sequences of actions that lead to a macro-goal and bias the agent towards good\nlong-term behavior. The difference between u and \u03c6(g) thus re\ufb02ects the trade-off that the hierarchical\npolicy learns between actions that are good for either short-term or long-term goals.\nMulti-stage learning. Given a set D of sequences of state-action tuples (st, \u02c6at), where the \u02c6at are\n1-hot labels (omitting the index i for clarity), the hierarchical policy network can be trained via\n\n(cid:88)\n\nT(cid:88)\n\n\u03b8\u2217 = argmin\n\n\u03b8\n\nD\n\nt=1\n\nA(cid:88)\n\nLt(st, ht, \u02c6at; \u03b8).\n\n(5)\n\nGiven the hierarchical structure of our model class, we decompose the loss Lt (omitting the index t):\n(6)\n\nL(s, h, \u02c6a; \u03b8) = Lmacro (s, h, g; \u03b8) + Lmicro (s, h, \u02c6a; \u03b8) + R(\u03b8),\n\nLmicro(s, h, \u02c6a; \u03b8) =\n\n\u02c6ak log [P raw(uk|s, h; \u03b8) \u00b7 \u03c6k(g; \u03b8)] ,\n\n(7)\n\nk=1\n\nwhere Rt(\u03b8) regularizes the model weights \u03b8 and k indexes A discrete action-values. Although we\nhave ground truths \u02c6at for the observable micro-actions, in general we may not have labels for the\nmacro-goals gt that induce optimal long-term planning. As such, one would have to appeal to separate\nsolution methods to compute the posterior P (gt|st, ht) which minimizes Lt,macro (st, ht, gt; \u03b8).\nTo reduce complexity and given the non-convexity of (7), we instead follow a multi-stage learning\napproach with a set of weak labels \u02c6gt, \u02c6\u03c6t for the macro-goals gt and attention masks \u03c6t = \u03c6(gt).\nWe assume access to such weak labels and only use them in the initial training phases. Here, we\n\ufb01rst train the raw micro-policy, macro-policy and attention individually, freezing the other parts of\nthe network. The policies \u03c0micro, \u03c0macro and attention \u03c6 can be trained using standard cross-entropy\nminimization with the labels \u02c6at, \u02c6gt and \u02c6\u03c6t, respectively. In the \ufb01nal stage we \ufb01ne-tune the entire\nnetwork on objective (5), using only Lt,micro and R. We found this approach capable of \ufb01nding a good\ninitialization for \ufb01ne-tuning and generating high-quality long-term trajectories.1 Another advantage\nof this approach is that the network can be trained using gradient descent during all stages.\n\n5 Case Study on Modeling Basketball Behavior\n\nWe applied our approach to modeling basketball behavior data. In particular, we focus on imitating\nthe players\u2019 movements, which is a challenging problem in the spatiotemporal planning setting.\n\n1As ut and \u03c6(gt) enter symmetrically into the objective (7), it is hypothetically possible that the network\nconverges to a symmetric phase where the predictions ut and \u03c6(gt) become identical along the entire trajectory.\nHowever, our experiments suggest that our multi-stage learning approach separates timescales well between the\nmicro- and macro policy and prevents the network from settling in such a redundant symmetric phase.\n\n4\n\n\fFigure 4: Network architecture and hyperparameters of the hierarchical policy network. For clarity, we\nsuppressed the indices i, t in the image. Max-pooling layers (numbers indicate kernel size) with unit stride\nupsample the sparse tracking data st. The policies \u03c0raw, \u03c0macro use a convolutional (kernel size, stride) and GRU\nmemory (number of cells) stack to predict ui\nt. Batch-normalization \"bn\" (Ioffe and Szegedy [9]) is applied\nto stabilize training. The output attention \u03c6 is implemented by 2 fully-connected layers (number of output units).\nFinally, the network predicts the \ufb01nal output \u03c0micro(st, ht) = \u03c0raw(st, ht) (cid:12) \u03c6(gi\nt).\n\nt and gi\n\n5.1 Experimental Setup\n\nt =(cid:0)xi\n\nt, yi\nt\n\nt = si\n\nt+1 \u2212 si\n\nt = \u03c0micro(st, ht).\n\n(cid:1) for each player i, recorded at 25 Hz, where one\n\nWe validated the hierarchical policy network (HPN) by learning a movement policy of individual\nbasketball players that predicts as the micro-action the instantaneous velocity vi\nTraining data. We trained the HPN on a large dataset of tracking data from professional basketball\ngames (Yue et al. [16]). The dataset consists of possessions of variable length: each possession is\na sequence of tracking coordinates si\nteam has continuous possession of the ball. Since possessions last between 50 and 300 frames, we\nsub-sampled every 4 frames and used a \ufb01xed input sequence length of 50 to make training feasible.\nSpatially, we discretized the left half court using 400\u00d7380 cells of size 0.25ft \u00d7 0.25ft. For simplicity,\nwe modeled every player identically using a single policy network. The resulting input data for each\npossession is grouped into 4 channels: the ball, the player\u2019s location, his teammates, and the opposing\nteam. After this pre-processing, we extracted 130,000 tracks for training and 13,000 as a holdout set.\nt as 1-hot vectors in a grid of 17 \u00d7 17 unit\nLabels. We extracted micro-action labels \u02c6vi\ncells. Additionally, we constructed a set of weak macro-labels \u02c6gt, \u02c6\u03c6t by heuristically segmenting\neach track using its stationary points. The labels \u02c6gt were de\ufb01ned as the next stationary point. For \u02c6\u03c6t,\nwe used 1-hot velocity vectors vi\nt to the\nmacro-goal gi\nModel hyperparameters. To generate smooth rollouts while sub-sampling every 4 frames, we\nsimultaneously predicted the next 4 micro-actions at, . . . , at+3. A more general approach would\nmodel the dependency between look-ahead predictions as well, e.g. P (\u03c0t+\u2206+1|\u03c0t+\u2206). However, we\nfound that this variation did not outperform baseline models. We selected a network architecture to\nbalance performance and feasible training-time. The macro and micro-policy use GRU memory cells\nChung et al. [3] and a memory-less 2-layer fully-connected network as the transfer function \u03c6, as\ndepicted in Figure 4. We refer to the supplementary material for more details.\nBaselines. We compared the HPN against two natural baselines.\n\nt. We refer to the supplementary material for additional details.\n\nt,straight along the straight path from the player\u2019s location si\n\n1. A policy with only a raw micro-policy and memory (GRU-CNN) and without memory (CNN).\n2. A hierarchical policy network H-GRU-CNN-CC without an attention mechanism, which\n\ninstead learns the \ufb01nal output from a concatenation of the raw micro- and macro-policy.\n\nRollout evaluation. To evaluate the quality of our model, we generated rollouts (st; h0,r0 ) with burn-\nin period r0, These are generated by 1) feeding a ground truth sequence of states h0,r0 = (s0, . . . , sr0)\nto the policy network and 2) for t > r0, predicting at as the mode of the network output (1) and\nupdating the game-state st \u2192 st+1, using ground truth locations for the other agents.\n\n5.2 How Realistic are the Generated Trajectories?\n\nThe most holistic way to evaluate the trajectory rollouts is via visual analysis. Simply put, would a\nbasketball expert \ufb01nd the rollouts by HPN more realistic than those by the baselines? We begin by\n\ufb01rst visually analyzing some rollouts, and then present our human preference study results.\n\n5\n\n289convamacro-policy \u03c0macrosraw micro-policy \u03c0rawtransfer \u03d5gruconvconvfcumergepoolconvbngruconvconvfcggrufc\u03d5poolpoolpoolbnbnbnbnbnbnbn5125125122562561282892899021,721,521,521, 721, 521, 51, 12, 35, 510, 9400x380micro-policy \u03c0micro\f(a) HPN rollouts\n\n(b) HPN rollouts\n\n(c) HPN rollouts\n\n(d) HPN (top) and\nfailure case (bottom)\n\n(e) HPN (top), base-\nline (bottom)\n\nFigure 5: Rollouts generated by the HPN (columns a, b, c, d) and baselines (column e). Each frame shows\nan offensive player (dark green), a rollout (blue) track that extrapolates after 20 frames, the offensive team\n(light green) and defenders (red). Note we do not show the ball, as we did not use semantic basketball features\n(i.e \u201ccurrently has the ball\") during training. The HPN rollouts do not memorize training tracks (column a) and\ndisplay a variety of natural behavior, such as curving, moving towards macro-goals and making sharp turns\n(c, bottom). We also show a failure case (d, bottom), where the HPN behaves unnaturally by moving along a\nstraight line off the right side of the court \u2013 which may be \ufb01xable by adding semantic game state information.\nFor comparison, a hierarchical baseline without an attention model, produces a straight-line rollout (column e,\nbottom), whereas the HPN produces a more natural movement curve (column e, top).\n\nModel comparison\n\nVS-CNN\nVS-GRU-CNN\nVS-H-GRU-CNN-CC\nVS-GROUND TRUTH\n\nExperts\nW/T/L\n21 / 0 / 4\n21 / 0 / 4\n22 / 0 / 3\n11 / 0 / 14\n\nAvg Gain\n\n0.68\n0.68\n0.76\n-0.12\n\nNon-Experts\n\nW/T/L\n15 / 9 / 1\n18 / 2 / 5\n21 / 0 / 4\n10 / 4 / 11\n\nAvg Gain\n\n0.56\n0.52\n0.68\n-0.04\n\nAll\n\nW/T/L\n21 / 0 / 4\n21 / 0 / 4\n21 / 0 / 4\n11 / 0 / 14\n\nAvg Gain\n\n0.68\n0.68\n0.68\n-0.12\n\nTable 1: Preference study results. We asked basketball experts and knowledgeable non-experts to judge the\nrelative quality of policy rollouts. We compare HPN with ground truth and 3 baselines: a memory-less (CNN )\nand memory-full (GRU-CNN ) micro-policy and a hierarchical policy without attention (GRU-CNN -CC). For\neach of 25 test cases, HPN wins if more judges preferred the HPN rollout over a competitor. Average gain is\nthe average signed vote (1 for always preferring HPN , and -1 for never preferring). We see that the HPN is\npreferred over all baselines (all results against baselines are signi\ufb01cant at the 95% con\ufb01dence level). Moreover,\nHPN is competitive with ground truth, indicating that HPN generates realistic trajectories within our rollout\nsetting. Please see the supplementary material for more details.\n\nVisualization. Figure 5 depicts example rollouts for our HPN approach and one example rollout for\na baseline. Every rollout consists of two parts: 1) an initial ground truth phase from the holdout set\nand 2) a continuation by either the HPN or ground truth. We note a few salient properties. First, the\nHPN does not memorize tracks, as the rollouts differ from the tracks it has trained on. Second, the\nHPN generates rollouts with a high dynamic range, e.g. they have realistic curves, sudden changes of\ndirections and move over long distances across the court towards macro-goals. For instance, HPN\ntracks do not move towards macro-goals in unrealistic straight lines, but often take a curved route,\nindicating that the policy balances moving towards macro-goals with short-term responses to the\ncurrent state (see also Figure 6b). In contrast, the baseline model often generates more constrained\nbehavior, such as moving in straight lines or remaining stationary for long periods of time.\nHuman preference study. Our primary empirical result is a preference study eliciting judgments on\nthe relative quality of rollout trajectories between HPN and baselines or ground truth. We recruited\nseven experts (professional sports analysts) and eight knowledgeable non-experts (e.g., college\nbasketball players) as judges.\n\n6\n\n\f(b) Rollout tracks and predicted macro-goals gt (blue\nboxes). The HPN starts the rollout after 20 frames.\nMacro-goal box intensity corresponds to relative pre-\ndiction frequency during the trajectory.\n\n(a) Predicted distributions for attention masks \u03c6(g)\n(left column), velocity (micro-action) \u03c0micro (middle)\nand weighted velocity \u03c6(g) (cid:12) \u03c0micro (right) for basket-\nball players. The center corresponds to 0 velocity.\nFigure 6: Left: Visualizing how the attention mask generated from the macro-policy interacts with the micro-\npolicy \u03c0micro. Row 1 and 2: the micro-policy \u03c0micro decides to stay stationary, but the attention \u03c6 goes to the left.\nThe weighted result \u03c6 (cid:12) \u03c0micro is to move to the left, with a magnitude that is the average. Row 3) \u03c0micro wants to\ngo straight down, while \u03c6 boosts the velocity so the agent bends to the bottom-left. Row 4) \u03c0micro goes straight\nup, while \u03c6 goes left: the agent goes to the top-left. Row 5) \u03c0micro remains stationary, but \u03c6 prefers to move in\nany direction. As a result, the agent moves down. Right: The HPN dynamically predicts macro-goals and guides\nthe micro-policy in order to reach them. The macro-goal predictions are stable over a large number of timesteps.\nThe HPN sometimes predicts inconsistent macro-goals. For instance, in the bottom right frame, the agent moves\nto the top-left, but still predicts the macro-goal to be in the bottom-left sometimes.\n\nBecause all the learned policies perform better with a \u201cburn-in\u201d period, we \ufb01rst animated with the\nground truth for 20 frames (after subsampling), and then extrapolated with a policy for 30 frames.\nDuring extrapolation, the other nine players do not animate.2 For each test case, the judges were\nshown an animation of two rollout extrapolations of a speci\ufb01c player\u2019s movement: one generated by\nthe HPN, the other by a baseline or ground truth. The judges then chose which rollout looked more\nrealistic. Please see the supplementary material for details of the study.\nTable 1 shows the preference study results. We tested 25 scenarios (some corresponding to scenarios\nin Figure 6b). HPN won the vast majority of comparisons against the baselines using expert judges,\nwith slightly weaker but still very positive results using non-expert judgments. HPN was also\ncompetitive with ground truth. These results suggest that HPN can generate high-quality player\ntrajectories that are signi\ufb01cant improvements over baselines, and approach the ground truth quality in\nthis comparison setting.\n\n5.3 Analyzing Macro- and Micro-policy Integration\n\nOur model integrates the macro- and micro-policy by converting the macro-goal into an attention mask\non the micro-action output space, which intuitively guides the micro-policy towards the macro-goal.\nWe now inspect our macro-policy and attention mechanism to verify this behavior.\nFigure 6a depicts how the macro-policy \u03c0macro guides the micro-policy \u03c0micro through the attention \u03c6,\nby attending to the direction in which the agent can reach the predicted macro-goal. The attention\nmodel and micro-policy differ in semantic behavior: the attention favors a wider range of velocities\nand larger magnitudes, while the micro-policy favors smaller velocities.\n\n2We chose this preference study design to focus the qualitative comparison on the plausibility of individual\nmovements (e.g. how players might practice alone), as opposed to strategically coordinated team movements.\n\n7\n\n\fModel\nCNN\nGRU-CNN\nH-GRU-CNN-CC\nH-GRU-CNN-STACK\nH-GRU-CNN-ATT\nH-GRU-CNN-AUX\n\n\u2206 = 0 \u2206 = 1 \u2206 = 2 \u2206 = 3 Macro-goals g Attention \u03c6\n21.8% 21.5% 21.7% 21.5%\n25.8% 25.0% 24.9% 24.4%\n31.5% 29.9% 29.5% 29.1%\n26.9% 25.7% 25.9% 24.9%\n33.7% 31.6% 31.0% 30.5%\n31.6% 30.7% 29.4% 28.0%\n\n10.1%\n9.8%\n10.5%\n10.8%\n\n-\n-\n\n-\n-\n-\n-\n-\n\n19.2%\n\nTable 2: Benchmark Evaluations. \u2206-step look-ahead prediction accuracy for micro-actions ai\nt+\u2206 = \u03c0(st)\non a holdout set, with \u2206 = 0, 1, 2, 3. H-GRU-CNN-STACK is an HPN where predictions are organized in a\nfeed-forward stack, with \u03c0(st)t feeding into \u03c0(st)t+1. Using attention (H-GRU-CNN-ATT) improves on all\nbaselines in micro-action prediction. All hierarchical models are pre-trained, but not \ufb01ne-tuned, on macro-goals\n\u02c6g. We report prediction accuracy on the weak labels \u02c6g, \u02c6\u03c6 for hierarchical models.H-GRU-CNN-AUX is an HPN\nthat was trained using \u02c6\u03c6. As \u02c6\u03c6 optimizes for optimal long-term behavior, this lowers the micro-action accuracy.\n\nFigure 6b depicts predicted macro-goals by HPN along with rollout tracks. In general, we see that the\nrollouts are guided towards the predicted macro-goals. However, we also observe that the HPN makes\nsome inconsistent macro-goal predictions, which suggests there is still room for improvement.\n\n5.4 Benchmark Analysis\n\nWe \ufb01nally evaluated using a number of benchmark experiments, with results shown in Table 2. These\nexperiments measure quantities that are related to overall quality, albeit not holistically. We \ufb01rst\nevaluated the 4-step look-ahead accuracy of the HPN for micro-actions ai\nt+\u2206, \u2206 = 0, 1, 2, 3. On this\nbenchmark, the HPN outperforms all baseline policy networks when not using weak labels \u02c6\u03c6 to train\nthe attention mechanism, which suggests that using a hierarchical model can noticeably improve the\nshort-term prediction accuracy over non-hierarchical baselines.\nWe also report the prediction accuracy on weak labels \u02c6g, \u02c6\u03c6, although they were only used during pre-\ntraining, and we did not \ufb01ne-tune for accuracy on them. Using weak labels one can tune the network\nfor both long-term and short-term planning, whereas all non-hierarchical baselines are optimized\nfor short-term planning only. Including the weak labels \u02c6\u03c6 can lower the accuracy on short-term\nprediction, but increases the quality of the on-policy rollouts. This trade-off can be empirically set by\ntuning the number of weak labels used during pre-training.\n\n6 Limitations and Future Work\n\nWe have presented a hierarchical memory network for generating long-term spatiotemporal trajec-\ntories. Our approach simultaneously models macro-goals and micro-actions and integrates them\nusing a novel attention mechanism. We demonstrated signi\ufb01cant improvement over non-hierarchical\nbaselines in a case study on modeling basketball player behavior.\nThere are several notable limitations to our HPN model. First, we did not consider all aspects of\nbasketball gameplay, such as passing and shooting. We also modeled all players using a single policy\nwhereas in reality player behaviors vary (although the variability can be low-dimensional (Yue et al.\n[16])). We only modeled offensive players: an interesting direction is modeling defensive players and\nintegrating adversarial reinforcement learning (Panait and Luke [13]) into our approach. These issues\nlimited the scope of our preference study, and it would be interesting to consider extended settings.\nIn order to focus on the HPN model class, we only used the imitation learning setting. More broadly,\nmany planning problems require collecting training data via exploration (Mnih et al. [11]), which can\nbe more challenging. One interesting scenario is having two adversarial policies learn to be strategic\nagainst each other through repeatedly game-play in a basketball simulator. Furthermore, in general it\ncan be dif\ufb01cult to acquire the appropriate weak labels to initialize the macro-policy training.\nFrom a technical perspective, using attention in the output space may be applicable to other architec-\ntures. More sophisticated applications may require multiple levels of output attention masking.\n\nAcknowledgments. This research was supported in part by NSF Award #1564330, and a GPU donation (Tesla\nK40 and Titan X) by NVIDIA.\n\n8\n\n\fReferences\n[1] Aijun Bai, Feng Wu, and Xiaoping Chen. Online planning for large markov decision processes with\nhierarchical decomposition. ACM Transactions on Intelligent Systems and Technology (TIST), 6(4):45,\n2015.\n\n[2] Richard W Byrne and Anne E Russon. Learning by imitation: A hierarchical approach. Behavioral and\n\nbrain sciences, 21(05):667\u2013684, 1998.\n\n[3] Junyoung Chung, \u00c7aglar G\u00fcl\u00e7ehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural\nnetworks. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille,\nFrance, 6-11 July 2015, pages 2067\u20132075, 2015.\n\n[4] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A\nrecurrent latent variable model for sequential data. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2980\u20132988. Curran\nAssociates, Inc., 2015.\n\n[5] Jonathan St B. T. Evans and Keith E. Stanovich. Dual-Process Theories of Higher Cognition Advancing the\nDebate. Perspectives on Psychological Science, 8(3):223\u2013241, May 2013. ISSN 1745-6916, 1745-6924.\ndoi: 10.1177/1745691612460685.\n\n[6] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Ef\ufb01cient Solution Algorithms\n\nfor Factored MDPs. J. Artif. Int. Res., 19(1):399\u2013468, October 2003. ISSN 1076-9757.\n\n[7] Matthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. In\nProceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico,\nMay .\n\n[8] Ruijie He, Emma Brunskill, and Nicholas Roy. PUMA: Planning Under Uncertainty with Macro-Actions.\n\nIn Twenty-Fourth AAAI Conference on Arti\ufb01cial Intelligence, July 2010.\n\n[9] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by\n\nReducing Internal Covariate Shift. pages 448\u2013456, 2015.\n\n[10] George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning from demon-\nstration by constructing skill trees. The International Journal of Robotics Research, 31(3):360\u2013375, March\n2012. ISSN 0278-3649, 1741-3176. doi: 10.1177/0278364911428653.\n\n[11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,\nAmir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and\nDemis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533,\nFebruary 2015. ISSN 0028-0836. doi: 10.1038/nature14236.\n\n[12] Katharina Muelling, Abdeslam Boularias, Betty Mohler, Bernhard Sch\u00f6lkopf, and Jan Peters. Learning\nstrategies in table tennis using inverse reinforcement learning. Biological Cybernetics, 108(5):603\u2013619,\nOctober 2014. ISSN 1432-0770. doi: 10.1007/s00422-014-0599-1.\n\n[13] Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art. Autonomous Agents\n\nand Multi-Agent Systems, 11(3):387\u2013434, 2005.\n\n[14] Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112(1\u20132):181\u2013211, August 1999.\nISSN 0004-3702. doi: 10.1016/S0004-3702(99)00052-1.\n\n[15] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard\nZemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual\nAttention. arXiv:1502.03044 [cs], February 2015. arXiv: 1502.03044.\n\n[16] Yisong Yue, Patrick Lucey, Peter Carr, Alina Bialkowski, and Iain Matthews. Learning Fine-Grained\nSpatial Models for Dynamic Sports Play Prediction. In IEEE International Conference on Data Mining\n(ICDM).\n\n[17] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse\n\nreinforcement learning. In AAAI, pages 1433\u20131438, 2008.\n\n9\n\n\f", "award": [], "sourceid": 849, "authors": [{"given_name": "Stephan", "family_name": "Zheng", "institution": "Caltech"}, {"given_name": "Yisong", "family_name": "Yue", "institution": "Caltech"}, {"given_name": "Jennifer", "family_name": "Hobbs", "institution": "Stats"}]}