{"title": "Learning from Limited Demonstrations", "book": "Advances in Neural Information Processing Systems", "page_first": 2859, "page_last": 2867, "abstract": "We propose an approach to learning from demonstration (LfD) which leverages expert data, even if the expert examples are very few or inaccurate.  We achieve this by integrating LfD in an approximate policy iteration algorithm.  The key idea of our approach is that expert examples are used to generate linear constraints on the optimization, in a similar fashion to large-margin classification.  We prove an upper bound on the true Bellman error of the approximation computed by the algorithm at each iteration.  We show empirically that the algorithm outperforms both pure policy iteration, as well as DAgger (a state-of-art LfD algorithm) and supervised learning in a variety of scenarios, including when very few and/or imperfect demonstrations are available.  Our experiments include simulations as well as a real robotic navigation task.", "full_text": "Learning from Limited Demonstrations\n\nBeomjoon Kim\n\nSchool of Computer Science\n\nMcGill University\n\nMontreal, Quebec, Canada\n\nJoelle Pineau\n\nSchool of Computer Science\n\nMcGill University\n\nMontreal, Quebec, Canada\n\nAmir-massoud Farahmand\nSchool of Computer Science\n\nMcGill University\n\nMontreal, Quebec, Canada\n\nDoina Precup\n\nSchool of Computer Science\n\nMcGill University\n\nMontreal, Quebec, Canada\n\nAbstract\n\nWe propose a Learning from Demonstration (LfD) algorithm which leverages ex-\npert data, even if they are very few or inaccurate. We achieve this by using both\nexpert data, as well as reinforcement signals gathered through trial-and-error inter-\nactions with the environment. The key idea of our approach, Approximate Policy\nIteration with Demonstration (APID), is that expert\u2019s suggestions are used to de-\n\ufb01ne linear constraints which guide the optimization performed by Approximate\nPolicy Iteration. We prove an upper bound on the Bellman error of the estimate\ncomputed by APID at each iteration. Moreover, we show empirically that APID\noutperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm,\nand supervised learning in a variety of scenarios, including when very few and/or\nsuboptimal demonstrations are available. Our experiments include simulations as\nwell as a real robot path-\ufb01nding task.\n\n1\n\nIntroduction\n\nLearning from Demonstration (LfD) is a practical framework for learning complex behaviour poli-\ncies from demonstration trajectories produced by an expert. In most conventional approaches to\nLfD, the agent observes mappings between states and actions in the expert trajectories, and uses su-\npervised learning to estimate a function that can approximately reproduce this mapping. Ideally, the\nfunction (i.e., policy) should also generalize well to regions of the state space that are not observed\nin the demonstration data. Many of the recent methods focus on incrementally querying the expert in\nappropriate regions of the state space to improve the learned policy, or to reduce uncertainty [1, 2, 3].\nKey assumptions of most these works are that (1) the expert exhibits optimal behaviour, (2) the ex-\npert demonstrations are abundant, and (3) the expert stays with the learning agent throughout the\ntraining. In practice, these assumptions signi\ufb01cantly reduce the applicability of LfD.\nWe present a framework that leverages insights and techniques from the reinforcement learning\n(RL) literature to overcome these limitations of the conventional LfD methods. RL is a general\nframework for learning optimal policies from trial-and-error interactions with the environment [4,\n5]. The conventional RL approaches alone, however, might have dif\ufb01culties in achieving a good\nperformance from relatively little data. Moreover, they are not particularly cautious to risk involved\nin trial-and-error learning, which could lead to catastrophic failures. A combination of both expert\nand interaction data (i.e., mixing LfD and RL), however, offers a tantalizing way to effectively\naddress challenging real-world policy learning problems under realistic assumptions.\nOur primary contribution is a new algorithmic framework that integrates LfD, tackled using a large\nmargin classi\ufb01er, with a regularized Approximate Policy Iteration (API) method. The method is\n\n1\n\n\fformulated as a coupled constraint convex optimization, in which expert demonstrations de\ufb01ne a\nset of linear constraints in API. The optimization is formulated in a way that permits mistakes in\nthe demonstrations provided by the expert, and also accommodates variable availability of demon-\nstrations (i.e., just an initial batch or continued demonstrations). We provide a theoretical analysis\ndescribing an upper bound on the Bellman error achievable by our approach.\nWe evaluate our algorithm in a simulated environment under various scenarios, such as varying the\nquality and quantity of expert demonstrations. In all cases, we compare our algorithm with Least-\nSquare Policy Iteration (LSPI) [6], a popular API method, as well as with a state-of-the-art LfD\nmethod, Dataset Aggregation (DAgger) [1]. We also evaluate the algorithm\u2019s practicality in a real\nrobot path \ufb01nding task, where there are few demonstrations, and exploration data is expensive due\nto limited time. In all of the experiments, our method outperformed LSPI, using fewer exploration\ndata and exhibiting signi\ufb01cantly less variance. Our method also signi\ufb01cantly outperformed Dataset\nAggregation (DAgger), a state-of-art LfD algorithm, in cases where the expert demonstrations are\nfew or suboptimal.\n\n2 Proposed Algorithm\nWe consider a continuous-state, \ufb01nite-action discounted MDP (X ,A, P,R, \u03b3), where X is a\nmeasurable state space, A is a \ufb01nite set of actions, P : X \u00d7 A \u2192 M(X ) is the transition\nmodel, R : X \u00d7 A \u2192 M(R) is the reward model, and \u03b3 \u2208 [0, 1) is a discount factor.1 Let\nr(x, a) = E [R(\u00b7|x, a)], and assume that r is uniformly bounded by Rmax. A measurable mapping\n\u03c0 : X \u2192 A is called a policy. As usual, V \u03c0 and Q\u03c0 denote the value and action-value function for\n\u03c0, and V \u2217 and Q\u2217 denote the corresponding value functions for the optimal policy \u03c0\u2217 [5].\nOur algorithm is couched in the framework of API [7]. A standard API algorithm starts with an\ninitial policy \u03c00. At the (k + 1)th iteration, given a policy \u03c0k, the algorithm approximately evaluates\n\u03c0k to \ufb01nd \u02c6Qk, usually as an approximate \ufb01xed point of the Bellman operator T \u03c0k: \u02c6Qk \u2248 T \u03c0k \u02c6Qk.2\nThis is called the approximate policy evaluation step. Then, a new policy \u03c0k+1 is computed, which\nis greedy with respect to \u02c6Qk. There are several variants of API that mostly differ on how the approx-\nimate policy evaluation is performed. Most methods attempt to exploit the structures in the value\nfunction [8, 9, 10, 11], but in some problems one might have extra information about the structure\nof good or optimal policies as well. This is precisely our case, since we have expert demonstrations.\nTo develop the algorithm, we start with regularized Bellman error minimization, which is a common\n\ufb02avour of policy evaluation used in API. Suppose that we want to evaluate a policy \u03c0 given a batch\nof data DRL = {(Xi, Ai)}n\ni=1 containing n examples, and that the exact Bellman operator T \u03c0 is\nknown. Then, the new value function \u02c6Q is computed as:\n(cid:107)Q \u2212 T \u03c0Q(cid:107)2\n\n(1)\n\nn + \u03bbJ 2(Q),\n\n\u02c6Q \u2190 argmin\nQ\u2208F|A|\n\nwhere F|A| is the set of action-value functions, the \ufb01rst term is the squared Bellman error evaluated\non the data,3 J 2(Q) is the regularization penalty, which can prevent over\ufb01tting when F|A| is com-\nplex, and \u03bb > 0 is the regularization coef\ufb01cient. The regularizer J(Q) measures the complexity of\nfunction Q. Different choices of F|A| and J lead to different notions of complexity, e.g., various\nde\ufb01nitions of smoothness, sparsity in a dictionary, etc. For example, F|A| could be a reproducing\nkernel Hilbert space (RKHS) and J its corresponding norm, i.e., J(Q) = (cid:107)Q(cid:107)H.\nIn addition to DRL, we have a set of expert examples DE = {(Xi, \u03c0E(Xi))}m\ni=1, which we would\nlike to take into account in the optimization process. The intuition behind our algorithm is that\nwe want to use the expert examples to \u201cshape\u201d the value function where they are available, while\nusing the RL data to improve the policy everywhere else. Hence, even if we have few demonstration\nexamples, we can still obtain good generalization everywhere due to the RL data.\nTo incorporate the expert examples in the algorithm one might require that at the states Xi from DE,\nthe demonstrated action \u03c0E(Xi) be optimal, which can be expressed as a large-margin constraint:\n\n2For discrete state spaces, (T \u03c0k Q)(x, a) = r(x, a) + \u03b3(cid:80)\n\n1For a space \u2126 with \u03c3-algebra \u03c3\u2126, M(\u2126) denotes the set of all probability measures over \u03c3\u2126.\n3(cid:107)Q \u2212 T \u03c0Q(cid:107)2\n\n(cid:80)n\nx(cid:48) P (x(cid:48)|x, a)Q(x(cid:48), \u03c0k(x(cid:48))).\ni=1 |Q(Xi, Ai) \u2212 (T \u03c0Q)(Xi, Ai)|2 with (Xi, Ai) from DRL.\n\nn (cid:44) 1\n\nn\n\n2\n\n\fQ(Xi, \u03c0E(Xi)) \u2212 maxa\u2208A\\\u03c0E (Xi) Q(Xi, a) \u2265 1. However, this might not always be feasible, or\ndesirable (if the expert itself is not optimal), so we add slack variables \u03bei \u2265 0 to allow occasional\nviolations of the constraints (similar to soft vs. hard margin in the large-margin literature [12]). The\npolicy evaluation step can then be written as the following constrained optimization problem:\n\u02c6Q \u2190 argmin\n\n(cid:107)Q \u2212 T \u03c0Q(cid:107)2\n\nm(cid:88)\n\nn + \u03bbJ 2(Q) +\n\n\u03bei\n\n\u03b1\nm\nQ(Xi, a) \u2265 1 \u2212 \u03bei.\n\ni=1\n\n(2)\nfor all (Xi, \u03c0E(Xi)) \u2208 DE\n\ns.t. Q(Xi, \u03c0E(Xi)) \u2212 max\n\na\u2208A\\\u03c0E (Xi)\n\nQ\u2208F|A|,\u03be\u2208Rm\n\n+\n\n(cid:18)\n\nm(cid:88)\n\nn + \u03bbJ 2(Q) +\n\n(cid:107)Q \u2212 T \u03c0Q(cid:107)2\n\nThe parameter \u03b1 balances the in\ufb02uence of the data obtained by the RL algorithm (generally by trial-\nand-error) vs. the expert data. When \u03b1 = 0, we obtain (1), while when \u03b1 \u2192 \u221e, we essentially\nsolve a structured classi\ufb01cation problem based on the expert\u2019s data [13]. Note that the right side of\nthe constraints could also be multiplied by a coef\ufb01cient \u2206i > 0, to set the size of the acceptable\nmargin between the Q(Xi, \u03c0E(Xi)) and maxa\u2208A\\\u03c0E (Xi) Q(Xi, a). Such a coef\ufb01cient can then be\nset adaptively for different examples. However, this is beyond the scope of the paper.\n(cid:20)\nThe above constrained optimization problem is equivalent to the following unconstrained one:\n\u02c6Q \u2190 argmin\n1 \u2212\nQ\u2208F|A|\nwhere [1 \u2212 z]+ = max{0, 1 \u2212 z} is the hinge loss.\nIn many problems, we do not have access to the exact Bellman operator T \u03c0, but only to sam-\nples DRL = {(Xi, Ai, Ri, X(cid:48)\ni \u223c P (\u00b7|Xi, Ai).\ni=1 with Ri \u223c R(\u00b7|Xi, Ai) and X(cid:48)\nIn this\ncase, one might want to use the empirical Bellman error (cid:107)Q \u2212 \u02c6T \u03c0Q(cid:107)2\nn (with ( \u02c6T \u03c0Q)(Xi, Ai) (cid:44)\nRi + \u03b3Q(X(cid:48)\nn. It is known, however, that this is a\nbiased estimate of the Bellman error, and does not lead to proper solutions [14]. One approach to\naddress this issue is to use the modi\ufb01ed Bellman error [14]. Another approach is to use Projected\nBellman error, which leads to an LSTD-like algorithm [8]. Using the latter idea, we formulate our\noptimization as:\n\u02c6Q \u2190 argmin\n\ni)) for 1 \u2264 i \u2264 n) instead of (cid:107)Q \u2212 T \u03c0Q(cid:107)2\n\nQ(Xi, \u03c0E(Xi)) \u2212 max\n\n+ \u03bbJ 2(Q) +\n\nm(cid:88)\n\ni, \u03c0(X(cid:48)\n\na\u2208A\\\u03c0E (Xi)\n\n(cid:19)(cid:21)\n\ni)}n\n\nQ(Xi, a)\n\n(cid:13)(cid:13)(cid:13)Q \u2212 \u02c6hQ\n\n(cid:13)(cid:13)(cid:13)2\n\n\u03b1\nm\n\n(4)\n\n(3)\n\ni=1\n\n+\n\nQ\u2208F|A|,\u03be\u2208Rm\n\n+\n\ns.t.\n\nn\n\n\u03b1\nm\n\n(cid:20)(cid:13)(cid:13)(cid:13)h \u2212 \u02c6T \u03c0Q\n(cid:13)(cid:13)(cid:13)2\n\nn\n\n\u03bei\n\ni=1\n\n(cid:21)\n\n\u02c6hQ = argmin\nh\u2208F|A|\nQ(Xi, \u03c0E(Xi)) \u2212 max\n\n+ \u03bbhJ 2(h)\nQ(Xi, a) \u2265 1 \u2212 \u03bei.\n\na\u2208A\\\u03c0E (Xi)\n\nfor all (Xi, \u03c0E(Xi)) \u2208 DE\n\n(cid:16)\n\n(cid:17)\u22121\n\nHere \u03bbh > 0 is the regularization coef\ufb01cient for \u02c6hQ, which might be different from \u03bb. For some\nchoices of the function space F|A| and the regularizer J, the estimate \u02c6hQ can be found in closed-\nform. For example, one can use linear function approximators h(\u00b7) = \u03c6(\u00b7)(cid:62)u and Q(\u00b7) = \u03c6(\u00b7)(cid:62)w\nwhere u, w \u2208 Rp are parameter vectors and \u03c6(\u00b7) \u2208 Rp is a vector of p linearly independent basis\nfunctions de\ufb01ned over the space of state-action pairs. Using L2-regularization, J 2(h) = u(cid:62)u and\nJ 2(Q) = w(cid:62)w, the best parameter vector u\u2217 can be obtained as a function of w by solving a ridge\nregression problem:\n\ni = (X(cid:48)\n\nu\u2217(w) =\n\n1), . . . , \u03c6(Z(cid:48)\n\n\u03a6(cid:62)\u03a6 + n\u03bbhI\n\n\u03a6(cid:62)(r + \u03b3\u03a6(cid:48)w),\nwhere \u03a6, \u03a6(cid:48) and r are the feature matrices and reward vector,\nrespectively: \u03a6 =\n(cid:62), r = (R1, . . . , Rn)\n(cid:62), \u03a6(cid:48) = (\u03c6(Z(cid:48)\n(cid:62), with Zi = (Xi, Ai)\n(\u03c6(Z1), . . . , \u03c6(Zn))\ni)) (for data belonging to DRL). More generally, as discussed above, we might\nand Z(cid:48)\ni, \u03c0(X(cid:48)\nchoose the function space F|A| to be a reproducing kernel Hilbert space (RKHS) and J to be its\ncorresponding norm, which provides the \ufb02exibility of working with a nonparametric representation\nwhile still having a closed-form solution for \u02c6hQ. We do not provide the detail of formulation here\ndue to space constraints.\nThe approach presented so far tackles the policy evaluation step of the API algorithm. As usual\nin API, we alternate this step with the policy improvement step (i.e., greedi\ufb01cation). The resulting\nalgorithm is called Approximate Policy Iteration with Demonstration (APID).\n\nn))\n\n3\n\n\fUp to this point, we have left open the problem of how the datasets DRL and DE are obtained. These\ndatasets might be regenerated at each iteration, or they might be reused, depending on the availability\nof the expert and the environment. In practice, if the expert data is rare, DE will be a single \ufb01xed\nbatch, but DRL could be increased, e.g., by running the most current policy (possibly with some\nexploration) to collect more data. The approach used should be tailored to the application. Note\nthat the values of the regularization coef\ufb01cients as well as \u03b1 should ideally change from iteration to\niteration as a function of the number of samples as well as the value function Q\u03c0k. The choice of\nthese parameters might be automated by model selection [15].\n\n3 Theoretical Analysis\n\ni=1 are also drawn i.i.d. Xi\n\nIn this section we focus on the kth iteration of APID and consider the solution \u02c6Q to the optimization\nproblem (2). The theoretical contribution is an upper bound on the true Bellman error of \u02c6Q. Such an\nupper bound allows us to use error propagation results [16, 17] to provide a performance guarantee\non the value of the outcome policy \u03c0K (the policy obtained after K iterations of the algorithm)\ncompared to the optimal value function V \u2217. We make the following assumptions in our analysis.\nAssumption A1 (Sampling) DRL contains n independent and identically distributed (i.i.d.) samples\ni.i.d.\u223c \u03bdRL \u2208 M(X \u00d7 A) where \u03bdRL is a \ufb01xed distribution (possibly dependent on k) and\n(Xi, Ai)\nthe states in DE = {(Xi, \u03c0E(Xi)}m\ni.i.d.\u223c \u03bdE \u2208 M(X ) from an expert\ndistribution \u03bdE. DRL and DE are independent from each other. We denote N = n + m.\nAssumption A2 (RKHS) The function space F|A| is an RKHS de\ufb01ned by a kernel function K :\n(X \u00d7 A) \u00d7 (X \u00d7 A) \u2192 R, i.e., F|A| =\ni=1 =\nDRL \u222a DE. We assume that supz\u2208X\u00d7A K (z, z) \u2264 1. Moreover, the function space F|A| is Qmax-\nbounded.\nAssumption A3 (Function Approximation Property) For any policy \u03c0, Q\u03c0 \u2208 F|A|.\nAssumption A4 (Expansion of Smoothness) For all Q \u2208 F|A|, there exist constants 0 \u2264 LR, LP <\n\u221e, depending only on the MDP and F|A|, such that for any policy \u03c0, J(T \u03c0Q) \u2264 LR + \u03b3LP J(Q).\nAssumption A5 (Regularizers) The regularizer functionals J : B(X ) \u2192 R and J : B(X \u00d7 A) \u2192\nR are pseudo-norms on F and F|A|, respectively,4 and for all Q \u2208 F|A| and a \u2208 A, we have\nJ(Q(\u00b7, a)) \u2264 J(Q).\n\ni=1 wiK(z, Zi) : w \u2208 RN(cid:111)\n\n(cid:110)\n\nz (cid:55)\u2192(cid:80)N\n\nwith {Zi}N\n\nSome of these assumptions are quite mild, while some are only here to simplify the analysis, but\nare not necessary for practical application of the algorithm. For example, the i.i.d. assumption A1\ncan be relaxed using independent block technique [18] or other techniques to handle dependent data,\ne.g., [19]. The method is certainly not speci\ufb01c to RKHS (Assumption A2), so other function spaces\ncan be used without much change in the proof. Assumption A3 holds for \u201crich\u201d enough function\nspaces, e.g., universal kernels satisfy it for reasonable Q\u03c0. Assumption A4 ensures that if Q \u2208 F|A|\nthen T \u03c0Q \u2208 F|A|. It holds if F|A| is rich enough and the MDP is \u201cwell-behaving\u201d. Assumption A5\nis mild and ensures that if we control the complexity of Q \u2208 F|A|, the complexity of Q(\u00b7, a) \u2208 F\nis controlled too. Finally, note that focusing on the case when we have access to the true Bellman\noperator simpli\ufb01es the analysis while allowing us to gain more understanding about APID. We are\nnow ready to state the main theorem of this paper.\nTheorem 1. For any \ufb01xed policy \u03c0, let \u02c6Q be the solution to the optimization problem (2) with the\nchoice of \u03b1 > 0 and \u03bb > 0. If Assumptions A1\u2013A5 hold, for any n, m \u2208 N and 0 < \u03b4 < 1, with\nprobability at least 1 \u2212 \u03b4 we have\n\n4 B(X ) and B(X \u00d7 A) denote the space of bounded measurable functions de\ufb01ned on X and X \u00d7 A. Here\nwe are slightly abusing notation as the same symbol is used for the regularizer over both spaces. However, this\nshould not cause any confusion since the identity of the regularizer should always be clear from the context.\n\n4\n\n\f(cid:13)(cid:13)(cid:13) \u02c6Q \u2212 T \u03c0 \u02c6Q\n(cid:13)(cid:13)(cid:13)2\n(cid:40)\n\n\u03bdRL\n\nmin\n\n2\u03b1EX\u223c\u03bdE\n\n\u2264 64Qmax\n\n(cid:34)(cid:20)\n\n(cid:18)\n\n1 \u2212\n\n\u221a\n\nn + m\n\n(cid:18) (1 + \u03b3LP )\n\n\u221a\n\u221a\n\nn\n\n\u03bb\nQ\u03c0(X, \u03c0E(X)) \u2212 max\n\n(cid:34)(cid:20)\n\n(cid:18)\n\nR2max + \u03b1\n\n+ LR\n\n+\n\n(cid:19)\n(cid:19)(cid:21)\n\n(cid:35)\n\na\u2208A\\\u03c0E (X)\n\nQ\u03c0(X, a))\n\n+ \u03bbJ 2(Q\u03c0),\n\n+\n\n2(cid:107)Q\u03c0E \u2212 T \u03c0Q\u03c0E(cid:107)2\n\n+ 2\u03b1EX\u223c\u03bdE\n\n\u03bdRL\n\n1 \u2212\n\nQ\u03c0E (X, \u03c0E(X)) \u2212 max\n\nQ\u03c0E (X, a))\n\na\u2208A\\\u03c0E (X)\n\n(cid:41)\n\n(cid:32)(cid:114)\n\n\u03bbJ 2(Q\u03c0E )\n\n+ 4Q2\n\nmax\n\n2 ln(4/\u03b4)\n\nn\n\n+\n\n6 ln(4/\u03b4)\n\nn\n\n+ \u03b1\n\n20(1 + 2Qmax) ln(8/\u03b4)\n\n3m\n\n.\n\n(cid:33)\n\n(cid:19)(cid:21)\n\n(cid:35)\n\n+\n\n+\n\nn\u03bb\n\n\u03bdRL\n\n) + min{\u03bbJ 2(Q\u03c0), 2(cid:107)Q\u03c0E \u2212 T \u03c0Q\u03c0E(cid:107)2\n\nThe proof of this theorem is in the supplemental material. Let us now discuss some aspects of\nthe result. The theorem guarantees that when the amount of RL data is large enough (n (cid:29) m),\nwe indeed minimize the Bellman error if we let \u03b1 \u2192 0.\nIn that case, the upper bound would\n+ \u03bbJ 2(Q\u03c0E )}. Considering only the \ufb01rst\nbe OP ( 1\u221a\nterm inside min, the upper bound is minimized by the choice of \u03bb = [n1/3J 4/3(Q\u03c0)]\u22121, which\nleads to OP (J 2/3(Q\u03c0) n\u22121/3) behaviour of the upper bound. The bound shows that the dif\ufb01culty of\nlearning depends on J(Q\u03c0), which is the complexity of the true (but unknown) action-value function\nQ\u03c0 measured according to J in F|A|. Note that Q\u03c0 might be \u201csimple\u201d with respect to some choice\nof function space/regularizer, but complex in another one. The choice of F|A| and J re\ufb02ects prior\nknowledge regarding the function space and complexity measure that are suitable.\nWhen the number of samples n increases, we can afford to increase the size of the function space\nby making \u03bb smaller. Since we have two terms inside min, the complexity of the problem might\nactually depend on 2(cid:107)Q\u03c0E \u2212 T \u03c0Q\u03c0E(cid:107)2\n+\u03bbJ 2(Q\u03c0E ), which is the Bellman error of Q\u03c0E (the true\naction-value function of the expert) according to \u03c0 plus the complexity of Q\u03c0E in F|A|. Roughly\nspeaking, if \u03c0 is close to \u03c0E, the Bellman error would be small. Two remarks are in order. First, this\nresult does not provide a proper upper bound on the Bellman error when m dominates n. This is\nto be expected, because if \u03c0 is quite different from \u03c0E and we do not have enough samples in DRL,\nwe cannot guarantee that the Bellman error, which is measured according to \u03c0, will be small. But,\none can still provide a guarantee by choosing a large \u03b1 and using a margin-based error bound (cf.\nSection 4.1 of [20]). Second, this upper bound is not optimal, as we use a simple proof technique\nbased on controlling the supremum of the empirical process. More advanced empirical processes\ntechniques can be used to obtain a faster error rate (cf. [12]).\n\n\u03bdRL\n\n4 Experiments\n\nWe evaluate APID on a simulated domain, as well as a real robot path-\ufb01nding task. In the simulated\nenvironment, we compare APID against other benchmarks under varying availability and optimality\nof the expert demonstrations. In the real robot task, we evaluate the practicality of deploying APID\non a live system, especially when DRL and DE are both expensive to obtain.\n\n4.1 Car Brake Control Simulation\nIn the vehicle brake control simulation [21], the agent\u2019s goal is reach a target velocity, then maintain\nthat target. It can either press the acceleration pedal or the brake pedal, but not both simultaneously.\nA state is represented by four continuous-valued features:\ntarget and current velocities, current\npositions of brake pedal and acceleration pedal. Given a state, the learned policy has to output one\nof \ufb01ve actions: acceleration up, acceleration down, brake up, brake down, do nothing. The reward\nis \u221210 times the error in velocity. The initial velocity is set to 2m/s, and the target velocity is set\nto 7m/s. The expert was implemented using the dynamics between the pedal pressure and output\nvelocity, from which we calculate the optimal velocity at each state. We added random noise to the\ndynamics to simulate a realistic scenario, in which the output velocity is governed by factors such\nas friction and wind. The agent has no knowledge of the dynamics, and receives only DE and DRL.\nFor all experiments, we used a linear Radial Basis Function (RBF) approximator for the value func-\ntion and CVX, a package for specifying and solving convex programs [22], to solve the optimization\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Average reward with m = 15 optimal demonstrations.\n(b) Average reward with\nm = 100 sub-optimal demonstrations. Each iteration adds 500 new RL data to APID and LSPI,\nwhile the expert data stays the same. First iteration has n = 500 \u2212 m for APID. LSPI treats all the\ndata at this iteration RL data.\n\n\u221a\nproblem (4). We set \u03b1\nm to 1 if the expert is optimal and 0.01 otherwise. The regularization param-\neter was set according to 1/\nn + m. We averaged results over 10 runs and computed con\ufb01dence\nintervals as well. We compare APID with the regularized version of LSPI [6] in all the experiments.\nDepending on the availability of expert data, we either compare APID with the standard supervised\nLfD, or DAgger [1], a state-of-the-art LfD algorithm that has strong theoretical guarantees and good\nempirical performance when the expert is optimal. DAgger is designed to query for more demon-\nstrations at each iteration; then, it aggregates all demonstrations and trains a new policy. The number\nof queries in DAgger increases linearly with the task horizon. For Dagger and supervised LfD, we\nuse random forests to train the policy.\nWe \ufb01rst consider the case with little but optimal expert data, with task horizon 1000. At each\niteration, the agent gathers more RL data using a random policy. In this case, shown in Figure 1a,\nLSPI performs worse than APID on average, and it also has much higher variance, especially when\nDRL is small. This is consistent with empirical results in [6], in which LSPI showed signi\ufb01cant\nvariance even for simple tasks. In the \ufb01rst iteration, APID has moderate variance, but it is quickly\nreduced in the next iteration. This is due to the fact that expert constraints impose a particular shape\nto the value function, as noted in Section 2. The supervised LfD performs the worst, as the amount of\nexpert data is insuf\ufb01cient. Results for the case in which the agent has more but sub-optimal expert\ndata are shown in Figure 1b. Here, with probability 0.5 the expert gives a random action rather\nthan the optimal action. Compared to supervised LfD, APID and LSPI are both able to overcome\nsub-optimality in the expert\u2019s behaviour to achieve good performance, by leveraging the RL data.\nNext, we consider the case of abundant demonstrations from a sub-optimal expert, who gives random\nactions with probability 0.25, to characterize the difference between APID and DAgger. The task\nhorizon is reduced to 100, due to the number of demonstrations required by DAgger. As can be\nseen in Figure 2a, the sub-optimal demonstrations cause DAgger to diverge, because it changes the\npolicy at each iteration, based on the newly aggregated sub-optimal demonstrations. APID, on the\nother hand, is able to learn a better policy by leveraging DRL. APID also outperforms LSPI (which\nuses the same DRL), by generalizing from DE via function approximation. This result illustrates\nwell APID\u2019s robustness to sub-optimal expert demonstrations. Figure 2b shows the result for the\ncase of optimal and abundant demonstrations. In this case, which \ufb01ts Dagger\u2019s assumptions, the\nperformance of APID is on par with that of DAgger.\n4.2 Real Robot Path Finding\n\nWe now evaluate the practicality of APID on a real robot path-\ufb01nding task and compare it with LSPI\nand supervised LfD, using only one demonstrated trajectory. We do not assume that the expert is\noptimal (and abundant), and therefore do not include DAgger, which was shown to perform poorly\nfor this case. In this task, the robot needs to get to the goal in an unmapped environment by learning\n\n6\n\n12345678910\u221280\u221270\u221260\u221250\u221240\u221230\u221220\u22121001020Number of IterationsAverage Rewards  APIDLSPISupervised12345678910\u221280\u221270\u221260\u221250\u221240\u221230\u221220\u22121001020Number of IterationsAverage Rewards  APIDLSPISupervised\f(a)\n\n(b)\n\nFigure 2: (a) Performance with a sub-optimal expert. (b) Performance with an optimal expert. Each\niteration (X-axis) adds 100 new expert data points to APID and DAgger. We use n = 3000 \u2212 m for\nAPID. LSPI treats all data as RL data.\n\nto avoid obstacles. We use an iRobot Create equipped with Kinect RGB-depth sensor and a laptop.\nWe encode the Kinect observations with 1 \u00d7 3 grid cells (each 1m \u00d7 1m). The robot also has three\nbumpers to detect a collision from the front, left, and right. Figures 3a and 3b show a picture of the\nrobot and its environment. In order to reach the goal, the robot needs to turn left to avoid a \ufb01rst box\nand wall on the right, while not turning too much, to avoid the couch. Next, the robot must turn\nright to avoid a second box, but make sure not to turn too much or too soon to avoid colliding with\nthe wall or \ufb01rst box. Then, the robot needs to get into the hallway, turn right, and move forward to\nreach the goal position; the goal position is 6m forward and 1.5m right from the initial position.\nThe state space is represented by 3 non-negative integer features to represent number of point clouds\nproduced by Kinect in each grid cell, and 2 continuous features (robot position). The robot has three\ndiscrete actions: turn left, turn right, and move forward. The reward is minus the distance to the\ngoal, but if the robot\u2019s front bumper is pressed and the robot moves forward, it receives a penalty\nequal to 2 times the current distance to the goal. If the robot\u2019s left bumper is pressed and the robot\ndoes not turn right, and vice-versa, it also receives 2 times the current distance to the goal. The robot\noutputs actions at a rate of 1.7Hz.\nWe started from a single trajectory of demonstration, then incrementally added only RL data. The\nnumber of data points added varied at each iteration, but the average was 160 data points, which is\naround 1.6 minutes of exploration using \u0001-greedy exploration policy (decreasing \u0001 over iterations).\nm was set to 0.9, then\nFor 11 iterations, the training time was approximately 18 minutes. Initially, \u03b1\nit was decreased as new data was acquired. To evaluate the performance of each algorithm, we ran\neach iteration\u2019s policy for a task horizon of 100 (\u2248 1min); we repeated each iteration 5 times, to\ncompute the mean and standard deviation.\nAs shown in Figure 3c, APID outperformed both LSPI and supervised LfD; in fact, these two meth-\nods could not reach the goal. The supervised LfD kept running into the couch, as state distributions\nof expert and robot differed, as noted in [1]. LSPI had a problem due to exploring unnecessary\nstates; speci\ufb01cally, the \u0001-greedy policy of LSPI explored regions of state space that were not rele-\nvant in learning the optimal plan, such as the far left areas from the initial position. \u0001-greedy policy\nof APID, on the other hand, was able to leverage the expert data to ef\ufb01ciently explore the most rele-\nvant states and avoid unnecessary collisions. For example, it learned to avoid the \ufb01rst box in the \ufb01rst\niteration, then explored states near the couch, where supervised LfD failed. Table 1 gives the time it\ntook for the robot to reach the goal (within 1.5m). Only iterations 9, 10 and 11 of APID reached the\ngoal. Note that the times achieved by APID (iteration 11) are similar to the expert.\n\nTable 1: Average time to reach the goal\n\nAverage Vals\nTime To Goal(s)\n\nDemonstration APID-9th\n38.4 \u00b1 0.81\n35.9\n\nAPID-10th\n37.7 \u00b1 0.84\n\nAPID-11th\n36.1 \u00b1 0.24\n\n7\n\n1234567891011\u221220\u221218\u221216\u221214\u221212\u221210\u22128\u22126\u22124Number of IterationsAverage Rewards  APIDLSPIDAgger1234567891011\u221220\u221218\u221216\u221214\u221212\u221210\u22128\u22126\u22124Number of IterationsAverage Rewards  APIDLSPIDAgger\f(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) Picture of iRobot Create equipped with Kinect. (b) Top-down view of the environment.\nThe star represents the goal, the circle represents the initial position, black lines indicate walls, and\nthe three grid cells represent the vicinity of the Kinect. (c) Distance to the goal for LSPI, APID and\nsupervised LfD with random forest.\n\n5 Discussion\n\nWe proposed an algorithm that learns from limited demonstrations by leveraging a state-of-the-art\nAPI method. To our knowledge, this is the \ufb01rst LfD algorithm that learns from few and/or suboptimal\ndemonstrations. Most LfD methods focus on solving the issue of violation of i.i.d. data assumptions\nby changing the policy slowly [23], by reducing the problem to online learning [1], by querying\nthe expert [2] or by obtaining corrections from the expert [3]. These methods all assume that the\nexpert is optimal or close-to-optimal, and demonstration data is abundant. The TAMER system [24]\nuses rewards provided by the expert (and possibly blended with MDP rewards), instead of assuming\nthat an action choice is provided. There are a few Inverse RL methods that do not assume optimal\nexperts [25, 26], but their focus is on learning the reward function rather than on planning. Also,\nthese methods require a model of the system dynamics, which is typically not available in practice.\nIn the simulated environment, we compared our method with DAgger (a state-of-the-art LfD\nmethod) as well as with a popular API algorithm, LSPI. We considered four scenarios: very few but\noptimal demonstrations, a reasonable number of sub-optimal demonstrations, abundant sub-optimal\ndemonstrations, and abundant optimal demonstrations. In the \ufb01rst three scenarios, which are more\nrealistic, our method outperformed the others. In the last scenario, in which the standard LfD as-\nsumptions hold, APID performed just as well as DAgger. In the real robot path-\ufb01nding task, our\nmethod again outperformed LSPI and supervised LfD. LSPI suffered from inef\ufb01cient exploration,\nand supervised LfD was affected by the violation of the i.i.d. assumption, as pointed out in [1].\nWe note that APID accelerated API by utilizing demonstration data. Previous approaches [27, 28]\naccelerated policy search, e.g. by using LfD to \ufb01nd initial policy parameters. In contrast, APID\nleverages the expert data to shape the policy throughout the planning.\nThe most similar to our work, in terms of goals, is [29], where the agent is given multiple sub-\noptimal trajectories, and infers a hidden desired trajectory using Expectation Maximization and\nKalman Filtering. However, their approach is less general, as it assumes a particular noise model in\nthe expert data, whereas APID is able to handle demonstrations that are sub-optimal non-uniformly\nalong the trajectory.\nIn future work, we will explore more applications of APID and study its behaviour with respect to\n\u2206i. For instance, in safety-critical applications, large \u2206i could be used at critical states.\n\nAcknowledgements\nFunding for this work was provided by the NSERC Canadian Field Robotics Network, Discovery Grants Pro-\ngram, and Postdoctoral Fellowship Program, as well as by the CIHR (CanWheel team), and the FQRNT (Re-\ngroupements strat\u00b4egiques INTER et REPARTI).\n\n8\n\n02468101201234567Number of IterationsDistance to the Goal  APIDLSPIsupervised\fReferences\n[1] S. Ross, G. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to\n\nno-regret online learning. In AISTATS, 2011. 1, 2, 6, 7, 8\n\n[2] S. Chernova and M. Veloso. Interactive policy learning through con\ufb01dence-based autonomy. Journal of\n\nArti\ufb01cial Intelligence Research, 34, 2009. 1, 8\n\n[3] B. Argall, M. Veloso, and B. Browning. Teacher feedback to scaffold and re\ufb01ne demonstrated motion\n\nprimitives on a mobile robot. Robotics and Autonomous Systems, 59(3-4), 2011. 1, 8\n\n[4] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. 1\n[5] Cs. Szepesv\u00b4ari. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 1, 2\n[6] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:\n\n1107\u20131149, 2003. 2, 6\n\n[7] D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control\n\nTheory and Applications, 9(3):310\u2013335, 2011. 2\n\n[8] A.-m. Farahmand, M. Ghavamzadeh, Cs. Szepesv\u00b4ari, and S. Mannor. Regularized policy iteration. In\n\nNIPS 21, 2009. 2, 3\n\n[9] J. Z. Kolter and A. Y. Ng. Regularization and feature selection in least-squares temporal difference\n\nlearning. In ICML, 2009. 2\n\n[10] G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In ICML,\n\n2009. 2\n\n[11] M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-sample analysis of Lasso-TD. In ICML,\n\n2011. 2\n\n[12] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. 3, 5\n[13] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer. Large margin methods for structured\n\nand interdependent output variables. Journal of Machine Learning Research, 6(2):1453\u20131484, 2006. 3\n\n[14] A. Antos, Cs. Szepesv\u00b4ari, and R. Munos. Learning near-optimal policies with Bellman-residual mini-\nmization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71:89\u2013129, 2008.\n3\n\n[15] A.-m. Farahmand and Cs. Szepesv\u00b4ari. Model selection in reinforcement learning. Machine Learning, 85\n\n(3):299\u2013332, 2011. 4\n\n[16] R. Munos. Error bounds for approximate policy iteration. In ICML, 2003. 4\n[17] A.-m. Farahmand, R. Munos, and Cs. Szepesv\u00b4ari. Error propagation for approximate policy and value\n\niteration. In NIPS 23, 2010. 4\n\n[18] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of\n\nProbability, 22(1):94\u2013116, January 1994. 4\n\n[19] P.-M. Samson. Concentration of measure inequalities for Markov chains and \u03c6-mixing processes. The\n\nAnnals of Probability, 28(1):416\u2013461, 2000. 4\n\n[20] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: A survey of some recent advances.\n\nESAIM: Probability and Statistics, 9:323\u2013375, 2005. 5\n\n[21] T. Hester, M. Quinlan, and P. Stone. RTMBA: A real-time model-based reinforcement learning architec-\n\nture for robot control. In ICRA, 2012. 5\n\n[22] CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version 2.0. http:\n\n//cvxr.com/cvx, August 2012. 5\n\n[23] S. Ross and J. A. Bagnell. Ef\ufb01cient reductions for imitation learning. In AISTATS, 2010. 8\n[24] W. B Knox and P. Stone. Reinforcement learning from simultaneous human and MDP reward. In AAMAS,\n\n2012. 8\n\n[25] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In IJCAI, 2007. 8\n[26] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning.\n\nIn AAAI, 2008. 8\n\n[27] A. J. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor primitives. In\n\nNIPS 15, 2002. 8\n\n[28] J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1-2):171\u2013\n\n203, 2011. 8\n\n[29] A. Coates, P. Abbeel, and A. Y. Ng. Learning for control from multiple demonstrations. In ICML, 2008.\n\n8\n\n9\n\n\f", "award": [], "sourceid": 1305, "authors": [{"given_name": "Beomjoon", "family_name": "Kim", "institution": "McGill University"}, {"given_name": "Amir-massoud", "family_name": "Farahmand", "institution": "McGill University"}, {"given_name": "Joelle", "family_name": "Pineau", "institution": "McGill University"}, {"given_name": "Doina", "family_name": "Precup", "institution": "McGill University"}]}