{"title": "Probabilistic Movement Primitives", "book": "Advances in Neural Information Processing Systems", "page_first": 2616, "page_last": 2624, "abstract": "Movement Primitives (MP) are a well-established approach for representing modular and re-usable robot movement generators. Many state-of-the-art robot learning successes are based MPs, due to their compact representation of the inherently continuous and high dimensional robot movements. A major goal in robot learning is to combine multiple MPs as building blocks in a modular control architecture to solve complex tasks. To this effect, a MP representation has to allow for blending between motions, adapting to altered task variables, and co-activating multiple MPs in parallel. We present a probabilistic formulation of the MP concept that maintains a distribution over trajectories. Our probabilistic approach allows for the derivation of new operations which are essential for implementing all aforementioned properties in one framework. In order to use such a trajectory distribution for robot movement control, we analytically derive a stochastic feedback controller which reproduces the given trajectory distribution. We evaluate and compare our approach to existing methods on several simulated as well as real robot scenarios.", "full_text": "Probabilistic Movement Primitives\n\nAlexandros Paraschos, Christian Daniel, Jan Peters, and Gerhard Neumann\n\nIntelligent Autonomous Systems, Technische Universit\u00e4t Darmstadt\n\n{paraschos,daniel,peters,neumann}@ias.tu-darmstadt.de\n\nHochschulstr. 10, 64289 Darmstadt, Germany\n\nAbstract\n\nMovement Primitives (MP) are a well-established approach for representing mod-\nular and re-usable robot movement generators. Many state-of-the-art robot learn-\ning successes are based MPs, due to their compact representation of the inherently\ncontinuous and high dimensional robot movements. A major goal in robot learn-\ning is to combine multiple MPs as building blocks in a modular control architec-\nture to solve complex tasks. To this effect, a MP representation has to allow for\nblending between motions, adapting to altered task variables, and co-activating\nmultiple MPs in parallel. We present a probabilistic formulation of the MP con-\ncept that maintains a distribution over trajectories. Our probabilistic approach\nallows for the derivation of new operations which are essential for implementing\nall aforementioned properties in one framework. In order to use such a trajectory\ndistribution for robot movement control, we analytically derive a stochastic feed-\nback controller which reproduces the given trajectory distribution. We evaluate\nand compare our approach to existing methods on several simulated as well as\nreal robot scenarios.\n\n1\n\nIntroduction\n\nMovement Primitives (MPs) are commonly used for representing and learning basic movements\nin robotics, e.g., hitting and batting, grasping, etc.\n[1, 2, 3]. MP formulations are compact pa-\nrameterizations of the robot\u2019s control policy. Modulating their parameters permits imitation and\nreinforcement learning as well as adapting to different scenarios. MPs have been used to solve\nmany complex tasks, including \u2018Ball-in-the-Cup\u2019 [4], Ball-Throwing [5, 6], Pancake-Flipping [7]\nand Tetherball [8].\nThe aim of MPs is to allow for composing complex robot skills out of elemental movements with a\nmodular control architecture. Hence, we require a MP architecture that supports parallel activation\nand smooth blending of MPs for composing complex movements of sequentially [9] and simulta-\nneously [10] activated primitives. Moreover, adaptation to a new task or a new situation requires\nmodulation of the MP to an altered desired target position, target velocity or via-points [3]. Ad-\nditionally, the execution speed of the movement needs to be adjustable to change the speed of, for\nexample, a ball-hitting movement. As we want to learn the movement from data, another crucial re-\nquirement is that the parameters of the MPs should be straightforward to learn from demonstrations\nas well as through trial and error for reinforcement learning approaches. Ideally, the same archi-\ntecture is applicable for both stroke-based and periodic movements, and capable of representing\noptimal behavior in deterministic and stochastic environments.\nWhile many of these properties are implemented by one or more existing MP architectures [1, 11,\n10, 2, 12, 13, 14, 15], no approach exists which exhibits all of these properties in one framework. For\nexample, [13] also offers a probabilistic interpretation of MPs by representing an MP as a learned\ngraphical model. However, this approach heavily depends on the quality of the used planner and the\n\n1\n\n\fmovement can not be temporally scaled. Rozo et. al. [12, 16] use a combination of primitives, yet,\ntheir control policy of the MP is based on heuristics and it is unclear how the combination of MPs\naffects the resulting movements.\nIn this paper, we introduce the concept of probabilistic movement primitives (ProMPs) as a general\nprobabilistic framework for representing and learning MPs. Such a ProMP is a distribution over\ntrajectories. Working with distributions enables us to formulate the described properties by oper-\nations from probability theory. For example, modulation of a movement to a novel target can be\nrealized by conditioning on the desired target\u2019s positions or velocities. Similarly, consistent parallel\nactivation of two elementary behaviors can be accomplished by a product of two independent trajec-\ntory probability distributions. Moreover, a trajectory distribution can also encode the variance of the\nmovement, and, hence, a ProMP can often directly encode optimal behavior in stochastic systems\n[17]. Finally, a probabilistic framework allows us to model the covariance between trajectories of\ndifferent degrees of freedom, that can be used to couple the joints of the robot.\nSuch properties of trajectory distributions have so far not been properly exploited for representing\nand learning MPs. The main reason for the absence of such an approach has been the dif\ufb01culty of\nextracting a policy for controlling the robot from a trajectory distribution. We show how this step can\nbe accomplished and derive a control policy that exactly reproduces a given trajectory distribution.\nTo the best of our knowledge, we present the \ufb01rst principled MP approach that can exploit the power\nof operations from probability theory.\nWhile the ProMPs\u2019 representation introduces many novel components, it incorporates many ad-\nvantages from well-known previous movement primitive representations [18, 10], such as phase\nvariables for timing of the movement that enable temporal rescaling of movements, and the ability\nto represent both rhythmic and stroke based movements. However, since ProMPs incorporate the\nvariance of demonstrations, the increased \ufb02exibility and advantageous properties of the representa-\ntion come at the price of requiring multiple demonstrations to learn the primitives as opposed to past\napproaches [18, 3] that can clone movements from a single demonstration.\n\n2 Probabilistic Movement Primitives (ProMPs)\n\nTable 1: Desirable properties and their implemen-\ntation in the ProMP\n\nA movement primitive representation should\nexhibit several desirable properties, such as co-\nactivation, adaptability and optimality in order\nto be a powerful MP representation. The goal\nof this paper is to unify these properties in one\nframework. We accomplish this objective by\nusing a probabilistic formulation for MPs. We\nsummarized all the properties and how they are\nimplemented in our framework in Table 1. In\nthis section, we will sequentially explain the\nimportance of each of these property and dis-\ncuss the implementation in our framework. As\ncrucial part of our objective, we will introduce\nconditioning and a product of ProMPs as new\noperations that can be applied on the ProMPs due to the probabilistic formulation. Finally, we show\nhow to derive a controller which follows a given trajectory distribution.\n\nProperty\nCo-Activation\nModulation\nOptimality\nCoupling\nLearning\nTemporal Scaling\nRhythmic Movements\n\nImplementation\n\nProduct\n\nConditioning\n\nEncode variance\nMean, Covariance\nMax. Likelihood\nModulate Phase\nPeriodic Basis\n\n2.1 Probabilistic Trajectory Representation\nWe model a single movement execution as a trajectory \u03c4 = {qt}t=0...T , de\ufb01ned by the joint angles\nqt over time.\nIn our framework, a MP describes multiple ways to execute a movement, which\nnaturally leads to a probability distribution over trajectories.\n\nEncoding a Time-Varying Variance of Movements. Our movement primitive representation\nmodels the time-varying variance of the trajectories to be able to capture multiple demonstrations\nwith high-variability. Representing the variance information is crucial as it re\ufb02ects the importance of\n\n2\n\n\fsingle time points for the movement execution and it is often a requirement for representing optimal\nbehavior in stochastic systems [17].\nWe use a weight vector w to compactly represent a single trajectory. The probability of observing a\ntrajectory \u03c4 given the underlying weight vector w is given as a linear basis function model\n\np(\u03c4|w) =(cid:81)\n\ntN(cid:16)\n\nyt|\u03a6T\n\n(cid:17)\n\n(cid:20) qt\n\n(cid:21)\n\n,\n\n\u02d9qt\n\n= \u03a6T\n\nyt =\n\nt w, \u03a3y\n\nt w + \u0001y,\n\n(1)\nwhere \u03a6t = [\u03c6t, \u02d9\u03c6t] de\ufb01nes the n \u00d7 2 dimensional time-dependent basis matrix for the joint posi-\ntions qt and velocities \u02d9qt, n de\ufb01nes the number of basis functions and \u0001y \u223c N (0, \u03a3y) is zero-mean\ni.i.d. Gaussian noise. By weighing the basis functions \u03a8t with the parameter vector w, we can\nrepresent the mean of a trajectory.\nIn order to capture the variance of the trajectories, we introduce a distribution p(w; \u03b8) over the\nweight vector w, with parameters \u03b8. The trajectory distribution p(\u03c4 ; \u03b8) can now be computed\np(\u03c4|w)p(w; \u03b8)dw. The distribution\nby marginalizing out the weight vector w, i.e., p(\u03c4 ; \u03b8) =\np(\u03c4 ; \u03b8) de\ufb01nes a Hierarchical Bayesian Model (HBM) whose parameters are given by the observa-\ntion noise variance \u03a3y and the parameters \u03b8 of p(w; \u03b8).\n\n\u00b4\n\nTemporal Modulation. Temporal modulation is needed for a faster or slower execution of the\nmovement. We introduce a phase variable z to decouple the movement from the time signal as for\nprevious non-probabilistic approaches [18]. The phase can be any function monotonically increasing\nwith time z(t). By modifying the rate of the phase variable, we can modulate the speed of the\nmovement. Without loss of generality, we de\ufb01ne the phase as z0 = 0 at the beginning of the\nmovement and as zT = 1 at the end. The basis functions \u03c6t now directly depend on the phase\n(cid:48)\ninstead of time, such that \u03c6t = \u03c6(zt) and the corresponding derivative becomes \u02d9\u03c6t = \u03c6\n\n(zt) \u02d9zt.\n\nRhythmic and Stroke-Based Movements. The choice of the basis functions depends on the type\nof movement, which can be either rhythmic or stroke-based. For stroke-based movements, we use\nGaussian basis functions bG\ni , while for rhythmic movements we use Von-Mises basis functions bVM\nto model periodicity in the phase variable z, i.e.,\n\ni\n\n(cid:18) cos(2\u03c0(zt \u2212 ci))\n\n(cid:19)\n\nh\n\n,\n\nbVM\ni\n\n(z) = exp\n\n,\n\n(2)\n\n(cid:18)\n\u2212 (zt \u2212 ci)2\nbasis functions with \u03c6i(zt) = bi(z)/(cid:80)\n\nbG\ni (z) = exp\n\n2h\n\n(cid:19)\n\nj bj(z).\n\nwhere h de\ufb01nes the width of the basis and ci the center for the ith basis function. We normalize the\n\nEncoding Coupling between Joints. So far, we have considered each degree of freedom to be\nmodeled independently. However, for many tasks we have to coordinate the movement of the joints.\nA common way to implement such coordination is via the phase variable zt that couples the mean of\nthe trajectory distribution [18]. Yet, it is often desirable to also encode higher-order moments of the\ncoupling, such as the covariance of the joints at time point t. Hence, we extend our model to multiple\ndimensions. For each dimension i, we maintain a parameter vector wi, and we de\ufb01ne the combined,\nn ]T . The basis matrix \u03a6t now extends to a block-diagonal\nweight vector w as w = [wT\nmatrix containing the basis functions and their derivatives for each dimension. The observation\nvector yt consists of the angles and velocities of all joints. The probability of an observation y at\ntime t is given by\n\n1 , . . . , wT\n\n\uf8eb\uf8ec\uf8ed\n\uf8ee\uf8ef\uf8f0 y1,t\n\n...\nyd,t\n\n\uf8f9\uf8fa\uf8fb(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\uf8ee\uf8ef\uf8f0 \u03a6T\n\n...\n\nt\n\n0\n\n0\n\n. . .\n...\n...\n\u00b7\u00b7\u00b7 \u03a6T\n\nt\n\np(yt|w) = N\n\n\uf8f9\uf8fa\uf8fb w, \u03a3y\n\n\uf8f6\uf8f7\uf8f8 = N (yt|\u03a8tw, \u03a3y)\n\n(3)\n\nwhere yi,t = [qi,t, \u02d9qi,t]T denotes the joint angle and velocity for the ith joint. We now maintain a\ndistribution p(w; \u03b8) over the combined parameter vector w. Using this distribution, we can also\ncapture the covariance between joints.\n\nLearning from Demonstrations. One crucial requirement of a MP representation is that the pa-\nrameters of a single primitive are easy to acquire from demonstrations. To facilitate the estimation\n\n3\n\n\f\u02c6\n\nyt|\u03a8T\n\nN(cid:16)\n\n(cid:17)N (w|\u00b5w, \u03a3w) dw = N(cid:16)\n\nof the parameters, we will assume a Gaussian distribution for p(w; \u03b8) = N (w|\u00b5w, \u03a3w) over the\nparameters w. Consequently, the distribution of the state p(yt|\u03b8) for time step t is given by\nt \u03a3w\u03a8t + \u03a3y\np (yt; \u03b8) =\nand, thus, we can easily evaluate the mean and the variance for any time point t. As a ProMP\nrepresents multiple ways to execute an elemental movement, we also need multiple demonstrations\nto learn p(w; \u03b8). The parameters \u03b8 = {\u00b5w, \u03a3w} can be learned from multiple demonstrations by\nmaximum likelihood estimation, for example, by using the expectation maximization algorithm for\nHBMs with Gaussian distributions [19].\n\nt \u00b5w, \u03a8T\n\nyt|\u03a8T\n\nt w, \u03a3y\n\n, (4)\n\n(cid:17)\n\n2.2 New Probabilistic Operators for Movement Primitives\n\nThe ProMPs allow for the formulation of new operators from probability theory, e.g., conditioning\nfor modulating the trajectory and a product of distributions for co-activating MPs. We will now\ndescribe both operators in our general framework and, subsequently, discuss their implementation\nfor our speci\ufb01c choice of Gaussian distributions for p(w; \u03b8).\n\nt ) \u221d N(cid:16)\n\nModulation of Via-Points, Final Positions or Velocities by Conditioning. The modulation of\nvia-points and \ufb01nal positions are important properties of any MP framework such that the MP can\nbe adapted to new situations. In our probabilistic formulation, such operations can be described\nby conditioning the MP to reach a certain state y\u2217\nt at time t. Conditioning is performed by adding\na desired observation xt = [y\u2217\ny] to our probabilistic model and applying Bayes theorem, i.e.,\np(w|x\u2217\nt w, \u03a3\u2217\nt represents the desired position and veloc-\nity vector at time t and \u03a3\u2217\ny describes the accuracy of the desired observation. We can also condition\non any subset of y\u2217\nt . For example, by specifying a desired joint position q1 for the \ufb01rst joint the\ntrajectory distribution will automatically infer the most probable joint positions for the other joints.\nFor Gaussian trajectory distributions the conditional distribution p (w|x\u2217\nt ) for w is Gaussian with\nmean and variance\n\n(cid:17)\nt , \u03a3\u2217\np(w). The state vector y\u2217\n\nt|\u03a8T\ny\u2217\n\ny\n\n(cid:16)\n(cid:16)\n\n\u00b5[new]\n\nw\n\n= \u00b5w + \u03a3w\u03a8t\n= \u03a3w \u2212 \u03a3w\u03a8t\n\n\u03a3\u2217\ny + \u03a8T\n\u03a3\u2217\ny + \u03a8T\n\n(cid:17)\u22121(cid:16)\n(cid:17)\u22121\n\n(cid:17)\n\nt \u03a3w\u03a8t\n\nt \u2212 \u03a8T\ny\u2217\n\nt \u00b5w\n\n,\n\n(5)\n\n\u03a3[new]\n\nw\n\n(6)\nConditioning a ProMP to different target states is also illustrated in Figure 1(a). We can see that, de-\nspite the modulation of the ProMP by conditioning, the ProMP stays within the original distribution,\nand, hence, the modulation is also learned from the original demonstrations. Modulation strategies\nin current approaches such as the DMPs do not show this bene\ufb01cial effect [18].\n\nt \u03a3w\u03a8t\n\nt \u03a3w.\n\n\u03a8T\n\nthe products of distributions, i.e., pnew(\u03c4 ) \u221d (cid:81)\n\nCombination and Blending of Movement Primitives. Another bene\ufb01cial probabilistic operation\nis to continuously combine and blend different MPs into a single movement. Suppose that we\nmaintain a set of i different primitives that we want to combine. We can co-activate them by taking\nipi(\u03c4 )\u03b1[i]where the\u03b1[i] \u2208 [0, 1] factors denote the\nactivation of the ith primitive. This product captures the overlapping region of the active MPs, i.e.,\nthe part of the trajectory space where all MPs have high probability mass.\nHowever, we also want to be able to modulate the activations of the primitives, for example, to\ncontinuously blend the movement execution from one primitive to the next. Hence, we decompose\nthe trajectory into single time steps and use time-varying activation functions \u03b1[i]\npi(yt|w[i])pi(w[i])dw[i].\n(cid:17)\u22121\n\n(7)\nt ), the resulting distribution p\u2217(yt) is again\n\nFor Gaussian distributions pi(yt) = N (yt|\u00b5[i]\nGaussian with variance and mean\n\np\u2217(\u03c4 ) \u221d(cid:81)\n(cid:18)(cid:80)\n(cid:16)\n\n(cid:17)\u22121(cid:19)\u22121\n\npi(yt) =\nt , \u03a3[i]\n\n(cid:18)(cid:80)\n\nipi(yt)\u03b1[i]\nt ,\n\nt , i.e.,\n\n(cid:81)\n\n(cid:19)\n\n(cid:16)\n\n\u22121\n\n\u00b4\n\nt\n\n\u03a3\u2217\nt =\n\n\u03a3[i]\n\nt /\u03b1[i]\n\nt\n\ni\n\n\u03a3[i]\n\nt /\u03b1[i]\n\nt\n\ni\n\n\u00b5[i]\nt\n\n, \u00b5\u2217\n\nt = (\u03a3\u2217\nt )\n\n(8)\n\nBoth terms, and their derivatives, are required to obtain the stochastic feedback controller which is\n\ufb01nally used to control the robot. We illustrated the co-activation of two ProMPs in Figure 1(b) and\nthe blending of two ProMPs in Figure 1(c).\n\n4\n\n\f(a) Conditioning\n\n(b) Combination\n\n(c) Blending\n\nFigure 1: (a) Conditioning on different target states. The blue shaded area represents the learned\ntrajectory distribution. We condition on different target positions, indicated by the \u2018x\u2019-markers. The\nproduced trajectories exactly reach the desired targets while keeping the shape of the demonstrations.\n(b) Combination of two ProMPs. The trajectory distributions are indicated by the blue and red\nshaded areas. Both primitives have to reach via-points at different points in time, indicated by\nthe \u2018x\u2019-markers. We co-activate both primitives with the same activation factor. The trajectory\ndistribution generated by the resulting feedback controller now goes through all four via-points.\n(c) Blending of two ProMPs. We smoothly blend from the red primitive to the blue primitive. The\nactivation factors are shown in the bottom. The resulting movement (green) \ufb01rst follows the red\nprimitive and, subsequently, switches to following the blue primitive.\n\n2.3 Using Trajectory Distributions for Robot Control\n\nIn order to fully exploit the properties of trajectory distributions, a policy for controlling the robot\nis needed that reproduces these distributions. To this effect, we analytically derivate a stochastic\nfeedback controller that can accurately reproduce the mean vectors \u00b5t and the variances \u03a3t for all t\nof a given trajectory distribution.\nWe follow a model-based approach. First, we approximate the continuous time dynamics of the\nsystem by a linearized discrete-time system with step duration dt,\n\n(9)\nwhere the system matrices At, the input matrices Bt and the drift vectors ct can be obtained by \ufb01rst\norder Taylor expansion of the dynamical system1. We assume a stochastic linear feedback controller\nwith time varying feedback gains is generating the control actions, i.e.,\n\nyt+dt = (I + Atdt) yt + Btdtu + ctdt,\n\n(10)\nwhere the matrix Kt denotes a feedback gain matrix and kt a feed-forward component. We use a\ncontrol noise which behaves like a Wiener process [21], and, hence, its variance grows linearly with\nthe step duration2 dt. By substituting Eq. (10) into Eq. (9), we rewrite the next state of the system as\n\nu = Ktyt + kt + \u0001u,\n\n\u0001 \u223c N (\u0001u|0, \u03a3u/dt) ,\n\nyt+dt = (I + (At + BtKt) dt) yt + Btdt(kt + \u0001u) + cdt = F tyt + f t + Btdt\u0001u,\n\nwith F t = (I + (At + BtKt) dt) ,\n\n(11)\nFor improved clarity, we will omit the time-index as subscript for most matrices in the remainder\nof the paper. From Eq. 4 we know that the distribution for our current state yt is Gaussian with\nmean \u00b5t = \u03a8T\nt \u03a3w\u03a8t. As the system dynamics are modeled by a\nGaussian linear model, we can obtain the distribution of the next state p (yt+dt) analytically from\nthe forward model\n\nt \u00b5w and covariance3 \u03a3t = \u03a8T\n\nf t = Btktdt + cdt.\n\np(cid:0)yt+dt\n\n(cid:1) =\nN(cid:0)yt+dt|F yt + f , \u03a3sdt(cid:1)N (yt|\u00b5t, \u03a3t) dyt\n\u02c6\n=N(cid:16)\n\nyt+dt|F \u00b5t + f , F \u03a3tF T + \u03a3sdt\n\n(cid:17)\n\n,\n\n(12)\n\n1If inverse dynamics control [20] is used for the robot, the system reduces to a linear system where the terms\n\nAt, Bt and ct are constant in time.\n\nobtain this desired behavior.\n\nnext state.\n\n2As we multiply the noise by Bdt, we need to divide the covariance \u03a3u of the control noise \u0001u by dt to\n\n3The observation noise is omitted as it represents independent noise which is not used for predicting the\n\n5\n\ntime [s]00.30.71q[rad]time[s]00.30.71-2-10123Demonstration1Demonstration2Combination00.30.7101\u03b11\u03b12q [rad]00.30.71-2-10123Demonstration 1Demonstration 2Blending00.30.7101\u03b11\u03b12\fwhere dt\u03a3s = dtB\u03a3uBT represents the system noise matrix. Both sides of Eq. 12 are Gaussian\ndistributions, where the left-hand side can also be computed by our desired trajectory distribution\np(\u03c4 ; \u03b8). We match the mean and the variances of both sides with our control law, i.e.,\n\n(13)\nwhere F is given in Eq. (11) and contains the time varying feedback gains K. Using both con-\nstraints, we can now obtain the time dependend gains K and k.\n\n\u00b5t+dt = F \u00b5t + (Bk + c)dt,\n\n\u03a3t+dt = F \u03a3tF T + \u03a3sdt,\n\nDerivation of the Controller Gains. By rearranging terms, the covariance constraint becomes\n\n\u03a3t+dt \u2212 \u03a3t = \u03a3sdt + (A + BK) \u03a3tdt + \u03a3t (A + BK)T dt + O(dt2),\n\n(14)\nwhere O(dt2) denotes all second order terms in dt. After dividing by dt and taking the limit of\ndt \u2192 0, the second order terms disappear and we obtain the time derivative of the covariance\n\n\u02d9\u03a3t = lim\ndt\u21920\n\n\u03a3t+dt \u2212 \u03a3t\n\ndt\n\n= (A + BK)\u03a3t + \u03a3t(A + BK)T + \u03a3s.\n\n(15)\n\nThe matrix \u02d9\u03a3t can also be obtained from the trajectory distribution \u02d9\u03a3t = \u02d9\u03a8\nwhich we substitute into Eq. (15). After rearranging terms, the equation reads\n\nT\nt \u03a3w\u03a8t + \u03a8T\n\nt \u03a3w \u02d9\u03a8t,\n\nSetting M = BK\u03a3t and solving for the gain matrix K\n\nM + M T = BK\u03a3t + (BK\u03a3t)T , with M= \u02d9\u03a6t\u03a3w\u03a6T\n\nt -A\u03a3t-\u03a3s/2 .\n\nK = B\u2020(cid:16) \u02d9\u03a8T\n\n(cid:17)\nt \u03a3w\u03a8t \u2212 A\u03a3t \u2212 \u03a3s/2\n\n\u03a3\u22121\n\nt\n\n,\n\n(16)\n\n(17)\n\nyields the solution, where B\u2020 denotes the pseudo-inverse of the control matrix B.\n\nDerivation of the Feed-Forward Controls. Similarly, we obtain the feed-forward control signal k\nby matching the mean of the trajectory distribution \u00b5t+dt with the mean computed with the forward\nmodel. After rearranging terms, dividing by dt and taking the limit of dt \u2192 0, we arrive at the\ncontinuous time constraint for the vector k,\n\n(18)\nWe can again use the trajectory distribution p(\u03c4 ; \u03b8) to obtain \u00b5t = \u03a8t\u00b5w and \u02d9\u00b5t = \u02d9\u03a8t\u00b5w and\nsolve Eq. (18) for k,\n\n\u02d9\u00b5t = (A + BK)\u00b5t + Bk + c.\n\nk = B\u2020(cid:16) \u02d9\u03a8t\u00b5w \u2212 (A + BK) \u03a8t\u00b5w \u2212 c\n\n(cid:17)\n\n(19)\n\nEstimation of the Control Noise.\nIn order to match a trajectory distribution, we also need to\nmatch the control noise matrix \u03a3u which has been applied to generate the distribution. We \ufb01rst\ncompute the system noise covariance \u03a3s = B\u03a3uBT by examining the cross-correlation between\n\ntime steps of the trajectory distribution. To do so, we compute the joint distribution p(cid:0)yt, yt+dt\n\n(cid:1) of\n\nthe current state yt and the next state yt+dt,\n\n(cid:1) = N\n\n(cid:18)(cid:20) yt\n\n(cid:20) \u03a3t Ct\n(cid:21)\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:20) \u00b5t\np(cid:0)yt, yt+dt\n(cid:1) = N (yt|\u00b5t, \u03a3t)N(cid:0)yt+dt|F yt + f , \u03a3u\n(cid:1) which yields\n(cid:20) \u03a3t\n(cid:21)\np(cid:0)yt, yt+dt\n\n(cid:18)(cid:20) yt\n\n(cid:1) = N\n\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:20) \u00b5t\n\n\u00b5t+dt\n\nyt+dt\n\n\u03a3tF T\n\n,\n\n,\n\nyt+dt\n\nF \u00b5t + f\n\nCT\n\nt \u03a3t+dt\n\n(cid:21)(cid:19)\n\np(cid:0)yt, yt+dt\n\nwhere Ct = \u03a8t\u03a3w\u03a8T\nt+dt is the cross-correlation. We can again use our model to match the\ncross correlation. The joint distribution for yt and yt+dt is obtained by our system dynamics by\n\n,\n\n(20)\n\n(cid:21)(cid:19)\n\nF \u03a3t F \u03a3tF T + \u03a3sdt\n\n.\n\n(21)\n\nThe noise covariance \u03a3s can be obtained by matching both covariance matrices given in Eq. (20)\nand (21),\n\n\u03a3sdt = \u03a3t+dt \u2212 F \u03a3tF T = \u03a3t+dt \u2212 F \u03a3t\u03a3\u22121\n\n(22)\nThe variance \u03a3u of the control noise is then given by \u03a3u = B\u2020\u03a3sB\u2020T . As we can see from\nEq. (22) the variance of our stochastic feedback controller does not depend on the controller gains\nand can be pre-computed before estimating the controller gains.\n\nt \u03a3tF T = \u03a3t+dt \u2212 CT\n\nt \u03a3\u22121\n\nt Ct\n\n6\n\n\fFigure 2: A 7-link planar robot has to\nreach a target position at T = 1.0s\nwith its end-effector while passing a\nvia-point at t1 = 0.25s (top) or t2 =\n0.75s (middle). The plot shows the\nmean posture of the robot at different\ntime steps in black and samples gen-\nerated by the ProMP in gray.\nThe\nProMP approach was able to exactly re-\nproduce the demonstration which have\nbeen generated by an optimal control\nlaw. The combination of both learned\nProMPs is shown in the bottom. The\nresulting movement reached both via-\npoints with high accuracy.\n\nFigure 3: Robot Hockey. The robot shoots a hockey puck. We demonstrate ten straight shots for\nvarying distances and ten shots for varying angles. The pictures show samples from the ProMP\nmodel for straight shots (b) and angled shots (c). Learning from combined data set yields a model\nthat represents variance in both, distance and angle (d). Multiplying the individual models leads to a\nmodel that only reproduces shots where both models had probability mass, in the center at medium\ndistance (e). The last picture shows the effect of conditioning on only left and right angles (f).\n\n3 Experiments\n\nWe evaluated our approach on two different real robot tasks, one stroke based movement and one\nrhythmic movements. Additionally, we illustrate our approach on a 7-link simulated planar robot.\nFor all real robot experiments we use a seven degrees of freedom KUKA lightweight robot arm. A\nmore detailed description of the experiments is given in the supplementary material.\n\n7-link Reaching Task.\nIn this task, a seven link planar robot has to reach a target position in\nend-effector space. While doing so, it also has to reach a via-point at a certain time point. We\ngenerated the demonstrations for learning the MPs with an optimal control law [22]. In the \ufb01rst set of\ndemonstrations, the robot has to reach the via-point at t1 = 0.25s. The reproduced behavior with the\nProMPs is illustrated in Figure 2(top). We learned the coupling of all seven joints with one ProMP.\nThe ProMP exactly reproduced the via-points in task space while exhibiting a large variability in\nbetween the time points of the via-points. Moreover, the ProMP could also reproduce the coupling\nof the joints from the optimal control law which can be seen by the small variance of the end-effector\nin comparison to the rather large variance of the single joints at the via-points. The ProMP could\nachieve an average cost value of a similar quality as the optimal controller. We also used a second set\nof demonstrations where the \ufb01rst via-point was located at time step t2 = 0.75, which is illustrated\nin Figure 2(middle). We combined the ProMPs learned from both demonstrations, which resulted\nin the movement illustrated in Figure 2(bottom). The combination of both MPs accurately reaches\nboth via-points at t1 = 0.25 and t2 = 0.75.\n\n7\n\n\u2212202460246\u221220246\u221220246x\u2212axis [m]\u221220246\u2212202460246y\u2212axis [m]0246t = 0st = 0.25st = 0.5st = 0.75st = 1.0s\f(a)\n\n(b)\n\n(c)\n\nFigure 4: (a)The maracas task. (b) Trajectory distribution for playing maracas (joint number 4). By\nmodulating the speed of the phase signal zt, the speed of the movement can be adapted. The plot\nshows the desired distribution in blue and the generated distribution from the feedback controller\nin green. Both distributions match. (c) Blending between two rhythmic movements (blue and red\nshaded areas) for playing maracas. The green shaded is produced by continuously switching from\nthe blue to the red movement.\n\nRobot Hockey.\nIn the hockey task, the robot has to shoot a hockey puck in different directions and\ndistances. The task setup can be seen in Figure 3(a). We record two different sets of demonstrations,\none that contains straight shots with varying distances while the second set contains shots with a\nvarying shooting angle. Both data sets contain ten demonstrations each. Sampling from the two\nmodels generated by the different data sets yields shots that exhibit the demonstrated variance in\neither angle or distance, as shown in Figure 3(b) and 3(c). When combining the two individual\nprimitives, the resulting model shoots only in the center at medium distance, i.e., the intersection\nof both MPs. We also learn a joint distribution over the \ufb01nal puck position and the weight vectors\nw and condition on the angle of the shot. The conditioning yields a model that shoots in different\ndirections, depending on the conditioning, see Figure 3(f).\n\nRobot Maracas. A maracas is a musical instrument containing grains, such that shaking it pro-\nduces sounds. Demonstrating fast movements can be dif\ufb01cult on the robot arm, due to the inertia\nof the arm. Instead, we demonstrate a slower movement of ten periods to learn the motion. We\nuse this slow demonstration and change the phase after learning the model to achieve a shaking\nmovement of appropriate speed to generate the desired sound of the instrument. Using a variable\nphase also allows us to change the speed of the motion during one execution to achieve different\nsound patterns. We show an example movement of the robot in Figure 4(a). The desired trajectory\ndistribution of the rhythmic movement and the resulting distribution generated from the feedback\ncontroller are shown in Figure 4(b). Both distributions match. We also demonstrated a second type\nof rhythmic shaking movement which we use to continuously blend between both movements to\nproduce different sounds. One such transition between the two ProMPs is shown for one joint in\nFigure 4(c).\n\n4 Conclusion\n\nProbabilistic movement primitives are a promising approach for learning, modulating, and re-using\nmovements in a modular control architecture. To effectively take advantage of such a control archi-\ntecture, ProMPs support simultaneous activation, match the quality of the encoded behavior from the\ndemonstrations, are able to adapt to different desired target positions, and ef\ufb01ciently learn by imita-\ntion. We parametrize the desired trajectory distribution of the primitive by a Hierarchical Bayesian\nModel with Gaussian distributions. The trajectory distribution can be easily obtained from demon-\nstrations. Our probabilistic formulation allows for new operations for movement primitives, includ-\ning conditioning and combination of primitives. Future work will focus on using the ProMPs in a\nmodular control architecture and improving upon imitation learning by reinforcement learning.\n\nAcknowledgements\n\nThe research leading to these results has received funding from the European Community\u2019s Frame-\nwork Programme CoDyCo (FP7-ICT-2011-9 Grant.No.600716), CompLACS (FP7-ICT-2009-6\nGrant.No.270327), and GeRT (FP7-ICT-2009-4 Grant.No.248273).\n\n8\n\nq [rad]time [s]123456789101.31.41.51.61.7DesiredFeedback Controllerq [rad]time [s]2.533.544.555.566.577.5-0.2-0.100.10.20.30.40.5Demonstration 1Demonstration 2Combination\fReferences\n[1] A. Ijspeert and S. Schaal. Learning Attractor Landscapes for Learning Motor Primitives. In Advances in\n\nNeural Information Processing Systems 15, (NIPS). MIT Press, Cambridge, MA, 2003.\n\n[2] M. Khansari-Zadeh and A. Billard. Learning Stable Non-Linear Dynamical Systems with Gaussian Mix-\n\nture Models. IEEE Transaction on Robotics, 2011.\n\n[3] J. Kober, K. M\u00fclling, O. Kroemer, C. Lampert, B. Sch\u00f6lkopf, and J. Peters. Movement Templates for\nLearning of Hitting and Batting. In International Conference on Robotics and Automation (ICRA), 2010.\n[4] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, pages 1\u201333,\n\n2010.\n\n[5] A. Ude, A. Gams, T. Asfour, and J. Morimoto. Task-Speci\ufb01c Generalization of Discrete and Periodic\n\nDynamic Movement Primitives. Trans. Rob., (5), October 2010.\n\n[6] B. da Silva, G. Konidaris, and A. Barto. Learning Parameterized Skills. In International Conference on\n\nMachine Learning, 2012.\n\n[7] P. Kormushev, S. Calinon, and D. Caldwell. Robot Motor Skill Coordination with EM-based Reinforce-\nIn Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and\n\nment Learning.\nSystems (IROS), 2010.\n\n[8] C. Daniel, G. Neumann, and J. Peters. Learning Concurrent Motor Skills in Versatile Solution Spaces. In\n\nIEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.\n\n[9] George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot Learning from Demon-\nstration by Constructing Skill Trees. International Journal of Robotics Research, 31(3):360\u2013375, March\n2012.\n\n[10] A. dAvella and E. Bizzi. Shared and Speci\ufb01c Muscle Synergies in Natural Motor Behaviors. Proceedings\n\nof the National Academy of Sciences (PNAS), 102(3):3076\u20133081, 2005.\n\n[11] M. Williams, B.and Toussaint and A. Storkey. Modelling Motion Primitives and their Timing in Biologi-\n\ncally Executed Movements. In Advances in Neural Information Processing Systems (NIPS), 2007.\n\n[12] L. Rozo, S. Calinon, D. G. Caldwell, P. Jimenez, and C. Torras. Learning Collaborative Impedance-Based\n\nRobot Behaviors. In AAAI Conference on Arti\ufb01cial Intelligence, 2013.\n\n[13] E. Rueckert, G. Neumann, M. Toussaint, and W.Pr Maass. Learned Graphical Models for Probabilistic\n\nPlanning provide a new Class of Movement Primitives. 2012.\n\n[14] L. Righetti and A Ijspeert. Programmable central pattern generators: an application to biped locomotion\ncontrol. In Proceedings of the 2006 IEEE International Conference on Robotics and Automation, 2006.\nIn\n\n[15] A. Paraschos, G Neumann, and J. Peters. A probabilistic approach to robot trajectory generation.\n\nProceedings of the International Conference on Humanoid Robots (HUMANOIDS), 2013.\n\n[16] S. Calinon, P. Kormushev, and D. Caldwell. Compliant Skills Acquisition and Multi-Optima Policy\nSearch with EM-based Reinforcement Learning. Robotics and Autonomous Systems (RAS), 61(4):369 \u2013\n379, 2013.\n\n[17] E. Todorov and M. Jordan. Optimal Feedback Control as a Theory of Motor Coordination. Nature\n\nNeuroscience, 5:1226\u20131235, 2002.\n\n[18] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learning Movement Primitives.\n\nSymposium on Robotics Research, (ISRR), 2003.\n\nIn International\n\n[19] A. Lazaric and M. Ghavamzadeh. Bayesian Multi-Task Reinforcement Learning. In Proceedings of the\n\n27th International Conference on Machine Learning (ICML), 2010.\n\n[20] J. Peters, M. Mistry, F. E. Udwadia, J. Nakanishi, and S. Schaal. A Unifying Methodology for Robot\n\nControl with Redundant DOFs. Autonomous Robots, (1):1\u201312, 2008.\n\n[21] H. Stark and J. Woods. Probability and Random Processes with Applications to Signal Processing (3rd\n\nEdition). 3 edition, August 2001.\n\n[22] M. Toussaint. Robot Trajectory Optimization using Approximate Inference. In Proceedings of the 26th\n\nInternational Conference on Machine Learning, (ICML), 2009.\n\n9\n\n\f", "award": [], "sourceid": 1232, "authors": [{"given_name": "Alexandros", "family_name": "Paraschos", "institution": "TU Darmstadt"}, {"given_name": "Christian", "family_name": "Daniel", "institution": "TU Darmstadt"}, {"given_name": "Jan", "family_name": "Peters", "institution": "TU Darmstadt"}, {"given_name": "Gerhard", "family_name": "Neumann", "institution": "TU Darmstadt"}]}