{"title": "Projected Natural Actor-Critic", "book": "Advances in Neural Information Processing Systems", "page_first": 2337, "page_last": 2345, "abstract": "Natural actor-critics are a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability - their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.", "full_text": "Projected Natural Actor-Critic\n\nPhilip S. Thomas, William Dabney, Sridhar Mahadevan, and Stephen Giguere\n\n{pthomas,wdabney,mahadeva,sgiguere}@cs.umass.edu\n\nAmherst, MA 01003\n\nSchool of Computer Science\n\nUniversity of Massachusetts Amherst\n\nAbstract\n\nNatural actor-critics form a popular class of policy search algorithms for \ufb01nding\nlocally optimal policies for Markov decision processes. In this paper we address\na drawback of natural actor-critics that limits their real-world applicability\u2014their\nlack of safety guarantees. We present a principled algorithm for performing nat-\nural gradient descent over a constrained domain. In the context of reinforcement\nlearning, this allows for natural actor-critic algorithms that are guaranteed to re-\nmain within a known safe region of policy space. While deriving our class of\nconstrained natural actor-critic algorithms, which we call Projected Natural Actor-\nCritics (PNACs), we also elucidate the relationship between natural gradient de-\nscent and mirror descent.\n\n1\n\nIntroduction\n\nNatural actor-critics form a class of policy search algorithms for \ufb01nding locally optimal policies\nfor Markov decision processes (MDPs) by approximating and ascending the natural gradient [1] of\nan objective function. Despite the numerous successes of, and the continually growing interest in,\nnatural actor-critic algorithms, they have not achieved widespread use for real-world applications. A\nlack of safety guarantees is a common reason for avoiding the use of natural actor-critic algorithms,\nparticularly for biomedical applications. Since natural actor-critics are unconstrained optimization\nalgorithms, there are no guarantees that they will avoid regions of policy space that are known to be\ndangerous.\nFor example, proportional-integral-derivative controllers (PID controllers) are the most widely used\ncontrol algorithms in industry, and have been studied in depth [2]. Techniques exist for determining\nthe set of stable gains (policy parameters) when a model of the system is available [3]. Policy search\ncan be used to \ufb01nd the optimal gains within this set (for some de\ufb01nition of optimality). A desirable\nproperty of a policy search algorithm in this context would be a guarantee that it will remain within\nthe predicted region of stable gains during its search.\nConsider a second example: functional electrical stimulation (FES) control of a human arm. By se-\nlectively stimulating muscles using subcutaneous probes, researchers have made signi\ufb01cant strides\ntoward returning motor control to people suffering from paralysis induced by spinal cord injury [4].\nThere has been a recent push to develop controllers that specify how much and when to stimulate\neach muscle in a human arm to move it from its current position to a desired position [5]. This\nclosed-loop control problem is particularly challenging because each person\u2019s arm has different dy-\nnamics due to differences in, for example, length, mass, strength, clothing, and amounts of muscle\natrophy, spasticity, and fatigue. Moreover, these differences are challenging to model. Hence, a\nproportional-derivative (PD) controller, tuned to a simulation of an ideal human arm, required man-\nual tuning to obtain desirable performance on a human subject with biceps spasticity [6].\nResearchers have shown that policy search algorithms are a viable approach to creating controllers\nthat can automatically adapt to an individual\u2019s arm by training on a few hundred two-second reach-\n\n1\n\n\fing movements [7]. However, safety concerns have been raised in regard to both this speci\ufb01c ap-\nplication and other biomedical applications of policy search algorithms. Speci\ufb01cally, the existing\nstate-of-the-art gradient-based algorithms, including the current natural actor-critic algorithms, are\nunconstrained and could potentially select dangerous policies. For example, it is known that certain\nmuscle stimulations could cause the dislocation of a subject\u2019s arm. Although we lack an accurate\nmodel of each individual\u2019s arm, we can generate conservative safety constraints on the space of\npolicies. Once again, a desirable property of a policy search algorithm would be a guarantee that it\nwill remain within a speci\ufb01ed region of policy space (known-safe policies).\nIn this paper we present a class of natural actor-critic algorithms that perform constrained\noptimization\u2014given a known safe region of policy space, they search for a locally optimal pol-\nicy while always remaining within the speci\ufb01ed region. We call our class of algorithms Projected\nNatural Actor-Critics (PNACs) since, whenever they generate a new policy, they project the policy\nback to the set of safe policies. The interesting question is how the projection can be done in a\nprincipled manner. We show that natural gradient descent (ascent), which is an unconstrained opti-\nmization algorithm, is a special case of mirror descent (ascent), which is a constrained optimization\nalgorithm. In order to create a projected natural gradient algorithm, we add constraints in the mir-\nror descent algorithm that is equivalent to natural gradient descent. We apply this projected natural\ngradient algorithm to policy search to create the PNAC algorithms, which we validate empirically.\n\n2 Related Work\n\nResearchers have addressed safety concerns like these before [8]. Bendrahim and Franklin [9]\nshowed how a walking biped robot can switch to a stabilizing controller whenever the robot leaves\na stable region of state space. Similar state-avoidant approaches to safety have been proposed by\nseveral others [10, 11, 12]. These approaches do not account for situations where, over an unavoid-\nable region of state space, the actions themselves are dangerous. Kuindersma et al. [13] developed\na method for performing risk-sensitive policy search, which models the variance of the objective\nfunction for each policy and permits runtime adjustments of risk sensitivity. However, their ap-\nproach does not guarantee that an unsafe region of state space or policy space will be avoided.\nBhatnagar et al. [14] presented projected natural actor-critic algorithms for the average reward set-\nting. As in our projected natural actor-critic algorithms, they proposed computing the update to the\npolicy parameters and then projecting back to the set of allowed policy parameters. However, they\ndid not specify how the projection could be done in a principled manner. We show in Section 7\nthat the Euclidean projection can be arbitrarily bad, and argue that the projection that we propose is\nparticularly compatible with natural actor-critics (natural gradient descent).\nDuchi et al. [15] presented mirror descent using the Mahalanobis norm for the proximal function,\nwhich is very similar to the proximal function that we show to cause mirror descent to be equivalent\nto natural gradient descent. However, their proximal function is not identical to ours and they did\nnot discuss any possible relationship between mirror descent and natural gradient descent.\n\n3 Natural Gradients\nConsider the problem of minimizing a differentiable function f : Rn \u2192 R. The standard gra-\ndient descent approach is to select an initial x0 \u2208 Rn, compute the direction of steepest descent,\n\u2212\u2207f (x0), and then move some amount in that direction (scaled by a step size parameter, \u03b10). This\nprocess is then repeated inde\ufb01nitely: xk+1 = xk \u2212 \u03b1k\u2207f (xk), where {\u03b1k} is a step size schedule\nand k \u2208 {1, . . .}. Gradient descent has been criticized for its low asymptotic rate of convergence.\nNatural gradients are a quasi-Newton approach to improving the convergence rate of gradient de-\nscent.\nWhen computing the direction of steepest descent, gradient descent assumes that the vector xk\nresides in Euclidean space. However, in several settings it is more appropriate to assume that xk\nresides in a Riemannian space with metric tensor G(xk), which is an n \u00d7 n positive de\ufb01nite matrix\nthat may vary with xk [16].\nIn this case, the direction of steepest descent is called the natural\ngradient and is given by \u2212G(xk)\n\u22121\u2207f (xk) [1]. In certain cases, (which include our policy search\napplication), following the natural gradient is asymptotically Fisher-ef\ufb01cient [16].\n\n2\n\n\f4 Mirror Descent\n\n,\n\nk\n\n(cid:3)\n\nxk+1 = \u2207\u03c8\u2217\n\n(cid:2)\u2207\u03c8k(xk) \u2212 \u03b1k\u2207f (xk)\n\nMirror descent algorithms form a class of highly scalable online gradient methods that are useful\nin constrained minimization of non-smooth functions [17, 18]. They have recently been applied to\nvalue function approximation and basis adaptation for reinforcement learning [19, 20]. The mirror\ndescent update is\n(1)\nwhere \u03c8k : Rn \u2192 R is a continuously differentiable and strongly convex function called the proxi-\nk(y) (cid:2) maxx\u2208Rn {x(cid:2)y \u2212 \u03c8k(x)}, for any y \u2208 Rn.\nmal function, and where the conjugate of \u03c8k is \u03c8\u2217\nDifferent choices of \u03c8k result in different mirror descent algorithms. A common choice for a \ufb01xed\n\u03c8k = \u03c8,\u2200k, is the p-norm [20], and a common adaptive \u03c8k is the Mahalanobis norm with a dynamic\ncovariance matrix [15].\nIntuitively, the distance metric for the space that xk resides in is not necessarily the same as that of\nthe space that \u2207f (xk) resides in. This suggests that it may not be appropriate to directly add xk\nand \u2212\u03b1k\u2207f (xk) in the gradient descent update. To correct this, mirror descent moves xk into the\nspace of gradients (the dual space) with \u2207\u03c8k(xk) before performing the gradient update. It takes\nthe result of this step in gradient space and returns it to the space of xk (the primal space) with \u2207\u03c8\u2217\nk.\nDifferent choices of \u03c8k amount to different assumptions about the relationship between the primal\nand dual spaces at xk.\n\n5 Equivalence of Natural Gradient Descent and Mirror Descent\nTheorem 5.1. The natural gradient descent update at step k with metric tensor Gk (cid:2) G(xk):\n\nxk+1 = xk \u2212 \u03b1kG\u22121\n\n\u2207f (xk),\n\nk\n\nis equivalent to (1), the mirror descent update at step k, with \u03c8k(x) = (1/2)x(cid:2)Gkx.\nProof. First, notice that \u2207\u03c8k(x) = Gkx. Next, we derive a closed-form for \u03c8\u2217\nk:\n\n(cid:5)\n\n(cid:4)\n\n\u03c8\u2217\nk(y) = max\nx\u2208Rn\n\nx(cid:2)y \u2212 1\n2\n\nx(cid:2)Gkx\n\n.\n\n(2)\n\n(3)\n\nk\n\nSince the function being maximized on the right hand side is strictly concave, the x that maximizes\nit is its critical point. Solving for this critical point, we get x = G\u22121\ny. Substituting this into (3), we\ny. Inserting the de\ufb01nitions of \u2207\u03c8k(x) and\n\ufb01nd that \u03c8\u2217\n\u2207\u03c8\u2217\n\nk(y) = G\u22121\nk(y) into (1), we \ufb01nd that the mirror descent update is\n\ny. Hence, \u2207\u03c8\u2217\nk (Gkxk \u2212 \u03b1k\u2207f (xk)) = xk \u2212 \u03b1kG\u22121\n\nk(y) = (1/2)y(cid:2)G\u22121\nxk+1 =G\u22121\n\n(cid:3)\nwhich is identical to (2).\nAlthough researchers often use \u03c8k that are norms like the p-norm and Mahalanobis norm, notice\nthat the \u03c8k that results in natural gradient descent is not a norm. Also, since Gk depends on k, \u03c8k is\nan adaptive proximal function [15].\n\n\u2207f (xk),\n\nk\n\nk\n\nk\n\n6 Projected Natural Gradients\n\nWhen x is constrained to some set, X, \u03c8k in mirror descent is augmented with the indicator function\nIX, where IX (x) = 0 if x \u2208 X, and +\u221e otherwise. The \u03c8k that was shown to generate an\nupdate equivalent to the natural gradient descent update, with the added constraint that x \u2208 X, is\n\u03c8k(x) = (1/2)x(cid:2)Gkx + IX (x). Hereafter, any references to \u03c8k refer to this augmented version.\nFor this proximal function, the subdifferential of \u03c8k(x) is \u2207\u03c8k(x) = Gk(x) + \u02c6NX (x) = (Gk +\n\u02c6NX )(x), where \u02c6NX (x) (cid:2) \u2202IX (x) and, in the middle term, Gk and \u02c6NX are relations and + denotes\nMinkowski addition.1 \u02c6NX (x) is the normal cone of X at x if x \u2208 X and \u2205 otherwise [21].\n\n\u2207\u03c8\u2217\n\nk(y) = (Gk + \u02c6NX )\n\n\u22121(y).\n\n(4)\n\n1Later, we abuse notation and switch freely between treating Gk as a matrix and a relation. When it is a\nmatrix, Gkx denotes matrix-vector multiplication that produces a vector. When it is a relation, Gk(x) produces\nthe singleton {Gkx}.\n\n3\n\n\fX (y), be the set of x \u2208 X that are closest to y, where the length of a vector, z, is (1/2)z(cid:2)Gkz.\n\nLet \u03a0Gk\nMore formally,\n\nX (y) (cid:2) arg min\n\u03a0Gk\nx\u2208X\n\u22121(Gky).\n\n(y \u2212 x)\n\n(cid:2)Gk(y \u2212 x).\n\n1\n2\n\n(5)\n\nLemma 6.1. \u03a0Gk\nProof. We write (5) without the explicit constraint that x \u2208 X by appending the indicator function:\n\nX (y) = (Gk + \u02c6NX )\n\n\u03a0Gk\n\nX (y) = arg min\nx\u2208Rn\n\nhy(x),\n\nwhere hy(x) = (1/2)(y \u2212 x)\nelsewhere, its critical point is its global minimizer. The critical point satis\ufb01es\n\n(cid:2)Gk(y \u2212 x) + IX (x). Since hy is strictly convex over X and +\u221e\n\n0 \u2208 \u2207hy(x) = \u2212Gk(y) + Gk(x) + \u02c6NX (x).\n\nThe globally minimizing x therefore satis\ufb01es Gky \u2208 Gk(x) + \u02c6NX (x) = (Gk + \u02c6NX )(x). Solving\n(cid:3)\nfor x, we \ufb01nd that x = (Gk + \u02c6NX )\nCombining Lemma 6.1 with (4), we \ufb01nd that \u2207\u03c8\u2217\ny). Hence, mirror descent with\nthe proximal function that produces natural gradient descent, augmented to include the constraint\nthat x \u2208 X, is:\n\n\u22121(Gky).\n\nX (G\u22121\n\n(y) = \u03a0Gk\n\nk\n\nxk+1 =\u03a0Gk\nX\n\n=\u03a0Gk\nX\n\nk\n\nG\u22121\n(I + G\u22121\n\n(Gk + \u02c6NX )(xk) \u2212 \u03b1k\u2207f (xk)\n\u2207f (xk)\n\n\u02c6NX )(xk) \u2212 \u03b1kG\u22121\n\nk\n\n(cid:7)(cid:7)\n(cid:7)\n\n,\n\nwhere I denotes the identity relation. Since xk \u2208 X, we know that 0 \u2208 \u02c6NX (xk), and hence the\nupdate can be written as\n\nxk+1 = \u03a0Gk\nX\n\nxk \u2212 \u03b1kG\u22121\n\n\u2207f (xk)\n\nk\n\n,\n\n(6)\n\n(cid:3)\n\nwhich we call projected natural gradient (PNG).\n\n7 Compatibility of Projection\n\nThe standard projected subgradient (PSG) descent method follows the negative gradient (as opposed\nto the negative natural gradient) and projects back to X using the Euclidean norm. If f and X are\nconvex and the step size is decayed appropriately, it is guaranteed to converge to a global minimum,\nx\u2217 \u2208 X. Any such x\u2217 is a \ufb01xed point. This means that a small step in the negative direction of any\nsubdifferential of f at x\u2217 will project back to x\u2217.\nOur choice of projection, \u03a0Gk\nX , results in PNG having the same \ufb01xed points (see Lemma 7.1). This\nmeans that, when the algorithm is at x\u2217 and a small step is taken down the natural gradient to x(cid:4),\nX will project x(cid:4) back to x\u2217. We therefore say that \u03a0Gk\n\u03a0Gk\nX is compatible with the natural gradient.\nFor comparison, the Euclidean projection of x(cid:4) will not necessarily return x(cid:4) to x\u2217.\nLemma 7.1. The sets of \ufb01xed points for PSG and PNG are equivalent.\nProof. A necessary and suf\ufb01cient condition for x to be a \ufb01xed point of PSG is that \u2212\u2207f (x) \u2208\n\u02c6NX (x) [22]. A necessary and suf\ufb01cient condition for x to be a \ufb01xed point of PNG is\n\n(cid:2)\n\n(cid:3)\n\nx =\u03a0Gk\nX\n\nx \u2212 \u03b1kG\u22121\n\n\u2207f (x)\n\nk\n\n= (Gk + \u02c6NX )\n\nGk\n\nx \u2212 \u03b1kG\u22121\n\n\u2207f (x)\n\nk\n\n(cid:6)\n\n\u22121\n\n(cid:2)\n\n(cid:3)(cid:7)\n\n(cid:6)\n(cid:6)\n\n(cid:6)\n\nk\n\n(cid:2)\n\n\u22121 (Gkx \u2212 \u03b1k\u2207f (x))\n\n=(Gk + \u02c6NX )\n\u21d4Gkx \u2212 \u03b1k\u2207f (x) \u2208 Gk(x) + \u02c6NX (x)\n\u21d4 \u2212 \u2207f (x) \u2208 \u02c6NX (x).\n\n(cid:3)\n\nTo emphasize the importance of using a compatible projection, consider the following simple exam-\nple. Minimize the function f (x) = x(cid:2)Ax + b(cid:2)x, where A = diag(1, 0.01) and b = [\u22120.2,\u22120.1]\n(cid:2),\nsubject to the constraints (cid:9)x(cid:9)1 \u2264 1 and x \u2265 0. We implemented three algorithms, and ran each for\n1000 iterations using a \ufb01xed step size:\n\n4\n\n\fFigure 1: The thick diagonal line shows\none constraint and dotted lines show pro-\njections. Solid arrows show the direc-\ntions of the natural gradient and gradient\nat the optimal solution, x\u2217. The dashed\nblue arrows show PNG-Euclid\u2019s projec-\ntions, and emphasize the the projections\ncause PNG-Euclid to move away from\nthe optimal solution.\n\n1. PSG - projected subgradient descent using the Euclidean projection.\n2. PNG - projected natural gradient descent using \u03a0Gk\nX .\n3. PNG-Euclid - projected natural gradient descent using the Euclidean projection.\n\nThe results are shown in Figure 1. Notice that PNG and PSG converge to the optimal solution, x\u2217.\nFrom this point, they both step in different directions, but project back to x\u2217. However, PNG-Euclid\nconverges to a suboptimal solution (outside the domain of the \ufb01gure). If X were a line segment\nbetween the point that PNG-Euclid and PNG converge to, then PNG-Euclid would converge to the\npessimal solution within X, while PSG and PNG would converge to the optimal solution within X.\nAlso, notice that the natural gradient corrects for the curvature of the function and heads directly\ntowards the global unconstrained minimum. Since the natural methods in this example use metric\ntensor G = A, which is the Hessian of f, they are essentially an incremental form of Newton\u2019s\nmethod. In practice, the Hessian is usually not known, and an estimate thereof is used.\n\n8 Natural Actor-Critic Algorithms\nAn MDP is a tuple M = (S,A,P,R, d0, \u03b3), where S is a set of states, A is a set of actions,\nP(s(cid:4)|s, a) gives the probability density of the system entering state s(cid:4) when action a is taken in state\ns, R(s, a) is the expected reward, r, when action a is taken in state s, d0 is the initial state distri-\nbution, and \u03b3 \u2208 [0, 1) is a reward discount parameter. A parameterized policy, \u03c0, is a conditional\nprobability density function\u2014\u03c0(a|s, \u03b8) is the probability density of action a in state s given a vector\nof policy parameters, \u03b8 \u2208 Rn.\n(cid:8)\u221e\nt=0 \u03b3trt|\u03b8] be the discounted-reward objective or the average reward objective\nLet J(\u03b8) = E [\nt=0 rt|\u03b8]. Given an MDP, M, and a parameterized policy, \u03c0,\nfunction with J(\u03b8) = limn\u2192\u221e 1\nthe goal is to \ufb01nd policy parameters that maximize one of these objectives. When the action set is\ncontinuous, the search for globally optimal policy parameters becomes intractable, so policy search\nalgorithms typically search for locally optimal policy parameters.\nNatural actor-critics, \ufb01rst proposed by Kakade [23], are algorithms that estimate and ascend the\n(cid:11)(cid:10)\nnatural gradient of J(\u03b8), using the average Fisher information matrix as the metric tensor:\n\n(cid:9)(cid:10)\n\nnE [\n\n(cid:8)\n\nn\n\n(cid:11)(cid:2)(cid:12)\n\n,\n\nlog \u03c0(a|s, \u03b8k)\n\n\u2202\n\u2202\u03b8k\n\nGk = G(\u03b8k) = Es\u223cd\u03c0,a\u223c\u03c0\n\nlog \u03c0(a|s, \u03b8k)\n\n\u2202\n\u2202\u03b8k\n\nwhere d\u03c0 is a policy and objective function-dependent distribution over the state set [24].\nThere are many natural actor-critics, including Natural policy gradient utilizing the Temporal Differ-\nences (NTD) algorithm [25], Natural Actor-Critic using LSTD-Q(\u03bb) (NAC-LSTD) [26], Episodic\nNatural Actor-Critic (eNAC) [26], Natural Actor-Critic using Sarsa(\u03bb) (NAC-Sarsa) [27], Incremen-\ntal Natural Actor-Critic (INAC) [28], and Natural-Gradient Actor-Critic with Advantage Parameters\n(NGAC) [14]. All of them form an estimate, typically denoted wk, of the natural gradient of J(\u03b8k).\nThat is, wk \u2248 G(\u03b8k)\n\u22121\u2207J(\u03b8k). They then perform the policy parameter update, \u03b8k+1 = \u03b8k+\u03b1kwk.\n\n9 Projected Natural Actor-Critics\nIf we are given a closed convex set, \u0398 \u2286 Rn, of admissible policy parameters (e.g., the stable\nregion of gains for a PID controller), we may wish to ensure that the policy parameters remain\n\n5\n\n\fwithin \u0398. The natural actor-critic algorithms described in the previous section do not provide such\na guarantee. However, their policy parameter update equations, which are natural gradient ascent\nupdates, can easily be modi\ufb01ed to the projected natural gradient ascent update in (6) by projecting\nthe parameters back onto \u0398 using \u03a0G(\u03b8k)\n\n:\n\n\u0398\n\n\u03b8k+1 = \u03a0G(\u03b8k)\n\n\u0398\n\n(\u03b8k + \u03b1kwk) .\n\nMany of the existing natural policy gradient algorithms, including NAC-LSTD, eNAC, NAC-Sarsa,\nand INAC, follow biased estimates of the natural policy gradient [29]. For our experiments, we\nmust use an unbiased algorithm since the projection that we propose is compatible with the natural\ngradient, but not necessarily biased estimates thereof.\nNAC-Sarsa and INAC are equivalent biased discounted-reward natural actor-critic algorithms with\nper-time-step time complexity linear in the number of features. The former was derived by replacing\nthe LSTD-Q(\u03bb) component of NAC-LSTD with Sarsa(\u03bb), while the latter is the discounted-reward\nversion of NGAC. Both are similar to NTD, which is a biased average-reward algorithm. The\nunbiased discounted-reward form of NAC-Sarsa was recently derived [29]. References to NAC-\nSarsa hereafter refer to this unbiased variant. In our case studies we use the projected natural actor-\ncritic using Sarsa(\u03bb) (PNAC-Sarsa), the projected version of the unbiased NAC-Sarsa algorithm.\nNotice that the projection, \u03a0G(\u03b8k)\n, as de\ufb01ned in (5), is not merely the Euclidean projection back\nonto \u0398. For example, if \u0398 is the set of \u03b8 that satisfy A\u03b8 \u2264 b, for some \ufb01xed matrix A and vector b,\nthen the projection, \u03a0G(\u03b8k)\n\n\u0398\n\n\u0398\n\n, of y onto \u0398 is a quadratic program,\n\u03b8(cid:2)G(\u03b8k)\u03b8,\n\nminimize f (\u03b8) = \u2212 y(cid:2)G(\u03b8k)\u03b8 +\n\n1\n2\n\ns.t. A\u03b8 \u2264 b.\n\nIn order to perform this projection, we require an estimate of the average Fisher information matrix,\nG(\u03b8k). If the natural actor-critic algorithm does not already include this (like NAC-LSTD and NAC-\nSarsa do not), then an estimate can be generated by selecting G0 = \u03b2I, where \u03b2 is a positive scalar\n(cid:11)(cid:10)\nand I is the identity matrix, and then updating the estimate with\n\n(cid:10)\n\nGt+1 = (1 \u2212 \u03bct)Gt + \u03bct\n\nlog \u03c0(at|st, \u03b8k)\n\n\u2202\n\u2202\u03b8k\n\nlog \u03c0(at|st, \u03b8k)\n\n\u2202\n\u2202\u03b8k\n\n(cid:11)(cid:2)\n\n,\n\nwhere {\u03bct} is a step size schedule [14]. Notice that we use t and k subscripts since many time steps\nof the MDP may pass between updates to the policy parameters.\n\n10 Case Study: Functional Electrical Stimulation\n\n2\n\nand \u03c6target\n\nIn this case study, we searched for proportional-derivative (PD) gains to control a simulated human\narm undergoing FES. We used the Dynamic Arm Simulator 1 (DAS1) [30], a detailed biomechanical\nsimulation of a human arm undergoing functional electrical stimulation.\nIn a previous study, a\ncontroller created using DAS1 performed well on an actual human subject undergoing FES, although\nit required some additional tuning in order to cope with biceps spasticity [6]. This suggests that it is\na reasonably accurate model of an ideal arm.\nThe DAS1 model, depicted in Figure 2a, has state st = (\u03c61, \u03c62, \u02d9\u03c61, \u02d9\u03c62, \u03c6target\n), where\n\u03c6target\nare the desired joint angles, and the desired joint angle velocities are zero. The\n1\ngoal is to, during a two-second episode, move the arm from its random initial state to a randomly\nchosen stationary target. The arm is controlled by providing a stimulation in the interval [0, 1] to\neach of six muscles. The reward function used was similar to that of Jagodnik and van den Bogert\n[6], which punishes joint angle error and high muscle stimulation. We searched for locally optimal\nPD gains using PNAC-Sarsa where the policy was a PD controller with Gaussian noise added for\nexploration.\nAlthough DAS1 does not model shoulder dislocation, we added safety constraints by limiting the\nl1-norm of certain pairs of gains. The constraints were selected to limit the forces applied to the\nhumerus. These constraints can be expressed in the form A\u03b8 \u2264 b, where A is a matrix, b is a vector,\nand \u03b8 are the PD gains (policy parameters). We compared the performance of three algorithms:\n\n, \u03c6target\n\n1\n\n2\n\n1. NAC: NAC-Sarsa with no constraints on \u03b8.\n\n6\n\n\fNAC\nNAC\n\nPNAC\nPNAC\n\nPNAC\u0372E\nPNAC E\n\nn\nr\nr\nu\nt\ne\nR\nn\na\ne\nM\nM\n\n\u0003\n\n\u037215\n\n\u037216\n\n1\n\u037217\n\n(Figure 2a) DAS1, the two-joint, six-muscle biome-\nchanical model used. Antagonistic muscle pairs are\nas follows, listed as (\ufb02exor, extensor): monoarticu-\nlar shoulder muscles (a: anterior deltoid, b: posterior\ndeltoid); monoarticular elbow muscles (c: brachialis,\nd: triceps brachii (short head)); biarticular muscles\n(e: biceps brachii, f: triceps brachii (long head)).\n\n(Figure 2b) Mean return during the last 250,000\nepisodes of training using thee algorithms. Standard\ndeviation error bars from the 10 trials are provided.\nThe NAC bar is red to emphasize that the \ufb01nal pol-\nicy found by NAC resides in the dangerous region of\npolicy space.\n\n2. PNAC: PNAC-Sarsa using the compatible projection, \u03a0G(\u03b8k)\n3. PNAC-E: PNAC-Sarsa using the Euclidean projection.\n\n\u0398\n\n.\n\nSince we are not promoting the use of one natural actor-critic over another, we did not focus on\n\ufb01nely tuning the natural actor-critic nor comparing the learning speeds of different natural actor-\ncritics. Rather, we show the importance of the proper projection by allowing PNAC-Sarsa to run for\na million episodes (far longer than required for convergence), after which we plot the mean sum of\nrewards during the last quarter million episodes. Each algorithm was run ten times, and the results\naveraged and plotted in Figure 2b. Notice that PNAC performs worse than the unconstrained NAC.\nThis happens because NAC leaves the safe region of policy space during its search, and converges\nto a dangerous policy\u2014one that reaches the goal quickly and with low total muscle force, but which\ncan cause large, short, spikes in muscle forces surrounding the shoulder, which violates our safety\nconstraints. We suspect that PNAC converges to a near-optimal policy within the region of policy\nspace that we have designated as safe. PNAC-E converges to a policy that is worse than that found\nby PNAC because it uses an incompatible projection.\n\n11 Case Study: uBot Balancing\n\nIn the previous case study, the optimal policy lay outside the designated safe region of policy space\n(this is common when a single failure is so costly that adding a penalty to the reward function for\nfailure is impractical, since a single failure is unacceptable). We present a second case study in which\nthe optimal policy lies within the designated safe region of policy space, but where an unconstrained\nsearch algorithm may enter the unsafe region during its search of policy space (at which point large\nnegative rewards return it to the safe region).\nThe uBot-5, shown in Figure 3, is an 11-DoF mobile manipulator developed at the University of\nMassachusetts Amherst [31, 32]. During experiments, it often uses its arms to interact with the\nworld. Here, we consider the problem faced by the controller tasked with keeping the robot balanced\nduring such experiments. To allow for results that are easy to visualize in 2D, we use a PD controller\nthat observes only the current body angle, its time derivative, and the target angle (always vertical).\nThis results in the PD controller having only two gains (tunable policy parameters). We use a\ncrude simulation of the uBot-5 with random upper-body movements, and search for the PD gains\nthat minimize a weighted combination of the energy used and the mean angle error (distance from\nvertical).\nWe constructed a set of conservative estimates of the region of stable gains, with which the uBot-\n5 should never fall, and used PNAC-Sarsa and NAC-Sarsa to search for the optimal gains. Each\ntraining episode lasted 20 seconds, but was terminated early (with a large penalty) if the uBot-5 fell\nover. Figure 3 (middle) shows performance over 100 training episodes. Using NAC-Sarsa, the PD\nweights often left the conservative estimate of the safe region, which resulted in the uBot-5 falling\nover. Figure 3 (right) shows one trial where the uBot-5 fell over four times (circled in red). The\n\n7\n\n\fNAC\nPNAC\n\n8\n\n6\n\n4\n\n2\n\n0\n\n2\n\n\u03b8\n\n60\n\n65\n\n\u03b81\n\n70\n\n75\n\nFigure 3: Left: uBot-5 holding a ball. Middle: Mean (over 20-trials) returns over time using PNAC-\nSarsa and NAC-Sarsa on the simulated uBot-5 balancing task. The shaded region depicts standard\ndeviations. Right: Trace of the two PD gains, \u03b81 and \u03b82, from a typical run of PNAC-Sarsa and\nNAC-Sarsa. A marker is placed for the gains after each episode, and red markers denote episodes\nwhere the simulated uBot-5 fell over.\n\nresulting large punishments cause NAC-Sarsa to quickly return to the safe region of policy space.\nUsing PNAC-Sarsa, the simulated uBot-5 never fell. Both algorithms converge to gains that reside\nwithin the safe region of policy space. We selected this example because it shows how, even if the\noptimal solution resides within the safe region of policy space (unlike the in the previous case study),\nunconstrained RL algorithms may traverse unsafe regions of policy space during their search.\n\n12 Conclusion\n\nWe presented a class of algorithms, which we call projected natural actor-critics (PNACs). PNACs\nare the simple modi\ufb01cation of existing natural actor-critic algorithms to include a projection of newly\ncomputed policy parameters back onto an allowed set of policy parameters (e.g., those of policies\nthat are known to be safe). We argued that a principled projection is the one that results from viewing\nnatural gradient descent, which is an unconstrained algorithm, as a special case of mirror descent,\nwhich is a constrained algorithm.\nWe show that the resulting projection is compatible with the natural gradient and gave a simple em-\npirical example that shows why a compatible projection is important. This example also shows how\nan incompatible projection can result in natural gradient descent converging to a pessimal solution\nin situations where a compatible projection results in convergence to an optimal solution. We then\napplied a PNAC algorithm to a realistic constrained control problem with six-dimensional contin-\nuous states and actions. Our results support our claim that the use of an incompatible projection\ncan result in convergence to inferior policies. Finally, we applied PNAC to a simulated robot and\nshowed its substantial bene\ufb01ts over unconstrained natural actor-critic algorithms.\n\nReferences\n[1] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10:251\u2013276, 1998.\n[2] K. J. \u02daAstr\u00a8om and T. H\u00a8agglund. PID Controllers: Theory, Design, and Tuning. ISA: The Instrumentation,\n\nSystems, and Automation Society, 1995.\n\n[3] M. T. S\u00a8oylemez, N. Munro, and H. Baki. Fast calculation of stabilizing PID controllers. Automatica, 39\n\n(1):121\u2013126, 2003.\n\n[4] C. L. Lynch and M. R. Popovic. Functional electrical stimulation. In IEEE Control Systems Magazine,\n\nvolume 28, pages 40\u201350.\n\n[5] E. K. Chadwick, D. Blana, A. J. van den Bogert, and R. F. Kirsch. A real-time 3-D musculoskeletal model\nfor dynamic simulation of arm movements. In IEEE Transactions on Biomedical Engineering, volume 56,\npages 941\u2013948, 2009.\n\n[6] K. Jagodnik and A. van den Bogert. A proportional derivative FES controller for planar arm movement.\n\nIn 12th Annual Conference International FES Society, Philadelphia, PA, 2007.\n\n8\n\n\f[7] P. S. Thomas, M. S. Branicky, A. J. van den Bogert, and K. M. Jagodnik. Application of the actor-critic\narchitecture to functional electrical stimulation control of a human arm. In Proceedings of the Twenty-\nFirst Innovative Applications of Arti\ufb01cial Intelligence, 2009.\n\n[8] T. J. Perkins and A. G. Barto. Lyapunov design for safe reinforcement learning. Journal of Machine\n\nLearning Research, 3:803\u2013832, 2003.\n\n[9] H. Bendrahim and J. A. Franklin. Biped dynamic walking using reinforcement learning. Robotics and\n\nAutonomous Systems, 22:283\u2013302, 1997.\n\n[10] A. Arapostathis, R. Kumar, and S. P. Hsu. Control of markov chains with safety bounds.\n\nTransactions on Automation Science and Engineering, volume 2, pages 333\u2013343, October 2005.\n\nIn IEEE\n\n[11] E. Arvelo and N. C. Martins. Control design for Markov chains under safety constraints: A convex\n\napproach. CoRR, abs/1209.2883, 2012.\n\n[12] P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints.\n\nJournal of Arti\ufb01cial Intelligence Research 24, pages 81\u2013108, 2005.\n\n[13] S. Kuindersma, R. Grupen, and A. G. Barto. Variational bayesian optimization for runtime risk-sensitive\n\ncontrol. In Robotics: Science and Systems VIII, 2012.\n\n[14] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica,\n\n45(11):2471\u20132482, 2009.\n\n[15] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\noptimization. Technical Report UCB/EECS-2010-24, Electrical Engineering and Computer Sciences,\nUniversity of California at Berkeley, March 2010.\n\n[16] S. Amari and S. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Confer-\n\nence on Acoustics, Speech, and Signal Processing, volume 2, pages 1213\u20131216, 1998.\n\n[17] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley, New\n\nYork, 1983.\n\n[18] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti-\n\nmization. Operations Research Letters, 2003.\n\n[19] S. Mahadevan and B. Liu. Sparse Q-learning with mirror descent. In Proceedings of the Conference on\n\nUnvertainty in Arti\ufb01cial Intelligence, 2012.\n\n[20] S. Mahadevan, S. Giguere, and N. Jacek. Basis adaptation for sparse nonlinear reinforcement learning.\n\nIn Proceedings of the Conference on Arti\ufb01cial Intelligence, 2013.\n\n[21] R. Tyrell Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey, 1970.\n[22] J. Nocedal and S. Wright. Numerical Optimization. Springer, second edition, 2006.\n[23] S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14,\n\npages 1531\u20131538, 2002.\n\n[24] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning\nwith function approximation. In Advances in Neural Information Processing Systems 12, pages 1057\u2013\n1063, 2000.\n\n[25] T. Morimura, E. Uchibe, and K. Doya. Utilizing the natural gradient in temporal difference reinforcement\nlearning with eligibility traces. In International Symposium on Information Geometry and its Application,\n2005.\n\n[26] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71:1180\u20131190, 2008.\n[27] P. S. Thomas and A. G. Barto. Motor primitive discovery. In Procedings of the IEEE Conference on\n\nDevelopment and Learning and EPigenetic Robotics, 2012.\n\n[28] T. Degris, P. M. Pilarski, and R. S. Sutton. Model-free reinforcement learning with continuous action in\n\npractice. In Proceedings of the 2012 American Control Conference, 2012.\n\n[29] P. S. Thomas. Bias in natural actor-critic algorithms. Technical Report UM-CS-2012-018, Department of\n\nComputer Science, University of Massachusetts at Amherst, 2012.\n\n[30] D. Blana, R. F. Kirsch, and E. K. Chadwick. Combined feedforward and feedback control of a redundant,\nnonlinear, dynamic musculoskeletal system. Medical and Biological Engineering and Computing, 47:\n533\u2013542, 2009.\n\n[31] P. Deegan. Whole-Body Strategies for Mobility and Manipulation. PhD thesis, University of Mas-\n\nsachusetts Amherst, 2010.\n\n[32] S. R. Kuindersma, E. Hannigan, D. Ruiken, and R. A. Grupen. Dexterous mobility with the uBot-5 mobile\n\nmanipulator. In Proceedings of the 14th International Conference on Advanced Robotics, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1116, "authors": [{"given_name": "Philip", "family_name": "Thomas", "institution": "UMass Amherst"}, {"given_name": "William", "family_name": "Dabney", "institution": "UMass Amherst"}, {"given_name": "Stephen", "family_name": "Giguere", "institution": "UMass Amherst"}, {"given_name": "Sridhar", "family_name": "Mahadevan", "institution": "UMass Amherst"}]}