{"title": "Inverse Reinforcement Learning through Structured Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1007, "page_last": 1015, "abstract": "This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multi-class classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator.", "full_text": "Inverse Reinforcement Learning\nthrough Structured Classi\ufb01cation\n\nEdouard Klein1,2\n1LORIA \u2013 team ABC\n\nNancy, France\n\nedouard.klein@supelec.fr\n\nMatthieu Geist2\n\nMetz, France\n\n2Sup\u00e9lec \u2013 IMS-MaLIS Research Group\n\nmatthieu.geist@supelec.fr\n\nBilal Piot2,3, Olivier Pietquin2,3\n3 UMI 2958 (GeorgiaTech-CNRS)\n\n{bilal.piot,olivier.pietquin}@supelec.fr\n\nMetz, France\n\nAbstract\n\nThis paper adresses the inverse reinforcement learning (IRL) problem, that is in-\nferring a reward for which a demonstrated expert behavior is optimal. We in-\ntroduce a new algorithm, SCIRL, whose principle is to use the so-called feature\nexpectation of the expert as the parameterization of the score function of a multi-\nclass classi\ufb01er. This approach produces a reward function for which the expert\npolicy is provably near-optimal. Contrary to most of existing IRL algorithms,\nSCIRL does not require solving the direct RL problem. Moreover, with an ap-\npropriate heuristic, it can succeed with only trajectories sampled according to the\nexpert behavior. This is illustrated on a car driving simulator.\n\n1 Introduction\n\nInverse reinforcement learning (IRL) [14] consists in \ufb01nding a reward function such that a demon-\nstrated expert behavior is optimal. Many IRL algorithms (to be brie\ufb02y reviewed in Sec. 5) search\nfor a reward function such that the associated optimal policy induces a distribution over trajectories\n(or some measure of this distribution) which matches the one induced by the expert. Often, this\ndistribution is characterized by the so-called feature expectation (see Sec. 2.1): given a reward func-\ntion linearly parameterized by some feature vector, it is the expected discounted cumulative feature\nvector for starting in a given state, applying a given action and following the related policy.\nIn this paper, we take a different route. The expert behavior could be mimicked by a supervised\nlearning algorithm generalizing the mapping from states to actions. Here, we consider generally\nmulti-class classi\ufb01ers which compute from a training set the parameters of a linearly parameterized\nscore function; the decision rule for a given state is the argument (the action) which maximizes the\nscore function for this state (see Sec. 2.2). The basic idea of our SCIRL (Structured Classi\ufb01cation-\nbased IRL) algorithm is simply to take an estimate of the expert feature expectation as the param-\neterization of the score function (see Sec. 3.1). The computed parameter vector actually de\ufb01nes a\nreward function for which we show the expert policy to be near-optimal (Sec. 3.2).\nContrary to most existing IRL algorithms, a clear advantage of SCIRL is that it does not require\nsolving repeatedly the direct reinforcement learning (RL) problem. It requires estimating the expert\nfeature expectation, but this is roughly a policy evaluation problem (for an observed policy, so is less\ninvolved than repeated policy optimization problems), see Sec. 4. Moreover, up to the use of some\nheuristic, SCIRL may be trained solely from transitions sampled from the expert policy (no need to\nsample the whole dynamic). We illustrate this on a car driving simulator in Sec. 6.\n\n1\n\n\f2 Background and Notations\n\nR = T \u2217\n\nRv\u2217\n\n(Inverse) Reinforcement Learning\n\nR (according to the reward function R) is a policy of associated value function v\u2217\n\nthis state and following the policy \u03c0 afterwards: v\u03c0R(s) = E[P\n\n2.1\nA Markov Decision process (MDP) [12] is a tuple {S,A,P,R, \u03b3} where S is the \ufb01nite state space1,\nA the \ufb01nite actions space, P = {Pa = (p(s0|s, a))1\u2264s,s0\u2264|S|, a \u2208 A} the set of Markovian transition\nprobabilities, R \u2208 RS the state-dependent reward function and \u03b3 the discount factor. A deterministic\npolicy \u03c0 \u2208 SA de\ufb01nes the behavior of an agent. The quality of this control is quanti\ufb01ed by the\nvalue function v\u03c0R \u2208 RS, associating to each state the cumulative discounted reward for starting in\nt\u22650 \u03b3tR(St)|S0 = s, \u03c0]. An optimal\npolicy \u03c0\u2217\nR satisfying\nR \u2265 v\u03c0R, for any policy \u03c0 and componentwise.\nv\u2217\nLet P\u03c0 be the stochastic matrix P\u03c0 = (p(s0|s, \u03c0(s)))1\u2264s,s0\u2264|S|. With a slight abuse of notation,\nwe may write a the policy which associates the action a to each state s. The Bellman evaluation\nR) : RS \u2192 RS are de\ufb01ned as T \u03c0Rv = R + \u03b3P\u03c0v and\n(resp. optimality) operators T \u03c0R (resp. T \u2217\nT \u2217\nRv = max\u03c0 T \u03c0Rv. These operators are contractions and v\u03c0R and v\u2217\nR are their respective \ufb01xed-\nR. The action-value function Q\u03c0 \u2208 RS\u00d7A adds a degree of\npoints: v\u03c0R = T \u03c0Rv\u03c0R and v\u2217\nfreedom on the choice of the \ufb01rst action, it is formally de\ufb01ned as Q\u03c0R(s, a) = [T aRv\u03c0R](s). We also\nwrite \u03c1\u03c0 the stationary distribution of the policy \u03c0 (satisfying \u03c1>\nReinforcement learning and approximate dynamic programming aim at estimating the optimal con-\ntrol policy \u03c0\u2217\nR when the model (transition probabilities and the reward function) is unknown (but\nobserved through interactions with the system to be controlled) and when the state space is too large\nto allow exact representations of the objects of interest (as value functions or policies) [2, 15, 17].\nWe refer to this as the direct problem. On the contrary, (approximate) inverse reinforcement learn-\ning [11] aim at estimating a reward function for which an observed policy is (nearly) optimal. Let\nus call this policy the expert policy, denoted \u03c0E. We may assume that it optimizes some unknown\nreward function RE. The aim of IRL is to compute some reward \u02c6R such that the expert policy is\n(close to be) optimal, that is such that v\u2217\nSimilarly to the direct problem, the state space may be too large for the reward function to admit a\npractical exact representation. Therefore, we restrict our search of a good reward among linearly pa-\nrameterized functions. Let \u03c6(s) = (\u03c61(s) . . . \u03c6p(s))> be a feature vector composed of p basis func-\ni=1 \u03b8i\u03c6i(s).\nSearching a good reward thus reduces to searching a good parameter vector \u03b8 \u2208 Rp. Notice that we\nwill use interchangeably R\u03b8 and \u03b8 as subscripts (e.g., v\u03c0\n). Parameterizing the reward this\nway implies a related parameterization for the action-value function:\n\ntion \u03c6i \u2208 RS, we de\ufb01ne the parameterized reward functions as R\u03b8(s) = \u03b8>\u03c6(s) =Pp\n\n\u02c6R \u2248 v\u03c0E\n\n\u02c6R . We refer to this as the inverse problem.\n\n\u03c0 P\u03c0 = \u03c1>\n\u03c0 ).\n\n\u03b8 (s, a) = \u03b8>\u00b5\u03c0(s, a) with \u00b5\u03c0(s, a) = E[X\n\nQ\u03c0\n\n\u03b8 for v\u03c0R\u03b8\n\nt\u22650\n\n\u03b3t\u03c6(St)|S0 = s, A0 = a, \u03c0].\n\n(1)\n\nTherefore, the action-value function shares the parameter vector of the reward function, with an as-\nsociated feature vector \u00b5\u03c0 called the feature expectation. This notion will be of primary importance\ni of this feature vector is actually the\nfor the contribution of this paper. Notice that each component \u00b5\u03c0\n(s, a). Therefore,\naction-value function of the policy \u03c0 assuming the reward is \u03c6i: \u00b5\u03c0\nany algorithm designed for estimating an action-value function may be used to estimate the feature\nexpectation, such as Monte-Carlo rollouts or temporal difference learning [7].\n\ni (s, a) = Q\u03c0\n\u03c6i\n\n2.2 Classi\ufb01cation with Linearly Parameterized Score Functions\nLet X be a compact or a \ufb01nite set (of inputs to be classi\ufb01ed) and let Y be a \ufb01nite set (of labels).\nAssume that inputs x \u2208 X are drawn according to some unknown distribution P(x) and that there\nexists some oracle which associates to each of these inputs a label y \u2208 Y drawn according to the\nunknown conditional distribution P(y|x). Generally speaking, the goal of multi-class classi\ufb01cation\nis, given a training set {(xi, yi)1\u2264i\u2264N} drawn according to P(x, y), to produce a decision rule\ng \u2208 YX which aims at minimizing the classi\ufb01cation error E[\u03c7{g(x)6=y}] = P(g(x) 6= y), where \u03c7\ndenotes the indicator function.\n\n1This work can be extended to compact state spaces, up to some technical aspects.\n\n2\n\n\fHere, we consider a more restrictive set of classi\ufb01cation algorithms. We assume that the decision\nrule associates to an input the argument which maximizes a related score function, this score func-\ntion being linearly parameterized and the associated parameters being learnt by the algorithm. More\nformally, let \u03c8(s, a) = (\u03c81(x, y) . . . \u03c8d(x, y))> \u2208 Rd be a feature vector whose components are d\nbasis functions \u03c8i \u2208 RX\u00d7Y. The linearly parameterized score function sw \u2208 RX\u00d7Y of parameter\nvector w \u2208 Rd is de\ufb01ned as sw(x, y) = w>\u03c8(x, y). The associated decision rule gw \u2208 YX is de-\n\ufb01ned as gw(x) \u2208 argmaxy\u2208Y sw(x, y). Using a training set {(xi, yi)1\u2264i\u2264N}, a linearly parameter-\nized score function-based multi-class classi\ufb01cation (MC2 for short) algorithm computes a parameter\nvector \u03b8c. The quality of the solution is quanti\ufb01ed by the classi\ufb01cation error \u0001c = P(g\u03b8c(x) 6= y).\nWe do not consider a speci\ufb01c MC2 algorithm, as long as it classi\ufb01es inputs by maximizing the\nargument of a linearly parameterized score function. For example, one may choose a multi-class\nsupport vector machine [6] (taking the kernel induced by the feature vector) or a structured large\nmargin approach [18]. Other choices may be possible, one can choose its preferred algorithm.\n\n3 Structured Classi\ufb01cation for Inverse Reinforcement Learning\n\n3.1 General Algorithm\n\nConsider the classi\ufb01cation framework of Sec. 2.2. The input x may be seen as a state and the label y\nas an action. Then, the decision rule gw(x) can be interpreted as a policy which is greedy according\nto the score function w>\u03c8(x, y), which may itself be seen as an action-value function. Making the\nparallel with Eq. (1), if \u03c8(x, y) is the feature expectation of some policy \u03c0 which produces labels\nof the training set, and if the classi\ufb01cation error is small, then w will be the parameter vector of a\nreward function for which we may hope the policy \u03c0 to be near optimal. Based on these remarks,\nwe\u2019re ready to present the proposed Structured Classi\ufb01cation-based IRL (SCIRL) algorithm.\nLet \u03c0E be the expert policy from which we would like to recover a reward function. Assume that\nwe have a training set D = {(si, ai = \u03c0E(si))1\u2264i\u2264N} where states are sampled according to the\nexpert stationary distribution2 \u03c1E = \u03c1\u03c0E . Assume also that we have an estimate \u02c6\u00b5\u03c0E of the expert\nfeature expectation \u00b5\u03c0E de\ufb01ned in Eq. (1). How to practically estimate this quantity is postponed to\nSec. 4.1; however, recall that estimating \u00b5\u03c0E is simply a policy evaluation problem (estimating the\naction-value function of a given policy), as noted in Sec. 2.1. Assume also that an MC2 algorithm\nhas been chosen. The proposed algorithm simply consists in choosing \u03b8> \u02c6\u00b5\u03c0E (s, a) as the linearly\nparameterized score function, training the classi\ufb01er on D which produces a parameter vector \u03b8c, and\noutputting the reward function R\u03b8c(s) = \u03b8>\n\nc \u03c6(s).\n\nAlgorithm 1: SCIRL algorithm\nGiven a training set D = {(si, ai = \u03c0E(si))1\u2264i\u2264N}, an estimate \u02c6\u00b5\u03c0E of the expert feature\nexpectation \u00b5\u03c0E and an MC2 algorithm;\nCompute the parameter vector \u03b8c using the MC2 algorithm fed with the training set D and\nconsidering the parameterized score function \u03b8> \u02c6\u00b5\u03c0E (s, a);\nOutput the reward function R\u03b8c(s) = \u03b8>\n\nc \u03c6(s) ;\n\nThe proposed approach is summarized in Alg. 1. We call this Structured Classi\ufb01cation-based IRL\nbecause using the (estimated) expert feature expectation as the feature vector for the classi\ufb01er some-\nhow implies taking into account the MDP structure into the classi\ufb01cation problem and allows out-\nputting a reward vector. Notice that contrary to most of existing IRL algorithms, SCIRL does not\nrequire solving the direct problem. If it possibly requires estimating the expert feature expectation,\nit is just a policy evaluation problem, less dif\ufb01cult than the policy optimization issue involved by the\ndirect problem. This is further discussed in Sec. 5.\n\n2For example, if the Markov chain induced by the expert policy is fast-mixing, sampling a trajectory will\n\nquickly lead to sample states according to this distribution.\n\n3\n\n\f3.2 Analysis\n\nIn this section, we show that the expert policy \u03c0E is close to be optimal according to the reward\nfunction R\u03b8c, more precisely that Es\u223c\u03c1E [v\u2217\n(s)] is small. Before stating our main result,\nwe need to introduce some notations and to de\ufb01ne some objects.\nWe will use the \ufb01rst order discounted future state distribution concentration coef\ufb01cient Cf [9]:\n\n(s) \u2212 v\u03c0E\n\n\u03b8c\n\n\u03b8c\n\nCf = (1 \u2212 \u03b3)X\n\nt\u22650\n\n\u03b3tc(t) with c(t) = max\n\n\u03c01,...,\u03c0t,s\u2208S\n\n(\u03c1>\n\nEP\u03c01 . . . P\u03c0t)(s)\n\n\u03c1E(s)\n\n.\n\nWe note \u03c0c the decision rule of the classi\ufb01er: \u03c0c(s) \u2208 argmaxa\u2208A \u03b8>\nc \u02c6\u00b5\u03c0E (s, a). The classi\ufb01ca-\ntion error is therefore \u0001c = Es\u223c\u03c1E [\u03c7{\u03c0c(s)6=\u03c0E (s)}] \u2208 [0, 1]. We write \u02c6Q\u03c0E\nc \u02c6\u00b5\u03c0E the score\nfunction computed from the training set D (which can be interpreted as an approximate action-value\nfunction). Let also \u0001\u00b5 = \u02c6\u00b5\u03c0E \u2212 \u00b5\u03c0E : S \u00d7 A \u2192 Rp be the feature expectation error. Conse-\nc (\u02c6\u00b5\u03c0E \u2212 \u00b5\u03c0E ) =\nquently, we de\ufb01ne the action-value function error as \u0001Q = \u02c6Q\u03c0E\nc \u0001\u00b5 : S \u00d7 A \u2192 R. We \ufb01nally de\ufb01ne the mean delta-max action-value function error as\n\u03b8>\n\u00af\u0001Q = Es\u223c\u03c1E [maxa\u2208A \u0001Q(s, a) \u2212 mina\u2208A \u0001Q(s, a)] \u2265 0.\nTheorem 1. Let R\u03b8c be the reward function outputted by Alg. 1. Let also the quantities Cf , \u0001c and\n(cid:18)\n\u00af\u0001Q be de\ufb01ned as above. We have\n0 \u2264 Es\u223c\u03c1E [v\u2217\nR\u03b8c\n\n2\u03b3kR\u03b8ck\u221e\n\n\u2212 Q\u03c0E\n\n\u2212 v\u03c0ER\u03b8c\n\n\u00af\u0001Q + \u0001c\n\n= \u03b8>\n\n= \u03b8>\n\n(cid:19)\n\n\u03b8c\n\n\u03b8c\n\n\u03b8c\n\n.\n\n] \u2264 Cf\n1 \u2212 \u03b3\n\n1 \u2212 \u03b3\n\nProof. As the proof only relies on the reward R\u03b8c, we omit the related subscripts to keep the nota-\nor R for R\u03b8c). First, we link the error Es\u223c\u03c1E [v\u2217(s) \u2212 v\u03c0E (s)]\ntions simple (e.g., v\u03c0 for v\u03c0\n\u03b8c\nto the Bellman residual Es\u223c\u03c1E [[T \u2217v\u03c0E ](s) \u2212 v\u03c0E (s)]. Componentwise, we have that:\n\n= v\u03c0R\u03b8c\n\nv\u2217 \u2212 v\u03c0E = T \u2217v\u2217 \u2212 T \u03c0\u2217\n\nv\u03c0E + T \u03c0\u2217\n\nv\u03c0E \u2212 T \u2217v\u03c0E + T \u2217v\u03c0E \u2212 v\u03c0E\n\n(a)\u2264 \u03b3P\u03c0\u2217(v\u2217 \u2212 v\u03c0E ) + T \u2217v\u03c0E \u2212 v\u03c0E\n\n(b)\u2264 (I \u2212 \u03b3P\u03c0\u2217)\u22121(T \u2217v\u03c0E \u2212 v\u03c0E ).\n\nator, we have T \u2217v\u03c0E \u2265 T \u03c0E v\u03c0E = v\u03c0E . Additionally, remark that (I \u2212 \u03b3P\u03c0\u2217)\u22121 =P\n\nv\u03c0E \u2264 T \u2217v\u03c0E and inequality (b) holds thanks to [9, Lemma 4.2].\nInequality (a) holds because T \u03c0\u2217\nMoreover, v\u2217 being optimal we have that v\u2217 \u2212 v\u03c0E \u2265 0 and T \u2217 being the Bellman optimality oper-\n\u03c0\u2217.\nt\u22650 \u03b3tP t\n\nTherefore, using the de\ufb01nition of the concentration coef\ufb01cient Cf , we have that:\n\n0 \u2264 Es\u223c\u03c1E [v\u2217(s) \u2212 v\u03c0E (s)] \u2264 Cf\n1 \u2212 \u03b3\n\nEs\u223c\u03c1E [[T \u2217v\u03c0E ](s) \u2212 v\u03c0E (s)] .\n\n(2)\n\nThis results actually follows closely the one of [9, Theorem 4.2]. There remains to bound the\nBellman residual Es\u223c\u03c1E [[T \u2217v\u03c0E ](s) \u2212 v\u03c0E (s)]. Considering the following decomposition,\n\nT \u2217v\u03c0E \u2212 v\u03c0E = T \u2217v\u03c0E \u2212 T \u03c0c v\u03c0E + T \u03c0c v\u03c0E \u2212 v\u03c0E ,\n\nwe will bound Es\u223c\u03c1E [[T \u2217v\u03c0E ](s) \u2212 [T \u03c0c v\u03c0E ](s)] and Es\u223c\u03c1E [[T \u03c0c v\u03c0E ](s) \u2212 v\u03c0E (s)].\nThe policy \u03c0c (the decision rule of the classi\ufb01er) is greedy with respect to \u02c6Q\u03c0E = \u03b8>\nfor any state-action couple (s, a) \u2208 S \u00d7 A we have:\n\nc \u02c6\u00b5\u03c0E . Therefore,\n\n\u02c6Q\u03c0E (s, \u03c0c(s)) \u2265 \u02c6Q\u03c0E (s, a) \u21d4 Q\u03c0E (s, a) \u2264 Q\u03c0E (s, \u03c0c(s)) + \u0001Q(s, \u03c0c(s)) \u2212 \u0001Q(s, a).\n\nBy de\ufb01nition, Q\u03c0E (s, a) = [T av\u03c0E ](s) and Q\u03c0E (s, \u03c0c(s)) = [T \u03c0c v\u03c0E ](s). Therefore, for s \u2208 S:\n\n\u2200a \u2208 A, [T av\u03c0E ](s) \u2264 [T \u03c0c v\u03c0E ](s) + \u0001Q(s, \u03c0c(s)) \u2212 \u0001Q(s, a)\n\n\u21d2 [T \u2217v\u03c0E ](s) \u2264 [T \u03c0c v\u03c0E ](s) + max\n\na\u2208A \u0001Q(s, a) \u2212 min\n\na\u2208A \u0001Q(s, a).\n\nTaking the expectation according to \u03c1E and noticing that T \u2217v\u03c0E \u2265 v\u03c0E , we bound the \ufb01rst term:\n\n0 \u2264 Es\u223c\u03c1E [[T \u2217v\u03c0E ](s) \u2212 [T \u03c0c v\u03c0E ](s)] \u2264 \u00af\u0001Q.\nThere \ufb01nally remains to bound the term Es\u223c\u03c1E [[T \u03c0c v\u03c0E ](s) \u2212 v\u03c0E (s)].\n\n(3)\n\n4\n\n\fLet us write M \u2208 R|S|\u00d7|S| the diagonal matrix de\ufb01ned as M = diag(\u03c7{\u03c0c(s)6=\u03c0E (s)}). Using this,\nthe Bellman operator T \u03c0c may be written as, for any v \u2208 RS:\n\nT \u03c0c v = R + \u03b3M P\u03c0c v + \u03b3(I \u2212 M)P\u03c0E v = R + \u03b3P\u03c0E v + \u03b3M(P\u03c0c \u2212 P\u03c0E )v.\nApplying this operator to v\u03c0E and recalling that R + \u03b3P\u03c0E v\u03c0E = T \u03c0E v\u03c0E = v\u03c0E , we get:\nT \u03c0c v\u03c0E \u2212 v\u03c0E = \u03b3M(P\u03c0c \u2212 P\u03c0E )v\u03c0E \u21d2 |\u03c1>\nOne can easily see that k(P\u03c0c \u2212 P\u03c0E )v\u03c0Ek\u221e \u2264 2\n\nEM(P\u03c0c \u2212 P\u03c0E )v\u03c0E|.\nE(T \u03c0c v\u03c0E \u2212 v\u03c0E )| = \u03b3|\u03c1>\n1\u2212\u03b3kRk\u221e, which allows bounding the last term:\n\n|Es\u223c\u03c1E [[T \u03c0c v\u03c0E ](s) \u2212 v\u03c0E (s)]| \u2264 \u0001c\n\n2\u03b3\n1 \u2212 \u03b3\n\nkRk\u221e.\n\n(4)\n\nInjecting bounds of Eqs. (3) and (4) into Eq. (2) gives the stated result.\n\nThis result shows that if the expert feature expectation is well estimated (in the sense that the estima-\ntion error \u0001\u00b5 is small for states sampled according to the expert stationary policy and for all actions)\nand if the classi\ufb01cation error \u0001c is small, then the proposed generic algorithm outputs a reward func-\ntion R\u03b8c for which the expert policy will be near optimal. A direct corollary of Th. 1 is that given\nthe true expert feature expectation \u00b5\u03c0E and a perfect classi\ufb01er (\u0001c = 0), \u03c0E is the unique optimal\npolicy for R\u03b8c.\nOne may argue that this bounds trivially holds for the null reward function (a reward often exhibited\nto show that IRL is an ill-posed problem), obtained if \u03b8c = 0. However, recall that the parameter\nvector \u03b8c is computed by the classi\ufb01er. With \u03b8c = 0, the decision rule would be a random policy\nand we would have \u0001c = |A|\u22121\n, the worst possible classi\ufb01cation error. This case is really unlikely.\n|A|\nTherefore, we advocate that the proposed approach somehow allows disambiguating the IRL prob-\nlem (at least, it does not output trivial reward functions such as the null vector). Also, this bound is\nscale-invariant: one could impose k\u03b8ck = 1 or normalize (action-) value functions by kR\u03b8ck\u22121\u221e .\nOne should notice that there is a hidden dependency of the classi\ufb01cation error \u0001c to the estimated\nexpert feature expectation \u02c6\u00b5\u03c0E . Indeed, the minimum classi\ufb01cation error depends on the hypothesis\nspace spanned by the chosen score function basis functions for the MC2 algorithm (here \u02c6\u00b5\u03c0E ).\nNevertheless, provided a good representation for the reward function (that is a good choice of basis\nfunctions \u03c6i) and a small estimation error, this should not be a practical problem.\nFinally, if our bound relies on the generalization errors \u0001c and \u00af\u0001Q, the classi\ufb01er will only use\n(\u02c6\u00b5\u03c0E (si, a))1\u2264i\u2264N,a\u2208A in the training phase, where si are the states from the set D.\nIt out-\nputs \u03b8c, seen as a reward function, thus the estimated feature expectation \u02c6\u00b5\u03c0E is no longer re-\nquired. Therefore, practically it should be suf\ufb01cient to estimate well \u02c6\u00b5\u03c0E on state-action couples\n(si, a)1\u2264i\u2264N,a\u2208A, which allows envisioning Monte-Carlo rollouts for example.\n\n4 A Practical Approach\n\n4.1 Estimating the Expert Feature Expectation\n\n(s, a) = [T a\n\u03c6i\n\nv\u03c0E\n\u03c6i\n\ni\n\n(s, a) = Q\u03c0E\n\u03c6i\n\nSCIRL relies on an estimate \u02c6\u00b5\u03c0E of the expert feature expectation. Basically, this is a policy evalu-\nation problem. An already made key observation is that each component of \u00b5\u03c0E is the action-value\n](s). We brie\ufb02y review\nfunction of \u03c0E for a reward function \u03c6i: \u00b5\u03c0E\nits exact computation and possible estimation approaches, and consider possible heuristics.\nIf the model is known, the feature expectation can be computed explicitly. Let \u03a6 \u2208 R|S|\u00d7p be the\nfeature matrix whose rows contain the feature vectors \u03c6(s)> for all s \u2208 S. For a \ufb01xed a \u2208 A,\na \u2208 R|S|\u00d7p be the feature expectation matrix whose rows are the expert feature vectors, that\nlet \u00b5\u03c0E\nis (\u00b5\u03c0E (s, a))> for any s \u2208 S. With these notations, we have \u00b5\u03c0E\na = \u03a6 + \u03b3Pa(I \u2212 \u03b3P\u03c0E )\u22121\u03a6.\nMoreover, the related computational cost is the same order of magnitude as evaluating a single policy\n(as the costly part, computing (I \u2212 \u03b3P\u03c0E )\u22121, is shared by all components).\nIf the model is unknown, any temporal difference learning algorithm can be used to estimate the\nexpert feature expectation [7], as LSTD (Least-Squares Temporal Differences) [4]. Let \u03c8 : S\u00d7A \u2192\nRd be a feature vector composed of d basis functions \u03c8i \u2208 RS\u00d7A. Each component \u00b5\u03c0E\nof the\n\n5\n\ni\n\n\fi\n\n. . .\n\ni, a0\n\ni = \u03c0E(s0\n\ni, a0\n\n(s, a) \u2248 \u03be>\n\nexpert feature expectation is parameterized by a vector \u03bei \u2208 Rd: \u00b5\u03c0E\ni \u03c8(s, a). Assume\nthat we have a training set {(si, ai, s0\ni))1\u2264i\u2264M} with actions ai not necessarily sampled\naccording to policy \u03c0E (e.g., this may be obtained by sampling trajectories according to an expert-\nbased \u0001-greedy policy), the aim being to have a better variability of tuples (non-expert actions should\nbe tried). Let \u02dc\u03a8 \u2208 RM\u00d7d (resp. \u02dc\u03a80) be the feature matrix whose rows are the feature vectors\ni)>). Let also \u02dc\u03a6 \u2208 RM\u00d7p be the feature matrix whose rows are the\n\u03c8(si, ai)> (resp. \u03c8(s0\n\u03bep] \u2208 Rd\u00d7p be the matrix of all\nreward\u2019s feature vectors \u03c6(si)>. Finally, let \u039e = [\u03be1\nparameter vectors. Applying LSTD to each component of the feature expectation gives the LSTD-\u00b5\nalgorithm [7]: \u039e = ( \u02dc\u03a8>( \u02dc\u03a8 \u2212 \u03b3 \u02dc\u03a80))\u22121 \u02dc\u03a8> \u02dc\u03a6 and \u02c6\u00b5\u03c0E (s, a) = \u039e>\u03c8(s, a). As for the exact case,\nthe costly part (computing the inverse matrix) is shared by all feature expectation components, the\ncomputational cost is reasonable (same order as LSTD).\nProvided a simulator and the ability to sample according to the expert policy, the expert feature\nexpectation may also be estimated using Monte-Carlo rollouts for a given state-action pair (as noted\nin Sec. 3.2, \u02c6\u00b5\u03c0E need only be known on (si, a)1\u2264i\u2264N,a\u2208A). Assuming that K trajectories are\nsampled for each required state-action pair, this method would require KN|A| rollouts.\nIn order to have a small error \u00af\u0001Q, one may learn using transitions whose starting state is sampled\naccording to \u03c1E and whose actions are uniformly distributed. However, it may happen that only\ni)1\u2264i\u2264N}. If the state-action cou-\ntransitions of the expert are available: T = {(si, ai = \u03c0E(si), s0\nples (si, ai) may be used to feed the classi\ufb01er, the transitions (si, ai, s0\ni) are not enough to provide\nan accurate estimate of the feature expectation. In this case, we can still expect an accurate estimate\nof \u00b5\u03c0E (s, \u03c0E(s)), but there is little hope for \u00b5\u03c0E (s, a 6= \u03c0E(s)). However, one can still rely on\nsome heuristic; this does not \ufb01t the analysis of Sec. 3.2, but it can still provide good experimental\nresults, as illustrated in Sec. 6.\nWe propose such a heuristic. Assume that only data T is available and that we use it to provide an\n(accurate) estimate \u02c6\u00b5\u03c0E (s, \u03c0E(s)) (this basically means estimating a value function instead of an\naction-value function as described above). We may adopt an optimistic point of view by assuming\nthat applying a non-expert action just delays the effect of the expert action. More formally, we\nassociate to each state s a virtual state sv for which p(.|sv, a) = p(.|s, \u03c0E(s)) for any action a\nand for which the reward feature expectation is the null vector, \u03c6(sv) = 0. In this case, we have\n\u00b5\u03c0E (s, a 6= \u03c0E(s)) = \u03b3\u00b5\u03c0E (s, \u03c0E(s)). Applying this idea to the available estimate (recalling that\nthe classi\ufb01ers only requires evaluating \u02c6\u00b5\u03c0E on (si, a)1\u2264i\u2264N,a\u2208A) provides the proposed heuristic:\nfor 1 \u2264 i \u2264 N, \u02c6\u00b5\u03c0E (si, a 6= ai) = \u03b3 \u02c6\u00b5\u03c0E (si, ai).\nWe may even push this idea further, to get the simpler estimate of the expert feature expectation (but\nwith the weakest guarantees). Assume that the set T consists of one long trajectory, that is s0\ni = si+1\n(thus T = {s1, a1, s2, . . . , sN\u22121, aN\u22121, sN , aN}). We may estimate \u00b5\u03c0E (si, ai) using the single\nrollout available in the training set and use the proposed heuristic for other actions:\n\nNX\n\n\u22001 \u2264 i \u2264 N, \u02c6\u00b5\u03c0E (si, ai) =\n\n\u03b3j\u2212i\u03c6(sj) and \u02c6\u00b5\u03c0E (si, a 6= ai) = \u03b3 \u02c6\u00b5\u03c0E (si, ai).\n\n(5)\n\nj=i\n\nTo sum up, the expert feature expectation may be seen as a vector of action-value functions (for\nthe same policy \u03c0E and different reward functions \u03c6i). Consequently, any action-value function\nevaluation algorithm may be used to estimate \u00b5\u03c0(s, a). Depending on the available data, one may\nhave to rely on some heuristic to assess the feature expectation for a unexperienced (non-expert)\naction. Also, this expert feature expectation estimate is only required for training the classi\ufb01er, so\nit is suf\ufb01cient to estimate on state-action couples (si, a)1\u2264i\u2264N,a\u2208A. In any case, estimating \u00b5\u03c0E is\nnot harder than estimating the action-value function of a given policy in the on-policy case, which is\nmuch easier than computing an optimal policy for an arbitrary reward function (as required by most\nof existing IRL algorithms, see Sec. 5).\n\n4.2 An Instantiation\n\nAs stated before, any MC2 algorithm may be used. Here, we choose the structured large margin\napproach [18]. Let L : S \u00d7 A \u2192 R+ be a user-de\ufb01ned margin function satisfying L(s, \u03c0E(s)) \u2264\n\n6\n\n\fmin\n\u03b8,\u03b6\n\n1\n2\n\nk\u03b8k2 + \u03b7\nN\n\n\u03b6i\n\ns.t. \u2200i, \u03b8> \u02c6\u00b5\u03c0E (si, ai) + \u03b6i \u2265 max\n\na\n\n\u03b8> \u02c6\u00b5\u03c0E (si, a) + L(si, a).\n\nFollowing [13], we express the equivalent hinge-loss form (noting that the slack variables \u03b6i are\ntight, which allows moving the constraints in the objective function):\n\nNX\n\ni=1\n\nNX\n\ni=1\n\nL(s, a) (here, L(si, ai) = 0 and L(si, a 6= ai) = 1). The MC2 algorithm solves:\n\nJ(\u03b8) =\n\n1\nN\n\nmax\n\na\n\n\u03b8> \u02c6\u00b5\u03c0E (si, a) + L(si, a) \u2212 \u03b8> \u02c6\u00b5\u03c0E (si, ai) + \u03bb\n2\n\nk\u03b8k2.\n\nThis objective function is minimized using a subgradient descent. The expert feature expectation is\nestimated using the scheme described in Eq. (5).\n\n5 Related Works\n\nThe notion of IRL has \ufb01rst been introduced in [14] and \ufb01rst been formalized in [11]. A classic ap-\nproach to IRL, initiated in [1], consists in \ufb01nding a policy (through some reward function) such that\nits feature expectation (or more generally some measure of the underlying trajectories\u2019 distribution)\nmatches the one of the expert policy. See [10] for a review. Notice that related algorithms are not\nalways able to output a reward function, even if they may make use of IRL as an intermediate step.\nIn such case, they are usually refereed to as apprenticeship learning algorithms.\nCloser to our contribution, some approaches also somehow introduce a structure in a classi\ufb01cation\nprocedure [8][13]. In [8], a metric induced by the MDP is used to build a kernel which is used in\na classi\ufb01cation algorithm, showing improvements compared to a non-structured kernel. However,\nthis approach is not an IRL algorithm, and more important assessing the metric of an MDP is a\nquite involved problem. In [13], a classi\ufb01cation algorithm is also used to produce a reward function.\nHowever, instead of associating actions to states, as we do, it associates optimal policies (labels) to\nMDPs (inputs), which is how the structure is incorporated. This involves solving many MDPs.\nAs far as we know, all IRL algorithms require solving the direct RL problem repeatedly, except [5, 3].\n[5] applies to linearly-solvable MDPs (where the control is done by imposing any dynamic to the\nsystem). In [3], based on a relative entropy argument, some utility function is maximized using a\nsubgradient ascent. Estimating the subgradient requires sampling trajectories according to the policy\nbeing optimal for the current estimated reward. This is avoided thanks to the use of importance\nsampling. Still, this requires sampling trajectories according to a non-expert policy and the direct\nproblem remains at the core of the approach (even if solving it is avoided).\nSCIRL does not require solving the direct problem, just estimating the feature expectation of the\nexpert policy. In other words, instead of solving multiple policy optimization problems, we only\nsolve one policy evaluation problem. This comes with theoretical guarantees (which is not the case\nof all IRL algorithms, e.g. [3]). Moreover, using heuristics which go beyond our analysis, SCIRL\nmay rely solely on data provided by expert trajectories. We demonstrate this empirically in the next\nsection. To the best of our knowledge, no other IRL algorithm can work in such a restrictive case.\n\n6 Experiments\n\nWe illustrate the proposed approach on a car driving simulator, similar to [1, 16]. The goal si to drive\na car on a busy three-lane highway with randomly generated traf\ufb01c (driving off-road is allowed on\nboth sides). The car can move left and right, accelerate, decelerate and keep a constant speed. The\nexpert optimizes a handcrafted reward RE which favours speed, punish off-road, punish collisions\neven more and is neutral otherwise.\nWe compare SCIRL as instantiated in Sec. 4.2 to the unstructured classi\ufb01er (using the same classi\ufb01-\ncation algorithm) and to the algorithm of [1] (called here PIRL for Projection IRL). We also consider\nthe optimal behavior according to a randomly sampled reward function as a baseline (using the same\nreward feature vector as SCIRL and PIRL, the associated parameter vector is randomly sampled).\nFor SCIRL and PIRL we use a discretization of the state space as the reward feature vector, \u03c6 \u2208\nR729: 9 horizontal positions for the user\u2019s car, 3 horizontal and 9 vertical positions for the closest\n\n7\n\n\fFigure 1: Highway problem. The highest line is the expert value. For each curves, we show the\nmean (plain line), the standard deviation (dark color) and the min-max values (light color). The\npolicy corresponding to the random reward is in blue, the policy outputted by the classi\ufb01er is in\nyellow and the optimal policy according the SCIRL\u2019s reward is in red. PIRL is the dark blue line.\n\ntraf\ufb01c\u2019s car and 3 speeds. Notice that these features are much less informative than the ones used\nin [1, 16]. Actually, in [16] features are so informative that sampling a random positive parameter\nvector \u03b8 already gives an acceptable behavior. The discount factor is \u03b3 = 0.9. The classi\ufb01er uses\nthe same feature vector reproduced for each action.\nSCIRL is fed with n trajectories of length n (started in a random state) with n varying from 3 to\n20 (so fed with 9 to 400 transitions). Each experiment is repeated 50 times. The classi\ufb01er uses the\nsame data. PIRL is an iterative algorithm, each iteration requiring to solve the MDP for some reward\nfunction. It is run for 70 iterations, all required objects (a feature expectations for a non-expert policy\nand an optimal policy according to some reward function at each iteration) are computed exactly\n(s)], where U is\nusing the model. We measure the performance of each approach with Es\u223cU[v\u03c0RE\nthe uniform distribution (this allows measuring the generalization capability of each approach for\nstates infrequently encountered), RE is the expert reward and \u03c0 is one of the following polices: the\noptimal policy for RE (upper baseline), the optimal policy for a random reward (lower baseline),\nthe optimal policy for R\u03b8c (SCIRL), the policy produced by PIRL and the classi\ufb01er decision rule.\nFig. 1 shows the performance of each approach as a number of used expert transitions (except PIRL\nwhich uses the model). We can see that the classi\ufb01er does not work well on this example. Increasing\nthe number of samples would improve its performance, but after 400 transitions it does not work as\nwell as SCIRL with only a ten of transitions. SCIRL works pretty well here: after only a hundred\nof transitions it reaches the performance of PIRL, both being close to the expert value. We do not\nreport exact computational times, but running SCIRL one time with 400 transitions is approximately\nhundred time faster than running PIRL for 70 iteration.\n\n7 Conclusion\n\nWe have introduced a new way to perform IRL by structuring a linearly parameterized score\nfunction-based multi-class classi\ufb01cation algorithm with an estimate of the expert feature expecta-\ntion. This outputs a reward function for which we have shown the expert to be near optimal, provided\na small classi\ufb01cation error and a good expert feature expectation estimate. How to practically es-\ntimate this quantity has been discussed and we have introduced a heuristic for the case where only\ntransitions from the expert are available, along with a speci\ufb01c instantiation of the SCIRL algorithm.\nWe have shown on a car driving simulator benchmark that the proposed approach works well (even\ncombined with the introduced heuristic), much better than the unstructured classi\ufb01er and as well as\na state-of-the-art algorithm making use of the model (and with a much lower computational time).\nIn the future, we plan to deepen the theoretical properties of SCIRL (notably regarding possible\nheuristics) and to apply it to real-world robotic problems.\n\nAcknowledgments. This research was partly funded by the EU FP7 project ILHAIRE (grant\nn\u25e6270780), by the EU INTERREG IVa project ALLEGRO and by the R\u00e9gion Lorraine (France).\n\n8\n\n50100150200250300350400Numberofsamplesfromtheexpert\u22124\u221220246810Es\u223cU[V\u03c0RE(s)]50100150200250300350400Numberofsamplesfromtheexpert\u22124\u221220246810Es\u223cU[V\u03c0RE(s)]\fReferences\n[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning.\n\nIn Proceedings of the 21st International Conference on Machine learning (ICML), 2004.\n\n[2] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming (Optimization and\n\nNeural Computation Series, 3). Athena Scienti\ufb01c, 1996.\n\n[3] Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learn-\n\ning. In JMLR Workshop and Conference Proceedings Volume 15: AISTATS 2011, 2011.\n\n[4] Steven J. Bradtke and Andrew G. Barto. Linear Least-Squares algorithms for temporal differ-\n\nence learning. Machine Learning, 22(1-3):33\u201357, 1996.\n\n[5] Krishnamurthy Dvijotham and Emanuel Todorov.\n\nInverse Optimal Control with Linearly-\nSolvable MDPs. In Proceedings of the 27th International Conference on Machine Learning\n(ICML), 2010.\n\n[6] Yann Guermeur. VC thoery of large margin multi-category classi\ufb01ers. Journal of Machine\n\nLearning Research, 8:2551\u20132594, 2007.\n\n[7] Edouard Klein, Matthieu Geist, and Olivier Pietquin. Batch, Off-policy and Model-free Ap-\nprenticeship Learning. In Proceedings of the European Workshop on Reinforcement Learning\n(EWRL), 2011.\n\n[8] Francisco S. Melo and Manuel Lopes. Learning from demonstration using MDP induced\n\nmetrics. In Proceedings of the European Conference on Machine Learning (ECML), 2010.\n\n[9] R\u00e9mi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM journal\n\non control and optimization, 46(2):541\u2013561, 2007.\n\n[10] Gergely Neu and Czaba Szepesvari. Training Parsers by Inverse Reinforcement Learning.\n\nMachine Learning, 77(2-3):303\u2013337, 2009.\n\n[11] Andrew Y. Ng and Stuart Russell. Algorithms for Inverse Reinforcement Learning. In Pro-\n\nceedings of 17th International Conference on Machine Learning (ICML), 2000.\n\n[12] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nWiley-Interscience, 1994.\n\n[13] Nathan Ratliff, Andrew D. Bagnell, and Martin Zinkevich. Maximum Margin Planning. In\n\nProceedings of the 23rd International Conference on Machine Learning (ICML), 2006.\n\n[14] Stuart Russell. Learning agents for uncertain environments (extended abstract). In Proceedings\n\nof the 11th annual Conference on Computational Learning Theory (COLT), 1998.\n\n[15] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT\n\nPress, 3rd edition, March 1998.\n\n[16] Umar Syed and Robert Schapire. A game-theoretic approach to apprenticeship learning. In\n\nAdvances in Neural Information Processing Systems 20 (NIPS), 2008.\n\n[17] Csaba Szepesv\u00e1ri. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.\n[18] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning Structured\nPrediction Models: a Large Margin Approach. In Proceedings of 22nd International Confer-\nence on Machine Learning (ICML), 2005.\n\n9\n\n\f", "award": [], "sourceid": 491, "authors": [{"given_name": "Edouard", "family_name": "Klein", "institution": null}, {"given_name": "Matthieu", "family_name": "Geist", "institution": null}, {"given_name": "Bilal", "family_name": "Piot", "institution": null}, {"given_name": "Olivier", "family_name": "Pietquin", "institution": null}]}