{"title": "VIME: Variational Information Maximizing Exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 1109, "page_last": 1117, "abstract": "Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls. This paper introduces Variational Information Maximizing Exploration (VIME), an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics. We propose a practical implementation, using variational inference in Bayesian neural networks which efficiently handles continuous state and action spaces. VIME modifies the MDP reward function, and can be applied with several different underlying RL algorithms. We demonstrate that VIME achieves significantly better performance compared to heuristic exploration methods across a variety of continuous control tasks and algorithms, including tasks with very sparse rewards.", "full_text": "VIME: Variational Information Maximizing\n\nExploration\n\nRein Houthooft\u00a7\u2020\u2021, Xi Chen\u2020\u2021, Yan Duan\u2020\u2021, John Schulman\u2020\u2021, Filip De Turck\u00a7, Pieter Abbeel\u2020\u2021\n\n\u2021 OpenAI\n\n\u2020 UC Berkeley, Department of Electrical Engineering and Computer Sciences\n\n\u00a7 Ghent University - imec, Department of Information Technology\n\nAbstract\n\nScalable and effective exploration remains a key challenge in reinforcement learn-\ning (RL). While there are methods with optimality guarantees in the setting of dis-\ncrete state and action spaces, these methods cannot be applied in high-dimensional\ndeep RL scenarios. As such, most contemporary RL relies on simple heuristics\nsuch as \u0001-greedy exploration or adding Gaussian noise to the controls. This paper\nintroduces Variational Information Maximizing Exploration (VIME), an explo-\nration strategy based on maximization of information gain about the agent\u2019s belief\nof environment dynamics. We propose a practical implementation, using varia-\ntional inference in Bayesian neural networks which ef\ufb01ciently handles continuous\nstate and action spaces. VIME modi\ufb01es the MDP reward function, and can be\napplied with several different underlying RL algorithms. We demonstrate that\nVIME achieves signi\ufb01cantly better performance compared to heuristic exploration\nmethods across a variety of continuous control tasks and algorithms, including\ntasks with very sparse rewards.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) studies how an agent can maximize its cumulative reward in a previously\nunknown environment, which it learns about through experience. A long-standing problem is how to\nmanage the trade-off between exploration and exploitation. In exploration, the agent experiments\nwith novel strategies that may improve returns in the long run; in exploitation, it maximizes rewards\nthrough behavior that is known to be successful. An effective exploration strategy allows the agent\nto generate trajectories that are maximally informative about the environment. For small tasks, this\ntrade-off can be handled effectively through Bayesian RL [1] and PAC-MDP methods [2\u20136], which\noffer formal guarantees. However, these guarantees assume discrete state and action spaces. Hence, in\nsettings where state-action discretization is infeasible, many RL algorithms use heuristic exploration\nstrategies. Examples include acting randomly using \u0001-greedy or Boltzmann exploration [7], and\nutilizing Gaussian noise on the controls in policy gradient methods [8]. These heuristics often rely on\nrandom walk behavior which can be highly inef\ufb01cient, for example Boltzmann exploration requires\na training time exponential in the number of states in order to solve the well-known n-chain MDP\n[9]. In between formal methods and simple heuristics, several works have proposed to address the\nexploration problem using less formal, but more expressive methods [10\u201314]. However, none of\nthem fully address exploration in continuous control, as discretization of the state-action space scales\nexponentially in its dimensionality. For example, the Walker2D task [15] has a 26-dim state-action\nspace. If we assume a coarse discretization into 10 bins for each dimension, a table of state-action\nvisitation counts would require 1026 entries.\n\n\fThis paper proposes a curiosity-driven exploration strategy, making use of information gain about the\nagent\u2019s internal belief of the dynamics model as a driving force. This principle can be traced back\nto the concepts of curiosity and surprise [16\u201318]. Within this framework, agents are encouraged\nto take actions that result in states they deem surprising\u2014i.e., states that cause large updates to the\ndynamics model distribution. We propose a practical implementation of measuring information gain\nusing variational inference. Herein, the agent\u2019s current understanding of the environment dynamics is\nrepresented by a Bayesian neural networks (BNN) [19, 20]. We also show how this can be interpreted\nas measuring compression improvement, a proposed model of curiosity [21]. In contrast to previous\ncuriosity-based approaches [10, 22], our model scales naturally to continuous state and action spaces.\nThe presented approach is evaluated on a range of continuous control tasks, and multiple underlying\nRL algorithms. Experimental results show that VIME achieves signi\ufb01cantly better performance than\nna\u00efve exploration strategies.\n\n2 Methodology\n\nIn Section 2.1, we establish notation for the subsequent equations. Next, in Section 2.2, we explain\nthe theoretical foundation of curiosity-driven exploration. In Section 2.3 we describe how to adapt this\nidea to continuous control, and we show how to build on recent advances in variational inference for\nBayesian neural networks (BNNs) to make this formulation practical. Thereafter, we make explicit\nthe intuitive link between compression improvement and the variational lower bound in Section 2.4.\nFinally, Section 2.5 describes how our method is practically implemented.\n\n2.1 Preliminaries\n\nThis paper assumes a \ufb01nite-horizon discounted Markov decision process (MDP), de\ufb01ned by\n(S,A,P, r, \u03c10, \u03b3, T ), in which S \u2286 Rn is a state set, A \u2286 Rm an action set, P : S \u00d7 A \u00d7 S \u2192 R\u22650\na transition probability distribution, r : S \u00d7 A \u2192 R a bounded reward function, \u03c10 : S \u2192 R\u22650\nan initial state distribution, \u03b3 \u2208 (0, 1] a discount factor, and T the horizon. States and actions\nviewed as random variables are abbreviated as S and A. The presented models are based on the\noptimization of a stochastic policy \u03c0\u03b1 : S \u00d7 A \u2192 R\u22650, parametrized by \u03b1. Let \u00b5(\u03c0\u03b1) denote its\nt=0 \u03b3tr(st, at)], where \u03c4 = (s0, a0, . . .) denotes the\nwhole trajectory, s0 \u223c \u03c10(s0), at \u223c \u03c0\u03b1(at|st), and st+1 \u223c P(st+1|st, at).\n\nexpected discounted return: \u00b5(\u03c0\u03b1) = E\u03c4 [(cid:80)T\n\n2.2 Curiosity\n\nOur method builds on the theory of curiosity-driven exploration [16, 17, 21, 22], in which the agent\nengages in systematic exploration by seeking out state-action regions that are relatively unexplored.\nThe agent models the environment dynamics via a model p(st+1|st, at; \u03b8), parametrized by the\nrandom variable \u0398 with values \u03b8 \u2208 \u0398. Assuming a prior p(\u03b8), it maintains a distribution over\ndynamic models through a distribution over \u03b8, which is updated in a Bayesian manner (as opposed to\na point estimate). The history of the agent up until time step t is denoted as \u03bet = {s1, a1, . . . , st}.\nAccording to curiosity-driven exploration [17], the agent should take actions that maximize the\nreduction in uncertainty about the dynamics. This can be formalized as maximizing the sum of\nreductions in entropy\n\n(cid:80)\nt (H(\u0398|\u03bet, at) \u2212 H(\u0398|St+1, \u03bet, at)) ,\n\n(1)\nthrough a sequence of actions {at}. According to information theory, the individual terms equal\nthe mutual information between the next state distribution St+1 and the model parameter \u0398, namely\nI (St+1; \u0398|\u03bet, at). Therefore, the agent is encouraged to take actions that lead to states that are\nmaximally informative about the dynamics model. Furthermore, we note that\n\nI (St+1; \u0398|\u03bet, at) = Est+1\u223cP(\u00b7|\u03bet,at)\n\n(cid:2)DKL[p(\u03b8|\u03bet, at, st+1)(cid:107) p(\u03b8|\u03bet)](cid:3),\n\n(2)\n\nthe KL divergence from the agent\u2019s new belief over the dynamics model to the old one, taking\nexpectation over all possible next states according to the true dynamics P. This KL divergence can\nbe interpreted as information gain.\n\n2\n\n\fmutual information along a trajectory(cid:80)\n\nIf calculating the posterior dynamics distribution is tractable, it is possible to optimize Eq. (2)\ndirectly by maintaining a belief over the dynamics model [17]. However, this is not generally the\ncase. Therefore, a common practice [10, 23] is to use RL to approximate planning for maximal\nt I (St+1; \u0398|\u03bet, at) by adding each term I (St+1; \u0398|\u03bet, at)\nas an intrinsic reward, which captures the agent\u2019s surprise in the form of a reward function. This\nis practically realized by taking actions at \u223c \u03c0\u03b1(st) and sampling st+1 \u223c P(\u00b7|st, at) in order to\nadd DKL[p(\u03b8|\u03bet, at, st+1)(cid:107) p(\u03b8|\u03bet)] to the external reward. The trade-off between exploitation and\nexploration can now be realized explicitly as follows:\n\nr(cid:48)(st, at, st+1) = r(st, at) + \u03b7DKL[p(\u03b8|\u03bet, at, st+1)(cid:107) p(\u03b8|\u03bet)],\n\n(3)\nwith \u03b7 \u2208 R+ a hyperparameter controlling the urge to explore. In conclusion, the biggest practical\nissue with maximizing information gain for exploration is that the computation of Eq. (3) requires\ncalculating the posterior p(\u03b8|\u03bet, at, st+1), which is generally intractable.\n\n2.3 Variational Bayes\n\nWe propose a tractable solution to maximize the information gain objective presented in the previous\nsection. In a purely Bayesian setting, we can derive the posterior distribution given a new state-action\npair through Bayes\u2019 rule as\n\np(\u03b8|\u03bet, at, st+1) =\n\n(4)\nwith p(\u03b8|\u03bet, at) = p(\u03b8|\u03bet) as actions do not in\ufb02uence beliefs about the environment [17]. Herein, the\ndenominator is computed through the integral\n\np(st+1|\u03bet, at)\n\n,\n\np(\u03b8|\u03bet)p(st+1|\u03bet, at; \u03b8)\n\np(st+1|\u03bet, at) =\n\np(st+1|\u03bet, at; \u03b8)p(\u03b8|\u03bet)d\u03b8.\n\n(5)\n\n\u0398\n\nIn general, this integral tends to be intractable when using highly expressive parametrized models\n(e.g., neural networks), which are often needed to accurately capture the environment model in\nhigh-dimensional continuous control.\n\nWe propose a practical solution through variational inference [24]. Herein, we embrace the fact that\ncalculating the posterior p(\u03b8|D) for a data set D is intractable. Instead we approximate it through\nan alternative distribution q(\u03b8; \u03c6), parameterized by \u03c6, by minimizing DKL[q(\u03b8; \u03c6)(cid:107) p(\u03b8|D)]. This is\ndone through maximization of the variational lower bound L[q(\u03b8; \u03c6),D]:\n\nL[q(\u03b8; \u03c6),D] = E\u03b8\u223cq(\u00b7;\u03c6) [log p(D|\u03b8)] \u2212 DKL[q(\u03b8; \u03c6)(cid:107) p(\u03b8)].\n\n(6)\nRather than computing information gain in Eq. (3) explicitly, we compute an approximation to it,\nleading to the following total reward:\n\nr(cid:48)(st, at, st+1) = r(st, at) + \u03b7DKL[q(\u03b8; \u03c6t+1)(cid:107)q(\u03b8; \u03c6t)],\n\n(7)\nwith \u03c6t+1 the updated and \u03c6t the old parameters representing the agent\u2019s belief. Natural candidates\nfor parametrizing the agent\u2019s dynamics model are Bayesian neural networks (BNNs) [19], as they\nmaintain a distribution over their weights. This allows us to view the BNN as an in\ufb01nite neural\nnetwork ensemble by integrating out its parameters:\n\n(cid:90)\n\n(cid:90)\n\np(y|x) =\n\np(y|x; \u03b8)q(\u03b8; \u03c6)d\u03b8.\n\n(8)\n\nIn particular, we utilize a BNN parametrized by a fully factorized Gaussian distribution [20]. Practical\nBNN implementation details are deferred to Section 2.5, while we give some intuition into the\nbehavior of BNNs in the appendix.\n\n\u0398\n\n2.4 Compression\n\nIt is possible to derive an interesting relationship between compression improvement\u2014an intrinsic\nreward objective de\ufb01ned in [25], and the information gain of Eq. (2). In [25], the agent\u2019s curiosity is\n\n3\n\n\fequated with compression improvement, measured through C(\u03bet; \u03c6t\u22121) \u2212 C(\u03bet; \u03c6t), where C(\u03be; \u03c6)\nis the description length of \u03be using \u03c6 as a model. Furthermore, it is known that the negative variational\nlower bound can be viewed as the description length [19]. Hence, we can write compression\nimprovement as L[q(\u03b8; \u03c6t), \u03bet] \u2212 L[q(\u03b8; \u03c6t\u22121), \u03bet]. In addition, an alternative formulation of the\nvariational lower bound in Eq. (6) is given by\n\n(cid:122)\n(cid:90)\n\n\u0398\n\nL[q(\u03b8;\u03c6),D]\n\n(cid:125)(cid:124)\n\n(cid:123)\n\np(\u03b8,D)\nq(\u03b8; \u03c6)\n\nlog p(D) =\n\nq(\u03b8; \u03c6) log\n\nd\u03b8 +DKL[q(\u03b8; \u03c6)(cid:107) p(\u03b8|D)].\n\n(9)\n\nThus, compression improvement can now be written as\n\n(log p(\u03bet) \u2212 DKL[q(\u03b8; \u03c6t)(cid:107) p(\u03b8|\u03bet)]) \u2212 (log p(\u03bet) \u2212 DKL[q(\u03b8; \u03c6t\u22121)(cid:107)p(\u03b8|\u03bet)]) .\n\n(10)\nIf we assume that \u03c6t perfectly optimizes the variational lower bound for the history \u03bet, then\nDKL[q(\u03b8; \u03c6t)(cid:107) p(\u03b8|\u03bet)] = 0, which occurs when the approximation equals the true posterior, i.e.,\nq(\u03b8; \u03c6t) = p(\u03b8|\u03bet). Hence, compression improvement becomes DKL[p(\u03b8|\u03bet\u22121)(cid:107)p(\u03b8|\u03bet)]. Therefore,\noptimizing for compression improvement comes down to optimizing the KL divergence from the\nposterior given the past history \u03bet\u22121 to the posterior given the total history \u03bet. As such, we arrive at\nan alternative way to encode curiosity than information gain, namely DKL[p(\u03b8|\u03bet)(cid:107) p(\u03b8|\u03bet, at, st+1)],\nits reversed KL divergence. In experiments, we noticed no signi\ufb01cant difference between the two KL\ndivergence variants. This can be explained as both variants are locally equal when introducing small\nchanges to the parameter distributions. Investigation of how to combine both information gain and\ncompression improvement is deferred to future work.\n\n2.5\n\nImplementation\n\nThe complete method is summarized in Algorithm 1. We \ufb01rst set forth implementation and\nparametrization details of the dynamics BNN. The BNN weight distribution q(\u03b8; \u03c6) is given by\nthe fully factorized Gaussian distribution [20]:\n\n(11)\nHence, \u03c6 = {\u00b5, \u03c3}, with \u00b5 the Gaussian\u2019s mean vector and \u03c3 the covariance matrix diagonal. This is\nparticularly convenient as it allows for a simple analytical formulation of the KL divergence. This\nis described later in this section. Because of the restriction \u03c3 > 0, the standard deviation of the\nGaussian BNN parameter is parametrized as \u03c3 = log(1 + e\u03c1), with \u03c1 \u2208 R [20].\n\ni=1 N (\u03b8i|\u00b5i; \u03c32\ni ).\n\nq(\u03b8; \u03c6) =(cid:81)|\u0398|\n\n(cid:80)N\nNow the training of the dynamics BNN through optimization of the variational lower bound is\ndescribed. The second term in Eq. (6) is approximated through sampling E\u03b8\u223cq(\u00b7;\u03c6) [log p(D|\u03b8)] \u2248\ni=1 log p(D|\u03b8i) with N samples drawn according to \u03b8 \u223c q(\u00b7; \u03c6) [20]. Optimizing the variational\n1\nN\nlower bound in Eq. (6) in combination with the reparametrization trick is called stochastic gradient\nvariational Bayes (SGVB) [26] or Bayes by Backprop [20]. Furthermore, we make use of the local\nreparametrization trick proposed in [26], in which sampling at the weights is replaced by sampling the\nneuron pre-activations, which is more computationally ef\ufb01cient and reduces gradient variance. The\noptimization of the variational lower bound is done at regular intervals during the RL training process,\nby sampling D from a FIFO replay pool that stores recent samples (st, at, st+1). This is to break up\nthe strong intratrajectory sample correlation which destabilizes learning in favor of obtaining i.i.d.\ndata [7]. Moreover, it diminishes the effect of compounding posterior approximation errors.\n\nThe posterior distribution of the dynamics parameter, which is needed to compute the KL divergence\nin the total reward function r(cid:48) of Eq. (7), can be computed through the following minimization\n\n(cid:104)\n\n(cid:122)\n(cid:124)\n\n\u03c6(cid:48) = arg min\n\n\u03c6\n\n(cid:96)(q(\u03b8;\u03c6),st)\n\n(cid:125)(cid:124)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:96)KL(q(\u03b8;\u03c6))\n\nDKL[q(\u03b8; \u03c6)(cid:107) q(\u03b8; \u03c6t\u22121)]\n\n\u2212E\u03b8\u223cq(\u00b7;\u03c6) [log p(st|\u03bet, at; \u03b8)]\n\n,\n\n(12)\n\n(cid:123)\n\n(cid:105)\n\nwhere we replace the expectation over \u03b8 with samples \u03b8 \u223c q(\u00b7; \u03c6). Because we only update the model\nperiodically based on samples drawn from the replay pool, this optimization can be performed in\nparallel for each st, keeping \u03c6t\u22121 \ufb01xed. Once \u03c6(cid:48) has been obtained, we can use it to compute the\nintrinsic reward.\n\n4\n\n\fAlgorithm 1: Variational Information Maximizing Exploration (VIME)\nfor each epoch n do\n\nfor each timestep t in each trajectory generated during n do\n\nGenerate action at \u223c \u03c0\u03b1(st) and sample state st+1 \u223c P(\u00b7|\u03bet, at), get r(st, at).\nAdd triplet (st, at, st+1) to FIFO replay pool R.\nCompute DKL[q(\u03b8; \u03c6(cid:48)\ndiagonal BNNs, or by optimizing Eq. (12) to obtain \u03c6(cid:48)\nDivide DKL[q(\u03b8; \u03c6(cid:48)\nConstruct r(cid:48)(st, at, st+1) \u2190 r(st, at) + \u03b7DKL[q(\u03b8; \u03c6(cid:48)\n\nn+1)(cid:107)q(\u03b8; \u03c6n+1)] by approximation \u2207(cid:62)H\u22121\u2207, following Eq. (16) for\nn+1 for general BNNs.\nn+1)(cid:107)q(\u03b8; \u03c6n+1)] by median of previous KL divergences.\nn+1)(cid:107)q(\u03b8; \u03c6n+1)], following Eq. (7).\nMinimize DKL[q(\u03b8; \u03c6n)(cid:107)p(\u03b8)] \u2212 E\u03b8\u223cq(\u00b7;\u03c6n) [log p(D|\u03b8)] following Eq. (6), with D sampled\nrandomly from R, leading to updated posterior q(\u03b8; \u03c6n+1).\nUse rewards {r(cid:48)(st, at, st+1)} to update policy \u03c0\u03b1 using any standard RL method.\n\nTo optimize Eq. (12) ef\ufb01ciently, we only take a single second-order step. This way, the gradient\nis rescaled according to the curvature of the KL divergence at the origin. As such, we compute\nDKL[q(\u03b8; \u03c6 + \u03bb\u2206\u03c6)(cid:107)q(\u03b8; \u03c6)], with the update step \u2206\u03c6 de\ufb01ned as\n\u2206\u03c6 = H\u22121((cid:96))\u2207\u03c6(cid:96)(q(\u03b8; \u03c6), st),\n\n(13)\nin which H((cid:96)) is the Hessian of (cid:96)(q(\u03b8; \u03c6), st). Since we assume that the variational approximation is\na fully factorized Gaussian, the KL divergence from posterior to prior has a particularly simple form:\n\n(cid:19)\n\n(cid:80)|\u0398|\n\ni=1\n\n(cid:17)2\n\n(cid:18)(cid:16) \u03c3i\n\n\u03c3(cid:48)\n\ni\n\nDKL[q(\u03b8; \u03c6)(cid:107)q(\u03b8; \u03c6(cid:48))] = 1\n\n2\n\n+ 2 log \u03c3(cid:48)\n\ni \u2212 2 log \u03c3i + (\u00b5(cid:48)\n\ni\u2212\u00b5i)2\n\u03c3(cid:48)2\n\ni\n\n\u2212 |\u0398|\n2 .\n\n(14)\n\nBecause this KL divergence is approximately quadratic in its parameters and the log-likelihood term\ncan be seen as locally linear compared to this highly curved KL term, we approximate H by only\ncalculating it for the term KL term (cid:96)KL(q(\u03b8; \u03c6)). This can be computed very ef\ufb01ciently in case of a\nfully factorized Gaussian distribution, as this approximation becomes a diagonal matrix. Looking at\nEq. (14), we can calculate the following Hessian at the origin. The \u00b5 and \u03c1 entries are de\ufb01ned as\n\n2e2\u03c1i\n\n1\n\n,\n\n(15)\n\n\u22022(cid:96)KL\n\u2202\u00b52\ni\n\n=\n\n1\n\nlog2(1 + e\u03c1i)\n\nand\n\n\u22022(cid:96)KL\n\u2202\u03c12\ni\n\n=\n\n(1 + e\u03c1i)2\n\n(cid:0)H\u22121\u2207(cid:1)(cid:62)\n\nlog2(1 + e\u03c1i)\n\nH(cid:0)H\u22121\u2207(cid:1), since both the\n\nwhile all other entries are zero. Furthermore, it is also possible to approximate the KL divergence\nthrough a second-order Taylor expansion as 1\nvalue and gradient of the KL divergence are zero at the origin. This gives us\n\n2 \u2206\u03c6H\u2206\u03c6 = 1\n\nDKL[q(\u03b8; \u03c6 + \u03bb\u2206\u03c6)(cid:107) q(\u03b8; \u03c6)] \u2248 1\n\n(16)\nNote that H\u22121((cid:96)KL) is diagonal, so this expression can be computed ef\ufb01ciently. Instead of using the\nKL divergence DKL[q(\u03b8; \u03c6t+1)(cid:107)q(\u03b8; \u03c6t)] directly as an intrinsic reward in Eq. (7), we normalize it\nby division through the average of the median KL divergences taken over a \ufb01xed number of previous\ntrajectories. Rather than focusing on its absolute value, we emphasize relative difference in KL\ndivergence between samples. This accomplishes the same effect since the variance of KL divergence\nconverges to zero, once the model is fully learned.\n\n2 \u03bb2\u2207\u03c6(cid:96)(cid:62)H\u22121((cid:96)KL)\u2207\u03c6(cid:96).\n\n2\n\n3 Experiments\n\nIn this section, we investigate (i) whether VIME can succeed in domains that have extremely sparse\nrewards, (ii) whether VIME improves learning when the reward is shaped to guide the agent towards\nits goal, and (iii) how \u03b7, as used in in Eq. (3), trades off exploration and exploitation behavior. All\nexperiments make use of the rllab [15] benchmark code base and the complementary continuous\ncontrol tasks suite. The following tasks are part of the experimental setup: CartPole (S \u2286 R4,\nA \u2286 R1), CartPoleSwingup (S \u2286 R4, A \u2286 R1), DoublePendulum (S \u2286 R6, A \u2286 R1), MountainCar\n(S \u2286 R3, A \u2286 R1), locomotion tasks HalfCheetah (S \u2286 R20, A \u2286 R6), Walker2D (S \u2286 R20,\nA \u2286 R6), and the hierarchical task SwimmerGather (S \u2286 R33, A \u2286 R2).\n\n5\n\n\fPerformance is measured through the average return (not including the intrinsic rewards) over the\ntrajectories generated (y-axis) at each iteration (x-axis). More speci\ufb01cally, the darker-colored lines in\neach plot represent the median performance over a \ufb01xed set of 10 random seeds while the shaded\nareas show the interquartile range at each iteration. Moreover, the number in each legend shows this\nperformance measure, averaged over all iterations. The exact setup is described in the Appendix.\n\n(a) MountainCar\n\n(b) CartPoleSwingup\n\n(c) HalfCheetah\n\n(d) state space\n\nFigure 1: (a,b,c) TRPO+VIME versus TRPO on tasks with sparse rewards; (d) comparison of\nTRPO+VIME (red) and TRPO (blue) on MountainCar: visited states until convergence\n\nDomains with sparse rewards are dif\ufb01cult to solve through na\u00efve exploration behavior because, before\nthe agent obtains any reward, it lacks a feedback signal on how to improve its policy. This allows\nus to test whether an exploration strategy is truly capable of systematic exploration, rather than\nimproving existing RL algorithms by adding more hyperparameters. Therefore, VIME is compared\nwith heuristic exploration strategies on the following tasks with sparse rewards. A reward of +1 is\ngiven when the car escapes the valley on the right side in MountainCar; when the pole is pointed\nupwards in CartPoleSwingup; and when the cheetah moves forward over \ufb01ve units in HalfCheetah.\nWe compare VIME with the following baselines: only using Gaussian control noise [15] and using\nthe (cid:96)2 BNN prediction error as an intrinsic reward, a continuous extension of [10]. TRPO [8] is\nused as the RL algorithm, as it performs very well compared to other methods [15]. Figure 1 shows\nthe performance results. We notice that using a na\u00efve exploration performs very poorly, as it is\nalmost never able to reach the goal in any of the tasks. Similarly, using (cid:96)2 errors does not perform\nwell. In contrast, VIME performs much better, achieving the goal in most cases. This experiment\ndemonstrates that curiosity drives the agent to explore, even in the absence of any initial reward,\nwhere na\u00efve exploration completely breaks down.\n\nTo further strengthen this point, we have evaluated VIME on the highly dif\ufb01cult hierarchical task\nSwimmerGather in Figure 5 whose reward signal is naturally sparse. In this task, a two-link robot\nneeds to reach \u201capples\u201d while avoiding \u201cbombs\u201d that are perceived through a laser scanner. However,\nbefore it can make any forward progress, it has to learn complex locomotion primitives in the absence\nof any reward. None of the RL methods tested previously in [15] were able to make progress with\nna\u00efve exploration. Remarkably, VIME leads the agent to acquire coherent motion primitives without\nany reward guidance, achieving promising results on this challenging task.\n\nNext, we investigate whether VIME is widely applicable by (i) testing it on environments where the\nreward is well shaped, and (ii) pairing it with different RL methods. In addition to TRPO, we choose\nto equip REINFORCE [27] and ERWR [28] with VIME because these two algorithms usually suffer\nfrom premature convergence to suboptimal policies [15, 29], which can potentially be alleviated by\nbetter exploration. Their performance is shown in Figure 2 on several well-established continuous\ncontrol tasks. Furthermore, Figure 3 shows the same comparison for the Walker2D locomotion task.\nIn the majority of cases, VIME leads to a signi\ufb01cant performance gain over heuristic exploration.\nOur exploration method allows the RL algorithms to converge faster, and notably helps REINFORCE\nand ERWR avoid converging to a locally optimal solution on DoublePendulum and MountainCar.\nWe note that in environments such as CartPole, a better exploration strategy is redundant as following\nthe policy gradient direction leads to the globally optimal solution. Additionally, we tested adding\nGaussian noise to the rewards as a baseline, which did not improve performance.\n\nTo give an intuitive understanding of VIME\u2019s exploration behavior, the distribution of visited states\nfor both na\u00efve exploration and VIME after convergence is investigated. Figure 1d shows that using\nGaussian control noise exhibits random walk behavior: the state visitation plot is more condensed\nand ball-shaped around the center. In comparison, VIME leads to a more diffused visitation pattern,\nexploring the states more ef\ufb01ciently, and hence reaching the goal more quickly.\n\n6\n\n\f(a) CartPole\n\n(b) CartPoleSwingup\n\n(c) DoublePendulum\n\n(d) MountainCar\n\nFigure 2: Performance of TRPO (top row), ERWR (middle row), and REINFORCE (bottom row)\nwith (+VIME) and without exploration for different continuous control tasks.\n\nFigure 3: Performance of TRPO\nwith and without VIME on the\nhigh-dimensional Walker2D lo-\ncomotion task.\n\nthe \ufb01rst\n\nFigure 4: VIME: performance\nover\nfew iterations\nfor TRPO, REINFORCE, and\nERWR i.f.o. \u03b7 on MountainCar.\n\nFigure 5:\nPerformance of\nTRPO with and without VIME\non the challenging hierarchical\ntask SwimmerGather.\n\nFinally, we investigate how \u03b7, as used in in Eq. (3), trades off exploration and exploitation behavior.\nOn the one hand, higher \u03b7 values should lead to a higher curiosity drive, causing more exploration.\nOn the other hand, very low \u03b7 values should reduce VIME to traditional Gaussian control noise.\nFigure 4 shows the performance on MountainCar for different \u03b7 values. Setting \u03b7 too high clearly\nresults in prioritizing exploration over getting additional external reward. Too low of an \u03b7 value\nreduces the method to the baseline algorithm, as the intrinsic reward contribution to the total reward\nr(cid:48) becomes negligible. Most importantly, this \ufb01gure highlights that there is a wide \u03b7 range for which\nthe task is best solved, across different algorithms.\n\n4 Related Work\n\nA body of theoretically oriented work demonstrates exploration strategies that are able to learn online\nin a previously unknown MDP and incur a polynomial amount of regret\u2014as a result, these algorithms\n\ufb01nd a near-optimal policy in a polynomial amount of time. Some of these algorithms are based on the\nprinciple of optimism under uncertainty: E3 [3], R-Max [4], UCRL [30]. An alternative approach is\nBayesian reinforcement learning methods, which maintain a distribution over possible MDPs [1, 17,\n23, 31]. The optimism-based exploration strategies have been extended to continuous state spaces,\nfor example, [6, 9], however these methods do not accommodate nonlinear function approximators.\n\nPractical RL algorithms often rely on simple exploration heuristics, such as \u0001-greedy and Boltzmann\nexploration [32]. However, these heuristics exhibit random walk exploratory behavior, which can lead\n\n7\n\n\fto exponential regret even in case of small MDPs [9]. Our proposed method of utilizing information\ngain can be traced back to [22], and has been further explored in [17, 33, 34]. Other metrics for\ncuriosity have also been proposed, including prediction error [10, 35], prediction error improvement\n[36], leverage [14], neuro-correlates [37], and predictive information [38]. These methods have not\nbeen applied directly to high-dimensional continuous control tasks without discretization. We refer\nthe reader to [21, 39] for an extensive review on curiosity and intrinsic rewards.\n\nRecently, there have been various exploration strategies proposed in the context of deep RL. [10]\nproposes to use the (cid:96)2 prediction error as the intrinsic reward. [12] performs approximate visitation\ncounting in a learned state embedding using Gaussian kernels. [11] proposes a form of Thompson\nsampling, training multiple value functions using bootstrapping. Although these approaches can scale\nup to high-dimensional state spaces, they generally assume discrete action spaces. [40] make use\nof mutual information for gait stabilization in continuous control, but rely on state discretization.\nFinally, [41] proposes a variational method for information maximization in the context of optimizing\nempowerment, which, as noted by [42], does not explicitly favor exploration.\n\n5 Conclusions\n\nWe have proposed Variational Information Maximizing Exploration (VIME), a curiosity-driven\nexploration strategy for continuous control tasks. Variational inference is used to approximate the\nposterior distribution of a Bayesian neural network that represents the environment dynamics. Using\ninformation gain in this learned dynamics model as intrinsic rewards allows the agent to optimize\nfor both external reward and intrinsic surprise simultaneously. Empirical results show that VIME\nperforms signi\ufb01cantly better than heuristic exploration methods across various continuous control\ntasks and algorithms. As future work, we would like to investigate measuring surprise in the value\nfunction and using the learned dynamics model for planning.\n\nAcknowledgments\n\nThis work was supported in part by DARPA, the Berkeley Vision and Learning Center (BVLC), the Berkeley\nArti\ufb01cial Intelligence Research (BAIR) laboratory, Berkeley Deep Drive (BDD), and ONR through a PECASE\naward. Rein Houthooft is supported by a Ph.D. Fellowship of the Research Foundation - Flanders (FWO).\nXi Chen was also supported by a Berkeley AI Research lab Fellowship. Yan Duan was also supported by a\nBerkeley AI Research lab Fellowship and a Huawei Fellowship.\n\nReferences\n\n[1] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar, \u201cBayesian reinforcement learning: A survey\u201d,\n\nFound. Trends. Mach. Learn., vol. 8, no. 5-6, pp. 359\u2013483, 2015.\n\n[2] S. Kakade, M. Kearns, and J. Langford, \u201cExploration in metric state spaces\u201d, in ICML, vol. 3, 2003,\n\npp. 306\u2013312.\n\n[3] M. Kearns and S. Singh, \u201cNear-optimal reinforcement learning in polynomial time\u201d, Mach. Learn., vol.\n\n49, no. 2-3, pp. 209\u2013232, 2002.\n\n[4] R. I. Brafman and M. Tennenholtz, \u201cR-Max - a general polynomial time algorithm for near-optimal\n\n[5]\n\n[6]\n\nreinforcement learning\u201d, J. Mach. Learn. Res., vol. 3, pp. 213\u2013231, 2003.\nP. Auer, \u201cUsing con\ufb01dence bounds for exploitation-exploration trade-offs\u201d, J. Mach. Learn. Res., vol. 3,\npp. 397\u2013422, 2003.\nJ. Pazis and R. Parr, \u201cPAC optimal exploration in continuous space Markov decision processes\u201d, in AAAI,\n2013.\n\n[8]\n\n[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al., \u201cHuman-level control through deep reinforcement learning\u201d, Nature,\nvol. 518, no. 7540, pp. 529\u2013533, 2015.\nJ. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, \u201cTrust region policy optimization\u201d, in\nICML, 2015.\nI. Osband, B. Van Roy, and Z. Wen, \u201cGeneralization and exploration via randomized value functions\u201d,\nArXiv preprint arXiv:1402.0635, 2014.\n\n[9]\n\n8\n\n\f[10] B. C. Stadie, S. Levine, and P. Abbeel, \u201cIncentivizing exploration in reinforcement learning with deep\n\n[11]\n\n[12]\n\npredictive models\u201d, ArXiv preprint arXiv:1507.00814, 2015.\nI. Osband, C. Blundell, A. Pritzel, and B. Van Roy, \u201cDeep exploration via bootstrapped DQN\u201d, in ICML,\n2016.\nJ. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, \u201cAction-conditional video prediction using deep\nnetworks in Atari games\u201d, in NIPS, 2015, pp. 2845\u20132853.\n\n[13] T. Hester and P. Stone, \u201cIntrinsically motivated model learning for developing curious robots\u201d, Arti\ufb01cial\n\nIntelligence, 2015.\n\n[14] K. Subramanian, C. L. Isbell Jr, and A. L. Thomaz, \u201cExploration from demonstration for interactive\n\nreinforcement learning\u201d, in AAMAS, 2016.\n\n[15] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, \u201cBenchmarking deep reinforcement\n\nlearning for continous control\u201d, in ICML, 2016.\nJ. Schmidhuber, \u201cCurious model-building control systems\u201d, in IJCNN, 1991, pp. 1458\u20131463.\n\n[16]\n[17] Y. Sun, F. Gomez, and J. Schmidhuber, \u201cPlanning to be surprised: Optimal Bayesian exploration in\n\ndynamic environments\u201d, in Arti\ufb01cial General Intelligence, 2011, pp. 41\u201351.\n\n[21]\n\n[18] L. Itti and P. F. Baldi, \u201cBayesian surprise attracts human attention\u201d, in NIPS, 2005, pp. 547\u2013554.\n[19] A. Graves, \u201cPractical variational inference for neural networks\u201d, in NIPS, 2011, pp. 2348\u20132356.\n[20] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, \u201cWeight uncertainty in neural networks\u201d, in\n\nICML, 2015.\nJ. Schmidhuber, \u201cFormal theory of creativity, fun, and intrinsic motivation (1990\u20132010)\u201d, IEEE Trans.\nAuton. Mental Develop., vol. 2, no. 3, pp. 230\u2013247, 2010.\nJ. Storck, S. Hochreiter, and J. Schmidhuber, \u201cReinforcement driven information acquisition in non-\ndeterministic environments\u201d, in ICANN, vol. 2, 1995, pp. 159\u2013164.\nJ. Z. Kolter and A. Y. Ng, \u201cNear-Bayesian exploration in polynomial time\u201d, in ICML, 2009, pp. 513\u2013520.\n[23]\n[24] G. E. Hinton and D. Van Camp, \u201cKeeping the neural networks simple by minimizing the description\n\n[22]\n\nlength of the weights\u201d, in COLT, 1993, pp. 5\u201313.\nJ. Schmidhuber, \u201cSimple algorithmic principles of discovery, subjective beauty, selective attention,\ncuriosity & creativity\u201d, in Intl. Conf. on Discovery Science, 2007, pp. 26\u201338.\n\n[25]\n\n[26] D. P. Kingma, T. Salimans, and M. Welling, \u201cVariational dropout and the local reparameterization trick\u201d,\n\nin NIPS, 2015, pp. 2575\u20132583.\n\n[27] R. J. Williams, \u201cSimple statistical gradient-following algorithms for connectionist reinforcement learn-\n\ning\u201d, Mach. Learn., vol. 8, no. 3-4, pp. 229\u2013256, 1992.\nJ. Kober and J. R. Peters, \u201cPolicy search for motor primitives in robotics\u201d, in NIPS, 2009, pp. 849\u2013856.\nJ. Peters and S. Schaal, \u201cReinforcement learning by reward-weighted regression for operational space\ncontrol\u201d, in ICML, 2007, pp. 745\u2013750.\n\n[28]\n[29]\n\n[30] P. Auer, T. Jaksch, and R. Ortner, \u201cNear-optimal regret bounds for reinforcement learning\u201d, in NIPS,\n\n2009, pp. 89\u201396.\n\n[31] A. Guez, N. Heess, D. Silver, and P. Dayan, \u201cBayes-adaptive simulation-based search with value function\n\napproximation\u201d, in NIPS, 2014, pp. 451\u2013459.\n\n[32] R. S. Sutton, Introduction to reinforcement learning.\n[33] S. Still and D. Precup, \u201cAn information-theoretic approach to curiosity-driven reinforcement learning\u201d,\n\nTheory Biosci., vol. 131, no. 3, pp. 139\u2013148, 2012.\n\n[34] D. Y. Little and F. T. Sommer, \u201cLearning and exploration in action-perception loops\u201d, Closing the Loop\n\nAround Neural Systems, p. 295, 2014.\n\n[35] S. B. Thrun, \u201cEf\ufb01cient exploration in reinforcement learning\u201d, Tech. Rep., 1992.\n[36] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer, \u201cExploration in model-based reinforcement learning\n\nby empirically estimating learning progress\u201d, in NIPS, 2012, pp. 206\u2013214.\nJ. Schossau, C. Adami, and A. Hintze, \u201cInformation-theoretic neuro-correlates boost evolution of\ncognitive systems\u201d, Entropy, vol. 18, no. 1, p. 6, 2015.\n\n[37]\n\n[38] K. Zahedi, G. Martius, and N. Ay, \u201cLinear combination of one-step predictive information with an\nexternal reward in an episodic policy gradient setting: A critical analysis\u201d, Front. Psychol., vol. 4, 2013.\n[39] P.-Y. Oudeyer and F. Kaplan, \u201cWhat is intrinsic motivation? a typology of computational approaches\u201d,\n\nFront Neurorobot., vol. 1, p. 6, 2007.\n\n[40] G. Montufar, K. Ghazi-Zahedi, and N. Ay, \u201cInformation theoretically aided reinforcement learning for\n\nembodied agents\u201d, ArXiv preprint arXiv:1605.09735, 2016.\n\n[41] S. Mohamed and D. J. Rezende, \u201cVariational information maximisation for intrinsically motivated\n\nreinforcement learning\u201d, in NIPS, 2015, pp. 2116\u20132124.\n\n[42] C. Salge, C. Glackin, and D. Polani, \u201cGuided self-organization: Inception\u201d, in. 2014, ch. Empowerment\u2013\n\nAn Introduction, pp. 67\u2013114.\n\n9\n\n\f", "award": [], "sourceid": 632, "authors": [{"given_name": "Rein", "family_name": "Houthooft", "institution": "Ghent University - iMinds and UC Berkeley and OpenAI"}, {"given_name": "Xi", "family_name": "Chen", "institution": "UC Berkeley and OpenAI"}, {"given_name": "Xi", "family_name": "Chen", "institution": "UC Berkeley and OpenAI"}, {"given_name": "Yan", "family_name": "Duan", "institution": "UC Berkeley"}, {"given_name": "John", "family_name": "Schulman", "institution": "OpenAI"}, {"given_name": "Filip", "family_name": "De Turck", "institution": "Ghent University - iMinds"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}]}