{"title": "Active Exploration in Dynamic Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 531, "page_last": 538, "abstract": null, "full_text": "Active Exploration in Dynamic Environments \n\nSebastian B. Thrun \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nKnut Moller \n\nUniversity of Bonn \n\nDept. of Computer Science \n\nROmerstr. 164 \n\nE-mail: thrun@cs.cmu.edu \n\nD-5300 Bonn, Germany \n\nAbstract \n\n\\Vhenever an agent learns to control an unknown environment, two oppos(cid:173)\ning principles have to be combined, namely: exploration (long-term opti(cid:173)\nmization) and exploitation (short-term optimization). Many real-valued \nconnectionist approaches to learning control realize exploration by ran(cid:173)\ndomness in action selection. This might be disadvantageous when costs \nare assigned to \"negative experiences\" . The basic idea presented in this \npaper is to make an agent explore unknown regions in a more directed \nmanner. This is achieved by a so-called competence map, which is trained \nto predict the controller's accuracy, and is used for guiding exploration. \nBased on this, a bistable system enables smoothly switching attention \nbetween two behaviors - exploration and exploitation - depending on ex(cid:173)\npected costs and knowledge gain. \nThe appropriateness of this method is demonstrated by a simple robot \nnavigation task. \n\nINTRODUCTION \nThe need for exploration in adaptive control has been recognized by various au(cid:173)\nthors [MB89, Sut90, Mo090, Sch90, BB591]. Many connectionist approaches (e.g. \n[~leI89, MB89)) distinguish a random exploration phase, at which a controller is \nconstructed by generating actions randomly, and a subsequent exploitation phase. \nRandom exploration usually suffers from three major disadvantages: \n\n\u2022 Whenever costs are assigned to certain experiences - which is the case for \nvarious real-world t.asks such as autonomous robot learning, chemical control. \nflight control etc. -, exploration may become unnecessarily expensive. Intu(cid:173)\nitively speaking, a child would burn itself again and again simply because it is \n\n531 \n\n\f532 \n\nThrun and Moller \n\nworld \n\nFigure 1: The training of the model network is a system identification task. Weights \nand biases of the network are estimated by gradient descent using the backpropagation \nalgorithm. \n\nin its random phase . \n\n\u2022 Random exploration is often inefficient in terms of learning time, too [Whi9l, \nThr92]. Random actions usually make an agent waste plenty of time in already \nwell-explored regions in state space, while other regions may still be poorly \nexplored. Exploration happens by chance and is thus undirected . \n\n\u2022 Once the exploitation phase begins, learning is finished and the system is unable \n\nto adapt to time-varying, dynamic environments. \n\nHowever, more efficient exploration techniques rely on knowledge about the learn(cid:173)\ning process itself, which is used for guiding exploration. Rather than selecting ac(cid:173)\ntions randomly, these exploration techniques select actions such that the expected \nknowledge gain is maximal. In discrete domains, this may be achieved by preferring \nstates (or state-action pairs) that have been visited less frequently [BS90], or less \nrecently [Sut90], or have previously shown a high prediction error [Mo090, Sch91]i. \nFor various discrete deterministic domains such exploration heuristics have been \nproved to prevent from exponential learning time [Thr92] (exponential in size of \nthe state space). However, such techniques require a variable associated with each \nstate-action pair, which is not feasible if states and actions are real-valued. \nA novel real-valued generalization of these approaches is presented in this paper. \nA so-called competence map estimates the controller's accuracy. Using this esti(cid:173)\nmation, the agent is driven into regions in state space with low accuracy, where \nthe resulting learning effect is assumed to be maximal. This technique defines a \ndirected exploration rule. In order to minimize costs during learning, exploration is \ncombined with an exploitation mechanism using selective attention, which allows \nfor switching between exploration and exploitation. \n\nINDIRECT CONTROL USING FORWARD MODELS \n\nIn this paper we focus on an adaptive control scheme adopted from Jordan (JorS9]: \nSystem identification (Fig. 1): Observing the input-output behavior of the un(cid:173)\nknown world (environment), a model is constructed by minimizing the difference of \nthe observed outcome and its corresponding predictions. This is done with back(cid:173)\npropagation. \nAction search using the model network (Fig. 2): Let an actual state sand \na goal state s* be given. Optimal actions are searched using gradient descent in \naction space: starting with an initial action (e.g. randomly chosen), the next state \n\n1 Note that these two approaches [Moo90, Sch91] are real-valued. \n\n\fActive Exploration in Dynamic Environments \n\n533 \n\nFigure 2: Using the model for optimizing actions (exploitation). Starting with some \ninitial action, gradient descent through the model network progressively improves actions. \n\ns is predicted with the world model. The exploitation energy function \n\nEexploit \n\n(s'\" - sf (s'\" - s) \n\nmeasures the LMS-deviation of the predicted and the desired state. Since the \nmodel network is differentiable, gradients of EexPloit can be propagated back through \nthe model network. Using these gradients, actions are optimized progressively by \ngradient descent in action space, minimizing Eexploit. The resulting actions exploit \nthe world. \n\nTHE COMPETENCE MAP \n\nThe general principle of many enhanced exploration schemes [BS90, Sut90, Mo090, \nTM91, Sch91, Thr92] is to select actions such that the resulting observations are \nexpected to optimally improve the controller. In terms of the above control scheme, \nthis may be realized by driving the agent into regions in state-action space where \nthe accuracy of the model network is assumed to be low, and thus the knowledge \ngain by visiting these regions is assumed to be high. In order to estimate the \naccuracy of the model network, we introduce the notion of a competence network \n[Sch91, TM91]. Basically, this map estimates some upper bound of the LMS-error \nof the model network. This estimation is used for exploring the world by selecting \nactions which minimize the expected competence of the model, and thus maximize \nthe resulting learning effect. \nHowever, training the competence map is not as straightforward, since it is impos(cid:173)\nsible to exactly predict the accuracy of the model network for regions in state space \nnot visited for some time. The training procedure for the competence map is based \non the assumption that the error increases (and thus competence decreases) slowly \nfor such regions due to relearning and environmental dynamics: \n\n1. At each time tick, backpropagation learning is applied using the last state(cid:173)\n\naction pair as input, and the observed LMS-prediction error of the model as \ntarget value (c.f. Fig. 3), normalized to (O,Cmax) (O~cmax~l, so far we used \ncmax=l). \n\n2. For some 2 randomly generated state-action pairs, the competence map is subse(cid:173)\n\nquently trained with target 1.0 (~ largest possible error cmax ) [ACL +90]. This \ntraining step establishes a heuristic, realizing the loss of accuracy in unvisited \nregions: over time, the output values of the competence map increase for these \nreglOns. \n\nActions are now selected with respect to an energy function E which combines both \n\n2in our simulations: five - with a small learning rate \n\n\f534 \n\nThrun and Moller \n\nworld model \n\nFigure 3: Training the competence map to predict the error of the model by gradient \ndescen t (see text). \n\nexploration and exploitation: \n\nE \n\n(I-f) . Eexplore + f\u00b7 EexPloil \n\n(1) \n\nwith gain parameter f (O<f<I). Here the exploration energy \n\nEexplore \n\n1 - competence( action) \n\nis evaluated using the competence map - minimizing Eexplore is equivalent to maxi(cid:173)\nmizing the predicted model error. Since both the model net and the competence net \nare differentiable, gradient descent in action space may be used for minimizing Eq. \n(1). E combines exploration with exploitation: on the one hand minimizing Eexploil \nserves to avoid costs (short-term optimization), and on the other hand minimizing \nEexplore ensures exploration (long-term optimization). r determines the portion of \nboth target functions - which can be viewed to represent behaviors - in the action \nselection process. \nNote that Cma.x determines the character of exploration: if Cma.x is large, the agent \nis attracted by regions in state space which have previously shown high prediction \nerror. The smaller Cma.x is, the more the agent is attracted by rarely-visited regions. \n\nEXPLORATION AND SELECTIVE ATTENTION \nClearly, exploration and exploitation are often conflicting and can hinder each other. \nE.g. if exploration and exploitation pull a mobile robot into opposite directions, the \nsystem will stay where it is. It therefore makes sense not to keep r constant during \nlearning, but sometimes to focus more on exploration and sometimes more on ex(cid:173)\nploitation, depending on expected costs and improvements. In our approach, this is \nachieved by determining the focus of attention r using the following bistable recur(cid:173)\nsive function which allows for smoothly switching attention between both policies. \nAt each step of action search, let eexploil = ~EexPloil(a) and eexplore = ~Eexplore(a) \ndenote the expected change of both energy functions by action a. With fC) being \na positive and monotonically increasing function 3 , \n(l-r)\u00b7f(eexplore) \n\n(2) \ncompares the influence of action a on both energy functions under the current focus \nof attention r. The new r is then derived by squashing K (with c>O): \n\nK \n\n-\n\nf\u00b7f(eexploil) \n\n-\n\nr \n\n1 \n\n1 + e-CoK. \n\n3We chosed f(x) = eX in our simulations. \n\n(3) \n\n\fActive Exploration in Dynamic Environments \n\n535 \n\ngoal + \n\nobstacle o \n\nstart \u2022 \n\nFigure 4: (a) Robot world - note that there are two equally good paths leading around \nthe obstacle. \n(b) Potential field: In addition to the x-y-state vector, the environment \nreturns for each state a potential field value (the darker the color, the larger the value). \nGradient ascent in the potential field yields both optimal paths depicted. Learning this \npotential field function is part of the system identification task. \n\nIf K > 0, the learning system is in exploitation mood and r > 0.5 . Likewise, if \nK < 0, the system is in exploration mood and r < 0.5. Since the actual attention \nr weighs both competing energy functions, in most cases Eqs. (2) and (3) establish \ntwo stable points (fixpoints), close to 0 and 1, respectively. Attention is switched \nonly if K changes its sign. The scalar c serves as stability factor : the larger cis, \nthe closer is r to its extremal values and the larger the switching factors r(l-r)-l \n(taken from Eq. (2)). \n\nA ROBOT NAVIGATION TASK \nWe now will demonstrate the benefits of active exploration using a competence map \nwith selective attention by a simple robot navigation example. The environment is \na 2-dimensional room with one obstacle and walls (see Fig. 4a), and x-y-states are \nevaluated by a potential field function (Fig. 4b). The goal is to navigate the robot \nfrom the start to the goal position without colliding with the obstacle or a wall. \nUsing a model network without hidden units for state prediction and a model with \ntwo hidden layers (10 units with gaussian activation functions in the first hidden \nlayer, and 8 logistic units in the second) for potential field value prediction, we \ncompared the following exploration techniques - Table 1 summarizes the results: \n\n\u2022 Pure random exploration. In Fig. 5a the best result out of 20 runs is \nshown. The dark color in the middle indicates that the obstacle was touched \nextremely often. Moreover, the resulting controller (exploitation phase) did \nnot find a path to the goal. \n\n\u2022 Pure exploitation (see Fig. 5b). (With a bit of randomness in the beginning) \nthis exploration technique found one of two paths but failed in both finding the \nother path and performing proper system identification. The number of crashes \n\n\f536 \n\nThrun and Moller \n\nFigure 5: Resulting models of the potential field function. (a) Random exploration. \nThe dark color in the middle indicates the high number of crashes against the obstacle. \nNote that the agent is restarted whenever it crashes against a wall or the obstacle - the \nprobability for reaching the goal is 0.0007. (b) Pure exploitation: The resulting model \nis accurate along the path, but inaccurate elsewhere. Only one of two paths is identified. \n\nFigure 6: Active exploration. (a) Resulting model of the potential field function. This \nmodel is most accurate, and the number of crashes during training is the smallest. Both \npaths are found about equally often. (b) \"Typical\" competence map: The arrows indicate \nactions which maximize Eexplore (pure exploration) . \n\nrandom exploration \npure exploitation \nactive exploration \n\n# runs # crashes # paths found L2-model error \n10000 \n15000 \n15000 \n\n9993 \n11000 \n4000 \n\n0 \n1 \n2 \n\n2.5 % \n0.7 % \n0.4 % \n\nTable 1: Results (averaged over 20 runs). The L2-model error is measured in relation to \nits initial value (= 100%). \n\n\fActive Exploration in Dynamic Environments \n\n537 \n\nexplor:a.lion \n\nregion \n\n(b) \n\n/ \n\no \n\nexplo~ .. lion \n\nregIOn \n\n(a) \n\n(c) \n\nFigure 7: Three examples of trajectories during learning demonstrate the switching at(cid:173)\ntention mechanism described in the paper. Thick lines indicate exploration mode (r <0.2), \nand thin lines indicate exploitation (r>o.S). The arrows mark some points where explo(cid:173)\nration is switched off due to a predicted collision. \n\nduring learning was significantly smaller than with random exploration . \n\n\u2022 Directed exploration with selective attention. Using a competence net(cid:173)\n\nwork with two hidden layers (6 units each hidden layer), a proper model was \nfound in all simulations we performed (Fig. 6a), and the number of collisions \nwere the least. An intermediate state of the competence map is depicted in \nFig. 6b, and three exploration runs are shown in Fig. 7. \n\nDISCUSSION \n\nWe have presented an adaptive strategy for efficient exploration in non-discrete \nenvironments. A so-called competence map is trained to estimate the competence \n(error) of the world model, and is used for driving the agent to less familiar regions. \nIn order to avoid unnecessary exploration costs, a selective attention mechanism \nswitches between exploration and exploitation. The resulting learning system is \ndynamic in the sense that whenever one particular region in state space is preferred \nfor several runs, sooner or later the exploration behavior forces the agent to leave \nthis region. Benefits of this exploration technique have been demonstrated on a \nrobot navigation task. \nHowever, it should be noted that the exploration method presented seeks to ex(cid:173)\nplore more or less the whole state-action space. This may be reasonable for the \nabove robot navigation task, but many state spaces, e.g. those typically found in \ntraditional AI, are too large for getting exhaustively explored even once. In order \nto deal with such spaces, this method should be extended by some mechanism for \ncutting off exploration in \"unrelevant\" regions in state-action space, which may be \ndetermined by some notion of \"relevance\" . \nNote that the technique presented here does not depend on the particular control \nscheme at hand. E.g., some exploration techniques in the context of reinforcement \n\n\f538 \n\nThrun and Moller \n\nlearning may be found in [Sut90, BBS91], and are surveyed and compared in [Thr92]. \n\nAcknowledgements \n\nThe authors wish to thank Jonathan Bachrach, Andy Barto, Jorg Kindermann, \nLong-Ji Lin, Alexander Linden, Tom Mitchell, Andy Moore, Satinder Singh, Don \nSofge, Alex Waibel, and the reinforcement learning group at CMU for interesting \nand fruitful discussions. S. Thrun gratefully acknowledges the support by German \nNational Research Center for Computer Science (GMD) where part of the research \nwas done, and also the financial support from Siemens Corp. \n\n[BS90] \n\n[Jor89] \n\n[MB89] \n\n[BBS91] \n\nReferences \n[ACL +90] 1. Atlas, D. Cohn, R. Ladner, M.A. EI-Sharkawi, R.J. Marks, M.E. Aggoune, \nand D.C. Park. Training connectionist networks with queries and selective \nsampling. In D. Touretzky (ed.) Advances in Neural Information Processing \nSystems 2, San Mateo, CA, 1990. IEEE, Morgan Kaufmann. \nA.G. Barto, S.J. Bradtke, and S.P. Singh. Real-time learning and control using \nasynchronous dynamic programming. Technical Report COINS 91-57, Depart(cid:173)\nment of Computer Science, University of Massachusetts, MA, Aug. 1991. \nA.G. Barto and S.P. Singh. On the computational economics of reinforcement \nlearning. In D.S. Touretzky et al. (eds.), Connectionist Models, Proceedings of \nthe 1990 Summer School, San Mateo, CA, 1990. Morgan Kaufmann. \nM.l. Jordan. Generic constraints on underspecified target trajectories. \nIn \nProceedings of the First International Joint Conference on Neural Networks, \nWashington, DC, IEEE TAB Neural Network Committee, San Diego, 1989. \nM.C. Mozer and J.R. Bachrach. Discovering the structure of a reactive envi(cid:173)\nronment by exploration. Technical Report CU-CS-451-89, Dept. of Computer \nScience, University of Colorado, Boulder, Nov. 1989. \nB.W. Mel. Murphy: A neurally-inspired connectionist approach to learning \nand performance in vision-based robot motion planning. Technical Report \nCCSR-89-17 A, Center for Complex Systems Research Beckman Institute, Uni(cid:173)\nversity of Illinois, 1989. \nA.W. Moore. Efficient Memory-based Learning for Robot Control. PhD thesis, \nTrinity Hall, University of Cambridge, England, 1990. \nJ.H. Schmidhuber. Making the world differentiable: On using supervised learn(cid:173)\ning fully recurrent neural networks for dynamic reinforcemen t learning and \nplanning in non-stationary environments. Technical Report, Technische Uni(cid:173)\nversitiit Munchen, Germany, 1990. \nJ.H. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical \nReport FKI-149-91, Technische Universitat Munchen, Germany 1991. \nR.S. Sutton. Integrated architectures for learning, planning, and reacting based \non approximating dynamic programming. In Proceedings of the Seventh Inter(cid:173)\nnational Conference on Machine Learning, June 1990. \nS.B. Thrun and K. Moller. On planning and exploration in non-discrete envi(cid:173)\nronments. Technical Report 528, GMD, St.Augustin, FRG, 1991. \nS.B. Thrun. Efficient exploration in reinforcement learning. Technical Report \nCMU-CS-92-102, Carnegie Mellon University, Pittsburgh, Jan. 1992. \nS.D. Whitehead. A study of cooperative mechanisms for faster reinforcement \nlearning. Technical Report 365, University of Rochester, Computer Science \nDepartment, Rochester, NY, March 1991. \n\n[Mo090] \n\n[Sch91] \n\n[Sut90] \n\n[TM91] \n\n[Thr92] \n\n[Whi91] \n\n[MeI89] \n\n[Sch90] \n\n\f", "award": [], "sourceid": 573, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}, {"given_name": "Knut", "family_name": "M\u00f6ller", "institution": null}]}