{"title": "Instance-Based State Identification for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 377, "page_last": 384, "abstract": null, "full_text": "Instance-Based State Identification for \n\nReinforcement Learning \n\nR. Andrew McCallum \n\nDepartment of Computer Science \n\nUniversity of Rochester \n\nRochester, NY 14627-0226 \n\nmccallumCcs.rochester.edu \n\nAbstract \n\nThis paper presents instance-based state identification, an approach \nto reinforcement learning and hidden state that builds disambiguat(cid:173)\ning amounts of short-term memory on-line, and also learns with an \norder of magnitude fewer training steps than several previous ap(cid:173)\nproaches. Inspired by a key similarity between learning with hidden \nstate and learning in continuous geometrical spaces, this approach \nuses instance-based (or \"memory-based\") learning, a method that \nhas worked well in continuous spaces. \n\n1 BACKGROUND AND RELATED WORK \n\nWhen a robot's next course of action depends on information that is hidden from \nthe sensors because of problems such as occlusion, restricted range, bounded field \nof view and limited attention, the robot suffers from hidden state. More formally, \nwe say a reinforcement learning agent suffers from the hidden state problem if the \nagent's state representation is non-Markovian with respect to actions and utility. \n\nThe hidden state problem arises as a case of perceptual aliasing: the mapping be(cid:173)\ntween states of the world and sensations of the agent is not one-to-one [Whitehead, \n1992]. If the agent's perceptual system produces the same outputs for two world \nstates in which different actions are required, and if the agent's state representation \nconsists only of its percepts, then the agent will fail to choose correct actions. Note \nthat even if an agent's state representation includes some internal state beyond its \n\n\f378 \n\nR. Andrew McCallum \n\nimmediate percepts, the agent can still suffer from hidden state if it does not keep \nenough internal state to uncover the non-Markovian-ness of its environment. \n\nOne solution to the hidden state problem is simply to avoid passing through the \naliased states. This is the approach taken in Whitehead's Lion algorithm [White(cid:173)\nhead, 1992]. Whenever the agent finds a state that delivers inconsistent reward, it \nsets that state's utility so low that the policy will never visit it again. The success \nof this algorithm depends on a deterministic world and on the existence of a path \nto the goal that consists of only unaliased states. \n\nOther solutions do not avoid aliased states, but do as best they can given a non(cid:173)\nMarkovian state representation [Littman, 1994; Singh et al., 1994; Jaakkola et al., \n1995]. They involve either learning deterministic policies that execute incorrect \nactions in some aliased states, or learning stochastic policies with action choice \nprobabilities matching the proportions of the different underlying aliased world \nstates. These approaches do not depend on a path of unaliased states, but they \nhave other limitations: when faced with many aliased states, a stochastic policy \ndegenerates into random walk; when faced with potentially harmful results from \nincorrect actions, deterministically incorrect or probabilistically incorrect action \nchoice may prove too dangerous; and when faced with performance-critical tasks, \ninefficiency that is proportional to the amount of aliasing may be unacceptable. \n\nThe most robust solution to the hidden state problem is to augment the agent's \nstate representation on-line so as to disambiguate the aliased states. State identi(cid:173)\nfication techniques uncover the hidden state information-that is, they make the \nagent's internal state space Markovian. This transformation from an imperfect state \ninformation model to a perfect state information model has been formalized in the \ndecision and control literature, and involves adding previous percepts and actions to \nthe definition of agent internal state [Bertsekas and Shreve, 1978]. By augmenting \nthe agent's perception with history information-.short-term memory of past per(cid:173)\ncepts, actions and rewards-the agent can distinguish perceptually aliased states, \nand can then reliably choose correct actions from them. \n\nPredefined, fixed memory representations such as order n Markov models (also \nknown as constant-sized perception windows, linear traces or tapped-delay lines) \nare often undesirable. When the length of the window is more than needed, they \nexponentially increase the number of internal states for which a policy must be \nstored and learned; when the length of the memory is less than needed, the agent \nreverts to the disadvantages of undistinguished hidden state. Even if the agent de(cid:173)\nsigner understands the task well enough to know its maximal memory requirements, \nthe agent is at a disadvantage with constant-sized windows because, for most tasks, \ndifferent amounts of memory are needed at different steps of the task. \n\nThe on-line memory creation approach has been adopted in several reinforcement \nlearning algorithms. The Perceptual Distinctions Approach [Chrisman, 1992] and \nUtile Distinction Memory [McCallum, 1993] are both based on splitting states of a \nfinite state machine by doing off-line analysis of statistics gathered over many steps. \nRecurrent-Q [Lin, 1993] is based on training recurrent neural networks. Indexed \nMemory [Teller, 1994] uses genetic programming to evolve agents that use load and \nstore instructions on a register bank. A chief disadvantage of all these techniques \nis that they require a very large number of steps for training. \n\n\fInstance-Based State Identification for Reinforcement Learning \n\n379 \n\n2 \n\nINSTANCE-BASED STATE IDENTIFICATION \n\nThis paper advocates an alternate solution to the hidden state problem we term \ninstance-based state identification. The approach was inspired by the successes of \ninstance-based (also called \"memory-based\") methods for learning in continuous \nperception spaces, (i.e. [Atkeson, 1992; Moore, 1992]). \n\nThe application of instance-based learning to short-term memory for hidden state is \ndriven by the important insight that learning in continuous spaces and learning with \nhidden state have a crucial feature in common: they both begin learning without \nknowing the final granularity of the agent's state space. The former learns which \nregions of continuous input space can be represented uniformly and which areas \nmust be finely divided among many states. The later learns which percepts can \nbe represented uniformly because they uniquely identify a course of action with(cid:173)\nout the need for memory, and which percepts must be divided among many states \neach with their own detailed history to distinguish them from other perceptually \naliased world states. The first approach works with a continuous geometrical input \nspace, the second works with a percept-act ion-reward \"sequence\" space, (or \"his(cid:173)\ntory\" space). Large continuous regions correspond to less-specified, small memories; \nsmall continuous regions correspond to more-specified, large memories. \n\nFurthermore, learning in continuous spaces and sequence spaces both have a lot to \ngain from instance-based methods. In situations where the state space granularity \nis unknown, it is especially useful to memorize the raw previous experiences. If \nthe agent tries to fit experience to its current, flawed state space granularity, it is \nbound to lose information by attributing experience to the wrong states. Experi(cid:173)\nence attributed to the wrong state turns to garbage and is wasted. When faced \nwith an evolving state space, keeping raw previous experience is the path of least \ncommitment, and thus the most cautious about losing information. \n\n3 NEAREST SEQUENCE MEMORY \n\nThere are many possible instance-based techniques to choose from, but we wanted \nto keep the first application simple. With that in mind, this initial algorithm is \nbased on k-nearest neighbor. We call it Nearest Sequence Memory, (NSM). It bears \nemphasizing that this algorithm is the most straightforward, simple, almost naive \ncombination of instance-based methods and history sequences that one could think \nof; there are still more sophisticated instance-based methods to try. The surprising \nresult is that such a simple technique works as well as it does. \n\nAny application of k-nearest neighbor consists of three parts: 1) recording each \nexperience, 2) using some distance metric to find neighbors of the current query \npoint, and 3) extracting output values from those neighbors. We apply these three \nparts to action-percept-reward sequences and reinforcement learning by Q-Iearning \nlWatkins, 1989] as follows: \n\n1. For each step the agent makes in the world, it records the action, percept \nand reward by adding a new state to a single, long chain of states. Thus, \neach state in the chain contains a snapshot of immediate experience; and \nall the experiences are laid out in a time-connected history chain. \n\n\f380 \n\nR. Andrew McCallum \n\nLearning in a Geometric Space \nk-nearest neighbor, k = 3 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\nLearning in a Sequence Space \nk-nearest neighbor, k = 3 \n\n00014001301201 \n\nmatch length \n\n3~ 0 \n\n1 \n\n\" \n\naction. percept. rewanl \n\nFigure 1: A continuous space compared with a sequence space. In each case, the \n\"query point\" is indicated with a gray cross, and the three nearest neighbors are \nindicated with gray shadows. In a geometric space, the neighborhood metric is \ndefined by Euclidean distance. In a sequence space, the neighborhood metric is \ndetermined by sequence match length-the number of preceding states that match \nthe states preceding the query point. \n\n2. When the agent is about to choose an action, it finds states considered to \nbe similar by looking in its state chain for states with histories similar to \nthe current situation. The longer a state's string of previous experiences \nmatches the agent's most recent experiences, the more likely the state rep(cid:173)\nresents where the agent is now. \n\n3. Using the states, the agent obtains Q-values by averaging together the \nexpected future reward values associated with the k nearest states for each \naction. The agent then chooses the action with the highest Q-value. The \nregular Q-Iearning update rule is used to update the k states that voted for \nthe chosen action. \n\nChoosing to represent short-term memory as a linear trace is a simple, well(cid:173)\nestablished technique. Nearest Sequence Memory uses a linear trace to represent \nmemory, but it differs from the fixed-sized window approaches because it provides \na variable memory-length-like k-nearest neighbor, NSM can represent varying res(cid:173)\nolution in different regions of state space. \n\n4 DETAILS OF THE ALGORITHM \n\nA more complete description of Nearest Sequence Memory, its performance and its \npossible improvements can be found in [McCallum, 1995]. \n\nThe interaction between the agent and its environment is described by actions, \npercepts and rewards. There is a finite set of possible actions, A = {al,a2, ... ,am }, \n\n\fInstance-Based State Identification for Reinforcement Learning \n\n381 \n\na finite set of possible percepts, () = {Ol, 02, ... , On}, and scalar range of possible \nrewards, n = [x, y], x, Y E~. At each time step, t, the agent executes an action, \nat E A, then as a result receives a new percept, Ot E (), and a reward, rt E n. The \nagent records its raw experience at time t in a \"state\" data point, St. Also associated \nwith St is a slot to hold a single expected future discounted reward value, denoted \nq(st). This value is associated with at and no other action. \n\n1. Find the k nearest neighbor (most similar) states for each possible future action. \nThe state currently at the end of the chain is the \"query point\" from which we \nmeasure all the distances. The neighborhood metric is defined by the number \nof preceding experience records that match the experience records preceding the \n\"query point\" state. (Here higher values of n(s;, sJ) indicate that S; and Sj are \ncloser neighbors.) \n\n( _ _ )_ { 1+n(s;_1,Sj-I), \nn S\" SJ -\n\n0 , \n\nif (a;-1 = aj-I) A (0;-1 = OJ-I) A (r;-1 = rj_I) \notherwise \n\n(1) \nConsidering each of the possible future actions ill turn, we find the k nearest \nneighbors and give them a vote, v(s;). \n\nv(S;) = { ~: \n\nif n(st, s;) is among the k maxv$jlaj=a; n(st, Sj)'s \notherwise \n\n(2) \n\n2. Determine the Q-value for each action by averaging individual the q-values from \n\nthe k voting states for that action. \n\nQt(a;) = L (v(s;)/k)q(sj) \n\nV$jlaj=a; \n\n(3) \n\n3. Select an action by maximum Q-value, or by random exploration. According to \n\nan exploration probability, e, either let at+1 be randomly chosen from A, or \n\n(4) \n\n4. Execute the action chosen in step 3, and record the resulting experience. Do this \nby creating a new \"state\" representing the current state of the environment, and \nstoring the action-percept-reward triple associated with it: \nIncrement the time counter: t ~ t + 1. Create St; record in it at, Ot, rt. \nThe agent can limit its storage and computational load by limiting the number \nof instances it maintains to N (where N is some reasonably large number) . Once \nthe agent accumulates N instances, it can discard the oldest instance each time \nit adds a new one. This also provides a way to handle a changing environment. \n\n5. Update the q-values by vote. Perform the dynamic programming step using the \nstandard Q-Iearning rule to update those states that voted for the chosen action. \nNote that this actually involves performing steps 1 and 2 to get the next Q-values \nneeded for calculating the utility of the agent's current state, Ut . (Here (3 is the \nlearning rate.) \n\nUt = max Qt(a) \n\na \n\n(Vsda; = at-I) q(s;) ~ (1- (3v(s;))q(s;) + (3v(s;)(r; + \"YUt) \n\n(5) \n\n(6) \n\n\f382 \n\n20 \n\n15 \n\n~ \n5 10 \n\n5 \n\nPerformance during learning \n\nNearest Sequence Memory -\nPerceptual Distinctions Approach -\n\n. \n\nR. Andrew McCallum \n\nSteps per Trial during learning \n\nNearest Sequence Memory -\n\nRecurrent-a - . \n\n~ \n\n70 \n\n60 \n\n50 \n\n., \nc. 40 \nli \n\n30 \n\n20 \n\n1000 2000 3000 4000 5000 6000 7000 8000 \n\nSteps \n\n74 \n\nlD ~: t \nfBI \n\n~:; \n\n153 \n\nStejM 10 LMm \n\n1500 \n\n_o f ...... \n\nStepslOLMm \n\n2500 \n.... borof ...... \n\n20 \n\n40 \n\n60 \n\n80 \n\n10 \n\n0 \n\n100 \n\nTrials \n\nStejM 10 L.\", \n\n~: -, .... _._' \nIi\u00b7 iii :: =L-.'-...:39==5=-___ Steps __ I0_~..,.\"I0000 \n\n238 \n\n.....borof ...... \n\nFigure 2: Comparing Nearest Sequence Memory with three other algorithms: Per(cid:173)\nceptual Distinction Approach, Recurrent-Q and Utile Distinction Memory. In each \ncase, NSM learns with roughly an order of magnitude fewer steps. \n\n5 EXPERIMENTAL RESULTS \n\nThe performance of NSM is compared to three other algorithms using the tasks \nchosen by the other algorithms' designers. In each case, NSM learns the task with \nroughly an order of magnitude fewer steps. Although NSM learns good policies \nquickly, it does not always learn optimal policies. In section 6 we will discuss why \nthe policies are not always optimal and how NSM could be improved. \nThe Perceptual Distinctions Approach [Chrisman, 1992] was demonstrated in a \nspace ship docking application with hidden state. The task was made difficult by \nnoisy sensors and actions. Some of the sensors returned incorrect values 30% of \nthe time. Various actions failed 70, 30 or 20% of the time, and when they failed, \nresulted in random states. NSM used f3 = 0.2, I = 0.9, k = 8, and N = 1000. PDA \ntakes almost 8000 steps to learn the task. NSM learns a good policy in less than \n1000 steps, although the policy is not quite optimal. \n\nUtile Distinction Memory [McCallum, 1993] was demonstrated on several local per(cid:173)\nception mazes. Unlike most reinforcement learning maze domains, the agent per(cid:173)\nceives only four bits indicating whether there is a barrier to the immediately adjacent \nnorth, east, south and west. NSM used f3 = 0.9, I = 0.9, k = 4, and N = 1000. In \ntwo of the mazes, NSM learns the task in only about 1/20th the time required by \nUDM; in the other two, NSM learns mazes that UDM did not solve at all. \n\n\fInstance-Based State Identification for Reinforcement Learning \n\n383 \n\nRecurrent-Q [Lin, 1993] was demonstrated on a robot 2-cup retrieval task. The \nenv,jronment is deterministic, but the task is made difficult by two nested levels of \nhidden state and by providing no reward until the task is completely finished. NSM \nused {3 = 0.9, I = 0.9, k = 4, and N = 1000. NSM learns good performance in \nabout 15 trials , Recurrent-Q takes about 100 trials to reach equivalent performance. \n\n6 DISCUSSION \n\nNearest Sequence Memory offers much improved on-line performance and fewer \ntraining steps than its predecessors. Why is the improvement so dramatic? \nI \nbelieve the chief reason lies with the inherent advantage of instance-based methods, \nas described in section 2: the key idea behind Instance-Based State Identification \nis the recognition that recording raw experience is particularly advantageous when \nthe agent is learning a policy over a changing state space granularity, as is the case \nwhen the agent is building short-term memory for disambiguating hidden state. \n\nIf, instead of using an instance-based technique, the agent simply averages new ex(cid:173)\nperiences into its current, flawed state space model, the experiences will be applied \nto the wrong states, and cannot be reused when the agent reconfigures its state \nspace. Furthermore, and perhaps even more detrimentally, incoming data is always \ninterpreted in the context of the flawed state space, always biased in an inappropri(cid:173)\nate way-not simply recorded, kept uncommitted and open to easy reinterpretation \nin light of future data. \n\nThe experimental results in this paper bode well for instance-based state identifi(cid:173)\ncation. Nearest Sequence Memory is simple-if such a simplistic implementation \nworks as well as it does, more sophisticated approaches may work even better. Here \nare some ideas for improvement: \n\nThe agent should use a more sophisticated neighborhood distance metric than exact \nstring match length. A new metric could account for distances between different \npercepts instead of considering only exact matches. A new metric could also handle \ncontinuous-valued inputs. \n\nNearest Sequence Memory demonstrably solves tasks that involve noisy sensation \nand action, but it could perhaps handle noise even better if it used some technique \nfor explicitly separating noise from structure. K-nearest neighbor does not explicitly \ndiscriminate between structure and noise. If the current query point has neighbors \nwith wildly varying output values, there is no way to know if the variations are due \nto noise , (in which case they should all be averaged), or due to fine-grained structure \nof. the underlying function (in which case only the few closest should be averaged). \nBecause NSM is built on k-nearest neighbor, it suffers from the same inability to \nmethodically separate history differences that are significant for predicting reward \nand history differences that are not. I believe this is the single most important \nreason that NSM sometimes did not find optimal policies. \n\nWork in progress addresses the structure/noise issue by combining instance-based \nstate identification with the structure/noise separation method from Utile Dis(cid:173)\ntinction Memory [McCallum, 1993]. The algorithm, called Utile Suffix Memory, \nuses a tree-structured representation, and is related to work with Ron, Singer and \nTishby's Prediction Suffix Trees, Moore's Parti-game, Chapman and Kaelbling's \n\n\f384 \n\nR. Andrew McCallum \n\nG-algorithm, and Moore's Variable Resolution Dynamic Programming. See [Mc(cid:173)\nCallum, 1994] for more details as well as references to this related work. \n\nAclmowledgments \n\nThis work has benefited from discussions with many colleagues, including: Dana \nBallard, Andrew Moore, Jeff Schneider, and Jonas Karlsson. This material is based \nupon work supported by NSF under Grant no. IRI-8903582 and by NIH/PHS under \nGrant no. 1 R24 RR06853-02. \n\nReferences \n[Atkeson, 1992] Christopher G. Atkeson. Memory-based approaches to approximating \ncontinuous functions. In M. Casdagli and S. Eubank, editors, Nonlinear Modeling and \nForecasting, pages 503-521. Addison Wesley, 1992. \n\n[Bertsekas and Shreve, 1978] Dimitri. P. Bertsekas and Steven E. Shreve. Stochastic Op(cid:173)\n\ntimal Control. Academic Press, 1978. \n\n[Chrisman, 1992] Lonnie Chrisman. Reinforcement learning with perceptual aliasing: The \n\nperceptual distinctions approach. In Tenth Nat'l Conf. on AI, 1992. \n\n[Jaakkola et al., 1995] Tommi Jaakkola, Satinder Pal Singh, and Michael 1. Jordan. Re(cid:173)\n\ninforcement learning algorithm for partially observable markov decision problems. In \nAdvances of Neural Information Processing Systems 7. Morgan Kaufmann, 1995. \n\n[Lin, 1993] Long-Ji Lin. Reinforcement Learning for Robots Using Neural Networks. PhD \n\nthesis, Carnegie Mellon, School of Computer Science, January 1993. \n\n[Littman, 1994] Michael Littman. Memoryless policies: Theoretical limitations and prac(cid:173)\n\ntical results. \nAdaptive Behavior: From Animals to Animats, 1994. \n\nIn Proceedings of the Third International Conference on Simulation of \n\n[McCallum, 1993] R. Andrew McCallum. Overcoming incomplete perception with utile \ndistinction memory. In The Proceedings of the Tenth International Machine Learning \nConference. Morgan Kaufmann Publishers, Inc., 1993. \n\n[McCallum, 1994] R. Andrew McCallum. Utile suffix memory for reinforcement learning \n\nwith hidden state. TR 549, U. of Rochester, Computer Science, 1994. \n\n[McCallum, 1995] R. Andrew McCallum. Hidden state and reinforcement learning with \nIEEE Trans. on Systems, Man, and Cybernetics, \n\ninstance-based state identification. \n1995. (In press) [Earlier version available as U. of Rochester TR 502]. \n\n[Moore, 1992] Andrew Moore. Efficient Memory-based Learning for Robot Control. PhD \n\nthesis, University of Cambridge, November 1992. \n\n[Singh et al., 1994] Satinder Pal Singh, Tommi Jaakkola, and Michael 1. Jordan. Model(cid:173)\n\nfree reinforcement learning for non-markovian decision problems. In The Proceedings of \nthe Eleventh International Machine Learning Conference, 1994. \n\n[Teller, 1994] Astro Teller. The evolution of mental models. \n\nIn Kim Kinnear, editor, \n\nAdvances in Genetic Programming, chapter 9. MIT Press, 1994. \n\n[Watkins, 1989] Chris Watkins. Learning from delayed rewards. PhD thesis, Cambridge \n\nUniversity, 1989. \n\n[Whitehead, 1992] Steven Whitehead. Reinforcement Learning for the Adaptive Control \nof Perception and Action. PhD thesis, Department of Computer Science, University of \nRochester, 1992. \n\n\f", "award": [], "sourceid": 932, "authors": [{"given_name": "R.", "family_name": "McCallum", "institution": null}]}