{"title": "Navigating through Temporal Difference", "book": "Advances in Neural Information Processing Systems", "page_first": 464, "page_last": 470, "abstract": null, "full_text": "Navigating through Temporal Difference \n\nCentre for Cognitive Science &. Department of Physics \n\nPeter Dayan \n\nUniversity of Edinburgh \n\n2 Buccleuch Place, Edinburgh EH8 9LW \n\ndayantcns.ed.ac.uk \n\nAbstract \n\nBarto, Sutton and Watkins [2] introduced a grid task as a didactic ex(cid:173)\nample of temporal difference planning and asynchronous dynamical pre>(cid:173)\ngramming. This paper considers the effects of changing the coding of the \ninput stimulus, and demonstrates that the self-supervised learning of a \nparticular form of hidden unit representation improves performance. \n\n1 \n\nINTRODUCTION \n\nTemporal difference (TD) planning [6, 7] uses prediction for control. Consider an \nagent moving around a finite grid such as the one in figure 1 (the agent is incapable \nof crossing the barrier) trying to reach a goal whose position it does not know. If \nit can predict how far away from the goal it is at the current step, and how far \naway from the goal it is at the next step, after making a move, then it can decide \nwhether or not that move was helpful or harmful. If, in addition, it can record this \nfact, then it can learn how to navigate to the goal. This generation of actions from \npredictions is closely related to the mechanism of dynamical programming. \n\nTD is used to learn the predictions in the first place. Consider the agent moving \naround randomly on the grid, receiving a negative reinforcement of -1 for every \nmove it makes apart from moves which take it onto the goal. In this case, if it can \nestimat.e from every location it visits, how much reinforcement (discounted by how \nsoon it arrives) it will get before it next reaches the goal, it will be predicting how \nfar away it is, based on the random method of selecting actions. TD's mechanism \nof learning is to force the predictions to be consistent; the prediction from location \na should be -1 more than the average of the predictions from the locations that can \nbe reached in one step (hence the extra -1 reinforcement) from a. \n\n464 \n\n\fNavigating Through Temporal Difference \n\n465 \n\nIf the agent initially selects each action with the same probability, then the estimate \nof future reinforcement from a will be monotonically related to how many steps a \nis away from the goal. This makes the predictions useful for criticising actions as \nabove. In practice, the agent will modify its actions according to this criticism at \nthe same time as learning the predictions based on those actions. \nBarto, Sutton and Watkins [2] develop this example, and show how the TD mech(cid:173)\nanism coupled with a punctate representation of the stimulus (referred to as'RBsw \nbelow) finds the optimal paths to the goal. 'RBsw ignores the cues shown in figure 1, \nand devotes one input unit to each location on the grid, which fires if and only if \nthe agent is at that place. \n\nTD methods can however work with more general codes. Section 2 considers al(cid:173)\nternative representations, including ones that are sensitive to the orientation of the \nagent as it moves through the grid, and section 3 looks at a restricted form of la.(cid:173)\ntent learning - what the agent can divine about its environment in the absence of \nreinforcement. Both techniques can improve the speed of learning. \n\n2 ALTERNATE REPRESENTATIONS \n\nStimulus representations, the means by which the agent finds out from the environ(cid:173)\nment where it is, can be classified along two dimensions; whether they are punctate \nor distributed, and whether they are directionally sensitive or in register with the \nworld. \n\nOver most of the grid, a 'sensible' distributed representation, such as a coarse-coded \none, would be expected to make learning faster, as information about the value and \naction functions could be shared across adjacent grid points. There are points of \ndiscontinuity in the actions, as in the region above the right hand arm of the barrier, \nbut they are few. In his PhD thesis [9], Watkins considers a rather similar problem \nto that in figure I, and solves it using his variant ofTD, Q-Iearning, based on a CMAC \n[1] coarse-coded representation of the space. Since his agent moves in a continuous \nbounded space, rather than being confined merely to discrete grid points, something \nof this sort is anyway essential. After the initial learning, Watkins arbitrarily makes \nthe agent move ten times more slowly in a closed section of the space. This has a \nsimilar effect to the barrier in inducing a discontinuity in the action space. Despite \nthe CMACS forcing the system to share information across such discontinuities, they \nwere able to learn the task quickly. \n\nThe other dimension over which representations may vary involves the extent to \nwhich they are sensitive to the direction in which the agent is facing. This is of \ninterest if the agent must construe its location from the cues around the grid. In this \ncase, rather than moving North, South, East or West, which are actions registered \nwith the world, the agent should only move Ahead, Left or Right (Behind is disabled \nas an additional constraint), whose effects are also orientation dependent. This, \ntogether with the fact that the representation will be less compact (it having a \nlarger input dimensionality) should make learning slower. Dynamical programming \nand its equivalents are notoriously subject to Bellman's curse of dimensionality, an \nengineering equivalent of exponential explosion in search. \n\nTable 1 shows four possible representations classified along these two dimensions. \n\n\f466 \n\nDayan \n\nDirectionally Punctate Distributed \n\nCoarse ness \n\nSensltlve \nInsensltlve \n\nR,x \n'RBSW \n\nRA \n\n'RCMAC \n\nTable 1: Representations. \n\n'RBSW is the representation Barto, Sutton and Watkins used. R,x is punctate \nand directionally sensitive - it devotes four units to every grid point, one of which \nfires for each possible orientation of the agent. 'Rc~IAC' the equivalent of Watkins' \nrepresentation, was not simulated, because its capabilities would not differ markedly \nfrom those of the mapping-based representation developed in the next section. \nn A is rather different from the other representations; it provides a test of a represen(cid:173)\ntation which is more directly associated with the sensory information that might be \navailable directly from the cues. Figure 2 shows how 'RA works. Various identifiable \ncues, C1 . . . Cc (c = 7 in the figure) are scattered around the outside of the grid, \nand the agent has a fictitious 'retina' which rotates with it. This retina is divided \ninto a number of angular buckets (8 in the figure), and each bucket has c units, the \niSh one of which responds if the cue Ci is visible in that bucket. This representation \nis clearly directionally sensitive (if the agent is facing a different way, then so is its \nretina, and so no cue will be visible in the same bucket as it was before), and also \ndistributed, since in general more than one cue will be visible from every location. \nNote that there is no restriction on the number of units that can fire in each bucket \nat any time - more than one will fire if more than one cue is visible there. Also, \nunder the present system 'RA will in general not work if its coding is ambiguous \n- grid points must be distinguishable. Finally, it should be clear that 'R A is not \nbiologically plausible. \n\nFigure 3 shows the learning curves for the three representations simulated. Each \npoint is generated by switching off the learning temporarily after a certain number \nof iterations, starting the agent from everywhere in the grid, and averaging how \nmany steps it takes in getting to the goal over and above the minimum necesary. It \nis apparent that n.x is substantially worse, but, surprisingly, that 'RA is actually \nbetter than 'RBSW . This implies that the added advantage of its distributed na(cid:173)\nture more than outweighs its disadvantages of having more components and being \ndirectionally sensitive. \n\nOne of the motivations behind studying alternate representations is the experimen(cid:173)\ntal findings on place cells in the hippocampi of rats (amongst other species). These \nare cells that fire only when the rat is at a certain location in its environment. \nAlthough their existence has led to many hypotheses about rat cognitive mapping \n(see [5J for a substantial discussion of place cells and mapping), it is important to \nnote that even with a map, there remains the computational1y intensive problem of \nnavigation addressed, in this paper, by TD. 'RA, being closely related to the input \nstimuli is quite unlike a place cell code - the other representations all bear some \nsimilarities. \n\n\fNavigating Through Temporal Difference \n\n467 \n\n3 GOAL-FREE LEARNING \n\nOne of the problems with the TD system as described is that it is incapable oflatent \nlearning in the absence of reinforcement or a goal. If the goal is just taken away, but \nthe -1 reinforcements are still applied at each step, then the values assigned to each \nlocation will tend to -00. If both are removed, then although the agent will wander \nabout its environment with random gay abandon, it will not pick up anything that \ncould be used to speed subsequent learning. Latent learning experiments with rats \nin dry mazes prove fairly conclusively that rats running mazes in the absence of \nrewards and punishments learn almost as much as rats that are reinforced. \nOne way to solve this problem is suggested by Sutton's DYNA architecture [7]. \nBriefly, this constructs a map of place x action -+ next place, and takes steps \nin the fictitious world constructed from its map in-between taking steps in the real \nworld, as a way of ironing out the computational 'bumps' (ie inconsistencies) in the \nvalue and action functions. \n\nInstead, it is possible to avoid constructing a complete map by altering the repre(cid:173)\nsentation of the environment used for learning the prediction function and optimal \nactions . The section on representations concluded that coarse-coded representations \nare generally better than punctate ones, since information can be shared between \nneighbouring points. However, not all neighbouring points are amenable to this \nsharing, because of discontinuities in the value and action functions. If there were \na way of generating a coarse coded representation (generally from a punctate one) \nthat is sensitive to the structure of the task, rather than arbitrarily assigned by \nthe environment, it should provide the base for faster learning still. In this case, \nneighbouring points should only be coded together if they are not separated by the \nbarrier. The initial exploration would allow the agent to learn this much about the \nstructure of the environment. \n\nConsider a set of units whose job is to predict the future discounted sum of firings \nof the raw input lines. Using 'R.Bsw during the initial stage of learning when the \nact.ions are still random, if the agent is at location (3,3) of the grid, say, then the \ndiscounted prediction of how often it will be in (3,4) (ie the frequency with which \nthe single unit representing (3,4) will fire) will be high, since this location is close. \nHowever, the prediction for (7,11) will be low, because it is very unlikely to get \nthere quickly. Consider the effect of the barrier: locations on opposite sides of it, eg \n(1,6) and (2,6), though close in the Euclidean (or Manhattan) metric on the grid, \nare far apart in the task. This means that the discounted prediction of how often \nthe agent will be at (1,6) given that it starts at (2,6), will be proportionately lower. \n\nOverall, the prediction units should act like a coarse code, sensitive to the struc(cid:173)\nture of the task. As required, this information about the environment is entirely \nindependent of whether or not the agent is reinforced during its exploration. In \nfact, the resulting 'map' will be more accurate if it is not, as its exploration will be \nmore random. The output of the prediction units is taken as an additional source \nof information for the value and action functions. \n\nSince their main aim is to create intelligently distributed representations from punc(cid:173)\ntate ones, it is only appropriate to use these prediction units for 'RBsw and 'R 4X ' \nFigure 4 compares average learning curves for 'RBsw with and without these ex-\n\n\f468 \n\nDayan \n\ntra mapping units, and with and without 6000 steps of latent learning (LL) in the \nabsence of any reinforcement. A significant improvement is apparent. \nFigure 5 shows one set of predictions based on the 1lBsw representation! after a \nfew un-reinforced iterations. The predictions are clearly fairly well developed and \nsmooth - a predictable exponentially decaying hump. The only deviations from \nthis are at the barrier and along the edges, where the effects of impermeability and \nimmobility are apparent. \n\nFigure 6 shows the same set of predictions but after 2000 reinforced iterations, by \nwhich time the agent reaches the goal almost optimally. The predictions degenerate \nfrom being roughly radially symmetric (bar the barrier) to being highly asymmetric. \nOnce the agent has learnt how to get to the goal from some location, the path it will \nfollow, and so the locations it will visit from there, is largely fixed. The asymptotic \nvalues of the predictions will therefore be 0 for units not on the path, and -( for \nthose on the path, where r is the number of steps since the agent's start point and \n'Y is the discounting factor weighting immediate versus distant reinforcement. This \nis a severe limitation since it implies that the topological information present in the \nearly stages of learning disappears evaporates, and with it almost all the benefits \nof the prediction units. \n\n4 DISCUSSION \n\nNavigation comprises two problems; where the agent and the goals in its environ(cid:173)\nment are, and how it can get to them. Having some form of cognitive map, as is \nsuggested by the existence of place cells, addresses the first, but leaves open the \nsecond. For the case of one goal, the simple TD method described here is one \nsolution. \n\nTD planning methods are clearly robust to changes in the way the input stimu(cid:173)\nlus is represented. Distributed codes, particularly ones that allow for the barrier, \nmake learning faster. This is even true for 1lA' which is sensitive to the orientation \nof the agent. All these results require each location to have a unique representa(cid:173)\ntion - Mozer and Bachrach [4] and Chrisley [3] and references therein look at how \nambiguities can be resolved using information on the sequence of states the agent \ntraverses. \n\nSince these TD planning methods are totally general, just like dynamical program(cid:173)\nming, they are unlikely to scale well. Some evidence for this comes from the rel(cid:173)\natively poor performance of 1l.x , with its quadrupled input dimension. This puts \nthe onus back either onto dividing the task into manageable chunks, or onto more \nsophisticated representation. \n\nA cknow ledgements \n\nI am very grateful to Jay Buckingham, Kate Jeffrey, Richard Morris, Toby Tyrell, \nDavid Willshaw, and the attendees of the PDP Workshop at Edinburgh, the Con(cid:173)\nnectionist Group at Amherst, and a spatial learning workshop at King's College \nCambridge for their helpful comments. This work was funded by SERC. \n\n1 Note that these are normalised to a maximum value of 10, for graphical convenience. \n\n\fNavigating Through Temporal Difference \n\n469 \n\nReferences \n[1] Albus, JS (1975). A new approach to manipulator control: the Cerebellar \nModel Articulation Controller (CMAC). Transactions of the ASME: Journal \nof Dynamical Systems, Measurement and Control, 97, pp 220-227. \n\n[2] Barto, AG, Sutton, RS &. Watkins, CJCH (1989). Learning and Sequential \nDecision Making. Technical Report 89-95, Computer and Information Science, \nUniversity of Massachusetts, Amherst, MA. \n\n[3] Chrisley, RL (1990). Cognitive map construction and use: A parallel dis(cid:173)\ntributed approach. In DS Touretzky, J Elman, TJ Sejnowski, &. GE Hinton, \neditors, Proceedings of the 1990 Con nectionist M odds Summer School. San \nMateo, CA: Morgan Kaufmann. \n\n[4] Mozer, MC, &. Bachrach, J (1990). Discovering the structure of a reactive \n\nen vironment by exploration. In D Touretzky, editor, Advances in Neurallnfor(cid:173)\nmation Processing Systems, \u00a3, pp 439-446. San Mateo, CA: Morgan Kaufmann. \n[5] O'Keefe, J & Nadel, L (1978). The Hippocampus as a Cognitive Map. Oxford, \n\nEngland: Oxford University Press. \n\n[6] Sutton, RS (1988). Learning to predict by the methods of temporal difference. \n\nMachine Learning, 3, pp 9-44. \n\n[7] Sutton, RS (1990). Integrated architectures for learning, planning, and reacting \nbased on approximating dynamic progranuning. In Proceedings of the Seventh \nInternational Conference on Machine Learning. San Mateo, CA: Morgan Kauf(cid:173)\nmann. \n\n[8] Sutton, RS, &. Barto, AG. To appear. Time-derivative models of Pavlovian \nconditioning. In M Gabriel &. JW Moore, editors, Learning and Computational \nNeuroscience. Cambridge, MA: MIT Press. \n\n[9J \\Vatkins, CJCH (1989). Learning from Delayed Rewards. PhD Thesis. Univer(cid:173)\n\nsity of Cambridge, England. \n\nAgall \n\nCl \n\nC4 \n\nC2 \n\nCues \n\ncs \n\n\\ \n\n\\ \n\\ \nl\\ , \n, \n\n\\ \n\\ \n\nGoal \n\nC6 \n\nC1 \n\n........ \n\n~ arrier \n\nC3 \n\nCl \n\nOriCIIlltloD \n\nC6 \n\nC1 \n\nC4 \n\nC2 \n\n'Retina' \n\ncs \n\nAnplar bucket \n\n\u2022\u2022 Dot rlrina \n1. flrina \n\nC3 \n\nFig 1: The grid task \n\nFig 2: The 'retina' for 1lA \n\n\f470 \n\nDayan \n\nAverage extra \nsteps to goal \n200 \n\nNo map \n\n--- Map, DO LL \n\nMap, LL \n\nAverage extra \nsteps to goal \n200 \n\n--- 4X \n, , , 1 \n, , \n, . \n\n' I \n\nBSW \nA \n\n150 \n\n100 \n\n50 \n\n150 \n\n100 \n\n50 \n\n, \n\\ \n, , \n, \n\\ \n\\ \n\" \n\\ \n\" \n, \n\\ \n\" \n\" \\ \n\n1000 \n\n0 \n\n1 \n\n) \n1\\ \n\n\\ I \n~ \nl \n1\\1 \n\n~I \nI .1 \n\n~ \n\n,~ \" \n, \n100 \n\n0 \n\n1 \n\n10 \n\nLearning iterations \n\n10 \n\n100 \n\n1000 \n\nLearning iterations \n\nFig 3: Different representations \n\nFig 4: Mapping with 'RBSW \n\nFig 5: Initial predictions from (5,6) \n\nFig 6: Predictions after 2000 \niterations \n\n\f", "award": [], "sourceid": 428, "authors": [{"given_name": "Peter", "family_name": "Dayan", "institution": null}]}