{"title": "Sequential Decision Problems and Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 686, "page_last": 693, "abstract": "", "full_text": "686 \n\nBarto, Sutton and Watkins \n\nSequential Decision Problems \n\nand Neural Networks \n\nA. G. Barto \nDept. of Computer and \nInformation Science \nUniv. of Massachusetts \nAmherst, MA 01003 \n\nR. S. Sutton \n\nGTE Laboratories Inc. \nWaltham, MA 02254 \n\nc. J. C. H. Watkins \n25B Framfield \nHighbury, London \nN51UU \n\nABSTRACT \n\nDecision making tasks that involve delayed consequences are very \ncommon yet difficult to address with supervised learning methods. \nIf there is an accurate model of the underlying dynamical system, \nthen these tasks can be formulated as sequential decision problems \nand solved by Dynamic Programming. This paper discusses rein(cid:173)\nforcement learning in terms of the sequential decision framework \nand shows how a learning algorithm similar to the one implemented \nby the Adaptive Critic Element used in the pole-balancer of Barto, \nSutton, and Anderson (1983), and further developed by Sutton \n(1984), fits into this framework. Adaptive neural networks can \nplay significant roles as modules for approximating the functions \nrequired for solving sequential decision problems. \n\nINTRODUCTION \n\n1 \nMost neural network research on learning assumes the existence of a supervisor or \nteacher knowledgeable enough to supply desired, or target, network outputs during \ntraining. These network learning algorithms are function approximation methods \nhaving various useful properties. Other neural network research addresses the ques(cid:173)\ntion of where the training information might come from. Typical of this research \nis that into reinforcement learning systems; these systems learn without detailed \n\n\fSequential Decision Problems and Neural Networks \n\n687 \n\ninstruction about how to interact successfully with reactive environments. Learn(cid:173)\ning tasks involving delays between actions and their consequences are particularly \ndifficult to address with supervised learning methods, and special reinforcement \nlearning algorithms have been developed to handle them. In this paper, reinforce(cid:173)\nment learning is related to the theory of sequential decision problems and to the \ncomputational methods known as Dynamic Programming (DP). DP methods are \nnot learning methods because they rely on complete prior knowledge of the task, \nbut their theory is nevertheless relevant for understanding and developing learning \nmethods. \n\nAn example of a sequential decision problem invloving delayed consequences is the \nversion of the pole-balancing problem studied by Barto, Sutton, and Anderson \n(1983). In this problem the consequences of control decisions are not immediately \navailable because training information comes only in the form of a \"failure signal\" \noccurring when the pole falls past a critical angle or when the cart hits an end of \nthe track. The learning system used by Barto et al. (1983), and subsequently sys(cid:173)\ntematically explored by Sutton (1984), consists of two different neuron-like adaptive \nelements: an Associative Search Element (ASE), which implemented and adjusted \nthe control rule, or decision policy, and an Adaptive Critic Element (ACE), which \nused the failure signal to learn how to provide useful moment-to-moment evaluation \nof control decisions. The focus of this paper is the algorithm implemented by the \nACE: What computational task does this algorithm solve, and how does it solve it? \n\nSutton (1988) analyzed a class of learning rules which includes the algorithm used \nby the ACE, calling them Temporal Difference, or TD, algorithms. Although Sut(cid:173)\nton briefly discussed the relationship between TD algorithms and DP, he did not \ndevelop this perspective. Here, we discuss an algorithm slightly different from the \none implemented by the ACE and call it simply the \"TD algorithm\" (although the \nclass of TD algorithms includes others as well). The earliest use of a TD algorithm \nthat we know of was by Samuel (1959) in his checkers player. Werbos (1977) was \nthe first we know of to suggest such algorithms in the context of DP, calling them \n\"heuristic dynamic programming\" methods. The connection to dynamic program(cid:173)\nming has recently been extensively explored by Watkins (1989), who uses the term \n\"incremental dynamic programming.\" Also related is the \"bucket brigade\" used \nin classifier systems (see Liepins et al., 1989), the adaptive controller developed by \nWitten (1977), and certain animal learning models (see Sutton and Barto, to ap(cid:173)\npear). Barto, Sutton, and Watkins (to appear) discuss the relationship between TD \nalgorithms and DP more extensively than is possible here and provide references to \nother related research. \n\n2 OPTIMIZING DELAYED CONSEQUENCES \nMany problems require making decisions whose consequences emerge over time peri(cid:173)\nods of variable and uncertain duration. Decision-making strategies must be formed \nthat take into account expectations of both the short-term and long-term conse(cid:173)\nquences of decisions. The theory of sequential decision problems is highly developed \n\n\f688 \n\nBarto, Sutton and Watkins \n\nand includes formulations of both deterministic and stochastic problems (the books \nby Bertsekas, 1976, and Ross, 1983, are two of the many relevant texts). This the(cid:173)\nory concerns problems such as the following special case of a stochastic problem. \nA decision maker (DM) interacts with a discrete-time stochastic dynamical system \nin such a way that, at each time step, the DM observes the system's current state \nand selects an action. After the action is performed, the DM receives (at the next \ntime step) a certain amount of payoff that depends on the action and the current \nstate, and the system makes a transition to a new state determined by the current \nstate, the action, and random disturbances. Upon observing the new state, the DM \nchooses another action and continues in this manner for a sequence of time steps. \nThe objective of the task is to form a rule for the DM to use in selecting actions, \ncalled a policy, that maximizes a measure of the total amount of payoff accumulated \nover time. The amount of time over which this measure is computed is the horizon \nof the problem, and a maximizing policy is an optimal policy. One commonly stud(cid:173)\nied measure of cumulative payoff is the expected infinite-horizon discounted return, \ndefined below. Because the objective is to maximize a measure of cumulative payoff, \nboth short- and long-term consequences of decisions are important. Decisions that \nproduce high immediate payoff may prevent high payoff from being received later \non, and hence such decisions should not necessarily be included in optimal policies. \n\nMore formally (following the presentation of Ross, 1983), a policy is a mapping, de(cid:173)\nnoted 1r, that assigns an action to each state ofthe underlying system (for simplicity, \nhere we consider only the special case of deterministic policies). Let Xt denote the \nsystem state at time step t, and if the DM uses policy 1r, the action it takes at step \nt is at = 1r(Xt). After the action is taken, the system makes a transition from state \nx = Xt to state y = Xt+l with a probability Pzy(at). At time step t + 1, the DM \nreceives a payoff, rt+l, with expected value R(xt, at). For any policy 1r and state x, \none can define the expected infinite-horizon discounted return (which we simply call \nthe expected return) under the condition that the system begins in state x, the DM \ncontinues to use policy 1r throughout the future, and 'Y, 0 ::; 'Y < 1, is the discount \nfactor: \n\n(1) \nwhere Xo is the initial system state, and E'Jr is the expectation assuming the DM uses \npolicy 1r. The objective of the decision problem is to form a policy that maximizes \nthe expected return defined by Equation 1 for each state x. \n\n3 DYNAMIC PROGRAMMING \nDynamic Programming (DP) is a collection of computational methods for solving \nstochastic sequential decision problems. These methods require a model of the \ndynamical system underlying the decision problem in the form ofthe state transition \nprobabilities, PZy(a), for all states x and y and actions a, as well as knowledge of the \nfunction, R( x, a), giving the payoff expectations for all states x and actions a. There \nare several different DP methods, all of which are iterative methods for computing \noptimal policies, and all of which compute sequences of different types of evaluation \njunctions. Most relevant to the TD algorithm is the evaluation function for a given \n\n\fSequential Decision Problems and Neural Networks \n\n689 \n\npolicy. This function assigns to each state the expected value of the return assuming \nthe problem starts in that state and the given policy is used. Specifically, for policy \n1r and discount factor ,,(, the evaluation function, V';, assigns to each state, x, the \nexpected return given the initial state x: \n\nFor each state, the evaluation function provides a prediction of the return that will \naccrue throughout the future whenever this state is encountered if the given policy \nis followed . If one can compute the evaluation function for a state merely from \nobserving that state, this prediction is effectively available immediately upon the \nsystem entering that state. Evaluation functions provide the means for assessing \nthe temporally extended consequences of decisions in a temporally local manner. \n\nIt can be shown (e.g., Ross, 1983) that the evaluation function V'Ylr is the unique \nfunction satisfying the following condition for each state x: \n\n(2) \n\nDP methods for solving this system of equations (i.e., for determining V'Ylr) typi(cid:173)\ncally proceed through successive approximations. For dynamical systems with large \nstate sets the solution requires considerable computation. For systems with con(cid:173)\ntinuous state spaces, DP methods require approximations of evaluation functions \n(and also of policies). In their simplest form, DP methods rely on lookup-table \nrepresentations of these functions, based on discretizations of the state space in \ncontinuous cases, and are therefore exponential in the state space dimension. In \nfact, Richard Bellman, who introduced the term Dynamic Programming (Bellman, \n1957), also coined the phrase \"curse of dimensionality\" to describe the difficulty of \nrepresenting these functions for use in DP. Consequently, any advance in function \napproximation methods, whether due to theoretical insights or to the development \nof hardware having high speed and high capacity, can be used to great advantage \nin DP. Artificial neural networks therefore have natural applications in DP. \n\nBecause DP methods rely on complete prior knowledge of the decision problem, \nthey are not learning methods. However, DP methods and reinforcement learning \nmethods are closely related, and many concepts from DP are relevant to the case \nof incomplete prior knowledge. Payoff values correspond to the available evaluation \nsignals (the \"primary reinforcers\"), and the values of an evaluation function cor(cid:173)\nrespond to improved evaluation signals (the \"secondary reinforcers\") such a those \nproduced by the ACE. In the simplest reinforcement learning systems, the role of \nthe dynamical system model required by DP is played by the real system itself. A \nreinforcement learning system improves performance by interacting directly with \nthe real system. A system model is not required. 1 \n\n1 Although reinforcement learning methods can greatly benefit from such models (Sutton, to \n\nappear). \n\n\f690 \n\nBarto, Sutton and Watkins \n\n4 THE TD ALGORITHM \nThe TD algorithm approximates V1''Ir for a given policy 1(\" in the absence of knowledge \nof the transition probabilities and the function determining expected payoff values. \nAssume that each system state is represented by a feature vector, and that V1''Ir can \nbe approximated adequately as a function in a class of parameterized functions of \nthe feature vectors, such as a class of functions parameterized by the connection \nweights of a neural network. Letting \u00a2>(Xt) denote the feature vector representing \nstate Xt, let the estimated evaluation of Xt be \n\nwhere Vt is the weight vector at step t and f depends on the class of models assumed. \nIn terms of a neural network, \u00a2>(Xt) is the input vector at time t, and Vt(Xt) is the \noutput at time t, assuming no delay across the network. \nIf we knew the true evaluations of the states, then we could define as an error the \ndifference between the true evaluations and the estimated evaluations and adjust \nthe weight vector Vt according to this error using supervised-learning methods. \nHowever, it is unrealistic to assume such knowledge in sequential decision tasks. \nInstead the TD algorithm uses the following update rule to adjust the weight vector: \n\n(3) \n\nIn this equation, (l' is a positive step-size parameter, rt+l is the payoff received at \ntime step t + I, Vt(Xt+d is the estimated evaluation of the state at t + 1 using the \nweight vector Vt (i.e., Vt(Xt+l) = f( Vt, \u00a2>(Xt+l))),2 and *!;(\u00a2>(Xt)) is the gradient \nof f with respect to Vt evaluated at \u00a2>(Xt). If f is the inner product of Vt and \n\u00a2>(Xt), this gradient is just \u00a2>(Xt), as it is for a single linear ACE element. In the \ncase of an appropriate feedforward network, this gradient can be computed by the \nerror backpropagation method as illustrated by Anderson (1986). One can think \nof Equation 3 as the usual supervised-learning rule using rt+l + iVt(Xt+d as the \n\"target\" output in the error term. \n\nTo understand why the TD algorithm uses this target, assume that the DM is \nusing a fixed policy for selecting actions. The output of the critic at time step t, \nVt(Xt), is intended to be a prediction of the return that will accrue after time step \nt. Specifically, vt(Xt) should be an estimate for the expected value of \n\nwhere rt+l: is the payoff received at time step t + k. One way to adjust the weights \nwould be to wait forever and use the actual return as a target. More practically, \n2 Instead of using Vt to evaluate the state at t + I, the learning tule used by the ACE by Barto et \nal. (1983) uses Vt+l. This closely approximates the algorithm described here if the weights change \nslowly. \n\n\fSequential Decision Problems and Neural Networks \n\n691 \n\none could wait n time steps and use what Watkins (1989) calls the n-step truncated \nreturn as a target: \n\nrt+l + 1Tt+2 + -y2rt+3 + ... + -yn-lrt+n. \n\nHowever, it is possible to do better than this. One can use what Watkins calls the \ncorrected n-step truncated return as a target: \n\nrt+1 + ,rt+2 + -y2rt+3 + ... + -yn-lrt+n + ,nllt(xt+n), \n\nwhere lIt(xt+n) is the estimated evaluation of state Xt+n using the weight values at \ntime t. Because lit (xt+n) is an estimate of the expected return from step t + n + 1 \nreturn from state Xt. To see this, note that ,n lit (Xt+n) approximates \nonwards, -ynVi(xt+n) is an estimate for the missing terms in the n-step truncated \n\n,n[rt+n+l + -yrt+n+2 + ,2rt+n+3 + ... ]. \n\nMUltiplying through by -yn, this equals \n\n-ynrt+n+l + ,n+l rt +n+2 + ... , \n\nwhich is the part of the series missing from the n-step truncated return. The weight \nupdate rule for the TD algorithm (Equation 3) uses the corrected I-step truncated \nreturn as a target, and using the n-step truncated return for n > 1 produces obvious \ngeneralizations of this learning rule at the cost of requiring longer delay lines for \nimplementation. \n\nThe above justification of the TD algorithm is based on the assumption that the \ncritic's output lIt(x) is in fact a useful estimate of the expected return starting \nfrom any state x. Whether this estimate is good or bad, however, the expected \nvalue of the n-step corrected truncated return is always better (Watkins, 1989). \nIntuitively, this is true because the n-step corrected truncated return includes more \ndata, namely the payoffs rt+k, k = 1, ... , n. Surprisingly, as Sutton (1988) shows, \nthe corrected truncated return is often a better estimate of the actual expected \nreturn than is the actual return itself. \n\nAnother way to explain the TD algorithm is to refer to the system of equations from \nDP (Equation 2), which the evaluation function for a given policy must satisfy. One \ncan obtain an error based on how much the current estimated evaluation function, \nVi, departs from the desired condition given by Equation 2 for the current state, Xt: \n\nR(Xt, at} +, Ly PZt,y(at)Vi(y) - Vi(xt). \n\nBut the function R and the transition probabilities, PZt,y(at), are not known. Con(cid:173)\nsequently, one substitutes rt+l, the payoff actually received at step t + 1, for the \nexpected value of this payoff, R(xt, at), and substitutes the current estimated eval(cid:173)\nuation of the state actually reached in one step for the expectation of the estimated \nevaluations of states reachable in one step. That is, one uses Vi(Xt+l) in place of \nLy PXt,y(at)lIt(y). Using the resulting error in the usual supervised-learning rule \nyields the TD algorithm (Equation 3). \n\n\f692 \n\nBarto, Sutton and Watkins \n\n5 USING THE TD ALGORITHM \nWe have described the TD algorithm above as a method for approximating the \nevaluation function associated with a fixed policy. However, if the fixed policy and \nthe underlying dynamical system are viewed together as an autonomous dynamical \nsystem, i.e, a system without input, then the TD algorithm can be regarded purely \nas a prediction method, a view taken by Sutton (1988). The predicted quantity \ncan be a discounted sum of any observable signal, not just payoff. For example, in \nspeech recognition, the signal might give the identity of a word at the word's end, \nand the prediction would provide an anticipatory indication of the word's identity. \nUnlike other adaptive prediction methods, the TD algorithm does not require fixing \na prediction time interval. \n\nMore relevant to the topic of this paper, the TD algorithm can be used as a com(cid:173)\nponent in methods for improving policies. The pole-balancing system of Barto et \nal. (1983; see also Sutton, 1984) provides one example in which the policy changes \nwhile the TD algorithm operates. The ASE of that system changes the policy by at(cid:173)\ntempting to improve it according to the current estimated evaluation function . This \napproach is most closely related to the policy improvement algorithm of DP (e.g., \nsee Bertsekas, 1976; Ross, 1983) and is one of several ways to use TD-like methods \nfor improving policies; others are described by Watkins (1989) and Werbos (1987) . \n\n6 CONCLUSION \nDecision making problems involving delayed consequences can be formulated as \nstochastic sequential decision problems and solved by DP if there is a complete \nand accurate model of the underlying dynamical system. Due to the computational \ncost of exact DP methods and their reliance on complete and exact models, there \nis a need for methods that can provide approximate solutions and that do not re(cid:173)\nquire this amount of prior knowledge. The TD algorithm is an incremental, on-line \nmethod for approximating the evaluation function associated with a given policy \nthat does not require a system model. The TD algorithm directly adjusts a pa(cid:173)\nrameterized model of the evaluation function-a model that can take the form of \nan artificial neural network. The TD learning process is a Monte-Carlo approxi(cid:173)\nmation to a successive approximation method of DP. This perspective provides the \nnecessary framework for extending the theory of TD algorithms as well as that of \nother algorithms used in reinforcement learning. Adaptive neural networks can play \nsignificant roles as modules for approximating the required functions. \n\nAcknowledgements \n\nA. G. Barto's contribution was supported by the Air Force Office of Scientific Re(cid:173)\nsearch, Bolling AFB, through grants AFOSR-87-0030 and AFOSR-89-0526. \n\nReferences \n\nC. W. Anderson. (1986) Learning and Problem Solving with Multilayer Connec(cid:173)\ntionist Systems. PhD thesis, University of Massachusetts, Amherst, MA. \n\n\fSequential Decision Problems and Neural Networks \n\n693 \n\nA. G. Barto, R. S. Sutton, and C. W. Anderson. (1983) Neuronlike elements that \ncan solve difficult learning control problems. IEEE Transactions on Systems, Man, \nand Cybernetics, 13:835-846. \n\nA. G. Barto, R. S. Sutton, and C. Watkins. (to appear) Learning and sequential \ndecision making. In M. Gabriel and J. W. Moore, editors, Learning and Computa(cid:173)\ntional Neuroscience. The MIT Press, Cambridge, MA. \nR. E. Bellman. (1957) Dynamic Programming. Princeton University Press, Prince(cid:173)\nton, NJ. \nD. 1. Bertsekas. (1976) Dynamic Programming and Stochastic Control. Academic \nPress, New York. \nLiepins, G. E., Hilliard, M.R., Palmer, M., and Rangarajan, G. (1989) Alternatives \nfor classifier system credit assignment. Proceedings of the Eleventh International \nJoint Conference on Artificial Intelligence, 756-761. \n\nS. Ross. (1983) Introduction to Stochastic Dynamic Programming. Academic Press, \nNew York. \nA. L. Samuel. (1959) Some studies in machine learning using the game of checkers. \nIBM Journal on Research and Development, 210-229. \n\nR. S. Sutton. (1984) Temporal Credit Assignment in Reinforcement Learning. PhD \nthesis, University of Massachusetts, Amherst, MA. \nR. S. Sutton. (1988) Learning to predict by the methods of temporal differences. \nMachine Learning, 3:9-44. \nR. S. Sutton (to appear) First results with Dyna, an integrated architecture for \nlearning planning and reacting. Proceedings of the 1990 AAAI Symposium on Plan(cid:173)\nning in Uncertain, Unpredictable, or Changing Environments. \n\nR. S. Sutton and A. G. Barto. (to appear) Time-derivative models of Pavlovian \nreinforcement. In M. Gabriel and J. W. Moore, editors, Learning and Computational \nNeuroscience. The MIT Press, Cambridge, MA. \n\nC. J. C. H. Watkins. (1989) Learning from Delayed Rewards. PhD thesis, Cam(cid:173)\nbridge University, Cambridge, England. \nP. J. Werbos. (1977) Advanced forecasting methods for global crisis warning and \nmodels of intelligence. General Systems Yearbook, 22:25-38. \nP. J. Werbos. (1987) Building and understanding adaptive systems: A statisti(cid:173)\ncal/numerical approach to factory automation and brain research. IEEE Transac(cid:173)\ntions on Systems, Man, and Cybernetics, 17:7-20. \n\n1. H. Witten. \nenvironments. Information and Control, 34:286-295. \n\n(1977). An adaptive optimal controller for discrete-time markov \n\n\f", "award": [], "sourceid": 194, "authors": [{"given_name": "A.", "family_name": "Barto", "institution": null}, {"given_name": "R.", "family_name": "Sutton", "institution": null}, {"given_name": "C. J.", "family_name": "Watkins", "institution": null}]}