Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Richard S. Sutton, Brian Tanner*

We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single pre- diction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predic- tions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possi- ble with conventional TD methods. Secondly, we show that if the inter- predictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (super- vised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these net- works. Overall we argue that TD networks represent a substantial ex- tension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.

Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this set- ting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learn- ing is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as "learning a guess from a guess". In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first.

To be more precise, we must make a distinction between the extensive definition of a predic- tion, expressing its desired relationship to measurable data, and its TD definition, express- ing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future re- wards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It's the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics.

1 The Learning-to-predict Problem

The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation ot O, and the agent takes an action at A. Whereas A is an arbitrary discrete set, we assume without loss of gener- ality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1, a1, o2, a2, o3 , with each event of course dependent only on those pre- ceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state:

```
1 0 0 0 0 0 1 1 2 3 4 5 6 7
```

The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the first two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends determinis- tically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences.

Do not remove: This comment is monitored to verify that the site is working properly