Part of Advances in Neural Information Processing Systems 7 (NIPS 1994)

*Mance E. Harmon, Leemon Baird, A. Harry Klopf*

An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and non-residual gradient forms of advantage updating and Q-learning are compared. The results show that advantage updating converges faster than Q-learning in all simulations. The results also show advantage updating converges regardless of the time step duration; Q-learning is unable to converge as the time step duration ~rows small.

U.S .A.F. Academy, 2354 Fairchild Dr. Suite 6K4l, USAFA, CO 80840-6234

354

Mance E. Hannon, Leemon C. Baird ll/, A. Harry Klopf

1 ADVANTAGE UPDATING

The advantage updating algorithm (Baird, 1993) is a reinforcement learning algorithm in which two types of information are stored. For each state x, the value V(x) is stored, representing an estimate of the total discounted return expected when starting in state x and performing optimal actions. For each state x and action u, the advantage, A(x,u), is stored, representing an estimate of the degree to which the expected total discounted reinforcement is increased by performing action u rather than the action currently considered best. The optimal value function V* (x) represents the true value of each state. The optimal advantage function A * (x,u) will be zero if u is the optimal action (because u confers no advantage relative to itself) and A * (x,u) will be negative for any suboptimal u (because a suboptimal action has a negative advantage relative to the best action). The optimal advantage function A * can be defined in terms of the optimal value function v*:

A*(x,u) = ~[RN(X,U)- V*(x)+ rNV*(x')]

Do not remove: This comment is monitored to verify that the site is working properly