{"title": "Time Dependent Adaptive Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 710, "page_last": 718, "abstract": null, "full_text": "710 \n\nPineda \n\nTime DependentAdaptive Neural Networks \n\nFernando J. Pineda \n\nCenter for Microelectronics Technology \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nABSTRACT \n\nA  comparison  of algorithms  that minimize error functions  to  train  the \ntrajectories of recurrent networks, reveals how complexity is traded off for \ncausality.  These  algorithms  are  also  related  to  time-independent \nfonnalisms.  It is  suggested  that  causal  and  scalable  algorithms  are \npossible  when  the  activation  dynamics  of  adaptive  neurons  is  fast \ncompared  to  the  behavior  to  be  learned.  Standard  continuous-time \nrecurrent backpropagation is used in an example. \n\n1 INTRODUCTION \n\nTraining the time dependent behavior of a neural network model involves the minimization \nof a  function  that  measures  the  difference  between  an  actual  trajectory  and  a  desired \ntrajectory.  The standard method of accomplishing this  minimization  is  to calculate  the \ngradient of an error function with respect to the weights of the system and then to use the \ngradient in a minimization algorithm (e.g. gradient descent or conjugate gradient). \n\nTechniques for evaluating gradients and performing minimizations are well developed in the \nfield of optimal control and system identification, but are only now being introduced to the \nneural network community. Not all algorithms that are useful or efficient in control problems \nare realizable as physical neural networks.  In particular, physical neural network algorithms \nmust satisfy locality, scaling and causality constraints.  Locality simply is the constraint that \none should  be  able  to  update  each  connection  using  only  presynaptic  and postsynaptic \ninfonnation. There should be no need to use infonnation from neurons or connections that \nare not in physical contact with a given connection.  Scaling, for this paper, refers  to  the \n\n\fTime Dependent Adaptive Neural Networks \n\n711 \n\nscaling law that governs the amount of computation or hardware that is required to perform \nthe weight updates.  For neural networks, where the number of weights can become very \nlarge, the amount of hardware or computation required to calculate the gradient must scale \nlinearly with the number of weights.  Otherwise, large networks are not possible.  Finally, \nlearning algorithms must be causal since physical neural networks must evolve forwards in \ntime.  Many algorithms for learning time-dependent behavior, although they are seductively \nelegant and computationally efficient, cannot be implemented as physical systems because \nthe gradient evaluation requires time evolution in two directions.  In this paper networks that \nviolate the causality constraint will be referred to as unphysical. \n\nIt is useful to understand how scalability and causality trade off in various gradient evaluation \nalgorithms.  In the next section three related gradient evaluation algorithms are derived and \ntheir scaling and causality properties are compared.  The three algorithms demonstrate a \nnatural progression from a causal algorithm that scales poorly to an a causal algorithm that \nscales linearly. \n\nThe difficulties that these exact algorithms exhibit appear to be inescapable.  This suggests \nthat approximation  schemes that do  not calculate exact gradients or that exploit special \nproperties of the tasks to-be-Ieamed may lead to physically realizable neural networks.  The \nfinal section of this paper suggests an approach that could be exploited in systems where the \ntime scale of the to-be-Ieamed task is much slower than  the relaxation time scale of the \nadaptive neurons. \n\n2 ANALYSIS OF ALGORITHMS \n\nWe will begin by reviewing the learning algorithms that apply to time-dependent recurrent \nnetworks.  The control literature generally derives these algorithms by taking a variational \napproach  (e.g.  Bryson  and  Ho,  1975).  Here  we will  take  a  somewhat unconventional \napproach and restrict oursel yes to the domain of differential equations and their solutions.  To \nbegin with, let us take a concrete example. Consider the neural system given by the equation \n\n, \ndx\u00b7 \n(it = X  i+  ~ w  I(x) + I  I \n\nn \n\n,=1 \n\nWhere f(.) is a sigmoid shaped function (e.g. tanh(.))  and ~is an external input  This system \nis a well studied neural model (e.g. Aplevich, 1968; Cowan, 1967; Hopfield, 1984; Malsburg, \n1973; Sejnowski, 1977).  The goal is to find the weight matrix w  that causes the states x(t) \nof the output units to follow a specified trajectory x(t).  The actually trajectory depends not \nonly on the weight matrix but also on the  external input vector I.  To find the weights one \nminimizes a  measure of the difference between  the actual trajectory x(t) and the desired \ntrajectory ~(t).  This measure is a functional of the trajectories and a function of the weights. \nIt is given by \n\nf t I \n\nE(w ,t I,t )  = 2 .L \n\n1 \n\ndt (x  ,{t) - ~,{t)) \n\n2 \n\n(1) \n\n(2) \n\nwhere 0  is the set of output units.  We shall, only for the purpose of algorithm comparison, \n\n,e 0 \n\nt \n\no \n\n\faE  nft! \n-\naw  rs \n\n=- L \n\ni=1 \n\nt \n\no \n\nd t  J i(t) P irit ) \n\nJ.=  {  g i(t)- x i(t)  if i E  0 \n'0  \nififl.O \n\nax \u00b7 , \nPirs=-a(cid:173)\n\nWrs \n\n(3a) \n\n(3b) \n\n(3c) \n\n712 \n\nPineda \n\nmake the following assumptions: (1) That the networks are fully connected (2) That all the \ninterval [tD,tr]  is divided  into q segments with numerical integrations performed using the \nEuler method and (3) That all the operations are performed with the same precision. This will \nallow  us  to  easily  estimate  the  amount of computation  and  memory  required  for  each \nalgorithm relative to  the others. \n\n2.1 ALGORITHM A \n\nIf the objective function E is differentiated with respect to  w n  one obtains \n\nwhere \n\nand where \n\nTo evaluate Pirs'  differentiate  equation  (1)  with respect to w n  and observe that the time \nderivative and the partial derivative with respect to w n  commute.  The resulting equation is \n\ndp irs  ~L ( \n-d  =  ~ ij'X j  Pjrs+Sir. \n\n) \n\nt \n\n.  1 \nJ= \n\n(4a) \n\n(4b) \n\nwhere \n\nand where \n\n(4c) \nThe initial condition for eqn.  (4a) is p(t) = O.  Equations  (1), (3) and (4) can be used to \ncalculate the gradient for a learning rule.  This is the approach taken by Williams and Zipser \n(1989) and also discussed by Pearlmutter(1988). Williams and Zipser further observe that \none can  use  the  instantaneous  value  of p(t)  and J(t)  to  update  the  weights  continually \nprovided the weights change slowly.  The computationally intensive part of this algorithm \noccurs in the integration of equation (4a).  There are n3 components to  p hence there are Ji3 \nequations.  Accordingly  the  amount  of hardware  or  memory  required  to  perform  the \ncalculation will scale like n3\u2022  Each of these equations requires a summation over all  the \nneurons, hence the amount of computation (measured in multiply-accumulates) goes like It \nper time step, and there are q time steps, hence the total number of multiply-accumulates \nscales like n4q  Clearly, the scaling properties of this approach are very poor and it cannot \nbe practically applied to very large networks. \n\n2.2 ALGORITHM B \n\nRather than numerically integrate the system of equations (4a) to obtain p(t), suppose we \nwrite down the formal solution. This solution is \n\n\fTime Dependent Adaptive Neural Networks \n\n713 \n\nPirs(t)='LKij(t,to)PjrsCt  0)+ 'L \n\n11 \n\ndrKjj(t,f)Sjrs(i) \n\nj=1 \n\n\"f' \n'0 \n\nj=1 \n\nThe matrix K is defined by the expression \n\nK  (' 2' ,) = ex p(.r.. '~T L (x (T))) \n\n(Sa) \n\n(5b) \n\nThis matrix is known as the propagator or transition matrix.  The expression for Pit. consists \nof a homogeneous solution and a particular solution.  The choice of initial condition Pirs(to) \n= 0 leaves only the particular solution. If the particular solution is substituted back into eqn. \n(3a),  one eventually obtains the following expression for the gradient \n\n'f  f  ' \n\naE \n-=  - 'L f   dt \naw  rs \n\n'0 \n\nj=1 \n\n11 \n\n'0 \n\nd-r J;Ct)K irU ,-r)f(x s(-r)) \n\n(6) \n\nTo obtain this expression one must observe that s.  can be expressed in terms of x  , i.e. use \neqn.  (4c).  This allows  the summation over j  to be performed trivially,  thus resulting in \neqn.(6).  The familiar  outer product form  of backpropagation is not yet manifest in this \nexpression. To uncover it,  change the order of the integrations.  This requires some care \nbecause the limits of the integration are not the same.  The result is \n\nIn \n\n\u2022 \n\naE \n-= -'L \naw  rs \n\ni = 1 \n\n11 f  If \n\n'0 \n\nIf \n\nl' \n\nd-rf  dt Jj(t)K irU ,-r)f(x sC-r)) \n\n(7) \n\nInspection ofthis expression reveals that neither the summation over i nor the integration over \n't includes x.(t),  thus it is useful  to  factor it out.  Consequently equation (7)  takes on the \nfamiliar outer product form of backpropagation \n\naE \n-=  - f   dt Y r(t)f(x sU)) \naw  rs \n\nIf \n\nl' \n\nWhere  yr(t) is defined to be \n\nY r(-r) =- 'L f \n\n11 \n\nIf \ndt  Jj(t)K irU ,-r) \n\ni= 1 \n\nt' \n\n(8) \n\n(9) \n\n(10) \n\nEquation (8), defines an expression for the gradient, provided we can calculate Yr(t) from eqn. \n(9).  In principle, this can be done since the propagator K and the vector J are both completely \ndetermined by x(t).  The computationally intensive part of this algorithm is the calculation \nof K(t, 't) for all values of t and't.  The calculation requires the integration of equations of the \nform \n\ndK i: ,-r)  - L  (x U)  K  (t ,-r) \n\nfor q different values of't.  There are n2different equations to integrate for each value of't \nConsequently there are n2q integrations to be performed where the interval from  to  to tf  is \ndivided into q intervals.  The calculation of all the components ofK(t,'t), from tr to t ,scales \nlike n3q2, since each integration requires n multiply-accumulates per time step and there are \nq time steps.  Similarly, the memory requirements scale like n2q2.  This is because K has n2 \ncomponents for each (t,'t) pair and there are q2 such pairs. \n\n\f714 \n\nPineda \n\nEquation (10)  must be integrated  forwards in time from t= 't  to t = trand backwards in time \nfrom t= 't  to t = to.  This is because  K  must satisfy K( 't\u00bb't) = 1 (the identity matrix) for all \n'to  This condition follows  from  the definition of  K  eqn.  (5b).  Finally, we observe  that \nexpression (9) is the time-dependent analog of the expression used by Rohwer and Forrest \n(1987) to calculate the gradient in recurrent networks.  The analogy can be made somewhat \nmore explicit by writingK(t,'t) as the inverse K-l('t,t). Thus we see that y( t) can be expressed \nin terms of a matrix inverse just as in the Rohwer and Forrest algorithm. \n\n2.3 \n\nALGORITHM C \n\nThe final algorithm is familiar from continuous time optimal control and identification.  The \nalgorithm is usually derived by performing a variation on the functional given by eqn. (2). \nThis results in a two-point boundary value problem.  On the other hand, we know that y is \ngiven by eqn. (9).  So we simply observe that this is the particular solution of the differential \nequation \n\ndy \n\n- ([t=  L \n\nT \n(x  (t))y +J \n\n(11) \n\nWhere LT is the transpose of the matrix defined in eqn. (4b).  To see this simply substitute \nthe form for y into eqn. (11) and verify that it is indeed the solution to the equation. \n\nThe particular solution to eqn. (11) vanishes only if  y(1r) = O.  In other words: to obtain yet) \nwe need only integrate eqn. (11) backwards  from the final condition y(t~ = O.  This is just \nthe algorithm introduced to the neural network community by Pearlmutter (1988).  This also \ncorresponds to the unfolding in time approach discussed by Rumelhart et al. (1986), provided \nthat all the equations are discretized and one takes At =  1. \n\nThe two  point  boundary  value problem  is  rather straight forward  to  solve  because the \nequation for x(t) is independent of yet).  Both x(t) and yet) can be obtained with n multiply(cid:173)\naccumulates per time step.  There are q time steps from to to tfand bothx(t) and yet)  have n \ncomponents,  hence  the  calculation  of x(t)  and yet)  scales like  02q.  The weight update \nequation also requires n2q mUltiply- accumulates.  Thus the computational requirements of \nthe algorithm as a whole scale like  n2q  The memory required also scales like  n2q, since it \nis necessary to save each value of x(t) along the trajectory to compute  yet). \n\n2.4 \n\nSCALING VS  CAUSALITY \n\nThe results of the previous sections are summarized in table 1 below. We see that we have \na progression of tradeoffs between scaling and causality.  That is, we must choose between \na causal algorithm with exploding computational and storage requirements  and an a causal \nalgorithm  with  modest storage  requirements.  There is  no  q  dependence  in  the  memory \nrequirments because the integral given in eqn. (3a) can be accumulated at each time step. \nAlgorithm B has some of the worst features of both algorithms. \n\n\fTime Dependent Adaptive Neural Networks \n\n715 \n\nTable 1: Comparison of three algorithms \n\nAlgorithm  Memory \n\nMultiply \n\ndiirection of integations \n\n-accumulates \n\nA \nB \nC \n\nx and p are both forward in time \nx is forward,  K is forward and  backward \nx is forward, y is backward in time. \n\nDigital  hardware  has  no  difficulties  (at  least  over  finite  time  intervals)  with  a  causal \nalgorithms provided a stack is available to act as a memory that can recall states in reverse \norder. To the extent that the gradient calculations are carried out on digital machines, it makes \nsense to use algorithm C because it is  the most efficient. \nIn analog VLSI  however, it is \ndifficult  to  imagine  how  to  build  a  continually  running  network  that  uses  an  a  causal \nalgorithm.  Algorithm A is attractive  for physical implementation because it could be run \ncontinually and in real time (Williams and Zipser, 1989).  However, its scaling properties \npreclude the possibility of building very large networks based on the algorithm.  Recently, \nZipser  (1990)  has  suggested  that  a  divide  and  conquer  approach  may  reduce  the \ncomputational and spatial complexity of the algorithm. This approach, although promising, \ndoes not always work and there is as yet no convergence proof.  How then, is it possible to \nlearn trajectories using local, scalable and causal algorithms?  In the next section a possible \navenue of attack is suggested. \n\n3  EXPLOITING DISPARATE TIME SCALES \n\nI  assert  that for  some classes  of problems  there are scalable  and  causal algorithms  that \napproximate the gradient and that these algorithms can be found by exploiting the disparity \nin time scales found in these classes of problems.  In  particular, I assert that when the time \nscale of the adaptive units is fast compared to the time scale of the behavior to be learned,  it \nis possible to find scalable and causal adaptive algorithms.  A general formalism for doing \nthis will not be presented here, instead a simple, perhaps artificial, example will be presented. \nThis example minimizes an error function for a time dependent problem. \n\nIt  is  likely  that  trajectory  generation  in  motor  control  problems  are  of this  type.  The \ncharacteristic time scales  of the  trajectories  that need to be generated are determined by \ninertia and friction.  These mechanical time scales are considerably longer than the electronic \ntime scales that occur in VLSI.  Thus it seems that for robotic problems, there may be no need \nto use the completely general algorithms discussed in section 2.  Instead, algorithms that take \nadvantage of the disparity between the mechanical and the electronic time scales are likely \nto be more useful for learning to generate trajectories. \n\nhe task is to map from a periodic input I(t) to a periodic output  ~(t).  The basic idea is to use \nthe  continuous-time  recurrent-backpropagation  approach  with  slowly  varying  time-\ndependent inputs rather than with static inputs.  The learning is done in real-time and in a \ncontinuous fashion.  Consider a set of n \"fast\" neurons (i= 1, .. ,n) each of which satisfies the \n\n\f716 \n\nPineda \n\nadditive activation dynamics determined by eqn  (1).  Assume that the initial weights are \nsufficientl y small that the dynamics of the network would be convergent if the inputs I  were \nconstant.  The external input vector ~ is applied to the network through the vector I.  It has \nbeen previously shown (pineda, 1988) that the ij-th component of the gradient ofE is equal \nto yfjf(xf) where Xfj  is the steady state solution of  eqn. (1) and where yfjis a component of \nthe steady state solution of \n\nT \n\ndy \n-=  L  (x  )y +1 \ndt \n\nf \n\n(12) \n\nwhere the components ofLT are given by eqn. (4.b).  Note that the relative sign between \nequations (11) and (12) is what enables this algorithm to be causal.  Now suppose that instead \nof a fixed input vector I,  we use a slowly varying input  I(t/'t  ) where't  is the characteristic \ntime scale over which the input changes significantly. If w~ take as  lite gradient descent \nalgorithm, the dynamics defined by \n\ndw  rs \n\n't'w([t=Y i(t)X /t) \n\n(13) \n\nwhere't ..  is the time constant that defines the (slow) time scale over which w changes and \nwhere Xj  is the instantaneous solution of eqn. (1) and Yj  is  the instantaneous  solution of \neqn.(12) . Then in the adiabatic limit the Cartesian product yl(x) in eqn. (13) approximates \nthe negative gradient of the objective function E, that is \n\n(14) \n\nThis  approach  can  map  one  continuous  trajectory  into  another  continuous  trajectory, \nprovided the trajectories change slowly enough.  Furthermore, learning occurs causally and \nscalably.  There is no memory in the model, i.e. the output of the adaptive neurons depends \nonly on  their input and  not on  their internal state.  Thus, this  network can never learn to \nperform tasks  that require memory unless the learning algorithm is modified to learn the \nappropriate transitions.  This is the major drawback of the adiabatic approach.  Some state \ninformation can be incorporated into this model by using recurrent connections -\nin which \ncase the network can have multiple basins and the final state will depend on the initial state \nof the net as well as on the inputs, but this will not be pursued here. \n\nSimple simulations were performed to verify that the approach did indeed perform gradient \ndescent.  One simulation is presented here for the benefit of investigators who may wish to \nverify the results. A feedforward network topology consisting of two input units, five hidden \nunits  and  two  output  units  was  used  for  the  adaptive  network.  Units  were  numbered \nsequentially, 1 through 9, beginning with the input layer and ending in the output layer.  Time \ndependent external inputs for the two  input  neurons were generated with time dependence \nII = sin(27tt) and ~ = cos(2m).  The targets for the output neurons were  ~ = R sin(27tt) and \n~9 =R cos(2m) where R = 1.0 + 0.lsin(6m).  All the equations were simultaneously integrated \nusing 4th order Runge-Kutta with a time step of 0.1.  A relaxation time scale was introduced \ninto the forward and backward propagation equations by multiplying the time derivatives in \neqns. (1) and (12) by't\" and 'tyrespectively.  These time scales were set to't\" ='ty= 0.5.  The \nadaptive time scale of the weights was 't .. = 1.0.  The error in the network was initially, E = \n\n\fTime Dependent Adaptive Neural Networks \n\n717 \n\n10 and the integration was cut off when the error reached a plateau at E = 0.12.  The learning \ncurve is shown in Fig. 1.  The trained trajectory did not exactly reach the desired solution.  In \nparticular the network did  not learn the odd order hannonic that modulates R.  By way of \ncomparison, a conventional backpropagation approach that calculated a cumulative gradient \nover the trajectory and used conjugate gradient for the descent, was able to converge to the \nglobal minimum. \n\n12,---------------------------------~ \n\n8  -\n\n10-' \nIII \nIII \nm \nm \nm \nm \nm \nm \nED \nm \n\n6  -\n\n4-\n\n2  -. \nO+-__ ~---~~\u00b7\u00b7~\u00b7\u00b7E\u00b7\u00b7~\u00b7.B .. ~ .. B .. B . . ~\u00b7.m\u00b7\u00b7D\u00b7~\u00b7\u00b7\u00b7~~ \n50 \n\n30 \n\n40 \n\n20 \n\no \n\nI \n\n10 \n\nI \n\nI \n\nFigure 1:  Learning curve.  One time unit corresponds to a single oscillation \n\nTime \n\n4 SUMMARY \n\nThe  key  points  of this  paper  are:  1)  Exact  minimization  algorithms  for  learning  time(cid:173)\ndependent behavior either scale poorl y or else violate causality and 2) Approximate gradient \ncalculations  will  likely  lead  to  causal  and  scalable  learning  algorithms.  The  adiabatic \napproach should be useful for learning to generate trajectories of the kind encountered when \nlearning motor skills. \n\nReferences herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, \nor otherwise, does not constitute or imply any endorsement by the U oited States Government or the Jet Propulsion \nLaboratory, California Institute of Technology. The work described in this paper was carried out at the \nCenter for  Space Microelectonrics Technology,  Jet Propulsion Laboratory,  California Institute  of \nTechnology. Support for the work came from  the Air Force Office of Scientific Research through an \nagreement with the National Aeronautics and Space Administration (AFOSR-ISSA-90-0027). \n\nREFERENCES \n\nAplevich,J.D. (1968). Models of certain nonlinear systems. InE.R.Caianiello(Ed.),Neural \nNetworks, (pp.  110-115), Berlin: Springer Verlag. \n\nBryson, A. E. and Ho, Y. (1975). Applied Optimal Control: Optimization. Estimation. and \n\n\f718 \n\nPineda \n\nControl. New York: Hemisphere Publishing Co. \n\nCowan,  J.  D.  (1967).  A  mathematical  theory  of central  nervous  activity.  Unpublished \ndissertation, Imperial College, University of London. \n\nHopfield,  J.  J.  (1984).  Neurons  with  graded  response  have  collective  computational \nproperties like those of two-state neurons. Proc. Nat. Acad. Sci. USA, Bio., .. 8.l. 3088-3092. \n\nMalsburg, C. van der (1973). Self-organization of orientation sensitive cells in striate cortex, \nKybernetic,  14,85-100. \n\nPearlmutter, B. A.  (1988), Learning state space trajectories in recurrent neural networks:  A \npreliminary report, (Tech. Rep. AlP-54), Department of Computer Science , Carnegie Mellon \nUniversity, Pittsburgh, PA \n\nPineda,  F.  J.  (1988).  Dynamics  and  Architecture  for  Neural  Computation.  Journal  of \nComplexity,~, (pp.216-245) \n\nRowher R, R. and Forrest, B. (1987). Training time dependence in neural networks, In  M. \nCaudilandC.Butler,(Eds.),ProceedingsoftheIEEEFirstAnnuallnternationalConference \non Neural Networks, ~, (pp. 701-708). San Diego, California: IEEE. \n\nRumelhart,  D.  E.,  Hinton,  G.  E.,  and  Willaims,  R.J.  (1986).  Learning  Internal \nRepresentations by Error Propagation.  In  D. E. Rumelhart  and J. L. McClelland, (Eds.), \nParallel Distributed Processing, (pp.  318-362).  Cambridge: M.LT. Press. \n\nSejnowski, T. J.  (1977). Storing covariance with nonlinearly interacting neurons. Journal \nof Mathematical Biology,  ~,303 .. 321. \n\nWilliams, R.I. and Zipser, D. (1989).  A learning algorithm for continually running \nfully recurrent neural networks. Neural Computation, 1, (pp. 270-280). \n\nZipser, D.  (1990). Subgrouping reduces complexity  and speeds up learning in recurrent \nnetworks, (this volume). \n\n\f", "award": [], "sourceid": 212, "authors": [{"given_name": "Fernando", "family_name": "Pineda", "institution": null}]}