{"title": "Adjoint Operator Algorithms for Faster Learning in Dynamical Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 498, "page_last": 508, "abstract": null, "full_text": "498 \n\nBarben, Toomarian and Gulati \n\nAdjoint  Operator Algorithms for  Faster \nLearning  in  Dynamical Neural Networks \n\nJacob Barhen \n\nNikzad Toomarian \n\nSandeep  Gulati \n\nCenter for  Space  Microelectronics Technology \n\nJet  Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA  91109 \n\nABSTRACT \n\nA  methodology for  faster supervised learning in  dynamical nonlin(cid:173)\near neural networks is  presented.  It exploits the concept  of adjoint \noperntors  to  enable  computation  of changes  in  the  network's  re(cid:173)\nsponse due to perturbations in all system parameters,  using the so(cid:173)\nlution of a single set of appropriately constructed  linear equations. \nThe  lower  bound  on  speedup  per  learning  iteration  over  conven(cid:173)\ntional methods for  calculating the neuromorphic energy gradient is \nO(N2),  where  N  is  the number  of neurons  in  the network. \n\nINTRODUCTION \n\n1 \nThe  biggest  promise of artifcial neural  networks  as  computational tools  lies  in  the \nhope  that  they  will  enable  fast  processing  and  synthesis  of complex  information \npatterns.  In  particular,  considerable efforts  have  recently  been  devoted  to  the for(cid:173)\nmulation of efficent methodologies for learning (e.g.,  Rumelhart et al.,  1986; Pineda, \n1988; Pearlmutter,  1989; Williams and Zipser,  1989; Barhen,  Gulati and Zak, 1989). \nThe  development  of learning algorithms is  generally  based  upon  the minimization \nof a  neuromorphic energy  function.  The  fundamental  requirement  of such  an  ap(cid:173)\nproach  is  the  computation of the  gradient  of this  objective  function  with  respect \nto the  various  parameters of the  neural  architecture,  e.g.,  synaptic weights,  neural \n\n\fAdjoint Operator Algorithms \n\n499 \n\ngains,  etc.  The paramount  contribution  to  the often  excessive  cost  of learning  us(cid:173)\ning  dynamical  neural  networks  arises  from  the  necessity  to solve,  at  each  learning \niteration, one set  of equations for each  parameter of the  neural system, since  those \nparameters affect  both directly  and indirectly  the  network's energy. \n\nIn this paper we show that the concept of adjoint operators, when applied to dynam(cid:173)\nical  neural  networks,  not  only  yields  a  considerable  algorithmic speedup,  but  also \nputs on a firm mathematical basis prior results for  \"recurrent\"  networks,  the deriva(cid:173)\ntions of which sometimes involved much  heuristic reasoning.  We  have  already used \nadjoint operators in  some of our earlier  work  in  the fields  of energy-economy  mod(cid:173)\neling (Alsmiller and Barhen,  1984)  and nuclear reactor  thermal hydraulics  (Barhen \net  al.,  1982; Toomarian et al.,  1987)  at the  Oak  Ridge  National Laboratory,  where \nthe concept flourished  during  the  past decade  (Oblow,  1977;  Cacuci  et  al.,  1980). \n\nIn  the  sequel  we  first  motivate  and  construct,  in  the  most  elementary  fashion,  a \ncomputational framework  based  on  adjoint  operators.  We  then  apply  our  results \nto  the  Cohen-Grossberg-Hopfield  (CGH)  additive  model,  enhanced  with  terminal \nattractor  (Barhen,  Gulati and  Zak,  1989)  capabilities.  We  conclude  by  presenting \nthe results of a  few  typical simulations. \n\n2  ADJOINT  OPERATORS \nConsider, for  the sake of simplicity,  that a  problem of interest is  represented  by the \nfollowing  system of N  coupled  nonlinear equations \no \n\n(2.1) \n.  Let  u and  p represent  the  N-vector  of \nwhere  rp  denotes  a  nonlinear  operator1 \ndependent state variables and the  M-vector of system parameters,  respectively.  We \nwill  assume that generally  M  \u00bb  N  and  that elements of p are,  in  principle,  inde(cid:173)\npendent.  Furthermore, we  will also assume that, for  a specific  choice of parameters, \na  unique  solution  of Eq.  (2.1)  exists.  Hence,  u is  an  implicit  function  of p.  A \nsystem  \"response\",  R,  represents  any  result  of the  calculations  that  is  of interest. \nSpecifically \n\nrp( u, p) \n\nR  =  R(u,p) \n\n(2.2) \ni.e., R is a known nonlinear function of p and u and may be calculated from Eq.  (2.2) \nwhen  the  solution u in  Eq.  (2.1)  has  been  obtained for  a  given  p.  The  problem of \ninterest is  to compute the \"sensitivities\"  of R,  i.e.,  the derivatives of R with  respect \nto parameters PI\" \n\n1L  =  1\"\", M.  By  definition \n\noR  au \noR \n-+ - . -\nOPI' \nau  OPI' \n\n(2.3) \n\n1 If differential  operators  appear in  Eq.  (2.1),  then  a  corresponding  set  of  boundary  and/or \ninitial conditions to specify  the domain of cp  must also be provided.  In general an inhomogeneous \n\"source\"  term  can  also  be  present.  The  learning  model  discussed  in  this  paper focuses  on  the \nadiabatic approximation only.  Nonadiabatic learning algorithms,  wherein the response is defined \nas a  functional, will be discussed in a  forthcoming article. \n\n\f500 \n\nBarhen, Toomarian and Gulati \n\nSince the response R is known analytically,  the computation of oR/oPIS  and oR/au \nis straightforward.  The quantity that needs  to be determined is  the vector ou/ oPw \nDifferentiating the state equations (2.1),  we obtain a set of equations to be  referred \nto as  \"forward\"  sensitivity equations \n\n(2.4) \n\nTo simplify  the  notations, we  are  omitting the  \"transposed\"  sign  and  denoting the \nN  by  N  forward  sensitivity  matrix  ocp/ou  by  A,  the  N-vector  oU/OPIS  by  I-'ij  and \nthe  \"source\"  N-vector  -ocp/ OPIS  by ISS. Thus \n\n(2.5) \n\nSince  the  source  term  in  Eq.  (2.5)  explicitly  depends  on  ft,  computing  dR/dPI-\" \nrequires  solving  the  above  system of N  algebraic equations for  each  parameter Pw \nThis difficulty is circumvented by introd ucing adjoint operators.  Let  A\u00b7  denote the \nformal  adjoint2  of the  operator  A.  The  adjoint  sensitivity  equations  can  then  be \nexpressed  as \n\nA.  I-' ij. \n\nIS  -. S  . \n\nBy  definition,  for  algebraic operators \n\nSince  Eq.  (2.3),  can  be rewritten  as \n\nif we  identify \n\ndR \ndpl-' \n\noR \noR  1'-\nOPIS  +  au  q, \n\noR \nau \n\n-\n\nI-' s. \n\n-* \ns \n\n(2.6) \n\n(2.8) \n\n(2.9) \n\nwe  observe  that  the  source  term  for  the  adjoint  equations  is  independent  of  the \nspecific  parameter PI-\"  Hence,  the  solution  of a  single  set  of adjoint  equations  will \nprovide  all the  information  required  to  compute  the  gradient  of R  with  respect  to  all \nparameters.  To underscore  that fact  we  shall denote  I-'ij*  as  ii.  Thus \n\n(2.10) \n\nWe will now apply this computational framework  to a  CGH network enha.nced  with \nterminal  attractor  dynamics.  The  model  developed  in  the  sequel  differs  from  our \n\n2 Adjoint operators can only be considered for densely defined linear operators on Banach spaces \n(see e.g.,  Cacuci, 1980).  For the neural application under consideration  we  will  limit  ourselves to \nreal Hilbert spaces.  Such spaces are self-dual.  Furthermore,  the domain of an adjoint operator is \ndetennined by  selecting appropriate adjoint  boundary  conditions l .  The associated bilinear form \nevaluated on the domain boundary must thus be also generally included. \n\n\fAdjoint Operator Algorithms \n\n501 \n\nearlier formulations  (Barhen,  Gulati and Zak,  1989;  Barhen,  Zak  and  Gulati,  1989) \nin  avoiding  the  use  of constraints  in  the  neuromorphic  energy  function,  thereby \neliminating  the  need  for  differential  equations  to evolve  the  concomitant  Lagrange \nmultipliers.  Also,  the  usual activation dynamics is  transformed  into a set  of equiv(cid:173)\nalent equations which exhibit more  \"congenial\"  numerical properties, such  as  \"con(cid:173)\ntraction\" . \n\n3  APPLICATIONS  TO  NEURAL LEARNING \nWe  formalize  a  neural  network  as  an  adaptive  dynamical  system  whose  temporal \nevolution is  governed  by the following set of coupled nonlinear differential equations \n\n2:=  Wnm  Tnm  g-y(zm)  +  kIn \n\nm \n\n(3.1) \n\nwhere  Zn  represents the mean soma potential of the nth neuron and Tnm  denotes the \nsynaptic  coupling from  the  m-th  to  the  n-th neuron.  The  weighting  factor  Wnm \nenforces topological considerations.  The constant Kn  chara.cterizes  the decay of neu(cid:173)\nron  activity.  The sigmoidal function  g-y(.)  modulates  the neural  response,  with gain \ngiven  by 1m;  typically,  g-y(z)  =  tanh(fz).  The  \"source\"  term  k In,  which  includes \ndimensional considerations,  encodes contribution  in  terms  of attractor  coordinates \nof the  k-th  training sample via the following  expression \n\nif n  E  Sx \nif n E  SH  U Sy \n\n(3.2) \n\nThe topographic  input, output and hidden  network  partitions Sx,  Sy  and SH  are \narchitectural  requirements  related  to  the  encoding  of ma.pping-type  problems  for \nwhich  a  number  of possibilities exist  (Barhen,  Gulati  and  Zak,  1989;  Barhen,  Zak \nand Gulati,  1989).  In  previous articles  (ibid; Zak,  1989) we  have demonstrated that \nin  general, for  f3  =  (2i + 1)-1 and i a strictly positive integer, such  attractors have \ninfinite  local  stability and  provide opportunity for  learning in  real-time.  Typically, \nf3  can  be  set  to  1/3.  Assuming  an  adiabatic  framework,  the  fixed  point  equations \nat equilibrium,  i.e., as zn \ng \n\n--+  0,  yield \nUn  = \n\n(3.3) \n\nnrn  Urn  + \n\nk  -\n\nkI-\nn \n\nKn  -l(k-) \n-\nIn \n\n~  T. \n~ Wnm \nm \n\nwhere  Un  =  g-y(zn)  represents the  neura.l  response.  The superscript\"\"  denotes \nquantities evaluated  at steady  state.  Operational  network  dynamics  is  then  given \nby \n\nUn  +  Un  =  g-y \n\n[  In  2:= Wnm  T,lm  Urn  +  In  kIn 1 \n\nKn  m \n\nKn \n\n(3.4) \n\nTo  proceed  formally  with  the  development  of a  supervised  learning  algorithm,  we \nconsider an approach based upon the minimization of a constrained  \"neuromorphic\" \nenergy function  E  given  by  the following  expression \n\nE(u,p)  =  ~ 2:= 2:=  [ku n \n\nk \n\nn \n\n-\n\nkan  ]2 \n\nV n  E  Sx  U Sy \n\n(3.5) \n\n\f502 \n\nBarben, Toomarian and Gulati \n\nWe relate adjoint theory  to neural learning by identifying the neuromorphic energy \nfunction,  E in Eq.  (3.5), with the system response R. Also, let p denote the following \nsystem parameters: \n\nThe  proposed  objective  function  enforces  convergence  of every  neuron  in  Sx  and \nSy  to  attractor  coordinates  corresponding  to  the  components  in  the  input-output \ntraining  patterns,  thereby  prompting  the  network  to  learn  the  embedded  invari(cid:173)\nances.  Lyapunov stability requires  an energy-like function  to be  monotonically  de(cid:173)\ncreasing  in  time.  Since in  our model  the  internal  dynamical parameters of interest \nare  the  synaptic strengths  Tnm  of the  interconnection  topology,  the  characteristic \ndecay constants  Kn  and  the gain parameters In  this implies that \n\nE  =  '\"\"\"'  '\"\"\"' \n\n~ ~ ~ nm  +  ~ dK  Kn  +  ~ d \nIn \nn  m \n\n'\"\"\"'  dE. \nIn \nn \n\ndE \nnm \n\nn \n\nn \n\nr.. \n\n'\"\"\"'  dE. \n\n<  0 \n\n(3.6) \n\nFor each adaptive system parameter, PIA'  Lyapunov stability will  be  satisfied by the \nfollowing  choice  of equations of motion \n\nPIA  =  -Tp \n\ndE \ndpIA \n\n(3.7) \n\nExamples include \n\n,n \n\n-r.  -\n\ndE \n'Y  din \n\n. \ndE \nTnm  =  -TT dTnm \nwhere  the  time-scale  parameters  TT,  T,.  and  T\"y  >  O.  Since  E  depends  on  PIA \nboth  directly  and  indirectly,  previous  methods  required  solution  of a  system of N \nequations for each parameter  PIA  to obtain dE/dPIA  from  du/dPIA.  Our methodology \n(based  on  adjoint  operators),  yields  all  deri vati ves  dE / dplA' V J1.  ,  by  solving  a \nsingle set of N  linear equations. \nThe nonlinear  neural operator for  each  training pattern  k,  k  =  1,\u00b7\u00b7\u00b7 J(,  at equi(cid:173)\nlibrium is  given  by \n\ndE \n\n\" \nl(Jn \n\n(\" -\n\nU, P  =  9 \n\n-) \n\n- ~ Wnm'  nm'  U m ,  +  -\n[  1 \nKn \nKn \n\nr.\" -\n\n'\"\"\"' \n, \nm \n\nwhere,  without  loss  of generality  we  have  set ,n  to unity.  So,  in  principle\" Un  = \n\"un  [T,  K,  r,  \"an,\u00b7\u00b7-j.  Using  Eqs.  (3.8),  the  forward  sensitivity  matrix  can  be \ncomputed  and compactly expressed  as \n\n1  \"1- 1 \n\nn \n\n(3.8) \n\n{)  \"l(Jn \n\n{) ,,-Um \n\n[ \nIn \nWnm  Tnm  +  {)\"_ \nU m \n\n\" - 1 \n\n{) \n\n\"A \ngn  -\n\n1 \nKn \n1  \"A \n-\nKn \n\ngn  Wnm \n\nT. \n\nnm  -\n\n,,~ \nfJn  unm\u00b7 \n\n(3.9) \n\n\fAdjoint Operator Algorithms \n\n503 \n\nwhere \n\n(3.10) \nAbove,  kgn  represents  the  derivative  of 9  with  respect  to  kun, i.e.,  if  9  = tanh, \nthen \n\nif n  E  Sx \nifn  E  SHUSy \n\n'g.  =  1  - ['g.J 2  where \n\n'g.  =  g[ :.  ( ~w.m T.m 'um  + 'I. )  1 (3.11) \n\nRecall  that the  formal  adjoint equation is  given  as  A\u00b7 v  = s\u00b7  ; here \n\n1  k~ \n-\nKm \n\ngm  Wmn  mn  -\n\nT. \n\nk ,  \n\nTJm  Umn \n\n(3.12) \n\nUsing  Eqs.  (2.9)  and  (3.5),  we  can compute  the formal adjoint source \n\nBE \n.ll  k(cid:173)\nv  Un \n\nifn E Sx USy \nif n  E SH \n\n(3.13) \n\nThe  system  of  adjoint  fixed-point  equations  can  then  be  constructed  using  Eqs. \n(3.12)  and  (3.13),  to yield: \n\n\"'\"  1  k~ \n~ -\nm  Km \n\ngm  Wmn  mn  Vm  - ~ fJm  Umn  Vm \n\n, \n\nk-\n\nT. \n\nk-\n\n\"'\" k \nm \n\n(3.14) \n\nNotice  that  the  above  coupled  system,  (3.14),  is  linear  in  kv.  Furthermore,  it \nhas  the  same  mathematical  characteristics  as  the  operational  dynamics  (3.4).  Its \ncomponents can be obtained as  the equilibrium points, (i.e., Vi \n--+  0) of the adjoint \nneural dynalnics \n\n1  k  ~ \n-\nKm \n\ngm  Wmn  mn  Vm \n\nT. \n\nm \n\n(3.15) \n\nAs  an implementation example,  let  us  conclude  by  deriving the  learning equations \nfor  the synaptic strengths, Tw  Recall  that \n\ndE \ndTIJ \n\nBE  +  \"'\" k- IJk  -\n- -\nBTIJ \n\nL \nk \n\nv, \n\nS \n\np.  =  (i, j) \n\n(3.16) \n\nWe  differentiate  the steady state equations  (3.8)  with  respect  to Tij ,  to obtain the \nforward  source  term, \n\na k<pn \naIij \n\n-\n\nk~  [1\"\",  \" k -\n\n; :  ~ Wnl  uin  Ujl  UI \n\ngn \n\nI \n\nn \n1  k~, \n-\nKn \n\ngn  Din  Wnj  Uj \n\nk-\n\n(3.17) \n\n\f504 \n\nBarben, Toomarian and Gulati \n\nSince  by definition,  fJE / 8Tnm  = 0  ,  the explicit energy gradient  contribution is \nobtained  as \n\nT.. \nnm \n\n-\n-\n\n-1\"T \n\n[Wnm  ~ 1.;  -\n\n- - - L.-,  Vn  9n  Urn \n\nII:  ~ \n\nII:  -\n\n] \n\n(3.18) \n\n\"'n \n\nk \n\nIt is straightforward to obtain learning equations for  In  and \"'n  in  a similar fashion. \n\n4  ADAPTIVE TIME-SCALES \nSo far  the adaptive learning rates,  i.e.,  Tp  in  Eq.(3.7), have  not been specified.  Now \nwe  will show  that, by an appropriate selection of these  parameters the convergence \nof the  corresponding  dynamical  systems  can  be  considerably  improved.  Without \nloss of generality,  we shall assume  TT  =  T,.  = T-y  = T,  and we shall seek  T  in  the \nform  (Barhen  et aI,  1989;  Zak  1989) \n\nwhere \\7 E denotes the vector with components \\7TE,  \\7 -yE  and \\7 ,.E.  It is straight(cid:173)\nforward  to show  that \n\n(4.2) \n\n(4.1) \n\nas  \\7 E  tends  to zero,  where  X  is  an  arbitrary  positive constant.  If we  evaluate  the \nrelaxation  time of the  energy gradient, we  find  that \n\ntE  = \n\nl IVE'-O  d!  \\7 E  ! \n\n!\\7E!I-.6 \n\nIVElo \n\nf3  <  0 \nif \nif  f3  >  0 \n\n( 4. 3) \n\nThus, for  f3  ~  0  the  relaxation  time  is  infinite,  while  for  f3  >  0  it is  finite.  The \ndynamical system (3.19) suffers a qualitative change for  f3  >  0:  it loses  uniqueness \nof solution.  The equilibrium point  1 \\7 E  1 =  0  becomes  a  singular solution  being \nintersected by all  the transients, and  the  Lipschitz condition  is  violated, as one  can \nsee  from \n\n(  d ! \\7 E !)  =  -X  1 \\7 E  1-.6  _ \n\nd \n\nd  1 \\7 E  1 \n\ndt \n\n-00 \n\n(4.4) \n\nwhere  1 \\7 E  1 tends  to zero,  while f3  is  strictly positive.  Such infinitely stable points \nare\" terminal attractors\".  By analogy with our previous results we choose f3  = 2/3, \nwhich  yields \n\nT \n\n( \n\n)\n~ ~ [\\7TE  ]~rn  +  ~ [\\7-yE]~  +  ~ [\\7 ,.E]~ \n\n(4.5) \n\n-1/3 \n\nThe  introduction  of these  adaptive  time-scales  dramatically  improves  the  conver(cid:173)\ngence  of the  corresponding learning dynamical systems. \n\n\fAdjoint Operator Algorithms \n\n505 \n\n5  SIMULATIONS \nThe  computational  framework  developed  in  the  preceding  section  has  been  ap(cid:173)\nplied  to  a  number of problems  that involve  learning nonlinear mappings,  including \nExclusive-OR, the hyperbolic tangent and trignometric functions,  e.g., sin.  Some of \nthese  mappings  (e.g.,  XOR)  have  been  extensively  benchmarked  in  the  literature, \nand provide an adequate basis for  illustrating the computational efficacy  of our pro(cid:173)\nposed  formulation.  Figures  l(a)-I(d)  demonstrate  the  temporal  profile  of various \nnetwork  elements  during learning of the  XOR function.  A  six  neuron  feedforward \nnetwork  was  used,  that  included  self-feedback  on  the  output  unit  and  bias.  Fig. \nl(a) shows the LMS error during the training phase.  The worst-case  convergence of \nthe output state neuron  to the presented attractor is  displayed in  Fig.  l(b) .  Notice \nthe  rapid  convergence  of the  input state  due  to  the  terminal  attractor effect.  The \nbehavior  of the  adaptive  time-scale  parameter  T  is  depicted  in  Fig.  1 (c).  Finally, \nFig.  l(d) shows  the evolution of the energy gradient components. \n\nThe test setup for signal processing applications, i.e.,  learning the sin  function  and \nthe  tanh sigmoidal nonlinearlity,  included  a  8-neUl'on  fully  connected  network with \nno bias.  In each case  the network was trained  using as little as  4 randomly sampled \ntraining points.  Efficacy  of recall  was  determined  by  presenting  100  random sam(cid:173)\nples.  Fig.  (2)  and  (3b)  illustrate  that we  were  able  to  approximate the sin  and  the \nhyperbolic  tangent  functions  using  16  and  4  pairs  respectively.  Fig.  3(a)  demon(cid:173)\nstrates  the  network  performance  when  4  pairs  were  used  to  learn  the  hyperbolic \ntangent. \n\nWe  would  like  to  mention  that  since  our  learning  methodology  involves  terminal \nat tractors,  extreme  caution  must  be  exercised  when  simulating  the  algorithms  in \na  digital  computing  environment.  Our  discussion  on  sensitivity  of  results  to  the \nintegration schemes (Barhen,  Zak  and Gulati,  1989) emphasizes that explicit meth(cid:173)\nods such  as  Euler or  Runge-Kutta shall not be  used,  since  the  presence of terminal \nat tractors  induces  extreme stiffness.  Practically,  this  would  require  an  integration \ntime-step  of infinitesimal size,  resulting  in  numerical  round-off errors  of unaccept(cid:173)\nable  magnitude.  Implicit integration  techniques such  as  the  Kaps- Rentrop scheme \nshould  therefore  be  used. \n\n6  CONCLUSIONS \nIn  this  paper  we  have  presented  a  theoretical  framework  for  faster  learning in  dy(cid:173)\nnamical neural networks.  Central to our approach is the concept of adjoint  operators \nwhich enables computation of network neuromorphic energy gradients with  respect \nto  all  system  parameters  using  the  solution  of a  single  set  of lineal'  equations.  If \nCF  and CA  denote the computational costs associated with solving the forward  and \nadjoint  sensitivity equations  (Eqs.  2.5  and  2.6),  and  if M  denotes  the  number  of \nparameters of interest in  the  network,  the speedup  achieved  is \n\n\f506 \n\nBarhen, Toomarian and Gulati \n\nIf we  assume  that CF  ~ CA  and  that  M  = N 2 + 2N + ... , we  see  that  the lower \nbound on speedup per learning iteration is O(N2).  Finally, particular care  must be \nexecrcised  when  integrating  the dynamical systems of interest,  due  to  the extreme \nstiffness  introduced  by  the terminal attractor constructs. \n\nAcknowledgements \n\nThe  research  described  in  this  paper  was  performed  by  the  Center  for  Space  Mi(cid:173)\ncroelectronics Technology,  Jet  Propulsion  Laboratory,  California Institute of Tech(cid:173)\nnology,  and  was  sponsored  by  agencies  of the  U.S.  Department  of  Defense,  and \nby the  Office  of Basic  Energy  Sciences of the  U.S.  Department of Energy,  through \ninteragency  agreements with  NASA. \n\nReferences \n\nR.G.  Alsmiller,  J.  Barhen  and  J.  Horwedel.  (1984)  \"The  Application  of Adjoint \nSensitivity Theory  to  a  Liquid Fuels  Supply Model\" , Energy,  9(3),  239-253. \nJ.  Barhen,  D.G. Cacuci and  J.J.  Wagschal.  (1982)  \"Uncertainty Analysis of Time(cid:173)\nDependent Nonlinear Systems\",  Nucl.  Sci.  Eng.,  81,  23-44. \nJ. Barhen, S. Gulati and M.  Zak.  (1989)  \"Neural Learning of Constrained Nonlinear \nTransformations\",  IEEE  Computer,  22(6),  67-76. \n\nJ. Barhen,  M.  Zak  and S.  Gulati.  (1989)  \"  Fast  Neural  Learning  Algorithms  Using \nNetworks with Non-Lipschitzian Dynamics\", in Proc.  Neuro-Nimes  '89,55-68, EC2, \nN anterre,  France. \n\nD.G. Cacuci,  C.F. Weber,  E.M.  Oblow  and J.H.  Marable.  (1980)  \"Sensitivity The(cid:173)\nory for  General  Systems of Nonlinear  Equations\",  Nucl.  Sci.  Eng.,  75,  88-110. \n\nE.M.  Oblow.  (1977)  \"Sensitivity Theory  for  General  Non-Linear  Algebraic  Equa(cid:173)\ntions with Constraints\",  ORNL/TM-5815,  Oak  Ridge  National Laboratory. \n\nB.A.  Pearlmutter.  (1989)  \"Learning State  Space  Trajectories  in  Recurrent  Neural \nNetworks\",  Neural  Computation,  1(3),  263-269. \n\nF.J. Pineda.  (1988)  \"Dynamics and Architecture in  Neural Computation\",  Journal \nof Complexity,  4,  216-245. \nD.E. Rumelhart and J .L.  Mclelland.  (1986)  Parallel and Distributed Procesing,  MIT \nPress,  Cambridge,  MA. \n\nN.  Toomarian,  E.  Wacholder  and  S.  Kaizerman.  (1987)  \"Sensitivity  Analysis  of \nTwo-Phase  Flow  Problems\",  Nucl.  Sci.  Eng.,  99(1),  53-8l. \n\nR.J.  Williams and  D.  Zipser.  (1989)  \"A  Learning  Algorithm for  Continually  Run(cid:173)\nning Fully  Recurrent  Neural  Networks\",  Neural  Computation,  1(3),  270-280. \n\nM.  Zak.  (1989)  \"Terminal Attractors\",  Neural  Networks,  2(4),259-274. \n\n\fAdjoint Operator Algorithms \n\n507 \n\n(a) \n\n(b) \n\n4 \n\ntil \n:2! \nt:r4 \n~ \n\n~ \n\n~ 1'--\n~ \n\n~ \n\nl \n\niterations \n\n\u2022 \n\n20 \n\n1.5 \n\n~ \nP \nQ) \n0 a Q) \nbJI \" ~ \n\n8 \n, \n150 \n\n1 \n\niterations \n\n150 \n\niterations \n\n150 \n\niterations \n\n150 \n\n(c) \n\n(d) \n\nFigure  l(a)-(d). \n\nLearning the Exclusive-OR function  using a  6-neumn \n(including bias)  feedforward  dynamical nctwork with \nsclf-feedback on the output unit. \n\n\f508 \n\nBarben, Toomarian and Gulati \n\n1 .000 , - - - - - - - - - - - - - . , . . . _ - -_  \n\n0 .500 \n\n0.000 \n\n-0.500 \n\n-1.000 t---..:::....~~--t__---t__--__.J \n\n-1.000 \n\n-0.500 \n\n0.000 \n\n0.500 \n\n1.000 \n\nFigure  2. \n\nLearning the Sin  function  using a  fully  connccted, 8-neunm \nnetwork with  no bias. The truining set comprised of \n4  points  that were  randomly  selected. \n\n3 (a) \n\n1.000 r----------.---:::=;~----. \n\n0.500 \n\n0000 \n\n-0.500 \n\n-1000~~~~~---t__---t__--~ \n1.000 \n\n- 1.000 \n\n-0.500 \n\n0 .500 \n\n0.000 \n\n3(b) \n\n1000 \n\n0.500 \n\n0.000 \n\n-0.500 \n\n-I.OOG .--\"-.-.!~---t__---t__--__.J \n\n- I.oeo \n\n-0 .500 \n\n0.000 \n\n0.500 \n\n1.000 \n\nIt'igure  3. \n\nLearning the Hyperbolic Tangent function  using a  fully  connected, \n8-neunm network with  no bias.  (a>  using 4  randomly  selected \ntraining samples;  (b>  using 16 randomly selected  training  samples. \n\n\f", "award": [], "sourceid": 262, "authors": [{"given_name": "Jacob", "family_name": "Barhen", "institution": null}, {"given_name": "Nikzad", "family_name": "Toomarian", "institution": null}, {"given_name": "Sandeep", "family_name": "Gulati", "institution": null}]}