{"title": "Higher Order Recurrent Networks and Grammatical Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 380, "page_last": 387, "abstract": null, "full_text": "380 \n\nGiles, Sun, Chen, Lee and Chen \n\nHIGHER ORDER RECURRENT NETWORKS \n\n& GRAMMATICAL INFERENCE \n\nC. L. Giles\u00b7,  G. Z. Sun, H. H. Chen, Y. C. Lee,  D. Chen \n\nDepartment of Physics and Astronomy \n\nInstitute for Advanced Computer Studies \n\nand \n\nUniversity of Maryland. College Park. MD 20742 \n\n* NEC Research Institute \n\n4 Independence Way. Princeton. NJ. 08540 \n\nABSTRACT \n\nA  higher  order  single  layer  recursive  network \neasily  learns  to \nsimulate  a  deterministic  finite  state  machine  and  recognize  regular \ngrammars.  When an  enhanced version of this  neural net state machine \nis connected through a common error term  to an external analog stack \nmemory, the combination can be interpreted as  a neural net pushdown \nautomata.  The  neural net finite state machine  is given  the primitives, \npush  and  POP.  and  is  able  to  read  the  top  of  the  stack.  Through  a \ngradient  descent  learning  rule  derived  from \nthe  common  error \nfunction,  the  hybrid  network  learns  to  effectively  use  the  stack \nactions  to  manipUlate  the  stack  memory  and  to  learn  simple  context(cid:173)\nfree grammars. \nINTRODUCTION \nBiological  networks  readily  and  easily  process  temporal  information;  artificial  neural \nnetworks  should  do  the  same.  Recurrent  neural  network  models  permit  the  encoding \nand learning of temporal sequences.  There are many recurrent neural net models. for ex(cid:173)\nample see  [Jordan  1986. Pineda  1987, Williams & Zipser 1988].  Nearly all encode the \ncurrent  state  representation  of the  models  in  the  activity  of the  neuron  and  the  next \nstate  is  determined  by  the  current  state  and  input.  From  an  automata  perspective,  this \ndynamical  structure  is  a state  machine.  One  formal  model  of sequences  and  machines \nthat  generate  and  recognize  them  are  formal  grammars  and  their  respective  automata. \nThese models  formalize some of the foundations  of computer science.  In  the Chomsky \nhierarchy  of formal  grammars  [Hopcroft  & Ullman  1979]  the  simplest  level  of com(cid:173)\nplexity  is  defmed  by  the  finite  state  machine  and  its  regular  grammars.  (All  machines \n\n\fHigher Order Recurrent Networks and Grammatical Inference \n\n381 \n\nand  grammars  described  here  are  deterministic.}  The  next  level  of complexity  is  de(cid:173)\nscribed by  pushdown automata and  their associated context-free grammars.  The push(cid:173)\ndown automaton is a fmite  state machine with  the added power to  use  a stack memory. \nNemal  networks  should  be  able  to  perform  the  same  type  of computation  and  thus \nsolve such learning problems as grammatical inference [pu 1982] . \nSimple grammatical inference is defined as  the problem of finding (learning) a grammar \nfrom  a  fmite  set  of strings,  often  called  the  teaching  sample.  Recall  that  a  grammar \n{phrase-structured}  is  defined as a 4-tuple (N, V, P, S) where N and V are a  nonterm i(cid:173)\nna1 and terminal vocabularies, P is a finite set of production rules and S is the start sym(cid:173)\nbol.  Here  grammatical  inference  is  also  defined  as  the  learning  of  the  machine  that \nrecognizes  the  teaching  and  testing  samples.  Potential  applications  of grammatical  in(cid:173)\nference  include  such  various  areas  as  pattern  recognition,  information  retrieval,  pro(cid:173)\ngramming language design, translation and compiling and graphics languages [pu 1982]. \nThere has been a great deal of interest in  teaching nemal nets to recognize grammars and \nsimulate  automata  [Allen  1989.  Jordan  1986.  Pollack  1989.  Servant-Schreiber  et.  a1. \n1989,Williams  & Zipser  1988].  Some  important extensions  of that  work  are  discussed \nhere.  In  particular we construct recurrent higher order nemal  net state machines which \nhave  no  hidden  layers and seem  to  be  at least as  powerful  as  any nemal  net multilayer \nstate machine discussed so  far.  For example,  the learning time and  training  sample size \nare  significantly reduced.  In  addition,  we integrate this  neural  net fmite  state machine \nwith  an  external  stack  memory  and  inform  the  network  through  a  common  objective \nfunction  that  it  has  at its disposal  the  symbol  at  the  top of the  stack and  the  operation \nprimitives  of push  and  pop.  By  devising  a common error function  which  integrates the \nstack and  the  nemal  net state machine,  this  hybrid structure learns to effectively use  the \nthe  interesting  work  of [Williams  & \nstack  to  recognize  context-free  grammars. \nZipser  1988]  a recurrent  net  learns  only  the  state  machine  part of a Turing  Machine. \nsince the associated move, read, write operations for each input string are known and are \ngiven as part of the  training set.  However,  the  model  we present learns  how  to  manipu(cid:173)\nlate  the push, POP. and read primitives of an external stack memory plus  learns  the ad(cid:173)\nditional necessary state operations and  structure. \nHIGHER ORDER RECURRENT NETWORK \nThe  recurrent neural  network  utilized  can  be  considered as  a higher order modification \nof the  network  model  developed by  [Williams  & Zipser  1988].  Recall  that  in  a recur(cid:173)\nrent  net  the  activation  state S of the  neurons  at time  (t+l) is defined as  in  a state  ma(cid:173)\nchine automata: \n\nIn \n\n(1) \nwhere  F maps  the state S and  the input I at time  t to  the  next state.  The  weight matrix \nW forms  the mapping and is usually  learned.  We  use a higher order form  for this map(cid:173)\nping: \n\nS(t+ 1) = F  ( S(t), I(t); W  } \n\n(2) \n\n\f382 \n\nGiles, Sun, Chen, Lee and Chen \n\nwhere the range of i, j  is the number of state neurons and k the number of input neurons; \ng is defined as  g(x)=l!(l+exp(-x)).  In order to  use  the net for  grammatical inference, a \nlearning rule must be devised.  To learn the mapping F and the weight matrix W, given \na sample set of P strings of the grammar,  we construct the following error function E : \n\nE = L E 2  = L (T  - S  (L)) 2 \n\nr  01\" \n\nr \n\n(3) \n\nwhere the sum is over P samples. The error function  is evaluated at the end of a present(cid:173)\ned sequence of length  ~ and So  is the activity of the output neuron. For a recurrent net, \nthe output neuron  is a  designated  member of the state neurons.  The  target  value  of any \npattern is  1 for  a  legal string  and  0  for an  illegal one.  U sing  a gradient descent proce(cid:173)\ndure, we minimize the  error E function  for  only the rth  pattern.  The weight update rule \nbecomes \n\n(4) \n\nwhere\" is  the  learning rate.  Using eq.  (2),  dSo(tp) /  dWijk  is easily calculated using \nthe recursion relationship and the choice of an initial value for  aSi(t = O)/aWijk' \n\naSI(t+l)/aWijk = hI (Sl(t+l)) (  ~li Sit) Ik(t) + 1: Wlmn In(t) aSm(t)taWijk  } (5) \n\nwhere  h(x)  = dg/dx.  Note  that this requires dSi(t) /  dWijk be updated as each element \nof each string is presented and to have a known  initial value. Given an adequate network \ntopology,  the  above  neural  net state machine  should be capable of learning  any  regular \ngrammar of arbitrary string length or a more complex grammar of finite length. \n\nFINITE STATE MACHINE SIMULATION \nIn order to see how such a net performs, we trained the net on a regular grammar, the du(cid:173)\nal  parity  grammar.  An  arbitrary  length  string  of O's  and  1 's  has  dual  parity  if the \nstring contains an even number of O's and an even number of 1 's.  The network architec(cid:173)\nture was  3 input neurons and either 3, 4, or 5 state neurons  with fully  connected second \norder interconnection  weights.  The  string  vocabulary  O,l,e  (end  symbol)  used  a  unary \nrepresentation.  The initial  training  set consisted  of 30 positive and  negative  strings  of \nincreasing  sting length  up  to  length  4.  After  including  in  the  training  all  strings  up  to \nlength  10  which resulted in misclassification(about 30 strings), the  neural  net state ma(cid:173)\nchine perfectly recognized on all strings up to length  20.  Total training  time  was usual(cid:173)\nly 500 epochs or less. \nBy  looking closely  at the  dynamics  of learning,  it was  discovered  that  for  different in(cid:173)\nputs \nstate. These four states can be considered as  possible states of an actual fmite state ma(cid:173)\nchine and the movement between these states as a function of input can be interpreted as \nthe  state  transitions  of a  state  machine.  Constructing  a  state  machine  yields  a  perfect \nfour  state machine which  will recognize any  dual parity grammar.  Using  minimization \nprocedures  [pu  1982],  the extraneous state  transitions can be reduced  to  the  minimal 4-\n\ntended  to  cluster around  three  values  plus  the  initial \n\nthe  states  of the  network \n\n\fHigher Order Recurrent Networks and Grammatical Inference \n\n383 \n\nstate machine.  The extracted state machine is  shown in Fig.  1.  However, for more com(cid:173)\nplicated  grammars  and  different  initial  conditions,  it might  be  difficult  to  extract  the \nfmite  state  machine.  When  different  initial  weights  were  chosen,  different  extraneous \ntransition  diagrams  with  more  states  resulted.  What  is  interesting  is  that  the  neural \nnet finite state machine learned this simple grammar perfectly. A first order net  can  al(cid:173)\nso learn  this problem;  the  higher order net learns it much  faster.  It is  easy to prove that \nthere  are  fmite  sate machines  that cannot be  represented by  fust order,  single  layer re(cid:173)\ncurrent  nets  [Minsky  1967].  For further  discussion  of higher order state  machines,  see \n[Liu, et. al.  1990]. \n\no \n\nI \n\n1 \n\nI \n\n1 \n\nFIGURE 1:  A learned four state machine; state  1 is both  the start \nand the final state. \n\nNEURAL NET PUSHDOWN AUTOMATA \nIn  order  to  easily  learn  more  complex  deterministic  grammars,  the  neural  net  must \nsomehow develop and/or learn to  use  some  type  of memory, the  simplest being  a stack \nmemory. Two approaches easily come to mind.  Teach the additional weight structure in \na multilayer neural  network  to  serve as  memory  [Pollack  1989]  or teach  the  neural net \nto  use  an  external  memory  source.  The  latter  is  appealing  because  it  is  well  known \nfrom  formal  language theory  that a finite  stack machine requires  significantly fewer  re(cid:173)\nsources  than  a fmite  state  machine  for  bounded problems  such  as recognizing  a  finite \nlength context-free grammar.  To teach a neural net to use a stack memory poses at least \nthree problems:  1) how  to construct the stack memory, 2)  how to couple the stack mem(cid:173)\nory to  the neural  net state machine, and  3) how  to  formulate  the objective function such \nthat its optimization will yield  effective learning rules. \nMost  slraight-forward  is  formulating  the  objective  function  so  that  the  stack  is  cou(cid:173)\npled  to  the  neural  net state  machine.  The  most stringent  condition  for  a pushdown  au(cid:173)\ntomata  to  accept  a  context-free  grammar  is  that  the  pushdown  automata  be  in  a  final \nstate and the stack be empty.  Thus,  the error function of eq. (3) above is modified to in(cid:173)\nclude both  final state and stack length terms: \n\n\f384 \n\nGiles, Sun, Chen, Lee and Chen \n\n(6) \n\nwhere L(Y is the final  stack length  at time )\"  i.e.  the time at which  the  last symbol of \nthe  string  is presented.  Therefore,  for  legal  strings  E =  0,  if the  pushdown  automata is \nin a final state and the stack is empty. \nNow consider how  the  stack can  be connected  to  the  neural  net state machine.  Recall \nthat for a pushdown automata [pu  1982], the state transition mapping of eq. (I) includes \nan additional argument, the symbol R(t) read from  the top of the stack and an additional \nstack action mapping. An obvious approach to connecting the stack to the neural net is to \nlet the  activity level of certain neurons represent the  symbol  at the  top  of the  stack and \nothers represent the action on the stack.  The pushdown automata has an additional stack \naction of reading or writing to the top of the stack based on the current state, input, and \ntop stack symbol. One interpretation of these mappings would be extensions of eq. (2): \n\nSi(t+l) =  g( 1: WSijk  Slt)  Vk(t)} \n\n~(t+l) =  f( 1: Waijk  Slt)  Vk(t)} \n\n(7) \n\n(8) \n\nTee \n\nFIGURE 2:.  Single  layer  higher  order  recursive  neural  network  that is  connected \nto a stack memory. A  represents action  neurons connected to  the stack; R  represents \nmemory buffer neurons which read the top of the stack.  The activation  proceeds up(cid:173)\nward  from  states,  input,  and  stack  top  at  time  t  to  states  and  action  at  time  t+ 1. \nThe recursion replaces the states in the bottom layer with the states in the top layer. \n\nwhere  Aj(t)  are  output  neurons  controlling  the  action  of the  stack;  Vk(t)  is  either  the \n\ninput neuron  value Ik(t)  or the  connected stack  memory  neuron  value  Rk(t),  dependent \non the index k; and f=2g-1.  The current values Slt), Ik(t), and  Rk(t)  are all fully  con(cid:173)\nnected through 2nd order weights with no hidden neurons.  The mappings of eqs. (7) and \n(8)  define  the  recursive  network  and  can  be implemented concurrently  and  in  parallel. \nLet A(t=O)  & R(t=O)=  O.  The neuron state values  range continuously from  0 to  1 while \nthe  neuron action values range from  -I  to  I.  The neural network part of the architecture \n\n\fHigher Order Recurrent Networks and Grammatical Inference \n\n385 \n\nis depicted in Fig. 2.  The number of read neurons is equal to the coding representation of \nthe stack.  For most applications, one action neuron suffices. \nIn  order to  use  the  gradient descent  learning  rule  described  in  eq.  (4),  the  stack  length \n(Other  types  of leaming  algorithms  may  not  require  a \nmust  have  continuous  values. \ncontinuous stack.)  We now explain how a continuous stack is used and connected to the \naction and read  neurons.  Interpret the stack actions as  follows:  push  (A>O), pop (A<O), \nno  action  (A=O).  For simplicity,  only  the  current  input  symbol  is  pushed  ;  then  the \nnumber  of input  and  stack  memory  neurons  are  equal.  (If the  input symbol  is  a,  then \nonly AD of that value is pushed into the stack) T he stack consists of a summation of ana(cid:173)\nlog symbols.  By definition, all symbols up in unit depth one are in  the read neuron R at \ntime  too  If A<O  (POp),  a  depth  of IAI  of all  symbols  in  that depth  is  removed  from  the \nstack.  In  the  next  time  step  what  remains  in R  is  a  unit  length  from  the  current stack \ntop.  An attempt to pop an empty stack occurs if not enough remains in the stack to pop \ndepth  IAI.  Further description of this operation with examples can be found in [Sun, et. \nal.1990).  Since the action operation A removes  or adds to  the  stac~ the stack length  at \ntime t+l is L(t+l) = L(t) + A(t), where L(t=O) = O. \nWith  the  recursion  relations,  stack  construction,  and  error  function  defined,  the  leam(cid:173)\ning algorithms may be derived from eqs. (4) & (6) \n\nAWijk =11  Er (dSt(y/awijk - dL(~)/dWij' \n\n(9) \n\nThe derivative terms  may  be derived  from  the recurrent relations eqs.  (7)  & (8) and  the \nstack length equation. They are \n\naSl(t+l)/aWijk = hI Sl(t+l) (~il  Slt) Vk(t) + 1:: Wlmn V n(t) aSm(t)!aWijk + \n\nand \n\n1:: Wlmn Sm(t) aRn(t)!aWijk } \n\n(10) \n\n(11) \n\nSince  the  change dRk(t)/dWijk  must contain  information about past changes  in  action \nA, we have \n\naRk(t)/awijk = 1::  aRk(t)/aA(t)  aA(t)!awijk  ==  AR aA(t)/awijk \n\n(12) \n\nwhere  AR = 0,1, or -1  and depends on the top and bottom symbols read in R(t). This ~p\u00ad\n\nproximation assumes  that the read changes are only effected by actions  which occurred \nin  the recent past.  The change in action with respect to the weights is defined by a recur(cid:173)\nsion derived from  eq.  (8) and has  the same form  as eq. (10).  For the case of popping an \nempty stack,  the  weight change  increases  the  stack  length  for  a  legal  string;  otherwise \nnothing  happens.  It appears  that  all  these  derivatives  are  necessary  to  adequately  inte(cid:173)\ngrate the neural net to the continuous stack memory. \nPUSHDOWN AUTOMATA SIMULATIONS \nTo  test  this  theoretical development,  we  trained  the neural  net pushdown automaton  on \n\n\f386 \n\nGiles, Sun, Chen, Lee and Chen \n\ntwo  context-free  grammars,  1 nOn  and  the  parenthesis  grammar  (balanced  strings  of \nparentheses),  For the parenthesis grammar, the net architecture consisted of a 2nd order \nfully  interconnected single layer net with  3 state neurons, 3 input neurons, and 2  action \nneurons  (one  for  push &  one  for  pop). \nIn  20 epochs  with  fifty  positive  and  negative \ntraining samples of increasing length up to  length eight , the network learned how to be \na  perfect  pushdown  automaton.  We  concluded  this  after  testing  on  all  strings  up  to \nlength  20  and  through  a  similar  analysis  of  emergent  state-stack  values.  Using  a \nsimilar  clustering  analysis  and  heuristic  reduction  approach,  the  minimal  pushdown \nautomaton  emerges.  It  should  be  noted  that  for  this  pushdown  automaton,  the  state \nmachine  does  very  little  and  is  easily  learned  Fig.  3  shows  the  pushdown  automaton \nthat  emerged;  the  3-tuple  represents  (input  symbol,  stack  symbol,  action  of push  or \npop),  The  1 non  was  also  successfully  trained  with  a  small  training  set  and  a  few \nhundred  epochs  of  learning.  This  should  be  compared  to  the  more  computationally \nintense  learning  of layered  networks  [Allen  1989].  A  minimal  pushdown  automaton \nwas also derived,  For further details of the learning and emergent pushdown  automata, \nsee [Sun, etal.  1990]. \n\n(O,cp,-I) \n\n(O,cp,-I) \n\n(1,1,1) \n(0,1,-1) \n(1,cp,l) \n\n(e,I,.) \n\nFIGURE 3:  Learned  neural  network  pushdown automaton  for  parenthesis \nbalance checker where the  numerical  results  for  states  (1),  (2),  (3),  and  (4) \nare  (1,0,0),  (.9,.2,.2),  (.89,.17,.48)  and  (.79,.25,.70).  State  (1)  is  the  start \nstate.  State  (3)  is  a  legal  end state.  Before  feeding  the end symbol,  a  legal \nstring must end at state (2) with empty stack. \n\nCONCLUSIONS \nThis work presents a  different approach  to  incorporating and using memory in a  neural \nnetwork.  A  recurrent  higher order  net  learned  to  effectively  employ  an  external  stack \n\n\fHigher Order Recurrent Networks and Grammatical Inference \n\n387 \n\nmemory  to  learn  simple  context-free  grammars.  However,  to  do  so  required  the  cre(cid:173)\nation of a continuous stack structure. Since it was possible to reduce the neural network \nto  the  ideal  pushdown  automaton,  the  neural  network  can  be  said  to  have  \"perfectly\" \nlearned  these  simple  grammars.  Though  the  simulations  appear  very  promising,  many \nquestions remain.  Besides extending  the simulations  to  more  complex  grammars, there \nare questions of how  well  such  architectures will  scale for  \"real\"  problems.  What be(cid:173)\ncame evident was the power of the higher order network;  again demonstrating its sp~ \nof learning  and  sparseness  of training  sets.  Will  the  same  be  true  for  more  complex \nproblems is a question for further work. \nREFERENCES \nR.A.  Allen,  Adaptive  Training  for  Connectionist  State  Machines,  ACM  Computer \nConference, Louisville, p.428, (1989). \nD.  Angluin  & C.H.  Smith,  Inductive  Inference:  Theory and  Methods, ACM Computing \nSurveys. Vol. 15, No.3,  p. 237, (1983). \nK.S.  Fu,  Syntactic  Pattern  Recognition  and  Applications.  Prentice-Hall,  Englewood \nCliffs, NJ. (1982). \nJ.E.  Hopcroft & J.D.  Ullman, Introduction  to  Automata  Theory.  Languages.  and  Com(cid:173)\nputation. Addison Wesley, Reading, Ma. (1979). \nM.I.  Jordan,  Attractor  Dynamics  and  Parallelism  in  a  Connectionist  Sequential  Ma(cid:173)\nchine, Proceedings of the  Eigtht Conference  of the  Cognitive  Science  Society.  Amherst, \nMa, p. 531  (1986). \nY.D. Liu, G.Z.  Sun,  H.H.  Chen, Y.C. Lee, C.L. Giles, Grammatical Inference and Neu(cid:173)\nral Network State Machines, Proceedings of the International Joint Conference on Neu(cid:173)\nral  Networks,  M.  Caudill  (ed),  Lawerence  Erlbaum,  Hillsdale,  NJ.,  vol  1.  p.285 \n(1990). \nML.  Minsky,  Computation:  Finite  and  Infinite  Machines,  Prentice-Hall,  Englewood, \nNJ., p. 55 (1967). \nFJ.  Pineda,  Generalization  of  Backpropagation  to  Recurrent  Neural  Networks,  Phys. \nRev. Lett., vol 18, p. 2229 (1987). \nJ.B.  Pollack,  Implications  of Recursive  Distributed  Representations,  Advances  in  Neu(cid:173)\nral Information  Systems 1, D.S.  Touretzky  (ed), Morgan  Kaufmann,  San  Mateo,  Ca,  p. \n527 (1989). \nD.  Servan-Schreiber,  A.  Cleeremans  & J L. McClelland, Encoding Sequential Structure \nin  Simple  Recurrent  Networks,  Advances  in  Neural  Information  Systems  1,  D.S. \nTouretzky (ed), Morgan Kaufmann, San Mateo, Ca, p. 643 (1989). \nGZ. Sun, H.H.  Chen, C.L.  Giles, Y.C.  Lee,  D.  Chen, Connectionist Pushdown Autom(cid:173)\nata  that Learn  Context-free Grammars,  Proceedings  of the  International  Joint  Confer(cid:173)\nence  on  Neural  Networks.  M.  Caudill  (ed),  Lawerence  Erlbaum,  Hillsdale,  N.J.,  vol \n1. p.577 (1990). \nR.I. Williams  & D.  Zipser,  A Learning  Algorithm  for  Continually Running  Fully Re(cid:173)\ncurrent Neural Networks,  Institute  for  Cognitive  Science  Report  8805,  U.  of CA,  San \nDiego,  La Jolla,  Ca 92093,  (1988). \n\n\f", "award": [], "sourceid": 243, "authors": [{"given_name": "C.", "family_name": "Giles", "institution": null}, {"given_name": "Guo-Zheng", "family_name": "Sun", "institution": null}, {"given_name": "Hsing-Hen", "family_name": "Chen", "institution": null}, {"given_name": "Yee-Chun", "family_name": "Lee", "institution": null}, {"given_name": "Dong", "family_name": "Chen", "institution": null}]}