{"title": "Induction of Finite-State Automata Using Second-Order Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 309, "page_last": 317, "abstract": null, "full_text": "Induction of Finite-State Automata Using \n\nSecond-Order Recurrent  Networks \n\nRaymond L.  Watrous \n\nSiemens Corporate  Research \n\n755  College  Road East,  Princeton,  NJ  08540 \n\nGary M.  Kuhn \n\nCenter for  Communications Research,  IDA \n\nThanet  Road,  Princeton,  NJ  08540 \n\nAbstract \n\nSecond-order  recurrent  networks  that  recognize  simple  finite  state  lan(cid:173)\nguages over  {0,1}* are  induced  from  positive and negative examples.  Us(cid:173)\ning the  complete gradient of the  recurrent  network  and sufficient  training \nexamples to  constrain  the  definition  of the  language to  be induced,  solu(cid:173)\ntions  are  obtained  that  correctly  recognize  strings of arbitrary  length.  A \nmethod for  extracting  a  finite  state  automaton corresponding  to  an  opti(cid:173)\nmized network is demonstrated. \n\n1 \n\nIntroduction \n\nWe address the problem of inducing languages from examples by considering a set of \nfinite  state languages over  {O, 1}*  that were  selected for  study by  Tomita (Tomita, \n1982): \n\nL1.  1* \nL2.  (10)* \nL3.  no  odd-length O-string  anywhere  after  an odd-length  I-string \nL4.  not more than 20's in a  row \nL5.  bit  pairs,  #01 's + #10's = 0 mod 2 \n\n309 \n\n\f310  Watrous and Kuhn \n\nL6.  abs(#l's - #O's) = 0 mod 3 \n\nL 7.  0*1*0*1* \n\nTomita  also  selected  for  each  language  a  set  of  positive  and  negative  examples \n(summarized  in  Table  1)  to  be  used  as  a  training  set.  By  a  method  of heuristic \nsearch  over  the space  of finite  state  automata with  up  to eight states,  he  was  able \nto induce  a recognizer  for  each of these  languages (Tomita,  1982). \nRecognizers  of finite-state  languages  have  also  been  induced  using  first-order  re(cid:173)\ncurrent  connectionist  networks  (Elman,  1990;  Williams and  Zipser,  1988;  Cleere(cid:173)\nmans,  Servan-Schreiber  and  McClelland,  1989).  Generally  speaking,  these  results \nwere  obtained  by  training  the  network  to  predict  the  next  symbol  (Cleeremans, \nServan-Schreiber  and  McClelland,  1989;  Williams  and  Zipser,  1988),  rather  than \nby  training  the  network  to  accept  or  reject  strings  of different .lengths.  Several \ntraining  algorithms used  an  approximation to  the  gradient  (Elman,  1990;  Cleere(cid:173)\nmans,  Servan-Schreiber  and  McClelland,  1989)  by  truncating  the  computation  of \nthe backward recurrence. \nThe problem of inducing languages from examples has also  been  approached using \nsecond-order  recurrent  networks  (Pollack,  1990;  Giles  et  al.,  1990).  Using  a  trun(cid:173)\ncated  approximation to  the  gradient,  and Tomita's training sets,  Pollack reported \nthat \"none of the ideal languages were induced\"  (Pollack, 1990).  On the other hand, \na Tomita language has been induced using the complete gradient (Giles et al., 1991). \nThis paper reports the induction of several Tomita languages and the extraction of \nthe  corresponding  automata with  certain  differences  in  method from  (Giles et  al., \n1991). \n\n2  Method \n\n2.1  Architecture \n\nThe network model consists of one input unit, one threshold unit,  N state units and \none output unit.  The output unit and each state unit receive  a first order connection \nfrom the input unit and the threshold unit.  In addition, each of the output and state \nunits receives  a second-order  connection for each  pairing of the input and threshold \nunit with each of the state units.  For  N  = 3,  the model is mathematically identical \nto that used  by  Pollack  (Pollack,  1990);  it has 32  free  parameters. \n\n2.2  Data Representation \n\nThe symbols of the language are  represented  by  byte values,  that are  mapped into \nreal  values  between  0  and 1 by  dividing by  255.  Thus,  the  ZERO  symbol is  repre(cid:173)\nsented  by octal 040  (0.1255).  This value was chosen  to be different  from  0.0,  which \nis used as the initial condition for  all units except the threshold unit, which is set  to \n1.0.  The ONE symbol was chosen as octal 370 (0.97255).  All strings are terminated \nby  two  occurrences  of a  termination symbol that has the value  0.0. \n\n\fInduction of Finite-State Automata Using Second-Order Recurrent Networks \n\n311 \n\nGrammatical Strings \n\nUngrammatical Strings \n\nI Longer  Strmgs \nIn Training Set \n\nI Longer  Strmgs \nIn Training Set \n\nLength  < 10 \n\nLanguage \n\nTotal  Training \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n\n11 \n6 \n652 \n1103 \n683 \n683 \n561 \n\n9 \n5 \n11 \n10 \n9 \n10 \n11 \n\n1 \n2 \n1 \n\n2 \n\nLength  ::;  10 \n\nTotal  Training \n2036 \n2041 \n1395 \n944 \n1364 \n1364 \n1486 \n\n8 \n10 \n11 \n7 \n11 \n11 \n6 \n\n1 \n2 \n1 \n1 \n2 \n\nTable 1;  Number of grammatical and ungrammatical strings oflength 10 or less for \nTomita languages and number of those included in the Tomita training sets. \n\n2.3  Training \n\nThe Tomita languages are  characterized  in Table  1 by  the  number of grammatical \nstrings  of length  10  or  less  (out  of a  total  of 2047  strings).  The  Tomita training \nsets  are  also  characterized  by  the  number  of grammatical strings  of length  10  or \nless  included  in  the  training  data.  For  completeness,  the  Table  also  shows  the \nnumber  of grammatical  strings  in  the  training  set  of  length  greater  than  10.  A \ncomparison of the number of grammatical strings with the  number included  in  the \ntraining  set  shows  that  while  Languages  1  and  2  are  very  sparse,  they  are  almost \ncompletely covered by the training data, whereas Languages 3-7 are more dense,  and \nare sparsely  covered by  the training sets.  Possible consequences  of these  differences \nare  considered  in discussing  the experimental results. \nA  mean-squared  error  measure  was  defined  with  target  values  of 0.9  and  0.1  for \naccept  and reject,  respectively.  The target function was weighted so  that error was \ninjected  only at the end of the string. \n\nThe complete gradient of this error measure for  the recurrent network was computed \nby  a  method of accumulating the weight  dependencies  backward in time (Watrous, \nLadendorf and  Kuhn,  1990).  This  is  in  contrast  to  the  truncated  gradient  used \nby  Pollack (Pollack,  1990)  and to the forward-propagation  algorithm used  by  Giles \n(Giles et  al.,  1991). \n\nThe  networks  were  optimized  by  gradient  descent  using  the  BFGS  algorithm.  A \ntermination criterion  of 10- 10  was  set;  it was  believed  that such  a  strict  tolerance \nmight lead to smaller loss of accuracy  on  very long strings.  No  constraints were  set \non  the number of iterations. \n\nFive networks  with  different  sets  of random initial weights  were  trained separately \non  each  of the seven  languages described  by  Tomita using  exactly  his training sets \n(Tomita, 1982), including the null string.  The training set  used by Pollack (Pollack, \n1990)  differs  only in not including the null string. \n\n2.4  Testing \n\nThe networks were tested on the complete set of strings up to length 10.  Acceptance \nof a  string was  defined  as  the  network  having  a  final  output  value  of greater  than \n\n\f312  Watrous and Kuhn \n\n0.9 - T  and rejection  as a final  value of less than 0.1 + T,  where  0 < T  < 0.4 is  the \ntolerance.  The decision  was  considered  ambiguous otherwise. \n\n3  Results \n\nThe results  of the first  experiment  are summarized in Table 2.  For  each  language, \neach  network is  listed by  the seed  value used  to initialize the random weights.  For \neach  network,  the  iterations  to  termination  are  listed,  followed  by  the  minimum \nMSE  value reached.  Also listed is the percentage of strings of length  10  or  less  that \nwere  correctly  recognized  by  the  network,  and  the  percentage  of strings for  which \nthe decision  was uncertain  at  a  tolerance  of 0.0. \n\nThe number of iterations until termination varied widely, from 28  to 37909.  There \nis  no obvious  correlation between  number of iterations and minimum MSE. \n\n3.1  Language 1 \n\nIt may be observed  that Language 1 is  recognized  correctly  by  two  of the networks \n(seeds 72 and 987235) and nearly correctly by a third (seed 239).  This latter network \nfailed on the strings 19  and 110 ,  both of which  were  not in  the training set. \n\nThe  network  of seed  72  was  further  tested  on  all  strings  of length  15  or  less  and \nmade no errors.  This network was also tested on a string of 100 ones and showed no \ndiminution of output value over  the  length of the string.  When tested on  strings of \n99  ones  plus either an initial zero  or  a final  zero,  the  network  also  made no errors. \nAnother  network,  seed  987235,  made no errors  on  strings  of length  15  or  less  but \nfailed on  the string of 100  ones.  The hidden units broke into oscillation after about \nthe 30th input symbol and the output fell into a low amplitude oscillation near zero. \n\n3.2  Language 2 \n\nSimilarly,  Language  2  was  recognized  correctly  by  two  networks  (seeds  89340  and \n987235)  and  nearly  correctly  by  a  third  network  (seed  104).  The  latter  network \nfailed  only  on  strings  of the  form  (10)*010,  none  of which  were  included  in  the \ntraining data. \n\nThe networks that performed perfectly on strings up to length 10 were tested further \non  all strings up to length  15  and made no errors.  These networks were  also tested \non  a  string of 100  alternations of 1 and  0,  and  responded  correctly.  Changing the \nfirst  or final  zero  to a  one caused  both networks correctly  to reject  the string. \n\n3.3  The Other Languages \n\nFor  most  of  the  other  languages,  at  least  one  network  converged  to  a  very  low \nMSE  value.  However,  networks  that  performed  perfectly  on  the  training  set  did \nnot  generalize  well  to  a  definition  of the  language.  For  example,  for  Language  3, \nthe  network  with  seed  104  reached  a  MSE  of 8  x  10- 10  at  termination,  yet  the \nperformance  on  the  test  set  was  only  78.31%.  One  interpretation of this  outcome \nis  that the intended language was not sufficiently  constrained  by the  training set. \n\n\fInduction of Finite-State Automata Using Second-Order Recurrent Networks \n\n313 \n\nLanguage \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\nSeed \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n72 \n104 \n239 \n89340 \n987235 \n\nIterations \n28 \n95 \n\n0.0012500000 \n0.0215882357 \n8707  0.0005882353 \n5345  0.0266176471 \n994  0.0000000001 \n0.0005468750 \n5935 \n4081 \n0.0003906250 \n807  0.0476171875 \n1084  0.0005468750 \n1(\"06  0.0001562500 \n442  0.0149000000 \n37909  0.0000000008 \n9264  0.0087000000 \n8250 \n0.0005000000 \n5769  0.0136136712 \n0.0004375001 \n8630 \n0.0624326924 \n60 \n2272  0.0005000004 \n0.0003750001 \n10680 \n324  0.0459375000 \n0.0526912920 \n890 \n0.0464772727 \n368 \n1422  0.0487500000 \n0.0271525856 \n2775 \n0.0209090867 \n2481 \n524 \n0.0788760972 \n0.0789530751 \n332 \n1355  0.0229551248 \n8171 \n0.0001733280 \n0.0577867426 \n306 \n373  0.0588385157 \n8578  0.0104224185 \n969  0.0211073814 \n0.0007684520 \n4259 \n0.0688690476 \n666 \n\nMSE  Accuracy  Uncertainty \n0.00 \n20.76 \n0.00 \n0.00 \n0.00 \n4.93 \n0.20 \n37.27 \n0.00 \n0.00 \n33.27 \n0.15 \n11.87 \n0.00 \n23.94 \n6.45 \n50.02 \n9.38 \n15.53 \n77.38 \n63.80 \n41.62 \n36.93 \n22.52 \n2.49 \n99.95 \n99.95 \n47.04 \n5.32 \n24.87 \n86.08 \n17.00 \n26.58 \n0.49 \n74.94 \n\n100.00 \n78.07 \n99.90 \n66.93 \n100.00 \n93.36 \n99.80 \n62.73 \n100.00 \n100.00 \n47.09 \n78.31 \n74.60 \n73.57 \n50.76 \n52.71 \n20.86 \n55.40 \n60.92 \n22.62 \n34.39 \n45.92 \n31.46 \n46.12 \n66.83 \n0.05 \n0.05 \n31.95 \n46.21 \n37.71 \n9.38 \n55.74 \n52.76 \n54.42 \n12.55 \n\nTable  2:  Results  of Training Three  State-Unit  Network  from  5  Random Starts on \nTomita.Languages Using Tomita Training Data \n\nIn the case of Language 5,  in  no case  was the MSE reduced  below 0.02.  We  believe \nthat  the  model  is  sufficiently  powerful  to  compute  the  language.  It  is  possible, \nhowever,  that  the  power  of  the  model  is  marginally  sufficient,  so  that  finding  a \nsolution depends  critically upon  the initial conditions. \n\n\f314 \n\nWatrous and Kuhn \n\nSeed \n72 \n104 \n239 \n89340 \n987235 \n\nIterations \n0.0000001022 \n215 \n0.0000000001 \n665 \n0.0000000001 \n205 \n5244  0.0005731708 \n0.0004624581 \n2589 \n\nMSE  Accuracy  Uncertainty \n0.00 \n0.05 \n0.10 \n0.10 \n6.55 \n\n100.00 \n99.85 \n99.90 \n99.32 \n92.13 \n\nTable  3:  Results  of Training Three  State-Unit Network  from  5  Random Starts on \nTomita Language 4 Using  Probabilistic Training Data (p=O.l) \n\n4 \n\nFurther Experiments \n\nThe effect  of additional training data was  investigated  by  creating  training sets  in \nwhich each string oflength 10 or less is randomly included with a fixed  probability p. \nThus, for p = 0.1  approximately 10% of 2047 strings are included in the training set. \nA flat  random sampling of the lexicographic domain may not be the best approach, \nhowever,  since grammaticality can vary non-uniformly. \n\nThe same networks  as  before  were  trained on  the larger  training set  for  Language \n4,  with the results listed  in Table 3. \n\nUnder  these  conditions,  a  network  solution was  obtained that generalizes  perfectly \nto the test set (seed 72).  This network also made no errors on strings up to length 15. \nHowever,  very low MSE values were again obtained for networks that do not perform \nperfectly  on  the  test  data (seeds  104  and  239).  Network  239  made two  ambiguous \ndecisions  that  would  have  been  correct  at  a  tolerance  value  of 0.23.  Network  104 \nincorrectly  accepted  the  strings  000  and  1000  and  would  have  correctly  accepted \nthe  string  0100  at  a  tolerance  of 0.25.  Both  networks  made  no  additional  errors \non strings  up  to  length  15.  The  training  data may still  be  slightly indeterminate. \nMoreover,  the  few  errors  made were  on  short strings,  that  are  not  included  in  the \ntraining data. \n\nSince this network model is  continuous, and thus potentially infinite state, it is per(cid:173)\nhaps not surprising that the successful  induction of a finite state language seems to \nrequire more training data than was needed for Tomita's finite state model (Tomita, \n1982). \nThe effect  of more complex models was investigated for Language 5 using  a network \nwith  11  state units;  this increases  the number of weights from  32  to 288.  Networks \nof this type were optimized from 5 random initial conditions on the original training \ndata.  The results  of this experiment  are summarized in Table 4.  By  increasing  the \ncomplexity of the model, convergence to low MSE values was obtained in every case, \nalthough  none  of these  networks generalized  to  the  desired  language.  Once  again, \nit  is possible  that more data is  required  to constrain the language sufficiently. \n\n5  FSA  Extraction \n\nThe  following  method for  extracting  a  deterministic finite-state  automaton corre(cid:173)\nsponding  to an optimized network  was developed: \n\n\fInduction of Finite-State Automata Using Second-Order Recurrent Networks \n\n315 \n\nSeed \n72 \n104 \n239 \n89340 \n987235 \n\nIterations \n\n1327  0.0002840909 \n0.0001136364 \n680 \n357  0.0006818145 \n122  0.0068189264 \n4502  0.0001704545 \n\nMSE  Accuracy  Uncertainty \n11.87 \n16.32 \n3.32 \n6.64 \n16.95 \n\n53.00 \n39.47 \n61.31 \n63.36 \n48.41 \n\nTable 4:  Results  of Training  Network  with  11  State-Units from  5  Random Starts \non  Tomita Language 5  Using Tomita Training Data \n\n1.  Record  the  response  of the network  to a set  of strings. \n2.  Compute a  zero  bin-width histogram for  each  hidden  unit  and  partition each \n\nhistogram so  that the intervals between  adjacent  peaks  are bisected. \n\n3.  Initialize  a  state-transition  table  which  is  indexed  by  the  current  state  and \n\ninput symbol; then,  for  each string: \n(a)  Starting from  the  NULL  state, for  each  hidden  unit  activation vector: \n\n1.  Obtain  the  next  state  label  from  the  concatenation  of the  histogram \n\ninterval number of each hidden  unit value. \n\nll.  Record  the  next  state  in  the  state-transition  table.  If a  transition  is \nrecorded from the same state on the same input symbol to two different \nstates, move or remove hidden unit histogram partitions so that the two \nstates are  collapsed and go  to 3;  otherwise,  update the  current state. \n(b)  At 'the  end  of the  string,  mark the  current  state  as  accept,  reject  or  un(cid:173)\n\ncertain  according  as  the  output  unit  is  ~ 0.9,  S;  0.1  or  otherwise.  If the . \ncurrent  state has already received  a  different  marking, move or insert  his(cid:173)\ntogram partitions so  that the offending state is subdivided  and go  to  3. \n\nIf the recorded  strings are processed  successfully,  then the resulting state-transition \ntable  may be  taken  as  an  FSA  interpretation  of the  optimized network.  The  FSA \nmay  then  be  minimized by standard methods (Giles et  al.,  1991).  If no  histogram \npartition can  be found such  that the process succeeds,  the  network  may not have  a \nfinite-state  interpretation. \n\nAs  an  approximation to Step  3,  the hidden  unit vector  was  labeled by  the index of \nthat vector in an initially empty set of reference  vectors for  which each  component \nvalue  was  within  some  global  threshold  (B)  of  the  hidden  unit  value.  If no  such \nreference  vector was found,  the observed vector was added to the reference set.  The \nthreshold B could be raised or lowered as states needed to be collapsed or subdivided. \n\nUsing  the approximate method, for  Language 1,  the  correct  and minimal FSA  was \nextracted from one network (seed  72,  B = 0.1).  The correct  FSA was also extracted \nfrom  another  network  (seed  987235,  B = 0.06),  although  for  no  partition  of  the \nhidden  unit  activation  values  could  the  minimal FSA  be  extracted.  Interestingly, \nthe  FSA  extracted  from  the  network  with seed  239  corresponded  to  1 n  for  n  < 8. \nAlso,  the  FSA  for  another  network  (seed  89340,  B = 0.0003)  was  nearly  correct, \nalthough the string accuracy was only 67%;  one state was wrongly labeled  \"accept\" . \nFor Language 2,  the correct and minimal FSA was extracted from one network (seed \n987235, B = 0.00001).  A correct FSA was also extracted from another network (seed \n\n\f316  Watrous and Kuhn \n\n89340,  ()  = 0.0022),  although this FSA was  not minimal. \nFor  Language  4,  a  histogram  partition  was  found  for  one  network  (seed  72)  that \nled  to  the  correct  and  minimal  FSA;  for  the  zero-width  histogram,  the  FSA  was \ncorrect,  but not  minimal. \n\nThus,  a  correct  FSA  was  extracted  from  every  optimized  network  that  correctly \nrecognized  strings  of length  10  or  less  from  the language for  which  it was  trained. \nHowever,  in  some cases,  no  histogram partition was  found  for  which  the extracted \nFSA  was  minimal.  It also  appears  that  an  almost-correct  FSA  can  be  extracted, \nwhich might perhaps  be corrected  externally.  And,  finally,  the extracted  FSA  may \nbe correct,  even  though the network  might fail  on very  long strings. \n\n6  Conclusions \n\nWe have succeeded in recognizing several simple finite state languages using second(cid:173)\norder  recurrent  networks  and  extracting  corresponding  finite-state  automata.  We \nconsider  the  computation of the complete gradient  a key  element in  this result. \n\nAcknowledgements \n\nWe  thank Lee  Giles for  sharing with  us  their results  (Giles et  al.,  1991). \n\nReferences \n\nCleeremans,  A.,  Servan-Schreiber,  D.,  and  McClelland,  J.  (1989).  Finite state  au(cid:173)\n\ntomata and simple recurrent  networks.  Neural  Computation,  1(3):372-381. \nElman, J . L.  (1990).  Finding structure in  time.  Cognitive  Science,  14:179-212. \nGiles,  C.  1.,  Chen,  D.,  Miller,  C.  B.,  Chen,  H.  H.,  Sun,  G.  Z.,  and  Lee,  Y.  C. \n(1991).  Second-order  recurrent  neural networks for  grammatical inference.  In \nProceedings  of the  International  Joint  Conference  on  Neural  Networks,  vol(cid:173)\nume II,  pages  273-281. \n\nGiles, C.  L., Sun,  G. Z.,  Chen, H.  H., Lee,  Y. C., and Chen,  D. (1990).  Higher order \nrecurrent  networks  and  grammatical  inference.  In  Touretzky,  D.  S.,  editor, \nAdvances in Neural Information Systems 2,  pages 380-387. Morgan Kaufmann. \nPollack,  J.  B.  (1990).  The  induction  of dynamical  recognizers.  Technical  Report \n\n90-JP-AUTOMATA,  Ohio State University. \n\nTomita,  M.  (1982).  Dynamic  construction  of finite  automata from  examples  us(cid:173)\n\ning hill-climbing.  In  Proceedings  of the  Fourth  International  Cognitive  Science \nConference,  pages  105-108. \n\nWatrous,  R.  L.,  Ladendorf,  B.,  and Kuhn,  G.  M.  (1990).  Complete gradient  opti(cid:173)\nmization of a recurrent  network applied to /b/, /d/, /g/ discrimination.  Jour(cid:173)\nnal of the  Acoustical Society  of America, 87(3):1301-1309. \n\nWilliams, R.  J. and Zipser,  D.  (1988).  A learning algorithm for  continually running \nfully  recurrent  neural  networks.  Technical  Report  ICS  Report  8805,  UCSD \nInstitute for  Cognitive Science. \n\n\f", "award": [], "sourceid": 560, "authors": [{"given_name": "Raymond", "family_name": "Watrous", "institution": null}, {"given_name": "Gary", "family_name": "Kuhn", "institution": null}]}