{"title": "Hierarchical Recurrent Neural Networks for Long-Term Dependencies", "book": "Advances in Neural Information Processing Systems", "page_first": 493, "page_last": 499, "abstract": null, "full_text": "Hierarchical Recurrent Neural Networks for \n\nLong-Term Dependencies \n\nSalah El  Hihi \n\nDept.  Informatique et \n\nRecherche  Operationnelle \nUniversite de  Montreal \nMontreal,  Qc  H3C-3J7 \n\nelhihiGiro.umontreal.ca \n\nYoshua Bengio\u00b7 \nDept.  Informatique et \n\nRecherche  Operationnelle \nUniversite de Montreal \nMontreal,  Qc  H3C-3J7 \n\nbengioyGiro.umontreal.ca \n\nAbstract \n\nWe  have already shown  that extracting long-term dependencies  from  se(cid:173)\nquential  data is  difficult,  both for  determimstic dynamical systems  such \nas  recurrent  networks,  and  probabilistic  models  such  as  hidden  Markov \nmodels  (HMMs)  or input/output hidden  Markov  models  (IOHMMs).  In \npractice,  to  avoid  this  problem,  researchers  have  used  domain  specific \na-priori  knowledge  to give  meaning  to  the hidden or  state variables rep(cid:173)\nresenting  past context.  In  this  paper,  we  propose  to use  a  more general \ntype  of a-priori  knowledge,  namely  that  the  temporal  dependencIes  are \nstructured  hierarchically.  This implies that  long-term  dependencies  are \nrepresented  by  variables with  a long time scale.  This principle is  applied \nto a recurrent network which includes delays and multiple time scales.  Ex(cid:173)\nperiments confirm the advantages of such  structures.  A similar approach \nis  proposed for  HMMs  and IOHMMs. \n\n1 \n\nIntroduction \n\nLearning  from  examples  basically  amounts  to  identifying  the  relations  between  random \nvariables of interest.  Several learning problems involve sequential data,  in which  the vari(cid:173)\nables  are  ordered  (e.g.,  time  series).  Many  learning  algorithms  take  advantage  of this \nsequential  structure  by  assuming  some  kind  of homogeneity  or  continuity  of the  model \nover  time,  e.g.,  bX  sharing parameters for  different  times,  as  in  Time-Delay Neural  Net(cid:173)\nworks  (TDNNs)  tLang,  WaIbel and Hinton,  1990), recurrent  neural networks  (Rumelhart, \nHinton  and Williams,  1986), or  hidden  Markov  models  (Rabiner and Juang,  1986).  This \ngeneral  a-priori  assumption considerably simplifies the learning problem. \nIn  previous  papers  (Bengio,  Simard and Frasconi,  1994\u00b7  Bengio  and Frasconi,  1995a),  we \nhave shown for  recurrent  networks and Markovian models that, even with this assumption, \ndependencies  that  span  longer  intervals  are  significantly  harder  to  learn.  In  all  of the \nsystems we  have considered for  learning from sequential data, some form of representation \nof context  ( or state)  is  required  (to summarize all  \"useful\"  past information).  The  \"hard \nlearning\"  problem  IS  to  learn  to  represent  context,  which  involves  performing the  proper \n\n\u00b0 also,  AT&T Bell  Labs,  Holmdel,  NJ 07733 \n\n\f494 \n\nS. E.  HIHI. Y.  BENGIO \n\ncredit  assignment  through  time.  Indeed,  in  practice,  recurrent  networks  (e.g.,  injecting \nprior  knowledge  for  grammar inference  (Giles  and  Omlin,  1992;  Frasconi  et  al.,  1993)) \nand  HMMs  (e.g.,  for  speech  recognition  (Levinson,  Rabiner  and  Sondhi,  1983;  Rabiner \nand Juang,  1986)) work quite well when  the representation of context (the meaning of the \nstate variable) is decided a-priori. The hidden variable is not any more completely hidden. \nLearning  becomes  much  easier.  Unfortunately,  this  requires  a  very  precise  knowledge  of \nthe appropriate state variables,  which  is  not available in many applications. \nWe have seen  that the successes  ofTDNNs,  recurrent  networks and HMMs are based on a \ngeneral assumption on the sequential nature of the data.  In this paper, we propose another, \nsimple,  a-priori assumption on the sequences  to be  analyzed:  the  temporal dependencies \nhave a  hierarchical structure.  This implies that dependencies  spanning long  intervals  are \n\"robust\"  to  small local  changes  in  the  timing of events,  whereas  dependencies  spanning \nshort intervals are allowed to be more sensitive to the precise timing of events.  This yields \na  multi-resolution representation  of state information.  This general  idea is  not  new  and \ncan  be found  in  various approaches  to learning and artificial intelligence.  For example, in \nconvolutional neural  networks, both for  sequential data with  TDNNs  (Lang,  Waibel and \nHinton,  1990),  and  for  2-dimensional data with  MLCNNs  (LeCun  et  al.,  1989;  Bengio, \nLeCun  and  Henderson,  1994),  the  network  is  organized  in  layers  representing  features \nof increasing  temporal  or  spatial  coarseness.  Similarly,  mostly  as  a  tool  for  analyzing \nand preprocessing  sequential  or  spatial data,  wavelet  transforms  (Daubechies,  1990)  also \nrepresent  such  information at mUltiple resolutions.  Multi-scale  representations  have  also \nbeen proposed to improve reinforcement learning systems (Singh , 1992; Dayan and Hinton, \n1993;  Sutton,  1995)  and  path  planning  systems.  However,  with  these  algorithms,  one \ngenerally  assumes  that  the  state  of the  system  is  observed,  whereas,  in  this  paper  we \nconcentrate  on  the  difficulty  of learning  what  the  state  variable  should  represent.  A \nrelated idea using a  hierarchical structure was  presented  in  (Schmidhuber, 1992) . \nOn  the  HMM  side,  several  researchers  (Brugnara  et  al.,  1992;  Suaudeau,  1994)  have \nattempted  to  improve HMMs  for  speech  recognition  to  better  model  the  different  types \nof var1ables, intrmsically varying  at different  time scales  in  speech.  In  those  papers,  the \nfocus  was  on setting an a-priori representation,  not on learning how  to represent  context. \nIn  section  2,  we  attempt  to  draw  a  common conclusion  from  the analyses  performed  on \nrecurrent  networks  and  HMMs  to  learn  to  represent  long-term  dependencies.  This  will \njustify the  proposed  approach,  presented  in  section  3.  In section  4 a  specific  hierarchical \nmodel is  proposed for  recurrent networks,  using different  time scales for  different  layers of \nthe network.  EXp'eriments  performed with this model are described  in  section  4.  Finally, \nwe  discuss  a sim1lar scheme for  HMMs and IOHMMs  in  section  5. \n\n2  Too  Many Products \nIn this section, we take another look at the analyses of (Bengio, Simard and Frasconi, 1994) \nand  (Bengio  and  Frasconi,  1995a),  for  recurrent  networks  and  HMMs  respectively.  The \nobjective 1S  to draw a parallel between the problems encountered with the two approaches, \nin order  to guide us  towards  some form of solution, and justify the  proposals  made here. \nFirst, let us consider  the deterministic dynamical systems  (Bengio,  Simard and  Frasconi, \n1994)  (such  as recurrent networks),  which  map an input sequence  U1 l  . .. , UT  to an output \nsequence  Y1, ... , ftr\u00b7  The state or context  information is  represented  at  each  time t  by  a \nvariable Xt,  for  example the activities of all the hidden  units of a  recurrent  network: \n\n(1) \nwhere  Ut  is  the  system  input  at  time  t  and  1  is  a  differentiable  function  (such  as \ntanh(Wxt_1 + ut)).  When  the  sequence  of inputs  U1, U2, \u2022 .. , UT  is  given,  we  can  write \nXt  = It(Xt-d = It(/t-1( .. . l1(xo)) . . . ).  A  learning criterion  Ct  yields  gradients  on out(cid:173)\nputs,  and  therefore  on  the  state  variables  Xt.  Since  parameters  are  shared  across  time, \nlearning using  a  gradient-based  algorithm depends  on  the  influence  of parameters  W  on \nCt  through  an time steps before t : \n\naCt  OXt  oX T \naCt  _  \" \noW - L...J  OXt  OX T  oW \n\nT \n\n(2) \n\n\fHierarchical Recurrent Neural Networks for Long-term Dependencies \n\nThe Jacobian matrix of derivatives  .!!:U{J{Jx  can further be factored  as  follows: \n\nXr \n\n495 \n\n(3) \n\nOur earlier analysis (Bengio, Simard and Frasconi,  1994) shows  that the difficulty revolves \naround  the  matrix product in equation 3.  In  order  to  reliably  \"store\"  informatIOn in  the \ndynamics  of the  network,  the  state  variable  Zt  must  remain  in  regions  where  If:!  <  1 \n(i.e., near enough to a stable attractor representing the stored information).  However,  the \nabove  products  then  rapidly  converge  to 0  when  t - T  increases.  Consequently,  the sum \nin 2 is  dominated by  terms corresponding  to short-term dependencies  (t  - T  is small). \nLet  us  now consider  the case of Markovian models (including HMMs and IOHMMs  (Ben(cid:173)\ngio  and  Frasconi,  1995b)).  These  are  probabilistic  models,  either  of  an  \"output\" \nsequence  P(YI  . . . YT) \n(HMMs)  or  of  an  output  sequence  given  an  input  sequence \nP(YI  ... YT lUI ... UT)  (IOHMMs). \nIntroducing  a  discrete  state  variable  Zt  and  using \nMarkovian assumptIOns of independence this probability can be factored  in terms of tran(cid:173)\nsition  probabilities  P(ZtIZt-d  (or  P(ZtIZt-b ut})  and  output  probabilities  P(ytlZt)  (or \nP(ytiZt, Ut)) .  According  to the model, the distribution of the state  Zt  at time t  given the \nstate  ZT  at an earlier time T  is given  by  the matrix \n\nP(ZtlZT)  =  P(ZtiZt-I)P(Zt-Ilzt-2) . .. P(zT+dzT) \n\n(4) \nwhere each  of the factors  is  a matrix of transition probabilities  (conditioned on inputs in \nthe case  of IOHMMs) .  Our earlier  analysis  (Bengio  and Frasconi,  1995a)  shows  that the \ndifficulty  in  representing  and learning  to represent  context  (i.e., learning what  Zt  should \nrepresent)  revolves  around  equation  4.  The  matrices  in  the  above  equations  have  one \neigenvalue equal  to  1  (because  of the  normalization constraint)  and  the  others  ~ 1.  In \nthe case  in  which  all  eIgenvalues  are  1 the  matrices have  only  i's and  O's,  i.e,  we  obtain \ndeterministic dynamics for  IOHMMs  or  pure cycles  for  HMMs  (which  cannot  be  used  to \nmodel  most  interesting  sequences) .  Otherwise  the  above  product  converges  to  a  lower \nrank matrix (some or most of the eigenvalues converge toward 0).  Consequently,  P(ZtlZT) \nbecomes more and more independent of ZT  as t - T  increases.  Therefore, both representing \nand learning context becomes more difficult as the span of dependencies increases or when \nthe Markov  model is  more non-deterministic (transition probabilities not close  to 0 or  1). \nClearly, a  common trait of both analyses lies  in  taking too  many products,  too  many time \nsteps,  or too  many transformations to relate the state variable at time T  with the state vari(cid:173)\nable at time t  > T, as in equations 3 and 4.  Therefore the idea presented in the next section \nis centered on allowing several paths between  ZT  and Zt, some with few  \"transformations\" \nand some with  many  transformations.  At  least  through  those  with few  transformations, \nwe  expect context information (forward) , and credit assignment  (backward)  to propagate \nmore easily over longer time spans than through  \"paths\"  lDvolving many tralIBformations. \n\n3  Hierarchical Sequential Models \nInspired  by  the  above  analysis  we  introduce  an assumption about  the  sequential  data to \nbe modeled,  although it will  be a  very  simple and general a-priori on the structure of the \ndata.  Basically,  we  will  assume  that  the  sequential  structure  of data can  be  described \nhierarchically:  long-term dependencies  (e.g.,  between  two  events  remote from  each other \nin time) do not depend on a precise  time scale (Le., on the precise timing of these events). \nConsequently, in order to represent a context variable taking these long-term dependencies \ninto account, we will be able to use a coarse time scale (or a Slowly changing state variable). \nTherefore,  instead of a single homogeneous state variable, we  will introduce several levels \nof state  variables,  each  \"working\"  at  a  different  time scale.  To  implement in  a  discrete(cid:173)\ntime system  such  a  multi-resolution representation  of context,  two  basic  approaches  can \nbe  considered.  Either  the  higher  level  state  variables  change  value  less  often  or  they \nare  constrained  to change  more  slowly  at  each  time step.  In  our  ex~eriments, we  have \nconsidered input and output variables both at the shortest time scale  highest frequency), \nbut one of the potential advantages of the approach presented here is t  at it becomes very \n\n\f496 \n\nS.  E. IDHI, Y.  BENOIO \n\nFigure  1:  Four  multi-resolution  recurrent  architectures  used  in  the  experiments.  Small \nsguares represent  a discrete delay,  and numbers near each neuron represent  its time scale. \nThe architectures B  to E  have respectively  2,  3,  4,  and 6 time scales. \nsimple  to  incorporate  input  and  output  variables  that  operate  at  different  time  scales. \nFor  example,  in  speech  recognition  and  synthesis,  the  variable  of interest  is  not  only \nthe speech  signal  itself (fast)  but also slower-varying variables such  as  prosodic  (average \nenergy,  pitch,  etc ... )  and  phonemic  (place  of articulation,  phoneme duration)  variables. \nAnother  example  is  in  the  application  of learning  algorithms  to financial  and  economic \nforecasting  and decision  taking.  Some of the  variables  of interest  are  given  daily,  others \nweekly,  monthly, etc ... \n\n4  Hierarchical Recurrent Neural  Network:  Experiments \nAs  in TDNNs  (Lang,  Waibel and Hinton, 1990)  and reverse-TDNNs  (Simard and  LeCun, \n1992),  we  will  use  discrete  time delays  and  subsampling  (or  oversampling)  in  order  to \nimplement the  multiple time scales.  In  the  time-unfolded  network,  paths going  through \nthe  recurrences  in  the  slow  varying  units  (long  time  scale)  will  carry  context  farther, \nwhile  paths  going  through  faster  varying  units  (short  time scale)  will  respond  faster  to \nchanges in input or desired changes in output.  Examples of such multi-resolution recurrent \nneural networks are shown in Figure 1.  Two sets of simple experiments were performed to \nvalidate some of the ideas presented in this paper.  In both cases,  we compare a hierarchical \nrecurrent  network  with  a single-scale fully-connected  recurrent  network. \nIn  the  first  set  of experiments,  we  want  to  evaluate  the  performance  of a  hierarchical \nrecurrent  network  on  a  problem already  used  for  studying  the difficulty  in  learning long(cid:173)\nterm dependencies  (Bengio,  Simard and  Frasconi,  1994;  Bengio  and  Frasconi,  1994) .  In \nthis 2-class J?roblem, the network has to detect a pattern at the beginning of the sequence, \nkeeping  a  blt  of information in  \"memory\"  (while  the  inputs  are  noisy)  until  the  end  of \nthe sequence  (supervision  is  only  a  the end of the sequence).  As  in  (Bengio,  Simard and \nFrasconi,  1994; Bengio and Frasconi,  1994) only the first  3 time steps contain information \nabout the class  (a 3-number pattern was  randomly chosen for  each  class within [-1,1]3). \nThe length  of the  sequences  is  varied  to evaluate  the effect  of the  span  of input/output \ndependencies.  Uniformly  distributed  noisy  inputs  between  -.1  and  .1  are  added  to  the \ninitial patterns as  well  as  to the remainder of the sequence.  For each  sequence  length,  10 \ntrials were run with different initial weights and noise patterns, with 30 training sequences. \nExperiments were  performed with sequence  of lengths  10,  20,40 and  100. \nSeveral  recurrent  network  architectures  were  compared.  All  were  trained  with  the  same \nalgorithm  (back-propagation  through  time)  to  minimize  the  sum  of squared  differences \nbetween  the final  output and a  desired  value.  The simplest architecture  (A)  is  similar to \narchitecture  B  in  Figure  1 but it is  not hierarchical:  it has  a  single  time scale.  Like  the \n\n\fHierarchical Recurrent Neural Networks for Long-term Dependencies \n\n497 \n\neo \n\n50 \n\n40 \n\nl \n~  30 \n\n20 \n\n1.4 \n1.3 \n1.2 \n1.1 \n1.0 \n0.9 \nIs  0.8 \nIi  0.7 \nI  0.6 \n\n0.5 \n0.4 \n\nABCDE  ABCDE  ABCDE  ABCDE \n\nseq.1engIh \n\n10 \n\n20 \n\n40 \n\n100 \n\n~~ '-----'-I....L.~~~....L.-.W....L.mL.......J....I.ll \n\nABCDE  ABCDE  ABCDE  ABCDE \n\n1IIq. 1engIh \n\n10 \n\n20 \n\n40 \n\n100 \n\nFigure 2:  Average classification error after training for  2-sequence problem (left, classifica(cid:173)\ntion error)  and network-generated data (right,  mean squared  error), for  varying sequence \nlengths and architectures.  Each set of 5 consecutive  bars represents  the performance of 5 \narchitectures  A  to E,  with  respectively  1,  2,  3,  4  and  6  time scales  (the architectures  B \nto E  are shown  in  Figure 1).  Error  bars show  the standard deviation over  10  trials. \n\nother  networks,  it has  however  a  theoretically  \"sufficient\"  architecture,  i.e.,  there  exists \na  set  of weights  for  which  it  classifies  perfectly  the  trainin~ sequences.  Four of the  five \narchitectures  that  we  compared  are  shown  in  Figure  1,  wIth  an  increasing  number  of \nlevels  in  the  hierarchy.  The performance of these  four  architectures  (B  to E)  as  well  as \nthe  architecture  with  a  single  time-scale  (A)  are  compared  in  Figure  2  (left,  for  the  2-\nsequence  problem).  Clearly,  adding  more levels  to the  hierarchy  has significantly helped \nto reduce  the difficulty in  learning long-term dependencies. \nIn  a  second  set  of experiments,  a  hierarchical  recurrent  network  with  4  time scales  was \ninitialized with random (but large)  weights  and  used  to generate  a  data set.  To generate \nthe inputs  as  well  as  the  outputs,  the  network  has  feedback  links from  hidden  to input \nunits.  At  the  initial  time  step  as  well  as  at  5%  of the  time  steps  (chosen  randomly), \nthe input was  clamped  with  random  values  to introduce some further  variability.  It is  a \nregression  task, and the mean squared error is shown on Figure 2.  Because of the network \nstructure, we expect the data to contain long-term dependencies  that can be modeled with \na hierarchical structure.  100 training sequences of length 10,  20,40 and 100 were generated \nby  this  network.  The same 5 network  architectures  as  in  the  previous  experiments  were \ncompared  (see  Figure 1 for  architectures B  to E), with  10 training trials per network and \nper  sequence  length.  The  results  are  summarized  in  Figure  2  (right) .  More  high-level \nhierarchical structure appears to have improved performance for long-term dependencies. \nThe fact  that the  simpler  I-level  network  does  not  achieve  a  good performance suggests \nthat there were some difficult long-term dependencies  in the the artificially generated data \nset.  It is interesting to compare those results with those reported in (Lin et al., 1995) which \nshow  that using  longer delays  in certain  recurrent  connections helps  learning longer-term \ndependencies.  In  both  cases  we  find  that  introducing  longer  time scales  allows  to  learn \ndependencies  whose span is  proportionally longer. \n\n5  Hierarchical HMMs \nHow do we represent  multiple time scales with a  HMM? Some solutions have already been \nproposed in the speech  recognition literature,  motivated by  the obvious presence of differ(cid:173)\nent  time scales  in  the  speech  phenomena.  In  (Brugnara et  al.,  1992)  two  Markov  chains \nare coupled  in a  \"master/slave\"  configuration.  For  the  \"master\"  HMM,  the observations \nare slowly  varying features  (such  as  the signal energy),  whereas  for  the  \"slave\"  HMM  the \nobservations are t.he  speech  spectra themselves.  The two chains are synchronous and op(cid:173)\nerate at the same time scale,  therefore  the problem of diffusion of credit in HMMs  would \nprobably  also  make difficult  the  learning  of long-term  dependencies.  Note  on  the  other \n\n\f498 \n\nS.  E.  HIHI, Y.  BENOIO \n\nhand  that in  most  applications of HMMs  to speech  recognition  the  meaning of states  is \nfixed  a-priori  rather  than  learned  from  the  data  (see  (Bengio  and  Frasconi,  1995a)  for  a \ndiscussion).  In a  more recent  contribution,  Nelly  Suaudeau  (Suaudeau,  1994)  proposes  a \n\"two-level HMM\"  in  which  the higher level  HMM  represents  \"segmental\"  variables  (such \nas  phoneme duration).  The  two  levels  operate  at different  scales:  the  higher  level  state \nvarIable represents the phonetic identity and models the distributions of the average energy \nand the duration within each  phoneme.  Again, this work is  not geared towards learning a \nrepresentation  of context,  but rather,  given  the  traditional  (phoneme-based)  representa(cid:173)\ntion of context in speech  recognition,  towards  building a  better model of the distribution \nof \"slow\"  segmental  variables such  as  phoneme duration and energy.  Another  promising \napproach was  recently  proposed in  (Saul and Jordan,  1995).  Using decimation techniques \nfrom statistical mechanics,  a polynomial-time algorithm is  derived  for parallel Boltzmann \nchains  (which  are similar to parallel HMMs),  which  can operate at different  time scales. \nThe  ideas  presented  here  point toward  a  HMM  or  IOHMM  in  which  the  (hidden)  state \nvariable  Xt  is  represented  by  the  Cartesian  product  of several  state  variables  Xt, each \n\"working\"  at  a  different  time  scale:  Xt  =  (x;, x~, ... I xf)..  To  take  advantage  of the \ndecomposition,  we  propose  to consider  that  tbe state dIstrIbutions at  the  different  levels \nare conditionally independent (given the state at the previous time step and at the current \nand previous levels).  Transition probabilities are therefore factored  as followed: \n\n(5) \n\nTo force  the  state variable at a  each  level  to effectively  work  at a  given  time scale,  self(cid:173)\ntransition probabilities are constrained as follows (using above independence assumptions): \n\nP(x:=i3Ixt_l=iI,.\u00b7 ., x:_l=i3\"  .. , xt-l=is) = P(x:=i3Ix:_1 =i3, X::t=i3-d = W3 \n\n6  Conclusion \nMotivated  by  the  analysis of the  problem of learning  long-term dependencies  in  sequen(cid:173)\ntial  data,  i.e.,  of learning  to  represent  context,  we  have  proposed  to  use  a  very  general \nassumption  on  the  structure  of sequential  data to  reduce  the  difficulty  of these  learning \ntasks.  Following numerous  previous  work  in  artificial  intelligence  we  are  assuming  that \ncontext  can  be represented  with  a  hierarchical  structure.  More  precisely,  here,  it  means \nthat  long-term  dependencies  are  insensitive  to  small  timing variations,  i.e.,  they  can  be \nrepresented  with  a  coarse  temporal  scale.  This  scheme  allows  context  information  and \ncredit information to be  respectively  propagated forward  and backward more easily. \nFollowing this intuitive idea, we have proposed to use hierarchical recurrent networks for se(cid:173)\nquence  processing.  These  networks  use  multiple-time scales  to achieve  a  multi-resolution \nrepresentation  of context.  Series  of experiments  on  artificial  data  have  confirmed  the \nadvantages  of imposing  such  structures  on  the  network  architecture.  Finally  we  have \nproposed  a similar application of this concept to hidden  Markov models  (for density esti(cid:173)\nmation) and input/output hidden  Markov models (for classification and regression). \nReferences \nBengio,  Y.  and  Frasconi,  P.  (1994).  Credit  assignment  through  time:  Alternatives  to \nbackpropagation.  In  Cowan,  J., Tesauro,  G.,  and  Alspector,  J., editors,  Advances  in \nNeural  Information  Processing  Systems  6.  Morgan Kaufmann. \n\nBengio, Y. and Frasconi, P.  (1995a) .  Diffusion of context and credit information in marko(cid:173)\n\nvian models.  Journal  of Artificial Intelligence  Research, 3:223-244. \n\nBengio,  Y.  and  Frasconi,  P.  (1995b).  An  input/output  HMM  architecture.  In Tesauro, \nG., Touretzky,  D., and  Leen,  T., editors,  Advances  in Neural  Information  Processmg \nSystems  7,  pages 427-434. MIT Press,  Cambridge, MA. \n\nBengio,  Y.,  LeCun,  Y.,  and  Henderson,  D.  (1994).  Globally  trained  handwritten  word \n\nrecognizer  using spatial representation,  space displacement neural  networks  and hid(cid:173)\nden  Markov  models.  In Cowan,  J ., Tesauro, G.,  and Alspector,  J., editors,  Advances \nin  Neural  Information  Processing  Systems  6,  pages 937- 944. \n\n\fHierarchical Recurrent Neural Networks for Long-term Dependencies \n\n499 \n\nBengio,  Y.,  Simard,  P.,  and  Frasconi,  P.  (1994).  Learning  long-term dependencies  with \ngradient descent  is  difficult.  IEEE  Transactions  on  Neural  Networks,  5(2):157-166. \n\nBrugnara,  F.,  DeMori,  R,  Giuliani,  D.,  and  Omologo,  M.  (1992).  A  family  of parallel \nhidden markov models.  In  International  Conference  on  Acoustics,  Speech  and Signal \nProcessing,  pages  377-370, New  York,  NY,  USA.  IEEE. \n\nDaubechies,  I.  (1990).  The  wavelet  transform,  time-frequency  localization  and  signal \n\nanalysis.  IEEE  Transaction  on  Information  Theory,  36(5):961-1005 . \n\nDayan, P. and Hinton, G. (1993).  Feudal reinforcement learning.  In Hanson, S. J., Cowan, \nJ.  D.,  and  Giles,  C.  L.,  edItors,  Advances  in  Neural  Information  Processing  Systems \n5,  San Mateo,  CA.  Morgan  Kaufmann. \n\nFrasconi,  P.,  Gori,  M.,  Maggini,  M.,  and  Soda,  G.  (1993).  Unified  integration  of explicit \n\nrules  and learning by  example in  recurrent  networks.  IEEE  Transactions  on  Knowl(cid:173)\nedge  and Data  Engineering.  (in  press). \n\nGiles,  C.  1. and  amlin, C.  W.  (1992).  Inserting  rules  into recurrent  neural  networks.  In \nKung,  Fallside, Sorenson,  and Kamm, editors,  Neural Networks for Signal Processing \nII,  Proceedings  of the  1992  IEEE workshop,  pages  13-22. IEEE Press. \n\nLang,  K.  J.,  Waibel,  A.  H.,  and  Hinton,  G.  E.  (1990).  A  time-delay  neural  network \n\narchitecture  for  isolated  word recognition.  Neural  Networks,  3:23-43. \n\nLeCun,  Y.,  Boser,  B.,  Denker,  J.,  Henderson,  D.,  Howard,  R, Hubbard,  W.,  and Jackel, \nL.  (1989) .  Backpropagation  applied  to  handwritten  zip  code  recognition.  Neural \nComputation,  1:541-551. \n\nLevinson, S., Rabiner, 1., and Sondhi, M.  (1983).  An introduction to the application ofthe \ntheory of probabilistic functions of a Markov process to automatic speech  recognition. \nBell System  Technical  Journal,  64(4):1035-1074. \n\nLin, T ., Horne,  B., Tino, P., and Giles,  C.  (1995).  Learning long-term dependencies  is not \nas  difficult  with  NARX  recurrent  neural  networks.  Techmcal  Report  UMICAS-TR-\n95-78, Institute for  Advanced  Computer Studies,  University of Mariland. \n\nRabiner, L.  and Juang, B.  (1986).  An introduction to hidden Markov models.  IEEE A SSP \n\nMagazine,  pages  257-285. \n\nRumelhart, D.,  Hinton, G., and Williams, R  (1986).  Learning internal representations by \nerror propagation.  In  Rumelhart, D.  and McClelland, J., editors,  Parallel Distributed \nProcessing,  volume  1,  chapter 8,  pages  318-362.  MIT  Press,  Cambridge. \n\nSaul, L.  and Jordan, M.  (1995).  Boltzmann chains and hidden markov models.  In Tesauro, \nG., Touretzky,  D.,  and Leen,  T., editor~ Advances  in  Neural  Information  Processing \nSystems  7,  pages 435--442.  MIT Press,  vambridge, MA. \n\nSchmidhuber,  J.  (1992).  Learning  complex,  extended  sequences  using  the  principle  of \n\nhistory compression.  Neural  Computation,  4(2):234-242. \n\nSimard,  P.  and  LeCun,  Y.  (1992).  Reverse  TDNN:  An  architecture  for  trajectory  gen(cid:173)\neration.  In  Moody,  J.,  Hanson,  S.,  and  Lipmann,  R,  editors,  Advances  in  Neural \nInformation  Processing  Systems  4,  pages  579-588,  Denver,  CO.  Morgan  Kaufmann, \nSan  Mateo. \n\nSingh,  S.  (1992).  Reinforcement  learning  with  a  hierarchy  of abstract  models.  In  Pro(cid:173)\n\nceedings  of the  10th  National  Conference  on  Artificial  Intelligence,  pages  202-207. \nMIT / AAAI  Press. \n\nSuaudeau,  N.  (1994).  Un  modele  probabiliste  pour  integrer la  dimension  temporelle  dans \nun  systeme  de  reconnaissance  automatique  de  la  parole.  PhD  thesis,  Universite  de \nRennes  I,  France. \n\nSutton, RjI995).  TD models:  modeling the world at a mixture of time scales.  In Proceed(cid:173)\nthe  12th  International  Conference  on  Machine  Learning.  Morgan Kaufmann. \n\nings  0 \n\n\f", "award": [], "sourceid": 1102, "authors": [{"given_name": "Salah", "family_name": "Hihi", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}