{"title": "Induction of Multiscale Temporal Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 275, "page_last": 282, "abstract": null, "full_text": "Induction of Multiscale Temporal  Structure \n\nMichael C.  Moser \n\nDepartment of Computer Science  &: \n\nInstitute of Cognitive  Science \n\nUniversity  of Colorado \n\nBoulder,  CO  80309-0430 \n\nAbstract \n\nLearning  structure  in  temporally-extended  sequences  is  a  difficult  com(cid:173)\nputational problem because  only a  fraction  of the relevant  information is \navailable  at  any  instant.  Although  variants  of back  propagation  can  in \nprinciple  be  used  to find  structure in  sequences,  in  practice  they  are  not \nsufficiently  powerful  to  discover  arbitrary  contingencies,  especially  those \nspanning  long  temporal  intervals  or  involving  high  order  statistics.  For \nexample,  in  designing  a  connectionist  network for  music  composition,  we \nhave encountered  the problem that the net is  able to learn  musical struc(cid:173)\nture that occurs locally in time-e.g., relations  among notes within a  mu(cid:173)\nsical phrase-but not structure that occurs over longer time periods--e.g., \nrelations  among  phrases.  To  address  this  problem,  we  require  a  means \nof constructing  a  reduced  deacription  of the  sequence  that  makes  global \naspects more explicit  or more readily detectable.  I propose to achieve this \nusing hidden  units that operate with different  time constants.  Simulation \nexperiments  indicate  that slower  time-scale  hidden  units  are  able  to pick \nup global structure,  structure that simply can not be learned  by standard \nback propagation. \n\nMany patterns in  the world  are intrinsically  temporal,  e.g.,  speech,  music,  the  un(cid:173)\nfolding  of events.  Recurrent  neural net  architectures  have  been devised  to  accom(cid:173)\nmodate  time-varying  sequences.  For  example,  the  architecture  shown  in  Figure  1 \ncan  map  a  sequence  of inputs  to  a  sequence  of outputs.  Learning  structure  in \ntemporally-extended  sequences is  a difficult  computational problem because  the in(cid:173)\nput pattern may not contain all the task-relevant information at any instant.  Thus, \n275 \n\n\f276 \n\nMozer \n\nFigure 1:  A  generic  recurrent  network architecture for  processing input and output \nsequences.  Each  box corresponds  to  a  layer  of units,  each  line  to full  connectivity \nbetween layers. \n\nthe context layer must hold on to relevant  aspects of the input history  until a  later \npoint in  time at which  they can be  used. \nIn principle,  variants of back propagation for  recurrent  networks  (Rumelhart,  Hin(cid:173)\nton,  &;  Williams,  1986;  Williams  &;  Zipser,  1989)  can  discover  an appropriate  rep(cid:173)\nresentation  in  the  context  layer  for  a  particular  task.  In  practice,  however,  back \npropagation  is  not  sufficiently  powerful  to  discover  arbitrary  contingencies,  espe(cid:173)\ncially  those  that span long  temporal intervals  or  that involve  high  order  statistics \n(e.g.,  Mozer,  1989j  Rohwer,  1990j Schmidhuber,  1991). \nLet me present a  simple situation where back propagation fails.  It involves  remem(cid:173)\nbering  an  event  over  an interval  of time.  A  variant  of this  task  was  first  studied \nby  Schmid huber  (1991).  The input  is  a  sequence  of discrete  symbols:  A,  B,  C,  D, \n. \", I, Y.  The  task  is  to  predict  the  next  symbol in  the  sequence.  Each  sequence \nbegins  with  either  an  I  or  a  Y-call this  the  trigger .ymbol--and is  followed  by  a \nfixed  sequence such  as ABCDE,  which  in turn is followed  by a  second instance of the \ntrigger  symbol,  i.e.,  IABCDEI  or  or  YABCDEY.  To perform  the  prediction  task,  it  is \nnecessary  to store  the  trigger  symbol when it  is  first  presented,  and  then  to  recall \nthe same symbol five  time steps later. \n\nThe  number  of symbols intervening  between  the  two  triggers-call  this  the  gap(cid:173)\ncan  be  varied.  By  training  different  networks  on  different  gaps,  we  can  examine \nhow  difficult  the  learning  task is  as  a  function  of gap.  To  better  control  the  ex(cid:173)\nperiments,  all  input sequences  had  the same length  and consisted  of either  I  or  Y \nfollowed  by  ABCDEFGHIJK.  The second  instance  of the  trigger  symbol was  inserted \nat various points in the sequence.  For example, IABCDIEFGHIJK represents a  gap of \n4,  YABCDEFGHYIJK a  gap of 8. \n\nEach training set  consisted  of two sequences,  one with  I  and one with Y.  Different \nnetworks were  trained on different  gaps.  The network architecture consisted  of one \ninput and output unit  per symbol,  and ten context  units.  Twenty-five replications \nof each  network  were  run with  different  random initial  weights.  IT the training set \nwas  not  learned  within  10000  epochs,  the  replication  was  counted  as  a  \"failure.\" \nThe  primary  result  was  that  training  sets  with  gaps  of 4  or  more  could  not  be \nlearned  reliably,  as shown in Table 1. \n\n\fInduction of Multiscale Temporal Structure \n\n277 \n\nTabl  1  L \n\n: \n\nearnIng con mgencles  across  E aps \n\ne \ngap  % failure.  mean #  epoch. \n\nf \n\nto  learn \n\n2 \n4 \n6 \n8 \n10 \n\n0 \n36 \n92 \n100 \n100 \n\n468 \n7406 \n9830 \n10000 \n10000 \n\nThe results  are suprisingly  poor.  My  general impression  is  that back  propagation \nis  powerful enough  to learn  only structure  that is fairly  local in  time.  For instance, \nin earlier  work on neural net music  composition (Mozer  &  Soukup, 1991), we found \nthat our network  could  master the rules  of composition for  notes  within  a  musical \nphrase,  but not  rules  operating at  a  more  global level-rules  for  how  phrases  are \ninterrelated. \nThe focus  of the present  work is  on devising  learning algorithms  and architectures \nfor  better  handling  temporal structure  at  more  global scales,  as  well  as  multiscale \nor hierarchical  structure.  This difficult  problem  has been identified  and studied  by \nseveral  other  researchers,  including  Miyata and  Burr  (1990),  Rohwer  (1990),  and \nSchmidhuber  (1991). \n\n1  BUILDING A  REDUCED  DESCRIPTION \n\nThe  basic  idea  behind  my  work  involves  building  a  redueed  de.eription  (Hinton, \n1988) of the sequence  that makes global aspects  more explicit  or  more  readily  de(cid:173)\ntectable.  The challenge of this approach is to devise an appropriate reduced descrip(cid:173)\ntion.  I've experimented  with a scheme that constructs  a reduced description  that is \nessentially  a  bud's eye view  of the sequence,  sacrificing  a  representation  of individ(cid:173)\nual elements for  the overall contour of the sequence.  Imagine a musical tape played \nat double  the  regular  speed.  Individual  sounds  are  blended  together  and  become \nindistinguishable.  However, coarser time-scale events become more explicit,  such as \nan ascending trend in pitch or a  repeated  progression  of notes.  Figure  2 illustrates \nthe idea.  The  curve  in  the  left  graph,  depicting  a  sequence  of individual  pitches, \nhas  been  smoothed  and  compressed  to  produce  the  right  graph.  Mathematically, \n\"smoothed  and  compressed\"  means  that  the  waveform  has  been  low-pass  filtered \nand  sampled  at  a  lower  rate.  The  result  is  a  waveform  in  which  the  alternating \nupwards and downwards :ftow  is  unmistakable. \nMultiple  views  of the  sequence  are  realized  using  context  units  that  operate  with \ndifferent  time  eon.tantl: \n\n(1) \n\nwhere  Ci(t)  is  the  activity  of context  unit  i  at  time  t,  net,(t)  is  the  net  input  to \nunit  i  at  time  t,  including  activity  both  from  the  input  layer  and  the  recurrent \ncontext  connections,  and T,  is  a  time  constant  associated  with  each  unit  that has \n\nthe  range  (0,1)  and  determines  the  responsiveness  of the  unit-the rate  at  which \n\n\f278 \n\nMozer \n\n(a)  p \ni \nt c \nh \n\n(b)  P \ni \nt c \nh \n\nreduced \ndescription \n\ntime \n\ntime \n\n(compressed) \n\nFigure 2:  (a) A sequence of musical notes.  The vertical axis indicates  the pitch, the \nhorizontal axis time.  Each point corresponds to a  particular note.  (b) A smoothed, \ncompact view  of the sequence. \n\nits  activity  changes.  With 7'i  ==  0,  the  activation  rule  reduces  to  the standard  one \nand the unit  can sharply change its response  based  on a  new input.  With large  7'i, \nthe unit is sluggish,  holding on to much of its previous value and thereby averaging \nthe response  to the net input over time.  At the extreme of 7'i  ==  1,  the second  term \ndrops  out  and  the  unit's  activity  becomes  fixed.  Thus,  large  7'i  smooth  out  the \nresponse  of a  context  unit  over time.  Note,  however,  that what is  smoothed is  the \nactivity of the context units,  not the input itself as Figure  2 might suggest. \n\nSmoothing is  one  property  that distinguishes  the  waveform in  Figure  2b from  the \noriginal.  The other property, compactness,  is  also  achieved  by a  large 7'i,  although \nsomewhat indirectly.  The key  benefit of the compact waveform in Figure 2b is  that \nit allows a longer period of time to be viewed in a single glance,  thereby explicating \ncontingencies occurring in this interval during learning.  The context unit activation \nrule  (Equation  1)  permits  this.  To  see  why  this is  the  case,  consider  the  relation \nbetween the error  derivative with respect  to the context units at time t,  8E/8c(t), \nand  the  error  back  propagated  to  the  previous  step,  t - 1.  One  contribution  to \n8E/8ci(t - 1), from  the first  term in Equation 1, is \n\n(2) \n\nThis means that when  7'i  is large,  most of the error signal in context unit i  at time \nt is  carried  back to time t - 1.  Intuitively, just as the activation of units with large \n7'i  changes  slowly  forward  in  time,  the  error  propagated  back  through  these  units \nchanges slowly too.  Thus, the back propagated error signal can make contact with \npoints further  back in time, facilitating  the learning of more global structure in the \ninput sequence. \n\nTime constants  have  been  incorporated  into  the  activation  rules  of other  connec(cid:173)\ntionist  architectures  (Jordan,  1987;  McClelland,  1979;  Mozer,  1989;  Pearlmutter, \n1989;  Pineda,  1987).  However,  none  of this  work  has  exploited  time  constants  to \ncontrol the temporal responsivity  of individual  units. \n\n\fInduction of Multiscale Temporal Structure \n\n279 \n\n2  LEARNING AABA PHRASE PATTERNS \n\nA  simple  simulation  illustrates  the  benefits  of temporal  reduced  descriptions. \nI \ngenerated  pseudo  musical  phrases  consisting  of five  notes  in  ascending  chromatic \norder, e.g., F#2  G2  G#2  12 1#2 or C4  C#4  Dot  D#4  &ot,  where the first  pitch was selected \nat random. 1  Pairs  of phrases-call them  A  and  B-were concatenated  to form  an \nAABA  pattern,  terminated  by  a  special  EID  marker.  The  complete  melody  then \nconsisted  of 21  elements-four phrases offive notes followed  by the EID marker-an \nexample  of which  is: \n\nTwo versions  of CONCERT were  tested,  each with  35 context units.  In the  ,tandard \nversion,  all  35  units  had  T  = 0;  in  the  reduced  de.eMption  or  RD version,  30  had \nT  = 0 and 5 had T  = 0.8.  The training set consisted of 200 examples and the test set \nanother  100 examples.  Ten replications  of each simulation  were  run for  300 passes \nthrough  the  training set.  See  Mozer  and Soukup  (1991) for  details  of the network \narchitecture  and note representations. \n\nBecause ofthe way that the sequences are organized, certain pitches can be predicted \nbased  on  local  structure  whereas  other  pitches  require  a  more  global  memory  of \nthe  sequence.  In particular,  the  second  through  fifth  pitches  within  a  phrase  can \nbe  predicted  based  on  knowledge  of the immediately  preceding  pitch.  To  predict \nthe  first  pitch  in  the  repeated  A  phrases  and  to  predict  the  EID  marker,  more \nglobal information is  necessary.  Thus, the analysis was split  to distinguish  between \npitches  requiring  only  local  structure and  pitches  requiring  more global structure. \nAs  Table  2 shows,  performance  requiring  global structure  was  significantly  better \nfor  the  RD  version  (F(l,9)=179.8,  p  <  .001),  but  there  was  only  a  marginally \nreliable  difference  for  performance involving  local structure  (F(l,9)=3.82,  p=.08). \nThe global structure  can be  further  broken down  to prediction  of the  EID  marker \nand  prediction  of the  first  pitch  of the  repeated  A  phrases.  In  both  cases,  the \nperformance improvement for  the RD version  was significant:  88.0% versus  52.9% \nfor  the  end  of sequence  (F(l,9)=220,  p  <  .001);  69.4%  versus  61.2%  for  the  first \npitch  (F(l,9)=77.6, p < .001). \nExperiments  with  different  values  of T  in  the  range  .7-.95  yielded  qualitatively \nsimilar  results,  as did  experiments  in which  the  A  and  B  phrases  were  formed  by \nrandom walks  in the key  of C  major. \n\nlOne need not understand the musical notation to make sense of this example.  Simply \nconsider each note to be a unique symbol in a  set of symbols having a fixed  ordering.  The \nexample is framed in terms of music because my original work involved music composition. \n\nTable 2:  Performance on AABA phrases \n.tandard ver,ion  RD ver.ion \n.trueture \n\nlocal \nglobal \n\n97.3% \n58.4% \n\n96.7% \n75.6% \n\n\f280 \n\nMozer \n\n3  DETECTING CONTINGENCIES  ACROSS  GAPS(cid:173)\n\nREVISITED \n\nI  now  return  to  the  prediction  task  involving  sequences  containing  two  I's or  Y's \nseparated  by  a  stream of intervening symbols.  A  reduced  description  network  had \nno problem learning the contingency across wide gaps.  Table 3 compares the results \npresented  earlier  for  a  standard  net  with  ten  context  units  and  the  results  for  an \nRD  net  having six standard context  units  (T  = 0)  and four  units  having  identical \nnonzero  T,  in the range of .75-.95.  More on the choice  of T  below,  but first  observe \nthat the reduced description net had a100% success rate.  Indeed, it had no difficulty \nwith much wider  gaps:  I tested gaps of up to 25  symbols.  The number of epochs to \nlearn scales  roughly linearly  with the gap. \n\nWhen the  task  was  modified  slightly  such  that  the  intervening  symbols  were  ran(cid:173)\ndomly  selected  from  the  set  {!,B,e,D},  the  RD  net  still  had  no  difficulty  with  the \nprediction  task. \n\nThe bad news here is that the choice of T  can be important.  In the results  reported \nabove,  T  was  selected  to  optimize  performance.  In general,  a  larger  T  was  needed \nto span larger gaps.  For sma.ll gaps, performance was insensitive  to the particular  T \nchosen.  However,  the larger  the temporal gap that had to be spanned,  the sma.ller \nthe range of T  values that gave acceptable results.  This would appear to be a serious \nlimitation of the approach.  However,  there are several potential solutions. \n\n1.  One might try using back propagation to train the time constants directly.  This \ndoes  not  work  particularly  well  on  the  problems  I've  examined,  apparently \nbecause  the  path  to  an  appropriate  T  is  fraught  with  local  optima.  Using \ngradient descent  to fine tune T, once it's in the right neighborhood, is somewhat \nmore successful. \n\n2.  One might include  a  complete  range of T  values  in the context  layer.  It is  not \ndifficult  to determine  a  rough correspondence  between  the choice  of T  and the \ntemporal  interval  to  which  a  unit  is  optimally  tuned.  If sufficient  units  are \nused  to span a  range of intervals,  the network should  perform well.  The down \nside,  of course,  is  that  this  gives  the  network  an excess  of weight  parameters \nwith which it could potentia.lly overfit the training data.  However,  because the \ndifferent  T  correspond  to different  temporal scales,  there  is  much less  freedom \nto  abuse  the  weights  here  than,  say,  in  a  situation  where  additional  hidden \nunits are added  to a  feedforward  network. \n\nTable 3:  Learning contingencies  across gaps (revisited) \n\n,tandard net \n\nreduced  de,criptaon  net \n\ngap  % failure,  mean #  epoch,  % failure,  mean #  epoch, \n\nto  learn \n\nto learn \n\n2 \n4 \n6 \n8 \n10 \n\n0 \n36 \n92 \n100 \n100 \n\n468 \n7406 \n9830 \n10000 \n10000 \n\n0 \n0 \n0 \n0 \n0 \n\n328 \n584 \n992 \n1312 \n1630 \n\n\fInduction of Multiscale Temporal Structure \n\n281 \n\nlower \nnet \n\nupper \n\nnet \n\nFigure 3:  A  sketch  of the Schmidhuber  (1991)  architecture \n\n3.  One might  dynamically  adjust  T  as a  sequence  is  presented  based  on external \n\ncriteria.  In Section  5,  I discuss  one such criterion. \n\n4  MUSIC  COMPOSITION \n\nI  have  used  music  composition  as  a  domain  for  testing  and  evaluating  different \napproaches  to learning  multiscale  temporal  structure.  In previous  work  (Mozer  &; \nSoukup,  1991),  we  designed  a  sequential prediction network,  called  CONCERT,  that \nlearns  to  reproduce  a  set  of pieces  of a  particular  musical  style.  CONCERT  also \nlearns structural regularities  of the musical style,  and can be  used  to compose  new \npieces  in the same style.  CONCERT  was trained  on a  set of Bach pieces  and a  set of \ntraditional European folk  melodies.  The compositions it  produces  were  reasonably \npleasant, but were lacking in global coherence.  The compositions tended to wander \nrandomly with little direction,  modulating haphazardly from major to minor keys, \nflip-flopping from the style of a march to that of a minuet.  I attribute these problems \nto the fact  that  CONCERT  had learned  only local temporal structure. \n\nI  have  recently  trained  CONCERT  on  a  third  set  of examples-waltzes-and  have \nincluded  context  units  that  operate  with  a  range  of time  constants.  There  is  a \nconsensus  among  listeners  that  the  new  compositions  are  more  coherent. \nI  am \npresently  running  more controlled  simulations  using  the same musical  training  set \nand versions of CONCERT with and without reduced descriptions,  and am attempting \nto quantify CONCERT'S abilities  at various temporal scales. \n\n5  A  HYBRID  APPROACH \n\nSchmidhuber  (1991; this volume)  has proposed an alternative approach to learning \nmultiscale temporal structure in sequences.  His approach, the chunking architecture, \nbasically  involves  two  (or  more)  sequential  prediction  networks  cascaded  together \n(Figure  3).  The  lower  net  receives  each  input  and  attempts  to  predict  the  next \ninput.  When it fails  to predict  reliably,  the  next input is  passed  to the upper  net. \nThus, once the lower net has been trained  to predict local temporal structure,  such \nstructure is  removed  from  the input  to  the  upper  net.  This  simplifies  the  task  of \nlearning  global structure in the upper  net. \n\n\f282 \n\nMozer \n\nSchmidhuber's approach has some serious limitations,  as does the approach I've de(cid:173)\nscribed.  We  have thus merged  the two in a  scheme  that incorporates  the strengths \nof each approach (Schmidhuber,  Prelinger,  Mozer,  Blumenthal,  &:  Mathis, in prepa(cid:173)\nration).  The architecture is  the same as depicted  in  Figure  3,  except  that all  units \nin the upper  net have associated  with them a  time constant  Tu ,  and the  prediction \nerror  in the lower  net determines  Tu.  In effect,  this  allows  the upper  net to kick  in \nonly  when  the lower  net fails  to predict.  This  avoid  the problem  of selecting  time \nconstants,  which my approach suffers.  This also  avoids the drawback of Schmidhu(cid:173)\nber's approach that yes-or-no decisions  must be made about whether the lower  net \nwas  successful.  Initial  simulation  experiments  indicate  robust  performance  of the \nhybrid  algorithm. \n\nAcknowledgements \n\nThis research was supported by NSF Presidential Young Investigator award ffiI-9058450, \ngrant  90--21  from  the James S.  McDonnell  Foundation,  and DEC extemal research grant \n1250.  Thanks to Jiirgen Schmidhuber and Paul Smolensky for helpful comments regarding \nthis work, and to Darren Hardy for  technical assistance. \n\nReferences \n\nHinton,  G.  E.  (1988).  Representing  part-whole  hierarchies  in  connectionist  networks. \n\nProceeding' of the Eighth Annual Conference of the  Cognitive  Science Society. \n\nJordan,  M.  I.  (1987).  Attractor  dynamics  and  parallelism in a  connectionist  sequential \nmachine.  In  Proceeding,  of the  Eighth  Annual  Conference  of the  Cognitive  Science \nSociety (pp.  531-546).  Hillsdale, NJ: Erlbaum. \n\nMcClelland,  J.  L.  (1979).  On the time relations of mental processes:  An examination of \n\nsystems of processes in cascade.  P,ychological Review,  86,  287-330. \n\nMiyata, Y., k  Burr, D. (1990).  Hierarchical recurrent networks for learning musical struc(cid:173)\n\nture.  Unpublished Manuscript. \n\nMoser,  M.  C.  (1989).  A  focused  back-propagation algorithm for  temporal pattem recog(cid:173)\n\nnition.  Complez Syltem\"  3,  349-381. \n\nMoser,  M.  C.,  k  Soukup,  T.  (1991).  CONCERT:  A  connectionist  composer  of erudite \ntunes.  In R.  P. Lippmann,  J. Moody,  k  D.  S.  Tourebky (Eds.),  Advance, in neural \ninformation proce\"ing ,ylteml 3 (pp.  789-796).  San Mateo, CA: Morgan Kaufmann. \nPearlmutter, B. A. (1989).  Learning state space trajectories in recurrent neural networks. \n\nNeural  Computation, 1,  263-269. \n\nPineda, F. (1987).  Generalisation of back propagation to recurrent neural networks.  Phy,(cid:173)\n\nical Review Letter\"  19,  2229-2232. \n\nRohwer,  R.  (1990).  The  'moving  targets'  training  algorithm.  In D.  S.  Tourebky  (Ed.), \nAdvance, in neural information proce\"ing ,yltem, I  (pp.  558-565).  San Mateo,  CA: \nMorgan  Kaufmann. \n\nRumelhart, D.  E., Hinton,  G. E.,  k  Williams, R.  J. (1986).  Learning intemal representa(cid:173)\n\ntions  by error propagation.  In D.  E.  Rumelhart  k  J. L.  McClelland (Eds.),  Parallel \ndi,tributed  proce\"ing:  Ezploration,  in  the  microltructure  of cognition.  Volume  I: \nFoundation, (pp.  318-362).  Cambridge, MA: MIT Press/Bradford  Books. \n\nSchmidhuber,  J.  (1991).  Neural  ,equence chunker, (Report  FKI-148-91).  Munich,  Ger(cid:173)\n\nmany:  Technische  Universitaet Muenchen,  Institut fuel Informatik. \n\nWilliams,  R.  J.,  k  Zipser,  D.  (1989).  A  learning  algorithm for  continually  running  fully \n\nrecurrent neural networks.  Neural  Computation,  1, 270--280. \n\n\f", "award": [], "sourceid": 522, "authors": [{"given_name": "Michael", "family_name": "Mozer", "institution": null}]}