{"title": "Factorial Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 472, "page_last": 478, "abstract": null, "full_text": "Factorial Hidden Markov Models \n\nZoubin  Ghahramani \nzoubin@psyche.mit.edu \n\nDepartment of Computer Science \n\nUniversity of Toronto \nToronto, ON  M5S  1A4 \n\nCanada \n\nMichael I.  Jordan \njordan@psyche.mit.edu \n\nDepartment of Brain &  Cognitive Sciences \n\nMassachusetts  Institute of Technology \n\nCambridge,  MA  02139 \n\nUSA \n\nAbstract \n\nWe present a framework for learning in hidden Markov models with \ndistributed  state  representations.  Within  this framework ,  we  de(cid:173)\nrive  a  learning algorithm based  on  the  Expectation-Maximization \n(EM)  procedure for  maximum likelihood estimation.  Analogous to \nthe  standard  Baum-Welch  update  rules,  the  M-step  of our  algo(cid:173)\nrithm is  exact and can be solved analytically.  However,  due  to the \ncombinatorial nature of the  hidden state representation,  the exact \nE-step is intractable.  A simple and tractable mean field approxima(cid:173)\ntion is  derived.  Empirical results on a set of problems suggest that \nboth  the mean field  approximation and Gibbs sampling are viable \nalternatives to  the computationally expensive exact algorithm. \n\n1 \n\nIntroduction \n\nA problem of fundamental interest to machine learning is time series modeling.  Due \nto  the  simplicity and  efficiency  of its parameter estimation  algorithm,  the  hidden \nMarkov model (HMM)  has emerged as one of the basic statistical tools for modeling \ndiscrete  time series,  finding  widespread  application in  the  areas  of speech  recogni(cid:173)\ntion  (Rabiner and Juang, 1986)  and computational molecular biology  (Baldi et al. , \n1994).  An  HMM  is  essentially  a  mixture  model,  encoding  information about  the \nhistory  of a  time series  in  the  value  of a  single  multinomial variable  (the  hidden \nstate).  This multinomial assumption allows an efficient  parameter estimation algo(cid:173)\nrithm to  be derived  (the  Baum-Welch algorithm).  However,  it also  severely  limits \nthe  representational  capacity  of HMMs.  For  example,  to  represent  30  bits  of in(cid:173)\nformation  about  the history  of a  time sequence,  an  HMM  would  need  230  distinct \nstates.  On  the  other  hand  an HMM  with  a  distributed state  representation  could \nachieve the same task with 30 binary units (Williams and Hinton, 1991).  This paper \naddresses  the  problem of deriving  efficient  learning algorithms for  hidden  Markov \nmodels with distributed state representations. \n\n\fFactorial Hidden Markov Models \n\n473 \n\nThe need  for  distributed  state  representations  in  HMMs  can  be  motivated in  two \nways.  First,  such  representations  allow  the  state  space  to  be  decomposed  into \nfeatures  that  naturally  decouple  the  dynamics  of a  single  process  generating  the \ntime series.  Second,  distributed state representations simplify the task of modeling \ntime  series  generated  by  the  interaction  of  multiple  independent  processes.  For \nexample,  a  speech  signal  generated  by  the superposition  of multiple simultaneous \nspeakers can be potentially modeled with such  an  architecture. \n\nWilliams and Hinton (1991) first formulated the problem of learning in HMMs with \ndistributed  state  representation  and  proposed  a  solution  based  on  deterministic \nBoltzmann learning.  The  approach  presented  in  this  paper  is  similar to  Williams \nand Hinton's in that it is also based on a statistical mechanical formulation of hidden \nMarkov models.  However,  our learning algorithm is quite different  in that it makes \nuse of the special structure of HMMs with distributed state representation,  resulting \nin  a  more  efficient  learning  procedure.  Anticipating  the  results  in  section  2,  this \nlearning algorithm both obviates the need for the two-phase procedure of Boltzmann \nmachines,  and  has  an  exact  M-step.  A  different  approach  comes  from  Saul  and \nJordan  (1995),  who  derived  a  set  of rules  for  computing the gradients  required  for \nlearning in HMMs  with  distributed state spaces.  However,  their  methods can  only \nbe  applied to a limited class of architectures. \n\n2  Factorial  hidden  Markov models \n\nHidden  Markov  models are  a generalization of mixture models.  At  any  time step, \nthe  probability  density  over  the  observables  defined  by  an  HMM  is  a  mixture  of \nthe  densities  defined  by  each  state  in  the  underlying  Markov  model.  Temporal \ndependencies  are introduced by specifying that the prior probability of the state at \ntime t depends on the state at time t -1 through a transition matrix, P  (Figure 1a). \nAnother generalization of mixture models,  the cooperative  vector quantizer  (CVQ; \nHinton and Zemel,  1994  ), provides a natural formalism for  distributed state repre(cid:173)\nsentations  in  HMMs.  Whereas  in simple mixture models each  data point must be \naccounted for  by  a single mixture component, in CVQs each data point is accounted \nfor  by  the combination of contributions from many mixture components, one  from \neach  separate  vector  quantizer.  The total  probability  density  modeled  by  a  CVQ \nis  also  a  mixture model;  however  this  mixture density  is  assumed  to factorize  into \na  product  of densities,  each  density  associated  with  one  of the  vector  quantizers. \nThus, the CVQ is  a mixture model with distributed representations  for  the mixture \ncomponents. \n\nFactorial hidden  Markov  models!  combine the state transition structure of HMMs \nwith the distributed representations  of CVQs  (Figure 1 b).  Each of the d underlying \nMarkov  models has  a  discrete  state  s~  at time t  and  transition  probability matrix \nPi.  As  in the CVQ,  the states are  mutually exclusive  within each  vector  quantizer \nand  we  assume  real-valued  outputs.  The sequence  of observable output  vectors  is \ngenerated from a normal distribution with mean given by the weighted combination \nof the states of the  underlying Markov  models: \n\nwhere  C  is a common covariance matrix.  The k-valued states  Si  are represented  as \n\n1 We  refer  to  HMMs  with  distributed  state  as  factorial  HMMs  as  the  features  of the \n\ndistributed  state factorize  the total state representation. \n\n\f474 \n\nZ. GHAHRAMANI. M.  I.  JORDAN \n\ndiscrete column vectors with a  1 in one position and 0 everywhere else;  the mean of \nthe observable is  therefore a  combination of columns from each  of the Wi  matrices. \n\na) \n\n~-------.... \n\ny \n\np \n\nFigure  1.  a)  Hidden  Markov  model.  b)  Factorial hidden  Markov  model. \n\nWe  capture the  above probability model by  defining  the energy  of a sequence  of T \nstates and observations, {(st, yt)};=l' which  we  abbreviate  to {s, y}, as: \n\n1l( {s,y}) = ~ t. k -t. w;s:]' C- 1 [yt -t. w;s:]-t. t. sf A;S:-l, \n\n(1) \n\nwhere  [Ai]jl  = logP(s~jls~I-I)  such  that  2::=1 e[Ai]j/  = 1,  and  I  denotes  matrix \ntranspose.  Priors for  the initial state, sl, are introduced  by setting the second  term \nin  (1)  to  - 2:t=1 sf log7ri.  The  probability  model  is  defined  from  this  energy  by \nthe Boltzmann distribution \n\nP({s,y}) = Z exp{-ll({s,y})}. \n\n1 \n\n(2) \n\nNote  that like in  the CVQ  (Ghahramani, 1995),  the undamped partition function \n\nZ  = J d{y} Lexp{-ll({s,y})}, \n\n{s} \n\nevaluates  to  a  constant,  independent  of the  parameters.  This  can  be  shown  by \nfirst  integrating the Gaussian variables,  removing all dependency  on  {y}, and then \nsumming over  the states using  the constraint  on  e[A,]j/ . \n\nThe EM algorithm for  Factorial HMMs \n\nAs  in  HMMs,  the  parameters  of a  factorial  HMM  can  be  estimated  via  the  EM \n(Baum-Welch)  algorithm.  This  procedure  iterates  between  assuming  the  current \nparameters  to  compute  probabilities  over  the  hidden  states  (E-step),  and  using \nthese  probabilities to  maximize the expected  log  likelihood of the  parameters  (M(cid:173)\nstep). \n\nUsing the likelihood (2),  the expected  log  likelihood of the parameters is \n\nQ(4)new l4>)  = (-ll({s,y}) -logZ)c , \n\n(3) \n\n\fFactorial Hidden Markov  Models \n\n475 \n\nwhere  </J  =  {Wi, Pi, C}f=l  denotes  the  current  parameters,  and  Oc  denotes  ex(cid:173)\npectation  given  the  damped observation  sequence  and  </J.  Given  the  observation \nsequence,  the only random variables are the hidden states.  Expanding equation (3) \nand  limiting the  expectation  to  these  random variables  we  find  that  the statistics \nthat  need  to  be  computed for  the  E-step  are  (sDc,  (s~sj')c,  and  (S~S~-l\\.  Note \nthat  in  standard  HMM  notation  (Rabiner  and  Juang,  1986),  (sDc  corresponds  to \n)c  corresponds  to et,  whereas  (s~st\u00b7)c  has  no  analogue  when  there \nIt  and  (SiSi -\nis  only  a  single  underlying  Markov  model.  The ~-step uses  these  expectations  to \nmaximize Q with respect  to the parameters. \nThe constant  partition function  allowed us  to drop  the second  term in  (3).  There(cid:173)\nfore,  unlike  the  Boltzmann  machine,  the  expected  log  likelihood does  not  depend \non  statistics collected  in  an undamped phase  of learning,  resulting  in  much faster \nlearning than the  traditional Boltzmann machine (Neal,  1992). \n\nt  I' \n\nt \n\n, \n\nM-step \n\nSetting the  derivatives of Q  with respect  to the output weights  to  zero,  we  obtain \na  linear system of equations for  W: \n\nWnew  =  [2:(SS')c] t  [2:(S)CY']  , \n\nN,t. \n\nN,t \n\nwhere  sand  Ware  the  vector  and  matrix  of  concatenated  Si  and.  Wi, \nrespectively,L:N  denotes  summation over  a  data set  of N  sequences,  and t is  the \nMoore-Penrose pseudo-inverse.  To estimate the log transition probabilities we  solve \n8Q/8[Ai ]jl  =  0 subject  to the constraint L:j e[A,]i l  =  1,  obtaining \n\n[A .]~ew _  I \n\n\u2022  JI \n\n-\n\nog \n\n( \n\n'\" \ni...JN,t \n\n(st  st-l)  ) \nij  il \nc \nt  t-l \n)c \nL:N,t,j(SijSil \n\n. \n\n(4) \n\nThe covariance matrix can  be similarly estimated: \n\ncnew  =  2: YY'  - 2: y(s)~(ss')!(s)cy'. \n\nN,t \n\nN,t \n\nThe M-step equations can therefore be solved analytically; furthermore, for  a single \nunderlying Markov chain,  they reduce to the traditional Baum-Welch re-estimation \nequations. \n\nE-step \n\nUnfortunately, as in the simpler CVQ, the exact E-step for factorial  HMMs is com(cid:173)\nputationally intractable.  For example, the expectation of the lh unit in vector  i  at \ntime step t, given  {y}, is: \n\n(s!j)c \n\np(sL =  II{y}, </J) \n\nk \n2:  P(s~it=I,.\",s~j =  1, ... ,s~,jd=ll{y},</J) \n\nit,\u00b7\u00b7\u00b7,jhyt',oo.,jd \n\nAlthough the Markov  property can  be used  to obtain a forward-back ward-like fac(cid:173)\ntorization of this expectation  across time steps,  the sum over all possible configura(cid:173)\ntions of the other  hidden units  within each time step is  unavoidable.  For a data set \n\n\f476 \n\nZ.  GHAHRAMANI, M. I. JORDAN \n\nof N  sequences of length T, the full  E-ste~ calculated through the forward-backward \nprocedure  has time complexity O(NTk2  ).  Although more careful bookkeeping can \nreduce  the  complexity  to  O(NTdk d+1),  the  exponential  time cannot  be  avoided. \nThis intractability of the exact  E-step  is  due  inherently  to the cooperative  nature \nof the model-the setting of one vector only determines the mean of the observable \nif all  the other vectors  are fixed. \n\nRather than summing over all possible hidden state patterns to compute the exact \nexpectations,  a  natural  approach  is  to  approximate them  through  a  Monte  Carlo \nmethod such  as  Gibbs sampling.  The procedure  starts  with  a  clamped observable \nsequence  {y}  and a  random setting of the  hidden  states  {sj}.  At  each  time step, \neach state vector is  updated stochastically according to its probability distribution \nconditioned  on  the  setting  of all  the other  state  vectors:  s~  '\"  P (s~ I {y }, {sj  : j  \"# \ni or T  \"#  t}, \u00a2). These conditional distributions are straightforward to compute  and \na  full  pass of Gibbs sampling requires  O(NTkd) operations.  The first  and second-\norder statistics needed  to  estimate  (sDc,  (s~sj\\ and  (S~s~-l\\ are  collected  using \nthe  S~j'S visited  and the probabilities estimated during  this sampling process. \n\nMean field approximation \n\nA  different  approach  to  computing  the  expectations  in  an  intractable  system  is \ngiven by  mean field  theory.  A mean field  approximation for factorial HMMs can be \nobtained by  defining  the energy function \n\n1l({s,y}) =  ~ L [yt  -Itt]' C- 1  [yt  -Itt] - Lsf logm}. \n\nt \n\nt ,i \n\nwhich  results  in a  completely factorized  approximation to probability density  (2): \n\n.P({s,y}) ex II exp{-~ [yt  -Itt], C- 1  [yt -Itt]} II (m~j)3:j \n\n(5) \n\nt \n\nt ,i ,j \n\nIn this approximation, the observables are independently Gaussian distributed with \nmean Itt  and each  hidden  state vector  is  multinomially distributed with  mean m~. \nThis approximation is made as tight as  possible by  chosing the mean field  parame(cid:173)\nters  Itt  and m~ that minimize the  \u00a5ullback-Liebler divergence \n\nK.q.PIIP) ==  (logP)p - (log?)p \n\nwhere  Op  denotes  expectation  over  the  mean  field  distribution  (5).  With  the \nobservables clamped, Itt can be set equal to the observable yt.  Minimizing K\u00a3(.PIIP) \nwith  respect  to  the  mean  field  parameters  for  the  states  results  in  a  fixed-point \nequation which can  be  iterated until convergence: \n\nm~ new \n\nu{W!C- 1 [yt - yt] + W!C-1Wim~ - ~diag{W!C-IWd - 1  (6) \n\n+At'm~-l + A~m~+1} \n\nt \n\nt \n\nt \n\nwhere  yt  ==  Ei Wim~ and  u{-}  is  the  softmax exponential,  normalized  over  each \nhidden state vector.  The first  term is  the projection of the error  in the observable \nonto the weights of state vector i-the more a hidden unit can reduce this error,  the \nlarger its mean field  parameter.  The next three terms arise from the fact that (s;j) p \nis  equal  to  mij  and  not  m;j.  The  last  two  terms  introduce  dependencies  forward \nand  backward in time.  Each state vector  is  asynchronously  updated  using  (6),  at \na  time  cost  of O(NTkd)  per  iteration.  Convergence  is  diagnosed  by  monitoring \nthe K\u00a3 divergence  in  the mean field  distribution between  successive  time steps;  in \npractice convergence  is  very  rapid  (about 2 to  10  iterations of (6)). \n\n\fFactorial Hidden Markov  Models \n\n477 \n\nTable 1:  Comparison of factorial HMM  on four  problems of varying size \n\nd  k \n3 \n\n3 \n\n5 \n\n5 \n\nAlg \n2  HMM \nExact \nGibbs \nMF \n3  HMM \nExact \nGibbs \nMF \n2  HMM \nExact \nGibbs \nMF \n3  HMM \nExact \nGibbs \nMF \n\n# \n5 \n\n5 \n\n5 \n\n3 \n\nTrain \n649  \u00b18 \n877  \u00b1O \n710  \u00b1  152 \n755  \u00b1  168 \n670  \u00b1  26 \n568  \u00b1  164 \n564  \u00b1  160 \n495  \u00b1  83 \n588  \u00b1  37 \n223  \u00b1  76 \n123  \u00b1  103 \n292  \u00b1  101 \n1671,1678,1690 \n-55,-354,-295 \n-123,-160,-194 \n-287,-286,-296 \n\nTest \n\nCycles \n33  \u00b1  19 \n358  \u00b1  81 \n22  \u00b16 \n768  \u00b1O \n627  \u00b1  129 \n28  \u00b1ll \n670  \u00b1  137  32  \u00b1  22 \n23  \u00b1  10 \n-782  \u00b1  128 \n276  \u00b1  62 \n35  \u00b1  12 \n45  \u00b1  16 \n305  \u00b1  51 \n38  \u00b1  22 \n326  \u00b1  62 \n-2634  \u00b1  566 \n18  \u00b1  1 \n31  \u00b1  17 \n159  \u00b1  80 \n40  \u00b15 \n73  \u00b1  95 \n237  \u00b1  103  54  \u00b1  29 \n14,14,12 \n-00,-00 ,-00 \n90,100,100 \n-123,-378,-402 \n-202,-237 ,-307 \n100,73,100 \n-364,-370,-365  100,100,100 \n\nTime7Cycle \n\nl.ls \n3.0 s \n6.0 s \nl.2 s \n3.6 s \n5.2 s \n9.2 s \nl.6 s \n5.2 s \n6.9 s \n12.7 s \n2.2 s \n90.0 s \n5l.Os \n14.2 s \n4.7 s \n\nTable  1.  Data  was  generated  from  a  factorial  HMM  with  d  underlying  Markov  models  of \nk  states  each.  The  training  set  was  10  sequences  of length  20  where  the  observable  was  a \n4-dimensional  vector;  the test set  was  20  such sequences.  HMM indicates  a  hidden  Markov \nmodel  with  k d  states;  the  other algorithms  are  factorial  HMMs  with  d  underlying  k-state \nmodels.  Gibbs  sampling  used  10  samples  of each  state.  The  algorithms  were  run  until \nconvergence,  as monitored by relative change in the likelihood,  or a  maximum of 100  cycles. \nThe #  column indicates number of runs.  The Train and Test columns show the log likelihood \n\u00b1 one standard deviation  on the two data sets.  The last column  indicates approximate time \nper cycle  on a  Silicon  Graphics  R4400  processor running  Matlab. \n\n3  Empirical Results \n\nWe  compared  three  EM  algorithms for  learning  in  factorial  HMMs-using  Gibbs \nsampling,  mean  field  approximation , and  the  exact  (exponential)  E  step- on  the \nbasis  of performance  and speed  on  randomly generated  problems.  Problems  were \ngenerated  from  a  factorial  HMM  structure,  the  parameters  of which  were  sam(cid:173)\npled from a uniform [0,1]  distribution, and  appropriately normalized to satisfy the \nsum-to-one constraints of the  transition  matrices  and  priors.  Also  included in the \ncomparison was a traditional HMM with as many states (k d )  as the factorial HMM. \nTable  1  summarizes  the  results.  Even  for  moderately  large  state  spaces  (d  ~ \n3  and  k  ~ 3)  the  standard  HMM  with  k d  states  suffers  from  severe  overfitting. \nFurthermore,  both  the  standard  HMM  and  the  exact  E-step  factorial  HMM  are \nextremely slow on the larger problems.  The Gibbs sampling and mean field  approx(cid:173)\nimations offer  roughly comparable performance at a great increase in  speed. \n\n4  Discussion \n\nThe  basic  contribution  of this  paper  is  a  learning  algorithm for  hidden  Markov \nmodels  with  distributed  state  representations.  The  standard  Baum-Welch  proce(cid:173)\ndure  is  intractable for  such  architectures  as  the  size  of the  state  space  generated \nfrom  the cross  product of d k-valued features  is  O(kd),  and the time complexity of \nBaum-Welch is quadratic in  this size.  More  importantly, unless special constraints \nare applied to this cross-product HMM  architecture, the number of parameters also \n\n\f478 \n\nz. GHAHRAMANI, M. 1.  JORDAN \n\ngrows  as  O(k2d),  which  can  result  in severe  overfitting. \n\nThe  architecture  for  factorial  HMMs  presented  in  this  paper  did  not  include  any \ncoupling  between  the  underlying  Markov chains.  It is  possible  to extend  the  algo(cid:173)\nrithm presented  to  architectures  which incorporate such  couplings.  However,  these \ncouplings must be introduced with caution as they may result either in an exponen(cid:173)\ntial growth in parameters or in  a  loss  of the constant  partition function  property. \n\nThe learning algorithm derived in this paper assumed real-valued observables.  The \nalgorithm can also  be derived for  HMMs with discrete  observables,  an architecture \nclosely  related  to sigmoid belief networks  (Neal,  1992).  However,  the nonlinearities \ninduced by discrete  observables make both the E-step and  M-step of the algorithm \nmore difficult. \n\nIn conclusion, we have presented Gibbs sampling and mean field learning algorithms \nfor  factorial  hidden  Markov models.  Such  models incorporate the  time series  mod(cid:173)\neling capabilities of hidden  Markov  models and  the  advantages of distributed  rep(cid:173)\nresentations  for  the  state space.  Future  work  will  concentrate  on  a  more efficient \nmean field  approximation in  which the forward-backward  algorithm is used  to com(cid:173)\npute the E-step exactly within each  Markov chain, and mean field  theory is used  to \nhandle interactions between  chains  (Saul  and Jordan,  1996). \n\nAcknowledgements \n\nThis project  was supported in part by  a  grant from  the McDonnell-Pew  Foundation,  by  a \ngrant  from  ATR Human  Information  Processing  Research  Laboratories,  by  a  grant  from \nSiemens  Corporation,  and by  grant  N00014-94-1-0777  from  the Office  of Naval  Research. \n\nReferences \n\nBaldi,  P.,  Chauvin,  Y.,  Hunkapiller,  T ., and  McClure,  M.  (1994).  Hidden  Markov  models \nof biological  primary sequence information.  Proc.  Nat.  Acad. Sci. (USA),91(3):1059-\n1063. \n\nGhahramani,  Z.  (1995) . Factorial learning and the EM  algorithm.  In  Tesauro,  G., Touret(cid:173)\n\nzky,  D.,  and  Leen,  T.,  editors,  Advances  in  Neural  Information  Processing  Systems \n7.  MIT Press,  Cambridge,  MA. \n\nHinton,  G .  and  Zemel,  R.  (1994).  Autoencoders,  minimum  description  length,  and \n\nHelmholtz  free  energy.  In  Cowan,  J.,  Tesauro,  G.,  and  Alspector,  J.,  editors,  Ad(cid:173)\nvances  in  Neural  Information  Processing Systems  6.  Morgan  Kaufmanm  Publishers, \nSan Francisco,  CA. \n\nNeal,  R.  (1992).  Connectionist  learning  of belief networks.  Artificial Intelligence,  56:71-\n\n113. \n\nRabiner,  1.  and  Juang,  B.  (1986).  An  Introduction  to  hidden  Markov  models. \n\nAcoustics,  Speech  \u20ac1  Signal Processing Magazine, 3:4-16. \n\nIEEE \n\nSaul, 1. and Jordan, M.  (1995).  Boltzmann chains and hidden  Markov models.  In Tesauro, \nG.,  Touretzky,  D.,  and Leen,  T., editors,  Advances in  Neural  Information  Processing \nSystems  7.  MIT Press, Cambridge,  MA. \n\nSaul,  1.  and  Jordan,  M.  (1996) .  Exploiting  tractable  substructures  in  Intractable  net(cid:173)\n\nworks.  In  Touretzky,  D.,  Mozer,  M.,  and Hasselmo,  M.,  editors,  Advances in  Neural \nInformation  Processing Systems 8.  MIT Press. \n\nWilliams,  C.  and  Hinton,  G.  (1991) .  Mean field  networks  that learn  to discriminate  tem(cid:173)\n\nporally  distorted strings.  In Touretzky,  D., Elman,  J., Sejnowski,  T., and Hinton,  G., \neditors,  Connectionist Models:  Proceedings  of the  1990 Summer School, pages  18-22. \nMorgan  Kaufmann Publishers,  Man  Mateo,  CA. \n\n\f", "award": [], "sourceid": 1144, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}