{"title": "Exploiting Tractable Substructures in Intractable Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 486, "page_last": 492, "abstract": null, "full_text": "Exploiting Tractable  Substructures \n\nin Intractable Networks \n\nLawrence K.  Saul and Michael I.  Jordan \n\n{lksaul.jordan}~psyche.mit.edu \n\nCenter for  Biological and  Computational Learning \n\nMassachusetts  Institute of Technology \n\n79  Amherst Street,  ElO-243 \n\nCambridge, MA  02139 \n\nAbstract \n\nWe  develop  a  refined  mean field  approximation for  inference  and \nlearning  in  probabilistic neural  networks.  Our  mean field  theory, \nunlike most, does not assume that the units behave as independent \ndegrees  of freedom;  instead,  it  exploits  in  a  principled  way  the \nexistence of large substructures that are computationally tractable. \nTo  illustrate  the  advantages  of this  framework,  we  show  how  to \nincorporate weak higher order interactions into a first-order  hidden \nMarkov  model,  treating  the  corrections  (but  not  the  first  order \nstructure)  within mean field  theory. \n\n1 \n\nINTRODUCTION \n\nLearning  the  parameters  in  a  probabilistic  neural  network  may  be  viewed  as  a \nproblem in  statistical estimation.  In  networks  with sparse  connectivity  (e.g.  trees \nand chains),  there  exist  efficient  algorithms for  the exact probabilistic calculations \nthat  support  inference  and  learning.  In  general,  however,  these  calculations  are \nintractable, and approximations are required. \n\nMean field  theory  provides  a  framework  for  approximation in  probabilistic  neural \nnetworks  (Peterson  &  Anderson,  1987).  Most  applications  of mean  field  theory, \nhowever,  have  made  a  rather  drastic  probabilistic  assumption-namely,  that  the \nunits  in the  network  behave  as  independent  degrees  of freedom.  In  this  paper  we \nshow  how  to  go  beyond  this  assumption.  We  describe  a  self-consistent  approxi(cid:173)\nmation  in  which  tractable  substructures  are  handled  by  exact  computations  and \nonly the remaining, intractable parts of the network are handled within mean field \ntheory.  For  simplicity  we  focus  on  networks  with  binary  units;  the  extension  to \ndiscrete-valued  (Potts)  units is  straightforward. \n\n\fExploiting Tractable  Substructures  in Intractable  Networks \n\n487 \n\nWe  apply these  ideas  to  hidden  Markov  modeling (Rabiner  &  Juang,  1991).  The \nfirst  order  probabilistic structure of hidden  Markov  models  (HMMs)  leads  to  net(cid:173)\nworks with chained architectures for  which efficient,  exact algorithms are  available. \nMore  elaborate  networks  are  obtained  by  introducing  couplings  between  multiple \nHMMs (Williams &  Hinton, 1990) and/or long-range couplings within a single HMM \n(Stolorz,  1994).  Both  sorts  of extensions  have  interesting  applications;  in  speech, \nfor  example, multiple HMMs  can provide a  distributed  representation of the artic(cid:173)\nulatory state, while long-range couplings can model the effects of coarticulation.  In \ngeneral, however, such extensions lead to networks for  which exact probabilistic cal(cid:173)\nculations are not feasible.  One would like to develop a mean field  approximation for \nthese  networks  that exploits  the  tractability of first-order  HMMs.  This  is  possible \nwithin the more sophisticated mean field  theory described  here. \n\n2  MEAN FIELD THEORY \n\nWe briefly review the basic methodology of mean field  theory for networks of binary \n(\u00b11) stochastic units (Parisi, 1988).  For each configuration {S} = {Sl, S2, ... , SN}, \nwe  define  an energy  E{S} and a  probability P{S} via the  Boltzmann distribution: \n\nP{S} = \n\ne-.8E {S} \n\nZ \n\n' \n\n(1) \n\nwhere  {3  is  the  inverse  temperature  and  Z  is  the  partition  function.  When  it  is \nintractable  to  compute  averages  over  P{S},  we  are  motivated  to  look  for  an  ap(cid:173)\nproximating distribution Q{S}.  Mean field  theory posits a  particular parametrized \nform  for  Q{S},  then  chooses  parameters  to  minimize  the  Kullback-Liebler  (KL) \ndivergence: \n\nKL(QIIP) = L Q{S} In  P{S}  . \n[ Q{S}] \n\nis} \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nWhy  are  mean  field  approximations  valuable  for  learning?  Suppose  that  P{S} \nrepresents  the  posterior  distribution over  hidden  variables,  as  in  the  E-step  of an \nEM  algorithm  (Dempster,  Laird,  &  Rubin,  1977).  Then  we  obtain  a  mean  field \napproximation  to  this  E-step  by  replacing  the  statistics  of P{S}  (which  may  be \nquite difficult to compute) with those of Q{S} (which may be much simpler).  If,  in \naddition, Z  represents  the likelihood of observed data (as is the case for  the example \nof section  3),  then  the  mean field  approximation yields a  lower  bound  on the  log(cid:173)\nlikelihood.  This  can  be  seen  by  noting  that  for  any  approximating  distribution \nQ{S}, we  can  form  the lower  bound: \n\nIn Z  = \n\nIn L e-.8 E {S} \n\n{S} \n\n{S} \n\nIn L Q{S}\u00b7  Q{S} \n\ne-.8 E {S} ] \n\n[\n\n>  L Q{ S}[ - {3E {S} - In Q{ S}], \n\n{S} \n\nwhere  the last line follows from Jensen's inequality.  The difference  between  the left \nand right-hand side of eq. (5) is exactly KL( QIIP); thus the better the approximation \nto P {S}, the tighter the bound on In Z.  Once a lower bound is available, a learning \nprocedure  can  maximize the  lower  bound.  This is  useful  when  the  true  likelihood \nitself cannot be efficiently computed. \n\n\f488 \n\nL.  K.  SAUL, M.I. JORDAN \n\n2.1  Complete Factorizability \n\nThe  simplest  mean field  theory  involves  assuming marginal  independence  for  the \nunits Si.  Consider, for  example, a  quadratic energy function \n\nand  the factorized  approximation: \n\ni<j \n\nQ{ S} = IJ (1 + :i Si )  . \n\nI \n\n(6) \n\n(7) \n\nThe expectations under  this mean field  approximation are  (Si)  =  mi  and  (Si Sj) = \nmimj  for  i  =1=  j.  The  best  approximation of this  form  is  found  by minimizing the \nKL-divergence, \n\nKL(QIIP)  =  ~[(1+2mi)ln(I+2mi)+(1-2mi)ln(I-2mi)]  (8) \n\nI \n\n- L  Jijmimj  - L  himi + InZ, \n\ni<j \n\ni \n\nwith respect  to the mean field  parameters mi.  Setting the gradients of eq.  (8) equal \nto zero,  we  obtain the  (classical)  mean field  equations: \n\ntanh- 1(mi)  =  L  Jijmj + hi\u00b7 \n\nj \n\n(9) \n\n2.2  Partial Factorizability \n\nWe now consider a more structured model in which the network consists of interact(cid:173)\ning modules  that,  taken  in isolation,  define  tractable substructures.  One  example \nof this  would  be  a  network  of weakly  coupled  HMMs,  in which  each  HMM,  taken \nby  itself,  defines  a  chain-like substructure  that  supports efficient  probabilistic cal(cid:173)\nculations.  We  denote  the interactions between  these  modules by parameters  K~v, \nwhere the superscripts J.'  and 1/ range over modules and the subscripts  i  and j  index \nunits  within modules.  An appropriate energy  function for  this network is: \n\n- ,6E{S} =  L  {LJ~srSf + LhfSr} + L  K~vSrS'f. \n\n/J \n\ni<j \n\ni \n\n/J<V \nij \n\n(10) \n\nThe first  term in  this energy  function  contains the  intra-modular interactions;  the \nlast term, the inter-modular ones. \n\nWe  now  consider  a  mean  field  approximation  that  maintains  the  first  sum  over \nmodules but dispenses  with the inter-modular corrections: \n\nQ{S} =  }Q  exp {L [~J~srSf + ~Hisr]} \n\n/J \n\nI <1 \n\nI \n\n(11) \n\nThe  parameters  of this  mean field  approximation are  Hi;  they  will  be  chosen  to \nprovide  a  self-consistent  model of the inter-modular interactions.  We  easily obtain \nthe following expectations  under  the mean field  approximation, where  J.'  =1=  1/: \n\n(Sr Sj) \n(Sr sy Sk) \n\n8/Jw (Sf Sj) + (1  - 8/Jw)(Sr)(Sj) , \n8/Jw(Sf Sk)(S'f} + 8vw (Sj Sk)(Sf) + \n(1- 8vw )(I- 8w/J)(Sf}{S'f} (Sk)\u00b7 \n\n(12) \n(13) \n\n\fExploiting Tractable Substructures  in Intractable  Networks \n\n489 \n\nNote  that units in the same module are statistically correlated  and that these  cor(cid:173)\nrelations are assumed to be taken into account  in calculating the expectations.  We \nassume that an efficient  algorithm is available for  handling these intra-modular cor(cid:173)\nrelations.  For  example,  if the  factorized  modules  are  chains  (e.g.  obtained  from \na  coupled  set  of HMMs),  then  computing  these  expectations  requires  a  forward(cid:173)\nbackward  pass through each  chain. \n\nThe  best  approximation  of  the  form,  eq.  (11),  is  found  by  minimizing  the  KL(cid:173)\ndivergence, \n\nKL(QIIP) =  In(ZjZQ) + L  (Hf - hf) (Sn - L  Kt'/ (Sf Sf), \n\nJJ<V \nij \n\n(14) \n\nwith respect  to the mean field  parameters HI:.  To compute the appropriate gradi(cid:173)\nents,  we  use  the fact  that derivatives of expectations  under  a  Boltzmann distribu(cid:173)\ntion  (e.g.  a(Sn jaHk ) yield  cumulants  (e.g.  (Sf Sk) - (Sf)(Sk)) .  The  conditions \nfor  stationarity are then: \n\nSubstituting  the  expectations  from  eqs.  (12)  and  (13),  we  find  that  K L( QIIP)  is \nminimized when \n\nJJ<V \nij \n\no = ~ {Hi - hi - L  ~Kit(Sj)} [(S'i Sk) - (S,:)(SI:)]. \n\n, \n\nV~W  J \n\nThe resulting mean field  equations are: \n\nHi =  L  LK~V(Sj) + hi\u00b7 \n\nV~W  j \n\n(16) \n\n(17) \n\nThese equations may be solved by iteration, in which the (assumed)  tractable algo(cid:173)\nrithms for  averaging over Q{S} are invoked as subroutines to compute the expecta(cid:173)\ntions  (Sj) on the right  hand side.  Because these  expectations depend on Hi, these \nequations  may be  viewed  as  a  self-consistent  model  of the  inter-modular  interac(cid:173)\ntions.  Note that the mean field  parameter Hi plays a role analogous to-tanh- 1(mi) \nin eq.  (9)  of the fully factorized  case. \n\n2.3 \n\nInducing Partial Factorizability \n\nMany interesting networks do not have strictly modular architectures and can only \nbe approximately decomposed into tractable core structures.  Techniques are needed \nin such cases to induce partial factorizability.  Suppose for example that we  are given \nan energy function \n\ni<j \n\ni<j \n\n(18) \n\nfor  which  the  first  two  terms  represent  tractable  interactions  and  the  last  term, \nintractable  ones.  Thus  the  weights  Jij  by  themselves  define  a  tractable  skeleton \nnetwork,  but  the  weights  Kij  spoil  this  tractability.  Mimicking  the  steps  of the \nprevious section,  we  obtain the mean field  equations: \n\n0= L \n\n((SiSk)  - (Si)(Sk)) [Hi  - hi] - L  Kij [(SiSj Sk)  - (SiSj )(Sk)] . \n\n(19) \n\ni<j \n\n\f490 \n\nL. K.  SAUL. M. I. JORDAN \n\nIn this case,  however,  the weights Kij  couple  units in the same core structure.  Be(cid:173)\ncause these units are not assumed to be independent,  the triple correlator (SiSjSk) \ndoes  not factorize,  and we  no  longer obtain the decoupled  update rules of eq.  (17). \nRather,  for  these  mean  field  equations,  each  iteration  requires  computing  triple \ncorrelators and solving a  large set of coupled  linear equations. \n\nTo avoid this heavy computational load, we  instead manipulate the energy function \ninto one  that can be partially factorized.  This is  done  by introducing extra hidden \nvariables Wij  =  \u00b11 on the intractable links of the network.  In particular,  consider \nthe energy  function \n\n-I3E{S, W} = ~JijSiSj + ~hiSi + ~ [KH)Si + Kfj)Sj] Wij. \n\n(20) \n\ni<j \n\ni \n\ni<j \n\nThe  hidden  variables  Wij  in  eq.  (20)  serve  to  decouple  the  units  connected  by \nthe  intractable weights  Kij.  However,  we  can  always  choose  the  new  interactions, \nK (l) \n\nd  jA2) \n\nh \n\nij  an \n\n'\"ij' so t  at \n\ne-.BE{S}  =  ~ e-.BE{S,W}. \n\n{W} \n\n(21) \n\nEq.  (21)  states that the marginal distribution over  {S} in the new  network is  iden(cid:173)\ntical  to  the joint distribution over  {S}  in  the original one.  Summing both sides  of \neq.  (21)  over  {S},  it follows  that both networks have  the same partition function. \nThe form of the energy function  in eq.  (20) suggests the mean field  approximation: \n\nwhere  the  mean field  parameters  Hi  have  been  augmented  by  a  set of additional \nmean  field  parameters  Hij  that  account  for  the  extra  hidden  variables.  In  this \nexpression,  the  variables  Si  and  Wij  act  as  decoupled  degrees  of freedom  and  the \nmethods of the preceding section  can  be  applied  directly.  We  consider  an example \nof this reduction  in the following section. \n\n3  EXAMPLE \nConsider  a  continuous-output  HMM  in  which  the  probability of an output Xt  at \ntime t  is  dependent  not only on the  state  at time t,  but also on the  state  at time \nt +~. Such  a context-sensitive  HMM  may serve  as a flexible  model of anticipatory \ncoarticulatory  effects  in  speech,  with  ~ ~ 50ms  representing  a  mean  phoneme \nlifetime.  Incorporating  these  interactions  into  the  basic  HMM  probability model, \nwe  obtain the following joint probability on states and outputs: \n\n2} \nP{S, X} = II aS jSt +1  II (211\")D/2  exp  -2\"  [Xt  - US j  - VSj+~] \n\nT-~  1  {I \n\nT-l \n\nt=l \n\nt=l \n\nDenoting the likelihood of an output sequence  by Z,  we  have \n\nZ  = P{X} = ~ P{S, X}. \n\n{S} \n\n. \n\n(23) \n\n(24) \n\nWe  can represent  this probability model using energies  rather than transition prob(cid:173)\nabilities  (Luttrell,  1989;  Saul  and  Jordan,  1995).  For  the  special  case  of binary \n\n\fExploiting Tractable  Substructures in  Intractable  Networks \n\n491 \n\nHere,  a++  is  the  probability  of transitioning  from  the  ON  state  to  the  ON  state \n(and similarly for  the other a  parameters),  while 0+  and V+  are the mean outputs \nassociated  with  the  ON  state  at time steps t  and t + .6.  (and similarly for  0_  and \nV_).  Given these  definitions,  we  obtain an equivalent expression  for  the likelihood: \n\nZ  =  Lexp {-gO + 'E JStSt+1  + thtSt + 'E KStSt+t:..} , \n\nt=l \n\nt=l \n\n{S} \n\nt=l \n\n(27) \n\nwhere  go  is  a  placeholder  for  the  terms in  InP{S,X}  that do  not  depend  on  {S}. \nWe  can  interpret  Z  as  the  partition function  for  the  chained  network  of T  binary \nunits  that  represents  the  HMM  unfolded  in  time.  The  nearest  neighbor  connec(cid:173)\ntivity of this network  reflects  the first  order structure of the  HMM;  the  long-range \nconnectivity reflects  the higher order interactions that model sensitivity to context. \n\nThe  exact  likelihood  can  in  principle  be  computed  by  summing over  the  hidden \nstates in eq.  (27),  but the required  forward-backward  algorithm scales much worse \nthan  the  case  of first-order  HMMs.  Because  the  likelihood  can  be  identified  as  a \npartition function,  however,  we  can obtain a  lower  bound on its value  from  mean \nfield  theory.  To exploit the tractable first  order structure of the HMM,  we  induce a \npartially factorizable network by introducing extra link variables on the long-range \nconnections,  as  described  in  section  2.3.  The  resulting  mean field  approximation \nuses  the  chained  structure  as  its  backbone  and  should  be  accurate  if  the  higher \norder  effects  in the data are weak  compared to the basic first-order  structure. \n\nThe  above  scenario  was  tested  in  numerical  simulations.  In  actuality,  we  imple(cid:173)\nmented a  generalization of the model in eq. (23):  our HMM had non-binary hidden \nstates  and  a  coarticulation  model  that  incorporated  both  left  and  right  context. \nThis network  was  trained on several  artificial  data sets  according  to the  following \nprocedure.  First,  we  fixed  the  \"context\"  weights  to zero  and used  the Baum-Welch \nalgorithm  to  estimate  the  first  order  structure  of the  HMM.  Then,  we  lifted  the \nzero  constraints and re-estimated  the parameters of the  HMM  by a  mean field  EM \nalgorithm.  In the E-step of this algorithm, the true  posterior  P{SIX} was  approx(cid:173)\nimated by the  distribution Q{SIX} obtained by solving  the  mean field  equations; \nin the M-step,  the parameters of the HMM  were  updated  to match the statistics of \nQ{SIX}.  Figure  1 shows  the  type of structure  captured  by a  typical network . \n\n4  CONCLUSIONS \n\nEndowing networks with probabilistic semantics provides a unified framework for in(cid:173)\ncorporating prior knowledge,  handling missing data, and performing inferences  un(cid:173)\nder uncertainty.  Probabilistic calculations, however, can quickly become intractable, \nso  it  is  important to develop  techniques  that both  approximate probability distri(cid:173)\nbutions in a flexible manner and make use of exact techniques wherever possible.  In \n\nIThere are boundary  corrections  to ht  (not  shown)  for  t =  1 and t> T  - A. \n\n\f492 \n\nL.  K.  SAUL, M. I. JORDAN \n\n\u2022: ., ..... .. \n\n'. \n:;. \n',; ... ~ ... } \n\n-5 \n\n-,0 \n\n.~.; .. : \n. . ' .. \n\"  . \"' .. \n\nFigure 1:  2D  output vectors  {Xt }  sampled from  a first-order  HMM  and  a  context(cid:173)\nsensitive  HMM,  each  with  n  =  5  hidden  states.  The  latter's  coarticulation model \nused  left  and  right  context,  coupling Xt  to the  hidden  states at times t  and t \u00b1 5. \nAt left:  the five  main clusters  reveal  the basic first-order  structure.  At right:  weak \nmodulations reveal  the effects  of context. \n\nthis paper we  have developed  a mean field  approximation that meets both these ob(cid:173)\njectives.  As  an example, we  have applied our methods to context-sensitive  HMMs, \nbut the methods are general  and can be  applied more widely. \n\nAcknowledgements \n\nThe  authors  acknowledge  support  from  NSF  grant  CDA-9404932,  ONR  grant \nNOOOI4-94-1-0777, ATR Research  Laboratories,  and Siemens Corporation. \n\nReferences \n\nA.  Dempster, N.  Laird, and D.  Rubin.  (1977)  Maximum likelihood from incomplete \ndata via the  EM  algorithm.  J.  Roy.  Stat.  Soc.  B39:1-38. \n\nB.  H.  Juang and  L.  R.  Rabiner.  (1991)  Hidden  Markov  models for  speech  recogni(cid:173)\ntion,  Technometrics 33:  251-272. \n\nS.  Luttrell.  (1989)  The Gibbs machine applied to hidden Markov  model problems. \nRoyal Signals  and  Radar Establishment:  SP Research  Note  99. \n\nG.  Parisi.  (1988)  Statistical field  theory.  Addison-Wesley:  Redwood  City,  CA. \nC.  Peterson and J. R.  Anderson.  (1987)  A mean field  theory learning algorithm for \nneural networks.  Complex Systems 1:995-1019. \nL.  Saul  and  M.  Jordan.  (1994)  Learning  in  Boltzmann trees.  Neural  Compo  6: \n1174-1184. \n\nL.  Saul  and  M.  Jordan. \n(1995)  Boltzmann  chains  and  hidden  Markov  models. \nIn  G.  Tesauro,  D.  Touretzky,  and  T .  Leen,  eds.  Advances  in  Neural  Information \nProcessing  Systems  7.  MIT Press:  Cambridge, MA. \n\nP. Stolorz.  (1994)  Recursive approaches to the statistical physics of lattice proteins. \nIn L.  Hunter,  ed.  Proc.  27th  Hawaii  Inti.  Conf  on  System  Sciences V: 316-325. \n\nC. Williams and G. E.  Hinton.  (1990) Mean field networks that learn to discriminate \ntemporally distorted  strings.  Proc.  Connectionist  Models  Summer School:  18-22. \n\n\f", "award": [], "sourceid": 1155, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}