{"title": "Deterministic Annealing Variant of the EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 545, "page_last": 552, "abstract": null, "full_text": "Deterministic Annealing Variant \n\nof the EM Algorithm \n\nN aonori U eda \nueda@cslab.kecl.ntt.jp \n\nRyohei N alcano \nnakano@cslab.kecl.ntt.jp \n\nNTT Communication Science  Laboratories \n\nHikaridai, Seika-cho,  Soraku-gun, \n\nKyoto 619-02 Japan \n\nAbstract \n\nWe  present  a  deterministic annealing variant of the EM  algorithm \nfor  maximum likelihood  parameter  estimation  problems.  In  our \napproach,  the  EM  process  is  reformulated  as the  problem of min(cid:173)\nimizing  the  thermodynamic free  energy  by  using  the  principle  of \nmaximum  entropy and  statistical mechanics  analogy.  Unlike simu(cid:173)\nlated  annealing approaches,  this  minimization is  deterministically \nperformed.  Moreover,  the  derived  algorithm,  unlike  the  conven(cid:173)\ntional EM  algorithm, can obtain better estimates free of the initial \nparameter values. \n\n1 \n\nINTRODUCTION \n\nThe  Expectation-Maximization (EM)  algorithm (Dempster,  Laird &  Rubin,  1977) \nis  an  iterative statistical  technique  for  computing maximum likelihood  parameter \nestimates from  incomplete data.  It has generally been  employed to  a  wide  variety \nof parameter estimation problems.  Recently,  the EM  algorithm has also been suc(cid:173)\ncessfully employed as  the learning algorithm of the hierarchical  mixture of experts \n(Jordan &  Jacobs,  1993).  In addition, it has been  found  to have some relationship \nto the learning of the Boltzmann machines (Byrne,  1992). \n\nThis algorithm has attractive features such  as reliable global convergence,  low  cost \nper iteration, economy of storage, and ease of programming, but it is not free  from \nproblems in  practice.  The serious  practical problem associated  with  the algorithm \n\n\f546 \n\nNaonori  Veda, Ryohei Nakano \n\nis  the local maxima problem.  This problem makes the  performance dependent  on \nthe  initial parameter value.  Indeed,  the  EM  algorithm should  be  performed  from \nas  wide a  choice  of starting values  as  possible according to some ad hoc criterion. \n\nTo overcome this  problem, we  adopt the principle of statistical mechanics.  Namely, \nby using the  principle of maximum entropy,  the thermodynamic free  energy  is  de(cid:173)\nfined  as  an  effective  cost  function  that  depends  on  the  temperature.  The  maxi(cid:173)\nmization of log-likelihood is done by minimizing the cost function.  Unlike simulated \nannealing  (Geman  &  Geman,  1984)  where  stochastic  search  is  performed  on  the \ngiven energy surface,  this cost function  is  deterministically optimized at each  tem(cid:173)\nperature. \n\nSuch deterministic annealing (DA)  approach has been successfully  adopted for  vec(cid:173)\ntor  quantization or  clustering  problems  (Rose  et  al.,  1992;  Buhmann  et  al.,  1993; \nWong,  1993).  Recently, Yuile  et  al.(Yuile, Stolorz, & Utans,  1994) have shown that \nthe EM  algorithm can be used  in  conjunction with the  DA.  In our previous paper, \nindependent ofYuile's work, we presented a new EM  algorithm with DA for mixture \ndensity  estimation problems (Ueda  &  Nakano,  1994).  The aim of this  paper  is  to \ngeneralize our earlier work and to derive a DA variant of the general EM algorithm. \nSince the  EM  algorithm can  be used  not only for  the mixture estimation problems \nbut also for other parameter estimation problems, this generalization is expected to \nbe of value in practice. \n\n2  GENERAL THEORY OF  THE EM  ALGORITHM \n\nSuppose  that a  measure space Y of \"unobservable data\"  exists  corresponding  to a \nmeasure  space  X  of \"observable  data\".  An  observable  data sample  ~(E X)  with \ndensity p(~; @)  is called incomplete and (~, y) with joint density p(~, y;@) is called \ncomplete, where  y  is an unobservable data sample 1  corresponding to~. Note that \n~ E'R,n  and y  E 'R,m.  @ is  parameter of the density  distribution to be estimated. \nGiven  incomplete  data samples X  =  {~klk =  1\"\", N},  the  goal  of the  EM  al(cid:173)\ngorithm is  to compute the maximum-likelihood estimate of @ that maximizes the \nfollowing log-likelihood function : \n\nN \n\nL(@;X) = Llogp(~k;@)' \n\n1:=1 \n\nby  using the following complete data log-likelihood function: \n\nLe(@;X) = I: logp(zk' Yk;@) ' \n\nN \n\nk=l \n\n(1) \n\n(2) \n\nIn the EM  algorithm, the parameter @ is  iteratively estimated.  Suppose that  @(t) \ndenotes  the  current  estimate of @  after  the  tth  iteration  of the  algorithm.  Then \n@(t+1)  at the next  iteration is determined by the following two steps: \n\n1 In  such  unsupervised  learning  as  mixture  problems,  1/  reduces  to  an  integer  value \n(1/  E {I, 2, ... , C}, where C  is the number of components),  indicating  the component from \nwhich  a data sample  x  originates. \n\n\fDeterministic Annealing  Variant  of the  EM Algorithm \n\n547 \n\nE-step:  Compute the  Q-function  defined  by  the  conditional  expectation  of the \n\ncomplete data log-likelihood given  X  and  @(t): \n\nQ(@; @(t\u00bb  ~ E{Lc(@;X)IX, @(t)}. \n\n(3) \n\nM-step: Set  @(t+l)  equal  to @  which  maximizes Q(@, @(t\u00bb. \n\nIt has theoretically  been  shown  that  an iterative procedure  for  maximizing Q  over \n@  will  cause the likelihood L  to monotonically increase,  e.g.,  L(@(t+l\u00bb  2:  L(@(t\u00bb. \nEventually, L(@(t\u00bb  converges to a local maximum. The EM  algorithm is especially \nuseful when the maximization of the Q-function can be more easily performed than \nthat of L. \nBy  substituting Eq.  2 into Eq.  3,  we  have \n\nQ(@;@{t\u00bb \n\n=  ~ J  ... J{logp(~j:'Yj:;@)} IIp(Yjl~j;@(t))dYl . .. dYN \n\nN \n\nN \n\nk=l \nN \n~ J  {logp(~k' Yk; @)}p(Ykl~k; @(t\u00bbdYk' \n\nj=l \n\n(4) \n\n(5) \n\nk=l \n\n@  that maximizes Q(@; @(t\u00bb  should satisfy aQla@ = 0,  or equivalently, \n\nN  J a \n~ {a@ logp( ~k, Yk; @) }P(Yk I~k; @(t\u00bbdYk = o. \nk=l \n\nHere,  p(Ykl~k' @(t\u00bb  denotes  the  posterior  and  can  be  computed by  the following \nBayes rule: \n\n(6) \n\nIt can be interpreted  that the missing information is estimated by  using the poste(cid:173)\nrior.  However,  because the reliability of the posterior highly depends on the param(cid:173)\neter @(t),  the performance of the EM  algorithm is sensitive to an initial parameter \nvalue  @(O).  This has often  caused  the algorithm to become trapped  by some local \nmaxima.  In  the next section,  we  will  derive  a  new  variant of the  EM  algorithm as \nan attempt at global maximization of the Q-function in the EM  process. \n\n3  DETERMINISTIC ANNEALING APPROACH \n\n3.1  DERIVATION  OF PARAMETERIZED  POSTERIOR \n\nInstead  of the  posterior  given  in  Eq.  6,  we  introduce  another posterior  f(Ykl~k)' \nThe function  form  of f  will  be specified  later.  Using  f(Ykl~k)' we  consider  a  new \nfunction  instead of Q,  say  E,  defined  as: \n\nE ~r ~ J  {-logp(~k' Yk; @)}f(Yk l~k)dYk' \n\nN \n\nk=l \n\n(7) \n\n\f548 \n\nNaonori  Ueda.  Ryohei Nakano \n\n(N ote:  E  is  always  nonnegative.)  One  can easily  see  that  (-E)  is  also  the  condi(cid:173)\ntional expectation of the  complete data log-likelihood but it differs from Q in  that \nthe expectation  is  taken with respect  to !(lIk I:l:k)  instead of the posterior given by \nEq.  6.  In  other words,  if !(Ykl:l:k) = P(lIkl:l:k; B(t\u00bb, then E = -Q. \nSince we  do  not have  a priori knowledge about !(lIk I:l:k),  we  apply the  principle  of \nmaximum  entropy to specify  it.  That is,  by  maximizing the entropy given by: \n\nN \n\nH = - L J {log !(1I k I:l:k)} !(Yk I:l:k )dllb \n\nk=1 \n\n(8) \nwith  respect  to  !, under  the  constraints of Eq.  7  and J !dYk  =  1,  we  can  obtain \nthe following Gibbs distribution: \n1 \nZ:l:\" \n\n(9) \nwhere  Z:I:\"  =  J exp{ -{J( -logp(:l:k, 11k; B\u00bb)}dllb  and  is  called  the  partition  func(cid:173)\ntion.  The parameter {J  is the Lagrange multiplier determined by the value E.  From \nan analogy of the  annealing,  1 j {J  corresponds  to the  \"temperature\". \n\nexp{ -{J( -logp(:l:k.lIk; B\u00bb)}, \n\n!(lIk I:l:k)  =  -\n\nBy simplifying Eq.  9,  we  obtain a  new  posterior  parameterized by {J, \n\n!(  I )  \n\nlIk:l:k  =  J p(:l:k, 11k; B)fJdllk . \n\np(:l:k, 11k; B)fJ \n\n(10) \n\nClearly, when {J  =  1, !(lIkl:l:k) reduces to the original posterior given in Eq.  6.  The \neffect  of {J  will  be explained later. \n\nSince  :1:1,  ..\u2022 ,:l:N  are  identically  and  independently  distributed  observations,  the \npartition function  Zp(B) for  X  becomes  Ok Z:l:\".  Therefore, \n\nZp(B) = II J P(lIbllk; B)Pdllk \u00b7 \n\nN \n\nk=l \n\n(11) \n\nOnce the partition function  is obtained explicitly, using statistical mechanics anal(cid:173)\nogy,  we  can define  the free  energy as an effective  cost  function  that depends  on the \ntemperature: \n\n(12) \n\nAt  equilibrium, it is well  known that a thermodynamic system settles  into a config(cid:173)\nuration  that  minimizes its free  energy.  Hence,  B  should  satisfy  oFp(B)joB =  O. \nIt follows  that \n\n(13) \n\nInterestingly, We have arrived at the same equation as the result ofthe maximization \nof the Q-function,  except  that the posterior P(lIk I:l:k; B(t\u00bb \nin  Eq.  5 is  replaced  by \n!(lIk I:l:k). \n\n\fDeterministic Annealing  Variant  of the  EM Algorithm \n\n549 \n\n3.2  ANNEALING  VARIANT  OF THE EM  ALGORITHM \n\nLet  Qf3(@; @(I\u00bb  be the expectation of the complete data log-likelihood by  the pa(cid:173)\nrameterized posterior f(y\" I~,,).  Then, the following deterministic annealing variant \nof the  EM  algorithm can  be naturally derived  to maximize -Ff3(@). \n[Annealing EM  (AEM)  algorithm] \n1.  Set {3  +- {3min(O  < {3min  <t::  1). \n2.  Arbitrarily choose  an initial estimate @(O).  Set  t  +- O. \n3.  Iterate the following two steps  until convergence2 : \n\nE-step:  Compute \n\nN  J \n\n\"=1 \n\nQfJ(@;@(t\u00bb  = ~ {logp(~\", y,,; @)} \n\ndy\". \n\n(14) \n\np(~\" y  . @(t\u00bbf3 \n\n, \n\n'\" \n\nI p(~\", y\";@(t\u00bb f3 dy,, \n\nM-step:  Set  @(t+l)  equal to  @  which maximizes Qf3(@; @(t\u00bb. \n4.  Increase {3. \n5.  If {3  < {3max,  set  t  +- t + 1,  and repeat  from step  3;  otherwise stop. \nOne  can see  that in the  proposed  algorithm, an outer  loop is  added to the original \nEM  algorithm for  the annealing process.  An important distinction to keep  in mind \nis  that  unlike simulated annealing,  the  optimization in  step  3  is  deterministically \nperformed at each {3.  Now let's consider the effect of the posterior parameterization \nof Eq.  10.  The  annealing  process  begins at small {3  (high  temperature).  Clearly, \nat this  time, since  f(y\"I~,,) becomes  uniform,  -Ff3(@)  has only  one  global maxi(cid:173)\nmum.  Hence,  the maximum can  be easily  found.  Then by  gradually  increasing {3, \nthe  influence  of each  ~\" is  gradually  localized.  At  {3  > 0,  function  Qf3  will  have \nseveral local maxima.  However, at each step, it can be assumed that the new  global \nmaximum is  close  to the  previous  one.  Hence,  by  this  assumption,  the  algorithm \ncan  track the global maximum at each  {3  while increasing {3.  Clearly,  when  {3  = 1 \n-Fl(@) = L(@), {3max  ought  be one. \nthe parameterized posterior coincides with  the original one.  Moreover,  noting that \n\n4  Demonstration \n\nTo  visualize  how  the  proposed  algorithm  works,  we  consider  a  simple  one(cid:173)\ndimensional,  two-component  normal  mixture  problem.  The  mixture  is  given  by \np(x;m},m2)  = 0.3kexp{-~(x - ml?} + 0.7*exp{-~(x - m2)2}.  In  this \ncase,  @  =  (mb m2),  y\"  E {I, 2},  and therefore,  the joint density  p(x, 1; ml, m2)  = \n0.3$ exp{ -~(x - ml)2}, while p(x, 2; ml, m2) = 0.7 $ \nOne  hundred  samples  in  total  were  generated  from  this  mixture  with  ml  =  -2 \nand  m2  = 2.  Figure  1 shows  contour  plots  of the  -FfJ(ml,m2)/N surface.  It is \ninteresting to see how  FfJ(m}, m2)  varies with {3.  One can see that a finer  and truer \nstructure  emerges  by  increasing  {3.  Note  that  as  explained  before,  when  {3  = 1, \n2When  the  sequence  converges  to  a  saddle  point  (e.g.,  when  the  Hessian  matrix  of \n-Fp(9)  has  at least  one  positive  eigen  value),  a  local  line  search  in  the  direction  of the \neigen  vector  corresponding  to  the largest  eigen  value  should  be  performed  to escape  the \nsolution  from  the saddle  point. \n\nexp{ -~(x - m2)2}. \n\n\f550 \n\nNaonori  Ueda,  Ryohei Nakano \n\n-4 \n\n0 \n\n-2 \n2 \n(a)  ~ .. 0.1 \n\n0 \n\n-2 \n2 \n(b)  ~ - 0.25 \n\n-4 \n\n0 \n\n-2 \n2 \n(c)  ~ '\" 0.3 \n\n-4 \n\n0 \n\n-2 \n2 \n(d)  ~ =  0.5 \n\n4  ml \n\n-4 \n\n0 \n\n-2 \n2 \n(e)  ~ .. 0.7 \n\n4  ml \n\n-4 \n\n0 \n\n-2 \n2 \n(f)  ~ = 1 \n\nFigure 1: Contour plots of  -Fp(mhm2)fN surface. \n\n(\"+\" denotes a global maximum at each JJ.) \n\n0 \n\n2 \n\n-2 \n\n-4 \n(a)  (~),m~~ = (4, -1) \n\n4  ml \n\n-4 \n\n-2 \n\n0 \n\n2 \n\n4  ml \n\n(b)  (m(~),m~~ =(-2,-4) \n\nFigure 2: Trajectories for EM and AEM procedures. \n\n\fDeterministic Annealing  Variant  of the  EM Algorithm \n\n551 \n\n-Fl(ml, m2)  ==  L(mb m2).  The maximum of -Fl (or L) occurs at m1  =  -2.0 and \nm2  = 1.9.  As  initial points,  (m~O),m~O\u00bb  = (4,-1),(-2,-4) were  tested.  Note that \nthese  points  are  close  to  a  local  maximum.  Figure  2  shows  how  both  algorithms \nconverge  from these starting points3. \nThe  original  EM  algorithm  converges  to  (mp2), m~12\u00bb  =  (2.002, -1.687)  (Figure \n2( a\u00bb and  to  (m~19), m~19\u00bb  = (1.999, -1.696)  (Figure  2(b\u00bb.  In contrast,  the  pro(cid:173)\nposed  AEM  algorithm converges  to (m~43), m~43\u00bb  =  (-2.022,1.879) (Figure  2(a\u00bb \nand  (m~43),m~43\u00bb  =  (-2.022,1.879) (Figure  2(b\u00bb.  Since  these  starting points  are \nclose  to the local maximum point, the original EM  algorithm becomes  trapped  by \nthe local  maximum point,  while  the  proposed  algorithm successfully  converges  to \nthe global  maximum in  both cases. \n\n5  Discussion \n\nA new view of the EM  algorithm has recently been presented (Hathaway, 1986; Neal \n&  Hinton,  1993).  This view states that the EM  steps can be regarded  as a grouped \nversion of the method of coordinate  ascent of the following objective function: \n\nJ ~f I: E p(lIlo) {logp(:l:l:, 111:; @)}  - I: E p(lIlo) {logp(lIl:)} \n\n(15) \n\nN \n\nI: \n\nN \n\nI: \n\nover  P(lIl:)  and  @.  That is,  the E-step  corresponds  to the maximization of J  with \nrespect  to  P(lIl:)  with  fixed  @,  while  the  M-step  corresponds  to  that  of J  with \nrespect  to @  with fixed  P(lIl:)'  Neal  & Hinton mentioned that apart from the sign \nchange,  J  is analogous to the  \"free energy\"  well  known  in statistical physics. \nIt is worth mentioning another interpretation of the derived parameterized posterior; \ni.e.,  it  plays  a  central  role  in  the  AEM  algorithm.  Taking the  logarithm on  both \nsides of Eq.  10,  we  have \n\n(16) \n\nN \n\nMoreover,  taking the conditional expectation4 given :1:1:,  summing over k,  and using \nEq.  12,  we  have \nFfJ(@) = I: EJ(lIlol:l:lo){ -logp(:l:I:, 111:; @)}+ p I: EJ(lIlo 1:1:10) {log 1(111: 1:l:1:)}\u00b7  (17) \nFrom Eqs.  7 and 8,  Eq.  17 can be rewritten as FfJ(@) = E - !H.  Noting that l/f3 \ncorresponds  to the  \"temperature\",  one  can  see  that  this expression  exactly  agrees \nwith  the  free  energy,  while  Eq.  15  is  completely  without  the  \"temperature\".  In \nother  words,  FfJ(@)  can  be  interpreted  as  an  annealing  variant  of -J.  Clearly, \nwhen f3  =  1,  FfJ(@)  ==  -J. \n\n1  N \n\n1:=1 \n\n1:=1 \n\n3 Although  the  AEM  procedure  was  actually  performed  for  successive  fJ  (fJnew  -\n\n1.4), the results  a.re  superimposed  on  -Fl(ml, m2)/N surfa.ce  for convenience. \n\nfJ  x \n\n4Since  the LHS of Eq.  16  is independent  of 111\"  it does  not  cha.nge  a.fter  expecta.tion. \n\n\f552 \n\nNaonori  Veda.  Ryohei Nakano \n\nThe  proposed  algorithm is  also  applicable to the learning of the  (Generalized)  Ra(cid:173)\ndial  Basis  Function  (RBF,  GRBF)  networks.  Indeed,  Nowlan  (Nowlan,  1990), for \ninstance,  proposes  a  maximum likelihood  competitive  learning  algorithm  for  the \nRBF  networks.  In  his  study,  \"soft  competition\"  and  \"hard  competition\"  are  eX(cid:173)\nperimentally  compared  and  it  is  shown  that  the  soft  competition  can  give  better \nperformance.  In  our  algorithm, on  the other  hand,  the  soft  model  exactly  corre(cid:173)\nsponds  to  the  case  f3  =  1,  while  the  hard  model  corresponds  to  the  case  f3  -+  00. \nConsequently, both models can be regarded  as special  cases  in  our algorithm. \n\nAcknowledgements \n\nWe would like to thank Dr. Tsukasa Kawaoka, NTT CS labs, for  his encouragement. \n\nReferences \n\nJ.  Buhmann  &  H.  Kuhnel.  (1993)  Complexity optimized data clustering  by  com(cid:173)\npetitive neural  networks.  Neural  Computation,  5:75-88. \n\n(1992)  Alternating  minimization  and  Boltzmann  machine  learning. \n\nW.  Byrne. \nIEEE  Trans.  Neural  Networks,  3:612-620. \nA.  P.  Dempster,  N.  M.  Laird  &  D.  B.  Rubin.  (1977)  Maximum-likelihood  from \nincomplete data via the EM  algorithm.  J.  Royal Statist.  Soc.  Ser.  B  (methodolog(cid:173)\nical),39:1-38. \nS.  Geman  &  D.  Geman.  (1984)  Stochastic  relaxation,  Gibbs  distribution  and the \nBaysian restortion in images.  IEEE Trans.  Pattern Anal.  Machine  Intel/\"  6,6:721-\n741. \n\nR.  J.  Hathaway.  (1986)  Another  interpretation of the  EM  algorithm for  mixture \ndistributions.  Statistics  fj  Probability  Leiters, 4:  53-56. \nM.  1.  Jordan  &  R.  A.  Jacob.  (1993)  Hierarchical  mixtures of experts  and the  EM \nalgorithm.  MIT  Dept.  of Brain and Cognitive Science  preprint. \nR.  M.  Neal &  G.  E.  Hinton.  (1993)  A new  view  of the  EM  algorithm that justifies \nincremental and other variants.  submitted to Biometrika. \n\nS.  J.  Nowlan.  (1990) Maximum likelihood competitve learning.  in  D.  S.  Touretzky \net  al.  eds.,  Advances in Neural Information Systems 2,  Morgan Kaufmann.  574-582. \n\nK.  Rose,  E.  Gurewitz  &  G.  C.  Fox.  (1992)  Vector  quantization  by  deterministic \nannealing.  IEEE  Trans.  Information  Theory,  38,4:1249-1257. \nN.  Ueda &  R.  Nakano.  (1994)  Mixture  density  estimation via  EM  algorithm with \ndeterministic  annealing. \nin  proc.  IEEE  Neural  Networks  for  Signal  Processing, \n69-77. \n\nY.  Wong.  (1993)  Clustering data by  melting.  Neural  Computation,  5:89-104. \nA.  1. Yuille, P.  Stolorz &  J. Utans.  (1994) Statistical physics,  mixtures of distribu(cid:173)\ntions,  and  the EM  algorithm.  Neural  Computation,  6:334-340. \n\n\f", "award": [], "sourceid": 941, "authors": [{"given_name": "Naonori", "family_name": "Ueda", "institution": null}, {"given_name": "Ryohei", "family_name": "Nakano", "institution": null}]}