{"title": "EM Optimization of Latent-Variable Density Models", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 471, "abstract": "", "full_text": "EM Optimization of Latent-Variable \n\nDensity Models \n\nChristopher M  Bishop, Markus  Svensen and Christopher K  I  Williams \n\nNeural  Computing Research  Group \n\nAston  University,  Birmingham, B4  7ET,  UK \n\nc.m.bishop~aston.ac.uk svensjfm~aston.ac.uk c.k.i.williams~aston.ac.uk \n\nAbstract \n\nThere is  currently  considerable interest  in developing general non(cid:173)\nlinear  density  models  based  on  latent,  or  hidden,  variables.  Such \nmodels have the ability to discover the presence of a relatively small \nnumber of underlying  'causes'  which,  acting  in  combination,  give \nrise  to the apparent complexity of the observed  data set.  Unfortu(cid:173)\nnately, to train such models generally requires large computational \neffort.  In this paper we  introduce a  novel  latent variable algorithm \nwhich retains the general non-linear capabilities of previous models \nbut  which  uses  a  training procedure  based  on  the  EM  algorithm. \nWe  demonstrate  the  performance  of the  model  on  a  toy  problem \nand on data from flow  diagnostics for  a  multi-phase oil  pipeline. \n\n1 \n\nINTRODUCTION \n\nMany  conventional  approaches to density  estimation, such  as  mixture models,  rely \non  linear  superpositions  of  basis  functions  to  represent  the  data  density.  Such \napproaches  are  unable  to  discover  structure  within  the  data whereby  a  relatively \nsmall number of 'causes'  act  in  combination to account for  apparent  complexity in \nthe data.  There is  therefore considerable interest in  latent  variable  models in which \nthe  density  function  is  expressed  in  terms  of of hidden  variables.  These  include \ndensity  networks  (MacKay,  1995)  and  Helmholtz  machines  (Dayan  et  al.,  1995). \nMuch  of this  work  has  been  concerned  with  predicting  binary  variables.  In  this \npaper  we  focus  on  continuous data. \n\n\f466 \n\nc. M. BISHOP, M.  SVENSEN, C.  K. I. WILLIAMS \n\ny(x;W) \n\nFigure 1:  The latent variable  density  model constructs  a distribution  function  in  t-space \nin  terms of a non-linear  mapping  y(x; W) from  a latent  variable  x-space. \n\n2  THE LATENT VARIABLE MODEL \n\nSuppose  we  wish  to model the  distribution of data which  lives  in  aD-dimensional \nspace  t  =  (tl, ... , tD).  We  first  introduce  a  transformation from  the  hidden  vari(cid:173)\nable  space  x  =  (Xl, ... , xL)  to  the  data space,  governed  by  a  non:-linear  function \ny(x; W)  which  is  parametrized  by  a  matrix of weight  parameters  W.  Typically \nwe  are  interested  in the situation in  which  the  dimensionality L  of the latent vari(cid:173)\nable  space  is  less  than  the  dimensionality  D  of the  data  space,  since  we  wish  to \ncapture  the  fact  that  the  data itself has  an  intrinsic  dimensionality  which  is  less \nthan  D.  The transformation y(x; W) then  maps the hidden  variable space  into an \nL-dimensional  non-Euclidean  subspace  embedded  within  the  data  space.  This  is \nillustrated schematically for  the  case  of L  =  2 and  D  =  3 in  Figure  1. \n\nIf we  define  a  probability  distribution  p(x)  on  the  latent  variable  space,  this  will \ninduce  a  corresponding distribution  p(y) in  the data space.  We  shall refer  to p(x) \nas  the  prior  distribution  of x  for  reasons  which  will  become  clear  shortly.  Since \nL  < D, the distribution in t-space would be confined  to a  manifold of dimension  L \nand hence  would be singular.  Since in reality data will only approximately live on a \nlower-dimensional space,  it is  appropriate to include a  noise model for  the t  vector. \nWe  therefore  define  the  distribution of t, for  given  x  and  W, given  by  a  spherical \nGaussian centred  on y(x; W) having variance {3-1  so that \n\n( 1) \n\nThe  distribution  in  t-space,  for  a  given  value  of the  weight  matrix  W,  lS  then \nobtained  by  integration over  the  x-distribution \n\np(tIW) = J p(tlx, W)p(x) dx. \n\n(2) \n\nFor  a  given  data  set  V  =  (t l , ... , t N )  of N  data  points,  we  can  determine  the \nweight  matrix  W  using  maximum  likelihood.  For  convenience  we  introduce  an \nerror function  given  by the negative log likelihood: \n\nE(W) = -In 11 p(tn IW) = - ~ In {J p(tn Ixn, W)p(xn) dxn }  . \n\nN \n\nN \n\n(3) \n\n\fEM  Optimization of Latent-Variable Density Models \n\n467 \n\nIn principle we can now seek the maximum likelihood solution for the weight matrix, \nonce  we  have  specified  the  prior  distribution  p(x)  and  the  functional  form  of the \nmapping y(x; W),  by  minimizing E(W).  However,  the  integrals  over  x  occuring \nin  (3),  and  in  the  corresponding  expression  for  'iJ E,  will,  in  general,  be  analyti(cid:173)\ncally  intractable.  MacKay  (1995)  uses  Monte  Carlo  techniques  to  evaluate  these \nintegrals and conjugate gradients to find  the weights.  This is  computationally very \nintensive,  however,  since  a  Monte  Carlo integration must be  performed every  time \nthe  conjugate  gradient  algorithm requests  a  value for  E(W)  or  'iJ E(W).  We  now \nshow  how,  by  a suitable choice  of model,  it is  possible  to find  an EM  algorithm for \ndetermining the weights. \n\n2.1  EM  ALGORITHM \n\nThere  are  three  key  steps  to  finding  a  tractable  EM  algorithm for  evaluating the \nweights.  The  first  is  to  use  a  generalized  linear  network  model  for  the  mappmg \nfunction  y(x; W).  Thus we  write \n\ny(x; W) = W \u00a2(x) \n\n(4) \n\nwhere  the  elements  of \u00a2(x)  consist  of M  fixed  basis  functions  cPj(x),  and  W  is  a \nD  x  M  matrix with  elements  Wkj'  Generalized  linear  networks  possess  the  same \nuniversal  approximation  capabilities  as  multi-layer  adaptive  networks.  The  price \nwhich  has to be paid, however,  is that the number of basis functions  must typically \ngrow  exponentially  with  the  dimensionality  L  of the  input  space.  In  the  present \ncontext this is not a serious problem since  the dimensionality is governed  by the la(cid:173)\ntent variable space and will typically be small.  In fact we  are particularly interested \nin  visualization applications, for  which  L  =  2. \n\nThe second  important step  is  to  use  a  simple  Monte  Carlo  approximation for  the \nintegrals over x.  In general,  for  a function  Q(x) we  can  write \n\nJ \n\nQ(x)p(x) dx ~ f{ ~ Q(xi ) \n\n1  K \n\nz=l \n\nwhere  xi represents  a sample drawn from the distribution p(x).  If we  apply this to \n(3)  we  obtain \n\nE(W) = - t,ln{ ~ tp(tnlxni,w)} \n\n(5) \n\n(6) \n\nThe  third  key  step  to  choose  the  sample of points  {xni}  to  be  the  same for  each \nterm in  the summation over  n.  Thus we  can drop  the index n  on  x ni  to give \n\nN  {I  K \n\nE(W) = - ~ In \n\n} \nf{ ~p(tnlxi, W) \n\n(7) \n\nWe  now  note  that  (7)  represents  the  negative  log  likelihood  under  a  distribution \nconsisting  of a  mixture  of  f{  kernel  functions.  This  allows  us  to  apply  the  EM \nalgorithm to find  the maximum likelihood solution for the weights.  Furthermore, as \na consequence of our choice  (4) for the non-linear mapping function, it will turn out \nthat the  M-step  can  be performed explicitly, leading to a  solution in  terms of a set \n\n\f468 \n\nc. M. BISHOP, M.  SYENSEN, C.  K. I. WILLIAMS \n\nof linear equations.  We  note that this model corresponds to a  constrained  Gaussian \nmixture distribution of the kind discussed  in Hinton  et  al.  (1992). \n\nWe  can formulate the EM  algorithm for  this system  as  follows.  Setting the  deriva(cid:173)\ntives of (7)  with  respect  to  Wkj  to zero  we  obtain \n\nt, t, R;n(W) {t, w\"f,(x;) - t~ }  f;(x;) =  0 \n\n(8) \n\nwhere  we  have  used  Bayes'  theorem  to  introduce  the  posterior  probabilities,  or \nresponsibilities, for  the mixture components given  by \n\nR- (W) = \n\nm \n\np(tnlxi, W) \n\nL:~=1 p(tnlxil, W) \n\nSimilarly, maximizing with  respect  to (3  we  obtain \n\nK  N \n\n~ =  N1D I: I: Rni(W) lIy(xn; W) - t n ll 2 . \n\ni=l n=l \n\n(9) \n\n(10) \n\nThe  EM  algorithm is  obtained by  supposing that,  at some point  in  the  algorithm, \nthe current  weight  matrix is  given by wold  and the current  value of (3  is  (30ld.  Then \nwe  can  evaluate  the  responsibilities  using  these  values  for  Wand (3  (the  E-step), \nand then solve  (8) for the weights to give W new  and subsequently solve  (10)  to give \n(3new  (the M-step).  The two steps are repeated until a suitable convergence criterion \nis  reached.  In  practice  the  algorithm converges  after  a  relatively  small number  of \niterations. \n\nA  more  formal  justification  for  the  EM  algorithm  can  be  given  by  introducing \nauxiliary variables to label which component is  responsible for generating each data \npoint,  and then computing the expectation with respect  to the distribution of these \nvariables.  Application  of Jensen's  inequality  then  shows  that,  at  each  iteration \nof the  algorithm,  the  error  function  will  decrease  unless  it  is  already  at  a  (local) \nminimum, as  discussed  for  example in Bishop  (1995). \n\nIf desired,  a  regularization  term  can  be  added  to the  error  function  to  control  the \ncomplexity of the model y(x; W).  From a  Bayesian viewpoint,  this  corresponds  to \na  prior distribution over weights.  For a  regularizer  which is a  quadratic function  of \nthe  weight  parameters,  this  leads  to  a  straightforward  modification  to the  weight \nupdate equations.  It is  convenient  to write the  condition (8)  in matrix notation as \n(11) \n\n(~TGold~ + AI)(Wnew)T = ~TTold \n\nwhere  we  have included  a  regularization term with  coefficient  A,  and I  denotes  the \nunit matrix.  In (11)  ~ is a  f{ x M  matrix with elements <l>ij  =  (/Jj(xi ), T  is a  I<  x D \nmatrix, and G  is  a  I<  x  I<  diagonal matrix, with elements \n\nN \n\nTik  = I: Rin(W)t~ \n\nn=l \n\nN \n\nGjj = I: ~n(W). \n\nn=l \n\n(12) \n\nWe can now  solve  (11) for w new  using standard linear matrix inversion techniques, \nbased  on  singular  value  decomposition  to  allow for  possible  ill-conditioning.  Note \nthat the matrix ~ is constant  throughout the algorithm, and so  need  only be  eval(cid:173)\nuated once  at the start. \n\n\fEM Optimization of Latent-Variable Density Models \n\n469 \n\n4~----------------------~  4~----------------------~ \n\n3 \n\n2 \n\n\u2022 \n\u2022 \n\n3 \n\n2 \n\n,  : 1' . \n\no \n\n11 \n\no \n\n\u2022 \n\n-_11~--~----~--~----~--~  -1~--~----~------~~--~ \n4 \n\n-1 \n\n2 \n\n3 \n\n4 \n\no \n\n0 \n\n2 \n\n3 \n\nFigure 2:  Results from a toy problem involving data (' x') generated from a 1-dimensional \ncurve  embedded  in  2  dimensions,  together  with  the  projected  sample  points  ('+')  and \ntheir  Gaussian  noise  distributions  (filled  circles).  The  initial  configuration,  determined \nby  principal  component  analysis,  is  shown  on  the left,  and  an  intermediate  configuration, \nobtained  after  4  iterations  of EM,  is  shown  on  the right. \n\n3  RESULTS \n\nWe now present results from the application of this algorithm first  to a toy problem \ninvolving data in  three  dimensions,  and  then  to a  more realistic  problem involving \n12-dimensional data arising from  diagnostic measurements of oil flows  along multi(cid:173)\nphase pipelines. \nFor  simplicity we  choose  the distribution p(x)  to be  uniform over  the  unit  square. \nThe  basis  functions  \u00a2j (x)  are  taken  to  be  spherically  symmetric  Gaussian  func(cid:173)\ntions  whose  centres  are  distributed  on  a  uniform  grid  in  x-space,  with  a  common \nwidth parameter chosen so that the standard deviation is  equal to the separation of \nneighbouring  basis  functions.  For  both  problems the  weights  in  the network  were \ninitialized by  performing principal components  analysis on the data and then find(cid:173)\ning  the  least-squares  solution  for  the  weights  which  best  approximates  the  linear \ntransformation which maps latent space to target space while generating the correct \nmean and variance in  target space. \n\nAs  a  simple  demonstration  of this  algorithm,  we  consider  data  generated  from  a \none-dimensional distribution embedded in  two  dimensions , as shown  in  Figure 2. \n\n3.1  OIL FLOW  DATA \n\nOur  second  example  arises  in  the  problem  of determining  the  fraction  of oil  in  a \nmulti-phase pipeline  carrying  a  mixture of oil,  water  and  gas  (Bishop  and  James, \n1993).  Each data point consists of 12 measurements taken from dual-energy gamma \ndensitometers measuring the attenuation of gamma beams passing through the pipe. \nSynthetically generated  data is  used  which  models accurately  the  attenuation pro(cid:173)\ncesses  in  the  pipe,  as  well  as  the  presence  of noise  (arising from  photon statistics). \nThe three phases in the pipe (oil, water and gas) can belong to one of three different \ngeometrical configurations,  corresponding  to stratified,  homogeneous,  and  annular \nflows,  and  the  data set  consists  of  1000  points  distributed  equally  between  the  3 \n\n\f470 \n\nc. M.  BISHOP, M. SVENSEN, C. K. I. W1LUAMS \n\n2~----~------~-------------, \n1.5  ~ ..,..~ \n\n\u2022 \n\n0 \n\n\" \n, \n\n\u2022 \n\n0.5 \n\n~,J1 \n\n..... \n.  ... \n.. ,.,..t:'\"  .. \n-1.5  : .. ~ ~_\" \n-32 \n\n(~  ~+ \n\n\u2022  aa.; \n\n-0.5 \n\n-1 \n\n.. \n\n00 \n\n. . .\" ,  \n\n.,,;' \n\n:\".\" \n~  ........ \n\n.. \n..\" \n\u2022 \n0  ~ ~iI:\" \n+~+ .,. \u2022 \n-,  ++ \n\".a:. 0 \n\n~O  ~ \n. .   0 6  C \n..... \n\n0 \n\n, \n\n-\n\n\u2022 \n\n#+ \n\n.'IIIt. \n\n-1 \n\n0 \n\n2 \n\n-2 \n\n0 \n\n2 \n\n4 \n\nFigure  3:  The left  plot  shows  the posterior-mean  projection  of the oil  data in  the latent \nspace  of the  non-linear  model.  The  plot  on  the  right  shows  the  same  data set  projected \nonto  the  first  two  principal  components. \nIn  both  plots,  crosses,  circles  and  plus-signs \nrepresent  the stratified,  annular  and homogeneous  configurations  respectively. \n\nclasses.  We  take  the  latent  variable space  to  be  two-dimensional.  This  is  appro(cid:173)\npriate for  this  problem  as  we  know  that,  locally,  the  data must  have  an  intrinsic \ndimensionality of two (neglecting noise on the data) since, for any given geometrical \nconfiguration of the three phases, there are two degrees of freedom corresponding to \nthe fractions of oil and water in the pipe (the fraction of gas being redundant since \nthe  three  fractions  must  sum to one).  It also  allows  us  to  use  the  latent  variable \nmodel to visualize  the data by  projection onto x-space. \n\nFor  the purposes  of visualization, we  note that a  data point t n  induces  a  posterior \ndistribution  p(xltn, W*)  in  x-space,  where  W*  denotes  the  value  of the  weight \nmatrix for  the trained network.  This provides considerably more information in the \nvisualization space than many simple techniques (which generally project each data \npoint  onto  a  single  point  in  the  visualization  space).  For  example,  the  posterior \ndistribution may  be  multi-modal, indicating that there  is  more than  one  region  of \nx-space  which  can  claim  significant  responsibility  for  generating  the  data  point. \nHowever,  it is often convenient to project each data point down to a unique point in \nx-space.  This can  be  done by  finding  the  mean of the posterior distribution,  which \nitself can be evaluated by  a simple Monte Carlo integration using quantities already \ncalculated in the evaluation of W* . \n\nFigure 3 shows the oil data visualized in the latent-variable space in which, for each \ndata point, we  have plotted the posterior mean vector.  Again the points have  been \nlabelled according to their  multi-phase configuration.  We  have  compared these re(cid:173)\nsults with those from a number of conventional techniques including factor  analysis \nand  principal  component  analysis.  Note  that factor  analysis  is  precisely  the  model \nwhich  results  if a  linear  mapping is  assumed for  y(x; W), a  Gaussian  distribution \np(x)  is  chosen  in the latent space,  and the noise  distribution in data space is  taken \nto  be  Gaussian  with  a  diagonal  covariance  matrix.  Of these  techniques,  principal \ncomponent  analysis  gave  the  best  class  separation  (assessed  subjectively)  and  is \nillustrated  in  Figure  3.  Comparison  with  the  results  from  the  non-linear  model \nclearly shows  that the latter gives  much better separation of the three  classes,  as  a \nconsequence  of the non-linearity permitted by  the latent variable mapping. \n\n\fEM Optimization of Latent-Variable Density Models \n\n471 \n\n4  DISCUSSION \n\nThere  are  interesting relationships between the model discussed  here  and a number \nof well-known  algorithms for  unsupervised  learning.  We  have  already  commented \nthat factor  analysis is  a special  case  of this model, involving a  linear mapping from \nlatent  space  to  data space.  The  Kohonen  topographic  map  algorithm  (Kohonen, \n1995)  can  be  regarded  as  an  approximation to  a  latent  variable  density  model  of \nthe  kind  outlined here.  Finally,  there  are  interesting  similarities to  a  'soft' version \nof the  'principal curves'  algorithm (Tibshirani, 1992). \n\nThe  model we  have described  can readily  be  extended  to  deal  with  the problem of \nmissing data, provided we  assume that the missing data is  ignorable and  missing  at \nrandom (Little and Rubin,  1987) .  This involves maximizing the likelihood function \nin which the missing values have been integrated out.  For the model discussed  here, \nthe  integrations  can  be  performed  analytically,  leading  to  a  modified  form  of the \nEM  algorithm. \n\nCurrently we are extending the model to allow for  mixed continuous and categorical \nvariables.  We are also exploring Bayesian approaches,  based on Markov chain Monte \nCarlo,  to replace  the  maximum likelihood procedure. \n\nAcknowledgements \n\nThis  work  was  partially supported  by  EPSRC  grant  GR/J75425:  Novel  Develop(cid:173)\nments  in  Learning  Theory.  Markus  Svensen  would  like  to  thank  the  staff of the \nSANS  group  in Stockholm for  their  hospitality during part of this project . \n\nReferences \n\nBishop,  C.  M.  (1995).  Neural  Networks for  Pattern  Recognition.  Oxford  Univer(cid:173)\n\nsity  Press. \n\nBishop,  C.  M.  and G.  D.  James (1993).  Analysis of multiphase flows  using dual(cid:173)\n\nenergy  gamma densitometry  and  neural  networks.  Nuclear  Instruments  and \nMethods  in  Physics  Research  A327, 580-593. \n\nDayan,  P.,  G.  E.  Hinton,  R.  M.  Neal,  and  R.  S.  Zemel  (1995).  The  HelmQoltz \n\nmachine . Neural  Computation  7  (5),  889- 904. \n\nHinton, G.  E., C. K.  1.  Williams, and M.  D.  Revow  (1992) . Adaptive elastic mod(cid:173)\n\nels for  hand-printed character recognition. In J.  E.  Moody,  S.  J.  Hanson,  and \nR.  P.  Lippmann (Eds.),  Advances  in  Neural  Information  Processing  Systems \n4.  Morgan  Kauffmann. \n\nKohonen,  T. (1995).  Self-Organizing  Maps.  Berlin:  Springer-Verlag. \nLittle,  R.  J.  A.  and  D.  B.  Rubin  (1987).  Statistical  Analysis  with  Missing  Data. \n\nNew  York:  John Wiley. \n\nMacKay, D. J. C. (1995). Bayesian neural networks and density networks.  Nuclear \n\nInstruments  and  Methods  in  Physics  Research,  A  354 (1),  73- 80 . \n\nTibshirani,  R .  (1992).  Principal  curves  revisited.  Statistics  and  Computing  2, \n\n183-190. \n\n\f", "award": [], "sourceid": 1132, "authors": [{"given_name": "Christopher", "family_name": "Bishop", "institution": null}, {"given_name": "Markus", "family_name": "Svens\u00e9n", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}]}