{"title": "Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 528, "page_last": 534, "abstract": null, "full_text": "Fast  Learning by Bounding Likelihoods \n\nin  Sigmoid Type Belief Networks \n\nTommi Jaakkola \n\ntommi@psyche.mit.edu \n\nLawrence  K.  Saul \nlksaul@psyche.mit.edu \n\nMichael I.  Jordan \njordan@psyche.mit.edu \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts  Institute of Technology \n\nCambridge, MA  02139 \n\nAbstract \n\nSigmoid  type  belief networks,  a  class  of probabilistic  neural  net(cid:173)\nworks,  provide  a  natural  framework  for  compactly  representing \nprobabilistic  information  in  a  variety  of unsupervised  and super(cid:173)\nvised  learning  problems.  Often  the  parameters used  in  these  net(cid:173)\nworks  need  to  be  learned  from examples.  Unfortunately,  estimat(cid:173)\ning  the  parameters  via  exact  probabilistic  calculations  (i.e,  the \nEM-algorithm)  is  intractable  even  for  networks  with  fairly  small \nnumbers of hidden  units.  We  propose  to  avoid the  infeasibility of \nthe  E step  by  bounding likelihoods instead of computing them ex(cid:173)\nactly.  We  introduce extended  and  complementary representations \nfor  these  networks  and  show  that  the  estimation  of the  network \nparameters  can  be  made fast  (reduced  to  quadratic optimization) \nby  performing the estimation in either of the  alternative domains. \nThe  complementary networks  can  be  used  for  continuous  density \nestimation as  well. \n\n1 \n\nIntroduction \n\nThe  appeal  of probabilistic  networks  for  knowledge  representation,  inference,  and \nlearning  (Pearl,  1988)  derives  both from  the sound  Bayesian  framework  and  from \nthe explicit  representation  of dependencies  among the  network  variables which  al(cid:173)\nlows  ready  incorporation of prior  information into  the  design  of the network.  The \nBayesian formalism permits full  propagation of probabilistic information across the \nnetwork  regardless  of which  variables in the network are instantiated.  In this sense \nthese  networks  can  be  \"inverted\"  probabilistically. \n\nThis  inversion,  however,  relies  heavily  on  the  use  of look-up  table  representations \n\n\fFast Learning  by  Bounding Likelihoods  in  Sigmoid Type Belief Networks \n\n529 \n\nof conditional  probabilities or representations equivalent  to them for  modeling de(cid:173)\npendencies  between  the variables.  For  sparse  dependency  structures  such  as  trees \nor chains  this  poses  no difficulty.  In  more realistic  cases  of reasonably  interdepen(cid:173)\ndent variables the exact algorithms developed for these belief networks (Lauritzen & \nSpiegelhalter,  1988)  become infeasible due  to the exponential growth in  the size  of \nthe conditional probability tables needed to store the exact dependencies.  Therefore \nthe use of compact representations to model probabilistic interactions is unavoidable \nin large problems.  As  belief network  models move  away  from  tables,  however,  the \nrepresentations  can  be  harder to assess  from  expert  knowledge  and the  important \nrole  of learning is  further  emphasized. \n\nCompact representations  of interactions  between  simple  units  have  long been  em(cid:173)\nphasized in neural networks.  Lacking a  thorough probabilistic interpretation, how(cid:173)\never,  classical  feed-forward  neural  networks  cannot  be inverted  in  the  above sense; \ne.g.  given  the  output  pattern  of a  feed-forward  neural  network  it  is  not  feasible \nto compute  a  probability  distribution over  the  possible  input  patterns  that would \nhave  resulted  in  the  observed  output.  On  the  other  hand,  stochastic  neural  net(cid:173)\nworks such as Boltzman machines admit probabilistic interpretations and therefore, \nat least in  principle,  can be inverted and used  as a  basis for  inference  and learning \nin the presence  of uncertainty. \n\nSigmoid belief networks (Neal, 1992) form a subclass of probabilistic neural networks \nwhere  the  activation function  has  a  sigmoidal form  - usually the  logistic function. \nNeal  (1992)  proposed  a learning algorithm for  these  networks  which  can be  viewed \nas an improvement ofthe algorithm for Boltzmann machines.  Recently Hinton et al. \n(1995)  introduced  the  wake-sleep  algorithm for  layered  bi-directional  probabilistic \nnetworks.  This  algorithm relies  on forward  sampling and  has  an  appealing coding \ntheoretic  motivation.  The  Helmholtz  machine  (Dayan et  al.,  1995),  on  the  other \nhand,  can  be  seen  as  an  alternative  technique  for  these  architectures  that  avoids \nGibbs  sampling  altogether.  Dayan  et  al.  also  introduced  the  important  idea  of \nbounding likelihoods  instead  of computing  them  exactly.  Saul  et  al.  (1995)  sub(cid:173)\nsequently  derived  rigorous  mean field  bounds for  the likelihoods.  In this paper  we \nintroduce the idea of alternative - extended  and complementary - representations \nof these networks by reinterpreting the nonlinearities in the activation function.  We \nshow  that deriving likelihood bounds in the new  representational domains leads to \nefficient  (quadratic) estimation procedures for  the network  parameters. \n\n2  The probability representations \n\nBelief networks represent  the joint probability of a set of variables {S} as a  product \nof conditional probabilities given  by \n\nP(St, ... , Sn)  = IT P(Sk Ipa[k]), \n\nn \n\nk=l \n\n(1) \n\nwhere  the  notation pa[k],  \"parents  of Sk\",  refers  to  all the  variables  that  directly \ninfluence  the probability of Sk  taking on a particular value (for equivalent represen(cid:173)\ntations, see  Lauritzen et al.  1988).  The fact that the joint probability can be written \nin the above form implies that there are no  \"cycles\"  in the network; i.e.  there exists \nan ordering of the variables in the network such that no variable directly  influences \nany preceding  variables. \nIn this paper we  consider sigmoid belief networks where  the variables S  are  binary \n\n\f530 \n\nT. JAAKKOLA, L. K. SAUL, M.  I. JORDAN \n\n(0/1), the conditional probabilities have the form \n\nP(Ss:lpa[i]) = g( (2Ss:  - 1) L WS:jSj) \n\nj \n\n(2) \n\nand  the  weights Wij  are  zero  unless  Sj  is  a  parent  of Si,  thus preserving  the feed(cid:173)\nforward  directionality of the network.  For notational convenience  we  have assumed \nthe  existence  of a  bias  variable  whose  value  is  clamped  to  one.  The  activation \nfunction g(.)  is chosen to be the cumulative Gaussian distribution function given by \n\n.l  ~ \n\ne- 2 Z  dz =  - -\n\ne- 2  z-x  dz \n\n.l( \n\n)~ \n\n(3) \n\n1 \ng(x) =  - -\n\njX \n..j2;  - 00 \n\n1  100 \n\n..j2;  0 \n\nAlthough  very  similar  to  the  standard  logistic  function,  this  activation  function \nderives  a  number of advantages from  its  integral  representation.  In  particular,  we \nmay reinterpret  the integration as a  marginalization and thereby obtain alternative \nrepresentations for  the network.  We  consider two  such  representations. \n\nWe  derive  an  extended representation  by  making explicit  the  nonlinearities  in  the \nactivation function.  More  precisely, \n\nP(Silpa[i]) \n\ng( (2Si - 1) L WijSj) \n\nj \n\n(4) \n\nThis suggests defining the extended network in terms of the new conditional proba(cid:173)\nbilities P(Si, Zs:lpa[i]).  By construction then the original binary network is obtained \nby  marginalizing over  the extra variables Z.  In  this sense  the extended  network is \n(marginally) equivalent to the binary network. \n\nWe  distinguish  a  complementary representation  from  the extended  one  by  writing \nthe  probabilities entirely  in terms  of continuous variables!.  Such  a  representation \ncan be obtained from the extended network by a simple transformation of variables. \nThe  new  continuous  variables  are  defined  by  Zs:  =  (2Si  -\nl)Zi,  or,  equivalently, \nby  Zi  = IZs: I and  Si  = O( Zs:)  where  0(\u00b7)  is  the  step  function.  Performing  this \ntransformation yields \n\nP(Z-'I  [.]) - _1_  -MZi-L: . Wij9(Zj)1~ \n\nI  pa z  -\n\nJ \n\nrn=e \nV 211\" \n\n(5) \n\nwhich  defines  a  network  of conditionally Gaussian  variables.  The original network \nin this case  can be recovered  by  conditional marginalization over Z where  the con(cid:173)\nditioning variables are O(Z). \n\nFigure  1 below summarizes the relationships between  the different  representations. \nAs  will  become  clear  later,  working  with  the  alternative  representations  instead \nof the original  binary representation  can  lead  to  more flexible  and efficient  (least(cid:173)\nsquares)  parameter estimation. \n\n3  The learning problem \n\nWe  consider  the problem of learning the parameters of the  network from  instantia(cid:173)\ntions of variables contained in a training set.  Such instantiations, however, need not \n\n1 While the binary variables are the outputs of each unit the continuous variables pertain \n\nto the inputs - hence  the  name complementary. \n\n\fFast Learning by  Bounding Likelihoods  in Sigmoid Type  Belief Networks \n\n531 \n\n___ --~-::a-z_.::~ ~ _::s. Z) \n\nExtended network \n\nOriginal network \n\nover {S} \n\n\"tr:sfonnation of \n\n~ariables \n\nComplementary \nnetwork over {Z} \n\nFigure  1:  The relationship between  the alternative representations. \n\nbe  complete; there  may be variables that have no value  assignments in the  training \nset  as  well  as  variables  that  are  always  instantiated.  The  tacit  division  between \nhidden  (H)  and  visible  (V)  variables  therefore  depends  on  the  particular  training \nexample considered  and is  not an intrinsic property of the  network. \n\nTo  learn  from  these  instantiations  we  adopt  the  principle  of maximum likelihood \nto  estimate  the  weights  in  the  network.  In  essence,  this  is  a  density  estimation \nproblem  where  the  weights  are  chosen  so  as  to  match  the  probabilistic  behavior \nof  the  network  with  the  observed  activities  in  the  training  set.  Central  to  this \nestimation is the ability to compute likelihoods (or log-likelihoods) for  any (partial) \nconfiguration  of variables  appearing  in  the  training set.  In  other  words,  if we  let \nXV  be  the  configuration  of visible  or  instantiated  variables 2  and  XH  denote  the \nhidden  or  uninstantiated  variables,  we  need  to  compute  marginal  probabilities  of \nthe form \n\n(6) \n\nIf the  training samples are  independent,  then  these  log  marginals can  be  added  to \ngive the overall log-likelihood of the  training set \n\n10gP(training set) = L:logP(XVt) \n\n(7) \n\nXH \n\nUnfortunately,  computing  each  of these  marginal  probabilities  involves  summing \n(integrating)  over  an  exponential  number  of different  configurations  assumed  by \nthe hidden variables in the network.  This renders  the sum (integration)  intractable \nin all but few  special cases  (e.g.  trees and chains).  It is possible, however,  to instead \nfind  a  manageable lower  bound  on  the  log-likelihood and  optimize  the  weights  in \nthe network  so  as to maximize this bound. \n\nTo obtain such  a  lower  bound we  resort  to Jensen's inequality: \n\n10gP(Xv) \n\n10gL p(XH,XV) = 10gLQ(XH)P(XH,;V) \n\nXH \n\nXH \n>  ~Q(XH)1  p(XH,XV) \n\nog  Q(XH) \n\nf; \n\nQ(X  ) \n\n(8) \n\nAlthough this bound holds for  all distributions Q(X) over the hidden variables, the \naccuracy  of the bound  is  determined  by  how  closely  Q approximates the  posterior \ndistribution p(XH IXv) in terms of the Kullback-Leibler divergence;  if the approx(cid:173)\nimation is perfect  the divergence is zero and the inequality is satisfied with equality. \nSuitable  choices  for  Q can  make  the  bound  both  accurate  and  easy  to  compute. \nThe feasibility of finding such  Q,  however,  is  highly dependent  on the choice  of the \nrepresentation for  the network. \n\n2To postpone  the issue of representation we use  X  to denote 5, {5, Z}, or Z depending \n\non the particular  representation  chosen. \n\n\f532 \n\nT. JAAKKOLA, L. K.  SAUL, M. I. JORDAN \n\n4  Likelihood bounds in different  representations \n\nTo complete the  derivation of the likelihood bound (equation 8)  we  need  to fix  the \nrepresentation for the network.  Which representation to select,  however,  affects the \nquality  and  accuracy  of the  bound.  In  addition,  the  accompanying bound  of the \nchosen  representation implies bounds in the other two representational  domains as \nthey all code the same distributions over the observables.  In this section we illustrate \nthese points by deriving bounds in the complementary and extended representations \nand discuss  the corresponding bounds in  the original binary domain. \nNow,  to obtain a lower  bound we  need  to specify  the  approximate posterior  Q.  In \nthe  complementary representation  the  conditional probabilities are  Gaussians and \ntherefore a reasonable approximation (mean field)  is found by choosing the posterior \napproximation from  the family of factorized  Gaussians: \n\nQ(Z) = IT _1_e-(Zi-hi)~/2 \n\ni..;?:; \n\n(9) \n\n(10) \n\nSubstituting this into equation 8 we  obtain the bound \n\nlog P(S*) ~ -~ L (hi - Ej  Jij g(hj\u00bb2 - ~ L Ji~g(hj )g(-hj ) \n\ni \n\nij \n\nThe means hi  for the hidden variables  are adjustable parameters that can be tuned \nto  make  the  bound  as  tight  as  possible.  For  the  instantiated  variables  we  need \nto  enforce  the  constraints  g( hi)  = S:  to  respect  the  instantiation.  These  can  be \nsatisfied  very  accurately  by  setting  hi  = 4(2S:  - 1).  A  very  convenient  property \nof this  bound  and  the  complementary  representation  in  general  is  the  quadratic \nweight  dependence  - a  property  very  conducive  to  fast  learning.  Finally,  we  note \nthat  the  complementary representation  transforms  the  binary estimation problem \ninto a  continuous density estimation problem. \n\nWe  now  turn to the  interpretation of the  above  bound in  the binary domain.  The \nsame  bound  can  be  obtained  by  first  fixing  the  inputs  to  all  the  units  to  be  the \nmeans  hi  and  then  computing the  negative total mean squared  error  between  the \nfixed inputs and the corresponding probabilistic inputs propagated from the parents. \nThe fact that this procedure in fact gives a lower bound on the log-likelihood would \nbe more difficult  to justify by working with the binary representation  alone. \nIn  the  extended  representation  the  probability  distribution  for  Zi  is  a  truncated \nGaussian  given  Si  and  its  parents.  We  therefore  propose  the  partially factorized \nposterior  approximation: \n\nwhere  Q(ZiISi) is  a  truncated  Gaussian: \n\nQ(Zi lSi) = \n\n1 \n\n_1_e- t(Zi-(2S,-1)hi)~ \n\ng\u00ab 2Si- 1)hi )  ..;?:; \n\n(11) \n\n(12) \n\nAs in the complementary domain the resulting bound depends  quadratically on the \nweights.  Instead of writing out  the  bound here,  however,  it is  more informative to \nsee  its derivation in the binary domain. \n\nA  factorized  posterior  approximation  (mean  field)  Q(S)  =  n. q~i(1 - qi)l-S,  for \n\nthe binary network yields  a  bound \n\nI \n\nI \n\n10gP(S*)  >  L {(Si 10gg(Lj J,jSj\u00bb) + (1- Si) 10g(l- 9(L; Ji;S;\u00bb)} \n\ni \n\n\fFast Learning by  Bounding Likelihoods  in  Sigmoid Type Belief Networks \n\n533 \n\n(13) \n\nwhere  the  averages  (.)  are  with  respect  to  the  Q distribution.  These  averages, \nhowever,  do  not  conform  to  analytical  expressions.  The  tractable  posterior  ap(cid:173)\nproximation in  the  extended  domain avoids  the problem  by  implicitly making the \nfollowing  Legendre  transformation: \n\n1  2 \nlogg(x) = [\"2x  + logg(x)] -\"2x  ~ AX  - G(A) - \"2x \n\n(14) \nwhich  holds since  x 2/2 + logg(x)  is  a convex function.  Inserting this  back  into the \nrelevant  parts of equation 13  and performing the averages  gives \n\n1  2 \n\n1  2 \n\n10gP(S*)  >  L \n\n{[qjAj  - (1- qj),Xd Lhjqj - qjG(Ai) - (1- qj)G('xi)} \n\nj \n\nI \" ,  \n\n-\"2(L.,.. Jijqj)  -\"2 L.,.. Jjjqj  1- gj) \n\n2  1 \" ,2   ( \n\nj \n\nij \n\n(15) \n\nwhich is  quadratic in the weights as expected.  The mean activities q for  the hidden \nvariables and the parameters A can  be optimized to make the bound tight.  For the \ninstantiated variables we  set  qi  = S; . \n\n5  Numerical experiments \n\nTo test  these  techniques  in practice  we  applied  the complementary network  to the \nproblem of detecting  motor failures  from  spectra  obtained during motor operation \n(see  Petsche et al.  1995).  We  cast  the  problem as a  continuous density estimation \nproblem.  The training set  consisted  of 800 out of 1283  FFT spectra each  with  319 \ncomponents  measured  from  an  electric  motor  in  a  good  operating  condition  but \nunder  varying loads.  The test set  included the remaining 483  FFTs from the same \nmotor  in  a  good  condition  in  addition  to  three  sets  of 1340  FFTs each  measured \nwhen  a  particular  fault  was  present.  The  goal  was  to  use  the  likelihood  of a  test \nFFT with respect  to the estimated density  to determine whether  there  was a  fault \npresent  in the motor. \n\nWe  used  a  layered  6  -+  20  -+  319  generative  model  to  estimate  the  training  set \ndensity.  The resulting classification error rates on the test set  are shown in figure  2 \nas a function  of the threshold  likelihood.  The achieved  error  rates  are  comparable \nto those of Petsche et  al.  (1995). \n\n6  Conclusions \n\nNetwork  models  that  admit  probabilistic  formulations  derive  a  number  of advan(cid:173)\ntages  from  probability  theory.  Moving  away  from  explicit  representations  of  de(cid:173)\npendencies,  however,  can  make  these  properties  harder  to exploit  in  practice.  We \nshowed  that  an  efficient  estimation  procedure  can  be  derived  for  sigmoid  belief \nnetworks,  where  standard  methods  are  intractable  in  all  but  a  few  special  cases \n(e.g.  trees  and  chains).  The efficiency  of our  approach  derived  from  the  combina(cid:173)\ntion  of two  ideas.  First,  we  avoided  the  intractability  of  computing  likelihoods \nin  these  networks  by  computing  lower  bounds  instead.  Second,  we  introduced \nnew  representations  for  these  networks  and  showed  how  the  lower  bounds  in  the \nnew  representational  domains  transform  the  parameter  estimation  problem  into \n\n\f534 \n\nT. JAAKKOLA, L. K.  SAUL, M.  1.  JORDAN \n\n0.0 \n\n..... \n\n0.8 \n\n0.7 \n\n0.8  ',_ \n\nfo.s  \"\\ \n\n, . , , \n, \n\nD:' \n\nd.. \n\n0.3 \n\n0.2 \n\n0.1 \n\n---\n\n' ,  ... .. \n\n' '', \n\n\" \n\n, \n\n.\n\n.. \n\n'-\n\n, \n\" \n\n, , \n'. \n, , . \n\nFigure  2:  The  probability  of error  curves  for  missing  a  fault  (dashed  lines)  and \nmisclassifying a good motor (solid line)  as  a function  of the likelihood threshold. \n\nquadratic optimization. \n\nAcknowledgments \n\nThe  authors  wish  to  thank  Peter  Dayan  for  helpful  comments.  This  project  was \nsupported  in  part  by  NSF  grant  CDA-9404932,  by  a  grant  from  the  McDonnell(cid:173)\nPew  Foundation,  by  a  grant  from  ATR  Human  Information  Processing  Research \nLaboratories,  by  a  grant  from  Siemens  Corporation,  and  by  grant  N00014-94-1-\n0777  from  the  Office  of Naval  Research.  Michael  I.  Jordan  is  a  NSF  Presidential \nYoung Investigator. \n\nReferences \n\nP. Dayan, G. Hinton, R.  Neal, and R.  Zemel (1995).  The helmholtz machine.  Neural \nComputation 7:  889-904. \n\nA.  Dempster,  N.  Laird,  and D.  Rubin.  Maximum likelihood from  incomplete data \nvia the  EM  algorithm (1977).  J.  Roy.  Statist.  Soc.  B  39:1-38. \nG.  Hinton,  P.  Dayan,  B.  Frey,  and  R.  Neal  (1995).  The  wake-sleep  algorithm for \nunsupervised  neural networks.  Science  268:  1158-1161. \n\nS.  L.  Lauritzen  and D.  J. Spiegelhalter (1988) .  Local computations with probabili(cid:173)\nties on graphical structures and their application to expert systems.  J.  Roy.  Statist. \nSoc.  B  50:154-227. \nR.  Neal.  Connectionist learning of belief networks (1992).  Artificial Intelligence 56: \n71-113. \n\nJ. Pearl (1988).  Probabilistic Reasoning  in  Intelligent Systems.  Morgan Kaufmann: \nSan  Mateo. \n\nT.  Petsche,  A.  Marcantonio,  C.  Darken,  S.  J.  Hanson,  G.  M.  Kuhn,  I.  Santoso \n(1995).  A neural network  autoassociator for  induction motor failure prediction.  In \nAdvances  in  Neural Information  Processing Systems  8.  MIT Press. \n1.  K.  Saul,  T.  Jaakkola,  and  M.  I.  Jordan  (1995).  Mean  field  theory  for  sigmoid \nbelief networks.  M.l. T.  Computational  Cognitive  Science  Technical  Report 9501. \n\n\f", "award": [], "sourceid": 1111, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}