{"title": "Local Probability Propagation for Factor Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 442, "page_last": 448, "abstract": null, "full_text": "Local  probability propagation for  factor \n\nanalysis \n\nComputer Science,  University of Waterloo, Waterloo,  Ontario,  Canada \n\nBrendan J. Frey \n\nAbstract \n\nEver since Pearl's probability propagation algorithm in graphs with \ncycles  was  shown  to  produce excellent  results  for  error-correcting \ndecoding  a  few  years  ago,  we  have  been  curious  about  whether \nlocal  probability  propagation  could  be  used  successfully  for  ma(cid:173)\nchine  learning.  One  of the simplest  adaptive models  is  the factor \nanalyzer,  which  is  a  two-layer  network  that  models  bottom  layer \nsensory inputs as a  linear combination of top layer factors  plus in(cid:173)\ndependent  Gaussian  sensor noise.  We  show  that local probability \npropagation in the factor analyzer network usually takes just a few \niterations to perform accurate inference, even in networks with 320 \nsensors and 80 factors.  We derive an expression for  the algorithm's \nfixed  point and show that this fixed  point matches the exact solu(cid:173)\ntion in a variety of networks, even when the fixed point is  unstable. \nWe also show that this method can be used successfully to perform \ninference for approximate EM and we give results on an online face \nrecognition task. \n1  Factor analysis \nA  simple way  to encode  input patterns is  to suppose that  each input  can  be  well(cid:173)\napproximated by a linear combination of component vectors,  where the amplitudes \nof the vectors are modulated to match the input.  For a given training set,  the most \nappropriate set  of component vectors  will  depend  on  how  we  expect  the  modula(cid:173)\ntion levels  to  behave  and how  we  measure the distance  between  the input  and its \napproximation.  These  effects  can  be  captured  by  a  generative  probabilit~ model \nthat  specifies  a  distribution  p(z)  over  modulation  levels  z  =  (Zl, ... ,ZK)  and  a \ndistribution  p(xlz)  over  sensors  x  =  (Xl, ... ,XN)T  given  the  modulation  levels. \nPrincipal component analysis,  independent component analysis and factor analysis \ncan be viewed as maximum likelihood learning in a model of this type, where we as(cid:173)\nsume that over the training set, the appropriate modulation levels  are independent \nand the overall distortion is  given  by  the sum of the individual sensor distortions. \n\nIn  factor  analysis,  the  modulation  levels  are  called  factors  and  the  distributions \nhave the following  form: \n\np(Zk)  =  N(Zk; 0,1), \n\np(z)  =  nf=lP(Zk) =  N(z; 0, I), \n\np(xnl z) =  N(xn; E~=l AnkZk, 'l/Jn), \n\n(1) \nThe parameters of this model  are the factor  loading matrix A,  with elements  Ank, \nand the diagonal sensor noise  covariance matrix  'It,  with  diagonal elements 'l/Jn.  A \nbelief network for  the factor analyzer is  shown in Fig.  1a.  The likelihood is \n\np(xlz) =  n:=IP(xnlz) =  N(x; Az, 'It). \n\np(x) = 1 N(z; 0, I)N(x; Az, 'It)dz = N(x; 0, AA T  + 'It), \n\n(2) \n\n\fLocal Probability Propagation for Factor Analysis \n\n...., \n\n(b) \n\n~. \n\n- -... \n\n\" \n\n'r \n\n- 'J \n\n'. \n\n443 \n\n. ... , \n\n1t. \n\nE \n...  '\" \n\"' I:~ \n\nFigure 1:  (a)  A  belief network  for  factor analysis.  (b)  High-dimensional data  (N =  560). \n\nand online factor  analysis consists  of adapting A  and q,  to increase the likelihood \nof the current input,  such as a  vector of pixels from an image in Fig.  lb. \nProbabilistic inference  - computing or estimating p{zlx)  - is  needed to do dimen(cid:173)\nsionality reduction and to fill in the unobserved factors for online EM-type learning. \nIn this paper,  we  focus  on methods  that infer  independent  factors.  p(zlx)  is  Gaus(cid:173)\nsian and it turns out that the posterior means and variances of the factors  are \n\nE[zlx]  =  (A Tq,-l A +  1)-1 AT q,-lx, \n\ndiag(COV(zlx))  =  diag(A T q,-l A + 1)-1). \n\n(3) \n\nGiven  A  and  q\"  computing  these  values  exactly  takes  O(K2 N)  computations, \nmainly because of the time needed to compute AT q,-l A.  Since there are only K N \nconnections  in  the  network,  exact  inference  takes  at  least  O{K)  bottom-up/top \ndown iterations. \n\nOf course, if the same network is going to be applied more than K  times for inference \n(e.g.,  for  batch EM),  then the  matrices  in  (3)  can be computed  once  and  reused. \nHowever,  this is  not directly applicable in online learning and in  biological models. \nOne way  to  circumvent  computing  the  matrices is  to keep  a  separate  recognition \nnetwork,  which  approximates  E[zlx]  with  Rx  (Dayan  et  al.,  1995).  The optimal \nrecognition network, R  =  (A Tq,-l A+I)-l A Tq,-l, can be approximated by jointly \nestimating the  generative network and the recognition network using online wake(cid:173)\nsleep learning  (Hinton  et al.,  1995). \n2  Probability propagation in the factor  analyzer network \nRecent  results  on  error-correcting  coding  show  that  in  some  cases  Pearl's  prob(cid:173)\nability  propagation  algorithm,  which  does  exact  probabilistic  inference  in  graphs \nthat  are  trees,  gives  excellent  performance  even  if the  network  contains  so  many \ncycles  that its  minimal cut set is  exponential  (Frey and MacKay,  1998;  Frey,  1998; \nMacKay,  1999).  In  fact,  the  probability  propagation  algorithm for  decoding  low(cid:173)\ndensity parity-check codes  (MacKay,  1999)  and turbocodes  (Berrou and  Glavieux, \n1996)  is  widely  considered  to be  a  major breakthrough in  the  information  theory \ncommunity. \n\nWhen the network contains cycles,  the local computations give  rise to an iterative \nalgorithm,  which  hopefully  converges to a  good answer.  Little is  known  about the \nconvergence properties  of the algorithm.  Networks  containing a  single  cycle  have \nbeen successfully analyzed by Weiss  (1999)  and Smyth et al.  (1997), but results for \nnetworks containing many  cycles  are much less  revealing. \nThe probability messages produced by probability propagation in the factor analyzer \nnetwork of Fig.  1a are Gaussians.  Each iteration of propagation consists of passing \na  mean  and  a  variance  along each edge  in  a  bottom-up  pass,  followed  by  passing \na  mean  and a  variance  along  each  edge  in  a  top-down  pass.  At  any  instant,  the \n\n\f444 \n\nB.J.  Frey \n\nbottom-up  means  and  variances can  be  combined  to form  estimates  of the  means \nand variances of the modulation levels  given the input. \n\nInitially,  the variance and mean sent from  the kth top layer  unit  to the nth sensor \nis  set  to vk~ = 1 and 7]i~ = 0.  The bottom-up pass  begins  by  computing a  noise \nlevel  and  an  error  signal  at  each  sensor  using  the  top-down  variances  and  means \nfrom  the previous iteration: \n\ne~) =  Xn  - 2: {:= 1 Ank7]i~-l). \nThese are used to compute bottom-up variances  and means as follows: \n\ns~) =  'l/Jn  + 2:{:=1 A;kVk~-I) , \n\n\",(i)  =  s(i)/A2  _  v(i-l) \nkn' \n'l'nk \n\nnk \n\nn \n\nlI(i)  =  e(i)/A  k  + 7](i-l) \nr'nk \nkn' \n\nn \n\nn \n\n(4) \n\n(5) \n\nThe  bottom-up  variances  and  means  are  then  combined  to  form  the  current esti(cid:173)\nmates of the modulation variances and means: \n\n(i) \nVk  =  1/(1 + 2:n=1 1/\u00a2nk)' \n\n(i) \n\nN \n\nA(i)  _ \nZk \n\n(i)\"\",N \n\n(i)/\",(i) \n- V k  L..Jn=lJ.tnk  'l'nk' \n\n(6) \n\nThe top-down pass proceeds by computing top-down variances and means as follows: \n\nvk~ = l/(l/vii )  - l/\u00a2~l), \n\n(7) \nNotice that the variance updates are independent of the mean updates, whereas the \nmean updates depend on the variance updates. \n\n7]i~ = vk~(.iki) /vii )  - J.t~V\u00a2~l)\u00b7 \n\n2.1  Performance  of local  probability  propagation.  We  created a  total of \n200,000 factor analysis networks with 20 different sizes ranging from  K  =  5,  N  =  10 \nto K  = 80, N  =  320 and for each size of network we  measured the inference error as \na function of the number of iterations of propagation.  Each of the 10,000 networks of \na  given  size  was  produced by drawing the AnkS  from standard normal distributions \nand  then  drawing  each  sensor  variance 'l/Jn  from  an  exponential  distribution  with \nmean 2:{:=1  A;k'  (A  similar procedure was  used by Neal  and Dayan  (1997).) \nFor each random network,  a  pattern was  simulated from the network and probabil(cid:173)\nity propagation was applied using the simulated pattern as input.  We measured the \nerror between the estimate  z(i)  and the correct value  E[zlx]  by  computing the  dif(cid:173)\nference  between  their coding costs under the exact  posterior distribution and then \nnormalizing by  K  to get an average number of nats per top layer unit. \nFig.  2a shows  the inference error on a  logarithmic scale versus  the number of iter(cid:173)\nations  (maximum of 20)  in the 20  different  network sizes.  In all  cases,  the median \nerror is  reduced below  .01  nats within  6 iterations.  The rate of convergence of the \nerror improves for  larger N, as indicated by a  general trend for  the error curves to \ndrop when N  is  increased.  In contrast, the rate of convergence of the error appears \nto worsen  for  larger K,  as  shown  by  a  general  slight  trend  for  the error curves  to \nrise when  K  is  increased. \nFor K  ~ N/8, 0.1%  of the networks actually diverge.  To  better understand the di(cid:173)\nvergent cases, we  studied the means and variances for all of the divergent networks. \nIn all cases, the variances converge within a few  iterations whereas the means oscil(cid:173)\nlate and diverge.  For K  =  5,  N  =  10,  54  of the  10,000 networks diverged and 5 of \nthese are shown in  Fig.  2b.  This observation suggests that in  general the dynamics \nare determined by the dynamics of the mean updates. \n\n2.2  Fixed points and a  condition for global convergence.  When the vari(cid:173)\nance  updates  converge,  the dynamics  of probability  propagation  in  factor  analysis \nnetworks become linear.  This allows  us  to derive the fixed  point of propagation in \nclosed form  and write an eigenvalue condition for  global convergence. \n\n\fLocal Probability Propagation for Factor Analysis \n\n445 \n\nK=20  K=40 \n\nK=80 \n\n(a) \n\nK  =  5 \n\nK  = 10 \n\n~',:~ \n~ 01~ \ng\"Xl~ ...  10 \nII \n~ 1:~ \n11' 0~ \n~  .01 \n\n2: \n\n~ ',:u \n~ ',:~ \n\n0, \n\n~  0' \no  ,OO~ \n~' O~ \n~  01 0 \n\n20 \n\n10 \n\nFigure  2:  (a)  Performance  of  probability  propagation .  Median  inference  error  (bold  curve) \non  a  logarithmic scale  as a  function  of the  number of iterations for  different  sizes  of network \nparameterized  by  K  and N.  The two curves adjacent to the bold  curve show the range within \nwhich  98%  of  the errors  lie.  99 .9%  of the errors  were  below  the fourth,  topmost  curve.  (b) \nThe error,  bottom-up variances and  top-down  means as a function  of the number of iterations \n(maximum  of 20)  for  5  divergent  networks of size  K  = 5,  N  = 10. \n\n(i) \n\n(i))T \n\n.  , 17KN \n\n,  P, \n\n(i) \n\n(i) \n\n(i ) \n\nf \n(i) \n\n.  - (i)  _ \n-\n(i)  )T  -\n\n,  X=  Xl,Xl, .. \u00b7  ,Xl,X2, .. \u00b7  , X2 , XN, .. \u00b7  ,XN \n\n(  (i) \n1711,1721\"'\"  17Kl' 1712' \" \n( \n\nTo  analyze  the system  of mean  updates,  we  define  the following  length  K N  vec-\nd  h \u00b7 \n- (i)  _ \ntors  0  means  an \nt  e  mput.  TJ \n-\n)T \n(i) \n(i ) \n(\nJ-tll,J-t12 ' ''' ,J-tlK , J-t21''''  , J-tNK \n, \nwhere each  Xn  is  repeated K  times in the last vector.  The network parameters are \nrepresented  using  K N  x  K N  diagonal  matrices,  A  and  q,.  The diagonal  of A  is \nA11, ... , AIK , A21, ... , ANK, and the diagonal  of q,  is  '1/111, '1/121,  ... , '1/INI,  where 1 is \nthe  K  x  K  identity  matrix.  The  converged  bottom-up  variances  are  represented \nusing a  diagonal matrix ~ with diagonal  \u00a211, ... , \u00a2IK , \u00a221, .. . , \u00a2NK. \nThe summation operations in the propagation formulas are represented by a  K N  x \nK N  matrix I: z  that sums over means sent down from the top layer and a K N x K N \nmatrix I:x that sums over means sent up from  the sensory input: \n\n1 \n\n:Ex  =  (~ \ni \n\n1 \n1 \n\n1 \n\n) \n1 \n\n' \n\n(8) \n\nThese are N  x  N  matrices of K  x  K  blocks,  where  1  is  the  K  x  K  block of ones \nand 1 is  the K  x  K  identity matrix. \n\nUsing the above representations, the bottom-up pass is  given  by \n\nji, (i)  =  A-I X _  A- I (:E z  - I)Af7(i-l), \n\nand the top-down pass is  given  by \n\nf7( i)  =  (I + diag(:Ex~ -1 :Ex)  _  ~ -1) -1 (I:x _  I)~ -1 ji,( i ) . \n\n(9) \n\n(10) \n\nSubstituting (10)  into  (9),  we  get the linear update for  ji,: \nji,(i)  =  A-I X _  A-I (:E z  _  I)A(I + diag(:Exci -l:Ex) _  c) -1) -1 (:Ex  _  I)ci -1 ji, (i -l). \n(11) \n\n\f446 \n\nB.J.  Frey \n\nB[]Bga~Q~g[] \n\n1.11 \n\n1.06 \n\n1.24 \n\n1.07 \n\n1.49 \n\n1.13 \n\n1.03 \n\n1.02 \n\n1.09 \n\n1.01 \n\nFigure 3:  The  error  (log  scale)  versus  number  of iterations  (log scale.  max.  of 1000)  in  10 \nof the divergent  networks with  K  = 5.  N  = 10.  The means were  initialized to the fixed  point \nsolutions  and  machine  round-off  errors  cause  divergence  from  the  fixed  points.  whose  errors \nare  shown  by  horizontal  lines. \n\nThe fixed  point of this dynamic system,  when  it exists,  is \n\nji,*  =  ~ (A~ + (tz - I)A(I + diag(I:xc) -ltx) - ~ -1) -\\tx - I)) -1 x. \n\n(12) \n\nA fixed  point exists  if the  determinant of the  expression  in  large braces in  (12)  is \nnonzero.  We have found a simplified expression for this determinant in terms of the \ndeterminants of smaller,  K  x K  matrices. \n\nReinterpreting the dynamics in  (11)  as dynamics for  Aji,(i),  the stability of a  fixed \npoint is  determined by  the largest eigenvalue of the update matrix,  (I:z  - I)A (I + \n. If the modulus ofthe largest eigenvalue \ndiag(Exc})  Ex)-c}) \nis  less  than  1,  the fixed  point  is  stable.  Since  the system is  linear,  if a  stable fixed \npoint exists,  the system will  be globally convergent to this point. \n\n(Ex-I)c})  A \n\n- -1  -1  -\n\n- -1  - -1 \n\n-\n\n- -1 -\n\n) \n\nOf the 200,000 networks we  explored, about 99.9%  of the networks converged.  For \n10 of the divergent networks with K  =  5,  N  =  10,  we  used 1000 iterations of prob(cid:173)\nability propagation to compute the steady state variances.  Then, we  computed the \nmodulus  of the largest eigenvalue of the system and we  computed the fixed  point. \nAfter initializing the bottom-up means to the fixed point values,  we  performed 1000 \niterations to see if numerical errors due to machine precision would cause divergence \nfrom  the fixed  point.  Fig.  3 shows  the error versus  number of iterations  (on  loga(cid:173)\nrithmic scales)  for  each network,  the error of the fixed  point,  and the  modulus  of \nthe largest eigenvalue.  In some cases, the network diverges from the fixed  point and \nreaches a dynamic equilibrium that has a lower average error than the fixed  point. \n3  Online factor analysis \nTo perform maximum likelihood factor analysis in an online fashion, each parameter \nshould  be  modified  to  slightly  increase  the  log-probability  of the  current  sensory \ninput,logp(x).  However, since the factors are hidden, they must be probabilistically \n\"filled in\"  using inference  before an incremental learning step is  performed. \nIf the estimated mean and variance  of the  kth factor  are Zk  and Vk,  then  it  turns \nout  (e.g.,  Neal  and Dayan, 1997) the parameters can be updated as follows: \n\nAnk  f- Ank + l}[Zk(Xn - Ef=1 AnjZj) - VkAnk]/'ljln, \n\n'IjIn  f- (l-l})'ljln + l}[(xn  - Ef=1 AnjZj)2 + Ef=1 VkA~j], \n\n(13) \n\nwhere 1}  is  a  learning rate. \n\nOnline learning consists of performing some number of iterations of probability prop(cid:173)\nagation for the current input (e.g.,  4 iterations) and then modifying the parameters \nbefore  processing the next input. \n\n3.1  Results  on  simulated  data.  We  produced  95  training  sets  of 200  cases \neach,  with input sizes  ranging from  20 sensors to 320 sensors.  For each of 19  sizes \nof factor  analyzer,  we  randomly  selected  5  sets  of parameters as  described  above \nand generated a  training set.  The factor analyzer sizes were K  E  {5, 10,20,40, 80}, \n\n\fLocal Probability Propagation for Factor Analysis \n\n447 \n\nFigure 4:  (a) Achievable errors after the same number of epochs of learning using 4  iterations \nversus 1 iteration.  The horizontal axis gives the log-probability error (log scale) for  learning with \n1  iteration  and  the vertical  axis  gives  the error  after the same  number of epochs for  learning \nwith  4 iterations.  (b) The achievable errors for  learning using 4  iterations of propagation versus \nwake-sleep  learning  using 4  iterations. \nN  E  {20, 40, 80,160, 320}, N  > K.  For each factor analyzer and simulated data set, \nwe  estimated the optimal log-probability of the data using  100 iterations of EM. \n\nFor learning, the size of the model to be trained was set equal to the size of the model \nthat was  used  to generate the data.  To avoid the issue of how  to schedule learning \nrates,  we  searched  for  achievable  learning  curves,  regardless  of  whether  or  not  a \nsimple  schedule for  the learning rate exists.  So,  for  a  given  method and randomly \ninitialized  parameters,  we  performed one  separate epoch  of learning using  each of \nthe learning rates, 1,0.5, ... ,0.520  and picked the learning rate that most improved \nthe log-probability.  Each successive learning rate was determined by comparing the \nperformance using the old learning rate and one 0.75 times smaller. \n\nWe  are mainly  interested in comparing the achievable curves for  different  methods \nand  how  the differences  scale with  K  and N.  For  two  methods  with  the same  K \nand  N  trained  on  the  same  data,  we  plot  the  log-probability  error  (optimal  log(cid:173)\nprobability minus  log-probability under the learned model)  of one  method against \nthe log-probability error of the other method. \n\nFig. 4a shows the achievable errors using 4 iterations versus using 1 iteration.  Usu(cid:173)\nally,  using 4 iterations produces networks with lower errors than those learned using \n1 iteration.  The difference  is  most significant for  networks  with large K, where in \nSec.  2.1  we  found  that the convergence of the inference error was slower. \nFig.  4b  shows  the  achievable  errors  for  learning  using  4  iterations  of probability \npropagation  versus  wake-sleep  learning  using  4  iterations.  Generally,  probability \npropagation  achieves  much  smaller  errors  than  wake-sleep  learning,  although  for \nsmall K  wake-sleep performs better very close to the optimum log-probability.  The \nmost  significant  difference  between  the  methods  occurs  for  large  K,  where  aside \nfrom local optima probability propagation achieves nearly optimal log-probabilities \nwhile  the log-probabilities for  wake-sleep  learning  are  still  close  to  their  values  at \nthe start of learning. \n4  Online face  recognition \nFig.  1b  shows  examples  from  a  set  of 30,000  20  x  28  greyscale  face  images  of 18 \ndifferent people.  In contrast to other data sets used to test face recognition methods, \nthese faces  include  wide  variation  in  expression  and  pose.  To  make  classification \nmore  difficult,  we  normalized  the  images  for  each  person  so  that  each  pixel  has \n\n\f448 \n\nB.J.  Frey \n\nthe same  mean  and variance.  We  used  probability  propagation and a  recognition \nnetwork  in  a  factor  analyzer to reduce  the dimensionality  of the  data online from \n560 dimensions to 40 dimensions.  For probability propagation, we rather arbitrarily \nchose a  learning rate of 0.0001, but for  wake-sleep  learning we  tried learning rates \nranging from 0.1  down to 0.0001.  A multilayer perceptron with one hidden layer of \n160 tanh units and one output layer of 18 softmax units was  simultaneously being \ntrained using gradient descent to predict face  identity from  the mean factors.  The \nlearning rate for  the multilayer perceptron was  set to 0.05  and this value  was  used \nfor  both methods. \n\n\"' \" \n\n..  ',\"--\"\"~ \\<',::,::'--, ... , ... \"\" \n\n\\ \n\n' \n\n'. ~, \n\n'. \n\n~ \nj \ni \n\nNumber of pattern  presentations \n\nFigure  5:  Online  error \ncurves for  probability  prop(cid:173)\nagation  (solid),  wake-sleep \nlearning  (dashed),  nearest \nneighbors \n(dot-dashed) \nand  guessing (dotted). \n\nFor each  image,  a  prediction was  made  before the pa(cid:173)\nrameters  were  modified.  Fig.  5  shows  online  error \ncurves  obtained  by  filtering  the losses.  The curve  for \nprobability  propagation  is  generally  below  the  curves \nfor  wake-sleep learning. \nThe figure also shows the error curves for  two forms of \nonline  nearest  neighbors,  where  only  the  most  recent \nW  cases  are  used  to make  a  prediction.  The form  of \nnearest neighbors that performs the worst has W  set so \nthat the storage requirements  are  the  same  as  for  the \nfactor  analysis  /  multilayer  perceptron  method.  The \nbetter form  of nearest neighbors has W  set so that the \nnumber  of computations  is  the same as  for  the  factor \nanalysis /  multilayer perceptron method. \n5  Summary \nIt  turns  out  that  iterative  probability  propagation  can  be  fruitful  when  used  for \nlearning  in  a  graphical  model  with  cycles,  even  when  the  model  is  densely  con(cid:173)\nnected.  Although we  are more  interested in extending this  work to more  complex \nmodels where exact inference takes exponential time, studying iterative probability \npropagation in the factor analyzer allowed us to compare our results with exact in(cid:173)\nference and allowed us  to derive the fixed  point of the algorithm.  We  are currently \napplying iterative propagation in  multiple cause networks for  vision  problems. \nReferences \nC.  Berrou  and  A.  Glavieux  1996.  Near  optimum error  correcting  coding  and  decoding: \nTurbo-codes.  IEEE  TI-ans.  on  Communications,  44,  1261-1271. \nP.  Dayan,  G.  E.  Hinton,  R.  M.  Neal  and  R.  S.  Zemel  1995.  The  Helmholtz  machine. \nNeural  Computation  1, 889-904. \nB.  J.  Frey  and  D.  J.  C.  MacKay  1998.  A  revolution:  Belief propagation  in graphs  with \nIn  M.  Jordan,  M.  Kearns  and  S.  Solla  (eds),  Advances  in  Neural  Information \ncycles. \nProcessing  Systems  10,  Denver, 1997. \nB.  J.  Frey  1998.  Graphical  Models  for  Machine  Learning  and  Digital  Communication. \nMIT  Press,  Cambridge MA.  See http://wvv.cs.utoronto.ca/-frey . \nG.  E.  Hinton,  P.  Dayan,  B.  J.  Frey and R.  M.  Neal  1995.  The  wake-sleep  algorithm  for \nunsupervised neural networks.  Science  268,  1158-1161. \nD.  J. C.  MacKay  1999.  Information  Theory,  Inference  and Learning  Algorithms.  Book in \npreparation, currently available  at http://wol.ra.phy.cam.ac . uk/mackay. \nR. M.  Neal and P. Dayan 1997.  Factor analysis using delta-rule wake-sleep learning.  Neural \nComputation  9,  1781-1804. \nP.  Smyth,  R.  J .  McEliece,  M.  Xu,  S.  Aji  and  G.  Horn  1997.  Probability  propagation  in \ngraphs  with  cycles.  Presented  at  the  workshop  on  Inference  and  Learning  in  Graphical \nModels,  Vail,  Colorado. \nY.  Weiss  1998.  Correctness  of  local  probability  propagation  in  graphical  models.  To \nappear in  Neural  Computation. \n\n\f", "award": [], "sourceid": 1649, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}]}