{"title": "A Theory of Mean Field Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 360, "abstract": null, "full_text": "A Theory of Mean Field Approximation \n\nT.Tanaka \n\nDepartment of Electronics and Information Engineering \n\nTokyo Metropolitan University \n\nI-I, Minami-Osawa, Hachioji , Tokyo 192-0397 Japan \n\nAbstract \n\nI present a theory of mean field  approximation based on information ge(cid:173)\nometry.  This  theory  includes  in  a  consistent way  the  naive  mean  field \napproximation, as well as the TAP approach and the  linear response the(cid:173)\norem  in  statistical physics, giving clear information-theoretic interpreta(cid:173)\ntions to them. \n\n1 \n\nINTRODUCTION \n\nMany  problems of neural  networks,  such  as  learning and  pattern  recognition,  can  be cast \ninto a framework of statistical  estimation problem.  How difficult it  is to solve a particular \nproblem depends on a statistical model one employs in solving the problem. For Boltzmann \nmachines[ 1]  for example,  it  is  computationally  very  hard  to evaluate expectations of state \nvariables from  the model parameters. \n\nMean field  approximation[2], which is originated in  statistical physics, has been frequently \nused in practical situations in  order to circumvent this difficulty.  In  the context of statistical \nphysics several  advanced  theories  have  been  known , such  as  the  TAP approach[3],  linear \nresponse  theorem[4],  and  so on.  For neural  networks,  application  of mean  field  approxi(cid:173)\nmation  has  been  mostly  confined  to that of the so-called  naive mean  field  approximation, \nbut there are also attempts to utilize those advanced theories[5, 6, 7, 8] . \nIn  this paper I present an  information-theoretic formulation of mean field  approximation.  It \nis based on  information geometry[9], which has been successfully applied to several prob(cid:173)\nlems in  neural networks[ 1 0].  This formulation includes the naive mean field  approximation \nas  well  as the advanced theories in  a consistent way.  I give the formulation for Boltzmann \nmachines, but  its  extension  to  wider classes of statistical  models is  possible,  as described \nelsewhere[ 11 ]. \n\n2  BOLTZMANN MACHINES \n\nA Boltzmann machine is a statistical model with N  binary random variables Si  E {-I, I}, \ni  = 1,  ... ,  N. The vector s  = (s},  . .. , S N)  is called the state of the Boltzmann machine. \n\n\f352 \n\nT.  Tanaka \n\nThe state  s  is  also  a  random  variable,  and  its  probability  law  is  given  by  the  Boltzmann(cid:173)\nGibbs distribution \n\np(s)  =  e- E (s)-1/J(p) , \n\n(I) \n\n(2) \n\nwhere E( s)  is  the \"energy\" defined by \n\nE(s) = - 2: hisi - 2: wij SiSj \n\n(ij) \n\nwith hi and  wij  the  parameters, and  -1jJ(p)  is determined  by  the normalization  condition \nand  is  called the Helmholtz free energy of p.  The notation  (ij)  means that the summation \nshould be taken over all distinct pairs. \nLet 'fJi(P)  ==  (Si}p  and 'fJij(p)  ==  (SiSj}p,  where  (.}p  means the expectation with respect to \np.  The following problem is essential for  Boltzmann machines: \n\nProblem 1  Evaluate the expectations '1Ji (p)  and 'fJij (p)  from  the parameters hi  and wij of \nthe Boltzmann machine p. \n\n3 \n\nINFORMATION GEOMETRY \n\n3.1  ORTHOGONAL DUAL FOLIATIONS \n\nA whole set M  of the Boltzmann-Gibbs distribution (1) realizable by a Boltzmann machine \nis regarded as an  exponential family.  Let us use shorthand notations I,  J, ... , to represent \ndistinct  pairs  of indices,  such  as  ij.  The  parameters  hi  and  wI  constitute  a  coordinate \nsystem of M, called the canonical parameters of M. The expectations \"Ii  and 'f/I  constitute \nanother coordinate system of M, called the expectation parameters of M. \nLet  Fo  be  a  subset  of M  on  which  wI  are  all  equal  to  zero.  I  call  Fo  the  factorizable \nsubmodel of M  since p(s)  E  Fo  can  be factorized  with respect to Si.  On  Fo  the problem \nis  easy:  Since wI  are  all  zero,  Si  are  statistically  independent  of others,  and  therefore \n'fJi  =  tanh - 1  hi and 'fJij  =  'fJi'fJj  hold. \n\nMean  field  approximation  systematically  reduces  the  problem  onto  the  factorizable  sub(cid:173)\nmodel Fo.  For this  reduction,  I introduce dual  foliations  F  and  A onto M.  The foliation \nF  =  {F(w)}, M  = Uw F(w),  is  parametrized  by  w  ==  (wI)  and  each  leaf F(w)  is \ndefined as \n\nF(w) =  {p(s)  I wI (p)  =  wI}. \n\n(3) \nThe leaf F(O)  is  the same  as  Fo, the  factorizable  submodel.  Each  leaf F( w) is  again  an \nexponential  family  with  hi  and 'fJi  the  canonical  and  the  expectation  parameters,  respec(cid:173)\ntively.  A  pair of dual  potentials  is defined  on  each  leaf,  one  is  the  Helmholtz free  energy \n1/J(p)  ==  1jJ(p)  and another is its Legendre transform, or the Gibbs free energy, \n\n(4) \n\nand  the parameters of p  E  F( w) are given by \n\n'fJi(P)  = ()i1/J(p),  hi(p) = ()i\u00a2>(p), \n\nwhere  {)i  ==  ({)/{)hi )  and  {)i \nUrn A( m), is parametrized by m  ==  (md and each leaf A( m) is defined as \n\n::::i  (()/{)'fJi).  Another  foliation  A \n\n(5) \n{A(m)},  M  = \n\nA(m) =  {p(s)  I 'fJi(P)  =  mi}. \n\n(6) \n\n\fA  Theory of Mean Field Approximation \n\n353 \n\nEach leaf A(m) is  not an  exponential family,  but again a pair of dual  potentials.\"b and \u00a2 is \ndefined on each leaf, the former is given by \n\nand  the latter by  its Legendre transform as \n\n\u00a2(p) = L  wI (p)1]I(p)  - .\"b(p), \n\nI \n\nand  the parameters of P E A(m) are given by \n\n(7) \n\n(8) \n\n1JI(p)  = fh.\"b(p),  wI (p)  =  al \u00a2(p), \n\n(9) \n\nwhere al = (a/awl) and al  = (a / a1JI).  These two  foliations  form  the orthogonal dual \n\nfoliations,  since  the  leaves  F{w)  and  A(m) are  orthogonal  at their  intersecting point.  I \nintroduce  still  another coordinate system  on  M,  called  the  mixed  coordinate  system,  on \nthe  basis  of the  orthogonal  dual  foliations.  It uses  a pair  (m, w)  of the  expectation  and \nthe canonical parameters to  specify a single element p EM. The m  part specifies the leaf \nA(m) on  which p resides,  and  the w  part specifies the leaf F(w). \n\n3.2  REFORMULATION OF PROBLEM \n\nAssume that a target Boltzmann machine q is given by specifying its  parameters hi (q)  and \nwI (q).  Problem  I  is  restated  as  follows:  evaluate  its  expectations 1Ji(q)  and  1JI(q)  from \nthose parameters.  To evaluate 1Ji  mean  field  approximation translates the problem into the \nfollowing one: \n\nProblem 2  Let F( w) be a leaf on  which  q resides.  Find p  E  F{ w)  which  is  the closest \nto q. \n\nAt first  sight this  problem is trivial,  since one immediately  finds the solution p  = q.  How(cid:173)\never, sol ving this problem with respect to TJi (p)  is nontrivial, and it is the key to understand(cid:173)\ning of mean field  approximation including advanced theories. \n\nLet us measure the proximity of p to q by  the Kullback divergence \n\nD{pllq)  = LP{s) log :~:~, \n\ns \n\n(10) \n\nthen solving Problem 2 reduces to finding a minimizer p E  F{w) of D{pllq)  for a given q. \nFor p,  q E F(w), D{pllq) is expressed in  terms of the dual  potentials\",& and \u00a2  of F(w) as \n\nThe minimization problem is thus equivalent to minimizing \n\n(11 ) \n\n( 12) \n\nsince \",&{q)  in  eq.  (11) does not depend on p.  Solving the stationary condition EfG{p)  = 0 \nwith  respect  to 1Ji(P)  will  give  the correct expectations 1Ji{q),  since the  true  minimizer  is \np  = q. However, the scenario is  in  general intractable since\u00a2{p) cannot be given explicitly \nas  a function  of 1Ji{P). \n\n\f354 \n\n3.3  PLEFKA EXPANSION \n\nT  Tanaka \n\nThe  problem  is  easy  if wI  =  O.  In  this  case \u00a2(p)  is  given  explicitly  as  a  function  of \nmi ==  7]i(p)  as \n\n-\n</>(p)  =  \"2  ~ (1  + mi) log \n\n1 '\" [ \n\n1 + mi \n\n2  + (1  - md log \n\n1 - mi] \n. \n\n2 \n\n(13) \n\ni \n\nMinimization  of G(p)  with  respect  to  mi  gives  the  solution  mi  =  tanh hi  as  expected. \nWhen wI 1=  0 the expression (13) is no longer exact, but to compensate the error one may \nuse, leaving convergence problem aside, the Taylor expansion of\u00a2(w) ==  \u00a2(p) with respect \nto w  = 0, \n\n\u00a2(w) \n\n\u00a2(O) + 2:(ch\u00a2(O))wI  + ~ 2:Uh aJ\u00a2(O))wI wJ \n+ ! 2: ({hfhaK\u00a2(O))wlwJwK + .... \n6 IJK \n\nIJ \n\n1 \n\n( 14) \n\nThis expansion  has  been  called  the Plefka expansion[ 12]  in  the  literature of spin glasses. \nNote that in considering the expansion one should temporarily assume that m  is fixed:  One \ncan rely on the solution m  evaluated from  the stationary condition 8G(p) = 0 only if the \nexpansion does not change the value of m. \n\nThe coefficients in  the expansion can be efficiently computed by fully  utilizing the orthog(cid:173)\nonal dual structure of the foliations.  First,  we have the following theorem: \n\nTheorem 1  The coefficients o/the expansion (14) are given by the cumulant tensors (l/the \ncorresponding orders,  dejined on A(m). \n\nBecause \u00a2 =  -;fi holds, one can consider derivatives of;fi  instead of those of \u00a2. The first(cid:173)\norder derivatives aI;fi  are  immediately  given  by  the  property  of the  potential  of the  leaf \nA(m) (eq. (9\u00bb, yielding \n\n( 15) \nwhere Po  denotes the distribution  on A(m) corresponding to w  =  O.  The coefficients of \nthe lowest-orders, including the first-order one, are given  by the following theorem. \n\nTheorem 2  The jirst-, second-, and third-order coefficients o/the expansion (14) are given \nby: \n\n(h;fi(o) \n(h{h;fi(O) \nalaJaK;fi(O) \n\nT/I(PO) \n((all)(aJl) )po \n((all)(aJl)(aKl) )po \n\n(16) \n\nwhere l  ==  logPo. \n\nThe proofs will  be found  in  [11].  It should be noted  that,  although these results happen  to \nbe  the  same  as  the  ones  which  would  be obtained  by  regarding A(m) as an  exponential \nfamily,  they are not the same in general since actually A(m) is not an  exponential family; \nfor example, they are different for the fourth-order coefficients. \n\nThe explicit formulas for these coefficients for Boltzmann machines are given as follows : \n\n\u2022  For the first-order, \n\n(17) \n\n\fA Theory of Mean Field Approximation \n\n\u2022  For the second-order, \n\n(th )2~(O) = (1  - mr)(1 - m;,) \n\n(I = ii'), \n\nand \n\n\u2022  For the third-order, \n\n(th )3~(O) =  4mi mi' (1  - mn(1 - mr,)  (I =  ii'), \nand for 1 =  ij, J  =  j k,  K  =  ik for three distinct indices i , j, and k, \n\n(h{h8K~(O) =  (1  - m;)(1 - m;)(1 - m~) \n\nFor other combinations of I , J, and K , \n\n355 \n\n(18) \n\n(19) \n\n(20) \n\n(21) \n\n(22) \n\n4  MEAN FIELD APPROXIMATION \n\n4.1  MEAN FIELD EQUATION \n\nTruncating the Plefka expansion (14) up to n-th order term gives n-th order approximations, \n~n (P) and Gn(p)  ==  ~n(P) - L:i hi(q)mi . The Weiss free energy, which is used in the naive \nmean field approximation, is given by ~l (p).  The TAP approach picks up all relevant terms \nof the Plefka expansion[ 12], and for the SK model it gives the second-order approximation \n~2(P) . \nThe stationary condition 8i Gn (p)  =  0 gives the so-called mean field equation, from which \na solution of the approximate minimization problem is to be determined. For n  =  1 it takes \nthe following familiar form , \n\ntanh - 1 mi - hi - 2: wijmj = 0 \n\nand for n =  2 it includes the so-called On sager reaction term. \n\ntanh- 1 m i - hi  - 2: wijmj + 2:(wij )2(1  - m;)mi  =  0 \n\n#i \n\n# i \n\n# i \n\n(23) \n\n(24) \n\nNote that all of these are expressed as functions  of ffii. \nGeometrically,  the  mean  field  equation  approximately  represents  the  \"surface\"  hf(p) \nhi(q) in terms of the mixed coordinate system of M , since for the exact Gibbs free energy \nG,  the  stationary  condition QiG(p)  =  0  gives  hi(p)  - hi(q)  =  O.  Accordingly,  the  ap-\nproximate relation  hi(p)  =  8i~n(P), for  fixed  m, represents  the  n-th order approximate \nexpression of the leaf A(m) in  the canonical coordinate system . The fit  of this expression \nto the true leaf A( m) around the point w  = 0 becomes beller as the order of approximation \ngets higher, as seen in  Fig.  I. Such a behavior is  well expected, since the Plefka expansion \nis essentially a Taylor expansion. \n\n4.2  LINEAR RESPONSE \n\nFor estimating r/1(p)  one can  utilize the  linear response theorem .  In  information  geomet(cid:173)\nrical  framework  it is represented as  a trivial  identity relation  for the Fisher information on \nthe  leaf F( w) . The Fisher information matrix  (gij) , or the  Riemannian  metric tensor,  on \nthe leaf F(w) , and its inverse (gij)  are given by \n\n(25) \n\n\f356 \n\nT  Tanaka \n\n,... \n\n,/ \n\n. \n\\1  I \n:  ; \n!.  \" \n\n/ / -\n\n0.4 .---------r--.,..---_~-... _ .. Oth  order \n----\u00b7 1st order \n-_. 2nd order \n\n:'::~=~==~'~- ~--. ~.'.~ !;~ ~;~:; \n\n\",'\" \n\n0.25 \n\no \n\nFO, \n\n/\n\n/ \n)/ .// \n,/ ; \n\nOL-~--~~--------~ \n\n0.2 \n\nA(m) \n\"\n\n, . \nI  '\\ \nI  l\\ \n\nI \n\n\\ \n,1\"----- ...... : \n\\ \nI \n\n\" \n\n\" \n\n' -- -. _ _ _ \n0.1 '---__  -lo.-_~ __  __\" \n0.499 \n\n0.501 \n\nFigure  I:  Approximate expressions of A(m) by  mean field  approximations of several or(cid:173)\nders for 2-unit Boltzmann machine, with (ml' m2)  =  (0.5,  0.5) (left), and their magnified \nview (right). \n\nFigure 2: Relation between \"naive\" approximation and present theory. \n\nand \n\nrespectively.  In the framework here,  the linear response theorem states the trivial  fact that \nthose are the inverse of the other. In  mean field  approximation, one substitutes an  approxi(cid:173)\nmation \u00a2n(P)  in  place of \u00a2(p) in eq. (26) to get an  approximate inverse of the metric (r/j). \nThe derivatives in eq. (26) can be analytically calculated, and therefore (rJj) can be numer(cid:173)\nically evaluated by  substituting to it a solution Tni  of the mean field  equation.  Equating its \ninverse  to (9ij)  gives an  estimate of 17ij (p)  by using eq.  (25).  So far,  Problem  I  has  been \nsol ved within the framework of mean field  approximation, with T1li  and 17ij  obtained by the \nmean field  equation and the linear response theorem, respectively. \n\n(26) \n\n5  DISCUSSION \n\nFollowing  the  framework  presented  so  far,  one  can  in  principle  construct  algorithms  of \nmean field  approximation of desired orders.  The first-order algorithm with linear response \nhas  been  first  proposed  and  examined  by  Kappen  and  Rodrfguez[7,  8].  Tanaka[13]  has \nformulated second- and third-order algorithms and explored them by computer simulations, \nIt is  also possible to extend the  present formulation  so that  it can be applicable  to higher(cid:173)\norder Boltzmann machines.  Tanaka[ 14]  discusses an  extension of the present formulation \nto  third-order  Boltzmann  machines:  It is  possible  to  extend  linear  response  theorem  to \nhigher-orders,  and  it  allows  us to  treat  higher-order correlations  within  the framework  of \nmean field  approximation. \n\n\fA  Theory of Mean Field Approximation \n\n357 \n\nThe common understanding about the \"naive\" mean field approximation is that it minimizes \nKullback divergence D(A>llq)  with respect to A>  E  Fo  for a given q.  It can be shown that \nthis  view  is  consistent  with  the  theory  presented  in  this  paper.  Assume  that  q  E  F( w) \nand  Po  E  A(m), and  let p  be  a  distribution  corresponding  the  intersecting  point  of the \nleaves F(w) and  A(m).  Because of the orthogonality of the  two foliations  F  and A  the \nfollowing \"Pythagorean law[9]\" holds (Fig. 2). \n\nD(Pollq)  =  D(Pollp) + D(pllq) \n\n(27) \nIntuitively, D(A> lip) measures the squared distance between F( w) and Fa, and is a second(cid:173)\norder  quantity  in  w. \nIt  should  be  ignored  in  the  first-order  approximation,  and  thus \nD(Pollq)  ~ D(pllq)  holds.  Under  this  approximation  minimization  of the  former  with \nrespect to Po  is  equivalent to that of the latter with  respect to p,  which  establishes the re(cid:173)\nlation  between  the  \"naive\" approximation  and  the  present theory.  It can  also be checked \ndirectly that the first-order approximation of D(pllq) exactly gives D(A>llq), the Weiss free \nenergy. \n\nThe present  theory  provides  an  alternative  view  about the  validity  of mean  field  approx(cid:173)\nimation:  As  opposed  to a  common \"belief\" that mean  field  approximation  is  a  good  one \nwhen  N  is sufficiently large, one can state from  the present formulation  that it is so when(cid:173)\never higher-order contribution  of the Plefka expansion  vanishes,  regardless o/whether N \nis large or not.  This provides a theoretical basis for the observation that mean field  approx(cid:173)\nimation often works well  for  small  networks. \n\nThe author would  like to  thank the  Telecommunications Advancement Foundation  for fi(cid:173)\nnancial support. \n\nReferences \n[1]  Ackley,  D. H., Hinton, G.  E.,  and  Sejnowski, T.  J.  (1985) A learning algorithm for  Boltzmann \n\nmachines.  Cognitive Science 9:  147-169. \n\n[2]  Peterson, c.,  and  Anderson,  J.  R.  (1987)  A  mean  field  theory  learning  algorithm  for  neural \n\nnetworks.  Complex Systems 1:  995-1019. \n\n[3]  Thouless,  D. J.,  Anderson,  P.  w.,  and  Palmer,  R.  G.  (1977)  Solution  of 'Solvable model  of a \n\nspin glass'. Phil.  Mag.  35 (3):  593-60l. \n\n[4]  Parisi, G.  (1988) Statistical Field Theory.  Addison-Wesley. \n[5]  Galland, C. C.  (1993) The limitations of deterministic Boltzmann machine learning. Network 4 \n\n(3):  355-379. \n\n[6]  Hofmann,  T.  and  Buhmann,  J.  M.  (1997) Pairwise data clustering by  deterministic annealing. \n\nIEEE Trans.  Patl.  Anal.  & Machine IntelJ. 19 (I):  1-14; Errata, ibid.  19 (2):  197 (1997). \n\n[7]  Kappen,  H.  1.  and  RodrIguez,  F.  B.  (1998)  Efficient  learning  in  Boltzmann  machines  using \n\nlinear response theory. Neural Computation.  10 (5):  1137-1156. \n\n[8]  Kappen, H. J. and Rodriguez, F.  B.  (1998) Boltzmann machine learning using mean field theory \nand linear response correction.  In  M.  I.  Jordan, M.  1.  Kearns, and S. A. Solla (Eds.), Advances \nill Neural Information Processing S.ystems  10, pp. 280-286. The MIT Press. \n\n[9]  Amari,  S.-I.  (1985)  Differential-Geometrical Method in  Statistics.  Lecture  Notes  in  Statistics \n\n28, Springer-Verlag. \n\n[10]  Amari ,  S.-I.,  Kurata,  K .\u2022  and  Nagaoka,  H.  (1992)  Information  geometry  of Boltzmann  ma(cid:173)\n\nchines. IEEE Trans.  Neural Networks 3 (2): 260-271. \n\n[II]  Tanaka, T.  Information geometry of mean field approximation. preprint. \n[12]  Plefka, P.  (1982) Convergence condition of the TAP equation for the infinite-ranged Ising spin \n\nglass model. 1.  Phys.  A:  Math.  Gen.  15 (6):  197t-1978. \n\n[13]  Tanaka,  T.  (1998)  Mean  field  theory  of Boltzmann  machine  learning.  Phys.  Rev.  E.  58  (2): \n\n2302-2310. \n\n[14]  Tanaka, T.  (1998)  Estimation of third-order correlations within mean field  approximation. In S. \nUsui and T.  Omori (Eds.), Proc.  Fifth International Conference on Neurallllformation Process(cid:173)\ning,  vol.  1,  pp. 554-557. \n\n\f\fPART IV \n\nALGORITHMS AND ARCHITECTURE \n\n\f\f", "award": [], "sourceid": 1604, "authors": [{"given_name": "Toshiyuki", "family_name": "Tanaka", "institution": null}]}