{"title": "Projecting Ising Model Parameters for Fast Mixing", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 673, "abstract": "Inference in general Ising models is difficult, due to high treewidth making tree-based algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.", "full_text": "Projecting Ising Model Parameters for Fast Mixing\n\nJustin Domke\n\nXianghang Liu\n\nNICTA, The Australian National University\n\nNICTA, The University of New South Wales\n\njustin.domke@nicta.com.au\n\nxianghang.liu@nicta.com.au\n\nAbstract\n\nInference in general Ising models is dif\ufb01cult, due to high treewidth making tree-\nbased algorithms intractable. Moreover, when interactions are strong, Gibbs sam-\npling may take exponential time to converge to the stationary distribution. We\npresent an algorithm to project Ising model parameters onto a parameter set that\nis guaranteed to be fast mixing, under several divergences. We \ufb01nd that Gibbs\nsampling using the projected parameters is more accurate than with the original\nparameters when interaction strengths are strong and when limited time is avail-\nable for sampling.\n\n1 Introduction\n\nHigh-treewidth graphical models typically yield distributions where exact inference is intractable.\nTo cope with this, one often makes an approximation based on a tractable model. For example,\ngiven some intractable distribution q, mean-\ufb01eld inference [14] attempts to minimize KL(p||q) over\np \u2208 TRACT, where TRACT is the set of fully-factorized distributions. Similarly, structured mean-\n\ufb01eld minimizes the KL-divergence, but allows TRACT to be the set of distributions that obey some\ntree [16] or a non-overlapping clustered [20] structure. In different ways, loopy belief propagation\n[21] and tree-reweighted belief propagation [19] also make use of tree-based approximations, while\nGloberson and Jaakkola [6] provide an approximate inference method based on exact inference in\nplanar graphs with zero \ufb01eld.\n\nIn this paper, we explore an alternative notion of a \u201ctractable\u201d model. These are \u201cfast mixing\u201d\nmodels, or distributions that, while they may be high-treewidth, have parameter-space conditions\nguaranteeing that Gibbs sampling will quickly converge to the stationary distribution. While the\nprecise form of the parameter space conditions is slightly technical (Sections 2-3), informally, it is\nsimply that interaction strengths between neighboring variables are not too strong.\n\nIn the context of the Ising model, we attempt to use these models in the most basic way possible\u2013 by\ntaking an arbitrary (slow-mixing) set of parameters, projecting onto the fast-mixing set, using four\ndifferent divergences. First, we show how to project in the Euclidean norm, by iteratively thresh-\nolding a singular value decomposition (Theorem 7). Secondly, we experiment with projecting using\nthe \u201czero-avoiding\u201d divergence KL(q||p). Since this requires taking (intractable) expectations with\nrespect to q, it is of only theoretical interest. Third, we suggest a novel \u201cpiecewise\u201d approximation\nof the KL divergence, where one drops edges from both q and p until a low-treewidth graph remains\nwhere the exact KL divergence can be calculated. Experimentally, this does not perform as well as\nthe true KL-divergence, but is easy to evaluate. Fourth, we consider the \u201czero forcing\u201d divergence\nKL(q||p). Since this requires expectations with respect to p, which is constrained to be fast-mixing,\nit can be approximated by Gibbs sampling, and the divergence can be minimized through stochastic\napproximation. This can be seen as a generalization of mean-\ufb01eld where the set of approximating\ndistributions is expanded from fully-factorized to fast-mixing.\n\n1\n\n\f2 Background\n\nThe literature on mixing times in Markov chains is extensive, including a recent textbook [10]. The\npresentation in the rest of this section is based on that of Dyer et al. [4].\n\nGiven a distribution p(x), one will often wish to draw samples from it. While in certain cases\n(e.g. the Normal distribution) one can obtain exact samples, for Markov random \ufb01elds (MRFs), one\nmust generally resort to iterative Markov chain Monte Carlo (MCMC) methods that obtain a sample\nasymptotically. In this paper, we consider the classic Gibbs sampling method [5], where one starts\nwith some con\ufb01guration x, and repeatedly picks a node i, and samples xi from p(xi|x\u2212i). Under\nmild conditions, this can be shown to sample from a distribution that converges to p as t \u2192 \u221e.\nIt is common to use more sophisticated methods such as block Gibbs sampling, the Swendsen-Wang\nalgorithm [18], or tree sampling [7]. In principle, each algorithm could have unique parameter-space\nconditions under which it is fast mixing. Here, we focus on the univariate case for simplicity and\nbecause fast mixing of univariate Gibbs is suf\ufb01cient for fast mixing of some other methods [13].\nDe\ufb01nition 1. Given two \ufb01nite distributions p and q, the total variation distance || \u00b7 ||T V is\n\n||p(X) \u2212 q(X)||T V =\n\n1\n2 !\n\nx |p(X = x) \u2212 q(X = x)|.\n\nWe need a property of a distribution that can guarantee fast mixing. The dependency Rij of xi on\nxj is de\ufb01ned by considering two con\ufb01gurations x and x\", and measuring how much the conditional\ndistribution of xi can vary when xk = x\"\nDe\ufb01nition 2. Given a distribution p, the dependency matrix R is de\ufb01ned by\n\nk for all k $= j.\n\nRij =\n\nmax\n\nx,x!:x\u2212j =x!\n\n\u2212j ||p(Xi|x\u2212i) \u2212 p(Xi|x\"\n\n\u2212i)||T V .\n\nGiven some threshold \u0001, the mixing time is the number of iterations needed to guarantee that the\ntotal variation distance of the Gibbs chain to the stationary distribution is less than \u0001.\nDe\ufb01nition 3. Suppose that {X t} denotes the sequence of random variables corresponding to run-\nning Gibbs sampling on some distribution p. The mixing time \u03c4(\u0001) is the minimum time t such that\nthe total variation distance between X t and the stationary distribution is at most \u0001. That is,\n\n\u03c4(\u0001) = min{t : d(t) < \u0001},\nd(t) = max\n\nx ||P(X t|X 0 = x) \u2212 p(X)||T V .\n\nUnfortunately, the mixing time can be extremely long, which makes the use of Gibbs sampling\ndelicate in practice. For example, for the two-dimensional Ising model with zero \ufb01eld and uniform\ninteractions, it is known that mixing time is polynomial (in the size of the grid) when the interaction\nstrengths are below a threshold \u03b2c, and exponential for stronger interactions [11]. For more general\ndistributions, such tight bounds are not generally known, but one can still derive suf\ufb01cient conditions\nfor fast mixing. The main result we will use is the following [8].\nTheorem 4. Consider the dependency matrix R corresponding to some distribution p(X1, ..., Xn).\nFor Gibbs sampling with random updates, if ||R||2 < 1, the mixing time is bounded by\n\n\u03c4(\u0001) \u2264\n\nn\n\n1 \u2212 ||R||2\n\nln\" n\n\n\u0001 # .\n\nRoughly speaking, if the spectral norm (maximum singular value) of R is less than one, rapid mixing\nwill occur. A similar result holds in the case of systematic scan updates [4, 8].\n\nSome of the classic ways of establishing fast mixing can be seen as special cases of this. For\nexample, the Dobrushin criterion is that ||R||1 < 1, which can be easier to verify in many cases,\nsince ||R||1 = maxj $i |Rij| does not require the computation of singular values. However, for\nsymmetric matrices, it can be shown that ||R||2 \u2264 ||R||1, meaning the above result is tighter.\n\n2\n\n\f3 Mixing Time Bounds\n\nFor variables xi \u2208 {\u22121, +1}, an Ising model is of the form\n\np(x) = exp\uf8eb\n\n\uf8ed#\n\ni,j\n\n\u03b2ijxixj + #\n\ni\n\n\u03b1ixi \u2212 A(\u03b2, \u03b1)\uf8f6\n\uf8f8 ,\n\nwhere \u03b2ij is the interaction strength between variables i and j, \u03b1i is the \u201c\ufb01eld\u201d for variable i,\nand A ensures normalization. This can be seen as a member of the exponential family p(x) =\nexp (\u03b8 \u00b7 f (x) \u2212 A(\u03b8)) , where f (x) = {xixj\u2200(i, j)} \u222a {xi\u2200i} and \u03b8 contains both \u03b2 and \u03b1.\nLemma 5. For an Ising model, the dependency matrix is bounded by\n\nRij \u2264 tanh|\u03b2ij| \u2264 |\u03b2ij|\n\nHayes [8] proves this for the case of constant \u03b2 and zero-\ufb01eld, but simple modi\ufb01cations to the proof\ncan give this result.\n\nThus, to summarize, an Ising model can be guaranteed to be fast mixing if the spectral norm of the\nabsolute value of interactions terms is less than one.\n\n4 Projection\n\nIn this section, we imagine that we have some set of parameters \u03b8, not necessarily fast mixing, and\nwould like to obtain another set of parameters \u03c8 which are as close as possible to \u03b8, but guaranteed\nto be fast mixing. This section derives a projection in the Euclidean norm, while Section 5 will build\non this to consider other divergence measures.\n\nWe will use the following standard result that states that given a matrix A, the closest matrix with a\nmaximum spectral norm can be obtained by thresholding the singular values.\nTheorem 6. If A has a singular value decomposition A = U SV T , and ||\u00b7||F denotes the Frobenius\nii = min(Sii, c2).\nnorm, then B = arg min\n\nB:||B||2\u2264c||A \u2212 B||F can be obtained as B = U S\"V T , where S\n\n!\n\nWe denote this projection by B = \u03a0c[A]. This is close to providing an algorithm for obtaining the\nclosest set of Ising model parameters that obey a given spectral norm constraint. However, there are\ntwo issues. First, in general, even if A is sparse, the projected matrix B will be dense, meaning that\nprojecting will destroy a sparse graph structure. Second, this result constrains the spectral norm of\nB itself, rather than R = |B|, which is what needs to be controlled. The theorem below provides a\ndual method that \ufb01xed these issues.\n\nHere, we take some matrix Z that corresponds to the graph structure, by setting Zij = 0 if (i, j) is\nan edge, and Zij = 1 otherwise. Then, enforcing that B obeys the graph structure is equivalent to\nenforcing that ZijBij = 0 for all (i, j). Thus, \ufb01nding the closest set of parameters B is equivalent\nto solving\n\nmin\nB,D\n\n||A \u2212 B||F subject to ||D||2 \u2264 c, ZijDij = 0, D = |B|.\n\n(1)\n\nWe \ufb01nd it convenient to solve this minimization by performing some manipulations, and deriving a\ndual. The proof of this theorem is provided in the appendix. To accomplish the maximization of g\nover M and \u039b, we use LBFGS-B [1], with bound constraints used to enforce that M \u2265 0.\nThe following theorem uses the \u201ctriple dot product\u201d notation of A \u00b7 B \u00b7 C = &ij AijBijCij .\nTheorem 7. De\ufb01ne R = |A|. The minimization in Eq. 1 is equivalent to the problem of\nmaxM\u22650,\u039b g(\u039b, M ), where the objective and gradient of g are, for D(\u039b, M ) = \u03a0c[R+M\u2212\u039b\u2019Z],\n\ng(\u039b, M ) =\n\n1\n2||D(\u039b, M ) \u2212 R||2\n\nF + \u039b \u00b7 Z \u00b7 D(\u039b, M )\n\ndg\nd\u039b\ndg\ndM\n\n= Z \u2019 D(\u039b, M )\n= D(\u039b, M ).\n\n3\n\n(2)\n\n(3)\n\n(4)\n\n\f5 Divergences\n\nAgain, we would like to \ufb01nd a parameter vector \u03c8 that is close to a given vector \u03b8, but is guaranteed\nto be fast mixing, but with several notions of \u201ccloseness\u201d that vary in terms of accuracy and compu-\ntational convenience. Formally, if \u03a8 is the set of parameters that we can guarantee to be fast mixing,\nand D(\u03b8, \u03c8) is a divergence between \u03b8 and \u03c8, then we would like to solve\n\narg min\n\u03c8\u2208\u03a8\n\nD(\u03b8, \u03c8).\n\n(5)\n\nAs we will see, in selecting D there appears to be something of a trade-off between the quality of\nthe approximation, and the ease of computing the projection in Eq. 5.\n\nIn this section, we work with the generic exponential family representation\n\nWe use \u00b5 to denote the mean value of f . By a standard result, this is equal to the gradient of A, i.e.\n\np(x; \u03b8) = exp(\u03b8 \u00b7 f (x) \u2212 A(\u03b8)).\n\n\u00b5(\u03b8) = !\n\nx\n\np(x; \u03b8)f (x) = \u2207A(\u03b8).\n\n5.1 Euclidean Distance\n\nThe simplest divergence is simply the l2 distance between the parameter vectors, D(\u03b8, \u03c8) = ||\u03b8 \u2212\n\u03c8||2. For the Ising model, Theorem 7 provides a method to compute the projection arg min\u03c8\u2208\u03a8 ||\u03b8\u2212\n\u03c8||2. While simple, this has no obvious probabilistic interpretation, and other divergences perform\nbetter in the experiments below.\n\nHowever, it also forms the basis of our projected gradient descent strategy for computing the pro-\njection in Eq. 5 under more general divergences D. Speci\ufb01cally, we will do this by iterating\n\nd\u03c8 D(\u03b8, \u03c8)\n\n1. \u03c8\" \u2190 \u03c8 \u2212 \u03bb d\n2. \u03c8 \u2190 arg min\u03c8\u2208\u03a8 ||\u03c8\" \u2212 \u03c8||2\nfor some step-size \u03bb. In some cases, dD/d\u03c8 can be calculated exactly, and this is simply projected\ngradient descent. In other cases, one needs to estimate dD/d\u03c8 by sampling from \u03c8. As discussed\nbelow, we do this by maintaining a \u201cpool\u201d of samples. In each iteration, a few Markov chain steps\nare applied with the current parameters, and then the gradient is estimated using them. Since the\ngradients estimated at each time-step are dependent, this can be seen as an instance of Ergodic Mirror\nDescent [3]. This guarantees convergence if the number of Markov chain steps, and the step-size \u03bb\nare both functions of the total number of optimization iterations.\n\n5.2 KL-Divergence\n\nPerhaps the most natural divergence to use would be the \u201cinclusive\u201d KL-divergence\n\nD(\u03b8, \u03c8) = KL(\u03b8||\u03c8) = !\n\nx\n\np(x; \u03b8) log\n\np(x; \u03b8)\np(x; \u03c8)\n\n.\n\n(6)\n\nThis has the \u201czero-avoiding\u201d property [12] that \u03c8 will tend to assign some probability to all con\ufb01g-\nurations that \u03b8 assigns nonzero probability to. It is easy to show that the derivative is\n\ndD(\u03b8, \u03c8)\n\nd\u03c8\n\n= \u00b5(\u03c8) \u2212 \u00b5(\u03b8),\n\n(7)\n\nwhere \u00b5\u03b8 = E\u03b8[f (X)]. Unfortunately, this requires inference with respect to both the parameter\nvectors \u03b8 and \u03c8. Since \u03c8 will be enforced to be fast-mixing during optimization, one could ap-\nproximate \u00b5(\u03c8) by sampling. However, \u03b8 is presumed to be slow-mixing, making \u00b5(\u03b8) dif\ufb01cult to\ncompute. Thus, this divergence is only practical on low-treewidth \u201ctoy\u201d graphs.\n\n4\n\n\f5.3 Piecewise KL-Divergences\n\nInspired by the piecewise likelihood [17] and likelihood approximations based on mixtures of trees\n[15], we seek tractable approximations of the KL-divergence based on tractable subgraphs. Our\nmotivation is the the following: if \u03b8 and \u03c8 de\ufb01ne the same distribution, then if a certain set of edges\nare removed from both, they should continue to de\ufb01ne the same distribution1. Thus, given some\ngraph T , we de\ufb01ne the \u201cprojection\u201d \u03b8(T ) onto the tree such by setting all edge parameters to zero if\nnot part of T . Then, given a set of graphs T , the piecewise KL-divergence is\n\nD(\u03b8, \u03c8) = max\n\nT\n\nKL(\u03b8(T )||\u03c8(T )).\n\nComputing the derivative of this divergence is not hard\u2013 one simply computes the KL-divergence\nfor each graph, and uses the gradient as in Eq. 7 for the maximizing graph.\n\nThere is some \ufb02exibility of selecting the graphs T . In the simplest case, one could simply select a\nset of trees (assuring that each edge is covered by one tree), which makes it easy to compute the KL-\ndivergence on each tree using the sum-product algorithm. We will also experiment with selecting\nlow-treewidth graphs, where exact inference can take place using the junction tree algorithm.\n\n5.4 Reversed KL-Divergence\n\nWe also consider the \u201czero-forcing\u201d KL-divergence\n\nD(\u03b8, \u03c8) = KL(\u03c8||\u03b8) = !\n\nx\n\np(x; \u03c8) log\n\np(x; \u03c8)\np(x; \u03b8)\n\n.\n\nTheorem 8. The divergence D(\u03b8, \u03c8) = KL(\u03c8||\u03b8) has the gradient\n\nd\nd\u03c8\n\nD(\u03b8, \u03c8) = !\n\nx\n\np(x; \u03c8)(\u03c8 \u2212 \u03b8) \u00b7 f (x) (f (x) \u2212 \u00b5(\u03c8)) .\n\nArguably, using this divergence is inferior to the \u201czero-avoiding\u201d KL-divergence. For example,\nsince the parameters \u03c8 may fail to put signi\ufb01cant probability at con\ufb01gurations where \u03b8 does, using\nimportance sampling to reweight samples from \u03c8 to estimate expectations with respect to \u03b8 could\nhave high variance Further, it can be non-convex with respect to \u03c8. Nevertheless, it often work\nwell in practice. Minimizing this divergence under the constraint that the dependency matrix R\ncorresponding to \u03c8 have a limited spectral norm is closely related to naive mean-\ufb01eld, which can be\nseen as a degenerate case where one constrains R to have zero norm.\n\nThis is easier to work with than the \u201czero-avoiding\u201d KL-divergence in Eq. 6 since it involves taking\nexpectations with respect to \u03c8, rather than \u03b8: since \u03c8 is enforced to be fast-mixing, these expec-\ntations can be approximated by sampling. Speci\ufb01cally, suppose that one has generated a set of\nsamples x1, ..., xK using the current parameters \u03c8. Then, one can \ufb01rst approximate the marginals\nby \u02c6\u00b5 = 1\n\nk=1 f (xk), and then approximate the gradient by\n\nK \"K\n\n\u02c6g =\n\n1\nK\n\nK\n\n!\nk=1#(\u03c8 \u2212 \u03b8) \u00b7 f (xk)$#f (xk) \u2212 \u02c6\u00b5$ .\n\n(8)\n\nIt is a standard result that if two estimators are unbiased and independent, the product of the two\nestimators will also be unbiased. Thus, if one used separate sets of perfect samples to estimate \u02c6\u00b5\nand \u02c6g, then \u02c6g would be an unbiased estimator of dD/d\u03c8. In practice, of course, we generate the\nsamples by Gibbs sampling, so they are not quite perfect. We \ufb01nd in practice that using the same set\nof samples twice makes makes little difference, and do so in the experiments.\n\n1Technically, here, we assume that the exponential family is minimal. However, in the case of an over-\n\ncomplete exponential family, enforcing this will simply ensure that \u03b8 and \u03c8 use the same reparameterization.\n\n5\n\n\fGrid, Mixed\n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nPiecewise KL(\u03b8||\u03c8) (TW 2)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\ni\n\n \nl\na\nn\ng\nr\na\nM\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\nInteraction Strength\n\n2\n\n3\n\n3.5\n\nEdge Density = 0.3, Attractive\n\nLoopy BP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\n \n\n4\n\n \n\n \n\n4\n\n \n\nGrid, Attractive\n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nPiecewise KL(\u03b8||\u03c8) (TW 2)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\ni\n\n \nl\na\nn\ng\nr\na\nM\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\nInteraction Strength\n\n2\n\n3\n\n3.5\n\nEdge Density = 0.3, Mixed\n\nLoopy BP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\nInteraction Strength\n\n2\n\n3\n\n3.5\n\n4\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n2.5\nInteraction Strength\n\n2\n\n3\n\n3.5\n\n4\n\nFigure 1: The mean error of estimated univariate marginals on 8x8 grids (top row) and low-density\nrandom graphs (bottom row), comparing 30k iterations of Gibbs sampling after projection to vari-\national methods. To approximate the computational effort of projection (Table 1), sampling on the\noriginal parameters with 250k iterations is also included as a lower curve. (Full results in appendix.)\n\n6 Experiments\n\nOur experimental evaluation follows that of Hazan and Shashua [9] in evaluating the accuracy of\nthe methods using the Ising model in various con\ufb01gurations. In the experiments, we approximate\nrandomly generated Ising models with rapid-mixing distributions using the projection algorithms\ndescribed previously. Then, the marginals of rapid-mixing approximate distribution are compared\nagainst those of the target distributions by running a Gibbs chain on each. We calculate the mean\nabsolute distance of the marginals as the accuracy measure, with the marginals computed via the\nexact junction-tree algorithm.\n\nWe evaluate projecting under the Euclidean distance (Section 5.1), the piecewise divergence (Section\n5.3), and the zero-forcing KL-divergence KL(\u03c8||\u03b8) (Section 5.4). On small graphs, it is possible to\nminimize the zero-avoiding KL-divergence KL(\u03b8||\u03c8) by computing marginals using the junction-\ntree algorithm. However, as minimizing this KL-divergence leads to exact marginal estimates, it\ndoesn\u2019t provide a useful measure of marginal accuracy. Our methods are compared with four other\ninference algorithms, namely loopy belief-propagation (LBP), Tree-reweighted belief-propagation\n(TRW), Naive mean-\ufb01eld (MF), and Gibbs sampling on the original parameters.\n\nLBP, MF and TRW are among the most widely applied variational methods for approximate infer-\nence. The MF algorithm uses a fully factorized distribution as the tractable family, and can be viewed\nas an extreme case of minimizing the zero forcing KL-divergence KL(\u03c8||\u03b8) under the constraint\nof zero spectral norm. The tractable family that it uses guarantees \u201cinstant\u201d mixing but is much\nmore restrictive. Theoretically, Gibbs sampling on the original parameters will produce highly ac-\ncurate marginals if run long enough. However, this can take exponentially long and convergence is\ngenerally hard to diagnose [2]. In contrast, Gibbs sampling on the rapid-mixing approximation is\nguaranteed to converge rapidly but will result in less accurate marginals asymptotically. Thus, we\nalso include time-accuracy comparisons between these two strategies in the experiments.\n\n6\n\n\f30,000 Gibbs steps\n250,000 Gibbs steps\nEuclidean Projection\nPiecewise-1 Projection\n\nKL Projection\n\n30k / 0.17s\n\nGrid, Strength 1.5\nSVDs\n\nGibbs Steps\n30k / 0.17s\n250k / 1.4s\n\nGrid, Strength 3\n\nSVDs\n\nGibbs Steps\n30k / 0.17s\n250k / 1.4s\n\nRandom Graph, Strength 3.\nGibbs Steps\n30k / 0.04s\n250k / 0.33s\n\nSVDs\n\n22 / 0.04s\n322 / 0.61s\n265 / 0.55s\n\n30k / 0.17s\n\n78 / 0.15s\n547 / 1.0s\n471 / 0.94s\n\n30k / 0.04s\n\n17 / .0002s\n408 / 0.047s\n300 / 0.037s\n\nTable 1: Running Times on various attractive graphs, showing the number of Gibbs passes and\nSingular Value Decompositions, as well as the amount of computation time. The random graph is\nbased on an edge density of 0.7. Mean-Field, Loopy BP, and TRW take less than 0.01s.\n\n6.1 Con\ufb01gurations\n\nTwo types of graph topologies are used: two-dimensional 8 \u00d7 8 grids and random graphs with 10\nnodes. Each edge is independently present with probability pe \u2208 {0.3, 0.5, 0.7}. Node parameters \u03b8i\nare uniformly drawn from unif(\u2212dn, dn) and we \ufb01x the \ufb01eld strength to dn = 1.0. Edge parameters\n\u03b8ij are uniformly drawn from unif(\u2212de, de) or unif(0, de) to obtain mixed or attractive interactions\nrespectively. We generate graphs with different interaction strength de = 0, 0.5, . . . , 4. All results\nare averaged over 50 random trials.\n\nTo calculate piecewise divergences, it remains to specify the set of subgraphs T .\nIt can be any\ntractable subgraph of the original distribution. For the grids, one straightforward choice is to use the\nhorizontal and the vertical chains as subgraphs. We also test with chains of treewidth 2. For random\ngraphs, we use the sets of random spanning trees which can cover every edge of the original graphs\nas the set of subgraphs.\n\nA stochastic gradient descent algorithm is applied to minimize the zero forcing KL-divergence\nKL(\u03c8||\u03b8). In this algorithm, a \u201cpool\u201d of samples is repeatedly used to estimate gradients as in\nEq. 8. After each parameter update, each sample is updated by a single Gibbs step, consisting of\none pass over all variables. The performance of this algorithm can be affected by several parameters,\nincluding the gradient search step size, the size of the sample pool, the number of Gibbs updates,\nand the number of total iterations. (This algorithm can be seen as an instance of Ergodic Mirror\nDescent [3].) Without intensive tuning of these parameters, we choose a constant step size of 0.1,\nsample pool size of 500 and 60 total iterations, which performed reasonably well in practice.\n\nFor each original or approximate distribution, a single chain of Gibbs sampling is run on the \ufb01nal\nparameters, and marginals are estimated from the samples drawn. Each Gibbs iteration is one pass\nof systematical scan over the variables in \ufb01xed order. Note that this does not take into account the\ncomputational effort deployed during projection, which ranges from 30,000 total Gibbs iterations\nwith repeated Euclidean projection (KL(\u03c8||\u03b8)) to none at all (Original parameters). It has been\nour experience that more aggressive parameters can lead to this procedure being more accurate than\nGibbs in a comparison of total computational effort, but such a scheduling tends to also reduce the\naccuracy of the \ufb01nal parameters, making results more dif\ufb01cult to interpret.\n\nIn Section 3.2, we show that for Ising models, a suf\ufb01cient condition for rapid-mixing is the spectral\nnorm of pairwise weight matrix is less than 1.0. However, we \ufb01nd in practice using a spectral norm\nbound of 2.5 instead of 1.0 can still preserve the rapid-mixing property and gives better approxima-\ntion to the original distributions. (See Section 7 for a discussion.)\n\n7 Discussion\n\nInference in high-treewidth graphical models is intractable, which has motivated several classes\nof approximations based on tractable families. In this paper, we have proposed a new notion of\n\u201ctractability\u201d, insisting not that a graph has a fast algorithm for exact inference, but only that it obeys\nparameter-space conditions ensuring that Gibbs sampling will converge rapidly to the stationary\ndistribution. For the case of Ising models, we use a simple condition that can guarantee rapid mixing,\nnamely that the spectral norm of the matrix of interaction strengths is less than one.\n\n7\n\n\fGrid, Interaction Strength 4.0, Mixed\n\nGrid, Interaction Strength 4.0, Attractive\n\nr\no\nr\nr\ne\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nPiecewise KL(\u03b8||\u03c8) (TW 2)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\nr\no\nr\nr\ne\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nPiecewise KL(\u03b8||\u03c8) (TW 2)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n0\n \n0\n10\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n \n0\n10\n\n1\n10\n\n2\n10\n\n3\n10\n\n4\n10\n\n5\n10\n\nNumber of samples (log scale)\n\n0\n \n0\n10\n\n1\n10\n\n2\n10\n\n3\n10\n\n4\n10\n\n5\n10\n\nNumber of samples (log scale)\n\nEdge Density = 0.3, Interaction Strength 3.0, Mixed\n\nEdge Density = 0.3, Interaction Strength 3.0, Attractive\n\n \n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n1\n10\n\n2\n10\n\n3\n10\n\n4\n10\n\n5\n10\n\nNumber of samples (log scale)\n\nr\no\nr\nr\ne\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\n \n\nLBP\nTRW\nMean\u2212Field\nOriginal Parameters\nEuclidean\nPiecewise KL(\u03b8||\u03c8) (TW 1)\nKL(\u03c8||\u03b8)\nKL(\u03b8||\u03c8)\n\n1\n10\n\n2\n10\n\n3\n10\n\n4\n10\n\n5\n10\n\nNumber of samples (log scale)\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n \n0\n10\n\nr\no\nr\nr\ne\n\n \nl\n\ni\n\na\nn\ng\nr\na\nM\n\nFigure 2: Example plots of the accuracy of obtained marginals vs. the number of samples. Top:\nGrid graphs. Bottom: Low-Density Random graphs. (Full results in appendix.)\n\nGiven an intractable set of parameters, we consider using this approximate family by \u201cprojecting\u201d\nthe intractable distribution onto it under several divergences. First, we consider the Euclidean dis-\ntance of parameters, and derive a dual algorithm to solve the projection, based on an iterative thresh-\nolding of the singular value decomposition. Next, we extend this to more probabilistic divergences.\nFirstly, we consider a novel \u201cpiecewise\u201d divergence, based on computing the exact KL-divergnce on\nseveral low-treewidth subgraphs. Secondly, we consider projecting onto the KL-divergence. This re-\nquires a stochastic approximation approach where one repeatedly generates samples from the model,\nand projects in the Euclidean norm after taking a gradient step.\n\nWe compare experimentally to Gibbs sampling on the original parameters, along with several stan-\ndard variational methods. The proposed methods are more accurate than variational approximations.\nGiven enough time, Gibbs sampling using the original parameters will always be more accurate, but\nwith \ufb01nite time, projecting onto the fast-mixing set to generally gives better results.\n\nFuture work might extend this approach to general Markov random \ufb01elds. This will require two\ntechnical challenges. First, one must \ufb01nd a bound on the dependency matrix for general MRFs,\nand secondly, an algorithm is needed to project onto the fast-mixing set de\ufb01ned by this bound.\nFast-mixing distributions might also be used for learning. E.g., if one is doing maximum likeli-\nhood learning using MCMC to estimate the likelihood gradient, it would be natural to constrain the\nparameters to a fast mixing set.\n\nOne weakness of the proposed approach is the apparent looseness of the spectral norm bound. For\nthe two dimensional Ising model with no univariate terms, and a constant interaction strength \u03b2,\nthere is a well-known threshold \u03b2c = 1\nniques than the spectral norm [11]. Roughly, for \u03b2 < \u03b2c, mixing is known to occur quickly (polyno-\nmial in the grid size) while for \u03b2 > \u03b2c, mixing is exponential. On the other hand, the spectral bound\nnorm will be equal to one for \u03b2 = .25, meaning the bound is too conservative in this case by a factor\nof \u03b2c/.25 \u2248 1.76. A tighter bound on when rapid mixing will occur would be more informative.\n\n2 ln(1 + \u221a2) \u2248 .4407, obtained using more advanced tech-\n\n8\n\n\fReferences\n\n[1] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm\n\nfor bound constrained optimization. SIAM J. Sci. Comput., 16(5):1190\u20131208, 1995.\n\n[2] Mary Kathryn Cowles and Bradley P. Carlin. Markov chain monte carlo convergence diag-\nnostics: A comparative review. Journal of the American Statistical Association, 91:883\u2013904,\n1996.\n\n[3] John C. Duchi, Alekh Agarwal, Mikael Johansson, and Michael I. Jordan. Ergodic mirror\n\ndescent. SIAM Journal on Optimization, 22(4):1549\u20131578, 2012.\n\n[4] Martin E. Dyer, Leslie Ann Goldberg, and Mark Jerrum. Matrix norms and rapid mixing for\n\nspin systems. Ann. Appl. Probab., 19:71\u2013107, 2009.\n\n[5] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian\n\nrestoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721\u2013741, 1984.\n\n[6] Amir Globerson and Tommi Jaakkola. Approximate inference using planar graph decomposi-\n\ntion. In NIPS, pages 473\u2013480, 2006.\n\n[7] Firas Hamze and Nando de Freitas. From \ufb01elds to trees. In UAI, 2004.\n\n[8] Thomas P. Hayes. A simple condition implying rapid mixing of single-site dynamics on spin\n\nsystems. In FOCS, pages 39\u201346, 2006.\n\n[9] Tamir Hazan and Amnon Shashua. Convergent message-passing algorithms for inference over\n\ngeneral graphs with convex free energies. In UAI, pages 264\u2013273, 2008.\n\n[10] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times.\n\nAmerican Mathematical Society, 2006.\n\n[11] Eyal Lubetzky and Allan Sly. Critical Ising on the square lattice mixes in polynomial time.\n\nCommun. Math. Phys., 313(3):815\u2013836, 2012.\n\n[12] Thomas Minka. Divergence measures and message passing. Technical report, 2005.\n\n[13] Yuval Peres and Peter Winkler. Can extra updates delay mixing? arXiv/1112.0603, 2011.\n\n[14] C. Peterson and J. R. Anderson. A mean \ufb01eld theory learning algorithm for neural networks.\n\nComplex Systems, 1:995\u20131019, 1987.\n\n[15] Patrick Pletscher, Cheng S. Ong, and Joachim M. Buhmann. Spanning Tree Approximations\n\nfor Conditional Random Fields. In AISTATS, 2009.\n\n[16] Lawrence K. Saul and Michael I. Jordan. Exploiting tractable substructures in intractable\n\nnetworks. In NIPS, pages 486\u2013492, 1995.\n\n[17] Charles Sutton and Andrew Mccallum. Piecewise training for structured prediction. Machine\n\nLearning, 77:165\u2013194, 2009.\n\n[18] Robert H. Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in monte carlo\n\nsimulations. Phys. Rev. Lett., 58:86\u201388, Jan 1987.\n\n[19] Martin Wainwright, Tommi Jaakkola, and Alan Willsky. A new class of upper bounds on the\n\nlog partition function. IEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n\n[20] Eric P. Xing, Michael I. Jordan, and Stuart Russell. A generalized mean \ufb01eld algorithm for\n\nvariational inference in exponential families. In UAI, 2003.\n\n[21] Jonathan Yedidia, William Freeman, and Yair Weiss. Constructing free energy approximations\nIEEE Transactions on Information Theory,\n\nand generalized belief propagation algorithms.\n51:2282\u20132312, 2005.\n\n9\n\n\f", "award": [], "sourceid": 391, "authors": [{"given_name": "Justin", "family_name": "Domke", "institution": "NICTA"}, {"given_name": "Xianghang", "family_name": "Liu", "institution": "NICTA/UNSW"}]}