{"title": "Analyzing Hogwild Parallel Gaussian Gibbs Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 2715, "page_last": 2723, "abstract": "Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation.  Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going Hogwild'' and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood.  As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra.  In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean. \"", "full_text": "Analyzing Hogwild Parallel Gaussian Gibbs Sampling\n\nMatthew J. Johnson\n\nEECS, MIT\n\nmattjj@mit.edu\n\nJames Saunderson\n\nEECS, MIT\n\njamess@mit.edu\n\nAlan S. Willsky\n\nEECS, MIT\n\nwillsky@mit.edu\n\nAbstract\n\nSampling inference methods are computationally dif\ufb01cult to scale for many mod-\nels in part because global dependencies can reduce opportunities for parallel com-\nputation. Without strict conditional independence structure among variables, stan-\ndard Gibbs sampling theory requires sample updates to be performed sequentially,\neven if dependence between most variables is not strong. Empirical work has\nshown that some models can be sampled effectively by going \u201cHogwild\u201d and sim-\nply running Gibbs updates in parallel with only periodic global communication,\nbut the successes and limitations of such a strategy are not well understood.\nAs a step towards such an understanding, we study the Hogwild Gibbs sampling\nstrategy in the context of Gaussian distributions. We develop a framework which\nprovides convergence conditions and error bounds along with simple proofs and\nconnections to methods in numerical linear algebra. In particular, we show that if\nthe Gaussian precision matrix is generalized diagonally dominant, then any Hog-\nwild Gibbs sampler, with any update schedule or allocation of variables to proces-\nsors, yields a stable sampling process with the correct sample mean.\n\n1\n\nIntroduction\n\nScaling probabilistic inference algorithms to large datasets and parallel computing architectures is a\nchallenge of great importance and considerable current research interest, and great strides have been\nmade in designing parallelizeable algorithms. Along with the powerful and sometimes complex\nnew algorithms, a very simple strategy has proven to be surprisingly successful in some situations:\nrunning Gibbs sampling updates, derived only for the sequential setting, in parallel without globally\nsynchronizing the sampler state after each update. Concretely, the strategy is to apply an algorithm\nlike Algorithm 1. We refer to this strategy as \u201cHogwild Gibbs sampling\u201d in reference to recent\nwork [1] in which sequential computations for computing gradient steps were applied in parallel\n(without global coordination) to great bene\ufb01cial effect.\nThis Hogwild Gibbs sampling strategy has long been considered a useful hack, perhaps for preparing\ndecent initial states for a proper serial Gibbs sampler, but extensive empirical work on Approximate\nDistributed Latent Dirichlet Allocation (AD-LDA) [2, 3, 4, 5, 6], which applies the strategy to\ngenerate samples from a collapsed LDA model, has demonstrated its effectiveness in sampling LDA\nmodels with the same predictive performance as those generated by standard serial Gibbs [2, Figure\n3]. However, the results are largely empirical and so it is dif\ufb01cult to understand how model properties\nand algorithm parameters might affect performance, or whether similar success can be expected\nfor any other models. There have been recent advances in understanding some of the particular\nstructure of AD-LDA [6], but a thorough theoretical explanation for the effectiveness and limitations\nof Hogwild Gibbs sampling is far from complete.\nSampling inference algorithms for complex Bayesian models have notoriously resisted theoretical\nanalysis, so to begin an analysis of Hogwild Gibbs sampling we consider a restricted class of mod-\nels that is especially tractable for analysis: Gaussians. Gaussian distributions and algorithms are\ntractable because of their deep connection with linear algebra. Further, Gaussian sampling is of\n\n1\n\n\f{1, 2, . . . , n}, and an inner iteration schedule q(k, (cid:96)) \u2265 0\n\nAlgorithm 1 Hogwild Gibbs Sampling\nRequire: Samplers Gi(\u00afx\u00aci) which sample p(xi|x\u00aci = \u00afx\u00aci), a partition {I1,I2, . . . ,IK} of\n1: Initialize \u00afx(1)\n2: for (cid:96) = 1, 2, . . . until convergence do\nfor k = 1, 2, . . . , K in parallel do\n3:\n4:\n5:\n6:\n7:\n8:\n\n(cid:46) global iterations/synchronizations\n(cid:46) for each of K parallel processors\n\n(cid:46) run local Gibbs steps with old\n(cid:46) statistics from other processors\n\n\u00afy(1)Ik \u2190 \u00afx(cid:96)Ik\nfor j = 1, 2, . . . , q(k, (cid:96)) do\n\nfor i \u2208 Ik do\n\u00afx((cid:96)+1) \u2190 (\u00afy(q(1,(cid:96)))\n\n\u00afy(j)\ni \u2190 Gi(\u00afx((cid:96))I1\n\nI1\n\n(cid:46) globally synchronize statistics\n\n, . . . , \u00afy(j)Ik\\{i}, . . . , \u00afx((cid:96))IK\nIK\n\n\u00b7\u00b7\u00b7 \u00afy(q(K,(cid:96)))\n\n)\n\n)\n\ngreat interest in its own right, and there is active research in developing powerful Gaussian sam-\nplers [7, 8, 9, 10]. Gaussian Hogwild Gibbs sampling could be used in conjunction with those\nmethods to allow greater parallelization and scalability, given an understanding of its applicability\nand tradeoffs.\nToward the goal of understanding Gaussian Hogwild Gibbs sampling, the main contribution of this\npaper is a linear algebraic framework for analyzing the stability and errors in Gaussian Hogwild\nGibbs sampling. Our framework yields several results, including a simple proof for a suf\ufb01cient\ncondition for all Gaussian Hogwild Gibbs sampling processes to be stable and yield the correct\nasymptotic mean no matter the allocation of variables to processors or number of sub-iterations\n(Proposition 1, Theorem 1), as well as an analysis of errors introduced in the process variance.\nCode to regenerate our plots is available at https://github.com/mattjj/gaussian-hogwild-gibbs.\n\n2 Related Work\n\nThere has been signi\ufb01cant work on constructing parallel Gibbs sampling algorithms, and the contri-\nbutions are too numerous to list here. One recent body of work [11] provides exact parallel Gibbs\nsamplers which exploit graphical model structure for parallelism. The algorithms are supported by\nthe standard Gibbs sampling analysis, and the authors point out that while heuristic parallel sam-\nplers such as the AD-LDA sampler offer easier implementation and often greater parallelism, they\nare currently not supported by much theoretical analysis.\nThe parallel sampling work that is most relevant to the proposed Hogwild Gibbs sampling analysis\nis the thorough empirical demonstration of AD-LDA [2, 3, 4, 5, 6] and its extensions. The AD-LDA\nsampling algorithm is an instance of the strategy we have named Hogwild Gibbs, and Bekkerman\net al. [5, Chapter 11] suggests applying the strategy to other latent variable models.\nThe work of Ihler et al. [6] provides some understanding of the effectiveness of a variant of AD-LDA\nby bounding in terms of run-time quantities the one-step error probability induced by proceeding\nwith sampling steps in parallel, thereby allowing an AD-LDA user to inspect the computed error\nbound after inference [6, Section 4.2]. In experiments, the authors empirically demonstrate very\nsmall upper bounds on these one-step error probabilities, e.g. a value of their parameter \u03b5 = 10\u22124\nmeaning that at least 99.99% of samples are expected to be drawn just as if they were sampled\nsequentially. However, this per-sample error does not necessarily provide a direct understanding\nof the effectiveness of the overall algorithm because errors might accumulate over sampling steps;\nindeed, understanding this potential error accumulation is of critical importance in iterative systems.\nFurthermore, the bound is in terms of empirical run-time quantities, and thus it does not provide\nguidance regarding on which other models the Hogwild strategy may be effective. Ihler et al. [6,\nSection 4.3] also provides approximate scaling analysis by estimating the order of the one-step\nbound in terms of a Gaussian approximation and some distributional assumptions.\nFinally, Niu et al. [1] provides both a motivation for Hogwild Gibbs sampling as well as the Hog-\nwild name. The authors present \u201ca lock-free approach to parallelizing stochastic gradient descent\u201d\n(SGD) by providing analysis that shows, for certain common problem structures, that the locking\n\n2\n\n\fand synchronization needed to run a stochastic gradient descent algorithm \u201ccorrectly\u201d on a multi-\ncore architecture are unnecessary, and in fact the robustness of the SGD algorithm compensates for\nthe uncertainty introduced by allowing processors to perform updates without locking their shared\nmemory.\n\n3 Background\n\n2 xTJx + hTx(cid:9)\n\nIn this section we \ufb01x notation for Gaussian distributions and describe known connections between\nGaussian sampling and a class of stationary iterative linear system solvers which are useful in ana-\nlyzing the behavior of Hogwild Gibbs sampling.\nThe density of a Gaussian distribution on n variables with mean vector \u00b5 and positive de\ufb01nite1\ncovariance matrix \u03a3 (cid:31) 0 has the form\n\n(cid:110)\n2 (x \u2212 \u00b5)T\u03a3\u22121(x \u2212 \u00b5)\n\u2212 1\n\n(cid:111)\n\n\u221d exp(cid:8)\n\n\u2212 1\n\np(x) \u221d exp\n\n(1)\nwhere we have written the information parameters J := \u03a3\u22121 and h := J\u00b5. The matrix J is often\ncalled the precision matrix or information matrix, and it has a natural interpretation in the context of\nGaussian graphical models: its entries are the coef\ufb01cients on pairwise log potentials and its sparsity\npattern is exactly the sparsity pattern of a graphical model. Similarly h, also called the potential\nvector, encodes node potentials and evidence.\nIn many problems [12] one has access to the pair (J, h) and must compute or estimate the moment\nparameters \u00b5 and \u03a3 (or just the diagonal) or generate samples from N (\u00b5, \u03a3). Sampling provides\nboth a means for estimating the moment parameters and a subroutine for other algorithms. Comput-\ning \u00b5 from (J, h) is equivalent to solving the linear system J\u00b5 = h for \u00b5.\nOne way to generate samples is via Gibbs sampling, in which one iterates sampling each xi con-\nditioned on all other variables to construct a Markov chain for which the invariant distribution is\nthe target N (\u00b5, \u03a3). The conditional distributions for Gibbs sampling steps are p(xi|x\u00aci = \u00afx\u00aci) \u221d\n\u2212 1\n(hi \u2212 Ji\u00aci \u00afx\u00aci) + vi\n2 Jiix2\niid\nwhere vi\n\u223c N (0, 1\nSince each variable update is a linear function of other variables with added Gaussian noise, we can\ncollect one scan for i = 1, 2, . . . , n into a matrix equation relating the sampler state at t and t + 1:\n\n(cid:9) . That is, we update each xi via xi \u2190 1\n\ni + (hi \u2212 Ji\u00aci \u00afx\u00aci)xi\n\nexp(cid:8)\n\nJii\n\n).\n\nJii\n\nx(t+1) = \u2212D\u22121Lx(t+1) \u2212 D\u22121LTx(t) + D\u22121h + D\u2212 1\n\n2 \u02dcv(t), \u02dcv(t) iid\n\n\u223c N (0, I).\n\nwhere we have split J = L + D + LT into its strictly lower-triangular, diagonal, and strictly upper-\ntriangular parts, respectively. Note that x(t+1) appears on both sides of the equation, and that the\nsparsity patterns of L and LT ensure that each entry of x(t+1) depends on the appropriate entries of\nx(t) and x(t+1). We can re-arrange the equation into an update expression:\n\nx(t+1) = \u2212(D + L)\n\n\u22121LTx(t) + (D + L)\n\n\u22121h + (D + L)\n\n\u22121\u02dcv(t), \u02dcv(t) iid\n\n\u223c N (0, D).\n\nThe expectation of this update is exactly the Gauss-Seidel iterative linear system solver update [13,\n\u22121h. Therefore a\nSection 7.3] applied to J\u00b5 = h, i.e. x(t+1) = \u2212(D + L)\nGaussian Gibbs sampling process can be interpreted as Gauss-Seidel iterates on the system J\u00b5 = h\nwith appropriately-shaped noise injected at each iteration.\nGauss-Seidel is one instance of a stationary iterative linear solver based on a matrix splitting. In\ngeneral, one can construct a stationary iterative linear solver for any splitting J = M \u2212 N where M\nis invertible, and similarly one can construct iterative Gaussian samplers via\n\n\u22121LTx(t) + (D + L)\n\nx(t+1) = (M\u22121N )x(t) + M\u22121h + M\u22121v(t), v(t) iid\n\n(2)\nwith the constraint that M T + N (cid:23) 0 (i.e. that the splitting is P-regular [14]). For an iterative\nprocess like (2) to be stable or convergent for any initialization we require the eigenvalues of its\n\n\u223c N (0, M T + N )\n\n1Assume models are non-degenerate: matrix parameters are of full rank and densities are \ufb01nite everywhere.\n\n3\n\n\fupdate map to lie in the interior of the complex unit disk, i.e. \u03c1(M\u22121N ) := maxi |\u03bbi(M\u22121N )| < 1\n[13, Lemma 7.3.6]. The Gauss-Seidel solver (and Gibbs sampling) correspond to choosing M to be\nthe lower-triangular part of J and N to be the negative of the strict upper-triangle of J. J (cid:23) 0 is a\nsuf\ufb01cient condition for Gauss-Seidel to be convergent [13, Theorem 7.5.41] [15], and the connection\nto Gibbs sampling provides an independent proof.\nFor solving linear systems with splitting-based algorithms, the complexity of solving linear systems\nin M directly affects the computational cost per iteration. For the Gauss-Seidel splitting (and hence\nGibbs sampling), M is chosen to be lower-triangular so that the corresponding linear system can\nbe solved ef\ufb01ciently via backsubstitution. In the sampling context, the per-iteration computational\ncomplexity is also determined by the covariance of the injected noise process v(t), because at each\niteration one must sample from a Gaussian distribution with covariance M T + N.\nWe highlight one other standard stationary iterative linear solver that is relevant to analyzing Gaus-\nsian Hogwild Gibbs sampling: Jacobi iterations, in which one splits J = D \u2212 A where D is the\ndiagonal part of J and A is the negative of the off-diagonal part. Due to the choice of a diagonal\nM, each coordinate update depends only on the previous sweep\u2019s output, and thus the Jacobi update\nsweep can be performed in parallel. A suf\ufb01cient condition for the convergence of Jacobi iterates\nis for J to be a generalized diagonally dominant matrix (i.e. an H-matrix) [13, De\ufb01nition 5.13]. A\nsimple proof 2 due to Ruozzi et al. [16], is to consider Gauss-Seidel iterations on a lifted 2n \u00d7 2n\n\n(cid:19)(cid:18)0 A\n\n(cid:19)\n\n0\n\n0\n\n(cid:18)0 D\u22121A\n\n(cid:19)\n\n=\n\n(D\u22121A)2\n\n0\n\n.\n\n(3)\n\nsystem: (cid:18) D \u2212A\n\n\u2212A D\n\n(cid:19) G-S update\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2192\n\n(cid:18) D\u22121\n\n0\nD\u22121AD\u22121 D\u22121\n\nTherefore one iteration of Gauss-Seidel on the lifted system is exactly two applications of the Jacobi\nupdate D\u22121A to the second half of the state vector, so Jacobi iterations converge if Gauss-Seidel\non the lifted system converges. Furthermore, a suf\ufb01cient condition for Gauss-Seidel to converge\non the lifted system is for it to be positive semi-de\ufb01nite, and by taking Schur complements we\nrequire D \u2212 AD\u22121A (cid:23) 0 or I \u2212 (D\u2212 1\n2 )(D\u2212 1\n2 ) (cid:23) 0, which is equivalent to requiring\ngeneralized diagonal dominance [13, Theorem 5.14].\n\n2 AD\u2212 1\n\n2 AD\u2212 1\n\n4 Gaussian Hogwild Analysis\n\nGiven that Gibbs sampling iterations and Jacobi solver iterations, which can be computed in parallel,\ncan each be written as iterations of a stochastic linear dynamical system (LDS), it is not surprising\nthat Gaussian Hogwild Gibbs sampling can also be expressed as an LDS by appropriately composing\nthese ideas. In this section we describe the LDS corresponding to Gaussian Hogwild Gibbs sampling\nand provide convergence and error analysis, along with a connection to a class of linear solvers.\nFor the majority of this section, we assume that the number of inner iterations performed on each\nprocessor is constant across time and processor index; that is, we have a single number q = q(k, (cid:96))\nof sub-iterations per processor for each outer iteration. We describe how to relax the assumption at\nthe end of this subsection.\nGiven a joint Gaussian distribution of dimension n represented by a pair (J, h) as in (1), we repre-\nsent an allocation of the n scalar variables to local processors by a partition of {1, 2, . . . , n}, where\nwe assume partition elements are contiguous without loss of generality. Consider a block-Jacobi\nsplitting of J into its block diagonal and block off-diagonal components, J = Dblock \u2212 A, accord-\ning to the partition. A includes the cross-processor potentials, and this block-Jacobi splitting will\nmodel the outer iterations in Algorithm 1. We further perform a Gauss-Seidel splitting on Dblock\ninto (block-diagonal) lower-triangular and strictly upper-triangular parts, Dblock = B \u2212 C; these\nprocessor-local Gauss-Seidel splittings model the inner Gibbs sampling steps in Algorithm 1. We\nrefer to this splitting J = B \u2212 C \u2212 A as the Hogwild splitting; see Figure 1a for an example.\nFor each outer iteration of the Hogwild Gibbs sampler we perform q processor-local Gibbs steps,\neffectively applying the block-diagonal update B\u22121C repeatedly using Ax(t) + h as a potential\n\n2 When J is symmetric one can arrive at the same condition by applying a similarity transform as in Propo-\n\nsition 5. We use the lifting argument here because we extend the idea in our other proofs.\n\n4\n\n\fvector that includes out-of-date statistics from the other processors. The resulting update operator\nfor one outer iteration of the Hogwild Gibbs sampling process can be written as\n\nq\u22121(cid:88)\n\nj=0\n\nB\u22121(cid:16)\n\nAx(t) + h + v(t,j)(cid:17)\n\nx(t+1) = (B\u22121C)\n\nq\n\nx(t) +\n\n(B\u22121C)\n\nj\n\n, v(t,j) iid\n\n\u223c N (0, D)\n\n(4)\n\nwhere D is the diagonal of J. Note that we shape the noise diagonally because in Hogwild Gibbs\nsampling we simply apply standard Gibbs updates in the inner loop.\nAs mentioned previously, the update in (4) is written so that the number of sub-iterations is homo-\ngeneous, but the expression can easily be adapted to model any numbers of sub-iterations by writing\na separate sum over j for each block row of the output and a separate matrix power for each block\nin the \ufb01rst B\u22121C term. The proofs and arguments in the following subsections can also be extended\nwith extra bookkeeping, so we focus on the homogeneous q case for convenience.\n\n4.1 Convergence and Correctness of Means\n\nq\u22121(cid:88)\n\nBecause the Gaussian Hogwild Gibbs sampling iterates form a Gaussian linear dynamical system,\nthe process is stable (i.e. its iterates converge in distribution) if and only if [13, Lemma 7.3.6] the\ndeterministic part of the update map (4) has spectral radius less than unity, i.e.\n\nj\n\nq\n\nq\n\n+\n\nj=0\n\n(B\u22121C)\n\nT := (B\u22121C)\n\nB\u22121A = (B\u22121C)\n\n+ (I \u2212 (B\u22121C)\nsatis\ufb01es \u03c1(T ) < 1. We can write T = T q\nind)Tblock where Tind is the purely Gauss-\nSeidel update when A = 0 and Tblock for the block Jacobi update, which corresponds to solving the\nprocessor-local linear systems exactly at each outer iteration. The update (5) falls into the class of\ntwo-stage splitting methods [14, 17, 18], and the next proposition is equivalent to such two-stage\nsolvers having the correct \ufb01xed point.\nProposition 1. If a Gaussian Hogwild Gibbs process is stable, the asymptotic mean is correct.\n\nind + (I \u2212 T q\n\n)(B \u2212 C)\n\n(5)\n\nq\n\n\u22121A\n\nProof. If the process is stable the mean process has a unique \ufb01xed point, and from (4) and (5) we can\nwrite the \ufb01xed-point equation for the process mean \u00b5hog as (I\u2212T )\u00b5hog = (I\u2212Tind)(I\u2212Tblock)\u00b5hog =\n(I\u2212Tind)(B\u2212C)\u22121h, hence (I\u2212(B\u2212C)\u22121A)\u00b5hog = (B\u2212C)\u22121h and \u00b5hog = (B\u2212C\u2212A)\u22121h.\nThe behavior of the spectral radius of the update map can be very complicated, even generically\nover simple ensembles. In Figure 1b, we compare \u03c1(T ) for q = 1 and q = \u221e (corresponding to\nT = Tblock) with models sampled from a natural random ensemble; we see that there is no general\nrelationship between stability at q = 1 and at q = \u221e.\nDespite the complexity of the update map\u2019s stability, in the next subsection we give a simple ar-\ngument that identi\ufb01es its convergence with the convergence of Gauss-Seidel iterates on a larger,\nnon-symmetric linear system. Given that relationship we then prove a condition on the entries of\nJ that ensures the convergence of the Gaussian Hogwild Gibbs sampling process for any choice of\npartition or sub-iteration count.\n\n4.1.1 A lifting argument and suf\ufb01cient condition\n\nFirst observe that we can write multiple steps of Gauss-Seidel as a single step of Gauss-Seidel on\na larger system: given J = L \u2212 U where L is lower-triangular (including the diagonal, unlike the\nnotation of Section 3) and U is strictly upper-triangular, consider applying Gauss-Seidel to a larger\n(cid:33)\nblock k \u00d7 k system:\n\u2212U\n. . .\u2212U L\n\n\uf8f6\uf8f8 G-S\n\n\uf8eb\uf8ed L\n\n(L\u22121U )k\u22121L\u22121 \u00b7\u00b7\u00b7 L\u22121U L\u22121 L\u22121\n\n\uf8f6\uf8f8(cid:18)\n\n\u2212U L\n. . .\n\nU(cid:19)\n\n\uf8eb\uf8ed\n\nL\u22121U L\u22121\n\n(L\u22121U )k\n\n\u2212\u2212\u2192\n\n(cid:32)\n\nL\u22121U\n\nL\u22121\n\nL\u22121\n\n(6)\n\n. . .\n\n=\n\n...\n\n...\n\nTherefore one step of Gauss-Seidel on the larger system corresponds to k applications of the Gauss-\nSeidel update L\u22121U from the original system to the last block element of the lifted state vector.\nNow we provide a lifting on which Gauss-Seidel corresponds to Gaussian Hogwild Gibbs iterations.\n\n5\n\n\fA\n\nB\n\nC\n\n(a) Support pattern (in black) of the Hogwild split-\nting J = B \u2212 C \u2212 A with n = 9 and the processor\npartition {{1, 2, 3},{4, 5, 6},{7, 8, 9}}\n\n(b) \u03c1(T ) for q = 1 versus for q = \u221e\n\n(c) \u03a0 projects to the block diagonal\n\n(d) \u03a0 projects to the off-block-diagonal\n\nFigure 1: (a) visualization of the Hogwild splitting; (b) Hogwild stability for generic models; (c)\nIn (b) each point corresponds to a sampled model\nand (d) typical plots of ||\u03a0(\u03a3 \u2212 \u03a3hog)||Fro.\nJ = QQT + nrI with Qij\n\u223c Uniform[0.5, 1] , n = 24 with an even partition of\n\u223c N (0, 1), r\nsize 4. In (c) and (d), models are J = B \u2212 C \u2212 tA where B \u2212 C \u2212 A = QQT, n = 150 with an\neven partition of size 3. The plots can be generated with python figures.py -seed=0.\n\niid\n\niid\n\nProposition 2. Two applications of the Hogwild update T of (5) are equivalent to the update to the\nlast block element of the state vector in one Gauss-Seidel iteration on the (2qn) \u00d7 (2qn) system\n\n(cid:32) B\u2212C B\n\n(cid:33)\n\n. . .\n\n. . .\u2212C B\n\n(cid:32)\n\n(cid:33)\n\nA+C\n\nA\n...\nA\n\n(cid:19)\n\n(cid:18) E \u2212F\n\nE\n\n\u2212F\n\n(cid:18) h\n\n(cid:19)\n\n...\nh\n\n\u02dcx =\n\nwith E =\n\nF =\n\n.\n\n(7)\n\nProof. By comparing to the block update in (3), it suf\ufb01ces to consider E\u22121F . Furthermore, since\nthe claim concerns the last block entry, we need only consider the last block row of E\u22121F . E is\nblock lower-bidiagonal as the matrix that is inverted in (6), so E\u22121 has the same lower-triangular\nform as in (6) and the product of the last block row of E\u22121 with the last block column of F yields\n\n(B\u22121C)q +(cid:80)q\u22121\n\nj=0 (B\u22121C)jB\u22121A = T.\n\nProposition 3. Gaussian Hogwild Gibbs sampling is convergent if Gauss-Seidel converges on (7).\n\nUnfortunately the lifting is not symmetric and so we cannot impose positive semi-de\ufb01niteness on\nthe lifted system; however, another suf\ufb01cient condition for Gauss-Seidel stability can be applied:\nTheorem 1. If J is generalized diagonally dominant (i.e. an H-matrix, see Berman et al. [13, Def-\ninition 5.13, Theorem 5.14]) then Hogwild Gibbs sampling is convergent for any variable partition\nand any number of sub-iterations.\n\nProof. If J is generalized diagonally dominant then there exists a diagonal scaling matrix R such\n\nthat (cid:101)J := JR is row diagonally dominant, i.e. (cid:101)Jii \u2265\n\n(cid:80)\nj(cid:54)=i |(cid:101)Jij|. Since each scalar row of the\n\ncoef\ufb01cient matrix in (7) contains only entries from one row of J and zeros, it is generalized diag-\nonally dominant with a scaling matrix that consists of 2q copies of R along the diagonal. Finally,\nGauss-Seidel iterations on generalized diagonally dominant systems are convergent [13, Theorem\n5.14], so by Proposition 3 the corresponding Hogwild Gibbs iterations are convergent.\n\n6\n\n0.80.91.01.11.2\u03c1(T),q=10.80.91.01.11.2\u03c1(T),q=\u221e0.000.050.100.150.20t0.00000.00020.00040.00060.00080.00100.0012blockdiagonalerrorA=0\u03c1(B\u22121C)q=0.670\u03c1(B\u22121C)q=0.448\u03c1(B\u22121C)q=0.300\u03c1(B\u22121C)q=0.2010.000.050.100.150.20t0.0000.0050.0100.015off-block-diagonalerrorA=0\u03c1(B\u22121C)q=0.670\u03c1(B\u22121C)q=0.448\u03c1(B\u22121C)q=0.300\u03c1(B\u22121C)q=0.201\fIn terms of Gaussian graphical models, generalized diagonally dominant models include tree models\nand latent tree models (since H-matrices are closed under Schur complements), in which the den-\nsity of the distribution can be written as a tree-structured set of pairwise potentials over the model\nvariables and a set of latent variables. Latent tree models are useful in modeling data with hierar-\nchical or multi-scaled relationships, and this connection to latent tree structure is evocative of many\nhierarchical Bayesian models. More broadly, diagonally dominant systems are well-known for their\ntractability and applicability in many other settings [19], and Gaussian Hogwild Gibbs provides\nanother example of their utility.\nBecause of the connection to linear system solvers based on two-stage multisplittings, this result\ncan be identi\ufb01ed with [18, Theorem 2.3], which shows that if the coef\ufb01cient matrix is an H-matrix\nthen the two-stage iterative solver is convergent. Indeed, by the connection between solvers and\nsamplers one can prove our Theorem 1 as a corollary to [18, Theorem 2.3] (or vice-versa), though\nour proof technique is much simpler. The other results on two-stage multisplittings [18, 14], can\nalso be applied immediately for results on the convergence of Gaussian Hogwild Gibbs sampling.\nThe suf\ufb01cient condition provided by Theorem 1 is coarse in that it provides convergence for any par-\ntition or update schedule. However, given the complexity of the processes, as exhibited in Figure 1b,\nit is dif\ufb01cult to provide general conditions without taking into account some model structure.\n\n4.1.2 Exact local block samples\n\nConvergence analysis simpli\ufb01es greatly in the case where exact block samples are drawn at each\nprocessor because q is suf\ufb01ciently large or because another exact sampler [9, 10] is used on each\nprocessor. This regime of Hogwild Gibbs sampling is particularly interesting because it minimizes\ncommunication between processors.\nIn (4), we see that as q \u2192 \u221e we have T \u2192 Tblock; that is, the deterministic part of the update\nbecomes the block Jacobi update map, which admits a natural suf\ufb01cient condition for convergence:\nProposition 4. If ((B \u2212 C)\u2212 1\n2 )2 \u227a I, then block Hogwild Gibbs sampling converges.\nProof. Since similarity transformations preserve eigenvalues, with \u00afA := (B \u2212 C)\u2212 1\n2 A(B \u2212 C)\u2212 1\n2 ) = \u03c1( \u00afA) and since \u00afA is symmetric\nwe have \u03c1(Tblock) = \u03c1((B \u2212 C) 1\n\u00afA2 \u227a I \u21d2 \u03c1( \u00afA) < 1 \u21d2 \u03c1(Tblock) < 1.\n4.2 Variances\n\n2 (B \u2212 C)\u22121A(B \u2212 C)\u2212 1\n\n2 A(B \u2212 C)\u2212 1\n\n2\n\nis given by the series(cid:80)\u221e\n\u03a3inj =(cid:80)q\u22121\n\nSince we can analyze Gaussian Hogwild Gibbs sampling as a linear dynamical system, we can write\nan expression for the steady-state covariance \u03a3hog of the process when it is stable. For a general\nstable LDS of the form x(t+1) = T x(t) + v(t) with v(t) \u223c N (0, \u03a3inj) the steady-state covariance\nt=0 T t\u03a3injT tT, which is the solution to the linear discrete-time Lyapunov\nequation \u03a3 \u2212 T \u03a3T T = \u03a3inj in \u03a3.\nThe injected noise for the outer loop of the Hogwild iterations is generated by the inner loop, which\nitself has injected noise with covariance D, the diagonal of J, so for Hogwild sampling we have\nj=0(B\u22121C)jB\u22121DB\u2212T(B\u22121C)jT. The target covariance is J\u22121 = (B \u2212 C \u2212 A)\u22121.\nComposing these expressions we see that the Hogwild covariance is complicated, but we can analyze\nsome salient properties in at least two regimes: when A is small and when local processors draw\nexact block samples (e.g. when q \u2192 \u221e).\n4.2.1 First-order effects in A\n\nIntuitively, the Hogwild strategy works best when cross-processor interactions are small, and so it\nis natural to analyze the case when A is small and we can discard terms that include powers of A\nbeyond \ufb01rst order.\nWhen A = 0, the model is independent across processors and both the exact covariance and the\nHogwild steady-state covariance for any q is (B\u2212 C)\u22121. For small nonzero A, we consider \u03a3hog(A)\n\n7\n\n\f[D0\u03a3hog](A) \u2212 S[D0\u03a3hog](A)ST = (cid:101)A \u2212 S(cid:101)AST \u2212 (I \u2212 S)(cid:101)A(I \u2212 S)T\n\nto be a function of A and linearize around A = 0 to write \u03a3hog(A) \u2248 (B \u2212 C)\u22121 + [D0\u03a3hog](A),\nwhere the derivative [D0\u03a3hog](A) is a matrix determined by the linear equation\nwhere (cid:101)A := (B \u2212 C)\u22121A(B \u2212 C)\u22121 and S := (B\u22121C)q. See the supplementary materials. We\nSince (cid:101)A has zero block-diagonal and S is block-diagonal, we see that to \ufb01rst order A has no effect on\n\nJ\u22121 = [I + (B \u2212 C)\u22121A + ((B \u2212 C)\u22121A)2 + \u00b7\u00b7\u00b7 ](B \u2212 C)\u22121 \u2248 (B \u2212 C)\u22121 + (cid:101)A.\n\ncan compare this linear approximation to the linear approximation for the exact covariance:\n\nthe block-diagonal of either the exact covariance or the Hogwild covariance. As shown in Figure 1c,\nin numerical experiments higher-order terms improve the Hogwild covariance on the block diagonal\nrelative to the A = 0 approximation, and the improvements increase with local mixing rates.\nThe off-block-diagonal \ufb01rst-order term in the Hogwild covariance is nonzero and it depends on the\nlocal mixing performed by S. In particular, if global synchronization happens infrequently relative\nto the speed of local sampler mixing (e.g. if q is large), S \u2248 0 and D0\u03a3hog \u2248 0, so \u03a3hog \u2248\n(B \u2212 C)\u22121 (to \ufb01rst order in A) and cross-processor interactions are ignored (though they are still\nwhich S is slow to mix, D0\u03a3hog picks up some parts of the correct covariance\u2019s \ufb01rst-order term, (cid:101)A.\nused to compute the correct mean, as per Proposition 1). However, when there are directions in\n\nFigure 1d shows the off-block-diagonal error increasing with faster local mixing for small A.\nIntuitively, more local mixing, and hence relatively less frequent global synchronization, degrades\nthe Hogwild approximation of the cross-processor covariances. Such an effect may be undesirable\nbecause increased local mixing re\ufb02ects greater parallelism (or an application of more powerful local\nsamplers [9, 10]). In the next subsection we show that this case admits a special analysis and even an\ninexpensive correction to recover asymptotically unbiased estimates for the full covariance matrix.\n\n(8)\n\n4.2.2 Exact local block samples\nAs local mixing increases, e.g. as q \u2192 \u221e or if we use an exact block local sampler between global\nsynchronizations, we are effectively sampling in the lifted model of Eq. (3) and therefore we can use\nthe lifting construction to analyze the error in variances:\nProposition 5. When local block samples are exact, the Hogwild sampled covariance \u03a3Hog satis\ufb01es\n\n\u03a3 = (I + (B \u2212 C)\u22121A)\u03a3Hog\n\nand ||\u03a3 \u2212 \u03a3Hog|| \u2264 ||(B \u2212 C)\u22121A|| ||\u03a3Hog||\n\nwhere \u03a3 = J\u22121 is the exact target covariance and || \u00b7 || is any submultiplicative matrix norm.\nProof. Using the lifting in (3), the Hogwild process steady-state covariance is the marginal covari-\nance of half of the lifted state vector, so using Schur complements we can write \u03a3Hog = ((B \u2212 C)\u2212\nA(B\u2212C)\u22121A)\u22121 = [I +((B\u2212C)\u22121A)2 +\u00b7\u00b7\u00b7 ](B\u2212C)\u22121. We can compare this series to the exact\nexpansion in (8) to see that \u03a3Hog includes exactly the even powers (due to the block-bipartite lifting),\nso therefore \u03a3\u2212\u03a3Hog = [(B\u2212C)\u22121A+((B\u2212C)\u22121A)3 +\u00b7\u00b7\u00b7 ](B\u2212C)\u22121 = (B\u2212C)\u22121A\u03a3Hog.\n5 Conclusion\n\nWe have introduced a framework for understanding Gaussian Hogwild Gibbs sampling and shown\nsome results on the stability and errors of the algorithm, including (1) quantitative descriptions for\nwhen a Gaussian model is not \u201ctoo dependent\u201d to cause Hogwild sampling to be unstable (Proposi-\ntion 2, Theorem 1, Proposition 4); (2) given stability, the asymptotic Hogwild mean is always correct\n(Proposition 1); (3) in the linearized regime with small cross-processor interactions, there is a trade-\noff between local mixing and error in Hogwild cross-processor covariances (Section 4.2.1); and (4)\nwhen local samplers are run to convergence we can bound the error in the Hogwild variances and\neven ef\ufb01ciently correct estimates of the full covariance (Proposition 5). We hope these ideas may be\nextended to provide further insight into Hogwild Gibbs sampling, in the Gaussian case and beyond.\n\n6 Acknowledgements\n\nThis research was supported in part under AFOSR Grant FA9550-12-1-0287.\n\n8\n\n\fReferences\n\n[1] F. Niu, B. Recht, C. R\u00e9, and S.J. Wright. \u201cHogwild!: A lock-free approach to parallelizing\nstochastic gradient descent\u201d. In: Advances in Neural Information Processing Systems (2011).\n[2] D. Newman, A. Asuncion, P. Smyth, and M. Welling. \u201cDistributed inference for latent dirich-\nlet allocation\u201d. In: Advances in Neural Information Processing Systems 20.1081-1088 (2007),\npp. 17\u201324.\n\n[3] D. Newman, A. Asuncion, P. Smyth, and M. Welling. \u201cDistributed algorithms for topic mod-\n\nels\u201d. In: The Journal of Machine Learning Research 10 (2009), pp. 1801\u20131828.\n\n[4] Z. Liu, Y. Zhang, E.Y. Chang, and M. Sun. \u201cPLDA+: Parallel latent dirichlet allocation with\ndata placement and pipeline processing\u201d. In: ACM Transactions on Intelligent Systems and\nTechnology (TIST) 2.3 (2011), p. 26.\n\n[5] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and dis-\n\ntributed approaches. Cambridge University Press, 2012.\n\n[6] A. Ihler and D. Newman. \u201cUnderstanding Errors in Approximate Distributed Latent Dirich-\nlet Allocation\u201d. In: Knowledge and Data Engineering, IEEE Transactions on 24.5 (2012),\npp. 952\u2013960.\n\n[7] Y. Liu, O. Kosut, and A. S. Willsky. \u201cSampling GMRFs by Subgraph Correction\u201d. In: NIPS\n\n2012 Workshop: Perturbations, Optimization, and Statistics (2012).\n\n[8] G. Papandreou and A. Yuille. \u201cGaussian sampling by local perturbations\u201d. In: Neural Infor-\n\nmation Processing Systems (NIPS). 2010.\n\n[9] A. Parker and C. Fox. \u201cSampling Gaussian distributions in Krylov spaces with conjugate\n\ngradients\u201d. In: SIAM Journal on Scienti\ufb01c Computing 34.3 (2012), pp. 312\u2013334.\n\n[10] Colin Fox and Albert Parker. \u201cConvergence in Variance of First-Order and Second-Order\nChebyshev Accelerated Gibbs Samplers\u201d. 2013. URL: http://www.physics.otago.\nac.nz/data/fox/publications/SIAM_CS_2012-11-30.pdf.\nJ. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. \u201cParallel Gibbs Sampling: From Colored\nFields to Thin Junction Trees\u201d. In: In Arti\ufb01cial Intelligence and Statistics (AISTATS). Ft.\nLauderdale, FL, May 2011.\n\n[11]\n\n[12] M. J. Wainwright and M. I. Jordan. \u201cGraphical models, exponential families, and variational\n\n[13] A. Berman and R.J. Plemmons. \u201cNonnegative Matrices in the Mathematical Sciences\u201d. In:\n\ninference\u201d. In: Foundations and Trends R(cid:13) in Machine Learning 1.1-2 (2008), pp. 1\u2013305.\nClassics in Applied Mathematics, 9 (1979).\n\n[14] M. J. Castel V. Migall\u00f3n and J. Penad\u00e9s. \u201cOn Parallel two-stage methods for Hermitian pos-\nitive de\ufb01nite matrices with applications to preconditioning\u201d. In: Electronic Transactions on\nNumerical Analysis 12 (2001), pp. 88\u2013112.\n\n[15] D. Serre. Nov. 2011. URL: http : / / mathoverflow . net / questions / 80793 /\nis - gauss - seidel - guaranteed - to - converge - on - semi - positive -\ndefinite-matrices/80845#80845.\n\n[16] Nicholas Ruozzi and Sekhar Tatikonda. \u201cMessage-Passing Algorithms for Quadratic Min-\nimization\u201d. In: Journal of Machine Learning Research 14 (2013), pp. 2287\u20132314. URL:\nhttp://jmlr.org/papers/v14/ruozzi13a.html.\n\n[17] A. Frommer and D.B. Szyld. \u201cOn asynchronous iterations\u201d. In: Journal of computational and\n\napplied mathematics 123.1 (2000), pp. 201\u2013216.\n\n[18] A. Frommer and D.B. Szyld. \u201cAsynchronous two-stage iterative methods\u201d. In: Numerische\n\nMathematik 69.2 (1994), pp. 141\u2013153.\nJ. A. Kelner, L. Orecchia, A. Sidford, and Z. A. Zhu. A Simple, Combinatorial Algorithm for\nSolving SDD Systems in Nearly-Linear Time. 2013. arXiv: 1301.6628 [cs.DS].\n\n[19]\n\n9\n\n\f", "award": [], "sourceid": 1267, "authors": [{"given_name": "Matthew", "family_name": "Johnson", "institution": "MIT"}, {"given_name": "James", "family_name": "Saunderson", "institution": "MIT"}, {"given_name": "Alan", "family_name": "Willsky", "institution": "MIT"}]}