{"title": "Stochastic Gradient MCMC with Stale Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 2937, "page_last": 2945, "abstract": "Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers. Experiments on synthetic data and deep neural networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC with stale gradients.", "full_text": "Stochastic Gradient MCMC with Stale Gradients\n\nChangyou Chen\u2020\n\nNan Ding\u2021\n\nChunyuan Li\u2020\n\nYizhe Zhang\u2020\n\nLawrence Carin\u2020\n\n\u2020Dept. of Electrical and Computer Engineering, Duke University, Durham, NC, USA\n\n\u2020{cc448,cl319,yz196,lcarin}@duke.edu; \u2021dingnan@google.com\n\n\u2021Google Inc., Venice, CA, USA\n\nAbstract\n\nStochastic gradient MCMC (SG-MCMC) has played an important role in large-\nscale Bayesian learning, with well-developed theoretical convergence properties. In\nsuch applications of SG-MCMC, it is becoming increasingly popular to employ dis-\ntributed systems, where stochastic gradients are computed based on some outdated\nparameters, yielding what are termed stale gradients. While stale gradients could\nbe directly used in SG-MCMC, their impact on convergence properties has not\nbeen well studied. In this paper we develop theory to show that while the bias and\nMSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients,\nits estimation variance (relative to the expected estimate, based on a prescribed\nnumber of samples) is independent of it. In a simple Bayesian distributed system\nwith SG-MCMC, where stale gradients are computed asynchronously by a set\nof workers, our theory indicates a linear speedup on the decrease of estimation\nvariance w.r.t. the number of workers. Experiments on synthetic data and deep\nneural networks validate our theory, demonstrating the effectiveness and scalability\nof SG-MCMC with stale gradients.\n\n1\n\nIntroduction\n\nThe pervasiveness of big data has made scalable machine learning increasingly important, especially\nfor deep models. A basic technique is to adopt stochastic optimization algorithms [1], e.g., stochastic\ngradient descent and its extensions [2]. In each iteration of stochastic optimization, a minibatch of\ndata is used to evaluate the gradients of the objective function and update model parameters (errors\nare introduced in the gradients, because they are computed based on minibatches rather than the\nentire dataset; since the minibatches are typically selected at random, this yields the term \u201cstochastic\u201d\ngradient). This is highly scalable because processing a minibatch of data in each iteration is relatively\ncheap compared to analyzing the entire (large) dataset at once. Under certain conditions, stochastic\noptimization is guaranteed to converge to a (local) optima [1]. Because of its scalability, the minibatch\nstrategy has recently been extended to Markov Chain Monte Carlo (MCMC) Bayesian sampling\nmethods, yielding SG-MCMC [3, 4, 5].\nIn order to handle large-scale data, distributed stochastic optimization algorithms have been developed,\nfor example [6], to further improve scalability. In a distributed setting, a cluster of machines with\nmultiple cores cooperate with each other, typically through an asynchronous scheme, for scalability\n[7, 8, 9]. A downside of an asynchronous implementation is that stale gradients must be used in\nparameter updates (\u201cstale gradients\u201d are stochastic gradients computed based on outdated parameters,\ninstead of the latest parameters; they are easier to compute in a distributed system, but introduce\nadditional errors relative to traditional stochastic gradients). While some theory has been developed to\nguarantee the convergence of stochastic optimization with stale gradients [10, 11, 12], little analysis\nhas been done in a Bayesian setting, where SG-MCMC is applied. Distributed SG-MCMC algorithms\nshare characteristics with distributed stochastic optimization, and thus are highly scalable and suitable\nfor large-scale Bayesian learning. Existing Bayesian distributed systems with traditional MCMC\nmethods, such as [13], usually employ stale statistics instead of stale gradients, where stale statistics\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(cid:80)L\n\nl=1 and uses the sample average \u02c6\u03c6L (cid:44) 1\n\nL\n\nposterior average of a test function \u03c6(x), de\ufb01ned as \u00af\u03c6 (cid:44)(cid:82)\n(cid:17)2\nand E(cid:16) \u02c6\u03c6L \u2212 E \u02c6\u03c6L\nde\ufb01ned as |E \u02c6\u03c6L \u2212 \u00af\u03c6|, E(cid:16) \u02c6\u03c6L \u2212 \u00af\u03c6\n(cid:17)2\n\nare summarized based on outdated parameters, e.g., outdated topic distributions in distributed Gibbs\nsampling [13]. Little theory exists to guarantee the convergence of such methods. For existing\ndistributed SG-MCMC methods, typically only standard stochastic gradients are used, for limited\nproblems such as matrix factorization, without rigorous convergence theory [14, 15, 16].\nIn this paper, by extending techniques from standard SG-MCMC [17], we develop theory to study the\nconvergence behavior of SG-MCMC with Stale gradients (S2G-MCMC). Our goal is to evaluate the\nX \u03c6(x)\u03c1(x)d x, where \u03c1(x) is the desired\nposterior distribution with x the possibly augmented model parameters (see Section 2). In practice,\nS2G-MCMC generates L samples {xl}L\nl=1 \u03c6(xl) to\napproximate \u00af\u03c6. We measure how \u02c6\u03c6L approximates \u00af\u03c6 in terms of bias, MSE and estimation variance,\n, respectively. From the de\ufb01nitions, the\nbias and MSE characterize how accurately \u02c6\u03c6L approximates \u00af\u03c6, and the variance characterizes how\nfast \u02c6\u03c6L converges to its own expectation (for a prescribed number of samples L). Our theoretical\nresults show that while the bias and MSE depend on the staleness of stochastic gradients, the variance\nis independent of it. In a simple asynchronous Bayesian distributed system with S2G-MCMC, our\ntheory indicates a linear speedup on the decrease of the variance w.r.t. the number of workers\nused to calculate the stale gradients, while maintaining the same optimal bias level as standard\nSG-MCMC. We validate our theory on several synthetic experiments and deep neural network models,\ndemonstrating the effectiveness and scalability of the proposed S2G-MCMC framework.\nRelated Work Using stale gradients is a standard setup in distributed stochastic optimization\nsystems. Representative algorithms include, but are not limited to, the ASYSG-CON [6] and HOG-\nWILD! algorithms [18], and some more recent developments [19, 20]. Furthermore, recent research\non stochastic optimization has been extended to non-convex problems with provable convergence\nrates [12]. In Bayesian learning with MCMC, existing work has focused on running parallel chains on\nsubsets of data [21, 22, 23, 24], and little if any effort has been made to use stale stochastic gradients,\nthe setting considered in this paper.\n2 Stochastic Gradient MCMC\nThroughout this paper, we denote vectors as bold lower-case letters, and matrices as bold upper-\ncase letters. For example, N (m, \u03a3) means a multivariate Gaussian distribution with mean m and\ncovariance \u03a3. In the analysis we consider algorithms with \ufb01xed-stepsizes for simplicity; decreasing-\nstepsize variants can be addressed similarly as in [17].\nThe goal of SG-MCMC is to generate random samples from a posterior distribution p(\u03b8| D) \u221d\ni=1 p(di |\u03b8), which are used to evaluate a test function. Here \u03b8 \u2208 Rn represents the parameter\nvector and D = {d1,\u00b7\u00b7\u00b7 , dN} represents the data, p(\u03b8) is the prior distribution, and p(di |\u03b8) the\nlikelihood for di. SG-MCMC algorithms are based on a class of stochastic differential equations,\ncalled It\u00f4 diffusion, de\ufb01ned as\n(1)\nwhere x \u2208 Rm represents the model states, typically x augments \u03b8 such that \u03b8 \u2286 x and n \u2264 m;\nt is the time index, wt \u2208 Rm is m-dimensional Brownian motion, functions F : Rm \u2192 Rm and\ng : Rm \u2192 Rm\u00d7m are assumed to satisfy the usual Lipschitz continuity condition [25].\nFor appropriate functions F and g, the stationary distribution, \u03c1(x), of the It\u00f4 diffusion (1) has a\nmarginal distribution equal to the posterior distribution p(\u03b8| D) [26]. For example, denoting the\ni=1 log p(di |\u03b8), the stochastic\ngradient Langevin dynamic (SGLD) method [3] is based on 1st-order Langevin dynamics, with\n2 In, where In is the n \u00d7 n identity matrix. The\nx = \u03b8, and F (xt) = \u2212\u2207\u03b8U (\u03b8), g(xt) =\nstochastic gradient Hamiltonian Monte Carlo (SGHMC) method [4] is based on 2nd-order Langevin\ndynamics, with x = (\u03b8, q), and F (xt) =\nfor a\nscalar B > 0; q is an auxiliary variable known as the momentum [4, 5]. Diffusion forms for other\nSG-MCMC algorithms, such as the stochastic gradient thermostat [5] and variants with Riemannian\ninformation geometry [27, 26, 28], are de\ufb01ned similarly.\nIn order to ef\ufb01ciently draw samples from the continuous-time diffusion (1), SG-MCMC algorithms\ntypically apply two approximations: i) Instead of analytically integrating in\ufb01nitesimal increments\n\nunnormalized negative log-posterior as U (\u03b8) (cid:44) \u2212 log p(\u03b8) \u2212(cid:80)N\n(cid:17)\n\np(\u03b8)(cid:81)N\n\nq\n\n\u2212B q\u2212\u2207\u03b8U (\u03b8)\n\nd xt = F (xt)dt + g(xt)dwt ,\n\n\u221a\n\n(cid:16)\n\n(cid:16) 0\n\n0\n0 In\n\n(cid:17)\n\n\u221a\n\n, g(xt) =\n\n2B\n\n2\n\n\fJ(cid:88)\n\ndt, numerical integration over small step size h is used to approximate the integration of the true\ndynamics. ii) Instead of working with the full gradient \u2207\u03b8U (\u03b8lh), a stochastic gradient \u2207\u03b8 \u02dcUl(\u03b8lh),\nde\ufb01ned as\n\n\u2207\u03b8 \u02dcUl(\u03b8) (cid:44) \u2212\u2207\u03b8 log p(\u03b8) \u2212 N\nJ\n\n(2)\nis calculated from a minibatch of size J, where {\u03c01,\u00b7\u00b7\u00b7 , \u03c0J} is a random subset of {1,\u00b7\u00b7\u00b7 , N}.\nNote that to match the time index t in (1), parameters have been and will be indexed by \u201clh\u201d in the\nl-th iteration.\n3 Stochastic Gradient MCMC with Stale Gradients\nIn this section, we extend SG-MCMC to the stale-gradient setting, commonly met in asynchronous\ndistributed systems [7, 8, 9], and develop theory to analyze convergence properties.\n\n\u2207\u03b8 log p(d\u03c0i |\u03b8),\n\ni=1\n\ni=1\n\nJ(cid:88)\n\n3.1 Stale stochastic gradient MCMC (S2G-MCMC)\nThe setting for S2G-MCMC is the same as the standard SG-MCMC described above, except that\nthe stochastic gradient (2) is replaced with a stochastic gradient evaluated with outdated parameter\n\u03b8(l\u2212\u03c4l)h instead of the latest version \u03b8lh (see Appendix A for an example):\n\n\u2207\u03b8 log p(d\u03c0i |\u03b8(l\u2212\u03c4l)h),\n\n\u2207\u03b8 \u02c6U\u03c4l (\u03b8) (cid:44) \u2212\u2207\u03b8 log p(\u03b8(l\u2212\u03c4l)h) \u2212 N\nJ\n\n(3)\nwhere \u03c4l \u2208 Z+ denotes the staleness of the parameter used to calculate the stochastic gradient in the\nl-th iteration. A distinctive difference between S2G-MCMC and SG-MCMC is that stale stochastic\ngradients are no longer unbiased estimations of the true gradients. This leads to additional challenges\nin developing convergence bounds, one of the main contributions of this paper.\nWe assume a bounded staleness for all \u03c4l\u2019s,\ni.e.,\n\nAlgorithm 1 State update of SGHMC with the stale\nstochastic gradient \u2207\u03b8 \u02c6U\u03c4l (\u03b8)\n\nInput: xlh = (\u03b8lh, qlh), \u2207\u03b8 \u02c6U\u03c4l (\u03b8), \u03c4l, \u03c4, h, B\nOutput: x(l+1)h = (\u03b8(l+1)h, q(l+1)h)\nif \u03c4l \u2264 \u03c4 then\n\nfor some constant \u03c4. As an example, Al-\ngorithm 1 describes the update rule of the\nstale-SGHMC in each iteration with the\nEuler integrator, where the stale gradient\n\u2207\u03b8 \u02c6U\u03c4l (\u03b8) with staleness \u03c4l is used.\n3.2 Convergence analysis\nThis section analyzes the convergence properties of the basic S2G-MCMC; an extension with multiple\nchains is discussed in Section 3.3. It is shown that the bias and MSE depend on the staleness parameter\n\u03c4, while the variance is independent of it, yielding signi\ufb01cant speedup in Bayesian distributed systems.\n\nDraw \u03b6l \u223c N (0, I);\nq(l+1)h = (1\u2212Bh) qlh \u2212\u2207\u03b8 \u02c6U\u03c4l (\u03b8)h+\n\u03b8(l+1)h = \u03b8lh + q(l+1)h h;\n\n\u03c4l \u2264 \u03c4\n\n2Bh\u03b6l;\n\nend if\n\nmax\n\n\u221a\n\nl\n\nBias and MSE In [17], the bias and MSE of the standard SG-MCMC algorithms with a Kth order\nintegrator were analyzed, where the order of an integrator re\ufb02ects how accurately an SG-MCMC\nalgorithm approximates the corresponding continuous diffusion. Speci\ufb01cally, if evolving xt with\na numerical integrator using discrete time increment h induces an error bounded by O(hK), the\nintegrator is called a Kth order integrator, e.g., the popular Euler method used in SGLD [3] is a\n1st-order integrator. In particular, [17] proved the bounds stated in Lemma 1.\nLemma 1 ([17]). Under standard assumptions (see Appendix B), the bias and MSE of SG-MCMC\nwith a Kth-order integrator at time T = hL are bounded as:\n\nHere \u2206Vl (cid:44) L \u2212 \u02dcLl, where L is the generator of the It\u00f4 diffusion (1) de\ufb01ned as\nLf (xt) (cid:44) lim\nh\u21920+\n\n(cid:0)g(xt)g(xt)T(cid:1) :\u2207x\u2207T\n\nE [f (xt+h)] \u2212 f (xt)\n\nF (xt) \u00b7 \u2207x +\n\n=\n\nx\n\n1\n2\n\nh\n\n(cid:19)\n\nf (xt) ,\n\n(4)\n\n(cid:12)(cid:12)(cid:12)E \u02c6\u03c6L \u2212 \u00af\u03c6\n(cid:12)(cid:12)(cid:12) = O\nMSE: E(cid:16) \u02c6\u03c6L \u2212 \u00af\u03c6\n(cid:17)2\n\nBias:\n\n= O\n\n(cid:19)\n\n+\n\n1\nLh\nE(cid:107)\u2206Vl(cid:107)2\nL\n\n+ hK\n\n+\n\n1\nLh\n\n(cid:33)\n\n+ h2K\n\nL\n\n(cid:18)(cid:80)\nl (cid:107)E\u2206Vl(cid:107)\n(cid:32) 1\n(cid:80)\n(cid:18)\n\nL\n\nl\n\n3\n\n\ffor any compactly supported twice differentiable function f : Rn \u2192 R, h \u2192 0+ means h approaches\nzero along the positive real axis. \u02dcLl is the same as L except using the stochastic gradient \u2207 \u02dcUl instead\nof the full gradient.\nWe show that the bounds of the bias and MSE of S2G-MCMC share similar forms as SG-MCMC, but\nwith additional dependence on the staleness parameter. In addition to the assumptions in SG-MCMC\n[17] (see details in Appendix B), the following additional assumption is imposed.\nAssumption 1. The noise in the stochastic gradients is well-behaved, such that: 1) the stochastic\ngradient is unbiased, i.e., \u2207\u03b8U (\u03b8) = E\u03be\u2207\u03b8 \u02dcU (\u03b8) where \u03be denotes the random permutation over\n{1,\u00b7\u00b7\u00b7 , N}; 2) the variance of stochastic gradient is bounded, i.e., E\u03be\ngradient function \u2207\u03b8U is Lipschitz (so is \u2207\u03b8 \u02dcU), i.e., (cid:107)\u2207\u03b8U (x) \u2212 \u2207\u03b8U (y)(cid:107) \u2264 C (cid:107)x\u2212 y(cid:107) ,\u2200 x, y.\nIn the following theorems, we omit the assumption statement for conciseness. Due to the staleness\nof the stochastic gradients, the term \u2206Vl in S2G-MCMC is equal to L\u2212\u02dcLl\u2212\u03c4l, where \u02dcLl\u2212\u03c4l arises\nfrom \u2207\u03b8 \u02c6U\u03c4l. The challenge arises to bound these terms involving \u2206Vl. To this end, de\ufb01ne flh (cid:44)\n\n(cid:13)(cid:13), and \u03c8 to be a functional satisfying the Poisson Equation\u2217:\n\n(cid:13)(cid:13)(cid:13)2 \u2264 \u03c32; 3) the\n\n(cid:13)(cid:13)(cid:13)U (\u03b8) \u2212 \u02dcU (\u03b8)\n\n(cid:13)(cid:13)xlh \u2212 x(l\u22121)h\n\nL\u03c8(xlh) = \u02c6\u03c6L \u2212 \u00af\u03c6 .\n\n(5)\n\nTheorem 2. After L iterations, the bias of S2G-MCMC with a Kth-order integrator is bounded, for\nsome constant D1 independent of {L, h, \u03c4}, as:\n\n1\nL\n\nl=1\n\nL(cid:88)\n(cid:12)(cid:12)(cid:12) \u2264 D1\n\n(cid:18) 1\n\n(cid:12)(cid:12)(cid:12)E \u02c6\u03c6L \u2212 \u00af\u03c6\n\n(cid:19)\n\n(cid:19)\n(cid:80)\n\n,\n\nwhere M1 (cid:44) maxl |Lflh| maxl (cid:107)E\u2207\u03c8(xlh)(cid:107) C, M2 (cid:44)(cid:80)K\n\nLh\n\n(cid:80)\n\n+ M1\u03c4 h + M2hK\n\n,\n\nE \u02dcLk+1\n\nl \u03c8(x(l\u22121)h)\n(k+1)!L\n\nl\n\nare constants.\n\nk=1\n\nTheorem 3. After L iterations, the MSE of S2G-MCMC with a Kth-order integrator is bounded, for\nsome constant D2 independent of {L, h, \u03c4}, as:\n\nE(cid:16) \u02c6\u03c6L \u2212 \u00af\u03c6\n\n(cid:17)2 \u2264 D2\n\n(cid:18) 1\n\nLh\n\n+ \u02dcM1\u03c4 2h2 + \u02dcM2h2K\n\nwhere constants \u02dcM1 (cid:44) maxl (cid:107)E\u2207\u03c8(xlh)(cid:107)2 maxl (Lflh)2 C 2, \u02dcM2 (cid:44) E(\nThe theorems indicate that both the bias and MSE depend on the staleness parameter \u03c4. For a \ufb01xed\ncomputational time, this could possibly lead to unimproved bounds, compared to standard SG-MCMC,\nwhen \u03c4 is too large, i.e., the terms with \u03c4 would dominate, as is the case in the distributed system\ndiscussed in Section 4. Nevertheless, better bounds than standard SG-MCMC could be obtained if\nthe decrease of 1\n\nLh is faster than the increase of the staleness in a distributed system.\n\nl\nL(K+1)!\n\n\u03c8(x(l\u22121)h)\n\n)2.\n\nl\n\n\u02dcLK+1\n\nE(cid:16) \u02c6\u03c6L \u2212 E \u02c6\u03c6L\n\n(cid:17)2\n\nVariance Next we investigate the convergence behavior of\n\nthe variance, Var( \u02c6\u03c6L) (cid:44)\n. Theorem 4 indicates the variance is independent of \u03c4, hence a linear speedup in the\ndecrease of variance is always achievable when stale gradients are computed in parallel. An example\nis discussed in the Bayesian distributed system in Section 4.\nTheorem 4. After L iterations, the variance of S2G-MCMC with a Kth-order integrator is bounded,\nfor some constant D, as:\n\n(cid:16) \u02c6\u03c6L\n\n(cid:17) \u2264 D\n\n(cid:18) 1\n\nLh\n\nVar\n\n(cid:19)\n\n+ h2K\n\n.\n\nThe variance bound is the same as for standard SG-MCMC, whereas L could increase linearly\nw.r.t. the number of workers in a distributed setting, yielding signi\ufb01cant variance reduction. When\noptimizing the the variance bound w.r.t. h, we get an optimal variance bound stated in Corollary 5.\nCorollary 5. In term of estimation variance, the optimal convergence rate of S2G-MCMC with a\nKth-order integrator is bounded as: Var\n\n(cid:17) \u2264 O(cid:0)L\u22122K/(2K+1)(cid:1) .\n\n(cid:16) \u02c6\u03c6L\n\n\u2217The existence of a nice \u03c8 is guaranteed in the elliptic/hypoelliptic SDE settings when x is on a torus [25].\n\n4\n\n\fIn real distributed systems, the decrease of 1/Lh and increase of \u03c4, in the bias and MSE bounds,\nwould typically cancel, leading to the same bias and MSE level compared to standard SG-MCMC,\nwhereas a linear speedup on the decrease of variance w.r.t. the number of workers is always achievable.\nMore details are discussed in Section 4.\n3.3 Extension to multiple parallel chains\nThis section extends the theory to the setting with S parallel chains, each independently running an\nS2G-MCMC algorithm. After generating samples from the S chains, an aggregation step is needed to\ncombine the sample average from each chain, i.e., { \u02c6\u03c6Ls}M\ns=1, where Ls is the number of iterations on\nchain s. For generality, we allow each chain to have different step sizes, e.g., (hs)S\ns=1. We aggregate\nthe sample averages as \u02c6\u03c6S\nL\nInterestingly, with increasing S, using multiple chains does not seem to directly improve the conver-\ngence rate for the bias, but improves the MSE bound, as stated in Theorem 6.\nTheorem 6. Let Tm (cid:44) maxl Tl, hm (cid:44) maxl hl, \u00afT = T /S, the bias and MSE of S parallel S2G-\nMCMC chains with a Kth-order integrator are bounded, for some constants D1 and D2 independent\nof {L, h, \u03c4}, as:\nBias:\n\n\u02c6\u03c6Ls, where Ts (cid:44) Lshs, T (cid:44)(cid:80)S\n\n(cid:44)(cid:80)S\n\ns=1 Ts.\n\n(cid:18) 1\n(cid:12)(cid:12)(cid:12) \u2264 D1\n(cid:18) 1 \u2212 1/ \u00afT\n(cid:17)2 \u2264 D2\n\n(cid:1)(cid:19)\n(cid:0)M1\u03c4 hs + M2hK\n(cid:0)M 2\n\nTm\n\u00afT\n\n(cid:12)(cid:12)(cid:12)E \u02c6\u03c6S\nMSE: E(cid:16) \u02c6\u03c6S\n\nL \u2212 \u00af\u03c6\nL \u2212 \u00af\u03c6\n\n1 \u03c4 2h2\n\ns + M 2\n\n2 h2K\n\n(cid:1)(cid:19)\n\nTs\nT\n\ns=1\n\n+\n\n+\n\n\u00afT\n\n.\n\ns\n\ns\n\n1\n\u00afT 2\n\n+\n\nT 2\nm\n\u00afT 2\n\nT\n\nAssume that \u00afT = T /S is independent of the number of chains. As a result, using multiple chains\ndoes not directly improve the bound for the bias\u2020. However, for the MSE bound, although the last\ntwo terms are independent of S, the \ufb01rst term decreases linearly with respect to S because T = \u00afT S.\nThis indicates a decreased estimation variance with more chains. This matches the intuition because\nmore samples can be obtained with more chains in a given amount of time.\nThe decrease of MSE for multiple-chain is due to the decrease of the variance as stated in Theorem 7.\nTheorem 7. The variance of S parallel S2G-MCMC chains with a Kth-order integrator is bounded,\nfor some constant D independent of {L, h, \u03c4}, as:\n\n(cid:33)\n\nO(cid:0)((cid:80)\n\ns Ls)\u22122K/(2K+1)(cid:1), i.e. a linear speedup with respect to S is achieved.\n\ns=1\n\nL\n\nWhen using the same step size for all chains, Theorem 7 gives an optimal variance bound of\n\nIn addition, Theorem 6 with \u03c4 = 0 and K = 1 provides convergence rates for the distributed SGLD\nalgorithm in [14], i.e., improved MSE and variance bounds compared to the single-server SGLD.\n\n4 Applications to Distributed SG-MCMC Systems\n\nOur theory for S2G-MCMC is general, serving as a basic analytic tool for distributed SG-MCMC\nsystems. We propose two simple Bayesian distributed systems with S2G-MCMC in the following.\nSingle-chain distributed SG-MCMC Perhaps the simplest architecture is an asynchronous dis-\ntributed SG-MCMC system, where a server runs an S2G-MCMC algorithm, with stale gradients\ncomputed asynchronously from W workers. The detailed operations of the server and workers are\ndescribed in Appendix A.\nWith our theory, now we explain the convergence property of this simple distributed system with\nSG-MCMC, i.e., a linear speedup w.r.t. the number of workers on the decrease of variance, while\nmaintaining the same bias level. To this end, rewrite L = W \u00afL from Theorems 2 and 3, where \u00afL is the\naverage number of iterations on each worker. We can observe from the theorems that when M1\u03c4 h >\nM2hK in the bias and \u02dcM1\u03c4 2h2 > \u02dcM2h2K in the MSE, the terms with \u03c4 dominate. Optimizing the\nbounds with respect to h yields a bound of O((\u03c4 /W \u00afL)1/2) for the bias, and O((\u03c4 /W \u00afL)2/3) for the\nMSE. In practice, we usually observe \u03c4 \u2248 W , making W in the optimal bounds cancels, i.e., the\nsame optimal bias and MSE bounds as standard SG-MCMC are obtained, no theoretical speedup is\n\u2020It means the bound does not directly relate to low-order terms of S, though constants might be improved.\n\n5\n\nE(cid:16) \u02c6\u03c6S\n\nL \u2212 E \u02c6\u03c6S\n\n(cid:32)\n\n(cid:17)2 \u2264 D\n\nS(cid:88)\n\n1\nT\n\n+\n\nT 2\ns\nT 2 h2K\n\ns\n\n.\n\n\fachieved when increasing W . However, from Corollary 5, the variance is independent of \u03c4, thus a\nlinear speedup on the variance bound can be always obtained when increasing the number of workers,\ni.e., the distributed SG-MCMC system convergences a factor of W faster than standard SG-MCMC\nwith a single machine. We are not aware of similar conclusions from optimization, because most of\nthe research focuses on the convex setting, thus only variance (equivalent to MSE) is studied.\n\nMultiple-chain distributed SG-MCMC We can also adopt multiple servers based on the multiple-\nchain setup in Section 3.3, where each chain corresponds to one server. The detailed architecture\nis described in Appendix A. This architecture trades off communication cost with convergence\nrates. As indicated by Theorems 6 and 7, the MSE and variance bounds can be improved with more\nservers. Note that when only one worker is associated with one server, we recover the setting of S\nindependent servers. Compared to the single-server architecture described above with S workers,\nfrom Theorems 2\u20137, while the variance bound is the same, the single-server arthitecture improves the\nbias and MSE bounds by a factor of S.\nMore advanced architectures More complex architectures could also be designed to reduce\ncommunication cost, for example, by extending the downpour [7] and elastic SGD [29] architectures\nto the SG-MCMC setting. Their convergence properties can also be analyzed with our theory since\nthey are essentially using stale gradients. We leave the detailed analysis for future work.\n\n5 Experiments\nOur primal goal is to validate the theory, comparing with different distributed architectures and\nalgorithms, such as [30, 31], is beyond the scope of this paper. We \ufb01rst use two synthetic experiments\nto validate the theory, then apply the distributed architecture described in Section 4 for Bayesian\ndeep learning. To quantitatively describe the speedup property, we adopt the the iteration speedup\n[12], de\ufb01ned as: iteration speedup (cid:44) #iterations with a single worker\naverage #iterations on a worker , where # is the iteration count when the\nsame level of precision is achieved. This speedup best matches with the theory. We also consider the\ntime speedup, de\ufb01ned as: running time for a single worker\n, where the running time is recorded at the same\nrunning time for W worker\naccuracy. It is affected signi\ufb01cantly by hardware, thus is not accurately consistent with the theory.\n\n5.1 Synthetic experiments\nImpact of stale gradients A simple Gaussian model is used\nto verify the impact of stale gradients on the convergence\naccuracy, with di \u223c N (\u03b8, 1), \u03b8 \u223c N (0, 1). 1000 data samples\n{di} are generated, with minibatches of size 10 to calculate\nstochastic gradients. The test function is \u03c6(\u03b8) (cid:44) \u03b82. The\ndistributed SGLD algorithm is adopted in this experiment.\nWe aim to verify that the optimal MSE bound \u221d \u03c4 2/3L\u22122/3,\nderived from Theorem 3 and discussed in Section 4 (with\nW = 1). The optimal stepsize is h = C\u03c4\u22122/3L\u22121/3 for some\nconstant C. Based on the optimal bound, setting L = L0 \u00d7 \u03c4\nFigure 1: MSE vs. # iterations (L =\n500 \u00d7 \u03c4) with increasing staleness \u03c4.\nfor some \ufb01xed L0 and varying \u03c4\u2019s would result in the same\nMSE, which is \u221d L\nResulting in roughly the same MSE.\n. In the experiments we set C = 1/30,\nL0 = 500, \u03c4 = {1, 2, 5, 10, 15, 20}, and average over 200 runs to approximate the expectations in\nthe MSE formula. As indicated in Figure 1, approximately the same MSE\u2019s are obtained after L0\u03c4\niterations for different \u03c4 values, consistent with the theory. Note since the stepsizes are set to make\nend points of the curves reach the optimal MSE\u2019s, the curves would not match the optimal MSE\ncurves of \u03c4 2/3L\u22122/3 in general, except for the end points, i.e., they are lower bounded by \u03c4 2/3L\u22122/3.\nConvergence speedup of the variance A Bayesian logistic regression model (BLR) is adopted\nto verify the variance convergence properties. We use the Adult dataset\u2021, a9a, with 32,561 training\nsamples and 16,281 test samples. The test function is de\ufb01ned as the standard logistic loss. We\naverage over 10 runs to estimate the expectation E \u02c6\u03c6L in the variance. We use the single-server\ndistributed architecture in Section 4, with multiple workers computing stale gradients in parallel. We\nplot the variance versus the average number of iterations on the workers ( \u00afL) and the running time in\nFigure 2 (a) and (b), respectively. We can see that the variance drops faster with increasing number\n\n\u22122/3\n0\n\n\u2021http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.\n\n6\n\n101102103104#iterations:L100101MSE==1==2==5==10==15==20Achieving the same MSE level\f(a) Variance vs. Iteration \u00afL\n\n(b) Variance vs. Time\n\nFigure 2: Variance with increasing number of workers.\n\n(c) Speedup\n\n,\nof workers. To quantitatively relate these results to the theory, Corollary 5 indicates that L1\n= W1\nW2\nL2\ni=1 means the number of workers and iterations at the same variance, i.e., a linear\nwhere (Wi, Li)2\nspeedup is achieved. The iteration speedup and time speedup are plotted in Figure 2 (c), showing\nthat the iteration speedup approximately scales linearly worker numbers, consistent with Corollary 5;\nwhereas the time speedup deteriorates when the worker number is large due to high system latency.\n\n5.2 Applications to deep learning\nWe further test S2G-MCMC on Bayesian learning of deep neural networks. The distributed system is\ndeveloped based on an MPI (message passing interface) extension of the popular Caffe package for\ndeep learning [32]. We implement the SGHMC algorithm, with the point-to-point communications\nbetween servers and workers handled by the MPICH library.The algorithm is run on a cluster of \ufb01ve\nmachines. Each machine is equipped with eight 3.60GHz Intel(R) Core(TM) i7-4790 CPU cores.\nWe evaluate S2G-MCMC on the above BLR model and two deep convolutional neural networks\n(CNN). In all these models, zero mean and unit variance Gaussian priors are employed for the weights\nto capture weight uncertainties, an effective way to deal with over\ufb01tting [33]. We vary the number of\nservers S among {1, 3, 5, 7}, and the number of workers for each server from 1 to 9.\nLeNet for MNIST We modify the standard LeNet to a Bayesian setting for the MNIST\ndataset.LeNet consists of 2 convolutional layers, 2 max pool layers and 2 ReLU nonlinear lay-\ners, followed by 2 fully connected layers [34]. The detailed speci\ufb01cation can be found in Caffe. For\nsimplicity, we use the default parameter setting speci\ufb01ed in Caffe, with the additional parameter B in\nSGHMC (Algorithm 1) set to (1\u2212 m), where m is the moment variable de\ufb01ned in the SGD algorithm\nin Caffe.\nCifar10-Quick net for CIFAR10 The Cifar10-Quick net consists of 3 convolutional layers, 3 max\npool layers and 3 ReLU nonlinear layers, followed by 2 fully connected layers. The CIFAR-10\ndataset consists of 60,000 color images of size 32\u00d732 in 10 classes, with 50,000 for training and\n10,000 for testing.Similar to LeNet, default parameter setting speci\ufb01ed in Caffe is used.\nIn these models, the test function is de\ufb01ned as the cross entropy of the softmax outputs {o1,\u00b7\u00b7\u00b7 , oN}\nc=1 eoc.\nSince the theory indicates a linear speedup on the decrease of variance w.r.t. the number of workers,\nthis means for a single run of the models, the loss would converge faster to its expectation with\nincreasing number of workers. The following experiments verify this intuition.\n\nfor test data {(d1, y1),\u00b7\u00b7\u00b7 , (dN , yN )} with C classes, i.e., loss = \u2212(cid:80)N\n\ni=1 oyi +N log(cid:80)C\n\n5.2.1 Single-server experiments\nWe \ufb01rst test the single-server architecture in Section 4 on the three models. Because the expectations\nin the bias, MSE or variance are not analytically available in these complex models, we instead plot\nthe loss versus average number of iterations ( \u00afL de\ufb01ned in Section 4) on each worker and the running\ntime in Figure 3. As mentioned above, faster decrease of the loss with more workers is expected.\nFor the ease of visualization, we only plot the results with {1, 2, 4, 6, 9} workers; more detailed\nresults are provided in Appendix I. We can see that generally the loss decreases faster with increasing\nnumber of workers. In the CIFAR-10 dataset, the \ufb01nal losses of 6 and 9 workers are worst than the one\nwith 4 workers. It shows that the accuracy of the sample average suffers from the increased staleness\ndue to the increased number of workers. Therefore a smaller step size h should be considered to\nmaintain high accuracy when using a large number of workers. Note the 1-worker curves correspond\nto the standard SG-MCMC, whose loss decreases much slower due to high estimation variance,\n\n7\n\n100101102103#iterations10-810-610-410-2100Var1 worker2 workers3 workers4 workers5 workers6 workers7 workers8 workers9 workers10-1100Time (s)10-610-410-2Var1 worker2 workers3 workers4 workers5 workers6 workers7 workers8 workers9 workers2468#workers123456789Speeduplinear speedupiteration-speeduptime-speedup\fFigure 3: Testing loss vs. #workers. From left to right, each column corresponds to the a9a, MNIST\nand CIFAR dataset, respectively. The loss is de\ufb01ned in the text.\n\nthough in theory it has the same level of bias as the single-server architecture for a given number of\niterations (they will converge to the same accuracy).\n5.2.2 Multiple-server experiments\nFinally, we test the multiple-servers architecture on the same models. We use the same criterion as\nthe single-server setting to measure the convergence behavior. The loss versus average number of\niterations on each worker ( \u00afL de\ufb01ned in Section 4) for the three datasets are plotted in Figure 4, where\nwe vary the number of servers among {1, 3, 5, 7}, and use 2 workers for each server. The plots of\nloss versus time and using different number of workers for each server are provided in the Appendix.\nWe can see that in the simple BLR model, multiple servers do not seem to show signi\ufb01cant speedup,\nprobably due to the simplicity of the posterior, where the sample variance is too small for multiple\nservers to take effect; while in the more complicated deep neural networks, using more servers results\nin a faster decrease of the loss, especially in the MNIST dataset.\n\nFigure 4: Testing loss vs. #servers. From left to right, each column corresponds to the a9a, MNIST\nand CIFAR dataset, respectively. The loss is de\ufb01ned in the text.\n6 Conclusion\nWe extend theory from standard SG-MCMC to the stale stochastic gradient setting, and analyze the\nimpacts of the staleness to the convergence behavior of an S2G-MCMC algorithm. Our theory reveals\nthat the estimation variance is independent of the staleness, leading to a linear speedup w.r.t. the\nnumber of workers, although in practice little speedup in terms of optimal bias and MSE might be\nachieved due to their dependence on the staleness. We test our theory on a simple asynchronous\ndistributed SG-MCMC system with two simulated examples and several deep neural network models.\nExperimental results verify the effectiveness and scalability of the proposed S2G-MCMC framework.\n\nAcknowledgements Supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.\n\n8\n\n#iterations100101102103104Loss0.350.40.450.50.550.61 worker2 workers4 workers6 workers9 workers#iterations200040006000800010000Loss10-11001 worker2 workers4 workers6 workers9 workers#iterations#1040.511.52Loss0.811.21.41.61.822.21 worker2 workers4 workers6 workers9 workersTime (s)10-1100101Loss0.350.40.450.50.550.61 worker2 workers4 workers6 workers9 workersTime (s)0100200300400500600Loss10-11001 worker2 workers4 workers6 workers9 workersTime (s)01000200030004000Loss0.811.21.41.61.822.21 worker2 workers4 workers6 workers9 workers#iterations100101102103104Loss0.350.40.450.50.550.61 server3 servers5 servers7 servers#iterations200040006000800010000Loss10-11001 server3 servers5 servers7 servers#iterations#1040.511.52Loss11.21.41.61.822.21 server3 servers5 servers7 servers\fReferences\n[1] L. Bottou, editor. Online algorithms and stochastic approximations. Cambridge University Press, 1998.\n[2] L. Bottou. Stochastic gradient descent tricks. Technical report, Microsoft Research, Redmond, WA, 2012.\n[3] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.\n[4] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In ICML, 2014.\n[5] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic\n\ngradient thermostats. In NIPS, 2014.\n\n[6] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.\n[7] J. Dean et al. Large scale distributed deep networks. In NIPS, 2012.\n[8] T. Chen et al. MXNet: A \ufb02exible and ef\ufb01cient machine learning library for heterogeneous distributed\n\n[9] M. Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software\n\nsystems. (arXiv:1512.01274), Dec. 2015.\n\navailable from tensor\ufb02ow.org.\n\n[10] Q. Ho, J. Cipar, H. Cui, J. K. Kim, S. Lee, P. B. Gibbons, G. A. Gibbons, G. R. Ganger, and E. P. Xing.\n\nMore effective distributed ML via a stale synchronous parallel parameter server. In NIPS, 2013.\n\n[11] M. Li, D. Andersen, A. Smola, and K. Yu. Communication ef\ufb01cient distributed machine learning with the\n\n[12] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization.\n\nparameter server. In NIPS, 2014.\n\nIn NIPS, 2015.\n\nmodels. In WSDM, 2012.\n\n[13] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable\n\n[14] S. Ahn, B. Shahbaba, and M. Welling. Distributed stochastic gradient MCMC. In ICML, 2014.\n[15] S. Ahn, A. Korattikara, N. Liu, S. Rajan, and M. Welling. Large-scale distributed Bayesian matrix\n\nfactorization using stochastic gradient MCMC. In KDD, 2015.\n\n[16] U. Simsekli, H. Koptagel, Guldas H, A. Y. Cemgil, F. Oztoprak, and S. Birbil. Parallel stochastic gradient\n\nMarkov chain Monte Carlo for matrix factorisation models. Technical report, 2015.\n\n[17] C. Chen, N. Ding, and L. Carin. On the convergence of stochastic gradient MCMC algorithms with\n\n[18] F. Niu, B. Recht, C. R\u00e9, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic\n\nhigh-order integrators. In NIPS, 2015.\n\ngradient descent. In NIPS, 2011.\n\n[19] H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regu-\n\nlarised stochastic optimization. Technical Report arXiv:1505.04824, May 2015.\n\n[20] S. Chaturapruek, J. C. Duchi, and C. R\u00e9. Asynchronous stochastic convex optimization: the noise is in the\n\nnoise and SGD don\u2019t care. In NIPS, 2015.\n\n[21] S. L. Scott, A. W. Blocker, and F. V. Bonassi. Bayes and big data: The consensus Monte Carlo algorithm.\n\n[22] M. Rabinovich, E. Angelino, and M. I. Jordan. Variational consensus Monte Carlo. In NIPS, 2015.\n[23] W. Neiswanger, C. Wang, and E. P. Xing. Asymptotically exact, embarrassingly parallel MCMC. In UAI,\n\n[24] X. Wang, F. Guo, K. Heller, and D. Dunson. Parallelizing MCMC with random partition trees. In NIPS,\n\nBayes 250, 2013.\n\n2014.\n\n2015.\n\n[25] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Construction of numerical time-average and stationary\n\nmeasures via Poisson equations. SIAM Journal on Numerical Analysis, 48(2):552\u2013577, 2010.\n\n[26] Y. A. Ma, T. Chen, and E. B. Fox. A complete recipe for stochastic gradient MCMC. In NIPS, 2015.\n[27] S. Patterson and Y. W. Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex.\n\nIn NIPS, 2013.\n\nneural networks. In AAAI, 2016.\n\n[28] C. Li, C. Chen, D. Carlson, and L. Carin. Preconditioned stochastic gradient Langevin dynamics for deep\n\n[29] S. Zhang, A. E. Choromanska, and Y. Lecun. Deep learning with elastic averaging SGD. In NIPS, 2015.\n[30] Y. W. Teh, L. Hasenclever, T. Lienart, S. Vollmer, S. Webb, B. Lakshminarayanan, and C. Blundell.\nDistributed Bayesian learning with stochastic natural-gradient expectation propagation and the posterior\nserver. Technical Report arXiv:1512.09327v1, December 2015.\n\n[31] R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. Technical\n\nReport arXiv:1505.02827, May 2015.\n\n[32] T. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[33] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In\n\nICML, 2015.\n\nnetworks. In NIPS, 2012.\n\n[34] A. Krizhevshy, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\n[35] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. (Non-)asymptotic properties of stochastic gradient Langevin\n\ndynamics. Technical Report arXiv:1501.00438, University of Oxford, UK, January 2015.\n\n9\n\n\f", "award": [], "sourceid": 1473, "authors": [{"given_name": "Changyou", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Nan", "family_name": "Ding", "institution": "Google"}, {"given_name": "Chunyuan", "family_name": "Li", "institution": "Duke"}, {"given_name": "Yizhe", "family_name": "Zhang", "institution": "Duke university"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}