{"title": "Streaming Variational Bayes", "book": "Advances in Neural Information Processing Systems", "page_first": 1727, "page_last": 1735, "abstract": "We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation primitive function.  We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections.  We demonstrate the advantages of our algorithm over stochastic variational inference (SVI), both in the single-pass setting SVI was designed for and in the streaming setting, to which SVI does not apply.", "full_text": "Streaming Variational Bayes\n\nTamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson\n\nUniversity of California, Berkeley\n\n{tab@stat, nickboyd@eecs, wibisono@eecs, ashia@stat}.berkeley.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\njordan@cs.berkeley.edu\n\nAbstract\n\na\n\n(S)treaming,\n\nframework for\n\nWe present SDA-Bayes,\n(D)istributed,\n(A)synchronous computation of a Bayesian posterior. The framework makes\nstreaming updates to the estimated posterior according to a user-speci\ufb01ed approx-\nimation batch primitive. We demonstrate the usefulness of our framework, with\nvariational Bayes (VB) as the primitive, by \ufb01tting the latent Dirichlet allocation\nmodel to two large-scale document collections. We demonstrate the advantages\nof our algorithm over stochastic variational inference (SVI) by comparing the two\nafter a single pass through a known amount of data\u2014a case where SVI may be\napplied\u2014and in the streaming setting, where SVI does not apply.\n\n1\n\nIntroduction\n\nLarge, streaming data sets are increasingly the norm in science and technology. Simple descriptive\nstatistics can often be readily computed with a constant number of operations for each data point in\nthe streaming setting, without the need to revisit past data or have advance knowledge of future data.\nBut these time and memory restrictions are not generally available for the complex, hierarchical\nmodels that practitioners often have in mind when they collect large data sets. Signi\ufb01cant progress\non scalable learning procedures has been made in recent years [e.g., 1, 2]. But the underlying\nmodels remain simple, and the inferential framework is generally non-Bayesian. The advantages\nof the Bayesian paradigm (e.g., hierarchical modeling, coherent treatment of uncertainty) currently\nseem out of reach in the Big Data setting.\nAn exception to this statement is provided by [3\u20135], who have shown that a class of approxima-\ntion methods known as variational Bayes (VB) [6] can be usefully deployed for large-scale data\nsets. They have applied their approach, referred to as stochastic variational inference (SVI), to the\ndomain of topic modeling of document collections, an area with a major need for scalable infer-\nence algorithms. VB traditionally uses the variational lower bound on the marginal likelihood as an\nobjective function, and the idea of SVI is to apply a variant of stochastic gradient descent to this\nobjective. Notably, this objective is based on the conceptual existence of a full data set involving D\ndata points (i.e., documents in the topic model setting), for a \ufb01xed value of D. Although the stochas-\ntic gradient is computed for a single, small subset of data points (documents) at a time, the posterior\nbeing targeted is a posterior for D data points. This value of D must be speci\ufb01ed in advance and is\nused by the algorithm at each step. Posteriors for D! data points, for D! != D, are not obtained as\npart of the analysis.\nWe view this lack of a link between the number of documents that have been processed thus far\nand the posterior that is being targeted as undesirable in many settings involving streaming data.\nIn this paper we aim at an approximate Bayesian inference algorithm that is scalable like SVI but\n\n1\n\n\fis also truly a streaming procedure, in that it yields an approximate posterior for each processed\ncollection of D! data points\u2014and not just a pre-speci\ufb01ed \u201c\ufb01nal\u201d number of data points D. To that\nend, we return to the classical perspective of Bayesian updating, where the recursive application\nof Bayes theorem provides a sequence of posteriors, not a sequence of approximations to a \ufb01xed\nposterior. To this classical recursive perspective we bring the VB framework; our updates need\nnot be exact Bayesian updates but rather may be approximations such as VB. This approach is\nsimilar in spirit to assumed density \ufb01ltering or expectation propagation [7\u20139], but each step of those\nmethods involves a moment-matching step that can be computationally costly for models such as\ntopic models. We are able to avoid the moment-matching step via the use of VB. We also note other\nrelated work in this general vein: MCMC approximations have been explored by [10], and VB or\nVB-like approximations have also been explored by [11, 12].\nAlthough the empirical success of SVI is the main motivation for our work, we are also motivated by\nrecent developments in computer architectures, which permit distributed and asynchronous compu-\ntations in addition to streaming computations. As we will show, a streaming VB algorithm naturally\nlends itself to distributed and asynchronous implementations.\n\n2 Streaming, distributed, asynchronous Bayesian updating\n\nStreaming Bayesian updating. Consider data x1, x2, . . . generated iid according to a distribution\np(x | \u0398) given parameter(s) \u0398. Assume that a prior p(\u0398) has also been speci\ufb01ed. Then Bayes theo-\nrem gives us the posterior distribution of \u0398 given a collection of S data points, C1 := (x1, . . . , xS):\n\np(\u0398 | C1) = p(C1)\u22121 p(C1 | \u0398) p(\u0398),\n\ns=1 p(xs | \u0398).\n\np(\u0398 | C1, . . . , Cb) \u221d p(Cb | \u0398) p(\u0398 | C1, . . . , Cb\u22121).\n\nwhere p(C1 | \u0398) = p(x1, . . . , xS | \u0398) =!S\nSuppose we have seen and processed b\u22121 collections, sometimes called minibatches, of data. Given\nthe posterior p(\u0398 | C1, . . . , Cb\u22121), we can calculate the posterior after the bth minibatch:\n(1)\nThat is, we treat the posterior after b \u2212 1 minibatches as the new prior for the incoming data points.\nIf we can save the posterior from b \u2212 1 minibatches and calculate the normalizing constant for the\nbth posterior, repeated application of Eq. (1) is streaming; it automatically gives us the new posterior\nwithout needing to revisit old data points.\nIn complex models, it is often infeasible to calculate the posterior exactly, and an approximation\nmust be used. Suppose that, given a prior p(\u0398) and data minibatch C, we have an approximation\nalgorithm A that calculates an approximate posterior q: q(\u0398) = A(C, p(\u0398)). Then, setting q0(\u0398) =\np(\u0398), one way to recursively calculate an approximation to the posterior is\np(\u0398 | C1, . . . , Cb) \u2248 qb(\u0398) = A (Cb, qb\u22121(\u0398)) .\n(2)\nWhen A yields the posterior from Bayes theorem, this calculation is exact. This approach already\ndiffers from that of [3\u20135], which we will see (Sec. 3.2) directly approximates p(\u0398 | C1, . . . , CB)\nfor \ufb01xed B without making intermediate approximations for b strictly between 1 and B.\nDistributed Bayesian updating. The sequential updates in Eq. (2) handle streaming data in theory,\nbut in practice, the A calculation might take longer than the time interval between minibatch arrivals\nor simply take longer than desired. Parallelizing computations increases algorithm throughput. And\nposterior calculations need not be sequential. Indeed, Bayes theorem yields\n\np(\u0398 | C1, . . . , CB) \u221d\" B#b=1\n\np(Cb | \u0398)$ p(\u0398) \u221d\" B#b=1\n\np(\u0398 | Cb) p(\u0398)\u22121$ p(\u0398).\n\nThat is, we can calculate the individual minibatch posteriors p(\u0398 | Cb), perhaps in parallel, and then\ncombine them to \ufb01nd the full posterior p(\u0398 | C1, . . . , CB).\nGiven an approximating algorithm A as above, the corresponding approximate update would be\n\np(\u0398 | C1, . . . , CB) \u2248 q(\u0398) \u221d\" B#b=1\n\nA(Cb, p(\u0398)) p(\u0398)\u22121$ p(\u0398),\n\n(3)\n\n(4)\n\n2\n\n\ffor some approximating distribution q, provided the normalizing constant for the right-hand side of\nEq. (4) can be computed.\nVariational inference methods are generally based on exponential family representations [6], and we\nwill make that assumption here. In particular, we suppose p(\u0398) \u221d exp{\u03be0 \u00b7 T (\u0398)}; that is, p(\u0398) is\nan exponential family distribution for \u0398 with suf\ufb01cient statistic T (\u0398) and natural parameter \u03be0. We\nsuppose further that A always returns a distribution in the same exponential family; in particular, we\nsuppose that there exists some parameter \u03beb such that\n(5)\nfor\n\nqb(\u0398) \u221d exp{\u03beb \u00b7 T (\u0398)}\n\nWhen we make these two assumptions, the update in Eq. (4) becomes\n\np(\u0398 | C1, . . . , CB) \u2248 q(\u0398) \u221d exp%\"\u03be0 +\n\nqb(\u0398) = A(Cb, p(\u0398)).\nB&b=1\n\n(\u03beb \u2212 \u03be0)$ \u00b7 T (\u0398)\u2019 ,\n\n(6)\n\nwhere the normalizing constant is readily obtained from the exponential family form. In what fol-\nlows we use the shorthand \u03be \u2190A (C, \u03be0) to denote that A takes as input a minibatch C and a prior\nwith exponential family parameter \u03be0 and that it returns a distribution in the same exponential family\nwith parameter \u03be.\nSo, to approximate p(\u0398 | C1, . . . , CB), we \ufb01rst calculate \u03beb via the approximation primitive A for\neach minibatch Cb; note that these calculations may be performed in parallel. Then we sum together\nthe quantities \u03beb \u2212 \u03be0 across b, along with the initial \u03be0 from the prior, to \ufb01nd the \ufb01nal exponential\nfamily parameter to the full posterior approximation q. We previously saw that the general Bayes\nsequential update can be made streaming by iterating with the old posterior as the new prior (Eq. (2)).\nSimilarly, here we see that the full posterior approximation q is in the same exponential family as\nthe prior, so one may iterate these parallel computations to arrive at a parallelized algorithm for\nstreaming posterior computation.\nWe emphasize that while these updates are reminiscent of prior-posterior conjugacy, it is actually\nthe approximate posteriors and single, original prior that we assume belong to the same exponential\nfamily. It is not necessary to assume any conjugacy in the generative model itself nor that any true\nintermediate or \ufb01nal posterior take any particular limited form.\nAsynchronous Bayesian updating. Performing B computations in parallel can in theory speed up\nalgorithm running time by a factor of B, but in practice it is often the case that a single computation\nthread takes longer than the rest. Waiting for this thread to \ufb01nish diminishes potential gains from\ndistributing the computations. This problem can be ameliorated by making computations asyn-\nchronous. In this case, processors known as workers each solve a subproblem. When a worker\n\ufb01nishes, it reports its solution to a single master processor. If the master gives the worker a new\nsubproblem without waiting for the other workers to \ufb01nish, it can decrease downtime in the system.\nOur asynchronous algorithm is in the spirit of Hogwild! [1]. To present the algorithm we \ufb01rst\ndescribe an asynchronous computation that we will not use in practice, but which will serve as a\nconceptual stepping stone. Note in particular that the following scheme makes the computations\nin Eq. (6) asynchronous. Have each worker continuously iterate between three steps: (1) collect\na new minibatch C, (2) compute the local approximate posterior \u03be \u2190A (C, \u03be0), and (3) return\n\u2206\u03be := \u03be \u2212 \u03be0 to the master. The master, in turn, starts by assigning the posterior to equal the prior:\n\u03be(post) \u2190 \u03be0. Each time the master receives a quantity \u2206\u03be from any worker, it updates the posterior\nsynchronously: \u03be(post) \u2190 \u03be(post) +\u2206 \u03be. If A returns the exponential family parameter of the true\nposterior (rather than an approximation), then the posterior at the master is exact by Eq. (4).\nA preferred asynchronous computation works as follows. The master initializes its posterior estimate\nto the prior: \u03be(post) \u2190 \u03be0. Each worker continuously iterates between four steps: (1) collect a new\nminibatch C, (2) copy the master posterior value locally \u03be(local) \u2190 \u03be(post), (3) compute the local\napproximate posterior \u03be \u2190A (C, \u03be(local)), and (4) return \u2206\u03be := \u03be \u2212 \u03be(local) to the master. Each\ntime the master receives a quantity \u2206\u03be from any worker, it updates the posterior synchronously:\n\u03be(post) \u2190 \u03be(post) +\u2206 \u03be.\nThe key difference between the \ufb01rst and second frameworks proposed above is that, in the second,\nthe latest posterior is used as a prior. This latter framework is more in line with the streaming update\nof Eq. (2) but introduces a new layer of approximation. Since \u03be(post) might change at the master\n\n3\n\n\fwhile the worker is computing \u2206\u03be, it is no longer the case that the posterior at the master is exact\nwhen A returns the exponential family parameter of the true posterior. Nonetheless we \ufb01nd that the\nlatter framework performs better in practice, so we focus on it exclusively in what follows.\nWe refer to our overall framework as SDA-Bayes, which stands for (S)treaming, (D)istributed,\n(A)synchronous Bayes. The framework is intended to be general enough to allow a variety of local\napproximations A. Indeed, SDA-Bayes works out of the box once an implementation of A\u2014and a\nprior on the global parameter(s) \u0398\u2014is provided. In the current paper our preferred local approxi-\nmation will be VB.\n\n3 Case study: latent Dirichlet allocation\nIn what follows, we consider examples of the choices for the \u0398 prior and primitive A in the context\nof latent Dirichlet allocation (LDA) [13]. LDA models the content of D documents in a corpus.\nThemes potentially shared by multiple documents are described by topics. The unsupervised learn-\ning problem is to learn the topics as well as discover which topics occur in which documents.\nMore formally, each topic (of K total topics) is a distribution over the V words in the vocabulary:\n\u03b2k = (\u03b2kv)V\nv=1. Each document is an admixture of topics. The words in document d are assumed\nto be exchangeable. Each word wdn belongs to a latent topic zdn chosen according to a document-\nspeci\ufb01c distribution of topics \u03b8d = (\u03b8dk)K\nk=1. The full generative model, with Dirichlet priors for\n\u03b2k and \u03b8d conditioned on respective parameters \u03b7k and \u03b1, appears in [13].\nTo see that this model \ufb01ts our speci\ufb01cation in Sec. 2, consider the set of global parameters \u0398=\n\u03b2. Each document wd = (wdn)Nd\nn=1 is distributed iid conditioned on the global topics. The full\ncollection of data is a corpus C = w = (wd)D\nd=1 of documents. The posterior for LDA, p(\u03b2, \u03b8, z |\nC, \u03b7, \u03b1), is equal to the following expression up to proportionality:\n\n\u221d\" K#k=1\n\nDirichlet(\u03b2k | \u03b7k)$ \u00b7\" D#d=1\n\nDirichlet(\u03b8d | \u03b1)$ \u00b7\" D#d=1\n\nNd#n=1\n\n\u03b8dzdn\u03b2zdn,wdn$ .\n\n(7)\n\nThe posterior for just the global parameters p(\u03b2| C, \u03b7, \u03b1) can be obtained from p(\u03b2, \u03b8, z| C, \u03b7, \u03b1) by\nintegrating out the local, document-speci\ufb01c parameters \u03b8, z. As is common in complex models, the\nnormalizing constant for Eq. (7) is intractable to compute, so the posterior must be approximated.\n\n3.1 Posterior-approximation algorithms\n\nTo apply SDA-Bayes to LDA, we use the prior speci\ufb01ed by the generative model. It remains to\nchoose a posterior-approximation algorithm A. We consider two possibilities here: variational\nBayes (VB) and expectation propagation (EP). Both primitives take Dirichlet distributions as priors\nfor \u03b2 and both return Dirichlet distributions for the approximate posterior of the topic parameters \u03b2;\nthus the prior and approximate posterior are in the same exponential family. Hence both VB and EP\ncan be utilized as a choice for A in the SDA-Bayes framework.\nMean-\ufb01eld variational Bayes. We use the shorthand pD for Eq. (7), the posterior given D docu-\nments. We assume the approximating distribution, written qD for shorthand, takes the form\n\nNd#n=1\n\nqD(\u03b8d | \u03b3d)$ \u00b7\" D#d=1\n\nqD(\u03b2k | \u03bbk)$ \u00b7\" D#d=1\n\nqD(\u03b2, \u03b8, z | \u03bb, \u03b3, \u03c6) =\" K#k=1\n\nqD(zdn | \u03c6dwdn)$ (8)\nfor parameters (\u03bbkv), (\u03b3dk), (\u03c6dvk) with k \u2208{ 1, . . . , K}, v \u2208{ 1, . . . , V }, d \u2208{ 1, . . . , D}.\nMoreover, we set qD(\u03b2k | \u03bbk) = DirichletV (\u03b2k | \u03bbk), qD(\u03b8d | \u03b3d) = DirichletK(\u03b8d | \u03b3d), and\nqD(zdn | \u03c6dwdn) = CategoricalK(zdn | \u03c6dwdn). The subscripts on Dirichlet and Categorical indicate\nthe dimensions of the distributions (and of the parameters).\nThe problem of VB is to \ufb01nd the best approximating qD, de\ufb01ned as the collection of variational\nparameters \u03bb, \u03b3, \u03c6 that minimize the KL divergence from the true posterior: KL (qD \u2019 pD). Even\n\ufb01nding the minimizing parameters is a dif\ufb01cult optimization problem. Typically the solution is\napproximated by coordinate descent in each parameter [6, 13] as in Alg. 1. The derivation of VB for\nLDA can be found in [4, 13] and Sup. Mat. A.1.\n\n4\n\n\fd=1; hyperparameters \u03b7, \u03b1\n\nAlgorithm 1: VB for LDA\nInput: Data (nd)D\nOutput: \u03bb\nInitialize \u03bb\nwhile (\u03bb, \u03b3, \u03c6) not converged do\n\nfor d = 1, . . . , D do\n\n(\u03b3d,\u03c6 d) \u2190 LocalVB(d, \u03bb)\n\u2200(k, v), \u03bbkv \u2190 \u03b7kv +(D\n\nd=1 \u03c6dvkndv\n\nAlgorithm 2: SVI for LDA\nInput: Hyperparameters \u03b7, \u03b1, D, (\u03c1t)T\nOutput: \u03bb\nInitialize \u03bb\nfor t = 1, . . . , T do\n\nt=1\n\n(\u03b3d,\u03c6 d) \u2190 LocalVB(d, \u03bb)\n\nCollect new data minibatch C\nforeach document indexed d in C do\n\u2200(k, v), \u02dc\u03bbkv \u2190 \u03b7kv + D\n\u2200(k, v), \u03bbkv \u2190 (1 \u2212 \u03c1t)\u03bbkv + \u03c1t\u02dc\u03bbkv\n\n|C|(d in C \u03c6dvkndv\n\nSubroutine LocalVB(d, \u03bb)\n\nOutput: (\u03b3d,\u03c6 d)\nInitialize \u03b3d\nwhile (\u03b3d,\u03c6 d) not converged do\n\n\u2200(k, v), set \u03c6dvk \u221d\nexp (Eq[log \u03b8dk] + Eq[log \u03b2kv])\n(normalized across k)\n\u2200k, \u03b3dk \u2190 \u03b1k +(V\n\nv=1 \u03c6dvkndv\n\nAlgorithm 3: SSU for LDA\nInput: Hyperparameters \u03b7, \u03b1\nOutput: A sequence \u03bb(1),\u03bb (2), . . .\nInitialize \u2200(k, v), \u03bb(0)\nfor b = 1, 2, . . . do\n\nkv \u2190 \u03b7kv\n\nCollect new data minibatch C\nforeach document indexed d in C do\n\n(\u03b3d,\u03c6 d) \u2190 LocalVB(d, \u03bb)\n\n\u2200(k, v), \u03bb(b)\n\nkv \u2190 \u03bb(b\u22121)\n\nkv +(d in C \u03c6dvkndv\n\nFigure 1: Algorithms for calculating \u03bb, the parameters for the topic posteriors in LDA. VB iter-\nates multiple times through the data, SVI makes a single pass, and SSU is streaming. Here, ndv\nrepresents the number of words v in document d.\nExpectation propagation. An EP [7] algorithm for approximating the LDA posterior appears in\nAlg. 6 of Sup. Mat. B. Alg. 6 differs from [14], which does not provide an approximate posterior for\nthe topic parameters, and is instead our own derivation. Our version of EP, like VB, learns factorized\nDirichlet distributions over topics.\n\n3.2 Other single-pass algorithms for approximate LDA posteriors\n\nThe algorithms in Sec. 3.1 pass through the data multiple times and require storing the data set in\nmemory\u2014but are useful as primitives for SDA-Bayes in the context of the processing of minibatches\nof data. Next, we consider two algorithms that can pass through a data set just one time (single pass)\nand to which we compare in the evaluations (Sec. 4).\nStochastic variational inference. VB uses coordinate descent to \ufb01nd a value of qD, Eq. (8), that\nlocally minimizes the KL divergence, KL (qD \u2019 pD). Stochastic variational inference (SVI) [3, 4]\nis exactly the application of a particular version of stochastic gradient descent to the same optimiza-\ntion problem. While stochastic gradient descent can often be viewed as a streaming algorithm, the\noptimization problem itself here depends on D via pD, the posterior on D data points. We see that,\nas a result, D must be speci\ufb01ed in advance, appears in each step of SVI (see Alg. 2), and is indepen-\ndent of the number of data points actually processed by the algorithm. Nonetheless, while one may\nchoose to visit D! != D data points or revisit data points when using SVI to estimate pD [3, 4], SVI\ncan be made single-pass by visiting each of D data points exactly once and then has constant mem-\nory requirements. We also note that two new parameters, \u03c40 > 0 and \u03ba \u2208 (0.5, 1], appear in SVI,\nbeyond those in VB, to determine a learning rate \u03c1t as a function of iteration t: \u03c1t := (\u03c40 + t)\u2212\u03ba.\nSuf\ufb01cient statistics. On each round of VB (Alg. 1), we update the local parameters for all doc-\numents and then compute \u03bbkv \u2190 \u03b7kv +(D\nd=1 \u03c6dvkndv. An alternative single-pass (and indeed\nstreaming) option would be to update the local parameters for each minibatch of documents as they\narrive and then add the corresponding terms \u03c6dvkndv to the current estimate of \u03bb for each document\nd in the minibatch. This essential idea has been proposed previously for models other than LDA by\n[11, 12] and forms the basis of what we call the suf\ufb01cient statistics update algorithm (SSU): Alg. 3.\nThis algorithm is equivalent to SDA-Bayes with A chosen to be a single iteration over the global\nvariable \u03bb of VB (i.e., updating \u03bb exactly once instead of iterating until convergence).\n\n5\n\n\fWikipedia\n\nNature\n\nLog pred prob\nTime (hours)\n\n32-SDA 1-SDA\n\u22127.31\n2.09\n\nSSU\n\u22127.43 \u22127.32 \u22127.91\n8.28\n43.93\n\n7.87\n\nSVI\n\n32-SDA 1-SDA\n\u22127.11\n0.55\n\nSSU\n\u22127.19 \u22127.08 \u22127.82\n1.27\n10.02\n\n1.22\n\nSVI\n\nTable 1: A comparison of (1) log predictive probability of held-out data and (2) running time of\nfour algorithms: SDA-Bayes with 32 threads, SDA-Bayes with 1 thread, SVI, and SSU.\n\n4 Evaluation\n\nWe follow [4] (and further [15, 16]) in evaluating our algorithms by computing (approximate) pre-\ndictive probability. Under this metric, a higher score is better, as a better model will assign a higher\nprobability to the held-out words.\nWe calculate predictive probability by \ufb01rst setting aside held-out testing documents C(test) from the\nfull corpus and then further setting aside a subset of held-out testing words Wd,test in each testing\ndocument d. The remaining (training) documents C(train) are used to estimate the global parameter\nposterior q(\u03b2), and the remaining (training) words Wd,train within the dth testing document are used\nto estimate the document-speci\ufb01c parameter posterior q(\u03b8d).1 To calculate predictive probability,\nan approximation is necessary since we do not know the predictive distribution\u2014just as we seek to\nlearn the posterior distribution. Speci\ufb01cally, we calculate the normalized predictive distribution and\nreport \u201clog predictive probability\u201d as\n\nwhere we use the approximation\n\n(d\u2208C(test) |Wd,test|\n\n(d\u2208C(test) log p(Wd,test | C(train), Wd,train)\np(wtest | C(train), Wd,train) =)\u03b2)\u03b8d* K&k=1\n\u2248)\u03b2)\u03b8d* K&k=1\n\n,\n\n(d\u2208C(test) |Wd,test|\n\n= (d\u2208C(test)(wtest\u2208Wd,test log p(wtest | C(train), Wd,train)\n\u03b8dk\u03b2kwtest+ p(\u03b8d | Wd,train,\u03b2 ) p(\u03b2 | C(train)) d\u03b8d d\u03b2\n\u03b8dk\u03b2kwtest+ q(\u03b8d) q(\u03b2) d\u03b8d d\u03b2 =\n\nEq[\u03b8dk] Eq[\u03b2kwtest].\n\nK&k=1\n\nTo facilitate comparison with SVI, we use the Wikipedia and Nature corpora of [3, 5] in our exper-\niments. These two corpora represent a range of sizes (3,611,558 training documents for Wikipedia\nand 351,525 for Nature) as well as different types of topics. We expect words in Wikipedia to rep-\nresent an extremely broad range of topics whereas we expect words in Nature to focus more on the\nsciences. We further use the vocabularies of [3, 5] and SVI code available online at [17]. We hold\nout 10,000 Wikipedia documents and 1,024 Nature documents (not included in the counts above)\nfor testing.\nIn the results presented in the main text, we follow [3, 4] in \ufb01tting an LDA model\nwith K = 100 topics and hyperparameters chosen as: \u2200k, \u03b1k = 1/K, \u2200(k, v),\u03b7 kv = 0.01. For\nboth Wikipedia and Nature, we set the parameters in SVI according to the optimal values of the\nparameters described in Table 1 of [3] (number of documents D correctly set in advance, step size\nparameters \u03ba = 0.5 and \u03c40 = 64).\nFigs. 3(a) and 3(d) demonstrate that both SVI and SDA are sensitive to minibatch size when\n\u03b7kv = 0.01, with generally superior performance at larger batch sizes.\nInterestingly, both SVI\nand SDA performance improve and are steady across batch size when \u03b7kv = 1 (Figs. 3(a) and 3(d)).\nNonetheless, we use \u03b7kv = 0.01 in what follows in the interest of consistency with [3, 4]. Moreover,\nin the remaining experiments, we use a large minibatch size of 215 = 32,768. This size is the largest\nbefore SVI performance degrades in the Nature data set (Fig. 3(d)).\nPerformance and timing results are shown in Table 1. One would expect that with additional stream-\ning capabilities, SDA-Bayes should show a performance loss relative to SVI. We see from Table 1\n1 In all cases, we estimate q(\u03b8d) for evaluative purposes using VB since direct EP estimation takes pro-\n\nhibitively long.\n\n6\n\n\fy\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n \ne\nv\ni\nt\nc\nd\ne\nr\np\n \ng\no\n\ni\n\nl\n\n\u22127.3\n\n\u22127.35\n\n\u22127.4\n\n\u22127.45\n\n \n\n sync\n async\n\n8\n\n1\n\n4\n\n2\n16\nnumber of threads\n(a) Wikipedia\n\n \n\n32\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n \ne\nv\ni\nt\nc\nd\ne\nr\np\n \ng\no\n\ni\n\nl\n\n\u22127.1\n\n\u22127.15\n\n\u22127.2\n\n \n\n1\n\n sync\n async\n\n \n\n40\n\n30\n\n20\n\n10\n\n)\ns\nr\nu\no\nh\n(\n \ne\nm\n\ni\nt\n \nn\nu\nr\n\n \n\n sync\n async\n\n32\n\n0\n\n \n\n1\n\n8\n\n4\n\n2\n16\nnumber of threads\n(b) Nature\n\n8\n\n4\n\n2\n16\nnumber of threads\n(c) Wikipedia\n\n32\n\n)\ns\nr\nu\no\nh\n(\n \ne\nm\n\ni\nt\n \nn\nu\nr\n\n10\n8\n6\n4\n2\n0\n\n \n\n1\n\n \n\n sync\n async\n\n32\n\n8\n\n4\n\n2\n16\nnumber of threads\n(d) Nature\n\nFigure 2: SDA-Bayes log predictive probability (two left plots) and run time (two right plots) as a\nfunction of number of threads.\n\nthat such loss is small in the single-thread case, while SSU performs much worse. SVI is faster than\nsingle-thread SDA-Bayes in this single-pass setting.\nFull SDA-Bayes improves run time with no performance cost. We handicap SDA-Bayes in the\nabove comparisons by utilizing just a single thread. In Table 1, we also report performance of SDA-\nBayes with 32 threads and the same minibatch size. In the synchronous case, we consider minibatch\nsize to equal the total number of data points processed per round; therefore, the minibatch size equals\nthe number of data points sent to each thread per round times the total number of threads. In the\nasynchronous case, we analogously report minibatch size as this product.\nFig. 2 shows the performance of SDA-Bayes when we run with {1, 2, 4, 8, 16, 32} threads while\nkeeping the minibatch size constant. The goal in such a distributed context is to improve run time\nwhile not hurting performance. Indeed, we see dramatic run time improvement as the number of\nthreads grows and in fact some slight performance improvement as well. We tried both a paral-\nlel version and a full distributed, asynchronous version of the algorithm; Fig. 2 indicates that the\nspeedup and performance improvements we see here come from parallelizing\u2014which is theoreti-\ncally justi\ufb01ed by Eq. (3) when A is Bayes rule. Our experiments indicate that our Hogwild!-style\nasynchrony does not hurt performance. In our experiments, the processing time at each thread seems\nto be approximately equal across threads and dominate any communication time at the master, so\nsynchronous and asynchronous performance and running time are essentially identical. In general,\na practitioner might prefer asynchrony since it is more robust to node failures.\nSVI is sensitive to the choice of total data size D. The evaluations above are for a single posterior\nover D data points. Of greater concern to us in this work is the evaluation of algorithms in the\nstreaming setting. We have seen that SVI is designed to \ufb01nd the posterior for a particular, pre-\nchosen number of data points D. In practice, when we run SVI on the full data set but change the\ninput value of D in the algorithm, we can see degradations in performance. In particular, we try\nvalues of D equal to {0.01, 0.1, 1, 10, 100} times the true D in Fig. 3(b) for the Wikipedia data set\nand in Fig. 3(e) for the Nature data set.\nA practitioner in the streaming setting will typically not know D in advance, or multiple values of\nD may be of interest. Figs. 3(b) and 3(e) illustrate that an estimate may not be suf\ufb01cient. Even in\nthe case where D is known in advance, it is reasonable to imagine a new in\ufb02ux of further data. One\nmight need to run SVI again from the start (and, in so doing, revisit the \ufb01rst data set) to obtain the\ndesired performance.\nSVI is sensitive to learning step size.\n[3, 5] use cross-validation to tune step-size parameters\n(\u03c40,\u03ba ) in the stochastic gradient descent component of the SVI algorithm. This cross-validation\nrequires multiple runs over the data and thus is not suited to the streaming setting. Figs. 3(c) and 3(f)\ndemonstrate that the parameter choice does indeed affect algorithm performance. In these \ufb01gures,\nwe keep D at the true training data size.\n[3] have observed that the optimal (\u03c40,\u03ba ) may interact with minibatch size, and we further observe\nthat the optimal values may vary with D as well. We also note that recent work has suggested a way\nto update (\u03c40,\u03ba ) adaptively during an SVI run [18].\nEP is not suited to LDA. Earlier attempts to apply EP to the LDA model in the non-streaming\nsetting have had mixed success, with [19] in particular \ufb01nding that EP performance can be poor for\nLDA and, moreover, that EP requires \u201cunrealistic intermediate storage requirements.\u201d We found\n\n7\n\n\fy\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\ne\nv\ni\nt\nc\nd\ne\nr\np\ng\no\n\n \n\ni\n\nl\n\n\u22127\n\u22127.3\n\u22127.6\n\u22127.9\n\u22128.2\n\u22128.5\n\n \n\n5\n\n \n\nSVI, \u03b7 = 1.0\nSVI, \u03b7 = 0.01\nSDA, \u03b7 = 1.0\nSDA, \u03b7 = 0.01\n\n15\nlog batch size (base 2)\n\n10\n\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\ne\nv\ni\nt\nc\nd\ne\nr\np\ng\no\n\n \n\ni\n\nl\n\n\u22127.3\n\u22127.35\n\u22127.4\n\u22127.45\n\u22127.5\n0\n\n \n\n \n\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\ne\nv\ni\nt\nc\nd\ne\nr\np\ng\no\n\n \n\ni\n\nl\n\n\u22127.3\n\n\u22127.35\n\n\u22127.4\n\n\u22127.45\n\n\u22127.5\n0\n\n \n\nD = 361155800\nD = 36115580\nD = 3611558\nD = 361155\nD = 36115\n2e6\n\n3e6\n\n1e6\n\nnumber of examples seen\n\n \n\n5\n\n0\n\n,\n\n\u03c40 = 16\n\u03c40 = 256\n\u03c40 = 64\n\u03c40 = 256\n\u03c40 = 64\n\u03c40 = 16\n\n,\n\n,\n\n,\n\n.\n\n,\n\n.\n\n\u03ba = 1\n\n\u03ba = 1\n\n0\n\u03ba = 0\n0\n\u03ba = 1\n5\n5\n\n\u03ba = 0\n\u03ba = 0\n\n.\n\n,\n\n.\n\n.\n\n.\n\n1e6\n\n2e6\n\n3e6\n\nnumber of examples seen\n\n(a) Sensitivity to minibatch size on\nWikipedia\n\n(b) SVI\nWikipedia\n\nsensitivity to D on\n\n(c) SVI sensitivity to stepsize pa-\nrameters on Wikipedia\n\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\ne\nv\ni\nt\nc\nd\ne\nr\np\n\ni\n\n \ng\no\n\nl\n\n\u22127\n\u22127.3\n\u22127.6\n\u22127.9\n\u22128.2\n\u22128.5\n\n \n\n5\n\nSDA, \u03b7 = 1.0\nSDA, \u03b7 = 0.01\nSVI, \u03b7 = 0.01\nSVI, \u03b7 = 1.0\n\n15\nlog batch size (base 2)\n\n10\n\n \n\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\ne\nv\ni\nt\nc\nd\ne\nr\np\ng\no\n\n \n\ni\n\nl\n\n\u22127\n\u22127.2\n\u22127.4\n\u22127.6\n\u22127.8\n\u22128\n0\n\n \n\nD = 3515250\nD = 35152500\nD = 351525\nD = 35152\nD = 3515\n3e5\nnumber of examples seen\n\n2e5\n\n1e5\n\n \n\ny\nt\ni\nl\ni\n\n \n\nb\na\nb\no\nr\np\ne\nv\ni\nt\nc\nd\ne\nr\np\ng\no\n\n \n\ni\n\nl\n\n\u22127\n\u22127.2\n\u22127.4\n\u22127.6\n\u22127.8\n\u22128\n0\n\n \n\n \n\n0\n\n5\n\n,\n\n,\n\n\u03c40 = 16\n\u03c40 = 64\n\u03c40 = 256\n\u03c40 = 16\n\u03c40 = 64\n\u03c40 = 256\n\n,\n\n,\n\n.\n\n.\n\n,\n\n\u03ba = 0\n\u03ba = 0\n\n5\n5\n\u03ba = 1\n0\n0\n\u03ba = 0\n\n\u03ba = 1\n\u03ba = 1\n\n.\n\n.\n\n.\n\n,\n\n.\n\n1e5\n\n3e5\nnumber of examples seen\n\n2e5\n\n(d) Sensitivity to minibatch size on\nNature\n\n(e) SVI sensitivity to D on Nature\n\n(f) SVI sensitivity to stepsize pa-\nrameters on Nature\n\nFigure 3: Sensitivity of SVI and SDA-Bayes to some respective parameters. Legends have the same\ntop-to-bottom order as the rightmost curve points.\n\nthis to also be true in the streaming setting. We were not able to obtain competitive results with EP;\nbased on an 8-thread implementation of SDA-Bayes with an EP primitive2, after over 91 hours on\nWikipedia (and 6.7\u00d7104 data points), log predictive probability had stabilized at around \u22127.95 and,\nafter over 97 hours on Nature (and 9.7\u00d7104 data points), log predictive probability had stabilized at\naround \u22128.02. Although SDA-Bayes with the EP primitive is not effective for LDA, it remains to be\nseen whether this combination may be useful in other domains where EP is known to be effective.\n\n5 Discussion\n\nWe have introduced SDA-Bayes, a framework for streaming, distributed, asynchronous computa-\ntion of an approximate Bayesian posterior. Our framework makes streaming updates to the esti-\nmated posterior according to a user-speci\ufb01ed approximation primitive. We have demonstrated the\nusefulness of our framework, with variational Bayes as the primitive, by \ufb01tting the latent Dirichlet\nallocation topic model to the Wikipedia and Nature corpora. We have demonstrated the advantages\nof our algorithm over stochastic variational inference and the suf\ufb01cient statistics update algorithm,\nparticularly with respect to the key issue of obtaining approximations to posterior probabilities based\non the number of documents seen thus far, not posterior probabilities for a \ufb01xed number of docu-\nments.\n\nAcknowledgments\n\nWe thank M. Hoffman, C. Wang, and J. Paisley for discussions, code, and data and our reviewers\nfor helpful comments. TB is supported by the Berkeley Fellowship, NB by a Hertz Foundation\nFellowship, and ACW by the Chancellor\u2019s Fellowship at UC Berkeley. This research is supported in\npart by NSF award CCF-1139158, DARPA Award FA8750-12-2-0331, AMPLab sponsor donations,\nand the ONR under grant number N00014-11-1-0688.\n\n2We chose 8 threads since any fewer was too slow to get results and anything larger created too high of a\n\nmemory demand on our system.\n\n8\n\n\fReferences\n[1] F. Niu, B. Recht, C. R\u00b4e, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic\n\ngradient descent. In Neural Information Processing Systems, 2011.\n\n[2] A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan. The big data bootstrap. In International Conference\n\non Machine Learning, 2012.\n\n[3] M. Hoffman, D. M. Blei, and F. Bach. Online learning for latent Dirichlet allocation. In Neural Informa-\n\ntion Processing Systems, volume 23, pages 856\u2013864, 2010.\n\n[4] M. Hoffman, D. M. Blei, J. Paisley, and C. Wang. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14:1303\u20131347.\n\n[5] C. Wang, J. Paisley, and D. M. Blei. Online variational inference for the hierarchical Dirichlet process. In\n\nArti\ufb01cial Intelligence and Statistics, 2011.\n\n[6] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[7] T. P. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in Arti\ufb01cial\n\nIntelligence, pages 362\u2013369. Morgan Kaufmann, 2001.\n\n[8] T. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts\n\nInstitute of Technology, 2001.\n\n[9] M. Opper. A Bayesian approach to on-line learning.\n[10] K. R Canini, L. Shi, and T. L Grif\ufb01ths. Online inference of topics with latent Dirichlet allocation. In\n\nArti\ufb01cial Intelligence and Statistics, volume 5, 2009.\n\n[11] A. Honkela and H. Valpola. On-line variational Bayesian learning. In International Symposium on Inde-\n\npendent Component Analysis and Blind Signal Separation, pages 803\u2013808, 2003.\n\n[12] J. Luts, T. Broderick, and M. P. Wand. Real-time semiparametric regression. Journal of Computational\n\nand Graphical Statistics, to appear. Preprint arXiv:1209.3550.\n\n[13] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[14] T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Uncertainty in\n\nArti\ufb01cial Intelligence, pages 352\u2013359. Morgan Kaufmann, 2002.\n\n[15] Y. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent\n\nDirichlet allocation. In Neural Information Processing Systems, 2006.\n\n[16] A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models.\n\nUncertainty in Arti\ufb01cial Intelligence, 2009.\n\nIn\n\n[17] M. Hoffman. Online inference for LDA (Python code) at\n\nhttp://www.cs.princeton.edu/\u02dcblei/downloads/onlineldavb.tar, 2010.\n\n[18] R. Ranganath, C. Wang, D. M. Blei, and E. P. Xing. An adaptive learning rate for stochastic variational\n\ninference. In International Conference on Machine Learning, 2013.\n\n[19] W. L. Buntine and A. Jakulin. Applying discrete PCA in data analysis.\n\nIntelligence.\n\nIn Uncertainty in Arti\ufb01cial\n\n[20] M. Seeger. Expectation propagation for exponential families. Technical report, University of California\n\nat Berkeley, 2005.\n\n9\n\n\f", "award": [], "sourceid": 874, "authors": [{"given_name": "Tamara", "family_name": "Broderick", "institution": "UC Berkeley"}, {"given_name": "Nicholas", "family_name": "Boyd", "institution": "UC Berkeley"}, {"given_name": "Andre", "family_name": "Wibisono", "institution": "UC Berkeley"}, {"given_name": "Ashia", "family_name": "Wilson", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}