{"title": "Variance Reduction for Stochastic Gradient Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 181, "page_last": 189, "abstract": "Stochastic gradient optimization is a class of widely used algorithms for training machine learning models. To optimize an objective, it uses the noisy gradient computed from the random data samples instead of the true gradient computed from the entire dataset. However, when the variance of the noisy gradient is large, the algorithm might spend much time bouncing around, leading to slower convergence and worse performance. In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments (pre-computed or estimated online) is used to form the control variate. We demonstrate how to construct the control variate for two practical problems using stochastic gradient optimization. One is convex---the MAP estimation for logistic regression, and the other is non-convex---stochastic variational inference for latent Dirichlet allocation. On both problems, our approach shows faster convergence and better performance than the classical approach.", "full_text": "Variance Reduction for\n\nStochastic Gradient Optimization\n\nChong Wang Xi Chen\u2217 Alex Smola Eric P. Xing\n\nCarnegie Mellon University, University of California, Berkeley\u2217\n\n{chongw,xichen,epxing}@cs.cmu.edu\n\nalex@smola.org\n\nAbstract\n\nStochastic gradient optimization is a class of widely used algorithms for training\nmachine learning models. To optimize an objective, it uses the noisy gradient\ncomputed from the random data samples instead of the true gradient computed\nfrom the entire dataset. However, when the variance of the noisy gradient is\nlarge, the algorithm might spend much time bouncing around, leading to slower\nconvergence and worse performance. In this paper, we develop a general approach\nof using control variate for variance reduction in stochastic gradient. Data statistics\nsuch as low-order moments (pre-computed or estimated online) is used to form\nthe control variate. We demonstrate how to construct the control variate for two\npractical problems using stochastic gradient optimization. One is convex\u2014the\nMAP estimation for logistic regression, and the other is non-convex\u2014stochastic\nvariational inference for latent Dirichlet allocation. On both problems, our approach\nshows faster convergence and better performance than the classical approach.\n\n1\n\nIntroduction\n\nStochastic gradient (SG) optimization [1, 2] is widely used for training machine learning models with\nvery large-scale datasets. It uses the noisy gradient (a.k.a. stochastic gradient) estimated from random\ndata samples rather than that from the entire data. Thus, stochastic gradient algorithms can run many\nmore iterations in a limited time budget. However, if the noisy gradient has a large variance, the\nstochastic gradient algorithm might spend much time bouncing around, leading to slower convergence\nand worse performance. Taking a mini-batch with a larger size for computing the noisy gradient could\nhelp to reduce its variance; but if the mini-batch size is too large, it can undermine the advantage in\nef\ufb01ciency of stochastic gradient optimization.\nIn this paper, we propose a general remedy to the \u201cnoisy gradient\u201d problem ubiquitous to all stochastic\ngradient optimization algorithms for different models. Our approach builds on a variance reduction\ntechnique, which makes use of control variates [3] to augment the noisy gradient and thereby reduce\nits variance. The augmented \u201cstochastic gradient\u201d can be shown to remain an unbiased estimate of\nthe true gradient, a necessary condition that ensures the convergence. For such control variates to be\neffective and sound, they must satisfy the following key requirements: 1) they have a high correlation\nwith the noisy gradient, and 2) their expectation (with respect to random data samples) is inexpensive\nto compute. We show that such control variates can be constructed via low-order approximations\nto the noisy gradient so that their expectation only depends on low-order moments of the data. The\nintuition is that these low-order moments roughly characterize the empirical data distribution, and\ncan be used to form the control variate to correct the noisy gradient to a better direction. In other\nwords, the variance of the augmented \u201cstochastic gradient\u201d becomes smaller as it is derived with\nmore information about the data.\nThe rest of the paper is organized as follows. In \u00a72, we describe the general formulation and the\ntheoretical property of variance reduction via control variates in stochastic gradient optimization.\n\n1\n\n\fIn \u00a73, we present two examples to show how one can construct control variates for practical algorithms.\n(More examples are provided in the supplementary material.) These include a convex problem\u2014the\nMAP estimation for logistic regression, and a non-convex problem\u2014stochastic variational inference\nfor latent Dirichlet allocation [22]. Finally, we demonstrate the empirical performance of our\nalgorithms under these two examples in \u00a74. We conclude with a discussion on some future work.\n\n2 Variance reduction for general stochastic gradient optimization\n\nWe begin with a description of the general formulation of variance reduction via control variate for\nstochastic gradient optimization. Consider a general optimization problem over a \ufb01nite set of training\ndata D = {xd}D\nd=1 with each xd \u2208 Rp. Here D is the number of the training data. We want to\nmaximize the following function with respect to a p-dimensional vector w,\n\nL(w) := R(w) + (1/D)(cid:80)D\n\nd=1 f (w; xd),\n\nmaximize\n\nw\n\nwhere R(w) is a regularization function.1 Gradient-based algorithms can be used to maximize L(w)\nat the expense of computing the gradient over the entire training set. Instead, stochastic gradient\n(SG) methods use the noisy gradient estimated from random data samples. Suppose data index d is\nselected uniformly from {1,\u00b7\u00b7\u00b7 , D} at step t,\n\ng(w; xd) = \u2207wR(w) + \u2207wf (w; xd),\n\n(1)\n(2)\nwhere g(w; xd) is the noisy gradient that only depends on xd and \u03c1t is a proper step size. To make\nnotation simple, we use gd(w) (cid:44) g(w; xd).\nFollowing the standard stochastic optimization literature [1, 4], we require the expectation of the\nnoisy gradient gd equals to the true gradient,\n\nwt+1 = wt + \u03c1tg(w; xd),\n\nEd[gd(w)] = \u2207wL(w),\n\n(3)\nto ensure the convergence of the stochastic gradient algorithm. When the variance of gd(w) is large,\nthe algorithm could suffer from slow convergence.\nThe basic idea of using control variates for variance reduction is to construct a new random vector\nthat has the same expectation as the target expectation but with smaller variance. In previous work [5],\ncontrol variates were used to improve the estimate of the intractable integral in variational Bayesian\ninference which was then used to compute the gradient of the variational lower bound. In our context,\nwe employ a random vector hd(w) of length p to reduce the variance of the sampled gradient,\n\nnoisy gradient gd(w) in Eq. 1, and thus can be used to replace gd(w) in the SG update in Eq. 2. To\n\n(4)\nwhere A is a p \u00d7 p matrix and h(w) (cid:44) Ed[hd(w)]. (We will show how to choose hd(w) later, but it\n\nusually depends on the form of gd(w).) The random vector(cid:101)gd(w) has the same expectation as the\nreduce the variance of the noisy gradient, the trace of the covariance matrix of(cid:101)gd(w),\nmust be necessarily small; therefore we set A to be the minimizer of Tr (Vard[(cid:101)gd(w)]). That is,\n\nVard[(cid:101)gd(w)] (cid:44) Covd[(cid:101)gd(w),(cid:101)gd(w)] = Vard[gd(w)]\n\n\u2212 (Covd[hd(w), gd(w)] + Covd[gd(w), hd(w)])A + AT Vard[hd(w)]A,\n\n(5)\n\nA\u2217 = argminATr (Vard[(cid:101)gd(w)])\n\n\u22121 (Covd[gd(w), hd(w)] + Covd[hd(w), gd(w)]) /2.\n\n= (Vard[hd(w)])\nThe optimal A\u2217 is a function of w.\n\nWhy is(cid:101)gd(w) a better choice? Now we show that(cid:101)gd(w) is a better \u201cstochastic gradient\u201d under the\n\n(cid:96)2-norm. In the \ufb01rst-order stochastic oracle model, we normally assume that there exists a constant \u03c3\nsuch that for any estimate w in its domain [6, 7]:\n\n(6)\n\n(cid:101)gd(w) = gd(w) \u2212 AT (hd(w) \u2212 h(w)),\n\n(cid:104)(cid:107)gd(w) \u2212 Ed[gd(w)](cid:107)2\n\n(cid:105)\n\n2\n\nEd\n\n= Tr(Vard[gd(w)]) \u2264 \u03c32.\n\n1We follow the convention of maximizing a function f: when we mention a convex problem, we actually\n\nmean the objective function \u2212f is convex.\n\n2\n\n\f\u221a\nUnder this assumption, the dominating term in the optimal convergence rate is O(\u03c3/\nt) for convex\nproblems and O(\u03c32/(\u00b5t)) for strongly convex problems, where \u00b5 is the strong convexity parameter\n(see the de\ufb01nition of strong convexity on Page 459 in [8]).\nNow suppose that we can \ufb01nd a random vector hd(w) and compute A\u2217 according to Eq. 6. By\nplugging A\u2217 back into Eq. 5,\nEd\n\nwhere Vard[(cid:101)gd(w)] = Vard[gd(w)] \u2212 Covd[gd(w), hd(w)](Vard[hd(w)])\u22121Covd[hd(w), gd(w)].\n\n(cid:104)(cid:107)(cid:101)gd(w) \u2212 Ed[(cid:101)gd(w)](cid:107)2\n\n\u22121 Covd(hd, gd) is a semi-positive de\ufb01nite matrix.\nFor any estimate w, Covd(gd, hd) (Covd(hd, hd))\nTherefore, its trace, which equals to the sum of the eigenvalues, is positive (or zero when hd and gd\nare uncorrelated) and hence,\n\n= Tr(Vard[\u02dcgd(w)]),\n\n(cid:105)\n\n2\n\n(cid:104)(cid:107)\u02dcgd(w) \u2212 Ed[\u02dcgd(w)](cid:107)2\n\n(cid:105) \u2264 Ed\n\n(cid:104)(cid:107)gd(w) \u2212 Ed[gd(w)](cid:107)2\n\n(cid:105)\n\nEd\n\n2\n\nIn other words, it is possible to \ufb01nd a constant \u03c4 \u2264 \u03c3 such that Ed\n\u221a\nfor all w. Therefore, when applying stochastic gradient methods, we could improve the optimal con-\nvergence rate from O(\u03c3/\nt) for convex problems; and from O(\u03c32/(\u00b5t)) to O(\u03c4 2/(\u00b5t))\nfor strongly convex problems.\nEstimating optimal A\u2217. When estimating A\u2217 according to Eq. 6, one needs to compute the inverse\nof Vard[hd(w)], which could be computationally expensive. In practice, we could constrain A to be\na diagonal matrix. According to Eq. 5, when A = Diag(a11, . . . , app), its optimal value is:\n\nt) to O(\u03c4 /\n\n\u221a\n\n2\n\n.\n\n(cid:104)(cid:107)\u02dcgd(w) \u2212 Ed[\u02dcgd(w)](cid:107)2\n\n2\n\n(cid:105) \u2264 \u03c4 2\n\na\u2217\nii = [Covd(gd(w),hd(w))]ii\n\n[Vard(hd(w))]ii\n\n.\n\n(7)\n\nThis formulation avoids the computation of the matrix inverse, and leads to signi\ufb01cant reduction\nof computational cost since only the diagonal elements of Covd(gd(w), hd(w)) and Vard(hd(w)),\ninstead of the full matrices, need to be evaluated. It can be shown that, this simpler surrogate to the\nA\u2217 due to Eq. 6 still leads to a better convergence rate. Speci\ufb01cally:\nEd\n\n(cid:3) = Tr(Vard(\u02dcgd(w))) = Tr (Vard(gd(w))) \u2212(cid:80)p\n\n([Covd(gd(w),hd(w))]ii)2\n\n,\n\n[Vard(hd(w))]ii\n\n(cid:2)(cid:107)\u02dcgd(w) \u2212 Ed[\u02dcgd(w)](cid:107)2\n=(cid:80)p\n\ni=1(1 \u2212 \u03c12\n\n2\n\n(cid:2)(cid:107)gd(w) \u2212 Ed[gd(w)](cid:107)2\n\ni=1\n\n(cid:3) ,\n\n2\n\nii)Var(gd(w))ii \u2264 Tr (Vard(gd(w))) = Ed\n\n(8)\n\nwhere \u03c1ii is the Pearson\u2019s correlation coef\ufb01cient between [gd(w)]i and [hd(w)]i.\nIndeed, an even simpler surrogate to the A\u2217, by reducing A to a single real number a, can also\nimprove convergence rate of SG. In this case, according to Eq. 5, the optimal a\u2217 is simply:\n\na\u2217 = Tr (Covd(gd(w), hd(w)))/Tr (Vard(hd(w))).\n\n(9)\nTo estimate the optimal A\u2217 or its surrogates, we need to evaluate Covd(gd(w), hd(w)) and\nVard(hd(w)) (or their diagonal elements), which can be approximated by the sample covariance and\nvariance from mini-batch samples while running the stochastic gradient algorithm. If we can not\nalways obtain mini-batch samples, we may use strategies like moving average across iterations, as\nthose used in [9, 10].\nFrom Eq. 8, we observe that when the Pearson\u2019s correlation coef\ufb01cient between gd(w) and hd(w)\nis higher, the control variate hd(w) will lead to a more signi\ufb01cant level of variance reduction and\nhence faster convergence. In the maximal correlation case, one could set hd(w) = gd(w) to obtain\nzero variance. But obviously, we cannot compute Ed[hd(w)] ef\ufb01ciently in this case. In practice, one\nshould construct hd(w) such that it is highly correlated with gd(w). In next section, we will show\nhow to construct control variates for both convex and non-convex problems.\n\n3 Practicing variance reduction on convex and non-convex problems\n\nIn this section, we apply the variance reduction technique presented above to two exemplary but\npractical problems: MAP estimation for logistic regression\u2014a convex problem; and stochastic varia-\ntional inference for latent Dirichlet allocation [11, 22]\u2014a non-convex problem. In the supplement,\n\n3\n\n\fFigure 1: The illustration of how data statistics help reduce variance for the noisy gradient in stochastic\noptimization. The solid (red) line is the \ufb01nal gradient direction the algorithm will follow. (a) The exact gradient\ndirection computed using the entire dataset. (b) The noisy gradient direction computed from the sampled subset,\nwhich can have high variance. (c) The improved noisy gradient direction with data statistics, such as low-order\nmoments of the entire data. These low-order moments roughly characterize the data distribution, and are used to\nform the control variate to aid the noisy gradient.\n\nwe show that the same principle can be applied to more problems, such as hierarchical Dirichlet\nprocess [12, 13] and nonnegative matrix factorization [14].\nAs we discussed in \u00a72, the higher the correlation between gd(w) and hd(w), the lower the variance\nis. Therefore, to apply the variance reduction technique in practice, the key is to construct a random\nvector hd(w) such that it has high correlations with gd(w), but its expectation h(w) = Ed[hd(w)] is\ninexpensive to compute. The principle behind our choice of h(w) is that we construct h(w) based on\nsome data statistics, such as low-order moments. These low-order moments roughly characterize\nthe data distribution which does not depend on parameter w. Thus they can be pre-computed when\nprocessing the data or estimated online while running the stochastic gradient algorithm. Figure 1\nillustrates this idea. We will use this principle throughout the paper to construct control variates for\nvariance reduction under different scenarios.\n\n3.1 SG with variance reduction for logistic regression\n\nLogistic regression is widely used for classi\ufb01cation [15]. Given a set of training examples (xd, yd),\nd = 1, ..., D, where yd = 1 or yd = \u22121 indicates class labels, the probability of yd is\n\np(yd | xd, w) = \u03c3(ydw(cid:62)xd),\n\nwhere \u03c3(z) = 1/(1 + exp(\u2212z)) is the logistic function. The averaged log likelihood of the training\ndata is\n\n(cid:8)ydw(cid:62)xd \u2212 log(cid:0)1 + exp(ydw(cid:62)xd)(cid:1)(cid:9) .\n\n(cid:80)D\n\n(cid:96)(w) = 1\nD\n\nd=1\n\n(10)\n\n(11)\n\nAn SG algorithm employs the following noisy gradient:\n\ngd(w) = ydxd\u03c3(\u2212ydw(cid:62)xd).\n\nNow we show how to construct our control variate for logistic regression. We begin with the \ufb01rst-order\nTaylor expansion around \u02c6z for the sigmoid function,\n\n\u03c3(z) \u2248 \u03c3(\u02c6z) (1 + \u03c3(\u2212\u02c6z)(z \u2212 \u02c6z)) .\n\nWe then apply this approximation to \u03c3(\u2212ydw(cid:62)xd) in Eq. 11 to obtain our control variate.2 For\nlogistic regression, we consider two classes separately, since data samples within each class are more\nlikely to be similar. We consider positive data samples \ufb01rst. Let z = \u2212w(cid:62)xd, and we de\ufb01ne our\ncontrol variate hd(w) for yd = 1 as\n\nd (w) (cid:44) xd\u03c3(\u02c6z) (1 + \u03c3(\u2212\u02c6z)(z \u2212 \u02c6z)) = xd\u03c3(\u02c6z)(cid:0)1 + \u03c3(\u2212\u02c6z)(\u2212w(cid:62)xd \u2212 \u02c6z)(cid:1) .\nVar(1)[xd] + \u00afx(1)(\u00afx(1))(cid:62)(cid:17)\n\nIts expectation given yd = 1 can be computed in closed-form as\n\u00afx(1) (1 \u2212 \u03c3(\u2212\u02c6z)\u02c6z) \u2212 \u03c3(\u2212\u02c6z)\n\nEd[h(1)\n\nd (w)| yd = 1] = \u03c3(\u02c6z)\n\n(cid:17)\n\n,\n\nw\n\n(cid:16)\n\n(cid:16)\n\nh(1)\n\n2Taylor expansion is not the only way to obtain control variates. Lower bounds or upper bounds of the\n\nobjective function [16] can also provide alternatives. But we will not explore those solutions in this paper.\n\n4\n\n(a) entire data(b) sampled subset(c) sampled subset with data statisticsexact gradient directionexact gradient directionbut unreachablenoisy gradient directionexact gradient directionbut unreachablenoisy gradient directionimproved noisy gradient direction\fwhere \u00afx(1) and Var(1)[xd] are the mean and variance of the input features for the positive examples.\nIn our experiments, we choose \u02c6z = \u2212w(cid:62) \u00afx(1), which is the center of the positive examples. We can\nsimilarly derive the control variate h(\u22121)\n(w) for negative examples and we omit the details. Given\nthe random sample regardless its label, the expectation of the control variate is computed as\n(w)| yd = \u22121],\n\nd (w)| yd = 1] + (D(\u22121)/D)Ed[h(\u22121)\n\nEd[hd(w)] = (D(1)/D)Ed[h(1)\n\nd\n\nd\n\nwhere D(1) and D(\u22121) are the number of positive and negative examples and D(1)/D is the probability\nof choosing a positive example from the training set. With Taylor approximation, we would expect\nour control variate is highly correlated with the noisy gradient. See our experiments in \u00a74 for details.\n\n3.2 SVI with variance reduction for latent Dirichlet allocation\n\nThe stochastic variational inference (SVI) algorithm used for latent Dirichlet allocation (LDA) [22] is\nalso a form of stochastic gradient optimization, therefore it can also bene\ufb01t from variance reduction.\nThe basic idea is to stochastically optimize the variational objective for LDA, using stochastic mean\n\ufb01eld updates augmented by control variates derived from low-order moments on the data.\nLatent Dirichlet allocation (LDA). LDA is the simplest topic model for discrete data such as text\ncollections [17, 18]. Assume there are K topics. The generative process of LDA is as follows.\n\n1. Draw topics \u03b2k \u223c DirV (\u03b7) for k \u2208 {1, . . . , K}.\n2. For each document d \u2208 {1, . . . , D}:\n\n(a) Draw topic proportions \u03b8d \u223c DirK(\u03b1).\n(b) For each word wdn \u2208 {1, . . . , N}:\ni. Draw topic assignment zdn \u223c Mult(\u03b8d).\nii. Draw word wdn \u223c Mult(\u03b2zdn ).\n\nd=1 p(\u03b8d | \u03b1)(cid:81)N\n\nk=1 p(\u03b2k | \u03b7)(cid:81)D\n\np(\u03b2, \u03b8, z | w) \u221d(cid:81)K\n\nGiven the observed words w (cid:44) w1:D, we want to estimate the posterior distribution of the latent\nvariables, including topics \u03b2 (cid:44) \u03b21:K, topic proportions \u03b8 (cid:44) \u03b81:D and topic assignments z (cid:44) z1:D,\n(12)\nHowever, this posterior is intractable. We must resort to approximation methods. Mean-\ufb01eld\nvariational inference is a popular approach for the approximation [19].\nMean-\ufb01eld variational inference for LDA. Mean-\ufb01eld variational inference posits a family of dis-\ntributions (called variational distributions) indexed by free variational parameters and then optimizes\nthese parameters to minimize the KL divergence between the variational distribution and the true\nposterior. For LDA, the variational distribution is\n\nn=1 p(zdn | \u03b8d)p(wdn | \u03b2zdn ).\n\nq(\u03b2, \u03b8, z) =(cid:81)K\n\nk=1 q(\u03b2k | \u03bbk)(cid:81)D\n\nd=1 q(\u03b8d | \u03b3d)(cid:81)N\n\n(13)\nwhere the variational parameters are \u03bbk (Dirichlet), \u03b8d (Dirichlet), and \u03c6dn (multinomial). We seek\nthe variational distribution (Eq. 13) that minimizes the KL divergence to the true posterior (Eq. 12).\nThis is equivalent to maximizing the lower bound of the log marginal likelihood of the data,\n\nn=1 q(zdn | \u03c6dn),\n\nlog p(w) \u2265 Eq [log p(\u03b2, \u03b8, z, w)] \u2212 Eq [log q(\u03b2, \u03b8, z)] (cid:44) L(q),\n\n(14)\nwhere Eq [\u00b7] denotes the expectation with respect to the variational distribution q(\u03b2, \u03b8, z). Setting\nthe gradient of the lower bound L(q) with respect to the variational parameters to zero gives the\nfollowing coordinate ascent algorithm [17]. For each document d \u2208 {1, . . . , D}, we run local\nvariational inference using the following updates until convergence,\n\ndv \u221d exp{\u03a8(\u03b3dk) + \u03a8(\u03bbk,v) \u2212 \u03a8 ((cid:80)\n\u03b3d = \u03b1 +(cid:80)V\n\n\u03c6k\n\n(15)\n(16)\nwhere \u03a8(\u00b7) is the digamma function and ndv is the number of term v in document d. Note that here\nwe use \u03c6dv instead of \u03c6dn in Eq. 13 since the same term v have the same \u03c6dn. After \ufb01nding the\nvariational parameters for each document, we update the variational Dirichlet for each topic,\n\nv=1 ndv\u03c6dv.\n\nv \u03bbkv)}\n\nfor v \u2208 {1, . . . , V }\n\n\u03bbkv = \u03b7 +(cid:80)D\n\nd=1 ndv\u03c6k\n\ndv.\n\n5\n\n(17)\n\n\fThe whole coordinate ascent variational algorithm iterates over Eq. 15, 16 and 17 until convergence.\nHowever, this also reveals the drawback of this algorithm\u2014updating the topic parameter \u03bb in Eq. 17\ndepends on the variational parameters \u03c6 from every document. This is especially inef\ufb01cient for large-\nscale datasets. Stochastic variational inference solves this problem using stochastic optimization.\nStochastic variational inference (SVI). Instead of using the coordinate ascent algorithm, SVI\noptimizes the variational lower bound L(q) using stochastic optimization [22]. It draws random\nsamples from the corpus and use these samples to form the noisy estimate of the natural gradient [20].\nThen the algorithm follows that noisy natural gradient with a decreasing step size until convergence.\nThe noisy gradient only depends on the sampled data and it is inexpensive to compute. This leads to\na much faster algorithm than the traditional coordinate ascent variational inference algorithm.\nLet d be a random document index, d \u223c Unif(1, ..., D) and Ld(q) be the sampled lower bound. The\nsampled lower bound Ld(q) has the same form as the L(q) in Eq. 14 except that the sampled lower\nbound uses a virtual corpus that only contains document d replicated D times. According to [22], for\nLDA the noisy natural gradient with respect to the topic variational parameters is\n\ngd(\u03bbkv) (cid:44) \u2212\u03bbkv + \u03b7 + Dndv\u03c6k\ndv,\n\n(18)\ndv are obtained from the local variational inference by iterating over Eq. 15 and 16 until\nwhere the \u03c6k\nconvergence.3 With a step size \u03c1t, SVI uses the following update \u03bbkv \u2190 \u03bbkv + \u03c1tgd(\u03bbkv). However,\nthe sampled natural gradient gd(\u03bbkv) in Eq. 18 might have a large variance when the number of\ndocuments is large. This could lead to slow convergence or a poor local mode.\nControl variate. Now we show how to construct control variates for the noisy gradient to reduce\nits variance. According to Eq. 18, the noisy gradient gd(\u03bbkv) is a function of topic assignment\nparameters \u03c6dv, which in turn depends on wd, the words in document d, through the iterative updates\nin Eq. 15 and 16. This is different from the case in Eq. 11. In logistic regression, the gradient is an\nanalytical function of the training data (Eq. 11), while in LDA, the natural gradient directly depends\non the optimal local variational parameters (Eq. 18), which then depends on the training data through\nthe local variational inference (Eq. 15). However, by carefully exploring the structure of the iterations,\nwe can create effective control variates.\nThe key idea is to run Eq. 15 and 16 only up to a \ufb01xed number of iterations, together with some\nadditional approximations to maintain analytical tractability. Starting the iteration with \u03b3dk having\nthe same value, we have \u03c6k(0)\ndoes not depend\non document d. Intuitively, \u03c6k(0)\nis the probability of term v belonging to topic k out of K topics.\nNext we use \u03b3dk \u2212 \u03b1 to approximate exp(\u03a8(\u03b3dk)) in Eq. 15.5 Plugging this approximation into\nEq. 15 and 16 leads to the update,\n\nv \u221d exp{\u03a8(\u03bbkv) \u2212 \u03a8 ((cid:80)\n\nv \u03bbkv)}.4 Note that \u03c6k(0)\n\nv\n\nv\n\nu=1 fdu\u03c6k(0)\n\nu\n\n)\u03c6k(0)\n\nv\n\nu=1 fdu\u03c6k(0)\n\nu\n\n\u03c6k(0)\n\nv\n\n(cid:17)\n\n\u2248 ((cid:80)V\n(cid:16)(cid:80)V\n(cid:80)K\n\nk=1\n\nu=1\n\nu=1 fdu\u03c6k(0)\n\n)\u03c6k(0)\n\nu\n\u00affu\u03c6k(0)\n\nu\n\n\u03c6k(0)\n\nv\n\n(cid:17)\n\nv\n\n,\n\n(19)\n\n((cid:80)V\n(cid:16)(cid:80)V\n(cid:80)K\n(cid:16)(cid:80)V\n\nk=1\n\n\u03c6k(1)\ndv =\n\nwith \u00affu (cid:44) (1/D)(cid:80)\nv (cid:44)(cid:80)K\n\n(cid:17)\n\nwhere fdv = ndv/nd is the empirical frequency of term v in document d. In addition, we replace fdu\nd fdu, the averaged frequency of term u in the corpus, making the denominator\nof Eq. 19, m(1)\n, independent of documents. This approximation\ndoes not change the relative importance for the topics from term v. We de\ufb01ne our control variate as\n\n\u00affu\u03c6k(0)\n\n\u03c6k(0)\nv\n\nu=1\n\nk=1\n\nu\n\n(cid:16)\n\nhd(\u03bbkv) (cid:44) Dndv\u03c6k(1)\ndv ,\nu=1 nvfu\u03c6k(0)\n\nD/m(1)\nv\n\n(cid:17)(cid:110)(cid:16)(cid:80)V\n\n(cid:17)\n\n(1/D)(cid:80)\n\nd ndufdv = (1/D)(cid:80)\n\nwhose expectation is Ed[hd(\u03bbkv)] =\n\n, where nvfu (cid:44)\nd ndundv/nd. This depends on up to the second-order moments\nof data, which is usually sparse. We can continue to compute \u03c6k(2)\ndv , which\ndv\nturns out using the third-order (or higher) moments. We omit the details here. Similar ideas can be\nused in deriving control variates for hierarchical Dirichlet process [12, 13] and nonnegative matrix\nfactorization [14]. We outline these in the supplementary material.\n\n(or higher) given \u03c6k(1)\n\n\u03c6k(0)\nv\n\nu\n\n(cid:111)\n\n3Running to convergence is essential to ensure the natural gradient is valid in Eq. 18 [22].\n4In our experiments, we set \u03c6k(0)\nis less than 0.02. This leaves \u03c6(0) very sparse, since a term\n5The scale of the approximation does not matter\u2014C(\u03b3dk \u2212 \u03b1), where C is a constant, has the same effect as\n\nusually belongs to a small set of topics. For example, in Nature data, only 6% entries are non-zero.\n\u03b3dk \u2212 \u03b1. Other approximations to exp(\u03a8(\u03b3dk)) can also be used as long as it is linear in term of \u03b3dk.\n\nv = 0 if \u03c6k(0)\n\nv\n\n6\n\n\fFigure 2: Comparison of our approach with standard SG algorithms using different constant learning rates. The\n\ufb01gure was created using geom smooth function in ggplot2 using local polynomial regression \ufb01tting (loess). A\nwider stripe indicates the result \ufb02uctuates more. This \ufb01gure is best viewed in color. (Decayed learning rates we\ntested did not perform as well as constant ones and are not shown.) Legend \u201cVariance Reduction-1\u201d indicates the\nalgorithm with variance reduction using learning rate \u03c1t = 1.0. (a) Optimum minus the objective on the training\ndata. The lower the better. (b) Test accuracy on testing data. The higher the better. From these results, we see\nthat variance reduction with \u03c1t = 1.0 performs the best, while the standard SG algorithm with \u03c1t = 1.0 learns\nfaster but bounces more (a wider stripe) and performs worse at the end. With \u03c1t = 0.05, variance reduction\nperforms about the same as the standard algorithm and both converge slowly. These indicate that with the\nvariance reduction, a larger learning rate is possible to allow faster convergence without sacri\ufb01cing performance.\n\nFigure 3: Pearson\u2019s correlation coef\ufb01cient for \u03c1t = 1.0 as we run the our algorithm. It is usually high, indicating\nthe control variate is highly correlated with the noisy gradient, leading to a large variance reduction. Other\nsettings are similar.\n\n4 Experiments\n\nIn this section, we conducted experiments on the MAP estimation for logistic regression and stochastic\nvariational inference for LDA.6 In our experiments, we chose to estimate the optimal a\u2217 as a scalar\nshown in Eq. 9 for simplicity.\n\n4.1 Logistic regression\n\nWe evaluate our algorithm on stochastic gradient (SG) for logistic regression. For the standard SG\nalgorithm, we also evaluated the version with averaged output (ASG), although we did not \ufb01nd it\noutperforms the standard SG algorithm much. Our regularization added to Eq. 10 for the MAP\nestimation is \u2212 1\n2D w(cid:62)w. Our dataset contains covtype (D = 581, 012, p = 54), obtained from the\nLIBSVM data website.7 We separate 5K examples as the test set. We test two types of learning rates,\nconstant and decayed. For constant rates, we explore \u03c1t \u2208 {0.01, 0.05, 0.1, 0.2, 0.5, 1}. For decayed\nrates, we explore \u03c1t \u2208 {t\u22121/2, t\u22120.75, t\u22121}. We use a mini-batch size of 100.\nResults. We found that the decayed learning rates we tested did not work well compared with the\nconstant ones on this data. So we focus on the results using the constant rates. We plot three cases\nin Figure 2 for \u03c1t \u2208 {0.05, 0.2, 1} to show the trend by comparing the objective function on the\ntraining data and the test accuracy on the testing data. (The best result for variance reduction is\nobtained when \u03c1t = 1.0 and for standard SGD is when \u03c1t = 0.2.) These contain the best results of\n\n6Code will be available on authors\u2019 websites.\n7http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets\n\n7\n\n0.010.101100data points (x100K)Optimum minus ObjectivemethodVariance Reduction-1Standard-1Variance Reduction-0.2Standard-0.2Variance Reduction-0.05Standard-0.05(a) Optimum minus Objective on training data (b) Test Accuracy on testing data 0.650.700.751100data points (x100K)Test AccuracymethodVariance Reduction-1Standard-1Variance Reduction-0.2Standard-0.2Variance Reduction-0.05Standard-0.050.9600.9650.9700.9750306090120data points (x100K)Pearson's correlation coefficient\fFigure 4: Held-out log likelihood on three large corpora. (Higher numbers are better.) Legend \u201cStandard-100\u201d\nindicates the stochastic algorithm in [10] with the batch size as 100. Our method consistently performs better\nthan the standard stochastic variational inference. A large batch size tends to perform better.\n\neach. With variance reduction, a large learning rate is possible to allow faster convergence without\nsacri\ufb01cing performance. Figure 3 shows the mean of Pearson\u2019s correlation coef\ufb01cient between the\ncontrol variate and noisy gradient8, which is quite high\u2014the control variate is highly correlated with\nthe noisy gradient, leading to a large variance reduction.\n\n4.2 Stochastic variational inference for LDA\n\nWe evaluate our algorithm on stochastic variational inference for LDA. [10] has shown that the\nadaptive learning rate algorithm for SVI performed better than the manually tuned ones. So we use\ntheir algorithm to estimate adaptive learning rate. For LDA, we set the number of topics K = 100,\nhyperparameters \u03b1 = 0.1 and \u03b7 = 0.01. We tested mini-batch sizes as 100 and 500.\nData sets. We analyzed three large corpora: Nature, New York Times, and Wikipedia. The Nature\ncorpus contains 340K documents and a vocabulary of 4,500 terms; the New York Times corpus\ncontains 1.8M documents and a vocabulary vocabulary of 8,000 terms; the Wikipedia corpus contains\n3.6M documents and a vocabulary of 7,700 terms.\nEvaluation metric and results. To evaluate our models, we held out 10K documents from each\ncorpus and calculated its predictive likelihood. We follow the metric used in recent topic modeling\nliterature [21, 22]. For a document wd in Dtest, we split it in into halves, wd = (wd1, wd2), and\ncomputed the predictive log likelihood of the words in wd2 conditioned on wd1 and Dtrain. The\nper-word predictive log likelihood is de\ufb01ned as\n\nlikelihoodpw (cid:44)(cid:80)\n\nd\u2208Dtest\n\nlog p(wd2|wd1,Dtrain)/(cid:80)\n\nd\u2208Dtest\n\n|wd2|.\n\nHere | \u00b7 | is the number of words. A better predictive distribution given the \ufb01rst half gives higher\nlikelihood to the second half. We used the same strategy as in [22] to approximate its computation.\nFigure 4 shows the results. On all three corpora, our algorithm gives better predictive distributions.\n\n5 Discussions and future work\n\nIn this paper, we show that variance reduction with control variates can be used to improve stochastic\ngradient optimization. We further demonstrate its usage on convex and non-convex problems,\nshowing improved performance on both. In future work, we would like to explore how to use\nsecond-order methods (such as Newton\u2019s method) or better line search algorithms to further improve\nthe performance of stochastic optimization. This is because, for example, with variance reduction,\nsecond-order methods are able to capture the local curvature much better.\n\nAcknowledgement. We thank anonymous reviewers for their helpful comments. We also thank\nDani Yogatama for helping with some experiments on LDA. Chong Wang and Eric P. Xing are\nsupported by NSF DBI-0546594 and NIH 1R01GM093156.\n\n8Since the control variate and noisy gradient are vectors, we use the mean of the Pearson\u2019s coef\ufb01cients\n\ncomputed for each dimension between these two vectors.\n\n8\n\nNatureNew York TimesWikipedia-7.75-7.50-7.25-7.00-8.50-8.25-8.00-7.75-7.50-8.1-7.9-7.7-7.5-7.3-7.1051015200510152005101520time (in hours)Heldout log likelihoodmethodStandard-100Standard-500Var-Reduction-100Var-Reduction-500\fReferences\n[1] Spall, J. Introduction to stochastic search and optimization: Estimation, simulation, and control. John\n\nWiley and Sons, 2003.\n\n[2] Bottou, L. Stochastic learning. In O. Bousquet, U. von Luxburg, eds., Advanced Lectures on Machine\nLearning, Lecture Notes in Arti\ufb01cial Intelligence, LNAI 3176, pages 146\u2013168. Springer Verlag, Berlin,\n2004.\n\n[3] Ross, S. M. Simulation. Elsevier, fourth edn., 2006.\n[4] Nemirovski, A., A. Juditsky, G. Lan, et al. Robust stochastic approximation approach to stochastic\n\nprogramming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[5] Paisley, J., D. Blei, M. Jordan. Variational Bayesian inference with stochastic search. In International\n\nConference on Machine Learning. 2012.\n\n[6] Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133:365\u2013\n\n397, 2012.\n\n[7] Chen, X., Q. Lin, J. Pena. Optimal regularized dual averaging methods for stochastic optimization. In\n\nAdvances in Neural Information Processing Systems (NIPS). 2012.\n\n[8] Boyd, S., L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[9] Schaul, T., S. Zhang, Y. LeCun. No More Pesky Learning Rates. ArXiv e-prints, 2012.\n[10] Ranganath, R., C. Wang, D. M. Blei, et al. An adaptive learning rate for stochastic variational inference. In\n\nInternational Conference on Machine Learning. 2013.\n\n[11] Hoffman, M., D. Blei, F. Bach. Online inference for latent Drichlet allocation. In Neural Information\n\nProcessing Systems. 2010.\n\n[12] Teh, Y., M. Jordan, M. Beal, et al. Hierarchical Dirichlet processes. Journal of the American Statistical\n\nAssociation, 101(476):1566\u20131581, 2007.\n\n[13] Wang, C., J. Paisley, D. Blei. Online variational inference for the hierarchical Dirichlet process. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics. 2011.\n\n[14] Seung, D., L. Lee. Algorithms for non-negative matrix factorization. In Neural Information Processing\n\nSystems. 2001.\n\n[15] Bishop, C. Pattern Recognition and Machine Learning. Springer New York., 2006.\n[16] Jaakkola, T., M. Jordan. A variational approach to Bayesian logistic regression models and their extensions.\n\nIn International Workshop on Arti\ufb01cial Intelligence and Statistics. 1996.\n\n[17] Blei, D., A. Ng, M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993\u20131022,\n\n2003.\n\n[18] Blei, D., J. Lafferty. Topic models. In A. Srivastava, M. Sahami, eds., Text Mining: Theory and Applications.\n\nTaylor and Francis, 2009.\n\n[19] Jordan, M., Z. Ghahramani, T. Jaakkola, et al. Introduction to variational methods for graphical models.\n\nMachine Learning, 37:183\u2013233, 1999.\n\n[20] Amari, S. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276, 1998.\n[21] Asuncion, A., M. Welling, P. Smyth, et al. On smoothing and inference for topic models. In Uncertainty in\n\nArti\ufb01cial Intelligence. 2009.\n\n[22] Hoffman, M., D. Blei, C. Wang, et al. Stochastic Variational Inference. Journal of Machine Learning\n\nResearch, 2013.\n\n9\n\n\f", "award": [], "sourceid": 165, "authors": [{"given_name": "Chong", "family_name": "Wang", "institution": "CMU"}, {"given_name": "Xi", "family_name": "Chen", "institution": "CMU"}, {"given_name": "Alexander", "family_name": "Smola", "institution": "Yahoo! Research"}, {"given_name": "Eric", "family_name": "Xing", "institution": "CMU"}]}