{"title": "Barzilai-Borwein Step Size for Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 685, "page_last": 693, "abstract": "One of the major issues in stochastic gradient descent (SGD) methods is how to choose an appropriate step size while running the algorithm. Since the traditional line search technique does not apply for stochastic optimization methods, the common practice in SGD is either to use a diminishing step size, or to tune a step size by hand, which can be time consuming in practice. In this paper, we propose to use the Barzilai-Borwein (BB) method to automatically compute step sizes for SGD and its variant: stochastic variance reduced gradient (SVRG) method, which leads to two algorithms: SGD-BB and SVRG-BB. We prove that SVRG-BB converges linearly for strongly convex objective functions. As a by-product, we prove the linear convergence result of SVRG with Option I proposed in [10], whose convergence result has been missing in the literature. Numerical experiments on standard data sets show that the performance of SGD-BB and SVRG-BB is comparable to and sometimes even better than SGD and SVRG with best-tuned step sizes, and is superior to some advanced SGD variants.", "full_text": "Barzilai-Borwein Step Size for Stochastic Gradient\n\nDescent\n\nConghui Tan\n\nShiqian Ma\n\nThe Chinese University of Hong Kong\n\nThe Chinese University of Hong Kong\n\nchtan@se.cuhk.edu.hk\n\nsqma@se.cuhk.edu.hk\n\nYu-Hong Dai\n\nChinese Academy of Sciences, Beijing, China\n\ndyh@lsec.cc.ac.cn\n\nYuqiu Qian\n\nThe University of Hong Kong\n\nqyq79@connect.hku.hk\n\nAbstract\n\nOne of the major issues in stochastic gradient descent (SGD) methods is how to\nchoose an appropriate step size while running the algorithm. Since the traditional\nline search technique does not apply for stochastic optimization methods, the\ncommon practice in SGD is either to use a diminishing step size, or to tune a step\nsize by hand, which can be time consuming in practice. In this paper, we propose\nto use the Barzilai-Borwein (BB) method to automatically compute step sizes\nfor SGD and its variant: stochastic variance reduced gradient (SVRG) method,\nwhich leads to two algorithms: SGD-BB and SVRG-BB. We prove that SVRG-BB\nconverges linearly for strongly convex objective functions. As a by-product, we\nprove the linear convergence result of SVRG with Option I proposed in [10], whose\nconvergence result has been missing in the literature. Numerical experiments\non standard data sets show that the performance of SGD-BB and SVRG-BB is\ncomparable to and sometimes even better than SGD and SVRG with best-tuned\nstep sizes, and is superior to some advanced SGD variants.\n\n1\n\nIntroduction\n\nThe following optimization problem, which minimizes the sum of cost functions over samples from a\n\ufb01nite training set, appears frequently in machine learning:\n\nmin F (x) \u2261 1\nn\n\nn(cid:88)\n\ni=1\n\nfi(x),\n\n(1)\n\nwhere n is the sample size, and each fi : Rd \u2192 R is the cost function corresponding to the i-th sample\ndata. Throughout this paper, we assume that each fi is convex and differentiable, and the function F\nis strongly convex. Problem (1) is challenging when n is extremely large so that computing F (x) and\n\u2207F (x) for given x is prohibited. Stochastic gradient descent (SGD) method and its variants have\nbeen the main approaches for solving (1). In the t-th iteration of SGD, a random training sample it is\nchosen from {1, 2, . . . , n} and the iterate xt is updated by\n\nxt+1 = xt \u2212 \u03b7t\u2207fit(xt),\n\n(2)\nwhere \u2207fit(xt) denotes the gradient of the it-th component function at xt, and \u03b7t > 0 is the step size\n(a.k.a. learning rate). In (2), it is usually assumed that \u2207fit is an unbiased estimation to \u2207F , i.e.,\n(3)\n\nE[\u2207fit(xt) | xt] = \u2207F (xt).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fHowever, it is known that the total number of gradient evaluations of SGD depends on the variance\nof the stochastic gradients and it is of sublinear convergence rate for strongly convex and smooth\nproblem (1). As a result, many works along this line have been focusing on designing variants of\nSGD that can reduce the variance and improve the complexity. Some popular methods include the\nstochastic average gradient (SAG) method [16], the SAGA method [7], the stochastic dual coordinate\nascent (SDCA) method [17], and the stochastic variance reduced gradient (SVRG) method [10].\nThese methods are proven to converge linearly on strongly convex problems.\nAs pointed out by Le Roux et al. [16], one important issue regarding to stochastic algorithms that has\nnot been fully addressed in the literature, is how to choose an appropriate step size \u03b7t while running\nthe algorithm. In classical gradient descent method, the step size is usually obtained by employing\nline search techniques. However, line search is computationally prohibited in stochastic gradient\nmethods because one only has sub-sampled information of function value and gradient. As a result,\nfor SGD and its variants used in practice, people usually use a diminishing step size \u03b7t, or use a\nbest-tuned \ufb01xed step size. Neither of these two approaches can be ef\ufb01cient.\nSome recent works that discuss the choice of step size in SGD are summarized as follows. AdaGrad\n[8] scales the gradient by the square root of the accumulated magnitudes of the gradients in the past\niterations, but this still requires to decide a \ufb01xed step size \u03b7. [16] suggests a line search technique\non the component function fik (x) selected in each iteration, to estimate step size for SAG. [12]\nsuggests performing line search for an estimated function, which is evaluated by a Gaussian process\nwith samples fit(xt). [13] suggests to generate the step sizes by a given function with an unknown\nparameter, and to use the online SGD to update this unknown parameter.\nOur contributions in this paper are in several folds.\n(i) We propose to use the Barzilai-Borwein (BB) method to compute the step size for SGD and\nSVRG. The two new methods are named as SGD-BB and SVRG-BB, respectively. The per-iteration\ncomputational cost of SGD-BB and SVRG-BB is almost the same as SGD and SVRG, respectively.\n(ii) We prove the linear convergence of SVRG-BB for strongly convex function. As a by-product,\nwe show the linear convergence of SVRG with Option I (SVRG-I) proposed in [10]. Note that in\n[10] only convergence of SVRG with Option II (SVRG-II) was given, and the proof for SVRG-I has\nbeen missing in the literature. However, SVRG-I is numerically a better choice than SVRG-II, as\ndemonstrated in [10].\n(iii) We conduct numerical experiments for SGD-BB and SVRG-BB on solving logistic regression\nand SVM problems. The numerical results show that SGD-BB and SVRG-BB are comparable to and\nsometimes even better than SGD and SVRG with best-tuned step sizes. We also compare SGD-BB\nwith some advanced SGD variants, and demonstrate that our method is superior.\nThe rest of this paper is organized as follows. In Section 2 we brie\ufb02y introduce the BB method\nin the deterministic setting. In Section 3 we propose our SVRG-BB method, and prove its linear\nconvergence for strongly convex function. As a by-product, we also prove the linear convergence of\nSVRG-I. In Section 4 we propose our SGD-BB method. A smoothing technique is also implemented\nto improve the performance of SGD-BB. Finally, we conduct some numerical experiments for\nSVRG-BB and SGD-BB in Section 5.\n\n2 The Barzilai-Borwein Step Size\n\nThe BB method, proposed by Barzilai and Borwein in [2], has been proven to be very successful\nin solving nonlinear optimization problems. The key idea behind the BB method is motivated by\nquasi-Newton methods. Suppose we want to solve the unconstrained minimization problem\n\nmin\n\nx\n\nf (x),\n\nwhere f is differentiable. A typical iteration of quasi-Newton methods for solving (4) is:\n\nxt+1 = xt \u2212 B\u22121\n\nt \u2207f (xt),\n\n(4)\n\n(5)\n\nwhere Bt is an approximation of the Hessian matrix of f at the current iterate xt. The most important\nfeature of Bt is that it must satisfy the so-called secant equation: Btst = yt, where st = xt \u2212 xt\u22121\nand yt = \u2207f (xt) \u2212 \u2207f (xt\u22121) for t \u2265 1. It is noted that in (5) one needs to solve a linear system,\nwhich may be time consuming when Bt is large and dense.\n\n2\n\n\fOne way to alleviate this burden is to use the BB method, which replaces Bt by a scalar matrix (1/\u03b7t)I.\nHowever, one cannot choose a scalar \u03b7t such that the secant equation holds with Bt = (1/\u03b7t)I.\nInstead, one can \ufb01nd \u03b7t such that the residual of the secant equation, i.e., (cid:107)(1/\u03b7t)st \u2212 yt(cid:107)2\n2, is\nminimized, which leads to the following choice of \u03b7t:\n\n2/(cid:0)s(cid:62)\n\nt yt\n\n(cid:1) .\n\n\u03b7t = (cid:107)st(cid:107)2\n\nTherefore, a typical iteration of the BB method for solving (4) is\n\nxt+1 = xt \u2212 \u03b7t\u2207f (xt),\n\n(6)\n\n(7)\n\nwhere \u03b7t is computed by (6).\nFor convergence analysis, generalizations and variants of the BB method, we refer the interested\nreaders to [14, 15, 6, 9, 4, 5, 3] and references therein. Recently, BB method has been successfully\napplied for solving problems arising from emerging applications, such as compressed sensing [21],\nsparse reconstruction [20] and image processing [19].\n\n3 Barzilai-Borwein Step Size for SVRG\n\nWe see from (7) and (6) that the BB method does not need any parameter and the step size is computed\nwhile running the algorithm. This has been the main motivation for us to work out a black-box\nstochastic gradient descent method that can compute the step size automatically without requiring\nany parameters. In this section, we propose to incorporate the BB step size to SVRG, which leads to\nthe SVRG-BB method.\n\n3.1 SVRG-BB Method\n\nStochastic variance reduced gradient (SVRG) is a variant of SGD proposed in [10], which utilizes a\nvariance reduction technique to alleviate the impact of the random samplings of the gradients. SVRG\ncomputes the full gradient \u2207F (x) of (1) in every m iterations, where m is a pre-given integer, and\nthe full gradient is then used for generating stochastic gradients with lower variance in the next m\niterations (the next epoch). In SVRG, the step size \u03b7 needs to be provided by the user. According to\n[10], the choice of \u03b7 depends on the Lipschitz constant of F , which is usually dif\ufb01cult to estimate in\npractice.\nOur SVRG-BB algorithm is described in Algorithm 1. The only difference between SVRG and\nSVRG-BB is that in the latter we use BB method to compute the step size \u03b7k, instead of using a\npre\ufb01xed \u03b7 as in SVRG.\n\nAlgorithm 1 SVRG with BB step size (SVRG-BB)\n\nParameters: update frequency m, initial point \u02dcx0, initial step size \u03b70 (only used in the \ufb01rst epoch)\nfor k = 0, 1,\u00b7\u00b7\u00b7 do\n\n2/(\u02dcxk \u2212 \u02dcxk\u22121)(cid:62)(gk \u2212 gk\u22121)\n\n((cid:52))\n\n(cid:80)n\ni=1 \u2207fi(\u02dcxk)\nm \u00b7 (cid:107)\u02dcxk \u2212 \u02dcxk\u22121(cid:107)2\n\ngk = 1\nn\nif k > 0 then\n\n\u03b7k = 1\n\nend if\nx0 = \u02dcxk\nfor t = 0,\u00b7\u00b7\u00b7 , m \u2212 1 do\n\nRandomly pick it \u2208 {1, . . . , n}\nxt+1 = xt \u2212 \u03b7k(\u2207fit(xt) \u2212 \u2207fit(\u02dcxk) + gk)\n\nend for\nOption I: \u02dcxk+1 = xm\nOption II: \u02dcxk+1 = xt for randomly chosen t \u2208 {1, . . . , m}\n\nend for\n\nRemark 1. A few remarks are in demand for the SVRG-BB algorithm.\n(i) If we always set \u03b7k = \u03b7 in SVRG-BB instead of using ((cid:52)), then it reduces to the original SVRG.\n(ii) One may notice that \u03b7k is equal to the step size computed by the BB formula (6) divided by m.\nThis is because in the inner loop for updating xt, m unbiased gradient estimators are added to x0 to\n\n3\n\n\fget xm.\n(iii) For the \ufb01rst epoch of SVRG-BB, a step size \u03b70 needs to be speci\ufb01ed. However, we observed from\nour numerical experiments that the performance of SVRG-BB is not sensitive to the choice of \u03b70.\n(iv) The BB step size can also be naturally incorporated to other SVRG variants, such as SVRG with\nbatching [1].\n\n3.2 Linear Convergence Analysis\n\nIn this section, we analyze the linear convergence of SVRG-BB (Algorithm 1) for solving (1) with\nstrongly convex objective F (x), and as a by-product, our analysis also proves the linear convergence\nof SVRG-I. The proofs in this section are provided in the supplementary materials. The following\nassumption is made throughout this section.\nAssumption 1. We assume that (3) holds for any xt. We assume that the objective function F (x) is\n\u00b5-strongly convex, i.e.,\n\nF (y) \u2265 F (x) + \u2207F (x)(cid:62)(y \u2212 x) +\n\n(cid:107)x \u2212 y(cid:107)2\n2,\n\n\u00b5\n2\n\n\u2200x, y \u2208 Rd.\n\nWe also assume that the gradient of each component function fi(x) is L-Lipschitz continuous, i.e.,\n\n(cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107)2 \u2264 L(cid:107)x \u2212 y(cid:107)2, \u2200x, y \u2208 Rd.\n\nUnder this assumption, it is easy to see that \u2207F (x) is also L-Lipschitz continuous.\n\nWe \ufb01rst provide the following lemma, which reveals the relationship between the distances of two\nconsecutive iterates to the optimal point.\nLemma 1. De\ufb01ne\n\n\u03b1k := (1 \u2212 2\u03b7k\u00b5(1 \u2212 \u03b7kL))m +\n\n4\u03b7kL2\n\n\u00b5(1 \u2212 \u03b7kL)\n\n.\n\n(8)\n\nFor both SVRG-I and SVRG-BB, we have the following inequality for the k-th epoch:\n\nwhere x\u2217 is the optimal solution to (1).\n\nE(cid:107)\u02dcxk+1 \u2212 x\u2217(cid:107)2\n\n2 < \u03b1k(cid:107)\u02dcxk \u2212 x\u2217(cid:107)2\n2,\n\nThe linear convergence of SVRG-I follows immediately.\nCorollary 1. In SVRG-I, if m and \u03b7 are chosen such that\n\u03b1 := (1 \u2212 2\u03b7\u00b5(1 \u2212 \u03b7L))m +\n\n4\u03b7L2\n\n\u00b5(1 \u2212 \u03b7L)\n\n< 1,\n\n(9)\n\nthen SVRG-I converges linearly in expectation:\nE(cid:107)\u02dcxk \u2212 x\u2217(cid:107)2\n\n2 < \u03b1k(cid:107)\u02dcx0 \u2212 x\u2217(cid:107)2\n2.\nRemark 2. We now give some remarks on this convergence result.\n(i) To the best of our knowledge, this is the \ufb01rst time that the linear convergence of SVRG-I is\nestablished.\n(ii) The convergence result given in Corollary 1 is for the iterates \u02dcxk, while the one given in [10] is\nfor the objective function values F (\u02dcxk).\n\nThe following theorem establishes the linear convergence of SVRG-BB (Algorithm 1).\nTheorem 1. Denote \u03b8 = (1 \u2212 e\u22122\u00b5/L)/2. Note that \u03b8 \u2208 (0, 1/2). In SVRG-BB, if m is chosen such\nthat\n\n(cid:26)\n\nm > max\n\n2\n\nlog(1 \u2212 2\u03b8) + 2\u00b5/L\n\n,\n\n,\n\n(10)\n\n(cid:27)\n\n4L2\n\u03b8\u00b52 +\n\nL\n\u00b5\n\nthen SVRG-BB (Algorithm 1) converges linearly in expectation:\n\nE(cid:107)\u02dcxk \u2212 x\u2217(cid:107)2\n\n2 < (1 \u2212 \u03b8)k(cid:107)\u02dcx0 \u2212 x\u2217(cid:107)2\n2.\n\n4\n\n\f4 Barzilai-Borwein Step Size for SGD\n\nIn this section, we propose to incorporate the BB method to SGD (2). The BB method does not\napply to SGD directly, because SGD never computes the full gradient \u2207F (x). One may suggest\nto use \u2207fit+1(xt+1) \u2212 \u2207fit(xt) to estimate \u2207F (xt+1) \u2212 \u2207F (xt) when computing the BB step\nsize using formula (6). However, this approach does not work well because of the variance of the\nstochastic gradients. The recent work by Sopy\u0142a and Drozda [18] suggested several variants of this\nidea to compute an estimated BB step size using the stochastic gradients. However, these ideas lack\ntheoretical justi\ufb01cations and the numerical results in [18] show that these approaches are inferior to\nsome existing methods.\nThe SGD-BB algorithm we propose in this paper works in the following manner. We call every\nm iterations of SGD as one epoch. Following the idea of SVRG-BB, SGD-BB also uses the same\nstep size computed by the BB formula in every epoch. Our SGD-BB algorithm is described as in\nAlgorithm 2.\n\nAlgorithm 2 SGD with BB step size (SGD-BB)\nParameters: update frequency m, initial step sizes \u03b70 and \u03b71 (only used in the \ufb01rst two epochs),\nweighting parameter \u03b2 \u2208 (0, 1), initial point \u02dcx0\nfor k = 0, 1,\u00b7\u00b7\u00b7 do\n\nif k > 0 then\n\n\u03b7k = 1\n\nm \u00b7 (cid:107)\u02dcxk \u2212 \u02dcxk\u22121(cid:107)2\n\n2/|(\u02dcxk \u2212 \u02dcxk\u22121)(cid:62)(\u02c6gk \u2212 \u02c6gk\u22121)|\n\n(\u2217)\n\nend if\nx0 = \u02dcxk\n\u02c6gk+1 = 0\nfor t = 0,\u00b7\u00b7\u00b7 , m \u2212 1 do\n\nRandomly pick it \u2208 {1, . . . , n}\nxt+1 = xt \u2212 \u03b7k\u2207fit(xt)\n\u02c6gk+1 = \u03b2\u2207fit(xt) + (1 \u2212 \u03b2)\u02c6gk+1\n\nend for\n\u02dcxk+1 = xm\n\nend for\n\nm\u22121(cid:88)\n\nt=0\n\n\u02c6gk =\n\n1\nm\n\n\u2207fit(xt).\n\n(11)\n\nRemark 3. We have a few remarks about SGD-BB (Algorithm 2).\n(i) SGD-BB takes a convex combination of the m stochastic gradients in one epoch as an estimation\nof the full gradient with parameter \u03b2. The performance of SGD-BB on different problems is not\nsensitive to the choice of \u03b2. For example, setting \u03b2 = 10/m worked well for all test problems in our\nexperiments.\n(ii) Note that for computing \u03b7k in Algorithm 2, we actually take the absolute value for the BB formula\n(6). This is because that unlike SVRG-BB, \u02c6gk in Algorithm 2 is not an exact full gradient. As a\nresult, the step size generated by (6) can be negative. This can be seen from the following argument.\nConsider a simple case in which \u03b2 = 1/m, approximately we have\n\nIt is easy to see that \u02dcxk \u2212 \u02dcxk\u22121 = \u2212m\u03b7k\u22121\u02c6gk. By substituting this equality into the equation for\ncomputing \u03b7k in Algorithm 2, we have\n\n\u03b7k =(1/m) \u00b7 (cid:107)\u02dcxk \u2212 \u02dcxk\u22121(cid:107)2/|(\u02dcxk \u2212 \u02dcxk\u22121)(cid:62)(\u02c6gk \u2212 \u02c6gk\u22121)|\n\n(cid:12)(cid:12)1 \u2212 \u02c6g(cid:62)\n\n\u03b7k\u22121\nk \u02c6gk\u22121/(cid:107)\u02c6gk(cid:107)2\n\n2\n\n(cid:12)(cid:12) .\n\n=\n\n(12)\n2 \u2212 1, which is usually\nk \u02c6gk\u22121/(cid:107)\u02c6gk(cid:107)2\nWithout taking the absolute value, the denominator of (12) is \u02c6g(cid:62)\nnegative in stochastic settings.\n(iii) Moreover, from (12) we have the following observations. If \u02c6g(cid:62)\nk \u02c6gk\u22121 < 0, then \u03b7k is smaller than\n\u03b7k\u22121. This is reasonable because \u02c6g(cid:62)\nk \u02c6gk\u22121 < 0 indicates that the step size is too large and we need to\nshrink it. If \u02c6g(cid:62)\nk \u02c6gk\u22121 > 0, then it indicates that we should be more aggressive to take larger step size.\nHence, the way we compute \u03b7k in Algorithm 2 is in a sense to dynamically adjust the step size, by\n\n5\n\n\fevaluating whether we are moving the iterates along the right direction. This kind of idea can be\ntraced back to [11].\n\nNote that SGD-BB requires the averaged gradients in two epochs to compute the BB step size.\nTherefore, we need to specify the step sizes \u03b70 and \u03b71 for the \ufb01rst two epochs. From our numerical\nexperiments, we found that the performance of SGD-BB is not sensitive to choices of \u03b70 and \u03b71.\n\n4.1 Smoothing Technique for the Step Sizes\n\nDue to the randomness of the stochastic gradients, the step size computed in SGD-BB may vibrate\ndrastically sometimes and this may cause instability of the algorithm. Inspired by [13], we propose\nthe following smoothing technique to stabilize the step size.\nIt is known that in order to guarantee the convergence of SGD, the step sizes are required to be\ndiminishing. Similar as in [13], we assume that the step sizes are in the form of C/\u03c6(k), where C > 0\nis an unknown constant that needs to be estimated, \u03c6(k) is a pre-speci\ufb01ed function that controls the\ndecreasing rate of the step size, and a typical choice of function \u03c6 is \u03c6(k) = k + 1. In the k-th epoch\nof Algorithm 2, we have all the previous step sizes \u03b72, \u03b73, . . . , \u03b7k generated by the BB method, while\nthe step sizes generated by the function C/\u03c6(k) are given by C/\u03c6(2), C/\u03c6(3), . . . , C/\u03c6(k). In order\nto ensure that these two sets of step sizes are close to each other, we solve the following optimization\nproblem to determine the unknown parameter C:\n\n(cid:21)2\n\n\u02c6Ck := argmin\n\nC\n\nlog\n\nC\n\n\u03c6(j)\n\n\u2212 log \u03b7j\n\n.\n\n(13)\n\n(cid:20)\n\nk(cid:88)\n\nj=2\n\nk(cid:89)\n\n(cid:81)k\nHere we take the logarithms of the step sizes to ensure that the estimation is not dominated by\nthose \u03b7j\u2019s with large magnitudes. It is easy to verify that the solution to (13) is given by \u02c6Ck =\nj=2 [\u03b7j\u03c6(j)]1/(k\u22121). Therefore, the smoothed step size for the k-th epoch of Algorithm 2 is:\n\n\u02dc\u03b7k = \u02c6Ck/\u03c6(k) =\n\n[\u03b7j\u03c6(j)]1/(k\u22121) /\u03c6(k).\n\n(14)\n\nThat is, we replace the \u03b7k in equation (\u2217) of Algorithm 2 by \u02dc\u03b7k in (14). In practice, we do not need\nto store all the \u03b7j\u2019s and \u02c6Ck can be computed recursively by \u02c6Ck = \u02c6C (k\u22122)/(k\u22121)\n\u00b7 [\u03b7k\u03c6(k)]1/(k\u22121).\n\nk\u22121\n\nj=2\n\n4.2\n\nIncorporating BB Step Size to SGD Variants\n\nThe BB step size and the smoothing technique we used in SGD-BB (Algorithm 2) can also be used\nin other variants of SGD, because these methods only require the gradient estimations, which are\naccessible in all SGD variants. For example, when replacing the stochastic gradient in Algorithm 2\nby the averaged gradients in SAG method, we obtain SAG with BB step size (denoted as SAG-BB).\nBecause SAG does not need diminishing step sizes to ensure convergence, in the smoothing technique\nwe just choose \u03c6(k) \u2261 1. The details of SAG-BB are given in the supplementary material.\n\n5 Numerical Experiments\n\nIn this section, we conduct numerical experiments to demonstrate the ef\ufb01cacy of our SVRG-BB\n(Algorithm 1) and SGD-BB (Algorithm 2) algorithms. In particular, we apply SVRG-BB and SGD-\nBB to solve two standard testing problems in machine learning: logistic regression with (cid:96)2-norm\nregularization (LR), and the squared hinge loss SVM with (cid:96)2-norm regularization (SVM):\n\n(LR)\n\nmin\n\nx\n\nF (x) =\n\n1\nn\n\n(SVM)\n\nmin\n\nx\n\nF (x) =\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nlog(cid:2)1 + exp(\u2212bia(cid:62)\ni x)(cid:3) +\nn(cid:88)\n(cid:0)[1 \u2212 bia(cid:62)\n(cid:1)2\n\ni x]+\n\n+\n\n\u03bb\n2\n\n(cid:107)x(cid:107)2\n2,\n\n\u03bb\n2\n\n(cid:107)x(cid:107)2\n2,\n\n(15)\n\n(16)\n\ni=1\n\n6\n\n\fwhere ai \u2208 Rd and bi \u2208 {\u00b11} are the feature vector and class label of the i-th sample, respectively,\nand \u03bb > 0 is a weighting parameter.\nWe tested SVRG-BB and SGD-BB on three standard real data sets, which were downloaded from the\nLIBSVM website1. Detailed information of the data sets are given in Table 1.\n\nTable 1: Data and model information of the experiments\n\nDataset\n\nrcv1.binary\n\nw8a\nijcnn1\n\nn\n\n20,242\n49,749\n49,990\n\nd\n\n47,236\n\n300\n22\n\nmodel\n\u03bb\n10\u22125\nLR\n10\u22124\nLR\nSVM 10\u22124\n\n5.1 Numerical Results of SVRG-BB\n\n(a) Sub-optimality on rcv1.binary\n\n(b) Sub-optimality on w8a\n\n(c) Sub-optimality on ijcnn1\n\n(d) Step sizes on rcv1.binary\n\n(e) Step sizes on w8a\n\n(f) Step sizes on ijcnn1\n\nFigure 1: Comparison of SVRG-BB and SVRG with \ufb01xed step sizes on different problems. The\ndashed lines stand for SVRG with different \ufb01xed step sizes \u03b7k given in the legend. The solid\nlines stand for SVRG-BB with different \u03b70; for example, the solid lines in sub-\ufb01gures (a) and (d)\ncorrespond to SVRG-BB with \u03b70 = 10, 1, 0.1, respectively.\n\nIn this section, we compare SVRG-BB (Algorithm 1) and SVRG with \ufb01xed step size for solving\n(15) and (16). We used the best-tuned step size for SVRG, and three different initial step sizes \u03b70 for\nSVRG-BB. For both SVRG-BB and SVRG, we set m = 2n as suggested in [10].\nThe comparison results of SVRG-BB and SVRG are shown in Figure 1. In all sub-\ufb01gures, the x-axis\ndenotes the number of epochs k, i.e., the number of outer loops in Algorithm 1. In Figures 1(a), 1(b)\nand 1(c), the y-axis denotes the sub-optimality F (\u02dcxk)\u2212 F (x\u2217), and in Figures 1(d), 1(e) and 1(f), the\ny-axis denotes the corresponding step sizes \u03b7k. x\u2217 is obtained by running SVRG with the best-tuned\nstep size until it converges. In all sub-\ufb01gures, the dashed lines correspond to SVRG with \ufb01xed step\nsizes given in the legends of the \ufb01gures. Moreover, the dashed lines in black color always represent\nSVRG with best-tuned step size, and the green and red lines use a relatively larger and smaller \ufb01xed\nstep sizes respectively. The solid lines correspond to SVRG-BB with different initial step sizes \u03b70.\nIt can be seen from Figures 1(a), 1(b) and 1(c) that, SVRG-BB can always achieve the same level\nof sub-optimality as SVRG with the best-tuned step size. Although SVRG-BB needs slightly more\nepochs compared with SVRG with the best-tuned step size, it clearly outperforms SVRG with the\n\n1www.csie.ntu.edu.tw/~cjlin/libsvmtools/.\n\n7\n\n\fother two choices of step sizes. Moreover, from Figures 1(d), 1(e) and 1(f) we see that the step sizes\ncomputed by SVRG-BB converge to the best-tuned step sizes after about 10 to 15 epochs. From\nFigure 1 we also see that SVRG-BB is not sensitive to the choice of \u03b70. Therefore, SVRG-BB has\nvery promising potential in practice because it generates the best step sizes automatically while\nrunning the algorithm.\n\n5.2 Numerical Results of SGD-BB\n\n(a) Sub-optimality on rcv1.binary\n\n(b) Sub-optimality on w8a\n\n(c) Sub-optimality on ijcnn1\n\n(d) Step sizes on rcv1.binary\n\n(e) Step sizes on w8a\n\n(f) Step sizes on ijcnn1\n\nFigure 2: Comparison of SGD-BB and SGD. The dashed lines correspond to SGD with diminishing\nstep sizes in the form \u03b7/(k + 1) with different constants \u03b7. The solid lines stand for SGD-BB with\ndifferent initial step sizes \u03b70.\n\nIn this section, we compare SGD-BB with smoothing technique (Algorithm 2) with SGD for solving\n(15) and (16). We set m = n, \u03b2 = 10/m and \u03b71 = \u03b70 in our experiments. We used \u03c6(k) = k + 1\nwhen applying the smoothing technique. Since SGD requires diminishing step size to converge,\nwe tested SGD with diminishing step size in the form \u03b7/(k + 1) with different constants \u03b7. The\ncomparison results are shown in Figure 2. Similar as Figure 1, the dashed line with black color\nrepresents SGD with the best-tuned \u03b7, and the green and red dashed lines correspond to the other two\nchoices of \u03b7; the solid lines represent SGD-BB with different \u03b70.\nFrom Figures 2(a), 2(b) and 2(c) we can see that SGD-BB gives comparable or even better sub-\noptimality than SGD with best-tuned diminishing step size, and SGD-BB is signi\ufb01cantly better than\nSGD with the other two choices of step size. From Figures 2(d), 2(e) and 2(f) we see that after only a\nfew epochs, the step sizes generated by SGD-BB approximately coincide with the best-tuned ones. It\ncan also be seen that after only a few epochs, the step sizes are stabilized by the smoothing technique\nand they approximately follow the same decreasing trend as the best-tuned diminishing step sizes.\n\n5.3 Comparison with Other Methods\n\nWe also compared our algorithms with many existing related methods. Experimental results also\ndemonstrated the superiority of our methods. The results are given in the supplementary materials.\n\nAcknowledgements\n\nResearch of Shiqian Ma was supported in part by the Hong Kong Research Grants Council General\nResearch Fund (Grant 14205314). Research of Yu-Hong Dai was supported by the Chinese NSF\n(Nos. 11631013 and 11331012) and the National 973 Program of China (No. 2015CB856000).\n\n8\n\n\fReferences\n[1] R. Babanezhad, M. O. Ahmed, A. Virani, M. Schmidt, K. Kone\u02c7cn`y, and S. Sallinen. Stop\nwasting my gradients: Practical SVRG. In Advances in Neural Information Processing Systems,\npages 2242\u20132250, 2015.\n\n[2] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical\n\nAnalysis, 8(1):141\u2013148, 1988.\n\n[3] Y.-H. Dai. A new analysis on the Barzilai-Borwein gradient method. Journal of Operations\n\nResearch Society of China, 1(2):187\u2013198, 2013.\n\n[4] Y.-H. Dai and R. Fletcher. Projected Barzilai-Borwein methods for large-scale box-constrained\n\nquadratic programming. Numerische Mathematik, 100(1):21\u201347, 2005.\n\n[5] Y.-H. Dai, W. W. Hager, K. Schittkowski, and H. Zhang. The cyclic Barzilai-Borwein method\n\nfor unconstrained optimization. IMA Journal of Numerical Analysis, 26(3):604\u2013627, 2006.\n\n[6] Y.-H. Dai and L. Liao. R-linear convergence of the Barzilai and Borwein gradient method. IMA\n\nJournal of Numerical Analysis, 22:1\u201310, 2002.\n\n[7] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems, pages 1646\u20131654, 2014.\n\n[8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[9] R. Fletcher. On the Barzilai-Borwein method. In Optimization and control with applications,\n\npages 235\u2013256. Springer, 2005.\n\n[10] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[11] H. Kesten. Accelerated stochastic approximation. The Annals of Mathematical Statistics,\n\n29(1):41\u201359, 1958.\n\n[12] M. Mahsereci and P. Hennig. Probabilistic line searches for stochastic optimization. arXiv\n\npreprint arXiv:1502.02846, 2015.\n\n[13] P. Y. Mass\u00e9 and Y. Ollivier. Speed learning on the \ufb02y. arXiv preprint arXiv:1511.02540, 2015.\n[14] M. Raydan. On the Barzilai and Borwein choice of steplength for the gradient method. IMA\n\nJournal of Numerical Analysis, 13(3):321\u2013326, 1993.\n\n[15] M. Raydan. The Barzilai and Borwein gradient method for the large scale unconstrained\n\nminimization problem. SIAM Journal on Optimization, 7(1):26\u201333, 1997.\n\n[16] R. L. Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential\nconvergence rate for \ufb01nite training sets. In Advances in Neural Information Processing Systems,\npages 2663\u20132671, 2012.\n\n[17] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Jornal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[18] K. Sopy\u0142a and P. Drozda. Stochastic gradient descent with Barzilai-Borwein update step for\n\nsvm. Information Sciences, 316:218\u2013233, 2015.\n\n[19] Y. Wang and S. Ma. Projected Barzilai-Borwein methods for large scale nonnegative image\n\nrestorations. Inverse Problems in Science and Engineering, 15(6):559\u2013583, 2007.\n\n[20] Z. Wen, W. Yin, D. Goldfarb, and Y. Zhang. A fast algorithm for sparse reconstruction based on\nshrinkage, subspace optimization, and continuation. SIAM J. SCI. COMPUT, 32(4):1832\u20131857,\n2010.\n\n[21] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable\n\napproximation. IEEE Transactions on Signal Processing, 57(7):2479\u20132493, 2009.\n\n9\n\n\f", "award": [], "sourceid": 389, "authors": [{"given_name": "Conghui", "family_name": "Tan", "institution": "The Chinese University of HK"}, {"given_name": "Shiqian", "family_name": "Ma", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Yu-Hong", "family_name": "Dai", "institution": null}, {"given_name": "Yuqiu", "family_name": "Qian", "institution": "The University of Hong Kong"}]}