{"title": "A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order", "book": "Advances in Neural Information Processing Systems", "page_first": 3054, "page_last": 3062, "abstract": "Asynchronous parallel optimization received substantial successes and extensive attention recently. One of core theoretical questions is how much speedup (or benefit) the asynchronous parallelization can bring to us. This paper provides a comprehensive and generic analysis to study the speedup property for a broad range of asynchronous parallel stochastic algorithms from the zeroth order to the first order methods. Our result recovers or improves existing analysis on special cases, provides more insights for understanding the asynchronous parallel behaviors, and suggests a novel asynchronous parallel zeroth order method for the first time. Our experiments provide novel applications of the proposed asynchronous parallel zeroth order method on hyper parameter tuning and model blending problems.", "full_text": "A Comprehensive Linear Speedup Analysis for\n\nAsynchronous Stochastic Parallel Optimization from\n\nZeroth-Order to First-Order\n\nXiangru Lian*, Huan Zhangy, Cho-Jui Hsiehz, Yijun Huang*, and Ji Liu*\n\n(cid:3) Department of Computer Science, University of Rochester, USA\n\ny Department of Electrical and Computer Engineering, University of California, Davis, USA\n\nz Department of Computer Science, University of California, Davis, USA\n\nxiangru@yandex.com, victzhang@gmail.com, chohsieh@ucdavis.edu,\n\nhuangyj0@gmail.com, ji.liu.uwisc@gmail.com\n\nAbstract\n\nAsynchronous parallel optimization received substantial successes and extensive\nattention recently. One of core theoretical questions is how much speedup (or\nbene\ufb01t) the asynchronous parallelization can bring to us. This paper provides a\ncomprehensive and generic analysis to study the speedup property for a broad\nrange of asynchronous parallel stochastic algorithms from the zeroth order to the\n\ufb01rst order methods. Our result recovers or improves existing analysis on special\ncases, provides more insights for understanding the asynchronous parallel behav-\niors, and suggests a novel asynchronous parallel zeroth order method for the \ufb01rst\ntime. Our experiments provide novel applications of the proposed asynchronous\nparallel zeroth order method on hyper parameter tuning and model blending prob-\nlems.\n\nIntroduction\n\n1\nAsynchronous parallel optimization received substantial successes and extensive attention recently,\nfor example, [5, 25, 31, 33, 34, 37]. It has been used to solve various machine learning problems,\nsuch as deep learning [4, 7, 26, 36], matrix completion [25, 28, 34], SVM [15], linear systems [3, 21],\nPCA [10], and linear programming [32]. Its main advantage over the synchronous parallel optimiza-\ntion is avoiding the synchronization cost, so it minimizes the system overheads and maximizes the\nef\ufb01ciency of all computation workers.\nOne of core theoretical questions is how much speedup (or bene\ufb01t) the asynchronous parallelization\ncan bring to us, that is, how much time can we save by employing more computation resources?\nMore precisely, people are interested in the running time speedup (RTS) with T workers:\n\nRTS(T ) =\n\nrunning time using a single worker\n\nrunning time using T workers\n\n:\n\nSince in the asynchronous parallelism all workers keep busy, RTS can be measured roughly by the\ncomputational complexity speedup (CCS) with T workers1\n\nCCS(T ) =\n\ntotal computational complexity using a single worker\n\ntotal computational complexity using T workers\n\n(cid:2) T:\n\nIn this paper, we are mainly interested in the conditions to ensure the linear speedup property. More\nspeci\ufb01cally, what is the upper bound on T to ensure CCS(T ) = (cid:2)(T )?\nExisting studies on special cases, such as asynchronous stochastic gradient descent (ASGD) and\nasynchronous stochastic coordinate descent (ASCD), have revealed some clues for what factors can\n\n1For simplicity, we assume that the communication cost is not dominant throughout this paper.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTable 1: Asynchronous parallel algorithms. \u201cI\u201d and \u201cC\u201d in \u201cmodel\u201d stand for inconsistent and\nconsistent read model respectively, which will be explained later. \u201cBase alg.\u201d is short for base\nalgorithm.\n\nbase alg.\n\nproblem type\n\nupper bound of T\n\nmodel\n\nAsyn. alg.\nASGD [25]\nASGD [1]\nASGD[11]\nASGD [18]\nASGD [18]\nARK [21]\nASCD [20]\nASCD [20]\nASCD [19]\nASCD [3]\nASCD [3]\nASCD [15]\n\nSGD\nSGD\nSGD\nSGD\nSGD\nSGD\nSCD\nSCD\nSCD\nSCD\nSCD\nSCD\n\nsmooth, strongly convex\n\nsmooth, convex\ncomposite, convex\nsmooth, nonconvex\nsmooth, nonconvex\n\nAx = b\n\nsmooth, convex, unconstrained\nsmooth, convex, constrained\n\n1\n\ncomposite, convex\n2 xT Ax (cid:0) bT x\n2 xT Ax (cid:0) bT x\n\n1\n\n2 xT Ax (cid:0) bT x , constrained\n\n1\n\nO(N 1=4)\n\nO(K 1=4 minf(cid:27)3=2; (cid:27)1=2g)\n\nO(K 1=4(cid:27)1=2)\nO(N 1=4K 1=2(cid:27))\n\nO(K 1=2(cid:27))\n\nO(N )\n\nO(N 1=2)\nO(N 1=4)\nO(N 1=4)\n\nO(N )\n\nO(N 1=2)\nO(N 1=2)\n\np\np\nN 3=2 + KN 1=2(cid:27)2)\nN 3=2 + KN 1=2(cid:27)2)\nO(\n\nK(cid:27)2 + 1)\n\np\n\nO(N 3=4)\n\nC\nC\nC\nI\nC\nC\nC\nC\nI\nC\nI\nI\n\nI\nI\nC\nI\n\nASZD\nASGD\nASGD\nASCD\n\nzeroth order\n\nSGD & SCD\n\nSGD\nSGD\nSCD\n\nsmooth, nonconvex\nsmooth, nonconvex\nsmooth, nonconvex\nsmooth, nonconvex\n\nO(\nO(\n\naffect the upper bound of T . For example, Agarwal and Duchi [1] showed the upper bound depends\non the variance of the stochastic gradient in ASGD; Niu et al. [25] showed that the upper bound\ndepends on the data sparsity and the dimension of the problem in ASGD; and Avron et al. [3], Liu\nand Wright [19] found that the upper bound depends on the problem dimension as well as the\ndiagonal dominance of the Hessian matrix of the objective. However, it still lacks a comprehensive\nand generic analysis to comprehend all pieces and show how these factors jointly affect the speedup\nproperty.\nThis paper provides a comprehensive and generic analysis to study the speedup property for a broad\nrange of asynchronous parallel stochastic algorithms from the zeroth order to the \ufb01rst order methods.\nTo avoid unnecessary complication and cover practical problems and algorithms, we consider the\nfollowing nonconvex stochastic optimization problem:\n\nminx2RN f (x) := E(cid:24)(F (x; (cid:24)));\n\n(1)\nwhere (cid:24) 2 (cid:4) is a random variable, and both F ((cid:1); (cid:24)) : RN ! R and f ((cid:1)) : RN ! R are smooth but\nnot necessarily convex functions. This objective function covers a large scope of machine learning\nproblems including deep learning. F ((cid:1); (cid:24))\u2019s are called component functions in this paper. The most\ncommon speci\ufb01cation is that (cid:4) is an index set of all training samples (cid:4) = f1; 2;(cid:1)(cid:1)(cid:1) ; ng and F (x; (cid:24))\nis the loss function with respect to the training sample indexed by (cid:24).\nWe highlight the main contributions of this paper in the following:\n(cid:15) We provide a generic analysis for convergence and speedup, which covers many existing algo-\nrithms including ASCD, ASGD ( implementation on parameter server), ASGD (implementation on\nmulticore systems), and others as its special cases.\n(cid:15) Our generic analysis can recover or improve the existing results on special cases.\n(cid:15) Our generic analysis suggests a novel asynchronous stochastic zeroth-order gradient descent\n(ASZD) algorithm and provides the analysis for its convergence rate and speedup property. To the\nbest of our knowledge, this is the \ufb01rst asynchronous parallel zeroth order algorithm.\n(cid:15) The experiment includes a novel application of the proposed ASZD method on model blending\nand hyper parameter tuning for big data optimization.\n1.1 Related Works\nWe \ufb01rst review \ufb01rst-order asynchronous parallel stochastic algorithms. Table 1 summarizes existing\nlinear speedup results for asynchronous parallel optimization algorithms mostly related to this paper.\nThe last block of Table 1 shows the results in this paper. Reddi et al. [29] proved the convergence of\nasynchronous variance reduced stochastic gradient (SVRG) method and its speedup in sparse setting.\nMania et al. [22] provides a general perspective (or starting point) to analyze for asynchronous\nstochastic algorithms, including HOGWILD!, asynchronous SCD and asynchronous sparse SVRG.\nThe fundamental difference in our work lies on that we apply different analysis and our result can be\n\n2\n\n\fp\n\n\u221a\n\n(cid:3) denotes the optimal objective value in (1).\n\ndirectly applied to various special cases, while theirs cannot. In addition, there is a line of research\nstudying the asynchronous ADMM type methods, which is not in the scope of this paper. We\nencourage readers to refer to recent literatures, for example, Hong [14], Zhang and Kwok [35].\nWe end this section by reviewing the zeroth-order stochastic methods. We use N to denote the\ndimension of the problem, K to denote the iteration number, and (cid:27) to the variance of stochastic gra-\ndient. Nesterov and Spokoiny [24] proved a convergence rate of O(N=\nK) for zeroth-order SGD\napplied to convex optimization. Based on [24], Ghadimi and Lan [12] proved a convergence rate of\np\nN=K) rate for zeroth-order SGD on nonconvex smooth problems. Jamieson et al. [16] shows\nO(\np\na lower bound O(1=\nK) for any zeroth-order method with inaccurate evaluation. Duchi et al. [9]\np\nK) rate for zeroth order SGD on convex objectives but with some very\nproved a O(N 1=4=K + 1=\ndifferent assumptions compared to our paper. Agarwal et al. [2] proved a regret of O(poly(N)\nK)\nfor zeroth-order bandit algorithm on convex objectives.\nFor more comprehensive review of asynchronous algorithms, please refer to the long version of this\npaper on arXiv:1606.00498.\n1.2 Notation\n(cid:15) ei 2 RN denotes the ith natural unit basis vector.\n(cid:15) E((cid:1)) means taking the expectation with respect to all random variables, while Ea((cid:1)) denotes the\nexpectation with respect to a random variable a.\n(cid:15) \u2207f (x) 2 RN is the gradient of f (x) with respect to x. Let S be a subset of f1;(cid:1)(cid:1)(cid:1) ; Ng.\n\u2207Sf (x) 2 RN is the projection of \u2207f (x) onto the index set S, that is, setting components of\n\u2207f (x) outside of S to be zero. We use \u2207if (x) 2 RN to denote \u2207figf (x) for short.\n(cid:15) f\n2 Algorithm\nWe illustrate the asynchronous parallelism\nby assuming a centralized network: a cen-\ntral node and multiple child nodes (work-\ners). The central node maintains the opti-\nmization variable x. It could be a param-\neter server if implemented on a computer\ncluster [17]; it could be a shared memory\nif implemented on a multicore machine.\nGiven a base algorithm A, all child nodes\nrun algorithm A independently and con-\ncurrently: read x from the central node (we call the result of this read ^x, and it is mathematically\nde\ufb01ned later in (4)), calculate locally using the ^x, and modify x on the central node. There is no need\nto synchronize child nodes. Therefore, all child nodes stay busy and consequently their ef\ufb01ciency\ngets maximized. In other words, we have CCS(T ) (cid:25) RTS(T ). Note that due to the asynchronous\nparallel mechanism the variable x in the central node is not updated exactly following the protocol\nof Algorithm A, since when a child node returns its computation result, the x in the central node\nmight have been changed by other child nodes. Thus a new analysis is required. A fundamental\nquestion would be under what conditions a linear speedup can be guaranteed. In other words, under\nwhat conditions CCS(T ) = (cid:2)(T ) or equivalently RTS(T ) = (cid:2)(T )?\nTo provide a comprehensive analysis, we consider a generic algorithm A \u2013 the zeroth order hybrid\nof SCD and SGD: iteratively sample a component function2 indexed by (cid:24) and a coordinate block\nS(cid:18)f1; 2;(cid:1)(cid:1)(cid:1) ; Ng; where jSj = Y for some constant Y and update x with\n\nAlgorithm 1 Generic Asynchronous\nAlgorithm (GASA)\nRequire: x0; K; Y; ((cid:22)1; (cid:22)2; : : : ; (cid:22)N );f(cid:13)kg\nEnsure: fxkgK\n1: for k = 0; : : : ; K (cid:0) 1 do\n2:\n\nRandomly select a component function index (cid:24)k\nand a set of coordinate indices Sk, where jSkj = Y ;\nxk+1 = xk (cid:0) (cid:13)kGSk (^xk; (cid:24)k);\n\n(cid:13)k is the step length for kth iteration\n\n3:\n4: end for\n\nk=0;:::;K(cid:0)1 \u25b7\n\nStochastic\n\nk=0\n\n\u2211\n\nwhere GS(x; (cid:24)) is an approximation to the block coordinate stochastic gradient N Y\n\n(2)\n(cid:0)1\u2207SF (x; (cid:24)):\n(3)\nthe ith coordinate.\nIn the de\ufb01nition of GS(x; (cid:24)), (cid:22)i\n((cid:22)1; (cid:22)2; : : : ; (cid:22)N ) is prede\ufb01ned in practice. We only use the function value (the zeroth order in-\nformation) to estimate GS(x; (cid:24)). It is easy to see that the closer to 0 the (cid:22)i\u2019s are, the closer GS(x; (cid:24))\nand N Y\n\n(F (x + (cid:22)iei; (cid:24)) (cid:0) F (x (cid:0) (cid:22)iei; (cid:24)))ei; S (cid:18) f1; 2; : : : ; Ng:\n\n(cid:0)1\u2207Sf (x; (cid:24)) will be. In particular, lim(cid:22)i!0;8i GS(x; (cid:24)) = N Y\n\nis the approximation parameter\n\n(cid:0)1\u2207Sf (x; (cid:24)).\n\nGS(x; (cid:24)) :=\n\ni2S\n\nfor\n\n2Y (cid:22)i\n\nx x (cid:0) (cid:13)GS(x; (cid:24))\n\nN\n\n2The algorithm and theoretical analysis followed can be easily extended to the minibatch version.\n\n3\n\n\fApplying the asynchronous parallelism, we propose a generic asynchronous stochastic algorithm in\nAlgorithm 1. This algorithm essentially characterizes how the value of x is updated in the central\nnode. (cid:13)k is the prede\ufb01ned steplength (or learning rate). K is the total number of iterations (note\nthat this iteration number is counted by the the central node, that is, any update on x no matter from\nwhich child node will increase this counter.)\nAs we mentioned, the key difference of the asynchronous algorithm from the protocol of Algo-\nrithm A in Eq. (2) is that ^xk may be not equal to xk. In asynchronous parallelism, there are two\ndifferent ways to model the value of ^xk:\n(cid:15) Consistent read: ^xk is some early existed state of x in the central node, that is, ^xk = xk(cid:0)(cid:28)k for\nsome (cid:28)k (cid:21) 0. This happens if reading x and writing x on the central node by any child node are\natomic operations, for instance, the implementation on a parameter server [17].\n(cid:15) Inconsistent read: ^xk could be more complicated when the atomic read on x cannot be guaran-\nteed, which could happen, for example, in the implementation on the multi-core system. It means\nthat while one child is reading x in the central node, other child nodes may be performing modi\ufb01ca-\ntions on x at the same time. Therefore, different coordinates of x read by any child node may have\ndifferent ages. In other words, ^xk may not be any existed state of x in the central node.\nReaders who want to learn more details about consistent read and inconsistent read can refer to\n[3, 18, 19]. To cover both cases, we note that ^xk can be represented in the following generic form:\n(4)\nwhere J(k) (cid:26) fk(cid:0)1; k(cid:0)2; : : : ; k(cid:0)Tg is a subset of the indices of early iterations, and T is the upper\nbound for staleness. This expression is also considered in [3, 18, 19, 27]. Note that the practical\nvalue of T is usually proportional to the number of involved nodes (or workers). Therefore, the total\nnumber of workers and the upper bound of the staleness are treated as the same in the following\ndiscussion and this notation T is abused for simplicity.\n3 Theoretical Analysis\nBefore we show the main results of this paper, let us \ufb01rst make some global assumptions commonly\nused for the analysis of stochastic algorithms.3\nBounded Variance of Stochastic Gradient E(cid:24)(\u2225\u2207F (x; (cid:24)) (cid:0) \u2207f (x)\u22252) (cid:20) (cid:27)2;8x:\nLipschitzian Gradient The gradient of both the objective and its component functions are Lips-\nchitzian:4\n\n^xk = xk (cid:0)\u2211\n\nj2J(k)(xj+1 (cid:0) xj);\n\nmaxf\u2225\u2207f (x) (cid:0) \u2207f (y)\u2225;\u2225\u2207F (x; (cid:24)) (cid:0) \u2207F (y; (cid:24))\u2225g (cid:20) L\u2225x (cid:0) y\u2225 8x;8y;8(cid:24):\n\n(5)\nUnder the Lipschitzian gradient assumption, de\ufb01ne two more constants Ls and Lmax. Let s be\nany positive integer bounded by N. De\ufb01ne Ls to be the minimal constant satisfying the following\ninequality: 8(cid:24);8x; (cid:11)iei8S (cid:26) f1; 2; :::; Ng with jSj (cid:20) s for any z =\n\n\u2211\n\ni2S we have:\n\nmaxf\u2225\u2207f (x) (cid:0) \u2207f (x + z)\u2225 ; \u2225\u2207F (x; (cid:24)) (cid:0) \u2207F (x + z; (cid:24))\u2225g (cid:20) Ls \u2225z\u2225\n\nDe\ufb01ne L(i) for i 2 f1; 2; : : : ; Ng as the minimum constant that satis\ufb01es:\n\nmaxf\u2225\u2207if (x) (cid:0) \u2207if (x + (cid:11)ei)\u2225; \u2225\u2207iF (x; (cid:24)) (cid:0) \u2207iF (x + (cid:11)ei; (cid:24))\u2225g (cid:20) L(i)j(cid:11)j: 8(cid:24);8x:\nDe\ufb01ne Lmax := maxi2f1;:::;Ng L(i). It can be seen that Lmax (cid:20) Ls (cid:20) L.\nIndependence All random variables (cid:24)k; Sk for k = 0; 1;(cid:1)(cid:1)(cid:1) ; K are independent to each other.\nBounded Age Let T be the global bound for delay: J(k)(cid:18)fk (cid:0) 1; : : : ; k (cid:0) Tg;8k, so jJ(k)j (cid:20) T .\nWe de\ufb01ne the following global quantities for short notations:\n\n(6)\n\n(\u2211\n\n)\n\n(\n\n)\n\np\n\n! :=\n\nN\ni=1 L2\n(cid:11)2 := Y =((f (x0) (cid:0) f\n\n=N; (cid:11)1 := 4 + 4\n\n(i)(cid:22)2\ni\n(cid:3)\n)LY N ); (cid:11)3 := (K(N ! + (cid:27)2)(cid:11)2 + 4)L2\n\nT Y + Y 3=2T 2=\n\nY =L2\nT :\n\nN\n\nL2\n\nT =(L2\n\nY N );\n\n(7)\n\nNext we show our main result in the following theorem:\n\n3Some underlying assumptions such as reading and writing a \ufb02oat number are omitted here. As pointed in\n\n[25], these behaviors are guaranteed by most modern architectures.\nto \ufb01rst order methods (i.e., ! ! 0) in our following theorems.\n\n4Note that the Lipschitz assumption on the component function F (x; (cid:24))\u2019s can be eliminated when it comes\n\n4\n\n\f(\u221a\n\n\u221a\n\n\u221a\n\n)\n\n;8k\n)\n\np\nK(cid:11)2\n\nTheorem 1 (Generic Convergence Rate for GASA). Choose the steplength (cid:13)k to be a constant (cid:13) in\nAlgorithm 1\n\n(cid:11)2\n\n1=(K(N ! + (cid:27)2)(cid:11)2 + (cid:11)1) +\n\nK(N ! + (cid:27)2)(cid:11)2\n\n(\u221a\n)\n1 + 4Y (cid:0)1=2N 1=2(cid:11)3 (cid:0) 1\n\n: We have the fol-\n\n(cid:0)1\nk = (cid:13)\n\n(cid:13)\n\n(cid:0)1 = 2LY N Y\n\n(cid:0)1\n\nand suppose the age T is bounded by T (cid:20) p\n\u2211\n\nlowing convergence rate:\n\n(\n\nN\n\n2Y 1=2\n\nK\nk=0\n\nE\u2225\u2207f (xk)\u22252\nK\n\n\u2a7d 20\nK(cid:11)2\n\n1\n\nL2\nT\nL2\nY\n\np\n1 + 4Y (cid:0)1=2N 1=2(cid:11)3 (cid:0) 1\n\np\n\n+\n\n+ 11\n\nK(cid:11)2\n\nN ! + (cid:27)2\n\nN Y (cid:0)1\n\n+ N !:\n(8)\nRoughly speaking, the \ufb01rst term on the RHS of (8) is related to SCD; the second term is related to\n\u201cstochastic\u201d gradient descent; and the last term is due to the zeroth-order approximation.\nAlthough this result looks complicated (or may be less elegant), it is capable to capture many impor-\ntant subtle structures, which can be seen by the subsequent discussion. We will show how to recover\nand improve existing results as well as prove the convergence for new algorithms using Theorem 1.\nTo make the results more interpretable, we use the big-O notation to avoid explicitly writing down\nall the constant factors, including all L\u2019s, f (x0), and f\n3.1 Asynchronous Stochastic Coordinate Descent (ASCD)\nWe apply Theorem 1 to study the asynchronous SCD algorithm by taking Y = 1 and (cid:27) = 0. Sk =\nfikg only contains a single randomly sampled coordinate, and ! = 0 (or equivalently (cid:22)i = 0;8i).\nThe essential updating rule on x is xk+1 = xk (cid:0) (cid:13)k\u2207ik f (^xk):\nCorollary 2 (ASCD). Let ! = 0, (cid:27) = 0, and Y = 1 in Algorithm 1 and Theorem 1. If\n\n(cid:3) in the following corollaries.\n\nthe following convergence rate holds:\n\n(\u2211\n\nT \u2a7d O(N 3=4);\n\n)\n\nE\u2225\u2207f (xk)\u22252\n\nK\nk=0\n\n=K \u2a7d O(N=K):\n\n(9)\n\n(10)\n\nThe proved convergence rate O(N=K) is consistent with the existing analysis of SCD [30] or ASCD\nfor smooth optimization [20]. However, our requirement in (9) to ensure the linear speedup property\nis better than the one in [20], by improving it from T (cid:20) O(N 1=2) to T (cid:20) O(N 3=4). Mania et al. [22]\nanalyzed ASCD for strongly convex objectives and proved a linear speedup smaller than O(N 1=6),\nwhich is also more restrictive than ours.\n3.2 Asynchronous Stochastic Gradient Descent (ASGD)\nASGD has been widely used to solve deep learning [7, 26, 36], NLP [4, 13], and many other im-\nportant machine learning problems [25]. There are two typical implementations of ASGD. The \ufb01rst\ntype is to implement on the computer cluster with a parameter sever [1, 17]. The parameter server\nserves as the central node. It can ensure the atomic read or write of the whole vector x and leads to\nthe following updating rule for x (setting Y = N and (cid:22)i = 0;8i in Algorithm 1):\n\nxk+1 = xk (cid:0) (cid:13)k\u2207F (^xk; (cid:24)k):\n\n(11)\nNote that a single iteration is de\ufb01ned as modifying the whole vector. The other type is to implement\non a single computer with multiple cores. In this case, the central node corresponds to the shared\nmemory. Multiple cores (or threads) can access it simultaneously. However, in this model atomic\nread and write of x cannot be guaranteed. Therefore, for the purpose of analysis, each update on\na single coordinate accounts for an iteration. It turns out to be the following updating rule (setting\nSk = fikg, that is, Y = 1, and (cid:22)i = 0;8i in Algorithm 1):\n\nxk+1 = xk (cid:0) (cid:13)k\u2207ik F (^xk; (cid:24)k):\n\nReaders can refer to [3, 18, 25] for more details and illustrations for these two implementations.\nCorollary 3 (ASGD in (11)). Let ! = 0 (or (cid:22)i = 0;8i equivalently) and Y = N in Algorithm 1\nand Theorem 1. If\n\nT \u2a7d O\n\nK(cid:27)2 + 1\n\nthen the following convergence rate holds:\n\n)\n\n(\n\n;\n\np\n\n)\n\n(\u2211\n\nE\u2225\u2207f (xk)\u22252\n\nK\nk=0\n\n=K \u2a7d O\n\n(cid:27)=\n\nK + 1=K\n\n:\n\n(12)\n\n(13)\n\n(14)\n\n(p\n)\n\n5\n\n\fp\n\nFirst note that the convergence rate in (14) is tight since it is consistent with the serial (nonparallel)\nversion of SGD [23]. We compare this linear speedup property indicated by (13) with results in\n[1], [11], and [18]. To ensure such rate, Agarwal and Duchi [1] need T to be bounded by T (cid:20)\nO(K 1=4 minf(cid:27)3=2;\n(cid:27)g), which is inferior to our result in (13). Feyzmahdavian et al. [11] need\nT to be bounded by (cid:27)1=2K 1=4 to achieve the same rate, which is also inferior to our result. Our\nrequirement is consistent with the one in [18]. To the best of our knowledge, it is the best result so\nfar.\nCorollary 4 (ASGD in (12)). Let ! = 0 (or equivalently, (cid:22)i = 0;8i) and Y = 1 in Algorithm 1\nand Theorem 1. If\n\n)\n\nT \u2a7d O\nthen the following convergence rate holds\nE\u2225\u2207f (xk)\u22252\n\n(\u2211\n\nK\nk=0\n\nN 3=2 + KN 1=2(cid:27)2\n\n(\u221a\n\n;\n\n)\n\n=K \u2a7d O\n\nN=K(cid:27) + N=K\n\n:\n\n(p\n)\n\n(15)\n\n(16)\n\nThe additional factor N in (16) (comparing to (14)) arises from the different way of counting the\np\niteration. This additional factor also appears in [25] and [18]. We \ufb01rst compare our result with [18],\nwhich requires T to be bounded by O(\nKN 1=2(cid:27)2). We can see that our requirement in (16) allows\na larger value for T , especially when (cid:27) is small such that N 3=2 dominates KN 1=2(cid:27)2. Next we com-\npare with [25], which assumes that the objective function is strongly convex. Although this is sort\nof comparing \u201capple\u201d with \u201corange\u201d, it is still meaningful if one believes that the strong convexity\nwould not affect the linear speedup property, which is implied by [22]. In [25], the linear speedup\nis guaranteed if T (cid:20) O(N 1=4) under the assumption that the sparsity of the stochastic gradient\nis bounded by O(1). In comparison, we do not require the assumption of sparsity for stochastic\ngradient and have a better dependence on N. Moreover, beyond the improvement over existing anal-\nysis in [22] and [18], our analysis provides some interesting insights for asynchronous parallelism.\nNiu et al. [25] essentially suggests a large problem dimension N is bene\ufb01cial to the linear speedup,\nwhile Lian et al. [18] and many others (for example, Agarwal and Duchi [1], Feyzmahdavian et al.\n[11]) suggest that a large stochastic variance (cid:27) (this often implies the number of samples is large) is\nbene\ufb01cial to the linear speedup. Our analysis shows the combo effect of N and (cid:27) and shows how\nthey improve the linear speedup jointly.\n3.3 Asynchronous Stochastic Zeroth-order Descent (ASZD)\nWe end this section by applying Theorem 1 to generate a novel asynchronous zeroth-order stochastic\ndescent algorithm, by setting the block size Y = 1 (or equivalently Sk = fikg) in GSk (^xk; (cid:24)k)\nGSk (^xk; (cid:24)k) = Gfikg(^xk; (cid:24)k) = (F (^xk + (cid:22)ik eik ; (cid:24)k) (cid:0) F (^xk (cid:0) (cid:22)ik eik ; (cid:24)k))=(2(cid:22)ik )eik :\n\n(17)\nTo the best of our knowledge, this is the \ufb01rst asynchronous algorithm for zeroth-order optimization.\nCorollary 5 (ASZD). Set Y = 1 and all (cid:22)i\u2019s to be a constant (cid:22) in Algorithm 1. Suppose that (cid:22)\nsatis\ufb01es\n\n})\n\n(\n\np\n\n{p\n\n(cid:22) \u2a7d O\n\n1=\n\nK + min\n\n(cid:27)(N K)\n\nand T satis\ufb01es\n\nT \u2a7d O\nWe have the following convergence rate\n\n(\u2211\n\np\n\n(cid:0)1=4; (cid:27)=\n\nN\n\n;\n\n)\n\n:\n\n\u221a\n\n)\n\nN 3=2 + KN 1=2(cid:27)2\n\n(\n\n(p\n)\n\nE\u2225\u2207f (xk)\u22252\n\nK\nk=0\n\n=K \u2a7d O\n\nN=K +\n\nN=K(cid:27)\n\n:\n\n(18)\n\n(19)\n\n(20)\n\nWe \ufb01rstly note that the convergence rate in (20) is consistent with the rate for the serial (nonparallel)\nzeroth-order stochastic gradient method in [12]. Then we evaluate this result from two perspectives.\nFirst, we consider T = 1, which leads to the serial (non-parallel) zeroth-order stochastic descent.\nOur result implies a better dependence on (cid:22), comparing with [12].5 To obtain such convergence rate\n\n5Acute readers may notice that our way in (17) to estimate the stochastic gradient is different from the one\nused in [12]. Our method only estimates a single coordinate gradient of a sampled component function, while\nGhadimi and Lan [12] estimate the whole gradient of the sampled component function. Our estimation is more\naccurate but less aggressive. The proved convergence rate actually improves a small constant in [12].\n\n6\n\n\f(\n\n)\n\np\n\nK)\n\n1=(N\n\nin (20), Ghadimi and Lan [12] require (cid:22) \u2a7d O\n, while our requirement in (18) is much\nless restrictive. An important insight in our requirement is to suggest the dependence on the variance\n(cid:27): if the variance (cid:27) is large, (cid:22) is allowed to be a much larger value. This insight meets the common\nsense: a large variance means that the stochastic gradient may largely deviate from the true gradient,\nso we are allowed to choose a large (cid:22) to obtain a less exact estimation for the stochastic gradient\nwithout affecting the convergence rate. From the practical view of point, it always tends to choose a\nlarge value for (cid:22). Recall the zeroth-order method uses the function difference at two different points\n(e.g., x + (cid:22)ei and x(cid:0) (cid:22)ei) to estimate the differential. In a practical system (e.g., a concrete control\nsystem), there usually exists some system noise while querying the function values. If two points\nare too close (in other words (cid:22) is too small), the obtained function difference is dominated by noise\nand does not really re\ufb02ect the function differential.\nSecond, we consider the case T (cid:21) 1, which leads to the asynchronous zeroth-order stochastic\ndescent. To the best of our knowledge, this is the \ufb01rst such algorithm. The upper bound for T in (19)\nessentially indicates the requirement for the linear speedup property. The linear speedup property\nhere also shows that even if K(cid:27)2 is much smaller than 1, we still have O(N 3=4) linear speedup,\nwhich shows a fundamental understanding of asynchronous stochastic algorithms that N and (cid:27) can\nimprove the linear speedup jointly.\n4 Experiment\nSince the ASCD and various ASGDs have been extensively validated in recent papers. We conduct\ntwo experiments to validate the proposed ASZD on in this section. The \ufb01rst part applies ASZD\nto estimate the parameters for a synthetic black box system. The second part applies ASZD to the\nmodel combination for Yahoo Music Recommendation Competition.\n4.1 Parameter Optimization for A Black Box\nWe use a deep neural network to simulate a black box system. The optimization variables are the\nweights associated with a neural network. We choose 5 layers (400=100=50=20=10 nodes) for the\nneural network with 46380 weights (or parameters) totally. The weights are randomly generated\nfrom i.i.d. Gaussian distribution. The output vector is constructed by applying the network to the\ninput vector plus some Gaussian random noise. We use this network to generate 463800 samples.\nThese synthetic samples are used to optimize the weights for the black box. (We pretend not to know\nthe structure and weights of this neural network because it is a black box.) To optimize (estimate)\nthe parameters for this black box, we apply the proposed ASZD method.\nThe experiment is conducted on the machine (Intel Xeon architecture), which has 4 sockets and\n10 cores for each socket. We run Algorithm 1 on various numbers of cores from 1 to 32 and the\nsteplength is chosen as (cid:13) = 0:1, which is based on the best performance of Algorithm 1 running on\n1 core to achieve the precision 10\n\n(cid:0)1 for the objective value.\n\nTable 2: CCR and RTS of ASZD for different # of threads (synthetic data).\n\n4\n\n8\n\nthr-# 1\nCCS 1 3.87 7.91 9.97 14.74 17.86 21.76 26.44 30.86\nRTS 1 3.32 6.74 8.48 12.49 15.08 18.52 22.49 26.12\n\n12\n\n16\n\n32\n\n20\n\n24\n\n28\n\nThe speedup is reported in Table 2. We observe that the iteration speedup is almost linear while the\nrunning time speedup is slightly worse than the iteration speedup. We also draw Figure 1 (see the\nsupplement) to show the curve of the objective value against the number of iterations and running\ntime respectively.\n4.2 Asynchronous Parallel Model Combination for Yahoo Music Recommendation\n\nCompetition\n\n\u221a\u2211\n\nIn KDD-Cup 2011, teams were challenged to predict user ratings in music given the Yahoo! Music\ndata set [8]. The evaluation criterion is the Root Mean Squared Error (RMSE) of the test data set:\n\n(rui (cid:0) ^rui)2=jT1j;\n\n(u;i)2T1\n\nRMSE =\n\n(21)\nwhere (u; i) 2 T1 are all user ratings in Track 1 test data set (6,005,940 ratings), rui is the true rating\nfor user u and item i, and ^rui is the predicted rating. The winning team from NTU created more\nthan 200 models using different machine learning algorithms [6], including Matrix Factorization,\nk-NN, Restricted Boltzmann Machines, etc. They blend these models using Neural Network and\nBinned Linear Regression on the validation data set (4,003,960 ratings) to create a model ensemble\nto achieve better RMSE.\n\n7\n\n\fWe were able to obtain the predicted ratings of N = 237 individual models on the KDD-Cup test\ndata set from the NTU KDD-Cup team, which is a matrix X with 6,005,940 rows (corresponding to\nthe 6,005,940 test data set samples) and 237 columns. Each element Xij indicates the j-th model\u2019s\npredicted rating on the i-th Yahoo! Music test data sample. In our experiments, we try to linearly\nblend the 237 models using information from the test data set. Thus, our variable to optimize\nis a vector x 2 RN as coef\ufb01cients of the predicted ratings for each model. To ensure that our\nlinear blending does not over-\ufb01t, we further split X randomly into two equal parts, calling them the\n\u201cvalidation\u201d set (denoted as A 2 Rn(cid:2)N ) for model blending and the true test set.\nWe de\ufb01ne our objective function as RMSE2 of the blended output on the validation set: f (x) =\n\u2225Ax (cid:0) r\u22252=n where r is the corresponding true ratings in the validation set and Ax is the predicted\nratings after blending.\nWe assume that we cannot see the entries of r directly, and thus cannot compute the gradient of f (x).\nIn our experiment, we treat f (x) as a blackbox, and the only information we can get from it is its\nvalue given a model blending coef\ufb01cients x. This is similar to submitting a model for KDD-Cup and\nobtain a leader-board RMSE of the test set; we do not know the actual values of the test set. Then,\nwe apply our ASZD algorithm to minimize f (x) with zero-order information only.\n\nTable 3: Comparing RMSEs on test data set with KDD-Cup winner teams\n\nNTU (1st) Commendo (2nd)\n\nRMSE 21.0004\n\n21.0545\n\nInnerPeace (3rd) Our result\n21.1241\n\n21.2335\n\nWe implement our algorithm using Julia on a 10-core Xeon E7-4680 machine an run our algorithm\nfor the same number of iterations, with different number of threads, and measured the running time\nspeedup (RTS) in Figure 4 (see supplement). Similar to our experiment on neural network blackbox,\nour algorithm has a almost linear speedup. For completeness, Figure 2 in supplement shows the\nsquare root of objective function value (RMSE) against the number of iterations and running time.\nAfter about 150 seconds, our algorithm running with 10 threads achieves a RMSE of 21.1241 on our\ntest set. Our results are comparable to KDD-Cup winners, as shown in Table 3. Since our goal is\nto show the performance of our algorithm, we assume we can \u201csubmit\u201d our solution x for unlimited\ntimes, which is unreal in a real contest like KDD-Cup. However, even with very few iterations, our\nalgorithm does converge fast to a reasonable small RMSE, as shown in Figure 3.\n5 Conclusion\nIn this paper, we provide a generic linear speedup analysis for the zeroth-order and \ufb01rst-order asyn-\nchronous parallel algorithms. Our generic analysis can recover or improve the existing results on\nspecial cases, such as ASCD, ASGD (parameter implementation), ASGD (multicore implementa-\ntion). Our generic analysis also suggests a novel ASZD algorithm with guaranteed convergence rate\nand speedup property. To the best of our knowledge, this is the \ufb01rst asynchronous parallel zeroth\norder algorithm. The experiment includes a novel application of the proposed ASZD method on\nmodel blending and hyper parameter tuning for big data optimization.\nAcknowledgements\nThis project is in part supported by the NSF grant CNS-1548078. We especially thank Chen-Tse\nTsai for providing the code and data for the Yahoo Music Competition.\nReferences\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.\n[2] A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex optimization with\n\nbandit feedback. In NIPS, pages 1035\u20131043, 2011.\n\n[3] H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate\n\nthrough randomization. Journal of the ACM (JACM), 62(6):51, 2015.\n\n[4] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal\n\nof Machine Learning Research, 3:1137\u20131155, 2003.\n\n[5] S. Chaturapruek, J. C. Duchi, and C. R\u00e9. Asynchronous stochastic convex optimization: the noise is in\n\nthe noise and SGD don\u2019t care. In NIPS, pages 1531\u20131539, 2015.\n\n[6] P.-L. Chen, C.-T. Tsai, Y.-N. Chen, K.-C. Chou, C.-L. Li, C.-H. Tsai, K.-W. Wu, Y.-C. Chou, C.-Y. Li,\nIn\n\nW.-S. Lin, et al. A linear ensemble of individual and blended models for music rating prediction.\nKDDCup, 2012.\n\n[7] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le,\n\net al. Large scale distributed deep networks. NIPS, 2012.\n\n8\n\n\f[8] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The Yahoo! Music dataset and KDD-Cup\u201911. In\n\nKDDCup, pages 8\u201318, 2012.\n\n[9] J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization.\n\nSIAM Journal on Optimization, 22(2):674\u2013701, 2012.\n\n[10] J. Fellus, D. Picard, and P. H. Gosselin. Asynchronous gossip principal components analysis. Neurocom-\n\n[11] H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regular-\n\nputing, 2015. doi: 10.1016/j.neucom.2014.11.076.\n\nized stochastic optimization. arXiv, 2015.\n\n[12] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming.\n\nSIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[13] K. Gimpel, D. Das, and N. A. Smith. Distributed asynchronous online learning for natural language\n\n[14] M. Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An\n\n[15] C. Hsieh, H. Yu, and I. S. Dhillon. PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate\n\nprocessing. In CoNLL, pages 213\u2013222, 2010.\n\nADMM based approach. arXiv:1412.6058, 2014.\n\nDescent. In ICML, pages 2370\u20132379, 2015.\n\n[16] K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. In NIPS,\n\n[17] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y.\n\nSu. Scaling distributed machine learning with the parameter server. OSDI, 2014.\n\n[18] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimiza-\n\n[19] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence prop-\n\n[20] J. Liu, S. J. Wright, C. R\u00e9, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate\n\n[21] J. Liu, S. J. Wright, and S. Sridhar. An asynchronous parallel randomized kaczmarz algorithm. arXiv,\n\n[22] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate\n\nanalysis for asynchronous stochastic optimization. arXiv:1507.06970, 2015.\n\n[23] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochas-\n\ntic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[24] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of\n\nComputational Mathematics, pages 1\u201340, 2011.\n\n[25] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient\n\ntion. In NIPS, pages 2719\u20132727, 2015.\n\nerties. arXiv:1403.3862, 2014.\n\ndescent algorithm. ICML, 2014.\n\n2012.\n\n2014.\n\ndescent. NIPS, 2011.\n\nup neural network training. NIPS, 2013.\n\nnate updates. arXiv, 2015.\n\n[26] T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang. Gpu asynchronous stochastic gradient descent to speed\n\n[27] Z. Peng, Y. Xu, M. Yan, and W. Yin. Arock: an algorithmic framework for asynchronous parallel coordi-\n\n[28] F. Petroni and L. Querzoni. Gasgd: stochastic gradient descent for distributed asynchronous matrix com-\n\npletion via graph partitioning. ACM Conference on Recommender systems, 2014.\n\n[29] S. J. Reddi, A. Hefny, S. Sra, B. P\u00f3czos, and A. J. Smola. On variance reduction in stochastic gradient\n\ndescent and its asynchronous variants. In NIPS, pages 2629\u20132637, 2015.\n\n[30] P. Richt\u00e1rik and M. Tak\u00e1\u02c7c.\n\nIteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 144(1-2):1\u201338, 2014.\n\n[31] C. D. Sa, C. Zhang, K. Olukotun, C. R\u00e9, and C. R\u00e9. Taming the wild: A uni\ufb01ed analysis of hogwild-style\n\nalgorithms. In NIPS, pages 2674\u20132682, 2015.\n\n[32] S. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, and C. Zhang. An approximate, ef\ufb01cient LP solver for lp\n\nrounding. NIPS, 2013.\n\n[33] J. Wei, W. Dai, A. Kumar, X. Zheng, Q. Ho, and E. P. Xing. Consistent bounded-asynchronous parameter\n\nservers for distributed ml. arXiv:1312.7869, 2013.\n\n[34] H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon. Nomad: Non-locking, stochastic multi-\n\nmachine algorithm for asynchronous and decentralized matrix completion. arXiv:1312.0193, 2013.\n[35] R. Zhang and J. Kwok. Asynchronous distributed ADMM for consensus optimization. ICML, 2014.\n[36] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. arXiv, 2014.\n[37] S. Zhao and W. Li. Fast asynchronous parallel stochastic gradient descent: A lock-free approach with\n\nconvergence guarantee. In AAAI, pages 2379\u20132385, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1520, "authors": [{"given_name": "Xiangru", "family_name": "Lian", "institution": "University of Rochester"}, {"given_name": "Huan", "family_name": "Zhang", "institution": "University of California, Davis"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UC Davis"}, {"given_name": "Yijun", "family_name": "Huang", "institution": "University of Rochester"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester"}]}