{"title": "LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain", "book": "Advances in Neural Information Processing Systems", "page_first": 974, "page_last": 982, "abstract": "We study k-SVD that is to obtain the first k singular vectors of a matrix A. Recently, a few breakthroughs have been discovered on k-SVD: Musco and Musco [1] proved the first gap-free convergence result using the block Krylov method, Shamir [2] discovered the first variance-reduction stochastic method, and Bhojanapalli et al. [3] provided the fastest $O(\\mathsf{nnz}(A) + \\mathsf{poly}(1/\\varepsilon))$-time algorithm using alternating minimization.\r\n\r\nIn this paper, we put forward a new and simple LazySVD framework to improve the above breakthroughs. This framework leads to a faster gap-free method outperforming [1], and the first accelerated and stochastic method outperforming [2]. In the $O(\\mathsf{nnz}(A) + \\mathsf{poly}(1/\\varepsilon))$ running-time regime, LazySVD outperforms [3] in certain parameter regimes without even using alternating minimization.", "full_text": "LazySVD: Even Faster SVD Decomposition\n\nYet Without Agonizing Pain\u2217\n\nZeyuan Allen-Zhu\n\nzeyuan@csail.mit.edu\n\nInstitute for Advanced Study\n\n& Princeton University\n\nYuanzhi Li\n\nyuanzhil@cs.princeton.edu\n\nPrinceton University\n\nAbstract\n\nWe study k-SVD that is to obtain the \ufb01rst k singular vectors of a matrix A.\nRecently, a few breakthroughs have been discovered on k-SVD: Musco and\nMusco [19] proved the \ufb01rst gap-free convergence result using the block Krylov\nmethod, Shamir [21] discovered the \ufb01rst variance-reduction stochastic method, and\nBhojanapalli et al. [7] provided the fastest O(nnz(A) + poly(1/\u03b5))-time algorithm\nusing alternating minimization.\nIn this paper, we put forward a new and simple LazySVD framework to improve\nthe above breakthroughs. This framework leads to a faster gap-free method outper-\nforming [19], and the \ufb01rst accelerated and stochastic method outperforming [21].\nIn the O(nnz(A) + poly(1/\u03b5)) running-time regime, LazySVD outperforms [7] in\ncertain parameter regimes without even using alternating minimization.\n\nIntroduction\n\n1\nThe singular value decomposition (SVD) of a rank-r matrix A \u2208 Rd\u00d7n corresponds to decomposing\nA = V \u03a3U(cid:62) where V \u2208 Rd\u00d7r, U \u2208 Rn\u00d7r are two column orthonormal matrices, and \u03a3 =\ndiag{\u03c31, . . . , \u03c3r} \u2208 Rr\u00d7r is a non-negative diagonal matrix with \u03c31 \u2265 \u03c32 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3r \u2265 0. The\ncolumns of V (resp. U) are called the left (resp. right) singular vectors of A and the diagonal entries\nof \u03a3 are called the singular values of A. SVD is one of the most fundamental tools used in machine\nlearning, computer vision, statistics, and operations research, and is essentially equivalent to principal\ncomponent analysis (PCA) up to column averaging.\nA rank k partial SVD, or k-SVD for short, is to \ufb01nd the top k left singular vectors of A, or equivalently,\nthe \ufb01rst k columns of V . Denoting by Vk \u2208 Rd\u00d7k the \ufb01rst k columns of V , and Uk the \ufb01rst k\nk where \u03a3k = diag{\u03c31, . . . , \u03c3k}. Under\ncolumns of U, one can de\ufb01ne A\u2217\nk is the the best rank-k approximation of matrix A in terms of minimizing (cid:107)A \u2212 Ak(cid:107)\nthis notation, A\u2217\namong all rank k matrices Ak. Here, the norm can be any Schatten-q norm for q \u2208 [1,\u221e], including\nspectral norm (q = \u221e) and Frobenius norm (q = 2), therefore making k-SVD a very powerful tool\nfor information retrieval, data de-noising, or even data compression.\nTraditional algorithms to compute SVD essentially run in time O(nd min{d, n}), which is usually\nvery expensive for big-data scenarios. As for k-SVD, de\ufb01ning gap := (\u03c3k \u2212 \u03c3k+1)/(\u03c3k) to be the\nrelative k-th eigengap of matrix A, the famous subspace power method or block Krylov method [14]\nsolves k-SVD in time O(gap\u22121\u00b7k\u00b7nnz(A)\u00b7log(1/\u03b5)) or O(gap\u22120.5\u00b7k\u00b7nnz(A)\u00b7log(1/\u03b5)) respectively\nif ignoring lower-order terms. Here, nnz(A) is the number of non-zero elements in matrix A, and the\nmore precise running times are stated in Table 1.\nRecently, there are breakthroughs to compute k-SVD faster, from three distinct perspectives.\n\nk A = Vk\u03a3kU(cid:62)\n\nk := VkV (cid:62)\n\n\u2217The full version of this paper can be found on https://arxiv.org/abs/1607.03463. This paper is partially\n\nsupported by a Microsoft Research Award, no. 0518584, and an NSF grant, no. CCF-1412958.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fPaper\n\nsubspace PM [19]\n\nblock Krylov [19]\n\nLazySVD\n\nCorollary 4.3 and 4.4\n\nShamir [21]\n\nLazySVD\n\nCorollary 4.3 and 4.4\n\n\u03b53/2\n\n\u03b5 + k3\ngap + k3\n\ngap3/2\n\n\u03b5\n\ngap\n\n(cid:1)\n(cid:1)\n\nRunning time\n\n\u03b5 + k2d\ngap + k2d\n\u03b51/2 + k2d\ngap1/2 + k2d\n\u03b51/2 + k2d\ngap1/2 + k2d\n\n(cid:101)O(cid:0) knnz(A)\n(cid:101)O(cid:0) knnz(A)\n(cid:101)O(cid:0) knnz(A)\n(cid:101)O(cid:0) knnz(A)\n(cid:101)O(cid:0) knnz(A)\n(cid:1)\n(cid:101)O(cid:0) knnz(A)\n(cid:101)O(cid:0)knd + k4d\n(cid:1)\n(cid:101)O(cid:0)knd + kn3/4d\n(cid:101)O(cid:0)knd + kn3/4d\n\n1/2\nk \u03b51/2\n\n\u03c34\nkgap2\n\ngap1/2\n\n\u03b51/2\n\n\u03c3\n\n\u03c3\n\n1/2\nk gap1/2\n\n(cid:1)\n(cid:1)\n\n(\u00d7 for being outperformed) GF?\n\u00d7 yes\n\u00d7 no\n\u00d7 yes\n\u00d7 no\n\n(cid:1)\n\n(cid:1)\n\nStoc?\n\nAcc?\n\nno\n\nno\n\nno\n\nyes\n\n(local convergence only)\n\n(cid:16)\nalways \u2264 (cid:101)O(cid:0)knd + kd\n(cid:1)(cid:16)\nalways \u2264 (cid:101)O(cid:0)knd + kd\n\n\u03c32\nk\u03b52\n\n(cid:1)(cid:17)\n(cid:1)(cid:17)\n\n\u03c32\nkgap2\n\nno\n\u00d7 no\n\nyes\n\nno\n\nyes\n\nno\n\nyes\n\nyes\n\nyes\n\nno\n\nyes\n\nAll GF results above provide (1 + \u03b5)(cid:107)\u2206(cid:107)2 spectral and (1 + \u03b5)(cid:107)\u2206(cid:107)F Frobenius guarantees\n\nTable 1: Performance comparison among direct methods. De\ufb01ne gap = (\u03c3k \u2212 \u03c3k+1)/\u03c3k \u2208 [0, 1]. GF = Gap\nFree; Stoc = Stochastic; Acc = Accelerted. Stochastic results in this table are assuming (cid:107)ai(cid:107)2 \u2264 1\nfollowing (1.1).\n\nThe \ufb01rst breakthrough is the work of Musco and Musco [19] for proving a running time for k-\nSVD that does not depend on singular value gaps (or any other properties) of A. As highlighted\nin [19], providing gap-free results was an open question for decades and is a more reliable goal\nfor practical purposes. Speci\ufb01cally, they proved that the block Krylov method converges in time\n\n(cid:1), where \u03b5 is the multiplicative approximation error.2\n\n(cid:101)O(cid:0) knnz(A)\n\n\u03b51/2 + k2d\n\n\u03b5 + k3\n\n\u03b53/2\n\ni\n\nn\n\n(cid:80)n\ni=1 aia(cid:62)\n\ni and each ai \u2208 Rd has norm at most 1 .\n\nThe second breakthrough is the work of Shamir [21] for providing a fast stochastic k-SVD algorithm.\nIn a stochastic setting, one assumes3\nA is given in form AA(cid:62) = 1\n\n(1.1)\nInstead of repeatedly multiplying matrix AA(cid:62) to a vector in the (subspace) power method, Shamir\nproposed to use a random rank-1 copy aia(cid:62)\nto approximate such multiplications. When equipped\nwith very ad-hoc variance-reduction techniques, Shamir showed that the algorithm has a better (local)\nperformance than power method (see Table 1). Unfortunately, Shamir\u2019s result is (1) not gap-free; (2)\nnot accelerated (i.e., does not match the gap\u22120.5 dependence comparing to block Krylov); and (3)\nrequires a very accurate warm-start that in principle can take a very long time to compute.\n\nThe third breakthrough is in obtaining running times of the form (cid:101)O(nnz(A) + poly(k, 1/\u03b5) \u00b7 (n +\n\nd)) [7, 8], see Table 2. We call them NNZ results. To obtain NNZ results, one needs sub-sampling\non the matrix and this incurs a poor dependence on \u03b5. For this reason, the polynomial dependence\non 1/\u03b5 is usually considered as the most important factor. In 2015, Bhojanapalli et al. [7] obtained\na 1/\u03b52-rate NNZ result using alternating minimization. Since 1/\u03b52 also shows up in the sampling\ncomplexity, we believe the quadratic dependence on \u03b5 is tight among NNZ types of algorithms.\nAll the cited results rely on ad-hoc non-convex optimization techniques together with matrix algebra,\nwhich make the \ufb01nal proofs complicated. Furthermore, Shamir\u2019s result [21] only works if a 1/poly(d)-\naccurate warm start is given, and the time needed to \ufb01nd a warm start is unclear.\nIn this paper, we develop a new algorithmic framework to solve k-SVD. It not only improves the\naforementioned breakthroughs, but also relies only on simple convex analysis.\n\n2In this paper, we use (cid:101)O notations to hide possible logarithmic factors on 1/gap, 1/\u03b5, n, d, k and potentially\n\n3This normalization follows the tradition of stochastic k-SVD or 1-SVD literatures [12, 20, 21] in order to\n\nalso on \u03c31/\u03c3k+1.\n\nstate results more cleanly.\n\n2\n\n\fPaper\n\n[8]\n\n[7]\n\nLazySVD\n\nTheorem 5.1\n\nRunning time\n\n\u03b54 (n + d) + k3\n\nO(nnz(A)) + O(cid:0) k2\n(cid:1)\n(n + d)(cid:1)\nO(nnz(A)) + (cid:101)O(cid:0) k5(\u03c31/\u03c3k)2\nd(cid:1)\nO(nnz(A)) + (cid:101)O(cid:0) k2(\u03c31/\u03c3k+1)4\n(n + d)(cid:1)\nO(nnz(A)) + (cid:101)O(cid:0) k2(\u03c31/\u03c3k+1)2\nO(nnz(A)) + (cid:101)O(cid:0) k4(\u03c31/\u03c3k+1)4.5\nd(cid:1)\n\n\u03b52\n\n\u03b55\n\n\u03b52\n\n\u03b52.5\n\n\u03b52\n\nFrobenius norm\n(1 + \u03b5)(cid:107)\u2206(cid:107)F\n(1 + \u03b5)(cid:107)\u2206(cid:107)F\n\nN/A\n\nN/A\n\n(1 + \u03b5)(cid:107)\u2206(cid:107)2\n\nSpectral norm\n(1 + \u03b5)(cid:107)\u2206(cid:107)F\n(cid:107)\u2206(cid:107)2 + \u03b5(cid:107)\u2206(cid:107)F\n(cid:107)\u2206(cid:107)2 + \u03b5(cid:107)\u2206(cid:107)F\n(cid:107)\u2206(cid:107)2 + \u03b5(cid:107)\u2206(cid:107)F\n(cid:107)\u2206(cid:107)2 + \u03b5(cid:107)\u2206(cid:107)F\n\nTable 2: Performance comparison among O(nnz(A) + poly(1/\u03b5)) type of algorithms. Remark: we have not\n\ntried hard to improve the dependency with respect to k or (\u03c31/\u03c3k+1). See Remark 5.2.\n\n1.1 Our Results and the Settlement of an Open Question\nWe propose to use an extremely simple framework that we call LazySVD to solve k-SVD:\n\nLazySVD: perform 1-SVD repeatedly, k times in total.\n\nMore speci\ufb01cally, in this framework we \ufb01rst compute the leading singular vector v of A, and then\nleft-project (I \u2212 vv(cid:62))A and repeat this procedure k times. Quite surprisingly,\n\nThis seemingly \u201cmost-intuitive\u201d approach was widely considered as \u201cnot a good idea.\u201d\n\nIn textbooks and research papers, one typically states that LazySVD has a running time that inversely\ndepends on all the intermediate singular value gaps \u03c31\u2212\u03c32, . . . , \u03c3k\u2212\u03c3k+1 [18, 21]. This dependence\nmakes the algorithm useless if some singular values are close, and is even thought to be necessary [18].\nFor this reason, textbooks describe only block methods (such as block power method, block Krylov,\nalternating minimization) which \ufb01nd the top k singular vectors together. Musco and Musco [19]\nstated as an open question to design \u201csingle-vector\u201d methods without running time dependence on all\nthe intermediate singular value gaps.\nIn this paper, we fully answer this open question with novel analyses on this LazySVD framework.\nIn particular, the resulting running time either\n\u2022 depends on gap\u22120.5 where gap is the relative singular value gap only between \u03c3k and \u03c3k+1, or\n\u2022 depends on \u03b5\u22120.5 where \u03b5 is the approximation ratio (so is gap-free).\nSuch dependency matches the best known dependency for block methods.\nMore surprisingly, by making different choices of the 1-SVD subroutine in this LazySVD framework,\nwe obtain multiple algorithms for different needs (see Table 1 and 2):\n\u2022 If accelerated gradient descent or Lanczos algorithm is used for 1-SVD, we obtain a faster k-SVD\n\u2022 If a variance-reduction stochastic method is used for 1-SVD, we obtain the \ufb01rst accelerated\n\nalgorithm than block Krylov [19].\n\n\u2022 If one sub-samples A before applying LazySVD, the running time becomes (cid:101)O(nnz(A) +\n\nstochastic algorithm for k-SVD, and this outperforms Shamir [21].\n\u03b5\u22122poly(k)\u00b7 d). This improves upon [7] in certain (but suf\ufb01ciently interesting) parameter regimes,\nbut completely avoids the use of alternating minimization.\n\nFinally, besides the running time advantages above, our analysis is completely based on convex\noptimization because 1-SVD is solvable using convex techniques. LazySVD also works when k is\nnot known to the algorithm, as opposed to block methods which need to know k in advance.\n\nOther Related Work. Some authors focus on the streaming or online model of 1-SVD [4, 15, 17] or\nk-SVD [3]. These algorithms are slower than of\ufb02ine methods. Unlike k-SVD, accelerated stochastic\nmethods were previously known for 1-SVD [12, 13]. After this paper is published, LazySVD has\nbeen generalized to also solve canonical component analysis and generalized PCA by the same\nauthors [1]. If one is only interested in projecting a vector to the top k-eigenspace without computing\nthe top k eigenvectors like we do in this paper, this can also be done in an accelerated manner [2].\n\n3\n\n\f2 Preliminaries\nGiven a matrix A we denote by (cid:107)A(cid:107)2 and (cid:107)A(cid:107)F respectively the spectral and Frobenius norms of A.\nFor q \u2265 1, we denote by (cid:107)A(cid:107)Sq the Schatten q-norm of A. We write A (cid:23) B if A, B are symmetric\nand A \u2212 B is positive semi-de\ufb01nite (PSD). We denote by \u03bbk(M ) the k-th largest eigenvalue of a\nsymmetric matrix M, and \u03c3k(A) the k-th largest singular value of a rectangular matrix A.\nSince \u03bbk(AA(cid:62)) = \u03bbk(A(cid:62)A) = (\u03c3k(A))2,\n\nsolving k-SVD for A is the same as solving k-PCA for M = AA(cid:62).\n\nk the best rank-k approximation of A.\n\nWe denote by \u03c31 \u2265 \u00b7\u00b7\u00b7 \u03c3d \u2265 0 the singular values of A \u2208 Rd\u00d7n, by \u03bb1 \u2265 \u00b7\u00b7\u00b7 \u03bbd \u2265 0 the eigenvalues\nof M = AA(cid:62) \u2208 Rd\u00d7d. (Although A may have fewer than d singular values for instance when n < d,\nif this happens, we append zeros.) We denote by A\u2217\nWe use \u22a5 to denote the orthogonal complement of a matrix. More speci\ufb01cally, given a column\northonormal matrix U \u2208 Rd\u00d7k, we de\ufb01ne U\u22a5 := {x \u2208 Rd | U(cid:62)x = 0}. For notational simplicity,\nwe sometimes also denote U\u22a5 as a d \u00d7 (d \u2212 k) matrix consisting of some basis of U\u22a5.\nTheorem 2.1 (approximate matrix inverse). Given d \u00d7 d matrix M (cid:23) 0 and constants \u03bb, \u03b4 > 0\nsatisfying \u03bbI \u2212 M (cid:23) \u03b4I, one can minimize the quadratic f (x) := x(cid:62)(\u03bbI \u2212 M )x \u2212 b(cid:62)x in order to\n\ninvert (\u03bbI \u2212 M )\u22121b. Suppose the desired accuracy is(cid:13)(cid:13)x \u2212 (\u03bbI \u2212 M )\u22121b(cid:13)(cid:13) \u2264 \u03b5. Then,\n\u2022 Accelerated gradient descent (AGD) produces such an output x in O(cid:0) \u03bb1/2\ninstance [5]) produces such an output x in time O(cid:0) max{nd, n3/4d\u03bb1/4\n\n(cid:1) iterations, each\n(cid:1).\n\ni and (cid:107)ai(cid:107)2 \u2264 1, then accelerated SVRG (see for\n\nrequiring O(d) time plus the time needed to multiply M with a vector.\n\n\u2022 If M is given in the form M = 1\n\n(cid:80)n\ni=1 aia(cid:62)\n\n(cid:9) log \u03bb\n\n\u03b41/2 log \u03bb\n\n\u03b5\u03b4\n\nn\n\n\u03b41/2\n\n\u03b5\u03b4\n\n3 A Speci\ufb01c 1-SVD Algorithm: Shift-and-Inverse Revisited\nIn this section, we study a speci\ufb01c 1-PCA algorithm AppxPCA (recall 1-PCA equals 1-SVD). It is a\n(multiplicative-)approximate algorithm for computing the leading eigenvector of a symmetric matrix.\nWe emphasize that, in principle, most known 1-PCA algorithms (e.g., power method, Lanczos\nmethod) are suitable for our LazySVD framework. We choose AppxPCA solely because it provides\nthe maximum \ufb02exibility in obtaining all stochastic / NNZ running time results at once.\nOur AppxPCA uses the shift-and-inverse routine [12, 13], and our pseudo-code in Algorithm 1 is\na modi\ufb01cation of Algorithm 5 that appeared in [12]. Since we need a more re\ufb01ned running time\nstatement with a multiplicative error guarantee, and since the stated proof in [12] is anyways only a\nsketched one, we choose to carefully reprove a similar result of [12] and state the following theorem:\nTheorem 3.1 (AppxPCA). Let M \u2208 Rd\u00d7d be a symmetric matrix with eigenvalues 1 \u2265 \u03bb1 \u2265 \u00b7\u00b7\u00b7 \u2265\n\u03bbd \u2265 0 and corresponding eigenvectors u1, . . . , ud. With probability at least 1 \u2212 p, AppxPCA\n\n(w(cid:62)ui)2 \u2264 \u03b5 and w(cid:62)M w \u2265 (1 \u2212 \u03b4\u00d7)(1 \u2212 \u03b5)\u03bb1 .\n\nproduces an output w satisfying(cid:88)\n\ni\u2208[d],\u03bbi\u2264(1\u2212\u03b4\u00d7)\u03bb1\n\n\u03bb(s)\n\n\u03b4\u00d7 and\n\n\u03bbmin(\u03bb(s)I\u2212M ) \u2264 12\n\nFurthermore, the total number of oracle calls to A is O(log(1/\u03b4\u00d7)m1 + m2), and each time we call\nA we have\nSince AppxPCA reduces 1-PCA to oracle calls of a matrix inversion subroutine A, the stated conditions\n\u03bbmin(\u03bb(s)I\u2212M ) \u2264 12\nin Theorem 3.1, together with complexity results\nfor matrix inversions (see Theorem 2.1), imply the following running times for AppxPCA:\nCorollary 3.2.\n\n\u03bbmin(\u03bb(s)I\u2212M ) \u2264 12\n\n\u03bbmin(\u03bb(s)I\u2212M ) \u2264 12\n\n\u03b4\u00d7 and\n\n\u03b4\u00d7\u03bb1\n\n\u03b4\u00d7\u03bb1\n\n\u03bb(s)\n\n.\n\n1\n\n1\n\n(cid:1) multiplied with O(d) plus the time needed\n\n\u2022 If A is AGD, the running time of AppxPCA is (cid:101)O(cid:0) 1\ntime of AppxPCA is (cid:101)O(cid:0) max{nd, n3/4d\n\n(cid:80)n\ni=1 aia(cid:62)\n\nto multiply M with a vector.\n\n\u2022 If M = 1\n\n(cid:9)(cid:1).\n\nn\n\n\u03b41/2\u00d7\n\n\u03bb1/4\n\n1\n\n\u03b41/2\u00d7\n\ni where each (cid:107)ai(cid:107)2 \u2264 1, and A is accelerated SVRG, then the total running\n\n4\n\n\fAlgorithm 1 AppxPCA(A, M, \u03b4\u00d7, \u03b5, p)\npractitioners, feel free to use your favorite 1-PCA algorithm such as Lanczos to replace AppxPCA.)\nInput: A, an approximate matrix inversion method; M \u2208 Rd\u00d7d, a symmetric matrix satisfying\n0 (cid:22) M (cid:22) I; \u03b4\u00d7 \u2208 (0, 0.5], a multiplicative error; \u03b5 \u2208 (0, 1), a numerical accuracy parameter;\nand p \u2208 (0, 1), a con\ufb01dence parameter. (cid:5) running time only logarithmically depends on 1/\u03b5 and 1/p.\n\n(cid:5) (only for proving our theoretical results; for\n\n(cid:5) m1 = T PM(8, 1/32, p) and m2 = T PM(2, \u03b5/4, p) using de\ufb01nition in Lemma A.1\n\n6\n\n;\n\np2\n\n8m2\n\n6\n\np2\u03b5\n\nlog\n\n64m1\n\n4 log\n\ns \u2190 s + 1;\nfor t = 1 to m1 do\n\n(cid:17)(cid:109)\n(cid:16) 36d\n(cid:1)m2\n(cid:0) \u03b4\u00d7\n\n1: m1 \u2190(cid:108)\n2: (cid:101)\u03b51 \u2190 1\n3: (cid:98)w0 \u2190 a random unit vector; s \u2190 0; \u03bb(0) \u2190 1 + \u03b4\u00d7;\n\n(cid:16) 288d\n, m2 \u2190(cid:108)\n(cid:17)(cid:109)\n(cid:0) \u03b4\u00d7\n(cid:1)m1 and(cid:101)\u03b52 \u2190 \u03b5\nApply A to \ufb01nd (cid:98)wt satisfying(cid:13)(cid:13)(cid:98)wt \u2212 (\u03bb(s\u22121)I \u2212 M )\u22121(cid:98)wt\u22121\nw \u2190 (cid:98)wm1/(cid:107)(cid:98)wm1(cid:107);\nApply A to \ufb01nd v satisfying(cid:13)(cid:13)v \u2212 (\u03bb(s\u22121)I \u2212 M )\u22121w(cid:13)(cid:13) \u2264(cid:101)\u03b51;\nw(cid:62)v\u2212(cid:101)\u03b51\n(cid:13)(cid:13) \u2264(cid:101)\u03b52;\nApply A to \ufb01nd (cid:98)wt satisfying(cid:13)(cid:13)(cid:98)wt \u2212 (\u03bb(f )I \u2212 M )\u22121(cid:98)wt\u22121\n\n4: repeat\n5:\n6:\n7:\n8:\n9:\n10:\n11: until \u2206(s) \u2264 \u03b4\u00d7\u03bb(s)\n12: f \u2190 s;\n13: for t = 1 to m2 do\n14:\n\nand \u03bb(s) \u2190 \u03bb(s\u22121) \u2212 \u2206(s)\n2 ;\n\n\u2206(s) \u2190 1\n2 \u00b7\n\n15: return w := (cid:98)wm2/(cid:107)(cid:98)wm2(cid:107).\n\n1\n\n3\n\n(cid:13)(cid:13) \u2264(cid:101)\u03b51;\n\nAlgorithm 2 LazySVD(A, M, k, \u03b4\u00d7, \u03b5pca, p)\nInput: A, an approximate matrix inversion method; M \u2208 Rd\u00d7d, a matrix satisfying 0 (cid:22) M (cid:22) I;\nk \u2208 [d], the desired rank; \u03b4\u00d7 \u2208 (0, 1), a multiplicative error; \u03b5pca \u2208 (0, 1), a numerical accuracy\nparameter; and p \u2208 (0, 1), a con\ufb01dence parameter.\n\n1: M0 \u2190 M and V0 \u2190 [];\n2: for s = 1 to k do\n3:\n\nvs \u2190(cid:0)(I \u2212 Vs\u22121V (cid:62)\n\nVs \u2190 [Vs\u22121, vs];\nMs \u2190 (I \u2212 vsv(cid:62)\n\n4:\n5:\n6:\n7: end for\n8: return Vk.\n\ns \u2190 AppxPCA(A, Ms\u22121, \u03b4\u00d7/2, \u03b5pca, p/k);\nv(cid:48)\n\n(cid:1)/(cid:13)(cid:13)(I \u2212 Vs\u22121V (cid:62)\n\n(cid:13)(cid:13);\n\ns\u22121)v(cid:48)\n\n(cid:5) to practitioners: use your favorite 1-PCA algorithm such as Lanczos to compute v(cid:48)\ns\ns to V \u22a5\ns\u22121\ns )M (I \u2212 VsV (cid:62)\ns )\n\n(cid:5) we also have Ms = (I \u2212 VsV (cid:62)\n\ns )Ms\u22121(I \u2212 vsv(cid:62)\ns )\n\n(cid:5) project v(cid:48)\n\ns\u22121)v(cid:48)\n\ns\n\ns\n\n4 Main Algorithm and Theorems\nOur algorithm LazySVD is stated in Algorithm 2. It starts with M0 = M, and repeatedly applies\nk times AppxPCA. In the s-th iteration, it computes an approximate leading eigenvector of matrix\nMs\u22121 using AppxPCA with a multiplicative error \u03b4\u00d7/2, projects Ms\u22121 to the orthogonal space of this\nvector, and then calls it matrix Ms.\nIn this stated form, LazySVD \ufb01nds approximately the top k eigenvectors of a symmetric matrix\nM \u2208 Rd\u00d7d. If M is given as M = AA(cid:62), then LazySVD automatically \ufb01nds the k-SVD of A.\n\n4.1 Our Core Theorems\n\nWe state our approximation and running time core theorems of LazySVD below, and then provide\ncorollaries to translate them into gap-dependent and gap-free theorems on k-SVD.\nTheorem 4.1 (approximation). Let M \u2208 Rd\u00d7d be a symmetric matrix with eigenvalues 1 \u2265 \u03bb1 \u2265\n\u00b7\u00b7\u00b7 \u03bbd \u2265 0 and corresponding eigenvectors u1, . . . , ud. Let k \u2208 [d], let \u03b4\u00d7, p \u2208 (0, 1), and let \u03b5pca \u2264\n\n5\n\n\fk )M (I \u2212 VkV (cid:62)\n\npoly(cid:0)\u03b5, \u03b4\u00d7, 1\n\n(cid:1).4 Then, LazySVD outputs a (column) orthonormal matrix Vk = (v1, . . . , vk) \u2208\n\nd , \u03bb1\n\u03bbk+1\nRd\u00d7k which, with probability at least 1 \u2212 p, satis\ufb01es all of the following properties. (Denote by\nMk = (I \u2212 VkV (cid:62)\n(a) Core lemma: (cid:107)V (cid:62)\n(b) Spectral norm guarantee: \u03bbk+1 \u2264 (cid:107)Mk(cid:107)2 \u2264 \u03bbk+1\n1\u2212\u03b4\u00d7 .\n(c) Rayleigh quotient guarantee: (1 \u2212 \u03b4\u00d7)\u03bbk \u2264 v(cid:62)\nk M vk \u2264 1\n(d) Schatten-q norm guarantee: for every q \u2265 1, we have (cid:107)Mk(cid:107)Sq \u2264 (1+\u03b4\u00d7)2\n(1\u2212\u03b4\u00d7)2\n\nk U(cid:107)2 \u2264 \u03b5, where U = (uj, . . . , ud) is the (column) orthonormal matrix and j\n(cid:17)1/q\n\nis the smallest index satisfying \u03bbj \u2264 (1 \u2212 \u03b4\u00d7)(cid:107)Mk\u22121(cid:107)2.\n\n(cid:16)(cid:80)d\n\n1\u2212\u03b4\u00d7 \u03bbk.\n\ni=k+1 \u03bbq\n\nk ).)\n\n.\n\ni\n\n\u03b41/2\u00d7\n\nWe defer the proof of Theorem 4.1 to the full version, and we also have a section in the full version\nto highlight the technical ideas behind the proof. Below we state the running time of LazySVD.\nTheorem 4.2 (running time). LazySVD can be implemented to run in time\n\n\u2022 (cid:101)O(cid:0) knnz(M )+k2d\n\u2022 (cid:101)O(cid:0) knnz(A)+k2d\n\u2022 (cid:101)O(cid:0)knd + kn3/4d\ni where each (cid:107)ai(cid:107)2 \u2264 1.\nAbove, the (cid:101)O notation hides logarithmic factors with respect to k, d, 1/\u03b4\u00d7, 1/p, 1/\u03bb1, \u03bb1/\u03bbk.\n\n(cid:1) if A is AGD and M \u2208 Rd\u00d7d is given explicitly;\n(cid:1) if A is AGD and M is given as M = AA(cid:62) where A \u2208 Rd\u00d7n; or\n(cid:1) if A is accelerated SVRG and M = 1\n(cid:80)n\ni=1 aia(cid:62)\n\n\u03bb1/4\nk \u03b41/2\u00d7\n\n\u03b41/2\u00d7\n\nn\n\ns\u22121)M (I \u2212 Vs\u22121V (cid:62)\n\ns )ai for each vector ai, and feed the new a(cid:48)\n\ns\u22121. This proves the \ufb01rst two running times using Corollary 3.2.\n\nProof of Theorem 4.2. We call k times AppxPCA, and each time we can feed Ms\u22121 = (I \u2212\nVs\u22121V (cid:62)\ns\u22121) implicitly into AppxPCA thus the time needed to multiply Ms\u22121 with\na d-dimensional vector is O(dk + nnz(M )) or O(dk + nnz(A)). Here, the O(dk) overhead is due to\nthe projection of a vector into V \u22a5\ni \u2190 (I \u2212\nTo obtain the third running time, when we compute Ms from Ms\u22121, we explicitly project a(cid:48)\nvsv(cid:62)\nn into AppxPCA. Now the running time follows\nfrom the second part of Corollary 3.2 together with the fact that (cid:107)Ms\u22121(cid:107)2 \u2265 (cid:107)Mk\u22121(cid:107)2 \u2265 \u03bbk. (cid:3)\n4.2 Our Main Results for k-SVD\nOur main theorems imply the following corollaries (proved in full version of this paper).\nCorollary 4.3 (Gap-dependent k-SVD). Let A \u2208 Rd\u00d7n be a matrix with singular values 1 \u2265 \u03c31 \u2265\n\u00b7\u00b7\u00b7 \u03c3d \u2265 0 and the corresponding left singular vectors u1, . . . , ud \u2208 Rd. Let gap = \u03c3k\u2212\u03c3k+1\nbe the\nrelative gap. For \ufb01xed \u03b5, p > 0, consider the output\n\n1, . . . , a(cid:48)\n\n.\nThen, de\ufb01ning W = (uk+1, . . . , ud), we have with probability at least 1 \u2212 p:\n\nVk is a rank-k (column) orthonormal matrix with (cid:107)V (cid:62)\n\nOur running time is (cid:101)O(cid:0) knnz(A)+k2d\n\n\u221a\n\ngap\n\n\u03c3k\n\n(cid:16)A, AA(cid:62), k, gap, O(cid:0) \u03b54\u00b7gap2\n(cid:1), or time (cid:101)O(cid:0)knd + kn3/4d\n\n(cid:17)\n(cid:1), p\n(cid:1) in the stochastic setting (1.1).\n\nk W(cid:107)2 \u2264 \u03b5 .\n\nk4(\u03c31/\u03c3k)4\n\n\u221a\n\n\u03c31/2\n\nk\n\ngap\n\nVk \u2190 LazySVD\n\nAbove, both running times depend only poly-logarithmically on 1/\u03b5.\nCorollary 4.4 (Gap-free k-SVD). Let A \u2208 Rd\u00d7n be a matrix with singular values 1 \u2265 \u03c31 \u2265\n\u00b7\u00b7\u00b7 \u03c3d \u2265 0. For \ufb01xed \u03b5, p > 0, consider the output\n\n(v1, . . . , vk) = Vk \u2190 LazySVD\n\n3 , O\n\n\u03b56\n\nk4d4(\u03c31/\u03c3k+1)12\n\n, p\n\n.\n\n(cid:16)A, AA(cid:62), k, \u03b5\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\nThen, de\ufb01ning Ak = VkV (cid:62)\n\nk A which is a rank k matrix, we have with probability at least 1 \u2212 p:\n\n4The detailed speci\ufb01cations of \u03b5pca can be found in the appendix where we restate the theorem more formally.\nTo provide the simplest proof, we have not tightened the polynomial factors in the theoretical upper bound of\n\u03b5pca because the running time depends only logarithmic on 1/\u03b5pca.\n\n6\n\n\f1. Spectral norm guarantee: (cid:107)A \u2212 Ak(cid:107)2 \u2264 (1 + \u03b5)(cid:107)A \u2212 A\u2217\nk(cid:107)2;\nk(cid:107)F ; and\n2. Frobenius norm guarantee: (cid:107)A \u2212 Ak(cid:107)F \u2264 (1 + \u03b5)(cid:107)A \u2212 A\u2217\n\n3. Rayleigh quotient guarantee: \u2200i \u2208 [k],(cid:12)(cid:12)v(cid:62)\nRunning time is (cid:101)O(cid:0) knnz(A)+k2d\n\n(cid:1), or time (cid:101)O(cid:0)knd + kn3/4d\n\ni AA(cid:62)vi \u2212 \u03c32\n\u221a\n\n(cid:12)(cid:12) \u2264 \u03b5\u03c32\n(cid:1) in the stochastic setting (1.1).\n\ni .\n\ni\n\n\u221a\n\n\u03b5\n\n\u03c31/2\n\nk\n\n\u03b5\n\nRemark 4.5. The spectral and Frobenius guarantees are standard. The spectral guarantee is more\ndesirable than the Frobenius one in practice [19]. In fact, our algorithm implies for all q \u2265 1,\n(cid:107)A \u2212 Ak(cid:107)Sq \u2264 (1 + \u03b5)(cid:107)A \u2212 A\u2217\nk(cid:107)Sq where (cid:107) \u00b7 (cid:107)Sq is the Schatten-q norm. Rayleigh-quotient\nguarantee was introduced by Musco and Musco [19] for a more re\ufb01ned comparison. They showed\nthat the block Krylov method satis\ufb01es |v(cid:62)\nk+1, which is slightly stronger than ours.\nHowever, these two guarantees are not much different in practice as we evidenced in experiments.\n\ni AA(cid:62)vi\u2212\u03c32\n\ni | \u2264 \u03b5\u03c32\n\nk(cid:107)2 + \u03b5(cid:107)A \u2212 A\u2217\nk(cid:107)F .\n\n5 NNZ Running Time\nIn this section, we translate our results in the previous section into the O(nnz(A) + poly(k, 1/\u03b5)(n +\nd)) running-time statements. The idea is surprisingly simple: we sample either random columns\nof A, or random entries of A, and then apply LazySVD to compute the k-SVD. Such translation\ndirectly gives either 1/\u03b52.5 results if AGD is used as the convex subroutine and either column or entry\nsampling is used, or a 1/\u03b52 result if accelerated SVRG and column sampling are used together.\nWe only informally state our theorem and defer all the details to the full paper.\nTheorem 5.1 (informal). Let A \u2208 Rd\u00d7n be a matrix with singular values \u03c31 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3d \u2265 0.\nFor every \u03b5 \u2208 (0, 1/2), one can apply LazySVD with appropriately chosen \u03b4\u00d7 on a \u201ccarefully\nsub-sampled version\u201d of A. Then, the resulting matrix V \u2208 Rd\u00d7k can satisfy\n\u2022 spectral norm guarantee: (cid:107)A \u2212 V V (cid:62)A(cid:107)2 \u2264 (cid:107)A \u2212 A\u2217\nk(cid:107)F ;5\n\u2022 Frobenius norm guarantee: (cid:107)A \u2212 V V (cid:62)A(cid:107)F \u2264 (1 + \u03b5)(cid:107)A \u2212 A\u2217\nThe total running time depends on (1) whether column or entry sampling is used, (2) which matrix\ninversion routine A is used, and (3) whether spectral or Frobenius guarantee is needed. We list our\ndeduced results in Table 2 and the formal statements can be found in the full version of this paper.\nRemark 5.2. The main purpose of our NNZ results is to demonstrate the strength of LazySVD\nframework in terms of improving the \u03b5 dependency to 1/\u03b52. Since the 1/\u03b52 rate matches sampling\ncomplexity, it is very challenging have an NNZ result with 1/\u03b52 dependency.6 We have not tried\nhard, and believe it possible, to improve the polynomial dependence with respect to k or (\u03c31/\u03c3k+1).\n6 Experiments\nWe demonstrate the practicality of our LazySVD framework, and compare it to block power method\nor block Krylov method. We emphasize that in theory, the best worse-cast complexity for 1-PCA\nis obtained by AppxPCA on top of accelerated SVRG. However, for the size of our chosen datasets,\nLanczos method runs faster than AppxPCA and therefore we adopt Lanczos method as the 1-PCA\nmethod for our LazySVD framework.7\nDatasets. We use datasets SNAP/amazon0302, SNAP/email-enron, and news20 that were also\nused by Musco and Musco [19], as well as an additional but famous dataset RCV1. The \ufb01rst two can\nbe found on the SNAP website [16] and the last two can be found on the LibSVM website [11]. The\nfour datasets give rise sparse matrices of dimensions 257570\u00d7262111, 35600\u00d716507, 11269\u00d753975,\nand 20242 \u00d7 47236 respectively.\n\n5This is the best known spectral guarantee one can obtain using NNZ running time [7]. It is an open question\n\nwhether the stricter (cid:107)A \u2212 V V (cid:62)A(cid:107)2 \u2264 (1 + \u03b5)(cid:107)A \u2212 A\u2217\n\nk(cid:107)2 type of spectral guarantee is possible.\n\n6On one hand, one can use dimension reduction such as [9] to reduce the problem size to O(k/\u03b52); to the best\nof our knowledge, it is impossible to obtain any NNZ result faster than 1/\u03b53 using solely dimension reduction.\nOn the other hand, obtaining 1/\u03b52 dependency was the main contribution of [7]: they relied on alternating\nminimization but we have avoided it in our paper.\n\n7Our LazySVD framework turns every 1-PCA method satisfying Theorem 3.1 (including Lanczos method)\ninto a k-SVD solver. However, our theoretical results (esp. stochastic and NNZ) rely on AppxPCA because\nLanczos is not a stochastic method.\n\n7\n\n\f(a) amazon, k = 20, spectral\n\n(b) news, k = 20, spectral\n\n(c) news, k = 20, rayleigh\n\n(d) email, k = 10, Fnorm\n\n(e) rcv1, k = 30, Fnorm\n\n(f) rcv1, k = 30, rayleigh(last)\n\nFigure 1: Selected performance plots. Relative error (y-axis) vs. running time (x-axis).\n\nImplemented Algorithms. For the block Krylov method, it is a well-known issue that the Lanczos\ntype of three-term recurrence update is numerically unstable. This is why Musco and Musco [19]\nonly used the stable variant of block Krylov which requires an orthogonalization of each n\u00d7 k matrix\nwith respect to all previously obtained n \u00d7 k matrices. This greatly improves the numerical stability\nalbeit sacri\ufb01cing running time. We implement both these algorithms. In sum, we have implemented:\n\u2022 PM: block power method for T iterations.\n\u2022 Krylov: stable block Krylov method for T iterations [19].\n\u2022 Krylov(unstable): the three-term recurrence implementation of block Krylov for T iterations.\n\u2022 LazySVD: k calls of the vanilla Lanczos method, and each call runs T iterations.\nA Fair Running-Time Comparison. For a \ufb01xed integer T , the four methods go through the dataset\n(in terms of multiplying A with column vectors) the same number of times. However, since LazySVD\ndoes not need block orthogonalization (as needed in PM and Krylov) and does not need a (T k)-\ndimensional SVD computation in the end (as needed in Krylov), the running time of LazySVD is\nclearly much faster for a \ufb01xed value T . We therefore compare the performances of the four methods\nin terms of running time rather than T .\nWe programmed the four algorithms using the same programming language with the same sparse-\nmatrix implementation. We tested them single-threaded on the same Intel i7-3770 3.40GHz personal\ncomputer. As for the \ufb01nal low-dimensional SVD decomposition step at the end of the PM or Krylov\nmethod (which is not needed for our LazySVD), we used a third-party library that is built upon the\nx64 Intel Math Kernel Library so the time needed for such SVD is maximally reduced.\nPerformance Metrics. We compute four metrics on the output V = (v1, . . . , vk) \u2208 Rn\u00d7k:\nk(cid:107)F )/(cid:107)A \u2212 A\u2217\n\u2022 Fnorm: relative Frobenius norm error: ((cid:107)A \u2212 V V (cid:62)A(cid:107)F \u2212 (cid:107)A \u2212 A\u2217\nk(cid:107)F .\n\u2022 spectral: relative spectral norm error: ((cid:107)A \u2212 V V (cid:62)A(cid:107)2 \u2212 (cid:107)A \u2212 A\u2217\nk(cid:107)2.\nk(cid:107)2)/(cid:107)A \u2212 A\u2217\nj \u2212 v(cid:62)\n\u2022 rayleigh(last): Rayleigh quotient error relative to \u03c3k+1: maxk\nj AA(cid:62)vj\n\u2022 rayleigh: relative Rayleigh quotient error: maxk\nj \u2212 v(cid:62)\nj .\nThe \ufb01rst three metrics were also used by Musco and Musco [19]. We added the fourth one because\nour theory only predicted convergence with respect to the fourth but not the third metric. However,\nwe observe that in practice they are not much different from each other.\nOur Results. We study four datasets each with k = 10, 20, 30 and with the four performance\nmetrics, totaling 48 plots. Due to space limitation, we only select six representative plots out of 48\nand include them in Figure 1. (The full plots can be found in Figure 2, 3, 4 and 5 in the appendix.)\nWe make the following observations:\n\u2022 LazySVD outperforms its three competitors almost universally.\n\u2022 Krylov(unstable) outperforms Krylov for small value T ; however, it is less useful for obtaining\naccurate solutions due to its instability. (The dotted green curves even go up if T is large.)\n\u2022 Subspace power method performs the slowest unsurprisingly due to its lack of acceleration.\n\n(cid:12)(cid:12)\u03c32\n(cid:12)(cid:12)/\u03c32\n\n(cid:12)(cid:12)/\u03c32\n\nj=1\n\nj AA(cid:62)vj\n\n(cid:12)(cid:12)\u03c32\n\nj=1\n\nk+1.\n\n8\n\n1E-31E-21E-11E+0010203040this paperKrylov(unstable)KrylovPM1E-71E-51E-31E-1051015202530this paperKrylov(unstable)KrylovPM1E-81E-61E-41E-21E+0051015202530this paperKrylov(unstable)KrylovPM1E-71E-51E-31E-100.511.522.53this paperKrylov(unstable)KrylovPM1E-71E-51E-31E-10102030405060this paperKrylov(unstable)KrylovPM1E-31E-21E-11E+00102030405060this paperKrylov(unstable)KrylovPM\fReferences\n[1] Zeyuan Allen-Zhu and Yuanzhi Li. Doubly Accelerated Methods for Faster CCA and General-\n\nized Eigendecomposition. ArXiv e-prints, abs/1607.06017, July 2016.\n\n[2] Zeyuan Allen-Zhu and Yuanzhi Li. Faster Principal Component Regression via Optimal\n\nPolynomial Approximation to sgn(x). ArXiv e-prints, abs/1608.04773, August 2016.\n\n[3] Zeyuan Allen-Zhu and Yuanzhi Li. First Ef\ufb01cient Convergence for Streaming k-PCA: a Global,\n\nGap-Free, and Near-Optimal Rate. ArXiv e-prints, abs/1607.07837, July 2016.\n\n[4] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the Compressed Leader: Faster Algorithm for Matrix\n\nMultiplicative Weight Updates. ArXiv e-prints, abs/1701.01722, January 2017.\n\n[5] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-\n\nConvex Objectives. In ICML, 2016.\n\n[6] Sanjeev Arora, Satish Rao, and Umesh V. Vazirani. Expander \ufb02ows, geometric embeddings and\n\ngraph partitioning. Journal of the ACM, 56(2), 2009.\n\n[7] Srinadh Bhojanapalli, Prateek Jain, and Sujay Sanghavi. Tighter Low-rank Approximation via\n\nSampling the Leveraged Element. In SODA, pages 902\u2013920, 2015.\n\n[8] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input\n\nsparsity time. In STOC, pages 81\u201390, 2013.\n\n[9] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.\nDimensionality reduction for k-means clustering and low rank approximation. In STOC, pages\n163\u2013172. ACM, 2015.\n\n[10] Petros Drineas and Anastasios Zouzias. A Note on Element-wise Matrix Sparsi\ufb01cation via a\n\nMatrix-valued Bernstein Inequality. ArXiv e-prints, abs/1006.0407, January 2011.\n\n[11] Rong-En Fan and Chih-Jen Lin. LIBSVM Data: Classi\ufb01cation, Regression and Multi-label.\n\nAccessed: 2015-06.\n\n[12] Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints,\n\nSeptember 2015.\n\n[13] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli,\nand Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample ef\ufb01cient\nalgorithms for eigenvector computation. In ICML, 2016.\n\n[14] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The JHU Press, 4th edition,\n\n2012.\n\n[15] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming PCA:\nMatching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja\u2019s Algorithm.\nIn COLT, 2016.\n\n[16] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection.\n\nhttp://snap.stanford.edu/data, June 2014.\n\n[17] Chris J. Li, Mengdi Wang, Han Liu, and Tong Zhang. Near-Optimal Stochastic Approximation\n\nfor Online Principal Component Estimation. ArXiv e-prints, abs/1603.05305, March 2016.\n\n[18] Ren-Cang Li and Lei-Hong Zhang. Convergence of the block lanczos method for eigenvalue\n\nclusters. Numerische Mathematik, 131(1):83\u2013113, 2015.\n\n[19] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and\n\nfaster approximate singular value decomposition. In NIPS, pages 1396\u20131404, 2015.\n\n[20] Ohad Shamir. A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate.\n\nIn ICML, pages 144\u2014-153, 2015.\n\n[21] Ohad Shamir. Fast stochastic algorithms for svd and pca: Convergence properties and convexity.\n\nIn ICML, 2016.\n\n[22] Joel A. Tropp. An Introduction to Matrix Concentration Inequalities. ArXiv e-prints,\n\nabs/1501.01571, January 2015.\n\n9\n\n\f", "award": [], "sourceid": 581, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Princeton University"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton University"}]}