{"title": "Efficient Second Order Online Learning by Sketching", "book": "Advances in Neural Information Processing Systems", "page_first": 902, "page_last": 910, "abstract": "We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data. SON is an enhanced version of the Online Newton Step, which, via sketching techniques enjoys a running time linear in the dimension and sketch size.  We further develop sparse forms of the sketching methods (such as Oja's rule), making the computation linear in the sparsity of features. Together, the algorithm eliminates all computational obstacles in previous second order online learning approaches.", "full_text": "Ef\ufb01cient Second Order Online Learning by Sketching\n\nHaipeng Luo\n\nAlekh Agarwal\n\nPrinceton University, Princeton, NJ USA\n\nMicrosoft Research, New York, NY USA\n\nhaipengl@cs.princeton.edu\n\nalekha@microsoft.com\n\nNicol\u00f2 Cesa-Bianchi\n\nUniversit\u00e0 degli Studi di Milano, Italy\n\nnicolo.cesa-bianchi@unimi.it\n\nJohn Langford\n\nMicrosoft Research, New York, NY USA\n\njcl@microsoft.com\n\nAbstract\n\nWe propose Sketched Online Newton (SON), an online second order learning\nalgorithm that enjoys substantially improved regret guarantees for ill-conditioned\ndata. SON is an enhanced version of the Online Newton Step, which, via sketching\ntechniques enjoys a running time linear in the dimension and sketch size. We\nfurther develop sparse forms of the sketching methods (such as Oja\u2019s rule), making\nthe computation linear in the sparsity of features. Together, the algorithm eliminates\nall computational obstacles in previous second order online learning approaches.\n\n1\n\nIntroduction\n\nOnline learning methods are highly successful at rapidly reducing the test error on large, high-\ndimensional datasets. First order methods are particularly attractive in such problems as they typically\nenjoy computational complexity linear in the input size. However, the convergence of these methods\ncrucially depends on the geometry of the data; for instance, running the same algorithm on a rotated\nset of examples can return vastly inferior results. See Fig. 1 for an illustration.\nSecond order algorithms such as Online Newton Step [18] have the attractive property of being\ninvariant to linear transformations of the data, but typically require space and update time quadratic\nin the number of dimensions. Furthermore, the dependence on dimension is not improved even\nif the examples are sparse. These issues lead to the key question in our work: Can we develop\n(approximately) second order online learning algorithms with ef\ufb01cient updates? We show that\nthe answer is \u201cyes\u201d by developing ef\ufb01cient sketched second order methods with regret guarantees.\nSpeci\ufb01cally, the three main contributions of this work are:\n\n1. Invariant learning setting and optimal algorithms (Section 2). The typical online regret\nminimization setting evaluates against a benchmark that is bounded in some \ufb01xed norm (such as the\n(cid:96)2-norm), implicitly putting the problem in a nice geometry. However, if all the features are scaled\ndown, it is desirable to compare with accordingly larger weights, which is precluded by an apriori\n\ufb01xed norm bound. We study an invariant learning setting similar to the paper [33] which compares\nthe learner to a benchmark only constrained to generate bounded predictions on the sequence of\nexamples. We show that a variant of the Online Newton Step [18], while quadratic in computation,\nstays regret-optimal with a nearly matching lower bound in this more general setting.\n\n2. Improved ef\ufb01ciency via sketching (Section 3). To overcome the quadratic running time, we\nnext develop sketched variants of the Newton update, approximating the second order information\nusing a small number of carefully chosen directions, called a sketch. While the idea of data sketching\nis widely studied [36], as far as we know our work is the \ufb01rst one to apply it to a general adversarial\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fonline learning setting and provide rigorous regret guarantees. Three different sketching methods are\nconsidered: Random Projections [1, 19], Frequent Directions [12, 23], and Oja\u2019s algorithm [28, 29],\nall of which allow linear running time per round. For the \ufb01rst two methods, we prove regret bounds\nsimilar to the full second order update whenever the sketch-size is large enough. Our analysis makes\nit easy to plug in other sketching and online PCA methods (e.g. [11]).\n\n3. Sparse updates (Section 4). For practical\nimplementation, we further develop sparse ver-\nsions of these updates with a running time linear\nin the sparsity of the examples. The main chal-\nlenge here is that even if examples are sparse,\nthe sketch matrix still quickly becomes dense.\nThese are the \ufb01rst known sparse implementa-\ntions of the Frequent Directions1 and Oja\u2019s algo-\nrithm, and require new sparse eigen computation\nroutines that may be of independent interest.\nEmpirically, we evaluate our algorithm using\nthe sparse Oja sketch (called Oja-SON) against\n\ufb01rst order methods such as diagonalized ADA-\nGRAD [6, 25] on both ill-conditioned synthetic\nand a suite of real-world datasets. As Fig. 1\nshows for a synthetic problem, we observe sub-\nstantial performance gains as data conditioning\nworsens. On the real-world datasets, we \ufb01nd\nimprovements in some instances, while observing no substantial second-order signal in the others.\n\nFigure 1: Error rate of SON using Oja\u2019s sketch, and\nADAGRAD on a synthetic ill-conditioned problem.\nm is the sketch size (m = 0 is Online Gradient,\nm = d resembles Online Newton). SON is nearly\ninvariant to condition number for m = 10.\n\nRelated work Our online learning setting is closest to the one proposed in [33], which studies\nscale-invariant algorithms, a special case of the invariance property considered here (see also [31,\nSection 5]). Computational ef\ufb01ciency, a main concern in this work, is not a problem there since each\ncoordinate is scaled independently. Orabona and P\u00e1l [30] study unrelated notions of invariance. Gao\net al. [9] study a speci\ufb01c randomized sketching method for a special online learning setting.\nThe L-BFGS algorithm [24] has recently been studied in the stochastic setting2 [3, 26, 27, 34, 35], but\nhas strong assumptions with pessimistic rates in theory and reliance on the use of large mini-batches\nempirically. Recent works [7, 15, 14, 32] employ sketching in stochastic optimization, but do not\nprovide sparse implementations or extend in an obvious manner to the online setting. The Frank-\nWolfe algorithm [8, 20] is also invariant to linear transformations, but with worse regret bounds [17]\nwithout further assumptions and modi\ufb01cations [10].\n\nNotation Vectors are represented by bold letters (e.g., x, w, . . . ) and matrices by capital letters\n(e.g., M, A, . . . ). Mi,j denotes the (i, j) entry of matrix M. I d represents the d \u00d7 d identity matrix,\n0m\u00d7d represents the m \u00d7 d matrix of zeroes, and diag{x} represents a diagonal matrix with x on\nthe diagonal. \u03bbi(A) denotes the i-th largest eigenvalue of A, (cid:107)w(cid:107)A denotes\nw(cid:62)Aw, |A| is the\ni,j Ai,jBi,j, and A (cid:22) B means that\nB \u2212 A is positive semide\ufb01nite. The sign function SGN(a) is 1 if a \u2265 0 and \u22121 otherwise.\n\ndeterminant of A, TR(A) is the trace of A, (cid:104)A, B(cid:105) denotes(cid:80)\n\n\u221a\n\n2 Setup and an Optimal Algorithm\n\nWe consider the following setting. On each round t = 1, 2 . . . , T : (1) the adversary \ufb01rst presents an\nexample xt \u2208 Rd, (2) the learner chooses wt \u2208 Rd and predicts w(cid:62)\nt xt, (3) the adversary reveals a\nloss function ft(w) = (cid:96)t(w(cid:62)xt) for some convex, differentiable (cid:96)t : R \u2192 R+, and (4) the learner\nsuffers loss ft(wt) for this round.\n\nThe learner\u2019s regret to a comparator w is de\ufb01ned as RT (w) =(cid:80)T\n\nt=1 ft(w). Typical\nresults study RT (w) against all w with a bounded norm in some geometry. For an invariant update,\n1Recent work by [13] also studies sparse updates for a more complicated variant of Frequent Directions\n\nt=1 ft(wt) \u2212(cid:80)T\n\nwhich is randomized and incurs extra approximation error.\n\n2Stochastic setting assumes that the examples are drawn i.i.d. from a distribution.\n\n2\n\n0501001502000.060.080.10.120.140.16condition numbererror rate AdaGradOja\u2212SON (m=0)Oja\u2212SON (m=5)Oja\u2212SON (m=10)\fwe relax this requirement and only put bounds on the predictions w(cid:62)xt. Speci\ufb01cally, for some\npre-chosen constant C we de\ufb01ne Kt\ncomparators that generate bounded predictions on every data point, that is:\n\ndef= (cid:8)w : |w(cid:62)xt| \u2264 C(cid:9) . We seek to minimize regret to all\nT(cid:92)\n\nKt =(cid:8)w : \u2200t = 1, 2, . . . T,\n\n|w(cid:62)xt| \u2264 C(cid:9) .\n\nRT (w) where K def=\n\nRT = sup\nw\u2208K\n\nt=1\n\n(cid:48)\n\nUnder this setup, if the data are transformed to M xt for all t and some invertible matrix M \u2208 Rd\u00d7d,\nthe optimal w\u2217 simply moves to (M\u22121)(cid:62)w\u2217, which still has bounded predictions but might have\nsigni\ufb01cantly larger norm. This relaxation is similar to the comparator set considered in [33].\nWe make two structural assumptions on the loss functions.\nAssumption 1. (Scalar Lipschitz) The loss function (cid:96)t satis\ufb01es |(cid:96)\nAssumption 2. (Curvature) There exists \u03c3t \u2265 0 such that for all u, w \u2208 K, ft(w) is lower bounded\nby ft(u) + \u2207ft(u)(cid:62)(w \u2212 u) + \u03c3t\nNote that when \u03c3t = 0, Assumption 2 merely imposes convexity. More generally, it is satis\ufb01ed by\nsquared loss ft(w) = (w(cid:62)xt \u2212 yt)2 with \u03c3t = 1\n8C2 whenever |w(cid:62)xt| and |yt| are bounded by C,\nas well as for all exp-concave functions (see [18, Lemma 3]).\nEnlarging the comparator set might result in worse regret. We next show matching upper and lower\nbounds qualitatively similar to the standard setting, but with an extra unavoidable\nTheorem 1. For any online algorithm generating wt \u2208 Rd and all T \u2265 d, there exists a sequence of\nT examples xt \u2208 Rd and loss functions (cid:96)t satisfying Assumptions 1 and 2 (with \u03c3t = 0) such that the\n\n(cid:0)\u2207ft(u)(cid:62)(u \u2212 w)(cid:1)2\n\nt(z)| \u2264 L whenever |z| \u2264 C.\n\nregret RT is at least CL(cid:112)dT /2.\n\nd factor. 3\n\n\u221a\n\n2\n\n.\n\nWe now give an algorithm that matches the lower bound up to logarithmic constants in the worst case\nbut enjoys much smaller regret when \u03c3t (cid:54)= 0. At round t + 1 with some invertible matrix At speci\ufb01ed\nlater and gradient gt = \u2207ft(wt), the algorithm performs the following update before making the\nprediction on the example xt+1:\n\nut+1 = wt \u2212 A\u22121\n\nt gt,\n\nand wt+1 = argmin\nw\u2208Kt+1\n\n(cid:107)w \u2212 ut+1(cid:107)At\n\n.\n\n(1)\n\nThe projection onto the set Kt+1 differs from typical norm-based projections as it only enforces\nboundedness on xt+1 at round t + 1. Moreover, this projection step can be performed in closed form.\nLemma 1. For any x (cid:54)= 0, u \u2208 Rd and positive de\ufb01nite matrix A \u2208 Rd\u00d7d, we have\n\nargmin\n\nw : |w(cid:62)x|\u2264C\n\n(cid:107)w \u2212 u(cid:107)A = u \u2212 \u03c4C(u(cid:62)x)\nx(cid:62)A\u22121x\n\nA\u22121x, where \u03c4C(y) = SGN(y) max{|y| \u2212 C, 0}.\n\nIf At is a diagonal matrix, updates similar to those of Ross et al. [33] are recovered. We study a\nchoice of At that is similar to the Online Newton Step (ONS) [18] (though with different projections):\n\nt(cid:88)\n(\u03c3s + \u03b7s)gsg(cid:62)\n\ns\n\ns=1\n\nAt = \u03b1I d +\n\n(2)\n\nfor some parameters \u03b1 > 0 and \u03b7t \u2265 0. The regret guarantee of this algorithm is shown below:\nTheorem 2. Under Assumptions 1 and 2, suppose that \u03c3t \u2265 \u03c3 \u2265 0 for all t, and \u03b7t is non-increasing.\nThen using the matrices (2) in the updates (1) yields for all w \u2208 K,\n\n(cid:32)\n\n(\u03c3 + \u03b7T )(cid:80)T\n\nd\u03b1\n\nt=1 (cid:107)gt(cid:107)2\n\n2\n\n(cid:33)\n\n.\n\nRT (w) \u2264 \u03b1\n2\n\n(cid:107)w(cid:107)2\n\n2 + 2(CL)2\n\n\u03b7t +\n\nd\n\n2(\u03c3 + \u03b7T )\n\nln\n\n1 +\n\n3In the standard setting where wt and xt are restricted such that (cid:107)wt(cid:107) \u2264 D and (cid:107)xt(cid:107) \u2264 X, the minimax\n\n\u221a\nregret is O(DXL\nT ). This is clearly a special case of our setting with C = DX.\n\nT(cid:88)\n\nt=1\n\n3\n\n\fReceive example xt.\n\nAlgorithm 1 Sketched Online Newton (SON)\nInput: Parameters C, \u03b1 and m.\n1: Initialize u1 = 0d\u00d71.\n2: Initialize sketch (S, H) \u2190 SketchInit(\u03b1, m).\n3: for t = 1 to T do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nProjection step: compute(cid:98)x = Sxt, \u03b3 = \u03c4C (u(cid:62)\n(S, H) \u2190 SketchUpdate((cid:98)g).\n\nPredict label yt = w(cid:62)\nCompute gradient gt = (cid:96)(cid:48)\nUpdate weight: ut+1 = wt \u2212 1\n\nt xt and suffer loss (cid:96)t(yt).\n\n\u03b1 (gt \u2212 S(cid:62)HSgt).\n\nx(cid:62)\n\nt xt\u2212(cid:98)x(cid:62)H(cid:98)x and set wt = ut \u2212 \u03b3(xt \u2212 S(cid:62)H(cid:98)x).\n\nt xt)\n\nt(yt)xt and the to-sketch vector(cid:98)g =\n\n\u221a\n\n\u03c3t + \u03b7tgt.\n\nThe dependence on (cid:107)w(cid:107)2\n2 implies that the method is not completely invariant to transformations of\nthe data. This is due to the part \u03b1I d in At. However, this is not critical since \u03b1 is \ufb01xed and small\nwhile the other part of the bound grows to eventually become the dominating term. Moreover, we\ncan even set \u03b1 = 0 and replace the inverse with the Moore-Penrose pseudoinverse to obtain a truly\ninvariant algorithm, as discussed in Appendix D. We use \u03b1 > 0 in the remainder for simplicity.\nThe implication of this regret bound is the following: in the worst case where \u03c3 = 0, we set\n\n\u03b7t =(cid:112)d/C 2L2t and the bound simpli\ufb01es to\n\nRT (w) \u2264 \u03b1\n2\n\n(cid:107)w(cid:107)2\n\n2 +\n\nCL\n2\n\n\u221a\n\nT d ln\n\n1 +\n\n(cid:32)\n\n(cid:33)\n\n(cid:80)T\nt=1 (cid:107)gt(cid:107)2\n2\n\u03b1CL\nT d\n\n\u221a\n\n\u221a\n+ 4CL\n\nT d ,\n\nessentially only losing a logarithmic factor compared to the lower bound in Theorem 1. On the other\nhand, if \u03c3t \u2265 \u03c3 > 0 for all t, then we set \u03b7t = 0 and the regret simpli\ufb01es to\n\n(cid:32)\n\n\u03c3(cid:80)T\n\n(cid:33)\n\nt=1 (cid:107)gt(cid:107)2\nd\u03b1\n\n2\n\nRT (w) \u2264 \u03b1\n2\n\n(cid:107)w(cid:107)2\n\n2 +\n\nd\n2\u03c3\n\nln\n\n1 +\n\n,\n\n(3)\n\nextending the O(d ln T ) results in [18] to the weaker Assumption 2 and a larger comparator set K.\n\n3 Ef\ufb01ciency via Sketching\n\nOur algorithm so far requires \u2126(d2) time and space just as ONS. In this section we show how to\nachieve regret guarantees nearly as good as the above bounds, while keeping computation within a\nconstant factor of \ufb01rst order methods.\n\nLet Gt \u2208 Rt\u00d7d be a matrix such that the t-th row is(cid:98)g\n\nt where we de\ufb01ne(cid:98)gt =\n\n\u03c3t + \u03b7tgt to be\nthe to-sketch vector. Our previous choice of At (Eq. (2)) can be written as \u03b1I d + G(cid:62)\nt Gt. The idea\nof sketching is to maintain an approximation of Gt, denoted by St \u2208 Rm\u00d7d where m (cid:28) d is a\nsmall constant called the sketch size. If m is chosen so that S(cid:62)\nt Gt well, we can\nrede\ufb01ne At as \u03b1I d + S(cid:62)\nTo see why this admits an ef\ufb01cient algorithm, notice that by the Woodbury formula one has A\u22121\n\u03c4C(u(cid:62)\n\n(cid:1). With the notation Ht = (\u03b1I m + StS(cid:62)\n\nt =\nt )\u22121 \u2208 Rm\u00d7m and \u03b3t =\n\n(cid:0)I d \u2212 S(cid:62)\n\nt St approximates G(cid:62)\n\nt HtStxt+1), update (1) becomes:\n\nt )\u22121St\nt+1xt+1 \u2212 x(cid:62)\n\nt St for the algorithm.\n\nt (\u03b1I m + StS(cid:62)\n\n\u221a\n\n(cid:62)\n\n1\n\u03b1\n\nt+1xt+1)/(x(cid:62)\nut+1 = wt \u2212 1\n\n\u03b1\n\n(cid:0)gt \u2212 S(cid:62)\n\nt+1S(cid:62)\nt HtStgt\n\n(cid:1),\n\nThe operations involving Stgt or Stxt+1 require only O(md) time, while matrix vector products\nwith Ht require only O(m2). Altogether, these updates are at most m times more expensive than \ufb01rst\norder algorithms as long as St and Ht can be maintained ef\ufb01ciently. We call this algorithm Sketched\nOnline Newton (SON) and summarize it in Algorithm 1.\nWe now discuss three sketching techniques to maintain the matrices St and Ht ef\ufb01ciently, each\nrequiring O(md) storage and time linear in d.\n\nand wt+1 = ut+1 \u2212 \u03b3t\n\n(cid:0)xt+1 \u2212 S(cid:62)\n\nt HtStxt+1\n\n(cid:1) .\n\n4\n\n\fAlgorithm 2 FD-Sketch for FD-SON\nInternal State: S and H.\nSketchInit(\u03b1, m)\n1: Set S = 0m\u00d7d and H = 1\n2: Return (S, H).\n\nSketchUpdate((cid:98)g)\n1: Insert(cid:98)g into the last row of S.\n\n\u03b1 I m.\n\n2: Compute eigendecomposition: V (cid:62)\u03a3V =\n\nS(cid:62)S and set S = (\u03a3 \u2212 \u03a3m,mI m) 1\n2 V .\n,\u00b7\u00b7\u00b7 , 1\n\n\u03b1+\u03a31,1\u2212\u03a3m,m\n\n\u03b1\n\n1\n\n(cid:110)\n\n(cid:111)\n\n.\n\n3: Set H = diag\n4: Return (S, H).\n\nAlgorithm 3 Oja\u2019s Sketch for Oja-SON\nInternal State: t, \u039b, V and H.\nSketchInit(\u03b1, m)\n1: Set t = 0, \u039b = 0m\u00d7m, H = 1\n\n\u03b1 I m and V\nto any m\u00d7 d matrix with orthonormal rows.\n\n2: Return (0m\u00d7d, H).\n\nSketchUpdate((cid:98)g)\n\n1: Update t \u2190 t + 1, \u039b and V as Eqn. 4.\n2: Set S = (t\u039b) 1\n3: Set H = diag\n4: Return (S, H).\n\n,\u00b7\u00b7\u00b7 ,\n\n(cid:110)\n\n\u03b1+t\u039b1,1\n\n2 V .\n\n\u03b1+t\u039bm,m\n\n1\n\n1\n\n(cid:111)\n\n.\n\nt\n\ncan be realized by two rank-one updates: H\u22121\n\nHere we consider Gaussian Random Projection sketch: St = St\u22121 + rt(cid:98)g\nRandom Projection (RP). Random projections are classical methods for sketching [19, 1, 21].\n(cid:62)\n\u221a\nt , where each entry of\nrt \u2208 Rm is an independent random Gaussian variable drawn from N (0, 1/\nm). One can verify that\nqt = St(cid:98)gt \u2212 (cid:107)(cid:98)gt(cid:107)2\nt = H\u22121\nthe update of H\u22121\nt\u22121 + qtr(cid:62)\nt where\n2 rt. Using Woodbury formula, this results in O(md) update of S and H (see\nAlgorithm 6 in Appendix E). We call this combination of SON with RP-sketch RP-SON. When \u03b1 = 0\nthis algorithm is invariant to linear transformations for each \ufb01xed realization of the randomness.\nUsing the existing guarantees for RP-sketch, in Appendix E we show a similar regret bound as\nTheorem 2 up to constants, provided m = \u02dc\u2126(r) where r is the rank of GT . Therefore RP-SON is\nnear invariant, and gives substantial computational gains when r (cid:28) d with small regret overhead.\n\nt + rtq(cid:62)\n\n2\n\nmethod. FD maintains the invariant that the last row of St is always 0. On each round, the vector(cid:98)g\n\nFrequent Directions (FD). When GT is near full-rank, however, RP-SON may not perform well.\nTo address this, we consider Frequent Directions (FD) sketch [12, 23], a deterministic sketching\n(cid:62)\nt\nis inserted into the last row of St\u22121, then the covariance of the resulting matrix is eigendecomposed\ninto V (cid:62)\n2 Vt where \u03c1t is the smallest eigenvalue. Since the rows\nof St are orthogonal to each other, Ht is a diagonal matrix and can be maintained ef\ufb01ciently (see\nAlgorithm 2). The sketch update works in O(md) time (see [12] and Appendix G.2) so the total\nrunning time is O(md) per round. We call this combination FD-SON and prove the following regret\n\nt \u03a3tVt and St is set to (\u03a3t \u2212 \u03c1tI m) 1\n\nbound with notation \u2126k =(cid:80)d\ni=k+1 \u03bbi(G(cid:62)\nT(cid:88)\n\nTheorem 3. Under Assumptions 1 and 2, suppose that \u03c3t \u2265 \u03c3 \u2265 0 for all t and \u03b7t is non-increasing.\nFD-SON ensures that for any w \u2208 K and k = 0, . . . , m \u2212 1, we have\nTR(S(cid:62)\nRT (w) \u2264 \u03b1\nm\u03b1\n2\n\nT GT ) for any k = 0, . . . , m \u2212 1.\n(cid:19)\n\n2(m \u2212 k)(\u03c3 + \u03b7T )\u03b1\n\n2 + 2(CL)2\n\n2(\u03c3 + \u03b7T )\n\n(cid:107)w(cid:107)2\n\nT ST )\n\n(cid:18)\n\nm\u2126k\n\n\u03b7t +\n\n1 +\n\nm\n\nln\n\n+\n\n.\n\nt=1\n\n2 (cid:107)w(cid:107)2\n\nInstead of the rank, the bound depends on the spectral decay \u2126k, which essentially is the only extra\nterm compared to the bound in Theorem 2. Similarly to previous discussion, if \u03c3t \u2265 \u03c3, we get the\n2(m\u2212k)\u03c3\u03b1 . With \u03b1 tuned well, we pay logarithmic regret\nbound \u03b1\nfor the top m eigenvectors, but a square root regret O(\n\u2126k) for remaining directions not controlled\nby our sketch. This is expected for deterministic sketching which focuses on the dominant part of the\nspectrum. When \u03b1 is not tuned we still get sublinear regret as long as \u2126k is sublinear.\n\n1 + TR(S(cid:62)\n\n+ m\u2126k\n\nT ST )\nm\u03b1\n\n2 + m\n\n2\u03c3 ln\n\n\u221a\n\n(cid:16)\n\n(cid:17)\n\neigenvalues of data in a streaming fashion, with the to-sketch vector(cid:98)gt\u2019s as the input. Speci\ufb01cally,\n\nOja\u2019s Algorithm. Oja\u2019s algorithm [28, 29] is not usually considered as a sketching algorithm\nbut seems very natural here. This algorithm uses online gradient descent to \ufb01nd eigenvectors and\nlet Vt \u2208 Rm\u00d7d denote the estimated eigenvectors and the diagonal matrix \u039bt \u2208 Rm\u00d7m contain the\nestimated eigenvalues at the end of round t. Oja\u2019s algorithm updates as:\n\n\u039bt = (I m \u2212 \u0393t)\u039bt\u22121 + \u0393t diag{Vt\u22121(cid:98)gt}2 ,\n\north\u2190\u2212\u2212 Vt\u22121 + \u0393tVt\u22121(cid:98)gt(cid:98)g\n\nVt\n\n(cid:62)\nt\n\n(4)\n\n5\n\n\fwhere \u0393t \u2208 Rm\u00d7m is a diagonal matrix with (possibly different) learning rates of order \u0398(1/t)\non the diagonal, and the \u201c orth\u2190\u2212\u2212\u201d operator represents an orthonormalizing step.4 The sketch is then\nSt = (t\u039bt) 1\n2 Vt. The rows of St are orthogonal and thus Ht is an ef\ufb01ciently maintainable diagonal\nmatrix (see Algorithm 3). We call this combination Oja-SON.\nThe time complexity of Oja\u2019s algorithm is O(m2d) per round due to the orthonormalizing step. To\nimprove the running time to O(md), one can only update the sketch every m rounds (similar to\nthe block power method [16, 22]). The regret guarantee of this algorithm is unclear since existing\nanalysis for Oja\u2019s algorithm is only for the stochastic setting (see e.g. [2, 22]). However, Oja-SON\nprovides good performance experimentally.\n\nt gt (or sketched versions) are typically dense even if gt is sparse.\n\n4 Sparse Implementation\nIn many applications, examples (and hence gradients) are sparse in the sense that (cid:107)xt(cid:107)0 \u2264 s for all t\nand some small constant s (cid:28) d. Most online \ufb01rst order methods enjoy a per-example running time\ndepending on s instead of d in such settings. Achieving the same for second order methods is more\ndif\ufb01cult since A\u22121\nWe show how to implement our algorithms in sparsity-dependent time, speci\ufb01cally, in O(m2 +\nms) for RP-SON and FD-SON and in O(m3 + ms) for Oja-SON. We emphasize that since the\nsketch would still quickly become a dense matrix even if the examples are sparse, achieving purely\nsparsity-dependent time is highly non-trivial (especially for FD-SON and Oja-SON), and may be of\nindependent interest. Due to space limit, below we only brie\ufb02y mention how to do it for Oja-SON.\nSimilar discussion for the other two sketches can be found in Appendix G. Note that mathematically\nthese updates are equivalent to the non-sparse counterparts and regret guarantees are thus unchanged.\nThere are two ingredients to doing this for Oja-SON: (1) The eigenvectors Vt are represented as\nVt = FtZt, where Zt \u2208 Rm\u00d7d is a sparsely updatable direction (Step 3 in Algorithm 5) and\nFt \u2208 Rm\u00d7m is a matrix such that FtZt is orthonormal. (2) The weights wt are split as \u00afwt + Z(cid:62)\nt\u22121bt,\nwhere bt \u2208 Rm maintains the weights on the subspace captured by Vt\u22121 (same as Zt\u22121), and \u00afwt\ncaptures the weights on the complementary subspace which are again updated sparsely.\nWe describe the sparse updates for \u00afwt and bt below with the details for Ft and Zt deferred to\nAppendix H. Since St = (t\u039bt) 1\n\n2 Vt = (t\u039bt) 1\n\n(cid:1) gt\n\u03b1 = \u00afwt \u2212 gt\n\n(cid:124)\n\n2 FtZt and wt = \u00afwt + Z(cid:62)\nt (bt + 1\n\n(cid:125)\n\u03b1 \u2212 (Zt \u2212 Zt\u22121)(cid:62)bt\n\n+Z(cid:62)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nt\u22121bt, we know ut+1 is\n\u03b1 F (cid:62)\nt (t\u039btHt)FtZtgt\n) .\n\n(5)\n\n(cid:123)(cid:122)\n\nwt \u2212(cid:0)I d \u2212 S(cid:62)\n\nt HtSt\n\ndef= b(cid:48)\nSince Zt \u2212 Zt\u22121 is sparse by construction and the matrix operations de\ufb01ning b\n(cid:48)\nt+1 scale with m,\noverall the update can be done in O(m2 + ms). Using the update for wt+1 in terms of ut+1, wt+1\nis equal to\n\ndef= \u00afut+1\n\nt+1\n\nut+1 \u2212 \u03b3t(I d \u2212 S(cid:62)\n\n(cid:125)\nt HtSt)xt+1 = \u00afut+1 \u2212 \u03b3txt+1\n\n(cid:123)(cid:122)\n\n(cid:124)\n\ndef= \u00afwt+1\n\n(cid:124)\n\n(cid:123)(cid:122)\n\ndef= bt+1\n\n+Z(cid:62)\n\nt (b\n\n(cid:48)\nt+1 + \u03b3tF (cid:62)\n\nt (t\u039btHt)FtZtxt+1\n\n) .\n\n(6)\n\n(cid:125)\n\n(cid:125)\n\nAgain, it is clear that all the computations scale with s and not d, so both \u00afwt+1 and bt+1 require only\n(cid:62)\nO(m2 + ms) time to maintain. Furthermore, the prediction w(cid:62)\nt Zt\u22121xt can also\nbe computed in O(ms) time. The O(m3) in the overall complexity comes from a Gram-Schmidt\nstep in maintaining Ft (details in Appendix H).\nThe pseudocode is presented in Algorithms 4 and 5 with some details deferred to Appendix H. This\nis the \ufb01rst sparse implementation of online eigenvector computation to the best of our knowledge.\n\nt xt = \u00afw(cid:62)\n\nt xt + b\n\n5 Experiments\n\nPreliminary experiments revealed that out of our three sketching options, Oja\u2019s sketch generally has\nbetter performance (see Appendix I). For more thorough evaluation, we implemented the sparse\n\n4For simplicity, we assume that Vt\u22121 + \u0393tVt\u22121(cid:98)gt(cid:98)g\n\ndoes not reduce the dimension of Vt.\n\n(cid:62)\nt is always of full rank so that the orthonormalizing step\n\n6\n\n\f(Algorithm 5).\n\nReceive example xt.\n\nAlgorithm 4 Sparse Sketched Online Newton with Oja\u2019s Algorithm\nInput: Parameters C, \u03b1 and m.\n1: Initialize \u00afu = 0d\u00d71 and b = 0m\u00d71.\n2: (\u039b, F, Z, H) \u2190 SketchInit(\u03b1, m)\n3: for t = 1 to T do\n4:\n5:\n\nProjection step: compute(cid:98)x = F Zxt and \u03b3 = \u03c4C ( \u00afu(cid:62)xt+b(cid:62)Zxt)\nt xt\u2212(t\u22121)(cid:98)x(cid:62)\u039bH(cid:98)x.\nObtain \u00afw = \u00afu \u2212 \u03b3xt and b \u2190 b + \u03b3(t \u2212 1)F (cid:62)\u039bH(cid:98)x (Equation 6).\nt(yt)xt and the to-sketch vector(cid:98)g =\n\u221a\n(\u039b, F , Z, H, \u03b4) \u2190 SketchUpdate((cid:98)g)\nb)(cid:98)g and b \u2190 b + 1\n\u03b1 gt \u2212 (\u03b4\n\nPredict label yt = \u00afw(cid:62)xt + b\nCompute gradient gt = (cid:96)(cid:48)\nUpdate weight: \u00afu = \u00afw \u2212 1\n\nZxt and suffer loss (cid:96)t(yt).\n\n(Algorithm 5).\n\n\u03c3t + \u03b7tgt.\n\u03b1 tF (cid:62)\u039bHF Zgt\n\n6:\n7:\n8:\n9:\n10: end for\n\nx(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(Equation 5).\n\nAlgorithm 5 Sparse Oja\u2019s Sketch\nInternal State: t, \u039b, F , Z, H and K.\nSketchInit(\u03b1, m)\n1: Set t = 0, \u039b = 0m\u00d7m, F = K = \u03b1H = I m and Z to any m \u00d7 d matrix with orthonormal rows.\n2: Return (\u039b, F , Z, H).\n\nSketchUpdate((cid:98)g)\n1: Update t \u2190 t+1. Pick a diagonal stepsize matrix \u0393t to update \u039b \u2190 (I\u2212\u0393t)\u039b+\u0393t diag{F Z(cid:98)g}2.\n2: Set \u03b4 = A\u22121\u0393tF Z(cid:98)g and update K \u2190 K + \u03b4(cid:98)g\n3: Update Z \u2190 Z + \u03b4(cid:98)g\n\n4: (L, Q) \u2190 Decompose(F, K) (Algorithm 13), so that LQZ = F Z and QZ is orthogonal. Set\n5: Set H \u2190 diag\n\u03b1+t\u039b1,1\n6: Return (\u039b, F , Z, H, \u03b4).\n\nZ(cid:62) + Z(cid:98)g\u03b4\n\n(cid:62)(cid:98)g)\u03b4\u03b4\n\n+ ((cid:98)g\n\n,\u00b7\u00b7\u00b7 ,\n\nF = Q.\n\n\u03b1+t\u039bm,m\n\n(cid:110)\n\n(cid:111)\n\n(cid:62).\n\n(cid:62).\n\n1\n\n.\n\n(cid:62)\n\n(cid:62)\n\n1\n\nversion of Oja-SON in Vowpal Wabbit.5 We compare it with ADAGRAD [6, 25] on both synthetic and\nreal-world datasets. Each algorithm takes a stepsize parameter: 1\n\u03b1 serves as a stepsize for Oja-SON\nand a scaling constant on the gradient matrix for ADAGRAD. We try both methods with the parameter\nset to 2j for j = \u22123,\u22122, . . . , 6 and report the best results. We keep the stepsize matrix in Oja-SON\nt I m throughout. All methods make one online pass over data minimizing square loss.\n\ufb01xed as \u0393t = 1\n\n5.1 Synthetic Datasets\n\nTo investigate Oja-SON\u2019s performance in the setting it is really designed for, we generated a range\nof synthetic ill-conditioned datasets as follows. We picked a random Gaussian matrix Z \u223c RT\u00d7d\n(T = 10,000 and d = 100) and a random orthonormal basis V \u2208 Rd\u00d7d. We chose a speci\ufb01c spectrum\n\u03bb \u2208 Rd where the \ufb01rst d \u2212 10 coordinates are 1 and the rest increase linearly to some \ufb01xed condition\nnumber parameter \u03ba. We let X = Zdiag{\u03bb} 1\n2 V (cid:62) be our example matrix, and created a binary\nx), where \u03b8 \u2208 Rd is a random vector. We generated\n(cid:62)\nclassi\ufb01cation problem with labels y = sign(\u03b8\n20 such datasets with the same Z, V and labels y but different values of \u03ba \u2208 {10, 20, . . . , 200}. Note\nthat if the algorithm is truly invariant, it would have the same behavior on these 20 datasets.\nFig. 1 (in Section 1) shows the \ufb01nal progressive error (i.e. fraction of misclassi\ufb01ed examples after one\npass over data) for ADAGRAD and Oja-SON (with sketch size m = 0, 5, 10) as the condition number\nincreases. As expected, the plot con\ufb01rms the performance of \ufb01rst order methods such as ADAGRAD\ndegrades when the data is ill-conditioned. The plot also shows that as the sketch size increases,\nOja-SON becomes more accurate: when m = 0 (no sketch at all), Oja-SON is vanilla gradient\ndescent and is worse than ADAGRAD as expected; when m = 5, the accuracy greatly improves; and\n\ufb01nally when m = 10, the accuracy of Oja-SON is substantially better and hardly worsens with \u03ba.\n\n5An open source machine learning toolkit available at http://hunch.net/~vw\n\n7\n\n\fFigure 2: Oja\u2019s algorithm\u2019s\neigenvalue recovery error.\n\nFigure 3: (a) Comparison of two sketch sizes on real data,\nand (b) Comparison against ADAGRAD on real data.\n\nTo further explain the effectiveness of Oja\u2019s algorithm in identifying top eigenvalues and eigenvec-\ntors, the plot in Fig. 2 shows the largest relative difference between the true and estimated top 10\neigenvalues as Oja\u2019s algorithm sees more data. This gap drops quickly after seeing just 500 examples.\n\n5.2 Real-world Datasets\n\nthe matrix(cid:80)t\u22121\n\nNext we evaluated Oja-SON on 23 benchmark datasets from the UCI and LIBSVM repository (see\nAppendix I for description of these datasets). Note that some datasets are very high dimensional but\nvery sparse (e.g. for 20news, d \u2248 102, 000 and s \u2248 94), and consequently methods with running\ntime quadratic (such as ONS) or even linear in dimension rather than sparsity are prohibitive.\nIn Fig. 3(a), we show the effect of using sketched second order information, by comparing sketch\nsize m = 0 and m = 10 for Oja-SON (concrete error rates in Appendix I). We observe signi\ufb01cant\nimprovements in 5 datasets (acoustic, census, heart, ionosphere, letter), demonstrating the advantage\nof using second order information. However, we found that Oja-SON was outperformed by ADA-\nGRAD on most datasets, mostly because the diagonal adaptation of ADAGRAD greatly reduces the\ncondition number on these datasets. Moreover, one disadvantage of SON is that for the directions not\nin the sketch, it is essentially doing vanilla gradient descent. We expect better results using diagonal\nadaptation as in ADAGRAD in off-sketch directions.\nTo incorporate this high level idea, we performed a simple modi\ufb01cation to Oja-SON: upon seeing\n\u2212 1\nt xt to our algorithm instead of xt, where Dt \u2208 Rd\u00d7d is the diagonal part of\nexample xt, we feed D\n\u03c4 =1 g\u03c4 g(cid:62)\n\u03c4 .6 The intuition is that this diagonal rescaling \ufb01rst homogenizes the scales of\nall dimensions. Any remaining ill-conditioning is further addressed by the sketching to some degree,\nwhile the complementary subspace is no worse-off than with ADAGRAD. We believe this \ufb02exibility\nin picking the right vectors to sketch is an attractive aspect of our sketching-based approach.\nWith this modi\ufb01cation, Oja-SON outperforms ADAGRAD on most of the datasets even for m = 0,\nas shown in Fig. 3(b) (concrete error rates in Appendix I). The improvement on ADAGRAD at\nm = 0 is surprising but not impossible as the updates are not identical\u2013our update is scale invariant\nlike Ross et al. [33]. However, the diagonal adaptation already greatly reduces the condition number\non all datasets except splice (see Fig. 4 in Appendix I for detailed results on this dataset), so little\nimprovement is seen for sketch size m = 10 over m = 0. For several datasets, we veri\ufb01ed the\naccuracy of Oja\u2019s method in computing the top-few eigenvalues (Appendix I), so the lack of difference\nbetween sketch sizes is due to the lack of second order information after the diagonal correction.\nThe average running time of our algorithm when m = 10 is about 11 times slower than ADAGRAD,\nmatching expectations. Overall, SON can signi\ufb01cantly outperform baselines on ill-conditioned data,\nwhile maintaining a practical computational complexity.\n\n2\n\nAcknowledgements This work was done when Haipeng Luo and Nicol\u00f2 Cesa-Bianchi were at\nMicrosoft Research, New York.\n\n6D1 is de\ufb01ned as 0.1 \u00d7 I d to avoid division by zero.\n\n8\n\n020004000600080001000000.20.40.60.81number of examplesrelative eigenvalue difference \u03ba=50\u03ba=100\u03ba=150\u03ba=20000.10.20.30.400.10.20.30.4error rate of Oja\u2212SON (m=0)error rate of Oja\u2212SON (m=10) m = 0 vs m=1000.10.20.30.400.050.10.150.20.250.30.350.4error rate of AdaGraderror rate of Oja\u2212SON AdaGrad vs Oja\u2212SON (m=0)AdaGrad vs Oja\u2212SON (m=10)\fReferences\n[1] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of\n\nComputer and System Sciences, 66(4):671\u2013687, 2003.\n\n[2] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental pca. In NIPS, 2013.\n[3] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale\n\noptimization. SIAM Journal on Optimization, 26:1008\u20131031, 2016.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.\n[5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm. SIAM Journal on\n\nComputing, 34(3):640\u2013668, 2005.\n\n[6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. JMLR, 12:2121\u20132159, 2011.\n\n[7] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled newton methods. In NIPS, 2015.\n[8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3\n\n(1-2):95\u2013110, 1956.\n\n[9] W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass auc optimization. In ICML, 2013.\n[10] D. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to online\n\nand stochastic optimization. SIAM Journal on Optimization, 26:1493\u20131528, 2016.\n[11] D. Garber, E. Hazan, and T. Ma. Online learning of eigenvectors. In ICML, 2015.\n[12] M. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff. Frequent directions: Simple and deterministic\n\nmatrix sketching. SIAM Journal on Computing, 45:1762\u20131792, 2015.\n\n[13] M. Ghashami, E. Liberty, and J. M. Phillips. Ef\ufb01cient frequent directions algorithm for sparse matrices. In\n\nKDD, 2016.\n\n[14] A. Gonen and S. Shalev-Shwartz. Faster sgd using sketched conditioning. arXiv:1506.02649, 2015.\n[15] A. Gonen, F. Orabona, and S. Shalev-Shwartz. Solving ridge regression using sketched preconditioned\n\nsvrg. In ICML, 2016.\n\n[16] M. Hardt and E. Price. The noisy power method: A meta algorithm with applications. In NIPS, 2014.\n[17] E. Hazan and S. Kale. Projection-free online learning. In ICML, 2012.\n[18] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine\n\nLearning, 69(2-3):169\u2013192, 2007.\n\n[19] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality.\n\nIn STOC, 1998.\n\n[20] M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML, 2013.\n[21] D. M. Kane and J. Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM, 61(1):4, 2014.\n[22] C.-L. Li, H.-T. Lin, and C.-J. Lu. Rivalry of two families of algorithms for memory-restricted streaming\n\npca. arXiv:1506.01490, 2015.\n\n[23] E. Liberty. Simple and deterministic matrix sketching. In KDD, 2013.\n[24] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical\n\nprogramming, 45(1-3):503\u2013528, 1989.\n\n[25] H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In COLT,\n\n2010.\n\n[26] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory bfgs. JMLR, 16:3151\u20133181,\n\n2015.\n\n[27] P. Moritz, R. Nishihara, and M. I. Jordan. A linearly-convergent stochastic l-bfgs algorithm. In AISTATS,\n\n2016.\n\n[28] E. Oja. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of mathematical biology, 15\n\n(3):267\u2013273, 1982.\n\n[29] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation\n\nof a random matrix. Journal of mathematical analysis and applications, 106(1):69\u201384, 1985.\n\n[30] F. Orabona and D. P\u00e1l. Scale-free algorithms for online linear optimization. In ALT, 2015.\n[31] F. Orabona, K. Crammer, and N. Cesa-Bianchi. A generalized online mirror descent with applications to\n\nclassi\ufb01cation and regression. Machine Learning, 99(3):411\u2013435, 2015.\n\n[32] M. Pilanci and M. J. Wainwright. Newton sketch: A linear-time optimization algorithm with linear-\n\nquadratic convergence. arXiv:1505.02250, 2015.\n\n[33] S. Ross, P. Mineiro, and J. Langford. Normalized online learning. In UAI, 2013.\n[34] N. N. Schraudolph, J. Yu, and S. G\u00fcnter. A stochastic quasi-newton method for online convex optimization.\n\nIn AISTATS, 2007.\n\n[35] J. Sohl-Dickstein, B. Poole, and S. Ganguli. Fast large-scale optimization by unifying stochastic gradient\n\nand quasi-newton methods. In ICML, 2014.\n\n[36] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Machine\n\nLearning, 10(1-2):1\u2013157, 2014.\n\n9\n\n\f", "award": [], "sourceid": 555, "authors": [{"given_name": "Haipeng", "family_name": "Luo", "institution": "Princeton University"}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft"}, {"given_name": "Nicol\u00f2", "family_name": "Cesa-Bianchi", "institution": "University of Milan"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}]}