{"title": "Feature-distributed sparse regression: a screen-and-clean approach", "book": "Advances in Neural Information Processing Systems", "page_first": 2712, "page_last": 2720, "abstract": "Most existing approaches to distributed sparse regression assume the data is partitioned by samples. However, for high-dimensional data (D >> N), it is more natural to partition the data by features. We propose an algorithm to distributed sparse regression when the data is partitioned by features rather than samples. Our approach allows the user to tailor our general method to various distributed computing platforms by trading-off the total amount of data (in bits) sent over the communication network and the number of rounds of communication. We show that an implementation of our approach is capable of solving L1-regularized L2 regression problems with millions of features in minutes.", "full_text": "Feature-distributed sparse regression: a\n\nscreen-and-clean approach\n\nJiyan Yang\u2020 Michael W. Mahoney\u2021 Michael A. Saunders\u2020 Yuekai Sun\u00a7\n\n\u2021 University of California at Berkeley\n\n\u00a7 University of Michigan\n\n\u2020 Stanford University\n\njiyan@stanford.edu\n\nmmahoney@stat.berkeley.edu\n\nsaunders@stanford.edu\n\nyuekai@umich.edu\n\nAbstract\n\nMost existing approaches to distributed sparse regression assume the data is par-\ntitioned by samples. However, for high-dimensional data (D (cid:29) N), it is more\nnatural to partition the data by features. We propose an algorithm to distributed\nsparse regression when the data is partitioned by features rather than samples.\nOur approach allows the user to tailor our general method to various distributed\ncomputing platforms by trading-off the total amount of data (in bits) sent over the\ncommunication network and the number of rounds of communication. We show\nthat an implementation of our approach is capable of solving (cid:96)1-regularized (cid:96)2\nregression problems with millions of features in minutes.\n\n1\n\nIntroduction\n\nExplosive growth in the size of modern datasets has fueled the recent interest in distributed statistical\nlearning. For examples, we refer to [2, 20, 9] and the references therein. The main computational\nbottleneck in distributed statistical learning is usually the movement of data between compute notes,\nso the overarching goal of algorithm design is the minimization of such communication costs.\nMost work on distributed statistical learning assume the data is partitioned by samples. However, for\nhigh-dimensional datasets, it is more natural to partition the data by features. Unfortunately, methods\nthat are suited to such feature-distributed problems are scarce. A possible explanation for the paucity\nof methods is feature-distributed problems are harder than their sample-distributed counterparts. If\nthe data is distributed by samples, each machine has a complete view of the problem (albeit a partial\nview of the dataset). Given only its local data, each machine can \ufb01t the full model. On the other hand,\nif the data is distributed by features, each machine no longer has a complete view of the problem.\nIt can only \ufb01t a (generally mis-speci\ufb01ed) submodel. Thus communication among the machines is\nnecessary to solve feature-distributed problems. In this paper, our goal is to develop algorithms that\nminimize the amount of data (in bits) sent over the network across all rounds for feature-distributed\nsparse linear regression.\nThe sparse linear model is\n\n(1)\nwhere X \u2208 RN\u00d7D are features, y \u2208 RN are responses, \u03b2\u2217 \u2208 RD are (unknown) regression\ncoef\ufb01cients, and \u0001 \u2208 RN are unobserved errors. The model is sparse because \u03b2\u2217 is s-sparse; i.e., the\ncardinality of S := supp(\u03b2\u2217) is at most s. Although it is an idealized model, the sparse linear model\nhas proven itself useful in a wide variety of applications.\nA popular way to \ufb01t a sparse linear model is the lasso [15, 3]:\n\ny = X\u03b2\u2217 + \u0001,\n\n(cid:98)\u03b2 \u2190 arg min(cid:107)\u03b2(cid:107)1\u22641\n\n1\n\n2N (cid:107)y \u2212 X\u03b2(cid:107)2\n2,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\flasso that ensures the lasso estimator (cid:98)\u03b2 is nearly as close to \u03b2\u2217 as an oracle estimator X\n\nwhere we assumed the problem is scaled so that (cid:107)\u03b2\u2217(cid:107)1 = 1. There is a well-developed theory of the\n\u2020\nSy, where\nS \u2282 [D] is the support of \u03b2\u2217 [11]. Formally, under some conditions on the Gram matrix 1\nN X T X,\nthe (in-sample) prediction error of the lasso is roughly s log D\nN . Since the prediction error of the oracle\nN , the lasso estimator is almost as good as the oracle estimator. We refer to [8]\nestimator is (roughly) s\nfor the details.\nWe propose an approach to feature distributed sparse regression that attains the convergence rate of\nthe lasso estimator. Our approach, which we call SCREENANDCLEAN, consists of two stages: a\nscreening stage where we reduce the dimensionality of the problem by discarding irrelevant features;\nand a cleaning stage where we \ufb01t a sparse linear model to a sketched problem. The key features of\nthe proposed approach are:\n\n\u2022 We reduce the best-known communication cost (in bits) of feature-distributed sparse re-\ngression from O(mN 2) to O(N ms) bits, where N is the sample size, m is the number of\nmachines, and s is the sparsity. To our knowledge, the proposed approach is the only one\nthat exploits sparsity to reduce communication cost.\n\u2022 As a corollary, we show that constrained Newton-type methods converge linearly (up to a\nstatistical tolerance) on high-dimensional problems that are not strongly convex. Also, the\nconvergence rate is only weakly dependent on the condition number of the problem.\n\u2022 Another bene\ufb01t of our approach is it allows users to trade-off the amount of data (in bits)\nsent over the network and the number of rounds of communication. At one extreme, it\n\nis possible to reduce the amount of bits sent over the network to (cid:101)O(mN s) (at the cost of\nlog(cid:0) N\n(cid:1) rounds of communication). At the other extreme, it is possible to reduce the\ntotal number of iterations to a constant at the cost of sending (cid:101)O(mN 2) bits over the network.\n\ns log D\n\nRelated work. DECO [17] is a recently proposed method that addresses the same problem we\naddress. At a high level, DECO is based on the observation that if the features on separate machines\nare uncorrelated, the sparse regression problem decouples across machines. To ensures the features\non separate machines are uncorrelated, DECO \ufb01rst decorrelates the features by a decorrelation step.\nThe method is communication ef\ufb01cient in that it only requires a single round of communication,\nwhere O(mN 2) bits of data are sent over the network. We refer to [17] for the details of DECO.\nAs we shall see, in the cleaning stage of our approach, we utilize the sub-Gaussian sketches. In fact,\nother sketches, e.g., sketches based on Hadamard transform [16] and sparse sketches [4] may also be\nused. An overview of various sketching techniques can be found in [19].\nThe cleaning stage of our approach is operationally very similar to the iterative Hessian sketch\n(IHS) by Pilanci and Wainwright for constrained least squares problems [12]. Similar Newton-type\nmethods that relied on sub-sampling rather than sketching were also studied by [14]. However, they\nare chie\ufb02y concerned with the convergence of the iterates to the (stochastic) minimizer of the least\nsquares problem, while we are chie\ufb02y concerned with the convergence of the iterates to the unknown\nregression coef\ufb01cients \u03b2\u2217. Further, their assumptions on the sketching matrix are stated in terms of\nthe transformed tangent cone at the minimizer of the least squares problem, while our assumptions\nare stated in terms of the tangent cone at \u03b2\u2217.\nFinally, we wish to point out that our results are similar in spirit to those on the fast convergence\nof \ufb01rst order methods [1, 10] on high-dimensional problems in the presence of restricted strong\nconvexity. However, those results are also chie\ufb02y concerned with the convergence of the iterates to\nthe (stochastic) minimizer of the least squares problem. Further, those results concern \ufb01rst-order,\nrather than second-order methods.\n2 A screen-and-clean approach\nOur approach SCREENANDCLEAN consists of two stages:\n\n1. Screening Stage: reduce the dimension of the problem from D to d = O(N ) by discarding\n\nirrelevant features.\n\n2. Cleaning Stage: \ufb01t a sparse linear model to the O(N ) selected features.\n\nWe note that it is possible to avoid communication in the screening stage by using a method based on\nthe marginal correlations between the features and the response. Further, by exploiting sparsity, it is\n\n2\n\n\fpossible to reduce the amount of communication to O(mN s) bits (ignoring polylogarithmic factors).\nTo the authors\u2019 knowledge, all existing one-shot approaches to feature-distributed sparse regression\nthat involve only a single round of communication require sending O(mN 2) bits over the network.\n\nIn the \ufb01rst stage of SCREENANDCLEAN, the k-th machine selects a subset (cid:98)Sk of potentially relevant\nfeatures, where |(cid:98)Sk| = dk (cid:46) N. To avoid discarding any relevant features, we use a screening\n\nmethod that has the sure screening property:\n\nP(cid:0)supp(\u03b2\u2217\n\nk) \u2282 \u222ak\u2208[m](cid:98)Sk\n\n(cid:1) \u2192 1,\n\nN\n\n(cid:98)SSIS \u2190 {i \u2208 [D] : 1\n\n(2)\nwhere \u03b2\u2217\nk is the k-th block of \u03b2\u2217. We remark that we do not require the selection procedure to be\nvariable selection consistent. That is, we do not require the selection procedure to only selected\nrelevant features. In fact, we permit the possibility that most of the selected features are irrelevant.\nThere are many existing methods that, under some conditions on the strength of the signal, has the\nsure screening property. A prominent example is sure independence screening (SIS) [6]:\nN X T y}.\n\ni y(cid:12)(cid:12) is among the (cid:98)\u03c4 N(cid:99) largest entries of 1\n(cid:12)(cid:12)xT\n\n(3)\nSIS requires no communication among the machines, making it particularly amenable to distributed\nimplementation. Other methods include HOLP [18].\nIn the second stage of SCREENANDCLEAN, which is presented as Algorithm 1, we solve the reduced\nsparse regression problem in an iterative manner. At a high level, our approach is a constrained\nquasi-Newton method. At the beginning of the second stage, each machine sketches the features that\nare stored locally:\n\nSXk,(cid:98)Sk\n\u2208 Rn\u00d7dk comprises the features stored on the\nk-th machine that were selected by the screening stage. For notational convenience later, we divide\n\n(cid:102)Xk \u2190 1\u221a\nwhere S \u2282 RnT\u00d7N is a sketching matrix and Xk,(cid:98)Sk\n(cid:102)Xk row-wise into T blocks:\n\uf8ee\uf8ef\uf8f0(cid:102)Xk,1\n...(cid:102)Xk,T\n\nwhere each block is a n \u00d7 dk block. We emphasize that the sketching matrix is identical on all the\nmachines. To ensure the sketching matrix is identical, it is necessary to synchronize the random\nnumber generators on the machines.\nWe restrict our attention to sub-Gaussian sketches; i.e., the rows of Sk are i.i.d. sub-Gaussian random\nvectors. Formally, a random vector x \u2208 Rd is 1-sub-Gaussian if\n\n(cid:102)Xk =\n\n\uf8f9\uf8fa\uf8fb ,\n\nnT\n\n,\n\nP(\u03b8T x \u2265 \u0001) \u2264 e\u2212 \u00012\n\n2 for any \u03b8 \u2208 Sd\u22121, \u0001 > 0.\n\ni.i.d.\u223c N (0, 1), and\n\nTwo examples of sub-Gaussian sketches are the standard Gaussian sketch: Si,j\nthe Rademacher sketch: Si,j are i.i.d. Rademacher random variables.\n\nwhich solves a sequence of T regularized quadratic programs (QP) to estimate \u03b2\u2217:\n\nAfter each machine sketches the features that are stored locally, it sends the sketched features(cid:102)Xk\nand the correlation of the screened features with the response(cid:98)\u03b3k := 1\nwhere (cid:98)\u03b3 = (cid:2)(cid:98)\u03b3T\n(cid:98)\u0393 = 1\nN X T(cid:98)S\n\nk,(cid:98)Sk\nN X T\n2 \u03b2T(cid:101)\u0393t\u03b2 \u2212 ((cid:98)\u03b3 \u2212(cid:98)\u0393(cid:101)\u03b2t\u22121 +(cid:101)\u0393t(cid:101)\u03b2t\u22121)T \u03b2,\n(cid:105)\n. . . (cid:102)Xm,t\nAs we shall see, despite the absence of strong convexity, the sequence {(cid:101)\u03b2t}\u221e\n\n(cid:101)\u03b2t \u2190 arg min\u03b2\u2208Bd\n(cid:3)T are the correlations of the screened features with the response,\n(cid:104)(cid:102)X1,t\n\nX(cid:98)S is the Gram matrix of the features selected by the screening stage, and\n\n(cid:98)\u03b3m\n(cid:101)\u0393t :=\n\n(cid:105)T(cid:104)(cid:102)X1,t\n\n. . . (cid:102)Xm,t\n\nt=1 converges q-linearly\n\ny to a central machine,\n\n. . .\n\n1\n\n1\n\n.\n\n1\n\nto \u03b2\u2217 up to the statistical precision.\n\n3\n\n\fN Xk,(cid:98)Sk\n\ny, t \u2208 [T ]\n\nAlgorithm 1 Cleaning Stage\n\nSketching\n\nnT\n\n...\n\n...\n\n1\u221a\nnT\n\n(cid:1)T\n\n\uf8ee\uf8ef\uf8ef\uf8f0\n\n\uf8f9\uf8fa\uf8fa\uf8fb.\n\n1\n\n...\nk,(cid:98)Sk\nN X T\n...\n\nOptimization\n3: for t \u2208 [T ] do\n4:\n\n1: Each machine computes sketches\nand suf\ufb01cient statistics 1\n2: A central machine collects the sketches and suf\ufb01cient statistics and forms:\n\n\uf8f9\uf8fa\uf8fa\uf8fb(cid:2). . . StXk,(cid:98)Sk\n\nStXk,(cid:98)Sk\n\uf8ee\uf8ef\uf8ef\uf8f0\n. . .(cid:3) ,\n(cid:0)StXk,(cid:98)Sk\n(cid:101)\u0393t \u2190 1\n(cid:98)\u03b3 \u2190\nThe cluster computes(cid:98)\u0393(cid:101)\u03b2t\u22121 in a distributed fashion:\n\uf8ee\uf8ef\uf8ef\uf8f0\n(cid:98)yt\u22121 \u2190(cid:80)\n(cid:101)\u03b2t \u2190 arg min\u03b2\u2208Bd\n\n\uf8f9\uf8fa\uf8fa\uf8fb.\nk,(cid:98)Sk(cid:98)yt\u22121\n...\n2 \u03b2T(cid:101)\u0393t\u03b2 \u2212 ((cid:98)\u03b3 \u2212(cid:98)\u0393(cid:101)\u03b2t\u22121 +(cid:101)\u0393t(cid:101)\u03b2t\u22121)T \u03b2\n...\n7: The central machine pads (cid:101)\u03b2T with zeros to obtain an estimator of \u03b2\u2217\n\nk\u2208[m] Xk,(cid:98)Sk(cid:101)\u03b2t\u22121,k,\n\n(cid:98)\u0393(cid:101)\u03b2t\u22121 \u2190\n\n5:\n6: end for\n\n1\n\nN X T\n\n1\n\n1\n\ny\n\nThe cleaning stage involves 2T + 1 rounds of communication: step 2 involve a single round of\ncommunication, and step 4 involves two rounds of communication. We remark that T is a small\ninteger in practice. Consequently, the number of rounds of communication is a small integer.\nIn terms of the amount of data (in bits) sent over the network, the communication cost of the cleaning\nstage grows as O(dnmT ), where d is the number of features selected by the screening stage and n is\nthe sketch size. The communication cost of step 2 is O(dmnT + d), while that of step 4 is O(d + N ).\nThus the dominant term is O(dnmT ) incurred by machines sending sketches to the central machine.\n3 Theoretical properties of the screen-and-clean approach\nIn this section, we will establish our main theoretical result regarding our SCREENANDCLEAN\napproach, given as Theorem 3.5. Recall that a key element of our approach is to prove the \ufb01rst stage\nof SCREENANDCLEAN establishes the sure screening property, i.e., (2). To this end, we begin by\nstating a result by Fan and Lv that establishes suf\ufb01cient conditions for SIS, i.e., (3) to possess the\nsure screening property.\nTheorem 3.1 (Fan and Lv (2008)). Let \u03a3 be the covariance of the predictors and Z = X\u03a3\u22121/2 be\nthe whitened predictors. We assume Z satis\ufb01es the concentration property: there are c, c1 > 1 and\nC1 > 0 such that\n\n(cid:0) \u02dcd\u22121(cid:101)Z(cid:101)ZT(cid:1) < c\u22121\n\n1\n\n(cid:1) \u2264 e\u2212C1n\n\nP(cid:0)\u03bbmax\n\n(cid:0) \u02dcd\u22121(cid:101)Z(cid:101)ZT(cid:1) > c1 and \u03bbmin\n\nfor any N \u00d7 \u02dcd submatrix(cid:101)Z of Z. Further,\n(cid:12)(cid:12) \u2265 c2\n(cid:12)(cid:12)\u03b2\u2217\n\n1. the rows of Z are spherically symmetric, and \u0001i\n2. var(y) (cid:46) 1 and minj\u2208S\n\nj\n\nc2, c3 > 0;\n\n3. there is c4 > 0 such that \u03bbmax(\u03a3) \u2264 c4.\n\ni.i.d.\u223c N (0, \u03c32) for some \u03c3 > 0;\n\nN \u03ba and minj\u2208S |cov(y, xj)| \u2265 c3\n\n\u03b2j\n\nfor some \u03ba > 0 and\n\n2 , there is some \u03b8 < 1 \u2212 2\u03ba such that if \u03c4 = cN\u2212\u03b8 for some c > 0, we have\n\nP(S \u2282 (cid:98)SSIS) = 1 \u2212 C2 exp(cid:0)\u2212 CN 1\u22122\u03ba\n\nlog N\n\n(cid:1)\n\nAs long as \u03ba < 1\n\nfor some C, C2 > 0, where (cid:98)SSIS is given by (3).\n\nThe assumptions of Theorem 3.1 are discussed at length in [6], Section 5. We remark that the most\nstringent assumption is the third assumption, which is an assumption on the signal-to-noise ratio\n(SNR). It rules out the possibility a relevant variable is (marginally) uncorrelated with the response.\n\n4\n\n\fWe continue our analysis by studying the convergence rate of our approach. We begin by describing\nthree structural conditions we impose on the problem. In the rest of the section, let\n\nK(S) := {\u03b2 \u2208 Rd : (cid:107)\u03b2Sc(cid:107)1 \u2264 (cid:107)\u03b2S(cid:107)1}.\n\u2265 \u03b11(cid:107)\u03b2(cid:107)2\n\nfor any \u03b2 \u2208 K(S).\n\n2 for any \u03b2 \u2208 K(S).\n\nCondition 3.2 (RE condition). There is \u03b12 > 0 s.t. (cid:107)\u03b2(cid:107)2(cid:98)\u0393\n\u2265 \u03b12(cid:107)\u03b2(cid:107)2(cid:98)\u0393\nCondition 3.3. There is \u03b12 > 0 s.t. (cid:107)\u03b2(cid:107)2(cid:98)\u0393t\n1 ((cid:98)\u0393t \u2212(cid:98)\u0393)\u03b22| \u2264 \u03b13(cid:107)\u03b21(cid:107)(cid:98)\u0393(cid:107)\u03b22(cid:107)(cid:98)\u0393 for any \u03b2 \u2208 K(S).\nCondition 3.4. There is \u03b13 > 0 s.t. |\u03b2T\nThe preceding conditions deserve elaboration. The cone K(S) is an object that appears in the study\nof the statistical properties of constrained M-estimators: it is the set the error of the constrained lasso\n\n(cid:98)\u03b2 \u2212 \u03b2\u2217 belongs to. Its image under X(cid:98)S is the transformed tangent cone which contains the prediction\nerror X(cid:98)S((cid:98)\u03b2T \u2212 (cid:98)\u03b2\u2217). Condition 3.2 is a common assumption in the literature on high-dimensional\nlevel, Conditions 3.3 and 3.4 state that the action of the sketched Gram matrix(cid:98)\u0393t on K(S) is similar\nto that of (cid:98)\u0393 on K(S). As we shall see, they are satis\ufb01ed with high probability by sub-Gaussian\ns (cid:107)(cid:98)\u03b2 \u2212\nTheorem 3.5. Under Conditions 3.2, 3.3, and 3.4, for any T > 0 such that (cid:107)(cid:101)\u03b2t \u2212 \u03b2\u2217(cid:107)(cid:98)\u0393 \u2265 \u221a\n\nstatistics. It is a specialization of the notion of restricted strong convextiy that plays a crucial part in\nthe study of constrained M-estimators. Conditions 3.3 and 3.4 are conditions on the sketch. At a high\n\nsketches. The following theorem is our main result regarding the SCREENANDCLEAN method.\n\nL\u221a\n\n\u03b2\u2217(cid:107)1 for all t \u2264 T , we have\n\n,\n\n\u0001st(N, D)\n1 \u2212 \u03b3\n\n(cid:107)(cid:101)\u03b2t \u2212 \u03b2\u2217(cid:107)(cid:98)\u0393 \u2264 \u03b3t\u22121(cid:107)(cid:101)\u03b21 \u2212 \u03b2\u2217(cid:107)(cid:98)\u0393 +\n2(1 + 12\u03b13)\u03bbmax((cid:98)\u0393)1/2\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)1 +\n(cid:107)(cid:98)\u0393\u03b2\u2217 \u2212(cid:98)\u03b3(cid:107)\u221e.\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)1 (cid:46)P s(cid:107)(cid:98)\u0393\u03b2\u2217 \u2212(cid:98)\u03b3(cid:107)\u221e,\ns(cid:107)(cid:98)\u0393\u03b2\u2217 \u2212(cid:98)\u03b3(cid:107)\u221e,\n\n\u221a\n\u221a\n24\n\u03b12\n\ns\n\u03b11\n\n\u221a\n\n\u03b12\n\ns\n\nTo interpret Theorem 3.5, recall\n\u221a\n\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)2 (cid:46)P\n\n\u221a\nL\u221a\n\nwhere (cid:98)\u03b2 is the lasso estimator. Further, the prediction error of the lasso estimator is (up to a constant)\ns (cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)1, which (up to a constant) is exactly statistical precision \u0001st(N, D). Theorem 3.5 states\nthat the prediction error of (cid:101)\u03b2t decreases q-linearly to that of the lasso estimator. We emphasize that\n\nthe convergence rate is linear despite the absence of strong convexity, which is usually the case\nwhen N < D. A direct consequence is that only logarithmically many iterations ensures a desired\nsuboptimality, which stated in the following corollary.\nCorollary 3.6. Under the conditions of Theorem 3.5,\n\nwhere \u03b3 = c\u03b3 \u03b13\n\u03b12\n\nis the contraction factor (c\u03b3 > 0 is an absolute constant) and\n\n\u0001st(N, D) =\n\nlog(cid:0)\u0001 \u2212 \u0001st(N,D)\n\n(cid:1)\u22121 \u2212 log 1\n\n1\u2212\u03b3\nlog 1\n\u03b3\n\n\u00011\n\n\u2248 log 1\n\n\u0001 > max\n\niterations of the constrained quasi-Newton method, where \u00011 = (cid:107)(cid:98)\u03b21 \u2212 \u03b2\u2217(cid:107)(cid:98)\u0393, is enough to produce\n\n\u0001\n\nT =\n\nan iterate whose prediction error is smaller than\n\nTheorem 3.5 is vacuous if the contraction factor \u03b3 = c\u03b3 \u03b13\n\u03b12\nis enough to choose the sketch size n so that \u03b13\n\u03b12\n\nis not smaller than 1. To ensure \u03b3 < 1, it\n\n(4)\nIf the rows of St are sub-Gaussian, to ensure E(\u03b4) occurs with high probability, Pilanci and Wain-\nwright show it is enough to choose\n\ns\n\n\u221a\n\n1\u2212\u03b3\n\n(cid:111) \u2248 (cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)(cid:98)\u0393.\n(cid:110) \u03bbmax((cid:98)\u0393)1/2\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2\u2217(cid:107)1, \u0001st(N,D)\n(cid:9).\nE(\u03b4) :=(cid:8)\u03b12 \u2265 1 \u2212 \u03b4, \u03b13 \u2264 \u03b4\n\u03b42W(cid:0)X(cid:98)S(K(S) \u2229 Sd\u22121)(cid:1)2\n\n< c\u22121\n\nn > cs\n\n2\n\n,\n\n\u03b3 . Consider the \u201cgood event\u201d\n\nwhere cs > 0 is an absolute constant and W(S) is the Gaussian-width of the set S \u2282 Rd [13].\n\n(5)\n\n5\n\n\fLemma 3.7 (Pilanci and Wainwright (2014)). For any sketching matrix whose rows are independent\n1-sub-Gaussian vectors, as long as the sketch size n satis\ufb01es (5),\n\nP(cid:0)E(\u03b4)(cid:1) \u2265 1 \u2212 c5 exp(cid:0)\u2212c6n\u03b42(cid:1),\n\nwhere c5, c6 are absolute constants.\n\nAs a result, when the sketch size n satis\ufb01es (5), Theorem 3.5 is non-trivial.\nTradeoffs depending on sketch size. We remark that the contraction coef\ufb01cient in Theorem 3.5\ndepends on the sketch size. As the sketch size n increases, the contraction coef\ufb01cient decays and\nvice versa. Thus the sketch size allows practitioner to trade-off the total rounds of communication\nwith the total amount of data (in bits) sent over the network. A larger sketch size results in fewer\nrounds of communication, but more bits per round of communication and vice versa. Recall [5] the\ncommunication cost of an algorithm is\n\nrounds \u00d7 overhead + bits \u00d7 bandwidth\u22121.\n\nBy tweaking the sketch size, users can trade-off rounds and bits, thereby minimizing the communca-\ntion cost of our approach on various distributed computing platforms. For example, the user of\na cluster comprising commodity machines is more concerned with overhead than the user of a\npurpose-built high performance cluster [7]. In the following, we study the two extremes of the\ntrade-off.\nAt one extreme, users are solely concerned by the total amount of data sent over the network. On\nsuch platforms, users should use smaller sketches to reduce the total amount of data sent over the\nnetwork at the expense of performing a few extra iterations (rounds of communication).\nCorollary 3.8. Under the conditions of Theorem 3 and Lemma 3.7, selecting d := (cid:98)\u03c4 N(cid:99) features by\nSIS, where \u03c4 = cN\u2212\u03b8 for some c > 0 and \u03b8 < 1 \u2212 2\u03ba and letting\n\nW(cid:0)X(cid:98)S(K(S) \u2229 Sd\u22121)(cid:1)2\n\u0001st(N,D) \u2212 log 1\nin Algorithm 1 ensures (cid:107)(cid:101)\u03b2T \u2212 \u03b2\u2217(cid:107)(cid:98)\u0393 \u2264 3\u0001st(N, D) with probability at least\n(cid:1),\n1 \u2212 c4T exp(cid:0)\u2212c2n\u03b42(cid:1) \u2212 C2 exp(cid:0)\u2212 CN 1\u22122\u03ba\n\ncs(c\u03b3 + 2)2\n\n, T =\n\nlog 2\n\nn >\n\nlog\n\n\u00011\n\n4\n\n1\n\nlog N\n\nwhere c, c\u03b3, cs, c2, c4, C, C2 are absolute constants.\n\nN\n\nWe state the corrollary in terms of the statistical precision \u0001st(N, D) and the Gaussian width to keep\nthe expressions concise. It is known that the Gausssian width of the transformed tangent cone that\nappears in Corollary 3.8 is O(s log d)1/2 [13]. Thus it is possible to keep the sketch size n on the\norder of s log d. Recalling d = (cid:98)\u03c4 N(cid:99), where \u03c4 is speci\ufb01ed in the statement of Theorem 3.1, and\n\n\u0001st(N, D) \u2264(cid:0) s log D\nwhere (cid:101)O ignores polylogarithmic terms. The takeaway is it is possible to obtain an O(\u0001st(N, D)) accu-\nrate solution by sending (cid:101)O(mN s) bits over the network. Compared to the O(mN 2) communication\n\n(cid:1) 1\nO(dnmT ) = O(cid:0)N (s log d)m log(cid:0) N\n\n2 , we deduce the communication cost of the approach is\n\n(cid:1)(cid:1) = (cid:101)O(mns),\n\ncost of DECO, we see that our approach exploits sparsity to reduce communication cost.\nAt the other extreme, there is a line of work in statistics that studies estimators whose evaluation only\nrequires a single round of communication. DECO is such a method. In our approach, it is possible to\nobtain an \u0001st(N, D) accurate solution in a single iteration by choosing the sketch size large enough to\nensure the contraction factor \u03b3 is on the order of \u0001st(N, D).\nCorollary 3.9. Under the conditions of Theorem 3 and Lemma 3.7, selecting d := (cid:98)\u03c4 N(cid:99) features by\nSIS, where \u03c4 = cN\u2212\u03b8 for some c > 0 and \u03b8 < 1 \u2212 2\u03ba and letting\n\ns log D\n\ncs(c\u03b3\u0001st(N, D)\u22121 + 2)2\n\nn >\n\nand T = 1 in Algorithm 1 ensures (cid:107)(cid:101)\u03b2T \u2212 \u03b2\u2217(cid:107)(cid:98)\u0393 \u2264 3\u0001st(N, D) with probability at least\n\n1 \u2212 c4T exp(cid:0)\u2212c2n\u03b42(cid:1) \u2212 C2 exp(cid:0)\u2212 CN 1\u22122\u03ba\n\n4\n\nW(cid:0)X(cid:98)S(K(S) \u2229 Sd\u22121)(cid:1)2\n(cid:1),\n\nlog N\n\nwhere c, c\u03b3, cs, c2, c4, C, C2 are absolute constants.\n\n6\n\n\fFigure 1: Plots of the statistical error log (cid:107)(cid:102)X((cid:98)\u03b2 \u2212 \u03b2\u2217)(cid:107)2\n\n(a) xi\n\ni.i.d.\u223c N (0, ID)\n\n2 versus iteration. Each plots shows the\nconvergence of 10 runs of Algorithm 1 on the same problem instance. We see that the statistical error\ndecreases linearly up to the statistical precision of the problem.\n\n(b) xi\n\ni.i.d.\u223c AR(1)\n\nRecalling\n\nwe deduce the communication cost of the one-shot approach is\n\n\u0001st(N, D)2 \u2248 s log D\n\nN , W(cid:0)X(cid:98)S(K(S) \u2229 Sd\u22121)(cid:1)2 \u2248 s log d,\n(cid:1)(cid:1) = (cid:101)O(mN 2),\n\nO(dnmT ) = O(cid:0)N 2m log(cid:0) N\n\ns log D\n\nImpact of number of iterations and sketch size\n\nof the prediction error which is de\ufb01ned as (cid:107)(cid:102)X((cid:98)\u03b2 \u2212 \u03b2\u2217)(cid:107)2\n\nwhich matches the communication cost of DECO.\n4 Simulation results\nIn this section, we provide empirical evaluations of our main algorithm SCREENANDCLEAN on\nsynthetic datasets. In most of the experiments the performance of the methods is evaluated in terms\n2. All the experiments are implemented\nin Matlab on a shared memory machine with 512 GB RAM with 4(6) core intel Xeon E7540 2\nGHz processors. We use TFOCS as a solver for any optimization problem involved, e.g., step 5 in\nAlgorithm 1. For brevity, we refer to our approach as SC in the rest of the section.\n4.1\nFirst, we con\ufb01rm the prediction of Theorem 3.5 by simulation. Figure 1 shows the prediction error of\nthe iterates of Algorithm 1 with different sketch sizes m. We generate a random instance of a sparse\nregression problem with size 1000 by 10000 and sparsity s = 10, and apply Algorithm 1 to estimate\nthe regression coef\ufb01cients. Since Algorithm 1 is a randomized algorithm, for a given (\ufb01xed) dataset,\nits error is reported as the median of the results from 11 independent trials. The two sub\ufb01gures\nshow the results for two random designs: standard Gaussian (left) and AR(1) (right). Within each\nsub\ufb01gure, each curve corresponds to a sketch size, and the dashed black line show the prediction\nerror of the lasso estimator. On the logarithmic scale, a linearly convergent sequence of points appear\non a straight line. As predicted by Theorem 3.5, the iterates of Algorithm 1 converge linearly up to\nthe statistical precision, which is (roughly) the prediction error of the lasso estimator, and then it\nplateaus. As expected, the higher the sketch size is, the fewer number of iteration is needed. These\nresults are consistent with our theoretical \ufb01ndings.\n4.2\nNext, we evaluate the statistical performance of our SC algorithm when N grows. For completeness,\nwe also evaluate several competing methods, namely, lasso, SIS [6] and DECO [17]. The synthetic\ndatasets used in our experiments are based on model (1). In it, X \u223c N (0, ID) or X \u223c N (0, \u03a3) with\nall predictors equally correlated with correlation 0.7, \u0001 \u223c N (0, 1). Similar to the setting appeared\nin [17], the support of \u03b2\u2217, S satis\ufb01es that |S| = 5 and its coordinates are randomly chosen from\n{1, . . . , D}, and\n\nImpact of sample size N\n\n(cid:40)\n(\u22121)Ber(0.5)(cid:0)|(0, 1)| + 5(cid:0) log D\n\n(cid:1)1/2(cid:1)\n\nN\n\ni \u2208 S\ni /\u2208 S.\n\n\u03b2\u2217\ni =\n\n0\n\n7\n\n246810Iterations10-210-1100101Prediction errorm = 231m = 277m = 369m = 553m = 922Lasso246810Iterations10-210-1100101Prediction errorm = 231m = 277m = 369m = 553m = 922Lasso\fWe generate datasets with \ufb01xed D = 3000 and N ranging from 50 to 600. For each N, 20 synthetic\ndatasets are generated and the plots are made by averaging the results.\nIn order to compare with methods such as DECO which is concerned with the Lagrangian formulation\nof lasso, we modify our algorithm accordingly. That is, in step 5 of Algorithm 1, we solve\n\n(cid:101)\u03b2t \u2190 arg min\u03b2\u2208Rd\n\n\u03b2T(cid:101)\u0393t\u03b2 \u2212 ((cid:98)\u03b3 \u2212(cid:98)\u0393(cid:101)\u03b2t\u22121)T \u03b2 + \u03bb(cid:107)\u03b2(cid:107)1.\n\n1\n2\n\nHerein, in our experiments, the regularization parameter is set to be \u03bb = 2(cid:107)X T \u0001(cid:107)\u221e. Also, for SIS\nand SC, the screening size is set to be 2N. For SC, we run it with sketch size n = 2s log(N ) where\ns = 5 and 3 iterations. For DECO, the dataset is partitioned into m = 3 subsets and it is implemented\nwithout the re\ufb01nement step. The results on two kinds of design matrix are presented in Figure 2.\n\nFigure 2: Plots of the statistical error log (cid:107)(cid:102)X((cid:98)\u03b2 \u2212 \u03b2\u2217)(cid:107)2\n\n(a) xi\n\ni.i.d.\u223c N (0, ID)\n\n2 versus log N. In the above, (a) is generated\non datasets with independent predictors and (b) is generated on datasets with correlated predictors.\nBesides our main algorithm SC, several competing methods, namely, lasso, SIS and DECO are\nevaluated. Here D = 3000. For each N, 20 independent simulated datasets are generated and the\naveraged results are plotted.\n\n(b) xi\n\ni.i.d.\u223c N (0, \u03a3)\n\nAs can be seen, SIS achieves similar errors as lasso. Indeed, after careful inspection, we \ufb01nd out\nthat when in the cases where predictors are highly correlated, i.e., Figure 2(b), usually less than 2\nnon-zero coef\ufb01cients can be recovered by sure independent screening. Nevertheless, this doesn\u2019t\ndeteriorate the accuracy too much. Moreover, SC\u2019s performance is comparable to both SIS and lasso\nas the prediction error goes down in the same rate, and SC outperforms DECO in our experiments.\nFinally, in order to demonstrate that our approach is\namenable to distributed computing environments, we\nimplement it using Spark1 on a modern cluster with\n20 nodes, each of which has 12 executor cores. We\nrun our algorithm on an independent Gaussian prob-\nlem instance with size 6000 and 200,000, and sparsity\ns = 20. The screening size is 2400, sketch size is 700,\nnumber of iterations is 3. To show the scalability, we\nreport the running time using 1, 2, 4, 8, 16 machines,\nrespectively. As most of the steps in our approach are\nembarrassingly parallel, the running time becomes\nalmost half as we double the number of machines.\n5 Conclusion and discussion\nWe presented an approach to feature-distributed\nsparse regression that exploits the sparsity of the re-\ngression coef\ufb01cients to reduce communication cost.\nOur approach relies on sketching to compress the\ninformation that has to be sent over the network. Em-\npirical results verify our theoretical \ufb01ndings.\n\nFigure 3: Running time of a Spark implemen-\ntation of SC versus number of machines.\n\n1http://spark.apache.org/\n\n8\n\n102n100101102prediction errorLassoSISDECOSC102n100101102prediction errorLassoSISDECOSC246810121416number of machines0100020003000400050006000time (s)\fAcknowledgments. We would like to thank the Army Research Of\ufb01ce and the Defense Advanced\nResearch Projects Agency for providing partial support for this work.\n\nReferences\n[1] Alekh Agarwal, Sahand Negahban, Martin J. Wainwright, et al. Fast global convergence of gradient\n\nmethods for high-dimensional statistical recovery. The Annals of Statistics, 40(5):2452\u20132482, 2012.\n\n[2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and\nstatistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine\nLearning, 3(1):1\u2013122, 2011.\n\n[3] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit.\n\nSIAM Review, 43(1):129\u2013159, 2001.\n\n[4] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In\n\nSymposium on Theory of Computing (STOC), 2013.\n\n[5] Jim Demmel. Communication avoiding algorithms.\n\nIn 2012 SC Companion: High Performance\n\nComputing, Networking Storage and Analysis, pages 1942\u20132000. IEEE, 2012.\n\n[6] Jianqing Fan and Jinchi Lv. Sure independence screening for ultra-high dimensional feature space. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849\u2013911, 2008.\n\n[7] Alex Gittens, Aditya Devarakonda, Evan Racah, Michael F. Ringenburg, Lisa Gerhardt, Jey Kottalam, Jialin\nLiu, Kristyn J. Maschhoff, Shane Canon, Jatin Chhugani, Pramod Sharma, Jiyan Yang, James Demmel,\nJim Harrell, Venkat Krishnamurthy, Michael W. Mahoney, and Prabhat. Matrix factorization at scale:\na comparison of scienti\ufb01c data analytics in spark and C+MPI using three case studies. arXiv preprint\narXiv:1607.01335, 2016.\n\n[8] Trevor J. Hastie, Robert Tibshirani, and Martin J. Wainwright. Statistical Learning with Sparsity: The\n\nLasso and Its Generalizations. CRC Press, 2015.\n\n[9] Jason D. Lee, Yuekai Sun, Qiang Liu, and Jonathan E. Taylor. Communication-ef\ufb01cient sparse regression:\n\na one-shot approach. arXiv preprint arXiv:1503.04337, 2015.\n\n[10] Po-Ling Loh and Martin J. Wainwright. High-dimensional regression with noisy and missing data: Provable\n\nguarantees with nonconvexity. Ann. Statist., 40(3):1637\u20131664, 06 2012.\n\n[11] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A uni\ufb01ed framework for high-\ndimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557,\n2012.\n\n[12] Mert Pilanci and Martin J. Wainwright. Iterative Hessian sketch: Fast and accurate solution approximation\n\nfor constrained least-squares. arXiv preprint arXiv:1411.0347, 2014.\n\n[13] Mert Pilanci and Martin J. Wainwright. Randomized sketches of convex programs with sharp guarantees.\n\nInformation Theory, IEEE Transactions on, 61(9):5096\u20135115, 2015.\n\n[14] Farbod Roosta-Khorasani and Michael W. Mahoney. Sub-sampled Newton methods II: Local convergence\n\nrates. arXiv preprint arXiv:1601.04738, 2016.\n\n[15] Robert Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol.,\n\npages 267\u2013288, 1996.\n\n[16] Joel A. Tropp. Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data\n\nAnal., 3(1-2):115\u2013126, 2011.\n\n[17] Xiangyu Wang, David Dunson, and Chenlei Leng. Decorrelated feature space partitioning for distributed\n\nsparse regression. arXiv preprint arXiv:1602.02575, 2016.\n\n[18] Xiangyu Wang and Chenlei Leng. High dimensional ordinary least squares projection for screening\n\nvariables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2015.\n\n[19] David P. Woodruff. Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357,\n\n2014.\n\n[20] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-ef\ufb01cient algorithms for statistical\n\noptimization. Journal of Machine Learning Research, 14:3321\u20133363, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1386, "authors": [{"given_name": "Jiyan", "family_name": "Yang", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Saunders", "institution": "Stanford University"}, {"given_name": "Yuekai", "family_name": "Sun", "institution": "University of Michigan"}]}