{"title": "A Comparative Framework for Preconditioned Lasso Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1061, "page_last": 1069, "abstract": "The Lasso is a cornerstone of modern multivariate data analysis, yet its performance suffers in the common situation in which covariates are correlated. This limitation has led to a growing number of \\emph{Preconditioned Lasso} algorithms that pre-multiply $X$ and $y$ by matrices $P_X$, $P_y$ prior to running the standard Lasso. A direct comparison of these and similar Lasso-style algorithms to the original Lasso is difficult because the performance of all of these methods depends critically on an auxiliary penalty parameter $\\lambda$. In this paper we propose an agnostic, theoretical framework for comparing Preconditioned Lasso algorithms to the Lasso without having to choose $\\lambda$. We apply our framework to three Preconditioned Lasso instances and highlight when they will outperform the Lasso. Additionally, our theory offers insights into the fragilities of these algorithms to which we provide partial solutions.", "full_text": "A Comparative Framework for\nPreconditioned Lasso Algorithms\n\nFabian L. Wauthier\nStatistics and WTCHG\nUniversity of Oxford\n\nflw@stats.ox.ac.uk\n\nNebojsa Jojic\n\nMicrosoft Research, Redmond\njojic@microsoft.com\n\nMichael I. Jordan\n\nComputer Science Division\n\nUniversity of California, Berkeley\njordan@cs.berkeley.edu\n\nAbstract\n\nThe Lasso is a cornerstone of modern multivariate data analysis, yet its perfor-\nmance suffers in the common situation in which covariates are correlated. This\nlimitation has led to a growing number of Preconditioned Lasso algorithms that\npre-multiply X and y by matrices PX, Py prior to running the standard Lasso. A\ndirect comparison of these and similar Lasso-style algorithms to the original Lasso\nis dif\ufb01cult because the performance of all of these methods depends critically on\nan auxiliary penalty parameter \u03bb. In this paper we propose an agnostic framework\nfor comparing Preconditioned Lasso algorithms to the Lasso without having to\nchoose \u03bb. We apply our framework to three Preconditioned Lasso instances and\nhighlight cases when they will outperform the Lasso. Additionally, our theory\nreveals fragilities of these algorithms to which we provide partial solutions.\n\n1\n\nIntroduction\n\nVariable selection is a core inferential problem in a multitude of statistical analyses. Confronted with\na large number of (potentially) predictive variables, the goal is to select a small subset of variables\nthat can be used to construct a parsimonious model. Variable selection is especially relevant in linear\nobservation models of the form\n\ny = X\u03b2\u2217 + w with w \u223c N (0, \u03c32In\u00d7n),\n\n(1)\nwhere X is an n \u00d7 p matrix of features or predictors, \u03b2\u2217 is an unknown p-dimensional regression\nparameter, and w is a noise vector. In high-dimensional settings where n (cid:28) p, ordinary least squares\nis generally inappropriate. Assuming that \u03b2\u2217 is sparse (i.e., the support set S(\u03b2\u2217) (cid:44) {i|\u03b2\u2217i (cid:54)= 0} has\ncardinality k < n), a mainstay algorithm for such settings is the Lasso [10]:\n2 + \u03bb||\u03b2||1 .\n\n||y \u2212 X\u03b2||2\n\n(2)\n\nLasso: \u02c6\u03b2 = argmin\u03b2\u2208Rp\n\n1\n2n\n\nFor a particular choice of \u03bb, the variable selection properties of the Lasso can be analyzed by quan-\ntifying how well the estimated support S( \u02c6\u03b2) approximates the true support S(\u03b2\u2217). More careful\nanalyses focus instead on recovering the signed support S\u00b1(\u03b2\u2217),\n\n(cid:40) +1 if \u03b2\u2217i > 0\n\n\u22121\n0\n\nif \u03b2\u2217i < 0\no.w.\n\nS\u00b1(\u03b2\u2217i ) (cid:44)\n\n.\n\n(3)\n\nTheoretical developments during the last decade have shed light onto the support recovery proper-\nties of the Lasso and highlighted practical dif\ufb01culties when the columns of X are correlated. These\ndevelopments have led to various conditions on X for support recovery, such as the mutual incoher-\nence or the irrepresentable condition [1, 3, 8, 12, 13].\n\n1\n\n\fIn recent years, several modi\ufb01cations of the standard Lasso have been proposed to improve its\nsupport recovery properties [2, 7, 14, 15]. In this paper we focus on a class of \u201cPreconditioned\nLasso\u201d algorithms [5, 6, 9] that pre-multiply X and y by suitable matrices PX and Py to yield\n\u00afX = PX X, \u00afy = Pyy, prior to running Lasso. Thus, the general strategy of these methods is\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u00afy \u2212 \u00afX\u03b2(cid:12)(cid:12)(cid:12)(cid:12)2\n\n2 + \u00af\u03bb||\u03b2||1 .\n\nPreconditioned Lasso: \u02c6\u00af\u03b2 = argmin\u03b2\u2208Rp\n\n1\n2n\n\n(4)\n\nAlthough this class of algorithms often compares favorably to the Lasso in practice, our theoretical\nunderstanding of them is at present still fairly poor. Huang and Jojic [5], for example, consider\nonly empirical evaluations, while both Jia and Rohe [6] and Paul et al. [9] consider asymptotic\nconsistency under various assumptions. Important and necessary as they are, consistency results do\nnot provide insight into the relative performance of Preconditioned Lasso variants for \ufb01nite data sets.\nIn this paper we provide a new theoretical basis for making such comparisons. Although the focus\nof the paper is on problems of the form of Eq. (4), we note that the core ideas can also be applied to\nalgorithms that right-multiply X and/or y with some matrices (e.g., [4, 11]).\nFor particular instances of X, \u03b2\u2217, we want to discover whether a given Preconditioned Lasso al-\ngorithm following Eq. (4) improves or degrades signed support recovery relative to the standard\nLasso of Eq. (2). A major roadblock to a one-to-one comparison are the auxiliary penalty param-\neters, \u03bb, \u00af\u03bb, which trade off the (cid:96)1 penalty to the quadratic objective in both Eq. (2) and Eq. (4).\nA correct choice of penalty parameter is essential for signed support recovery: If it is too small,\nthe algorithm behaves like ordinary least squares; if it is too large, the estimated support may be\nempty. Unfortunately, in all but the simplest cases, pre-multiplying data X, y by matrices PX, Py\nchanges the relative geometry of the (cid:96)1 penalty contours to the elliptical objective contours in a\nnontrivial way. Suppose we wanted to compare the Lasso to the Preconditioned Lasso by choosing\nfor each \u03bb in Eq. (2) a suitable, matching \u00af\u03bb in Eq. (4). For a fair comparison, the resulting map-\nping would have to capture the change of relative geometry induced by preconditioning of X, y,\ni.e. \u00af\u03bb = f (\u03bb, X, y, PX , Py). It seems dif\ufb01cult to theoretically characterize such a mapping. Fur-\nthermore, it seems unlikely that a comparative framework could be built by independently choosing\n\u201cideal\u201d penalty parameters \u03bb, \u00af\u03bb: Meinshausen and B\u00a8uhlmann [8], for example, demonstrate that a\nseemingly reasonable oracle estimator of \u03bb will not lead to consistent support recovery in the Lasso.\nIn the Preconditioned Lasso literature this problem is commonly sidestepped either by resorting\nto asymptotic comparisons [6, 9], empirically comparing regularization paths [5], or using model-\nselection techniques which aim to choose reasonably \u201cgood\u201d matching penalty parameters [6]. We\ndeem these approaches to be unsatisfactory\u2014asymptotic and empirical analyses provide limited in-\nsight, and model selection strategies add a layer of complexity that may lead to unfair comparisons.\nIt is our view that all of these approaches place unnecessary emphasis on particular choices of\npenalty parameter. In this paper we propose an alternative strategy that instead compares the Lasso to\nthe Preconditioned Lasso by comparing data-dependent upper and lower penalty parameter bounds.\nSpeci\ufb01cally, we give bounds (\u03bbu, \u03bbl) on \u03bb so that the Lasso in Eq. (2) is guaranteed to recover the\nsigned support iff \u03bbl < \u03bb < \u03bbu. Consequently, if \u03bbl > \u03bbu signed support recovery is not possible.\nThe Preconditioned Lasso in Eq. (4) uses data \u00afX = PX X, \u00afy = Pyy and will thus induce new\nbounds (\u00af\u03bbu, \u00af\u03bbl) on \u00af\u03bb. The comparison of Lasso and Preconditioned Lasso on an instance X, \u03b2\u2217\nthen proceeds by suitably comparing the bounds on \u03bb and \u00af\u03bb. The advantage of this approach is that\nthe upper and lower bounds are easy to compute, even though a general mapping between speci\ufb01c\npenalty parameters cannot be readily derived.\nTo demonstrate the effectiveness of our framework, we use it to analyze three Preconditioned Lasso\nalgorithms [5, 6, 9]. Using our framework we make several contributions: (1) We con\ufb01rm intuitions\nabout advantages and disadvantages of the algorithms proposed in [5, 9]; (2) We show that for an\nSVD-based construction of n \u00d7 p matrices X, the algorithm in [6] changes the bounds determin-\nistically; (3) We show that in the context of our framework, this SVD-based construction can be\nthought of as a limit point of a Gaussian construction.\nThe paper is organized as follows. In Section 2 we will discuss three recent instances of Eq. (4). We\noutline our comparative framework in Section 3 and highlight some immediate consequences for [5]\nand [9] on general matrices X in Section 4. More detailed comparisons can be made by considering\na generative model for X. In Section 5 we introduce such a model based on a block-wise SVD of X\nand then analyze [6] for speci\ufb01c instances of this generative model. Finally, we show that in terms\nof signed support recovery, this generative model can be thought of as a limit point of a Gaussian\n\n2\n\n\fconstruction. Section 6 concludes with some \ufb01nal thoughts. The proofs of all lemmas and theorems\nare in the supplementary material.\n\n2 Preconditioned Lasso Algorithms\n\nOur interest lies in the class of Preconditioned Lasso algorithms that is summarized by Eq. (4).\nExtensions to related algorithms, such as [4, 11] will follow readily. In this section we focus on three\nrecent Preconditioned Lasso examples and instantiate the matrices PX , Py appropriately. Detailed\nderivations can be found in the supplementary material. For later reference, we will denote each\nalgorithm by the author initials.\nHuang and Jojic [5] (HJ). Huang and Jojic proposed Correlation Sifting [5], which, although\nnot presented as a preconditioning algorithm, can be rewritten as one. Let the SVD of X be X =\nU DV (cid:62). Given an algorithm parameter q, let UA be the set of q smallest left singular vectors of X1.\nThen HJ amounts to setting\n(5)\n\nPX = Py = UAU(cid:62)\nA .\n\nPaul et al. [9] (PBHT). An earlier instance of the preconditioning idea was put forward by Paul\net al. [9]. For some algorithm parameter q, let A be the q column indices of X with largest absolute\ncorrelation to y, (i.e., where |X(cid:62)j y|/||Xj||2 is largest). De\ufb01ne UA to be the q largest left singular\nvectors of XA. With this, PBHT can be expressed as setting\n(6)\n\nPX = In\u00d7n\n\nPy = UAU(cid:62)\nA .\n\nJia and Rohe [6] (JR).\nwhitening the matrix X. If X = U DV (cid:62) is full rank, then JR de\ufb01nes2\n\nJia and Rohe [6] propose a preconditioning method that amounts to\n\nPX = Py = U(cid:0)DD(cid:62)(cid:1)\u22121/2\n\nU(cid:62).\n\nIf n < p then \u00afX \u00afX(cid:62) = PX XX(cid:62)P (cid:62)X \u221d In\u00d7n and if n > p then \u00afX(cid:62) \u00afX = X(cid:62)P (cid:62)X PX X \u221d Ip\u00d7p.\nBoth HJ and PBHT estimate a basis UA for a q-dimensional subspace onto which they project y\nand/or X. However, since the methods differ substantially in their assumptions, the estimators differ\nalso. Empirical results in [5] and [9] suggest that the respective assumptions are useful in a variety of\nsituations. In contrast, JR reweights the column space directions U and requires no extra parameter\nq to be estimated.\n\n(7)\n\n3 Comparative Framework\n\nIn this section we propose a new comparative approach for Preconditioned Lasso algorithms which\navoids choosing particular penalty parameters \u03bb, \u00af\u03bb. We \ufb01rst derive upper and lower bounds for \u03bb\nand \u00af\u03bb respectively so that signed support recovery can be guaranteed iff \u03bb and \u00af\u03bb satisfy the bounds.\nWe then compare estimators by comparing the resulting bounds.\n\n3.1 Conditions for signed support recovery\n\nBefore proceeding, we make some de\ufb01nitions motivated by Wainwright [12]. Suppose that the\nsupport set of \u03b2\u2217 is S (cid:44) S(\u03b2\u2217), with |S| = k. To simplify notation, we will assume throughout that\nS = {1, . . . , k} so that the corresponding off-support set is Sc = {1, . . . , p}\\S, with |Sc| = p \u2212 k.\nDenote by Xj column j of X and by XA the submatrix of X consisting of columns indexed by set\nA. De\ufb01ne the following variables: For all j \u2208 Sc and i \u2208 S, let\n\n(cid:0)In\u00d7n \u2212 XS(X(cid:62)S XS)\u22121X(cid:62)S\n(cid:18) 1\n\n(cid:19)\u22121\n\n(cid:1) w\n\nn\n\nX(cid:62)S XS\n\nn\n\nX(cid:62)S\n\nw\nn\n\n.\n\n(8)\n\n(9)\n\n\u00b5j = X(cid:62)j XS(X(cid:62)S XS)\u22121sgn(\u03b2\u2217S)\n\n\u03b7j = X(cid:62)j\n\n\u03b3i = e(cid:62)i\n\nX(cid:62)S XS\n\nsgn(\u03b2\u2217S)\n\n\u0001i = e(cid:62)i\n\n(cid:18) 1\n\nn\n\n(cid:19)\u22121\n\n1The choice of smallest singular vectors is considered for matrices X with sharply decaying spectrum.\n2We note that Jia and Rohe [6] let D be square, so that it can be directly inverted. If X is not full rank, the\n\npseudo-inverse of D can be used.\n\n3\n\n\f(a) Signed support recovery around \u03bbl.\n\n(b) Signed support recovery around \u03bbu.\n\nFigure 1: Empirical evaluation of the penalty parameter bounds of Lemma 1. For each of 500\nsynthetic Lasso problems (n = 300, p = 1000, k = 10) we computed \u03bbl, \u03bbu as per Lemma 1.\nThen we ran Lasso using penalty parameters f \u03bbl in Figure (a) and f \u03bbu in Figure (b), where the\nfactor f = 0.5, . . . , 1.5. The \ufb01gures show the empirical probability of signed support recovery as a\nfunction of the factor f for both \u03bbl and \u03bbu. As expected, the probabilities change sharply at f = 1.\n\nFor the traditional Lasso of Eq. (2), results in (for example) Wainwright [12] connect settings of \u03bb\nwith instances of X, \u03b2\u2217, w to certify whether or not Lasso will recover the signed support. We invert\nthese results and, for particular instances of X, \u03b2\u2217, w, derive bounds on \u03bb so that signed support\nrecovery is guaranteed if and only if the bounds are satis\ufb01ed. Speci\ufb01cally, we prove the following\nLemma in the supplementary material.\nLemma 1. Suppose that X(cid:62)S XS is invertible, |\u00b5j| < 1,\u2200j \u2208 Sc, and sgn(\u03b2\u2217i )\u03b3i > 0,\u2200i \u2208 S. Then\nthe Lasso has a unique solution \u02c6\u03b2 which recovers the signed support (i.e., S\u00b1( \u02c6\u03b2) = S\u00b1(\u03b2\u2217)) if and\nonly if \u03bbl < \u03bb < \u03bbu, where\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b2\u2217i + \u0001i\n\n(cid:12)(cid:12)(cid:12)(cid:12)+\n\n,\n\n(10)\n\n\u03bbl = max\nj\u2208Sc\n\n\u03b7j\n\n(2(cid:74)\u03b7j > 0(cid:75) \u2212 1) \u2212 \u00b5j\n\n\u03bbu = min\ni\u2208S\n\n\u03b3i\n\n(cid:74)\u00b7(cid:75) denotes the indicator function and | \u00b7 |+ = max(0,\u00b7) denotes the hinge function. On the other\n\nhand, if X(cid:62)S XS is not invertible, then the signed support cannot in general be recovered.\nLemma 1 recapitulates well-worn intuitions about when the Lasso has dif\ufb01culty recovering the\nsigned support. For instance, assuming that w has symmetric distribution with mean 0, if 1 \u2212 |\u00b5j|\nis small (i.e., the irrepresentable condition almost fails to hold), then \u03bbl will tend to be large. In\nextreme cases we might have \u03bbl > \u03bbu so that signed support recovery is impossible. Figure 1 em-\npirically validates the bounds of Lemma 1 by estimating probabilities of signed support recovery for\na range of penalty parameters on synthetic Lasso problems.\n\n3.2 Comparisons\n\nIn this paper we propose to compare a preconditioning algorithm to the traditional Lasso by compar-\ning the penalty parameter bounds produced by Lemma 1. As highlighted in Eq. 4, the precondition-\ning framework runs Lasso on modi\ufb01ed variables \u00afX = PX X, \u00afy = Pyy. For the purpose of applying\nLemma 1, these transformations induce a new noise vector\n\n\u00afw = \u00afy \u2212 \u00afX\u03b2\u2217 = Py (X\u03b2\u2217 + w) \u2212 PX X\u03b2\u2217.\n\n(11)\nNote that if PX = Py then \u00afw = Pyw. Provided the conditions of Lemma 1 hold for \u00afX, \u03b2\u2217 we can\nde\ufb01ne updated variables \u00af\u00b5j, \u00af\u03b3i, \u00af\u03b7j, \u00af\u0001i from which the bounds \u00af\u03bbu, \u00af\u03bbl on the penalty parameter \u00af\u03bb can\nbe derived. In order for our comparison to be scale-invariant, we will compare algorithms by ratios\nof resulting penalty parameter bounds. That is, we deem a Preconditioned Lasso algorithm to be\nmore effective than the traditional Lasso if \u00af\u03bbu/\u00af\u03bbl > \u03bbu/\u03bbl. Intuitively, the upper bound \u00af\u03bbu is then\ndisproportionately larger than \u00af\u03bbl relative to \u03bbu and \u03bbl, which in principle allows easier tuning of \u00af\u03bb3.\nWe will later encounter the special case \u00af\u03bbu (cid:54)= 0, \u00af\u03bbl = 0 in which case we de\ufb01ne \u00af\u03bbu/\u00af\u03bbl (cid:44) \u221e to\nindicate that the preconditioned problem is very easy. If \u00af\u03bbu/\u00af\u03bbl < 1 then signed support recovery is\nin general impossible. Finally, to match this intuition, we de\ufb01ne \u00af\u03bbu/\u00af\u03bbl (cid:44) 0 if \u00af\u03bbu = \u00af\u03bbl = 0.\n\n3Other functions of \u03bbl, \u03bbu and \u00af\u03bbl, \u00af\u03bbu could also be considered. However, we \ufb01nd the ratio to be a particu-\n\nlarly intuitive measure.\n\n4\n\n0.511.500.20.40.60.81P(S\u00b1(\u02c6\u03b2)=S\u00b1(\u03b2\u2217))f  0.511.500.20.40.60.81P(S\u00b1(\u02c6\u03b2)=S\u00b1(\u03b2\u2217))f  \f4 General Comparisons\n\nWe begin our comparisons with some immediate consequences of Lemma 1 for HJ and PBHT. In\norder to highlight the utility of the proposed framework, we focus in this section on special cases of\nPX , Py. The framework can of course also be applied to general matrices PX , Py. As we will see,\nboth HJ and PBHT have the potential to improve signed support recovery relative to the traditional\nLasso, provided the matrices PX , Py are suitably estimated. The following notation will be used\nduring our comparisons: We will write \u00afA (cid:22) A to indicate that random variable A stochastically\ndominates \u00afA, that is, \u2200t P( \u00afA \u2265 t) \u2264 P(A \u2265 t). We also let US be a minimal basis for the column\n\nspace of the submatrix XS, and de\ufb01ne span(US) =(cid:8)x(cid:12)(cid:12)\u2203c \u2208 Rk s.t. x = USc(cid:9) \u2286 Rn. Finally, we\n\nlet USc be a minimal basis for the orthogonal complement of span(US).\nConsequences for HJ. Recall from Section 2 that HJ uses PX = Py = UAU(cid:62)\nA\ncolumn basis estimated from X. We have the following theorem:\nTheorem 1. Suppose that the conditions of Lemma 1 are met for a \ufb01xed instance of X, \u03b2\u2217.\nspan(US) \u2286 span(UA), then after preconditioning using HJ the conditions continue to hold, and\n(12)\n\n, where UA is a\nIf\n\n\u03bbu\n\u03bbl\n\n(cid:22) \u00af\u03bbu\n\u00af\u03bbl\n\n,\n\nwhere the stochasticity on both sides is due to independent noise vectors w. On the other hand, if\nX(cid:62)S P (cid:62)X PX XS is not invertible, then HJ cannot in general recover the signed support.\nWe brie\ufb02y sketch the proof of Theorem 1. If span(US) \u2286 span(UA) then plugging in the de\ufb01nition\nof PX into \u00af\u00b5j, \u00af\u03b3i, \u00af\u03b7j, \u00af\u0001i, one can derive the following\n(13)\n\n\u00af\u03b3i = \u03b3i\n\n(14)\nIf span(UA) = span(US), then it is easy to see that \u00af\u03b7j = 0. Notice that because \u00af\u00b5j and \u00af\u03b3i are un-\nchanged, if the conditions of Lemma 1 hold for the original Lasso problem (i.e., X(cid:62)S XS is invertible,\n|\u00b5j| < 1 \u2200j \u2208 Sc and sgn(\u03b2\u2217i )\u03b3i > 0 \u2200i \u2208 S), they will continue to hold for the preconditioned\nproblem. Suppose then that the conditions set forth in Lemma 1 are met. With some additional work\none can show that\n\n\u00af\u0001i = \u0001i.\n\nA\n\n\u00af\u00b5j = \u00b5j\n\u00af\u03b7j = X(cid:62)j\n\n(cid:0)In\u00d7n \u2212 USU(cid:62)S\n\n(cid:1) UAU(cid:62)\n\nw\nn\n\n\u00af\u03bbu = min\ni\u2208S\n\n= \u03bbu\n\n\u00af\u03bbl = max\nj\u2208Sc\n\n\u00af\u03b7j\n\n(2(cid:74)\u00af\u03b7j > 0(cid:75) \u2212 1) \u2212 \u00af\u00b5j\n\n(cid:22) \u03bbl.\n\n(15)\n\nThe result then follows by showing that \u00af\u03bbl, \u03bbl are both independent of \u00af\u03bbu = \u03bbu. Note that if\nspan(UA) = span(US), then \u00af\u03bbl = 0 and so \u00af\u03bbu/\u00af\u03bbl (cid:44) \u221e.\nIn the more common case when\nspan(US) (cid:54)\u2286 span(UA) the performance of the Lasso depends on how misaligned UA and US are. In\nextreme cases, X(cid:62)S P (cid:62)X PX XS is singular and so signed support recovery is not in general possible.\nConsequences for PBHT. Recall from Section 2 that PBHT uses PX = In\u00d7n, Py = UAU(cid:62)\n,\nA\nwhere UA is a column basis estimated from X. We have the following theorem.\nTheorem 2. Suppose that the conditions of Lemma 1 are met for a \ufb01xed instance of X, \u03b2\u2217.\nIf\nspan(US) \u2286 span(UA), after preconditioning using PBHT the conditions continue to hold, and\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b2\u2217i + \u00af\u0001i\n\n\u00af\u03b3i\n\n(cid:12)(cid:12)(cid:12)(cid:12)+\n\n\u03bbu\n\u03bbl\n\n(cid:22) \u00af\u03bbu\n\u00af\u03bbl\n\n,\n\n(16)\n\nwhere the stochasticity on both sides is due to independent noise vectors w. On the other hand, if\nspan(USc ) = span(UA), then PBHT cannot recover the signed support.\nAs before, we sketch the proof to build some intuition. Because PBHT does not set PX = Py as HJ\ndoes, there is no danger of X(cid:62)S P (cid:62)X PX XS becoming singular. On the other hand, this complicates\n\u2212\nthe form of the induced noise vector \u00afw. Plugging PX and Py into Eq. (11), we \ufb01nd \u00afw = (UAU(cid:62)\nA\nA w. However, even though the noise has a more complicated form, derivations\nIn\u00d7n)X\u03b2\u2217 + UAU(cid:62)\nin the supplementary material show that if span(US) \u2286 span(UA), then\n(17)\n\n\u00af\u03b3i = \u03b3i\n\n\u00af\u00b5j = \u00b5j\n\u00af\u03b7j = X(cid:62)j\n\n(cid:0)In\u00d7n \u2212 USU(cid:62)S\n\n(cid:1) UAU(cid:62)\n\nA\n\nw\nn\n\n\u00af\u0001i = \u0001i.\n\n(18)\n\n5\n\n\f(a) Empirical validation of Theorems 1 and 2.\n\n(b) Evaluation of JR on Gaussian ensembles.\n\nFigure 2: Experimental evaluations. Figure (a) shows empirical c.d.f.\u2019s of penalty parameter bounds\nratios estimated from 1000 variable selection problems. Each problem consists of Gaussians X and\nw, and \u03b2\u2217, with n = 100, p = 300, k = 5. The blue curve shows the c.d.f. for \u03bbu/\u03bbl estimated on\n, where span(US) \u2286\nthe original data (Lasso). Then we projected the data using PX = Py = UAU(cid:62)\nA\nspan(UA) but dim(UA) = dim(span(UA)) is variable (see legend), and estimated the resulting\nc.d.f. for the updated bounds ratio \u00af\u03bbu/\u00af\u03bbl. As predicted by Theorems 1 and 2, \u03bbu/\u03bbl (cid:22) \u00af\u03bbu/\u00af\u03bbl. In\nFigure (b) the blue curve shows the scale factor (p \u2212 k)/(n + p\u03ba2 \u2212 k) predicted by Theorem 3 for\n1 \u2212 (n/p). The red curve plots the corresponding\nproblems constructed from Eq. (19) for \u03ba = f\nfactor estimated from the Gaussian construction in Eq. (25) (n = 100, m = 2000, p = 200, k = 5)\nusing the same \u03a3S, \u03a3Sc as in Theorem 3, averaged over 50 problem instances and with error bars\nfor one standard deviation. As in Theorem 3, the factor is approximately 1 if f = 1.\n\n(cid:112)\n\nAs with HJ, if span(UA) = span(US), then \u00af\u03b7j = 0. Because \u00af\u00b5j and \u00af\u03b3i are again unchanged,\nthe conditions of Lemma 1 will continue to hold for the preconditioned problem if they hold for\nthe original Lasso problem. With the previous equalities established, the remainder of the proof\nis identical to that of Theorem 1. The fact that the above \u00af\u00b5j, \u00af\u03b7j, \u00af\u03b3i, \u00af\u0001i are identical to those of HJ\ndepends crucially on the fact that span(US) \u2286 span(UA). In general the values will differ because\nPBHT sets PX = In\u00d7n, but HJ does not.\nOn the other hand, if span(US) (cid:54)\u2286 span(UA) then the distribution of \u00af\u0001i depends on how misaligned\nUA and US are. In the extreme case when span(USc ) = span(UA), one can show that \u00af\u0001i = \u2212\u03b2\u2217i ,\nwhich results in \u00af\u03bbu = 0, \u00af\u03bbl (cid:22) \u03bbl. Because P(\u00af\u03bbl \u2265 0) = 1, signed support recovery is not possible.\n\nRemarks. Our theoretical analyses show that both HJ and PBHT can indeed lead to improved\nsigned support recovery relative to the Lasso on \ufb01nite datasets. To underline our \ufb01ndings, we em-\npirically validate Theorems 1 and 2 in Figure 2(a), where we plot estimated c.d.f.\u2019s for penalty\nparameter bounds ratios of Lasso and Preconditioned Lasso for various subspaces UA. Our theo-\nrems focussed on speci\ufb01c settings of PX , Py and ignored others. In general, the gains of HJ and\nPBHT over Lasso depend on how much the decoy signals in XSc are suppressed and how much of\nthe true signal due to XS is preserved. Further comparison of HJ and PBHT must thus analyze how\nthe subspaces span(UA) are estimated in the context of the assumptions made in [5] and [9]. A \ufb01nal\nnote concerns the dimension of the subspace span(UA). Both HJ and PBHT were proposed with the\nimplicit goal of \ufb01nding a basis UA that has the same span as US. This of course requires estimating\n|S| = k by q, which adds another layer of complexity to these algorithms. Theorems 1 and 2 sug-\ngest that underestimating k can be more detrimental to signed support recovery than overestimating\nit. By overestimating q > k, we can trade off milder improvement when span(US) \u2286 span(UA)\nagainst poor behavior should we have span(US) (cid:54)\u2286 span(UA).\n\n5 Model-Based Comparisons\n\nIn the previous section we used Lemma 1 in conjunction with assumptions on UA to make statements\nabout HJ and PBHT. Of course, the quality of the estimated UA depends on the speci\ufb01c instances\nX, \u03b2\u2217, w, which hinders a general analysis. Similarly, a direct application of Lemma 1 to JR yields\nbounds that exhibit strong X dependence.\nIt is possible to crystallize prototypical examples by\nspecializing X and w to come from a generative model. In this section we brie\ufb02y present this model\nand will show the resulting penalty parameter bounds for JR.\n\n6\n\n01000200030004000500000.20.40.60.81tP(\u03bbu/\u03bbl<t)  Lasso55352515100.20.40.60.811.21.400.511.522.5f\u00af\u03bbu/\u00af\u03bbl\u03bbu/\u03bbl  Orthogonal dataGaussian data\f5.1 Generative model for X\n\nAs discussed in Section 2, many preconditioning algorithms can be phrased as truncating or\nreweighting column subspaces associated with X [5, 6, 9]. This suggests that a natural generative\nmodel for X can be formulated in terms of the SVD of submatrices of X.\nAssume p \u2212 k > n and let \u03a3S, \u03a3Sc be \ufb01xed-spectrum matrices of dimension n \u00d7 k and n \u00d7\np \u2212 k respectively. We will assume throughout this paper that the top left \u201cdiagonal\u201d entries of\n\u03a3S, \u03a3Sc are positive and the remainder is zero. Furthermore, we let U, VS, VSc be orthonormal\nbases of dimension n \u00d7 n, k \u00d7 k and p \u2212 k \u00d7 p \u2212 k respectively. We assume that these bases are\nchosen uniformly at random from the corresponding Stiefel manifold. As before and without loss of\ngenerality, suppose S = {1, . . . , k}. Then we let the Lasso problem be\n\ny = X\u03b2\u2217 + w with X = U(cid:2)\u03a3SV (cid:62)S , \u03a3ScV (cid:62)Sc\n\n(19)\nTo ensure that the column norms of X are controlled, we compute the spectra \u03a3S, \u03a3Sc by normal-\nizing spectra \u02c6\u03a3S and \u02c6\u03a3Sc with arbitrary positive elements on the diagonal. Speci\ufb01cally, we let\n\n(cid:3) w \u223c N (0, \u03c32In\u00d7n),\n(cid:112)\n\n\u221a\n\n(20)\n\n\u03a3S =\n\n\u02c6\u03a3S\n|| \u02c6\u03a3S||F\n\nkn\n\n\u03a3Sc =\n\n\u02c6\u03a3Sc\n|| \u02c6\u03a3Sc||F\n\n(p \u2212 k)n.\n\nWe verify in the supplementary material that with these assumptions the squared column norms of\nX are in expectation n (provided the orthonormal bases are chosen uniformly at random).\nIntuition. Note that any matrix X can be decomposed using a block-wise SVD as\n\nX = [XS, XSc] = U(cid:2)\u03a3SV (cid:62)S , T \u03a3ScV (cid:62)Sc\n\n(cid:3) ,\n\n(21)\nwith orthonormal bases U, T, VS, VSc. Our model in Eq. (19) is only a minor restriction of this\nmodel, where we set T = In\u00d7n. To develop more intuition, let us temporarily set VS = Ik\u00d7k,\nVSc = Ip\u2212k\u00d7p\u2212k. Then X = [XS, XSc] = U [\u03a3S, \u03a3Sc] and we see that up to scaling XS equals\nthe \ufb01rst k columns of XSc. The dif\ufb01culty for Lasso thus lies in correctly selecting the columns in\nXS, which are highly correlated with the \ufb01rst few columns in XSc.\n\n5.2 Piecewise constant spectra\n\nFor notational clarity we will now focus on a special case of the above model. To begin, we develop\nsome notation. In previous sections we used US to denote a basis for the column space of XS. We\nwill continue to use this notation, and let US contain the \ufb01rst k columns of U. Accordingly, we\ndenote the last n \u2212 k columns of U by USc. We let the diagonal elements of \u03a3S, \u02c6\u03a3S, \u03a3Sc, \u02c6\u03a3Sc\nbe identi\ufb01ed by their column indices. That is, the diagonal entries \u03c3S,c of \u03a3S and \u02c6\u03c3S,c of \u02c6\u03a3S\nare indexed by c \u2208 {1, . . . , k}; the diagonal entries \u03c3Sc,c of \u03a3Sc and \u02c6\u03c3Sc,c of \u02c6\u03a3Sc are indexed\nby c \u2208 {1, . . . , n}. Each of the diagonal entries in \u03a3S, \u03a3Sc is associated with a column of U.\nThe set of diagonal entries of \u03a3S and \u03a3Sc associated with US is \u03c3(S) = {1, . . . , k} and the set\nof diagonal entries in \u03a3Sc associated with USc is \u03c3(Sc) = {1, . . . , n}\\\u03c3(S). We will construct\nspectrum matrices \u03a3S, \u03a3Sc that are piecewise constant on their diagonals. For some \u03ba \u2265 0, we let\n\u02c6\u03c3S,i = 1, \u02c6\u03c3Sc,i = \u03ba \u2200i \u2208 \u03c3(S) and \u02c6\u03c3Sc,j = 1 \u2200j \u2208 \u03c3(Sc).\n\nConsequences for JR. Recall that for JR, if X = U DV (cid:62), then PX = Py = U(cid:0)DD(cid:62)(cid:1)\u22121/2\n\nWe have the following theorem.\nTheorem 3. Assume the Lasso problem was generated according to the generative model of\nEq. (19) with \u2200i \u2208 \u03c3(S), \u02c6\u03c3S,i = 1, \u02c6\u03c3Sc,i = \u03ba and \u2200j \u2208 \u03c3(Sc), \u02c6\u03c3Sc,j = 1 and that\nk(p \u2212 k \u2212 1). Then the conditions of Lemma 1 hold before and after precondi-\n\u03ba <\ntioning using JR. Moreover,\n\nn \u2212 k/\n\n(cid:112)\n\nU(cid:62).\n\n\u221a\n\n\u00af\u03bbu\n\u00af\u03bbl\n\n=\n\n(p \u2212 k)\n\nn + p\u03ba2 \u2212 k\n\n\u03bbu\n\u03bbl\n\n.\n\n(22)\n\nIn other words, JR deterministically scales the ratio of penalty parameter bounds. The proof idea\nis as follows. It is easy to see that X(cid:62)S XS is always invertible. Furthermore, one can show that if\n\n7\n\n\f\u221a\n\nn \u2212 k/\n\n(cid:112)\nk(p \u2212 k \u2212 1), we have |\u00b5j| < 1,\u2200j \u2208 Sc and sgn(\u03b2\u2217i )\u03b3i > 0,\u2200i \u2208 S. Thus, by our\n\u03ba <\nassumptions, the preconditions of Lemma 1 are satis\ufb01ed for the original Lasso problem. Plugging in\nthe de\ufb01nitions of \u03a3S, \u03a3Sc into Eq. (19) we \ufb01nd that the SVD becomes X = U DV (cid:62), where U is the\nsame column basis as in Eq. (19), and the diagonal elements of D are determined by \u03ba. Substituting\nthis into the de\ufb01nitions of \u00af\u00b5j, \u00af\u03b3i, \u00af\u03b7j, \u00af\u0001i, we have that after preconditioning using JR\n\n(cid:18)\n\n(cid:19)\n\nn(p \u2212 k)\u03ba2\nk\u03ba2 + n \u2212 k\n\n\u00af\u00b5j = \u00b5j\n\n\u00af\u03b3i =\n\nn +\n\n(k\u03ba2 + n \u2212 k)\n\nn(p \u2212 k)\n\n\u00af\u03b7j =\n\n\u03b7j\n\n\u00af\u0001i = \u0001i.\n\n\u03b3i\n\n(23)\n\n(24)\n\nThus, if the conditions of Lemma 1 hold for X, \u03b2\u2217, they will continue to hold after precondition-\n\ning using JR. Furthermore, notice that (2(cid:74)\u00af\u03b7j > 0(cid:75) \u2212 1) \u2212 \u00af\u00b5j = (2(cid:74)\u03b7j > 0(cid:75) \u2212 1) \u2212 \u00b5j. Applying\n(cid:112)\nLemma 1 then gives the new ratio \u00af\u03bbu/\u00af\u03bbl as claimed. According to Theorem 3 the ratio \u00af\u03bbu/\u00af\u03bbl will\n1 \u2212 (n/p) then PX = Py \u221d In\u00d7n and\nbe larger than \u03bbu/\u03bbl iff \u03ba <\nso JR coincides with standard Lasso.\n\n1 \u2212 (n/p). Indeed, if \u03ba =\n\n(cid:112)\n\n5.3 Extension to Gaussian ensembles\n\n1\u221a\nn\n\n(cid:3) wm \u223c N(cid:16)\n\nW m(cid:2)\u03a3SV (cid:62)S , \u03a3Sc V (cid:62)Sc\n\nThe construction in Eq. (19) uses an orthonormal matrix U as the column basis of X. At \ufb01rst\nsight this may appear to be restrictive. However, as we show in the supplementary material, one\ncan construct Lasso problems using a Gaussian basis W m which lead to penalty parameter bounds\nratios that converge in distribution to those of the Lasso problem in Eq. (19). For some \ufb01xed \u03b2\u2217, VS,\nVSc, \u03a3S and \u03a3Sc, generate two independent problems: One using Eq. (19), and one according to\nym = X m\u03b2\u2217 + wm with X m =\n, (25)\nwhere W m is an m \u00d7 n standard Gaussian ensemble. Note that an X so constructed is low rank if\nn < p. The latter generative model bears some resemblance to Gaussian models considered in Paul\net al. [9] (Eq. (7)) and Jia and Rohe [6] (Proposition 2). Note that while the problem in Eq. (19) uses\nn observations with noise variance \u03c32, Eq. (25) has m observations with noise variance \u03c32m/n.\nThe increased variance is necessary because the matrix W m has expected column length m, while\ncolumns in U are of length 1. We will think of n as \ufb01xed and will let m \u2192 \u221e. Let the penalty\nparameter bounds ratio induced by the problem in Eq. (19) be \u03bbu/\u03bbl and that induced by Eq. (25)\nbe \u03bbm\nTheorem 4. Let VS, VSc, \u03a3S, \u03a3Sc and \u03b2\u2217 be \ufb01xed. If the conditions of Lemma 1 hold for X, \u03b2\u2217,\nthen for m large enough they will hold for X m, \u03b2\u2217. Furthermore, as m \u2192 \u221e\n\n. Then we have the following result.\n\n0, \u03c32 m\nn\n\nIm\u00d7m\n\nu /\u03bbm\nl\n\n(cid:17)\n\nwhere the stochasticity on the left is due to W m, wm and on the right is due to w.\n\n\u03bbm\nu\n\u03bbm\nl\n\nd\u2192 \u03bbu\n\u03bbl\n\n,\n\n(26)\n\nThus, with respect to the bounds ratio \u03bbu/\u03bbl, the construction of Eq. (19) can be thought of as the\nlimiting construction of Gaussian Lasso problems in Eq. (25) for large m. As such, we believe\nthat Eq. (19) is a useful proxy for less restrictive generative models. Indeed, as the experiment in\nFigure 2(b) shows, Theorem 3 can be used to predict the scaling factor for penalty parameter bounds\n\n(cid:1) / (\u03bbu/\u03bbl)) with good accuracy even for Gaussian ensembles.\n\nratios (i.e.,(cid:0)\u00af\u03bbu/\u00af\u03bbl\n\n6 Conclusions\n\nThis paper proposes a new framework for comparing Preconditioned Lasso algorithms to the stan-\ndard Lasso which skirts the dif\ufb01culty of choosing penalty parameters. By eliminating this parameter\nfrom consideration, \ufb01nite data comparisons can be greatly simpli\ufb01ed, avoiding the use of model\nselection strategies. To demonstrate the framework\u2019s usefulness, we applied it to a number of Pre-\nconditioned Lasso algorithms and in the process con\ufb01rmed intuitions and revealed fragilities and\nmitigation strategies. Additionally, we presented an SVD-based generative model for Lasso prob-\nlems that can be thought of as the limit point of a less restrictive Gaussian model. We believe this\nwork to be a \ufb01rst step towards a comprehensive theory for evaluating and comparing Lasso-style\nalgorithms and believe that the strategy can be extended to comparing other penalized likelihood\nmethods on \ufb01nite datasets.\n\n8\n\n\fReferences\n[1] D.L. Donoho, M. Elad, and V.N. Temlyakov. Stable recovery of sparse overcomplete repre-\nsentations in the presence of noise. Information Theory, IEEE Transactions on, 52(1):6\u201318,\n2006.\n\n[2] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle proper-\n\nties. Journal of the American Statistical Association, 96:1348\u20131360, 2001.\n\n[3] J.J. Fuchs. Recovery of exact sparse representations in the presence of bounded noise. Infor-\n\nmation Theory, IEEE Transactions on, 51(10):3601\u20133608, 2005.\n\n[4] H.-C. Huang, N.-J. Hsu, D.M. Theobald, and F.J. Breidt. Spatial Lasso with applications to GIS\n\nmodel selection. Journal of Computational and Graphical Statistics, 19(4):963\u2013983, 2010.\n\n[5] J.C. Huang and N. Jojic. Variable selection through Correlation Sifting.\n\nIn V. Bafna and\nS.C. Sahinalp, editors, RECOMB, volume 6577 of Lecture Notes in Computer Science, pages\n106\u2013123. Springer, 2011.\n\n[6] J. Jia and K. Rohe. \u201cPreconditioning\u201d to comply with the irrepresentable condition. 2012.\n[7] N. Meinshausen. Lasso with relaxation. Technical Report 129, Eidgen\u00a8ossische Technische\n\nHochschule, Z\u00a8urich, 2005.\n\n[8] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the\n\nLasso. Annals of Statistics, 34(3):1436\u20131462, 2006.\n\n[9] D. Paul, E. Bair, T. Hastie, and R. Tibshirani. \u201cPreconditioning\u201d for feature selection and\n\nregression in high-dimensional problems. Annals of Statistics, 36(4):1595\u20131618, 2008.\n\n[10] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58(1):267\u2013288, 1994.\n\n[11] R.J. Tibshirani. The solution path of the Generalized Lasso. Stanford University, 2011.\n[12] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using\nIEEE Transactions on Information Theory,\n\n(cid:96)1-constrained quadratic programming (Lasso).\n55(5):2183\u20132202, 2009.\n\n[13] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n[14] H. Zou. The Adaptive Lasso and its oracle properties. Journal of the American Statistical\n\nAssociation, 101:1418\u20131429, 2006.\n\n[15] H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the\n\nRoyal Statistical Society, Series B, 67:301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 557, "authors": [{"given_name": "Fabian", "family_name": "Wauthier", "institution": "UC Berkeley"}, {"given_name": "Nebojsa", "family_name": "Jojic", "institution": "Microsoft Research"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}