{"title": "When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements", "book": "Advances in Neural Information Processing Systems", "page_first": 989, "page_last": 997, "abstract": "We consider the problem of accurately estimating a high-dimensional sparse vector using a small number of linear measurements that are contaminated by noise.  It is well known that standard computationally tractable sparse recovery algorithms, such as the Lasso, OMP, and their various extensions, perform poorly when the measurement matrix contains highly correlated columns.  We develop a simple greedy algorithm, called SWAP, that iteratively swaps variables until a desired loss function cannot be decreased any further.  SWAP is surprisingly effective in handling measurement matrices with high correlations.  We prove that SWAP can be easily used as a wrapper around standard sparse recovery algorithms for improved performance.  We theoretically quantify the statistical guarantees of SWAP and complement our analysis with numerical results on synthetic and real data.", "full_text": "When in Doubt, SWAP: High-Dimensional\n\nSparse Recovery from Correlated Measurements\n\nDivyanshu Vats\nRice University\n\nHouston, TX 77251\ndvats@rice.edu\n\nRichard Baraniuk\n\nRice University\n\nHouston, TX 77251\nrichb@rice.edu\n\nAbstract\n\nWe consider the problem of accurately estimating a high-dimensional sparse vec-\ntor using a small number of linear measurements that are contaminated by noise. It\nis well known that standard computationally tractable sparse recovery algorithms,\nsuch as the Lasso, OMP, and their various extensions, perform poorly when the\nmeasurement matrix contains highly correlated columns. We develop a simple\ngreedy algorithm, called SWAP, that iteratively swaps variables until a desired\nloss function cannot be decreased any further. SWAP is surprisingly effective in\nhandling measurement matrices with high correlations. We prove that SWAP can\neasily be used as a wrapper around standard sparse recovery algorithms for im-\nproved performance. We theoretically quantify the statistical guarantees of SWAP\nand complement our analysis with numerical results on synthetic and real data.\n\n1 Introduction\n\nAn important problem that arises in many applications is that of recovering a high-dimensional\nsparse (or approximately sparse) vector given a small number of linear measurements. Depending\non the problem of interest, the unknown sparse vector can encode relationships between genes [1],\npower line failures in massive power grid networks [2], sparse representations of signals [3, 4], or\nedges in a graphical model [5,6], to name just a few applications. The simplest, but still very useful,\nsetting is when the observations can be approximated as a sparse linear combination of the columns\nin a measurement matrix X weighted by the non-zero entries of the unknown sparse vector. In\n\u2217, in\nthis paper, we study the problem of recovering the location of the non-zero entries, say S\nthe unknown vector, which is equivalent to recovering the columns of X that y depends on. In the\nliterature, this problem is often to referred to as the sparse recovery or the support recovery problem.\nAlthough several tractable sparse recovery algorithms have been proposed in the literature, statis-\n\u2217 can only be provided under conditions that limit how\ntical guarantees for accurately estimating S\ncorrelated the columns of X can be. For example, if there exists a column, say Xi, that is nearly lin-\n\u2217, some sparse recovery algorithms may falsely select\nearly dependent on the columns indexed by S\nXi. In certain applications, where X can be speci\ufb01ed a priori, correlations can easily be avoided\nby appropriately choosing X. However, in many applications, X cannot be speci\ufb01ed by a practi-\ntioner, and correlated measurement matrices are inevitable. For example, when the columns in X\ncorrespond to gene expression values, it has been observed that genes in the same pathway produce\ncorrelated values [1]. Additionally, it has been observed that regions in the brain that are in close\nproximity produce correlated signals as measured using an MRI [7].\n\u2217 for mea-\nIn this paper, we develop new sparse recovery algorithms that can accurately recover S\nsurement matrices that exhibit strong correlations. We propose a greedy algorithm, called SWAP,\n\u2217 until a desired loss function\nthat iteratively swaps variables starting from an initial estimate of S\ncannot be decreased any further. We prove that SWAP can accurately identify the true signal support\n\n1\n\n\funder relatively mild conditions on the restricted eigenvalues of the matrix X T X and under certain\nconditions on the correlations between the columns of X. A novel aspect of our theory is that the\nconditions we derive are only needed when conventional sparse recovery algorithms fail to recover\n\u2217. This motivates the use of SWAP as a wrapper around sparse recovery algorithms for improved\nS\nperformance. Finally, using numerical simulations, we show that SWAP consistently outperforms\nmany state of the art algorithms on both synthetic and real data corresponding to gene expression\nvalues.\n\u2217. The\nAs alluded to earlier, several algorithms now exist in the literature for accurately estimating S\ntheoretical properties of such algorithms either depend on the irrepresentability condition [5, 8\u201310]\nor various forms of the restricted eigenvalue conditions [11,12]. See [13] for a comprehensive review\nof such algorithms and the related conditions. SWAP is a greedy algorithm with novel guarantees\nfor sparse recovery and we make appropriate comparisons in the text. Another line of research when\ndealing with correlated measurements is to estimate a superset of S\nThe rest of the paper is organized as follows. Section 2 formally de\ufb01nes the sparse recovery problem.\nSection 3 introduces SWAP. Section 4 presents theoretical results on the conditions needed for\nprovably correct sparse recovery. Section 5 discusses numerical simulations. Section 6 summarizes\nthe paper and discusses future work.\n\n\u2217; see [14\u201318] for examples.\n\n2 Problem Setup\nThroughout this paper, we assume that y \u2208 R\nby the linear model\n\n\u2217\n\nn and X \u2208 R\n\nn\u00d7p are known and related to each other\n\n+ w ,\n\ny = X\u03b2\n\n\u2217 \u2208 R\n\n(1)\np is the unknown sparse vector that we seek to estimate. We assume that the columns\nwhere \u03b2\n2/n = 1 for all i \u2208 [p], where we use the notation [p] = {1, 2, . . . , p}\nof X are normalized, i.e., (cid:2)Xi(cid:2)2\n\u2217 accordingly.\nthroughout the paper. In practice, normalization can easily be done by scaling X and \u03b2\nWe assume that the entries of w are i.i.d. zero-mean sub-Gaussian random variables with parameter\n\u03c3 so that E[exp(twi)] \u2264 exp(t2\u03c32/2). The sub-Gaussian condition on w is common in the literature\nand allows for a wide class of noise models, including Gaussian, symmetric Bernoulli, and bounded\n\u2217 denote the location\nrandom variables. We let k be the number of non-zero entries in \u03b2\n\u2217 and we adopt this notation\nof the non-zero entries. It is common to refer to S\nthroughout the paper.\n\u2217. Thus, we mainly focus\nOnce S\n\u2217. A classical strategy for sparse recovery is to\non the sparse recovery problem of estimating S\nsearch for a support of size k that minimizes a suitable loss function. For a support S, we assume\nthe least-squares loss, which is de\ufb01ned as follows:\n\n\u2217 has been estimated, it is relatively straightforward to estimate \u03b2\n\n\u2217 as the support of \u03b2\n\n\u2217, and let S\n\nL(S; y, X) := min\n\u03b1\u2208R|S|\n\n(cid:2)y \u2212 XS\u03b1(cid:2)2\n\n2 =\n\n(2)\n\n(cid:1)(cid:1)\u03a0\n\n\u22a5\n\n(cid:1)(cid:1)2\n\n[S]y\n\n2 ,\n\nwhere XS refers to an n \u00d7 |S| matrix that only includes the columns indexed by S and \u03a0\n\u22a5\n[S] =\nI \u2212 XS(X T\nS is the orthogonal projection onto the null space of the linear operator XS. In\nthis paper, we design a sparse recovery algorithm that provably, and ef\ufb01ciently, \ufb01nds the true support\nfor a broad class of measurement matrices that includes matrices with high correlations.\n\n\u22121X T\n\nS XS)\n\n3 Overview of SWAP\n\nWe now describe our proposed greedy algorithm SWAP. Recall that our main goal is to \ufb01nd a\n\nsupport (cid:2)S that minimizes the loss de\ufb01ned in (2). Suppose that we are given an estimate, say S(1), of\nthe true support and let L(1) be the corresponding least-squares loss (see (2)). We want to transition\n\u2217. Our\nto another estimate S(2) that is closer (in terms of the number of true variables), or equal, to S\nmain idea to transition from S(1) to an appropriate S(2) is to swap variables as follows:\nSwap every i \u2208 S(1) with i\nsequently, we \ufb01nd {(cid:2)i,(cid:2)i\nIf mini,i(cid:1) L(1)\n\ni,i(cid:1) < L(1), there exists a support that has a lower loss than the original one. Sub-\n(cid:4)}. We repeat the\n\ni,i(cid:1) and let S(2) = {S(1)\\(cid:2)i} \u222a {(cid:2)i\n\n(cid:4) \u2208 (S(1))c and compute the resulting loss L(1)\n\n(cid:4)} = arg mini,i(cid:1) L(1)\n\ni,i(cid:1) = L({S(1)\\i}\u222ai\n\n; y, X).\n\n(cid:4)\n\n2\n\n\f400\n\n300\n\n200\n\n100\n\n0\n0\n\n0.05\n\n0.1\n(a)\n\n0.15\n\n0.2\n\n \n\n \n\nTLasso\nS\u2212TLasso\nFoBa\nS\u2212FoBa\nCoSaMP\nS\u2212CoSaMP\nMaR\nS\u2212MaR\n\n(b)\n\n1\n\ne\n\nt\n\n \n\na\nR\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n0.5\n\n0\n3\n\n4\n\n5\n\n6\n\nSparsity Level\n(c)\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\nI\n \nf\n\no\n\n \n\n#\n\n \n\nn\na\ne\nM\n\n10\n\n5\n\n0\n3\n\n4\n\n7\n\n8\n\n7\n\n8\n\n5\n\n6\n\nSparsity Level\n(d)\n\nFigure 1: Example of using SWAP on pseudo real data where the design matrix X corresponds to\ngene expression values and y is simulated. The notation S-Alg refers to the SWAP based algorithms.\n(a) Histogram of sparse eigenvalues of X over 10, 000 random sets of size 10; (b) legend; (c) mean\ntrue positive rate vs. sparsity; (d) mean number of iterations vs. sparsity.\n\nAlgorithm 1: SWAP(y, X, S)\nInputs: Measurements y, design matrix X, and initial support S.\n1 Let r = 1, S(1) = S, and L(1) = L(S(1); y, X)\n2 Swap i \u2208 S(r) with i\n3 if mini,i(cid:1) L(r)\n\n(cid:4) \u2208 (S(r))c and compute the loss L(r)\n\ni,i(cid:1) < L(r) then\n\n{(cid:2)i,(cid:2)i(cid:4)} = argmini,i(cid:1) L(r)\nLet S(r+1) = {S(r)\\(cid:2)i} \u222a(cid:2)i(cid:4) and L(r+1) be the corresponding loss.\nReturn (cid:2)S = S(r).\n\ni,i(cid:1) (In case of a tie, choose a pair arbitrarily)\n\nLet r = r + 1 and repeat steps 2-4.\n\n4\n\n5\n6\n\n7\n\nelse\n\ni,i(cid:1) = L({S(r)\\i} \u222a i\n\n(cid:4)\n\n; y, X).\n\nabove steps to \ufb01nd a sequence of supports S(1), S(2), . . . , S(r), where S(r) has the property that\ni,i(cid:1) \u2265 L(r). In other words, we stop SWAP when perturbing S(r) by one variable increases\nmini,i(cid:1) L(r)\nor does not change the resulting loss. These steps are summarized in Algorithm 1.\nFigure 1 illustrates the performance of SWAP for a matrix X that corresponds to 83 samples of\n2308 gene expression values for patients with small round blue cell tumors [19]. Since there is no\nground truth available, we simulate the observations y using Gaussian w with \u03c3 = 0.5 and randomly\nchosen sparse vectors with non-zero entries between 1 and 2. Figure 1(a) shows the histogram of the\nAXA/n, where |A| = 10. We clearly see that\neigenvalues of 10,000 randomly chosen matrices X T\nthese eigenvalues are very small. This means that the columns of X are highly correlated with each\nother. Figure 1(c) shows the mean fraction of variables estimated to be in the true support over 100\ndifferent trials. Figure 1(d) shows the mean number of iterations required for SWAP to converge.\nRemark 3.1. The main input to SWAP is the initial support S. This parameter implicitly speci\ufb01es the\ndesired sparsity level. Although SWAP can be used with a random initialization S, we recommend\nusing SWAP in combination with another sparse recovery algorithm. For example, in Figure 1(c),\nwe run SWAP using four different types of initializations. The dashed lines represent standard\nsparse recovery algorithms, while the solid lines with markers represent SWAP algorithms. We\nclearly see that all SWAP based algorithms outperform standard algorithms. Intuitively, since many\nsparse recovery algorithms can perform partial support recovery, using such an initialization results\nin a smaller search space when searching for the true support.\nRemark 3.2. Since each iteration of SWAP necessarily produces a unique loss, the supports\nS(1), . . . , S(r) are all unique. Thus, SWAP clearly converges in a \ufb01nite number of iterations. The\nexact convergence rate depends on the correlations in the matrix X. Although we do not theoreti-\ncally quantify the convergence rate, in all numerical simulations, and over a broad range of design\nmatrices, we observed that SWAP converged in roughly O(k) iterations. See Figure 1(d) for an\nexample.\nRemark 3.3. Using the properties of orthogonal projections, we can write Line 2 of SWAP as a\ndifference of two rank one projection matrices. The main computational complexity is in computing\n\n3\n\n\fthis quantity k(p \u2212 k) times for all i \u2208 S(r) and i\n(cid:4) \u2208 (S(r)c. If the computational complexity of\ncomputing a rank k orthogonal projection is Ik, then Line 2 can be implemented in time O(k(Ik +\np \u2212 k). When k (cid:6) p is small, then Ik = O(k3). When k is large, then several computational tricks\ncan be used to signi\ufb01cantly reduce the computational time.\nRemark 3.4. SWAP differs signi\ufb01cantly from other greedy algorithms in the literature. When k\nis known, the main distinctive feature of SWAP is that it always maintains a k-sparse estimate\nof the support. Note that the same is true for the computationally intractable exhaustive search\nalgorithm [10]. Other competitive algorithms, such as forward-backwards (FoBa) [20] or CoSaMP\n[21], usually estimate a signal with higher sparsity level and iteratively remove variables until k\nvariables are selected. The same is true for multi-stage algorithms [22\u201325]. Intuitively, as we shall\nsee in Section 4, by maintaining a support of size k, the performance of SWAP only depends on\ncorrelations among the columns of the matrix XA, where A is of size at most 2k and it includes the\ntrue support. In contrast, for other sparse recovery algorithms, |A| \u2265 2k. In Figure 1, we compare\nSWAP to several state of the art algorithms (see Section 5 for a description of the algorithms). In all\ncases, SWAP results in superior performance.\n\n4 Theoretical Analysis of SWAP\n\n4.1 Some Important Parameters\n\nIn this Section, we collect some important parameters that determine the performance of SWAP.\nFirst, we de\ufb01ne the restricted eigenvalue as\n\n\u03c1k+(cid:2) := inf\n\n: (cid:2)\u03b8(cid:2)0 \u2264 k + (cid:6) ,|S\n\n\u2217 \u2229 supp(\u03b8)| = k\n\n.\n\n(3)\n\n(cid:4)\n\n(cid:3)(cid:2)X\u03b8(cid:2)2\n\n2\n\nn(cid:2)\u03b8(cid:2)2\n\n2\n\nThe parameter \u03c1k+(cid:2) is the minimum eigenvalue of certain blocks of the matrix X T X/n of size 2k\nthat includes the blocks X T\nS\u2217XS\u2217 /n. Smaller values of \u03c1k+(cid:2) correspond to correlated columns in\nthe matrix X. Next, we de\ufb01ne the minimum absolute value of the non-zero entries in \u03b2\n\n\u2217 as\n\n\u03b2min := min\ni\u2208S\u2217\n\n|\u03b2\n\ni | .\n\u2217\n\n(4)\n\nA smaller \u03b2min will evidently require more number of observations for exact recovery of the support.\nFinally, we de\ufb01ne a parameter that characterizes the correlations between the columns of the matrix\n\u2217 is the true support of the unknown\nXS\u2217 and the columns of the matrix X(S\u2217)c, where recall that S\n\u2217. For a set \u2126k,d that contains all supports of size k with atleast k\u2212 d active variables\nsparse vector \u03b2\n(cid:6)\u22121\nfrom S\n\n\u2217, de\ufb01ne \u03b3d as\n\n(cid:5)\n\n(cid:1)(cid:1)(cid:1)(cid:1)\u03a3\n\nS\\i\ni, \u00afS\n\nS\\i\n\u03a3\n\u00afS, \u00afS\nS\\i\ni,i\n\n\u03a3\n\n(cid:1)(cid:1)(cid:1)(cid:1)2\n\n1\n\n\u03b32\nd := max\n\nS\u2208\u2126k,d\\S\u2217 min\n\ni\u2208(S\u2217)c\u2229S\n\n, \u00afS = S\n\n\u2217\\S ,\n\n(5)\n\nS\u2217,S\u2217(cid:2)2\n\u22121\n\n\u22a5\nwhere \u03a3B = X T \u03a0\n[B]X/n. Popular sparse regression algorithms, such as the Lasso and the OMP,\ncan perform accurate support recovery when \u03b62 = maxi\u2208(S\u2217)c (cid:2)\u03a3i,S\u2217\u03a3\n1 < 1. We will show\nin Section 3.2 that SWAP can perform accurate support recovery when \u03b3d < 1. Although the form\nof \u03b3d is similar to \u03b6, there are several key differences, which we highlight as follows:\n\u2022 Since \u2126k,d contains all supports such that |S\n\u2217\\S| \u2264 d, it is clear that \u03b3d is the (cid:6)1 norm of a d \u00d7 1\nvector, where d \u2264 k. In contrast, \u03b6 is the (cid:6)1 norm of a k \u00d7 1 vector. If indeed \u03b6 < 1, i.e., accurate\nsupport recovery is possible using the Lasso, then SWAP can be initialized by the output of the\n\u2217 minimizes\nLasso. In this case, \u03b3(\u2126) = 0 and SWAP also outputs the true support as long as S\nthe loss function. We make this statement precise in Theorem 4.1. Thus, it is only when \u03b6 \u2265 1\nthat the parameter \u03b3d plays a role in the performance of SWAP.\n\u2022 The parameter \u03b6 directly computes correlations between the columns of X. In contrast, \u03b3d com-\nputes correlations between the columns of X when projected onto the null space of a matrix XB,\nwhere |B| = d \u2212 1.\n\u2217 and a minimum\nover inactive variables in each support. The reason that the minimum appears in \u03b3d is because we\nchoose to swap variables that result in the smallest loss. In contrast, \u03b6 is computed by taking a\nmaximum over all inactive variables.\n\n\u2022 Notice that \u03b3d is computed by taking a maximum over supports in the set \u2126d\\S\n\n4\n\n\f4.2 Statement of Main Results\n\nIn this Section, we state the main results that characterize the performance of SWAP. Throughout\nthis Section, we assume the following:\n\n(A1) The observations y and the measurement matrix X follow the linear model in (1), where the\n\nnoise is sub-Gaussian with parameter \u03c3, and the columns of X have been normalized.\n\n(A2) SWAP is initialized with a support S(1) of size k and (cid:2)S is the output of SWAP. Since k is\n\ntypically unknown, a suitable value can be selected using standard model selection algorithms\nsuch as cross-validation or stability selection [26].\n\nOur \ufb01rst result for SWAP is as follows.\nTheorem 4.1. Suppose (A1)-(A2) holds and |S\n1/(18\u03c32), then P((cid:2)S = S\n\n) \u2192 1 as (n, p, k) \u2192 \u221e.\n\n\u2217\n\n\u2217\\S(1)| \u2264 1. If n > 4+log(k2(p\u2212k))\n\nmin\u03c12k/2 , where 0 < c2 \u2264\n\nc2\u03b22\n\nk+(cid:2)\u03b22\n\nmin)), is weaker since 1/\u03c13\n\nThe proof of Theorem 4.1 can be found in the extended version of our paper [27].\nInformally,\nTheorem 4.1 states that if the input to SWAP falsely detects at most one variable, then SWAP\nis high-dimensional consistent when given a suf\ufb01cient number of observations n. The condition\n\u2217 minimizes the loss function. This\non n is mainly enforced to guarantee that the true support S\ncondition is weaker than the suf\ufb01cient conditions required for other computationally tractable sparse\nrecovery algorithms. For example, the method FoBa is known to be superior to other methods\nsuch as the Lasso and the OMP. As shown in [20], FoBa requires that n = \u2126(log(p)/(\u03c13\nmin))\nfor high-dimensional consistent support recovery, where the choice of (cid:6), which is greater than k,\ndepends on the correlations in the matrix X.\nIn contrast, the condition in (4.1), which reduces\nto n = \u2126(log(p \u2212 k)/(\u03c12k\u03b22\nk+(cid:2) < 1/\u03c12k for (cid:6) > k and p \u2212 k < p.\nThis shows that if a sparse recovery algorithm can accurately estimate the true support, then SWAP\ndoes not introduce any false positives and also outputs the true support. Furthermore, if a sparse\nregression algorithm falsely detects one variable, then SWAP can potentially recover the correct\nsupport. Thus, using SWAP with other algorithms does not harm the sparse recovery performance\nof other algorithms.\nWe now consider the more interesting case when SWAP is initialized by a support S(1) that falsely\ndetects more than one variable. In this case, SWAP will clearly needs more than one iteration to\nrecover the true support. Furthermore, to ensure that the true support can be recovered, we need to\nimpose some additional assumptions on the measurement matrix X. The particular assumption we\nenforce will depend on the parameter \u03b3k de\ufb01ned in (5). As mentioned in Section 4.1, \u03b3k captures\n\u221a\nthe correlations between the columns of XS\u2217 and the columns of X(S\u2217)c. To simplify the statement\nin the next Theorem, de\ufb01ne let g(\u03b4, \u03c1, c) = g(\u03b4, \u03c1, c) = (\u03b4 \u2212 1) + 2c(\n\u03b4 + 1/\n(cid:8)\nTheorem 4.2. Suppose (A1)-(A2) holds and |S\nP((cid:2)S = S\nc2 < 1/(18\u03c32), g(\u03b3k, \u03c1k,1, c\u03c3) < 0, log\n\n\u2217\\S(1)| > 1.\n> 4 + log(k2(p \u2212 k)), and n >\n\nIf for a constant c such that 0 <\n, then\n\n) \u2192 1 as (n, p, k) \u2192 \u221e.\n\n\u221a\n\u03c1) + 2c2 .\n\n2 log (p\nk)\nc2\u03b22\nmin\u03c12\n2k\n\n(cid:7)\n\np\nk\n\n\u2217\n\nTheorem 4.2 says that if SWAP is initialized with any support of size k, and \u03b3k satis\ufb01es the condi-\ntion stated in the theorem, then SWAP will output the true support when given a suf\ufb01cient number\nof observations. In the noiseless case, i.e., when \u03c3 = 0, the condition required for accurate support\nrecovery reduces to \u03b3k < 1. The proof of Theorem 4.2, outlined in [27], relies on imposing condi-\ntions on each support of size k such that that there exists a swap so that the loss can be necessarily\n\u2217, then SWAP will output the\ndecreased. Clearly, if such a property holds for each support, except S\ntrue support since (i) there are only a \ufb01nite number of possible supports, and (ii) each iteration of\nSWAP results in a different support. The dependence on\nin the expression for the number of\nobservations n arises from applying the union bound over all supports of size k.\n(cid:7)\nThe condition in Theorem 4.2 is independent of the initialization S(1). This is why the sample\ncomplexity, i.e., the number of observations n required for consistent support recovery, scales as\n. To reduce the sample complexity, we can impose additional conditions on the support\nlog\nS(1) that is used to initialize SWAP. Under such assumptions, assuming that |S\n\u2217\\S(1)| > d, the\n\n(cid:8)\n\n(cid:7)\n\n(cid:8)\n\np\nk\n\np\nk\n\n5\n\n\fperformance of SWAP will depend on \u03b3d, which is less than \u03b3k, and n will scale as log\nrefer to [27] for more details.\n\n(cid:7)\n\n(cid:8)\n\np\nd\n\n. We\n\n5 Numerical Simulations\n\nIn this section, we show how SWAP compares to other sparse recovery algorithms. Section 5.1\npresents results for synthetic data and Section 5.2 presents results for real data.\n\n5.1 Synthetic Data\n\nTo illustrate the advantages of SWAP, we use the following examples:\n\n(A1) We sample the rows of X from a Gaussian distribution with mean zero and covariance \u03a3. The\ncovariance \u03a3 is block-diagonal with blocks of size 10. The entries in each block \u00af\u03a3 are speci-\n\ufb01ed as follows: \u00af\u03a3ii = 1 for i \u2208 [10] and \u00af\u03a3ij = a for i (cid:11)= j. This construction of the design\nmatrix is motivated from [18]. The true support is chosen so that each variable in the support\n\u2217 are chosen uniformly between 1\nis assigned to a different block. The non-zero entries in \u03b2\nand 2. We let \u03c3 = 1, p = 500, n = 100, 200, k = 20, and a = 0.5, 0.55, . . . , 0.9, 0.95.\n\n(A2) We sample X from the same distribution as described in (A1). The only difference is that the\ntrue support is chosen so that \ufb01ve different blocks contain active variables and each chosen\nblock contains four active variables. The rest of the parameters are also the same.\n\nIn both (A1) and (A2), as a increases, the strength of correlations between the columns increases.\nFurther, the restricted eigenvalue parameter for (A1) is greater than the restricted eigenvalue param-\neter of (A2).\nWe use the following sparse recovery algorithms to initialize SWAP: (i) Lasso, (ii) Thresholded\nLasso (TLasso) [25], (iii) Forward-Backward (FoBa) [20], (iv) CoSaMP [21], (v) Marginal Regres-\nsion (MaR), and (vi) Random. TLasso \ufb01rst applies Lasso to select a superset of the support and then\nselects the largest k as the estimated support. In our implementation, we used Lasso to select 2k\nvariables and then selected the largest k variables after least-squares. This algorithm is known to\nhave better performance that the Lasso. FoBa uses a combination of a forward and a backwards al-\ngorithm. CoSaMP is an iterative greedy algorithm. MaR selects the support by choosing the largest\nk variables in |X T y|. Finally, Random selects a random subset of size k. We use the notation S-\nTLasso to refer to the algorithm that uses TLasso as an initialization for SWAP. A similar notation\nfollows for other algorithms.\nOur results are shown in Figure 2. We use two metrics to assess the performance of SWAP. The\n\ufb01rst metric is the true positive rate (TPR), i.e., the number of active variables in the estimate divided\nby the total number of active variables. The second metric is the the number of iterations needed\nfor SWAP to converge. Since all the results are over supports of size k, the false postive rate (FPR)\nis simply 1 \u2212 TPR. All results for SWAP based algorithms have markers, while all results for non\nSWAP based algorithms are represented in dashed lines.\nFrom the TPR performance, we clearly see the advantages of using SWAP in practice. For different\nchoices the algorithm Alg, when n = 100, the performance of S-Alg is always better than the\nperformance of Alg. When the number of observations increase to n = 200, we observe that all\nSWAP based algorithms perform better than standard sparse recovery algorithms. For (A1), we\nhave exact support recovery for SWAP when a \u2264 0.9. For (A2), we have exact support recovery\nwhen a < 0.8. The reason for this difference is because of the differences in the placement of the\nnon-zero entries.\nFigures 2(a) and 2(b) shows the mean number of iterations required by SWAP based algorithms as\nthe correlations in the matrix X increase. We clearly see that the number of iterations increase with\nthe degree of correlations. For algorithms that estimate a large fraction of the true support (TLasso,\nFoBa, and CoSaMP), the number of iterations is generally very small. For MaR and Random, the\nnumber of iterations is larger, but still comparable to the sparsity level of k = 20.\n\n6\n\n\f \n\n \n\nR\nP\nT\nn\na\ne\nM\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0.5\n\nLasso\nS\u2212Lasso\nTLasso\nS\u2212TLasso\nFoBa\nS\u2212FoBa\nCoSaMP\nS\u2212CoSaMP\nMaR\nS\u2212MaR\nS\u2212Random\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nR\nP\nT\nn\na\ne\nM\n\n0.9\n\n0\n0.5\n\n0.6\n\n0.7\n\n0.8\n\nDegree of Correlation\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nDegree of Correlation\n\n(a) Legend\n\n(b) Example (A1), n = 100\n\n(c) Example (A1), n = 100\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n \n\nR\nP\nT\nn\na\ne\nM\n\n0.9\n\n0.5\n\n25\n\n20\n\n15\n\n10\n\n5\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\nI\n \nf\n\no\n\n \n\n \n\n#\nn\na\ne\nM\n\n0.9\n\n0\n0.5\n\n0.6\n\n0.7\n\n0.8\n\nDegree of Correlation\n\n0.6\n\n0.7\n\n0.8\n\nDegree of Correlation\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nDegree of Correlation\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n \n\nR\nP\nT\nn\na\ne\nM\n\n0.5\n\n(d) Example (A2), n = 100\n\n(e) Example (A2), n = 100\n\n(f) Example (A2), n = 100\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nR\nP\nT\nn\na\ne\nM\n\n0\n0.5\n\n30\n\n20\n\n10\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\nI\n \nf\n\no\n\n \n\n#\n\n \n\nn\na\ne\nM\n\n0.9\n\n0\n0.5\n\n0.6\n\n0.7\n\n0.8\n\nDegree of Correlation\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nR\nP\nT\nn\na\ne\nM\n\n0.9\n\n0\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nDegree of Correlation\n\n0.6\n\n0.7\n\n0.8\n\nDegree of Correlation\n\n(g) Example (A1), n = 200\n\n(h) Example (A1), n = 200\n\n(i) Example (A2), n = 200\n\nFigure 2: Empirical true positive rate (TPR) and number of iterations required by SWAP.\n\n5.2 Gene Expression Data\n\nWe now present results on two gene expression cancer datasets. The \ufb01rst dataset1 contains expres-\nsion values from patients with two different types cancers related to leukemia. The second dataset2\ncontains expression levels from patients with and without prostate cancer. The matrix X contains\nthe gene expression values and the vector y is an indictor of the type of cancer a patient has. Al-\nthough this is a classi\ufb01cation problem, we treat it as a recovery problem. For the leukemia data,\np = 5147 and n = 72. For the prostate cancer data, p = 12533 and n = 102. This is clearly a\nhigh-dimensional dataset, and the goal is to identify a small set of genes that are predictive of the\ncancer type.\nFigure 3 shows the performance of standard algorithms vs. SWAP. We use leave-one-out cross-\nvalidation and apply the sparse recovery algorithms described in Section 5.1 using multiple different\nchoices of the sparsity level. For each level of sparsity, we choose the sparse recovery algorithm\n(labeled as standard) and the SWAP based algorithm that results in the minimum least-squares loss\nover the training data. This allows us to compare the performance of using SWAP vs. not using\nSWAP. For both datasets, we clearly see that the training and testing error is lower for SWAP based\nalgorithms. This means that SWAP is able to choose a subset of genes that has better predictive\nperformance than that of standard algorithms for each level of sparsity.\n\n1see http://www.biolab.si/supp/bi-cancer/projections/info/leukemia.htm\n2see http://www.biolab.si/supp/bi-cancer/projections/info/prostata.htm\n\n7\n\n\fr\no\nr\nr\n\ni\n\n \n\nE\nn\na\nr\nT\n\u2212\nV\nC\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n \n2\n\n \n\nSWAP\nStandard\n\n4\n\n6\n\n8\n\n10\n\nSparsity Level\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\u2212\nV\nC\n\n0.38\n\n0.36\n\n0.34\n\n0.32\n\n0.3\n\n0.28\n\n0.26\n\n0.24\n\n \n2\n\n \n\nSWAP\nStandard\n\n4\n\n6\n\n8\n\n10\n\nSparsity Level\n\n4\n\n3.5\n\n3\n\n2.5\n\nr\no\nr\nr\n\ni\n\nE\n \nn\na\nr\nT\n\u2212\nV\nC\n\n2\n\n \n2\n\n \n\nSWAP\nStandard\n\n3\n\n4\n\n5\n\n6\n\nSparsity Level\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\u2212\nV\nC\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n \n2\n\n \n\nSWAP\nStandard\n\n3\n\n4\n\n5\n\n6\n\nSparsity Level\n\n(a) Training Error\n\n(b) Testing Error\n\n(c) Training Error\n\n(d) Testing Error\n\nFigure 3: (a)-(b) Leukemia dataset with p = 5147 and n = 72. (c)-(d) Prostate cancer dataset with\np = 12533 and n = 102.\n\n6 Summary and Future Work\n\nWe studied the sparse recovery problem of estimating the support of a high-dimensional sparse\nvector when given a measurement matrix that contains correlated columns. We presented a simple\nalgorithm, called SWAP, that iteratively swaps variables starting from an initial estimate of the\nsupport until an appropriate loss function can no longer be decreased further. We showed that SWAP\nis surprising effective in situations where the measurement matrix contains correlated columns. We\ntheoretically quanti\ufb01ed the conditions on the measurement matrix that guarantee accurate support\nrecovery. Our theoretical results show that if SWAP is initialized with a support that contains some\nactive variables, then SWAP can tolerate even higher correlations in the measurement matrix. Using\nnumerical simulations on synthetic and real data, we showed how SWAP outperformed several\nsparse recovery algorithms.\nOur work in this paper sets up a platform to study the following interesting extensions of SWAP.\nThe \ufb01rst is a generalization of SWAP so that a group of variables can be swapped in a sequential\nmanner. The second is a detailed analysis of SWAP when used with other sparse recovery algo-\nrithms. The third is an extension of SWAP to high-dimensional vectors that admit structured sparse\nrepresentations.\n\nAcknowledgement\n\nThe authors would like to thank Aswin Sankaranarayanan and Christoph Studer for feedback and\ndiscussions. The work of D. Vats was partly supported by an Institute for Mathematics and Appli-\ncations (IMA) Postdoctoral Fellowship.\n\nReferences\n\n[1] M. Segal, K. Dahlquist, and B. Conklin, \u201cRegression approaches for microarray data analysis,\u201d\n\nJournal of Computational Biology, vol. 10, no. 6, pp. 961\u2013980, 2003.\n\n[2] H. Zhu and G. Giannakis, \u201cSparse overcomplete representations for ef\ufb01cient identi\ufb01cation of\npower line outages,\u201d IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2215 \u20132224,\nnov. 2012.\n\n[3] E. J. Cand`es, J. Romberg, and T. Tao, \u201cRobust uncertainty principles: Exact signal reconstruc-\ntion from highly incomplete frequency information,\u201d IEEE Trans. Information Theory, vol. 52,\nno. 2, pp. 489\u2013509, 2006.\n\n[4] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk,\n\u201cSingle-pixel imaging via compressive sampling,\u201d IEEE Signal Processing Magazine, vol. 25,\nno. 2, pp. 83\u201391, Mar. 2008.\n\n[5] N. Meinshausen and P. B\u00a8uhlmann, \u201cHigh-dimensional graphs and variable selection with the\n\nLasso,\u201d Annals of Statistics, vol. 34, no. 3, pp. 1436, 2006.\n\n[6] P. Ravikumar, M. Wainwright, and J. Lafferty, \u201cHigh-dimensional Ising model selection using\n\n(cid:6)1-egularized logistic regression,\u201d Annals of Statistics, vol. 38, no. 3, pp. 1287\u20131319, 2010.\n\n8\n\n\f[7] G. Varoquaux, A. Gramfort, and B. Thirion, \u201cSmall-sample brain mapping: sparse recovery\non spatially correlated designs with randomization and clustering,\u201d in Proceedings of the 29th\nInternational Conference on Machine Learning (ICML-12), 2012, pp. 1375\u20131382.\n\n[8] P. Zhao and B. Yu, \u201cOn model selection consistency of Lasso,\u201d Journal of Machine Learning\n\nResearch, vol. 7, pp. 2541\u20132563, 2006.\n\n[9] J. A. Tropp and A. C. Gilbert, \u201cSignal recovery from random measurements via orthogonal\nmatching pursuit,\u201d IEEE Transactions Information Theory, vol. 53, no. 12, pp. 4655\u20134666,\n2007.\n\n[10] M. J. Wainwright, \u201cSharp thresholds for noisy and high-dimensional recovery of sparsity using\n(cid:6)1-constrained quadratic programming (Lasso),\u201d IEEE Transactions Information Theory, vol.\n55, no. 5, May 2009.\n\n[11] N. Meinshausen and B. Yu,\n\n\u201cLasso-type recovery of sparse representations for high-\n\ndimensional data,\u201d Annals of Statistics, vol. 37, no. 1, pp. 246\u2013270, 2009.\n\n[12] P. J. Bickel, Y. Ritov, and A. B. Tsybakov, \u201cSimultaneous analysis of Lasso and Dantzig\n\nselector,\u201d Annals of Statistics, vol. 37, no. 4, pp. 1705\u20131732, 2009.\n\n[13] P. B\u00a8uhlmann and S. Van De Geer, Statistics for High-Dimensional Data: Methods, Theory and\n\nApplications, Springer-Verlag New York Inc, 2011.\n\n[14] H. Zou and T. Hastie, \u201cRegularization and variable selection via the elastic net,\u201d Journal of\nthe Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301\u2013320,\n2005.\n\n[15] Y. She, \u201cSparse regression with exact clustering,\u201d Electronic Journal Statistics, vol. 4, pp.\n\n1055\u20131096, 2010.\n\n[16] E. Grave, G. R. Obozinski, and F. R. Bach, \u201cTrace Lasso: A trace norm regularization for\ncorrelated designs,\u201d in Advances in Neural Information Processing Systems 24, J. Shawe-\ntaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., 2011, pp. 2187\u20132195.\n\n[17] J. Huang, S. Ma, H. Li, and C. Zhang, \u201cThe sparse laplacian shrinkage estimator for high-\n\ndimensional regression,\u201d Annals of Statistics, vol. 39, no. 4, pp. 2021, 2011.\n\n[18] P. B\u00a8uhlmann, P. R\u00a8utimann, S. van de Geer, and C.-H. Zhang, \u201cCorrelated variables in regres-\nsion: clustering and sparse estimation,\u201d Journal of Statistical Planning and Inference, vol. 143,\npp. 1835\u20131858, Nov. 2013.\n\n[19] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M.\nSchwab, C. R. Antonescu, C. Peterson, et al., \u201cClassi\ufb01cation and diagnostic prediction of\ncancers using gene expression pro\ufb01ling and arti\ufb01cial neural networks,\u201d Nature medicine, vol.\n7, no. 6, pp. 673\u2013679, 2001.\n\n[20] T. Zhang, \u201cAdaptive forward-backward greedy algorithm for learning sparse representations,\u201d\n\nIEEE Transactions Information Theory, vol. 57, no. 7, pp. 4689\u20134708, 2011.\n\n[21] D. Needell and J. A. Tropp, \u201cCoSaMP: Iterative signal recovery from incomplete and inaccu-\nrate samples,\u201d Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301\u2013321,\n2009.\n\n[22] T. Zhang, \u201cSome sharp performance bounds for least squares regression with l1 regularization,\u201d\n\nThe Annals of Statistics, vol. 37, no. 5A, pp. 2109\u20132144, 2009.\n\n[23] L. Wasserman and K. Roeder, \u201cHigh dimensional variable selection,\u201d Annals of statistics, vol.\n\n37, no. 5A, pp. 2178, 2009.\n\n[24] T. Zhang, \u201cAnalysis of multi-stage convex relaxation for sparse regularization,\u201d Journal of\n\nMachine Learning Research, vol. 11, pp. 1081\u20131107, Mar. 2010.\n\n[25] S. van de Geer, P. B\u00a8uhlmann, and S. Zhou, \u201cThe adaptive and the thresholded lasso for poten-\ntially misspeci\ufb01ed models (and a lower bound for the lasso),\u201d Electronic Journal of Statistics,\nvol. 5, pp. 688\u2013749, 2011.\n\n[26] N. Meinshausen and P. B\u00a8uhlmann, \u201cStability selection,\u201d Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), vol. 72, no. 4, pp. 417\u2013473, 2010.\n\n[27] D. Vats and R. G. Baraniuk, \u201cSwapping variables for high-dimensional sparse regression with\n\ncorrelated measurements,\u201d arXiv:1312.1706, 2013.\n\n9\n\n\f", "award": [], "sourceid": 530, "authors": [{"given_name": "Divyanshu", "family_name": "Vats", "institution": "Rice University"}, {"given_name": "Richard", "family_name": "Baraniuk", "institution": "Rice University"}]}