{"title": "Structured Sparse Regression via Greedy Hard Thresholding", "book": "Advances in Neural Information Processing Systems", "page_first": 1516, "page_last": 1524, "abstract": "Several learning applications require solving high-dimensional regression problems where the relevant features belong to a small number of (overlapping) groups. For very large datasets and under standard sparsity constraints, hard thresholding methods have proven to be extremely efficient, but such methods require NP hard projections when dealing with overlapping groups. In this paper, we show that such NP-hard projections can not only be avoided by appealing to submodular optimization, but such methods come with strong theoretical guarantees even in the presence of poorly conditioned data (i.e. say when two features have correlation  $\\geq 0.99$), which existing analyses cannot handle. These methods exhibit an interesting computation-accuracy trade-off and can be extended to significantly harder problems such as sparse overlapping groups. Experiments on both real and synthetic data validate our claims and demonstrate that the proposed methods are orders of magnitude faster than other greedy and convex relaxation techniques for learning with group-structured sparsity.", "full_text": "Structured Sparse Regression via Greedy\n\nHard-thresholding\n\nPrateek Jain\n\nMicrosoft Research India\n\nNikhil Rao\nTechnicolor\n\nInderjit Dhillon\n\nUT Austin\n\nAbstract\n\nSeveral learning applications require solving high-dimensional regression problems\nwhere the relevant features belong to a small number of (overlapping) groups. For\nvery large datasets and under standard sparsity constraints, hard thresholding\nmethods have proven to be extremely ef\ufb01cient, but such methods require NP hard\nprojections when dealing with overlapping groups. In this paper, we show that\nsuch NP-hard projections can not only be avoided by appealing to submodular\noptimization, but such methods come with strong theoretical guarantees even\nin the presence of poorly conditioned data (i.e. say when two features have\ncorrelation  0.99), which existing analyses cannot handle. These methods exhibit\nan interesting computation-accuracy trade-off and can be extended to signi\ufb01cantly\nharder problems such as sparse overlapping groups. Experiments on both real and\nsynthetic data validate our claims and demonstrate that the proposed methods are\norders of magnitude faster than other greedy and convex relaxation techniques for\nlearning with group-structured sparsity.\n\n1\n\nIntroduction\n\nHigh dimensional problems where the regressor belongs to a small number of groups play a critical\nrole in many machine learning and signal processing applications, such as computational biology and\nmultitask learning. In most of these cases, the groups overlap, i.e., the same feature can belong to\nmultiple groups. For example, gene pathways overlap in computational biology applications, and\nparent-child pairs of wavelet transform coef\ufb01cients overlap in signal processing applications.\nThe existing state-of-the-art methods for solving such group sparsity structured regression problems\ncan be categorized into two broad classes: a) convex relaxation based methods , b) iterative hard\nthresholding (IHT) or greedy methods. In practice, IHT methods tend to be signi\ufb01cantly more\nscalable than the (group-)lasso style methods that solve a convex program. But, these methods\nrequire a certain projection operator which in general is NP-hard to compute and often certain simple\nheuristics are used with relatively weak theoretical guarantees. Moreover, existing guarantees for\nboth classes of methods require relatively restrictive assumptions on the data, like Restricted Isometry\nProperty or variants thereof [2, 7, 16], that are unlikely to hold in most common applications. In fact,\neven under such settings, the group sparsity based convex programs offer at most polylogarithmic\ngains over standard sparsity based methods [16].\nConcretely, let us consider the following linear model:\n\ny = Xw\u21e4 + ,\n\n(1)\nwhere  \u21e0 N (0, 2I), X 2 Rn\u21e5p, each row of X is sampled i.i.d. s.t. xi \u21e0 N (0, \u2303), 1 \uf8ff i \uf8ff n,\nand w\u21e4 is a k\u21e4-group sparse vector i.e. w\u21e4 can be expressed in terms of only k\u21e4 groups, Gj \u2713 [p].\nThe existing analyses for both convex as well as hard thresholding based methods require \uf8ff =\n1/p \uf8ff c, where c is an absolute constant (like say 3) and i is the i-th largest eigenvalue of \u2303.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThis is a signi\ufb01cantly restrictive assumption as it requires all the features to be nearly independent of\neach other. For example, if features 1 and 2 have correlation more than say .99 then the restriction on\n\uf8ff required by the existing results do not hold.\nMoreover, in this setting (i.e., when \uf8ff = O(1)), the number of samples required to exactly recover\nw\u21e4 (with  = 0) is given by: n =\u2326( s + k\u21e4 log M ) [16], where s is the maximum support size of a\nunion of k\u21e4 groups and M is the number of groups. In contrast, if one were to directly use sparse\nregression techniques (by ignoring group sparsity altogether) then the number of samples is given by\nn =\u2326( s log p). Hence, even in the restricted setting of \uf8ff = O(1), group-sparse regression improves\nupon the standard sparse regression only by logarithmic factors.\nGreedy, Iterative Hard Thresholding (IHT) methods have been considered for group sparse regression\nproblems, but they involve NP-hard projections onto the constraint set [3]. While this can be\ncircumvented using approximate operations, the guarantees they provide are along the same lines as\nthe ones that exist for convex methods.\nIn this paper, we show that IHT schemes with approximate projections for the group sparsity\nproblem yield much stronger guarantees. Speci\ufb01cally, our result holds for arbitrarily large \uf8ff, and\narbitrary group structures. In particular, using IHT with greedy projections, we show that n =\n\n\u270f + \uf8ff2k\u21e4 log M ) log 1\n\nn\n\n\u2326(s log 1\n\u270f samples suf\ufb01ce to recover \u270f-approximatation to w\u21e4 when  = 0.\nfor general noise variance 2, our method recovers \u02c6w s.t. k \u02c6w  w\u21e4k \uf8ff 2\u270f +  \u00b7 \uf8ffq s+\uf8ff2k\u21e4 log M\nk \u02c6w  w\u21e4k \uf8ff  \u00b7 ps + k\u21e4 log M for \uf8ff \uf8ff 3, i.e., \u02c6w is not a consistent estimator of w\u21e4 even for small\n\nOn the other hand, IHT for standard sparse regression [10] requires n =\u2326( \uf8ff2s log p). Moreover,\n.\nOn the other hand, the existing state-of-the-art results for IHT for group sparsity [4] guarantees\n\ncondition number \uf8ff.\nOur analysis is based on an extension of the sparse regression result by [10] that requires exact\nprojections. However, a critical challenge in the case of overlapping groups is the projection onto\nthe set of group-sparse vectors is NP-hard in general. To alleviate this issue, we use the connection\nbetween submodularity and overlapping group projections and a greedy selection based projection is\nat least good enough. The main contribution of this work is to carefully use the greedy projection\nbased procedure along with hard thresholding iterates to guarantee the convergence to the global\noptima as long as enough i.i.d. data points are generated from model (1).\nMoreover, the simplicity of our hard thresholding operator allows us to easily extend it to more\ncomplicated sparsity structures. In particular, we show that the methods we propose can be generalized\nto the sparse overlapping group setting, and to hierarchies of (overlapping) groups.\nWe also provide extensive experiments on both real and synthetic datasets that show that our methods\nare not only faster than several other approaches, but are also accurate despite performing approximate\nprojections. Indeed, even for poorly-conditioned data, IHT methods are an order of magnitude faster\nthan other greedy and convex methods. We also observe a similar phenomenon when dealing with\nsparse overlapping groups.\n\n1.1 Related Work\n\nSeveral papers, notably [5] and references therein, have studied convergence properties of IHT\nmethods for sparse signal recovery under standard RIP conditions. [10] generalized the method to\nsettings where RIP does not hold, and also to the low rank matrix recovery setting. [21] used a similar\nanalysis to obtain results for nonlinear models. However, these techniques apply only to cases where\nexact projections can be performed onto the constraint set. Forward greedy selection schemes for\nsparse [9] and group sparse [18] constrained programs have been considered previously, where a\nsingle group is added at each iteration. The authors in [2] propose a variant of CoSaMP to solve\nproblems that are of interest to us, and again, these methods require exact projections.\nSeveral works have studied approximate projections in the context of IHT [17, 6, 12]. However, these\nresults require that the data satis\ufb01es RIP-style conditions which typically do not hold in real-world\nregression problems. Moreover, these analyses do not guarantee a consistent estimate of the optimal\nregressor when the measurements have zero-mean random noise. In contrast, we provide results\nunder a more general RSC/RSS condition, which is weaker [20], and provide crisp rates for the error\nbounds when the noise in measurements is random.\n\n2\n\n\f2 Group Iterative Hard Thresholding for Overlapping Groups\n\nIn this section, we formally set up the group sparsity constrained optimization problem, and then\nbrie\ufb02y present the IHT algorithm for the same. Suppose we are given a set of M groups that can\narbitrarily overlap G = {G1, . . . , GM}, where Gi \u2713 [p]. Also, let [M\ni=1Gi = {1, 2, . . . , p}. We\nlet kwk denote the Euclidean norm of w, and supp(w) denotes the support of w. For any vector\nw 2 Rp, [8] de\ufb01ned the overlapping group norm as\nMXi=1\n\naGi = w, supp(aGi) \u2713 Gi\n\nkwkG := inf\n\nkaGik s.t.\n\nMXi=1\n\n(2)\n\nWe also introduce the notion of \u201cgroup-support\u201d of a vector and its group-`0 pseudo-norm:\n\nG-supp(w) := {i s.t. kaGik > 0},\n\nkwkG0 := inf\n\n1{kaGik > 0},\n\n(3)\n\nMXi=1\n\nwhere aGi satis\ufb01es the constraints of (2). 1{\u00b7} is the indicator function, taking the value 1 if the\ncondition is satis\ufb01ed, and 0 otherwise. For a set of groups G, supp(G) = {Gi, i 2 G}. Similarly,\nG-supp(S) = G-supp(wS).\nSuppose we are given a function f : Rp ! R and M groups G = {G1, . . . , GM}. The goal is to\nsolve the following group sparsity structured problem (GS-Opt):\nf (w) s.t. kwkG0 \uf8ff k\n\nGS-Opt:\n\nmin\nw\n\n(4)\n\nf can be thought of as a loss function over the training data, for instance, logistic or least squares loss.\nIn the high dimensional setting, problems of the form (4) are somewhat ill posed and are NP-hard\nin general. Hence, additional assumptions on the loss function (f) are warranted to guarantee a\nreasonable solution. Here, we focus on problems where f satis\ufb01es the restricted strong convexity and\nsmoothness conditions:\nDe\ufb01nition 2.1 (RSC/RSS). The function f : Rp ! R satis\ufb01es the restricted strong convexity (RSC)\nand restricted strong smoothness (RSS) of order k, if the following holds:\n\n\u21b5kI  H(w)  LkI,\n\nwhere H(w) is the Hessian of f at any w 2 Rp s.t. kwkG0 \uf8ff k.\nNote that the goal of our algorithms/analysis would be to solve the problem for arbitrary \u21b5k > 0 and\nLk < 1. In contrast, adapting existing IHT results to this setting lead to results that allow Lk/\u21b5kless\nthan a constant (like say 3).\nWe are especially interested in the linear model described in (1), and in recovering w? consistently\n(i.e. recover w? exactly as n ! 1). To this end, we look to solve the following (non convex)\nconstrained least squares problem\n\nGS-LS:\n\n\u02c6w = arg min\nw\n\nf (w) :=\n\n1\n2nky  Xwk2 s.t. kwkG0 \uf8ff k\n\n(5)\n\nwith k  k\u21e4 being a positive, user de\ufb01ned integer 1. In this paper, we propose to solve (5) using an\nIterative Hard Thresholding (IHT) scheme. IHT methods iteratively take a gradient descent step, and\nthen project the resulting vector (g) on to the (non-convex) constraint set of group sparse vectos, i.e.,\n\nw\u21e4 = P Gk (g) = arg min\n\nw kw  gk2 s.t kwkG0 \uf8ff k\n\n(6)\n\nComputing the gradient is easy and hence the complexity of the overall algorithm heavily depends on\nthe complexity of performing the aforementioned projection. Algorithm 1 details the IHT procedure\nfor the group sparsity problem (4). Throughout the paper we consider the same high-level procedure,\n\nbut consider different projection operators bP Gk (g) for different settings of the problem.\n\n1typically chosen via cross-validation\n\n3\n\n\fsparse vector\n\nT , step size \u2318\n\nAlgorithm 1 IHT for Group-sparsity\n1: Input : data y, X, parameter k, iterations\n2: Initialize: t = 0, w0 2 Rp a k-group\n3: for t = 1, 2, . . . , T do\n4:\n\ngt = wt  \u2318rf (wt)\n(approximate) projections\n\n5: wt = bP Gk (gt) where bP Gk (gt) performs\n\n6: end for\n7: Output : wT\n\nAlgorithm 2 Greedy Projection\nRequire: g 2 Rp, parameter \u02dck, groups G\n1: \u02c6u = 0 , v = g, bG = {0}\n2: for t = 1, 2, . . . \u02dck do\n3:\n4:\n5:\n6:\n7: end for\n\nFind G? = arg maxG2G\\bG kvGk\nbG = bGS G?\nv = v  vG?\nu = u + vG?\n\n8: Output \u02c6u := bP Gk (g), bG = supp(u)\n\n2.1 Submodular Optimization for General G\nSuppose we are given a vector g 2 Rp, which needs to be projected onto the constraint set kukG0 \uf8ff k\n(see (6)). Solving (6) is NP-hard when G contains arbitrary overlapping groups. To overcome\nthis, P Gk (\u00b7) can be replaced by an approximate operator bP Gk (\u00b7) (step 5 of Algorithm 1). Indeed,\nthe procedure for performing projections reduces to a submodular optimization problem [3], for\nwhich the standard greedy procedure can be used (Algorithm 2). For completeness, we detail this in\nAppendix A, where we also prove the following:\nLemma 2.2. Given an arbitrary vector g 2 Rp, suppose we obtain \u02c6u,bG as the output of Algorithm\n\n2 with input g and target group sparsity \u02dck. Let u\u21e4 = P Gk (g) be as de\ufb01ned in (6). Then\n\nk \u02c6u  gk2 \uf8ff e \u02dck\nwhere e is the base of the natural logarithm.\n\nkk(g)supp(u\u21e4)k2 + ku\u21e4  gk2\n\nNote that the term with the exponent in Lemma 2.2 approaches 0 as \u02dck increases. Increasing \u02dck should\nimply more samples for recovery of w\u21e4. Hence, this lemma hints at the possibility of trading off\nsample complexity for better accuracy, despite the projections being approximate. See Section 3 for\nmore details. Algorithm 2 can be applied to any G, and is extremely ef\ufb01cient.\n2.2\nIHT methods can be improved by the incorporation of \u201ccorrections\u201d after each projection step. This\nmerely entails adding the following step in Algorithm 1 after step 5:\n\nIncorporating Full Corrections\n\nwt = arg min\n\u02dcw\n\nf ( \u02dcw) s.t. supp( \u02dcw) = supp(bP Gk (gt))\n\nWhen f (\u00b7) is the least squares loss as we consider, this step can be solved ef\ufb01ciently using Cholesky\ndecompositions via the backslash operator in MATLAB. We will refer to this procedure as IHT-\nFC. Fully corrective methods in greedy algorithms typically yield signi\ufb01cant improvements, both\ntheoretically and in practice [10].\n\n3 Theoretical Performance Bounds\n\nWe now provide theoretical guarantees for Algorithm 1 when applied to the overlapping group\nsparsity problem (4). We then specialize the results for the linear regression model (5).\nTheorem 3.1. Let w\u21e4 = arg minw,kwGk0\uf8ffk\u21e4 f (w) and let f satisfy RSC/RSS with constants \u21b5k0,\n\nLk0, respectively (see De\ufb01nition 2.1). Set k = 32\u21e3 Lk0\n\u21b5k0\u23182\n\nSuppose we run Algorithm 1, with \u2318 = 1/Lk0 and projections computed according to Algorithm 2.\nThen, the following holds after t + 1 iterations:\n\n\u270f \u2318 and let k0 \uf8ff 2k +k\u21e4.\n\nkwt+1  w\u21e4k2 \uf8ff\u27131 \n\n\u00b7k\u21e4 log\u21e3 Lk0\n\u21b5k0 \u00b7 kw\u21e4k2\n10 \u00b7 Lk0\u25c6 \u00b7k wt  w\u21e4k2 +  +\n\n\u21b5k0\n\n\u21b5k0\nLk0\n\n\u270f,\n\n4\n\n\fmaxS, s.t., | G-supp(S)|\uf8ffk k(rf (w\u21e4))Sk2. Speci\ufb01cally, the output of the T =\n\nwhere  = 2\nLk0\nO\u21e3 Lk0\n\u21b5k0 \u00b7 kw\u21e4k2\n\n\u270f \u2318-th iteration of Algorithm 1 satis\ufb01es:\nkwT  w\u21e4k2 \uf8ff 2\u270f +\n\n10 \u00b7 Lk0\n\u21b5k0\n\n\u00b7 .\n\nThe proof uses the fact that Algorithm 2 performs approximately good projections. The result follows\nfrom combining this with results from convex analysis (RSC/RSS) and a careful setting of parameters.\nWe prove this result in Appendix B.\n\nRemarks\n\nTheorem 3.1 shows that Algorithm 1 recovers w\u21e4 up to O\u21e3 Lk0\n\n\u21b5k0 \u00b7 \u2318 error. If k arg minw f (w)kG0 \uf8ff k,\n\nthen,  = 0. In general our result obtains an additive error which is weaker than what one can obtain\nfor a convex optimization problem. However, for typical statistical problems, we show that  is small\nand gives us nearly optimal statistical generalization error (for example, see Theorem 3.2).\nTheorem 3.1 displays an interesting interplay between the desired accuracy \u270f, and the penalty we thus\npay as a result of performing approximate projections . Speci\ufb01cally, as \u270f is made small, k becomes\nlarge, and thus so does . Conversely, we can let \u270f be large so that the projections are coarse, but\nincur a smaller penalty via the  term. Also, since the projections are not too accurate in this case, we\ncan get away with fewer iterations. Thus, there is a tradeoff between estimation error \u270f and model\nselection error . Also, note that the inverse dependence of k on \u270f is only logarithmic in nature.\nWe stress that our results do not hold for arbitrary approximate projection operators. Our proof\ncritically uses the greedy scheme (Algorithm 2), via Lemma 2.2. Also, as discussed in Section 4, the\nproof easily extends to other structured sparsity sets that allow such greedy selection steps.\nWe obtain similar result as [10] for the standard sparsity case, i.e., when the groups are singletons.\nHowever, our proof is signi\ufb01cantly simpler and allows for a signi\ufb01cantly easier setting of \u2318.\n\n3.1 Linear Regression Guarantees\nWe next proceed to the standard linear regression model considered in (5). To the best of our\nknowledge, this is the \ufb01rst consistency result for overlapping group sparsity problems, especially\nwhen the data can be arbitrarily conditioned. Recall that max (min) are the maximum (minimum)\nsingular value of \u2303, and \uf8ff := max/min is the condition number of \u2303.\nTheorem 3.2. Let the observations y follow the model in (1). Suppose w\u21e4 is k\u21e4-group sparse and\nlet f (w) := 1\n\n2. Let the number of samples satisfy:\n\nn  \u2326\u21e3(s + \uf8ff2 \u00b7 k\u21e4 \u00b7 log M ) \u00b7 log\u21e3 \uf8ff\n\u270f\u2318\u2318 ,\nwhere s = maxw,kwkG0 \uf8ffk | supp(w)|. Then, applying Algorithm 1 with k = 8\uf8ff2k\u21e4 \u00b7 log \uf8ff\n\u270f,\n\u2318 = 1/(4max), guarantees the following after T =\u2326 \u21e3\uf8ff log \uf8ff\u00b7kw\u21e4k2\n\u2318 iterations (w.p.  1 1/n8):\nkwT  w\u21e4k \uf8ff  \u00b7 \uf8ffr s + \uf8ff2k\u21e4 log M\n\n2nkXw  yk2\n\n\u270f\n\n+ 2\u270f\n\nn\n\nRemarks\nNote that one can ignore the group sparsity constraint, and instead look to recover the (at most) s-\nsparse vector w\u21e4 using IHT methods for `0 optimization [10]. However, the corresponding sample\ncomplexity is n  \uf8ff2s log(p). Hence, for an ill conditioned \u2303, using group sparsity based methods\nprovide a signi\ufb01cantly stronger result, especially when the groups overlap signi\ufb01cantly.\nNote that the number of samples required increases logarithmically with the accuracy \u270f. Theorem\n3.2 thus displays an interesting phenomenon: by obtaining more samples, one can provide a smaller\nrecovery error while incurring a larger approximation error (since we choose more groups).\nOur proof critically requires that when restricted to group sparse vectors, the least squares objective\nfunction f (w) = 1\n\n2 is strongly convex as well as strongly smooth:\n\n2nky  Xwk2\n\n5\n\n\fLemma 3.3. Let X 2 Rn\u21e5p be such that each xi \u21e0N (0, \u2303). Let w 2 Rp be k-group sparse over\ngroups G = {G1, . . . GM}, i.e., kwkG0 \uf8ff k and s = maxw,kwkG0 \uf8ffk | supp(w)|. Let the number of\nsamples n  \u2326(C (k log M + s)). Then, the following holds with probability  1  1/n10:\n\n\u27131 \n\n4\n\npC\u25c6 minkwk2\n\n2 \uf8ff\n\n1\nnkXwk2\n\n2 \uf8ff\u27131 +\n\n4\n\npC\u25c6 maxkwk2\n\n2,\n\nWe prove Lemma 3.3 in Appendix C. Theorem 3.2 then follows by combining Lemma 3.3 with\nTheorem 3.1. Note that in the least squares case, these are the Restricted Eigenvalue conditions on\nthe matrix X, which as explained in [20] are much weaker than traditional RIP assumptions on the\ndata. In particular, RIP requires almost 0 correlation between any two features, while our assumption\nallows for arbitrary high correlations albeit at the cost of a larger number of samples.\n\nIHT with Exact Projections P Gk (\u00b7)\n\n3.2\nWe now consider the setting where P Gk (\u00b7) can be computed exactly and ef\ufb01ciently for any k. Examples\ninclude the dynamic programming based method by [3] for certain group structures, or Algorithm 2\nwhen the groups do not overlap. Since the exact projection operator can be arbitrary, our proof of\nTheorem 3.1 does not apply directly in this case. However, we show that by exploiting the structure\nof hard thresholding, we can still obtain a similar result:\nTheorem 3.4. Let w\u21e4 = arg minw,kwGk0\uf8ffk\u21e4 f (w). Let f satisfy RSC/RSS with constants \u21b52k+k\u21e4,\n\nL2k+k\u21e4, respectively (see De\ufb01nition 2.1). Then, the following holds for the T = O\u21e3 Lk0\n\u270f \u2318-th\n\u21b5k0 \u00b7 kw\u21e4k2\niterate of Algorithm 1 (with \u2318 = 1/L2k+k\u21e4) with bP Gk (\u00b7) = P Gk (\u00b7) being the exact projection:\nmaxS, s.t., | G-supp(S)|\uf8ffk k(rf (w\u21e4))Sk2.\n\nkwT  w\u21e4k2 \uf8ff \u270f +\nwhere k0 = 2k + k\u21e4, k = O(( Lk0\n)2 \u00b7 k\u21e4),  = 2\nLk0\n\u21b5k0\n\n10 \u00b7 Lk0\n\u21b5k0\n\n\u00b7 .\n\nSee Appendix D for a detailed proof. Note that unlike greedy projection method (see Theorem 3.1), k\nis independent of \u270f. Also, in the linear model, the above result also leads to consistent estimate of w\u21e4.\n\n4 Extension to Sparse Overlapping Groups (SoG)\n\nThe SoG model generalizes the overlapping group sparse model, allowing the selected groups\nthemselves to be sparse. Given positive integers k1, k2 and a set of groups G, IHT for SoG would\nperform projections onto the following set:\n\n:=(w =\n\nCsog\n\n0\n\naGi : kwkG0 \uf8ff k1, kaG1k0 \uf8ff k2)\n\nMXi=1\n\n(7)\n\nAs in the case of overlapping group lasso, projection onto (7) is NP-hard in general. Motivated by\nour greedy approach in Section 2, we propose a similar method for SoG (see Algorithm 3). The\nalgorithm essentially greedily selects the groups that have large top-k2 elements by magnitude.\nBelow, we show that the IHT (Algorithm 1) combined with the greedy projection (Algorithm 3)\nindeed converges to the optimal solution. Moreover, our experiments (Section 5) reveal that this\nmethod, when combined with full corrections, yields highly accurate results signi\ufb01cantly faster than\nthe state-of-the-art.\nWe suppose that there exists a set of supports Sk\u21e4 such that supp(w\u21e4) 2S k\u21e4. Then, we obtain the\nfollowing result, proved in Appendix E:\nTheorem 4.1. Let w\u21e4 = arg minw,supp(w)2Sk\u21e4 f (w), where Sk\u21e4 \u2713S k \u2713{ 0, 1}p is a \ufb01xed set\nparameterized by k\u21e4. Let f satisfy RSC/RSS with constants \u21b5k, Lk, respectively. Furthermore, assume\nthat there exists an approximately good projection operator for the set de\ufb01ned in (7) (for example,\nAlgorithm 3). Then, the following holds for the T = O\u21e3 Lk0\n\u270f \u2318-th iterate of Algorithm 1 :\n\nwhere k = O(( L2k+k\u21e4\n\u21b52k+k\u21e4\n\nkwT  w\u21e4k2 \uf8ff 2\u270f +\n\u270f ),  = 2\n\nL2k+k\u21e4\n\n)2 \u00b7 k\u21e4 \u00b7  \u21b52k+k\u21e4\n\nL2k+k\u21e4\n\n\u21b5k0 \u00b7 kw\u21e4k2\n\u00b7 ,\n\n10 \u00b7 L2k+k\u21e4\n\u21b52k+k\u21e4\nmaxS, S2Sk k(rf (w\u21e4))Sk2.\n\n6\n\n\fAlgorithm 3 Greedy Projections for SoG\nRequire: g 2 Rp, parameters k1, k2, groups G\n1: \u02c6u = 0 , v = g, bG = {0}, \u02c6S = {0}\n2: for t=1,2,. . . k1 do\nFind G? = arg maxG2G\\bG kvGk\n3:\nbG = bGS G?\n4:\nLet S correspond to the indices of the top k2 entries of vG? by magnitude\n5:\nDe\ufb01ne \u00afv 2 Rp, \u00afvS = (vG?)S \u00afvi = 0 i /2 S\n6:\n\u02c6S = \u02c6SS S\n7:\n8:\nv = v  \u00afv\n9:\nu = u + \u00afv\n10: end for\n11: Output \u02c6u, bG, \u02c6S\n\nRemarks\nSimilar to Theorem 3.1, we see that there is a tradeoff between obtaining accurate projections \u270f and\nmodel mismatch . Speci\ufb01cally in this case, one can obtain small \u270f by increasing k1, k2 in Algorithm\n3. However, this will mean we select large number of groups, and subsequently  increases.\nA result similar to Theorem 3.2 can be obtained for the case when f is least squares loss function.\n\nSpeci\ufb01cally, the sample complexity evaluates to n  \uf8ff2k\u21e41 log(M ) + \uf8ff2k\u21e41k\u21e42 log(maxi |Gi|). We\n\nobtain results for least squares in Appendix F.\nAn interesting extension to the SoG case is that of a hierarchy of overlapping, sparsely activated\ngroups. When the groups at each level do not overlap, this reduces to the case considered in [11].\nHowever, our theory shows that when a corresponding approximate projection operator is de\ufb01ned for\nthe hierarchical overlapping case (extending Algorithm 3), IHT methods can be used to obtain the\nsolution in an ef\ufb01cient manner.\n\n5 Experiments and Results\n\n6\n\n5.5\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n)\ne\nv\ni\nt\nc\ne\nb\no\n(\ng\no\n\nj\n\nl\n\nIHT\nIHT+FC\nCoGEnT\nFW\nGOMP\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n)\ne\nv\ni\nt\nc\ne\nb\no\n(\ng\no\n\nj\n\nl\n\nIHT\nIHT+FC\nCoGEnT\nFW\nGOMP\n\ns\nt\n\nn\ne\nm\ne\nr\nu\ns\na\ne\nm\n\n2000\n\n1800\n\n1600\n\n1400\n\n1200\n\ns\nt\nn\ne\nm\ne\nr\nu\ns\na\ne\nm\n\n2000\n\n1800\n\n1600\n\n1200\n\n1200\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nTime (seconds)\n\nTime (seconds)\n\n50\n\n150\n\n100\n200\ncondition number\n\n250\n\n300\n\n50\n\n100\n\n150\n\n200\ncondition number\n\n250\n\n300\n\nFigure 1: (From left to right) Objective value as a function of time for various methods, when data is\nwell conditioned and poorly conditioned. The latter two \ufb01gures show the phase transition plots for\npoorly conditioned data, for IHT and GOMP respectively.\n\nIn this section, we empirically compare and contrast our proposed group IHT methods against the\nexisting approaches to solve the overlapping group sparsity problem. At a high level, we observe\nthat our proposed variants of IHT indeed outperforms the existing state-of-the-art methods for group-\nsparse regression in terms of time complexity. Encouragingly, IHT also performs competitively with\nthe existing methods in terms of accuracy. In fact, our results on the breast cancer dataset shows a\n10% relative improvement in accuracy over existing methods.\nGreedy methods for group sparsity have been shown to outperform proximal point schemes, and\nhence we restrict our comparison to greedy procedures. We compared four methods: our algorithm\nwith (IHT-FC) and without (IHT) the fully corrective step, the Frank Wolfe (FW) method [19] ,\nCoGEnT, [15] and the Group OMP (GOMP) [18]. All relevant hyper-parameters were chosen via\na grid search, and experiments were run on a macbook laptop with a 2.5 GHz processor and 16gb\nmemory. Additional experimental results are presented in Appendix G\n\n7\n\n\f)\n\nE\nS\nM\n(\ng\no\n\nl\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\n\u22127\n\n \n\n\u22128\n0\n\n \n\n1\n\n0\n\n\u22121\n\n1\n\n0\n\n\u22121\n\n1\n\n0\n\n\u22121\n\nIHT\nIHT\u2212FC\nCOGEnT\nFW\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\n5\n\n10\n\ntime (seconds)\n\n15\n\n20\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\nMethod\n\nFW\nIHT\nGOMP\nCoGEnT\nIHT-FC\n\nError % time (sec)\n6.4538\n0.0400\n0.2891\n0.1414\n0.1601\n\n29.41\n27.94\n25.01\n23.53\n21.65\n\nFigure 2: (Left) SoG: error vs time comparison for various methods, (Center) SoG: reconstruction of\nthe true signal (top) from IHT-FC (middle) and CoGEnT (bottom). (Right:) Tumor Classi\ufb01cation:\nmisclassi\ufb01cation rate of various methods.\n\nSynthetic Data, well conditioned: We \ufb01rst compared various greedy schemes for solving the\noverlapping group sparsity problem on synthetic data. We generated M = 1000 groups of contiguous\nindices of size 25; the last 5 entries of one group overlap with the \ufb01rst 5 of the next. We randomly\nset 50 of these to be active, populated by uniform [1, 1] entries. This yields w? 2 Rp, p \u21e0 22000.\ni.i.d\u21e0 N (0, 1). Each measurement is corrupted with Additive\nX 2 Rn\u21e5p where n = 5000 and Xij\nWhite Gaussian Noise (AWGN) with standard deviation  = 0.1. IHT mehods achieve orders\nof magnitude speedup compared to the competing schemes, and achieve almost the same (\ufb01nal)\nobjective function value despite approximate projections (Figure 1 (Left)).\nSynthetic Data, poorly conditioned: Next, we consider the exact same setup, but with each row of\nX given by: xi \u21e0 N (0, \u2303) where \uf8ff = max(\u2303)/min(\u2303) = 10. Figure 1 (Center-left) shows again\nthe advantages of using IHT methods; IHT-FC is about 10 times faster than the next best CoGEnT.\nWe next generate phase transition plots for recovery by our method (IHT) as well as the state-\nof-the-art GOMP method. We generate vectors in the same vein as the above experiment, with\nM = 500, B = 15, k = 25, p \u21e0 5000. We vary the the condition number of the data covariance\n(\u2303) as well as the number of measurements (n). Figure 1 (Center-right and Right) shows the\nphase transition plot as the measurements and the condition number are varied for IHT, and GOMP\nrespectively. The results are averaged over 10 independent runs. It can be seen that even for condition\nnumbers as high as 200, n \u21e0 1500 measurements suf\ufb01ces for IHT to exactly recovery w\u21e4, whereas\nGOMP with the same setting is not able to recover w\u21e4 even once.\n\nTumor Classi\ufb01cation, Breast Cancer Dataset We next compare the aforementioned methods on\na gene selection problem for breast cancer tumor classi\ufb01cation. We use the data used in [8] 2. We ran\na 5-fold cross validation scheme to choose parameters, where we varied \u2318 2{ 25, 24, . . . , 23} k 2\n{2, 5, 10, 15, 20, 50, 100} \u2327 2{ 23, 24, . . . , 213}. Figure 2 (Right) shows that the vanilla hard\nthresholding method is competitive despite performing approximate projections, and the method with\nfull corrections obtains the best performance among the methods considered. We randomly chose\n15% of the data to test on.\nSparse Overlapping Group Lasso: Finally, we study the sparse overlapping group (SoG) problem\nthat was introduced and analyzed in [14] (Figure 2). We perform projections as detailed in Algorithm\n3. We generated synthetic vectors with 100 groups of size 50 and randomly selected 5 groups to be\nactive, and among the active group only set 30 coef\ufb01cients to be non zero. The groups themselves\nwere overlapping, with the last 10 entries of one group shared with the \ufb01rst 10 of the next, yielding\np \u21e0 4000. We chose the best parameters from a grid, and we set k = 2k\u21e4 for the IHT methods.\n6 Conclusions and Discussion\nWe proposed a greedy-IHT method that can applied to regression problems over set of group sparse\nvectors. Our proposed solution is ef\ufb01cient, scalable, and provide fast convergence guarantees under\ngeneral RSC/RSS style conditions, unlike existing methods. We extended our analysis to handle even\nmore challenging structures like sparse overlapping groups. Our experiments show that IHT methods\nachieve fast, accurate results even with greedy and approximate projections.\n\n2download at http : //cbio.ensmp.f r/ ljacob/\n\n8\n\n\fReferences\n[1] Francis Bach. Convex analysis and optimization with submodular functions: A tutorial. arXiv preprint\n\narXiv:1010.4207, 2010.\n\n[2] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive\n\nsensing. Information Theory, IEEE Transactions on, 56(4):1982\u20132001, 2010.\n\n[3] Nirav Bhan, Luca Baldassarre, and Volkan Cevher. Tractability of interpretability via selection of group-\nsparse models. In Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pages\n1037\u20131041. IEEE, 2013.\n\n[4] Thomas Blumensath and Mike E Davies. Sampling theorems for signals from the union of \ufb01nite-\n\ndimensional linear subspaces. Information Theory, IEEE Transactions on, 55(4):1872\u20131882, 2009.\n\n[5] Thomas Blumensath and Mike E Davies. Normalized iterative hard thresholding: Guaranteed stability and\n\nperformance. Selected Topics in Signal Processing, IEEE Journal of, 4(2):298\u2013309, 2010.\n\n[6] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based compres-\n\nsive sensing. Information Theory, IEEE Transactions on, 61(9):5129\u20135147, 2015.\n\n[7] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. The Journal of\n\nMachine Learning Research, 12:3371\u20133412, 2011.\n\n[8] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In\nProceedings of the 26th annual International Conference on Machine Learning, pages 433\u2013440. ACM,\n2009.\n\n[9] Prateek Jain, Ambuj Tewari, and Inderjit S Dhillon. Orthogonal matching pursuit with replacement. In\n\nAdvances in Neural Information Processing Systems, pages 1215\u20131223, 2011.\n\n[10] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-\nIn Advances in Neural Information Processing Systems, pages 685\u2013693,\n\ndimensional m-estimation.\n2014.\n\n[11] Rodolphe Jenatton, Julien Mairal, Francis R Bach, and Guillaume R Obozinski. Proximal methods for\nsparse hierarchical dictionary learning. In Proceedings of the 27th International Conference on Machine\nLearning (ICML-10), pages 487\u2013494, 2010.\n\n[12] Anastasios Kyrillidis and Volkan Cevher. Combinatorial selection and least absolute shrinkage via the\nclash algorithm. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages\n2216\u20132220. IEEE, 2012.\n\n[13] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for\n\nmaximizing submodular set functions. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[14] Nikhil Rao, Christopher Cox, Rob Nowak, and Timothy T Rogers. Sparse overlapping sets lasso for\nmultitask learning and its application to fmri analysis. In Advances in neural information processing\nsystems, pages 2202\u20132210, 2013.\n\n[15] Nikhil Rao, Parikshit Shah, and Stephen Wright. Forward\u2013backward greedy algorithms for atomic norm\n\nregularization. Signal Processing, IEEE Transactions on, 63(21):5798\u20135811, 2015.\n\n[16] Nikhil S Rao, Ben Recht, and Robert D Nowak. Universal measurement bounds for structured sparse\nsignal recovery. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 942\u2013950, 2012.\n[17] Parikshit Shah and Venkat Chandrasekaran. Iterative projections for signal identi\ufb01cation on manifolds:\nGlobal recovery guarantees. In Communication, Control, and Computing (Allerton), 2011 49th Annual\nAllerton Conference on, pages 760\u2013767. IEEE, 2011.\n\n[18] Grzegorz Swirszcz, Naoki Abe, and Aurelie C Lozano. Grouped orthogonal matching pursuit for variable\nselection and prediction. In Advances in Neural Information Processing Systems, pages 1150\u20131158, 2009.\n[19] Ambuj Tewari, Pradeep K Ravikumar, and Inderjit S Dhillon. Greedy algorithms for structurally constrained\nhigh dimensional problems. In Advances in Neural Information Processing Systems, pages 882\u2013890, 2011.\n[20] Sara A Van De Geer, Peter B\u00a8uhlmann, et al. On the conditions used to prove oracle results for the lasso.\n\nElectronic Journal of Statistics, 3:1360\u20131392, 2009.\n\n[21] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained\noptimization. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages\n127\u2013135, 2014.\n\n9\n\n\f", "award": [], "sourceid": 830, "authors": [{"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Nikhil", "family_name": "Rao", "institution": "Technicolor"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}