{"title": "Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 2202, "page_last": 2210, "abstract": "Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a notion of similarity and (2) features useful in one task are similar, but not necessarily identical, to the features best suited for other tasks. The main contribution of this paper is a new procedure called {\\em Sparse Overlapping Sets (SOS) lasso}, a convex optimization that automatically selects similar features for related learning tasks.  Error bounds are derived for SOSlasso and its consistency is established for squared error loss. In particular,  SOSlasso is motivated by multi-subject fMRI studies in which functional activity is classified using brain voxels as features. Experiments with real and synthetic data demonstrate the advantages of SOSlasso compared to the lasso and group lasso.", "full_text": "Sparse Overlapping Sets Lasso for Multitask\nLearning and its Application to fMRI Analysis\n\nNikhil S. Rao\u2020\n\nnrao2@wisc.edu\n\nRobert D. Nowak\u2020\n\nnowak@ece.wisc.edu\n\nChristopher R. Cox#\ncrcox@wisc.edu\n\nTimothy T. Rogers#\n\nttrogers@wisc.edu\n\n\u2020 Department of Electrical and Computer Engineering, # Department of Psychology\n\nUniversity of Wisconsin- Madison\n\nAbstract\n\nMultitask learning can be effective when features useful in one task are also useful\nfor other tasks, and the group lasso is a standard method for selecting a common\nsubset of features. In this paper, we are interested in a less restrictive form of mul-\ntitask learning, wherein (1) the available features can be organized into subsets\naccording to a notion of similarity and (2) features useful in one task are simi-\nlar, but not necessarily identical, to the features best suited for other tasks. The\nmain contribution of this paper is a new procedure called Sparse Overlapping Sets\n(SOS) lasso, a convex optimization that automatically selects similar features for\nrelated learning tasks. Error bounds are derived for SOSlasso and its consistency\nis established for squared error loss. In particular, SOSlasso is motivated by multi-\nsubject fMRI studies in which functional activity is classi\ufb01ed using brain voxels\nas features. Experiments with real and synthetic data demonstrate the advantages\nof SOSlasso compared to the lasso and group lasso.\n\nIntroduction\n\n1\nMultitask learning exploits the relationships between several learning tasks in order to improve\nperformance, which is especially useful if a common subset of features are useful for all tasks at\nhand. The group lasso (Glasso) [19, 8] is naturally suited for this situation: if a feature is selected\nfor one task, then it is selected for all tasks. This may be too restrictive in many applications, and\nthis motivates a less rigid approach to multitask feature selection. Suppose that the available features\ncan be organized into overlapping subsets according to a notion of similarity, and that the features\nuseful in one task are similar, but not necessarily identical, to those best suited for other tasks. In\nother words, a feature that is useful for one task suggests that the subset it belongs to may contain\nthe features useful in other tasks (Figure 1).\nIn this paper, we introduce the sparse overlapping sets lasso (SOSlasso), a convex program to re-\ncover the sparsity patterns corresponding to the situations explained above. SOSlasso generalizes\nlasso [16] and Glasso, effectively spanning the range between these two well-known procedures.\nSOSlasso is capable of exploiting the similarities between useful features across tasks, but unlike\nGlasso it does not force different tasks to use exactly the same features. It produces sparse solutions,\nbut unlike lasso it encourages similar patterns of sparsity across tasks. Sparse group lasso [14] is\na special case of SOSlasso that only applies to disjoint sets, a signi\ufb01cant limitation when features\ncannot be easily partitioned, as is the case of our motivating example in fMRI. The main contribu-\ntion of this paper is a theoretical analysis of SOSlasso, which also covers sparse group lasso as a\nspecial case (further differentiating us from [14]). The performance of SOSlasso is analyzed, error\n\n1\n\n\fbounds are derived for general loss functions, and its consistency is shown for squared error loss.\nExperiments with real and synthetic data demonstrate the advantages of SOSlasso relative to lasso\nand Glasso.\n\n1.1 Sparse Overlapping Sets\n\nSOSlasso encourages sparsity patterns that are similar, but not identical, across tasks. This is ac-\ncomplished by decomposing the features of each task into groups G1 . . . GM , where M is the same\nfor each task, and Gi is a set of features that can be considered similar across tasks. Conceptually,\nSOSlasso \ufb01rst selects subsets that are most useful for all tasks, and then identi\ufb01es a unique sparse\nsolution for each task drawing only from features in the selected subsets. In the fMRI application\ndiscussed later, the subsets are simply clusters of adjacent spatial data points (voxels) in the brains of\nmultiple subjects. Figure 1 shows an example of the patterns that typically arise in sparse multitask\nlearning applications, where rows indicate features and columns correspond to tasks.\nPast work has focused on recovering variables that exhibit within and across group sparsity, when\nthe groups do not overlap [14], \ufb01nding application in genetics, handwritten character recognition\n[15] and climate and oceanography [2]. Along related lines, the exclusive lasso [21] can be used\nwhen it is explicitly known that variables in certain sets are negatively correlated.\n\n(a) Sparse\n\n(b) Group sparse\n\n(c) Group sparse\nplus sparse\n\n(d) Group sparse\nand sparse\n\nFigure 1: A comparison of different sparsity patterns. (a) shows a standard sparsity pattern. An\nexample of group sparse patterns promoted by Glasso [19] is shown in (b). In (c), we show the\npatterns considered in [6]. Finally, in (d), we show the patterns we are interested in this paper.\n\n1.2\n\nfMRI Applications\n\nIn psychological studies involving fMRI, multiple participants are scanned while subjected to ex-\nactly the same experimental manipulations. Cognitive Neuroscientists are interested in identifying\nthe patterns of activity associated with different cognitive states, and construct a model of the activity\nthat accurately predicts the cognitive state evoked on novel trials. In these datasets, it is reasonable\nto expect that the same general areas of the brain will respond to the manipulation in every partici-\npant. However, the speci\ufb01c patterns of activity in these regions will vary, both because neural codes\ncan vary by participant [4] and because brains vary in size and shape, rendering neuroanatomy only\nan approximate guide to the location of relevant information across individuals. In short, a voxel\nuseful for prediction in one participant suggests the general anatomical neighborhood where useful\nvoxels may be found, but not the precise voxel. While logistic Glasso [17], lasso [13], and the elas-\ntic net penalty [12] have been applied to neuroimaging data, these methods do not exclusively take\ninto account both the common macrostructure and the differences in microstructure across brains.\nSOSlasso, in contrast, lends itself well to such a scenario, as we will see from our experiments.\n1.3 Organization\nThe rest of the paper is organized as follows: in Section 2, we outline the notations that we will\nuse and formally set up the problem. We also introduce the SOSlasso regularizer. We derive cer-\ntain key properties of the regularizer in Section 3. In Section 4, we specialize the problem to the\nmultitask linear regression setting (2), and derive consistency rates for the same, leveraging ideas\nfrom [9]. We outline experiments performed on simulated data in Section 5. In this section, we also\nperform logistic regression on fMRI data, and argue that the use of the SOSlasso yields interpretable\nmultivariate solutions compared to Glasso and lasso.\n\n2\n\n\f2 Sparse Overlapping Sets Lasso\nWe formalize the notations used in the sequel. Lowercase and uppercase bold letters indicate\nvectors and matrices respectively. We assume a multitask learning framework, with a data ma-\ntrix \u03a6t \u2208 Rn\u00d7p for each task t \u2208 {1, 2, . . . ,T }. We assume there exists a vector x(cid:63)\nt \u2208 Rp\nt + \u03b7t \u03b7t \u223c N (0, \u03c32I). Let\nsuch that measurements obtained are of the form yt = \u03a6tx(cid:63)\n2 . . . x(cid:63)T ] \u2208 Rp\u00d7T . Suppose we are given M (possibly overlapping) groups\nX (cid:63) := [x(cid:63)\n\u02dcG = { \u02dcG1, \u02dcG2, . . . , \u02dcGM}, so that \u02dcGi \u2282 {1, 2, . . . , p} \u2200i, of maximum size B. These groups contain\nsets of \u201csimilar\u201d features, the notion of similarity being application dependent. We assume that all\nbut k (cid:28) M groups are identically zero. Among the active groups, we further assume that at most\nonly a fraction \u03b1 \u2208 (0, 1) of the coef\ufb01cients per group are non zero. We consider the following\noptimization program in this paper\n\n1 x(cid:63)\n\n(cid:40) T(cid:88)\n\n(cid:41)\nL\u03a6t(xt) + \u03bbnh(x)\n\n\u02c6X = arg min\nx\n\nt=1\n\n(1)\n\n(2)\n\n2 . . . xTT ]T , h(x) is a regularizer and Lt := L\u03a6t(xt) denotes the loss function,\nwhere x = [xT\nwhose value depends on the data matrix \u03a6t. We consider least squares and logistic loss functions. In\nthe least squares setting, we have Lt = 1\n2n(cid:107)yt \u2212 \u03a6txt(cid:107)2. We reformulate the optimization problem\n\n1 xT\n\n(1) with the least squares loss as(cid:98)x = arg min\n\n(cid:26) 1\n\nx\n\n2n\n\n(cid:27)\n\n(cid:107)y \u2212 \u03a6x(cid:107)2\n\n2 + \u03bbnh(x)\n\n1 yT\n\n2 . . . yTT ]T and the block diagonal matrix \u03a6 is formed by block concatenating the\nwhere y = [yT\n\u03a6(cid:48)\nts. We use this reformulation for ease of exposition (see also [8] and references therein). Note\nand \u03a6 \u2208 RT n\u00d7T p. We also de\ufb01ne G = {G1, G2, . . . , GM} to be the\nthat x \u2208 RT p, y \u2208 RT n,\nset of groups de\ufb01ned on RT p formed by aggregating the rows of X that were originally in \u02dcG, so that\nx is composed of groups G \u2208 G.\nWe next de\ufb01ne a regularizer h that promotes sparsity both within and across overlapping sets of\nsimilar features:\n\n(\u03b1G(cid:107)wG(cid:107)2 + (cid:107)wG(cid:107)1) s.t. (cid:88)\n\nh(x) = infW\n\nwG = x\n\n(3)\n\nG\u2208G\n\n(cid:88)\n\nG\u2208G\n\nwhere the \u03b1G > 0 are constants that balance the tradeoff between the group norms and the (cid:96)1 norm.\nEach wG has the same size as x, with support restricted to the variables indexed by group G. W is\na set of vectors, where each vector has a support restricted to one of the groups G \u2208 G:\n\nW = {wG \u2208 RT p| [wG]i = 0 if i /\u2208 G}\n\nwhere [wG]i is the ith coef\ufb01cient of wG. The SOSlasso is the optimization in (1) with h(x) as\nde\ufb01ned in (3).\nWe say the set of vectors wG is an optimal decomposition of x if they achieve the inf in (3). The\nobjective function in (3) is convex and coercive. Hence, \u2200x, an optimal decomposition always exists.\nAs the \u03b1G \u2192 \u221e the (cid:96)1 term becomes redundant, reducing h(x) to the overlapping group lasso\npenalty introduced in [5], and studied in [10, 11]. When the \u03b1G \u2192 0, the overlapping group lasso\nterm vanishes and h(x) reduces to the lasso penalty. We consider \u03b1G = 1 \u2200G. All the results in the\npaper can be easily modi\ufb01ed to incorporate different settings for the \u03b1G.\n\nSupport\n{1, 4, 9}\n{1, 3, 4}\n\n{1, 2, 3, 4, 5}\n\nValues\n{3, 4, 7}\n{3, 4, 7}\n\n{2, 5, 2, 4, 5}\n\n(cid:80)\nG (cid:107)xG(cid:107)2\n12\n\n8.602\n8.602\n\n(cid:107)x(cid:107)1\n14\n18\n14\n\n(cid:80)\nG ((cid:107)xG(cid:107)2 + (cid:107)xG(cid:107)1)\n\n26\n\n26.602\n22.602\n\nTable 1: Different instances of a 10-d vector and their corresponding norms.\n\nThe example in Table 1 gives an insight into the kind of sparsity patterns preferred by the function\nh(x). The optimization problems (1) and (2) will prefer solutions that have a small value of h(\u00b7).\n\n3\n\n\fConsider 3 instances of x \u2208 R10, and the corresponding group lasso, (cid:96)1, and h(x) function values.\nThe vector is assumed to be made up of two groups, G1 = {1, 2, 3, 4, 5} and G2 = {6, 7, 8, 9, 10}.\nh(x) is smallest when the support set is sparse within groups, and also when only one of the two\ngroups is selected. The (cid:96)1 norm does not take into account sparsity across groups, while the group\nlasso norm does not take into account sparsity within groups.\nTo solve (1) and (2) with the regularizer proposed in (3), we use the covariate duplication method of\n[5], to reduce the problem to a non overlapping sparse group lasso problem. We then use proximal\npoint methods [7] in conjunction with the MALSAR [20] package to solve the optimization problem.\n3 Error Bounds for SOSlasso with General Loss Functions\nWe derive certain key properties of the regularizer h(\u00b7) in (3), independent of the loss function used.\nLemma 3.1 The function h(x) in (3) is a norm\nThe proof follows from basic properties of norms and because if wG, vG are optimal decompositions\nof x, y, then it does not imply that wG + vG is an optimal decomposition of x + y. For a detailed\nproof, please refer to the supplementary material.\nThe dual norm of h(x) can be bounded as\n\nh\u2217(u) = max\n\n{xT u} s.t. h(x) \u2264 1\n\nx\n\nG\u2208G\n\n= maxW {(cid:88)\n(i)\u2264 maxW {(cid:88)\n= maxW {(cid:88)\n\nG\u2208G\n\nwT\n\nwT\n\nGuG} s.t. (cid:88)\nGuG} s.t. (cid:88)\nGuG} s.t. (cid:88)\n\nG\u2208G\n\nG\u2208G\n\nG\u2208G\n\nwT\n\nG\u2208G\n(cid:107)uG(cid:107)2\n1\n2\n\n\u21d2 h\u2217(u) \u2264 max\nG\u2208G\n\n((cid:107)wG(cid:107)2 + (cid:107)wG(cid:107)1) \u2264 1\n\n2(cid:107)wG(cid:107)2 \u2264 1\n\n(cid:107)wG(cid:107)2 \u2264 1\n2\n\n(4)\n\n2(cid:107)uG\u2217(cid:107)2\n\n(i) follows from the fact that the constraint set in (i) is a superset of the constraint set in the previous\nstatement, since (cid:107)a(cid:107)2 \u2264 (cid:107)a(cid:107)1. (4) follows from noting that the maximum is obtained by setting\n, where G\u2217 = arg maxG\u2208G (cid:107)uG(cid:107)2. The inequality (4) is far more tractable than\nwG\u2217 = uG\u2217\nthe actual dual norm, and will be useful in our derivations below. Since h(\u00b7) is a norm, we can apply\nmethods developed in [9] to derive consistency rates for the optimization problems (1) and (2). We\nwill use the same notations as in [9] wherever possible.\nDe\ufb01nition 3.2 A norm h(\u00b7) is decomposable with respect to the subspace pair sA \u2282 sB if h(a +\nb) = h(a) + h(b) \u2200a \u2208 sA, b \u2208 sB\u22a5.\nLemma 3.3 Let x(cid:63) \u2208 Rp be a vector that can be decomposed into (overlapping) groups with within-\ngroup sparsity. Let G(cid:63) \u2282 G be the set of active groups of x(cid:63). Let S = supp(x(cid:63)) indicate the support\nset of x. Let sA be the subspace spanned by the coordinates indexed by S, and let sB = sA. We\nthen have that the norm in (3) is decomposable with respect to sA, sB\n\nThe result follows in a straightforward way from noting that supports of decompositions for vectors\nin sA and sB\u22a5 do not overlap. We defer the proof to the supplementary material.\nDe\ufb01nition 3.4 Given a subspace sB, the subspace compatibility constant with respect to a norm\n(cid:107) (cid:107) is given by\n\nLemma 3.5 Consider a vector x that can be decomposed into G(cid:63) \u2282 G active groups. Suppose the\nmaximum group size is B, and also assume that a fraction \u03b1 \u2208 (0, 1) of the coordinates in each\nactive group is non zero. Then,\n\n(cid:27)\n(cid:26) h(x)\n(cid:107)x(cid:107) \u2200x \u2208 sB\\{0}\nB\u03b1)(cid:112)|G(cid:63)|(cid:107)x(cid:107)2\n\n\u221a\n\n\u03a8(B) = sup\n\nh(x) \u2264 (1 +\n\n4\n\n\f\u221a\n\nG\u2208G(cid:63)\n\n(cid:88)\n\n((cid:107)wG(cid:107)2 + (cid:107)wG(cid:107)1) \u2264 (1 +\n\nthat the supports of the different wG do not overlap. Then,\n\nProof For any vector x with supp(x) \u2282 G(cid:63), there exists a representation x =(cid:80)\nh(x) \u2264 (cid:88)\n\nB\u03b1)(cid:112)|G(cid:63)|(cid:107)x(cid:107)2\nB\u03b1)(cid:112)|G(cid:63)| (Lemma 3.5) gives an upper bound on the subspace compatibility\n\nWe see that (1 +\nconstant with respect to the (cid:96)2 norm for the subspace indexed by the support of the vector, which is\ncontained in the span of the union of groups in G(cid:63).\nDe\ufb01nition 3.6 For a given set S, and given vector x(cid:63), the loss function L\u03a6(x) satis\ufb01es the Re-\nstricted Strong Convexity(RSC) condition with parameter \u03ba and tolerance \u03c4 if\n\n(cid:107)wG(cid:107)2 \u2264 (1 +\n\nG\u2208G(cid:63) wG, such\n\nG\u2208G(cid:63)\n\nB\u03b1)\n\n\u221a\n\n\u221a\n\nL\u03a6(x(cid:63) + \u2206) \u2212 L\u03a6(x(cid:63)) \u2212 (cid:104)\u2207L\u03a6(x(cid:63)), \u2206(cid:105) \u2265 \u03ba(cid:107)\u2206(cid:107)2\n\n2 \u2212 \u03c4 2(x(cid:63)) \u2200\u2206 \u2208 S\n\nIn this paper, we consider vectors x(cid:63) that lie exactly in k (cid:28) M groups, and display within-group\nsparsity. This implies that the tolerance \u03c4 (x(cid:63)) = 0, and we will ignore this term henceforth.\nWe also de\ufb01ne the following set, which will be used in the sequel:\n\nC(sA, sB, x(cid:63)) := {\u2206 \u2208 Rp|h(\u03a0sB\u22a5\u2206) \u2264 3h(\u03a0sB\u2206) + 4h(\u03a0sA\u22a5 x(cid:63))}\n\n(5)\nwhere \u03a0sA(\u00b7) denotes the projection onto the subspace sA. Based on the results above, we can now\napply a result from [9] to the SOSlasso:\n\nTheorem 3.7 (Corollary 1 in [9]) Consider a convex and differentiable loss function such that RSC\nholds with constants \u03ba and \u03c4 = 0 over (5), and a norm h(\u00b7) decomposable over sets sA and sB. For\nthe optimization program in (1), using the parameter \u03bbn \u2265 2h\u2217(\u2207L\u03a6(x(cid:63))), any optimal solution\n\u02c6x\u03bbn to (1) satis\ufb01es\n\n(cid:107)(cid:98)x\u03bbn \u2212 x(cid:63)(cid:107)2 \u2264 9\u03bb2\n\nn\n\u03ba\n\n\u03a82(sB)\n\nThe result above shows a general bound on the error using the lasso with sparse overlapping sets.\nNote that the regularization parameter \u03bbn as well as the RSC constant \u03ba depend on the loss function\nL\u03a6(x). Convergence for logistic regression settings may be derived using methods in [1]. In the\nnext section, we consider the least squares loss (2), and show that the estimate using the SOSlasso\nis consistent.\n\n4 Consistency of SOSlasso with Squared Error Loss\n\nn \u03a6T (\u03a6x \u2212 y) = 1\n\nWe \ufb01rst need to bound the dual norm of the gradient of the loss function, so as to bound \u03bbn. Consider\nL := L\u03a6(x) = 1\n2n(cid:107)y \u2212 \u03a6x(cid:107)2. The gradient of the loss function with respect to x is given by\n\u2207L = 1\n2 . . . \u03b7TT ]T (see Section 2). Our goal now is to\n\ufb01nd an upper bound on the quantity h\u2217(\u2207L), which from (4) is\nG\u2208G (cid:107)\u03a6T\n\nG\u2208G (cid:107)\u2207LG(cid:107)2 =\n\nn \u03a6T \u03b7 where \u03b7 = [\u03b7T\n\nG\u03b7(cid:107)2\n\n1 \u03b7T\n\nmax\n\nmax\n\n1\n2n\n\n1\n2\n\nG\u03b7 \u223c \u03c3N (0, \u03a6T\n\nwhere \u03a6G is the matrix \u03a6 restricted to the columns indexed by the group G. We will prove an upper\nbound for the above quantity in the course of the results that follow.\nSince \u03b7 \u223c N (0, \u03c32I), we have \u03a6T\nG\u03a6G} to be\n2 \u223c\nthe maximum singular value, we have (cid:107)\u03a6T\n\u03c72|G|, where \u03c72\nd is a chi-squared random variable with d degrees of freedom. This allows us to work\nwith the more tractable chi squared random variable when we look to bound the dual norm of \u2207L.\nThe next lemma helps us obtain a bound on the maximum of \u03c72 random variables.\nLemma 4.1 Let z1, z2, . . . , zM be chi-squared random variables with d degrees of freedom. Then\nfor some constant c,\n\nG\u03a6G). De\ufb01ning \u03c3mG := \u03c3max{\u03a6T\n\n2, where \u03b3 \u223c N (0, I|G|) \u21d2 (cid:107)\u03b3(cid:107)2\n\n2 \u2264 \u03c32\u03c32\n\nmG(cid:107)\u03b3(cid:107)2\n\nG\u03b7(cid:107)2\n\n(cid:18)\n\nP\n\nmax\n\ni=1,2,...,M\n\nzi \u2264 c2d\n\n(cid:18)\nlog(M ) \u2212 (c \u2212 1)2d\n\n(cid:19)\n\n2\n\n(cid:19)\n\n\u2265 1 \u2212 exp\n\n5\n\n\fProof From the chi-squared tail bound in [3], P(zi \u2265 c2d) \u2264 exp\nfrom a union bound and inverting the expression.\n\n(cid:16)\u2212 (c\u22121)2d\n\n2\n\n(cid:17)\n\n. The result follows\n\n(cid:80)T\nt=1 (cid:107)yt \u2212 \u03a6txt(cid:107)2 = 1\n\nLemma 4.2 Consider the loss function L := 1\n2n(cid:107)y \u2212 \u03a6x(cid:107)2, with the\n\u03a6(cid:48)\nts deterministic and the measurements corrupted with AWGN of variance \u03c32. For the regularizer\nin (3), the dual norm of the gradient of the loss function is bounded as\n\n2n\n\nh\u2217(\u2207L)2 \u2264 \u03c32\u03c32\n4\n\nm\n\n(log(M ) + T B)\n\nn\n\nwith probability at least 1 \u2212 c1 exp(\u2212c2n), for c1, c2 > 0, and where \u03c3m = maxG\u2208G \u03c3mG\nProof Let \u03b3 \u223c \u03c72T |G|. We begin with the upper bound obtained for the dual norm of the regularizer\nin (4):\n(cid:19)\n\n(cid:18)\nlog(M ) \u2212 (cn \u2212 1)2T B\n\nmax\nG\u2208G\nc2T B w. p. 1 \u2212 exp\n\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nh\u2217(\u2207L)2\n\n\u03c32\nmG\u03b3\nn2\n\n\u2264 \u03c32\n4\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n\u03a6T\n\nG\u03b7\n\nn\n\nm\n\n2\n\n(i)\u2264 1\n4\n\nmax\nG\u2208G\n(ii)\u2264 \u03c32\u03c32\n4\n\nm\n\nmax\nG\u2208G\n\n\u03b3\nn2\n\n(iii)\u2264 \u03c32\u03c32\n4\n\n2\n\nwhere (i) follows from the formulation of the gradient of the loss function and the fact that the\nsquare of maximum of non negative numbers is the maximum of the squares of the same numbers.\nIn (ii), we have de\ufb01ned \u03c3m = maxG \u03c3mG. Finally, we have made use of Lemma 4.1 in (iii). We\nthen set\n\nto obtain the result.\n\nc2 =\n\nlog(M ) + T B\n\nT Bn\n\nWe combine the results developed so far to derive the following consistency result for the SOS lasso,\nwith the least squares loss function.\n\nTheorem 4.3 Suppose we obtain linear measurements of a sparse overlapping grouped matrix\nX (cid:63) \u2208 Rp\u00d7T , corrupted by AWGN of variance \u03c32. Suppose the matrix X (cid:63) can be decomposed\ninto M possible overlapping groups of maximum size B, out of which k are active. Furthermore,\nassume that a fraction \u03b1 \u2208 (0, 1] of the coef\ufb01cients are non zero in each active group. Consider the\nfollowing vectorized SOSlasso multitask regression problem (2):\n\n(cid:26) 1\n\n2n\n\n(cid:98)x = arg min\n(cid:88)\n\nx\n\nh(x) = infW\n\nG\u2208G\n\n(cid:107)y \u2212 \u03a6x(cid:107)2\n\n2 + \u03bbnh(x)\n\n,\n\n((cid:107)wG(cid:107)2 + (cid:107)wG(cid:107)1)\n\nwG = x\n\n(cid:27)\n\ns.t. (cid:88)\n\nG\u2208G\n\nSuppose the data matrices \u03a6t are non random, and the loss function satis\ufb01es restricted strong\nconvexity assumptions with parameter \u03ba. Then, for \u03bbn \u2265 \u03c32\u03c32\n, the following holds\nwith probability at least 1 \u2212 c1 exp(\u2212c2n), with c1, c2 > 0:\n\u221aT B\u03b1\n\nk(log(M ) + T B)\n\nm(log(M )+T B)\n\n(cid:17)2\n\n(cid:16)\n\n1 +\n\n4n\n\n\u03c32\u03c32\nm\n\nwhere we de\ufb01ne \u03c3m := maxG\u2208G \u03c3max{\u03a6T\nProof Follows from substituting in Theorem 3.7 the results from Lemma 3.5 and Lemma 4.2.\n\nG\u03a6G}\n\nn\u03ba\n\n(cid:107)(cid:98)x \u2212 x(cid:63)(cid:107)2 \u2264 9\n\n4\n\nFrom [9], we see that the convergence rate matches that of the group lasso, with an additional\nmultiplicative factor \u03b1. This stems from the fact that the signal has a sparse structure \u201cembedded\u201d\nwithin a group sparse structure. Visualizing the optimization problem as that of solving a lasso\nwithin a group lasso framework lends some intuition into this result. Note that since \u03b1 < 1, this\nbound is much smaller than that of the standard group lasso.\n\n6\n\n\f5 Experiments and Results\n5.1 Synthetic data, Gaussian Linear Regression\nFor T = 20 tasks, we de\ufb01ne a N = 2002 element vector divided into M = 500 groups of size\nB = 6. Each group overlaps with its neighboring groups (G1 = {1, 2, . . . , 6}, G2 = {5, 6, . . . , 10},\nG3 = {9, 10, . . . , 14}, . . . ). 20 of these groups were activated uniformly at random, and populated\nfrom a uniform [\u22121, 1] distribution. A proportion \u03b1 of these coef\ufb01cients with largest magnitude\nwere retained as true signal. For each task, we obtain 250 linear measurements using a N (0, 1\n250 I)\nmatrix. We then corrupt each measurement with Additive White Gaussian Noise (AWGN), and\nassess signal recovery in terms of Mean Squared Error (MSE). The regularization parameter was\nclairvoyantly picked to minimize the MSE over a range of parameter values. The results of applying\nlasso, standard latent group lasso [5, 10], and our SOSlasso to these data are plotted in Figures 2(a),\nvarying \u03c3, \u03b1 = 0.2, and 2(b), varying \u03b1, \u03c3 = 0.1. Each point in Figures 2(a) and 2(b), is the\naverage of 100 trials, where each trial is based on a new random instance of X (cid:63) and the Gaussian\ndata matrices.\n\n(a) Varying \u03c3\n\n(b) Varying \u03b1\n\n(c) Sample pattern\n\nFigure 2: As the noise is increased (a), our proposed penalty function (SOSlasso) allows us to\nrecover the true coef\ufb01cients more accurately than the group lasso (Glasso). Also, when alpha is\nlarge, the active groups are not sparse, and the standard overlapping group lasso outperforms the\nother methods. However, as \u03b1 reduces, the method we propose outperforms the group lasso (b). (c)\nshows a toy sparsity pattern, with different colors denoting different overlapping groups\n\n5.2 The SOSlasso for fMRI\n\nIn this experiment, we compared SOSlasso, lasso, and Glasso in analysis of the star-plus dataset [18].\n6 subjects made judgements that involved processing 40 sentences and 40 pictures while their brains\nwere scanned in half second intervals using fMRI1. We retained the 16 time points following each\nstimulus, yielding 1280 measurements at each voxel. The task is to distinguish, at each point in time,\nwhich stimulus a subject was processing. [18] showed that there exists cross-subject consistency in\nthe cortical regions useful for prediction in this task. Speci\ufb01cally, experts partitioned each dataset\ninto 24 non overlapping regions of interest (ROIs), then reduced the data by discarding all but 7 ROIs\nand, for each subject, averaging the BOLD response across voxels within each ROI and showed that\na classi\ufb01er trained on data from 5 subjects generalized when applied to data from a 6th.\nWe assessed whether SOSlasso could leverage this cross-individual consistency to aid in the dis-\ncovery of predictive voxels without requiring expert pre-selection of ROIs, or data reduction, or\nany alignment of voxels beyond that existing in the raw data. Note that, unlike [18], we do not\naim to learn a solution that generalizes to a withheld subject. Rather, we aim to discover a group\nsparsity pattern that suggests a similar set of voxels in all subjects, before optimizing a separate\nsolution for each individual. If SOSlasso can exploit cross-individual anatomical similarity from\nthis raw, coarsely-aligned data, it should show reduced cross-validation error relative to the lasso\napplied separately to each individual. If the solution is sparse within groups and highly variable\nacross individuals, SOSlasso should show reduced cross-validation error relative to Glasso. Finally,\nif SOSlasso is \ufb01nding useful cross-individual structure, the features it selects should align at least\nsomewhat with the expert-identi\ufb01ed ROIs shown by [18] to carry consistent information.\n\n1Data and documentation available at http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/\n\n7\n\n00.050.10.150.200.0050.010.0150.02\u03c3MSE  GlassoSOSlasso00.20.40.60.8100.0050.010.0151 \u2212 \u03b1MSE  GlassoSOSlassolasso\fFigure 3: Results from fMRI exper-\niments.\n(a) Aggregated sparsity pat-\nterns for a single brain slice. (b) Cross-\nvalidation error obtained with each\nmethod. Lines connect data for a sin-\ngle subject. (c) The full sparsity pattern\nobtained with SOSlasso.\n\nMethod % ROI\n46.11\n50.89\n70.31\n\nlasso\nGlasso\nSOSlasso\n\nt(5) , p\n\n6.08 ,0.001\n5.65 ,0.002\n\nTable 2: Proportion of selected voxels\nin the 7 relevant ROIS aggregated over\nsubjects, and corresponding two-tailed\nsigni\ufb01cance levels for the contrast of\nlasso and Glasso to SOSlasso.\n\n(b)\n\n(c)\n\n(a)\n\nWe trained 3 classi\ufb01ers using 4-fold cross validation to select the regularization parameter, consid-\nering all available voxels without preselection. We group regions of 5\u00d7 5\u00d7 1 voxels and considered\noverlapping groups \u201cshifted\u201d by 2 voxels in the \ufb01rst 2 dimensions.2 Figure 3(b) shows the individual\nerror rates across the 6 subjects for the three methods. Across subjects, SOSlasso had a signi\ufb01cantly\nlower cross-validation error rate (27.47 %) than individual lasso (33.3 %; within-subjects t(5) = 4.8;\np = 0.004 two-tailed), showing that the method can exploit anatomical similarity across subjects to\nlearn a better classi\ufb01er for each. SOSlasso also showed signi\ufb01cantly lower error rates than glasso\n(31.1 %; t(5) = 2.92; p = 0.03 two-tailed), suggesting that the signal is sparse within selected regions\nand variable across subjects.\nFigure 3(a) presents a sample of the the sparsity patterns obtained from the different methods, aggre-\ngated over all subjects. Red points indicate voxels that contributed positively to picture classi\ufb01cation\nin at least one subject, but never to sentences; Blue points have the opposite interpretation. Purple\npoints indicate voxels that contributed positively to picture and sentence classi\ufb01cation in different\nsubjects. The remaining slices for the SOSlasso are shown in Figure 3(c). There are three things to\nnote from Figure 3(a). First, the Glasso solution is fairly dense, with many voxels signaling both\npicture and sentence across subjects. We believe this \u201cpurple haze\u201d demonstrates why Glasso is ill-\nsuited for fMRI analysis: a voxel selected for one subject must also be selected for all others. This\napproach will not succeed if, as is likely, there exists no direct voxel-to-voxel correspondence or if\nthe neural code is variable across subjects. Second, the lasso solution is less sparse than the SOSlasso\nbecause it allows any task-correlated voxel to be selected. It leads to a higher cross-validation error,\nindicating that the ungrouped voxels are inferior predictors (Figure 3(b)). Third, the SOSlasso not\nonly yields a sparse solution, but also clustered. To assess how well these clusters align with the\nanatomical regions thought a-priori to be involved in sentence and picture representation, we calcu-\nlated the proportion of selected voxels falling within the 7 ROIs identi\ufb01ed by [18] as relevant to the\nclassi\ufb01cation task (Table 2). For SOSlasso an average of 70% of identi\ufb01ed voxels fell within these\nROIs, signi\ufb01cantly more than for lasso or Glasso.\n\n6 Conclusions and Extensions\nWe have introduced SOSlasso, a function that recovers sparsity patterns that are a hybrid of overlap-\nping group sparse and sparse patterns when used as a regularizer in convex programs, and proved\nits theoretical convergence rates when minimizing least squares. The SOSlasso succeeds in a multi-\ntask fMRI analysis, where it both makes better inferences and discovers more theoretically plausible\nbrain regions that lasso and Glasso. Future work involves experimenting with different parameters\nfor the group and l1 penalties, and using other similarity groupings, such as functional connectivity\nin fMRI.\n\n2The irregular group size compensates for voxels being larger and scanner coverage being smaller in the\n\nz-dimension (only 8 slices relative to 64 in the x- and y-dimensions).\n\n8\n\nSOSlassoGlassolassoR0.240.260.280.30.320.340.36lassoGlassoSOSlassoErrorPicture onlySentence onlyPicture and Sentence\fReferences\n[1] Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic\n\nregression. arXiv preprint arXiv:1303.6149, 2013.\n\n[2] S. Chatterjee, A. Banerjee, and A. Ganguly. Sparse group lasso for regression on land climate variables.\nIn Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 1\u20138. IEEE,\n2011.\n\n[3] S. Dasgupta, D. Hsu, and N. Verma. A concentration theorem for projections.\n\narXiv:1206.6813, 2012.\n\narXiv preprint\n\n[4] Eva Feredoes, Giulio Tononi, and Bradley R Postle. The neural bases of the short-term storage of verbal\ninformation are anatomically variable across individuals. The Journal of Neuroscience, 27(41):11003\u2013\n11008, 2007.\n\n[5] L. Jacob, G. Obozinski, and J. P. Vert. Group lasso with overlap and graph lasso. In Proceedings of the\n\n26th Annual International Conference on Machine Learning, pages 433\u2013440. ACM, 2009.\n\n[6] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. Advances in\n\nNeural Information Processing Systems, 23:964\u2013972, 2010.\n\n[7] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. arXiv\n\npreprint arXiv:1009.2139, 2010.\n\n[8] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task\n\nlearning. arXiv preprint arXiv:0903.1468, 2009.\n\n[9] S. N. Negahban, P. Ravikumar, M. J Wainwright, and Bin Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557, 2012.\n\n[10] G. Obozinski, L. Jacob, and J.P. Vert. Group lasso with overlaps: The latent group lasso approach. arXiv\n\npreprint arXiv:1110.0413, 2011.\n\n[11] N. Rao, B. Recht, and R. Nowak. Universal measurement bounds for structured sparse signal recovery.\n\nIn Proceedings of AISTATS, volume 2102, 2012.\n\n[12] Irina Rish, Guillermo A Cecchia, Kyle Heutonb, Marwan N Balikic, and A Vania Apkarianc. Sparse\nregression analysis of task-relevant information distribution in the brain. In Proceedings of SPIE, volume\n8314, page 831412, 2012.\n\n[13] Srikanth Ryali, Kaustubh Supekar, Daniel A Abrams, and Vinod Menon. Sparse logistic regression for\n\nwhole brain classi\ufb01cation of fmri data. NeuroImage, 51(2):752, 2010.\n\n[14] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and\n\nGraphical Statistics, (just-accepted), 2012.\n\n[15] P. Sprechmann, I. Ramirez, G. Sapiro, and Y. Eldar. Collaborative hierarchical sparse modeling.\nInformation Sciences and Systems (CISS), 2010 44th Annual Conference on, pages 1\u20136. IEEE, 2010.\n\nIn\n\n[16] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 267\u2013288, 1996.\n\n[17] Marcel van Gerven, Christian Hesse, Ole Jensen, and Tom Heskes. Interpreting single trial data using\n\ngroupwise regularisation. NeuroImage, 46(3):665\u2013676, 2009.\n\n[18] X. Wang, T. M Mitchell, and R. Hutchinson. Using machine learning to detect cognitive states across\n\nmultiple subjects. CALD KDD project paper, 2003.\n\n[19] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n[20] J. Zhou, J. Chen, and J. Ye. Malsar: Multi-task learning via structural regularization, 2012.\n[21] Y. Zhou, R. Jin, and S. C. Hoi. Exclusive lasso for multi-task feature selection. In Proceedings of the\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n9\n\n\f", "award": [], "sourceid": 1072, "authors": [{"given_name": "Nikhil", "family_name": "Rao", "institution": "UW-Madison"}, {"given_name": "Christopher", "family_name": "Cox", "institution": "UW-Madison"}, {"given_name": "Rob", "family_name": "Nowak", "institution": "UW-Madison"}, {"given_name": "Timothy", "family_name": "Rogers", "institution": "UW-Madison"}]}