{"title": "Polar Operators for Structured Sparse Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 82, "page_last": 90, "abstract": "Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso.", "full_text": "Polar Operators for Structured Sparse Estimation\n\nXinhua Zhang\n\nMachine Learning Research Group\nNational ICT Australia and ANU\nxinhua.zhang@anu.edu.au\n\nYaoliang Yu and Dale Schuurmans\n\nDepartment of Computing Science, University of Alberta\n\nEdmonton, Alberta T6G 2E8, Canada\n\n{yaoliang,dale}@cs.ualberta.ca\n\nAbstract\n\nStructured sparse estimation has become an important technique in many areas of\ndata analysis. Unfortunately, these estimators normally create computational dif-\n\ufb01culties that entail sophisticated algorithms. Our \ufb01rst contribution is to uncover a\nrich class of structured sparse regularizers whose polar operator can be evaluated\nef\ufb01ciently. With such an operator, a simple conditional gradient method can then\nbe developed that, when combined with smoothing and local optimization, signif-\nicantly reduces training time vs. the state of the art. We also demonstrate a new\nreduction of polar to proximal maps that enables more ef\ufb01cient latent fused lasso.\n\nIntroduction\n\n1\nSparsity is an important concept in high-dimensional statistics [1] and signal processing [2] that has\nled to important application successes by reducing model complexity and improving interpretability\nof the results. Standard computational strategies such as greedy feature selection [3] and generic\nconvex optimization [4\u20137] can be used to implement simple sparse estimators. However, sophis-\nticated notions of structured sparsity have been recently developed that can encode combinatorial\npatterns over variable subsets [8]. Although combinatorial structure greatly enhances modeling ca-\npability, it also creates computational challenges that require sophisticated optimization approaches.\nFor example, current structured sparse estimators often adopt an accelerated proximal gradient\n(APG) strategy [9, 10], which has a low per-step complexity and enjoys an optimal convergence\nrate among black-box \ufb01rst-order procedures [10]. Unfortunately, APG must also compute a proxi-\nmal update (PU) of the nonsmooth regularizer during each iteration. Not only does the PU require a\nhighly nontrivial computation for structured regularizers [4]\u2014e.g., requiring tailored network \ufb02ow\nalgorithms in existing cases [5, 11, 12]\u2014it yields dense intermediate iterates. Recently, [6] has\ndemonstrated a class of regularizers where the corresponding PUs can be computed by a sequence\nof submodular function minimizations, but such an approach remains expensive.\nInstead, in this paper, we demonstrate that an alternative approach can be more effective for many\nstructured regularizers. We base our development on the generalized conditional gradient (GCG)\nalgorithm [13, 14], which also demonstrates promise for sparse model optimization. Although GCG\npossesses a slower convergence rate than APG, it demonstrates competitive performance if its up-\ndates are interleaved with local optimization [14\u201316]. Moreover, GCG produces sparse intermediate\niterates, which allows additional sparsity control. Importantly, unlike APG, GCG requires comput-\ning the polar of the regularizer, instead of the PU, in each step. This difference allows important\nnew approaches for characterizing and evaluating structured sparse regularizers.\nOur \ufb01rst main contribution is to characterize a rich class of structured sparse regularizers that allow\nef\ufb01cient computation of their polar operator. In particular, motivated by [6], we consider a family\nof structured sparse regularizers induced by a cost function on variable subsets. By introducing a\n\u201clifting\u201d construction, we show how these regularizers can be expressed as linear functions, which\nafter some reformulation, allows ef\ufb01cient evaluation by a simple linear program (LP). Important\nexamples covered include overlapping group lasso [5] and path regularization in directed acyclic\ngraphs [12]. By exploiting additional structure in these cases, the LP can be reduced to a piecewise\n\n1\n\n\flinear objective over a simple domain, allowing further reduction in computation time via smoothing\n\u221a\n[17]. For example, for the overlapping group lasso with n groups where each variable belongs to at\nmost r groups, the cost of evaluating the polar operator can be reduced from O(rn3) to O(rn\nn/\u0001)\nfor a desired accuracy of \u0001. Encouraged by the superior performance of GCG in these cases, we\nthen provide a simple reduction of the polar operator to the PU. This reduction makes it possible to\nextend GCG to cases where the PU is easy to compute. To illustrate the usefulness of this reduction\nwe provide an ef\ufb01cient new algorithm for solving the fused latent lasso [18].\n\n2 Structured Sparse Models\n\nConsider the standard regularized risk minimization framework\n\nmin\nw\u2208Rn\n\nf (w) + \u03bb \u2126(w),\n\n(1)\n\nwhere f is the empirical risk, assumed to be convex with a Lipschitz continuous gradient, and \u2126 is\na convex, positively homogeneous regularizer, i.e. a gauge [19, \u00a74]. Let 2[n] denote the power set of\n[n] := {1, . . . , n}, and let R+ := R+ \u222a {\u221e}. Recently, [6] has established a principled method for\nderiving regularizers from a subset cost function F : 2[n] \u2192 R+ based on de\ufb01ning the gauge:\n\n\u2126F (w) = inf{\u03b3 \u2265 0 : w\u2208 \u03b3 conv(SF )}, where SF =(cid:8)wA : (cid:107)wA(cid:107)p\n\n\u02dcp = 1/F (A),\u2205 (cid:54)= A \u2286 [n](cid:9). (2)\n\np = 1, (cid:107) \u00b7 (cid:107)p\nHere \u03b3 is a scalar, conv(SF ) denotes the convex hull of the set SF , \u02dcp, p \u2265 1 with 1\nthroughout is the usual (cid:96)p-norm, and wA denotes a duplicate of w with all coordinates not in A set\nto 0. Note that we have tacitly assumed F (A) = 0 iff A = \u2205 in (2). The gauge \u2126F de\ufb01ned in (2)\nis also known as the atomic norm with the set of atoms SF [20]. It will be useful to recall that the\npolar of a gauge \u2126 is de\ufb01ned by [19, \u00a715]:\n\n\u2126\u25e6(g) := supw{(cid:104)g, w(cid:105) : \u2126(w) \u2264 1}.\n\n(3)\nIn particular, the polar of a norm is its dual norm. (Recall that any norm is also a gauge.) For the\nspeci\ufb01c gauge \u2126F de\ufb01ned in (2), its polar is simply the support function of SF [19, Theorem 13.2]:\n(4)\n\n(cid:107)gA(cid:107)p /[F (A)]1/p.\n\n\u02dcp + 1\n\n(cid:104)g, w(cid:105) = max\n\u2205(cid:54)=A\u2286[n]\n\n\u2126\u25e6\nF (g) = max\nw\u2208SF\n\n(The \ufb01rst equality uses the de\ufb01nition of support function, and the second follows from (2).) By vary-\ning \u02dcp and F , one can generate a class of sparsity inducing regularizers that includes most current\nproposals [6]. For instance, if F (A) = 1 whenever |A| (the cardinality of A) is 1, and F (A) = \u221e\nfor |A| > 1, then \u2126\u25e6\nF is the (cid:96)\u221e norm and \u2126F is the usual (cid:96)1 norm. More importantly, one can encode\nstructural information through the cost function F , which selects and establishes preferences over\nthe set of atoms SF . As pointed out in [6], when F is submodular, (4) can be evaluated by a se-\ncant method with submodular minimizations ([21, \u00a78.4], see also Appendix B). However, as we will\nshow, it is possible to do signi\ufb01cantly better by completely avoiding submodular optimization. Be-\nfore presenting our main results, we \ufb01rst review the state of the art for solving (1), and demonstrate\nhow the performance of current methods can hinge on ef\ufb01cient computation of (4).\n\n2.1 Optimization Algorithms\n\n(cid:107)w \u2212 wk(cid:107)2\n\nA standard approach for minimizing (1) is the accelerated proximal gradient (APG) algorithm [9,\n10], where each iteration involves solving the proximal update (PU): wk+1 = arg minw (cid:104)dk, w(cid:105) +\n2 + \u03bb\u2126F (w), for some step size sk and descent direction dk. Although it can be\n\u0001) iterations [9, 10], each update can be quite\n\n1\n2sk\nshown that APG \ufb01nds an \u0001 accurate solution in O(1/\ndif\ufb01cult to compute when \u2126F encodes combinatorial structure, as noted in the introduction.\nAn alternative approach to solving (1) is the generalized conditional gradient (GCG) method [13,\n14], which has recently received renewed attention. Unlike APG, GCG only requires the polar\noperator of the regularizer \u2126F to be computed in each iteration, given by the argument of (4):\nP\u25e6\nF (g) = arg max\nw\u2208SF\n\n(cid:104)gC, w(cid:105) for C = arg max\n\u2205(cid:54)=A\u2286[n]\n\n(cid:104)g, w(cid:105) = F (C)\n\np /F (A). (5)\n\n\u22121\np arg max\nw:(cid:107)w(cid:107) \u02dcp=1\n\n(cid:107)gA(cid:107)p\n\n\u221a\n\nAlgorithm 1 outlines a GCG procedure for solving (1) that only requires the evaluation of P\u25e6\nF in\neach iteration without needing the full PU to be computed. The algorithm is quite simple: Line 3\n\n2\n\n\fAlgorithm 1 Generalized conditional gradient (GCG) for optimizing (1).\n1: Initialize w0 \u2190 0, s0 \u2190 0, (cid:96)0 \u2190 0.\n2: for k = 0, 1, . . . do\n3:\n4:\n5:\n\nPolar operator: vk \u2190P\u25e6\n2-D Conic search: (\u03b1, \u03b2) := arg min\u03b1\u22650,\u03b2\u22650 f (\u03b1wk + \u03b2vk) + \u03bb(\u03b1sk + \u03b2).\nLocal re-optimization: {ui}k\ni F (Ai)\n\n} f ((cid:80)\ni ui, (cid:96)i \u2190 ui for i \u2264 k, sk+1 \u2190(cid:80)\n\n6: wk+1 \u2190(cid:80)\n\ni ui) + \u03bb(cid:80)\n\n1 := arg min{ui=ui\nAi\n\np(cid:107)ui(cid:107) \u02dcp.\n\ni F (Ai)\n\n1\n\nF (gk), Ak \u2190 C(gk), where gk =\u2212\u2207f (wk) and C is de\ufb01ned in (5).\n\nwhere the {ui} are initialized by ui = \u03b1(cid:96)i for i < k and ui = \u03b2vi for i = k.\n\n1\n\np(cid:107)ui(cid:107) \u02dcp\n\n7: end for\n\nevaluates the polar operator, which provides a descent direction vk; Line 4 \ufb01nds the optimal step\nsizes for combining the current iterate wk with the direction vk; and Line 5 locally improves the\nobjective (1) by maintaining the same support patterns but re-optimizing the parameters. It has been\nshown that GCG can \ufb01nd an \u0001 accurate solution to (1) in O(1/\u0001) steps, provided only that the polar\n(5) is computed to \u0001 accuracy [14]. Although GCG has a slower theoretical convergence rate than\nAPG, the introduction of local optimization (Line 5) often yields faster convergence in practice [14\u2013\n16]. Importantly, Line 5 does not increase the sparsity of the intermediate iterates. Our main goal\nin this paper therefore is to extend this GCG approach to structured sparse models by developing\nef\ufb01cient algorithms for computing the polar operator for the structured regularizers de\ufb01ned in (2).\n\n3 Polar Operators for Atomic Norms\n\nLet 1 denote the vector of all 1s with length determined by context. Our \ufb01rst main contribution is\nto develop a general class of atomic norm regularizers whose polar operator (5) can be computed\nef\ufb01ciently. To begin, consider the case of a (partially) linear function F where there exists a c \u2208\nRn such that F (A) = (cid:104)c, 1A(cid:105) for all A \u2208 dom F (note that the domain need not be a lattice).\nA few useful regularizers can be generated by linear functions: for example, the (cid:96)1 norm can be\nderived from F (A) = (cid:104)1, 1A(cid:105) for |A| = 1, which is linear. Unfortunately, linearity is too restrictive\nto capture most structured regularizers of interest, therefore we will need to expand the space of\nfunctions F we consider. To do so, we introduce the more general class of marginalized linear\nfunctions: we say that F is marginalized linear if there exists a nonnegative linear function M on an\nextended domain 2[n+l] such that its marginalization to 2[n] is exactly F :\n\u2200 A \u2286 [n].\n\nF (A) =\n\nM (B),\n\nmin\n\n(6)\n\nB:A\u2286B\u2286[n+l]\n\nEssentially, such a function F is \u201clifted\u201d to a larger domain where it becomes linear. The key\nquestion is whether the polar \u2126\u25e6\nTo develop an ef\ufb01cient procedure for computing the polar \u2126\u25e6\ncomputing the polar \u2126\u25e6\ncan be expressed as M (B) = (cid:104)b, 1B(cid:105) for B \u2208 dom M \u2286 2[n+l] (b \u2208 Rn+l\ndomain of M need not be the whole space in general, we make use of the specialized polytope:\n\nF , \ufb01rst consider the simpler case of\nM for a nonnegative linear function M. Note that by linearity the function M\n+ ). Since the effective\n\nF can be ef\ufb01ciently evaluated for such functions.\n\nP := conv{1B : B \u2208 dom M} \u2286 [0, 1]n+l.\n\n(7)\nNote P may have exponentially many faces. From the de\ufb01nition (4) one can then re-express the\npolar \u2126\u25e6\nM as:\n\u2126\u25e6\nM (g) =\n\nwhere \u02dcgi = |gi|p \u2200i,\n\n(cid:107)gB(cid:107)p /M (B)1/p =\n\n(cid:19)1/p\n\n(cid:18)\n\nmax\n\n(8)\n\n(cid:104)\u02dcg, w(cid:105)\n(cid:104)b, w(cid:105)\n\nmax\n0(cid:54)=w\u2208P\n\n\u2205(cid:54)=B\u2208dom M\n\nwhere we have used the fact that the linear-fractional objective must attain its maximum at vertices of\nP ; that is, at 1B for some B \u2208 dom M. Although the linear-fractional program (8) can be reduced to\na sequence of LPs using the classical method of [22], a single LP suf\ufb01ces for our purposes. Indeed,\nlet us \ufb01rst remove the constraint w (cid:54)= 0 by considering the alternative polytope:\n\nQ := P \u2229 {w \u2208 Rn+l : (cid:104)1, w(cid:105) \u2265 1}.\n\n(9)\nAs shown in Appendix A, all vertices of Q are scalar multiples of the nonzero vertices of P . Since\nthe objective in (8) is scale invariant, we can restrict the constraints to w \u2208 Q. Then, by applying\ntransformations \u02dcw = w/(cid:104)b, w(cid:105), \u03c3 = 1/(cid:104)b, w(cid:105), problem (8) can be equivalently re-expressed by:\n(10)\n\n(cid:104)\u02dcg, \u02dcw(cid:105) , subject to \u02dcw \u2208 \u03c3Q, (cid:104)b, \u02dcw(cid:105) = 1.\n\nmax\n\u02dcw,\u03c3>0\n\n3\n\n\f(cid:107)gA(cid:107)p\np\nF (A)\n\nOf course, whether this LP can be solved ef\ufb01ciently depends on the structure of Q (and of P indeed).\nFinally, we note that the same formulation allows the polar to be ef\ufb01ciently computed for a marginal-\nized linear function F via a simple reduction: Consider any g \u2208 Rn and let [g; 0] \u2208 Rn+l denote g\npadded by l zeros. Then \u2126\u25e6\n\nF (g) = \u2126\u25e6\n\nM ([g; 0]) for all g \u2208 Rn because\n(cid:107)gA(cid:107)p\n(cid:107)gA(cid:107)p\np\nM (B)\n\n= max\n\u2205(cid:54)=A\u2286B\n\np\n\n(cid:107)[g; 0]B(cid:107)p\nM (B)\n\np\n\n= max\n\n\u2205(cid:54)=A\u2286[n]\n\nminB:A\u2286B\u2286[n+l] M (B)\n\n. (11)\nmax\n\u2205(cid:54)=A\u2286[n]\nTo see the last equality, \ufb01xing B the optimal A is attained at A = B \u2229 [n]. If B \u2229 [n] is empty, then\n(cid:107)[g; 0]B(cid:107) = 0 and the corresponding B cannot be the maximizer of the last term, unless \u2126\u25e6\nF (g) = 0\nin which case it is easy to see \u2126\u25e6\nAlthough we have kept our development general so far, the idea is clear: once an appropriate \u201clifting\u201d\nhas been found so that the polytope Q in (9) can be compactly represented, the polar (5) can be\nreformulated as the LP (10), for which ef\ufb01cient implementations can be sought. We now demonstrate\nthis new methodology for the two important structured regularizers: group sparsity and path coding.\n\n= max\nB:\u2205(cid:54)=B\u2286[n+l]\n\nM ([g; 0]) = 0.\n\n3.1 Group Sparsity\nFor a general formulation of group sparsity, let G \u2286 2[n] be a set of variable groups (subsets) that\npossibly overlap [3, 6, 7]. Here we use i \u2208 [n] to index variables and G \u2208 G to index groups.\nConsider the cost function over variable groups Fg : 2[n] \u2192 R+ de\ufb01ned by:\n\nFg(A) =\n\ncG I(A \u2229 G (cid:54)= \u2205),\n\n(12)\n\n(cid:88)\n\nG\u2208G\n\nwhere cG is a nonnegative cost and I is an indicator such that I(\u00b7) = 1 if its argument is true, and\n0 otherwise. The value Fg(A) provides a weighted count of how many groups overlap with A.\nUnfortunately, Fg is not linear, so we need to re-express it to recover an ef\ufb01cient polar operator. To\ndo so, augment the domain by adding l = |G| variables such that each new variable G corresponds\nto a group G. Then de\ufb01ne a weight vector b \u2208 Rn+l\n+ such that bi = 0 for i \u2264 n and bG = cG for\nn < G \u2264 n + l. Finally, consider the linear cost function Mg : 2[n+l] \u2192 R+ de\ufb01ned by:\nMg(B) = (cid:104)b, 1B(cid:105) if i \u2208 B \u21d2 G \u2208 B, \u2200 i \u2208 G \u2208 G; Mg(B) = \u221e otherwise.\n\n(13)\nThe constraint ensures that if a variable i \u2264 n appears in the set B, then every variable G corre-\nsponding to a group G that contains i must also appear in B. By construction, Mg is a nonnegative\nlinear function. It is also easy to verify that Fg satis\ufb01es (6) with respect to Mg.\nTo compute the corresponding polar, observe that the effective domain of Mg is a lattice, hence (4)\ncan be solved by combinatorial methods. However, we can do better by exploiting problem structure\nin the LP. For example, observe that the polytope (7) can now be compactly represented as:\n\nPg = {w \u2208 Rn+l : 0 \u2264 w \u2264 1, wi \u2264 wG,\u2200 i \u2208 G \u2208 G}.\n\n(14)\nIndeed, it is easy to verify that the integral vectors in Pg are precisely {1B : B \u2208 dom Mg}.\nMoreover, the linear constraint in (14) is totally unimodular (TUM) since it is the incidence matrix\nof a bipartite graph (variables and groups), hence Pg is the convex hull of its integral vectors [23].\nUsing the fact that the scalar \u03c3 in (10) admits a closed form solution \u03c3 = (cid:104)1, \u02dcw(cid:105) in this case, the LP\n(10) can be reduced to:\n\nmax\n\n\u02dcw\n\n\u02dcgi min\n\nG:i\u2208G\u2208G \u02dcwG, subject to \u02dcw \u2265 0,\n\nbG \u02dcwG = 1.\n\n(15)\n\n(cid:88)\n\nG\u2208G\n\n(cid:88)\n\ni\u2208[n]\n\nNote only { \u02dcwG} appear in the problem as implicitly \u02dcwi = minG:i\u2208G \u02dcwG, \u2200 i \u2208 [n]. This is now\njust a piecewise linear objective over a (reweighted) simplex. Since projecting to a simplex can\nbe performed in linear time, the smoothing method of [17] can be used to obtain a very ef\ufb01cient\nimplementation. We illustrate a particular case where each variable i \u2208 [n] belongs to at most r > 1\ngroups. (Appendix D considers when the groups form a directed acyclic graph.)\n\nG:i\u2208G r\u2212n\u02dcgi \u02dcwG/\u0001 satis\ufb01es: (i) the gradient of h\u0001 is(cid:0) n\n\nProposition 1 Let h( \u02dcw) denote the negated objective of (15). Then for any \u0001 > 0, h\u0001( \u02dcw) :=\n(ii) h( \u02dcw) \u2212 h\u0001( \u02dcw) \u2208 (\u2212\u0001, 0] for all \u02dcw, and (iii) the gradient of h\u0001 can be computed in O(nr) time.\n\n\u0001 (cid:107)\u02dcg(cid:107)2\u221e log r(cid:1)-Lipschitz,\n\n(cid:80)\ni\u2208[n] log(cid:80)\n\nn log r\n\n\u0001\n\n4\n\n\f(The proof is given in Appendix C.) With this construction, APG can be run on h\u0001 to achieve a 2\u0001\naccurate solution to (15) within O( 1\nn log r).\n\u0001\nNote that this is signi\ufb01cantly cheaper than the O(n2(l + n)r) worst case complexity of [11, Al-\ngorithm 2]. More importantly, we gain explicit control of the trade-off between accuracy \u0001 and\ncomputational cost. A detailed comparison to related approaches is given in Appendix B.1 and E.\n\nn log r) steps [17], using a total time cost of O( nr\n\u0001\n\n\u221a\n\n\u221a\n\n3.2 Path Coding\nAnother interesting regularizer, recently investigated by [12], is determined by path costs in a di-\nrected acyclic graph (DAG) de\ufb01ned over the set of variables i \u2208 [n]. For convenience, we add two\nnodes, a source s and a sink t, with dummy edges (s, i) and (i, t) for all i \u2208 [n]. An (s, t)-path (or\nsimply path) is then given by a sequence (s, i1), (i1, i2), . . . , (ik\u22121, ik), (ik, t) with k \u2265 1. A non-\nnegative cost is associated with each edge including (s, i) and (i, t), so the cost of a path is the sum\nof its edge costs. A regularizer can then be de\ufb01ned by (2) applied to the cost function Fp : 2[n]\u2192R+\n(16)\n\n(cid:26)cost of the path if the nodes in A form an (s, t)-path (unique for DAG)\n\nFp(A) =\n\n.\n\nif such a path does not exist\n\n\u221e\n\nNote Fp is not submodular. Although Fp is not linear, a similar \u201clifting\u201d construction can be used to\nshow that it is marginalized linear, hence it supports ef\ufb01cient computation of the polar. To explain\nthe construction, let V := [n] \u222a {s, t} be the node set including s and t, E be the edge set including\n(s, i) and (i, t), T = V \u222a E, and let b \u2208 R|T|\n+ be the concatenation of zeros for node costs and the\ngiven edge costs. Let m := |E| be the number of edges. It is then easy to verify that Fp satis\ufb01es (6)\nwith respect to the linear cost function Mp : 2T \u2192 R+ de\ufb01ned by:\n(17)\nTo ef\ufb01ciently compute the resulting polar, we consider the form (8) using \u02dcgi = |gi|p \u2200i as before:\nwki, \u2200i \u2208 [n]. (18)\n\u2126\u25e6\nMp(g) =\nHere the constraints form the well-known \ufb02ow polytope whose vertices are exactly all the paths in a\nDAG. Similar to (15), the normalized LP (10) can be simpli\ufb01ed by solving for the scalar \u03c3 to obtain:\n\u02dcwki, \u2200i \u2208 [n]. (19)\n\nMp(B) = (cid:104)b, 1B(cid:105) if B represents a path; \u221e otherwise.\n\n(cid:32) (cid:88)\n\n, s.t. (cid:104)b, \u02dcw(cid:105) = 1,\n\n(cid:104)\u02dcg, w(cid:105)\n(cid:104)b, w(cid:105) ,\n\n0(cid:54)=w\u2208[0,1]|T |\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\ns.t. wi =\n\nk:(k,i)\u2208E\n\nj:(i,j)\u2208E\n\n(cid:33)\n\nwij =\n\n\u02dcwij +\n\n\u02dcwki\n\nmax\n\nmax\n\u02dcw\u22650\n\n\u02dcgi\n\ni\u2208[n]\n\nj:(i,j)\u2208E\n\nk:(k,i)\u2208E\n\n\u02dcwij =\n\nj:(i,j)\u2208E\n\nk:(k,i)\u2208E\n\nDue to the extra constraints, the LP (19) is more complicated than (15) obtained for group spar-\nsity. Nevertheless, after some reformulation (essentially dualization), (19) can still be converted to\na simple piecewise linear objective, hence it is amenable to smoothing; see Appendix F for details.\nTo \ufb01nd a 2\u0001 accurate solution, the cutting plane method takes O( mn\n\u00012 ) computations to optimize the\nnonsmooth piecewise linear objective, while APG needs O( 1\nn) steps to optimize the smoothed\n\u0001\nobjective, using a total time cost of O( m\nn). This too is faster than the O(nm) worst case com-\n\u0001\nplexity of [12, Appendix D.5] in the regime where n is large and the desired accuracy \u0001 is moderate.\n\n\u221a\n\n\u221a\n\n4 Generalizing Beyond Atomic Norms\n\n(cid:88)\n\nAlthough we \ufb01nd the above approach to be effective, many useful regularizers are not expressed in\nform of an atomic norm (2), which makes evaluation of the polar a challenge and thus creates dif\ufb01-\nculty in applying Algorithm 1. For example, another important class of structured sparse regularizers\nis given by an alternative, composite gauge construction:\n\ni\n\n\u2126s(w) =\n\n\u03bai(w), where \u03bai is a closed gauge that can be different for different i.\n\n(20)\ni wi = g}, where each\nThe polar for such a regularizer is given by \u2126\u25e6\nwi is an independent vector and \u03ba\u25e6\ni corresponds to the polar of \u03bai (proof given in Appendix H).\nUnfortunately, a polar in this form does not appear to be easy to compute. However, for some\nregularizers in the form (20) the following proximal objective can indeed be computed ef\ufb01ciently:\n\ns(g) = inf{maxi \u03ba\u25e6\n\ni (wi) :(cid:80)\n\n1\n\n2(cid:107)g \u2212 \u03b8(cid:107)2\n\nProx\u2126(g) = min\u03b8\n2 + \u2126(\u03b8).\nThe key observation is that computing \u2126\u25e6 can be ef\ufb01ciently reduced to just computing Prox\u2126.\nProposition 2 For any closed gauge \u2126, its polar \u2126\u25e6 can be equivalently expressed by:\n\nArgProx\u2126(g) = arg min\u03b8\n\n2 + \u2126(\u03b8),\n\n1\n\n2(cid:107)g \u2212 \u03b8(cid:107)2\n\n\u2126\u25e6(g) = inf{ \u03b6 \u2265 0 : Prox\u03b6\u2126(g) = 1\n\n2 }.\n2(cid:107)g(cid:107)2\n\n(21)\n\n(22)\n\n5\n\n\f(The proof is included in Appendix I.) Since the left hand side of the inner constraint is decreasing in\n\u03b6, one can ef\ufb01ciently compute the polar \u2126\u25e6 by a simple root \ufb01nding search in \u03b6. Thus, regularizers in\nthe form of (20) can still be accommodated in an ef\ufb01cient GCG method in the form of Algorithm 1.\n\n4.1 Latent Fused Lasso\n\n(cid:17)\n\n,\n\ni\n\n(cid:16)\n\n(cid:88)\n\nW,U\u2208U f (W U, X) + \u2126p(W ), where \u2126p(W ) =\nmin\n\nTo demonstrate the usefulness of this reduction we consider the recently proposed latent fused lasso\nmodel [18], where for given data X \u2208 Rm\u00d7n one seeks a dictionary matrix W \u2208 Rm\u00d7t and\ncoef\ufb01cient matrix U \u2208 Rt\u00d7n that allow X to be accurately reconstructed from a dictionary that has\ndesired structure. In particular, for a reconstruction loss f, the problem is speci\ufb01ed by:\n\u03bb1 (cid:107)W:i(cid:107)p + \u03bb2 (cid:107)W:i(cid:107)TV\n\nsuch that (cid:107) \u00b7 (cid:107)TV is given by (cid:107)w(cid:107)TV = (cid:80)m\u22121\n\n(23)\nj=1 |wj+1 \u2212 wj| and (cid:107) \u00b7 (cid:107)p is the usual (cid:96)p-norm. The\nfused lasso [24] corresponds to p = 1. Note that U is constrained to be in a compact set U to avoid\ndegeneracy. To ease notation, we assume w.l.o.g. \u03bb1 = \u03bb2 = 1.\nThe main motivation for this regularizer arises from biostatistics, where one wishes to identify DNA\ncopy number variations simultaneously for a group of related samples [18]. In this case the total\nvariation norm (cid:107) \u00b7 (cid:107)TV encourages the dictionary to vary smoothly from entry to entry while the (cid:96)p\nnorm shrinks the dictionary so that few latent features are selected. Conveniently, \u2126p decomposes\nalong the columns of W , so one can apply the reduction in Proposition 2 to compute its polar assum-\ning Prox\u2126p can be ef\ufb01ciently computed. Solving Prox\u2126p appears non-trivial due to the composition\nof two overlapping norms, however [25] showed that for p = 1 the polar can be solved ef\ufb01ciently\nby computing Prox for each of the two norms successively. Here we extend this results by proving\nin Appendix J that the same fact holds for any (cid:96)p norm.\nProposition 3 For any 1 \u2264 p \u2264 \u221e, ArgProx(cid:107)\u00b7(cid:107)TV+(cid:107)\u00b7(cid:107)p (w) = ArgProx(cid:107)\u00b7(cid:107)p\nSince Prox(cid:107)\u00b7(cid:107)p is easy to compute, the only remaining problem is to develop an ef\ufb01cient algorithm\nfor computing Prox(cid:107)\u00b7(cid:107)TV. Although [26] has recently proposed an approximate iterative method, we\nprovide an algorithm in Appendix K that is able to ef\ufb01ciently compute the exact solution. Therefore,\nby combining this result with Propositions 2 and 3 we are able to ef\ufb01ciently compute the polar \u2126\u25e6\np\nand hence apply Algorithm 1 to solving (23) with respect to W .\n\n(cid:0)ArgProx(cid:107)\u00b7(cid:107)TV\n\n(w)(cid:1).\n\n5 Experiments\n\nTo investigate the effectiveness of these computational schemes we considered three applications:\ngroup lasso, path coding, and latent fused lasso. All algorithms were implemented in Matlab unless\notherwise noted.\n\n5.1 Group Lasso: CUR-like Matrix Factorization\n\n1\n\nminW\n\ni (cid:107)Wi:(cid:107)\u221e +(cid:80)\n\n2 (cid:107)X\u2212XW X(cid:107)2 + \u03bb(cid:0)(cid:80)\n\nOur \ufb01rst experiment considered an example of group lasso that is inspired by CUR matrix factor-\nization [27]. Given a data matrix X \u2208 Rn\u00d7d, the goal is to compute an approximate factorization\nX \u2248 CU R, such that C contains a subset of c columns from X and R contains a subset of r rows\nfrom X. Mairal et al. [11, \u00a75.3] proposed a convex relaxation of this problem:\nj (cid:107)W:j(cid:107)\u221e\n\n(24)\nConveniently, the regularizer \ufb01ts the development of Section 3.1, with p = 1 and the groups de\ufb01ned\nto be the rows and columns of W . To evaluate different methods, we used four gene-expression data\nsets [28]: SRBCT, Brain Tumor 2, 9 Tumor, and Leukemia2, of sizes 83 \u00d7 2308, 50 \u00d7 10367,\n60\u00d7 5762, and 72\u00d7 11225, respectively. The data matrices were \ufb01rst centered columnwise and then\nrescaled to have unit Frobenius norm.\nAlgorithms. We compared three algorithms: GCG (Algorithm 1) with our polar operator which we\ncall GCG TUM, GCG with the polar operator of [11, Algorithm 2] (GCG Secant), and APG (see\nSection 2.1). The PU in APG uses the routine mexProximalGraph from the SPAMS package [29].\nThe polar operator of GCG Secant was implemented with a mex wrapper of a max-\ufb02ow package\n[30], while GCG TUM used L-BFGS to \ufb01nd an optimal solution {w\u2217\nG} for the smoothed version of\n\n(cid:1).\n\n6\n\n\f(a) SRBCT\n\n(b) Brain Tumor 2\n\n(a) Obj vs CPU time (\u03bb = 10\u22122)\n\n(c) 9 Tumor\n\n(d) Leukemia2\n\nFigure 1: Convex CUR matrix factorization results.\n\n(b) Obj vs CPU time (\u03bb = 10\u22123)\nFigure 2: Path coding results.\n(15) given in Proposition 1, with smoothing parameter \u0001 set to 10\u22123. To recover an integral solution\nit suf\ufb01ces to \ufb01nd an optimal solution to (15) that has the form wG = c for some groups and wG = 0\nG} and set the wG of the smallest k\nfor the remainder (such a solution must exist). So we sorted {w\u2217\ngroups to 0, and wG for the remaining groups set to a common value that satis\ufb01es the constraint. The\nbest k can be recovered from {0, 1, . . . ,|G| \u2212 1} in O(nr) time. See more details in Appendix G.\nBoth GCG methods relinquish local optimization (step 5) in Algorithm 1, but use a totally corrective\nvariant of step 4, which allows ef\ufb01cient optimization by L-BFGS-B via pre-computing XP\u25e6\n(gk)X.\nResults. For simplicity, we tested three values for \u03bb: 10\u22123, 10\u22124, and 10\u22125, which led to increas-\ningly dense solutions. Due to space limitations we only show in Figure 1 the results for \u03bb = 10\u22124\nwhich gives moderately sparse solutions. On these data sets, GCG TUM proves to be an order of\nmagnitude faster than GCG Secant in computing the polar. As [11] observes, network \ufb02ow based\nalgorithms often \ufb01nd solutions in practice far more quickly than their theoretical bounds. Thanks\nto the ef\ufb01ciency of totally corrective update, almost all computations taken by GCG Secant were\ndevoted to the polar operator. Therefore the acceleration proffered by GCG TUM in computing the\npolar leads to a reduction of overall optimization time by at least 50%. Finally, APG is always even\nslower than GCG Secant by an order of magnitude, with PU taking up the most computation.\n\nFg\n\n5.2 Path Coding\nFollowing [12, \u00a74.3], we consider a logistic regression problem where one is given training examples\nxi \u2208 Rn with corresponding labels yi \u2208 {\u22121, 1}. For this problem, we formulate (1) with a path\ncoding regularizer \u2126Fp and the empirical risk:\n\nf (w) =(cid:80)\n\nlog(1 + exp(\u2212yi (cid:104)w, xi(cid:105))),\n\ni\n\n1\nni\n\n(25)\nwhere ni is the number of examples that share the same label as yi. We used the breast cancer data\nset for this experiment, which consists of 8141 genes and 295 tumors [31]. The gene network is\nadopted from [32]. Similar to [12, \u00a74.3], we removed all isolated genes (nodes) to which no edge is\nincident, randomly oriented the raw edges, and removed cycles to form a DAG using the function\nmexRemoveCyclesGraph in SPAMS. This resulted in 34864 edges and n = 7910 nodes.\nAlgorithms. We again considered three methods: APG, GCG with our polar operator (GCG TUM),\nand GCG with the polar operator from [12, Algorithm 1], which we label as GCG Secant. The PU\nin APG uses the routine mexProximalPathCoding from SPAMS, which solves a quadratic network\n\ufb02ow problem. It turns out the time cost for a single call of the PU was enough for GCG TUM and\n\n7\n\n1011021031040.050.10.150.20.25CPU time (seconds)Objective function value1011021031040.030.040.050.060.070.08CPU time (seconds)Objective function value1011021031040.040.050.060.070.08CPU time (seconds)Objective function value1011021030.050.060.070.080.090.1CPU time (seconds)Objective function value10\u2212110010111.11.21.3CPU time (seconds)Objective function value10\u221211001011020.20.40.60.81CPU time (seconds)Objective function value\fGCG Secant to converge to a \ufb01nal solution, and so the APG result is not included in our plots. We\nimplemented the polar operator for GCG Secant based on Matlab\u2019s built-in shortest path routine\ngraphshortestpath (C++ wrapped by mex). For GCG TUM, we used cutting plane to solve a vari-\nant of the dual of (19) (see Appendix F), which is much simipler than smoothing in implementation,\nbut exhibits similar ef\ufb01ciency in practice. An integral solution can also be naturally recovered in the\ncourse of computing the objective. Again, both GCG methods only used totally corrective updates.\nResults. Figure 2 shows the result for path coding, with the regularization coef\ufb01cient \u03bb set to 10\u22122\nand 10\u22123 so that the solution is moderately sparse. Again it is clear that GCG TUM is an order of\nmagnitude faster than GCG Secant.\n\nwe use the model \u02dcWij = (cid:80)Sj\n\n5.3 Latent Fused Lasso\nFinally, we compared GCG and APG on the latent fused lasso problem (23). Two algorithms were\ntested as the PU in APG: our proposed method and the algorithm in [26], which we label as APG-\nLiu. The synthetic data is generated by following [18]. For each basis (column) of the dictionary,\ns=1 csI(is \u2264 i \u2264 is + ls), where Sj \u2208 {3, 5, 8, 10} speci\ufb01es the\nnumber of consecutive blocks in the j-th basis, cs \u2208 {\u00b11,\u00b12,\u00b13,\u00b14,\u00b15}, is \u2208 {1, . . . , m \u2212 10}\nand ls \u2208 {5, 10, 15, 20}, which are the magnitude, starting position, and length of the s-th block,\nrespectively. Note that we choose cs, is, ls randomly (and independently for each block s) from\ntheir respective sets. The coef\ufb01cient matrix \u02dcU are sampled from the Gaussian distribution N (0, 1)\n(independently for each entry) and normalized to have unit (cid:96)2 norm for each row. Finally, we\ngenerate the observation matrix X = \u02dcW \u02dcU + \u03b5, with added (zero mean and unit variance) Gaussian\nnoise \u03b5. We set the dimension m = 300, the number of samples n = 200, and the number of bases\n(latent dimension) \u02dct = 10.\nF , but the algo-\nSince the noise is Gaussian, we choose the squared loss f (W U, X) = 1\nrithm is applicable to any other smooth loss as well. To avoid degeneracy, we constrained each row\nof U to have unit (cid:96)2 norm. Finally, to pick an appropriate dictionary size, we tried t \u2208 {5, 10, 20},\nwhich corresponds to under-, perfect- and over-estimation, respectively. The regularization con-\nstants \u03bb1, \u03bb2 in \u2126p were chosen from {0.01, 0.1, 1, 10, 100}.\n\n2(cid:107)X \u2212 W U(cid:107)2\n\nNote that problem (23) is not jointly convex in W and\nU, so we followed the same strategy as [18]; that is,\nwe alternatively optimized W and U keeping the other\n\ufb01xed. For each subproblem, we ran both APG and\nGCG to compare their performance. For space limita-\ntions, we only report the running time for the setting\n\u03bb1 = \u03bb2 = 0.1, t = 20 and p \u2208 {1, 2}.\nIn these\nexperiments we observed that the polar typically only\nrequires 5 to 6 calls to Prox. As can be seen from Fig-\nure 3, GCG is signi\ufb01cantly faster than APG and APG-\nLiu in reducing the objective. This is due to the greedy\nnature of GCG, which yields very sparse iterates, and\nwhen interleaved with local search achieves fast con-\nvergence.\n\nFigure 3: Latent fused lasso.\n\n6 Conclusion\nWe have identi\ufb01ed and investigated a new class of structured sparse regularizers whose polar can\nbe reformulated as a linear program with totally unimodular constraints. By leveraging smoothing\ntechniques, we are able to compute the corresponding polars with signi\ufb01cantly better ef\ufb01ciency than\nprevious approaches. When plugged into the GCG algorithm, one can observe signi\ufb01cant reductions\nin run time for both group lasso and path coding regularization. We have further developed a generic\nscheme for converting an ef\ufb01cient proximal solver to an ef\ufb01cient method for computing the polar\noperator. This reduction allowed us to develop a fast new method for latent fused lasso. For future\nwork, we plan to study more general subset cost functions and investigate new structured regularizers\namenable to our approach. It will also be interesting to extend GCG to handle nonsmooth losses.\n\n8\n\n0204060801005.25.45.65.866.26.4x 104CPU time (sec)Loss + Reg APG, p=1GCG, p=1APG(cid:239)Liu, p=1APG, p=2GCG, p=2APG(cid:239)Liu,p=2\fReferences\n[1] P. B\u00a8uhlmann and S. van de Geer. Statistics for High-Dimensional Data. Springer, 2011.\n[2] Y. Eldar and G. Kutyniok, editors. Compressed Sensing: Theory and Applications. Cambridge, 2012.\n[3] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. JMLR, 12:3371\u20133412, 2011.\n[4] S. Kim and E. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML,\n\n2010.\n\n[5] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.\n\nJMLR, 12:2297\u20132334, 2011.\n\n[6] G. Obozinski and F. Bach. Convex relaxation for combinatorial penalties. Technical Report HAL\n\n00694765, 2012.\n\n[7] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical\n\nvariable selection. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n[8] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foun-\n\ndations and Trends in Machine Learning, 4(1):1\u2013106, 2012.\n\n[9] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[10] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140:\n\n125\u2013161, 2013.\n\n[11] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network \ufb02ow optimization for structured\n\nsparsity. JMLR, 12:2681\u20132720, 2011.\n\n[12] J. Mairal and B. Yu. Supervised feature selection in graphs with path coding penalties and network \ufb02ows.\n\nJMLR, 14:2449\u20132485, 2013.\n\n[13] M. Dudik, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm regular-\n\nizations. In AISTATS, 2012.\n\n[14] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A boosting\n\napproach. In NIPS, 2012.\n\n[15] S. Laue. A hybrid algorithm for convex semide\ufb01nite optimization. In ICML, 2012.\n[16] B. Mishra, G. Meyer, F. Bach, and R. Sepulchre. Low-rank optimization with trace norm penalty. Tech-\n\nnical report, 2011. http://arxiv.org/abs/1112.2318.\n\n[17] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, 2005.\n\n[18] G. Nowak, T. Hastie, J. R. Pollack, and R. Tibshirani. A fused lasso latent feature model for analyzing\n\nmulti-sample aCGH data. Biostatistics, 12(4):776\u2013791, 2011.\n\n[19] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[20] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse\n\nproblems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[21] F. Bach. Convex analysis and optimization with submodular functions: a tutorial. Technical Report HAL\n\n00527714, 2010.\n\n[22] W. Dinkelbach. On nonlinear fractional programming. Management Science, 13(7), 1967.\n[23] A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1st edition, 1986.\n[24] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso.\n\nJournal of the Royal Statistical Society: Series B, 67:91\u2013108, 2005.\n\n[25] J. Friedman, T. Hastie, H. H\u00a8o\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. The Annals of\n\nApplied Statistics, 1(2):302\u2013332, 2007.\n\n[26] J. Liu, L. Yuan, and J. Ye. An ef\ufb01cient algorithm for a class of fused lasso problems. In Conference on\n\nKnowledge Discovery and Data Mining, 2010.\n\n[27] M. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the\n\nNational Academy of Sciences, 106(3):697\u2013702, 2009.\n\n[28] URL http://www.gems-system.or.\n[29] URL http://spams-devel.gforge.inria.fr.\n[30] URL http://drwn.anu.edu.au/index.html.\n[31] M. Van De Vijver et al. A gene-expression signature as a predictor of survival in breast cancer. The New\n\nEngland Journal of Medicine, 347(25):1999\u20132009, 2002.\n\n[32] H. Chuang, E. Lee, Y. Liu, D. Lee, and T. Ideker. Network-based classi\ufb01cation of breast cancer metastasis.\n\nMolecular Systems Biology, 3(140), 2007.\n\n9\n\n\f", "award": [], "sourceid": 92, "authors": [{"given_name": "Xinhua", "family_name": "Zhang", "institution": "NICTA"}, {"given_name": "Yao-Liang", "family_name": "Yu", "institution": "University of Alberta"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "University of Alberta"}]}