{"title": "On Decomposing the Proximal Map", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 99, "abstract": "The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory.", "full_text": "On Decomposing the Proximal Map\n\nDepartment of Computing Science, University of Alberta, Edmonton AB T6G 2E8, Canada\n\nyaoliang@cs.ualberta.ca\n\nYaoliang Yu\n\nAbstract\n\nThe proximal map is the key step in gradient-type algorithms, which have be-\ncome prevalent in large-scale high-dimensional problems. For simple functions\nthis proximal map is available in closed-form while for more complicated func-\ntions it can become highly nontrivial. Motivated by the need of combining regu-\nlarizers to simultaneously induce different types of structures, this paper initiates\na systematic investigation of when the proximal map of a sum of functions de-\ncomposes into the composition of the proximal maps of the individual summands.\nWe not only unify a few known results scattered in the literature but also discover\nseveral new decompositions obtained almost effortlessly from our theory.\n\nIntroduction\n\n1\nRegularization has become an indispensable part of modern machine learning algorithms. For ex-\nample, the (cid:96)2-regularizer for kernel methods [1] and the (cid:96)1-regularizer for sparse methods [2] have\nled to immense successes in various \ufb01elds. As real data become more and more complex, different\ntypes of regularizers, usually nonsmooth functions, have been designed. In many applications, it\nis thus desirable to combine regularizers, usually taking their sum, to promote different structures\nsimultaneously.\nSince many interesting regularizers are nonsmooth, they are harder to optimize numerically, es-\npecially in large-scale high-dimensional settings. Thanks to recent advances [3\u20135], gradient-type\nalgorithms have been generalized to take nonsmooth regularizers explicitly into account. And due\nto their cheap per-iteration cost (usually linear-time), these algorithms have become prevalent in\nmany \ufb01elds recently. The key step of such gradient-type algorithms is to compute the proximal map\n(of the nonsmooth regularizer), which is available in closed-form for some speci\ufb01c regularizers.\nHowever, the proximal map becomes highly nontrivial when we start to combine regularizers.\nThe main goal of this paper is to systematically investigate when the proximal map of a sum of\nfunctions decomposes into the composition of the proximal maps of the individual functions, which\nwe simply term prox-decomposition. Our motivation comes from a few known decomposition\nresults scattered in the literature [6\u20138], all in the form of our interest. The study of such prox-\ndecompositions is not only of mathematical interest, but also the backbone of popular gradient-type\nalgorithms [3\u20135]. More importantly, a precise understanding of this decomposition will shed light\non how we should combine regularizers, taking computational efforts explicitly into account.\nAfter setting the context in Section 2, we motivate the decomposition rule with some justi\ufb01ca-\ntions, as well as some cautionary results. Based on a suf\ufb01cient condition presented in Section 3.1,\nwe study how \u201cinvariance\u201d of the subdifferential of one function would lead to nontrivial prox-\ndecompositions. Speci\ufb01cally, we prove in Section 3.3 that when the subdifferential of one function\nis scaling invariant, then the prox-decomposition always holds if and only if another function is\nradial\u2014which is, quite unexpectedly, exactly the same condition proven recently for the validity of\nthe representer theorem in the context of kernel methods [9, 10]. The generalization to cone invari-\nance is considered in Section 3.4, and enables us to recover most known prox-decompositions, as\nwell as some new ones falling out quite naturally.\n\n1\n\n\fOur notations are mostly standard. We use \u03b9C(x) for the indicator function that takes 0 if x \u2208 C\nand \u221e otherwise, and 1C(x) for the indicator that takes 1 if x \u2208 C and 0 otherwise. The symbol\nId stands for the identity map and the extended real line R \u222a {\u221e} is denoted as \u00afR. Throughout the\npaper we denote \u2202f (x) as the subdifferential of the function f at point x.\n\n2 Preliminary\nLet our domain be some (real) Hilbert space (H,(cid:104)\u00b7,\u00b7(cid:105)), with the induced Hilbertian norm (cid:107) \u00b7 (cid:107). If\nneeded, we will assume some \ufb01xed orthonormal basis {ei}i\u2208I is chosen for H, so that for x \u2208 H\nwe are able to refer to its \u201ccoordinates\u201d xi = (cid:104)x, ei(cid:105).\nFor any closed convex proper function f : H \u2192 \u00afR, we de\ufb01ne its Moreau envelop as [11]\n\n\u2200y \u2208 H, Mf (y) = min\nx\u2208H\n\n1\n\n2(cid:107)x \u2212 y(cid:107)2 + f (x),\n\n(1)\n\nand the related proximal map\n\nPf (y) = argmin\n\n(2)\nDue to the strong convexity of (cid:107) \u00b7 (cid:107)2 and the closedness and convexity of f, Pf (y) always exists\nand is unique. Note that Mf : H \u2192 R while Pf : H \u2192 H. When f = \u03b9C is the indicator of some\nclosed convex set C, the proximal map reduces to the usual projection. Perhaps the most interesting\nproperty of Mf , known as Moreau\u2019s identity, is the following decomposition [11]\n\nx\u2208H\n\n1\n\n2(cid:107)x \u2212 y(cid:107)2 + f (x).\n\n(3)\nwhere f\u2217(z) = supx (cid:104)x, z(cid:105)\u2212 f (x) is the Fenchel conjugate of f. It can be shown that Mf is Frech\u00e9t\ndifferentiable, hence taking derivative w.r.t. y in both sides of (3) yields\n\nMf (y) + Mf\u2217 (y) = 1\n\n2(cid:107)y(cid:107)2,\n\nPf (y) + Pf\u2217 (y) = y.\n\n(4)\n\n3 Main Results\nOur main goal is to investigate and understand the equality (we always assume f + g (cid:54)\u2261 \u221e)\n\n?= Pf \u25e6 Pg\n\n?= Pg \u25e6 Pf ,\n\nPf +g\n\n2f +P\u22121\n\n(5)\nwhere f, g \u2208 \u03930, the set of all closed convex proper functions on H, and f \u25e6 g denotes the mapping\ncomposition. We present \ufb01rst some cautionary results.\n2g )\u22121\u25e62Id.\nNote that Pf = (Id+\u2202f )\u22121, hence under minor technical assumptions Pf +g = (P\u22121\nHowever, computationally this formula is of little use. On the other hand, it is possible to develop\nforward-backward splitting procedures1 to numerically compute Pf +g, using only Pf and Pg as\nsubroutines [12]. Our focus is on the exact closed-form formula (5).\nInterestingly, under some\n\u201cshrinkage\u201d assumption, the prox-decomposition (5), even if not necessarily hold, can still be used\nin subgradient algorithms [13].\nOur \ufb01rst result is encouraging:\nProposition 1. If H = R, then for any f, g \u2208 \u03930, there exists h \u2208 \u03930 such that Ph = Pf \u25e6 Pg.\nIn fact, Moreau [11, Corollary 10.c] proved that P : H \u2192 H is a proximal map iff it\nProof:\nis nonexpansive and it is the subdifferential of some convex function in \u03930. Although the latter\ncondition in general is not easy to verify, it reduces to monotonic increasing when H = R (note that\nP must be continuous). Since both Pf and Pg are increasing and nonexpansive, it follows easily that\nso is Pf \u25e6 Pg, hence the existence of h \u2208 \u03930 so that Ph = Pf \u25e6 Pg.\nIn a general Hilbert space H, we again easily conclude that the composition Pf \u25e6 Pg is always a\nnonexpansion, which means that it is \u201cclose\u201d to be a proximal map. This justi\ufb01es the composition\nPf \u25e6 Pg as a candidate for the decomposition of Pf +g. However, we note that Proposition 1 indeed\ncan fail already in R2:\n\n1In some sense, this procedure is to compute Pf +g \u2248 limt\u2192\u221e(Pf \u25e6 Pg)t, modulo some intermediate steps.\n\nEssentially, our goal is to establish the one-step convergence of that iterative procedure.\n\n2\n\n\fExample 1. Let H = R2. Let f = \u03b9{x1=x2} and g = \u03b9{x2=0}. Clearly both f and g are in \u03930. The\nproximal maps in this case are simply projections: Pf (x) = ( x1+x2\n) and Pg(x) = (x1, 0).\nTherefore Pf (Pg(x)) = ( x1\n2 ). We easily verify that the inequality\n\n, x1+x2\n\n2 , x1\n\n2\n\n2\n\n(cid:107)Pf (Pg(x)) \u2212 Pf (Pg(y))(cid:107)2 \u2264 (cid:104)Pf (Pg(x)) \u2212 Pf (Pg(y)), x \u2212 y(cid:105)\n\nis not always true, contradiction if Pf \u25e6 Pg was a proximal map [11, Eq. (5.3)].\nEven worse, when Proposition 1 does hold, in general we can not expect the decomposition (5) to\nbe true without additional assumptions.\nExample 2. Let H = R and q(x) = 1\nPq \u25e6 Pq = 1\nNevertheless, as we will see, the equality in (5) does hold in many scenarios, and an interesting\ntheory can be suitably developed.\n\n3 Id = Pq+q. We will give an explanation for this failure of composition shortly.\n\nIt is easily seen that P\u03bbq(x) = 1\n\n1+\u03bb x. Therefore\n\n4 Id (cid:54)= 1\n\n2 x2.\n\n3.1 A Suf\ufb01cient Condition\nWe start with a suf\ufb01cient condition that yields (5). This result, although easy to obtain, will play a\nkey role in our subsequent development.\nUsing the \ufb01rst order optimality condition and the de\ufb01nition of the proximal map (2), we have\n\nPf +g(y) \u2212 y + \u2202(f + g)(Pf +g(y)) (cid:51) 0\nPg(y) \u2212 y + \u2202g(Pg(y)) (cid:51) 0\nPf (Pg(y)) \u2212 Pg(y) + \u2202f (Pf (Pg(y))) (cid:51) 0.\n\n(6)\n(7)\n(8)\n\n(9)\n\nAdding the last two equations we obtain\n\nPf (Pg(y)) \u2212 y + \u2202g(Pg(y)) + \u2202f (Pf (Pg(y))) (cid:51) 0.\n\nComparing (6) and (9) gives us\nTheorem 1. A suf\ufb01cient condition for Pf +g = Pf \u25e6 Pg is\n\n\u2200 x \u2208 H, \u2202g(Pf (x)) \u2287 \u2202g(x).\n\n(10)\nProof: Let x = Pg(y). Then by (9) and the subdifferential rule \u2202(f + g) \u2287 \u2202f + \u2202g we verify that\nPf (Pg(y)) satis\ufb01es (6), hence follows Pf +g = Pf \u25e6 Pg since the proximal map is single-valued.\nWe note that a special form of our suf\ufb01cient condition has appeared in the proof of [8, Theorem 1],\nwhose main result also follows immediately from our Theorem 4 below. Let us \ufb01x f, and de\ufb01ne\n\nKf = {g \u2208 \u03930 : f + g (cid:54)\u2261 \u221e, (f, g) satisfy (10)}.\n\nImmediately we have\nProposition 2. For any f \u2208 \u03930, Kf is a cone. Moreover, if g1 \u2208 Kf , g2 \u2208 Kf , f + g1 + g2 (cid:54)\u2261 \u221e\nand \u2202(g1 + g2) = \u2202g1 + \u2202g2, then g1 + g2 \u2208 Kf too.\nThe condition \u2202(g1 +g2) = \u2202g1 +\u2202g2 in Proposition 2 is purely technical; it is satis\ufb01ed when, say g1\nis continuous at a single, arbitrary point in dom g1 \u2229 dom g2. For comparison purpose, we note that\nit is not clear how Pf +g+h = Pf \u25e6 Pg+h would follow from Pf +g = Pf \u25e6 Pg and Pf +h = Pf \u25e6 Ph.\nThis is the main motivation to consider the suf\ufb01cient condition (10). In particular\nDe\ufb01nition 1. We call f \u2208 \u03930 self-prox-decomposable (s.p.d.) if f \u2208 K\u03b1f for all \u03b1 > 0.\nFor any s.p.d. f, since Kf is a cone, \u03b2f \u2208 K\u03b1f for all \u03b1, \u03b2 \u2265 0. Consequently, P(\u03b1+\u03b2)f =\nP\u03b2f \u25e6 P\u03b1f = P\u03b1f \u25e6 P\u03b2f .\nis to require f \u2208 Kf , from which we conclude that\nRemark 1. A weaker de\ufb01nition for s.p.d.\n\u03b2f \u2208 Kf for all \u03b2 \u2265 0, in particular P(m+n)f = Pnf \u25e6 Pmf = Pmf \u25e6 Pnf for all natural numbers\nm and n. The two de\ufb01nitions coincide for positive homogeneous functions. We have not been able\nto construct a function that satis\ufb01es this weaker de\ufb01nition but not the stronger one in De\ufb01nition 1.\nExample 3. We easily verify that all af\ufb01ne functions (cid:96) = (cid:104)\u00b7, a(cid:105) + b are s.p.d., in fact, they are the\nonly differentiable functions that are s.p.d., which explains why Example 2 must fail. Another trivial\nclass of s.p.d. functions are projectors to closed convex sets. Also, univariate gauges2 are s.p.d., due\nto Theorem 4 below. Some multivariate s.p.d. functions are given in Remark 5 below.\n\n2A gauge is a positively homogeneous convex function that vanishes at the origin.\n\n3\n\n\fThe next example shows that (10) is not necessary.\nExample 4. Fix z \u2208 H, f = \u03b9{z}, and g \u2208 \u03930 with full domain. Clearly for any x \u2208 H, Pf +g(x) =\nz = Pf [Pg(x)]. However, since x is arbitrary, \u2202g(Pf (x)) = \u2202g(z) (cid:54)\u2287 \u2202g(x) if g is not linear.\nOn the other hand, if f, g are differentiable, then we actually have equality in (10), which is clearly\nnecessary in this case. Since convex functions are almost everywhere differentiable (in the interior\nof their domain), we expect the suf\ufb01cient condition (10) to be necessary \u201calmost everywhere\u201d too.\nThus we see that the key for the decomposition (5) to hold is to let the proximal map of f and the\nsubdifferential of g \u201cinteract well\u201d in the sense of (10). Interestingly, both are fully equivalent to the\nfunction itself.\nProposition 3 ([11, \u00a78]). Let f, g \u2208 \u03930. f = g + c for some c \u2208 R \u21d0\u21d2 \u2202f \u2286 \u2202g \u21d0\u21d2 Pf = Pg.\nProof: The \ufb01rst implication is clear. The second follows from the optimality condition Pf =\n(Id + \u2202f )\u22121. Lastly, Pf = Pg implies that Mf\u2217 = Mg\u2217 \u2212 c for some c \u2208 R (by integration).\nConjugating we get f = g + c for some c \u2208 R.\nTherefore some properties of the proximal map will transfer to some properties of the function f\nitself, and vice versa. The next result is easy to obtain, and appeared essentially in [14].\nProposition 4. Let f \u2208 \u03930 and x \u2208 H be arbitrary, then\ni). Pf is odd iff f is even;\nii). Pf (U x) = U Pf (x) for all unitary U iff f (U x) = f (x) for all unitary U;\niii). Pf (Qx) = QPf (x) for all permutation Q (under some \ufb01xed basis) iff f is permutation invari-\n\nant, that is f (Qx) = f (x) for all permutation Q.\n\nIn the following, we will put some invariance assumptions on the subdifferential of g and accordingly\n\ufb01nd the right family of f whose proximal map \u201crespects\u201d that invariance. This way we will meet\n(10) by construction therefore effortlessly have the decomposition (5).\n\n3.2 No Invariance\nTo begin with, consider \ufb01rst the trivial case where no invariance on the subdifferential of g is as-\nsumed. This is equivalent as requiring (10) to hold for all g \u2208 \u03930. Not surprisingly, we end up with\na trivial choice of f.\nTheorem 2. Fix f \u2208 \u03930. Pf +g = Pf \u25e6 Pg for all g \u2208 \u03930 if and only if\n\n\u2022 dim(H) \u2265 2; f \u2261 c or f = \u03b9{w} + c for some c \u2208 R and w \u2208 H;\n\u2022 dim(H) = 1 and f = \u03b9C + c for some closed and convex set C and c \u2208 R.\n\nProof: \u21d0: Straightforward calculations, see [15] for details.\n\u21d2: We \ufb01rst prove that f is constant on its domain even when g is restricted to indicators. Indeed,\nlet x \u2208 dom f and take g = \u03b9{x}. Then x = Pf +g(x) = Pf [Pg(x)] = Pf (x), meaning that\nx \u2208 argmin f. Since x \u2208 dom f is arbitrary, f is constant on its domain. The case dim(H) = 1 is\ncomplete. We consider the other case where dim(H) \u2265 2 and dom f contains at least two points.\nIf dom f (cid:54)= H, there exists z (cid:54)\u2208 dom f such that Pf (z) = y for some y \u2208 dom f, and closed\nconvex set C \u2229 dom f (cid:54)= \u2205 with y (cid:54)\u2208 C (cid:51) z. Let g = \u03b9C we obtain Pf +g(z) \u2208 C \u2229 dom f while\nPf (Pg(z)) = Pf (z) = y (cid:54)\u2208 C, contradiction.\nObserve that the decomposition (5) is not symmetric in f and g, also re\ufb02ected in the next result:\nTheorem 3. Fix g \u2208 \u03930. Pf +g = Pf \u25e6 Pg for all f \u2208 \u03930 iff g is a continuous af\ufb01ne function.\nProof: \u21d2: If g = (cid:104)\u00b7, a(cid:105) + c, then Pg(x) = x \u2212 a. Easy calculation reveals that Pf +g(x) =\nPf (x \u2212 a) = Pf [Pg(x)].\n\u21d0: The converse is true even when f is restricted to continuous linear functions. Indeed, let a \u2208 H\nbe arbitrary and consider f = (cid:104)\u00b7, a(cid:105). Then Pf +g(x) = Pg(x \u2212 a) = Pf (Pg(x)) = Pg(x) \u2212 a.\nLetting a = x yields Pg(x) = x + Pg(0) = P(cid:104)\u00b7,\u2212Pg(0)(cid:105)(x). Therefore by Proposition 3 we know g\nis equal to a continuous af\ufb01ne function.\n\n4\n\n\fNaturally, the next step is to put invariance assumptions on the subdifferential of g, effectively\nrestricting the function class of g. As a trade off, the function class of f, that satis\ufb01es (10), becomes\nlarger so that nontrivial results will arise.\n\n3.3 Scaling Invariance\nThe \ufb01rst invariance property we consider is scaling-invariance. What kind of convex functions have\ntheir subdifferential invariant to (positive) scaling? Assuming 0 \u2208 dom g and by simple integration\n\n(cid:90) t\n\n(cid:90) t\n\ng(tx) \u2212 g(0) =\n\ng(cid:48)(sx)ds =\n\n(cid:104)\u2202g(sx), x(cid:105) ds = t \u00b7 [g(x) \u2212 g(0)],\n\n0\n\n0\n\nwhere the last equality follows from the scaling invariance of the subdifferential of g. Therefore, up\nto some additive constant, g is positive homogeneous (p.h.). On the other hand, if g \u2208 \u03930 is p.h.\n(automatically 0 \u2208 dom g), then from de\ufb01nition we verify that \u2202g is scaling-invariant. Therefore,\nunder the scaling-invariance assumption, the right function class for g is the set of all p.h. functions\nin \u03930, up to some additive constant. Consequently, the right function class for f is to have the\nproximal map Pf (x) = \u03bb \u00b7 x for some \u03bb \u2208 [0, 1] that may depend on x as well3. The next theorem\ncompletely characterizes such functions.\nTheorem 4. Let f \u2208 \u03930. Consider the statements\ni). f = h((cid:107) \u00b7 (cid:107)) for some increasing function h : R+ \u2192 \u00afR;\nii). x \u22a5 y =\u21d2 f (x + y) \u2265 f (y);\niii). Pf (u) = \u03bb \u00b7 u for some \u03bb \u2208 [0, 1] (that may itself depend on u);\niv). 0 \u2208 dom f and Pf +\u03ba = Pf \u25e6 P\u03ba for all p.h. (up to some additive constant) function \u03ba \u2208 \u03930.\nThen we have i) =\u21d2 ii) \u21d0\u21d2 iii) \u21d0\u21d2 iv). Moreover, when dim(H) \u2265 2, ii) =\u21d2 i) as well, in\nwhich case Pf (u) = Ph((cid:107)u(cid:107))/(cid:107)u(cid:107) \u00b7 u (where we interpret 0/0 = 0).\nRemark 2. When dim(H) = 1, ii) is equivalent as requiring f to attain its minimum at 0, in which\ncase the implication ii) =\u21d2 iv), under the redundant condition that f is differentiable, was proved\nby Combettes and Pesquet [14, Proposition 3.6]. The implication ii) =\u21d2 iii) also generalizes [14,\nCorollary 2.5], where only the case dim(H) = 1 and f differentiable is considered. Note that there\nexists non-even f that satis\ufb01es Theorem 4 when dim(H) = 1. Such is impossible for dim(H) \u2265 2,\nin which case any f that satis\ufb01es Theorem 4 must also enjoy all properties listed in Proposition 4.\nProof: i) =\u21d2 ii): x \u22a5 y =\u21d2 (cid:107)x + y(cid:107) \u2265 (cid:107)y(cid:107).\nii) =\u21d2 iii): Indeed, by de\ufb01nition\n\nMf (u) = min\n\nx\n\n= min\n\n\u03bb\n\n1\n\n2(cid:107)x \u2212 u(cid:107)2 + f (x) = minu\u22a5,\u03bb\n2(cid:107)\u03bbu \u2212 u(cid:107)2 + f (\u03bbu) = min\u03bb\u2208[0,1]\n\n1\n\n1\n\n2(cid:107)u\u22a5 + \u03bbu \u2212 u(cid:107)2 + f (u\u22a5 + \u03bbu)\n\n1\n\n2 (\u03bb \u2212 1)2(cid:107)u(cid:107)2 + f (\u03bbu),\n\nwhere the third equality is due to ii), and the nonnegative constraint in the last equality can be seen\nas follows: For any \u03bb < 0, by increasing it to 0 we can only decrease both terms; similar argument\nfor \u03bb > 1. Therefore there exists \u03bb \u2208 [0, 1] such that \u03bbu minimizes the Moreau envelop Mf hence\nwe have Pf (u) = \u03bbu due to uniqueness.\niii) =\u21d2 iv): Note \ufb01rst that iii) implies 0 \u2208 \u2202f (0), therefore 0 \u2208 dom f. Since the subdifferential\nof \u03ba is scaling-invariant, iii) implies the suf\ufb01cient condition (10) hence iv).\niv) =\u21d2 iii): Fix y and construct the gauge function\n\n(cid:26) 0,\n\n\u221e,\n\n\u03ba(z) =\n\nif z = \u03bb \u00b7 y for some \u03bb \u2265 0\notherwise\n\n.\n\nThen P\u03ba(y) = y, hence Pf (P\u03ba(y)) = Pf (y) = Pf +\u03ba(y) by iv). On the other hand,\n\nMf +\u03ba(y) = min\n\n2 + f (x) + \u03ba(x) = min\u03bb\u22650\n3Note that \u03bb \u2264 1 is necessary since any proximal map is nonexpansive.\n\nx\n\n1\n\n2(cid:107)x \u2212 y(cid:107)2\n\n1\n\n2(cid:107)\u03bby \u2212 y(cid:107)2\n\n2 + f (\u03bby).\n\n(11)\n\n5\n\n\fTake y = 0 we obtain Pf +\u03ba(0) = 0. Thus Pf (0) = 0, i.e. 0 \u2208 \u2202f (0), from which we deduce that\nPf (y) = Pf +\u03ba(y) = \u03bby for some \u03bb \u2208 [0, 1], since f (\u03bby) in (11) is increasing on [1,\u221e[.\niii) =\u21d2 ii): First note that iii) implies that Pf (0) = 0 hence 0 \u2208 \u2202f (0), in particular, 0 \u2208 dom f.\nIf dim(H) = 1 we are done, so we assume dim(H) \u2265 2 in the rest of the proof. In this case, it is\nknown, cf. [9, Theorem 1] or [10, Theorem 3], that ii) \u21d0\u21d2 i) (even without assuming f convex).\nAll we left is to prove iii) =\u21d2 ii) or equivalently i), for the case dim(H) \u2265 2.\nWe \ufb01rst prove the case when dom f = H. By iii), Pf (x) = \u03bbx for some \u03bb \u2208 [0, 1] (which may\ndepend on x as well). Using the \ufb01rst order optimality condition for the proximal map we have\n0 \u2208 \u03bbx\u2212 x + \u2202f (\u03bbx), that is ( 1\n\u03bb \u2212 1)y \u2208 \u2202f (y) for each y \u2208 ran(Pf ) = H due to our assumption\ndom f = H. Now for any x \u22a5 y, by the de\ufb01nition of the subdifferential,\n\nf (x + y) \u2265 f (y) + (cid:104)x, \u2202f (y)(cid:105) = f (y) +(cid:10)x, ( 1\n\n\u03bb \u2212 1)y(cid:11) = f (y).\n\n2 Pf (x) + 1\n\n4 x = ( 1\n\n2 \u03bb + 1\n\ng = A(f, q) = [( 1\n\n2 (f\u2217 + q)\u2217 + 1\n\nFor the case when dom f \u2282 H, we consider the proximal average [16]\n4 q)\u2217 \u2212 q]\u2217,\n\n(12)\n2(cid:107) \u00b7 (cid:107)2. Importantly, since q is de\ufb01ned on the whole space, the proximal average g has\nwhere q = 1\n4 )x. Therefore\nfull domain too [16, Corollary 4.7]. Moreover, Pg(x) = 1\nby our previous argument, g satis\ufb01es ii) hence also i). It is easy to check that i) is preserved under\ntaking the Fenchel conjugation (note that the convexity of f implies that of h). Since we have shown\nthat g satis\ufb01es i), it follows from (12) that f satis\ufb01es i) hence also ii).\nAs mentioned, when dim(H) \u2265 2, the implication ii) =\u21d2 i) was shown in [9, Theorem 1]. The\nformula Pf (u) = Ph((cid:107)u(cid:107))/(cid:107)u(cid:107) \u00b7 u for f = h((cid:107) \u00b7 (cid:107)) follows from straightforward calculation.\nWe now discuss some applications of Theorem 4. When dim(H) \u2265 2, iii) in Theorem 4 automati-\ncally implies that the scalar constant \u03bb depends on x only through its norm. This fact, although not\nentirely obvious, does have a clear geometric picture:\nCorollary 1. Let dim(H) \u2265 2, C \u2286 H be a closed convex set that contains the origin. Then the\nprojection onto C is simply a shrinkage towards the origin iff C is a ball (of the norm (cid:107) \u00b7 (cid:107)).\nProof: Let f = \u03b9C and apply Theorem 4.\n\n2(cid:107) \u00b7 (cid:107)2. In many applications, in addition to the regularizer \u03ba\nExample 5. As usual, denote q = 1\n(usually a gauge), one adds the (cid:96)2\n2 regularizer \u03bbq either for stability or grouping effect or strong\nconvexity. This incurs no computational cost in the sense of computing the proximal map: We easily\ncompute that P\u03bbq = 1\n\u03bb+1 P\u03ba, whence it is also\nclear that adding an extra (cid:96)2 regularizer tends to double \u201cshrink\u201d the solution. In particular, let\nH = Rd and take \u03ba = (cid:107) \u00b7 (cid:107)1 (the sum of absolute values) we recover the proximal map for the\nelastic-net regularizer [17].\nExample 6. The Berhu regularizer\n\n\u03bb+1 Id. By Theorem 4, for any gauge \u03ba, P\u03ba+\u03bbq = 1\n\nh(x) = |x|1|x|<\u03b3 + x2+\u03b32\n\n2\u03b3 1|x|\u2265\u03b3 = |x| + (|x|\u2212\u03b3)2\n\n2\u03b3\n\n1|x|\u2265\u03b3,\n\nbeing the reverse of Huber\u2019s function, is proposed in [18] as a bridge between the lasso ((cid:96)1 regular-\n2 regularization). Let f (x) = h(x) \u2212 |x|. Clearly, f satis\ufb01es ii) of\nization) and ridge regression ((cid:96)2\nTheorem 4 (but not differentiable), hence\n\nPh = Pf \u25e6 P|\u00b7|,\n\nwhereas simple calculation veri\ufb01es that\n\nPf (x) = sign(x) \u00b7 min{|x|, \u03b3\n\n1+\u03b3 (|x| + 1)},\n\nand of course P|\u00b7|(x) = sign(x) \u00b7 max{|x| \u2212 1, 0}. Note that this regularizer is not s.p.d.\nCorollary 2. Let dim(H) \u2265 2, then the p.h. function f \u2208 \u03930 satis\ufb01es any item of Theorem 4 iff it is\na positive multiple of the norm (cid:107) \u00b7 (cid:107).\nProof:\nmultiple of the norm.\nTherefore (positive multiples of) the Hilbertian norm is the only p.h. convex function f that satis\ufb01es\nPf +\u03ba = Pf \u25e6 P\u03ba for all gauge \u03ba. In particular, this means that the norm (cid:107) \u00b7 (cid:107) is s.p.d. Moreover, we\neasily recover the following result that is perhaps not so obvious at \ufb01rst glance:\n\n[10, Theorem 4] showed that under positive homogeneity, i) implies that f is a positive\n\n6\n\n\fi=1 (cid:107)\u00b7(cid:107)gi\n\nCorollary 3 (Jenatton et al. [7]). Fix the orthonormal basis {ei}i\u2208I of H. Let G \u2286 2I be a collection\nof tree-structured groups, that is, either g \u2286 g(cid:48) or g(cid:48) \u2286 g or g \u2229 g(cid:48) = \u2205 for all g, g(cid:48) \u2208 G. Then\n\nP(cid:80)m\nProof: Let f = (cid:107) \u00b7 (cid:107)g1 and \u03ba = (cid:80)m\n\nwhere we arrange the groups so that gi \u2282 gj =\u21d2 i > j, and the notation (cid:107) \u00b7 (cid:107)gi denotes the\nHilbertian norm that is restricted to the coordinates indexed by the group gi.\ni=2 (cid:107) \u00b7 (cid:107)gi. Clearly they are both p.h. (and convex). By the\ntree-structured assumption we can partition \u03ba = \u03ba1 + \u03ba2, where gi \u2282 g1 for all gi appearing in \u03ba1\nwhile gj \u2229 g1 = \u2205 for all gj appearing in \u03ba2. Restricting to the subspace spanned by the variables in\ng1 we can treat f as the Hilbertian norm. Apply Theorem 4 we obtain Pf +\u03ba1 = Pf \u25e6 P\u03ba1. On the\nother hand, due to the non-overlapping property, nothing will be affected by adding \u03ba2, thus\n\n= P(cid:107)\u00b7(cid:107)g1\n\n\u25e6 \u00b7\u00b7\u00b7 \u25e6 P(cid:107)\u00b7(cid:107)gm\n\n,\n\nP(cid:80)m\n\ni=1 (cid:107)\u00b7(cid:107)gi\n\n= P(cid:107)\u00b7(cid:107)g1\n\ni=2 (cid:107)\u00b7(cid:107)gi\n\n.\n\n\u25e6 P(cid:80)m\n\nWe can clearly iterate the argument to unravel the proximal map as claimed.\nFor notational clarity, we have chosen not to incorporate weights in the sum of group seminorms:\nSuch can be absorbed into the seminorm and the corollary clearly remains intact. Our proof also\nreveals the fundamental reason why Corollary 3 is true: The (cid:96)2 norm admits the decomposition (5)\nfor any gauge g! This fact, to the best of our knowledge, has not been recognized previously.\n\n3.4 Cone Invariance\nIn the previous subsection, we restricted the subdifferential of g to be constant along each ray. We\nnow generalize this to cones. Speci\ufb01cally, consider the gauge function\n\n(13)\nwhere J is a \ufb01nite index set and each aj \u2208 H. Such polyhedral gauge functions have become\nextremely important due to the work of Chandrasekaran et al. [19]. De\ufb01ne the polyhedral cones\n\n\u03ba(x) = max\nj\u2208J\n\n(cid:104)aj, x(cid:105) ,\n\n(14)\nAssume Kj (cid:54)= \u2205 for each j (otherwise delete j from J). Since \u2202\u03ba(x) = {aj|j \u2208 J, x \u2208 Kj}, the\nsuf\ufb01cient condition (10) becomes\n\nKj = {x \u2208 H : (cid:104)aj, x(cid:105) = \u03ba(x)}.\n\n\u2200j \u2208 J, Pf (Kj) \u2286 Kj.\n\n(15)\nIn other words, each cone Kj is \u201c\ufb01xed\u201d under the proximal map of f. Although it would be very\ninteresting to completely characterize f under (15), we show that in its current form, (15) already\nimplies many known results, with some new generalizations falling out naturally.\nCorollary 4. Denote E a collection of pairs (m, n), and de\ufb01ne the total variational norm (cid:107)x(cid:107)tv =\n\n(cid:80){m,n}\u2208E wm,n|xm \u2212 xn|, where wm,n \u2265 0. Then for any permutation invariant function4 f,\n\nPf +(cid:107)\u00b7(cid:107)tv = Pf \u25e6 P(cid:107)\u00b7(cid:107)tv .\n\nPick an arbitrary pair (m, n) \u2208 E and let \u03ba = |xm \u2212 xn|.\n\nProof:\nClearly\nJ = {1, 2}, K1 = {xm \u2265 xn} and K2 = {xm \u2264 xn}. Since f is permutation invariant,\nits proximal map Pf (x) maintains the order of x, hence we establish (15). Finally apply Proposi-\ntion 2 and Theorem 1.\nRemark 3. The special case where E = {(1, 2), (2, 3), . . .} is a chain, wm,n \u2261 1 and f is the (cid:96)1\nnorm, appeared \ufb01rst in [6] and is generally known as the fused lasso. The case where f is the (cid:96)p\nnorm appeared in [20].\nWe call the permutation invariant function f symmetric if \u2200x, f (|x|) = f (x), where | \u00b7 | denotes\nthe componentwise absolute value. The proof for the next corollary is almost the same as that of\nCorollary 4, except that we also use the fact sign([Pf (x)]m) = sign(xm) for symmetric functions.\nfor any symmetric function f, Pf +(cid:107)\u00b7(cid:107)oct = Pf \u25e6 P(cid:107)\u00b7(cid:107)oct.\n\nCorollary 5. As in Corollary 4, de\ufb01ne the norm (cid:107)x(cid:107)oct =(cid:80){m,n}\u2208E wm,n max{|xm|,|xn|}. Then\n\n4All we need is the weaker condition: For all {m, n} \u2208 E, xm \u2265 xn =\u21d2 [Pf (x)]m \u2265 [Pf (x)]n.\n\n7\n\n\fRemark 4. This norm (cid:107) \u00b7 (cid:107)oct is proposed in [21] for feature grouping. Surprisingly, Corollary 5\nappears to be new. The proximal map P(cid:107)\u00b7(cid:107)oct is derived in [22], which turns out to be another\n\ndecomposition result. Indeed, for i \u2265 2, de\ufb01ne \u03bai(x) =(cid:80)\n\n(cid:88)\n\ni\u22652\n\n(cid:107) \u00b7 (cid:107)oct =\n\nj\u2264i\u22121 max{|xi|,|xj|}. Thus\n\u03bai.\n\nImportantly, we observe that \u03bai is symmetric on the \ufb01rst i \u2212 1 coordinates. We claim that\n\nP(cid:107)\u00b7(cid:107)oct = P\u03ba|I| \u25e6 . . . \u25e6 P\u03ba2 .\n\nThe proof is by recursion: Write (cid:107) \u00b7 (cid:107)oct = f + g, where f = \u03ba|I|. Note that the subdifferential of\ng depends only on the ordering and sign of the \ufb01rst |I| \u2212 1 coordinates while the proximal map of\nf preserves the ordering and sign of the \ufb01rst |I| \u2212 1 coordinates (due to symmetry). If we pre-sort\nx, the individual proximal maps P\u03bai (x) become easy to compute sequentially and we recover the\nalgorithm in [22] with some bookkeeping.\nCorollary 6. As in Corollary 3, let G \u2286 2I be a collection of tree-structured groups, then\n\nwhere we arrange the groups so that gi \u2282 gj =\u21d2 i > j, and (cid:107)x(cid:107)gi,k =(cid:80)k\n\ni=1 (cid:107)\u00b7(cid:107)gi,k = P(cid:107)\u00b7(cid:107)g1,k \u25e6 \u00b7\u00b7\u00b7 \u25e6 P(cid:107)\u00b7(cid:107)gm,k ,\n\nj=1 |xgi|[j] is the sum\n\nP(cid:80)m\n\nof the k (absolute-value) largest elements in the group gi, i.e., Ky-Fan\u2019s k-norm.\nProof: Similar as in the proof of Corollary 3, we need only prove that\nP(cid:107)\u00b7(cid:107)g1 ,k+(cid:107)\u00b7(cid:107)g2,k = P(cid:107)\u00b7(cid:107)g1,k \u25e6 P(cid:107)\u00b7(cid:107)g2,k ,\n\nwhere w.l.o.g. we assume g1 contains all variables while g2 \u2282 g1. Therefore (cid:107) \u00b7 (cid:107)g1,k can be treated\nas symmetric and the rest follows the proof of Corollary 5.\nNote that the case k \u2208 {1,|I|} was proved in [7] and Corollary 6 can be seen as an interpolation.\nInterestingly, there is another interpolated result whose proof should be apparent now.\nCorollary 7. Corollary 6 remains true if we replace Ky-Fan\u2019s k-norm with\nmax{|xi1|, . . . ,|xik|}.\n\n(cid:107)x(cid:107)oct,k =\n\n(cid:88)\n\n(16)\n\n1\u2264i1<i2<...<ik\u2264|I|\n\nTherefore we can employ the norm (cid:107)x(cid:107)oct,2 for feature grouping in a hierarchical manner. Clearly\nwe can also combine Corollary 6 and Corollary 7.\nCorollary 8. For any symmetric f, Pf +(cid:107)\u00b7(cid:107)oct,k = Pf \u25e6 P(cid:107)\u00b7(cid:107)oct,k. Similarly for Ky-Fan\u2019s k-norm.\nRemark 5. The above corollary implies that Ky-Fan\u2019s k-norm and the norm (cid:107) \u00b7 (cid:107)oct,k de\ufb01ned in\n(16) are both s.p.d. (see De\ufb01nition 1). The special case for the (cid:96)p norm where p \u2208 {1, 2,\u221e} was\nproved in [23, Proposition 11], with a substantially more complicated argument. As pointed out in\n[23], s.p.d. regularizers allow us to perform lazy updates in gradient-type algorithms.\n\nWe remark that we have not exhausted the possibility to have the decomposition (5). It is our hope\nto stimulate further work in understanding the prox-decomposition (5).\nAdded after acceptance: We have managed to extend the results in this subsection to the Lov\u00e1sz\nextension of submodular set functions. Details will be given elsewhere.\n\n4 Conclusion\nThe main goal of this paper is to understand when the proximal map of the sum of functions de-\ncomposes into the composition of the proximal maps of the individual functions. Using a simple\nsuf\ufb01cient condition we are able to completely characterize the decomposition when certain scaling\ninvariance is exhibited. The generalization to cone invariance is also considered and we recover\nmany known decomposition results, with some new ones obtained almost effortlessly. In the future\nwe plan to generalize some of the results here to nonconvex functions.\n\nAcknowledgement\n\nThe author thanks Bob Williamson and Xinhua Zhang from NICTA\u2014Canberra for their hospitality\nduring the author\u2019s visit when part of this work was performed; Warren Hare, Yves Lucet, and\nHeinz Bauschke from UBC\u2014Okanagan for some discussions around Theorem 4; and the reviewers\nfor their valuable comments.\n\n8\n\n\fReferences\n[1] Bernhard Scholk\u00f6pf and Alexander J. Smola. Learning with Kernels: Support Vector Ma-\n\nchines, Regularization, Optimization, and Beyond. MIT Press, 2001.\n\n[2] Peter B\u00fchlmann and Sara van de Geer. Statistics for High-Dimensional Data. Springer, 2011.\n[3] Patrick L. Combettes and Val\u00e9rie R. Wajs. Signal recovery by proximal forward-backward\n\nsplitting. Multiscale Modeling and Simulation, 4(4):1168\u20131200, 2005.\n\n[4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[5] Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Pro-\n\ngramming, Series B, 140:125\u2013161, 2013.\n\n[6] Jerome Friedman, Trevor Hastie, Holger H\u00f6\ufb02ing, and Robert Tibshirani. Pathwise coordinate\n\noptimization. The Annals of Applied Statistics, 1(2):302\u2013332, 2007.\n\n[7] Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, and Francis Bach. Proximal methods\nfor hierarchical sparse coding. Journal of Machine Learning Research, 12:2297\u20132334, 2011.\n[8] Jiayu Zhou, Jun Liu, Vaibhav A. Narayan, and Jieping Ye. Modeling disease progression via\n\nfused sparse group lasso. In Conference on Knowledge Discovery and Data Mining, 2012.\n\n[9] Francesco Dinuzzo and Bernhard Sch\u00f6lkopf. The representer theorem for Hilbert spaces: a\n\nnecessary and suf\ufb01cient condition. In NIPS, 2012.\n\n[10] Yao-Liang Yu, Hao Cheng, Dale Schuurmans, and Csaba Szepesv\u00e1ri. Characterizing the rep-\n\nresenter theorem. In ICML, 2013.\n\n[11] Jean J. Moreau. Proximit\u00e9 et dualtit\u00e9 dans un espace Hilbertien. Bulletin de la Soci\u00e9t\u00e9 Math\u00e9-\n\nmatique de France, 93:273\u2013299, 1965.\n\n[12] Patrick L. Combettes, \u00d0inh D\u02dcung, and B`\u02d8ang C\u00f4ng V\u02dcu. Proximity for sums of composite\n\nfunctions. Journal of Mathematical Analysis and Applications, 380(2):680\u2013688, 2011.\n\n[13] Andr\u00e9 F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, and M\u00e1rio A. T.\nFigueiredo. Online learning of structured predictors with multiple kernels. In Conference on\nArti\ufb01cial Intelligence and Statistics, 2011.\n\n[14] Patrick L. Combettes and Jean-Christophe Pesquet. Proximal thresholding algorithm for min-\n\nimization over orthonormal bases. SIAM Journal on Optimization, 18(4):1351\u20131376, 2007.\n\n[15] Yaoliang Yu. Fast Gradient Algorithms for Stuctured Sparsity. PhD thesis, University of\n\nAlberta, 2013.\n\n[16] Heinz H. Bauschke, Rafal Goebel, Yves Lucet, and Xianfu Wang. The proximal average:\n\nBasic theory. SIAM Journal on Optimization, 19(2):766\u2013785, 2008.\n\n[17] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal\n\nof the Royal Statistical Society B, 67:301\u2013320, 2005.\n\n[18] Art B. Owen. A robust hybrid of lasso and ridge regression. In Prediction and Discovery,\n\npages 59\u201372. AMS, 2007.\n\n[19] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear\n\ninverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[20] Xinhua Zhang, Yaoliang Yu, and Dale Schuurmans. Polar operators for structured sparse\n\nestimation. In NIPS, 2013.\n\n[21] Howard Bondell and Brian Reich. Simultaneous regression shrinkage, variable selection, and\n\nsupervised clustering of predictors with oscar. Biometrics, 64(1):115\u2013123, 2008.\n\n[22] Leon Wenliang Zhong and James T. Kwok. Ef\ufb01cient sparse modeling with automatic feature\n\ngrouping. In ICML, 2011.\n\n[23] John Duchi and Yoram Singer. Ef\ufb01cient online and batch learning using forward backward\n\nsplitting. Journal of Machine Learning Research, 10:2899\u20132934, 2009.\n\n9\n\n\f", "award": [], "sourceid": 93, "authors": [{"given_name": "Yao-Liang", "family_name": "Yu", "institution": "University of Alberta"}]}