{"title": "Geometric optimisation on positive definite matrices for elliptically contoured distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 2562, "page_last": 2570, "abstract": "Hermitian positive definite matrices (HPD) recur throughout statistics and machine learning. In this paper we develop \\emph{geometric optimisation} for globally optimising certain nonconvex loss functions arising in the modelling of data via elliptically contoured distributions (ECDs). We exploit the remarkable structure of the convex cone of positive definite matrices which allows one to uncover hidden geodesic convexity of objective functions that are nonconvex in the ordinary Euclidean sense. Going even beyond manifold convexity we show how further metric properties of HPD matrices can be exploited to globally optimise several ECD log-likelihoods that are not even geodesic convex. We present key results that help recognise this geometric structure, as well as obtain efficient fixed-point algorithms to optimise the corresponding objective functions. To our knowledge, ours are the most general results on geometric optimisation of HPD matrices known so far. Experiments reveal the benefits of our approach---it avoids any eigenvalue computations which makes it very competitive.", "full_text": "Geometric optimisation on positive de\ufb01nite matrices\nwith application to elliptically contoured distributions\n\nSuvrit Sra\n\nMax Planck Institute for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\nReshad Hosseini\n\nSchool of ECE, College of Engineering\n\nUniversity of Tehran, Tehran, Iran\n\nAbstract\n\nHermitian positive de\ufb01nite (hpd) matrices recur throughout machine learning,\nstatistics, and optimisation. This paper develops (conic) geometric optimisation\non the cone of hpd matrices, which allows us to globally optimise a large class of\nnonconvex functions of hpd matrices. Speci\ufb01cally, we \ufb01rst use the Riemannian\nmanifold structure of the hpd cone for studying functions that are nonconvex\nin the Euclidean sense but are geodesically convex (g-convex), hence globally\noptimisable. We then go beyond g-convexity, and exploit the conic geometry\nof hpd matrices to identify another class of functions that remain amenable to\nglobal optimisation without requiring g-convexity. We present key results that\nhelp recognise g-convexity and also the additional structure alluded to above. We\nillustrate our ideas by applying them to likelihood maximisation for a broad family\nof elliptically contoured distributions: for this maximisation, we derive novel,\nparameter free \ufb01xed-point algorithms. To our knowledge, ours are the most general\nresults on geometric optimisation of hpd matrices known so far. Experiments show\nthat advantages of using our \ufb01xed-point algorithms.\n\n1\n\nIntroduction\n\nThe geometry of Hermitian positive de\ufb01nite (hpd) matrices is remarkably rich and forms a founda-\ntional pillar of modern convex optimisation [21] and of the rapidly evolving area of convex algebraic\ngeometry [4]. The geometry exhibited by hpd matrices, however, goes beyond what is typically\nexploited in these two areas. In particular, hpd matrices form a convex cone which is also a dif-\nferentiable Riemannian manifold that is also a CAT(0) space (i.e., a metric space of nonpositive\ncurvature [7]). This rich structure enables \u201cgeometric optimisation\u201d with hpd matrices, which allows\nsolving many problems that are nonconvex in the Euclidean sense but convex in the manifold sense\n(see \u00a72 or [29]), or have enough metric structure (see \u00a73) to permit ef\ufb01cient optimisation.\nThis paper develops (conic) geometric optimisation1 (GO) for hpd matrices. We present key results\nthat help recognise geodesic convexity (g-convexity); we also present suf\ufb01cient conditions that put a\nclass of even non g-convex functions within the grasp of GO. To our knowledge, ours are the most\ngeneral results on geometric optimisation with hpd matrices known so far.\nMotivation for GO. We begin by noting that the widely studied class of geometric programs is\nultimately nothing but the 1D version of GO on hpd matrices. Given that geometric programming\nhas enjoyed great success in numerous applications\u2014see e.g., the survey of Boyd et al. [6]\u2014we\nhope GO also gains broad applicability. For this paper, GO arises naturally while performing\nmaximum likelihood parameter estimation for a rich class of elliptically contoured distributions\n\n1To our knowledge the name \u201cgeometric optimisation\u201d has not been previously attached to hpd matrix\noptimisation, perhaps because so far only scattered few examples were known. Our theorems provide a starting\npoint for recognising and constructing numerous problems amenable to geometric optimisation.\n\n1\n\n\f(ECDs) [8, 13, 20]. Perhaps the best known GO problem is the task of computing the Karcher /\nFr\u00b4echet-mean of hpd matrices: a topic that has attracted great attention within matrix theory [2, 3, 27],\ncomputer vision [10], radar imaging [22; Part II], and medical imaging [11, 31]\u2014we refer the reader\nto the recent book [22] for additional applications, references, and details. Another GO problem\narises as a subroutine in nearest neighbour search over hpd matrices [12]. Several other areas involve\nGO problems: statistics (covariance shrinkage) [9], nonlinear matrix equations [17], Markov decision\nprocesses and the wider encompassing area of nonlinear Perron-Frobenius theory [18].\nMotivating application. We use ECDs as a platform for illustrating our ideas for two reasons:\n(i) ECDs are important in a variety of settings (see the recent survey [23]); and (ii) they offer an\ninstructive setup for presenting key ideas from the world of geometric optimisation.\nLet us therefore begin by recalling some basics. An ECD with density on Rd takes the form 2\n\n\u2200 x \u2208 Rd,\n\nE\u03d5(x; S) \u221d det(S)\u22121/2\u03d5(xT S\u22121x),\n\n(1)\nwhere S \u2208 Pd (i.e., the set of d \u00d7 d symmetric positive de\ufb01nite matrices) is the scatter matrix while\n\u03d5 : R \u2192 R++ is positive density generating function (dgf). If ECDs have \ufb01nite covariance matrix,\nthen the scatter matrix is proportional to the covariance matrix [8].\nExample 1. With \u03d5(t) = e\u2212 t\n\n2 , density (1) reduces to the multivariate normal density. For the choice\n(2)\nwhere \u03b1, b and \u03b2 are \ufb01xed positive numbers, density (1) yields the rich class called Kotz-type\ndistributions that are known to have powerful modelling abilities [15; \u00a73.2]; they include as special\ncases multivariate power exponentials, elliptical gamma, multivariate W-distributions, for instance.\nMLE. Let (x1, . . . , xn) be i.i.d. samples from an ECD E\u03d5(S). Up to constants, the log-likelihood is\n(3)\n\n\u03d5(t) = t\u03b1\u2212d/2 exp(cid:0)\u2212(t/b)\u03b2(cid:1),\n\nL(S) = \u2212 1\n\n(cid:88)n\n\nlog \u03d5(xT\n\ni S\u22121xi).\n\ni=1\nEquivalently, we may consider the minimisation problem\n\n2 n log det S +\n\n2 n log det(S) \u2212(cid:88)\n\ni S\u22121xi).\n\nlog \u03d5(xT\n\nminS(cid:31)0 \u03a6(S) := 1\n\n(4)\nProblem (4) is in general dif\ufb01cult as \u03a6 may be nonconvex and may have multiple local minima.\nSince statistical estimation theory relies on having access to global optima, it is important to be able\nto solve (4) to global optimality. These dif\ufb01culties notwithstanding, using GO ideas, we identify a\nrich class of ECDs for which we can indeed solve (4) optimally. Some examples already exist in\nthe literature [16, 23, 30]; this paper develops techniques that are strictly more general and subsume\nprevious examples, while advancing the broader idea of geometric optimisation.\nWe illustrate our ideas by studying the following two main classes of dgfs in (1):\n\ni\n\n(i) Geodesically Convex (GC): This class contains functions for which the negative log-likelihood\n\u03a6(S) is g-convex, i.e., convex along geodesics in the manifold of hpd matrices. Some members\nof this class have been previously studied (though sometimes without recognising or directly\nexploiting the g-convexity);\n\n(ii) Log-Nonexpansive (LN): This is a new class that we introduce in this paper. It exploits the\n\n\u201cnon-positive curvature\u201d property of the manifold of hpd matrices.\n\nThere is a third important class: LC, the class of log-convex dgfs \u03d5. Though, since (4) deals with\n\u2212 log \u03d5, the optimisation problem is still nonconvex. We describe class LC only in [28] primarily\ndue to paucity of space and also because the \ufb01rst two classes contain our most novel results. These\nclasses of dgfs are neither mutually disjoint nor proper subsets of each other. Each captures unique\nanalytic or geometric structure crucial to ef\ufb01cient optimisation. Class GC characterises the \u201chidden\u201d\nconvexity found in several instances of (4), while LN is a novel class of models that might not have\nthis hidden convexity, but nevertheless admit global optimisation.\nContributions. The key contributions of this paper are the following:\n\u2013 New results that characterise and help recognise g-convexity (Thm. 1, Cor. 2, Cor. 3, Thm. 4).\nThough initially motivated by ECDs, our matrix-theoretic proofs are more generally applicable and\nshould be of wider interest. All technical proofs, and several additional results that help recognise\ng-convexity are in the longer version of this paper [28].\n2For simplicity we describe only mean zero families; the extension to the general case is trivial.\n\n2\n\n\f\u2013 New \ufb01xed-point theory for solving GO problems, including some that might even lack g-convexity.\nHere too, our results go beyond ECDs\u2014in fact, they broaden the class of problems that admit\n\ufb01xed-point algorithms in the metric space (Pd, \u03b4T )\u2014Thms. 11 and 14 are the key results here.\n\nOur results on geodesic convexity subsume the more specialised results reported recently in [29].\nWe believe our matrix-theoretic proofs, though requiring slightly more advanced machinery, are\nultimately simpler and more widely applicable. Our \ufb01xed-point theory offers a uni\ufb01ed framework\nthat not only captures the well-known M-estimators of [16], but applies to a larger class of problems\nthan possible using previous methods. Our experimental illustrate computational bene\ufb01ts of one of\nresulting algorithms.\n\n2 Geometric optimisation with geodesic convexity: class GC\n\nGeodesic convexity (g-convexity) is a classical concept in mathematics and is used extensively in\nthe study of Hadamard manifolds and metric spaces of nonpositive curvature [7, 24] (i.e., spaces\nwhose distance function is g-convex). This concept has been previously studied in nonlinear optimi-\nsation [25], but its full importance and applicability in statistical applications and optimisation is only\nrecently emerging [29, 30].\nWe begin our presentation by recalling some de\ufb01nitions\u2014please see [7, 24] for extensive details.\nDe\ufb01nition 2 (gc set). Let M denote a d-dimensional connected C 2 Riemannian manifold. A set\nX \u2282 M, where is called geodesically convex if any two points of X are joined by a geodesic lying in\nX . That is, if x, y \u2208 X , then there exists a path \u03b3 : [0, 1] \u2192 X such that \u03b3(0) = x and \u03b3(1) = y.\nDe\ufb01nition 3 (gc function). Let X \u2282 M be a gc set. A function \u03c6 : X \u2192 R is geodesically convex,\nif for any x, y \u2208 X and a unit speed geodesic \u03b3 : [0, 1] \u2192 X with \u03b3(0) = x and \u03b3(1) = y, we have\n(5)\n\n\u03c6(\u03b3(t)) \u2264 (1 \u2212 t)\u03c6(\u03b3(0)) + t\u03c6(\u03b3(1)) = (1 \u2212 t)\u03c6(x) + t\u03c6(y).\n\nThe power of gc functions in the context of solving (4) comes into play because the set Pd (the\nconvex cone of positive de\ufb01nite matrices) is also a differentiable Riemannian manifold where\ngeodesics between points can be computed ef\ufb01ciently. Indeed, the tangent space to Pd at any point\ncan be identi\ufb01ed with the set of Hermitian matrices, and the inner product on this space leads to\na Riemannian metric on Pd. At any point A \u2208 Pd, this metric is given by the differential form\nds = (cid:107)A\u22121/2dAA\u22121/2(cid:107)F; also, between A, B \u2208 Pd there is a unique geodesic [1; Thm. 6.1.6]\n\nA#tB := \u03b3(t) = A1/2(A\u22121/2BA\u22121/2)tA1/2,\n\n(6)\nThe midpoint of this path, namely A#1/2B is called the matrix geometric mean, which is an object\nof great interest in numerous areas [1\u20133, 10, 22]. As per convention, we denote it simply by A#B.\nExample 4. Let z \u2208 Cd be any vector. The function \u03c6(X) := z\u2217X\u22121z is gc.\nProof. Since \u03c6 is continuous, it suf\ufb01ces to verify midpoint convexity: \u03c6(X#Y ) \u2264 1\nfor X, Y \u2208 Pd. Since (X#Y )\u22121 = X\u22121#Y \u22121 and X\u22121#Y \u22121 (cid:22) X\u22121+Y \u22121\nthat \u03c6(X#Y ) = z\u2217(X#Y )\u22121z \u2264 1\n\n2 \u03c6(Y ),\n([1; 4.16]), it follows\n\n2 \u03c6(X) + 1\n\nt \u2208 [0, 1].\n\n2 (z\u2217X\u22121z + z\u2217Y \u22121z) = 1\n\n2 (\u03c6(X) + \u03c6(Y )).\n\n2\n\nWe are ready to state our \ufb01rst main theorem, which vastly generalises the above example and provides\na foundational tool for recognising and constructing gc functions.\nTheorem 1. Let \u03a0 : Pd \u2192 Pk be a strictly positive linear map. Let A, B \u2208 Pd we have\n\n\u03a0(A#tB) (cid:22) \u03a0(A)#t\u03a0(B),\n\nt \u2208 [0, 1].\n\n(7)\n\nProof. Although positive linear maps are well-studied objects (see e.g., [1; Ch. 4]), we did not \ufb01nd\nan explicit proof of (7) in the literature, so we provide a proof in the longer version [28].\n\nA useful corollary of Thm. 1 is the following (notice this corollary subsumes Example 4).\nCorollary 2. For positive de\ufb01nite matrices A, B \u2208 Pd and matrices 0 (cid:54)= X \u2208 Cd\u00d7k we have\n\ntr X\u2217(A#tB)X \u2264 [tr X\u2217AX]1\u2212t[tr X\u2217BX]t,\n\nt \u2208 (0, 1).\n\n(8)\n\n3\n\n\fProof. Use the map A (cid:55)\u2192 tr X\u2217AX in Thm. 1.\n\nNote: Cor. 2 actually constructs a log-g-convex function, from which g-convexity is immediate.\nA notable corollary to Thm. 1 that subsumes a nontrivial result [14; Lem. 3.2] is mentioned below.\nCorollary 3. Let Xi \u2208 Cd\u00d7k with k \u2264 d such that rank([Xi]m\ni=1) = k. Then the function \u03c6(S) :=\n\nlog det((cid:80)\nProof. By our assumption on the Xi, the map \u03a0 = S (cid:55)\u2192(cid:80)\nThm 1 it follows that \u03a0(S#R) =(cid:80)\n\ni X\u2217\nand determinant is multiplicative, the previous inequality yields\n\ni SXi) is gc on Pd.\n\ni SXi is strictly positive. Thus, from\ni (S#R)Xi (cid:22) \u03a0(S)#\u03a0(R). Since log det is monotonic,\n\ni X\u2217\n\ni X\u2217\n\n\u03c6(S#R) = log det \u03a0(S#R) \u2264 log det(\u03a0(S)) + log det(\u03a0(R)) = 1\n\n2 \u03c6(S) + 1\n\n2 \u03c6(R).\n\nWe are now ready to state our second main theorem.\nTheorem 4. Let h : Pk \u2192 R be gc function that is nondecreasing in L\u00a8owner order. Let r \u2208 {\u00b11},\nand let \u03a0 : Pd \u2192 Pk be a strictly positive linear map. Then, \u03c6(S) = h(\u03a0(Sr)) \u00b1 log det(S) is gc.\nProof. Since \u03c6 is continuous, it suf\ufb01ces to prove midpoint geodesic convexity. Since r \u2208 {\u00b11},\n(S#R)r = Sr#Rr; thus, from Thm. 1 and since h is matrix nondecreasing, it follows that\n\nh(\u03a0(S#R)r) = h(\u03a0(Sr#Rr)) \u2264 h(\u03a0(Sr)#\u03a0(Rr)).\n\n(9)\n\n(10)\n\nSince h is also gc, inequality (9) further yields\nh(\u03a0(Sr)#\u03a0(Rr)) \u2264 1\n\nSince \u00b1 log det(S#R) = \u00b1 1\n\n2\n\n(cid:0)log det(S) + log det(R)(cid:1), on combining with (10) we obtain\n\n2 h(\u03a0(Sr)) + 1\n\n2 h(\u03a0(Rr)).\n\n\u03c6(S#R) \u2264 1\n\n2 \u03c6(S) + 1\n\n2 \u03c6(R),\n\nas desired. Notice also that if h is strictly gc, then \u03c6(S) is also strictly gc.\n\nFinally, we state a corollary of Thm. 4 helpful towards recognising geodesic convexity of ECDs.\nWe mention here that a result equivalent to Corr. 5 was recently also discovered in [30]. Thm. 4 is\nmore general and uses a completely different argument founded on the matrix-theoretic results; our\ntechniques may also be of wider independent interest.\nCorollary 5. Let h : R++ \u2192 R be nondecreasing and gc (i.e., h(x1\u2212\u03bby\u03bb) \u2264 (1\u2212 \u03bb)h(x) + \u03bbh(y)).\n\nThen, for r \u2208 {\u00b11}, \u03c6 : Pd \u2192 R : S (cid:55)\u2192(cid:80)\n\ni Srxi) \u00b1 log det(S) is gc.\n\ni h(xT\n\n2.1 Application to ECDs in class GC\n\nWe begin with a straightforward corollary of the above discussion.\nCorollary 6. For the following distributions the negative log-likelihood (4) is gc: (i) Kotz with \u03b1 \u2264 d\n2\n(its special cases include Gaussian, multivariate power exponential, multivariate W-distribution with\nshape parameter smaller than one, elliptical gamma with shape parameter \u03bd \u2264 d\n2 ); (ii) Multivariate-t;\n(iii) Multivariate Pearson type II with positive shape parameter; (iv) Elliptical multivariate logistic\ndistribution. 3\n\nIf the log-likelihood is strictly gc then (4) cannot have multiple solutions. Moreover, for any local\noptimisation method that computes a solution to (4), geodesic convexity ensures that this solution is\nglobally optimal. Therefore, the key question to answer is: (i) does (4) have a solution?\nNote that answering this question is nontrivial even in special cases [16, 30]. We provide below a\nfairly general result that helps establish existence.\n\n3The dgfs of different distributions are brought here for the reader\u2019s convenience. Multivariate power\n\u03bd > 0;\n\u03bd > 0;\n\u03bd > \u22121, 0 \u2264 t \u2264 1; Elliptical multivariate logistic:\n\n\u03bd > 0; Multivariate W-distribution: \u03c6(t) = t\u03bd\u22121 exp(\u2212t\u03bd /b),\n\u03bd > 0; Multivariate t: \u03c6(t) = (1 + t/\u03bd)\u2212(\u03bd+d)/2,\n\nexponential: \u03c6(t) = exp(\u2212t\u03bd /b),\nElliptical gamma: \u03c6(t) = t\u03bd\u2212d/2 exp(\u2212t/b),\nMultivariate Pearson type II: \u03c6(t) = (1 \u2212 t)\u03bd ,\n\u03c6(t) = exp(\u2212\u221a\nt)/(1 + exp(\u2212\u221a\n\nt))2.\n\n4\n\n\fTheorem 7. If \u03a6(S) satis\ufb01es the following properties: (i) \u2212 log \u03d5(t) is lower semi-continuous (lsc)\nfor t > 0, and (ii) \u03a6(S) \u2192 \u221e as (cid:107)S(cid:107) \u2192 \u221e or (cid:107)S\u22121(cid:107) \u2192 \u221e, then \u03a6(S) attains its minimum.\n\nProof. Consider the metric space (Pd, dR), where dR is the Riemannian distance,\n\ndR(A, B) = (cid:107)log(A\u22121/2BA\u22121/2)(cid:107)F\n\n(11)\nIf \u03a6(S) \u2192 \u221e as (cid:107)S(cid:107) \u2192 \u221e or as (cid:107)S\u22121(cid:107) \u2192 \u221e, then \u03a6(S) has bounded lower-level sets in (Pd, dR).\nIt is a well-known result in variational analysis that a function that has bounded lower-level sets in\na metric space and is lsc, then the function attains its minimum [26]. Since \u2212 log \u03d5(t) is lsc and\nlog det(S\u22121) is continuous, \u03a6(S) is lsc on (Pd, dR). Therefore it attains its minimum.\n\nA, B \u2208 Pd.\n\nA key consequence of Thm. 7 is its ability to show existence of solutions to (4) for a variety of\ndifferent ECDs. Let us look at an application to Kotz-type distributions below. For these distributions,\nthe function \u03a6(S) assumes the form\n\n(cid:88)n\n\n(cid:88)n\n\n(cid:16) xT\n\ni=1\n\ni S\u22121xi\n\nb\n\n(cid:17)\u03b2\n\nK(S) = n\n\n2 log det(S) + ( d\n\n2 \u2212 \u03b1)\n\nlog xT\n\ni S\u22121xi +\n\ni=1\n\nLemma 8 shows that K(S) \u2192 \u221e whenever (cid:107)S\u22121(cid:107) \u2192 \u221e or (cid:107)S(cid:107) \u2192 \u221e.\nLemma 8. Let the data X = {x1, . . . , xn} span the whole space and satisfy for \u03b1 < d\n\n.\n\n(12)\n\n2 the condition\n\n|X \u2229 L|\n\ndL\n\n|X| <\n\n(13)\nwhere L is an arbitrary subspace with dimension dL < d and |X \u2229 L| is the number of datapoints\nthat lie in the subspace L. If (cid:107)S\u22121(cid:107) \u2192 \u221e or (cid:107)S(cid:107) \u2192 \u221e, then K(S) \u2192 \u221e.\nProof. If (cid:107)S\u22121(cid:107) \u2192 \u221e and since the data span the whole space, it is possible to \ufb01nd a datum x1 such\nthat t1 = xT\n\n1 S\u22121x1 \u2192 \u221e. Since\n\nd \u2212 2\u03b1\n\n,\n\nt\u2192\u221e c1 log(t) + tc2 + c3 \u2192 \u221e\n\nlim\n\nfor constants c1,c3 and c2 > 0, it follows that K(S) \u2192 \u221e whenever (cid:107)S\u22121(cid:107) \u2192 \u221e.\nIf (cid:107)S(cid:107) \u2192 \u221e and (cid:107)S\u22121(cid:107) is bounded, then the third term in expression of K(S) is bounded. Assume\nthat dL is the number of eigenvalues of S that go to \u221e and |X \u2229 L| is the number of data that lie\nin the subspace span by these eigenvalues. Then in the limit when eigenvalues of S go to \u221e, K(S)\nconverges to the following limit\n\nn\n\n2 dL log \u03bb + ( d\n\n2 \u2212 \u03b1)|X \u2229 L| log \u03bb\u22121 + c\n\nlim\n\u03bb\u2192\u221e\n\nApparently if n\n\n2 dL \u2212 ( d\n\n2 \u2212 \u03b1)|X \u2229 L| > 0, then K(S) \u2192 \u221e and the proof is complete.\n\nIt is important to note that overlap condition (13) can be ful\ufb01lled easily by assuming that the number\nof data is larger than their dimensionality and that they are noisy. Using Lemma 8, we can invoke\nThm. 7 to immediately state the following result.\nTheorem 9 (Existence Kotz-distr.). If the data samples satisfy condition (13), then the Kotz negative\nlog-likelihood has a minimiser.\n\nAs previously mentioned, once existence is ensured, one may use any local optimisation method to\nminimise (4) to obtain the desired mle. This brings us to the next question. What if \u03a6(S) is neither\nconvex nor g-convex? The ideas introduced in Sec. 3 below offer a partial one answer.\n3 Geometric optimisation for class LN\nWithout convexity or g-convexity, in general at best we might obtain local minima. However, as\nalluded to previously, the set Pd of hpd matrices possesses remarkable geometric structure that allows\nus to extend global optimisation to a rich class beyond just gc functions. To our knowledge, this class\nof ECDs was beyond the grasp of previous methods [16, 29, 30]. We begin with a key de\ufb01nition.\n\n5\n\n\fDe\ufb01nition 5 (Log-nonexpansive). Let f : R++ \u2192 (0,\u221e). We say f is log-nonexpansive (LN) on a\ncompact interval I \u2282 R+ if there exists a \ufb01xed constant 0 \u2264 q \u2264 1 such that\n\u2200s, t \u2208 I.\nIf q < 1, we say f is log-contractive. Finally, if for every s (cid:54)= t it holds that\ns (cid:54)= t,\n\n| log f (t) \u2212 log f (s)| \u2264 q| log t \u2212 log s|,\n\n| log f (t) \u2212 log f (s)| < | log t \u2212 log s|,\n\n\u2200s, t\n\n(14)\n\nwe say f is weakly log-contractive (wlc); an important point to note here is the absence of a \ufb01xed q.\n\n(cid:88)n\n\ni=1\n\n\u2202\u03a6(S)\n\nNext we study existence, uniqueness, and computation of solutions to (4). To that end, momentarily\nignore the constraint S (cid:31) 0, to see that the \ufb01rst-order necessary optimality condition for (4) is\n\nxih(xT\n\n\u03d5(cid:48)(xT\n\u03d5(xT\n\ni S\u22121 = 0.\n\ni S\u22121xi)\ni S\u22121xi) S\u22121xixT\nDe\ufb01ning h \u2261 \u2212\u03d5(cid:48)/\u03d5, condition (15) may be rewritten more compactly as\nn Xh(DS)X T ,\n\n\u2202S = 0 \u21d0\u21d2 1\n(cid:88)n\n\ni S\u22121xi)xT\n\n2 nS\u22121 +\n\n(16)\nS = 2\nn\ni S\u22121xi), and X = [x1, . . . , xm]. If (16) has a positive de\ufb01nite solution, then\nwhere DS := Diag(xT\nit is a candidate mle; if it is unique, then it is the desired solution (observe that if we have a Gaussian,\nthen h(t) \u2261 1/2, and as expected (16) reduces to the sample covariance matrix).\nBut how should we solve (16)? This question is in general highly nontrivial to answer because (16) is\ndif\ufb01cult nonlinear equation in matrix variables. This is the point where the class LN introduced above\ncomes into play. More speci\ufb01cally, we solve (16) via a \ufb01xed-point iteration. Introduce therefore the\nnonlinear map G : Pd \u2192 Pd that maps S to the right hand side of (16); then, starting with a feasible\nS0 (cid:31) 0, simply perform the iteration\n\ni = 2\n\n(15)\n\ni=1\n\nk = 0, 1, . . . ,\n\n(17)\n\nSk+1 \u2190 G(Sk),\nwhich is shown more explicitly as Alg. 1 below.\n\nAlgorithm 1 Fixed-point iteration for mle\n\nInput: Observations x1, . . . , xn; function h\nInitialize: k \u2190 0; S0 \u2190 In\nwhile \u00ac converged do\n\ni=1 xih(xT\n\ni S\u22121\n\nk xi)xT\ni\n\n(cid:80)n\n\nSk+1 \u2190 2\n\nn\n\nend while\nreturn Sk\n\nThe most interesting twist to analysing iteration (17) is that the map G is usually not contractive with\nrespect to the Euclidean metric. But the metric geometry of Pd alluded to previously suggests that it\nmight be better to analyse the iteration using a non-Euclidean metric. Unfortunately, the Riemannnian\ndistance (11) on Pd, while canonical, also turns out to be unsuitable. This impasse is broken by\nselecting a more suitable \u201chyperbolic distance\u201d that captures the crucial non-Euclidean geometry of\nPd, while still respecting its convex conical structure.\nSuch a suitable choice is provided by the Thompson metric\u2014an object of great interest in nonlinear\nmatrix equations [17]\u2014which is known to possess geometric properties suitable for analysing convex\ncones, of which Pd is a shining example [18]. On Pd, the Thompson metric is given by\n\n\u03b4T (X, Y ) := (cid:107)log(Y \u22121/2XY \u22121/2)(cid:107),\n\n(18)\nwhere (cid:107)\u00b7(cid:107) is the usual operator 2-norm, and \u2018log\u2019 is the matrix logarithm. The core properties of (18)\nthat prove useful for analysis \ufb01xed point iterations are listed below\u2014for proofs please see [17, 19].\nProposition 10. Unless noted otherwise, all matrices are assumed to be hpd..\n\n\u03b4T (X\u22121, Y \u22121) = \u03b4T (X, Y )\n\u03b4T (B\u2217XB, B\u2217Y B) = \u03b4T (X, Y ),\n\n\u03b4T (X t, Y t) \u2264 |t|\u03b4T (X, Y ),\n\nB \u2208 GLn(C)\n\nfor t \u2208 [\u22121, 1]\n\n(cid:16)(cid:88)\n\n\u03b4T\n\n(cid:88)\n\n(cid:17) \u2264\n\ni\n\nwiYi\n\nwiXi,\n\u03b4T (X + A, Y + A) \u2264\n\ni\n\n\u03b4T (Xi, Yi),\n\nmax\n1\u2264i\u2264m\n\u03b1\n\u03b1+\u03b2 \u03b4T (X, Y ),\n\nwi \u2265 0, w (cid:54)= 0\n\nA (cid:23) 0,\n\n(19a)\n(19b)\n(19c)\n\n(19d)\n\n(19e)\n\nwhere \u03b1 = max{(cid:107)X(cid:107),(cid:107)Y (cid:107)} and \u03b2 = \u03bbmin(A).\n\n6\n\n\fWe need one more crucial result (see [28] for a proof), which we state below. This theorem should be\nof wider interest as it enlarges the class of maps that one can study using the Thompson metric.\nTheorem 11. Let X \u2208 Cd\u00d7p, where p \u2264 d, and rank(X) = p. Let A, B \u2208 Pd. Then,\n\n\u03b4T (X\u2217AX, X\u2217BX) \u2264 \u03b4T (A, B).\n\n(20)\n\nWe now show how to use Prop. 10 and Thm. 11 to analyse contractions on Pd.\nProposition 12. Let h be a LN function. Then, the map G in (17) is nonexpansive in \u03b4T . Moreover, if\nh is wlc, then G is weakly-contractive in \u03b4T .\nProof. Let S, R (cid:31) 0 be arbitrary. Then, we have the following chain of inequalities\n\n\u03b4T (G(S),G(R)) = \u03b4T\n\u2264 \u03b4T\n\u2264 max\n1\u2264i\u2264n\n\n(cid:0)xT\n\nn Xh(DR)X T(cid:1)\n(cid:0) 2\n(cid:0)h(DS), h(DR)(cid:1) \u2264 max\n\nn Xh(DS)X T , 2\n\n\u03b4T\n\ni S\u22121xi, xT\n\ni R\u22121xi\n\n(cid:0)h(xT\n(cid:1) \u2264 \u03b4T\n\n\u03b4T\n\ni R\u22121xi)(cid:1)\n(cid:0)S\u22121, R\u22121(cid:1) = \u03b4T (S, R),\n\ni S\u22121xi), h(xT\n\n1\u2264i\u2264n\n\nwhere the \ufb01rst inequality follows from (19b) and Thm. 11; the second inequality follows since\nh(DS) and h(DS) are diagonal; the third follows from (19d); the fourth from another application of\nThm. 11; while the \ufb01nal equality is via (19a). This proves nonexpansivity. If in addition h is weakly\nlog-contractive and S (cid:54)= R, then the second inequality above is strict, that is,\n\u03b4T (G(S),G(R)) < \u03b4T (S, R) \u2200S, R and S (cid:54)= R.\n\nConsequently, we obtain the following main convergence theorem for (17).\nTheorem 13. If G is weakly contractive and (16) has a solution, then this solution is unique and\niteration (17) converges to it.\n\nWhen h is merely LN (not wlc), it is still possible to show uniqueness of (16) up to a constant. Our\nproof depends on the following new property of \u03b4T , which again should be of broader interest.\nTheorem 14. Let G be nonexpansive in the \u03b4T metric, that is \u03b4T (G(X),G(Y )) \u2264 \u03b4T (X, Y ), and F\nbe weakly contractive, that is \u03b4T (F(X),F(Y )) < \u03b4T (X, Y ), then G + F is also weakly contractive.\nObserve that the property proved in Thm. 14 is a striking feature of the nonpositive curvature of\nPd; clearly, such a result does not usually hold in Banach spaces. As a consequence, Thm. 14 helps\nestablish the following \u201crobustness\u201d result for iteration (17).\nTheorem 15. If h is LN, and S1 (cid:54)= S2 are solutions to the nonlinear equation (16), then iteration\n(17) converges to a solution, and S1 \u221d S2.\nAs an illustrative example of these results, consider the problem of \ufb01nding the minimum of negative\nlog-likelihood solution of Kotz type distribution. The convergence of the iterative algorithm in (17)\ncan be obtained from Thm. 15. But for the Kotz distribution we can show a stronger result, which\nhelps obtain geometric convergence rates for the \ufb01xed-point iteration.\nLemma 16. If c > 0 and \u22121 < b < 1, the function h(x) = x + cxb is weakly log-contractive.\n\n2 is wlc. Based on Thm. 9, K(S) has a minimum. Therefore, we have the following.\n\nAccording to this lemma, h in the iterative algorithm 16 for the Kotz-type distributions with 0 < \u03b2 < 2\nand \u03b1 < d\nCorollary 17. The iterative algorithm (16) for the Kotz-type distribution with 0 < \u03b2 < 2 and \u03b1 < d\n2\nconverges to a unique \ufb01xed point.\n\n4 Numerical results\nWe brie\ufb02y highlight the numerical performance of our \ufb01xed-point iteration. The key message here\nis that our \ufb01xed-point iterations solve nonconvex likelihood maximisation problems that involve a\ncomplicating hpd constraint. But since the \ufb01xed-point iterations always generate hpd iterates, no\nextra eigenvalue computation is needed, which leads to substantial computational advantages. In\ncontrast, a nonlinear solver must perform constrained optimisation, which can be unduly expensive.\n\n7\n\n\fFigure 1: Running times comparison of the \ufb01xed-point iteration compared with MATLAB\u2019s fmincon to\nmaximise a Kotz-likelihood (see text for details). The plots show (from left to right), running times for estimating\nS \u2208 Pd, for d \u2208 {4, 16, 32}. Larger d was not tried because fmincon does not scale.\n\nFigure 2: In the Kotz-type distribution, when \u03b2 gets close to zero or 2, the contraction factor becomes smaller\nwhich could impact the convergence rate. This \ufb01gure shows running time variance for Kotz-type distributions\nwith \ufb01xed d = 16, and \u03b1 = 2, for different values of \u03b2: \u03b2 = 0.1, \u03b2 = 1, \u03b2 = 1.7.\n\nWe show two short experiments (Figs. 1 and 2) showing scalability of the \ufb01xed-point iteration with\nincreasing dimensionality of the input matrix, and for varying \u03b2 parameter of the Kotz distribution; this\nparameter in\ufb02uences the convergence rate of the \ufb01xed-point iteration. For three different dimensions\nd = 4, d = 16, and d = 32, we sample 10,000 datapoints from a Kotz-type distribution with\n\u03b2 = 0.5, \u03b1 = 2, and a random covariance matrix. The convergence speed is shown as blue curves\nin Figure 1. For comparison, the result of constrained optimisation (red curves) using MATLAB\u2019S\noptimisation toolbox are shown. The \ufb01xed-point algorithm clearly outperforms MATLAB\u2019S toolbox,\nespecially as dimensionality increases. These results indicate that the \ufb01xed-point approach can be very\ncompetitive. Also note that the problems are nonconvex with an open constraint set\u2014this precludes\ndirect application simple approaches such as gradient-projection (since projection requires closed\nsets; moreover, projection also requires eigenvector decompositions). Additional comparisons in the\nlonger version [28] show that the \ufb01xed-point iteration also signi\ufb01cantly outperforms sophisticated\nmanifold optimisation techniques [5], especially for increasing data dimensionality.\n5 Conclusion\nWe developed geometric optimisation for minimising potentially nonconvex functions over the set of\npositive de\ufb01nite matrices. We showed key results that help recognise geodesic convexity; we also\nintroduced the class of log-nonexpansive functions that contains functions that need not be g-convex,\nbut can still be optimised ef\ufb01ciently. Key to our ideas here was a careful construction of \ufb01xed-point\niterations in a suitably chosen metric space. We motivated, developed, and applied our results to\nthe task of maximum likelihood estimation for various elliptically contoured distributions, covering\nclasses and examples substantially beyond what had been known so far in the literature. We believe\nthat the general geometric optimisation techniques that we developed in this paper will prove to be of\nwider use and interest beyond our motivating application. Developing a more extensive geometric\noptimisation numerical package is part of our ongoing project.\n\nReferences\n[1] R. Bhatia. Positive De\ufb01nite Matrices. Princeton University Press, 2007.\n[2] R. Bhatia and R. L. Karandikar. The matrix geometric mean. Technical Report isid/ms/2-11/02, Indian\n\nStatistical Institute, 2011.\n\n[3] D. A. Bini and B. Iannazzo. Computing the karcher mean of symmetric positive de\ufb01nite matrices. Linear\n\nAlgebra and its Applications, 438(4):1700 \u2013 1710, 2013.\n\n8\n\n\u22121.9 \u22121.52\u22121.14\u22120.76\u22120.38  0   \u22125  \u22123.18\u22121.360.46 2.28  4.1 log Running time (seconds) log \u03a6(S)\u2212\u03a6(Smin) fixed\u2212pointfmincon\u22121.4 \u22120.84\u22120.280.28 0.84  1.4  \u22125  \u22122.97\u22120.961.06 3.08  5.1 log Running time (seconds) log \u03a6(S)\u2212\u03a6(Smin) fixed\u2212pointfmincon\u22121.3 \u22120.460.38 1.22 2.06  2.9  \u22125  \u22122.89\u22120.79 1.3 3.41  5.5 log Running time (seconds) log \u03a6(S)\u2212\u03a6(Smin) fixed\u2212pointfmincon\u22121.4 \u22120.640.12 0.88 1.64  2.4 \u22125\u22123\u221211 3 5 log Running time (seconds) log \u03a6(S)\u2212\u03a6(Smin) fixed\u2212pointfmincon\u22121.4 \u22120.84\u22120.280.28 0.84  1.4  \u22125  \u22122.94\u22120.871.19 3.24  5.3 log Running time (seconds) log \u03a6(S)\u2212\u03a6(Smin) fixed\u2212pointfmincon\u22121.3 \u22120.72\u22120.140.44 1.02  1.6  \u22125  \u22122.8 \u22120.59 1.6 3.81   6  log Running time (seconds) log \u03a6(S)\u2212\u03a6(Smin) fixed\u2212pointfmincon\f[4] G. Blekherman and P. A. Parrilo, editors. Semide\ufb01nite Optimization and Convex Algebraic Geometry.\n\n[5] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt: a matlab toolbox for optimization on\n\n[6] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A Tutorial on Geometric Programming. Optimization\n\nSIAM, 2013.\n\nmanifolds. arXiv Preprint 1308.5200, 2013.\n\nand Engineering, 8(1):67\u2013127, 2007.\n\n[7] M. R. Bridson and A. Hae\ufb02inger. Metric Spaces of Non-Positive Curvature. Springer, 1999.\n[8] S. Cambanis, S. Huang, and G. Simons. On the theory of elliptically contoured distributions. Journal of\n\nMultivariate Analysis, 11(3):368\u2013385, 1981.\n\n[9] Y. Chen, A. Wiesel, and A. Hero. Robust shrinkage estimation of high-dimensional covariance matrices.\n\nIEEE Transactions on Signal Processing, 59(9):4097\u20134107, 2011.\n\n[10] G. Cheng and B. Vemuri. A novel dynamic system in the space of spd matrices with applications to\n\nappearance tracking. SIAM Journal on Imaging Sciences, 6(1):592\u2013615, 2013.\n\n[11] G. Cheng, H. Salehian, and B. C. Vemuri. Ef\ufb01cient Recursive Algorithms for Computing the Mean\nDiffusion Tensor and Applications to DTI Segmentation. In European Conference on Computer Vision\n(ECCV), volume 7, pages 390\u2013401, 2012.\n\n[12] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Jensen-Bregman LogDet Divergence for\n\nEf\ufb01cient Similarity Computations on Positive De\ufb01nite Tensors. IEEE TPAMI, 2012.\n\n[13] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman and Hall/CRC, 1999.\n[14] L. Gurvits and A. Samorodnitsky. A deterministic algorithm for approximating mixed discriminant and\n\nmixed volume, and a combinatorial corollary. Disc. Comp. Geom., 27(4), 2002.\n\n[15] S. K. K.-T. Fang and K. W. Ng. Symmetric multivariate and related distributions. Chapman & Hall, 1990.\n[16] J. T. Kent and D. E. Tyler. Redescending M-estimates of multivariate location and scatter. The Annals of\n\n[17] H. Lee and Y. Lim. Invariant metrics, contractions and nonlinear matrix equations. Nonlinearity, 21:\n\nStatistics, 19(4):2102\u20132119, Dec. 1991.\n\n857\u2013878, 2008.\n\n[18] B. Lemmens and R. Nussbaum. Nonlinear Perron-Frobenius Theory. Cambridge Univ. Press, 2012.\n[19] Y. Lim and M. P\u00b4al\ufb01a. Matrix power means and the Karcher mean. J. Functional Analysis, 262:1498\u20131514,\n\n[20] R. J. Muirhead. Aspects of multivariate statistical theory. John-Wiley, 1982.\n[21] Y. Nesterov and A. S. Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM,\n\n[22] F. Nielsen and R. Bhatia, editors. Matrix Information Geometry. Springer, 2013.\n[23] E. Ollila, D. Tyler, V. Koivunen, and H. V. Poor. Complex elliptically symmetric distributions: Survey,\n\nnew results and applications. IEEE Transactions on Signal Processing, 60(11):5597\u20135625, 2011.\n[24] A. Papadopoulos. Metric spaces, convexity and nonpositive curvature. Europ. Math. Soc., 2005.\n[25] T. Rapcs\u00b4ak. Geodesic convexity in nonlinear optimization. J. Optim. Theory and Appl., 69(1):169\u2013183,\n\n2012.\n\n1994.\n\n1991.\n\n[26] R. T. Rockafellar and R. J.-B. Wets. Variational analysis. Springer, 1998.\n[27] S. Sra. Positive De\ufb01nite Matrices and the Symmetric Stein Divergence. arXiv:1110.1773, Oct. 2012.\n[28] S. Sra and R. Hosseini. Conic geometric optimisation on the manifold of positive de\ufb01nite matrices. arXiv\n\n[29] A. Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60\n\npreprint, 2013.\n\n(12):6182\u201389, 2012.\n\n[30] T. Zhang, A. Wiesel, and S. Greco. Multivariate generalized gaussian distribution: Convexity and graphical\n\nmodels. arXiv preprint arXiv:1304.3206, 60(11):5597\u20135625, Nov. 2013.\n\n[31] H. Zhu, H. Zhang, J. Ibrahim, and B. Peterson. Statistical analysis of diffusion tensors in diffusion-weighted\nmagnetic resonance imaging data. Journal of the American Statistical Association, 102(480):1085\u20131102,\n2007.\n\n9\n\n\f", "award": [], "sourceid": 1216, "authors": [{"given_name": "Suvrit", "family_name": "Sra", "institution": "MPI for Intelligent Systems & CMU"}, {"given_name": "Reshad", "family_name": "Hosseini", "institution": "MPI T\u00fcbingen"}]}