{"title": "Transelliptical Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 800, "page_last": 808, "abstract": null, "full_text": "Transelliptical Graphical Models\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University, NJ 08544\nhanliu@princeton.edu\n\nFang Han\n\nDepartment of Biostatistics\nJohns Hopkins University\n\nBaltimore, MD 21210\nfhan@jhsph.edu\n\nCun-hui Zhang\n\nDepartment of Statistics\n\nRutgers University\n\nPiscataway, NJ 08854\n\ncunhui@stat.rutgers.edu\n\nAbstract\n\nWe advocate the use of a new distribution family\u2014the transelliptical\u2014for robust\ninference of high dimensional graphical models. The transelliptical family is an\nextension of the nonparanormal family proposed by Liu et al. (2009). Just as the\nnonparanormal extends the normal by transforming the variables using univariate\nfunctions, the transelliptical extends the elliptical family in the same way. We\npropose a nonparametric rank-based regularization estimator which achieves the\nparametric rates of convergence for both graph recovery and parameter estima-\ntion. Such a result suggests that the extra robustness and \ufb02exibility obtained by\nthe semiparametric transelliptical modeling incurs almost no ef\ufb01ciency loss. We\nalso discuss the relationship between this work with the transelliptical component\nanalysis proposed by Han and Liu (2012).\n\n1 Introduction\nWe consider the problem of learning high dimensional graphical models.\nIn a typical setting, a\nd-dimensional random vector X = (X1, ..., Xd)T can be represented as an undirected graph de-\nnoted by G = (V, E), where V contains nodes corresponding to the d variables in X, and the\nedge set E describes the conditional independence relationship among X1, ..., Xd. Let X\\{i,j} :=\n{Xk : k 6= i, j}. We say the joint distribution of X is Markov to G if Xi is independent of Xj given\nX\\{i,j} for all (i, j) /\u2208 E. While often G is assumed given, here we want to estimate it from data.\nMost graph estimation methods rely on the Gaussian graphical models, in which the random vector\nX is assumed to be Gaussian: X \u223c Nd(\u00b5, \u03a3). Under this assumption, the graph G is encoded\nby the precision matrix \u0398 := \u03a3\u22121. More speci\ufb01cally, no edge connects Xj and Xk if and only\nif \u0398jk = 0. This problem of estimating G is called covariance selection [5]. In low dimensions\nwhere d < n, [6, 7] develop a multiple testing procedure for identifying the sparsity pattern of the\nprecision matrix. In high dimensions where d (cid:29) n, [21] propose a neighborhood pursuit approach\nfor estimating Gaussian graphical models by solving a collection of sparse regression problems using\nthe Lasso [25, 3]. Such an approach can be viewed as a pseudo-likelihood approximation of the full\nlikelihood.\nIn contrast, [1, 30, 10] propose a penalized likelihood approach to directly estimate\n\u2126. [15, 14, 24] maximize the non-concave penalized likelihood to obtain an estimator with less\nbias than the traditional L1-regularized estimator. Under the irrepresentable conditions [33, 31, 27],\n[22, 23] study the theoretical properties of the penalized likelihood methods. More recently, [29, 2]\npropose the graphical Dantzig selector and CLIME, which can be solved by linear programming and\npossess more favorable theoretical properties than the penalized likelihood approach.\n\n1\n\n\fBesides Gaussian models, [18] propose a semiparametric procedure named nonparanormal SKEP-\nTIC which extends the Gaussian family to the more \ufb02exible semiparametric Gaussian copula family.\nInstead of assuming X follows a Gaussian distribution, they assume there exists a set of monotone\nfunctions f1, . . . , fd, such that the transformed data f(X) := (f1(X1), . . . , fd(Xd))T is Gaussian.\nMore details can be found in [18]. [32] has developed a scalable software package to implement\nthese algorithms. In another line of research, [26] extends the Gaussian graphical models to the\nelliptical graphical models. However, for elliptical distributions, only the generalized partial cor-\nrelation graph can be reliably estimated. These graphs only represent the conditional uncorrelated-\nness, but conditional independence, among variables. Therefore, by extending the Gaussian to the\nelliptical family, the gain in modeling \ufb02exibility is traded off with a loss in the strength of inference.\nIn a related work, [9] provide a latent variable interpretation of the generalized partial correlation\ngraph for multivariate t-distributions. An EM-type algorithm is proposed to \ufb01t the model for high\ndimensional data. However, the theoretical properties of their estimator is unknown.\nIn this paper, we introduce a new distribution family named transelliptical graphical model. A key\nconcept is the transelliptical distribution [12]. The transelliptical distribution is a generalization of\nthe nonparanormal distribution proposed by [18]. By mimicking how the nonparanormal extends the\nnormal family, the transelliptical extends the elliptical family in the same way. The transelliptical\nfamily contains the nonparanomral family and elliptical family. To infer the graph structure, a rank-\nbased procedure using the Kendall\u2019s tau statistic is proposed. We show such a procedure is adaptive\nover the transelliptical family: the procedure by default delivers a conditional uncorrelated graphs\namong certain latent variables; however, if the true distribution is the nonparanormal, the procedure\nautomatically delivers the conditional independence graph. Computationally, the only extra cost is a\none-pass data sort, which is almost negligible. Theoretically, even though the transelliptical family\nis much larger than the nonparanormal family, the same parametric rates of convergence for graph\nrecovery and parameter estimation can be established. These results suggest that the transelliptical\ngraphical model can be used routinely as a replacement of the nonparanormal models. Thorough\nnumerical results are provided to back up our theory.\n2 Background on Elliptical Distributions\nLet X and Y be two random variables, we denote by X d= Y if they have the same distribution.\nDe\ufb01nition 2.1 (elliptical distribution [8]). Let \u00b5 \u2208 Rd and \u03a3 \u2208 Rd\u00d7d with rank(\u03a3) = q \u2264 d. A\nd-dimensional random vector X has an elliptical distribution, denoted by X \u223c ECd(\u00b5, \u03a3, \u03be), if it\nhas a stochastic representation: X d= \u00b5 + \u03beAU, where U is a random vector uniformly distributed\non the unit sphere in Rq, \u03be \u2265 0 is a scalar random variable independent of U, A \u2208 Rd\u00d7q is a\ndeterministic matrix such that AAT = \u03a3.\nRemark 2.1. An equivalent de\ufb01nition of an elliptical distribution is that its characteristic function\ncan be written as exp(itT \u00b5)\u03c6(tT \u03a3t), where \u03c6 is a properly-de\ufb01ned characteristic function which\nhas a one-to-one mapping with \u03be in De\ufb01nition 2.1. In this setting we denote by X \u223c ECd(\u00b5, \u03a3, \u03c6).\nAn elliptical distribution does not necessarily have a density. One example is the rank-de\ufb01cient\nGaussian. More examples can be found in [11]. However, when the random variable \u03be is absolutely\ncontinuous with respect to the Lebesgue measure and \u03a3 is non-singular, the density of X exists and\nhas the form\n\np(x) = |\u03a3|\u22121/2g(cid:0)(x \u2212 \u00b5)T \u03a3\u22121(x \u2212 \u00b5)(cid:1) ,\n\n(1)\nwhere g(\u00b7) is a scale function uniquely determined by the distribution of \u03be. In this case, we can also\ndenote it as X \u223c ECd(\u00b5, \u03a3, g). Many multivariate distributions belong to the elliptical family. For\nexample, when g(x) = (2\u03c0)\u2212d/2 exp{\u2212x/2}, X is d-dimensional Gaussian. Another important\nsubclass is the multivariate t-distribution with the degrees of freedom v, in which, we choose\n\n\u0393(cid:0) v+d\n\n(cid:1)\n\n(cid:18)\n\ng(x) = cv\n\n(v\u03c0) d\n\n2\n2 \u0393( v\n2 )\n\n(cid:19)\u2212 v+d\n\n2\n\n1 \u2212 c2\nvx\nv\n\n,\n\n(2)\n\nwhere cv is a normalizing constant.\nThe model family in De\ufb01nition 2.1 is not identi\ufb01able. For example, given X \u223c ECd(\u00b5, \u03a3, \u03be) with\nrank(\u03a3) = q, there will be multiple As corresponding to the same \u03a3. i.e., there exist A1 6= A2 \u2208\n\n2\n\n\f1 = A2AT\n\n2 = \u03a3. For some constant c 6= 0, we de\ufb01ne \u03be\u2217 = \u03be/c and A\u2217 = c\u00b7 A,\nRd\u00d7q such that A1AT\nthen \u03beAU = \u03be\u2217A\u2217U. Therefore, the matrix \u03a3 is unique only up to a constant scaling. To make the\nmodel identi\ufb01able, we impose the condition that max{diag(\u03a3)} = 1. More discussions about the\nidenti\ufb01ability issue can be found in [12].\n3 Transelliptical Graphical Models\nIn this paper we only consider distributions with continuous marginals. We introduce the transellip-\ntical graphical models in analogy to the nonparanormal graphical models [19, 18]. The key concept\nis transelliptical distribution which is also introduced in [12]. However, the de\ufb01nition of transellip-\ntical distribution in this paper is slightly more restrictive than that in [12] due to the complication of\ngraphical modeling. More speci\ufb01cally, let\n\nR+\nd := {\u03a3 \u2208 Rd\u00d7d : \u03a3T = \u03a3, diag(\u03a3) = 1, \u03a3 (cid:31) 0},\n\n(3)\n\nwe de\ufb01ne the transelliptical distribution as follows:\nDe\ufb01nition 3.1 (transelliptical distribution). A continuous random vector X = (X1, . . . , Xd)T is\ntranselliptical, denoted by X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd), if there exists a set of monotone univariate\nfunctions f1, . . . , fd and a nonnegative random variable \u03be satisfying P(\u03be = 0) = 0, such that\n\n(f1(X1), . . . , fd(Xd))T \u223c ECd(0, \u03a3, \u03be), where \u03a3 \u2208 R+\nd .\n\n(4)\n\nd is called latent correlation matrix.\n\nHere, \u03a3 is called latent generalized correlation matrix1.\nWe then discuss the relationship between the transelliptical family with the nonparanormal family,\nwhich is de\ufb01ned as follows:\nDe\ufb01nition 3.2 (nonparanormal distribution). A ramdom vector X = (X1, . . . , Xd)T is nonpara-\nnormal, denoted by X \u223c N P Nd(\u03a3; f1, . . . , fd), if there exist monotone functions f1, . . . , fd such\nthat (f1(X1), . . . , fd(Xd))T \u223c Nd(0, \u03a3), where \u03a3 \u2208 R+\nFrom De\ufb01nitions 3.1 and 3.2, we see the transelliptical is a strict extension of the nonparanormal.\nBoth families assume there exits a set of univariate transformations such that the transformed data\nfollow a base distribution: the nonparanormal exploits a normal base distribution; while the transel-\nliptical exploits an elliptical base distribution. In the nonparanormal, \u03a3 is the correlation matrix for\nthe latent normal, therefore it is called latent correlation matrix; In the transelliptical, \u03a3 is the gener-\nalized correlation matrix for the latent elliptical distribution, therefore it is called latent generalized\ncorrelation matrix.\nWe now de\ufb01ne the transelliptical graphical models. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) where \u03a3 \u2208 R+\nd\nis the latent generalized correlation matrix. In this paper, we always assume the second moment\nE\u03be2 < \u221e. We de\ufb01ne \u0398 := \u03a3\u22121 to be the latent generalized concentration matrix. Let \u0398jk be the\nelement of \u0398 on the j-th row and k-th column. We de\ufb01ne the latent generalized partial correlation\n\nmatrix \u0393 as \u0393jk := \u2212\u0398jk/p\u0398jj \u00b7 \u0398kk. Let diag(A) be the matrix A with off-diagonal elements\n\nreplaced by zero and A1/2 be the squared root matrix of A. It is easy to see that\n\n\u0393 = \u2212[diag(\u03a3\u22121)]\u22121/2\u03a3\u22121[diag(\u03a3\u22121)]\u22121/2.\n\n(5)\nTherefore, \u0393 has the same nonzero pattern as \u03a3\u22121. We then de\ufb01ne a undirected graph G = (V, E):\nthe vertex set V contains nodes corresponding to the d variables in X, and the edge set E satis\ufb01es\n(6)\nGiven a graph G, we de\ufb01ne R+\nd with zero entries at the\npositions speci\ufb01ed by the graph G. The transelliptical graphical model induced by G is de\ufb01ned as:\nDe\ufb01nition 3.3 (transelliptical graphical model). The transelliptical graphical model induced by a\ngraph G, denoted by P(G), is de\ufb01ned to be the set of distributions:\n\n(Xj, Xk) \u2208 E if and only if \u0393jk 6= 0 for j, k = 1, . . . , d.\nd (G) to be the set containing all the \u03a3 \u2208 R+\n\nP(G) :=(cid:8)all the transelliptical distributions T Ed(\u03a3, \u03be; f1, . . . , fd) satisfying \u03a3 \u2208 R+\n\n(7)\nIn the rest of this section, we prove some properties of the transelliptical family and discuss the inter-\npretation of the meaning of the graph G. This graph is called latent generalized partial correlation\ngraph. First, we show the transelliptical family is closed under marginalization and conditioning.\n\nd (G)(cid:9) .\n\n1One thing to note is that in [12], the condition that \u03a3 \u2208 Rd+ is not required.\n\n3\n\n\fLemma 3.1. Let X := (X1, . . . , Xd)T \u223c T Ed(\u03a3, \u03be; f1, . . . , fd). The marginal and the conditional\ndistributions of (X1, X2)T given the remaining variables are still transellpitical.\nProof. Since X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd), we have (f1(X1), . . . , fd(Xd))T \u223c ECd(0, \u03a3, \u03be). Let\nZj := fj(Xj) for j = 1, . . . , d. From Theorem 2.18 of [8], the marginal distribution of (Z1, Z2)T\nand the conditional distribution of (Z1, Z2)T given the remaining Z3, . . . , Zd are both elliptical. By\nde\ufb01nition, the marginal distribution of (X1, X2)T is transelliptical. To see the conditional case, since\nX has continuous marginals and f1, . . . , fd are monotone, the distribution of (X1, X2)T conditional\non X\\{1,2} is the same as conditional on Z\\{1,2}. Combined with the fact that Z1 = f1(X1),\nZ2 = f2(X2), we know that (X1, X2)T | X\\{1,2} follows a transelliptical distribution.\n\nFrom (5), we see the matrices \u0393 and \u0398 have the same nonzero pattern, therefore, they encode\nthe same graph G. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd). The next lemma shows that, if the second\nmoment of X exists, the absence of an edge in the graph G is equivalent to the pairwise conditional\nuncorrelatedness of two corresponding latent variables.\nLemma 3.2. Let X :=(X1, . . . , Xd)T \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) with E\u03be2 < \u221e, and Zj := fj(Xj)\nfor j = 1, . . . , d. \u0393jk = 0 if and only if Zj and Zk are conditionally uncorrelated given Z\\{j,k}.\nProof. Let Z := (Z1, . . . , Zd)T . Since X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd), we have Z \u223c ECd(0, \u03a3, \u03be).\nTherefore, the latent generalized correlation matrix \u03a3 is the generalized correlation matrix of the\nlatent variable Z. It suf\ufb01ces to prove that, for elliptical distributions with E\u03be2 < \u221e, the generalized\npartial correlation matrix \u0393 as de\ufb01ned in (5) encodes the conditional uncorrelatedness among the\nvariables. Such a result has been proved in the section 2 of [26].\nLet A, B, C \u2282 {1, . . . , d}. We say C separates A and B in the graph G if any path from a node\nin A to a node in B goes through at least one node in C. We denote by XA the subvector of X\nindexed by A. The next lemma implies the equivalence between the pairwise and global conditional\nuncorrelatedness of the latent variables for the transelliptical graphical models. This lemma connects\nthe graph theory with probability theory.\nLemma 3.3. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) be any element of the transelliptical graphical model\nP(G) satisfying E\u03be2 < \u221e. Let Z := (Z1, . . . , Zd)T with Zj = fj(Xj) and A, B, C \u2282 {1, . . . , d}.\nThen C separates A and B in G if and only if ZA and ZB are conditional uncorrelated given ZC.\nProof. By de\ufb01nition, we know Z \u223c ECd(0, \u03a3, \u03be). It then suf\ufb01ces to show the pairwise conditional\nuncorrelatedness implies the global conditional uncorrelatedness for the elliptical family. This fol-\nlows from the same induction argument as in Theorem 3.7 of [16].\n\nCompared with the nonparanormal graphical model, the transelliptical graphical model gains a lot\non modeling \ufb02exibility, but at the price of inferring a weaker notion of graphs: a missing edge in\nthe graph only represents the conditional uncorrelatedness of the latent variables. The next lemma\nshows that we do not lose any thing compared with the nonparanormal graphical model. The proof\nof this lemma is simple and is omitted. Some related discussions can be found in [19].\nLemma 3.4. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) be a member of the transelliptical graphical model\nP(G). If X is also nonparanormal, the graph G encodes the conditional independence relationship\nof X (In other words, the distribution of X is Markov to G).\n\n4 Rank-based Regularization Estimator\nIn this section, we propose a nonparametric rank-based regularization estimator which achieves the\noptimal parametric rates of convergence for both graph recovery and parameter estimation. The\nmain idea of our procedure is to treat the marginal transformation functions fj and the generating\nvariable \u03be as nuisance parameters, and exploit the nonparametric Kendall\u2019s tau statistic to directly\nestimate the latent generalized correlation matrix \u03a3. The obtained correlation matrix estimate is then\nplugged into the CLIME procedure to estimate the sparse latent generalized concentration matrix \u0398.\nFrom the previous discussion, we know the graph G is encoded by the nonzero pattern of \u0398. We\n\nthen get a graph estimator by thresholding the estimatedb\u0398.\n\n4\n\n\f4.1 The Kendall\u2019s tau Statistic and its Invariance Property\nLet x1, . . . , xn \u2208 Rd be n observations of a random vector X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd). Our task is\nto estimate the latent generalized concentration matrix \u0398 := \u03a3\u22121. The Kendall\u2019s tau is de\ufb01ned as:\n(8)\n\nX\n\n(cid:17)(cid:16)\n\n(cid:16)\n\n(cid:17)\n\nsign\n\n2\n\n,\n\nk \u2212 xi0\nxi\n\nk\n\nj \u2212 xi0\nxi\n\nj\n\nb\u03c4jk =\n\nn(n \u2212 1)\n\n1\u2264i<i0\u2264n\n\n2 \u03c4jk\n\n(cid:1).\n\nwhich is a monotone transformation-invariant correlation between the empirical realizations of two\n\nrandom variables Xj and Xk. Let eXj and eXk be two independent copies of Xj and Xk. The\npopulation version of the Kendall\u2019s tau statistic is \u03c4jk := Corr(cid:0)sign(Xj \u2212 eXj), sign(Xk \u2212 eXk)(cid:1).\n\nLet X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd), the following theorem from [12] illustrates an important rela-\ntionship between the population Kendall\u2019s tau statistic \u03c4jk and the latent generalized correlation\ncoef\ufb01cient \u03a3jk.\nTheorem 4.1 (Invariance Property of Kendall\u2019s tau Statistic[12]). Let X := (X1, . . . , Xd)T \u223c\nT Ed(\u03a3, \u03be; f1, . . . , fd). We denote \u03c4jk to be the population Kendall\u2019s tau statistic between Xj and\n\n4.2 Rank-based Regularization Method\nWe start with some notations. We denote by I(\u00b7) to be the indicator function and Id be the identity\n\nXk. Then \u03a3jk = sin(cid:0) \u03c0\nmatrix. Given a matrix A, we de\ufb01ne kAkmax := maxjk |Ajk| and kAk1 :=P\nMotivated by Theorem 4.1, we de\ufb01ne bS = [bSjk] \u2208 Rd\u00d7d to estimate \u03a3:\n(cid:16) \u03c0\n(cid:17) \u00b7 I(j 6= k) + I(j = k).\n2b\u03c4jk\nWe then plug bS into the CLIME estimator [2] to get the \ufb01nal parameter and graph estimates. More\nX\n|\u0398jk| s.t. kbS\u0398 \u2212 Idkmax \u2264 \u03bb,\n\nbSjk = sin\nb\u0398 = arg min\n\nspeci\ufb01cally, the latent generalized concentration matrix \u0398 can be estimated by solving\n\njk |Ajk|.\n\n(10)\n\n(9)\n\n\u0398\n\nj,k\n\nwhere \u03bb > 0 is a tuning parameter.\n[2] show that this optimization can be decomposed into d\nvector minimization problems, each of which can be reformulated as a linear program. Thus it\n\nhas the potential to scale to very large problems. Once b\u0398 is obtained, we can apply an additional\nthresholding step to estimate the graph G. For this, we de\ufb01ne a graph estimator bG = (V,bE), in\nwhich an edge (j, k) \u2208 bE ifb\u0398jk \u2265 \u03b3. Here \u03b3 is another tuning parameter.\nbS, which requires us to evaluate d(d\u22121)/2 pairwise Kendal\u2019s tau statistics. A naive implementation\n\nCompared with the original CLIME, the extra cost of our rank-based procedure is the computation of\n\nof the Kendall\u2019s tau requires O(n2) computation. However, ef\ufb01cient algorithm based on sorting and\nbalanced binary trees has been developed to calculate the Kendall\u2019s tau statistic with a computational\ncomplexity O(n log n) [4]. Therefore, the incurred computational burden is negligible.\nRemark 4.1. Similar rank-based procedures have been discussed in [19, 18, 28]. Unlike our work,\nthey focus on the more restrictive nonparanromal family and discuss several rank-based procedures\nusing the normal-score, Spearman\u2019s rho, and Kendall\u2019s tau. Unlike our results, they advocate the\nuse of the Spearman\u2019s rho and normal-score correlation coef\ufb01cients. Their main concern is that,\nwithin the more restrictive nonparanormal family, the Spearman\u2019s rho and normal-score correlations\nare slightly easier to compute and have smaller asymptotic variance. In constrast to their results,\nthe new insight obtained from this current paper is that we advocate the usage of the Kendall\u2019s tau\ndue to its invariance property within the much larger transelliptical family. In fact, we can show that\nthe Spearman\u2019s rho is not invariant within the transelliptical family unless the true distribution is\nnonparanormal. More details on this issue can be found in [8].\n5 Asymptotic Properties\nWe analyze the theoretical properties of the rank-based regularization estimator proposed in Section\n4.2. Our main result shows: under the same conditions on \u03a3 that ensure the parameter estimation\n\n5\n\n\fand graph recovery consistency of the original CLIME estimator for Gaussian graphical models, our\nrank-based regularization procedure achieves exactly the same parametric rates of convergence for\nboth parameter estimation and graph recovery for the much larger transelliptical family. This result\nsuggests that the transelliptical graphical model can be used as a safe replacement of the Gaussian\ngraphical models, the nonparanormal graphical models, and the elliptical graphical models.\nWe introduce some additional notations. Given a symmetric matrix A, for 0 \u2264 q < 1, we de\ufb01ne\nkAkLq := maxi\n\nP\nj |Aij|q and the spectral norm kAkL2 to be its largest eigenvalue. We de\ufb01ne\nSd(q, s, M) := {\u0398 : k\u0398kL1 \u2264 M and k\u0398kLq \u2264 s}.\n\n(11)\nFor q = 0, the class Sd(0, s, M) contains all the s-sparse matrices. Our main result is Theorem 5.1\n0 \u2264 q < 1. Let b\u0398 be de\ufb01ned in (10). There exist constants C0 and C1 only depending on q, such\nTheorem 5.1. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) with \u03a3 \u2208 R+\nd and \u0398 := \u03a3\u22121 \u2208 Sd(q, S, M) with\nthat, whenever \u03bb = C0Mp(log d)/n, with probability no less than 1 \u2212 d\u22122, we have\n(cid:19)(1\u2212q)/2\nLet bG be the graph estimator de\ufb01ned in Section 4.2 with the additional tuning parameter \u03b3 = 4M \u03bb.\nIf we further assume \u0398 \u2208 Sd(0, s, M) and minj,k:|\u0398jk|6=0 |\u0398jk| \u2265 2\u03b3, then\n\n(cid:18)log d\nkb\u0398 \u2212 \u0398kL2 \u2264 C1M 2\u22122q \u00b7 s \u00b7\n(cid:17) \u2265 1 \u2212 o(1),\n\nP(cid:16)bG 6= G\n\n(Parameter estimation)\n\n(Graph recovery)\n\n(13)\n\n(12)\n\nn\n\n.\n\nwhere G is the graph determined by the nonzero pattern of \u0398.\nProof. The difference between the rank-based CLIME and the original CLIME is that we replace\n\nthe Pearson correlation coef\ufb01cient matrix bR by the Kendall\u2019s tau matrix bS. By examing the proofs\nof Theorems 1 and 7 in [2], the only property needed of bR is an exponential concentration inequality\n(cid:17) \u2264 c1 exp(\u2212c2nt2)\nP(cid:16)|bRjk \u2212 \u03a3jk| > t\n. Therefore, it suf\ufb01ces if we can prove a similar concentration inequality for |bSjk \u2212 \u03a3jk|. Since\n(cid:17)\n(cid:16) \u03c0\n(cid:17)\nbS = sin\nwe have |bSjk \u2212 \u03a3jk| \u2264 |b\u03c4jk \u2212 \u03c4|. Therefore, we only need to prove\nP (|b\u03c4jk \u2212 \u03c4jk| > t) \u2264 exp(cid:0)\u2212nt2/(2\u03c0)(cid:1) .\nP\nThis result holds since b\u03c4jk is a U-statistic: b\u03c4jk =\nK\u03c4 (xi, xi0) = sign(cid:0)xi\n(cid:1)(cid:0)xi\n1\u2264i<i0\u2264n K\u03c4 (xi, xi0), where\n\n(cid:1) is a bounded kernel between -1 and 1. The result fol-\n\n(cid:16) \u03c0\n2b\u03c4jk\n\nand \u03a3jk = sin\n\n2 \u03c4jk\n\nn(n\u22121)\n\nj \u2212 xi0\n\nk \u2212 xi0\n\nk\n\n2\n\n,\n\nj\n\nlows from the Hoeffding\u2019s inequality for U-statistic [13].\n\n6 Numerical Experiments\nWe investigate the empirical performance of the rank-based regularization estimator. We compare\nit with the following methods: (1) Pearson: the CLIME using the Pearson sample correlation; (2)\nKendall: the CLIME using the Kendall\u2019s tau; (3) Spearman: the CLIME using the Spearman\u2019s\nrho; (4) NPN: the CLIME using the original nonparanormal correlation estimator [19]; (5) NS:\nthe CLIME using the normal score correlation. The later three methods are discussed under the\nnonparanormal graphical model and we refer to [18] for detailed descriptions.\n6.1 Simulation Studies\nWe adopt the same data generating procedure as in [18]. To generate a d-dimensional sparse graph\nG = (V, E) where V = {1, . . . , d} correspond to variables X = (X1, . . . , Xd), we associate each\nindex j \u2208 {1, . . . , d} with a bivariate data point (Y (1)\nn \u223c\nUniform[0, 1] for k = 1, 2. Each pair of vertices (i, j) is included in the edge set E with probability\nP((i, j) \u2208 E) = exp(\u2212kyi\u2212yjk2\n) is the empirical observation\n) and k \u00b7 kn represents the Euclidean distance. We restrict the maximum degree of the\nof (Y (1)\n\n) \u2208 [0, 1]2 where Y (k)\n\n2\u03c0, where yi := (y(1)\n\n, . . . , Y (k)\n\nn/0.25)/\n\n, Y (2)\n\n, Y (2)\n\n, y(2)\n\n\u221a\n\n1\n\nj\n\nj\n\ni\n\ni\n\ni\n\ni\n\n6\n\n\fScheme 1\n\nScheme 2\n\nScheme 3\n\nScheme 4\n\nFigure 1: ROC curves for different methods in generating schemes 1 to 4 and different contamination\nlevel r = 0, 0.02, 0.05 (top, middle, bottom) using the CLIME. Here n = 400 and d = 100.\n\n1 /\u03be\u2217\n\n2, \u03be\u2217\n\n1 \u223c \u03c7d, \u03be\u2217\n\n2 \u223c \u03c71, \u03be\u2217\n\ngraph to be 4 and build the inverse correlation matrix \u2126 according to \u2126jk = 1 if j = k, \u2126jk = 0.145\nif (j, k) \u2208 E, and \u2126jk = 0 otherwise. The value 0.145 guarantees the positive de\ufb01niteness of \u2126.\nLet \u03a3 = \u2126\u22121. To obtain the correlation matrix, we rescale \u03a3 so that all its diagonal elements are 1.\nIn the simulated study we randomly sample n data points from a certain transelliptical distribution\nX \u223c T Ed(\u03a3, \u03be; f1, . . . , fd). We set d = 100. To determine the transelliptical distribution, we \ufb01rst\ngenerate \u03a3 as discussed in the previous paragraph. Secondly, three types of \u03be are considered:\n(1) \u03be(1) \u223c \u03c7d, i.e., \u03be follows a chi-distribution with degree of freedom d;\n(2) \u03be(2) d= \u03be\u2217\n(3) \u03be(3) \u223c F (d, 1), i.e., \u03be follows an F -distribution with degree of freedom d and 1.\nThirdly, two type of transformation functions f = {fj}d\n(1) linear transformation: f (1) = {f0, . . . , f0} with f0(x) = x;\n(2) nonlinear transformation: f (2) = {f1, . . . , fd} = {h1, h2, h3, h4, h5, h1, h2, h3, h4, h5, . . .},\nwhere h\u22121\n:= x, h\u22121\n:=\n\u221aR (\u03a6(y)\u2212R \u03a6(t)\u03c6(t)dt)2\u03c6(y)dy\n\n\u221aR |t|\u03c6(t)dt\n:= sign(x)|x|1/2\n\u221aR (exp(y)\u2212R exp(t)\u03c6(t)dt)2\u03c6(y)dy\n\nexp(x)\u2212R exp(t)\u03c6(t)dt\n\n1 (x)\n\n\u03a6(x)\u2212R \u03a6(t)\u03c6(t)dt\n\n1 is independent of \u03be\u2217\n2;\n\nx3\u221aR t6\u03c6(t)dt\n\nj=1 are considered:\n\n2 (x)\n, h\u22121\n\n5 (x) :=\n\n, h\u22121\n\n3 (x)\n\n:=\n\n, h\u22121\n\n4 (x)\n\n.\n\nWe consider the following four data generating schemes:\n\u2022 Scheme 1: X \u223c T Ed(\u03a3, \u03be(1); f (1)), i.e., X \u223c N(0, \u03a3).\n\u2022 Scheme 2: X \u223c T Ed(\u03a3, \u03be(2); f (1)), i.e., X follows the multivariate Cauchy.\n\u2022 Scheme 3: X \u223c T Ed(\u03a3, \u03be(3); f (1)), i.e., the distribution is highly related to the multivariate t.\n\u2022 Scheme 4: X \u223c T Ed(\u03a3, \u03be(3); f (2)).\nTo evaluate the robustness of different methods, let r \u2208 [0, 1) represent the proportion of samples\nbeing contaminated. For each dimension, we randomly select bnrc entries and replace them with\n\n7\n\n0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS0.00.20.40.60.81.00.00.20.40.60.81.0FPRTPRPearsonKendallSpearmanNPNNS\feither 5 or -5 with equal probability. The \ufb01nal data matrix we obtained is X \u2208 Rn\u00d7d. Here we\npick r = 0, 0.02 or 0.05. Under the Scheme 1 to Scheme 4 with different levels of contamination\n(r = 0, 0.02 or 0.05), we repeatedly generate the data matrix X for 100 times and compute the\naveraged False Positive Rates and False Negative Rates using a path of tuning parameters \u03bb from\n0.01 to 0.5 and \u03b3 = 10\u22125. The feature selection performances of different methods are evaluated\nby plotting (FPR(\u03bb), 1 \u2212 FNR(\u03bb)). The corresponding ROC curves are presented in Figure 1. We\nsee: (1) when the data are perfectly Gaussian without contamination, all methods perform well; (2)\nwhen data are non-Gaussian, with outliers existing or latent elliptical distribution different from the\nGaussian, Kendall is better than the other methods in terms of achieving a lower FPR + FNR.\n6.2 Equities Data\nWe compare different methods on the stock price data from Yahoo! Finance (finance.yahoo.\ncom). We collect the daily closing prices for 452 stocks that are consistently in the S&P 500 index\nbetween January 1, 2003 through January 1, 2008. This gives us altogether 1,257 data points, each\ndata point corresponding to the vector of closing prices on a trading day. With St,j denoting the\nclosing price of stock j on day t, we consider the variables Xtj = log (St,j/St\u22121,j) and build\ngraphs over the indices j. Though a time series, we treat the instances Xt as independent replicates.\n\nFigure 2: The graph estimated from the S&P 500 stock data from Jan. 1, 2003 to Jan. 1, 2008 using Pearson,\nKendall,Spearman, NPN, NS (left to right). The nodes are colored according to their GICS sector categories.\nThe 452 stocks are categorized into 10 Global Industry Classi\ufb01cation Standard (GICS) sec-\ntors, including Consumer Discretionary (70 stocks), Consumer Staples (35 stocks),\nEnergy (37 stocks), Financials (74 stocks), Health Care (46 stocks), Industrials\n(59 stocks), Information Technology (64 stocks) Telecommunications Services\n(6 stocks), , Materials (29 stocks), and Utilities (32 stocks).\nFigure 2 illustrates the estimated graphs using the same layout, the nodes are colored according to\nthe GICS sector of the corresponding stock. The tuning parameter is automatically selected using\na stability based approach named StARS [20]. We see that different methods get slightly different\ngraphs. The layout is drawn by a force-based algorithm using the estimated graph from the Kendall.\nWe see the stocks from the same GICS sector tends to be grouped with each other, suggesting that\nour method delivers an informative graph estimate.\n7 Discussion and Comparison with Related Work\nThe transelliptical distribution is also proposed by [12] for semiparametric scale-invariant principle\ncomponent analysis. Though both papers are based on the transelliptical family, the core idea and\nanalysis are fundamentally different. For scale-invariant principle component analysis, we impose\nstructural assumption of the latent generalized correlation matrix; For graph estimation, we impose\nstructural assumption on the latent generalized concentration matrix. Since the latent generalized\ncorrelation matrix encodes marginal uncorrelatedness while the latent generalized concentration ma-\ntrix encodes conditional uncorrelatedness of the variables, the analysis of the population models are\northogonal and complementary to each other. In particular, for graphical models, we need to charac-\nterize the properties of marginal and conditional distributions of a transelliptical distribution. These\nproperties are not needed for principle component analysis. Moreover, the model interpretation\nof the inferred transelliptical graph is very nontrivial.\nIn a longer version technical report [17],\nwe provide a three-layer hierarchal interpretation of the estimated transelliptical graphical model\nand sharply characterize the relationships between nonparnaormal, elliptical, meta-elliptical, and\ntranselliptical families. This research was supported by NSF award IIS-1116730.\n\n8\n\nPearsonKendallSpearmanNPNNS\fReferences\n[1] O. Banerjee, L. E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation. Journal of Machine Learning Research, 9(3):485\u2013516, 2008.\n\n[2] T. Cai, W. Liu, and X. Luo. A constrained \u20181 minimization approach to sparse precision matrix estimation.\n\nJournal of the American Statistical Association, 106(494):594\u2013607, 2011.\n\n[3] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on\n\nScienti\ufb01c Computing, 20(1):33\u201361, 1998.\n\n[4] David Christensen. Fast algorithms for the calculation of Kendall\u2019s \u03c4. Computational Statistics, 20(1):51\u2013\n\n[5] A. Dempster. Covariance selection. Biometrics, 28:157\u2013175, 1972.\n[6] M. Drton and M. Perlman. Multiple testing and error control in Gaussian graphical model selection.\n\nStatistical Science, 22(3):430\u2013449, 2007.\n\n[7] M. Drton and M. Perlman. A SINful approach to Gaussian graphical model selection. Journal of Statis-\n\ntical Planning and Inference, 138(4):1179\u20131200, 2008.\n\n[8] KT Fang, S. Kotz, and KW Ng. Symmetric multivariate and related distributions. Chapman&Hall,\n\n62, 2005.\n\nLondon, 1990.\n\n[9] Michael A. Finegold and Mathias Drton. Robust graphical modeling with t-distributions. In Proceedings\nof the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI \u201909, pages 169\u2013176, 2009.\n[10] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\n[11] P.R. Halmos. Measure theory, volume 18. Springer, 1974.\n[12] F. Han and H. Liu. Tca: Transelliptical principal component analysis for high dimensional non-gaussian\n\nBiostatistics, 9(3):432\u2013441, 2008.\n\ndata. Technical Report, 2012.\n\n[13] Wassily Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. Journal of the\n\nAmerican Statistical Association, 58(301):13\u201330, 1963.\n\n[14] A. Jalali, C. Johnson, and P. Ravikumar. High-dimensional sparse inverse covariance estimation using\n\ngreedy methods. International Conference on Arti\ufb01cial Intelligence and Statistics, 2012. to appear.\n\n[15] C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix estimation. Annals\n\nof Statistics, 37:42\u201354, 2009.\n\n[16] Steffen L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[17] H. Liu, F. Han, and Zhang C-H. Transelliptical graphical modeling under a hierarchical latent variable\n\nframework. Technical Report, 2012.\n\n[18] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High dimensional semiparametric gaussian\n\ncopula graphical models. Annals of Statistics, 2012.\n\n[19] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimen-\n\nsional undirected graphs. Journal of Machine Learning Research, 10:2295\u20132328, 2009.\n\n[20] Han Liu, Kathryn Roeder, and Larry Wasserman. Stability approach to regularization selection (stars) for\nhigh dimensional graphical models. In Proceedings of the Twenty-Third Annual Conference on Neural\nInformation Processing Systems (NIPS), 2010.\n\n[21] N. Meinshausen and P. B\u00a8uhlmann. High dimensional graphs and variable selection with the lasso. Annals\n\nof Statistics, 34(3):1436\u20131462, 2006.\n\n[22] P. Ravikumar, M. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by mini-\n\nmizing \u20181-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980, 2011.\n\n[23] A. Rothman, P. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Elec-\n\n[24] X. Shen, W. Pan, and Y. Zhu. ). likelihood-based selection and sharp parameter estimation. Journal of the\n\ntronic Journal of Statistics, 2:494\u2013515, 2008.\n\nAmerican Statistical Association, 2012. to appear.\n\n[25] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[26] D. Vogel and R. Fried. Elliptical graphical modelling. Biometrika, 98(4):935\u2013951, December 2011.\n[27] M. Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery using \u20181constrained\n\nquadratic programming. IEEE Transactions on Information Theory, 55(5):2183\u20132201, 2009.\n\n[28] L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical\n\nmodels. Annals of Statistics, 2012.\n\n[29] M. Yuan. High dimensional inverse covariance matrix estimation via linear programming. Journal of\n\nMachine Learning Research, 11(8):2261\u20132286, 2010.\n\n[30] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika,\n\n[31] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research,\n\n94(1):19\u201335, 2007.\n\n7(11):2541\u20132563, 2006.\n\n[32] T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional\n\nundirected graph estimation in r. Journal of Machine Learning Research, 2012. to appear.\n\n[33] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association,\n\n101(476):1418\u20131429, 2006.\n\n9\n\n\f", "award": [], "sourceid": 4822, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Fang", "family_name": "Han", "institution": null}, {"given_name": "Cun-hui", "family_name": "Zhang", "institution": null}]}