{"title": "Sparse Inverse Covariance Estimation with Calibration", "book": "Advances in Neural Information Processing Systems", "page_first": 2274, "page_last": 2282, "abstract": "We propose a semiparametric procedure for estimating high dimensional sparse inverse covariance matrix. Our method, named ALICE, is applicable to the elliptical family. Computationally, we develop an efficient dual inexact iterative projection (${\\rm D_2}$P) algorithm based on the alternating direction method of multipliers (ADMM). Theoretically, we prove that the ALICE estimator achieves the parametric rate of convergence in both parameter estimation and model selection. Moreover, ALICE calibrates regularizations when estimating each column of the inverse covariance matrix. So it not only is asymptotically tuning free, but also achieves an improved finite sample performance. We present numerical simulations to support our theory, and a real data example to illustrate the effectiveness of the proposed estimator.", "full_text": "Sparse Precision Matrix Estimation with Calibration\n\nTuo Zhao\n\nDepartment of Computer Science\n\nJohns Hopkins University\n\nDepartment of Operations Research and Financial Engineering\n\nHan Liu\n\nPrinceton University\n\nAbstract\n\nWe propose a semiparametric method for estimating sparse precision matrix of\nhigh dimensional elliptical distribution. The proposed method calibrates regular-\nizations when estimating each column of the precision matrix. Thus it not only\nis asymptotically tuning free, but also achieves an improved \ufb01nite sample per-\nformance. Theoretically, we prove that the proposed method achieves the para-\nmetric rates of convergence in both parameter estimation and model selection. We\npresent numerical results on both simulated and real datasets to support our theory\nand illustrate the effectiveness of the proposed estimator.\n\n1\n\nIntroduction\n\nWe study the precision matrix estimation problem: let X = (X1, ..., Xd)T be a d-dimensional ran-\ndom vector following some distribution with mean \u00b5 \u2208 Rd and covariance matrix \u03a3 \u2208 Rd\u00d7d, where\n\u03a3kj = EXkXj \u2212 EXkEXj. We want to estimate \u2126 = \u03a3\u22121 from n independent observations. To\nmake the estimation manageable in high dimensions (d/n \u2192 \u221e), we assume that \u2126 is sparse. That\nis, many off-diagonal entries of \u2126 are zeros.\nExisting literature in machine learning and statistics usually assumes that X follows a multivari-\nate Gaussian distribution, i.e., X \u223c N (0, \u03a3). Such a distributional assumption naturally connects\nsparse precision matrices with Gaussian graphical models (Dempster, 1972), and has motivated\nnumerous applications (Lauritzen, 1996). To estimate sparse precision matrices for Gaussian dis-\ntributions, many methods in the past decade have been proposed based on the sample covariance\nestimator. Let x1, ..., xn \u2208 Rd be n independent observations of X, the sample covariance estima-\ntor is de\ufb01ned as\n\nS =\n\n1\nn\n\n(xi \u2212 \u00afx)(xi \u2212 \u00afx)T with \u00afx =\n\n1\nn\n\nxi.\n\n(1.1)\n\nBanerjee et al. (2008); Yuan and Lin (2007); Friedman et al. (2008) take advantage of the Gaussian\nlikelihood, and propose the graphic lasso (GLASSO) estimator by solving\n|\u2126kj|,\n\n(cid:98)\u2126 = argmin\n\n\u2212 log |\u2126| + tr(S\u2126) + \u03bb\n\n(cid:88)\n\nwhere \u03bb > 0 is the regularization parameter. Scalable software packages for GLASSO have been\ndeveloped, such as huge (Zhao et al., 2012).\nIn contrast, Cai et al. (2011); Yuan (2010) adopt the pseudo-likelihood approach to estimate the pre-\ncision matrix. Their estimators follow a column-by-column estimation scheme, and possess better\n\n1\n\nn(cid:88)\n\ni=1\n\n\u2126\n\nn(cid:88)\n\ni=1\n\nj,k\n\n\fdenote the jth column of A, ||A\u2217j||1 =(cid:80)\n\ntheoretical properties. More speci\ufb01cally, given a matrix A \u2208 Rd\u00d7d, let A\u2217j = (A1j, ..., Adj)T\nk |Akj| and ||A\u2217j||\u221e = maxk |Akj|, Cai et al. (2011)\n\nobtain the CLIME estimator by solving\n||\u2126\u2217j||1\n\n(cid:98)\u2126\u2217j = argmin\n\n\u2126\u2217j\n\ns.t. ||S\u2126\u2217j \u2212 I\u2217j||\u221e \u2264 \u03bb, \u2200 j = 1, ..., d.\n\n(1.2)\n\nComputationally, (1.2) can be reformulated and solved by general linear program solvers. Theoret-\nically, let ||A||1 = maxj ||A\u2217j||1 be the matrix (cid:96)1 norm of A, and ||A||2 be the largest singular\nvalue of A, (i.e., the spectral norm of A), Cai et al. (2011) show that if we choose\n\nthe CLIME estimator achieves the following rates of convergence under the spectral norm,\n\n(cid:114)\n\n\u03bb (cid:16) ||\u2126||1\n\nlog d\n\n,\n\nn\n\n(cid:32)\n\n||(cid:98)\u2126 \u2212 \u2126||2\n(cid:80)\n2 = OP\nk |\u2126kj|q.\n\n(cid:18) log d\n\n(cid:19)1\u2212q(cid:33)\n\n,\n\n||\u2126||4\u22124q\n\n1\n\ns2\n\nn\n\n(1.3)\n\n(1.4)\n\nwhere q \u2208 [0, 1) and s = maxj\nDespite of these good properties, the CLIME estimator in (1.2) has three drawbacks: (1) The theoret-\nical justi\ufb01cation heavily relies on the subgaussian tail assumption. When this assumption is violated,\nthe inference can be unreliable; (2) All columns are estimated using the same regularization param-\neter, even though these columns may have different sparseness. As a result, more estimation bias is\nintroduced to the denser columns to compensate the sparser columns. In another word, the estima-\ntion is not calibrated (Liu et al., 2013); (3) The selected regularization parameter in (1.3) involves\nthe unknown quantity ||\u2126||1. Thus we have to carefully tune the regularization parameter over a\nre\ufb01ned grid of potential values in order to get a good \ufb01nite-sample performance. To overcome the\nabove three drawbacks, we propose a new sparse precision matrix estimation method, named EPIC\n(Estimating Precision mIatrix with Calibration).\nTo relax the Gaussian assumption, our EPIC method adopts an ensemble of the transformed\nKendall\u2019s tau estimator and Catoni\u2019s M-estimator (Kruskal, 1958; Catoni, 2012). Such a semi-\nparametric combination makes EPIC applicable to the elliptical distribution family. The elliptical\nfamily (Cambanis et al., 1981; Fang et al., 1990) contains many multivariate distributions such as\nGaussian, multivariate t-distribution, Kotz distribution, multivariate Laplace, Pearson type II and\nVII distributions. Many of these distributions do not have subgaussian tails, thus the commonly\nused sample covariance-based sparse precision matrix estimators often fail miserably.\nMoreover, our EPIC method adopts a calibration framework proposed in Gautier and Tsybakov\n(2011), which reduces the estimation bias by calibrating the regularization for each column. Mean-\nwhile, the optimal regularization parameter selection under such a calibration framework does not\nrequire any prior knowledge of unknown quantities (Belloni et al., 2011). Thus our EPIC estima-\ntor is asymptotically tuning free (Liu and Wang, 2012). Our theoretical analysis shows that if the\nunderlying distribution has a \ufb01nite fourth moment, the EPIC estimator achieves the same rates of\nconvergence as (1.4). Numerical experiments on both simulated and real datasets show that EPIC\noutperforms existing precision matrix estimation methods.\n\n2 Background\nWe \ufb01rst introduce some notations used throughout this paper. Given a vector v = (v1, . . . , vd)T \u2208\nRd, we de\ufb01ne the following vector norms:\n\n(cid:88)\n\n(cid:88)\n\n||v||1 =\n\n|vj|, ||v||2\n\n2 =\n\nj\n\nj\n\nj , ||v||\u221e = max\nv2\n\nj\n\n|vj|.\n\nGiven a matrix A \u2208 Rd\u00d7d, we use A\u2217j = (A1j, ..., Adj)T to denote the jth column of A. We\nde\ufb01ne the following matrix norms:\n\n(cid:88)\n\nk,j\n\nkj, ||A||max = max\nA2\n\nk,j\n\n|Akj|,\n\n||A||1 = max\n\nj\n\n||A\u2217j||1,||A||2 = max\n\nj\n\n\u03c8j(A), ||A||2\n\nF =\n\n2\n\n\fwhere \u03c8j(A)\u2019s are all singular values of A.\nWe then brie\ufb02y review the elliptical family. As a generalization of the Gaussian distribution, it has\nthe following de\ufb01nition.\nDe\ufb01nition 2.1 (Fang et al. (1990)). Given \u00b5 \u2208 Rd and \u039e \u2208 Rd\u00d7d, where \u039e (cid:23) 0 and\nrank(\u039e) = r \u2264 d, we say that a d-dimensional random vector X = (X1, ..., X)T follows an\nelliptical distribution with parameter \u00b5, \u039e, and \u03b2, if X has a stochastic representation\n\nX d= \u00b5 + \u03b2BU ,\n\nsuch that \u03b2 \u2265 0 is a continuous random variable independent of U, U \u2208 Sr\u22121 is uniformly dis-\ntributed in the unit sphere in Rr, and \u039e = BBT .\nSince we are interested in the precision matrix estimation, we assume that maxj EX 2\nj is \ufb01nite. Note\nthat the stochastic representation in De\ufb01nition 2.1 is not unique, and existing literature usually im-\nposes the constraint maxj \u039ejj = 1 to make the distribution identi\ufb01able (Fang et al., 1990). However,\nsuch a constraint does not necessarily make \u039e the covariance matrix. Here we present an alternative\nrepresentation as follows.\nProposition 2.2. If X has the stochastic representation X = \u00b5 + \u03b2BU as in De\ufb01nition 2.1, given\n\u039e = BBT , rank(\u039e) = r, and E(\u03be2) = \u03b1 < \u221e, X can be rewritten as X = \u00b5 + \u03beAU, where\n\n\u03be = \u03b2(cid:112)r/\u03b1, A = B(cid:112)\u03b1/r and \u03a3 = AAT . Moreover we have\n\nE(\u03be2) = r, E(X) = \u00b5, and Cov(X) = \u03a3.\n\nAfter the reparameterization in Proposition 2.2, the distribution is identi\ufb01able with \u03a3 de\ufb01ned as the\nconventional covariance matrix.\nRemark 2.3. \u03a3 has the decomposition \u03a3 = \u0398Z\u0398, where Z is the Pearson correlation matrix,\nand \u0398 = diag(\u03b81, ..., \u03b8d) with \u03b8j as the standard deviation of Xj. Since \u0398 is a diagonal matrix,\nthe precision \u2126 also has a similar decomposition \u2126 = \u0398\u22121\u0393\u0398\u22121, where \u0393 = Z\u22121 is the inverse\ncorrelation matrix.\n\n3 Method\n\nWe propose a three-step method: (1) We \ufb01rst use the transformed Kendall\u2019s tau estimator and\n\nCatoni\u2019s M-estimator to obtain (cid:98)Z and (cid:98)\u0398 respectively. (2) We then plug (cid:98)Z into the calibrated in-\nverse correlation matrix estimation to obtain(cid:98)\u0393. (3) At last, we assemble(cid:98)\u0393 and (cid:98)\u0398 to obtain (cid:98)\u2126.\n\n3.1 Correlation Matrix and Standard Deviation Estimation\n\n(cid:88)\n\n(cid:16)\n\n(cid:17)\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:98)\u03c4kj =\n\nTo estimate Z, we adopt the transformed Kendall\u2019s tau estimator proposed in Liu et al. (2012). Given\nn independent observations, x1, ..., xn, where xi = (xi1, ..., xid)T , we calculate the Kendall\u2019s\nstatistic by\n\n2\n\nn(n \u2212 1)\n\nsign\n\n(xij \u2212 xi(cid:48)j)(xik \u2212 xi(cid:48)k)\n\nif j (cid:54)= k;\n\ni<i(cid:48)\n\nAfter a simple transformation, we obtain a correlation matrix estimator(cid:98)Z = [(cid:98)Zkj] =(cid:2)sin(cid:0) \u03c0\n2(cid:98)\u03c4kj\n\n(Liu et al., 2012; Zhao et al., 2013).\nTo estimate \u0398 = diag(\u03b81, ..., \u03b8d), we adopt the Catoni\u2019s M-estimator proposed in Catoni (2012).\nWe de\ufb01ne\n\notherwise.\n\n(cid:1)(cid:3)\n\n1\n\nwhere sign(0) = 0. Let (cid:98)mj be the estimator of EX 2\nn(cid:88)\n\n(cid:114) 2\n\n(cid:18)\n(xij \u2212(cid:98)\u00b5j)\n\nn(cid:88)\n\n(cid:19)\n\n= 0,\n\n\u03c8\n\n(cid:18)\n\nj , we solve\n\n\u03c8\n\n(x2\n\n\u03c8(t) = sign(t) log(1 + |t| + t2/2),\n\ni=1\n\nwhere Kmax is an upper bound of maxj Var(Xj) and maxj Var(X 2\n\nincreasing function in t,(cid:98)\u00b5j and (cid:98)mj are unique and can be obtained by the ef\ufb01cient Newton-Raphson\nmethod (Stoer et al., 1993). Then we can obtain(cid:98)\u03b8j using(cid:98)\u03b8j =\n\nnKmax\nj ). Since \u03c8(t) is a strictly\n\nnKmax\n\ni=1\n\n(cid:19)\n\n= 0.\n\n(cid:114) 2\nij \u2212 (cid:98)mj)\n(cid:113)(cid:98)mj \u2212(cid:98)\u00b52\n\nj.\n\n3\n\n\f3.2 Calibrated Inverse Correlation Matrix Estimation\n\nWe plugin(cid:98)Z into the following convex program,\n\n((cid:98)\u0393\u2217j,(cid:98)\u03c4j) = argmin\n\n\u0393\u2217j ,\u03c4j\n\n||\u0393\u2217j||1 + c\u03c4j\n\n||(cid:98)Z\u0393\u2217j \u2212 I\u2217j||\u221e \u2264 \u03bb\u03c4j, ||\u0393\u2217j||1 \u2264 \u03c4j, \u2200 j = 1, ..., d.\n\ns.t.\n\n(3.1)\nwhere c can be an arbitrary constant (e.g. c = 0.5). \u03c4j works as an auxiliary variable to calibrate the\nregularization.\nRemark 3.1. If we know \u03c4j = ||\u2126\u2217j||1 in advance, we can consider a simple variant of the CLIME\nestimator,\n\n(cid:98)\u2126\u2217j = argmin\n((cid:98)\u0393\u2217j,(cid:98)\u03c4j) = argmin\n\n\u2126\u2217j\ns.t.\n\n||\u2126\u2217j||1\n||S\u2126\u2217j \u2212 I\u2217j||\u221e \u2264 \u03bb\u03c4j, \u2200 j = 1, ..., d.\n\nSince we do not have any prior knowledge of \u03c4(cid:48)\n\njs, we consider the following replacement\n\n||\u2126\u2217j||1\n||S\u2126\u2217j \u2212 I\u2217j||\u221e \u2264 \u03bb\u03c4j, \u03c4j = ||\u2126\u2217j||1 \u2200 j = 1, ..., d.\n\n\u0393\u2217j ,\u03c4j\ns.t.\n\n(3.2)\n\nAs can be seen, (3.2) is nonconvex due to the constraint \u03c4j = ||\u2126\u2217j||1. Thus no global optimum can\nbe guaranteed in polynomial time.\n\nFrom a computational perspective, (3.1) can be viewed as a convex relaxation of (3.2). Both the\nobjective function and the constraint in (3.1) contain \u03c4j to prevent from choosing \u03c4j either too large\nor too small. Due to the complementary slackness, (3.1) eventually encourages the regularization\nto be proportional to the (cid:96)1 norm of each column (weak sparseness). Therefore the estimation is\ncalibrated.\n\u2217j \u2265 0, we can reformulate (3.1)\nBy introducing the decomposition \u0393\u2217j = \u0393+\u2217j \u2212 \u0393\u2212\nas a linear program as follows,\n\n\u2217j with \u0393+\u2217j, \u0393\u2212\n\n((cid:98)\u0393+\u2217j,(cid:98)\u0393\u2212\n\n\u2217j,(cid:98)\u03c4j) = argmin\n\n\u0393+\u2217j ,\u0393\n\n\u2212\n\u2217j ,\u03c4j\n\nsubjected to\n\n\u2217j + c\u03c4j\n\n1T \u0393+\u2217j + 1T \u0393\u2212\n\uf8ee\uf8f0 (cid:98)Z \u2212(cid:98)Z \u2212\u03bb\n\u2212(cid:98)Z\n(cid:98)Z \u2212\u03bb\n\u0393\u2212\n\u2217j\n1T \u22121\n\u03c4j\n\u2217j \u2265 0, \u03c4j \u2265 0,\n\u0393+\u2217j \u2265 0, \u0393\u2212\n\n\uf8f9\uf8fb\uf8ee\uf8f0 \u0393+\u2217j\n\n1T\n\n(3.3)\n\n\uf8f9\uf8fb \u2264\n\n(cid:34) I\u2217j\u2212I\u2217j\n\n0\n\n(cid:35)\n\n,\n\nwhere \u03bb = (\u03bb, ..., \u03bb)T \u2208 Rd. (3.3) can be solved by existing linear program solvers, and further\naccelerated by the parallel computing techniques.\nRemark 3.2. Though (3.1) looks more complicated than (1.2), it is not necessarily more computa-\ntionally dif\ufb01cult. After the reparameterization, (3.3) contains 2d + 1 parameters to optimize, which\nis of a similar scale to the linear program formulation as the CLIME method in Cai et al. (2011).\n\nOur EPIC method does not guarantee the symmetry of the estimator(cid:98)\u0393. Thus we need the following\nsymmetrization methods to obtain the symmetric replacement(cid:101)\u0393.\n\n(cid:101)\u0393kj =(cid:98)\u0393kjI(|(cid:98)\u0393kj| \u2264(cid:98)\u0393jk) +(cid:98)\u0393jkI(|(cid:98)\u0393kj| >(cid:98)\u0393jk).\n\n3.3 Precision Matrix Estimation\n\nOnce we obtain the estimated inverse correlation matrix (cid:101)\u0393, we can recover the precision matrix\n\nestimator by the ensemble rule,\n\nRemark 3.3. A possible alternative is to directly estimate \u2126 by plugging a covariance estimator\n\ninto (3.1) instead of (cid:98)Z, but this direct estimation procedure makes the regularization parameter\n\n(3.4)\n\nselection sensitive to Var(X 2\nj ).\n\n(cid:98)\u2126 = (cid:98)\u0398\u22121(cid:101)\u0393(cid:98)\u0398\u22121.\n(cid:98)S = (cid:98)\u0398(cid:98)Z(cid:98)\u0398\n\n4\n\n\f4 Statistical Properties\n\n|\u0393kj|q \u2264 s, ||\u0393||1 \u2264 M\n\n(cid:111)\n\n,\n\nIn this section, we study statistical properties of the EPIC estimator. We de\ufb01ne the following class\nof sparse symmetric matrices,\n\nUq(s, M ) =\n\n\u0393 \u2208 Rd\u00d7d(cid:12)(cid:12)(cid:12) \u0393 (cid:31) 0, \u0393 = \u0393T , max\n(cid:110)\n\nj\n\n(cid:88)\n\nk\n\nwhere q \u2208 [0, 1) and (s, d, M ) can scale with the sample size n. We also impose the following\nadditional conditions:\n(A.1) \u0393 \u2208 Uq(s, M )\n(A.2) maxj |\u00b5j| \u2264 \u00b5max, maxj \u03b8j \u2264 \u03b8max, minj \u03b8j \u2265 \u03b8min\n(A.3) maxj EX 4\nwhere \u00b5max, K, \u03b8max, and \u03b8min are constants.\nBefore we proceed with our main results, we \ufb01rst present the following key lemma.\nLemma 4.1. Suppose that X follows an elliptical distribution with mean \u00b5, and covariance \u03a3 =\n\u0398Z\u0398. Assume that (A.1)-(A.3) hold, given the transformed Kendall\u2019s tau estimator and Catoni\u2019s M-\nestimator de\ufb01ned in Section 3, there exist universal constants \u03ba1 and \u03ba2 such that for large enough\nn,\n\nj \u2264 K\n\n(cid:32)\n(cid:32)\n\nP\n\nP\n\nmax\n\nj\n\nmax\nj,k\n\n(cid:114)\n(cid:114)\n\n| \u2264 \u03ba2\n\n|(cid:98)\u03b8\u22121\nj \u2212 \u03b8\u22121\n|(cid:98)Zkj \u2212 Zkj| \u2264 \u03ba1\n\nj\n\n(cid:33)\n(cid:33)\n\nlog d\n\nn\n\nlog d\n\nn\n\n\u2265 1 \u2212 2\nd3 ,\n\n\u2265 1 \u2212 1\nd3 .\n\nLemma 4.1 implies that both transformed Kendall\u2019s tau estimator and Catoni\u2019s M-estimator possess\ngood concentration properties, which enable us to obtain a consistent estimator of \u2126.\nThe next theorem presents the rates of convergence under the matrix (cid:96)1 norm, spectral norm, Frobe-\nnius norm, and max norm.\nTheorem 4.2. Suppose that X follows an elliptical distribution. Assume (A.1)-(A.3) hold, there\nexist universal constants C1, C2, and C3 such that by taking\n\nfor large enough n and p = 1, 2, we have\n\n(cid:114)\n\n\u03bb = \u03ba1\n\nlog d\n\n,\n\nn\n\n||(cid:98)\u2126 \u2212 \u2126||2\n||(cid:98)\u2126 \u2212 \u2126||2\n||(cid:98)\u2126 \u2212 \u2126||max \u2264 C3M 2\n\np \u2264 C1M 4\u22124qs2\n(cid:114)\nF \u2264 C2M 4\u22122qs\n\n1\nd\n\nn\n\nlog d\n\n,\n\n(cid:18) log d\n(cid:18) log d\n\n(cid:19)1\u2212q\n(cid:19)1\u2212q/2\n\nn\n\n,\n\n(4.1)\n\n,\n\nwith probability at least 1 \u2212 3 exp(\u22123 log d). Moreover, when the exact sparsity holds (i.e., q = 0),\n\nlet E = {(k, j) | \u2126kj (cid:54)= 0}, and (cid:98)E = {(k, j) | (cid:98)\u2126kj (cid:54)= 0}, then we have P(cid:16)\n\n(cid:17) \u2192 1, if there\n\nE \u2286 (cid:98)E\n\nn\n\nexists a large enough constant C4 such that\n\n|\u2126kj| \u2265 C4M 2\n\nmin\n(k,j)\u2208E\n\n(cid:114)\n\nlog d\n\nn\n\n.\n\nThe rates of convergence in Theorem 4.2 are comparable to those in Cai et al. (2011).\nRemark 4.3. The selected tuning parameter \u03bb in (4.1) does not involve any unknown quantity.\nTherefore our EPIC method is asymptotically tuning free.\n\n5\n\n\f5 Numerical Simulations\n\nIn this section, we compare the proposed ALCE method with other methods including\n\n(1) GLASSO.RC : GLASSO +(cid:98)S de\ufb01ned in (3.4) as the input covariance matrix\n(2) CLIME.RC: CLIME +(cid:98)S as the input covariance matrix\n\n(3) CLIME.SM: CLIME + S de\ufb01ned in (1.1) as the input covariance matrix\n\nWe consider three different settings for the comparison: (1) d = 100; (2) d = 200; (3) d = 400. We\nadopt the following three graph generation schemes, as illustrated in Figure 1, to obtain precision\nmatrices.\n\n(a) Chain\n\n(b) Erd\u00a8os-R\u00b4enyi\n\n(c) Scale-free\n\nFigure 1: Three different graph patterns. To ease the visualization, we choose d = 100.\n\nWe then generate n = 200 independent samples from the t-distribution1 with 5 degrees of freedom,\nmean 0 and covariance \u03a3 = \u2126\u22121. For the EPIC estimator, we set c = 0.5 in (3.1). For the Catoni\u2019s\nM-estimator, we set Kmax = 102.\nTo evaluate the performance in parameter estimation, we repeatedly split the data into a training set\nof n1 = 160 samples and a validation set of n2 = 40 samples for 10 times. We tune \u03bb over a re\ufb01ned\ngrid, then the selected optimal regularization parameter is\n\n10(cid:88)\n\nk=1\n\n||(cid:98)\u2126(\u03bb,k)(cid:98)\u03a3(k) \u2212 I||max,\n\n\u03bb = argmin\n\n\u03bb\n\nwhere (cid:98)\u2126(\u03bb,k) denotes the estimated precision matrix using the regularization parameter \u03bb and the\ntraining set in the kth split, and (cid:98)\u03a3(k) denotes the estimated covariance matrix using the validation\n\nset in the kth split. Table 1 summarizes our experimental results averaged over 200 simulations. We\nsee that EPIC outperforms the competing estimators throughout all settings.\nTo evaluate the performance in model selection, we calculate the ROC curve of each obtained reg-\nularization path. Figure 2 summarizes ROC curves of all methods averaged over 200 simulations.\nWe see that EPIC also outperforms the competing estimators throughout all settings.\n\n6 Real Data Example\n\nTo illustrate the effectiveness of the proposed EPIC method, we adopt the breast cancer data2, which\nis analyzed in Hess et al. (2006). The data set contains 133 subjects with 22,283 gene expression\nlevels. Among the 133 subjects, 99 have achieved residual disease (RD) and the remaining 34 have\nachieved pathological complete response (pCR). Existing results have shown that the pCR subjects\nhave higher chance of cancer-free survival in the long term than the RD subject. Thus we are\ninterested in studying the response states of patients (with RD or pCR) to neoadjuvant (preoperative)\nchemotherapy.\n\n1The marginal variances of the distribution vary from 0.5 to 2.\n2Available at http://bioinformatics.mdanderson.org/.\n\n6\n\n\f(a) d = 100\n\n(b) d = 200\n\n(c) d = 400\n\n(d) d = 100\n\n(e) d = 200\n\n(f) d = 400\n\n(g) d = 100\n\n(h) d = 200\n\n(i) d = 400\n\nFigure 2: Average ROC curves of different methods on the chain (a-c), Erd\u00a8os-R\u00b4enyi (d-e), and scale-\nfree (f-h) models. We can see that EPIC uniformly outperforms the competing estimators throughout\nall settings.\n\nWe randomly divide the data into a training set of 83 RD and 29 pCR subjects, and a testing set of the\nremaining 16 RD and 5 pCR subjects. Then by conducting a Wilcoxon test between two categories\nfor each gene, we further reduce the dimension by choosing the 113 most signcant genes with the\nsmallest p-values. We assume that the gene expression data in each category is elliptical distributed,\nand the two categories have the same covariance matrix \u03a3 but different means \u00b5(k), where k = 0\nfor RD and k = 1 for pCR. In Cai et al. (2011), the sample mean is adopted to estimate \u00b5(k)\u2019s, and\nCLIME.RC is adopted to estimate \u2126 = \u03a3\u22121. In contrast, we adopt the Catoni\u2019s M-estimator to\nestimate \u00b5k\u2019s, and EPIC is adopted to estimate \u2126. We classify a sample x to pCR if\n\n(cid:18)\n\n(cid:19)T (cid:98)\u2126\n\n(cid:16)(cid:98)\u00b5(1) \u2212(cid:98)\u00b5(0)(cid:17) \u2265 0,\n\nx \u2212 (cid:98)\u00b5(1) +(cid:98)\u00b5(0)\n\n2\n\nand to RD otherwise. We use the testing set to evaluate the performance of CLIME.RC and EPIC.\nFor the tuning parameter selection, we use a 5-fold cross validation on the training data to pick \u03bb\nwith the minimum classi\ufb01cation error rate.\nTo evaluate the classi\ufb01cation performance, we use the criteria of speci\ufb01city, sensitivity, and Mathews\n\nCorrelation Coef\ufb01cient (MCC). More speci\ufb01cally, let yi\u2019s and(cid:98)yi\u2019s be true labels and predicted labels\n\n7\n\n0.000.010.020.030.040.050.00.20.40.60.81.0False Positive RateTrue Positive RateEPICGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.20.40.60.81.0False Positive RateTrue Positive RateEPICGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.20.40.60.81.0False Positive RateTrue plot(c(e RateEPICGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.20.40.60.81.0False Positive RateTrue Positive RateEPICGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.20.40.60.8False Positive RateTrue Positive RateEPICGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.10.20.30.40.5False Positive RateTrue Positive RateEPICGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.20.40.60.8False Positive RateTrue Positive RateEPICGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.10.20.30.40.50.6False Positive RateTrue Positive RateEPCGLASSO.RCCLIME.RCCLIME.SC0.000.010.020.030.040.050.00.20.40.60.8False Positive RateTrue Positive RateEPICGLASSO.RCCLIME.RCCLIME.SC\fTable 1: Quantitive comparison of EPIC, GLASSO.RC, CLIME.RC, and CLIME.SC on the chain,\nErd\u00a8os-R\u00b4enyi, and scale-free models. We see that EPIC outperforms the competing estimators\nthroughout all settings.\n\nSpectral Norm: ||(cid:98)\u2126 \u2212 \u2126||2\n\nModel\n\nChain\n\nErd\u00a8os-R\u00b4enyi\n\nScale-free\n\nModel\n\nChain\n\nErd\u00a8os-R\u00b4enyi\n\nScale-free\n\nd\n100\n200\n400\n100\n200\n400\n100\n200\n400\n\nd\n100\n200\n400\n100\n200\n400\n100\n200\n400\n\nEPIC\n\n0.8405(0.1247)\n0.9147(0.1009)\n1.0058(0.1231)\n0.9846(0.0970)\n1.1944(0.0704)\n1.9010(0.0462)\n0.9779(0.1379)\n2.9278(0.3367)\n1.1816(0.1201)\n\nGLASSO.RC\n1.1880(0.1003)\n1.3433(0.0870)\n1.4842(0.0760)\n1.6037(0.2289)\n1.6105(0.0680)\n2.2613(0.1133)\n1.6619(0.1553)\n4.0882(0.0962)\n1.8304(0.0710)\n\nFrobenius Norm: ||(cid:98)\u2126 \u2212 \u2126||F\n\nCLIME.RC\n\nCLIME.SC\n\n0.9337(0.5389)\n1.0716(0.4939)\n1.3567(0.3706)\n1.6885(0.1704)\n1.7507(0.0389)\n2.6884(0.5988)\n2.1327(0.0986)\n4.5820(0.0604)\n2.1191(0.0629)\n\n3.2991(0.0512)\n3.7303(0.4477)\n3.8462(0.4827)\n3.7158(0.0663)\n3.5209(0.0419)\n4.1342(0.1079)\n3.4548(0.0513)\n8.8904(0.0575)\n3.4249(0.0849)\n\nEPIC\n\n3.3108(0.1521)\n5.0309(0.1833)\n7.5134(0.1205)\n3.5122(0.0796)\n6.3000(0.0868)\n11.489(0.0858)\n2.6369(0.1125)\n4.1280(0.1389)\n5.3440(0.0511)\n\nGLASSO.RC\n4.5664(0.1034)\n7.2154(0.0831)\n11.300(0.1851)\n3.9600(0.1459)\n7.3385(0.0994)\n12.594(0.1633)\n3.1154(0.1001)\n7.7543(0.0934)\n6.3741(0.0723)\n\nCLIME.RC\n\nCLIME.SC\n\n3.4406(0.4319)\n5.4776(0.2586)\n7.8357(1.2217)\n4.4212(0.1065)\n7.3501(0.1589)\n13.026(0.4124)\n3.1363(0.1014)\n7.8916(0.0556)\n5.7643(0.0625)\n\n16.282(0.1346)\n23.403(0.2727)\n33.504(0.1341)\n13.734(0.0629)\n20.151(0.1899)\n30.030(0.1289)\n10.717(0.0844)\n16.370(0.1490)\n20.687(0.1373)\n\nof the testing samples, we de\ufb01ne\n\nSpeci\ufb01city =\n\nTN\n\nMCC =\n\nwhere\n\nTP\n\nTP + FN\n\n,\n\nTN + FP\n\n, Sensitivity =\nTPTN \u2212 FPFN\n\n(cid:112)(TP + FP)(TP + FN)(TN + FP)(TN + FN)\n(cid:88)\n(cid:88)\nI((cid:98)yi = yi = 1), FP =\nI((cid:98)yi = 0, yi = 1)\n(cid:88)\n(cid:88)\nI((cid:98)yi = yi = 0), FN =\nI((cid:98)yi = 1, yi = 0).\n\ni\n\ni\n\n,\n\nTP =\n\nTN =\n\ni\n\ni\n\nTable 2 summarizes the performance of both methods over 100 replications. We see that EPIC\noutperforms CLIME.RC on the speci\ufb01city. The overall classi\ufb01cation performance measured by\nMCC shows that EPIC has a 4% improvement over CLIME.RC.\n\nTable 2: Quantitive comparison of EPIC and CLIME.RC in the breast cancer data analysis.\n\nMethod\n\nCLIME.RC\n\nEPIC\n\nSpeci\ufb01city\n\n0.7412(0.0131)\n0.7935(0.0211)\n\nSensitivity\n\n0.7911(0.0251)\n0.8087(0.0324)\n\nMCC\n\n0.4905(0.0288)\n0.5301(0.0375)\n\n8\n\n\fReferences\nBANERJEE, O., EL GHAOUI, L. and D\u2019ASPREMONT, A. (2008). Model selection through sparse\nmaximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine\nLearning Research 9 485\u2013516.\n\nBELLONI, A., CHERNOZHUKOV, V. and WANG, L. (2011). Square-root lasso: pivotal recovery of\n\nsparse signals via conic programming. Biometrika 98 791\u2013806.\n\nCAI, T., LIU, W. and LUO, X. (2011). A constrained (cid:96)1 minimization approach to sparse precision\n\nmatrix estimation. Journal of the American Statistical Association 106 594\u2014607.\n\nCAMBANIS, S., HUANG, S. and SIMONS, G. (1981). On the theory of elliptically contoured distri-\n\nbutions. Journal of Multivariate Analysis 11 368\u2013385.\n\nCATONI, O. (2012). Challenging the empirical mean and empirical variance: a deviation study.\n\nAnnales de l\u2019Institut Henri Poincar\u00b4e, Probabilit\u00b4es et Statistiques 48 1148\u20131185.\n\nDEMPSTER, A. P. (1972). Covariance selection. Biometrics 157\u2013175.\nFANG, K.-T., KOTZ, S. and NG, K. W. (1990). Symmetric Multivariate and Related Distribu-\ntions, Monographs on Statistics and Applied Probability, 36. London: Chapman and Hall Ltd.\nMR1071174.\n\nFRIEDMAN, J., HASTIE, T. and TIBSHIRANI, R. (2008). Sparse inverse covariance estimation with\n\nthe graphical lasso. Biostatistics 9 432\u2013441.\n\nGAUTIER, E. and TSYBAKOV, A. B. (2011). High-dimensional instrumental variables regression\n\nand con\ufb01dence sets. Tech. rep., ENSAE ParisTech.\n\nHESS, K. R., ANDERSON, K., SYMMANS, W. F., VALERO, V., IBRAHIM, N., MEJIA, J. A.,\nBOOSER, D., THERIAULT, R. L., BUZDAR, A. U., DEMPSEY, P. J. ET AL. (2006). Pharma-\ncogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and \ufb02uorouracil,\ndoxorubicin, and cyclophosphamide in breast cancer. Journal of clinical oncology 24 4236\u20134244.\nKRUSKAL, W. H. (1958). Ordinal measures of association. Journal of the American Statistical\n\nAssociation 53 814\u2013861.\n\nLAURITZEN, S. L. (1996). Graphical models, vol. 17. Oxford University Press.\nLIU, H., HAN, F., YUAN, M., LAFFERTY, J. and WASSERMAN, L. (2012). High-dimensional\n\nsemiparametric gaussian copula graphical models. The Annals of Statistics 40 2293\u20132326.\n\nLIU, H. and WANG, L. (2012). Tiger: A tuning-insensitive approach for optimally estimating\n\ngaussian graphical models. Tech. rep., Massachusett Institute of Technology.\n\nLIU, H., WANG, L. and ZHAO, T. (2013). Multivariate regression with calibration. arXiv preprint\n\narXiv:1305.2238 .\n\nSTOER, J., BULIRSCH, R., BARTELS, R., GAUTSCHI, W. and WITZGALL, C. (1993). Introduction\n\nto numerical analysis, vol. 2. Springer New York.\n\nYUAN, M. (2010). High dimensional inverse covariance matrix estimation via linear programming.\n\nThe Journal of Machine Learning Research 11 2261\u20132286.\n\nYUAN, M. and LIN, Y. (2007). Model selection and estimation in the gaussian graphical model.\n\nBiometrika 94 19\u201335.\n\nZHAO, T., LIU, H., ROEDER, K., LAFFERTY, J. and WASSERMAN, L. (2012). The huge package\nfor high-dimensional undirected graph estimation in r. The Journal of Machine Learning Research\n9 1059\u20131062.\n\nZHAO, T., ROEDER, K. and LIU, H. (2013). Positive semide\ufb01nite rank-based correlation matrix\nestimation with application to semiparametric graph estimation. Journal of Computational and\nGraphical Statistics To appear.\n\n9\n\n\f", "award": [], "sourceid": 1103, "authors": [{"given_name": "Tuo", "family_name": "Zhao", "institution": "Johns Hopkins University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}