{"title": "Learning Additive Exponential Family Graphical Models via $\\ell_{2,1}$-norm Regularized M-Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 4367, "page_last": 4375, "abstract": "We investigate a subclass of exponential family graphical models of which the sufficient statistics are defined by arbitrary additive forms. We propose two $\\ell_{2,1}$-norm regularized maximum likelihood estimators to learn the model parameters from i.i.d. samples. The first one is a joint MLE estimator which estimates all the parameters simultaneously. The second one is a node-wise conditional MLE estimator which estimates the parameters for each node individually. For both estimators, statistical analysis shows that under mild conditions the extra flexibility gained by the additive exponential family models comes at almost no cost of statistical efficiency. A Monte-Carlo approximation method is developed to efficiently optimize the proposed estimators. The advantages of our estimators over Gaussian graphical models and Nonparanormal estimators are demonstrated on synthetic and real data sets.", "full_text": "Learning Additive Exponential Family Graphical\nModels via \u21132,1-norm Regularized M-Estimation\n\nXiao-Tong Yuan\u2020 Ping Li\u2021\u00a7 Tong Zhang\u2021 Qingshan Liu\u2020 Guangcan Liu\u2020\n\n\u2020B-DAT Lab, Nanjing University of Info. Sci.&Tech.\n\n\u2021Depart. of Statistics and \u00a7Depart. of Computer Science, Rutgers University\n\nNanjing, Jiangsu, 210044, China\n\nPiscataway, NJ, 08854, USA\n\n{xtyuan,qsliu, gcliu}@nuist.edu.cn, {pingli,tzhang}@stat.rutgers.edu\n\nAbstract\n\nWe investigate a subclass of exponential family graphical models of which the\nsuf\ufb01cient statistics are de\ufb01ned by arbitrary additive forms. We propose two \u21132;1-\nnorm regularized maximum likelihood estimators to learn the model parameters\nfrom i.i.d. samples. The \ufb01rst one is a joint MLE estimator which estimates all\nthe parameters simultaneously. The second one is a node-wise conditional MLE\nestimator which estimates the parameters for each node individually. For both\nestimators, statistical analysis shows that under mild conditions the extra \ufb02exibil-\nity gained by the additive exponential family models comes at almost no cost of\nstatistical ef\ufb01ciency. A Monte-Carlo approximation method is developed to ef\ufb01-\nciently optimize the proposed estimators. The advantages of our estimators over\nGaussian graphical models and Nonparanormal estimators are demonstrated on\nsynthetic and real data sets.\n\n1 Introduction\n\n\uf8f1\uf8f2\uf8f3\u2211\n\ns\u2208V\n\nP(X; \u03b8) \u221d exp\n\n\u2211\n\n\uf8fc\uf8fd\uf8fe .\n\nAs an important class of statistical models for exploring the interrelationship among a large number\nof random variables, undirected graphical models (UGMs) have enjoyed popularity in a wide range\nof scienti\ufb01c and engineering domains, including statistical physics, computer vision, data mining,\n\u22a4 be a p-dimensional random vector with each\nand computational biology. Let X = [X1, ..., Xp]\nvariable Xi taking values in a set X . Suppose G = (V, E) is an undirected graph consists of a set\nof vertices V = {1, ..., p} and a set of unordered pairs E representing edges between the vertices.\nThe pairwise UGMs over X corresponding to G can be written as the following exponential family\ndistribution:\n\n\u03b8s\u03c6s(Xs) +\n\n(s;t)\u2208E\n\n\u03b8st\u03d5st(Xs, Xt)\n\n(1)\n\nIn such a pairwise model, (Xs, Xt) are conditionally independent (given the rest of the variables)\nif and only if the weight \u03b8st is zero. The most popular instances of pairwise UGMs are Gaussian\ngraphical models (GGMs) [19, 2] for real-valued random variables and Ising (or Potts) models [15]\nfor binary or \ufb01nite nominal discrete random variables. More broadly, in order to derive multivari-\nate graphical models from univariate exponential family distributions (such as the Gaussian, bino-\nmial/multinomial, Poisson, exponential distributions, etc.), the exponential family graphical models\n(EFGMs) [27, 21] were proposed as a uni\ufb01ed framework to learn UGMs with node-wise conditional\ndistributions arising from generalized linear models (GLMs).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1.1 Overview of contribution\n\nis\n\nthat\n\nissue\n\narises\n\nin UGMs\n\nto specify suf\ufb01cient\n\ni.e.,\nA fundamental\n{\u03c6s(Xs), \u03d5st(Xs, Xt)}, for modeling the interactions among variables.\nIt is noteworthy that\nmost prior pairwise UGMs use pairwise product of variables (or properly transformed variables)\nas pairwise suf\ufb01cient statistics [16, 11, 27]. This is clearly restrictive in modern data analysis\ntasks where the underlying pairwise interactions among variables are more often than not highly\ncomplex and unknown a prior. The goal of this work is to remove such a restriction and explore\nthe feasibility (in theory and practice) of de\ufb01ning suf\ufb01cient statistics in an additive formulation to\napproximate the underlying unknown suf\ufb01cient statistics. To this end, we consider the following\nAdditive Exponential Family Graphical Model (AdEFGM) distribution with joint density function:\n\nstatistics,\n\n\uf8f1\uf8f2\uf8f3\u2211\n\ns\u2208V\n\n\u2211\n\n\uf8fc\uf8fd\uf8fe ,\n\n(2)\n\n(3)\n\nP(X; f ) = exp\n\n\u222b\n\n{\u2211\n\nfs(Xs) +\n\n(s;t)\u2208E\n\n\u2211\n\nfst(Xs, Xt) \u2212 A(f )\n\n}\n\nX p exp\n\ns\u2208V fs(Xs) +\n\nwhere fs : X \u2192 R and fst(\u00b7,\u00b7) : X 2 \u2192 R are respectively node-wise and pairwise statistics, and\ndX is the log-partition function.\nA(f ) := log\nWe require the condition A(f ) < \u221e holds so that the de\ufb01nition of probability is valid. In this paper,\nwe assume the formulations of suf\ufb01cient statistics fs and fst are unknown but they admit linear\nrepresentations over two sets of pre-\ufb01xed basis functions {\u03c6k(\u00b7), k = 1, 2, ..., q} and {\u03d5l(\u00b7,\u00b7), l =\nq\u2211\n1, 2, ..., r}, respectively. That is,\n\n(s;t)\u2208E fst(Xs, Xt)\n\nr\u2211\n\nfs(Xs) =\n\n\u03b8s;k\u03c6k(Xs),\n\nfst(Xs, Xt) =\n\n\u03b8st;l\u03d5l(Xs, Xt),\n\nk=1\n\nl=1\n\nwhere q and r are the truncation order parameters. In the formulation (3), the choice of basis and\ntheir sizes is \ufb02exible and task-dependent. For instance, if the mapping functions fs and fst are\nperiodic, then we can choose {\u03c6k(\u00b7)} as 1-D Fourier basis and {\u03d5l(\u00b7,\u00b7)} as 2-D Fourier basis. As\nanother instance, the basis {\u03d5l} can be chosen as multiple kernels which are commonly used in\ncomputer vision tasks. Specially, when q = r = 1, \u03d5l(Xs, Xt) = XsXt and \u03c6k(Xs) is \ufb01xed as\ncertain parametric function, AdEFGM reduces to the standard EFGM [27, 21]. In general cases, by\nimposing an additive structure on the suf\ufb01cient statistics fs and fst, AdEFGM is expected to be able\nto capture more complex interactions among variables beyond pairwise product.\nAs the core contribution of this paper, we propose two \u21132;1-norm regularized maximum likelihood\nestimation (MLE) estimators to learn the weights of AdEFGM in high dimensional settings. The\n\u221a\n\ufb01rst estimator is formulated as an \u21132;1-norm regularized MLE to jointly estimate all the parameters\nin the model. The second estimator is formulated as an \u21132;1-norm regularized node-wise conditional\n\u221a\nMLE to estimate the parameters associated with each node individually. Theoretically, we prove that\n(2|E| + p) ln p/n)\nunder mild conditions the joint MLE estimator achieves convergence rate O((\nwhere |E| while the node-wise conditional estimator achieves convergence rate O(\n(d + 1) ln p/n)\nin which d is the degree of the underlying graph G. Computationally, we propose a Monte-Carlo\napproximation scheme to ef\ufb01ciently optimize the estimators via proximal gradient descent methods.\nWe conduct numerical studies on simulated and real data to support our claims. The simulation\nresults con\ufb01rm that, when the data are drawn from an underlying UGMs with highly nonlinear suf-\n\ufb01cient statistics, our estimators signi\ufb01cantly outperform GGMs and Nonparanormal [10] estimators\nin most cases. The experimental results on a stock price data show that our estimators are able to\nrecover more accurate category links among stocks than GMMs and Nonparanormal estimators.\n\n1.2 Related work\n\nIn order to model random variables beyond parametric UGMs such as GGMs and Ising models, re-\nsearchers recently investigated semi-parametric/nonparametric extensions of these parametric mod-\nels. The Nonparanormal [11] and copula-based methods [5] are semi-parametric graphical models\nwhich assume that data is Gaussian after applying a monotone transformation. More broadly, one\ncould learn transformations of the variables and then \ufb01t any parametric UGMs (like EFGMs) over\nthe transformed variables. In [10, 26], two rank-based estimators were used to estimate correlation\nmatrix and then \ufb01t the GGMs. In [24], a semi-parametric method was proposed to \ufb01t the conditional\n\n2\n\n\fmeans of the features with an arbitrary additive formulation. The Semi-EFGM proposed in [28] is\na semi-parametric rank-based conditional estimator for exponential family graphical models. In [1],\na kernel method was proposed for learning the structure of graphical models by treating variables\nas Gaussians in a mapped high-dimensional feature space. In [7], Gu proposed a functional min-\nimization framework to estimate the nonparametric model (1) over a Reproducing Hilbert Kernel\nSpace (RKHS). Nonparametric exponential family graphical models based on score matching loss\nwere investigated in [9, 20]. The forest density estimation [8] is a fully nonparametric method for\nestimating UGMs with structure restricted to be a forest.\nIn contrast to all these existing semi-\nparametric/nonparametric models, our approach is novel in model de\ufb01nition and computation: we\nimpose a simple additive structure on suf\ufb01cient statistics to describe complex interactions between\nvariables and use Monte-Carlo approximation to estimate the intractable normalization constant for\nef\ufb01cient optimization.\n\n\u2211\n\n1.3 Notation and organization\nNotation Let \u03b8 = {\u03b8s;k, \u03b8st;l : s \u2208 V, k = 1, 2, .., , (s, t) \u2208 V 2, s \u0338= t, l = 1, 2, ...} be a vector\nof parameters associated with AdEFGM and G = {{(s, k)}k,{(st, l)}l : s \u2208 V, (s, t) \u2208 V 2, s \u0338=\nt} be a group induced by the additive structures of nodes and edges. We conventionally de\ufb01ne\nthe following grouped-norm related notations: \u2225\u03b8\u22252;1 =\ng\u2208G \u2225\u03b8g\u2225, \u2225\u03b8\u22252;\u221e = maxg\u2208G \u2225\u03b8g\u2225,\nsupp(\u03b8,G) = {g \u2208 G : \u2225\u03b8g\u2225 \u0338= 0} and \u2225\u03b8\u22252;0 = |supp(\u03b8,G)|. For any S \u2286 G, these notations can\nbe de\ufb01ned restrictively over \u03b8S. We denote \u00afS = G \\ S the complement of S in G.\nOrganization. The remaining of this paper is organized as follows: In \u00a72, we present two maximum\nlikelihood estimators for learning the model parameters of AdEFGM. The statistical guarantees of\nthe proposed estimators are analyzed in \u00a73. Monte-Carlo simulations and experimental results on\nreal stock price data are presented in \u00a74. Finally, we conclude this paper in \u00a75. Due to space limit,\nall the technical proofs of theoretical results are deferred to an appendix section which is included\nin the supplementary material.\n\n2 \u21132;1-norm Regularized MLE for AdEFGM\n\nIn this section, we investigate the problem of estimating the parameters of AdEFGM in high dimen-\nsional settings. By substituting (3) into (2), the distribution of an AdEFGM can be converted to the\nfollowing form:\nwhere \u03b8 = {\u03b8s;k, \u03b8st;l}, and\n\nP(X; \u03b8) = exp{B(X; \u03b8) \u2212 A(\u03b8)} ,\n\u2211\n\n\u2211\n\n\u222b\n\n(4)\n\nB(X; \u03b8) :=\n\n\u03b8s;k\u03c6k(Xs) +\n\n\u03b8st;l\u03d5l(Xs, Xt), A(\u03b8) := log\n\nexp{B(X; \u03b8)} dX.\n\nX p\n\ns\u2208V;k\n\n(s;t)\u2208E;l\n\nSuppose we have n i.i.d. samples Xn = {X (i)}n\nparameters \u03b8\n\n\u2217:\n\ni=1 drawn from the following AdEFGM with true\n\nP(X; \u03b8\n\n\u2217\n\n) = exp{B(X; \u03b8\n\n\u2217\n\n) \u2212 A(\u03b8\n\n\u2217\n\n)} .\n\n(5)\n\u2217 from the ob-\nAn important goal of graphical model learning is to estimate the true parameters \u03b8\nserved data Xn. The more accurate parameter estimation is, the more accurate we are able to recover\nthe underlying true graph structure. We next propose two \u21132;1-norm regularized maximum likelihood\nestimation (MLE) methods for joint and node-conditional learning of parameters, respectively.\n\n2.1 Joint MLE estimation\nGiven the sample set Xn = {X (i)}n\n\ni=1, the negative log-likelihood of the joint distribution (5) is:\n\nIt is trivial to verify L(\u03b8; Xn) has the following \ufb01rst order derivative (see, e.g., [25]):\n\n\u2202L\n\u2202\u03b8s;k\n\n= E(cid:18)[\u03c6k(Xs)] \u2212 1\n\nn\n\nn\u2211\n\ni=1\n\nL(\u03b8; Xn) = \u2212 1\nn\u2211\n\nn\n\n\u03c6k(X (i)\n\ns ),\n\ni=1\n\nB(X (i); \u03b8) + A(\u03b8).\n\n\u2202L\n\u2202\u03b8st;l\n\n= E(cid:18)[\u03d5l(Xs, Xt)] \u2212 1\n\nn\n\n3\n\nn\u2211\n\ni=1\n\n\u03d5l(X (i)\n\ns , X (i)\n\nt ), (6)\n\n\fwhere the expectation E(cid:18)[\u00b7] is taken over the joint distribution (2). Also, it is well known that\nL(\u03b8; Xn) is convex in \u03b8.\nIn order to estimate the parameters which are expected to be sparse in edge level due to the potential\nsparse structure of graph, we consider the following \u21132;1-norm regularized MLE estimator:\n\n\u2211\n\n(\u2211\n\n\u02c6\u03b8n = arg min\n\n)1=2\n\n(cid:18)\n\n{L(\u03b8; Xn) + \u03bbn\u2225\u03b8\u22252;1} ,\n\u2211\n\n(\u2211\n\n)1=2\n\nwhere \u2225\u03b8\u22252;1 =\nis the \u21132;1-norm with\nrespect to the basis statistics and \u03bbn > 0 is the regularization strength parameter dependent on n.\nThe \u21132;1-norm penalty is used to promote edgewise sparsity as the graph structure is expected to be\nsparse in high dimensional settings.\n\n(s;t)\u2208V 2;s\u0338=t\n\nq\nk=1 \u03b82\ns;k\n\nr\nl=1 \u03b82\n\ns\u2208V\n\nst;l\n\n+\n\n(7)\n\n2.2 Node-conditional MLE estimation\n\nRecent state of the art methods for learning UGMs suggest a natural procedure deriving multivariate\ngraphical models from univariate distributions [12, 15, 27]. The common idea in these methods is\nto learn the graph structure by estimating node-neighborhoods, or by \ufb01tting the node-conditional\ndistribution of each individual node conditioned on the rest of the nodes. Indeed, these node-wise\n\ufb01tting methods have been shown to have strong statistical guarantees and attractive computational\nperformance.\nInspired by these approaches, we propose an alternative estimator to estimate the\nweights of suf\ufb01cient statistics associated with each individual node. With a slight abuse of notation,\nwe denote \u03b8s a subvector of \u03b8 associated with node s, i.e.,\n\nwhere N (s) is the neighborhood of s. Given the joint distribution (4), it is easy to show that the\nconditional distribution of Xs given the rest variables, X\\s, is written by:\n\n\u03b8s := {\u03b8s;k | k = 1, ..., q} \u222a {\u03b8st;l | t \u2208 N (s), l = 1, ..., r},\n}\nC(Xs | X\\s; \u03b8s) \u2212 D(X\\s; \u03b8s)\n\nP(Xs | X\\s; \u03b8s) = exp\n\n{\n\n,\n\nlog\n\n{\n\nX exp\n\n}\n\u222b\nwhere C(Xs | X\\s; \u03b8s) :=\nC(Xs | X\\s; \u03b8s)\nnote that the condition A(\u03b8) < \u221e for the joint log-partition function implies D(X\\s; \u03b8s) < \u221e.\nIn order to estimate the parameters associated with a node, we consider using the sparsity regularized\nconditional maximum likelihood estimation. Given n independent samples Xn drawn from (5), we\ncan write the negative log-likelihood of the conditional distribution as:\n\nt\u2208N (s);l \u03b8st;l\u03d5l(Xs, Xt), and D(X\\s; \u03b8s) :=\ndXs is the log-partition function which ensures normalization. We\n\nk \u03b8s;k\u03c6k(Xs) +\n\n(8)\n\n\u2211\n\n\u2211\n\nIt is standard that \u02dcL(\u03b8s; Xn) is convex with respect to \u03b8s and it has the following \ufb01rst-order derivative:\n\n}\n\n.\n\n| X (i)\\s ; \u03b8s) + D(X (i)\\s ; \u03b8s)\n}\n\ns ) + E(cid:18)s[\u03c6k(Xs) | X (i)\\s ]\n\ns , X (i)\n\nt ) + E(cid:18)s [\u03d5l(Xs, X (i)\n\n,\n\n}\nt ) | X (i)\\s ]\n\n(9)\n\n,\n\n{\n\u2212C(X (i)\n\ns\n\nn\u2211\n\ni=1\n\n\u02dcL(\u03b8s; Xn) =\n\n1\nn\n\nn\u2211\nn\u2211\n\ni=1\n\ni=1\n\n{\n\u2212\u03c6k(X (i)\n{\n\u2212\u03d5l(X (i)\n{\n(\u2211\n\n\u2211\n\n(cid:18)s\n\n\u2202 \u02dcL(\u03b8s; Xn)\n\n\u2202\u03b8s;k\n\n\u2202 \u02dcL(\u03b8s; Xn)\n\n\u2202\u03b8st;l\n\n=\n\n=\n\n1\nn\n\n1\nn\n\n(\u2211\n\n)1=2\n\nwhere the expectation E(cid:18)s[\u00b7 | X\\s] is taken over the node-wise conditional distribution (8).\nLet us consider the following \u21132;1-norm regularized conditional MLE formulation associated with\nthe variable Xs:\n\n}\n\n\u02c6\u03b8n\ns = arg min\n\n\u02dcL(\u03b8s; Xn) + \u03bbn\u2225\u03b8s\u22252;1\n\n,\n\n(10)\n\nwhere \u2225\u03b8s\u22252;1 =\nis the grouped \u21132;1-norm with respect to\nthe node-wise and pairwise basis associated with s and \u03bbn > 0 controls the regularization strength.\n\nq\nk=1 \u03b82\ns;k\n\nr\nl=1 \u03b82\n\nt\u0338=s\n\nst;l\n\n+\n\n)1=2\n\n4\n\n\f2.3 Computation via Monte-Carlo approximation\n\nWe consider using proximal gradient descent methods [22] to solve the composite optimization\nproblems in (7) and (10). For both estimators, the major computational overhead is to iteratively\ncalculate the expectation terms involved in the gradients \u2207L(\u03b8; Xn) and \u2207 \u02dcL(\u03b8s; Xn). In general,\nthese expectation terms have no close-form for exact calculation and sampling methods such as im-\nportance sampling and MCMC are usually needed for approximate estimation. There are, however,\ntwo challenging issues with such a sampling based optimization procedure: (1) the multivariate\nsampling methods typically suffer from high computational cost even when the dimensionality p\nis moderately large; and (2) the non-vanishing sampling error of gradient will accumulate during\nthe iteration which according to the results in [18] will deteriorate the overall convergence perfor-\nmance. Obviously, the main source of these challenges is caused by the intractable log-partition\nterms appeared in the estimators.\nTo more ef\ufb01ciently apply the \ufb01rst-order methods without suffering from iterative sampling and error\naccumulation, it is a natural idea to replace the log-partition terms by a Monte-Carlo approximation\nand minimize the resulting approximated formulation. Taking the joint estimator (7) as an example,\nwe resort to the basic importance sampling method to approximately estimate the log-partition term\nj=1 drawn\nA(\u03b8) = log\nfrom a random vector Y \u2208 X p with known probability density P(Y ). Given \u03b8, an importance\nsampling estimate of exp{A(\u03b8)} is given by\nexp{ \u02c6A(\u03b8; Ym)} =\n{\n\n\u222b\nX p exp{B(X; \u03b8)} dX. Assume we have m i.i.d. samples Ym = {Y (j)}m\n\nWe consider the following Monte-Carlo approximation to the estimator (7):\n\nB(Y (j); \u03b8)\nP(Y (j))\n\nm\u2211\n\n}\n\n{\n\n}\n\n1\nm\n\nexp\n\nj=1\n\n.\n\n\u02c6L(\u03b8; Xn, Ym) + \u03bbn\u2225\u03b8\u22252;1\n\n,\n\n(11)\n\n\u02c6\u02c6\u03b8n = arg min\n\n(cid:18)\n\n\u2211\n\nn\n\nn\n\nwhere \u02c6L(\u03b8; Xn, Ym) = \u2212 1\ni=1 B(X (i); \u03b8) + \u02c6A(\u03b8; Ym). Since the random samples Ym are \ufb01xed\nin (11), the sampling operation can be avoided in the computation of \u2207 \u02c6L(\u03b8; Xn, Ym). Concerning\nthe accuracy of the approximate estimator (11), the following result guarantees that, with high prob-\n\u221a\nability, the minimizer of the approximate estimator (11) is suboptimal to the population estimator (7)\nwith suboptimality O(1/\nm). A proof of this proposition is provided in A.1 (see the supplementary\n)\nmaterial).\nProposition 1. Assume that P(Y ) > 0. Then the following inequality holds with high probability:\n\u02c6\u02c6\u03b8n} + exp{\u2212 \u02c6A(\u02c6\u03b8n; Ym)}\n\nexp{\u2212A(\n\n2.58\u02c6\u03c3\n\n\u02c6\u02c6\u03b8n; Xn)+\u03bbn\u2225\u02c6\u02c6\u03b8n\u22252;1 \u2264 L(\u02c6\u03b8n; Xn)+\u03bbn\u2225\u02c6\u03b8n\u22252;1 +\n\nL(\n\n,\n\n\u221a\nm\n\n(\n)2\n\nwhere \u02c6\u03c3n = 1\nm\n\nm\nj=1\n\nexp{B(Y (j);^(cid:18)n)}\n\nP(Y (j))\n\n\u2212 exp{ \u02c6A(\u02c6\u03b8n; Ym)}\n\n.\n\n(\n\n\u2211\n\nA similar Monte-Carlo approximation strategy can be applied to the node-wise MLE estimator (10).\n\n3 Statistical Analysis\n\nIn this section, we provide statistical guarantees on parameter estimation error for the joint MLE\nestimator (7) and the node-conditional estimator (10).\nIn large picture, our analysis follows the\ntechniques presented in [13, 30] by specifying the conditions under which these techniques can be\napplied to our setting.\n\n3.1 Analysis of the joint estimator\n\nWe are interested in the concentration bounds of the random variables de\ufb01ned by\n\nZs;k := \u03c6k(Xs) \u2212 E(cid:18)(cid:3)[\u03c6k(Xs)], Zst;l := \u03d5l(Xs, Xt) \u2212 E(cid:18)(cid:3)[\u03d5l(Xs, Xt)],\n\n5\n\n\fwhere the expectation E(cid:18)(cid:3) [\u00b7] is taken over the underlying true distribution (5). By the \u201claw of the\nunconscious statistician\u201d we have E[Zs;k] = E[Zst;l] = 0. That is, {Zs;k} and {Zst;l} are zero-\nmean random variables. We introduce the following technical condition on {Zs;k, Zst;l} which we\nwill show to guarantee the gradient \u2207L(\u03b8\n; Xn) vanishes exponentially fast, with high probability,\nas sample size increases.\n{\nAssumption 1. For all (s, k) and all (s, t, l), we assume that there exist constants \u03c3 > 0 and \u03b6 > 0\nsuch that for all |\u03b7| \u2264 \u03b6,\n\n}\n\n{\n\n}\n\n\u2217\n\n\u03c32\u03b72/2\n\n.\n\nE[exp{\u03b7Zs;k}] \u2264 exp\n\n\u03c32\u03b72/2\n\n, E[exp{\u03b7Zst;l}] \u2264 exp\n\nThis assumption essentially imposes an exponential-type bound on the moment generating function\nof the random variables Zs;k, Zst;l.\nIt is well known that the Hessian \u22072L(\u03b8; Xn) is positive semide\ufb01nite at any \u03b8 and it is independent\non the sample set Xn. We also need the following condition which guarantees the restricted positive\nde\ufb01niteness of \u22072L(\u03b8; Xn) over certain low dimensional subspace when \u03b8 is in the vicinity of \u03b8\n\u2217.\n;G). There exist\n\u2217\nAssumption 2 (Locally Restricted Positive De\ufb01nite Hessian). Let S = supp(\u03b8\nconstants \u03b4 > 0 and \u03b2 > 0 such that for any \u03b8 \u2208 {\u2225\u03b8\u2212 \u03b8\n\u22a4\u22072L(\u03b8; Xn)\u03d1 \u2265\n\u2217\u2225 \u2264 \u03b4}, the inequality \u03d1\n\u03b2\u2225\u03d1\u22252 holds for any \u03d1 \u2208 CS := {\u2225\u03b8 (cid:22)S\n\u22252;1 \u2264 3\u2225\u03b8S\u22252;1}.\nAssumption 2 requires that the Hessian \u22072L(\u03b8; Xn) is positive de\ufb01nite in the cone CS when \u03b8 lies\n\u2217. This condition is a speci\ufb01cation of the concept restricted strong\nin a local ball centered at \u03b8\nconvexity [30] to AdEFGM.\nRemark 1 (Minimal Representation). We say an AdEFGM has minimal representation if there is a\nunique parameter vector \u03b8 associated with the distribution (4). This condition equivalently requires\nthat there exists no non-zero \u03d1 such that B(X; \u03d1) is equal to an absolute constant. This implies that\nfor any \u03b8 and for all non-zero \u03d1,\n\nVar(cid:18) [B(X; \u03d1)] = \u03d1\nIf AdEFGM has minimal representation at \u03b8\n\u03b4 > 0 and \u03b2 > 0 such that for any \u03b8 \u2208 {\u2225\u03b8 \u2212 \u03b8\nAssumption 2 holds true when AdEFGM has minimal representation at \u03b8\n\n\u22a4\u22072L(\u03b8; Xn)\u03d1 > 0.\n\u2217\u2225 \u2264 \u03b4}, \u03d1\n\n\u2217, then there must exist suf\ufb01ciently small constants\n\u22a4\u22072L(\u03b8; Xn)\u03d1 \u2265 \u03b2\u2225\u03d1\u22252. Therefore,\n\n\u2217.\n\n{\n\nn > max\n\nThe following theorem is our main result on the estimation error of the joint MLE estimator (7). A\nproof of this result is provided in Appendix A.2 in the supplementary material.\nTheorem 1. Assume that the conditions in Assumption 1 and Assumption 2 hold. If sample size n\nsatis\ufb01es\n\n6 max{q, r} ln p\n,\n,\n\u221a\nthen with probability at least 1 \u2212 2 max{q, r}p\n\u22121, the following inequality holds:\n6 max{q, r}\u2225\u03b8\u2217\u22252;0 ln p/n.\n\u22121\u03c3\n\u221a\n\u2217\u2225 vanishes at the order of O(\n\nRemark 2. The main message Theorem 1 conveys is that when n is suf\ufb01ciently large, the estimation\nmax{q, r}(2|E| + p) ln p/n) with high probability.\nerror \u2225\u02c6\u03b8n \u2212 \u03b8\nThis convergence rate matches the results obtained in [17, 16] for GGMs and the results in [10, 26]\nfor Nonparanormal.\n\n0\u03c32 max{q, r}\u2225\u03b8\n\n\u2217\u2225 \u2264 3c0\u03b2\n\n\u2217\u22252;0 ln p\n\n\u2225\u02c6\u03b8n \u2212 \u03b8\n\n}\n\n\u03c32\u03b6 2\n\n54c2\n\n\u03b42\u03b22\n\n3.2 Analysis of the node-conditional estimator\n\ns\n\n\u2217\ns\n\n\u2212 \u03b8\n\nFor the node-conditional estimator (10), we study the rate of convergence of the parameter estima-\ntion error \u2225\u02c6\u03b8n\n\u2225 as a function of sample size n. We need Assumption 1 and the following\nassumption in our analysis.\ns ;G). There exist constants \u02dc\u03b4 > 0 and \u02dc\u03b2 > 0 such\n\u2217\nAssumption 3. For any node s, let S = supp(\u03b8\n\u22072 \u02dcL(\u03b8s; Xn)\u03d1s \u2265 \u02dc\u03b2\u2225\u03d1s\u22252 holds for any\nthat for any \u03b8s \u2208 {\u2225\u03b8s \u2212 \u03b8\n\u03d1s \u2208 \u02dcCS := {\u2225(\u03b8s) (cid:22)S\n\n\u2225 < \u02dc\u03b4}, the inequality \u03d1\n\n\u22252;1 \u2264 3\u2225(\u03b8s)S\u22252;1}.\n\n\u22a4\ns\n\n\u2217\ns\n\n6\n\n\f216\u02dcc2\n\n\u2217\ns\n\n}\n\n{\n\ns\n\nn > max\n\n\u03c32\u03b6 2\n\n\u2225. A proof of this result is provided in Appendix A.3 in the supplementary material.\n\nThe following is our main result on the convergence rate of node-conditional estimation error \u2225\u02c6\u03b8n\n\u2217\n\u03b8\ns\nTheorem 2. Assume that the conditions in Assumption 1 and Assumption 3 hold. If sample size n\nsatis\ufb01es\n\n\u2212\n\ns\n\n\u22252;0 ln p\n\n0 \u02dc\u03c32 max{q, r}\u2225\u03b8\n\n6 max{q, r} ln p\n,\n,\n\u221a\nthen with probability at least 1 \u2212 4 max{q, r}p\n\u22122, the following inequality holds:\n6 max{q, r}\u2225\u03b8\u2217\n\u22121\u03c3\n\u02dc\u03b2\n\n\u22252;0 ln p/n.\n\u221a\nRemark 3. Theorem 2 indicates that with overwhelming probability,\nthe estimation error\n\u2225\u02c6\u03b8n\n\u2225 = O(\n\u2217\n(d + 1) ln p)/n) where d is the degree of the underlying graph, i.e., d =\n\u22252;0 \u2212 1. We may combine the parameter estimation errors from all the nodes as\nmaxs\u2208V \u2225\u03b8\n\u2217\ns\ns\nIndeed, by Theorem 2 and union of probability we get that\na global measurement of accuracy.\n\u2212\u03b8\nmaxs\u2208V \u2225\u02c6\u03b8n\n\u22121. This\nestimation error bound matches those for GGMs with neighborhood-selection-type estimators [29].\n\n\u2225\u02c6\u03b8n\n\u2212 \u03b8\n\u221a\n(d + 1) ln p/n) holds with probability at least 1\u22124 max{q, r}p\n\n\u2225 \u2264 6\u02dcc0\n\n\u2217\ns\n\n\u2225 = O(\n\n\u2217\ns\n\n\u2212 \u03b8\n\ns\n\n\u03b42 \u02dc\u03b22\n\ns\n\ns\n\n4 Experiments\n\nThis section is devoted to showing the actual learning performance of AdEFGM. We \ufb01rst investigate\ngraph structure recovery accuracy using simulation data (for which we know the ground truth), and\nthen we apply our method to a stock price data for inferring the statistical dependency among stocks.\n\n4.1 Monte-Carlo simulation\n\nThis is a proof-of-concept experiment. The purpose is to con\ufb01rm that when the pairwise interactions\nof the underlying graphical models are highly nonlinear and unknown a prior, our additive estimator\nwill be signi\ufb01cantly superior to existing parametric/semi-parametric graphical models for inferring\nthe structure of graphs. The numerical results of AdEFGM reported in this experiment are obtained\nby the joint MLE estimator in (7).\nSimulated data Our simulation study employs a graphical model of which the edges are generated\nindependently with probability P . We will consider the model under different levels of sparsity\nby adjusting the probability P . For simplicity purpose, we assume fs(Xs) \u2261 1 and consider a\nnonlinear pairwise interaction function fst(Xs, Xt) = cos(\u03c0(Xs \u2212 Xt)/5). We \ufb01t the data to\nthe additive model (4) with a 2-D Fourier basis of size 8. Using Gibbs sampling, we generate a\ntraining sample of size n from the true graphical model, and an independent sample of the same\nsize from the same distribution for tuning the strength parameter \u03bbn. We compare performance for\nn = 200, varying values of p \u2208 {50, 100, 150, 200, 250, 300}, and different sparsity levels under\nP = {0.02, 0.05, 0.1}, replicated 10 times for each con\ufb01guration.\nBaselines We compare the performance of our estimator to Graphical Lasso [6] as a GGM estimator\nand SKEPTIC [10] as a Nonparanormal estimator.\nIn our implementation, we use a version of\nSKEPTIC with Kendall\u2019s tau to infer the correlation.\nEvaluation metric To evaluate the support recovery performance, we use the standard F-score from\nthe information retrieval literature. The larger F-score is, the better the support recovery perfor-\nmance. The numerical values over 10\nResults Figure 1 shows the support recovery F-scores of the considered methods on the synthetic\ndata. From this group of results we can observe that by using 2-D Fourier basis to approximate the\nunknown cosine distance function, AdEFGM is able to more accurately recover the underlying graph\nstructure than the other two considered methods. The advantage of AdEFGM illustrated here is as\nexpected because it is designed to automatically learn the unknown complex pairwise interactions\nwhile GGM and Nonparanormal are restrictive to certain UGMs with known suf\ufb01cient statistics.\n\n\u22123 in magnitude are considered to be nonzero.\n\n4.2 Stock price data\n\nWe further study the performance of AdEFGM on a stock price data. This data contains the historical\nprices of S&P500 stocks over 5 years, from January 1, 2008 to January 1, 2013. By taking out the\n\n7\n\n\fFigure 1: Simulated data: Support recovery F-score curves. Left panels: P = 0.02, Middle panels:\nP = 0.05, Right panels: P = 0.1.\n\nFigure 2: Stock price data S&P500: Category link precision, recall and F-score curves.\n\nstocks with less than 5 years of history, we end up with 465 stocks, each having daily closing\nprices over 1,260 trading days. The prices are \ufb01rst adjusted for dividends and splits and the used\nto calculate daily log returns. Each day\u2019s return can be represented as a point in R465. To apply\nAdEFGM to this data, we consider the general model (4) with the 2-D Fourier basis being used to\napproximate the pairwise interaction between stocks Xs and Xt. Since the category information of\nS&P500 is available, we measure the performance by Precision, Recall and F-score of the top k\nlinks (edges) on the constructed graph. A link is regarded as true if and only if it connects two nodes\nbelonging to the same category. We use the joint MLE estimator for this experiment. Figure 2 shows\nthe curves of precision, recall and F-score as functions of k in a wide range [103, 105]. It is apparent\nthat AdEFGM signi\ufb01cantly outperforms GGM and Nonparanormal for identifying correct category\nlinks. This result suggests that the interactions among the S&P500 stocks are highly nonlinear.\n\n5 Conclusions\n\nIn this paper, we proposed and analyzed AdEFGMs as a generic class of additive undirected graphi-\ncal models. By expressing node-wise and pairwise suf\ufb01cient statistics as linear representations over\na set of basis statistics, AdEFGM is able to capture complex interactions among variables which\nare not uncommon in modern engineering applications. We investigated two types of \u21132;1-norm\nregularized MLE estimators for joint and node-conditional high dimensional estimation. Based on\nour theoretical justi\ufb01cation and empirical observation, we can draw the following two conclusions:\n1) the \u21132;1-norm regularized AdEFGM learning is a powerful tool for inferring pairwise exponen-\ntial family graphical models with unknown arbitrary suf\ufb01cient statistics; and 2) the extra \ufb02exibility\ngained by AdEFGM comes at almost no cost of statistical and computational ef\ufb01ciency.\n\nAcknowledgments\n\nXiao-Tong Yuan and Ping Li were partially supported by NSF-Bigdata-1419210, NSF-III-1360971,\nONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Xiao-Tong Yuan is also partially sup-\nported by NSFC-61402232, NSFC-61522308, and NSFJP-BK20141003. Tong Zhang is supported\nby NSF-IIS-1407939 and NSF-IIS-1250985. Qingshan Liu is supported by NSFC-61532009.\nGuangcan Liu is supported by NSFC-61622305, NSFC-61502238 and NSFJP-BK20160040.\n\n8\n\n5010015020025030000.20.40.60.81Dimension pRecovery F\u2212scoreCosine Distance AdEFGMGGMNonparanormal5010015020025030000.20.40.60.81Dimension pRecovery F\u2212scoreCosine Distance AdEFGMGGMNonparanormal5010015020025030000.20.40.60.81Dimension pRecovery F\u2212scoreCosine Distance AdEFGMGGMNonparanormal246810x 10400.20.40.60.81Number of linksLink Precision AdEFGMGGMNonparanormal246810x 10400.20.40.60.81Number of linksLink Recall AdEFGMGGMNonparanormal246810x 10400.20.40.6Number of linksLink F\u2212score AdEFGMGGMNonparanormal\fReferences\n[1] F. Bach and M. Jordan. Learning graphical models with mercer kernels.\n\nAnnual Conference on Neural Information Processing Systems (NIPS\u201902), 2002.\n\nIn Proceedings of the 16th\n\n[2] O. Banerjee, L. E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation for multivariate gaussian or binary data. JMLR, 9:485\u2013516, 2008.\n\n[3] R. G. Baraniuk, M. Davenport, R. A. DeVore, and M. Wakin. A simple proof of the restricted isometry\n\nproperty for random matrices. Constructive Approximation, 28(3):253\u2013263, 2008.\n\n[4] E. J. Cand\u00e8s, Y. C. Eldarb, D. Needella, and P. Randallc. Compressedsensing with coherent and redun-\n\ndantdictionarie. Applied and Computational Harmonic Analysis, 31(1):59\u201373, 2011.\n\n[5] A. Dobra and A. Lenkoski. Copula gaussian graphical models and their application to modeling functional\n\ndisability data. The Annals of Applied Statistics, 5:969\u2013993, 2011.\n\n[6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\n[7] C. Gu, Y. Jeon, and Y. Lin. Nonparametric density estimation in high-dimensions. Statistica Sinica,\n\n[8] J. Lafferty, H. Liu, and L. Wasserman. Sparse nonparametric graphical models. Statistical Science,\n\nBiostatistics, 9(3):432\u2013441, 2008.\n\n23:1131\u20131153, 2013.\n\n27(4):519\u2013537, 2012.\n\n[9] L. Lin, M. Drton, A. Shojaie, et al. Estimation of high-dimensional graphical models using regularized\n\nscore matching. Electronic Journal of Statistics, 10(1):806\u2013854, 2016.\n\n[10] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High dimensional semiparametric gaussian\n\ncopula graphical models. Annals of Statistics, 40(4):2293\u20132326, 2012.\n\n[11] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimen-\n\nsional undirected graphs. Journal of Machine Learning Research, 10:2295\u20132328, 2009.\n\n[12] N. Meinshausen and P. B\u00fchlmann. High-dimensional graphs and variable selection with the lasso. Annals\n\nof Statistics, 34(3):1436\u20131462, 2006.\n\n[13] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557, 2012.\n\n[14] A. B. Owen. Monte Carlo theory, methods and examples. 2013.\n[15] P. Ravikumar, M. Wainwright, and J. Lafferty. High-dimensional ising model selection using l1-\n\nregularized logistic regression. Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[16] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by\nminimizing \u21131-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980, 2011.\n[17] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation.\n\nElectronic Journal of Statistics, 2:494\u2013515, 2008.\n\n[18] M. Schmidt, N. L. Roux, and F. R. Bach. Convergence rates of inexact proximal-gradient methods for\nconvex optimization. In Proceedings of the 25th Annual Conference on Neural Information Processing\nSystems (NIPS\u201911), pages 1458\u20131466, 2011.\n\n[19] T. P. Speed and H. T. Kiiveri. Gaussian markov distributions over \ufb01nite graphs. Annals of Statistics,\n\n14:138\u2013150, 1986.\n\n[20] S. Sun, M. Kolar, and J. Xu. Learning structured densities via in\ufb01nite dimensional exponential families. In\nProceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS\u201915), 2015.\n[21] W. Tansey, O. H. M. Padilla, A. S. Suggala, and P. Ravikumar. Vector-space markov random \ufb01elds\nIn Proceedings of the 32nd International Conference on Machine Learning\n\nvia exponential families.\n(ICML\u201915), pages 684\u2013692, 2015.\n\nJournal of Optimization, 2008.\n\nabs/1011.3027, 2011.\n\n101(1):85\u2013101, 2014.\n\n[22] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM\n\n[23] R. Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices.\n\nCoRR, arxiv:\n\n[24] A. Voorman, A. Shojaie, and D. Witten. Graph estimation with joint additive models. Biometrika,\n\n[25] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foun-\n\ndations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[26] L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical\n\nmodels. Annals of Statistics, 40(5):2541\u20132571, 2012.\n\n[27] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via univariate exponential family\n\ndistributions. Journal of Machine Learning Research, 16:3813\u20133847, 2015.\n\n[28] Z. Yang, Y. Ning, and H. Liu. On semiparametric exponential family graphical models. arXiv preprint\n\narXiv:1412.8697, 2014.\n\n[29] M. Yuan. High dimensional inverse covariance matrix estimation via linear programming. Journal of\n\n[30] C.-H. Zhang and T. Zhang. A general framework of dual certi\ufb01cate analysis for structured sparse recovery\n\nMachine Learning Research, 11:2261\u20132286, 2010.\n\nproblems. CoRR, arxiv: abs/1201.3302, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2156, "authors": [{"given_name": "Xiaotong", "family_name": "Yuan", "institution": "Nanjing University of Informat"}, {"given_name": "Ping", "family_name": "Li", "institution": "Rugters University"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Rutgers"}, {"given_name": "Qingshan", "family_name": "Liu", "institution": null}, {"given_name": "Guangcan", "family_name": "Liu", "institution": "NUIST"}]}