{"title": "Nonparanormal Belief Propagation (NPNBP)", "book": "Advances in Neural Information Processing Systems", "page_first": 899, "page_last": 907, "abstract": "The empirical success of the belief propagation approximate inference algorithm has inspired numerous theoretical and algorithmic advances. Yet, for continuous non-Gaussian domains performing belief propagation remains a challenging task: recent innovations such as nonparametric or kernel belief propagation, while useful, come with a substantial computational cost and offer little theoretical guarantees, even for tree structured models.  In this work we present Nonparanormal BP  for performing efficient inference on distributions parameterized by  a Gaussian copulas network and any univariate marginals. For  tree structured networks, our approach is guaranteed to be exact for  this powerful class of non-Gaussian models. Importantly, the method  is as efficient as standard Gaussian BP, and its convergence properties do not depend on the complexity of the univariate marginals, even when a nonparametric representation is used.", "full_text": "Nonparanormal Belief Propagation (NPNBP)\n\nGal Elidan\n\nDepartment of Statistics\n\nHebrew University\n\nCobi Cario\n\ngalel@huji.ac.il\n\ncobi.cario@mail.huji.ac.il\n\nSchool of Computer Science and Engineering\n\nHebrew University\n\nAbstract\n\nThe empirical success of the belief propagation approximate inference algorithm\nhas inspired numerous theoretical and algorithmic advances. Yet, for continuous\nnon-Gaussian domains performing belief propagation remains a challenging task:\nrecent innovations such as nonparametric or kernel belief propagation, while use-\nful, come with a substantial computational cost and offer little theoretical guaran-\ntees, even for tree structured models. In this work we present Nonparanormal BP\nfor performing ef\ufb01cient inference on distributions parameterized by a Gaussian\ncopulas network and any univariate marginals. For tree structured networks, our\napproach is guaranteed to be exact for this powerful class of non-Gaussian mod-\nels. Importantly, the method is as ef\ufb01cient as standard Gaussian BP, and its con-\nvergence properties do not depend on the complexity of the univariate marginals,\neven when a nonparametric representation is used.\n\n1\n\nIntroduction\n\nProbabilistic graphical models [Pearl, 1988] are widely use to model and reason about phenomena\nin a variety of domains such as medical diagnosis, communication, machine vision and bioinformat-\nics. The usefulness of such models in complex domains, where exact computations are infeasible,\nrelies on our ability to perform ef\ufb01cient and reasonably accurate inference of marginal and condi-\ntional probabilities. Perhaps the most popular approximate inference algoritm for graphical models\nis belief propagation (BP) [Pearl, 1988]. Guaranteed to be exact for trees, it is the surprising per-\nformance of the method when applied to general graphs (e.g., [McEliece et al., 1998, Murphy and\nWeiss, 1999]) that has inspired numerous works ranging from attempts to shed theoretical light\non propagation-based algorithms (e.g., [Weiss and Freeman, 2001, Heskes, 2004, Mooij and Kap-\npen, 2005]) to a wide range of algorithmic variants and generalizations (e.g., [Yedidia et al., 2001,\nWiegerinck and Heskes, 2003, Globerson and Jaakkola, 2007]).\nIn most works, the variables are either discrete or the distribution is assumed to be Gaussian [Weiss\nand Freeman, 2001]. However, many continuous real-world phenomenon are far from Gaussian,\nand can have a complex multi-modal structure. This has inspired several innovative and practically\nuseful methods speci\ufb01cally aimed at the continuous non-Gaussian case such as expectation propa-\ngation [Minka, 2001], particle BP [Ihler and McAllester, 2009], nonparametric BP [Sudderth et al.,\n2010b], and kernel BP [Song et al., 2011]. Since these works are aimed at general unconstrained\ndistributions, they all come at a substantial computational price. Further, little can be said a-priori\nabout their expected performance even in tree structured models. Naturally, we would like an infer-\nence algorithm that is as general as possible while being as computationally convenient as simple\nGaussian BP [Weiss and Freeman, 2001].\nIn this work we present Nonparanormal BP (NPNBP), an inference method that strikes a balance\nbetween these competing desiderata. In terms of generality, we focus on the \ufb02exible class of Copula\nBayesian Networks (CBNs) [Elidan, 2010] that are de\ufb01ned via local Gaussian copula functions and\nany univariate densities (possible nonparametric). Utilizing the power of the copula framework\n[Nelsen, 2007], these models can capture complex multi-modal and heavy-tailed phenomena.\n\n1\n\n\fFigure 1: Samples from the\nbivariate Gaussian copula with\ncorrelation \u03b8 = 0.25.\n(left) with unit variance Gaus-\nsian and Gamma marginals;\n(right) with a mixture of Gaus-\nsian and exponential marginals.\n\nAlgorithmically, our approach enjoys the bene\ufb01ts of Gaussian BP (GaBP). First, it is guaranteed to\nconverge and return exact results on tree structured models, regardless of the form of the univariate\ndensities. Second, it is computationally comparable to performing GaBP on a graph with the same\nstructure. Third, its convergence properties on general graphs are similar to that of GaBP and, quite\nremarkably, do not depend on the complexity of the univariate marginals.\n\n2 Background\n\nIn this section we provide a brief background on copulas in general, the Gaussian copula in particu-\nlar, and the Copula Bayesian Network model of Elidan [2010].\n\n2.1 The Gaussian Copula\n\nA copula function [Sklar, 1959] links marginal distributions to form a multivariate one. Formally:\n\nDe\ufb01nition 2.1: Let U1, . . . , Un be real random variables marginally uniformly distributed on [0, 1].\nA copula function C : [0, 1]n \u2192 [0, 1] is a joint distribution\n\nC\u03b8(u1, . . . , un) = P (U1 \u2264 u1, . . . , Un \u2264 un),\n\nwhere \u03b8 are the parameters of the copula function.\nNow consider an arbitrary set X = {X1, . . . Xn} of real-valued random variables (typically not\nmarginally uniformly distributed). Sklar\u2019s seminal theorem states that for any joint distribution\nFX (x), there exists a copula function C such that FX (x) = C(FX1 (x1), . . . , FXn (xn)). When the\nunivariate marginals are continuous, C is uniquely de\ufb01ned.\nThe constructive converse, which is of central interest from a modeling perspective, is also true.\nSince Ui \u2261 Fi is itself a random variable that is always uniformly distributed in [0, 1], any copula\nfunction taking any marginal distributions {Ui} as its arguments, de\ufb01nes a valid joint distribution\nwith marginals {Ui}. Thus, copulas are \u201cdistribution generating\u201d functions that allow us to separate\nthe choice of the univariate marginals and that of the dependence structure, encoded in the copula\nfunction C. Importantly, this \ufb02exibility often results in a construction that is bene\ufb01cial in practice.\n\nDe\ufb01nition 2.2: The Gaussian copula distribution is de\ufb01ned by:\n\n(cid:16)\n\n\u22121\n\n\u03a6\n\n(cid:17)\n\n\u22121\n\nC\u03a3(u1, . . . , un) = \u03a6\u03a3\n\n(1)\n\u22121 is the inverse standard normal distribution, and \u03a6\u03a3 is a zero mean normal distribution\n\n(u1), . . . , \u03a6\n\n(un))\n\n,\n\nwhere \u03a6\nwith correlation matrix \u03a3.\n\nExample 2.3: The standard Gaussian distribution is mathematically convenient but limited due\nto its unimodal form and tail behavior. However, the Gaussian copula can give rise to complex\nvaried distribution and offers great \ufb02exibility. As an example, Figure 1 shows two bivariate distribu-\ntions that are constructed using the Gaussian copula and two different sets of univariate marginals.\nGenerally, any univariate marginal, both parametric and nonparametric can be used.\n\nLet \u03d5\u03a3 (x) denote the multivariate normal density with mean zero and covariance \u03a3, and let \u03d5(x)\ndenote the univariate standard normal density. Using the derivative chain rule and the derivative\n\n2\n\n\finverse function theorem, the Gaussian copula density c(u1, . . . , un) = \u2202nC\u03a3(u1,...,un)\n\nis\n\n(cid:16)\n\nc(u1, . . . , un) = \u03d5\u03a3\n\n\u22121\n\n\u03a6\n\n(u1), . . . , \u03a6\n\n\u22121\n\n(un)\n\n(cid:17)(cid:89)\n\n\u22121\n\n\u2202\u03a6\n\n(ui)\n\n\u03d5\u03a3\n\n=\n\n\u2202Ui\n\ni\n\n(cid:16)\n\n\u2202U1,...\u2202Un\n\n\u03a6\n\n\u22121\n\n(cid:81)\n\n\u22121\n\n(u1), . . . , \u03a6\n\u22121(ui))\n\ni \u03d5(\u03a6\n\n(cid:17)\n\n.\n\n(un)\n\nFor a distribution de\ufb01ned by a Gaussian copula FX (x1, . . . , xn) = C\u03a3(F1(x1), . . . , Fn(xn)), using\n\u2202Ui/\u2202Xi = fi, we have\n\nfX (x1, . . . , xn) =\n\n\u2202nC\u03a3(F1(x1), . . . , Fn(xn))\n\n\u2202X1, . . . , \u2202Xn\n\n=\n\n(cid:81)\n\n\u03d5\u03a3 (\u02dcx1, . . . , \u02dcxn)\n\ni \u03d5(\u02dcxi)\n\nfi(xi),\n\n(2)\n\n(cid:89)\n\ni\n\nwhere \u02dcxi \u2261 \u03a6\n\n\u22121\n\n(ui) \u2261 \u03a6\n\n\u22121\n\n(Fi(xi)). We will use this compact notation in the rest of the paper.\n\n2.2 Copula Bayesian Networks\nLet G be a directed acyclic graph (DAG) whose nodes correspond to the random variables X =\n{X1, . . . , Xn}, and let Pai = {Pai1, . . . , Paiki} be the parents of Xi in G. As for standard BNs, G\nencodes the independence statements I(G) = {(Xi \u22a5 NonDescendantsi | Pai)}, where \u22a5 denotes\nthe independence relationship, and NonDescendantsi are nodes that are not descendants of Xi in G.\nDe\ufb01nition 2.4: A Copula Bayesian Network (CBN) is a triplet C = (G, \u0398C, \u0398f ) that de\ufb01nes fX (x).\nG encodes the independencies assumed to hold in fX (x). \u0398C is a set of local copula functions\n) that are associated with the nodes of G that have at least one parent. In\nCi(ui, upai1 , . . . , upaiki\naddition, \u0398f is the set of parameters representing the marginal densities fi(xi) (and distributions\nui \u2261 Fi(xi)). The joint density fX (x) then takes the form\n\nfX (x) =\n\nci(ui, upai1, . . . , upaiki\n\u2202K Ci(1,upai1 ,...,upaiki\n\n)\n\n)\n\nRci(ui, upai1, . . . , upaiki\n\n)fi(xi)\n\n(3)\n\n\u2202Upai1 ...\u2202Upaiki\nWhen Xi has no parents in G, Rci (\u00b7) \u2261 1.\nNote that Rci(\u00b7)fi(xi) is always a valid conditional density f (xi | pai), and can be easily computed.\nIn particular, when the copula density c(\u00b7) in the numerator has an explicit form, so does Rci(\u00b7).\n(cid:81)\nElidan [2010] showed that a CBN de\ufb01nes a valid joint density. When the model is tree-structured,\n) de\ufb01nes a valid copula so that the univariate marginals of the con-\nstructed density are fi(xi). More generally, the marginals may be skewed. though in practice only\nIn this case the CBN model can be viewed as striking a balance between the \ufb01xed\nslightly so.\nmarginals and the unconstrained maximum likelihood objectives. Practically, the model leads to\nsubstantial generalization advantages (see Elidan [2010] for more details).\n\ni Rci(ui, upai1 , . . . , upaiki\n\nn(cid:89)\n\ni=1\n\nfi(xi) \u2261 n(cid:89)\n\ni=1\n\n3 Nonparanormal Belief Propagation\n\nAs exempli\ufb01ed in Figure 1, the Gaussian copula can give rise to complex multi-modal joint distri-\nbutions. When local Gaussian copulas are combined in a high-dimensional Gaussian Copula BN\n(GCBN), expressiveness is even greater. Yet, as we show in this section, tractable inference in this\nhighly non-Gaussian model is possible, regardless of the form of the univariate marginals.\n\n3.1\n\nInference for a Single Gaussian Copula\n\nWe start by showing how inference can be carried out in closed form for a single Gaussian cop-\nula. While all that is involved is a simple change of variables, the details are instructive. Let\nfX (x1, . . . , xn) be a density parameterized by a Gaussian copula. We start with the task of comput-\ning the multivariate marginal over a subset of variables Y \u2282 X. For convenience and without loss\n\n3\n\n\fChanging the integral variables to Ui and using fi = \u2202Ui\n\u2202Xi\n\nso that fi(xi)dxi = dui, we have\n\nfX1,...,XK (x1, . . . , xk) =\n\nduk+1 . . . dun.\n\nChanging variables once again to \u02dcxi = \u03a6\n\n(ui), and using \u2202 \u02dcXi/\u2202Ui = \u03d5(\u02dcxi)\n\n\u22121, we have\n\nk(cid:89)\n\ni=1\n\nfi(xi)\n\u03d5(\u02dcxi)\n\n(cid:90)\n\nk(cid:89)\n\ni=1\n\n\u03d5\u03a3\n\n(cid:34)\n\n(cid:90)\n\n[0,1]n\u2212k\n\u22121\n\nfi(xi)\n\u03d5(\u02dcxi)\n\nRn\u2212k\n\nof generality, we assume that Y = {X1, . . . , Xk} with k < n. From Eq. (2), we have\n\n(cid:90)\n\nfX1,...,XK (x1, . . . , xk) =\n\nfX (x1, . . . , xn)dxk+1 . . . dxn\n\n(cid:90) (cid:20)\n\n(cid:16)\n\nRn\u2212k\n\nk(cid:89)\n\ni=1\n\n=\n\nfi(xi)\n\u03d5(\u02dcxi)\n\n\u03d5\u03a3\n\n\u22121\n\n\u03a6\n\n(F1(x1)), . . . , \u03a6\n\n\u22121\n\n(Fn(xn))\n\n(cid:21)\n\ndxk+1 . . . dxn.\n\nfi(xi)\n\u22121(Fi(xi)))\n\n\u03d5(\u03a6\n\n(cid:17) n(cid:89)\n(cid:0)\u03a6\u22121(u1), . . . , \u03a6\u22121(un)(cid:1)\n(cid:81)n\n\ni=k+1 \u03d5(\u03a6\u22121(ui))\n\ni=k+1\n\n(cid:35)\n\nfX1,...,XK (x1, . . . , xk) =\n\n\u03d5\u03a3 (\u02dcx1, . . . , \u02dcxn) d\u02dcxk+1 . . . d\u02dcxn.\n\nThe integral on the right hand side is now a standard marginalization of a multivariate Gaussian\n(over \u02dcxi\u2019s) and can be carried out in closed form.\nComputation of densities conditioned on evidence Z = z can also be easily carried out. Letting\nW = X \\ {Z \u222a Y} denote non query or evidence variables, and plugging in the above, we have:\n\n(cid:82) f (x)dw\n(cid:82)(cid:82) f (x)dwdy\n\n=(cid:81)\n\ni\u2208Y\n\nfi(xi)\n\u03d5(\u02dcxi)\n\n(cid:82) \u03d5\u03a3 (\u02dcx1, . . . , \u02dcxn) d \u02dcw\n(cid:82)(cid:82) \u03d5\u03a3 (\u02dcx1, . . . , \u02dcxn) d \u02dcwd\u02dcy\n\n.\n\nfY|Z(y | z) =\n\nThe conditional density is now easy to compute since a ratio of normal distributions is also normal.\nThe \ufb01nal answer, of course, does involve fi(xi). This is not only unavoidable but in fact desirable\nsince we would like to retain the complexity of the desired posterior.\n\n3.2 Tractability of Inference in a Gaussian CBNs\n\nWe are now ready to consider inference in a Gaussian CBN (GCBN). In this case, the joint density\nof Eq. (3) takes, after cancellation of terms, the following form:\n\nfX (x1, . . . , xn) =\n\n(cid:89)\n\ni\n\n(cid:89)\n\ni\n\nfi(xi)\n\u03d5(\u02dcxi)\n\n\u03d5\u03a3i(\u02dcxi, \u02dcxpai1 , . . . , \u02dcxpaiki\n\u03d5\u03a3\n(\u02dcxpai1 , . . . , \u02dcxpaiki\n)\n\u2212\ni\n\n)\n\n,\n\ni\n\nwhere \u03a3\u2212\nis used to denote the i\u2019th local covariance matrix excluding the i\u2019th row and column.\nWhen Xi has no parents, the ratio reduces to \u03d5(\u02dcxi). When the graph is tree structured, this density\nis also a copula and its marginals are fi(xi). In this case, the same change of variables as above\nresults in\n\nf(cid:101)X (\u02dcx1, . . . , \u02dcxn) =\n\n\u03d5\u03a3i(\u02dcxi, \u02dcxpai1 , . . . , \u02dcxpaiki\n\u03d5\u03a3\n(\u02dcxpai1 , . . . , \u02dcxpaiki\n)\n\u2212\ni\n\n)\n\n.\n\n(cid:89)\n\ni\n\nSince a ratio of Gaussians is also a Gaussian, the entire density is Gaussian in \u02dcxi space, and compu-\ntation of any marginal f \u02dcY(\u02dcy) is easy. The required marginal in xi space is then recovered using\n\nfY(y) = f \u02dcY(\u02dcy)\n\nfi(xi)\n\u03d5(\u02dcxi)\n\n(4)\n\n(cid:89)\n\ni\u2208Y\n\nwhich essentially summarizes the detailed derivation of the previous section.\nWhen we consider a non-tree structured CBN model, as noted in Section 2.2, the marginals may not\nequal fi(xi), and the above simpli\ufb01cation is not applicable. However, for the Gaussian case, it is\nalways possible to estimate the local copulas in a topological order so that the univariate marginals\nare equal to fi(xi) (the model in this case is equivalent to the distribution-free continuous Bayesian\nbelief net model [Kurowicka and Cooke, 2005]). It follows that, for any structure,\n\nCorollary 3.1: The complexity of inference in a Gaussian CBN model is the same as that of inference\nin a multivariate Gaussian model of the same structure.\n\n4\n\n\fAlgorithm 1: Nonparanormal Belief Propagation (NPNBP) for general CBNs.\nInput: {fk(xk)} for all i, \u03a3i for all nodes with parents. Output: belief bS(xS) for each cluster S.\nCG \u2190 a valid cluster graph over the following potentials for all nodes i in the graph\n\n\u2022 \u03d5\u03a3i(\u02dcxi, \u02dcxpai1 , . . . , \u02dcxpaiki\n\u2022 1/\u03d5\u03a3\n(\u02dcxpai1 , . . . , \u02dcxpaiki\n\n\u2212\ni\n\n)\n\n)\n\nbG(\u02dcxS) \u2190 GaBP belief over cluster S.\n\nforeach cluster S in CG\n\nforeach cluster S in CG\n\nbS(xS) = bG(\u02dcxS)(cid:81)\n\ni\u2208S\n\nfi(xi)\n\u03d5(\u02dcxi)\n\n// use black-box GaBP in \u02dcxi space\n\n// change to xi space\n\nWhile mathematically this conclusion is quite straightforward, the implications are signi\ufb01cant. A\nGCBN model is the only general purpose non-Gaussian continuous graphical model for which exact\ninference is tractable. At the same time, as is demonstrated in our experimental evaluation, the\nmodel is able to capture complex distributions well both qualitatively and quantitatively.\nA \ufb01nal note is worthwhile regarding the (possibly conditional) marginal density. As can be expected,\nthe result of Eq. (4) includes fi(xi) terms for all variables that have not been marginalized out. As\nnoted, this is indeed desirable as we would like to preserve the complexity of the density in the\nmarginal computation. The marginal term, however, is now in low dimension so that quantities of\ninterest (e.g., expectation) can be readily computed using naive grid-based evaluation or, if needed,\nusing more sophisticated sampling schemes (see, for example, [Robert and Cassella, 2005]).\n\n3.3 Belief Propagation for Gaussian CBNs\n\nGiven the above observations, performing inference in a Gaussian CBN (GCBN) appears to be a\nsolved problem. However, inference in large-scale models can be problematic even in the Gaussian\ncase. First, the large joint covariance matrix may be ill conditioned and inverting it may not be\npossible. Second, matrix inversion can be slow when dealing with domains of suf\ufb01cient dimension.\nA possible alternative is to consider the popular belief propagation algorithm [Pearl, 1988]. For a\ntree structured model represented as a product of singleton \u03a8i and pairwise \u03a8ij factors, the method\nrelies on the recursive computation of \u201cmessages\u201d\n\n(cid:90)\n\n[\u03a8ij(xi, xj)\u03a8i(xi)(cid:81)\n\nmi\u2192j(xj) \u2190 \u03b1\n\nk\u2208N (i)\\j mk\u2192i(xi)]dxi,\n\nwhere \u03b1 is a normalization factor and N (i) are the indices of the neighbor nodes of Xi.\nIn the case of a GCBN model, performing belief propagation may seem dif\ufb01cult since \u03a8i(xi) \u2261\nfi(xi) can have a complex form. However, the change of variables used in the previous section\napplies here as well. That is, one can perform inference in \u02dcxi space using standard Gaussian BP\n(GaBP) [Weiss and Freeman, 2001], and then perform the needed change of variables. In fact, this\nis true regardless of the structure of the graph so that loopy GaBP can also be used to perform ap-\nproximate computations for a general GCBN model in \u02dcxi space. The approach is summarized in\nAlgorithm 1, where we assume access to a black-box GaBP procedure and a cluster graph construc-\ntion algorithm. In our experiments we simply use a Bethe approximation construction (see [Koller\nand Friedman, 2009] for details on BP, GaBP and the cluster graph construction).\nGenerally, little can be said about the convergence of loopy BP or its variants, particularly for non-\nGaussian domains. Appealingly, the form of our NPNBP algorithm implies that its convergence can\nbe phrased in terms of standard Gaussian BP convergence. In particular:\n\n\u2022 Observation 1: NPNBP converges whenever GaBP converges for the model de\ufb01ned by(cid:81)\n\ni Rci.\n\u2022 Observation 2: Convergence of NPNBP depends only on the covariance matrices \u03a3i that pa-\n\nrameterize the local copula and does not depend on the univariate form.\n\nIt follows that convergence conditions identi\ufb01ed for GaBP [Rusmevichientong and Roy, 2000, Weiss\nand Freeman, 2001, Malioutov et al., 2006] carry over to NPNBP for CBN models.\n\n5\n\n\fFigure 2: Exact vs. Nonparametric BP marginals for the GCBN model learned from the wine quality\ndataset. Shown are the marginal densities for the \ufb01rst four variables.\n\n4 Experimental Evaluation\n\nWe now consider the merit of using our NPNBP algorithm for performing inference in a a Gaussian\nCBN (GCBN) model. We learned a tree structured GCBN using a standard Chow-Liu approach\n[Chow and Liu, 1968], and a model with up to two parents for each variable using standard greedy\nstructure search. In both cases we use the Bayesian Information Criterion (BIC) [Schwarz, 1978]\nto guide the structure learning algorithm. For the univariate densities, we use a standard Gaussian\nkernel density estimator (see, for example, [Bowman and Azzalini, 1997]). Using an identical pro-\ncedure, we learn a linear Gaussian BN baseline where Xi \u223c N (\u03b1pai, \u03c3i) so that each variable Xi\nis normally distributed around a linear combination of its parents Pai (see [Koller and Friedman,\n2009] for details on this standard approach to structure learning).\nFor the GCBN model, we also compare to Nonparametric BP (NBP) [Sudderth et al., 2010a] using\nD. Bickson\u2019s code [Bickson, 2008] and A. Ihlers KDE Matlab package (http://www.ics.uci.edu/ ih-\nler/code/kde.html), which relies on a mixture of Gaussians for message representation. In this case,\nsince our univariate densities are constructed using Gaussian kernels, there is no approximation\nin the NBP representation and all approximations are due to message computations. To carry out\nmessage products, we tried all 7 sampling-based methods available in the KDE package. In the ex-\nperiments below we use only the multiresolution sequential Gibbs sampling method since all other\napproaches resulted in numerical over\ufb02ows even for small domains.\n\n4.1 Qualitative Assessment\n\nWe start with a small domain where the qualitative nature of the inferred marginals is easily explored,\nand consider performance and running time in more substantial domains in the next section. We\nuse the wine quality data set from the UCI repository which includes 1599 measurements of 11\nphysiochemical properties and a quality variable of red \u201dVinho Verde\u201d [Cortez et al., 2009].\nWe \ufb01rst examine a tree structured GCBN model where our NPNBP method allows us to perform\nexact marginal computations. Figure 2 compares the \ufb01rst four marginals to the ones computed by the\nNBP method. As can be clearly seen, although the NBP marginals are not nonsensical, they are far\nfrom accurate (results for the other marginals in the domain are similar). Quantitatively, each NBP\nmarginal is 0.5 to 1.5 bits/instance less accurate than the exact ones. Thus, the accuracy of NPNBP\nin this case is approximately twice that of NBP per variable, amounting to a substantial per sample\nadvantage. We also note that NBP was approximately an order of magnitude slower than NPNBP in\nthis domain. In the larger domains considered in the next section, NBP proved impractical.\nFigure 3 demonstrates the quality of the bivariate marginals inferred by our NPNBP method relative\nto the ones of a linear Gaussian BN model where inference can also be carried out ef\ufb01ciently. The\nmiddle panel shows a Gaussian distribution constructed only over the two variables and is thus an\nupper bound on the quality that we can expect from a linear Gaussian BN. Clearly, the Gaussian\nrepresentation is not suf\ufb01ciently \ufb02exible to reasonably capture the distribution of the true samples\n(left panel). In contrast, the bivariate marginals computed by our algorithm (right panel) demonstrate\nthe power of working with a copula-based construction and an effective inference procedure: in both\ncases the inferred marginals capture the non-Gaussian distributions quite accurately. Results were\nqualitatively similar for all other variable pairs (except for the few cases that are approximately\nGaussian in the original feature space and for which all models are equally bene\ufb01cial).\n\n6\n\n\fDensity vs. Alcohol\n\nFree vs. Total Sulfur\n\n(a) true samples\n\n(b) optimal Gaussian\n\n(c) CBN marginal\n\nFigure 3: The bivariate density for two pairs of variables in a tree structured GCBN model learned\nfrom the wine quality dataset. (a) empirical samples; (b) maximum likelihood Gaussian density; (c)\nexact GCBN marginal computed using our NPNBP algorithm.\n\nIn Figure 4 we repeat the comparison for another pair of variables in a non-tree GCBN (as before,\nresults were qualitatively similar for all pairs of variables). In this setting, the bivariate marginal\ncomputed by our algorithm (d) is approximate and we also compare to the exact marginal (c). As\nin the case of the tree-structured model, the GCBN model captures the true density quite accurately,\neven for this multi-modal example. NPNBP dampens some of this accuracy and results in marginal\ndensities that have the correct overall structure but with a reduced variance. This is not surprising\nsince it is well known that GaBP leads to reduced variances [Weiss and Freeman, 2001]. Never-\ntheless, the approximate result of NPNBP is clearly better than the exact Gaussian model, which\nassigns very low probability to regions of high density (along the main vertical axis of the density).\nFinally, Figure 5(left) shows the NPNBP vs. the exact expectations. As can be seen, the inferred\nvalues are quite accurate and it is plausible that the differences are due to numerical round-offs.\nThus, it is possible that, similarly to the case of standard GaBP [Weiss and Freeman, 2001], the\ninferred expectations are theoretically exact. The proof for the GaBP case, however, does not carry\nover to the CBN setting and shedding theoretical light on this issue remains a future challenge.\n\n4.2 Quantitative Assessment\n\nWe now consider several substantially larger domains with 100 to almost 2000 variables. For each\ndomain we learn a tree structured GCBN, and justify the need for the expressive copula-based model\nby reporting its average generalization advantage in terms of log-loss/instance over a standard linear\nGaussian model. We justify the use of NPNBP for performing inference by comparing the running\ntime of NPNBP to exact computations carried out using matrix inversion. For all datasets, we\nperformed 10-fold cross-validation and report average results. We use the following datasets:\n\u2022 Crime (UCI repository). 100 variables relating to crime ranging from household size to fraction\n\u2022 SP500. Daily changes of value of the 500 stocks comprising the Standard and Poor\u2019s index (S&P\n\u2022 Gene. A compendium of gene expression experiments used in [Marion et al., 2004]. We chose\ngenes that have only 1, 2, and 3 missing values and only use full observations. This resulted in\ndatasets with 765, 1400, and 1945 variables (genes), and 1088, 956, and 876 samples, respectively.\n\nof children born outside of a marriage, for 1994 communities across the U.S.\n\n500) over a period of one year.\n\nFor the 100 variable Crime domain, average test advantage of the GCBN model over the linear\nGaussian one was 0.39 bits/instance per variable (as in [Elidan, 2010]). For the 765 variable Gene\nexpression domain the advantage was around 0.1 bits/instance/variable (results were similar for the\n\n7\n\n\fSugar level vs. Density\n\n(a) true samples\n\n(b) optimal Gaussian\n\n(c) exact CBN marginal\n\n(d) inferred marginal\n\nFigure 4: The bivariate density for a pair of variables in a non-tree GCBN model learned from the\nwine quality dataset. (a) empirical samples; (b) maximum likelihood Gaussian density; (c) exact\nCBN marginal; (d) marginal density computed by our NPNBP algorithm.\n\nFigure 5:\n(left) exact vs.\nNPNBP expected values.\n(right) speedup relative to\nmatrix inversion for a tree\nstructured GCBN model.\n765,1400,1945\ncorrespond\nto the three different datasets\nextracted from the gene\nexpression compendium.\n\nother gene expression datasets). In both cases, the differences are dramatic and each instance is\nmany orders of magnitude more likely given a GCBN model. For the SP500 domain, evaluation\nof the linear Gaussian model resulted in numerical over\ufb02ows (due to the scarcity of the training\ndata), and the advantage of he GCBN cannot be quanti\ufb01ed. These generalization advantages make\nit obvious that we would like to perform ef\ufb01cient inference in a GCBN model.\nAs discussed, a GCBN model is itself tractable in that inference can be carried out by \ufb01rst con-\nstructing the inverse covariance matrix over all variables and then inverting it so as to facilitate\nmarginalization. Thus, using our NPNBP algorithm can only be justi\ufb01ed practically. Figure 5(right)\nshows the speedup of NPNBP relative to inference based on matrix inversion for the different do-\nmains. Although NPNBP is somewhat slower for the small domains (in which inference is carried\nout in less than a second), the speedup of NPNBP reaches an order of magnitude for the larger gene\nexpression domain. Appealingly, the advantage of NPNBP grows with the domain size due to the\ngrowth in complexity of matrix inversion. Finally, we note that we used a Matlab implementation\nwhere matrix inversion is highly optimized so that the gains reported are quite conservative.\n\n5 Summary\n\nWe presented Nonparanormal Belief Propagation (NPNBP), a propagation-based algorithm for per-\nforming highly ef\ufb01cient inference in a powerful class of graphical models that are based on the\nGaussian copula. To our knowledge, ours is the \ufb01rst inference method for an expressive continuous\nnon-Gaussian representation that, like ordinary GaBP, is both highly ef\ufb01cient and provably correct\nfor tree structured models. Appealingly, the ef\ufb01ciency and convergence properties of our method do\nnot depend on the choice of univariate marginals, even when a nonparametric representation is used.\nThe Gaussian copula is a powerful model widely used to capture complex phenomenon in \ufb01elds\nranging from mainstream economics (e.g., Embrechts et al. [2003]) to \ufb02ood analysis [Zhang and\nSingh, 2007]. Recent probabilistic graphical models that build on the Gaussian copula open the\ndoor for new high-dimensional non-Gaussian applications [Kirshner, 2007, Liu et al., 2010, Elidan,\n2010, Wilson and Ghahramani, 2010]. Our method offers the inference tools to make this practical.\n\n8\n\n\fAcknowledgements\nG. Elidan and C. Cario were supported in part by an ISF center of research grant. G. Elidan was also supported\nby a Google grant. Many thanks to O. Meshi and A. Globerson for their comments on an earlier draft.\n\nReferences\nD. Bickson. Gaussian Belief Propagation: Theory and Application. PhD thesis, The Hebrew University of\n\nJerusalem, Jerusalem, Israel, 2008.\n\nA. Bowman and A. Azzalini. Applied Smoothing Techniques for Data Analysis. Oxford Press, 1997.\nC. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees.\n\nTrans. on Info. Theory, 14:462\u2013467, 1968.\n\nIEEE\n\nP. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data mining from\n\nphysicochemical properties. Decision Support Systems, 47(4):547\u2013553, 2009.\n\nG. Elidan. Copula bayesian networks. In Neural Information Processing Systems (NIPS), 2010.\nP. Embrechts, F. Lindskog, and A. McNeil. Modeling dependence with copulas and applications to risk man-\n\nagement. Handbook of Heavy Tailed Distributions in Finance, 2003.\n\nA. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for map lp-\n\nrelaxations. In Neural Information Processing Systems (NIPS), 2007.\n\nT. Heskes. On the uniqueness of loopy belief propagation \ufb01xed points. Neural Comp, 16:2379\u20132413, 2004.\nA. Ihler and D. McAllester. Particle belief propagation. In Conf on AI and Statistics (AISTATS), 2009.\nS. Kirshner. Learning with tree-averaged densities and distributions. In Neural Info Proc Systems (NIPS), 2007.\nD. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.\nD. Kurowicka and R. M. Cooke. Distribution-free continuous bayesian belief nets. In Selected papers based\n\non the presentation at the international conference on mathematical methods in reliability (MMR), 2005.\n\nH. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimensional\n\nundirected graphs. Journal of Machine Learning Research, 2010.\n\nD. Malioutov, J. Johnson, and A. Willsky. Walk-sums and belief propagation in gaussian graphical models.\n\nJournal of Machine Learning Research, 7:2031\u20132064, 2006.\n\nR. Marion, A. Regev, E. Segal, Y. Barash, D. Koller, N. Friedman, and E. O\u2019Shea. Sfp1 is a stress- and nutrient-\nsensitive regulator of ribosomal protein gene expression. Proc Natl Acad Sci U S A, 101(40):14315\u201322, 2004.\nR. McEliece, D. McKay, and J. Cheng. Turbo decoding as an instance of pearl\u2019s belief propagation algorithm.\n\nIEEE Journal on Selected Areas in Communication, 16:140\u2013152, 1998.\n\nT. P. Minka. Expectation propagation for approximate Bayesian inference. In Proc. Conference on Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI), pages 362\u2013369, 2001.\n\nJ. Mooij and B. Kappen. Suf\ufb01cient conditions for convergence of loopy belief propagation. In Proc. Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\nK. Murphy and Y. Weiss. Loopy belief propagation for approximate inference: An empirical study. In Proc.\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 467\u2013475, 1999.\n\nR. Nelsen. An Introduction to Copulas. Springer, 2007.\nJ. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\nC. P. Robert and G. Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics. Springer, 2005.\nP. Rusmevichientong and B. Van Roy. An analysis of belief propagation on the turbo decoding graph with\n\ngaussian densities. IEEE Transactions on Information Theory, 47:745\u2013765, 2000.\n\nG. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461\u2013464, 1978.\nA. Sklar. Fonctions de repartition a n dimensions et leurs marges. Publications de l\u2019Institut de Statistique de\n\nL\u2019Universite de Paris, 8:229\u2013231, 1959.\n\nL. Song, A. Gretton, D. Bickson, Y. Low, and C. Guestrin. Kernel belief propagation.\n\nArti\ufb01cial Intelligence and Statistics (AIStats), 2011.\n\nIn Conference on\n\nE.B. Sudderth, A.T. Ihler, M. Isard, W.T. Freeman, and A.S. Willsky. Nonparametric belief propagation. Com-\n\nmunications of the ACM, 53(10):95\u2013103, 2010a.\n\nErik Sudderth, Alexander Ihler, Michael Isard, William Freeman, and Alan Willsky. Nonparametric belief\n\npropagation. Communications of the ACM, 53(10):95\u2013103, October 2010b.\n\nY. Weiss and W. Freeman. Correctness of belief propagation in gaussian graphical models of arbitrary topology.\n\nNeural Computation, 13:2173\u20132200, 2001.\n\nW. Wiegerinck and T. Heskes. Fractional belief propagation. In Neural Information Processing Systems 15,\n\nCambridge, Mass., 2003. MIT Press.\n\nA. Wilson and Z. Ghahramani. Copula processes. In Neural Information Processing Systems (NIPS), 2010.\nJ. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In Neural Information Processing\n\nSystems 13, pages 689\u2013695, Cambridge, Mass., 2001. MIT Press.\n\nL. Zhang and V. Singh. Trivariate \ufb02ood frequency analysis using the Gumbel-Hougaard copula. Journal of\n\nHydrologic Engineering, 12, 2007.\n\n9\n\n\f", "award": [], "sourceid": 417, "authors": [{"given_name": "Gal", "family_name": "Elidan", "institution": null}, {"given_name": "Cobi", "family_name": "Cario", "institution": null}]}