{"title": "Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1043, "page_last": 1051, "abstract": "Graphical model selection refers to the problem of estimating the unknown graph structure given observations at the nodes in the model. We consider a challenging instance of this problem when some of the nodes are latent or hidden.  We  characterize  conditions for tractable graph estimation and develop efficient methods with provable guarantees. We consider the class of Ising models Markov on  locally tree-like graphs, which are in the regime of correlation decay. We  propose an efficient method for graph estimation, and establish its structural consistency when the number of samples $n$ scales as $n = \\Omega(\\theta_{\\min}^{-\\delta \\eta(\\eta+1)-2}\\log p)$, where $\\theta_{\\min}$ is the minimum edge potential, $\\delta$ is the depth (i.e., distance from a hidden node to the nearest  observed nodes), and $\\eta$ is a parameter which depends on the minimum and maximum node and edge potentials in the Ising model. The proposed method is practical to implement and provides  flexibility to control  the number of latent variables and the cycle lengths in the output graph.  We also present necessary conditions for graph estimation by any method and show that our method nearly matches the lower bound  on sample requirements.", "full_text": "Latent Graphical Model Selection: Ef\ufb01cient Methods\n\nfor Locally Tree-like Graphs\n\nAnimashree Anandkumar\n\nUC Irvine\n\na.anandkumar@uci.edu\n\nRagupathyraj Valluvan\n\nUC Irvine\n\nrvalluva@uci.edu\n\nAbstract\n\nGraphical model selection refers to the problem of estimating the unknown graph\nstructure given observations at the nodes in the model. We consider a challenging\ninstance of this problem when some of the nodes are latent or hidden. We char-\nacterize conditions for tractable graph estimation and develop ef\ufb01cient methods\nwith provable guarantees. We consider the class of Ising models Markov on lo-\ncally tree-like graphs, which are in the regime of correlation decay. We propose\nan ef\ufb01cient method for graph estimation, and establish its structural consistency\nwhen the number of samples n scales as n = \u2126(\u03b8\nlog p), where \u03b8min\nis the minimum edge potential, \u03b4 is the depth (i.e., distance from a hidden node to\nthe nearest observed nodes), and \u03b7 is a parameter which depends on the minimum\nand maximum node and edge potentials in the Ising model. The proposed method\nis practical to implement and provides \ufb02exibility to control the number of latent\nvariables and the cycle lengths in the output graph. We also present necessary\nconditions for graph estimation by any method and show that our method nearly\nmatches the lower bound on sample requirements.\n\n\u2212\u03b4\u03b7(\u03b7+1)\u22122\nmin\n\nKeywords: Graphical model selection, latent variables, quartet methods, locally tree-like graphs.\n\n1\n\nIntroduction\n\nIt is widely recognized that the process of \ufb01tting observed data to a statistical model needs to in-\ncorporate latent or hidden factors, which are not directly observed. Learning latent variable models\ninvolves mainly two tasks: discovering structural relationships among the observed and hidden vari-\nables, and estimating the strength of such relationships. One of the simplest models is the latent\nclass model (LCM), which incorporates a single hidden variable and the observed variables are\nconditionally independent given the hidden variable. Latent tree models extend this model class\nto incorporate many hidden variables in a hierarchical fashion. Latent trees have been effective in\nmodeling data in a variety of domains, such as phylogenetics [1]. Their computational tractability:\nupon learning the latent tree model, enables the inference to be carried out ef\ufb01ciently through belief\npropagation. There has been extensive work on learning latent trees, including some of the recent\nworks, e.g. [2\u20134], demonstrate ef\ufb01cient learning in high dimensions. However, despite the advan-\ntages, the assumption of an underlying tree structure may be too restrictive. For instance, consider\nthe example of topic-word models, where topics (which are hidden) are discovered using informa-\ntion about word co-occurrences. In this case, a latent tree model does not accurately represent the\nhierarchy of topics and words, since there are many common words across different topics. Here, we\nrelax the latent tree assumption to incorporate cycles in the latent graphical model while retaining\nmany advantages of latent tree models, including tractable learning and inference. Relaxing the tree\nconstraint leads to many challenges: in general, learning these models is NP-hard, even when there\nare no latent variables, and developing tractable methods for such models is itself an area of active\nresearch, e.g. [5\u20137]. We consider structure estimation in latent graphical models Markov on locally\n\n1\n\n\ftree-like graphs. These extensions of latent tree models are relevant in many settings: for instance,\nwhen there is a small overlap among different hierarchies of variables, the resulting graph has mostly\nlong cycles. There are many questions to be addressed: are there parameter regimes where these\nmodels can be learnt consistently and ef\ufb01ciently? If so, are there practical learning algorithms?\nAre learning guarantees for loopy models comparable to those for latent trees? How does learning\ndepend on various graph attributes such as node degrees, girth of the graph, and so on?\n\nOur Approach: We consider learning Ising models with latent variables Markov on locally tree-\nlike graphs. We assume that the model parameters are in the regime of correlation decay. In this\nregime, there are no long-range correlations, and the local statistics converge to a tree limit. Hence,\nwe can employ the available latent tree methods to learn \u201clocal\u201d subgraphs consistently, as long as\nthey do not contain any cycles. However, merging these estimated local subgraphs (i.e., latent trees)\nremains a non-trivial challenge. It is not clear whether an ef\ufb01cient approach is possible for matching\nlatent nodes during this process. We employ a different philosophy for building locally tree-like\ngraphs with latent variables. We decouple the process of introducing cycles and latent variables in\nthe output model. We initialize a loopy graph consisting of only the observed variables, and then\niteratively add latent variables to local neighborhoods of the graph. We establish correctness of our\nmethod under a set of natural conditions. We establish that our method is structurally consistent\nwhen the number of samples n scales as n = \u2126(\u03b8\nlog p), where p is the number of\nobserved variables, \u03b8min is the minimum edge potential, \u03b4 is the depth (i.e., graph distance from a\nhidden node to the nearest observed nodes), and \u03b7 is a parameter which depends on the minimum\nand maximum node and edge potentials of the Ising model (\u03b7 = 1 for homogeneous models).\nThe sample requirement for our method is comparable to the requirement for many popular latent\ntree methods, e.g. [2\u20134]. Moreover, note that when there are no hidden variables (\u03b4 = 1), the\nsample complexity of our method is strengthened to n = \u2126(\u03b8\u22122\nmin log p), which matches with the\nsample complexity of existing algorithms for learning fully-observed Ising models [5\u20137]. Thus, we\npresent an ef\ufb01cient method which bridges structure estimation in latent trees with estimation in fully\nobserved loopy graphical models. Finally, we present necessary conditions for graph estimation by\nany method and show that our method nearly matches the lower bound. Our method has a number\nof attractive features: it is amenable to parallelization making it ef\ufb01cient on large datasets, provides\n\ufb02exibility to control the length of cycles and the number of latent variables in the output model, and\nit can incorporate penalty scores such as the Bayesian information criterion (BIC) [8] to tradeoff\nmodel complexity and \ufb01delity. Preliminary experiments on the newsgroup dataset suggests that\nthe method can discover intuitive relationships ef\ufb01ciently, and also compares well with the popular\nlatent Dirichlet allocation (LDA) [9] in terms of topic coherence and perplexity.\n\n\u2212\u03b4\u03b7(\u03b7+1)\u22122\nmin\n\nRelated Work: Learning latent trees has been studied extensively, mainly in the context of phy-\nlogenetics. Ef\ufb01cient algorithms with provable guarantees are available (e.g. [2\u20134]). Our proposed\nmethod for learning loopy models is inspired by the ef\ufb01cient latent tree learning algorithm of [4].\nWorks on high-dimensional graphical model selection are more recent. They can be mainly classi-\n\ufb01ed into two groups: non-convex local approaches [5, 6, 10] and those based on convex optimiza-\ntion [7, 11, 12]. There is a general agreement that the success of these methods is related to the\npresence of correlation decay in the model [13]. This work makes the connection explicit: it re-\nlates the extent of correlation decay with the learning ef\ufb01ciency for latent models on large girth\ngraphs. An analogous study of the effect of correlation decay for learning fully observed models\nis presented in [5]. This paper is the \ufb01rst work to provide provable guarantees for learning discrete\ngraphical models on loopy graphs with latent variables (which can also be easily extended to Gaus-\nsian models, see Remark following Theorem 1). The work in [12] considers learning latent Gaussian\ngraphical models using a convex relaxation method, by exploiting a sparse-low rank decomposition\nof the Gaussian precision matrix. However, the method cannot be easily extended to discrete mod-\nels. Moreover, the \u201cincoherence\u201d conditions required for the success of convex methods are hard to\ninterpret and verify in general. In contrast, our conditions for success are transparent and based on\nthe presence of correlation decay in the model.\n\n2 System Model\n\nIsing Models: A graphical model is a family of multivariate distributions Markov in accordance\nto a \ufb01xed undirected graph [14]. Each node in the graph i \u2208 W is associated to a random variable Xi\n\n2\n\n\f(cid:32)(cid:88)\n\ne\u2208E\n\n(cid:88)\n\ni\u2208V\n\n(cid:33)\n\nP (xW ) = exp\n\n\u03b8i,jxixj +\n\n\u03c6ixi \u2212 A(\u03b8)\n\n,\n\n(2)\n\ntaking value in a set X . The set of edges E captures the set of conditional independence relations\namong the random variables. We say that a set of random variables XW := {Xi, i \u2208 W} with\nprobability mass function (pmf) P is Markov on the graph G if\nP (xi|xN (i)) = P (xi|xW\\i)\n\n(1)\nholds for all nodes i \u2208 W , where N (i) are the neighbors of node i in graph G. The Hammersley-\nClifford theorem [14] states that under the positivity condition, given by P (xW ) > 0, for all\nxW \u2208 X |W|, a distribution P satis\ufb01es the Markov property according to a graph G iff. it factorizes\naccording to the cliques of G. A special case of graphical models is the class of Ising models, where\neach node consists of a binary variable over {\u22121, +1} and there are only pairwise interactions in\nthe model. In this case, the joint distribution factorizes as\n\nwhere \u03b8 := {\u03b8i,j} and \u03c6 := {\u03c6i} are known as edge and the node potentials, and A(\u03b8) is known\nas the log-partition function, which serves to normalize the probability distribution. We consider\nlatent graphical models in which a subset of nodes is latent or hidden. Let H \u2282 W denote the\nhidden nodes and V \u2282 W denote the observed nodes. Our goal is to discover the presence of hidden\nvariables XH and learn the unknown graph structure G(W ), given n i.i.d. samples from observed\nvariables XV . Let p := |V | denote the number of observed nodes and m := |W| denote the total\nnumber of nodes.\n\nTractable Models for Learning:\nIn general, structure estimation of graphical models is NP-hard.\nWe now characterize a tractable class of models for which we can provide guarantees on graph\nestimation.\n\nGirth-Constrained Graph Families: We consider the family of graphs with a bound on the girth,\nwhich is the length of the shortest cycle in the graph. Let GGirth(m; g) denote the ensemble of\ngraphs with girth at most g. There are many graph constructions which lead to a bound on girth.\nFor example, the bipartite Ramanujan graph [15] and the random Cayley graphs [16] have bounds\non the girth. Theoretical guarantees for our learning algorithm will depend on the girth of the graph.\nHowever, our experiments reveal that our method is able to construct models with short cycles as\nwell.\n\nRegime of Correlation Decay: This work establishes tractable learning when the graphical model\nconverges locally to a tree limit. A suf\ufb01cient condition for the existence of such limits is the regime\nof correlation decay, which refers to the property that there are no long-range correlations in the\nmodel [5]. In this regime, the marginal distribution at a node is asymptotically independent of the\ncon\ufb01guration of a growing boundary. For the class of Ising models in (2), the regime of correlation\ndecay can be explicitly characterized, in terms of the maximum edge potential \u03b8max of the model\nand the maximum node degree \u2206max. De\ufb01ne \u03b1 := \u2206max tanh \u03b8max. When \u03b1 < 1, the model is in\nthe regime of correlation decay, and we provide learning guarantees in this regime.\n\n3 Method, Guarantees and Necessary Conditions\n\nBackground on Learning Latent Trees: Most latent tree learning methods are distance based,\nmeaning they are based on the presence of an additive tree metric between any two nodes in the\ntree model. For Ising model (and more generally, any discrete model), the \u201cinformation\u201d distance\nbetween any two nodes i and j in a tree T is de\ufb01ned as\n\nd(i, j; T ) := \u2212 log | det(Pi,j)|,\n\n(3)\nwhere Pi,j denotes the joint probability distribution between nodes i and j. On a tree model T , it can\nbe established that {d(i, j)} is additive along any path in T . Learning latent trees can thus be refor-\n\nmulated as learning tree structure T given end-to-end (estimated) distances d := {(cid:98)d(i, j) : i, j \u2208 V }\n\nbetween the observed nodes V . Various methods with performance guarantees have been proposed,\ne.g. [2\u20134]. They are usually based on local tests such as quartet tests, involving groups of four nodes.\n\n3\n\n\fIn [4], the so-called CLGrouping method is proposed, which organically grows the tree structure by\nadding latent nodes to local neighborhoods. In the initial step, the method constructs the minimum\nspanning tree MST(V ; d) over the observed nodes V using distances d. The method then iteratively\nvisits local neighborhoods of MST(V ; d) and adds latent nodes by conducting local distance tests.\nSince a tree structure is maintained in every iteration of the algorithm, we can parsimoniously add\nhidden variables by selecting neighborhoods which locally maximize scores such as the Bayesian\ninformation criterion (BIC) [8]. This method also allows for fast implementation by parallelization\nof latent tree reconstruction in different neighborhoods, see [17] for details.\n\n(4)\n\n(cid:98)dn(i, j; G) := \u2212 log | det((cid:98)P n\ni,j)|,\n\nProposed Algorithm: We now propose a method for learning loopy latent graphical models. As\nin the case of latent tree methods, our method is also based on estimated information distances\n\ni,j denotes the empirical probability distribution at nodes i and j computed using n i.i.d.\n\nwhere (cid:98)P n\nsamples. The presence of correlation decay in the Ising model implies that (cid:98)dn(i, j; G) is approx-\nods. However, the challenge is in merging these local estimates together to get a global estimate (cid:98)G:\n\nimately a tree metric when nodes i and j are \u201cclose\u201d on graph G (compared to the girth g of the\ngraph). Thus, intuitively, local neighborhoods of G can be constructed through latent tree meth-\n\n\u2200i, j \u2208 V,\n\nthe presence of latent nodes in the local estimates makes merging challenging. Moreover, such a\nmerging-based method cannot easily incorporate global penalties for the number of latent variables\nadded in the output model, which is relevant to obtain parsimonious representations on real datasets.\nWe overcome the above challenges as follows: our proposed method decouples the process of adding\nlatent variables to local neighborhoods. Given a parameter r > 0, for every node i \u2208 V , consider\n\ncycles and latent nodes to the output model. It initializes a loopy graph (cid:98)G0 and then iteratively adds\nthe set of nodes Br(i;(cid:98)dn) := {j : (cid:98)dn(i, j) < r}. The initial graph estimate (cid:98)G0 is obtained by taking\nThe method then adds latent variables by considering only local neighborhoods in (cid:98)G0 and running a\nlatent tree reconstruction routine. By visiting all the neighborhoods, a graph estimate (cid:98)G is obtained.\n\n(cid:98)G0 \u2190 \u222ai\u2208V MST(Br(i;(cid:98)dn);(cid:98)dn).\n\nthe union of local minimum spanning trees:\n\nImplementation details about the algorithm are available in [17]. We subsequently establish that\ncorrectness of the proposed method under a set of natural conditions. We require that the parameter\nr, which determines the set Br(i; d) for each node i, needs to be chosen as a function of the depth \u03b4\n(i.e., distance from a hidden node to its closest observed nodes) and girth g of the graph. In practice,\nthe parameter r provides \ufb02exibility in tuning the length of cycles added to the graph estimate. When\nr is large enough, we obtain a latent tree, while for small r, the graph estimate can contain many\nshort cycles (and potentially many components). In experiments, we evaluate the performance of\nour method for different values of r. For more details, see Section 4.\n\n(5)\n\n3.1 Conditions for Recovery\n\nWe present a set of natural conditions on the graph structure and model parameters under which our\nproposed method succeeds in structure estimation.\n\nand let\n\n(A1) Minimum Degree of Latent Nodes: We require that all latent nodes have degree at least\nthree, which is a natural assumption for identi\ufb01ability of hidden variables. Otherwise, the\nlatent nodes can be marginalized to obtain an equivalent representation of the observed\nstatistics.\n(A2) Bounded Potentials: The edge potentials \u03b8 := {\u03b8i,j} of the Ising model are bounded,\n\n\u03b8min \u2264 |\u03b8i,j| \u2264 \u03b8max,\n\n\u2200 (i, j) \u2208 G.\n\n(6)\n\nSimilarly assume bounded node potentials.\n\n(A3) Correlation Decay: As described in Section 2, we assume correlation decay in the Ising\n\nmodel. We require\n\n\u03b1 := \u2206max tanh \u03b8max < 1,\n\n\u03b1g/2\n\n\u03b8\u03b7(\u03b7+1)+2\nmin\n\n= o(1),\n\n(7)\n\n4\n\n\fwhere \u2206max is the maximum node degree, g is the girth and \u03b8min, \u03b8max are the minimum\nand maximum (absolute) edge potentials in the model.\n(A4) Distance Bounds: We now de\ufb01ne certain quantities which depend on the edge potential\nbounds. Given an Ising model P with edge potentials \u03b8 = {\u03b8i,j} and node potentials\n\u03c6 = {\u03c6i}, consider its attractive counterpart \u00afP with edge potentials \u00af\u03b8 := {|\u03b8i,j|} and node\npotentials \u00af\u03c6 := {|\u03c6i|}. Let \u03c6(cid:48)\nmax := maxi\u2208V atanh(\u00afE(Xi)), where \u00afE is the expectation\nwith respect to the distribution \u00afP . Let P (X1,2;{\u03b8, \u03c61, \u03c62}) denote an Ising model on two\nnodes {1, 2} with edge potential \u03b8 and node potentials {\u03c61, \u03c62}. Our learning guarantees\ndepend on dmin and dmax de\ufb01ned below.\n\ndmin :=\u2212 log|det P (X1,2;{\u03b8max, \u03c6(cid:48)\n\nmax, \u03c6(cid:48)\n\ndmax :=\u2212 log|det P (X1,2;{\u03b8min, 0, 0})|,\n\nmax})|,\ndmax\ndmin\n\n\u03b7 :=\n\n.\n\n(A5) Girth vs. Depth: The depth \u03b4 characterizes how close the latent nodes are to observed\nnodes on graph G: for each hidden node h \u2208 H, \ufb01nd a set of four observed nodes which\nform the shortest quartet with h as one of the middle nodes, and consider the largest graph\ndistance in that quartet. The depth \u03b4 is the worst-case distance over all hidden nodes. We\nrequire the following tradeoff between the girth g and the depth \u03b4:\n\n\u2212 \u03b4\u03b7 (\u03b7 + 1) = \u03c9(1),\n\ng\n4\n\nFurther, the parameter r in our algorithm is chosen as\n\nr > \u03b4 (\u03b7 + 1) dmax + \u0001,\n\nfor some \u0001 > 0,\n\ndmin \u2212 r = \u03c9(1).\n\ng\n4\n\n(8)\n\n(9)\n\n(A1) is a natural assumption on the minimum degree of the hidden nodes for identi\ufb01ability. (A2)\nassumes bounds on the edge potentials. It is natural that the sample requirement of any graph es-\ntimation algorithm depends on the \u201cweakest\u201d edge characterized by the minimum edge potential\n\u03b8min. Further, the maximum edge potential \u03b8max characterizes the presence/absence of long range\ncorrelations in the model, and is made exact in (A3). Intuitively, there is a tradeoff between the\nmaximum degree \u2206max and the maximum edge potential \u03b8max of the model. Moreover, (A3) pre-\nscribes that the extent of correlation decay be strong enough (i.e., a small \u03b1 and a large enough girth\ng) compared to the weakest edge in the model. Similar conditions have been imposed before for\ngraphical model selection in the regime of correlation decay when there are no hidden variables [5].\n(A4) de\ufb01nes certain distance bounds. Intuitively, dmin and dmax are bounds on information dis-\ntances given by the local tree approximation of the loopy model. Note that e\u2212dmax = \u2126(\u03b8min) and\ne\u2212dmin = O(\u03b8max). (A5) provides the tradeoff between the girth g and the depth \u03b4. Intuitively, the\ndepth needs to be smaller than the girth to avoid encountering cycles during the process of graph re-\nconstruction. Recall that the parameter r in our algorithm determines the neighborhood over which\nlocal MSTs are built in the \ufb01rst step. It is chosen such that it is roughly larger than the depth \u03b4 in\norder for all the hidden nodes to be discovered. The upper bound on r ensures that the distortion\nfrom an additive metric is not too large. The parameters for latent tree learning routines (such as\ncon\ufb01dence intervals for quartet tests) are chosen appropriately depending on dmin and dmax, see [17]\nfor details.\n\n3.2 Guarantees\n\nWe now provide the main result of this paper that the proposed method correctly estimates the graph\nstructure of a loopy latent graphical model in high dimensions. Recall that \u03b4 is the depth (distance\nfrom a hidden node to its closest observed nodes), \u03b8min is the minimum (absolute) edge potential\nand \u03b7 = dmax\ndmin\n\nis the ratio of distance bounds.\n\nTheorem 1 (Structural Consistency and Sample Requirements) Under (A1)\u2013(A5), the proba-\nbility that the proposed method is structurally consistent tends to one, when the number of samples\nscales as\n\n(cid:16)\n\nn = \u2126\n\n\u2212\u03b4\u03b7(\u03b7+1)\u22122\n\u03b8\nmin\n\nlog p\n\n.\n\n(10)\n\n(cid:17)\n\n5\n\n\fThus, for learning Ising models on locally tree-like graphs, the sample complexity is dependent both\non the minimum edge potential \u03b8min and on the depth \u03b4. Our method is ef\ufb01cient in high dimensions\nsince the sample requirement is only logarithmic in the number of nodes p.\nDependence on Maximum Degree: For the correlation decay to hold (A3), we require \u03b8min \u2264\n\u03b8max = \u0398(1/\u2206max). This implies that the sample complexity is at least n = \u2126(\u2206\u03b4\u03b7(\u03b7+1)+2\nlog p).\n\nmax\n\nComparison with Fully Observed Models:\nIn the special case when all the nodes are observed1\n(\u03b4 = 1), we strengthen the results for our method and establish that the sample complexity is\nn = \u2126(\u03b8\u22122\nmin log p). This matches the best known sample complexity for learning fully observed\nIsing models [5, 6].\n\nComparison with Learning Latent Trees: Our method is an extension of latent tree methods\nfor learning locally tree-like graphs. The sample complexity of our method matches the sample\nrequirements for learning general latent tree models [2\u20134]. Thus, we establish that learning locally\ntree-like graphs is akin to learning latent trees in the regime of correlation decay.\n\nExtensions: We strengthen the above results to provide non-asymptotic sample complexity\nbounds and also consider general discrete models, see [17] for details. The above results can also\nbe easily extended to Gaussian models using the notion of walk-summability in place of correlation\ndecay (see [18]) and the negative logarithm of the correlation coef\ufb01cient as the additive tree metric\n(see [4]).\n\nDependence on Fraction of Observed Nodes:\nIn the special case when a fraction \u03c1 of the nodes\nare uniformly selected as observed nodes, we can provide probabilistic bounds on the depth \u03b4 in the\nresulting latent model, see [17] for details. For \u03b7 = 1 (homogeneous models) and regular graphs\n\n\u2206min = \u2206max = \u2206, the sample complexity simpli\ufb01es to n = \u2126(cid:0)\u22062\u03c1\u22122(log p)3(cid:1) . Thus, we can\n\ncharacterize an explicit dependence on the fraction of observed nodes \u03c1.\n\n3.3 Necessary Conditions for Graph Estimation\n\nWe have so far provided suf\ufb01cient conditions for recovering locally tree-like graphs in latent Ising\nmodels. We now provide necessary conditions on the number of samples required by any algorithm\nn i.i.d. samples from the observed node set V and Gm is the set of all possible graphs on m nodes.\nWe \ufb01rst de\ufb01ne the notion of the graph edit distance.\n\nto reconstruct the graph. Let (cid:98)Gn : (X |V |)n \u2192 Gm denote any deterministic graph estimator using\nDe\ufb01nition 1 (Edit Distance) Let G,(cid:98)G be two graphs2 with adjacency matrices AG, A(cid:98)G, and let\nbetween G,(cid:98)G is de\ufb01ned as\n\nV be the set of labeled vertices in both the graphs (with identical labels). Then the edit distance\n\ndist((cid:98)G, G; V ) := min\n\n\u03c0\n\n||A(cid:98)G \u2212 \u03c0(AG)||1,\n\nwhere \u03c0 is any permutation on the unlabeled nodes while keeping the labeled nodes \ufb01xed.\n\nany permutation of AG over the unlabeled nodes. In our context, the labeled nodes correspond to\nthe observed nodes V while the unlabeled nodes correspond to latent nodes H. We now provide\nnecessary conditions for graph reconstruction up to certain edit distance.\n\nIn other words, the edit distance is the minimum number of entries that are different in A(cid:98)G and in\nTheorem 2 (Necessary Condition for Graph Estimation) For any deterministic estimator (cid:98)Gm :\n\n2\u03c1mn (cid:55)\u2192 Gm based on n i.i.d. samples, where \u03c1 \u2208 [0, 1] is the fraction of observed nodes and m is\n1In the trivial case, when all the nodes are observed and the graph is locally tree-like, our method reduces\nto thresholding of information distances at each node, and building local MSTs. The threshold can be chosen\nas r = dmax + \u0001, for some \u0001 > 0.\n\n2We consider inexact graph matching where the unlabeled nodes can be unmatched. This is done by adding\nrequired number of isolated unlabeled nodes in the other graph, and considering the modi\ufb01ed adjacency matri-\nces [19].\n\n6\n\n\f(11)\n\n2\n\nthe total number of nodes of an Ising model Markov on graph Gm \u2208 GGirth(m; g, \u2206min, \u2206max) on\nm nodes with girth g, minimum degree \u2206min and maximum degree \u2206max, for all \u0001 > 0, we have\n\nP[dist((cid:98)Gm, Gm; V ) > \u0001m] \u2265 1 \u2212\n\n2n\u03c1mm(2\u0001+1)m3\u0001m\n\nm0.5\u2206minm(m \u2212 g\u2206g\n\nmax)0.5\u2206minm ,\n\nunder any sampling process used to choose the observed nodes.\n\nProof:\n\nThe proof is based on counting arguments. See [17] for details.\n\nn = \u2126(cid:0)\u2206min\u03c1\u22121 log p(cid:1)\n\nLower Bound on Sample Requirements: The above result states that roughly\n\n(12)\nsamples are required for structural consistency under any estimation method. Thus, when \u03c1 =\n\u0398(1) (constant fraction of observed nodes), polylogarithmic number of samples are necessary (n =\n\u2126(poly log p)), while when \u03c1 = \u0398(m\u2212\u03b3) for some \u03b3 > 0 (i.e., a vanishing fraction of observed\nnodes), polynomial number of samples are necessary for reconstruction (n = \u2126(poly(p)).\n\nComparison with Sample Complexity of Proposed Method: For Ising models, under uniform\nsampling of observed nodes, we established that the sample complexity of the proposed method\nscales as n = \u2126(\u22062\u03c1\u22122(log p)3) for regular graphs with degree \u2206. Thus, we nearly match the\nlower bound on sample complexity in (12).\n\n4 Experiments\n\nWe employ latent graphical models for topic modeling. Each hidden variable in the model can\nbe thought of as representing a topic, and topics and words in a document are drawn jointly from\nthe graphical model. We conduct some preliminary experiments on 20 newsgroup dataset with\n16,242 binary samples of 100 selected keywords. Each binary sample indicates the appearance\nof the given words in each posting, these samples are divided in to two equal groups for learning\nand testing purposes. We compare the performance with popular latent Dirichlet allocation (LDA)\nmodel [9]. We evaluate performance in terms of perplexity and topic coherence. In addition, we also\nstudy tradeoff between model complexity and data \ufb01tting through the Bayesian information criterion\n(BIC) [8].\n\nMethods: We consider a regularized variant of the proposed method for latent graphical model\nselection. Here, in every iteration, the decision to add hidden variables to a local neighborhood is\nbased on the improvement of the overall BIC score. This allows us to tradeoff model complexity\nand data \ufb01tting. Note that our proposed method only deals with structure estimation and we use\nexpectation maximization (EM) for parameter estimation. We compare the proposed method with\nthe LDA model3. This method is implemented in MATLAB. We used the modules for LBP, made\navailable with UGM4 package. The LDA models are learnt using the lda package5.\n\n(cid:34)\n\nn(cid:88)\n\nk=1\n\n\u2212 1\nnp\n\n(cid:35)\n\nPerformance Evaluation: We evaluate performance based on the test perplexity [20] given by\n\nPerp-LL := exp\n\nlog P (xtest(k))\n\n,\n\n(13)\n\nwhere n is the number of test samples and p is the number of observed variables (i.e., words). Thus\nthe perplexity is monotonically decreasing in the test likelihood and a lower perplexity indicates a\nbetter generalization performance. On lines of (13), we also de\ufb01ne\n\nPerp-BIC := exp\n\nBIC(xtest)\n\n, BIC(xtest) :=\n\nlog P (xtest(k)) \u2212 0.5(df) log n,\n\n(cid:20)\n\n\u2212 1\nnp\n\n(cid:21)\n\n(14)\n3Typically, LDA models the counts of different words in documents. Here, since we have binary data, we\n\nconsider a binary LDA model where the observed variables are binary.\n\n4These codes are available at http://www.di.ens.fr/\u02dcmschmidt/Software/UGM.html\n5http://chasen.org/\u02dcdaiti-m/dist/lda/\n\nn(cid:88)\n\nk=1\n\n7\n\n\fMethod\nProposed\nProposed\nProposed\nProposed\n\nLDA\nLDA\nLDA\nLDA\n\nr\n7\n9\n11\n13\nNA\nNA\nNA\nNA\n\nHidden\n\n32\n24\n26\n24\n10\n20\n30\n40\n\nEdges\n183\n129\n125\n123\nNA\nNA\nNA\nNA\n\nPMI\n0.4313\n0.6037\n0.4585\n0.4289\n0.2921\n0.1919\n0.1653\n0.1470\n\nPerp-LL\n1.1498\n1.1543\n1.1555\n1.1560\n1.1480\n1.1348\n1.1421\n1.1494\n\nPerp-BIC\n1.1518\n1.1560\n1.1571\n1.1576\n1.1544\n1.1474\n1.1612\n1.1752\n\nTable 1: Comparison of proposed method under different thresholds (r) with LDA under differ-\nent number of topics (i.e., number of hidden variables) on 20 newsgroup data. For de\ufb01nition of\nperplexity based on test likelihood and BIC scores, and PMI, see (13), (14), and (15).\n\nwhere df is the degrees of freedom in the model. For a graphical model, we set df GM := m +\n|E|, where m is the total number of variables (both observed and hidden) and |E| is the number\nof edges in the model. For the LDA model, we set df LDA := (p(m \u2212 p) \u2212 1), where p is the\nnumber of observed variables (i.e., words) and m\u2212 p is the number of hidden variables (i.e., topics).\nThis is because a LDA model is parameterized by a p \u00d7 (m \u2212 p) topic probability matrix and a\n(m \u2212 p)-length Dirichlet prior. Thus, the BIC perplexity in (14) is monotonically decreasing in\nthe BIC score, and a lower BIC perplexity indicates better tradeoff between model complexity and\ndata \ufb01tting. However, the likelihood and BIC score in (13) and (14) are not tractable for exact\nevaluation in general graphical models since they involve the partition function. We employ loopy\nbelief propagation (LBP) to evaluate them. Note that it is exact on a tree model and approximate\nfor loopy models.\nIn addition, we also evaluate topic coherence, frequently considered in topic\nmodeling. It is based on the average pointwise mutual information (PMI) score\n\nPMI(Xi; Xj), PMI(Xi; Xj) := log\n\nP (Xi = 1, Xj = 1)\n\nP (Xi = 1)P (Xj = 1)\n\n, (15)\n\n1\n\nPMI :=\n\n(cid:88)\n\n(cid:88)\nword pairs for each topic is(cid:0)10\n\n45|H|\n\ni,j\u2208A(h)\n\nh\u2208H\n\ni<j\n\nwhere the set A(h) represents the \u201ctop-10\u201d words associated with topic h \u2208 H. The number of such\n\n(cid:1) = 45, and is used for normalization. In [21], it is found that the\n\nPMI scores are a good measure of human evaluated topic coherence when it is computed using an\nexternal corpus. We compute PMI scores based on NYT articles bag-of-words dataset [22].\n\n2\n\nthresholds r \u2208\nExperimental Results: We learn the graph structures under different\n{7, 9, 11, 13}, which controls the length of cycles. At r = 13, we obtain a latent tree and for all\nother values, we obtain loopy models. The the \ufb01rst long cycle appears at r = 9. At r = 7, we \ufb01nd a\ncombination of short and long cycles. We \ufb01nd that models with cycles are more effective in discov-\nering intuitive relationships. For instance, in the latent tree (r = 13), the link between \u201ccomputer\u201d\nand \u201csoftware\u201d is missing due to the tree constraint, but is discovered when r \u2264 9. Moreover, we\nsee that common words across different topics tend to connect the local subgraphs, and thus loopy\nmodels are better at discovering such relationships. The graph structures from the experiments are\navailable in [17]. In Table 1, we present results under our method and under LDA modeling. For the\nLDA model, we vary the number of hidden variables (i.e., topics) as {10, 20, 30, 40}. In contrast,\nour method is designed to optimize for the number of hidden variables, and does not need this input.\nWe note that our method is competitive in terms of both perplexity and topic coherence. We \ufb01nd\nthat topic coherence (i.e., PMI) for our method is optimal at r = 9, where the graph has a single\nlong cycle and a few short cycles.\nThe above experiments con\ufb01rm the effectiveness of our approach for discovering hidden topics, and\nare in line with the theoretical guarantees established earlier in the paper. Our analysis reveals that\na large class of loopy graphical models with latent variables can be learnt ef\ufb01ciently.\n\nAcknowledgement\n\nThis work is supported by NSF Award CCF-1219234, AFOSR Award FA9550-10-1-0310, ARO\nAward W911NF-12-1-0404, the setup funds at UCI, and ONR award N00014-08-1-1015.\n\n8\n\n\fReferences\n[1] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models\n\nof Proteins and Nucleic Acids. Cambridge Univ. Press, 1999.\n\n[2] P. L. Erd\u00a8os, L. A. Sz\u00b4ekely, M. A. Steel, and T. J. Warnow. A few logs suf\ufb01ce to build (almost) all trees:\n\nPart i. Random Structures and Algorithms, 14:153\u2013184, 1999.\n\n[3] E. Mossel. Distorted metrics on trees and phylogenetic forests. IEEE/ACM Transactions on Computa-\n\ntional Biology and Bioinformatics, pages 108\u2013116, 2007.\n\n[4] M.J. Choi, V.Y.F. Tan, A. Anandkumar, and A. Willsky. Learning latent tree graphical models. J. of\n\nMachine Learning Research, 12:1771\u20131812, May 2011.\n\n[5] A. Anandkumar, V. Y. F. Tan, F. Huang, and A. S. Willsky. High-dimensional structure learning of Ising\n\nmodels: local separation criterion. The Annals of Statistics, 40(3):1346\u20131375, 2012.\n\n[6] A. Jalali, C. Johnson, and P. Ravikumar. On learning discrete graphical models using greedy methods. In\n\nProc. of NIPS, 2011.\n\n[7] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising Model Selection Using l1-\n\nRegularized Logistic Regression. Annals of Statistics, 2008.\n\n[8] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461\u2013464, 1978.\n[9] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. J. of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n[10] G. Bresler, E. Mossel, and A. Sly. Reconstruction of Markov random \ufb01elds from samples: some obser-\nvations and algorithms. In Intl. workshop APPROX Approximation, Randomization and Combinatorial\nOptimization, pages 343\u2013356. Springer, 2008.\n\n[11] N. Meinshausen and P. B\u00a8uhlmann. High dimensional graphs and variable selection with the lasso. Annals\n\nof Statistics, 34(3):1436\u20131462, 2006.\n\n[12] V. Chandrasekaran, P.A. Parrilo, and A.S. Willsky. Latent Variable Graphical Model Selection via Convex\n\nOptimization. Arxiv preprint, 2010.\n\n[13] J. Bento and A. Montanari. Which Graphical Models are Dif\ufb01cult to Learn? In Proc. of Neural Informa-\n\ntion Processing Systems (NIPS), Vancouver, Canada, Dec. 2009.\n\n[14] M.J. Wainwright and M.I. Jordan. Graphical Models, Exponential Families, and Variational Inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[15] F.R.K. Chung. Spectral graph theory. Amer Mathematical Society, 1997.\n[16] A. Gamburd, S. Hoory, M. Shahshahani, A. Shalev, and B. Virag. On the girth of random cayley graphs.\n\nRandom Structures & Algorithms, 35(1):100\u2013117, 2009.\n\n[17] A. Anandkumar and R. Valluvan. Learning Loopy Graphical Models with Latent Variables: Ef\ufb01cient\nMethods and Guarantees. Under revision from Annals of Statistics. Available on ArXiv:1203.3887, Jan.\n2012.\n\n[18] A. Anandkumar, V. Y. F. Tan, F. Huang, and A. S. Willsky. High-Dimensional Gaussian Graphical Model\nSelection: Walk-Summability and Local Separation Criterion. Accepted to J. Machine Learning Research,\nArXiv 1107.1270, June 2012.\n\n[19] H. Bunke and G. Allermann. Inexact graph matching for structural pattern recognition. Pattern Recogni-\n\ntion Letters, 1(4):245\u2013253, 1983.\n\n[20] D. Newman, E.V. Bonilla, and W. Buntine. Improving topic coherence with regularized topic models. In\n\nProc. of NIPS, 2011.\n\n[21] David Newman, Sarvnaz Karimi, and Lawrence Cavedon. External evaluation of topic models. In Pro-\nceedings of the 14th Australasian Computing Symposium(ACD2009), page 8, Sydney, Australia, Decem-\nber 2009.\n\n[22] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n\n9\n\n\f", "award": [], "sourceid": 507, "authors": [{"given_name": "Anima", "family_name": "Anandkumar", "institution": null}, {"given_name": "Ragupathyraj", "family_name": "Valluvan", "institution": null}]}