Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Mukund Narasimhan, Jeff A. Bilmes*

We investigate the problem of reducing the complexity of a graphical model (G, PG) by finding a subgraph H of G, chosen from a class of subgraphs H, such that H is optimal with respect to KL-divergence. We do this by first defining a decomposition tree representation for G, which is closely related to the junction-tree representation for G. We then give an algorithm which uses this representation to compute the optimal H H. Gavril [2] and Tarjan [3] have used graph separation properties to solve several combinatorial optimization problems when the size of the minimal separators in the graph is bounded. We present an extension of this technique which applies to some important choices of H even when the size of the minimal separators of G are arbitrarily large. In particular, this applies to problems such as finding an optimal subgraphical model over a (k - 1)-tree of a graphical model over a k-tree (for arbitrary k) and selecting an optimal subgraphical model with (a constant) d fewer edges with respect to KL-divergence can be solved in time polynomial in |V (G)| using this formulation. 1 Introduction and Preliminaries

The complexity of inference in graphical models is typically exponential in some parame- ter of the graph, such as the size of the largest clique. Therefore, it is often required to find a subgraphical model that has lower complexity (smaller clique size) without introducing a large error in inference results. The KL-divergence between the original probability dis- tribution and the probability distribution on the simplified graphical model is often used to measure the impact on inference. Existing techniques for reducing the complexity of graph- ical models including annihilation and edge-removal [4] are greedy in nature and cannot make any guarantees regarding the optimality of the solution. This problem is NP-complete [9] and so, in general, one cannot expect a polynomial time algorithm to find the optimal solution. However, we show that when we restrict the problem to some sets of subgraphs, the optimal solution can be found quite quickly using a dynamic programming algorithm in time polynomial in the tree-width of the graph.

1.1 Notation and Terminology

A graph G = (V, E) is said to be triangulated if every cycle of length greater than 3 has a chord. A clique of G is a non-empty set S V such that {a, b} E for all

```
This work was supported by NSF grant IIS-0093430 and an Intel Corporation Grant.
{b, c, d} {c, f, g} d {b, c} {f, c} {c, e} {b, e, c} {e, c, f } b c g {b, e}
{a, b, e} a e f
Figure 1: A triangulated graph G and a junction-tree for G
```

a, b S. A clique S is maximal if S is not properly contained in another clique. If and are non-adjacent vertices of G then a set of vertices S V \ {, } is called an (, )-separator if and are in distinct components of G[V \ S]. S is a minimal (, )-separator if no proper subset of S is an (, )-separator. S is said to be a minimal separator if S is a minimal (, )-separator for some non adjacent a, b V . If T = (K, S) is a junction-tree for G (see [7]), then the nodes K of T correspond to the maximal- cliques of G, while the links S correspond to minimal separators of G (We reserve the terms vertices/edges for elements of G, and nodes/links for the elements of T ). If G is triangulated, then the number of maximal cliques is at most |V |. For example, in the graph G shown in Figure 1, K = {{b, c, d} , {a, b, e} , {b, e, c} , {e, c, f } , {c, f, g}}. The links S of T correspond to minimal-separators of G in the following way. If ViVj S (where Vi, Vj K and hence are cliques of G), then Vi Vj = . We label each edge ViVj S with the set Vij = Vi Vj, which is a non-empty complete separator in G. The removal of any link ViVj S disconnects T into two subtrees which we denote T (i) and T (j) (chosen so that T (i) contains Vi). We will let K(i) be the nodes of T (i), and V (i) = V K(i)V be the set of vertices corresponding to the subtree T (i). The junction tree property ensures that V (i) V (j) = Vi Vj = Vij. We will let G(i) be the subgraph induced by V (i).

A graphical model is a pair (G, P ) where P is the joint probability distribution for random variables X1, X2, . . . , Xn, and G is a graph with vertex set V (G) = {X1, X2, . . . , Xn} such that the separators in G imply conditional independencies in P (so P factors according to G). If G is triangulated, then the junction-tree algorithm can be used for exact inference in the probability distribution P . The complexity of this algorithm grows with the treewidth of G (which is one less than the size of the largest clique in G when G is triangulated). The growth is exponential when P is a discrete probability distribution, thus rendering exact inference for graphs with large treewidth impractical. Therefore, we seek another graphical model (H, PH ) which allows tractable inference (so H should have lower treewidth than G has). The general problem of finding a graphical model of tree-width at most k so as to minimize the KL-divergence from a specified probability distribution is NP complete for general k ([9]) However, it is known that this problem is solvable in polynomial time (in |V (G)|) for some special cases cases (such as when G has bounded treewidth or when k = 1 [1]). If (G, PG) and (H, PH ) are graphical models, then we say that (H, PH ) is a subgraphical model of (G, PG) if H is a spanning subgraph of G. Note in particular that separators in G are separators in H, and hence (G, PH ) is also a graphical model.

2 Graph Decompositions and Divide-and-Conquer Algorithms

For the remainder of the paper, we will be assuming that G = (V, E) is some triangulated graph, with junction tree T = (K, S). As observed above, if ViVj S, then the removal

```
{b, c, d} {c, f, g} d {b, c} {f, c}
{b, e, c} {e, c, f } b c c g {b, e}
{a, b, e} a e e f
```

Figure 2: The graphs G(i), G(j) and junction-trees T (i) and T (j) resulting from the removal of the link Vij = {c, e}

of Vij = Vi Vj disconnects G into two (vertex-induced) subgraphs G(i) and G(j) which are both triangulated, with junction-trees T (i) and T (j) respectively. We can recursively decompose each of G(i) and G(j) into smaller and smaller subgraphs till the resulting sub- graphs are cliques. When the size of all the minimal separators are bounded, we may use these decompositions to easily solve problems that are hard in general. For example, in [5] it is shown that NP-complete problems like vertex coloring, and finding maximum inde- pendent sets can be solved in polynomial time on graphs with bounded tree-width (which are equivalent to spanning graphs with bounded size separators). We will be interested in finding (triangulated) subgraphs of G that satisfy some conditions, such as a bound on the number of edges, or a bound on the tree-width and which optimize separable objective functions (described in Section 2)

One reason why problems such as this can often be solved easily when the tree-width of G is bounded by some constant is this : If Vij is a separator decomposing G into G(i) and G(j), then a divide-and-conquer approach would suggest that we try and find optimal subgraphs of G(i) and G(j) and then splice the two together to get an optimal subgraph of G. There are two issues with this approach. First, the optimal subgraphs of G(i) and G(j) need not necessarily match up on Vij, the set of common vertices. Second, even if the two subgraphs agree on the set of common vertices, the graph resulting from splicing the two subgraphs together need not be triangulated (which could happen even if the two subgraphs individually are triangulated). To rectify the situation, we can do the following. We parti- tion the set of subgraphs of G(i) and G(j) into classes, so that any subgraph of G(i) and any subgraph G(j) corresponding to the same class are compatible in the sense that they match up on their intersection namely Vij, and so that by splicing the two subgraphs together, we get a subgraph of G which is acceptable (and in particular is triangulated). Then given op- timal subgraphs of both G(i) and G(j) corresponding to each class, we can enumerate over all the classes and pick the best one. Of course, to ensure that we do not repeatedly solve the same problem, we need to work bottom-up (a.k.a dynamic programming) or memoize our solutions. This procedure can be carried out in polynomial (in |V |) time as long as we have only a polynomial number of classes. Now, if we have a polynomial number of classes, these classes need not actually be a partition of all the acceptable subgraphs, though the union of the classes must cover all acceptable subgraphs (so the same subgraph can be contained in more than one class). For our application, every class can be thought of to be the set of subgraphs that satisfy some constraint, and we need to pick a polynomial number of constraints that cover all possibilities. The bound on the tree-width helps us here. If |V ) ij | = k, then in any subgraph H of G, H [Vij ] must be one of the 2(k 2 possible subgraphs of G[V ) ij ]. So, if k is sufficiently small (so 2(k 2 is bounded by some polynomial in |V |),

then this procedure results in a polynomial time algorithm. In this paper, we show that in some cases we can characterize the space H so that we still have a polynomial number of constraints even when the tree-width of G is not bounded by a small constant.

2.1 Separable objective functions

For cases where exact inference in the graphical model (G, PG) is intractable, it is natural to try to find a subgraphical model (H, PH ) such that D(PG PH ) is minimized, and inference using H is tractable. We will denote by H the set of subgraphs of G that are tractable for inference. For example, this set could be the set of subgraphs of G with treewidth one less than the treewidth of G, or perhaps the set of subgraphs of G with at d fewer edges. For a specified subgraph H of G, there is a unique probability distribution PH factoring over H that minimizes D(PG PH ). Hence, finding a optimal subgraphical model is equivalent to finding a subgraph H for which D(PG PH ) is minimized. If Vij is a separator of G, we will attempt to find optimal subgraphs of G by finding optimal subgraphs of G(i) and G(j) and splicing them together. However, to do this, we need to ensure that the objective criteria also decomposes along the separator Vij. Suppose that H is any triangulated subgraph of G. Let PG(i) and PG(j) be the (marginalized) distributions of PG on V (i) and V (j) respectively, and PH(i) and PH(j) be the (marginalized) distributions of the distribution PH on V (i) and V (j) where H(i) = H[V (i)] and H(j) = H[V (j)], The following result assures us that the KL-divergence also factors according to the separator Vij. Lemma 1. Suppose that (G, PG) is a graphical model, H is a triangulated subgraph of G, and PH factors over H. Then D(PG PH ) = D(PG(i) PH(i)) + D(PG(j) PH(j)) - D(PG[Vij] PH[Vij]).

Proof. Since H is a subgraph of G, and Vij is a separator of G, Vij must also be a sepa- P rator of H. Therefore, P H(i) ({Xv }vV (i) )PH(j) ({Xv }vV (j) ) H {Xv} = . The result vV PH[V ) ij ] ({Xv }vVij follows immediately.

Therefore, there is hope that we can reduce our our original problem of finding an optimal subgraph H H as one of finding subgraphs of H (i) G(i) and H(j) G(j) that are compatible, in the sense that they match up on the overlap Vij, and for which D(PG PH ) is minimized. Throughout this paper, for the sake of concreteness, we will assume that the objective criterion is to minimize the KL-divergence. However, all the results can be extended to other objective functions, as long as they "separate" in the sense that for any separator, the objective function is the sum of the objective functions of the two parts, possibly modulo some correction factor which is purely a function of the separator. Another example might be the complexity r(H) of representing the graphical model H. A very natural representation satisfies r(G) = r(G(i)) + r(G(j)) if G has a separator G(i) G(j). Therefore, the representation cost reduction would satisfy r(G) - r(H) = (r(G(i)) - r(H(i))) + (r(G(j)) - r(H(j))), and so also factors according to the separators. Finally note that any linear combinations of such separable functions is also separable, and so this technique could also be used to determine tradeoffs (representation cost vs. KL-divergence loss for example). In Section 4 we discuss some issues regarding computing this function.

2.2 Decompositions and decomposition trees

For the algorithms considered in this paper, we will be mostly interested in the decompo- sitions that are specified by the junction tree, and we will represent these decompositions by a rooted tree called a decomposition tree. This representation was introduced in [2, 3], and is similar in spirit to Darwiche's dtrees [6] which specify decompositions of directed acyclic graphs. In this section and the next, we show how a decomposition tree for a graph may be constructed, and show how it is used to solve a number of optimization problems.

```
abd; ce; gf
a; be; cd
e; cf ; g d; bc; e
abe dbc ebc cef cf g
Figure 3: The separator tree corresponding to Figure 1
```

A decomposition tree for G is a rooted tree whose vertices correspond to separators and cliques of G. We describe the construction of the decomposition tree in terms of a junction- tree T = (K, S) for G. The interior nodes of the decomposition tree R(T ) correspond to S (the links of T and hence the minimal separators of G). The leaf or terminal nodes represent the elements of K (the nodes of T and hence the maximal cliques of G). R(T ) can be recursively constructed from T as follows : If T consists of just one node K, (and hence no edges), then R consists of just one node, which is given the label K as well. If however, T has more than one node, then T must contain at least one link. To begin, let ViVj S be any link in T . Then removal of the link ViVj results in two disjoint junction- trees T (i) and T (j). We label the root of R by the decomposition (V (i); Vij; V (j)). The rest of R is recursively built by successively picking links of T (i) and T (j) (decompositions of G(i) and G(j)) to form the interior nodes of R. The effect of this procedure on the junction tree of Figure 1 is shown in Figure 3, where the decomposition associated with the interior nodes is shown inside the nodes. Let M be the set of all nodes of R(T ). For any interior node M induced by the the link ViVj S of T , then we will let M (i) and M (j) represent the left and right children of M , and R(i) and R(j) be the left and right trees below M .

3 Finding optimal subgraphical models

3.1 Optimal sub (k - 1)-trees of k-trees

Suppose that G is a k-tree. A sub (k - 1)-tree of G is a subgraph H of G that is (k - 1)- tree. Now, if Vij is any minimal separator of G, then both G(i) and G(j) are k-trees on vertex sets V (i) and V (j) respectively. It is clear that the induced subgraphs H[V (i)] and H[V (j)] are subgraphs of G(i) and G(j) and are partial (k - 1)-trees. We will be interested in finding sub (k - 1)-trees of k trees and this problem is trivial by the result of [1] when k = 2. Therefore, we assume that k 3. The following result characterizes the various possibilities for H[Vij] in this case.

Lemma 2. Suppose that G is a k-tree, and S = Vij is a minimal separator of G corre- sponding to the link ij of the junction-tree T . In any (k - 1)-tree H G either

```
1. There is a u S such that u is not connected to vertices in both V (i) \ S and V (j) \ S. Then S \ {u} is a minimal separator in H and hence is complete.
2. Every vertex in S is connected to vertices in both V (i) \S and V (j) \S. Then there are vertices {x, y} S such that the edge H[S] is missing only the edge {x, y}. Further either H[V (i)] or H[V (j)] does not contain a unchorded x-y path.
```

Proof. We consider two possibilities. In the first, there is some vertex u S such that u is not connected to vertices in both V (i) \ S and V (j). Since the removal of S disconnects G, the removal of S must also disconnect H. Therefore, S must contain a minimal separator of H. Since H is a (k - 1)-tree, all minimal separators of H must contain k - 1 vertices which must therefore be S {u}. This corresponds to case (1) above. Clearly this possiblity can occur.

If there is no such u S, then every vertex in S is connected to vertices in both V (i) \ S and V (j) \ S. If x S is connected to some yi V (i) \ S and yj V (j) \ S, then x is contained in every minimal yi/yj separator (see [5]). Therefore, every vertex in S is part of a minimal separator. Since each minimal separator contains k - 1 vertices, there must be at least two distinct minimum separators contained in S. Let Sx = S \ {x} and Sy = S \ {y} be two distinct minimal separators. We claim that H[S] contains all edges except the edge {x, y}. To see this, note that if z, w S, with z = w and {z, w} = {x, y} (as sets), then either {z, w} Sy or {z, w} Sx. Since both Sx and Sy are complete in H, this edge must be present in H. The edge {x, y} is not present in H[S] because all minimal separators in H must be of size k - 1. Further, if both V (i) and V (j) contain an unchorded path between x and y, then by joining the two paths at x and y, we get a unchorded cycle in H which contradicts the fact that H is triangulated.

Therefore, we may associate k 2 + 2 k constraints with each separator V 2 ij of G as follows. There are k possible constraints corresponding to case (1) above (one for each choice of x), and k 2 choices corresponding to case (2) above. This is because for each 2 pair {x, y} corresponding to the missing edge, we have either V (i) contains no unchorded xy paths or V (j) contains no unchorded xy paths. More explicitly, we can encode the set of constraints CM associated with each separator S corresponding to an interior node M of the decomposition tree as follows: CM = { (x, y, s) : x S, y S, s {i, j}}. If y = x, then this corresponds to case (1) of the above lemma. If s = i, then x is connected only to H(i) and if s = j, then x is connected only to H(j). If y = x, then this corresponds to case (2) in the above lemma. If s = i, then H (i) does not contain any unchorded path between x and y, and there is no constraint on H(j). Similarly if s = j, then H(j) does not contain any unchorded path between x and y, and there is no constraint on H (i).

Now suppose that H(i) and H(j) are triangulated subgraphs of G(i) and G(j) respectively, then it is clear that if H(i) and H(j) both satisfy the same constraint they must match up on the common vertices Vij. Therefore to splice together two solutions corresponding to the same constraint, we only need to check that the graph obtained by splicing the graphs is triangulated.

Lemma 3. Suppose that H(i) and H(j) are triangulated subgraphs of G(i) and G(j) re- spectively such that both of them satisfy the same constraint as described above. Then the graph H obtained by splicing H(i) and H(j) together is triangulated.

Proof. Suppose that both H(i) and H(j) are both triangulated and both satisfy the same constraint. If both H(i) and H(j) satisfy the same constraint corresponding to case (1) in Lemma 2 and H has an unchorded cycle, then this cycle must involve elements of both H(i) and H(j). Therefore, there must be two vertices of S {u} on the cycle, and hence this cycle has a chord as S \ {u} is complete. This contradiction shows that H is triangulated. So assume that both of them satisfy the constraint corresponding to case (2) of Lemma 2. Then if H is not triangulated, there must be a t-cycle (for t 4) with no chord. Now, since {x, y} is the only missing edge of S in H, and because H (i) and H(j) are individually triangulated, the cycle must contain x, y and vertices of both V (i) \ S and V (j) \ S. We

may split this unchorded cycle into two unchorded paths, one contained in V (i) and one in V (j) thus violating our assumption that both H(i) and H(j) satisfy the same constraint.

If |S| = k, then there are 2k + 2 k O(k2) O(n2). We can use a divide and conquer 2 strategy to find the optimal sub (k - 1) tree once we have taken care of the base case, where G is just a single clique (of k + 1) elements. However, for this case, it is easily checked that any subgraph of G obtained by deleting exactly one edge results in a (k - 1) tree, and every sub (k-1)-tree results from this operation. Therefore, the optimal (k-1)-tree can be found using Algorithm 1, and in this case, the complexity of Algorithm 1 is O(n(k + 1)2). This procedure can be generalized to find the optimal sub (k - d)- tree for any fixed d. However, the number of constraints grows exponentially with d (though it is still polynomial in n). Therefore for small, fixed values of d, we can compute the optimal sub (k - d)-tree of G. While we can compute (k - d)-trees of G by first going from a k tree to a (k - 1) tree, then from a (k - 1)-tree to a (k - 2)-tree, and so on in a greedy fashion, this will not be optimal in general. However, this might be a good algorithm to try when d is large.

3.2 Optimal triangulated subgraphs with |E(G)| - d edges

Suppose that we are interested in a (triangulated) subgraph of G that contains d fewer edges that G does. That is, we want to find an optimal subgraph H G such that |E(H)| = |E(G)| - d. Note that by the result of [4] there is always a triangulated subgraph with d fewer edges (if d < |E(G)|). Two possibilities for finding such an optimal subgraph are

```
1. Use the procedure described in [4]. This is a greedy procedure which works in d steps by deleting an edge at each step. At each state, the edge is picked from the set of edges whose deletion leaves a triangulated graph. Then the edge which causes the least increase in KL-divergence is picked at each stage.
2. For each possible subset A of E(G) of size d, whose deletion leaves a triangulated graph, compute the KL divergence using the formula above, and then pick the optimal one. Since there are |E(G)| such sets, this can be done in polynomial d time (in |V (G)|) when d is a constant.
```

The first greedy algorithm is not guaranteed to yield the optimal solution. The second takes time that is O(n2d). Now, let us solve this problem using the framework we've described.

Let H be the set of subgraphs of G which may be obtained by deletion of d edges. For each M = ij M corresponding to the separator Vij, let CM =

```
(l, r, c, s, A) : l + r - c = d, s a d bit string, A E(G[Vij]) . The constraint repre- c
```

sented by (l, r, c, A) is this : A is a set of d edges of G[Vij] that are missing in H, l edges are missing from the left subgraph, and r edges are missing from the right subgraph. c rep- resents the double count, and so is subtracted from the total. If k is the size of the largest ) clique, then the total number of such constraints is bounded by 2d 2d (k2 O(k2d) d which could be better than O(n2d) and is polynomial in |V | when d is constant. See [10] for additional details.

4 Conclusions

Algorithm 1 will compute the optimal H H for the two examples discussed above and is polynomial (for fixed constant d) even if k is O(n). In [10] a generalization is presented which will allow finding the optimal solution for other classes of subgraphical models. Now, we assume an oracle model for computing KL-divergences of probability distribu- tions on vertex sets of cliques. It is clear that these KL-divergences can be computed

R separator-tree for G; for each vertex M of R in order of increasing height (bottom up) do for each constraint cM of M do if M is an interior vertex of R corresponding to edge ij of the junction tree then Let Ml and Mr be the left and right children of M ; Pick constraint cl CM compatible with c l M to minimize table[Ml, cl]; Pick constraint cr CM compatible with c r M to minimize table[Mr , cr ]; loss D(PG[M ] PH [M ]); table[M, cM ] table[Ml, cl] + table[Mr, cr] - loss;

```
else table[M, cM ] D(PG[M ] PH [M ]);
end end end
Algorithm 1: Finding optimal set of constraints
```

efficiently for distributions like Gaussians, but for discrete distributions this may not be possible when k is large. However even in this case this algorithm will result in only polynomial calls to the oracle. The standard algorithm [3] which is exponential in the treewidth will make O(2k) calls to this oracle. Therefore, when the cost of computing the KL-divergence is large, this algorithm becomes even more attractive as it results in expo- nential speedup over the standard algorithm. Alternatively, if we can compute approximate KL-divergences, or approximately optimal solutions, then we can compute an approximate solution by using the same algorithm.

Do not remove: This comment is monitored to verify that the site is working properly