{"title": "EDML for Learning Parameters in Directed and Undirected Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1502, "page_last": 1510, "abstract": "EDML is a recently proposed algorithm for learning parameters in Bayesian networks.  It was originally derived in terms of approximate inference on a meta-network, which underlies the Bayesian approach to parameter estimation.  While this initial derivation helped discover EDML in the first place and provided a concrete context for identifying some of its properties (e.g., in contrast to EM), the formal setting was somewhat tedious in the number of concepts it drew on.  In this paper, we propose a greatly simplified perspective on EDML, which casts it as a general approach to continuous optimization. The new perspective has several advantages. First, it makes immediate some results that were non-trivial to prove initially. Second, it facilitates the design of EDML algorithms for new graphical models, leading to a new algorithm for learning parameters in Markov networks.  We derive this algorithm in this paper, and show, empirically, that it can sometimes learn better estimates from complete data, several times faster than commonly used optimization methods, such as conjugate gradient and L-BFGS.", "full_text": "EDML for Learning Parameters in\n\nDirected and Undirected Graphical Models\n\nKhaled S. Refaat, Arthur Choi, Adnan Darwiche\n\nComputer Science Department\n\nUniversity of California, Los Angeles\n\n{krefaat,aychoi,darwiche}@cs.ucla.edu\n\nAbstract\n\nEDML is a recently proposed algorithm for learning parameters in Bayesian net-\nworks. It was originally derived in terms of approximate inference on a meta-\nnetwork, which underlies the Bayesian approach to parameter estimation. While\nthis initial derivation helped discover EDML in the \ufb01rst place and provided a con-\ncrete context for identifying some of its properties (e.g., in contrast to EM), the\nformal setting was somewhat tedious in the number of concepts it drew on. In this\npaper, we propose a greatly simpli\ufb01ed perspective on EDML, which casts it as\na general approach to continuous optimization. The new perspective has several\nadvantages. First, it makes immediate some results that were non-trivial to prove\ninitially. Second, it facilitates the design of EDML algorithms for new graphical\nmodels, leading to a new algorithm for learning parameters in Markov networks.\nWe derive this algorithm in this paper, and show, empirically, that it can sometimes\nlearn estimates more ef\ufb01ciently from complete data, compared to commonly used\noptimization methods, such as conjugate gradient and L-BFGS.\n\n1\n\nIntroduction\n\nEDML is a recently proposed algorithm for learning MAP parameters of a Bayesian network from\nincomplete data [5, 16]. While it is procedurally very similar to Expectation Maximization (EM) [7,\n11], EDML was shown to have certain advantages, both theoretically and practically. Theoretically,\nEDML can in certain specialized cases provably converge in one iteration, whereas EM may require\nmany iterations to solve the same learning problem. Some empirical evaluations further suggested\nthat EDML and hybrid EDML/EM algorithms can sometimes \ufb01nd better parameter estimates than\nvanilla EM, in fewer iterations and less time. EDML was originally derived in terms of approximate\ninference on a meta-network used for Bayesian approaches to parameter estimation. This graphical\nrepresentation of the estimation problem lent itself to the initial derivation of EDML, as well to the\nidenti\ufb01cation of certain key theoretical properties, such as the one we just described. The formal\ndetails, however, can be somewhat tedious as EDML draws on a number of different concepts. We\nreview EDML in such terms in the supplementary appendix.\nIn this paper, we propose a new perspective on EDML, which views it more abstractly in terms of\na simple method for continuous optimization. This new perspective has a number of advantages.\nFirst, it makes immediate some results that were previously obtained for EDML, but through some\neffort. Second, it facilitates the design of new EDML algorithms for new classes of models, where\ngraphical formulations of parameter estimation, such as meta-networks, are lacking. Here, we de-\nrive, in particular, a new parameter estimation algorithm for Markov networks, which is in many\nways a more challenging task, compared to the case of Bayesian networks. Empirically, we \ufb01nd that\nEDML is capable of learning parameter estimates, under complete data, more ef\ufb01ciently than popu-\nlar methods such as conjugate-gradient and L-BFGS, and in some cases, by an order-of-magnitude.\n\n1\n\n\fThis paper is structured as follows. In Section 2, we highlight a simple iterative method for, approxi-\nmately, solving continuous optimization problems. In Section 3, we formulate the EDML algorithm\nfor parameter estimation in Bayesian networks, as an instance of this optimization method. In Sec-\ntion 4, we derive a new EDML algorithm for Markov networks, based on the same perspective. In\nSection 5, we contrast the two EDML algorithms for directed and undirected graphical models, in\nthe complete data case. We empirically evaluate our new algorithm for parameter estimation under\ncomplete data in Markov networks, in Section 6; review related work in Section 7; and conclude in\nSection 8. Proofs of theorems appear in the supplementary appendix.\n\n2 An Approximate Optimization of Real-Valued Functions\n\nConsider a real-valued objective function f (x) whose input x is a vector of components:\n\nx = (x1, . . . , xi, . . . , xn),\n\nwhere each component xi is a vector in Rki for some ki. Suppose further that we have a constraint\non the domain of function f (x) with a corresponding function g that maps an arbitrary point x to a\npoint g(x) satisfying the given constraint. We say in this case that g(x) is a feasibility function and\nrefer to the points in its range as feasible points.\nOur goal here is to \ufb01nd a feasible input vector x = (x1, . . . , xi, . . . , xn) that optimizes the func-\ntion f (x). Given the dif\ufb01culty of this optimization problem in general, we will settle for \ufb01nding\nstationary points x in the constrained domain of function f (x).\nOne approach for \ufb01nding such stationary points is as follows. Let x(cid:63) = (x(cid:63)\nfeasible point in the domain of function f (x). For each component xi, we de\ufb01ne a sub-function\n\ni , . . . , x(cid:63)\n\n1, . . . , x(cid:63)\n\nn) be a\n\nfx(cid:63) (xi) = f (x(cid:63)\n\n1, . . . , x(cid:63)\n\ni\u22121, xi, x(cid:63)\n\ni+1, . . . , x(cid:63)\n\nn).\n\nThat is, we use the n-ary function f (x) to generate n sub-functions fx(cid:63) (xi). Each of these sub-\nfunctions is obtained by \ufb01xing all inputs xj of f (x), for j (cid:54)= i, to their values in x(cid:63), while keeping\nthe input xi free. We further assume that these sub-functions are subject to the same constraints that\nthe function f (x) is subject to.\nWe can now characterize all feasible points x(cid:63) that are stationary with respect to the function f (x),\nin terms of local conditions on sub-functions fx(cid:63) (xi).\n\nClaim 1 A feasible point x(cid:63) = (x(cid:63)\ncomponent x(cid:63)\n\ni , . . . , x(cid:63)\ni is stationary for sub-function fx(cid:63) (xi).\n\n1, . . . , x(cid:63)\n\nn) is stationary for function f (x) iff for all i,\n\nThis is immediate from the de\ufb01nition of a stationary point. Assuming no constraints, at a stationary\npoint x(cid:63), the gradient \u2207f (x(cid:63)) = 0, i.e., \u2207xif (x(cid:63)) = \u2207fx(cid:63) (x(cid:63)\ni ) = 0 for all xi, where \u2207xif (x(cid:63))\ndenotes the sub-vector of gradient \u2207f (x(cid:63)) with respect to component xi.1\nWith these observations, we can now search for feasible stationary points x(cid:63) of the constrained\nfunction f (x) using an iterative method that searches instead for stationary points of the constrained\nsub-functions fx(cid:63) (xi). The method works as follows:\n\n1. Start with some feasible point xt of function f (x) for t = 0\n2. While some xt\n\ni is not a stationary point for constrained sub-function fxt (xi)\nfor each constrained sub-function fxt (xi)\n\n(a) Find a stationary point yt+1\n(b) xt+1 = g(yt+1)\n(c) Increment t\n\ni\n\nThe real computational work of this iterative procedure is in Steps 2(a) and 2(b), although we shall\nsee later that such steps can, in some cases, be performed ef\ufb01ciently. With an appropriate feasibility\nfunction g(y), one can guarantee that a \ufb01xed-point of this procedure yields a stationary point of the\nconstrained function f (x), by Claim 1.2 Further, any stationary point is trivially a \ufb01xed-point of this\nprocedure (one can seed this procedure with such a point).\n\n1Under constraints, we consider points that are stationary with respect to the corresponding Lagrangian.\n2We discuss this point further in the supplementary appendix.\n\n2\n\n\fAs we shall show in the next section, the EDML algorithm\u2014which has been proposed for parameter\nestimation in Bayesian networks\u2014is an instance of the above procedure with some notable obser-\nvations: (1) the sub-functions fxt(xi) are convex and have unique optima; (2) these sub-functions\nhave an interesting semantics, as they correspond to posterior distributions that are induced by Naive\nBayes networks with soft evidence asserted on them; (3) de\ufb01ning these sub-functions requires infer-\nence in a Bayesian network parameterized by the current feasible point xt; (4) there are already sev-\neral convergent, \ufb01xed-point iterative methods for \ufb01nding the unique optimum of these sub-functions;\nand (5) these convergent methods produce solutions that are always feasible and, hence, the feasi-\nbility function g(y) corresponds to the identity function g(y) = y in this case.\nWe next show this connection to EDML as proposed for parameter estimation in Bayesian networks.\nWe follow by deriving an EDML algorithm (another instance of the above procedure), but for param-\neter estimation in undirected graphical models. We will also study the impact of having complete\ndata on both versions of the EDML algorithm, and \ufb01nally evaluate the new instance of EDML by\ncomparing it to conjugate gradient and L-BFGS when applied to complete datasets.\n\n3 EDML for Bayesian Networks\n\nFrom here on, we use upper case letters (X) to denote variables and lower case letters (x) to denote\ntheir values. Variable sets are denoted by bold-face upper case letters (X) and their instantiations\nby bold-face lower case letters (x). Generally, we will use X to denote a variable in a Bayesian\nnetwork and U to denote its parents. A network parameter will therefore have the general form\n\u03b8x|u, representing the probability Pr (X = x|U = u).\nConsider a (possibly incomplete) dataset D with examples d1, . . . , dN , and a Bayesian network with\nparameters \u03b8. Our goal is to \ufb01nd parameter estimates \u03b8 that minimize the negative log-likelihood:\n\nf (\u03b8) = \u2212(cid:96)(cid:96)(\u03b8|D) = \u2212 N(cid:88)\n\nlog Pr \u03b8(di).\n\n(1)\n\ni=1\n\nHere, \u03b8 = (. . . , \u03b8X|u, . . .) is a vector over the network parameters. Moreover, Pr \u03b8 is the distribution\ninduced by the Bayesian network structure under parameters \u03b8. As such, Pr \u03b8(di) is the probability\nof observing example di in dataset D under parameters \u03b8.\nEach component of \u03b8 is a parameter set \u03b8X|u, which de\ufb01nes a parameter \u03b8x|u for each value x of\nvariable X and instantiation u of its parents U. The feasibility constraint here is that each component\n\n\u03b8X|u satis\ufb01es the convex sum-to-one constraint:(cid:80)\n\nx \u03b8x|u = 1.\n\nThe above parameter estimation problem is clearly in the form of the constrained optimization prob-\nlem that we phrased in the previous section and, hence, admits the same iterative procedure proposed\nin that section for \ufb01nding stationary points. The relevant questions now are: What form do the sub-\nfunctions f\u03b8(cid:63) (\u03b8X|u) take in this context? What are their semantics? What properties do they have?\nHow do we \ufb01nd their stationary points? What is the feasibility function g(y) in this case? Finally,\nwhat is the connection to previous work on EDML? We address these questions next.\n\n3.1 Form\n\nWe start by characterizing the sub-functions of the negative log-likelihood given in Equation 1.\nTheorem 1 For each parameter set \u03b8X|u, the negative log-likelihood of Equation 1 has the sub-\nfunction:\n\nf\u03b8(cid:63) (\u03b8X|u) = \u2212 N(cid:88)\n\ni=1\n\n(cid:16)\n\n(cid:88)\n\nx\n\n(cid:17)\n\nlog\n\nC i\n\nu +\n\nx|u \u00b7 \u03b8x|u\nC i\n\n(2)\n\nwhere C i\n\nu and C i\n\nx|u are constants that are independent of parameter set \u03b8X|u, given by\nx|u\n\nu = Pr \u03b8(cid:63) (di) \u2212 Pr \u03b8(cid:63) (u, di)\nC i\n\nand\n\nC i\n\nx|u = Pr \u03b8(cid:63) (x, u, di)/\u03b8(cid:63)\n\nTo compute the constants C i, we require inference on a Bayesian network with parameters \u03b8(cid:63).3\n\n3Theorem 1 assumes tacitly that \u03b8(cid:63)\n\nx|u (cid:54)= 0. More generally, however, C i\n\nx|u = \u2202Pr \u03b8(cid:63) (di)/\u2202\u03b8x|u, which\n\ncan also be computed using some standard inference algorithms [6, 14].\n\n3\n\n\fFigure 1: Estimation given independent soft observations.\n\n3.2 Semantics\n\nEquation 2 has an interesting semantics, as it corresponds to the negative log-likelihood of a root\nvariable in a naive Bayes structure, on which soft, not necessarily hard, evidence is asserted [5].4\nThis model is illustrated in Figure 1, where our goal is to estimate a parameter set \u03b8X, given soft\nobservations \u03b7 = (\u03b71, . . . , \u03b7N ) on variables X1, . . . , XN , where each \u03b7i has a strength speci\ufb01ed by\na weight on each value xi of Xi. If we denote the distribution of this model by P, then (1) P(\u03b8)\ndenotes a prior over parameters sets,5 (2) P(xi|\u03b8X = (. . . , \u03b8x, . . .)) = \u03b8x, and (3) weights P(\u03b7i|xi)\ndenote the strengths of soft evidence \u03b7i on value xi. The log likelihood of our soft observations \u03b7\nis:\n\nlog P(\u03b7|\u03b8X ) =\n\nlog\n\nP(\u03b7i|xi)P(xi|\u03b8X ) =\n\nlog\n\nP(\u03b7i|xi) \u00b7 \u03b8x\n\n(3)\n\nN(cid:88)\n\ni=1\n\n(cid:88)\n\nxi\n\nN(cid:88)\n\ni=1\n\n(cid:88)\n\nxi\n\nThe following result connects Equation 2 to the above likelihood of a soft dataset, when we now\nwant to estimate the parameter set \u03b8X|u, for a particular variable X and parent instantiation u.\nTheorem 2 Consider Equations 2 and 3, and assume that each soft evidence \u03b7i has the strength\nP(\u03b7i|xi) = C i\n\nx|u. It then follows that\n\nu + C i\n\nf\u03b8(cid:63) (\u03b8X|u) = \u2212 log P(\u03b7|\u03b8X|u)\n\n(4)\n\nThis theorem yields the following interesting semantics for EDML sub-functions. Consider a pa-\nrameter set \u03b8X|u and example di in our dataset. The example can then be viewed as providing\n\u201cvotes\u201d on what this parameter set should be. In particular, the vote of example di for value x takes\nthe form of a soft evidence \u03b7i whose strength is given by\n\nP(\u03b7i|xi) = Pr \u03b8(cid:63) (di) \u2212 Pr \u03b8(cid:63) (u, di) + Pr \u03b8(cid:63) (x, u, di)/\u03b8(cid:63)\nx|u\n\nThe sub-function is then aggregating these votes from different examples and producing a corre-\nsponding objective function on parameter set \u03b8X|u. EDML optimizes this objective function to\nproduce the next estimate for each parameter set \u03b8X|u.\n\n3.3 Properties\n\nEquation 2 is a convex function, and thus has a unique optimum.6 In particular, we have logs of a\nlinear function, which are each concave. The sum of two concave functions is also concave, thus our\nsub-function f\u03b8(cid:63) (\u03b8X|u) is convex, and is subject to a convex sum-to-one constraint [16]. Convex\nfunctions are relatively well understood, and there are a variety of methods and systems that can be\nused to optimize Equation 2; see, e.g., [3]. We describe one such approach, next.\n\n3.4 Finding the Unique Optima\n\nIn every EDML iteration, and for each parameter set \u03b8X|u, we seek the unique optimum for each\nsub-function f\u03b8(cid:63) (\u03b8X|u), given by Equation 2. Refaat, et al., has previously proposed a \ufb01xed-point\n4Soft evidence is an observation that increases or decreases ones belief in an event, but not necessarily to\n\nthe point of certainty. For more on soft evidence, see [4].\n\n5Typically, we assume Dirichlet priors for MAP estimation. However, we focus on ML estimation here.\n6More speci\ufb01cally, strict convexity implies a unique optimum, although under certain assumptions, we can\n\nguarantee that Equation 2 is indeed strictly convex.\n\n4\n\n\u2026\t\r \u00a0\u2026\t\r \u00a0\u03b71 \u03b72 \u03b7N !XX1X2XN\falgorithm that monotonically improves the objective, and is guaranteed to converge [16]. Moreover,\nthe solutions it produces already satisfy the convex sum-to-one constraint and, hence, the feasibility\nfunction g ends up being the identity function g(\u03b8) = \u03b8.\nIn particular, we start with some initial feasible estimates \u03b8t\nfollowing update equation until convergence:\n\nN(cid:88)\n\ni=1\n\nC i\n\nX|u at iteration t = 0, and then apply the\nx|u) \u00b7 \u03b8t\nx|u\nx(cid:48)|u \u00b7 \u03b8t\nx(cid:48) C i\n\nx(cid:48)|u\n\n(5)\n\n(C i\n\nu +(cid:80)\n\nu + C i\n\n\u03b8t+1\nx|u =\n\n1\nN\n\nNote here that constants C i are computed by inference on a Bayesian network structure under param-\neters \u03b8t (see Theorem 1 for the de\ufb01nitions of these constants). Moreover, while the above procedure\nis convergent when optimizing sub-functions f\u03b8(cid:63) (\u03b8X|u), the global EDML algorithm that is opti-\nmizing function f (\u03b8) may not be convergent in general.\n\n3.5 Connection to Previous Work\n\nEDML was originally derived by applying an approximate inference algorithm to a meta-network,\nwhich is typically used in Bayesian approaches to parameter estimation [5, 16]. This previous\nformulation of EDML, which is speci\ufb01c to Bayesian networks, now falls as a special instance of\nthe one given in Section 2. In particular, the \u201csub-problems\u201d de\ufb01ned by the original EDML [5, 16]\ncorrespond precisely to the sub-functions f\u03b8(cid:63) (\u03b8X|u) described here. Further, both versions of EDML\nare procedurally identical when they both use the same method for optimizing these sub-functions.\nThe new formulation of EDML is more transparent, however, at least in revealing certain properties\nof the algorithm. For example, it now follows immediately (from Section 2) that the \ufb01xed points\nof EDML are stationary points of the log-likelihood\u2014a fact that was not proven until [16], using a\ntechnique that appealed to the relationship between EDML and EM. Moreover, the proof that EDML\nunder complete data will converge immediately to the optimal estimates is also now immediate (see\nSection 5). More importantly though, this new formulation provides a systematic procedure for\nderiving new instances of EDML for additional models, beyond Bayesian networks. Indeed, in the\nnext section, we use this procedure to derive an EDML instance for Markov networks, which is\nfollowed by an empirical evaluation of the new algorithm under complete data.\n\n4 EDML for Undirected Models\n\nIn this section, we show how parameter estimation for undirected graphical models, such as Markov\nnetworks, can also be posed as an optimization problem, as described in Section 2.\nFor Markov networks, \u03b8 = (. . . , \u03b8Xa , . . .) is a vector over the network parameters. Component \u03b8Xa\nis a parameter set for a (tabular) factor a, assigning a number \u03b8xa \u2265 0 for each instantiation xa of\nvariables Xa. The negative log-likelihood \u2212(cid:96)(cid:96)(\u03b8|D) for a Markov network is:\n\n\u2212(cid:96)(cid:96)(\u03b8|D) = N log Z\u03b8 \u2212 N(cid:88)\n\ni=1\n\nlog Z\u03b8(di)\n\n(6)\n\nf (\u03b8) = \u2212 N(cid:88)\n\nwhere Z\u03b8 is the partition function, and where Z\u03b8(di) is the partition function after conditioning on\nexample di, under parameterization \u03b8. Sub-functions with respect to Equation 6 may not be convex,\nas was the case in Bayesian networks. Consider instead the following objective function, which we\nshall subsequently relate to the negative log-likelihood:\n\nlog Z\u03b8(di),\n\n(7)\n\ni=1\n\nwith a feasibility constraint that the partition function Z\u03b8 equals some constant \u03b1. The following re-\nsult tells us that it suf\ufb01ces to optimize Equation 7 under the given constraint, to optimize Equation 6.\nTheorem 3 Let \u03b1 be a positive constant, and let g(\u03b8) be a (feasibility) function satisfying Zg(\u03b8) = \u03b1\nand g(\u03b8xa ) \u221d \u03b8xa for all \u03b8xa.7 For every point \u03b8, if g(\u03b8) is optimal for Equation 7, subject to its\n7Here, g(\u03b8xa ) denotes the component of g(\u03b8) corresponding to \u03b8xa. Moreover, the function g(\u03b8) can be\nconstructed, e.g., by simply multiplying all entries of one parameter set by \u03b1/Z\u03b8. In our experiments, we\n\n5\n\n\fconstraint, then it is also optimal for Equation 6. Moreover, a point \u03b8 is stationary for Equation 6\niff the point g(\u03b8) is stationary for Equation 7, subject to its constraint.\n\nWith Equation 7 as a new (constrained) objective function for estimating the parameters of a Markov\nnetwork, we can now cast it in the terms of Section 2. We start by characterizing its sub-functions.\n\nTheorem 4 For a given parameter set \u03b8Xa, the objective function of Equation 7 has sub-functions:\n\nf\u03b8(cid:63) (\u03b8Xa ) = \u2212 N(cid:88)\n\n(cid:88)\n\nsubject to (cid:88)\n\nlog\n\n\u00b7 \u03b8xa\n\nC i\nxa\n\nCxa \u00b7 \u03b8xa = \u03b1\n\n(8)\n\ni=1\n\nxa\n\nxa\n\nwhere C i\nxa\n\nand Cxa are constants that are independent of the parameter set \u03b8Xa:\n.\n\nCxa = Z\u03b8(cid:63) (xa)/\u03b8(cid:63)\nxa\n\n= Z\u03b8(cid:63) (xa, di)/\u03b8(cid:63)\nxa\n\nC i\nxa\n\nand\n\nNote that computing these constants requires inference on a Markov network with parameters \u03b8(cid:63).8\nInterestingly, this sub-function is convex, as well as the constraint (which is now linear), resulting in\na unique optimum, as in Bayesian networks. However, even when \u03b8(cid:63) is a feasible point, the unique\noptima of these sub-functions may not be feasible when combined. Thus, the feasibility function\ng(\u03b8) of Theorem 3 must be utilized in this case.\nWe now have another instance of the iterative algorithm proposed in Section 2, but for undirected\ngraphical models. That is, we have just derived an EDML algorithm for such models.\n\n5 EDML under Complete Data\n\nxa\n\nof Theorem 4 reduces to: f\u03b8(cid:63) (\u03b8Xa ) = \u2212(cid:80)\n\nof Theorem 1 then reduces to: f\u03b8(cid:63) (\u03b8X|u) = \u2212(cid:80)\n\nWe consider now how EDML simpli\ufb01es under complete data for both Bayesian and Markov net-\nworks, identifying forms and properties of the corresponding sub-functions under complete data.\nWe start with Bayesian networks. Consider a variable X, and a parent instantiation u; and let\nD#(xu) represent the number of examples that contain xu in the complete dataset D. Equation 2\nx D#(xu) log \u03b8x|u + C, where C is a constant that\nis independent of parameter set \u03b8X|u. Assuming that \u03b8(cid:63) is feasible (i.e., each \u03b8X|u satis\ufb01es the sum-\nD#(xu)\nD#(u) , which is guaranteed\nto-one constraint), the unique optimum of this sub-function is \u03b8x|u =\nto yield a feasible point \u03b8, globally. Hence, EDML produces the unique optimal estimates in its \ufb01rst\niteration and terminates immediately thereafter.\nThe situation is different, however, for Markov networks. Under a complete dataset D, Equation 8\nD#(xa) log \u03b8xa + C, where C is a constant that\nis independent of parameter set \u03b8Xa. Assuming that \u03b8(cid:63) is feasible (i.e., satis\ufb01es Z\u03b8(cid:63) = \u03b1), the\nunique optimum of this sub-function has the closed form: \u03b8xa = \u03b1\n, which is equivalent\nN\nto the unique optimum one would obtain in a sub-function for Equation 6 [15, 13]. Contrary to\nBayesian networks, the collection of these optima for different parameter sets do not necessarily\nyield a feasible point \u03b8. Hence, the feasibility function g of Theorem 3 must be applied here.\nThe resulting feasible point, however, may no longer be a stationary point for the corresponding\nsub-functions, leading EDML to iterate further. Hence, under complete data, EDML for Bayesian\nnetworks converges immediately, while EDML for Markov networks may take multiple iterations.\nBoth results are consistent with what is already known in the literature on parameter estimation\nfor Bayesian and Markov networks. The result on Bayesian networks is useful in con\ufb01rming that\nEDML performs optimally in this case. The result for Markov networks, however, gives rise to a\nnew algorithm for parameter estimation under complete data. We evaluate the performance of this\nnew EDML algorithm after considering the following example.\nLet D be a complete dataset over three variables A, B and C, speci\ufb01ed in terms of the number\nof times that each instantiation a, b, c appears in D. In particular, we have the following counts:\nnormalize each parameter set to sum-to-one, but then update the constant \u03b1 = Z\u03b8t for the subsequent iteration.\n\nD#(xa)\n\nCxa\n\n8Theorem 4 assumes that \u03b8(cid:63)\n\nxa (cid:54)= 0. In general, C i\n\nxa = \u2202Z\u03b8(cid:63) (di)\n\n\u2202\u03b8xa\n\n, and Cxa = \u2202Z\u03b8(cid:63)\n\u2202\u03b8xa\n\n. See also Footnote 3.\n\n6\n\n\fTable 1: Speed-up results of EDML over CG and L-BFGS\ni(cid:48)\n\nproblem\n\nzero\none\ntwo\nthree\nfour\n\ufb01ve\nsix\n\nseven\neight\nnine\n\n54.wcsp\n\nor-chain-42\nor-chain-45\nor-chain-147\nor-chain-148\nor-chain-225\n\nrbm20\nSeg2-17\nSeg7-11\n\nFamily2Dominant.1.5loci\nFamily2Recessive.15.5loci\n\ngrid10x10.f5.wrap\ngrid10x10.f10.wrap\n\naverage\n\n#vars\n256\n256\n256\n256\n256\n256\n256\n256\n256\n256\n67\n385\n715\n410\n463\n467\n40\n228\n235\n385\n385\n100\n100\n275.65\n\nicg\n45\n104\n46\n43\n56\n43\n48\n57\n48\n56\n\niedml\n105\n73\n154\n169\n126\n155\n150\n147\n155\n168\n107.33 160.33\n120.33\n27\n33.67\n151\n18.67\n107.67\n42.33\n122.67\n58\n181.33\n41 30.98\n9\n63\n1.77\n83.66\n1.86\n54.3\n84\n2.39\n117.33\n88\n89.7\n111.6\n1.31\n136.67\n239 17.36\n101.33\n83.89 101.29\n\n(S)\ntcg\n3.62\n3.90x\n8.25 13.26x\n3.73\n2.83x\n2.52x\n3.58\n4.31x\n4.59\n2.70x\n3.48\n3.13x\n3.93\n4.64\n3.37x\n2.84x\n3.82\n3.15x\n4.46\n6.56\n2.78x\n0.12 31.27x\n0.14 12.52x\n3.27 80.72x\n1.00 49.04x\n0.79 44.14x\n2.38x\n7.00x\n2.84x\n5.90x\n3.85x\n6.26x\n62.33 12.39 20.92x\n5.39 13.55x\n\nil-bfgs\n24\n58\n21\n52\n61\n49\n20\n23\n57\n45\n68.33\n110\n94.33\n105\n80\n137.67\n\nedml\n74\n42\n87\n169\n115\n155\n90\n89\n154\n141\n172\n54.33\n36.33\n58.33\n32\n69\n30 107.22\n64.67\n73.33\n78.33\n81.67\n142 180.33\n59\n94.89\n\n46.67\n48.66\n85.67\n86.33\n\n92.67\n66.84\n\n(S(cid:48))\ntl-bfgs\n1.64\n1.98x\n3.87\n8.08x\n1.54\n1.54x\n3.55\n1.93x\n3.90\n3.22x\n3.20\n1.90x\n1.47\n1.40x\n1.65\n1.62x\n3.83\n2.28x\n2.90\n1.94x\n1.80\n0.72x\n0.06\n6.43x\n0.06\n4.85x\n1.63 12.77x\n0.28 14.24x\n0.33 10.76x\n0.99x\n30.18\n0.74\n4.14x\n2.32x\n1.27\n2.69x\n1.04\n2.18x\n0.74\n4.63x\n10.30\n5.94\n9.70x\n4.45x\n3.56\n\nD#(a, b, c) = 4, D#(a, b, \u00afc) = 18, D#(a, \u00afb, c) = 2, D#(a, \u00afb, \u00afc) = 13, D#(\u00afa, b, c) = 1,\nD#(\u00afa, b, \u00afc) = 1, D#(\u00afa, \u00afb, c) = 42, and D#(\u00afa, \u00afb, \u00afc) = 19. Suppose we want to learn, from\nthis dataset, a Markov network with 3 edges, (A, B), (B, C) and (A, C), with the corresponding\nparameter sets \u03b8AB, \u03b8BC and \u03b8AC. If the initial set of parameters \u03b8(cid:63) = (\u03b8(cid:63)\nAC) is uniform,\nXY = (1, 1, 1, 1), then Equation 8 gives the sub-function f\u03b8(cid:63) (\u03b8AB) = \u221222 \u00b7 log \u03b8ab \u2212 15 \u00b7\ni.e., \u03b8(cid:63)\nlog \u03b8a\u00afb \u2212 2 \u00b7 log \u03b8\u00afab \u2212 61 \u00b7 log \u03b8\u00afa\u00afb. Moreover, we have Z\u03b8(cid:63) = 2 \u00b7 \u03b8ab + 2 \u00b7 \u03b8a\u00afb + 2 \u00b7 \u03b8\u00afab + 2 \u00b7 \u03b8\u00afa\u00afb.\nMinimizing f\u03b8(cid:63) (\u03b8AB) under Z\u03b8(cid:63) = \u03b1 = 2 corresponds to solving a convex optimization problem,\n100 ). We solve similar convex\nwhich has the unique solution: (\u03b8ab, \u03b8a\u00afb, \u03b8\u00afab, \u03b8a\u00afb) = ( 22\noptimization problems for the other parameter sets \u03b8BC and \u03b8AC, to update estimates \u03b8(cid:63). We then\napply an appropriate feasibility function g (see Footnote 7), and repeat until convergence.\n\n100 , 61\n\n100 , 15\n\n100 , 2\n\nAB, \u03b8(cid:63)\n\nBC, \u03b8(cid:63)\n\n6 Experimental Results\n\nWe evaluate now the ef\ufb01ciency of EDML, conjugate gradient (CG) and Limited-memory BFGS\n(L-BFGS), when learning Markov networks under complete data.9 We \ufb01rst learned grid-structured\npairwise MRFs from the CEDAR dataset of handwritten digits, which has 10 datasets (one for each\ndigit) of 16\u00d716 binary images. We also simulated datasets from networks used in the probabilistic\ninference evaluations of UAI-2008, 2010 and 2012, that are amenable to jointree inference.10 For\neach network, we simulated 3 datasets of size 210 examples each, and learned parameters using the\noriginal structure. Experiments were run on a 3.6GHz Intel i5 CPU with access to 8GB RAM.\nWe used the CG implementation in the Apache Commons Math library, and the L-BFGS implemen-\ntation in Mallet.11 Both are Java libraries, and our implementation of EDML is also in Java. More\nimportantly, all of the CG, L-BFGS, and EDML methods rely on the same underlying engine for\n\nbenchmarks, as it invokes inference many times more often than the methods we considered.\n\n9We also considered Iterative Proportional Fitting (IPF) as a baseline. However, IPF does not scale to our\n10Network 54.wcsp is a weighted CSP problem; or-chain-{42, 45, 147, 148, 225} are from the Pro-\nmedas suite; rbm-20 is a restricted Boltzmann machine; Seg2-17, Seg7-11 are from the Segmentation\nsuite;\nfamily2-recessive.15.5loci are genetic linkage analysis networks; and\ngrid10x10.f5.wrap, grid10x10.10.wrap are 10x10 grid networks.\n\nfamily2-dominant.1.5loci,\n\n11Available at http://commons.apache.org/ and http://mallet.cs.umass.edu/.\n\n7\n\n\fexact inference.12 For EDML, we damped parameter estimates at each iteration, which is typical for\nalgorithms like loopy belief propagation, which EDML was originally inspired by [5].13 We used\nBrent\u2019s method with default settings for line search in CG, which was the most ef\ufb01cient over all\nunivariate solvers in Apache\u2019s library, which we evaluated in initial experiments.\nWe \ufb01rst run CG until convergence (or after exceeding 30 minutes) to obtain parameter estimates of\nsome quality qcg (in log likelihood), recording the number of iterations icg and time tcg required\nin minutes. EDML is then run next until it obtains an estimate of the same quality qcg, or better,\nrecording also the number of iterations iedml and time tedml in minutes. The time speed-up S\nof EDML over CG is computed as tcg/tedml. We also performed the same comparison with L-\nBFGS instead of CG, recording the corresponding number of iterations (il-bfgs, i(cid:48)\nedml) and time\ntaken (tl-bfgs, t(cid:48)\nedml), giving us the speed-up of EDML over L-BFGS as S(cid:48) = tl-bfgs/t(cid:48)\nedml.\nTable 1 shows results for both sets of experiments. It shows the number of variables in each net-\nwork (#vars), the average number of iterations taken by each algorithm, and the average speed-up\nachieved by EDML over CG (L-BFGS).14 On the given benchmarks, we see that on average EDML\nwas roughly 13.5\u00d7 faster than CG, and 4.5\u00d7 faster than L-BFGS. EDML was up to an order-of-\nmagnitude faster than L-BFGS in some cases. In many cases, EDML required more iterations but\nwas still faster in time. This is due in part by the number of times inference is invoked by CG and\nL-BFGS (in line search), whereas EDML only needs to invoke inference once per iteration.\n\n7 Related Work\n\nAs an iterative \ufb01xed-point algorithm, we can view EDML as a Jacobi-type method, where updates\nare performed in parallel [1]. Alternatively, a version of EDML using Gauss-Seidel iterations would\nupdate each parameter set in sequence using the most recently computed update. This leads to an\nalgorithm that monotonically improves the log likelihood at each update. In this case, we obtain a\ncoordinate descent algorithm, Iterative Proportional Fitting (IPF) [9], as a special case of EDML.\nThe notion of \ufb01xing all parameters, except for one, has been exploited before for the purposes of\noptimizing the log likelihood of a Markov network, as a heuristic for structure learning [15]. This\nnotion also underlies the IPF algorithm; see, e.g., [13], Section 19.5.7. In the case of complete data,\nthe resulting sub-function is convex, yet for incomplete data, it is not necessarily convex.\nOptimization methods such as conjugate gradient, and L-BFGS [12], are more commonly used to\noptimize the parameters of a Markov network. For relational Markov networks or Markov networks\nthat otherwise assume a feature-based representation [8], evaluating the likelihood is typically in-\ntractable, in which case one typically optimizes instead the pseudo-log-likelihood [2]. For more on\nparameter estimation in Markov networks, see [10, 13].\n\n8 Conclusion\n\nIn this paper, we provided an abstract and simple view of the EDML algorithm, originally proposed\nfor parameter estimation in Bayesian networks, as a particular method for continuous optimization.\nOne consequence of this view is that it is immediate that \ufb01xed-points of EDML are stationary points\nof the log-likelihood, and vice-versa [16]. A more interesting consequence, is that it allows us to\npropose an EDML algorithm for a new class of models, Markov networks. Empirically, we \ufb01nd that\nEDML can more ef\ufb01ciently learn parameter estimates for Markov networks under complete data,\ncompared to conjugate gradient and L-BFGS, sometimes by an order-of-magnitude. The empirical\nevaluation of EDML for Markov networks under incomplete data is left for future work.\n\nAcknowledgments\n\nThis work has been partially supported by ONR grant #N00014-12-1-0423.\n\n12For exact inference in Markov networks, we employed a jointree algorithm from the SAMIAM inference\n\nlibrary, http://reasoning.cs.ucla.edu/samiam/.\n\n13 We start with an initial factor of 1\n14 For CG, we used a threshold based on relative change in the likelihood at 10\u22124. We used Mallet\u2019s default\n\n2 , which we tighten as we iterate.\n\nconvergence threshold for L-BFGS.\n\n8\n\n\fReferences\n[1] Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical\n\nMethods. Prentice-Hall, 1989.\n\n[2] J. Besag. Statistical Analysis of Non-Lattice Data. The Statistician, 24:179\u2013195, 1975.\n[3] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[4] Hei Chan and Adnan Darwiche. On the revision of probabilistic beliefs using uncertain evi-\n\ndence. AIJ, 163:67\u201390, 2005.\n\n[5] Arthur Choi, Khaled S. Refaat, and Adnan Darwiche. EDML: A method for learning parame-\n\nters in Bayesian networks. In UAI, 2011.\n\n[6] Adnan Darwiche. A differential approach to inference in Bayesian networks.\n\n50(3):280\u2013305, 2003.\n\nJACM,\n\n[7] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society B, 39:1\u201338, 1977.\n\n[8] Pedro Domingos and Daniel Lowd. Markov Logic: An Interface Layer for Arti\ufb01cial Intelli-\ngence. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine Learning. Morgan & Clay-\npool Publishers, 2009.\n\n[9] Radim Jirousek and Stanislav Preucil. On the effective implementation of the iterative propor-\n\ntional \ufb01tting procedure. Computational Statistics & Data Analysis, 19(2):177\u2013189, 1995.\n\n[10] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques.\n\nMIT Press, 2009.\n\n[11] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Com-\n\nputational Statistics and Data Analysis, 19:191\u2013201, 1995.\n\n[12] D. C. Liu and J. Nocedal. On the Limited Memory BFGS Method for Large Scale Optimiza-\n\ntion. Mathematical Programming, 45(3):503\u2013528, 1989.\n\n[13] Kevin Patrick Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\n[14] James Park and Adnan Darwiche. A differential semantics for jointree algorithms. AIJ,\n\n156:197\u2013216, 2004.\n\n[15] Stephen Della Pietra, Vincent J. Della Pietra, and John D. Lafferty. Inducing features of random\n\n\ufb01elds. IEEE Trans. Pattern Anal. Mach. Intell., 19(4):380\u2013393, 1997.\n\n[16] Khaled S. Refaat, Arthur Choi, and Adnan Darwiche. New advances and theoretical insights\n\ninto EDML. In UAI, pages 705\u2013714, 2012.\n\n9\n\n\f", "award": [], "sourceid": 749, "authors": [{"given_name": "Khaled", "family_name": "Refaat", "institution": "UCLA"}, {"given_name": "Arthur", "family_name": "Choi", "institution": "UCLA"}, {"given_name": "Adnan", "family_name": "Darwiche", "institution": "UCLA"}]}