{"title": "Majorization for CRFs and Latent Likelihoods", "book": "Advances in Neural Information Processing Systems", "page_first": 557, "page_last": 565, "abstract": "The partition function plays a key role in probabilistic modeling including conditional random fields, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efficient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as first-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods.", "full_text": "Majorization for CRFs and Latent Likelihoods\n\nTony Jebara\n\nDepartment of Computer Science\n\nColumbia University\n\nAnna Choromanska\n\nDepartment of Electrical Engineering\n\nColumbia University\n\njebara@cs.columbia.edu\n\naec2163@columbia.edu\n\nAbstract\n\nThe partition function plays a key role in probabilistic modeling including condi-\ntional random \ufb01elds, graphical models, and maximum likelihood estimation. To\noptimize partition functions, this article introduces a quadratic variational upper\nbound. This inequality facilitates majorization methods: optimization of com-\nplicated functions through the iterative solution of simpler sub-problems. Such\nbounds remain ef\ufb01cient to compute even when the partition function involves\na graphical model (with small tree-width) or in latent likelihood settings. For\nlarge-scale problems, low-rank versions of the bound are provided and outper-\nform LBFGS as well as \ufb01rst-order methods. Several learning applications are\nshown and reduce to fast and convergent update rules. Experimental results show\nadvantages over state-of-the-art optimization methods.\n\n1 Introduction\nThe estimation of probability density functions over sets of random variables is a central problem\nin learning. Estimation often requires minimizing the partition function as is the case in conditional\nrandom \ufb01elds (CRFs) and log-linear models [1, 2]. Training these models was traditionally done\nvia iterative scaling and bound-majorization methods [3, 4, 5, 6, 1] which achieved monotonic con-\nvergence. These approaches were later surpassed by faster \ufb01rst-order methods [7, 8, 9] and then\nsecond-order methods such as LBFGS [10, 11, 12]. This article revisits majorization and repairs\nits slow convergence by proposing a tighter bound on the log-partition function. The improved ma-\njorization outperforms state-of-the-art optimization tools and admits multiple versatile extensions.\nMany decomposition methods for conditional random \ufb01elds and structured prediction have sought\nto render the learning and prediction problems more manageable [13, 14, 15]. Our decomposi-\ntion, however, hinges on bounding and majorization: decomposing an optimization of complicated\nfunctions through the iterative solution of simpler sub-problems [16, 17]. A tighter bound provides\nconvergent monotonic minimization while outperforming \ufb01rst- and second-order methods in prac-\ntice1. The bound applies to graphical models [18], latent variable situations [17, 19, 20, 21] as well\nas high-dimensional settings [10]. It also accommodates convex constraints on the parameter space.\nThis article is organized as follows. Section 2 presents the bound and Section 3 uses it for ma-\njorization in CRFs. Extensions to latent likelihood are shown in Section 4. The bound is extended\nto graphical models in Section 5 and high dimensional problems in Section 6. Section 7 provides\nexperiments and Section 8 concludes. The Supplement contains proofs and additional results.\n2 Partition Function Bound\nConsider a log-linear density model over discrete y \u2208 \u2126\n\np(y|\u03b8) =\n\nh(y) exp\n\n\u03b8\n\n\u22a4\n\nf (y)\n\n1\n\nZ(\u03b8)\n\n(\n\n)\n\n1Recall that some second-order methods like Newton-Raphson are not monotonic and may even fail to\n\nconverge for convex cost functions [4] unless, of course, line searches are used.\n\n1\n\n\f\u2211\n\nwhich is parametrized by a vector \u03b8 \u2208 Rd of dimensionality d \u2208 N. Here, f : \u2126 7\u2192 Rd is\nany vector-valued function mapping an input y to some arbitrary vector. The prior h : \u2126 7\u2192 R+\nis a \ufb01xed non-negative measure. The partition function Z(\u03b8) is a scalar that ensures that p(y|\u03b8)\nf (y)). Assume that the number of con\ufb01gurations of y is\nnormalizes, i.e. Z(\u03b8) =\n|\u2126| = n and is \ufb01nite2. The partition function is clearly log-convex in \u03b8 and a linear lower-bound\nis given via Jensen\u2019s inequality. This article contributes an analogous quadratic upper-bound on the\npartition function. Algorithm 1 computes3 the bound\u2019s parameters and Theorem 1 shows the precise\nguarantee it provides.\n\ny h(y) exp(\u03b8\n\n\u22a4\n\nAlgorithm 1 ComputeBound\nInput Parameters \u02dc\u03b8, f (y), h(y) \u2200y \u2208 \u2126\nInit z \u2192 0+, \u00b5 = 0, (cid:6) = zI\nFor each y \u2208 \u2126 {\n\u22a4\n\u03b1 = h(y) exp( \u02dc\u03b8\nl = f (y) \u2212 \u00b5\n(cid:6) + = tanh( 1\n\u00b5 + = \u03b1\nz+\u03b1 l\n}\nz + = \u03b1\nOutput z, \u00b5, (cid:6)\n\n2 log(\u03b1/z))\n\n2 log(\u03b1/z)\n\nf (y))\n\n\u22a4\n\nll\n\n\u2211\n\n2 (\u03b8 \u2212 \u02dc\u03b8)\n\n\u22a4\n\n(cid:6)(\u03b8 \u2212 \u02dc\u03b8) + (\u03b8 \u2212 \u02dc\u03b8)\n\n\u22a4\n\u00b5) upper-\n\nTheorem 1 Algorithm 1 \ufb01nds z, \u00b5, (cid:6) such that z exp( 1\nbounds Z(\u03b8) =\n\nf (y)) for any \u03b8, \u02dc\u03b8, f (y) \u2208 Rd and h(y) \u2208 R+ for all y \u2208 \u2126.\n\n\u22a4\n\ny h(y) exp(\u03b8\n\nProof 1 (Sketch, See Supplement for Formal Proof) Recall the bound log(e\u03b8 + e\n1). Tilt the bound to handle log(h1e\u03b8\nObtain a multivariate variant log(e\u03b8\nAdd an additional exponential term to get log(h1e\u03b8\nto extend to n elements in the summation.\n\nf1 + h2e\u03b8\n\nf2 + h3e\u03b8\n\n1 + e\n\n\u2212\u03b8\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u2212\u03b8) \u2264 c\u03b82 [22].\nf2 ).\nf1 + h2e\u03b8\nf3 ). Iterate the last step\n\n\u22a4\n\n\u22a4\n\n\u2211\n\n\u22a4\ny h(y) exp( \u02dc\u03b8\n\nThe bound improves previous inequalities and its proof is in the Supplement. It tightens [4, 19] since\nit avoids wasteful curvature tests (it uses duality theory to compare the bound and the optimized\nfunction rather than compare their Hessians). It generalizes [22] which only holds for n = 2 and\nh(y) constant; it generalizes [23] which only handles a simpli\ufb01ed one-dimensional case. The bound\nis computed using Algorithm 1 by iterating over the y variables (\u201cfor each y \u2208 \u2126\u201d) according to an\narbitrary ordering via the bijective function \u03c0 : \u2126 7\u2192 {1, . . . , n} which de\ufb01nes i = \u03c0(y). The order\nin which we enumerate over \u2126 slightly varies the (cid:6) in the bound (but not the \u00b5 and z) when |\u2126| >\n2. However, we empirically investigated the in\ufb02uence of various orderings on bound performance\n(in all the experiments presented in Section 7) and noticed no signi\ufb01cant effect across ordering\n\u22a4 with \u00b5 and z\nschemes. Recall that choosing (cid:6) =\nas in Algorithm 1 yields the second-order Taylor approximation (the Hessian) of the log-partition\nfunction. Algorithm 1 replaces a sum of log-linear models with a single log-quadratic model which\nmakes monotonic majorization straightforward. The \ufb01gure inside Algorithm 1 depicts the bound on\nlog Z(\u03b8) for various choices of \u02dc\u03b8. If there are no constraints on the parameters (i.e. any \u03b8 \u2208 Rd\nis admissible), a simple closed-form iterative update rule emerges: \u02dc\u03b8 \u2190 \u02dc\u03b8 \u2212 (cid:6)\n\u22121\u00b5. Alternatively,\nif \u03b8 must satisfy linear (convex) constraints it is straightforward to compute an update by solving a\nquadratic (convex) program. This update rule is interleaved with the bound computation.\n3 Conditional Random Fields and Log-Linear Models\nThe partition function arises naturally in maximum entropy estimation or minimum relative entropy\nestimation (cf. Supplement) as well as in conditional extensions of the maximum entropy paradigm\nwhere the model is conditioned on an observed input. Such models are known as conditional random\n\ufb01elds and have been useful for structured prediction problems [1, 24]. CRFs are given a data-set\n{(x1, y1), . . . , (xt, yt)} of independent identically-distributed (iid) input-output pairs where yj is\n\nf (y))(f (y) \u2212 \u00b5)(f (y) \u2212 \u00b5)\n\n2 Here, assume n is enumerable. Later, for larger spaces use O(n) to denote the time to compute Z.\n3By continuity, take tanh( 1\n\n2 log(1))/(2 log(1)) = 1\n\n4 and limz!0+ tanh( 1\n\n2 log(\u03b1/z))/(2 log(\u03b1/z)) = 0.\n\n2\n\n\u221250500.050.10.150.20.250.30.35\u03b8log(Z) and Bounds\fthe observed sample in a (discrete) space \u2126j conditioned on the observed input xj. A CRF de\ufb01nes\na distribution over all y \u2208 \u2126j (of which yj is a single element) as the log-linear model\n\n\u2211\n\np(y|xj, \u03b8) =\n\n1\n\nhxj (y) exp(\u03b8\n\nfxj (y))\n\n\u22a4\n\n\u22a4\n\ny\u2208\u2126j\n\nhxj (y) exp(\u03b8\n\nZxj (\u03b8)\nfxj (y)). For the j\u2019th training pair, we are given a non-\nwhere Zxj (\u03b8) =\nnegative function hxj (y) \u2208 R+ and a vector-valued function fxj (y) \u2208 Rd de\ufb01ned over the domain\ny \u2208 \u2126j. In this section, for simplicity, assume n = maxt\n|. Each partition function Zxj (\u03b8) is\n|\u2126yj\na function of \u03b8. The parameter \u03b8 for CRFs is estimated by maximizing the regularized conditional\n\u2225\u03b8\u22252 where \u03bb \u2208 R+ is a regularizer set\nlog-likelihood4 or log-posterior:\nusing prior knowledge or cross-validation. Rewriting gives the objective of interest\n\n\u2211\nj=1 log p(yj|xj, \u03b8) \u2212 t\u03bb\nt\u2211\n\nj=1\n\n2\n\nt\n\n\u22a4\n\nfxj (yj) \u2212 t\u03bb\n\n2\n\n+ \u03b8\n\n\u2225\u03b8\u22252.\n\n(1)\n\nJ(\u03b8) =\n\nlog\n\nj=1\n\nhxj (yj)\nZxj (\u03b8)\n\nIf prior knowledge (or constraints) restrict the solution vector to a convex hull (cid:3), the maximization\nproblem becomes arg max\u03b8\u2208(cid:3) J(\u03b8).\nAlgorithm 2 proposes a method for maximizing the regularized conditional likelihood J(\u03b8) or,\nequivalently minimizing the partition function Z(\u03b8). It solves the problem in Equation 1 subject\nto convex constraints by interleaving the quadratic bound with a quadratic programming procedure.\nTheorem 2 establishes the convergence of the algorithm and the proof is in the Supplement.\n\n\u2211\n\nAlgorithm 2 ConstrainedMaximization\n0: Input xj, yj and functions hxj , fxj for j = 1, . . . , t, regularizer \u03bb \u2208 R+ and convex hull (cid:3) \u2286 Rd\n1: Initialize \u03b80 anywhere inside (cid:3) and set \u02dc\u03b8 = \u03b80\n\nWhile not converged\n\n2: For j = 1, . . . , t\n\n\u2211\n\nj\n\n\u2211\n\nn\u22121\ni=1\n\nn\u22121\ni=1\n\n\u2211\n\n)\n\ni\u22121\n\nGet \u00b5j, (cid:6)j from hxj , fxj , \u02dc\u03b8 via Algorithm 1\n\n2 (\u03b8 \u2212 \u02dc\u03b8)\n((cid:6)j +\u03bbI)(\u03b8 \u2212 \u02dc\u03b8) +\n\u22a4\n\n1\n\n3: Set \u02dc\u03b8 =arg min\u03b8\u2208(cid:3)\n4: Output \u02c6\u03b8 = \u02dc\u03b8\n\u2308\n|\u2126j| \u2264 n, Algorithm 2\nTheorem 2 For any \u03b80 \u2208 (cid:3), all \u2225fxj (y)\u2225 \u2264 r and all\noutputs a \u02c6\u03b8 such that J( \u02c6\u03b8) \u2212 J(\u03b80) \u2265 (1 \u2212 \u03f5) max\u03b8\u2208(cid:3)(J(\u03b8) \u2212 J(\u03b80)) in more than\n)\n\niterations.\n\n)\u2309\n\n\u2211\n\ntanh(log(i)/2)\n\nn\u22121\ni=1\n\n1 + \u03bb\n\n(\n\n(\n\n2r2 (\n\n/ log\n\n(\n\n)\n\nj \u03b8\n\n\u22121\n\nlog(i)\n\nlog\n\n\u22a4\n\n(\u00b5j \u2212 fxj (yj) + \u03bb \u02dc\u03b8)\n\n1\n\u03f5\n\nn\n\n=\n\ntanh(log(i)/2)\n\nlog n\n\nlog(i)\n\n(i+1) log(i) is the logarithmic integral which is O\n\nThe series\nasymptotically [26]. The next sections show how to handle hidden variables in the learning problem,\nexploit graphical modeling, and further accelerate the underlying algorithms.\n4 Latent Conditional Likelihood\nSection 3 showed how the partition function is useful for maximum conditional likelihood problems\ninvolving CRFs. In this section, maximum conditional likelihood is extended to the setting where\nsome variables are latent. Latent models may provide more \ufb02exibility than fully observable models\n[21, 27, 28]. For instance, hidden conditional random \ufb01elds were shown to outperform generative\nhidden-state and discriminative fully-observable models [21].\nConsider the latent setting where we are given t iid samples x1, . . . , xt from some unknown distri-\nbution \u00afp(x) and t corresponding samples y1, . . . , yt drawn from identical conditional distributions\n\u00afp(y|x1), . . . , \u00afp(y|xt) respectively. Assume that the true generating distributions \u00afp(x) and \u00afp(y|x)\nare unknown. Therefore, we aim to estimate a conditional distribution p(y|x) from some set of hy-\npotheses that achieves high conditional likelihood given the data-set D = {(x1, y1), . . . , (xt, yt)}.\n4Alternatively, variational Bayesian approaches can be used instead of maximum likelihood via expectation\npropagation (EP) or power EP [25]. These, however, assume Gaussian posterior distributions over parameters,\nrequire approximations, are computationally expensive and are not necessarily more ef\ufb01cient than BFGS.\n\n3\n\n\f.\n\n(2)\n\n\u2211\nWe will select this conditional distribution by assuming it emerges from a conditioned joint distri-\nbution over x and y as well as a hidden variable m which is being marginalized as p(y|x, \u0398) =\n\u2211\nm p(x,y,m|\u0398)\ny;m p(x,y,m|\u0398) . Here m \u2208 \u2126m represents a discrete hidden variable, x \u2208 \u2126x is an input and\ny \u2208 \u2126y is a discrete output variable. The parameter \u0398 contains all parameters that explore the\nfunction class of such conditional distributions. The latent likelihood of the data L(\u0398) = p(D|\u0398)\nsubsumes Equation 1 and is the new objective of interest\n\nt\u220f\n\np(yj|xj, \u0398) =\n\nL(\u0398) =\n\nj=1\n\nt\u220f\n\nj=1\n\n(\n\n\u2211\n\u2211\nm p(xj, yj, m|\u0398)\ny,m p(xj, y, m|\u0398)\n)\n\nA good choice of the parameters is one that achieves a large conditional likelihood value (or poste-\nrior) on the data set D. Next, assume that each p(x|y, m, \u0398) is an exponential family distribution\n\np(x|y, m, \u0398) = h(x) exp\n\ny,m\u03d5(x) \u2212 a(\u03b8y,m)\n\u22a4\n\n\u03b8\n\n(\n\n\u03c0y;m\u2211\n\nwhere each conditional is speci\ufb01ed by a function h : \u2126x 7\u2192 R+ and a feature mapping \u03d5 : \u2126x 7\u2192 Rd\nwhich are then used to derive the normalizer a : Rd 7\u2192 R+. A parameter \u03b8y,m \u2208 Rd se-\nlects a speci\ufb01c distribution. Multiply each exponential family term by an unknown marginal dis-\ntribution called the mixing proportions p(y, m|\u03c0) =\n. This is parametrized by an un-\nknown parameter \u03c0 = {\u03c0y,m} \u2200y, m where \u03c0y,m \u2208 [0,\u221e). Finally, the collection of all pa-\nrameters is \u0398 = {\u03b8y,m, \u03c0y,m} \u2200y, m. Thus, we have the complete likelihood p(x, y, m|\u0398) =\n\u2211\nInsert this expression into Equation 2 and remove con-\nstant factors that appear in both denominator and numerator. Apply the change of variables\n)\n\u2211\nexp(\u03bdy,m) = \u03c0y,m exp(\u2212a(\u03b8y,m)) and rewrite the objective as a function5 of a vector \u03b8:\n\u2211\nm exp\nfj,yj ,m\ny,m exp (\u03b8\u22a4fj,y,m)\n\ny,m\u03d5(x) \u2212 a(\u03b8y,m)\n\u22a4\n(\nt\u220f\n(\n\n\u22a4\nyj ,m\u03d5(xj) + \u03bdyj ,m\n\u03b8\u22a4\ny,m\u03d5(xj) + \u03bdy,m\n\n)\n) =\n\nm exp\ny,m exp\n\n\u2211\n\u2211\n\n\u03c0y;mh(x)\ny;m \u03c0y;m\n\nt\u220f\n\nL(\u0398) =\n\ny;m \u03c0y;m\n\n)\n\n(\n\nexp\n\n\u22a4\n\n\u03b8\n\n\u03b8\n\n\u03b8\n\n.\n\n.\n\nj=1\n\nj=1\n\n\u00b7\u00b7\u00b7 \u03b8\n\n\u22a4\n|\u2126y|,|\u2126m| \u03bd|\u2126y|,|\u2126m|]\n\n\u22a4\n1,2 \u03bd1,2\n1]\u03b4[(\u02c6y, \u02c6m)=(1,1)] \u00b7\u00b7\u00b7 [\u03d5(xj )\n\u22a4\n\nThe last equality emerges by rearranging all \u0398 parameters as a vector \u03b8 \u2208 R|\u2126y||\u2126m|(d+1) equal to\n\u22a4 and introducing fj,\u02c6y, \u02c6m \u2208 R|\u2126y||\u2126m|(d+1) de\ufb01ned\n\u22a4\n1,1 \u03bd1,1 \u03b8\n[\u03b8\n\u22a4 is\n\u22a4\nas [[\u03d5(xj )\n1]\npositioned appropriately in the longer fj,\u02c6y, \u02c6m vector which is elsewhere zero). We will now \ufb01nd a\nvariational lower bound on L(\u03b8) \u2265 Q(\u03b8, \u02dc\u03b8) which is tight when \u03b8 = \u02dc\u03b8 such that L( \u02dc\u03b8) = Q( \u02dc\u03b8, \u02dc\u03b8).\nWe proceed by bounding each numerator and each denominator in the product over j = 1, . . . , t.\nApply Jensen\u2019s inequality to lower bound each numerator term as\n\n\u22a4\n\u22a4 (thus the feature vector [\u03d5(xj)\n\n1]\u03b4[(\u02c6y, \u02c6m)=(|\u2126y|,|\u2126m|)]]\n\nwhere \u03b7j,m = (e\n\n\u02dc\u03b8\n\nfj;yj ;m)/(\n\n). Algorithm 1 then bounds the denominator\n\nexp\n\n\u03b8\n\nfj,yj ,m\n\nm \u03b7j;m log \u03b7j;m\n\n(\n\n\u2211\n\u2211\n\nm\n\u22a4\n\nexp\n\ny,m\n\n\u22a4\n\n\u2211\n(\n\n\u03b8\n\nm\u2032 e\n\u22a4\n\n\u02dc\u03b8\n\n\u22a4\n\nfj;yj ;m\u2032\n\n) \u2265 e\u03b8\n\u22a4\u2211\nm \u03b7j;mfj;yj ;m\u2212\u2211\n) \u2264 zje\nj=1(\u00b5j \u2212\u2211\n\u2211\n\n2 (\u03b8\u2212 \u02dc\u03b8)\n\u22a4\n\n1\n\nt\n\nfj,y,m\n\n(cid:6)j (\u03b8\u2212 \u02dc\u03b8)+(\u03b8\u2212 \u02dc\u03b8)\n\u22a4\n\n\u00b5j .\n\nThe overall lower bound on the likelihood is then\n\u2212 1\n2 (\u03b8\u2212 \u02dc\u03b8)\n\nQ(\u03b8, \u02dc\u03b8) = L( \u02dc\u03b8)e\n\n\u22a4 \u02dc(cid:6)(\u03b8\u2212 \u02dc\u03b8)\u2212(\u03b8\u2212 \u02dc\u03b8)\n\u22a4\n\n\u02dc\u00b5\n\n\u2211\n\nt\n\nj=1 (cid:6)j and \u02dc\u00b5 =\n\nwhere \u02dc(cid:6) =\nm \u03b7j,mfj,yj ,m). The right hand side is simply an\nexponentiated quadratic function in \u03b8 which is easy to maximize. This yields an iterative scheme\nsimilar to Algorithm 2 for monotonically maximizing latent conditional likelihood.\n5 Graphical Models for Large n\nThe bounds in the previous sections are straightforward to compute when \u2126 is small. However,\nfor graphical models, enumerating over \u2126 can be daunting. This section provides faster algorithms\n\n5It is now easy to regularize L((cid:18)) by adding \u2212 t(cid:21)\n\n2\n\n\u2225(cid:18)\u22252.\n\n4\n\n\fthat recover the bound ef\ufb01ciently for graphical models of bounded tree-width. A graphical model\nrepresents the factorization of a probability density function. This article will consider the factor\ngraph notation of a graphical model. A factor graph is a bipartite graph G = (V, W, E) with variable\nvertices V = {1, . . . , k}, factor vertices W = {1, . . . , m} and a set of edges E between V and W .\nIn addition, de\ufb01ne a set of random variables Y = {y1, . . . , yk} each associated with the elements of\nV and a set of non-negative scalar functions \u03a8 = {\u03c81, . . . , \u03c8m} each associated with the elements\nof W . The factor graph implies that p(Y ) factorizes as p(y1, . . . , yk) = 1\nc\u2208W \u03c8c(Yc) where\nZ\nZ is a normalizing partition function (the dependence on parameters is suppressed here) and Yc is\na subset of the random variables that are associated with the neighbors of node c. In other words,\nYc = {yi|i \u2208 Ne(c)} where Ne(c) is the set of vertices that are neighbors of c.\nInference in\ngraphical models requires the evaluation and the optimization of Z. These computations can be\nNP-hard in general yet are ef\ufb01cient when G satis\ufb01es certain properties (low tree-width). Consider\na log-linear model (a function class) indexed by a parameter \u03b8 \u2208 (cid:3) in a convex hull (cid:3) \u2286 Rd as\nfollows\n\n\u220f\n\np(Y |\u03b8) =\n\n\u2211\n\n\u220f\n\n(\n\n)\n\nhc(Yc) exp\n\n\u22a4\n\n\u03b8\n\nfc(Yc)\n\n\u220f\n\nc\u2208W\n\u22a4\n\nZ(\u03b8)\n\n1\n\n(\n\n)\n\nY\n\n\u03b8\n\nfc(Yc)\n\nc\u2208W hc(Yc) exp\n\nwhere Z(\u03b8) =\n. The model is de\ufb01ned by a set of vector-valued\nfunctions fc(Yc) \u2208 Rd and scalar-valued functions hc(Yc) \u2208 R+. Choosing a function from the\nfunction class hinges on estimating \u03b8 by optimizing Z(\u03b8). However, Algorithm 1 may be inappli-\ncable due to the large number of con\ufb01gurations in Y . Instead, consider a more ef\ufb01cient surrogate\nalgorithm which computes the same bound parameters by ef\ufb01ciently exploiting the factorization of\nthe graphical model. This is possible since exponentiated quadratics are closed under multiplication\nand the required bound computations distribute nicely across decomposable graphical models.\n\nAlgorithm 3 JunctionTreeBound\nInput Reverse-topological tree T with c = 1, . . . , m factors hc(Yc) exp( \u02dc\u03b8\n\u22a4\nFor c = 1, . . . , m\n\nfc(Yc)) and \u02dc\u03b8 \u2208 Rd\n\nIf (c < m) {Yboth = Yc \u2229 Ypa(c), Ysolo = Yc \\ Ypa(c)}\nElse {Yboth ={}, Ysolo = Yc}\nFor each u \u2208 Yboth\n\n{ Initialize zc|x \u2190 0+, \u00b5c|x = 0, (cid:6)c|x = zc|xI\n\nFor each v \u2208 Ysolo\nw = u \u2297 v;\n{\n\u2211\n\u22a4\n\u03b1w = hc(w)e \u02dc\u03b8\n\n\u220f\n\nfc(w)\n\n(cid:6)c|u =\n\nb\u2208ch(c)(cid:6)b|w+\n\n\u2211\n\nb\u2208ch(c)zb|w;\ntanh( 1\n2 log(\n\u03b1w\nzc|u\n\n2 log(\n\n\u03b1w\nzc|u\n)\n\nlw = fc(w) \u2212 \u00b5c|u +\n\u22a4\nw; \u00b5c|u = \u03b1w\n\nlwl\n\n))\n\nzc|u+\u03b1w\n\nlw;\n\nb\u2208ch(c) \u00b5b|w;\n\nzc|u = \u03b1w; }}\n\nOutput Bound as z = zm, \u00b5 = \u00b5m, (cid:6) = (cid:6)m\n\nBegin by assuming that the graphical model in question is a junction tree and satis\ufb01es the running\nintersection property [18]. In Algorithm 3 (the Supplement provides a proof of its correctness), take\nch(c) to be the set of children-cliques of clique c and pa(c) to be the parent of c. Note that the\nalgorithm enumerates over u \u2208 Ypa(c) \u2229 Yc and v \u2208 Yc \\ Ypa(c). The algorithm stores a quadratic\nbound for each con\ufb01guration of u (where u is the set of variables in common across both clique c\nand its parent). It then forms the bound by summing over v \u2208 Yc \\ Ypa(c), each con\ufb01guration of\neach variable a clique c has that is not shared with its parent clique. The algorithm also collects\nprecomputed bounds from children of c. Also de\ufb01ne w = u \u2297 v \u2208 Yc as the conjunction of both\nindexing variables u and v. Thus, the two inner for loops enumerate over all con\ufb01gurations w \u2208 Yc\nof each clique. Note that w is used to query the children b \u2208 ch(c) of a clique c to report their bound\nparameters zb|w, \u00b5b|w, (cid:6)b|w. This is done for each con\ufb01guration w of the clique c. Note, however,\nthat not every variable in clique c is present in each child b so only the variables in w that intersect Yb\nare relevant in indexing the parameters zb|w, \u00b5b|w, (cid:6)b|w and the remaining variables do not change\nthe values of zb|w, \u00b5b|w, (cid:6)b|w.\nAlgorithm 3 is ef\ufb01cient in the sense that computations involve enumerating over all con\ufb01gurations\nof each clique in the junction tree rather than over all con\ufb01gurations of Y . This shows that the\n\n5\n\n\f\u2211\n\n\u2211\n\nc\n\nc\n\n(cid:12)(cid:12)(cid:12)\n\n|Yc|) rather than O(|\u2126|) as in Algorithm 1. Thus, for estimating the\ncomputation involved is O(\n|Yc| for the graphical\ncomputational ef\ufb01ciency of various algorithms in this article, take n =\nmodel case rather than n = |\u2126|. Algorithm 3 is a simple extension of the known recursions that\nare used to compute the partition function and its gradient vector. Thus, in addition to the (cid:6) matrix\nwhich represents the curvature of the bound, Algorithm 3 is recovering the partition function value\nz and the gradient since \u00b5 = \u2202 log Z(\u03b8)\n\n\u03b8= \u02dc\u03b8\n6 Low-Rank Bounds for Large d\nIn many realistic situations, the dimensionality d is large and this prevents the storage and inver-\nsion of the matrix (cid:6). We next present a low-rank extension that can be applied to any of the\nalgorithms presented so far. As an example, consider Algorithm 4 which is a low-rank incar-\nnation of Algorithm 2. Each iteration of Algorithm 2 requires O(tnd2 + d3) time since step 2\ncomputes several (cid:6)j \u2208 Rd\u00d7d matrices and 3 performs inversion.\nInstead, the new algorithm\nprovides a low-rank version of the bound which still majorizes the log-partition function but re-\nquires only \u02dcO(tnd) complexity (putting it on par with LBFGS). First, note that step 3 in Algo-\n\n\u2202\u03b8\n\n.\n\nAlgorithm 4 LowRankBound\nInput Parameter \u02dc\u03b8, regularizer \u03bb \u2208 R+, model ft(y) \u2208 Rd and ht(y) \u2208 R+ and rank k \u2208 N\nInitialize S = 0 \u2208 Rk\u00d7k, V = orthonormal \u2208 Rk\u00d7d, D = t\u03bbI \u2208 diag(Rd\u00d7d)\nFor each t { Set z \u2192 0+; \u00b5 = 0;\n\nFor each y{\n\n\u221a\n\ntanh( 1\n\n\u22a4\n\n\u22a4\n\n} }\n\nft(y); r =\n\n\u03b1\nz ))\n\n\u221a\nV(i,\u00b7); r = r \u2212 p(i)V(i,\u00b7);\n\n(ft(y) \u2212 \u00b5) ;\n\n2 log(\n\u03b1\nz )\n\n2 log(\n\nelse\n\u00b5 + = \u03b1\n\n{ D = D + 1\n\n\u22a4|r|diag(|r|); } }\n\n\u22121u where u = t\u03bb \u02dc\u03b8 +\n\nz+\u03b1 (ft(y) \u2212 \u00b5); z + = \u03b1;\n\nAQ = svd(S); S \u2190 A; V \u2190 QV;\n\n\u22a4\n\u03b1 = ht(y)e \u02dc\u03b8\nFor i = 1, . . . , k : p(i) = r\nFor i = 1, . . . , k : For j = 1, . . . , k : S(i, j) = S(i, j) + p(i)p(j);\n\u22a4\nQ\ns = [S(1, 1), . . . , S(k, k),\u2225r\u22252]\n\u22a4\n; \u02dck = arg mini=1,...,k+1 s(i);\nif (\u02dck \u2264 k) { D = D + S(\u02dck, \u02dck)1\n\u22a4|V(j,\u00b7)| diag(|V(k,\u00b7)|);\nS(\u02dck, \u02dck) = \u2225r\u22252; r = \u2225r\u2225\u22121r; V(k,\u00b7) = r; }\n\u2211\nOutput S \u2208 diag(Rk\u00d7k), V \u2208 Rk\u00d7d, D \u2208 diag(Rd\u00d7d)\nj=1 \u00b5j \u2212 fxj (yj). Clearly, Al-\nrithm 2 can be written as \u02dc\u03b8 = \u02dc\u03b8 \u2212 (cid:6)\ngorithm 1 can recover u by only computing \u00b5j for j = 1, . . . , t and skipping all steps involving\nmatrices. This merely requires O(tnd) work. Second, we store (cid:6) using a low-rank represen-\nSV + D where V \u2208 Rk\u00d7d is orthonormal, S \u2208 Rk\u00d7k is positive semi-de\ufb01nite, and\ntation V\nD \u2208 Rd\u00d7d is non-negative diagonal. Rather than increment the matrix by a rank one update of the\n(fi \u2212 \u00b5i), simply project ri onto each\nform (cid:6)i = (cid:6)i\u22121 + rir\neigenvector in V and update matrix S and V via a singular value decomposition (O(k3) work).\nAfter removing k such projections, the remaining residual from ri forms a new eigenvector ek+1\nand its magnitude forms a new singular value. The resulting rank (k + 1) system is orthonormal\nwith (k + 1) singular values. We discard its smallest singular value and corresponding eigenvector\nto revert back to an order k eigensystem. However, instead of merely discarding we can absorb\nthe smallest singular value and eigenvector into the D component by bounding the remaining outer-\nproduct with a diagonal term. This provides a guaranteed overall upper bound in \u02dcO(tnd) (k is\nassumed to be logarithmic with dimension d). Finally, to invert (cid:6), we apply the Woodbury formula:\n\u22121 which only requires O(k3) work. A proof of\n\u22121 = D\n(cid:6)\ncorrectness for Algorithm 4 can be found in the Supplement.\n7 Experiments\nWe \ufb01rst focus on the logistic regression task and compare the performance of the bound (using\nthe low-rank Algorithm 2) with \ufb01rst-order and second order methods such as LBFGS, conjugate\ngradient (CG) and steepest descent (SD). We use 4 benchmark data-sets: the SRBCT and Tumors\n\n\u22a4\ni where ri =\n\n\u22121 + VD\n\n\u22121VD\n)\n\n\u22121 + D\n\n\u22121V\n\n\u22a4\n\n\u22121V\n\n\u22a4\n\n\u221a\n\ntanh( 1\n\n2 log(\u03b1/z))\n\n2 log(\u03b1/z)\n\n(S\n\nt\n\n6\n\n\fdata-sets from [29] as well as the Text and SecStr data-sets from http://olivier.chapelle.cc/ssl-\nbook/benchmarks.html. For all experiments in this section, the setup is as follows. Each data-set is\nsplit into training (90%) and testing (10%) parts. All implementations are run on the same hardware\nwith C++ code. The termination criterion for all algorithms is a change in estimated parameter or\n\u22126 (with a ceiling on the number of iterations of 106). Results are\nfunction values smaller than 10\naveraged over 10 random initializations close to 0. The regularization parameter \u03bb, when used, was\nchosen through crossvalidation. In Table 1 we report times in seconds and the number of iterations\nfor each algorithm (including LBFGS) to achieve the LBFGS termination solution modulo a small\n\u22124). Table 1 also provides data-set sizes and regularization values. The \ufb01rst 4\nconstant \u03f5 (set to 10\ncolumns in Table 1 provide results for this experiment.\n\nData-set\n\nSize\n\nSRBCT\nn = 4\nt = 83\n\nd = 9236\n\u03bb = 101\niter\n42\n43\n\nAlgorithm time\n6.10\nLBFGS\n3246.83\n7.27\n18749.15\n40.61 100 14840.66\n3.67\n1639.93\n\nSD\nCG\n\nBound\n\n8\n\niter\n8\n53\n42\n4\n\nTumors\nn = 26\nt = 308\n\nd = 390260\n\n\u03bb = 101\ntime\n\nText\nn = 2\n\nt = 1500\nd = 23922\n\u03bb = 102\ntime\n15.54\n153.10\n57.30\n6.18\n\niter\n7\n69\n23\n3\n\nSecStr\nn = 2\n\nt = 83679\nd = 632\n\u03bb = 101\ntime\n881.31\n1490.51\n667.67\n27.97\n\niter\n47\n79\n36\n9\n\nCoNLL\nm = 9\n\nt = 1000\nd = 33615\n\u03bb = 101\ntime\n\n25661.54\n93821.72\n88973.93\n16445.93\n\niter\n17\n12\n23\n4\n\nPennTree\nm = 45\nt = 1000\nd = 14175\n\u03bb = 101\ntime\n\n62848.08\n156319.31\n76332.39\n27073.42\n\niter\n7\n12\n18\n2\n\nTable 1: Time in seconds and iterations required to obtain within \u03f5 of the LBFGS solution (where\n\u22124) for logistic regression problems (on SRBCT, Tumors, Text and SecStr data-sets where\n\u03f5 = 10\nn is the number of classes) and Markov CRF problems (on CoNLL and PennTree data-sets, where\nm is the number of classes). Here, t is the total number of samples (training and testing), d is the\ndimensionality of the feature vector and \u03bb is the cross-validated regularization setting.\n\nStructured prediction problems are explored using two popular data-sets. The \ufb01rst one contains\nSpanish news wire articles from the a session of the CoNLL 2002 conference. This corpus involves\na named entity recognition problem and consists of sentences where each word is annotated with\none of m = 9 possible labels. The second task is from the PennTree Bank. This corpus involves a\ntagging problem and consists of sentences where each word is labeled with one of m = 45 possible\nparts-of-speech. A conditional random \ufb01eld is estimated with a Markov chain structure to give\nword labels a sequential dependence. The features describing the words are constructed as in [30].\nTwo last columns of Table 1 provide results for this experiment. We used the low-rank version of\nAlgorithm 3. In both experiments, the bound always remained fastest as indicated in bold.\n\nFigure 1: Classi\ufb01cation boundaries using the bound and EM for a toy latent likelihood problem.\n\nWe next performed experiments with maximum latent conditional likelihood problems. We denote\nby m the number of hidden variables. Due to the non-concavity of this objective, we are most in-\nterested in \ufb01nding good local maxima. We start with a simple toy experiment from [19] comparing\nthe bound to the expectation-maximization (EM) algorithm in the binary classi\ufb01cation problem pre-\nsented on the left image of Figure 1. The model incorrectly uses only 2 Gaussians per class while\nthe data is generated using 8 Gaussians total. On Figure 1 we show the decision boundary obtained\nusing the bound (with m = 2) and EM. EM performs as well as random chance guessing while the\nbound classi\ufb01es the data very well. The average test log-likelihood obtained by EM was -1.5e+06\nwhile the bound obtained -21.8.\n\n7\n\n\u221250510152025\u221225\u221220\u221215\u221210\u2212505Data\u2212set\u221250510152025\u221225\u221220\u221215\u221210\u2212505Bound\u221250510152025\u221225\u221220\u221215\u221210\u2212505EM\fWe next compared the algorithms (the bound, Newton-Raphson, BFGS, CG and SD) in maximum\nlatent conditional likelihood problems on \ufb01ve benchmark data-sets. These included four UCI data-\nsets6 (ion, bupa, hepatitis and wine) and the previously used SRBCT data-set. The feature mapping\nused was \u03d5(x) = x \u2208 Rd which corresponds to a mixture Gaussian-gated logistic regressions\n(obtained by conditioning a mixture of m Gaussians per class). We used a value of \u03bb = 0 throughout\nthe latent experiments. We explored setting m \u2208 {1, 2, 3, 4}. Table 2 shows the testing latent log-\nlikelihood at convergence for m chosen through cross-validation (the Supplement contains a more\ncomplete table). In bold, we show the algorithm that obtained the highest testing log-likelihood.\nThe bound is the best performer overall and \ufb01nds better solutions in less time. Figure 2 depicts the\nconvergence on ion, hepatitis and SRBCT data sets.\n\nData-set\nhepatitis wine SRBCT\nAlgorithm m = 3 m = 2 m = 2 m = 3 m = 4\n-6.06\n-5.61\n-5.76\n-5.54\n-0.11\n\n-21.78\n-21.74\n-21.81\n-21.85\n-19.95\n\n-5.88\n-5.56\n-5.57\n-5.95\n-4.18\n\n-5.28\n-5.14\n-4.84\n-5.50\n-4.40\n\n-1.79\n-1.37\n-0.95\n-0.71\n-0.48\n\nion\n\nbupa\n\nBFGS\n\nSD\nCG\n\nNewton\nBound\n\nTable 2: Test log-likelihood at convergence for ion, bupa, hepatitis, wine and SRBCT data-sets.\n\nFigure 2: Convergence of test latent log-likelihood on ion, hepatitis and SRBCT data-sets.\n\n8 Discussion\nA simple quadratic upper bound for the partition function of log-linear models was proposed and\nmakes majorization approaches competitive with state-of-the-art \ufb01rst- and second-order optimiza-\ntion methods. The bound is ef\ufb01ciently recoverable for graphical models and admits low-rank vari-\nants for high-dimensional data. It allows faster and monotonically convergent majorization in CRF\nlearning and maximum latent conditional likelihood problems (where it also \ufb01nds better local max-\nima). Future work will explore intractable partition functions where likelihood evaluation is hard but\nbound maximization may remain feasible. Furthermore, the majorization approach will be applied\nin stochastic [31] and distributed optimization settings.\nAcknowledgments\nThe authors thank A. Smola, M. Collins, D. Kanevsky and the referees for valuable feedback.\nReferences\n[1] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In ICML, 2001.\n\n[2] A. Globerson, T. Koo, X. Carreras, and M. Collins. Exponentiated gradient algorithms for log-linear\n\nstructured prediction. In ICML, 2007.\n\n[3] J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Math. Stat.,\n\n43:1470\u20131480, 1972.\n\n6Downloaded from http://archive.ics.uci.edu/ml/\n\n8\n\n\u221250510\u221225\u221220\u221215\u221210\u221250ionlog(Time) [sec]\u2212log(J(\u03b8)) BoundNewtonBFGSConjugate gradientSteepest descent\u22126\u22124\u22122024\u221211\u221210\u22129\u22128\u22127\u22126\u22125\u22124hepatitislog(Time) [sec]\u2212log(J(\u03b8))\u22124\u22122024\u221212\u221210\u22128\u22126\u22124\u221220SRBCTlog(Time) [sec]\u2212log(J(\u03b8))\f[4] D. Bohning and B. Lindsay. Monotonicity of quadratic approximation algorithms. Ann. Inst. Statist.\n\nMath., 40:641\u2013663, 1988.\n\n[5] A. Berger. The improved iterative scaling algorithm: A gentle introduction. Technical report, 1997.\n\n[6] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random \ufb01elds. IEEE PAMI, 19(4),\n\n1997.\n\n[7] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In CoNLL, 2002.\n\n[8] H. Wallach. Ef\ufb01cient training of conditional random \ufb01elds. Master\u2019s thesis, University of Edinburgh,\n\n2002.\n\n[9] F. Sha and F. Pereira. Shallow parsing with conditional random \ufb01elds. In NAACL, 2003.\n\n[10] C. Zhu, R. Byrd, P Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale\n\nbound-constrained optimization. TOMS, 23(4), 1997.\n\n[11] S. Benson and J. More. A limited memory variable metric method for bound constrained optimization.\n\nTechnical report, Argonne National Laboratory, 2001.\n\n[12] G. Andrew and J. Gao. Scalable training of \u21131-regularized log-linear models. In ICML, 2007.\n\n[13] D. Roth. Integer linear programming inference for conditional random \ufb01elds. In ICML, 2005.\n\n[14] Y. Mao and G. Lebanon. Generalized isotonic conditional random \ufb01elds. Machine Learning, 77:225\u2013248,\n\n2009.\n\n[15] C. Sutton and A. McCallum. Piecewise training for structured prediction. Machine Learning, 77:165\u2013194,\n\n2009.\n\n[16] J. De Leeuw and W. Heiser. Convergence of correction matrix algorithms for multidimensional scaling,\n\nchapter Geometric representations of relational data. 1977.\n\n[17] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm.\n\nJ. of the Royal Stat. Soc., B-39, 1977.\n\n[18] M. Wainwright and M Jordan. Graphical models, exponential families and variational inference. Foun-\n\ndations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[19] T. Jebara and A. Pentland. On reversing Jensen\u2019s inequality. In NIPS, 2000.\n\n[20] J. Salojarvi, K Puolamaki, and S. Kaski. Expectation maximization algorithms for conditional likelihoods.\n\nIn ICML, 2005.\n\n[21] A. Quattoni, S. Wang, L. P. Morency, M. Collins, and T. Darrell. Hidden conditional random \ufb01elds. IEEE\n\nPAMI, 29(10):1848\u20131852, October 2007.\n\n[22] T. Jaakkola and M. Jordan. Bayesian parameter estimation via variational methods. Statistics and Com-\n\nputing, 10:25\u201337, 2000.\n\n[23] G. Bouchard. Ef\ufb01cient bounds for the softmax and applications to approximate inference in hybrid mod-\n\nels. In NIPS AIHM Workshop, 2007.\n\n[24] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In NIPS, 2004.\n\n[25] Y. Qi, M. Szummer, and T. P. Minka. Bayesian conditional random \ufb01elds. In AISTATS, 2005.\n\n[26] T. Bromwich and T. MacRobert. An Introduction to the Theory of In\ufb01nite Series. Chelsea, 1991.\n\n[27] S. B. Wang, A. Quattoni, L.-P. Morency, and D. Demirdjian. Hidden conditional random \ufb01elds for gesture\n\nrecognition. In CVPR, 2006.\n\n[28] Y. Wang and G. Mori. Max-margin hidden conditional random \ufb01elds for human action recognition. In\n\nCVPR, pages 872\u2013879. IEEE, 2009.\n\n[29] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization for Machine Learning, chapter Convex\n\noptimization with sparsity-inducing norms. MIT Press, 2011.\n\n[30] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In ICML, 2003.\n\n[31] SVN. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Murphy. Accelerated training of conditional\n\nrandom \ufb01elds with stochastic gradient methods. In ICML, 2006.\n\n[32] T. Jebara. Multitask sparsity via maximum entropy discrimination. JMLR, 12:75\u2013110, 2011.\n\n9\n\n\fMajorization for CRFs and Latent Likelihoods\n\n(Supplementary Material)\n\nTony Jebara\n\nDepartment of Computer Science\n\nColumbia University\n\nAnna Choromanska\n\nDepartment of Electrical Engineering\n\nColumbia University\n\njebara@cs.columbia.edu\n\naec2163@columbia.edu\n\nAbstract\n\nThis supplement presents additional details in support of the full article. These in-\nclude the application of the majorization method to maximum entropy problems.\nIt also contains proofs of the various theorems, in particular, a guarantee that the\nbound majorizes the partition function. In addition, a proof is provided guarantee-\ning convergence on (non-latent) maximum conditional likelihood problems. The\nsupplement also contains supporting lemmas that show the bound remains ap-\nplicable in constrained optimization problems. The supplement then proves the\nsoundness of the junction tree implementation of the bound for graphical mod-\nels with large n. It also proves the soundness of the low-rank implementation of\nthe bound for problems with large d. Finally, the supplement contains additional\nexperiments and \ufb01gures to provide further empirical support for the majorization\nmethodology.\n\n\u2211\n\n\u2211\n\nSupplement for Section 2\nProof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n\nunder the random ordering \u03c0 : \u2126 7\u2192 {1, . . . , n}. This de\ufb01nes j = \u03c0(y) and associates h and f with\n\u22121(j)). Next, write Z(\u03b8) =\n\u22121(j)) and fj = f (\u03c0\nfj) by introducing\nhj = h(\u03c0\n\u03bb = \u03b8 \u2212 \u02dc\u03b8 and \u03b1j = hj exp( \u02dc\u03b8\n\u22a4\nfj). De\ufb01ne the partition function over only the \ufb01rst i components\n\u22a4\nfj). When i = 0, a trivial quadratic upper bound holds\nas Zi(\u03b8) =\nZ0(\u03b8) \u2264 z0 exp\n\nwith the parameters z0 \u2192 0+, \u00b50 = 0, and (cid:6)0 = z0I. Next, add one term to the current partition\nfunction Z1(\u03b8) = Z0(\u03b8) + \u03b11 exp(\u03bb\n\n\u22a4\nf1). Apply the current bound Z0(\u03b8) to obtain\nZ1(\u03b8) \u2264 z0 exp( 1\n\u22a4\n2 \u03bb\nConsider the following change of variables\n\ni\nj=1 \u03b1j exp(\u03bb\n\nn\nj=1 \u03b1j exp(\u03bb\n\n\u00b50) + \u03b11 exp(\u03bb\n\n(cid:6)0\u03bb + \u03bb\n\n(cid:6)0\u03bb + \u03bb\n\n\u22a4\n1\n2 \u03bb\n\n)\n\n(\n\nf1).\n\n\u22a4\n\n\u22a4\n\n\u00b50\n\n\u22a4\n\n\u22a4\n\nu = (cid:6)1/2\n\u03b3 = \u03b11\nz0\n\n0 \u03bb \u2212 (cid:6)\nexp( 1\n\n(f1 \u2212 \u00b50))\n0 (f1 \u2212 \u00b50))\n\u22121\n\n(cid:6)\n\n2 (f1 \u2212 \u00b50)\n\u22a4\n\n\u22121/2\n0\n\n(\n\n)\n\n.\n\nApply Lemma 1 (cf. [32] p. 100) to the last term to get\n\n\u22a4\n\nlog Z1(\u03b8) \u2264 log z0 \u2212 1\n\nand rewrite the logarithm of the bound as\n2 (f1 \u2212 \u00b50)\nlog Z1(\u03b8) \u2264 log z0 \u2212 1\n2 (f1 \u2212 \u00b50)\n\u22a4\n(u \u2212 v)\n\u22a4\n1+\u03b3 exp(\u2212 1\n\u2225v\u22252)\n\n+\n\nv\n\n2\n\n(cid:6)\n\n0 (f1 \u2212 \u00b50) + \u03bb\n\u22121\n\u22a4\n\u22a4(\n0 (f1 \u2212 \u00b50) + \u03bb\n\u22121\n\n(u \u2212 v)\n\n\u22a4\n\n+\n\n(cid:6)\n\n1\n2\n\nI + \u0393vv\n\n\u2225u\u22252) + \u03b3\n)\n)\n\n\u2225v\u22252\n\n+\u03b3\n\nf1 + log\n\nexp( 1\n2\n\n(\n(\nexp\n(u \u2212 v)\n\n1\n2\n\n\u22a4)\n\nf1 + log\n\n10\n\n\f. The bound in [32] is tight when u = v. To achieve tightness\n\nwhere \u0393 =\n\ntanh(\n\n1\n\n2 log(\u03b3 exp(\u2212 1\n2 log(\u03b3 exp(\u2212 1\n2\n\n2\n\u2225v\u22252))\n\n\u2225v\u22252)))\n\n(\n\n)\n\n\u22121/2\nwhen \u03b8 = \u02dc\u03b8 or, equivalently, \u03bb = 0, we choose v = (cid:6)\n0\n\u22a4\n1\n2 \u03bb\n\nZ1(\u03b8) \u2264 z1 exp\n\n\u22a4\n\n(cid:6)1\u03bb + \u03bb\n\n\u00b51\n\n(\u00b50 \u2212 f1) yielding\n\nwhere we have\n\nz1 = z0 + \u03b11\n\n\u00b51 =\n\nz0\n\nz0 + \u03b11\n\n\u00b50 +\n\n\u03b11\n\nz0 + \u03b11\n\nf1\n\n(cid:6)1 = (cid:6)0 +\n\ntanh( 1\n\n2 log(\u03b11/z0))\n\n2 log(\u03b11/z0)\n\n(\u00b50 \u2212 f1)(\u00b50 \u2212 f1)\n\n\u22a4\n\n.\n\nThis rule updates the bound parameters z0, \u00b50, (cid:6)0 to incorporate an extra term in the sum over i in\nZ(\u03b8). The process is iterated n times (replacing 1 with i and 0 with i \u2212 1) to produce an overall\nbound on all terms.\n\n)\n\n\u2225u\u22252\n\n+ \u03b3\n\n(\n(\n\u22a4(\nexp\n(u \u2212 v)\n\n1\n2\n\n) \u2264\n\u22a4)\n\nI + \u0393vv\n\n(u \u2212 v)\n\n. Equality is achieved when u = v.\n\n)\n\n)\n\n(\n\n(\n\nLemma 1 (See [32] p. 100)\nFor all u \u2208 Rd, any v \u2208 Rd and any \u03b3 \u2265 0, the bound log\n1\n2\n2 log(\u03b3 exp(\u2212\u2225v\u22252/2)))\n\n(u \u2212 v)\n1 + \u03b3 exp(\u2212 1\n\nholds when the scalar term \u0393 = tanh( 1\n\n\u2225v\u22252)\n\n2 log(\u03b3 exp(\u2212\u2225v\u22252/2))\n\n\u2225v\u22252\n\nexp\n\n+ \u03b3\n\nlog\n\n+\n\n+\n\n\u22a4\n\nv\n\n1\n2\n\n2\n\nProof of Lemma 1 The proof is provided in [32].\nSupplement for Section 3\nMaximum entropy problem We show here that partition functions arise naturally in maximum\nentropy estimation or minimum relative entropy RE(p\u2225h) =\n\u2211\nh(y) estimation. Consider\nthe following problem:\n\ny p(y) log p(y)\n\n\u2211\n\n\u2211\n\nRE(p\u2225h) s.t.\n\nmin\n\np\n\np(y)f (y) = 0,\n\ny\n\ny\n\np(y)g(y) \u2265 0.\n\n(\n\n)\n\nHere, assume that f : \u2126 7\u2192 Rd and g : \u2126 7\u2192 Rd\n\u2211\nover the sample space. The solution distribution p(y) = h(y) exp\nrecovered by the dual optimization\n\n\u2032\n\n(\n\n\u03b8, \u03d1 =\n\narg\nmax\n\u03d1\u22650,\u03b8\n\n\u2212 log\n\nh(y) exp\n\n\u03b8\n\ny\n\n\u22a4\n\nf (y) + \u03d1\n\n\u22a4\n\ng(y)\n\n)\n\nare arbitrary (non-constant) vector-valued functions\n/Z(\u03b8, \u03d1) is\n\nf (y) + \u03d1\n\ng(y)\n\n\u22a4\n\n\u22a4\n\n\u03b8\n\nwhere \u03b8 \u2208 Rd and \u03d1 \u2208 Rd\n. These are obtained by minimizing Z(\u03b8, \u03d1) or equivalently by max-\nimizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the\nquadratic program\n\n\u2032\n\n2 (\u03b2 \u2212 \u02dc\u03b2)\n\u22a4\n\n1\n\n(cid:6)(\u03b2 \u2212 \u02dc\u03b2) + \u03b2\n\n\u22a4\n\n\u00b5\n\nmin\n\u03d1\u22650,\u03b8\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n]. Note that any general convex hull of constraints \u03b2 \u2208 (cid:3) \u2286 Rd+d\n\n\u2032\n\ncould be\n\n\u03d1\n\n= [\u03b8\n\nwhere \u03b2\nimposed without loss of generality.\nProof of Theorem 2 We begin by proving a lemma that will be useful later.\nLemma 2 If \u03ba(cid:9) \u227d (cid:8) \u227b 0 for (cid:8), (cid:9) \u2208 Rd\u00d7d, then\n\n2 (\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n2 (\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n\nL(\u03b8) = \u2212 1\nU (\u03b8) = \u2212 1\n\u03ba sup\u03b8\u2208(cid:3) U (\u03b8) for any convex (cid:3) \u2286 Rd, \u02dc\u03b8 \u2208 (cid:3), \u00b5 \u2208 Rd and \u03ba \u2208 R+.\n\n(cid:8)(\u03b8 \u2212 \u02dc\u03b8) \u2212 (\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n(cid:9)(\u03b8 \u2212 \u02dc\u03b8) \u2212 (\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n\n\u00b5\n\n\u00b5\n\nsatisfy sup\u03b8\u2208(cid:3) L(\u03b8) \u2265 1\n\n11\n\n\fProof of Lemma 2 De\ufb01ne the primal problems of interest as PL = sup\u03b8\u2208(cid:3) L(\u03b8) and PU =\nsup\u03b8\u2208(cid:3) U (\u03b8). The constraints \u03b8 \u2208 (cid:3) can be summarized by a set of linear inequalities A\u03b8 \u2264 b\nwhere A \u2208 Rk\u00d7d and b \u2208 Rk for some (possibly in\ufb01nite) k \u2208 Z. Apply the change of variables\nz = \u03b8\u2212 \u02dc\u03b8. The constraint A(z+ \u02dc\u03b8) \u2264 b simpli\ufb01es into Az \u2264 \u02dcb where \u02dcb = b\u2212A \u02dc\u03b8. Since \u02dc\u03b8 \u2208 (cid:3), it\n(cid:8)z \u2212\nis easy to show that \u02dcb \u2265 0. We obtain the equivalent primal problems PL = supAz\u2264\u02dcb\n\u22a4\nz\n\n\u2212 1\n\u22a4\n2 z\n\n\u00b5 and PU = supAz\u2264\u02dcb\n\n\u2212 1\n\u22a4\n2 z\n\n(cid:9)z \u2212 z\n\u22a4\n\u00b5. The corresponding dual problems are\n\u22a4\n\u22a4\n\u22121A\nA(cid:8)\ny\n2\n\u22121A\n\u22a4\nA(cid:9)\n2\n\n\u22121\u00b5\n\u22a4\n(cid:8)\n2\n\u22121\u00b5\n\u22a4\n(cid:9)\n2\n\n\u22121\u00b5+y\n\n\u22121\u00b5+y\n\n\u22a4\nA(cid:9)\n\n\u22a4\nA(cid:8)\n\n\u22a4\u02dcb+\n\n\u22a4\u02dcb+\n\n\u22a4\ny\n\n+y\n\n+y\n\n\u00b5\n\n\u00b5\n\n.\n\ny\n\ny\n\nDL = inf\ny\u22650\n\nDU = inf\ny\u22650\n\nDue to strong duality, PL = DL and PU = DU . Apply the inequalities (cid:8) \u227c \u03ba(cid:9) and y\n\u22a4 \u02dcb > 0 as\n\u22121\u00b5\n\u22a4\n(cid:9)\n\u00b5\n2\u03ba\n\n\u22a4\n\u22121A\ny\n2\u03ba\n\n\u22a4\nA(cid:9)\ny\n\u03ba\n\n(cid:9)z \u2212 z\n\u22a4\n\n\u00b5 = inf\ny\u22650\n\n\u22a4\nA(cid:9)\ny\n\n\u22a4\u02dcb +\n\n\u2212 \u03ba\n2\n\n\u22121\u00b5\n\nPL \u2265 sup\nAz\u2264\u02dcb\n\n\u22a4\nz\n\n+ y\n\n+\n\n\u2265 1\n\u03ba\n\nDU =\n\nPU .\n\n1\n\u03ba\n\nThis proves that PL \u2265 1\nWe will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering\nsense) the matrices (cid:6)j in Algorithm 2. Since \u2225fxj (y)\u22252 \u2264 r for all y \u2208 \u2126j and since \u00b5j in\nAlgorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for (cid:6)j satisfy\n\n\u03ba PU .\n\n(fxj (y) \u2212 \u00b5)(fxj (y) \u2212 \u00b5)\n\nThus, (cid:6)j \u227c F(\u03b11, . . . , \u03b1n)4r2I holds where\nF(\u03b11, . . . , \u03b1n) =\n\n\u22a4 \u227c 4r2I.\n\u03b1i\u2211\n\u03b1i\u2211\n\n2 log(\n\n2 log(\n\ni\u22121\nk=1 \u03b1k\n\n)\n\ni\u22121\nk=1 \u03b1k\n\n))\n\ntanh( 1\n\nn\u2211\n\ni=2\n\n\u2211\n\nn\u2211\n\ntanh( 1\n\n2 log(\n\n\u03c0\n\ni=2\n\n2 log(\n\n\u2211\n\n\u2211\n\n\u03b1(cid:25)(i)\ni\u22121\nk=1 \u03b1(cid:25)(k)\n\n))\n\n.\n\n\u03b1(cid:25)(i)\ni\u22121\nk=1 \u03b1(cid:25)(k)\n\n)\n\n(cid:6)j \u227c\n\n2r2\n\ntanh(log(i)/2)\n\nI = \u03c9I.\n\nlog(i)\n\ni=1\n\n\u2211\n\u2211\n\u2225\u03b8 \u2212 \u02dc\u03b8\u22252\u2212\n(\u00b5j\u2212fxj (yj))\n(\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n(\u03b8 \u2212 \u02dc\u03b8)\n\u2225\u03b8 \u2212 \u02dc\u03b8\u22252\u2212\n\n(\u00b5j\u2212fxj (yj))\n\u22a4\n\nj\n\nJ(\u03b8) \u2265 J( \u02dc\u03b8)\u2212t\u03c9+t\u03bb\nJ(\u03b8) \u2264 J( \u02dc\u03b8)\u2212t\u03bb\n\n2\n\n2\n\nj\n\n12\n\nusing the de\ufb01nition of \u03b11, . . . , \u03b1n in the proof of Theorem 1. The formula for F starts at i = 2 since\nz0 \u2192 0+. Assume permutation \u03c0 is sampled uniformly at random. The expected value of F is then\n\nE\u03c0[F(\u03b11, . . . , \u03b1n)] =\n\n1\nn!\n\nat the setting \u03b1i = 1,\u2200i. Due to the expectation over \u03c0, we have \u2202E\n\nWe claim that the expectation is maximized when all \u03b1i = 1 or any positive constant. Also, F\nis invariant under uniform scaling of its arguments. Write the expected value of F as E for short.\nNext, consider \u2202E\n= \u2202E\nfor any l, o. Therefore, the gradient vector is constant when all \u03b1i = 1. Since F(\u03b11, . . . , \u03b1n)\n\u2202\u03b1o\n\u2202\u03b1l\nis invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point\nwhen all \u03b1i = 1 is an extremum or a saddle. Next, consider\nfor any l, o. At the setting\n= c(n)/(n \u2212 1) for some non-negative constant function\n\u03b1i = 1, \u22022E\n\u2202\u03b12\nl\nc(n). Thus, the \u03b1i = 1 extremum is locally concave and is a maximum. This establishes that\nn\u22121\u2211\nE\u03c0[F(\u03b11, . . . , \u03b1n)] \u2264 E\u03c0[F(1, . . . , 1)] and yields the Loewner bound\n\n= \u2212c(n) and,\n\n(\n\n)\n\n\u2202E\n\u2202\u03b1l\n\n\u2202E\n\u2202\u03b1l\n\n\u2202\u03b1o\n\n\u2202\u03b1o\n\n\u2202\u03b1l\n\n\u2202\n\n\u2202\n\nApply this bound to each (cid:6)j in the lower bound on J(\u03b8) and also note a corresponding upper bound\n\n\fwhich follows from Jensen\u2019s inequality. De\ufb01ne the current \u02dc\u03b8 at time \u03c4 as \u03b8\u03c4 and denote by L\u03c4 (\u03b8) the\nabove lower bound and by U\u03c4 (\u03b8) the above upper bound at time \u03c4. Clearly, L\u03c4 (\u03b8) \u2264 J(\u03b8) \u2264 U\u03c4 (\u03b8)\nwith equality when \u03b8 = \u03b8\u03c4 . Algorithm 2 maximizes J(\u03b8) after initializing at \u03b80 and performing\nan update by maximizing a lower bound based on (cid:6)j. Since L\u03c4 (\u03b8) replaces the de\ufb01nition of (cid:6)j\nwith \u03c9I \u227d (cid:6)j, L\u03c4 (\u03b8) is a looser bound than the one used by Algorithm 2. Thus, performing\n\u03b8\u03c4 +1 = arg max\u03b8\u2208(cid:3) L\u03c4 (\u03b8) makes less progress than a step of Algorithm 1. Consider computing the\nslower update at each iteration \u03c4 and returning \u03b8\u03c4 +1 = arg max\u03b8\u2208(cid:3) L\u03c4 (\u03b8). Setting (cid:8) = (t\u03c9 +t\u03bb)I,\n(cid:9) = t\u03bbI and \u03ba = \u03c9+\u03bb\n\n\u03bb allows us to apply Lemma 2 as follows\n\nSince L\u03c4 (\u03b8\u03c4 ) = J(\u03b8\u03c4 ) = U\u03c4 (\u03b8\u03c4 ), J(\u03b8\u03c4 +1) \u2265 sup\u03b8\u2208(cid:3) L\u03c4 (\u03b8) and sup\u03b8\u2208(cid:3) U\u03c4 (\u03b8) \u2265 J(\u03b8\nobtain\n\n\u2217\n\n), we\n\n1\n\u03ba\n\nsup\n\u03b8\u2208(cid:3)\n\nL\u03c4 (\u03b8) \u2212 L\u03c4 (\u03b8\u03c4 ) =\n(\n(\nIterate the above inequality starting at t = 0 to obtain\n1 \u2212 1\n\u03ba\n\nJ(\u03b8\u03c4 +1) \u2212 J(\u03b8\n\nJ(\u03b8\u03c4 ) \u2212 J(\u03b8\n\n) \u2265\n\n) \u2265\n\n\u2217\n\n\u2217\n\n1 \u2212 1\n\u03ba\n\nsup\n\u03b8\u2208(cid:3)\n\n)\n)\u03c4\n\nU\u03c4 (\u03b8) \u2212 U\u03c4 (\u03b8\u03c4 ).\n\n(J(\u03b8\u03c4 ) \u2212 J(\u03b8\n\n\u2217\n\n)) .\n\n(\n\n(J(\u03b80) \u2212 J(\u03b8\n1 \u2212 1\n\n\u2217\n\n)) .\n\n)\u03c4 or log(1/\u03f5) = \u03c4 log \u03ba\n\u2309\n\n\u2308\n\n\u03ba\u22121 .\nor\n\nlog(1/\u03f5)\nlog (cid:20)\n(cid:20)\u22121\n\nA solution within a multiplicative factor of \u03f5 implies that \u03f5 =\nInserting the de\ufb01nition for \u03ba shows that the number of iterations \u03c4 is at most\n\n\u03ba\n\n\u2308\n\n\u2309\n\nlog(1/\u03f5)\n\nlog(1+\u03bb/\u03c9)\n\n. Inserting the de\ufb01nition for \u03c9 gives the bound.\n\nY 2,0\n1\n\nY 1,1\n1\n\nY 1,1\n2\n\nY 1,1\n3\n\n\u00b7\u00b7\u00b7\n\nY 1,1\nm1;1\n\nFigure 3: Junction tree of depth 2.\n\n1\n\nAlgorithm 5 SmallJunctionTree\nInput Parameters \u02dc\u03b8 and h(u), f (u) \u2200u \u2208 Y 2,0\n\u220fm1;1\nInitialize z \u2192 0+, \u00b5 = 0, (cid:6) = zI\n\u2211m1;1\nFor each con\ufb01guration u \u2208 Y 2,0\n{\ni=1 zi exp(\u2212 \u02dc\u03b8\n\u22a4\n\u22a4\n\u00b5i)) exp( \u02dc\u03b8\n\u03b1 = h(u)(\ni=1 \u00b5i \u2212 \u00b5\nl = f (u) +\n(cid:6) + =\n\u00b5 + = \u03b1\n}\nz + = \u03b1\nOutput z, \u00b5, (cid:6)\n\ni=1 (cid:6)i + tanh( 1\n\n\u2211m1;1\n\n2 log(\u03b1/z))\n\nz+\u03b1 l\n\n2 log(\u03b1/z)\n\n\u22a4\n\nll\n\n1\n\nand zi, (cid:6)i, \u00b5i \u2200i = 1, . . . , m1,1\n\n\u2211m1;1\n\n(f (u) +\n\n\u22a4\ni=1 \u00b5i)) = h(u) exp( \u02dc\u03b8\n\n\u220fm1;1\n\ni=1 zi\n\nf (u))\n\nc\n\n\u2211\n\nSupplement for Section 5\nProof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3.\nThe notation Y a,b\nrefers to the cth tree node located at tree level a (\ufb01rst level is considered as the one\nwith tree leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0).\nLet\n\\Y a2 ;b2\nc1\nc1\nrefers to the sum over all con\ufb01gurations of variables that are in Y a1,b1\nbut not in Y a2,b2\n. Let ma,b\ndenote the number of children of the bth node located at tree level a + 1. For short-hand, use\n\u03c8(Y ) = h(Y ) exp(\u03b8\n\nrefer to the sum over all con\ufb01gurations of variables in Y a1,b1\n\nf (Y )). The partition function can be expressed as:\n\n\u2211\n\nY a1 ;b1\n\nY a1 ;b1\n\nand\n\n\u22a4\n\nc2\n\nc1\n\nc1\n\nc2\n\n13\n\n\fY 3;0\n1\n\nY 2;1\n1\n\nY 2;1\n2\n\n\u00b7\u00b7\u00b7\n\nY 2;1\nm2;1\n\nY 1;1\n1\n\nY 1;1\n2\n\n\u00b7\u00b7\u00b7 Y 1;1\n\nm1;1\n\nY 1;2\n1\n\nY 1;2\n2\n\n\u00b7\u00b7\u00b7 Y 1;2\n\nm1;2\n\nY 1;m2;1\n1\n\nY 1;m2;1\n2\n\n\u00b7\u00b7\u00b7\n\nY 1;m2;1\nm1;m2;1\n\nFigure 4: Junction tree of depth 3.\n\n\uf8ee\uf8f0\u03c8(u)\n[\n[\n\n\u03c8(u)\n\nm1;1\u220f\nm1;1\u220f\n\ni=1\n\ni=1\n\nh(u) exp(\u03b8\n\n\uf8f6\uf8f8\uf8f9\uf8fb\n\n\u03c8(v)\n\n\uf8eb\uf8ed \u2211\n(\n\\Y 2;0\n\u03b8 \u2212 \u02dc\u03b8)\nm1;1\u220f\n\nv\u2208Y 1;1\n\nzi exp(\n\n1\n2\n\n1\n\ni\n\n\u22a4\n\nf (u))\n\nzi exp\n\ni=1\n\nu\u2208Y 2;0\n\n\u2211\n\u2211\n\u2211\n\n1\n\n1\n\nu\u2208Y 2;0\n\nu\u2208Y 2;0\n\n1\n\nZ(\u03b8) =\n\n\u2264\n\n=\n\n[\n\n\u2211\n\n(\nm1;1\u220f\n\nh(u)\n\ni=1\n\nu\u2208Y 2;0\n\n1\n\nZ(\u03b8) \u2264\n\nzi exp(\u2212 \u02dc\u03b8\n\u22a4\n\n\u00b5i)\n\nexp\n\n)]\n\n\u00b5i\n\n)]\n\n\u22a4\n\n\u00b5i\n\n\u22a4\n\n(cid:6)i(\u03b8 \u2212 \u02dc\u03b8) + (\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n(\n\n)\n\n1\n2\n\n(\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n(\n(\nm1;1\u2211\n\n(\n\n\u22a4\n\n\u03b8\n\n\u2211\n\n(cid:6)i(\u03b8 \u2212 \u02dc\u03b8) + (\u03b8 \u2212 \u02dc\u03b8)\n))\n)]\n\nm1;1\u2211\n\nv\u2208Y 1;1\n\n)\n\nf (u) +\n\n\u00b5i\n\ni=1\n\ni\n\n(\u03b8 \u2212 \u02dc\u03b8)\n\n.\n\n(\u03b8 \u2212 \u02dc\u03b8)\n\n\u22a4\n\n1\n2\n\n(cid:6)i\n\ni=1\n\n(\n\nwhere the upper-bound is obtained by applying Theorem 1 to each of the terms\nBy simply rearranging terms we get:\n\n\\Y 2;0\n\n1\n\n\u03c8(v).\n\n(\n\nexp\n\n)\n\n\u2211\n\nZ(\u03b8) =\n\nu\u2208Y 3;0\n\n1\n\n1\n\nthat\n\nthis\nprove\ncan\n2 (\u03b8 \u2212 \u02c6\u03b8)\n(cid:6)(\u03b8 \u2212 \u02c6\u03b8) + (\u03b8 \u2212 \u02c6\u03b8)\n\u22a4\n\u22a4\n\nexpression\nby\nOne\nwhere z, (cid:6) and \u00b5 can be computed using Algo-\nz exp\nrithm 5 (a simpli\ufb01cation of Algorithm 3). We will call this result Lemma A. The proof is similar to\nthe proof of Theorem 1 so is not repeated here.\nConsider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now\n\nupper-bounded\n\ncan\n\nbe\n\n\u00b5\n\nm2;1\u220f\n\ni=1\n\n\uf8ee\uf8ef\uf8f0\u03c8(u)\n(\n[\n\n1\n\n\uf8eb\uf8ec\uf8ed\u03c8(v)\n\n\uf8eb\uf8ec\uf8ed \u2211\nm1;i\u220f\n(\n\u2211\n\u03c8(v)\n(\n(cid:6)i(\u03b8 \u2212 \u02c6\u03b8) + (\u03b8 \u2212 \u02c6\u03b8)\n\u22a4\n\nv\u2208Y 2;1\n\nv\u2208Y 2;1\n\n\uf8eb\uf8ec\uf8ed \u2211\n\u220fm1;i\n)\n\nw\u2208Y 1;i\n\n\\Y 3;0\n\n\\Y 3;0\n\n\u00b5i\n\nj=1\n\n\u22a4\n\n1\n\n1\n\nj\n\ni\n\ni\n\n\u03c8(w)\n\n\\Y 2;1\n\n(\u2211\n\ni\n\nw\u2208Y 1;i\n\nj=1\n. This yields\n\nj\n\n\uf8f9\uf8fa\uf8fb .\n\n\uf8f6\uf8f7\uf8f8\n\uf8f6\uf8f7\uf8f8\n\uf8f6\uf8f7\uf8f8\n)]\n\n\\Y 2;1\n\ni\n\n(\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n\n(cid:6)i(\u03b8 \u2212 \u02dc\u03b8) + (\u03b8 \u2212 \u02dc\u03b8)\n\u22a4\n\n\u00b5i\n\n.\n\n1\n2\n\nBy Lemma A we can upper bound each\n\nby the expression zi exp\n\n\u2211\n\n2 (\u03b8 \u2212 \u02c6\u03b8)\nm2;1\u220f\n\n\u03c8(u)\n\nzi exp\n\ni=1\n\nu\u2208Y 3;0\n\n1\n\nZ(\u03b8) \u2264\n\n))\n\nterm\n\n\u03c8(w)\n\nThis process can be viewed as collapsing the sub-trees S2,1\nm2;1 to super-nodes that\nare represented by bound parameters, zi, (cid:6)i and \u00b5i, i = {1, 2,\u00b7\u00b7\u00b7 , m2,1}, where the sub-trees are\n\n2 , . . ., S2,1\n\n1 , S2,1\n\n14\n\n\fde\ufb01ned as:\n\nS2,1\n1\nS2,1\n2\n...\n\nS2,1\nm2;1\n\n= {Y 2,1\n= {Y 2,1\n\n1\n\n2\n\n1\n\n, Y 1,1\n, Y 1,2\n\n1\n\n2\n\n, Y 1,1\n, Y 1,2\n\n2\n\n3\n\n, Y 1,1\n, Y 1,2\n\n3\n\n, . . . , Y 1,1\nm1;1\n, . . . , Y 1,2\nm1;2\n\n}\n}\n\n= {Y 2,1\n(\n\nm2;1\n\n, Y 1,m2;1\n\n1\n\n, Y 1,m2;1\n\n2\n\n, Y 1,m2;1\n\n3\n\n, . . . , Y 1,m2;1\nm1;m2;1\n\n}.\n\n)\n\nNotice that the obtained expression can be further upper bounded using again Lemma A (induction)\nyielding a bound of the form: z exp\n\n(cid:6)(\u03b8 \u2212 \u02c6\u03b8) + (\u03b8 \u2212 \u02c6\u03b8)\n\n\u22a4\n\n\u00b5\n\n.\n\n1\n\n2 (\u03b8 \u2212 \u02c6\u03b8)\n\u22a4\n\nFinally, for a general tree, follow the same steps described above, starting from leaves and collapsing\nnodes to super-nodes, each represented by bound parameters. This procedure effectively yields\nAlgorithm 3 for the junction tree under consideration.\nSupplement for Section 6\nProof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later.\nLemma 3 For all x \u2208 Rd and for all l \u2208 Rd,\n\nd\u2211\n\ni=1\n\nx(i)2l(i)2 \u2265\n\n\uf8eb\uf8ed d\u2211\n)2\n\ni=1\n\nx(i)\n\n\u221a\u2211\n\u21d0\u21d2 d\u2211\n\ni=1\n\n\uf8f6\uf8f82\n\nl(i)2\n\nd\nj=1 l(j)2\n\nx(i)2l(i)2 \u2265\n\n.\n\n\uf8eb\uf8ed d\u2211\n\ni=1\n\n\uf8f6\uf8f82\n\n.\n\n\u221a\u2211\n\nx(i)l(i)2\n\nd\nj=1 l(j)2\n\nProof of Lemma 3 By Jensen\u2019s inequality,\n\nd\u2211\n\ni=1\n\nl(i)2\u2211\n\nd\nj=1 l(j)2\n\nx(i)2\n\n(\nd\u2211\n\ni=1\n\n\u2211\n\n\u2265\n\nx(i)l(i)2\nd\nj=1 l(j)2\n\nNow we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores (cid:6)i using\ni SiVi + Di where Vi \u2208 Rk\u00d7d is orthonormal, Si \u2208 Rk\u00d7k positive\n\u22a4\na low-rank representation V\nsemi-de\ufb01nite and Di \u2208 Rd\u00d7d is non-negative diagonal. The diagonal terms D are initialized to t\u03bbI\n\u2211\nwhere \u03bb is the regularization term. To mimic Algorithm 1 we must increment the (cid:6) matrix by a\n\u22a4\nrank one update of the form (cid:6)i = (cid:6)i\u22121 + rir\ni . By projecting ri onto each eigenvector in V, we\nj=1 Vi\u22121(j,\u00b7)riVi\u22121(j,\u00b7)\n\u22a4\n\u22a4\ni\u22121Vi\u22121ri + g where g is the\ncan decompose it as ri =\nremaining residue. Thus the update rule can be rewritten as:\n\u22a4\n\u22a4\ni\u22121Vi\u22121ri + g)(V\n(cid:6)i = (cid:6)i\u22121 + rir\ni = V\n\n\u22a4\ni\u22121Si\u22121Vi\u22121 + Di\u22121 + (V\n\n\u22a4\n\u22a4\ni\u22121Vi\u22121ri + g)\n\n+ g = V\n\nk\n\n\u22a4\ni\u22121(Si\u22121 + Vi\u22121rir\n\n\u22a4\ni V\n\n\u22a4\ni\u22121)Vi\u22121 + Di\u22121 + gg\n\n\u22a4\n\n= V\n\n\u2032\u22a4\n\u2032\ni\u22121S\ni\u22121V\n\n\u2032\ni\u22121 + gg\n\n\u22a4\n\n+ Di\u22121\n\n= V\n\n\u22a4\ni V\n\n\u22a4\ni\u22121S\n\n\u2032\ni\u22121Qi\u22121 = svd(Si\u22121 + Vi\u22121rir\n\n\u2032\ni\u22121 = Qi\u22121Vi\u22121 and de\ufb01ned Qi\u22121 in terms of the singular value decomposition,\nwhere we de\ufb01ne V\n\u2032\ni\u22121 is diagonal and nonnegative by\nQ\nconstruction. The current formula for (cid:6)i shows that we have a rank (k + 1) system (plus diagonal\nterm) which needs to be converted back to a rank k system (plus diagonal term) which we denote by\n\u2032\ni. We have two options as follows.\n(cid:6)\nCase 1) Remove g from (cid:6)i to obtain\n\u2032\n\n\u22a4\ni\u22121). Note that S\n\ni\u22121 + Di\u22121 = (cid:6)i \u2212 gg\n\n\u2032\u22a4\n\u2032\ni\u22121S\ni\u22121V\nwhere c = \u2225g\u22252 and v = 1\u2225g\u2225 g.\n\u2032\nCase 2) Remove the mth (smallest) eigenvalue in S\ni\u22121 and its corresponding eigenvector:\n\u22a4 \u2212 S\n\u2032\n\n= (cid:6)i \u2212 cvv\n\n(m,\u00b7) = (cid:6)i \u2212 cvv\n\n\u2032\ni\u22121 + Di\u22121 + gg\n\n\u2032\ni = V\n\n\u2032\n(cid:6)\ni = V\n\n(m,\u00b7)\n\n\u2032\ni\u22121V\n\n\u2032\u22a4\ni\u22121S\n\n(m, m)V\n\nV\n\n(cid:6)\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u2032\n\n\u2032\n\nwhere c = S\n\n\u2032\n\n(m, m) and v = V(m,\u00b7)\n\n\u2032\n\n.\n\n15\n\n\f\u22a4\n\n\u2032\ni = (cid:6)i + cvv\n\n\u22a4 where c \u2265 0 and\nClearly, both cases can be written as an update of the form (cid:6)\nv = 1. We choose the case with smaller c value to minimize the change as we drop from a system\nv\nof order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector\nwould violate the bound.\nInstead, consider absorbing this term into the diagonal component to\n\u2032\u2032\n\u2032\ni + F which also\npreserve the bound. Formally, we look for a diagonal matrix F such that (cid:6)\ni = (cid:6)\ni x \u2265 x\nmaintains x\n(cid:6)ix \u21d0\u21d2 x\n\n(cid:6)ix for all x \u2208 Rd. Thus, we want to satisfy:\n\nFx \u21d0\u21d2 c\n\nx \u2264 x\n\u22a4\n\nd\u2211\n\n)2\n\nx(i)2F(i)\n\nx(i)v(i)\n\n(\n\ni x \u2265 x\n\n\u2032\u2032\n(cid:6)\n\n\u22a4\nx\n\n\u2032\u2032\n(cid:6)\n\ncvv\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\nwhere, for ease of notation, we take F(i) = F(i, i).\n\n= 1\n\nDe\ufb01ne v\nthis assumption. We need an F such that\n\nw v where w = v\n\n\u22a4\n\n1. Consider the case where v \u2265 0 though we will soon get rid of\n. Equivalently, we\n\nd\ni=1 x(i)v(i)\n\n\u2032\n\n\u2211\n\n\u2264 d\u2211\n)2\n\ni=1\n\ni=1\n\n(\u2211\n\n(\u2211\n\n\u2211\n)2\ni=1 x(i)2F(i) \u2265 c\n. De\ufb01ne F(i)\n\nd\n\n\u2032\n\n)2\n\n(\u2211\n\n\u2032\nd\ni=1 x(i)v(i)\n\u2032\nd\ni=1 x(i)v(i)\n\n\u2032 \u2265\n\n. Thus, we obtain F(i) = cw2F(i)\n\n(\n\nd\u2211\n\nd\n\nj=1 v(j) we have\n\nx(i)2F(i) \u2265 c\n\n)2\n\nx(i)v(i)\n\n.\n\ni=1\n\n= F(i)\n\ncw2 for all i = 1, . . . , d. So, we need\n. Using Lemma 3 it is easy to show that we\n= cwv(i). Therefore, for all x \u2208 Rd,\n\n\u2032\n\n(3)\n\nd\n\nneed\n\n\u2032\n(i) = v(i)\n\n\u2211\ncw2 \u2265\ni=1 x(i)2 F(i)\n\u2032\n\u2211\nd\nsuch that\nan F\ni=1 x(i)2F(i)\n\u2032\nmay choose F\nall v \u2265 0, and for F(i) = cv(i)\nd\u2211\nsuf\ufb01cient to set F(i) = c|v(i)|\u2211\nfor F(i) = c|v(i)|\u2211\n\ni=1\n\nd\nj=1\n\nTo generalize the inequality to hold for all vectors v \u2208 Rd with potentially negative entries, it is\n|v(j)|. To verify this, consider \ufb02ipping the sign of any v(i).\nThe left side of the Inequality 3 does not change. For the right side of this inequality, \ufb02ipping the\nsign of v(i) is equivalent to \ufb02ipping the sign of x(i) and not changing the sign of v(i). However, in\nthis case the inequality holds as shown before (it holds for any x \u2208 Rd). Thus for all x, v \u2208 Rd and\n\nd\nj=1\n\n|v(j)|, Inequality 3 holds.\n\nSupplement for Section 7\nSmall scale experiments In additional small-scale experiments, we compared Algorithm 2 with\nsteepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems\nmay be interesting in real-time learning settings, for example, when a website has to learn from a\nuser\u2019s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic\nregression on \ufb01ve UCI data sets where missing values were handled via mean-imputation. A range of\nregularization settings \u03bb \u2208 {100, 102, 104} was explored and all algorithms were initialized from the\nsame ten random start-points. Table 3 shows the average number of seconds each algorithm needed\nto achieve the same solution that BFGS converged to (all algorithms achieve the same solution due\nto concavity). The bound is the fastest algorithm as indicated in bold.\ndata|\u03bb a|100 a|102 a|104 b|100 b|102 b|104 c|100 c|102 c|104 d|100 d|102 d|104 e|100 e|102 e|104\n3.28 2.63 2.01 1.49\nBFGS\n1.94 2.68 2.49 1.54\n1.23 0.48 0.55 0.43\n0.60 0.35 0.26 0.20\n0.07 0.03 0.03 0.03\n\n2.45 3.14 2.00 1.60 4.09 1.03 1.90 5.62\n2.88\n1.60 2.18 6.17 5.83 1.92 0.64 0.56 12.04 1.27\n0.85 0.70 0.67 0.83 0.65 0.64 0.72 1.36\n1.21\n0.63\n0.22 0.43 0.37 0.35 0.39 0.34 0.32 0.92\n0.01 0.07 0.04 0.04 0.07 0.02 0.02 0.16\n0.09\n\n1.90\n1.74\n0.78\nNewton 0.31\n0.01\nBound\n\n0.89\n0.92\n0.83\n0.25\n0.01\n\nSD\nCG\n\nTable 3:\nConvergence time in seconds under various regularization levels for a) Bupa (t =\n345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion\n(t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets.\n\nIn\ufb02uence of rank k on bound performance in large scale experiments We also examined the\nin\ufb02uence of k on bound performance and compared it with LBFGS, SD and CG. Several choices\n\n16\n\n\fof k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs\nbest but slows down for super\ufb02uously large values of k. Steepest descent and conjugate gradient\nare slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k\nfor the bound. However, we are reporting overall runtime which is also a function of the number of\niterations. Therefore, total runtime (a function of both) may not always decrease/increase with k.\n\nk\n\n1\n\n2\n\n8\n\n16\n\n4\n\n64\nLBFGS 1.37 1.32 1.39 1.35 1.46 1.40 1.54\n8.80 8.80 8.80 8.80 8.80 8.80 8.80\n4.39 4.39 4.39 4.39 4.39 4.39 4.39\n0.56 0.56 0.67 0.96 1.34 2.11 4.57\n\nSD\nCG\n\nBound\n\n32\n\nTable 4: Convergence time in seconds as a function of k.\n\nAdditional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets\nto complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to\nprovide the summary Table 2 in the main article.\n\nFigure 5: Convergence of test latent log-likelihood for bupa and wine data-sets.\n\nion\n\nbupa\n\nBFGS\n\nData-set\nAlgorithm m = 1 m = 2 m = 3 m = 4 m = 1 m = 2 m = 3 m = 4 m = 1 m = 2 m = 3 m = 4\n-4.96 -5.55 -5.88 -5.79 -22.07 -21.78 -21.92 -21.87 -4.42 -5.28 -4.95 -4.93\n-11.80 -9.92 -5.56 -8.59 -21.76 -21.74 -21.73 -21.83 -4.93 -5.14 -5.01 -5.20\n-5.47 -5.81 -5.57 -5.22 -21.81 -21.81 -21.81 -21.81 -4.84 -4.84 -4.84 -4.84\n-5.95 -5.95 -5.95 -5.95 -21.85 -21.85 -21.85 -21.85 -5.50 -5.50 -5.50 -4.50\n-6.08 -4.84 -4.18 -5.17 -21.85 -19.95 -20.01 -19.97 -5.47 -4.40 -4.75 -4.92\n\nNewton\nBound\n\nhepatitis\n\nSD\nCG\n\nwine\n\nBFGS\n\nSRBCT\n\nData-set\nAlgorithm m = 1 m = 2 m = 3 m = 4 m = 1 m = 2 m = 3 m = 4\n-0.90 -0.91 -1.79 -1.35 -5.99 -6.17 -6.09 -6.06\n-1.61 -1.60 -1.37 -1.63 -5.61 -5.62 -5.62 -5.61\n-0.51 -0.78 -0.95 -0.51 -5.62 -5.49 -5.36 -5.76\n-0.71 -0.71 -0.71 -0.71 -5.54 -5.54 -5.54 -5.54\n-0.51 -0.51 -0.48 -0.51 -5.31 -5.31 -4.90 -0.11\n\nNewton\nBound\n\nSD\nCG\n\nTable 5: Test latent log-likelihood at convergence for different values of m \u2208 {1, 2, 3, 4} on ion,\nbupa, hepatitis, wine and SRBCT data-sets.\n\n17\n\n\u221250510\u221224\u221223\u221222\u221221\u221220\u221219bupalog(Time) [sec]\u2212log(J(\u03b8))\u22124\u2212202468\u221220\u221215\u221210\u221250winelog(Time) [sec]\u2212log(J(\u03b8)) BoundNewtonBFGSConjugate gradientSteepest descent\f", "award": [], "sourceid": 279, "authors": [{"given_name": "Tony", "family_name": "Jebara", "institution": null}, {"given_name": "Anna", "family_name": "Choromanska", "institution": null}]}