{"title": "The variational hierarchical EM algorithm for clustering hidden Markov models", "book": "Advances in Neural Information Processing Systems", "page_first": 404, "page_last": 412, "abstract": "In this paper, we derive a novel algorithm to cluster  hidden Markov models (HMMs) according to their probability distributions. We propose a variational hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a ``cluster center'', i.e., a novel HMM that is representative for the group. We illustrate the benefits of the proposed algorithm on hierarchical clustering of motion capture sequences as well as on automatic music tagging.", "full_text": "The variational hierarchical EM algorithm for\n\nclustering hidden Markov models\n\nEmanuele Coviello\n\nECE Dept., UC San Diego\necoviell@ucsd.edu\n\nAntoni B. Chan\n\nCS Dept., CityU of Hong Kong\nabchan@cityu.edu.hk\n\nGert R.G. Lanckriet\n\nECE Dept., UC San Diego\ngert@ece.ucsd.edu\n\nAbstract\n\nIn this paper, we derive a novel algorithm to cluster hidden Markov models\n(HMMs) according to their probability distributions. We propose a variational\nhierarchical EM algorithm that i) clusters a given collection of HMMs into groups\nof HMMs that are similar, in terms of the distributions they represent, and ii) char-\nacterizes each group by a \u201ccluster center\u201d, i.e., a novel HMM that is representative\nfor the group. We illustrate the bene\ufb01ts of the proposed algorithm on hierarchical\nclustering of motion capture sequences as well as on automatic music tagging.\n\n1\n\nIntroduction\n\nThe hidden Markov model (HMM) [1] is a probabilistic model that assumes a signal is generated\nby a double embedded stochastic process. A discrete-time hidden state process, which evolves as a\nMarkov chain, encodes the dynamics of the signal, and an observation process, at each time condi-\ntioned on the current state, encodes the appearance of the signal. HMMs have successfully served\na variety of applications, including speech recognition [1], music analysis [2] and identi\ufb01cation [3],\nand clustering of time series data [4, 5].\nThis paper is about clustering HMMs. More precisely, we are interested in an algorithm that, given\na collection of HMMs, partitions them into K clusters of \u201csimilar\u201d HMMs, while also learning a\nrepresentative HMM \u201ccluster center\u201d that concisely and appropriately represents each cluster. This\nis similar to standard k-means clustering, except that the data points are HMMs now instead of\nvectors in Rd. Various applications motivate the design of HMM clustering algorithms, ranging from\nhierarchical clustering of sequential data (e.g., speech or motion sequences modeled by HMMs [4]),\nover hierarchical indexing for fast retrieval, to reducing the computational complexity of estimating\nmixtures of HMMs from large datasets (e.g., semantic annotation models for music and video) \u2014\nby clustering HMMs, ef\ufb01ciently estimated from many small subsets of the data, into a more compact\nmixture model of all data. However, there has been relatively little work on HMM clustering and,\ntherefore, its applications.\nExisting approaches to clustering HMMs operate directly on the HMM parameter space, by group-\ning HMMs according to a suitable pairwise distance de\ufb01ned in terms of the HMM parameters.\nHowever, as HMM parameters lie on a non-linear manifold, a simple application of the k-means al-\ngorithm will not succeed in the task, since it assumes real vectors in a Euclidean space. In addition,\nsuch an approach would have the additional complication that HMM parameters for a particular\ngenerative model are not unique, i.e., a permutation of the states leads to the same generative model.\nOne solution, proposed in [4], \ufb01rst constructs an appropriate similarity matrix between all HMMs\nthat are to be clustered (e.g., based on the Bhattacharyya af\ufb01nity, which depends non-linearly on the\nHMM parameters [6]), and then applies spectral clustering. While this approach has proven success-\nful to group HMMs into similar clusters [4], it does not allow to generate novel HMMs as cluster\ncenters. Each cluster can still be represented by choosing one of the given HMMs, e.g., the HMM\nwhich the spectral clustering procedure maps the closest to each spectral clustering center. However,\nthis may be suboptimal for various applications of HMM clustering, e.g., in hierarchical estimation\n\n1\n\n\fof HMM mixtures. Spectral clustering can be based on other af\ufb01nity scores between HMMs distri-\nbutions than Bhattacharyya af\ufb01nity, such as KL divergence approximated with sampling [7].\nInstead, in this paper we propose to cluster HMMs directly with respect to the probability distri-\nbutions they represent. We derive a hierarchical expectation maximization (HEM) algorithm that,\nstarting from a group of HMMs, estimates a smaller mixture model that concisely represents and\nclusters the input HMMs (i.e., the input HMM distributions guide the estimation of the output mix-\nture distribution). Historically, the \ufb01rst HEM algorithm was designed to cluster Gaussian probability\ndistributions [8]. This algorithm starts from a Gaussian mixture model (GMM) and reduces it to an-\nother GMM with fewer components, where each of the mixture components of the reduced GMM\nrepresents, i.e., clusters, a group of the original Gaussian mixture components. More recently, Chan\net al. [9] derived an HEM algorithm to cluster dynamic texture (DT) models (i.e., linear dynamical\nsystems, LDSs) through their probability distributions. HEM has been applied successfully to many\nmachine learning tasks for images [10], video [9] and music [11, 12]. The HEM algorithm is simi-\nlar in spirit to Bregman-clustering [13], which is based on assigning points to cluster centers using\nKL-divergence.\nTo extend the HEM framework for GMMs to hidden Markov mixture models (H3Ms), additional\nmarginalization of the hidden-state processes is required, as for DTMs. However, while Gaussians\nand DTs allow tractable inference in the E-step of HEM, this is no longer the case for HMMs.\nTherefore, in this work, we derive a variational formulation of the HEM algorithm (VHEM), and\nthen leverage a variational approximation derived in [14] (which has not been used in a learning con-\ntext so far) to make the inference in the E-step tractable. The proposed VHEM algorithm for H3Ms\n(VHEM-H3M) allows to cluster hidden Markov models, while also learning novel HMM centers that\nare representative of each cluster, in a way that is consistent with the underlying generative model\nof the input HMMs. The resulting VHEM algorithm can be generalized to handle other classes of\ngraphical models, for which exact computation of the E-step in standard HEM would be intractable,\nby leveraging similar variational approximations. The ef\ufb01cacy of the VHEM-H3M algorithm is\ndemonstrated on hierarchical motion clustering and semantic music annotation and retrieval.\nThe remainder of the paper is organized as follows. We review the hidden Markov model (HMM)\nand the hidden Markov mixture model (H3M) in Section 2. We present the derivation of the VHEM-\nH3M algorithm in Section 3, discussion and an experimental evaluation in Section 4.\n2 The hidden Markov (mixture) model\nA hidden Markov model (HMM) M assumes a sequence of \u03c4 observations y1:\u03c4 is generated by\na double embedded stochastic process. The hidden state process x1:\u03c4 is a \ufb01rst order Markov\nchain on S states, with transition matrix A whose entries are a\u03b2,\u03b3 = P (xt+1 = \u03b3|xt = \u03b2),\nand initial state distribution \u03c0 = [\u03c01, . . . , \u03c0S], where \u03c0\u03b2 = P (x1 = \u03b2|M). Each state \u03b2 gen-\nerates observations according to an emission probability density function p(y|x = \u03b2,M) which\nhere we assume time-invariant and modeled as a Gaussian mixture with M components, i.e.,\nm=1 c\u03b2,mp(y|\u03b6 = m,M), where \u03b6 \u223c multinomial(c\u03b2,1, . . . , c\u03b2,M ) is the\nhidden variable that selects the mixture component, c\u03b2,m the mixture weight of the mth Gaussian\ncomponent, and p(y|\u03b6 = m,M) = N (y; \u00b5\u03b2,m, \u03a3\u03b2,m) is the probability density function of a mul-\ntivariate Gaussian distribution with mean \u00b5\u03b2,m and covariance matrix \u03a3\u03b2,m. The HMM is speci\ufb01ed\n\u03b2=1} which can be ef\ufb01ciently learned\nby the parameters M = {\u03c0, A,{{c\u03b2,m, \u00b5\u03b2,m, \u03a3\u03b2,m}M\nfrom an observation sequence y1:\u03c4 with the Baum-Welch algorithm [1].\nA hidden Markov mixture model (H3M) models a set of observation sequences as samples\nfrom a group of K hidden Markov models, each associated to a speci\ufb01c sub-behavior [5].\nFor a given sequence, an assignment variable z \u223c multinomial(\u03c91,\u00b7\u00b7\u00b7 \u03c9K) selects the pa-\nis parametrized by Mz =\nrameters of one of the K HMMs.\nm=1}S\n{\u03c0z, Az,{{cz\nz=1.\nThe likelihood of a random sequence y1:\u03c4 \u223c M is\n\n\u03b2=1} and the H3M is parametrized by M = {\u03c9z,Mz}K\n\np(y|x = \u03b2,M) = (cid:80)M\n\nEach mixture component\n\n\u03b2,m, \u00b5z\n\n\u03b2,m, \u03a3z\n\n\u03b2,m}M\n\nm=1}S\n\nK(cid:88)\n\np(y1:\u03c4|M) =\n\n\u03c9ip(y1:\u03c4|z = i,M),\n\n(1)\n\nwhere p(y1:\u03c4|z = i,M) is the likelihood of y1:\u03c4 under the ith HMM component. To reduce clutter,\nhere we assume that all the HMMs have the same number S of hidden states and that all emission\nprobabilities have M mixture components, though our derivation could be easily extended to the\nmore general case, and in the remainder of the paper we use the notation in Table 1.\n\ni=1\n\n2\n\n\fvariables\n\nindex for HMM comp.\nHMM states\nHMM state sequence\nindex for comp. of GMM m\n\nmodels\n\nH3M\nHMM component\nGMM emission\ncomponent of GMM\n\n(b)\ni\n\u03b2\n\u03b21:\u03c4 ={\u03b21\u00b7\u00b7\u00b7\u03b2\u03c4}\n\nprobability distributions\nHMM state seq. (b)\nHMM state seq. (r)\n\n(r)\nj\n\u03c1\n\u03c11:\u03c4 ={\u03c11\u00b7\u00b7\u00b7\u03c1\u03c4} HMM obs. likelihood (r)\nGMM emit likelihood (r)\n(cid:96)\nGaussian likelihood (r)\n\nTable 1: Notation. (b) base model, (r) reduced model.\nnotation\np(x1:\u03c4=\u03b21:\u03c4|z(b)=i,M(b))\np(x1:\u03c4=\u03c11:\u03c4|z(r)=j,M(r))\np(y1:\u03c4|z(r) = j,M(r))\np(yt|xt = \u03c1,M(r)\nj )\np(yt|\u03b6t = (cid:96), xt = \u03c1,M(r)\nj )\nEy1:\u03c4|z(b)=i,M(b) [\u00b7]\nEyt|xt=\u03b2,M(b)\nEyt|\u03b6t=m,xt=\u03b2,M(b)\n\nHMM obs. seq.\nGMM emission\nGaussian component\n\nM(r)\nM(r)\nM(r)\nM(r)\n\nM(b)\nM(b)\nM(b)\nM(b)\n\nexpectations\n\n[\u00b7]\n\n[\u00b7]\n\ni,\u03b2,m\n\nj,\u03c1,(cid:96)\n\nj,\u03c1\n\ni,\u03b2\n\nj\n\ni\n\ni\n\ni\n\nshort-hand\n\u03c0(b),i\n\u03b21:\u03c4\n\u03c0(r),j\n\u03c11:\u03c4\np(y1:\u03c4|M(r)\nj )\np(yt|M(r)\nj,\u03c1)\np(yt|M(r)\nj,\u03c1,(cid:96))\n[\u00b7]\n[\u00b7]\n\nEM(b)\nEM(b)\nEM(b)\n\n[\u00b7]\n\ni,\u03b2\n\ni\n\ni,\u03b2,m\n\n3 Clustering hidden Markov models\n\ni\n\nj\n\n,M(r)\n\ni }K(b)\n\nj }K(r)\n\nWe now derive the variational hierarchical EM algorithm for clustering HMMs (VHEM-H3M). Let\nM(b) = {\u03c9(b)\n,M(b)\ni=1 be a base hidden Markov mixture model (H3M) with K (b) components.\nThe goal of the VHEM-H3M algorithm is to \ufb01nd a reduced hidden Markov mixture model M(r) =\n{\u03c9(r)\nj=1 with fewer components (i.e., K (r) < K (b)), that represents M(b) well. At a high\nlevel, the VHEM-H3M algorithm estimates the reduced H3M model M(r) from virtual samples\ndistributed according to the base H3M model M(b). From this estimation procedure, the VHEM\nalgorithm provides: (i) a (soft) clustering of the original K (b) HMMs into K (r) groups, encoded\nin assignment variables \u02c6zi,j, and (ii) novel HMM cluster centers, i.e., the HMM components of\nM(r), each of them representing a group of the original HMMs of M(b). Finally, because we take\nthe expectation over the virtual samples, the estimation is carried out in an ef\ufb01cient manner that\nrequires only knowledge of the parameters of the base model without the need of generating actual\nvirtual samples.\n\ni\n\ni\n\ni\n\n1:\u03c4 }Ni\n\n1:\u03c4 \u223c M(b)\n\nsamples Yi = {y(i,m)\n\nm=1 are from the ith component (i.e., y(i,m)\n\n3.1 Parameter estimation\nWe consider a set Y of N virtual samples distributed accordingly to the base model M(b), such that\nthe Ni = N \u03c9(b)\n). We\ndenote the entire set of samples as Y = {Yi}K(b)\ni=1 , and, in order to obtain a consistent clustering of the\ninput HMMs M(b)\n, we assume the entirety of samples Yi is assigned to the same component of the\nreduced model [8]. Note that, in this formulation, we are not using virtual samples {x(i,m)\n1:\u03c4 }\n, y(i,m)\nfor each base component, according to its joint distribution p(x1:\u03c4 , y1:\u03c4|M(b)\ni ), but we treat Xi =\n{x(i,m)\nm=1 as \u201cmissing\u201d information, and estimate them in the E-step. The reason is that a basis\nmismatch between components of M(b)\nare\ncomputed from virtual samples of the hidden states of {M(b)\nThe original formulation of HEM [8] maximizes log-likelihood of the virtual samples,\n\ni.e.,\ni=1 log p(Yi|M(r)), with respect to M(r), and uses the law of large num-\nbers to turn the virtual samples into an expectation over the base model components M(b)\n. In this\npaper, we will start with a slightly different objective function to derive the VHEM algorithm. To\nestimate M(r), we will maximize the expected log-likelihood of the virtual samples,\n\ni will cause problems when the parameters of M(r)\n\nlog p(Y |M(r)) = (cid:80)K(b)\n\ni }K(b)\ni=1 .\n\n1:\u03c4 }Ni\n\n1:\u03c4\n\nj\n\ni\n\nJ (M(r)) = EM(b)\n\nlog p(Y |M(r))\n\n=\n\nlog p(Yi|M(r))\n\n,\n\n(2)\n\nEM(b)\n\ni\n\nwhere the expectation is over the base model components M(b)\nA general framework for maximum likelihood estimation in the presence of hidden variables (which\nis the case for H3Ms) is the EM algorithm [15]. In this work, we take a variational perspective [16,\n17, 18], which views both E- and M-step as a maximization step. The variational E-step \ufb01rst obtains\na family of lower bounds to the log-likelihood (i.e., to equation 2), indexed by variational parameters,\nand then optimizes over the variational parameters to \ufb01nd the tightest bound. The corresponding\nM-step then maximizes the lower bound (with the variational parameters \ufb01xed) with respect to the\n\n.\n\ni\n\n3\n\n(cid:104)\n\n(cid:105)\n\nK(b)(cid:88)\n\ni=1\n\n(cid:104)\n\n(cid:105)\n\n\fmodel parameters. One advantage of the variational formulation is that it allows to replace a dif\ufb01cult\ninference in the E-step with a variational approximation, by restricting the maximization to a smaller\ndomain for which the lower bound is tractable.\n\n3.1.1 Lower bound to an expected log-likelihood\n\nBefore proceeding with the derivation of VHEM for H3Ms, we \ufb01rst need to derive a lower-bound to\nan expected log-likelihood term (e.g., (2)). We will \ufb01rst consider the lower bound to a log-likelihood.\nIn all generality, let {O, H} be the observation and hidden variables of a probabilistic model, respec-\ntively, where p(H) is the distribution of the hidden variables, p(O|H) is the conditional likelihood\nH p(O|H)p(H) is the observation likelihood. We can de\ufb01ne a\n\nof the observations, and p(O) =(cid:80)\n\nvariational lower bound to the observation log-likelihood [18, 19]:\n\np(H)p(O|H)\n\nq(H)\n\n(3)\n\nlog p(O) \u2265 log p(O) \u2212 D(q(H)||p(H|O)) =\n\nq(H) log\n\ndistribution (i.e.,(cid:80)\n(cid:82) p(y) log p(y)\n\nwhere p(H|O) is the posterior distribution of H given observation O, and q(H) is the variational\nH q(H) = 1 and qi(H) \u2265 0) or approximate posterior distribution. D(p(cid:107)q) =\nq(y) dy is the Kullback-Leibler (KL) divergence between two distributions, p and q. When\nthe variational distribution equals the true posterior, q(H) = P (H|O), then the KL divergence\nis zero, and hence the lower-bound reaches log p(O). When the true posterior is not possible to\ncalculate, then typically q is restricted to some set of approximate posterior distributions that are\ntractable, and the best lower-bound is obtained by maximizing over q,\np(H)p(O|H)\n\n(cid:88)\n\n(cid:88)\n\nH\n\nlog p(O) \u2265 max\nq\u2208Q\n\nq(H) log\n\nH\n\nq(H)\n\nUsing the lower bound in (4), we can now derive a lower bound to an expected log-likelihood\nexpression. Let Eb[\u00b7] be the expectation of O with respect to a distribution pb(O). Since pb(O) is\nnon-negative, taking the expectation on both sides of (4) yields,\n\nEb [log p(O)] \u2265 Eb\n\n(cid:88)\n\nH\n\n(cid:34)\n\nmax\nq\u2208Q\n\n(cid:88)\n\nH\n\n(cid:35)\n\n(cid:34)(cid:88)\n\nq(H) log\n\n(cid:26)\n\np(H)p(O|H)\n\n\u2265 max\nq\u2208Q\nq(H)\n+ Eb [log p(O|H)]\n\n(cid:27)\n\nEb\n\nH\n\n,\n\n= max\nq\u2208Q\n\nq(H)\n\nlog\n\np(H)\nq(H)\n\nq(H) log\n\np(H)p(O|H)\n\nq(H)\n\n(cid:35)\n\n(4)\n\n(5)\n\n(6)\n\nwhere (5) follows from Jensen\u2019s inequality (i.e., f (E[x]) \u2264 E[f (x)] when f is convex), and the\nconvexity of the max function.\n\n3.1.2 Variational lower bound\n\nWe now derive the lower bound of the expected log-likelihood cost function in (2). The derivation\nproceeds by successively applying the lower bound from (6) on each arising expected log-likelihood\nterm, which results in a set of nested lower bounds. We \ufb01rst de\ufb01ne the following three lower bounds:\n\n[log p(Yi|M(r))] \u2265 Li\n[log p(y1:\u03c4|M(r)\nj )] \u2265 Li,j\n[log p(yt|M(r)\n\nj,\u03c1t\n\nH3M ,\n\nHMM ,\n\n)] \u2265 L(i,\u03b2t),(j,\u03c1t)\n\nGMM\n\nEM(b)\n\ni\n\nEM(b)\n\ni\n\nEM(b)\n\ni,\u03b2t\n\n(7)\n\n(8)\n\n(9)\n\n.\n\nThe \ufb01rst lower bound, Li\nThe second lower bound, Li,j\nized over observation sequences from a different HMM M(b)\nlog p(y1:\u03c4|M(r)\ntion is not analytically tractable since y1:\u03c4 \u223c M(r)\nO(S\u03c4 ) components. The third lower bound is between GMM emission densities M(b)\n\nH3M , is on the expected log-likelihood between an HMM and an H3M.\nHMM , is on the expected log-likelihood of an HMM M(r)\n, marginal-\n. Although the data log-likelihood\nj ) can be computed exactly using the forward algorithm [1], calculating its expecta-\nis essentially an observation from a mixture with\nand M(r)\n.\n\nj\n\nj\n\ni\n\ni,\u03b2t\n\nj,\u03c1t\n\n4\n\n\fH3M lower bound - Looking at an individual term in (2), p(Yi|M(r)) is a mixture of HMMs, and\nthus the observation variable is Yi and the hidden variable is zi (the assignment of Yi to a component\nM(r)\n\n). Hence, introducing the variational distribution qi(zi) and applying (6), we have\n\nj\n\nlog p(Yi|M(r))\n\nEM(b)\n\ni\n\nqi(zi = j)\n\nlog\n\np(zi = j)\nqi(zi = j)\n\n+ NiEM(b)\n\ni\n\n[log p(y1:\u03c4|M(r)\nj )]\n\n(cid:27)\n\n(cid:104)\n\n(cid:105) \u2265 max\n\nqi\n\n\u2265 max\n\nqi\n\n(cid:88)\n(cid:88)\n\nj\n\nj\n\n(cid:26)\n(cid:26)\n\n(cid:27)\n\nqi(zi = j)\n\nlog\n\np(zi = j)\nqi(zi = j)\n\n+ NiLi,j\n\nHMM\n\n(cid:44) Li\n\nH3M .\n\n(10)\n\ndistributions to the form qi(zi = j) = zij for all i, where(cid:80)K(r)\n\nwhere we use the fact that Yi is a set of Ni i.i.d. samples, and we use the lower bound (8) for\nthe expectation of log p(y1:\u03c4|M(r)\nj ), which is the observation log-likelihood of an HMM and hence\nits expectation cannot be calculated directly. To compute Li\nH3M , we will restrict the variational\nj=1 zij = 1, and zij \u2265 0\u2200j.\nj ), the observation variable is y1:\u03c4 and\n\nHMM lower bound - For the HMM likelihood p(y1:\u03c4|M(r)\nthe hidden variable is its state sequence \u03c11:\u03c4 . Hence, for the lower bound Li,j\n\nHMM we get\n\nEM(b)\n\ni\n\n\u2265(cid:88)\n\u2265(cid:88)\n\n\u03b21:\u03c4\n\n\u03b21:\u03c4\n\n\u03c0(b),i\n\u03b21:\u03c4\n\nmax\nqi,j\n\n\u03c0(b),i\n\u03b21:\u03c4\n\nmax\nqi,j\n\n[log p(y1:\u03c4|M(r)\n\nj )] =\n\n\u03c0(b),i\n\u03b21:\u03c4\n\nEM(b)\n\ni\n\n(cid:88)\n(cid:88)\n\n\u03c11:\u03c4\n\n\u03c11:\u03c4\n\n\u03b21:\u03c4\n\nqi,j(\u03c11:\u03c4|\u03b21:\u03c4 )\n\nqi,j(\u03c11:\u03c4|\u03b21:\u03c4 )\n\n(cid:40)\n(cid:40)\n\nlog\n\nlog\n\n[log p(y1:\u03c4|M(r)\nj )]\n\n|\u03b21:\u03c4\np(\u03c11:\u03c4|M(r)\nj )\nqi,j(\u03c11:\u03c4|\u03b21:\u03c4 )\np(\u03c11:\u03c4|M(r)\nj )\nqi,j(\u03c11:\u03c4|\u03b21:\u03c4 )\n\n(cid:88)\n(cid:88)\n\nt\n\nt\n\n+\n\n+\n\n(11)\n\n(cid:41)\n\n)]\n\n(12)\n\n[log p(yt|M(r)\n\nj,\u03c1t\n\nEM(b)\ni,\u03b2t\n\n(cid:41)\n\nL(i,\u03b2t),(j,\u03c1t)\n\nGMM\n\n(cid:44) Li,j\n\nHMM\n\n(13)\n\n(cid:88)\n\n\u03c4(cid:89)\n\nto explicitly marginalize over the HMM state\nwhere in (11) we \ufb01rst rewrite the expectation EM(b)\nsequence \u03b21:\u03c4 from M(b)\n(\u03c11:\u03c4 ) on the state\nsequence \u03c11:\u03c4 , which depends on the particular sequence \u03b21:\u03c4 , and apply (6) , and in the last line we\nuse the lower bound, de\ufb01ned in (9), on each expectation.\nTo compute Li,j\n\nHMM we will restrict the variational distributions to the form of a Markov chain [14],\n\n, in (12) we introduce a variational distribution qi,j\n\u03b21:\u03c4\n\ni\n\ni\n\nj\n\nt=2\n\n\u03c6i,j\n\u03b2t\n\n\u03c1t=1 \u03c6i,j\n\n\u03b2t\n\n\u03c11=1 \u03c6i,j\n\n\u03b21\n\n(\u03c1t|\u03c1t\u22121),\n\nwhere (cid:80)S\n\nqi,j(\u03c11:\u03c4|\u03b21:\u03c4 ) = \u03c6i,j(\u03c11:\u03c4|\u03b21:\u03c4 ) = \u03c6i,j(\u03c11|\u03b21)\n\n(\u03c11:\u03c4 ) assigns state sequences \u03b21:\u03c4 \u223c M(b)\n\n(\u03c11) = 1 for each value of \u03b21, and (cid:80)S\n\n(14)\n(\u03c1t|\u03c1t\u22121) = 1 for each value\nto\n, based on how well (in expectation) the state sequence \u03c11:\u03c4 \u223c M(r)\nevolving through state sequence\n\nof \u03b2t and \u03c1t\u22121. The variational distribution qi,j\n\u03b21:\u03c4\nstate sequences \u03c11:\u03c4 \u223c M(r)\ncan explain an observation sequence generated by HMM M(b)\n\u03b21:\u03c4 \u223c M(b)\nGMM lower bound - In [20] we derive the lower bound (9), by marginalizing EM(b)\nover GMM\n\u03b2,\u03c1(\u03b6 = l|m), and applying (6). We will\nassignment m, introducing the variational distributions qi,j\n=1 \u2200m,\nrestrict the variational distributions to qi,j\n(cid:96)|m\n\u22650 \u2200(cid:96),m. Intuitively, \u03b7(i,\u03b2t),(j,\u03c1t) is the responsibility matrix between Gaussian ob-\nand \u03b7(i,\u03b2t),(j,\u03c1t)\nservation components for state \u03b2t in M(b)\nis the probability\nthat an observation from component m of M(b)\n\ncorresponds to component (cid:96) of M(r)\n\n\u03b2,\u03c1(\u03b6 = l|m) = \u03b7(i,\u03b2),(j,\u03c1)\n\n, where(cid:80)M\n\n, i.e., by p(y1:\u03c4|M(b)\n\nand state \u03c1t in M(r)\n\n, where \u03b7(i,\u03b2t),(j,\u03c1t)\n\n(cid:96)=1 \u03b7(i,\u03b2t),(j,\u03c1t)\n\n(cid:96)|m\n\n, \u03b21:\u03c4 ).\n\n(cid:96)|m\n\n(cid:96)|m\n\ni,\u03b2t\n\nj\n\nj\n\ni\n\ni\n\ni\n\ni\n\ni\n\n.\n\nj,\u03c1t\n\ni,\u03b2t\n\n3.2 Variational HEM algorithm\n\nFinally, the variational lower bound of the expected log-likelihood of the virtual samples in (2) is\n\nJ (M(r)) = EM(b)\n\nlog p(Y |M(r))\n\nLi\nH3M ,\n\n(15)\n\n(cid:104)\n\n(cid:105) \u2265 K(b)(cid:88)\n\ni=1\n\n5\n\n\fj\n\n\u03b21\n\nGMM\n\nt=2 \u03c6i,j\n\u03b2t\n\n\u03b2t\u22121=1\u03bdi,j\n\nexp(NiLi,j\n\n(cid:17) \u02c6\u03c6i,j\n\nt\u22121(\u03c1t\u22121, \u03b2t\u22121)a(b),i\n\n(\u03c11)(cid:81)\u03c4\n\n(cid:16)(cid:80)S\n\u03c4(cid:88)\n\n\u03bei,j\nt (\u03c1t\u22121, \u03c1t, \u03b2t) =\n\u03c1t\u22121=1 \u03bei,j\n\nt (\u03c1t, \u03b2t) =(cid:80)S\n\nfor each (i, j, \u03b2t, \u03c1t). Next, the HMM lower bound Li,j\n\nwhich is composed of three nested lower bounds, corresponding to different model elements (the\nH3M, the component HMMs, and the emission GMMs). The VHEM algorithm for HMMs consists\nin coordinate ascent on the right hand side of (15).\nE-step - The variational E-step (see [20] for details) calculates the variational parameters zij,\n\u03c6i,j(\u03c11:\u03c4|\u03b21:\u03c4 ) = \u03c6i,j\n(\u03c1t|\u03c1t\u22121), and \u03b7(i,\u03b2),(j,\u03c1) for the lower bounds in (9) (13)\n(10).\nIn particular, given the nesting of the lower bounds, we proceed by \ufb01rst maximizing the\nGMM lower bound L(i,\u03b2t),(j,\u03c1t)\nHMM is\nmaximized for each (i, j), which is followed by maximizing Li\nH3M for each i. The latter gives\n\u02c6zij \u221d w(r)\nHMM ), which is similar to the formula derived in [8, 9], but the expecta-\ntion is now replaced with its lower bound. We then collect the summary statistics: \u03bdi,j\n1 (\u03c11, \u03b21) =\n1 (\u03c11|\u03b21),\nt (\u03c1t|\u03c1t\u22121, \u03b2t),\n\u03c0(b),i\n\u02c6\u03c6i,j\n\u03c11\nand \u03bdi,j\nt (\u03c1t\u22121, \u03c1t, \u03b2t), the last two for t = 2, . . . , \u03c4, and their aggregates\nS(cid:88)\n\u03c4(cid:88)\nS(cid:88)\nwhich are necessary for the M-step:\n1 (\u03c1) is the expected number of times that the HMM M(r)\n\nstarts from state \u03c1, when\n. The quantity \u02c6\u03bdi,j(\u03c1, \u03b2) is the expected number of times that\nis in state \u03b2, when both are modeling sequences\n. Similarly, the quantity \u02c6\u03bei,j(\u03c1, \u03c1(cid:48)) is the expected number of transitions from\n\nThe statistic \u02c6\u03bdi,j\nmodeling sequences generated by M(b)\nthe HMM M(r)\ngenerated by M(b)\nstate \u03c1 to state \u03c1(cid:48) of M(r)\nM-step - The lower bound (15) is maximized with respect to the parameters M(r). De\ufb01ned a\n\u03b2,m x(i, \u03b2, m), the\n\nweighted sum operator \u2126j,\u03c1,(cid:96)(x(i, \u03b2, m)) =(cid:80)K(b)\n\n\u03b2=1 \u02c6\u03bdi,j(\u03c1, \u03b2)(cid:80)M\n(cid:80)S\nm=1 c(b),i\nparameters M(r) are updated according to (derivation in [20]):\n(cid:80)K(b)\n(cid:80)S\n(cid:80)K(b)\n(cid:17)\n\ni \u02c6\u03bdi,j\ni=1 \u02c6zi,j \u03c9(b)\n\n, when modeling sequences generated by M(b)\n\nis in state \u03c1 when the HMM M(b)\n\ni=1 \u02c6zi,j\u03c9(b)\n\n(cid:80)K(b)\n\n1 (\u03c1)\ni \u02c6\u03bdi,j\n\ni=1 \u02c6zi,j\nK(b)\n\n\u02c6\u03bdi,j(\u03c3, \u03b2) =\n\n\u03bdi,j\n1 (\u03c3, \u03b2),\n\n\u03bdi,j\nt (\u03c3, \u03b2),\n\ni=1 \u02c6zi,j \u03c9(b)\n\ni\n\ni=1 \u02c6zi,j \u03c9(b)\n\ni\n\ni=1 \u02c6zi,j \u03c9(b)\n\n\u02c6\u03bdi,j\n1 (\u03c3) =\n\n\u02c6\u03bei,j (\u03c1, \u03c1(cid:48))\n\n\u02c6\u03bei,j(\u03c1, \u03c1\n\n) =\n\n\u03bei,j\nt (\u03c1, \u03c1\n\n\u03b2t\u22121,\u03b3t\u22121\n\n\u02c6\u03bei,j (\u03c1, \u03c3)\n\nt=2\n\n\u03b2=1\n\n, \u03c0(r),j\n\n\u03c1\n\n=\n\n\u03c9(r)\n\nj\n\n=\n\n(16)\n\n(cid:48)\n\n, \u03b2).\n\n.\n\ni\n\n(cid:16)\n\n\u03b2=1\n\nt=1\n\n\u03c3=1\n\nj\n\ni\n\nj\n\nj\n\n(cid:48)\n\ni\n\ni\n\ni\n\n,\n\n(17)\n\n\u03b2,m \u2212 \u00b5(r),j\n\n.\n\n\u03c1,(cid:96)\n\n\u03a3(b),i\n\n\u03a3(r),j\n\n/\u2126j,\u03c1,(cid:96)\n\n\u02c6\u03b7(i,\u03b2),(j,\u03c1)\n(cid:96)|m\n\n\u02c6\u03b7(i,\u03b2),(j,\u03c1)\n(cid:96)|m\n\n\u03c1,(cid:96) ) (\u00b5(b),i\n\n\u03b2,m + (\u00b5(b),i\n\n= \u2126j,\u03c1,(cid:96)\n\n(18)\nEquations (17) and (18) are all weighted averages over all base models, model states, and Gaussian\ncomponents. The covariance matrices of the reduced models (18) are never smaller in magnitude\nthan the covariance matrices of the base models, due to the outer-product term. This regularization\neffect derives from the E-step, which averages all possible observations from the base model.\n4 Discussion, Experiments and Conclusions\nJebara et al. [4] cluster a collection of HMMs by applying spectral clustering to a probability product\nkernel (PPK) matrix between HMMs. While this has been proven successful in grouping HMMs\ninto similar clusters, it cannot learn novel HMM cluster centers and therefore is suboptimal for\nhierarchical estimation of mixture models (see Section 4.2). A second limitation is that the cost of\nbuilding the PPK matrix is quadratic in the number K (b) of input HMMs. Note that we extended the\nalgorithm in [4] to support GMM observations instead of only Gaussians.\nThe VHEM-H3M algorithm clusters a collection of HMMs directly through the distributions they\nrepresent, by estimating a smaller mixture of novel HMMs that concisely models the distribution\nrepresented by the input HMMs. This is achieved by maximizing the log-likelihood of \u201cvirtual\u201d\nsamples generated from the input HMMs. As a result, the VHEM cluster centers are consistent\nwith the underlying generative probabilistic framework. As a \ufb01rst advantage, since VHEM-H3M\nestimates novel HMM cluster centers, we expect the learned cluster centers to retain more informa-\ntion on the clusters\u2019 structure and VHEM-H3M to produce better hierarchical clusterings than [4],\nwhich suffers out-of-sample limitations. A second advantage is that VHEM does not build a kernel\nembedding as in [4], an is therefore expected to be more ef\ufb01cient, especially for large K (b).\n\n\u2217\n\n\u2217\n\n\u2217\n\nc(r),j\n\u03c1,(cid:96)\n\n=\n\n(cid:16)\n\n\u2126j,\u03c1,(cid:96)\n\n\u02c6\u03b7(i,\u03b2),(j,\u03c1)\n(cid:96)|m\n\n(cid:96)(cid:48)=1 \u2126j,\u03c1,(cid:96)(cid:48)\n\n\u02c6\u03b7(i,\u03b2),(j,\u03c1)\n(cid:96)(cid:48)|m\n\n(cid:80)M\n\n(cid:16)\n\n(cid:80)K(b)\n(cid:80)S\n(cid:80)K(b)\n(cid:17)\n(cid:17) , \u00b5(r),j\n\n\u03c1(cid:48)=1\n\n\u03c1,(cid:96)\n\n\u2217\n\n\u2217\n\n(cid:104)\n\n\u2217\n\n=\n\n(cid:16)\n\n, a(r),j\n\u03c1,\u03c1(cid:48)\n\n1 (\u03c1(cid:48)))\n\u03b7(i,\u03b2),(j,\u03c1)\n(cid:96)|m\n\n(cid:16)\n\u03b2,m \u2212 \u00b5(r),j\n\n(cid:17)\n\u03c1,(cid:96) )t(cid:105)(cid:17)\n\n\u02c6\u03b7(i,\u03b2),(j,\u03c1)\n(cid:96)|m\n\n\u00b5(b),i\n\u03b2,m\n\n\u2126j,\u03c1,(cid:96)\n\n=\n\n\u2126j,\u03c1,(cid:96)\n\n,\n\n(cid:16)\n\n(cid:17)\n\n6\n\n\fIn addition, VHEM-H3M allows for ef\ufb01cient estimation of HMM mixtures from large datasets using\na hierarchical estimation procedure. In particular, in a \ufb01rst stage intermediate HMM mixtures are\nestimated in parallel by running standard EM on small independent portions of the dataset, and the\n\ufb01nal model is estimated from the intermediate models using the VHEM algorithm. Relative to direct\nEM estimation on the entire data, VHEM-H3M is more time- and memory-ef\ufb01cient. First, it does\nnot need to evaluate the likelihood of all the samples at each iteration, and converges to effective\nestimates in shorter times. Second, it no longer requires storing in memory the entire data set during\nparameter estimation. Another advantage is that the intermediate models implicitly provide more\n\u201csamples\u201d (virtual variations of each time-series) to the \ufb01nal VHEM stage. This acts as a form of\nregularization that prevents over-\ufb01tting and improves robustness of the learned models. Therefore,\nwe expect models learned using the hierarchical estimation procedure to perform better than those\nlearned with EM directly on the entire data. Note that in the second stage we could use the spectral\nclustering algorithm in [4] instead of VHEM \u2014 run spectral clustering over intermediate models\npooled together, and form the \ufb01nal H3M with the HMMs mapped the closest to the K cluster centers.\nVHEM, however, is expected to do better since it learns novel cluster centers. As an alternative to\nVHEM, we tested a version of HEM that, instead of marginalizing over virtual samples, uses actual\nsampling and the EM algorithm [5] to learn the reduced H3M. Despite its simplicity, the algorithm\nrequires a large number of samples for learning accurate models, and has longer learning times\n(since it evaluates the likelihood of all samples at each iteration).\n4.1 Experiment on hierarchical motion clustering\n\nTable 2: Hierarchical clustering on Motion Capture data,\nusing various algorithms. The Rand-index is the probabil-\nity that any pair of motion sequences are correctly clustered\nwith respect to each other. Results are averages of 10 trials.\n\nRand-index\n\nlog-likelihood (\u00d7106)\n2\n4\n\n3\n\ntime (s)\n\n3\n\n4\n\n(#samples)\n\n2\n\nLevel\n30.97\n0.937 0.811 0.518 -5.361\nVHEM-H3M\n0.956 0.740 0.393 -5.399\n37.69\nPPK-SC\nSHEM-H3M (560)\n0.714 0.359 0.234 -13.632 -69.746 -275.650 843.89\nSHEM-H3M (2800) 0.782 0.685 0.480 -14.645 -30.086 -52.227 3849.72\n667.97\nEM-H3M\nHEM-DTM\n121.32\n\n0.831 0.430 0.340 -5.713 -202.55 -168.90\n0.897 0.661 0.412 -7.125\n-8.532\n\n-5.682\n-5.845\n\n-5.866\n-6.068\n\n-8.163\n\nTable 3: Annotation and retrieval on CAL500, for VHEM-\nH3M, PPK-SC, EM-H3M, HEM-DTM and HEM-GMM,\naveraged over the 97 tags with at least 30 examples in\nCAL500, and result of 5 fold-cross validation.\n\nP\n\nEM-H3M\nPPK-SC\n\nVHEM-H3M 0.446\n0.415\n0.299\nHEM-DTM 0.430\nHEM-GMM 0.374\n\nannotation\n\nR\n\n0.211\n0.214\n0.159\n0.202\n0.205\n\nF\n\n0.260\n0.248\n0.151\n0.252\n0.213\n\nretrieval\n\nMAP\n0.440\n0.423\n0.347\n0.439\n0.417\n\nP@10\n0.451\n0.422\n0.340\n0.453\n0.425\n\ntime (h)\n\n678\n1860\n1033\n426\n5\n\nFigure 1: Hierarchical clus-\ntering of Motion Capture data\n(qualitative). Best in color.\n\nWe tested the VHEM algorithm on hierarchical motion clustering, where each of the input HMMs\nto be clustered is estimated on a sequence of motion capture data from the Motion Capture dataset\n(http://mocap.cs.cmu.edu/). In particular, we start from K1 = 56 motion examples from 8 different\nclasses (\u201cjump\u201d, \u201crun\u201d, \u2018jog\u2018\u201d, \u201cwalk 1\u201d and \u201cwalk 2\u201d which are from two different subjects, \u201cbas-\nket\u201d, \u201csoccer\u201d, \u201csit\u201d), and learn a HMM for each of them, forming the \ufb01rst level of the hierarchy.\nA tree-structure is formed by successively clustering HMMs with the VHEM algorithm, and using\nthe learned cluster centers as the representative HMMs at the new level. Level 2, 3, and 4 of the\nhierarchy correspond to K2 = 8, K3 = 4 and K4 = 2 clusters.\nThe hierarchical clustering obtained with VHEM is illustrated in Figure 1 (top). In the \ufb01rst level,\neach vertical bar represents a motion sequence, and different colors indicate different ground-truth\nclasses. At Level 2, the 8 HMM clusters are shown with vertical bars, with the colors indicating the\nproportions of the motion classes in the cluster. At Level 2, VHEM produces clusters with examples\nfrom a single motion class (e.g., \u201crun\u201d, \u201cjog\u201d, \u201cjump\u201d), but mixes some \u201csoccer\u201d examples with\n\u201cbasket\u201d, possibly because both actions consists in a sequence of movement-shot-pause. Moving up\nthe hierarchy, VHEM clusters similar motions classes together (as indicated by the arrows), and at\nLevel 4 it creates a dichotomy between \u201csit\u201d and the other (more dynamic) motion classes. On the\n\n7\n\n1020304050Level 1  walk 1basketjumpsoccerrunwalk 2jogsit12Level 41234Level 312345678Level 2PPK-SC12Level 4Level 3Level 2123412345678VHEM-H3M algorithm1020304050Level 1  \fbottom, in Figure 1, the same experiment is repeated using spectral clustering in tandem with PPK\nsimilarity (PPK-SC). PPK-SC clusters motion sequences properly, however, at Level 2 it incorrectly\naggregates \u201csit\u201d and \u201csoccer\u201d that have quite different dynamics, and Level 4 is not as interpretable\nas the one by VHEM. Table 2 provides a quantitative comparison. While VHEM has lower Rand-\nindex than PPK-SC at Level 2 (0.937 vs. 0.956), it has higher Rand-index at Level 3 (0.811 vs.\n0.740) and Level 4 (0.518 vs. 0.393). In addition, VHEM-H3M has higher data log-likelihood\nthan PPK-SC at each level, and is more ef\ufb01cient. This suggests that the novel HMM cluster centers\nlearned by VHEM-H3M retain more information on the clusters\u2019 structure than the spectral cluster\ncenters, which is increasingly visible moving up the hierarchy. Finally, VHEM-H3M performs better\nand is more ef\ufb01cient than the HEM version based on actual sampling (SHEM-H3M), the EM applied\ndirectly on the motion sequences, and the HEM-DTM algorithm [9].\n4.2 Experiment on automatic music tagging\nWe evaluated VHEM-H3M on content-based music auto-tagging on the CAL500 [11], a collection\nof 502 songs annotated with respect to a vocabulary V of 149 tags. For each song, we extract a\ntime series Y = {y1, . . . , yT} of 13 Mel frequency cepstral coef\ufb01cients (MFCC) [1] over half-\noverlapping windows of 46ms, with \ufb01rst and second instantaneous derivatives. We formulate music\nauto-tagging as supervised multi-class labeling [10], where each class is a tag from V and is modeled\nas a H3M probability distribution estimated from audio-sequences (of T = 125 audio features, i.e.,\napproximately 3s of audio) extracted from the relevant songs in the database, using the VHEM-\nH3M algorithm. First, for each song the EM algorithm is used to learn a H3Ms with K (s) = 6\ncomponents (as many as the structural parts of most pop songs). Then, for each tag, the relevant\nsong-level H3Ms are pooled together and the VHEM-H3M algorithm is used to learn the \ufb01nal H3M\ntag model with K = 3 components.\nWe compare the proposed VHEM-H3M algorithm to PPK-SC,1 direct EM-estimation (EM-H3M)\n[5] from the relevant songs\u2019 audio sequences, HEM-DTM [12] and HEM-GMM [11]. The last two\nuse an ef\ufb01cient HEM algorithm for learning, and are state-of-the-art baselines for music tagging.\nWe were not able to successfully estimate tag-H3Ms with the sampling version of HEM-H3M.\nAnnotation (precision P, recall R, and f-score F) and retrieval (mean average precision MAP, and\ntop-10 precision P@10) are reported in Table 3. VHEM-H3M is the most ef\ufb01cient algorithm for\nlearning H3Ms as it requires only 36% of the time of EM-H3M, and 65% of the time of PPK-\nSC. VHEM-H3M capitalizes on the song-level H3Ms learned in the \ufb01rst stage (about one third of\nthe total time), by ef\ufb01ciently using them to learn the \ufb01nal tag models. The gain in computational\nef\ufb01ciency does not negatively affect the quality of the resulting models. On the contrary, VHEM-\nH3M achieves better performance than EM-H3M (differences are statistically signi\ufb01cant based on\na paired t-test with 95% con\ufb01dence), since it has the bene\ufb01t of regularization, and outperforms\nPPK-SC. Designed for clustering HMMs, PPK-SC does not produce accurate annotation models,\nsince it discards information on the clusters\u2019 structure by approximating it with one of the original\nHMMs. Instead, VHEM-H3M generates novel HMM cluster centers that effectively summarizes\neach cluster. VHEM-H3M outperforms HEM-GMM, which does not model temporal information\nin the audio signal. Finally, HEM-DTM, based on LDSs (a continuous-state model), can model only\nstationary time-series in a linear subspace. In contrast, VHEM-H3M uses HMMs with discrete states\nand GMM emissions, and can also adapt to non-stationary time-series on a non-linear manifold.\nHence, VHEM-H3M outperforms HEM-DTM on the human MoCap data (see Table (2)), which has\nnon-linear dynamics, while the two perform similarly on the music data (difference were statistically\nsigni\ufb01cant only on annotation P), where the audio features are stationary over short time frames.\n4.3 Conclusion\nWe presented a variational HEM algorithm for clustering HMMs through their distributions and gen-\nerates novel HMM cluster centers. The ef\ufb01cacy of the algorithm was demonstrated on hierarchical\nmotion clustering and automatic music tagging, with improvement over current methods.\nAcknowledgments\n\nThe authors acknowledge support from Google, Inc. E.C. and G.R.G.L. acknowledge support from\nQualcomm, Inc., Yahoo!\nInc., and the National Science Foundation (grants CCF-083053, IIS-\n1054960 and EIA-0303622). A.B.C. acknowledges support from the Research Grants Council of\nthe Hong Kong SAR, China (CityU 110610). G.R.G.L. acknowledges support from the Alfred P.\nSloan Foundation.\n\n1It was necessary to implement PPK-SC with song-level H3Ms with K (s)=1. K (s)=2 took about quadruple\n\nthe time with no improvement in performance. Larger K (s) would determine impractical learning times.\n\n8\n\n\fReferences\n[1] L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Upper Saddle\n\nRiver (NJ, USA), 1993.\n\n[2] Y. Qi, J.W. Paisley, and L. Carin. Music analysis using hidden markov mixture models. Signal\n\nProcessing, IEEE Transactions on, 55(11):5209\u20135224, 2007.\n\n[3] E. Batlle, J. Masip, and E. Guaus. Automatic song identi\ufb01cation in noisy broadcast audio. In\n\nIASTED International Conference on Signal and Image Processing. Citeseer, 2002.\n\n[4] T. Jebara, Y. Song, and K. Thadani. Spectral clustering and embedding with hidden markov\n\nmodels. Machine Learning: ECML 2007, pages 164\u2013175, 2007.\n\n[5] P. Smyth. Clustering sequences with hidden markov models. In Advances in neural information\n\nprocessing systems, 1997.\n\n[6] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. The Journal of Machine\n\nLearning Research, 5:819\u2013844, 2004.\n\n[7] B. H. Juang and L. R. Rabiner. A probabilistic distance measure for hidden Markov models.\n\nAT&T Technical Journal, 64(2):391\u2013408, February 1985.\n\n[8] N. Vasconcelos and A. Lippman. Learning mixture hierarchies. In Advances in Neural Infor-\n\nmation Processing Systems, 1998.\n\n[9] A.B. Chan, E. Coviello, and G.R.G. Lanckriet. Clustering dynamic textures with the hierar-\nchical em algorithm. In Intl. Conference on Computer Vision and Pattern Recognition, 2010.\n[10] G. Carneiro, A.B. Chan, P.J. Moreno, and N. Vasconcelos. Supervised learning of semantic\nclasses for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 29(3):394\u2013410, 2007.\n\n[11] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Semantic annotation and retrieval\nof music and sound effects. IEEE Transactions on Audio, Speech and Language Processing,\n16(2):467\u2013476, February 2008.\n\n[12] E. Coviello, A. Chan, and G. Lanckriet. Time series models for semantic music annotation.\n\nAudio, Speech, and Language Processing, IEEE Transactions on, 5(19):1343\u20131359, 2011.\n\n[13] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with bregman divergences. The\n\nJournal of Machine Learning Research, 6:1705\u20131749, 2005.\n\n[14] J.R. Hershey, P.A. Olsen, and S.J. Rennie. Variational Kullback-Leibler divergence for hid-\nden Markov models. In Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE\nWorkshop on, pages 323\u2013328. IEEE, 2008.\n\n[15] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society B, 39:1\u201338, 1977.\n\n[16] R.M. Neal and G.E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and\nother variants. NATO ASI SERIES D BEHAVIOURAL AND SOCIAL SCIENCES, 89:355\u2013370,\n1998.\n\n[17] I. Csisz, G. Tusn\u00b4ady, et al. Information geometry and alternating minimization procedures.\n\nStatistics and decisions, 1984.\n\n[18] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. An introduction to variational\n\nmethods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[19] Tommi S. Jaakkola. Tutorial on Variational Approximation Methods. In In Advanced Mean\n\nField Methods: Theory and Practice, pages 129\u2013159. MIT Press, 2000.\n\n[20] Anonymous. Derivation of the Variational HEM Algorithm for Hidden Markov Mixture Mod-\n\nels. Technical report, Anonymous, 2012.\n\n9\n\n\f", "award": [], "sourceid": 206, "authors": [{"given_name": "Emanuele", "family_name": "Coviello", "institution": null}, {"given_name": "Gert", "family_name": "Lanckriet", "institution": null}, {"given_name": "Antoni", "family_name": "Chan", "institution": null}]}