{"title": "Factorial Learning and the EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 617, "page_last": 624, "abstract": "", "full_text": "Factorial Learning and the EM Algorithm \n\nZoubin Ghahramani \nzoubin@psyche.mit.edu \n\nDepartment of Brain & Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nMany real world learning problems are best characterized by an \ninteraction of multiple independent causes or factors. Discover(cid:173)\ning such causal structure from the data is the focus of this paper. \nBased on Zemel and Hinton's cooperative vector quantizer (CVQ) \narchitecture, an unsupervised learning algorithm is derived from \nthe Expectation-Maximization (EM) framework. Due to the com(cid:173)\nbinatorial nature of the data generation process, the exact E-step \nis computationally intractable. Two alternative methods for com(cid:173)\nputing the E-step are proposed: Gibbs sampling and mean-field \napproximation, and some promising empirical results are presented. \n\n1 \n\nIntroduction \n\nMany unsupervised learning problems fall under the rubric of factorial learning-(cid:173)\nthat is, the goal of the learning algorithm is to discover multiple independent causes, \nor factors, that can well characterize the observed data (Barlow, 1989; Redlich, \n1993; Hinton and Zemel, 1994; Saund, 1995). Such learning problems often arise \nnaturally in response to the actual process by which the data have been generated. \nFor instance, images may be generated by combining multiple objects, or varying \ncolors, locations, and poses, with different light sources. Similarly, speech signals \nmay result from an interaction of factors such as the tongue position, lip aperture, \nglottal state, communication line, and background noises. The goal of factorial \nlearning is to invert this data generation process, discovering a representation that \nwill both parsimoniously describe the data and reflect its underlying causes. \n\nA recent approach to factorial learning uses the Minimum Description Length \n(MDL) principle (Rissanen, 1989) to extract a compact representation of the input \n(Zemel, 1993; Hinton and Zemel, 1994). This has resulted in a learning architecture \n\n\f618 \n\nZoubin Ghahramani \n\ncalled Cooperative Vector Quantization (CVQ), in which a set of vector quantiz(cid:173)\ners cooperates to reproduce the input. Within each vector quantizer a competitive \nlearning mechanism operates to select an appropriate vector code to describe the \ninput. The CVQ is related to algorithms based on mixture models, such as soft \ncompetitive clustering, mixtures of experts (Jordan and Jacobs, 1994), and hidden \nMarkov models (Baum et al., 1970), in that each vector quantizer in the CVQ is \nitself a mixture model. However, it generalizes this notion by allowing the mixture \nmodels to cooperate in describing features in the data set, thereby creating a dis(cid:173)\ntributed representations of the mixture components. The learning algorithm for the \nCVQ uses MDL to derive a cost function composed of a reconstruction cost (e.g. \nsum squared error), representation cost (negative entropy of the vector code), and \nmodel complexity (description length of the network weights), which is minimized \nby gradient descent. \n\nIn this paper we first formulate the factorial learning problem in the framework of \nstatistical physics (section 2). Through this formalism, we derive a novel learning \nalgorithm for the CVQ based on the Expectation-Maximization (EM) algorithm \n(Dempster et al., 1977) (section 3). The exact EM algorithm is intractable for this \nand related factorial learning problems-however, a tractable mean-field approxi(cid:173)\nmation can be derived. Empirical results on Gibbs sampling and the mean-field \napproximation are presented in section 4. \n2 Statistical Physics Formulation \nThe CVQ architecture, shown in Figure 1, is composed of hidden and observable \nunits, where the observable units, y, are real-valued, and the hidden units are \ndiscrete and organized into vectors Si, i = 1, ... , d. The network models a data \ngeneration process which is assumed to proceed in two stages. First, a factor is \nindependently sampled from each hidden unit vector, Sj, according to its prior \nprobability density, ?ri. Within each vector the factors are mutually exclusive, i.e. \nif Sij = 1 for some j, then Sik = 0, Vk -# j. The observable is then generated from \na Gaussian distribution with mean 2:1=1 WiSi. \n\nNotation: \n\n0 \n0 \n0 \n0 \n0 51 0 52 \u2022\u2022\u2022 0 Sd \n0 \n0 \n0 \n0 \nVOl V0 2 \n\n0 \n0 \n0 \n0 \nVOd \n\nnumber of vectors \nnumber of hidden units per vector \nnumber of outputs \nnumber of patterns \n\nd \nk \np \nN \nSij hidden unit j in vector i \nSi \nWi weight matrix from Si to output \nY \n\nvector i of units (Si = [Si1, ... , Sik]) \n\nnetwork output (observable) \n\nFigure 1. The factorial learning architecture. \n\nDefining the energy of a particular configuration of hidden states and outputs as \n\n1 \n\nH(s, y) = \"2lly - L Wi sill 2 - L L Sij log 7rij, \n\nk \n\nd \n\nd \n\ni=l \n\ni=l j=l \n\n(1) \n\n\fFactorial Learning and the EM Algorithm \n\nthe Boltzmann distribution \n\np(s, y) = -Z exp{-11.(s,y)}, \n\n1 \nfree \n\n619 \n\n(2) \n\nexactly recovers the probability model for the CVQ. The causes or factors are repre(cid:173)\nsented in the multinomial variables Si and the observable in the multivariate Gaus(cid:173)\nsian y. The undamped partition function, Zjree, can be evaluated by summing \nand integrating over all the possible configurations of the system to obtain \n\n(3) \n\nZjree = ~ 1 exp{ -11.(s, y)}dy = (21l\")P/2, \n\n$ \n\nY \n\nwhich is constant, independent of the weights. This constant partition function \nresults in desirable properties, such as the lack of a Boltzmann machine-like sleep \nphase (Neal, 1992), which we will exploit in the learning algorithm. \nThe system described by equation (1)1 can be thought of as a special form of the \nBoltzmann machine (Ackley et al., 1985). Expanding out the quadratic term we see \nthat there are pairwise interaction terms between every unit. The evaluation of the \npartition function (3) tells us that when y is unclamped the quadratic term can be \nintegrated out and therefore all Si are independent. However, when y is clamped \nall the Si become dependent. \n3 The EM Algorithm \nGiven a set of observable vectors, the goal of the unsupervised learning algorithm \nis to find weight matrices such that the network is most likely to have generated \nthe data. If the hidden causes for each observable where known, then the weight \nmatrices could be easily estimated . However, the hidden causes cannot be inferred \nunless these weight matrices are known. This chicken-and-egg problem can be solved \nby iterating between computing the expectation of the hidden causes given the \ncurrent weights and maximizing the likelihood of the weights given these expected \ncauses-the two steps forming the basis of the Expectation-Maximization (EM) \nalgorithm (Dempster et al., 1977). \nFormally, from (2) we obtain the expected log likelihood of the parameters \u00a2/: \n\nQ(\u00a2,\u00a2/) = (-11.(s,y) -logZjree)c,q, \n\n(4) \nwhere \u00a2 denotes the current parameters, \u00a2 = {Wi}?=1, and (-)c,q, denotes expecta(cid:173)\ntion given \u00a2 and the damped observables. The E-step of EM consists of computing \nthis expected log likelihood. As the only random variables are the hidden causes, \nthis simplifies to computing the (Si)c and (SiS])c terms appearing in the quadratic \nexpansion of 11.. Once these terms have been computed, the M-step consists of \nmaximizing Q with respect to the parameters. Setting the derivatives to zero we \nobtain a linear system, \n\n1 For the remainder of the paper we will ignore the second term in (1), thereby assuming \nequal priors on the hidden states. Relaxing this assumption and estimating priors from \nthe data is straightforward. \n\n\f620 \n\nZoubin Ghahramani \n\nwhich can be solved via the normal equations, \n\nwhere s is the vector of concatenated Si and the subscripts denote matrix size. \nFor models in which the observable is a monotonic differentiable function ofLi WiSi, \ni.e. generalized linear models, least squares estimates of the weights for the M-step \ncan be obtained iteratively by the method of scoring (McCullagh and NeIder, 1989). \n\n3.1 E-step: Exact \nThe difficulty arises in the E-step of the algorithm. The expectation of hidden unit \nj in vector i given pattern y is: \n\nP(Sij = 11Y; W)