{"title": "Global Analysis of Expectation Maximization for Mixtures of Two Gaussians", "book": "Advances in Neural Information Processing Systems", "page_first": 2676, "page_last": 2684, "abstract": "Expectation Maximization (EM) is among the most popular algorithms for estimating parameters of statistical models.  However, EM, which is an iterative algorithm based on the maximum likelihood principle, is generally only guaranteed to find stationary points of the likelihood objective, and these points may be far from any maximizer.  This article addresses this disconnect between the statistical principles behind EM and its algorithmic properties.  Specifically, it provides a global analysis of EM for specific models in which the observations comprise an i.i.d. sample from a mixture of two Gaussians.  This is achieved by (i) studying the sequence of parameters from idealized execution of EM in the infinite sample limit, and fully characterizing the limit points of the sequence in terms of the initial parameters; and then (ii) based on this convergence analysis, establishing statistical consistency (or lack thereof) for the actual sequence of parameters produced by EM.", "full_text": "Global Analysis of Expectation Maximization\n\nfor Mixtures of Two Gaussians\n\nJi Xu\n\nColumbia University\n\njixu@cs.columbia.edu\n\nDaniel Hsu\n\nColumbia University\n\ndjhsu@cs.columbia.edu\n\nArian Maleki\n\nColumbia University\n\narian@stat.columbia.edu\n\nAbstract\n\nExpectation Maximization (EM) is among the most popular algorithms for estimat-\ning parameters of statistical models. However, EM, which is an iterative algorithm\nbased on the maximum likelihood principle, is generally only guaranteed to \ufb01nd\nstationary points of the likelihood objective, and these points may be far from any\nmaximizer. This article addresses this disconnect between the statistical principles\nbehind EM and its algorithmic properties. Speci\ufb01cally, it provides a global analysis\nof EM for speci\ufb01c models in which the observations comprise an i.i.d. sample\nfrom a mixture of two Gaussians. This is achieved by (i) studying the sequence of\nparameters from idealized execution of EM in the in\ufb01nite sample limit, and fully\ncharacterizing the limit points of the sequence in terms of the initial parameters;\nand then (ii) based on this convergence analysis, establishing statistical consistency\n(or lack thereof) for the actual sequence of parameters produced by EM.\n\n1\n\nIntroduction\n\nSince Fisher\u2019s 1922 paper (Fisher, 1922), maximum likelihood estimators (MLE) have become one\nof the most popular tools in many areas of science and engineering. The asymptotic consistency\nand optimality of MLEs have provided users with the con\ufb01dence that, at least in some sense, there\nis no better way to estimate parameters for many standard statistical models. Despite its appealing\nproperties, computing the MLE is often intractable. Indeed, this is the case for many latent variable\nmodels {f (Y, z; \u03b7)}, where the latent variables z are not observed. For each setting of the parameters\n\u03b7, the marginal distribution of the observed data Y is (for discrete z)\n\n(cid:88)\n\nf (Y; \u03b7) =\n\nf (Y, z; \u03b7) .\n\nIt is this marginalization over latent variables that typically causes the computational dif\ufb01culty.\nFurthermore, many algorithms based on the MLE principle are only known to \ufb01nd stationary points\nof the likelihood objective (e.g., local maxima), and these points are not necessarily the MLE.\n\nz\n\n1.1 Expectation Maximization\n\nAmong the algorithms mentioned above, Expectation Maximization (EM) has attracted more attention\nfor the simplicity of its iterations, and its good performance in practice (Dempster et al., 1977; Redner\nand Walker, 1984). EM is an iterative algorithm for climbing the likelihood objective starting from\nan initial setting of the parameters \u02c6\u03b7\n\n(cid:104)0(cid:105). In iteration t, EM performs the following steps:\n\u02c6Q(\u03b7 | \u02c6\u03b7\n\n) log f (Y, z; \u03b7) ,\n\n) (cid:44) (cid:88)\n\nf (z | Y; \u02c6\u03b7\n\n(cid:104)t(cid:105)\n\n(cid:104)t(cid:105)\n\nE-step:\n\nM-step:\n\nz\n\n(cid:104)t+1(cid:105) (cid:44) arg max\n\n\u02c6\u03b7\n\n\u02c6Q(\u03b7 | \u02c6\u03b7\n\n(cid:104)t(cid:105)\n\n) ,\n\n\u03b7\n\n(1)\n\n(2)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn many applications, each step is intuitive and can be performed very ef\ufb01ciently.\nDespite the popularity of EM, as well as the numerous theoretical studies of its behavior, many im-\nportant questions about its performance\u2014such as its convergence rate and accuracy\u2014have remained\nunanswered. The goal of this paper is to address these questions for speci\ufb01c models (described in\nSection 1.2) in which the observation Y is an i.i.d. sample from a mixture of two Gaussians.\nTowards this goal, we study an idealized execution of EM in the large sample limit, where the E-step\nis modi\ufb01ed to be computed over an in\ufb01nitely large i.i.d. sample from a Gaussian mixture distribution\n), we replace the observed data Y with a random\nin the model. In effect, in the formula for \u02c6Q(\u03b7 | \u02c6\u03b7\nvariable Y \u223c f (y; \u03b7(cid:63)) for some Gaussian mixture parameters \u03b7(cid:63) and then take its expectation. The\nresulting E- and M-steps in iteration t are\n\n(cid:104)t(cid:105)\n\nE-step:\n\nQ(\u03b7 | \u03b7(cid:104)t(cid:105)) (cid:44) EY\n\nf (z | Y ; \u03b7(cid:104)t(cid:105)) log f (Y , z; \u03b7)\nQ(\u03b7 | \u03b7(cid:104)t(cid:105)) .\n\n,\n\n(3)\n\n\u03b7(cid:104)t+1(cid:105) (cid:44) arg max\n\n\u03b7\n\nM-step:\n\n(4)\nThis sequence of parameters (\u03b7(cid:104)t(cid:105))t\u22650 is fully determined by the initial setting \u03b7(cid:104)0(cid:105). We refer to\nthis idealization as Population EM, a procedure considered in previous works of Srebro (2007) and\nBalakrishnan et al. (2014). Not only does Population EM shed light on the dynamics of EM in\nthe large sample limit, but it can also reveal some of the fundamental limitations of EM. Indeed, if\nPopulation EM cannot provide an accurate estimate for the parameters \u03b7(cid:63), then intuitively, one would\nnot expect the EM algorithm with a \ufb01nite sample size to do so either. (To avoid confusion, we refer\nthe original EM algorithm run with a \ufb01nite sample as Sample-based EM.)\n\n(cid:34)(cid:88)\n\nz\n\n(cid:35)\n\n1.2 Models and Main Contributions\n\nIn this paper, we study EM in the context of two simple yet popular and well-studied Gaussian\nmixture models. The two models, along with the corresponding Sample-based EM and Population\nEM updates, are as follows:\nModel 1. The observation Y is an i.i.d. sample from the mixture distribution 0.5N (\u2212\u03b8(cid:63), \u03a3) +\n0.5N (\u03b8(cid:63), \u03a3); \u03a3 is a known covariance matrix in Rd, and \u03b8(cid:63) is the unknown parameter of interest.\n1. Sample-based EM iteratively updates its estimate of \u03b8(cid:63) according to the following equation:\n\n(cid:104)t+1(cid:105)\n\n\u02c6\u03b8\n\n=\n\n1\nn\n\n(cid:0)yi, \u02c6\u03b8\n\n(cid:17)\n(cid:104)t(cid:105)(cid:1) \u2212 1\n\nyi,\n\n2wd\n\nn(cid:88)\n\n(cid:16)\n\ni=1\n\nwhere y1, . . . , yn are the independent draws that comprise Y,\n\nwd(y, \u03b8) (cid:44)\n\n\u03c6d(y \u2212 \u03b8)\n\n\u03c6d(y \u2212 \u03b8) + \u03c6d(y + \u03b8)\n\n,\n\nand \u03c6d is the density of a Gaussian random vector with mean 0 and covariance \u03a3.\n2. Population EM iteratively updates its estimate according to the following equation:\n\nwhere Y \u223c 0.5N (\u2212\u03b8(cid:63), \u03a3) + 0.5N (\u03b8(cid:63), \u03a3).\n\n(cid:104)t+1(cid:105)\n\n\u03b8\n\n= E(2wd(Y , \u03b8\n\n(cid:104)t(cid:105)\n\n) \u2212 1)Y ,\n\n(5)\n\n(6)\n\nModel 2. The observation Y is an i.i.d. sample from the mixture distribution 0.5N (\u00b5(cid:63)\n0.5N (\u00b5(cid:63)\n\n2) are the unknown parameters of interest.\n\n2, \u03a3). Again, \u03a3 is known, and (\u00b5(cid:63)\n\n1, \u00b5(cid:63)\n\n1, \u03a3) +\n\n1. Sample-based EM iteratively updates its estimate of \u00b5(cid:63)\n\n2 at every iteration according\n\nto the following equations:\n\n1 and \u00b5(cid:63)\n\n(cid:80)n\n(cid:80)n\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\ni=1 vd(yi, \u02c6\u00b5\n1 , \u02c6\u00b5\n2 )yi\n(cid:80)n\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\n1 , \u02c6\u00b5\ni=1 vd(yi, \u02c6\u00b5\n2 )\n(cid:80)n\n(cid:104)t(cid:105)\ni=1(1 \u2212 vd(yi, \u02c6\u00b5\n1 , \u02c6\u00b5\n(cid:104)t(cid:105)\ni=1(1 \u2212 vd(yi, \u02c6\u00b5\n1 , \u02c6\u00b5\n\n,\n\n(cid:104)t(cid:105)\n2 ))yi\n(cid:104)t(cid:105)\n2 ))\n\n(cid:104)t+1(cid:105)\n\u02c6\u00b5\n1\n\n(cid:104)t+1(cid:105)\n\u02c6\u00b5\n2\n\n=\n\n=\n\n(7)\n\n(8)\n\n,\n\n2\n\n\fwhere y1, . . . , yn are the independent draws that comprise Y, and\n\nvd(y, \u00b51, \u00b52) (cid:44)\n\n\u03c6d(y \u2212 \u00b51)\n\n\u03c6d(y \u2212 \u00b51) + \u03c6d(y \u2212 \u00b52)\n\n.\n\n2. Population EM iteratively updates its estimates according to the following equations:\n\n(cid:104)t+1(cid:105)\n1\n\n\u00b5\n\n(cid:104)t+1(cid:105)\n2\n\n\u00b5\n\n=\n\n=\n\n,\n\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\nEvd(Y , \u00b5\n2 )Y\n1 , \u00b5\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\nEvd(Y , \u00b5\n1 , \u00b5\n2 )\n(cid:104)t(cid:105)\nE(1 \u2212 vd(Y , \u00b5\n1 , \u00b5\n(cid:104)t(cid:105)\nE(1 \u2212 vd(Y , \u00b5\n1 , \u00b5\n2, \u03a3).\n\n(cid:104)t(cid:105)\n2 ))Y\n(cid:104)t(cid:105)\n2 ))\n\nwhere Y \u223c 0.5N (\u00b5(cid:63)\n\n1, \u03a3) + 0.5N (\u00b5(cid:63)\n\n(9)\n\n(10)\n\n,\n\nOur main contribution in this paper is a new characterization of the stationary points and dynamics of\nEM in both of the above models.\n\n1. We prove convergence for the sequence of iterates for Population EM from each model:\n(cid:104)t(cid:105)\n2 ))t\u22650\n2)/2). We also fully\n\nthe sequence (\u03b8\nconverges to either (\u00b5(cid:63)\n1 + \u00b5(cid:63)\ncharacterize the initial parameter settings that lead to each limit point.\n\n)t\u22650 converges to either \u03b8(cid:63), \u2212\u03b8(cid:63), or 0; the sequence ((\u00b5\n\n1 + \u00b5(cid:63)\n\n2)/2, (\u00b5(cid:63)\n\n1, \u00b5(cid:63)\n\n2), (\u00b5(cid:63)\n\n2, \u00b5(cid:63)\n\n1), or ((\u00b5(cid:63)\n\n(cid:104)t(cid:105)\n1 , \u00b5\n\n(cid:104)t(cid:105)\n\n2. Using this convergence result for Population EM, we also prove that the limits of the Sample-\nbased EM iterates converge in probability to the unknown parameters of interest, as long\nas Sample-based EM is initialized at points where Population EM would converge to these\nparameters as well.\n\nFormal statements of our results are given in Section 2.\n\n1.3 Background and Related Work\n\nThe EM algorithm was formally introduced by Dempster et al. (1977) as a general iterative method\nfor computing parameter estimates from incomplete data. Although EM is billed as a procedure for\nmaximum likelihood estimation, it is known that with certain initializations, the \ufb01nal parameters\nreturned by EM may be far from the MLE, both in parameter distance and in log-likelihood value (Wu,\n1983). Several works characterize convergence of EM to stationary points of the log-likelihood\nobjective under certain regularity conditions (Wu, 1983; Tseng, 2004; Vaida, 2005; Chr\u00e9tien and\nHero, 2008). However, these analyses do not distinguish between global maximizers and other\nstationary points (except, e.g., when the likelihood function is unimodal). Thus, as an optimization\nalgorithm for maximizing the log-likelihood objective, the \u201cworst-case\u201d performance of EM is\nsomewhat discouraging.\nFor a more optimistic perspective on EM, one may consider a \u201cbest-case\u201d analysis, where (i) the\ndata are an iid sample from a distribution in the given model, (ii) the sample size is suf\ufb01ciently\nlarge, and (iii) the starting point for EM is suf\ufb01ciently close to the parameters of the data generating\ndistribution. Conditions (i) and (ii) are ubiquitous in (asymptotic) statistical analyses, and (iii) is a\ngenerous assumption that may be satis\ufb01ed in certain cases. Redner and Walker (1984) show that\nin such a favorable scenario, EM converges to the MLE almost surely for a broad class of mixture\nmodels. Moreover, recent work of Balakrishnan et al. (2014) gives non-asymptotic convergence\nguarantees in certain models; importantly, these results permit one to quantify the accuracy of a\npilot estimator required to effectively initialize EM. Thus, EM may be used in a tractable two-stage\nestimation procedures given a \ufb01rst-stage pilot estimator that can be ef\ufb01ciently computed.\nIndeed, for the special case of Gaussian mixtures, researchers in theoretical computer science and\nmachine learning have developed ef\ufb01cient algorithms that deliver the highly accurate parameter\nestimates under appropriate conditions. Several of these algorithms, starting with that of Dasgupta\n(1999), assume that the means of the mixture components are well-separated\u2014roughly at distance\neither d\u03b1 or k\u03b2 for some \u03b1, \u03b2 > 0 for a mixture of k Gaussians in Rd (Dasgupta, 1999; Arora\nand Kannan, 2005; Dasgupta and Schulman, 2007; Vempala and Wang, 2004; Kannan et al., 2008;\nAchlioptas and McSherry, 2005; Chaudhuri and Rao, 2008; Brubaker and Vempala, 2008; Chaudhuri\net al., 2009a). More recent work employs the method-of-moments, which permit the means of the\n\n3\n\n\fmixture components to be arbitrarily close, provided that the sample size is suf\ufb01ciently large (Kalai\net al., 2010; Belkin and Sinha, 2010; Moitra and Valiant, 2010; Hsu and Kakade, 2013; Hardt and\nPrice, 2015). In particular, Hardt and Price (2015) characterize the information-theoretic limits of\nparameter estimation for mixtures of two Gaussians, and that they are achieved by a variant of the\noriginal method-of-moments of Pearson (1894).\nMost relevant to this paper are works that speci\ufb01cally analyze EM (or variants thereof) for Gaussian\nmixture models, especially when the mixture components are well-separated. Xu and Jordan (1996)\nshow favorable convergence properties (akin to super-linear convergence near the MLE) for well-\nseparated mixtures. In a related but different vein, Dasgupta and Schulman (2007) analyze a variant\nof EM with a particular initialization scheme, and proves fast convergence to the true parameters,\nagain for well-separated mixtures in high-dimensions. For mixtures of two Gaussians, it is possible to\nexploit symmetries to get sharper analyses. Indeed, Chaudhuri et al. (2009b) uses these symmetries to\nprove that a variant of Lloyd\u2019s algorithm (MacQueen, 1967; Lloyd, 1982) (which may be regarded as\na hard-assignment version of EM) very quickly converges to the subspace spanned by the two mixture\ncomponent means, without any separation assumption. Lastly, for the speci\ufb01c case of our Model\n1, Balakrishnan et al. (2014) proves linear convergence of EM (as well as a gradient-based variant\nof EM) when started in a suf\ufb01ciently small neighborhood around the true parameters, assuming a\nminimum separation between the mixture components. Here, the permitted size of the neighborhood\ngrows with the separation between the components, and a recent result of Klusowski and Brinda\n(2016) quantitatively improves this aspect of the analysis (but still requires a minimum separation).\nRemarkably, by focusing attention on the local region around the true parameters, they obtain non-\nasymptotic bounds on the parameter estimation error. Our work is complementary to their results\nin that we focus on asymptotic limits rather than \ufb01nite sample analysis. This allows us to provide a\nglobal analysis of EM without separation or initialization conditions, which cannot be deduced from\nthe results of Balakrishnan et al. or Klusowski and Brinda by taking limits.\nFinally, two related works have appeared following the initial posting of this article (Xu et al., 2016).\nFirst, Daskalakis et al. (2016) concurrently and independently proved a convergence result comparable\nto our Theorem 1 for Model 1; for this case, they also provide an explicit rate of linear convergence.\nSecond, Jin et al. (2016) show that similar results do not hold in general for uniform mixtures of\nthree or more spherical Gaussian distributions: common initialization schemes for (Population or\nSample-based) EM may lead to local maxima that are arbitrarily far from the global maximizer.\nSimilar results were well-known for Lloyd\u2019s algorithm, but were not previously established for\nPopulation EM (Srebro, 2007).\n\n2 Analysis of EM for Mixtures of Two Gaussians\n\nIn this section, we present our results for Population EM and Sample-based EM under both Model\n1 and Model 2, and also discuss further implications about the expected log-likelihood function.\nWithout loss of generality, we may assume that the known covariance matrix \u03a3 is the identity matrix\nI d. Throughout, we denote the Euclidean norm by (cid:107) \u00b7 (cid:107), and the signum function by sgn(\u00b7) (where\nsgn(0) = 0, sgn(z) = 1 if z > 0, and sgn(z) = \u22121 if z < 0).\n\n2.1 Main Results for Population EM\n\nWe present results for Population EM for both models, starting with Model 1.\nTheorem 1. Assume \u03b8(cid:63) \u2208 Rd \\ {0}. Let (\u03b8\nand suppose (cid:104)\u03b8\n\n, \u03b8(cid:63)(cid:105) (cid:54)= 0. There exists \u03ba\u03b8 \u2208 (0, 1)\u2014depending only on \u03b8(cid:63) and \u03b8\n(cid:104)t+1(cid:105) \u2212 sgn((cid:104)\u03b8\n\n, \u03b8(cid:63)(cid:105))\u03b8(cid:63)(cid:13)(cid:13)(cid:13) \u2264 \u03ba\u03b8 \u00b7(cid:13)(cid:13)(cid:13)\u03b8\n\n, \u03b8(cid:63)(cid:105))\u03b8(cid:63)(cid:13)(cid:13)(cid:13) .\n\n(cid:13)(cid:13)(cid:13)\u03b8\n\n(cid:104)t(cid:105) \u2212 sgn((cid:104)\u03b8\n\n(cid:104)0(cid:105)\n\n(cid:104)0(cid:105)\n\n(cid:104)t(cid:105)\n\n)t\u22650 denote the Population EM iterates for Model 1,\n\n(cid:104)0(cid:105)\u2014such that\n\n(cid:104)0(cid:105)\n\nThe proof of Theorem 1 and all other omitted proofs are given in the full version of this article (Xu\n(cid:104)0(cid:105) is not on the hyperplane {x \u2208 Rd : (cid:104)x, \u03b8(cid:63)(cid:105) = 0}, then\net al., 2016). Theorem 1 asserts that if \u03b8\nthe sequence (\u03b8\nOur next result shows that if (cid:104)\u03b8\n\n)t\u22650 converges to either \u03b8(cid:63) or \u2212\u03b8(cid:63).\n, \u03b8(cid:63)(cid:105) = 0, then (\u03b8\n(cid:104)t(cid:105)\n\n)t\u22650 still converges, albeit to 0.\n\n(cid:104)0(cid:105)\n\n(cid:104)t(cid:105)\n\n4\n\n\fTheorem 2. Let (\u03b8\n\n(cid:104)t(cid:105)\n\n)t\u22650 denote the Population EM iterates for Model 1. If (cid:104)\u03b8\n\n(cid:104)0(cid:105)\n\n, \u03b8(cid:63)(cid:105) = 0, then\n\n(cid:104)t(cid:105) \u2192 0 as t \u2192 \u221e .\n\n\u03b8\n\nTheorems 1 and 2 together characterize the \ufb01xed points of Population EM for Model 1, and fully\nspecify the conditions under which each \ufb01xed point is reached. The results are simply summarized in\nthe following corollary.\nCorollary 1. If (\u03b8\n\n)t\u22650 denote the Population EM iterates for Model 1, then\n\n(cid:104)t(cid:105)\n\n(cid:104)t(cid:105) \u2192 sgn((cid:104)\u03b8\n\n\u03b8\n\n(cid:104)0(cid:105)\n\n, \u03b8(cid:63)(cid:105))\u03b8(cid:63)\n\nas t \u2192 \u221e .\n\nWe now discuss Population EM with Model 2. To state our results more concisely, we use the\nfollowing re-parameterization of the model parameters and Population EM iterates:\n\na(cid:104)t(cid:105) (cid:44) \u00b5\n\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\n1 + \u00b5\n2\n\n2\n\n\u2212 \u00b5(cid:63)\n\n1 + \u00b5(cid:63)\n2\n\n2\n\n,\n\n(cid:104)t(cid:105) (cid:44) \u00b5\nb\n\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\n2 \u2212 \u00b5\n1\n\n2\n\n,\n\n\u03b8(cid:63) (cid:44) \u00b5(cid:63)\n\n2 \u2212 \u00b5(cid:63)\n2\n\n1\n\n.\n\n(11)\n\nIf the sequence of Population EM iterates ((\u00b5\n(cid:104)t(cid:105) \u2192 \u03b8(cid:63). Hence, we also de\ufb01ne \u03b2(cid:104)t(cid:105) as the angle between b\nb\n\n(cid:104)t(cid:105)\n2 ))t\u22650 converges to (\u00b5(cid:63)\n(cid:104)t(cid:105) and \u03b8(cid:63), i.e.,\n\n1, \u00b5(cid:63)\n\n2), then we expect\n\n(cid:104)t(cid:105)\n1 , \u00b5\n\n(cid:32) (cid:104)b\n\n, \u03b8(cid:63)(cid:105)\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)(cid:107)(cid:107)\u03b8(cid:63)(cid:107)\n\n(cid:107)b\n\n(cid:33)\n\n\u03b2(cid:104)t(cid:105) (cid:44) arccos\n\n\u2208 [0, \u03c0] .\n\n(cid:104)t(cid:105) (cid:54)= 0 and \u03b8(cid:63) (cid:54)= 0.)\n\n(This is well-de\ufb01ned as long as b\nWe \ufb01rst present results on Population EM with Model 2 under the initial condition (cid:104)b\nTheorem 3. Assume \u03b8(cid:63) \u2208 Rd \\ {0}. Let (a(cid:104)t(cid:105), b\nEM iterates for Model 2, and suppose (cid:104)b\nexist \u03baa \u2208 (0, 1)\u2014depending only on (cid:107)\u03b8(cid:63)(cid:107) and |(cid:104)b\nonly on (cid:107)\u03b8(cid:63)(cid:107), (cid:104)b\n\n, \u03b8(cid:63)(cid:105) (cid:54)= 0.\n)t\u22650 denote the (re-parameterized) Population\n(cid:104)t(cid:105) (cid:54)= 0 for all t \u2265 0. Furthermore, there\n(cid:104)0(cid:105)(cid:107)|\u2014and \u03ba\u03b2 \u2208 (0, 1)\u2014depending\n\n, \u03b8(cid:63)(cid:105)/(cid:107)b\n(cid:104)0(cid:105)(cid:107)\u2014such that\n\n(cid:104)0(cid:105)(cid:107), (cid:107)a(cid:104)0(cid:105)(cid:107), and (cid:107)b\n\n, \u03b8(cid:63)(cid:105) (cid:54)= 0. Then b\n\n, \u03b8(cid:63)(cid:105)/(cid:107)b\n\n(cid:104)0(cid:105)\n\n(cid:104)0(cid:105)\n\n(cid:104)0(cid:105)\n\n(cid:104)0(cid:105)\n\n(cid:104)t(cid:105)\n\n(cid:107)a(cid:104)t+1(cid:105)(cid:107)2 \u2264 \u03ba2\nsin(\u03b2(cid:104)t+1(cid:105)) \u2264 \u03bat\n\nBy combining the two inequalities from Theorem 3, we conclude\n\n(cid:107)a(cid:104)t+1(cid:105)(cid:107)2 = \u03ba2t\n\na (cid:107)a(cid:104)0(cid:105)(cid:107)2 +\n\na \u00b7 sin2(\u03b2(cid:104)t\u2212\u03c4(cid:105))\n\u03ba2\u03c4\n\n(cid:107)\u03b8(cid:63)(cid:107)2 sin2(\u03b2(cid:104)t(cid:105))\n\n,\n\n4\n\n(cid:107)\u03b8(cid:63)(cid:107)2\n\na \u00b7 (cid:107)a(cid:104)t(cid:105)(cid:107)2 +\n\u03b2 \u00b7 sin(\u03b2(cid:104)0(cid:105)) .\nt(cid:88)\nt(cid:88)\n(cid:16)\nmax(cid:8)\u03baa, \u03ba\u03b2\n\na \u03ba2(t\u2212\u03c4 )\n\u03ba2\u03c4\n\n(cid:107)\u03b8(cid:63)(cid:107)2\n\n(cid:107)\u03b8(cid:63)(cid:107)2\n\n\u03c4 =0\n\n\u03c4 =0\n\n4\n\n4\n\nt\n\n\u03b2\n\n4\n\n\u00b7 sin2(\u03b2(cid:104)0(cid:105))\n\n(cid:9)(cid:17)t\n\nsin2(\u03b2(cid:104)0(cid:105)) .\n\n\u2264 \u03ba2t\n\na (cid:107)a(cid:104)0(cid:105)(cid:107)2 +\n\n\u2264 \u03ba2t\n\na (cid:107)a(cid:104)0(cid:105)(cid:107)2 +\n\n1 + \u00b5(cid:63)\n\nTheorem 3 shows that the re-parameterized Population EM iterates converge, at a linear rate, to the\n2)/2, as well as the line spanned by \u03b8(cid:63). The theorem, however,\naverage of the two means (\u00b5(cid:63)\n(cid:104)t(cid:105) to the magnitude of \u03b8(cid:63).\ndoes not provide any information on the convergence of the magnitude of b\nThis is given in the next theorem.\nTheorem 4. Assume \u03b8(cid:63) \u2208 Rd \\ {0}. Let (a(cid:104)t(cid:105), b\nEM iterates for Model 2, and suppose (cid:104)b\ncb > 0\u2014all depending only on (cid:107)\u03b8(cid:63)(cid:107), |(cid:104)b\n(cid:104)0(cid:105)\n\n)t\u22650 denote the (re-parameterized) Population\n, \u03b8(cid:63)(cid:105) (cid:54)= 0. Then there exist T0 > 0, \u03bab \u2208 (0, 1), and\n(cid:104)0(cid:105)\n(cid:104)0(cid:105)(cid:107)|, (cid:107)a(cid:104)0(cid:105)(cid:107), and (cid:107)b\n, \u03b8(cid:63)(cid:105)/(cid:107)b\n\n(cid:104)t(cid:105)\n\n(cid:13)(cid:13)(cid:13)b\n\n(cid:104)t+1(cid:105) \u2212 sgn((cid:104)b\n\n(cid:104)0(cid:105)\n\nb \u00b7(cid:13)(cid:13)(cid:13)b\n, \u03b8(cid:63)(cid:105))\u03b8(cid:63)(cid:13)(cid:13)(cid:13)2 \u2264 \u03ba2\n\n, \u03b8(cid:63)(cid:105))\u03b8(cid:63)(cid:13)(cid:13)(cid:13)2\n\n(cid:104)0(cid:105)(cid:107)\u2014such that\n+ cb \u00b7 (cid:107)a(cid:104)t(cid:105)(cid:107) \u2200t > T0 .\n\n(cid:104)t(cid:105) \u2212 sgn((cid:104)b\n\n(cid:104)0(cid:105)\n\n5\n\n\f(cid:104)0(cid:105)\n\nIf (cid:104)b\ndegenerate solution (0, 0).\n(cid:104)t(cid:105)\nTheorem 5. Let (a(cid:104)t(cid:105), b\nIf (cid:104)b\n\n, \u03b8(cid:63)(cid:105) = 0, then\n\n(cid:104)0(cid:105)\n\n, \u03b8(cid:63)(cid:105) = 0, then we show convergence of the (re-parameterized) Population EM iterates to the\n\n)t\u22650 denote the (re-parameterized) Population EM iterates for Model 2.\n\n(a(cid:104)t(cid:105), b\n\n(cid:104)t(cid:105)\n\n) \u2192 (0, 0) as t \u2192 \u221e .\n\nTheorems 3, 4, and 5 together characterize the \ufb01xed points of Population EM for Model 2, and fully\nspecify the conditions under which each \ufb01xed point is reached. The results are simply summarized in\nthe following corollary.\nCorollary 2. If (a(cid:104)t(cid:105), b\nthen\n\n)t\u22650 denote the (re-parameterized) Population EM iterates for Model 2,\n\n(cid:104)t(cid:105)\n\n1 + \u00b5(cid:63)\n2\n\na(cid:104)t(cid:105) \u2192 \u00b5(cid:63)\n2\n(cid:104)t(cid:105) \u2192 sgn((cid:104)b\n\nb\n\n(cid:104)0(cid:105)\n\nas t \u2192 \u221e ,\n2 \u2212 \u00b5(cid:63)\n\u00b5(cid:63)\n2 \u2212 \u00b5(cid:63)\n1(cid:105))\n2\n\n1\n\n, \u00b5(cid:63)\n\nas t \u2192 \u221e .\n\n2.2 Main Results for Sample-based EM\n\nUsing the results on Population EM presented in the above section, we can now establish consistency\nof (Sample-based) EM. We focus attention on Model 2, as the same results for Model 1 easily follow\nas a corollary. First, we state a simple connection between the Population EM and Sample-based EM\niterates.\nTheorem 6. Suppose Population EM and Sample-based EM for Model 2 have the same initial\nparameters: \u02c6\u00b5\n\n(cid:104)0(cid:105)\n1 = \u00b5\n\n(cid:104)0(cid:105)\n1 and \u02c6\u00b5\n\n(cid:104)0(cid:105)\n2 = \u00b5\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\n1 \u2192 \u00b5\n\u02c6\u00b5\n1\nwhere convergence is in probability.\n\n(cid:104)0(cid:105)\n2 . Then for each iteration t \u2265 0,\nas n \u2192 \u221e ,\nand\n\n(cid:104)t(cid:105)\n2 \u2192 \u00b5\n\u02c6\u00b5\n\n(cid:104)t(cid:105)\n2\n\n(cid:104)0(cid:105)\n1 , \u00b5\n\n(cid:104)0(cid:105)\n1 , \u02c6\u00b5\n\n(cid:104)0(cid:105)\n2 ) = (\u00b5\n\nNote that Theorem 6 does not necessarily imply that the \ufb01xed point of Sample-based EM (when\n(cid:104)0(cid:105)\n2 )) is the same as that of Population EM. It is conceivable that\ninitialized at ( \u02c6\u00b5\nas t \u2192 \u221e, the discrepancy between (the iterates of) Sample-based EM and Population EM increases.\nWe show that this is not the case: the \ufb01xed points of Sample-based EM indeed converge to the \ufb01xed\npoints of Population EM.\nTheorem 7. Suppose Population EM and Sample-based EM for Model 2 have the same initial\nparameters: \u02c6\u00b5\n\n(cid:104)0(cid:105)\n2 . If (cid:104)\u00b5\nand\n\n(cid:104)0(cid:105)\n(cid:104)0(cid:105)\n1 , \u03b8(cid:63)(cid:105) (cid:54)= 0, then\n2 \u2212 \u00b5\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\n| \u02c6\u00b5\n2 \u2212 \u00b5\n2 | \u2192 0\nlim sup\nt\u2192\u221e\n\nas n \u2192 \u221e ,\n\n(cid:104)0(cid:105)\n(cid:104)0(cid:105)\n(cid:104)0(cid:105)\n1 and \u02c6\u00b5\n2 = \u00b5\n1 = \u00b5\n(cid:104)t(cid:105)\n(cid:104)t(cid:105)\n1 \u2212 \u00b5\n| \u02c6\u00b5\n1 | \u2192 0\nwhere convergence is in probability.\n\nlim sup\nt\u2192\u221e\n\n2.3 Population EM and Expected Log-likelihood\n\nDo the results we derived in the last section regarding the performance of EM provide any information\non the performance of other ascent algorithms, such as gradient ascent, that aim to maximize the log-\nlikelihood function? To address this question, we show how our analysis can determine the stationary\npoints of the expected log-likelihood and characterize the shape of the expected log-likelihood in a\nneighborhood of the stationary points. Let G(\u03b7) denote the expected log-likelihood, i.e.,\n\n(cid:90)\n\nG(\u03b7) (cid:44) E(log f\u03b7(Y )) =\n\nf (y; \u03b7\u2217) log f (y; \u03b7) dy,\n\nwhere \u03b7\u2217 denotes the true parameter value. Also consider the following standard regularity conditions:\nR1 The family of probability density functions f (y; \u03b7) have common support.\nR2 \u2207\u03b7\n\n(cid:82) f (y; \u03b7\u2217) log f (y; \u03b7) dy =(cid:82) f (y; \u03b7\u2217)\u2207\u03b7 log f (y; \u03b7) dy, where \u2207\u03b7 denotes the gradient\n\nwith respect to \u03b7.\n\n6\n\n\fR3 \u2207\u03b7(E(cid:80)\n\nz f (z | Y ; \u03b7(cid:104)t(cid:105))) log f (Y , z; \u03b7) = E(cid:80)\n\nz f (z | Y ; \u03b7(cid:104)t(cid:105))\u2207\u03b7 log f (Y , z; \u03b7).\n\nThese conditions can be easily con\ufb01rmed for many models including the Gaussian mixture models.\nThe following theorem connects the \ufb01xed points of the Population EM and the stationary points of\nthe expected log-likelihood.\nLemma 1. Let \u00af\u03b7 \u2208 Rd denote a stationary point of G(\u03b7). Also assume that Q(\u03b7 | \u03b7(cid:104)t(cid:105)) has a unique\nand \ufb01nite stationary point in terms of \u03b7 for every \u03b7(cid:104)t(cid:105), and this stationary point is its global maxima.\nThen, if the model satis\ufb01es conditions R1\u2013R3, and the Population EM algorithm is initialized at \u00af\u03b7, it\nwill stay at \u00af\u03b7. Conversely, any \ufb01xed point of Population EM is a stationary point of G(\u03b7).\nProof. Let \u00af\u03b7 denote a stationary point of G(\u03b7). We \ufb01rst prove that \u00af\u03b7 is a stationary point of Q(\u03b7 | \u00af\u03b7).\n\n\u2207\u03b7Q(\u03b7 | \u00af\u03b7)(cid:12)(cid:12)\u03b7=\u00af\u03b7 =\n\n\u2207\u03b7f (y, z; \u03b7)(cid:12)(cid:12)\u03b7=\u00af\u03b7\n\nf (y, z; \u00af\u03b7)\n\nf (y; \u03b7\u2217) dy\n\nf (z | y; \u00af\u03b7)\n\n(cid:90) (cid:88)\n(cid:90) (cid:88)\n(cid:90) \u2207\u03b7f (y, \u03b7)(cid:12)(cid:12)\u03b7=\u00af\u03b7\n\n\u2207\u03b7f (y, z; \u03b7)(cid:12)(cid:12)\u03b7=\u00af\u03b7\n\nf (y; \u00af\u03b7)\n\nz\n\nz\n\nf (y; \u00af\u03b7)\n\nf (y; \u03b7\u2217) dy = 0 ,\n\n=\n\n=\n\nf (y; \u03b7\u2217) dy\n\nwhere the last equality is using the fact that \u00af\u03b7 is a stationary point of G(\u03b7). Since Q(\u03b7 | \u00af\u03b7) has a\nunique stationary point, and we have assumed that the unique stationary point is its global maxima,\nthen Population EM will stay at that point. The proof of the other direction is similar.\nRemark 1. The fact that \u03b7\u2217 is the global maximizer of G(\u03b7) is well-known in the statistics and\nmachine learning literature (e.g., Conniffe, 1987). Furthermore, the fact that \u03b7\u2217 is a global maximizer\nof Q(\u03b7 | \u03b7\u2217) is known as the self-consistency property (Balakrishnan et al., 2014).\nIt is straightforward to con\ufb01rm the conditions of Lemma 1 for mixtures of Gaussians. This lemma\ncon\ufb01rms that Population EM may be trapped in every local maxima. However, less intuitively it may\nget stuck at local minima or saddle points as well. Our next result characterizes the stationary points\nof G(\u03b8) for Model 1.\nCorollary 3. G(\u03b8) has only three stationary points. If d = 1 (so \u03b8 = \u03b8 \u2208 R), then 0 is a local\nminima of G(\u03b8), while \u03b8\u2217 and \u2212\u03b8\u2217 are global maxima. If d > 1, then 0 is a saddle point, and \u03b8(cid:63) and\n\u2212\u03b8(cid:63) are global maxima.\nThe proof is a straightforward result of Lemma 1 and Corollary 1. The phenomenon that Population\nEM may stuck in local minima or saddle points also happens in Model 2. We can employ Corollary 2\nand Lemma 1 to explain the shape of the expected log-likelihood function G. To simplify the notation,\nwe consider the re-parametrization a (cid:44) \u00b51+\u00b52\nCorollary 4. G(a, b) has three stationary points:\n\nand b (cid:44) \u00b52\u2212\u00b51\n\n.\n\n2\n\n2\n\n(cid:19)\n\n(cid:18) \u00b5(cid:63)\n\n1 + \u00b5(cid:63)\n2\n\n2\n\n1 \u2212 \u00b5(cid:63)\n\u00b5(cid:63)\n2\n\n2\n\n,\n\n,\n\nand\n\n1 + \u00b5(cid:63)\n2\n\n2\n\n,\n\n\u00b5(cid:63)\n\n1 + \u00b5(cid:63)\n2\n\n2\n\n(cid:18) \u00b5(cid:63)\n\n1 + \u00b5(cid:63)\n2\n\n2\n\n2 \u2212 \u00b5(cid:63)\n\u00b5(cid:63)\n2\n\n1\n\n,\n\n(cid:19)\n\n,\n\n(cid:18) \u00b5(cid:63)\n\n(cid:19)\n\n.\n\nThe \ufb01rst two points are global maxima. The third point is a saddle point.\n\n3 Concluding Remarks\n\nOur analysis of Population EM and Sample-based EM shows that the EM algorithm can, at least\nfor the Gaussian mixture models studied in this work, compute statistically consistent parameter\nestimates. Previous analyses of EM only established such results for speci\ufb01c methods of initializing\nEM (e.g., Dasgupta and Schulman, 2007; Balakrishnan et al., 2014); our results show that they are not\nreally necessary in the large sample limit. However, in any real scenario, the large sample limit may\nnot accurately characterize the behavior of EM. Therefore, these speci\ufb01c methods for initialization,\nas well as non-asymptotic analysis, are clearly still needed to understand and effectively apply EM.\nThere are several interesting directions concerning EM that we hope to pursue in follow-up work.\nThe \ufb01rst considers the behavior of EM when the dimension d = dn may grow with the sample size\n\n7\n\n\fis of the order(cid:112)d/n as t \u2192 \u221e. Therefore, we conjecture that the theorem still holds as long as\n\nn. Our proof of Theorem 7 reveals that the parameter error of the t-th iterate (in Euclidean norm)\n\ndn = o(n). This would be consistent with results from statistical physics on the MLE for Gaussian\nmixtures, which characterize the behavior when dn \u221d n as n \u2192 \u221e (Barkai and Sompolinsky, 1994).\nAnother natural direction is to extend these results to more general Gaussian mixture models (e.g.,\nwith unequal mixing weights or unequal covariances) and other latent variable models.\n\nAcknowledgements. The second named author thanks Yash Deshpande and Sham Kakade for\nmany helpful initial discussions. JX and AM were partially supported by NSF grant CCF-1420328.\nDH was partially supported by NSF grant DMREF-1534910 and a Sloan Fellowship.\n\nReferences\nD. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In Eighteenth\n\nAnnual Conference on Learning Theory, pages 458\u2013469, 2005.\n\nS. Arora and R. Kannan. Learning mixtures of separated nonspherical Gaussians. The Annals of\n\nApplied Probability, 15(1A):69\u201392, 2005.\n\nS. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From\n\npopulation to sample-based analysis. arXiv preprint arXiv:1408.2156, August 2014.\n\nN. Barkai and H. Sompolinsky. Statistical mechanics of the maximum-likelihood density estimation.\n\nPhysical Review E, 50(3):1766\u20131769, Sep 1994.\n\nM. Belkin and K. Sinha. Polynomial learning of distribution families. In Fifty-First Annual IEEE\n\nSymposium on Foundations of Computer Science, pages 103\u2013112, 2010.\n\nS. C. Brubaker and S. Vempala. Isotropic PCA and af\ufb01ne-invariant clustering. In Forty-Ninth Annual\n\nIEEE Symposium on Foundations of Computer Science, 2008.\n\nK. Chaudhuri and S. Rao. Learning mixtures of product distributions using correlations and indepen-\n\ndence. In Twenty-First Annual Conference on Learning Theory, pages 9\u201320, 2008.\n\nK. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering via canonical\n\ncorrelation analysis. In ICML, 2009a.\n\nK. Chaudhuri, S. Dasgupta, and A. Vattani. Learning mixtures of Gaussians using the k-means\n\nalgorithm. CoRR, abs/0912.0086, 2009b.\n\nS. Chr\u00e9tien and A. O. Hero. On EM algorithms and their proximal generalizations. ESAIM:\n\nProbability and Statistics, 12:308\u2013326, May 2008.\n\nD. Conniffe. Expected maximum log likelihood estimation. Journal of the Royal Statistical Society.\n\nSeries D, 36(4):317\u2013329, 1987.\n\nS. Dasgupta. Learning mixutres of Gaussians. In Fortieth Annual IEEE Symposium on Foundations\n\nof Computer Science, pages 634\u2013644, 1999.\n\nS. Dasgupta and L. Schulman. A probabilistic analysis of EM for mixtures of separated, spherical\n\nGaussians. Journal of Machine Learning Research, 8(Feb):203\u2013226, 2007.\n\nC. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of EM suf\ufb01ce for mixtures of two Gaussians.\n\narXiv preprint arXiv:1609.00368, September 2016.\n\nA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the\n\nEM algorithm. J. Royal Statist. Soc. Ser. B, 39:1\u201338, 1977.\n\nR. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of\n\nthe Royal Society, London, A., 222:309\u2013368, 1922.\n\nM. Hardt and E. Price. Tight bounds for learning a mixture of two Gaussians. In Proceedings of the\n\nForty-Seventh Annual ACM on Symposium on Theory of Computing, pages 753\u2013760, 2015.\n\nD. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: moment methods and spectral\n\ndecompositions. In Fourth Innovations in Theoretical Computer Science, 2013.\n\nC. Jin, Y. Zhang, S. Balakrishnan, M. J. Wainwright, and M. Jordan. Local maxima in the likelihood\nof Gaussian mixture models: Structural results and algorithmic consequences. arXiv preprint\narXiv:1609.00978, September 2016.\n\n8\n\n\fA. T. Kalai, A. Moitra, and G. Valiant. Ef\ufb01ciently learning mixtures of two Gaussians. In Forty-second\n\nACM Symposium on Theory of Computing, pages 553\u2013562, 2010.\n\nR. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. SIAM\n\nJournal on Computing, 38(3):1141\u20131156, 2008.\n\nJ. M. Klusowski and W. D. Brinda. Statistical guarantees for estimating the centers of a two-\n\ncomponent Gaussian mixture by EM. arXiv preprint arXiv:1608.02280, August 2016.\n\nS. P. Lloyd. Least squares quantization in PCM. IEEE Trans. Information Theory, 28(2):129\u2013137,\n\n1982.\n\nJ. B. MacQueen. Some methods for classi\ufb01cation and analysis of multivariate observations. In\nProceedings of the \ufb01fth Berkeley Symposium on Mathematical Statistics and Probability, volume 1,\npages 281\u2013297. University of California Press, 1967.\n\nA. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In Fifty-First\n\nAnnual IEEE Symposium on Foundations of Computer Science, pages 93\u2013102, 2010.\n\nK. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the\n\nRoyal Society, London, A., 185:71\u2013110, 1894.\n\nR. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM\n\nReview, 26(2):195\u2013239, 1984.\n\nN. Srebro. Are there local maxima in the in\ufb01nite-sample likelihood of Gaussian mixture estimation?\n\nIn 20th Annual Conference on Learning Theory, pages 628\u2013629, 2007.\n\nP. Tseng. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics of\n\nOperations Research, 29(1):27\u201344, Feb 2004.\n\nF. Vaida. Parameter convergence for EM and MM. Statistica Sinica, 15, 2005.\nS. Vempala and G. Wang. A spectral algorithm for learning mixtures models. Journal of Computer\n\nand System Sciences, 68(4):841\u2013860, 2004.\n\nC. F. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):\n\n95\u2013103, Mar 1983.\n\nJ. Xu, D. Hsu, and A. Maleki. Global analysis of Expectation Maximization for mixtures of two\n\nGaussians. arXiv preprint arXiv:1608.07630, 2016.\n\nL. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures.\n\nNeural Computation, 8:129\u2013151, 1996.\n\n9\n\n\f", "award": [], "sourceid": 1374, "authors": [{"given_name": "Ji", "family_name": "Xu", "institution": "Columbia university"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "Arian", "family_name": "Maleki", "institution": "Columbia University"}]}