{"title": "Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 2139, "page_last": 2147, "abstract": "While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering.", "full_text": "Minimax Theory for High-dimensional Gaussian\n\nMixtures with Sparse Mean Separation\n\nMartin Azizyan\n\nMachine Learning Department\nCarnegie Mellon University\nmazizyan@cs.cmu.edu\n\nAarti Singh\n\nMachine Learning Department\nCarnegie Mellon University\n\naarti@cs.cmu.edu\n\nLarry Wasserman\n\nDepartment of Statistics\n\nCarnegie Mellon University\nlarry@stat.cmu.edu\n\nAbstract\n\nWhile several papers have investigated computationally and statistically ef\ufb01cient\nmethods for learning Gaussian mixtures, precise minimax bounds for their statisti-\ncal performance as well as fundamental limits in high-dimensional settings are not\nwell-understood. In this paper, we provide precise information theoretic bounds\non the clustering accuracy and sample complexity of learning a mixture of two\nisotropic Gaussians in high dimensions under small mean separation. If there is\na sparse subset of relevant dimensions that determine the mean separation, then\nthe sample complexity only depends on the number of relevant dimensions and\nmean separation, and can be achieved by a simple computationally ef\ufb01cient pro-\ncedure. Our results provide the \ufb01rst step of a theoretical basis for recent methods\nthat combine feature selection and clustering.\n\nIntroduction\n\n1\nGaussian mixture models provide a simple framework for several machine learning problems in-\ncluding clustering, density estimation and classi\ufb01cation. Mixtures are especially appealing in high\ndimensional problems. Perhaps the most common use of Gaussian mixtures is for clustering. Of\ncourse, the statistical (and computational) behavior of these methods can degrade in high dimen-\nsions.\nInspired by the success of variable selection methods in regression, several authors have\nconsidered variable selection for clustering. However, there appears to be no theoretical results\njustifying the advantage of variable selection in high dimensional setting.\nTo see why some sort of variable selection might be useful, consider clustering n subjects using a\nvector of d genes for each subject. Typically d is much larger than n which suggests that statistical\nclustering methods will perform poorly. However, it may be the case that there are only a small\nnumber of relevant genes in which case we might expect better behavior by focusing on this small\nset of relevant genes.\nThe purpose of this paper is to provide precise bounds on clustering error with mixtures of Gaus-\nsians. We consider both the general case where all features are relevant, and the special case where\nonly a subset of features are relevant. Mathematically, we model an irrelevant feature by requiring\nthe mean of that feature to be the same across clusters, so that the feature does not serve to differ-\nentiate the groups. Throughout this paper, we use the probability of misclustering an observation,\nrelative to the optimal clustering if we had known the true distribution, as our loss function. This is\nakin to using excess risk in classi\ufb01cation.\nThis paper makes the following contributions:\n\n\u2022 We provide information theoretic bounds on the sample complexity of learning a mixture\nof two isotropic Gaussians with equal weight in the small mean separation setting that pre-\ncisely captures the dimension dependence, and matches known sample complexity require-\nments for some existing algorithms. This also debunks the myth that there is a gap between\n\n1\n\n\fstatistical and computational complexity of learning mixture of two isotropic Gaussians for\nsmall mean separation. Our bounds require non-standard arguments since our loss function\ndoes not satisfy the triangle inequality.\n\u2022 We consider the high-dimensional setting where only a subset of relevant dimensions deter-\nmine the mean separation between mixture components and show that learning is substan-\ntially easier as the sample complexity only depends on the sparse set of relevant dimensions.\nThis provides some theoretical basis for feature selection approaches to clustering.\n\u2022 We show that a simple computationally feasible procedure nearly achieves the information\n\ntheoretic sample complexity even in high-dimensional sparse mean separation settings.\n\n\u221a\n\nRelated Work. There is a long and continuing history of research on mixtures of Gaussians. A\ncomplete review is not feasible but we mention some highlights of the work most related to ours.\nPerhaps the most popular method for estimating a mixture distribution is maximum likelihood. Un-\nfortunately, maximizing the likelihood is NP-Hard. This has led to a stream of work on alternative\nmethods for estimating mixtures. These new algorithms use pairwise distances, spectral methods or\nthe method of moments.\nPairwise methods are developed in Dasgupta (1999), Schulman and Dasgupta (2000) and Arora and\nKannan (2001). These methods require the mean separation to increase with dimension. The \ufb01rst\none requires the separation to be\nd while the latter two improve it to d1/4. To avoid this problem,\nVempala and Wang (2004) introduced the idea of using spectral methods for estimating mixtures\nof spherical Gaussians which makes mean separation independent of dimension. The assumption\nthat the components are spherical was removed in Brubaker and Vempala (2008). Their method\nonly requires the components to be separated by a hyperplane and runs in polynomial time, but\nrequires n = \u2126(d4 log d) samples. Other spectral methods include Kannan et al. (2005), Achlioptas\nand McSherry (2005) and Hsu and Kakade (2013). The latter uses clever spectral decompositions\ntogether with the method of moments to derive an effective algorithm.\nKalai et al. (2012) use the method of moments to get estimates without requiring separation between\ncomponents of the mixture components. A similar approach is given in Belkin and Sinha (2010).\nChaudhuri et al. (2009) give a modi\ufb01ed k-means algorithm for estimating a mixture of two Gaus-\nsians. For the large mean separation setting \u00b5 > 1, Chaudhuri et al. (2009) show that n = \u02dc\u2126(d/\u00b52)\nsamples are needed. They also provide an information theoretic bound on the necessary sample com-\nplexity of any algorithm which matches the sample complexity of their method (up to log factors) in\nd and \u00b5. If the mean separation is small \u00b5 < 1, they show that n = \u02dc\u2126(d/\u00b54) samples are suf\ufb01cient.\nOur results for the small mean separation setting give a matching necessary condition. Assuming\nthe separation between the component means is not too sparse, Chaudhuri and Rao (2008) provide\nan algorithm for learning the mixture that has polynomial computational and sample complexity.\nMost of these papers are concerned with computational ef\ufb01ciency and do not give precise, statistical\nminimax upper and lower bounds. None of them deal with the case we are interested in, namely, a\nhigh dimensional mixture with sparse mean separation.\nWe should also point out that the results in different papers are not necessarily comparable since\ndifferent authors use different loss functions. In this paper we use the probability of misclassifying\na future observation, relative to how the correct distribution clusters the observation, as our loss\nfunction. This should not be confused with the probability of attributing a new observation to the\nwrong component of the mixture. The latter loss does not typically tend to zero as the sample\nsize increases. Our loss is similar to the excess risk used in classi\ufb01cation where we compare the\nmisclassi\ufb01cation rate of a classi\ufb01er to the misclassi\ufb01cation rate of the Bayes optimal classi\ufb01er.\nFinally, we remind the reader that our motivation for studying sparsely separated mixtures is that\nthis provides a model for variable selection in clustering problems. There are some relevant recent\npapers on this problem in the high-dimensional setting. Pan and Shen (2007) use penalized mixture\nmodels to do variable selection and clustering simultaneously. Witten and Tibshirani (2010) develop\na penalized version of k-means clustering. Related methods include Raftery and Dean (2006); Sun\net al. (2012) and Guo et al. (2010). The applied bioinformatics literature also contains a huge number\nof heuristic methods for this problem. None of these papers provide minimax bounds for the clus-\ntering error or provide theoretical evidence of the bene\ufb01t of using variable selection in unsupervised\nproblems such as clustering.\n\n2\n\n\f2 Problem Setup\n\nIn this paper, we consider the simple setting of learning a mixture of two isotropic Gaussians with\nequal mixing weights,1 given n data points X1, . . . , Xn \u2208 Rd drawn i.i.d. from a d-dimensional\nmixture density function\n\np\u03b8(x) =\n\nf (x; \u00b51, \u03c32I) +\n\nf (x; \u00b52, \u03c32I),\n\n1\n2\n\n1\n2\n\nwhere f (\u00b7; \u00b5, \u03a3) is the density of N (\u00b5, \u03a3), \u03c3 > 0 is a \ufb01xed constant, and \u03b8 := (\u00b51, \u00b52) \u2208 \u0398. We\nconsider two classes \u0398 of parameters:\n\n\u0398\u03bb = {(\u00b51, \u00b52) : (cid:107)\u00b51 \u2212 \u00b52(cid:107) \u2265 \u03bb}\n\n\u0398\u03bb,s = {(\u00b51, \u00b52) : (cid:107)\u00b51 \u2212 \u00b52(cid:107) \u2265 \u03bb, (cid:107)\u00b51 \u2212 \u00b52(cid:107)0 \u2264 s} \u2286 \u0398\u03bb.\n\nThe \ufb01rst class de\ufb01nes mixtures where the components have a mean separation of at least \u03bb > 0.\nThe second class de\ufb01nes mixtures with mean separation \u03bb > 0 along a sparse set of s \u2208 {1, . . . , d}\ndimensions. Also, let P\u03b8 denote the probability measure corresponding to p\u03b8.\nFor a mixture with parameter \u03b8, the Bayes optimal classi\ufb01cation, that is, assignment of a point\nx \u2208 Rd to the correct mixture component, is given by the function\nf (x; \u00b5i, \u03c32I).\n\nF\u03b8(x) = argmax\ni\u2208{1,2}\n\nGiven any other candidate assignment function F : Rd \u2192 {1, 2}, we de\ufb01ne the loss incurred by F\nas\n\nL\u03b8(F ) = min\n\n\u03c0\n\nP\u03b8({x : F\u03b8(x) (cid:54)= \u03c0(F (x))})\n\nwhere the minimum is over all permutations \u03c0 : {1, 2} \u2192 {1, 2}. This is the probability of misclus-\ntering relative to an oracle that uses the true distribution to do optimal clustering.\n\nWe denote by (cid:98)Fn any assignment function learned from the data X1, . . . , Xn, also referred to as\n\nestimator. The goal of this paper is to quantify how the minimax expected loss (worst case expected\nloss for the best estimator)\n\nE\u03b8L\u03b8((cid:98)Fn)\n\nRn \u2261 inf(cid:98)Fn\n\nsup\n\u03b8\u2208\u0398\n\nscales with number of samples n, the dimension of the feature space d, the number of relevant di-\nmensions s, and the signal-to-noise ratio de\ufb01ned as the ratio of mean separation to standard deviation\n\u03bb/\u03c3. We will also demonstrate a speci\ufb01c estimator that achieves the minimax scaling.\nFor the purposes of this paper, we say that feature j is irrelevant if \u00b51(j) = \u00b52(j). Otherwise we\nsay that feature j is relevant.\n\n3 Minimax Bounds\n\n3.1 Small mean separation setting without sparsity\n\n(cid:26)\n\nWe begin without assuming any sparsity, that is, all features are relevant. In this case, comparing\nthe projections of the data to the projection of the sample mean onto the \ufb01rst principal component\nsuf\ufb01ces to achieve both minimax optimal sample complexity and clustering loss.\nTheorem 1 (Upper bound). De\ufb01ne\n\nn v1((cid:98)\u03a3n)\nif xT v1((cid:98)\u03a3n) \u2265(cid:98)\u00b5T\ni=1 Xi is the sample mean,(cid:98)\u03a3n = n\u22121(cid:80)n\nwhere(cid:98)\u00b5n = n\u22121(cid:80)n\ni=1(Xi\u2212(cid:98)\u00b5n)(Xi\u2212(cid:98)\u00b5n)T is the sample\ncovariance and v1((cid:98)\u03a3n) denotes the eigenvector corresponding to the largest eigenvalue of (cid:98)\u03a3n. If\n(cid:19)(cid:114)\n(cid:18) 4\u03c32\n\nn \u2265 max(68, 4d), then\n\n(cid:98)Fn(x) =\n\notherwise.\n\nd log(nd)\n\n1\n2\n\nE\u03b8L\u03b8((cid:98)F ) \u2264 600 max\n\n\u03bb2 , 1\n\n.\n\nn\n\nsup\n\u03b8\u2208\u0398\u03bb\n\n1We believe our results should also hold in the unequal mixture weight setting without major modi\ufb01cations.\n\n3\n\n\f(cid:19)\n\n.\n\n\u2212 \u03bb2\n80\u03c32\n\n(cid:18)\n\n32\n\n(cid:17)\n\n+ 9 exp\n\n(cid:16)\u2212 n\n(cid:40)\u221a\n(cid:114)\n\u03c3 \u2264 0.2. Then\n\nlog 2\n3\n\n\u03c32\n\u03bb2\n\nd \u2212 1\nn\n\n,\n\n1\n4\n\n(cid:41)\n\n.\n\nWe note that the estimator in Theorem 1 (and that in Theorem 3) does not use knowledge of \u03c32.\nTheorem 2 (Lower bound). Assume that d \u2265 9 and \u03bb\n\nE\u03b8L\u03b8((cid:98)Fn) \u2265 1\n\nmin\n\n500\n\ninf(cid:98)Fn\n\nsup\n\u03b8\u2208\u0398\u03bb\n\nWe believe that some of the constants (including lower bound on d and exact upper bound on \u03bb/\u03c3)\ncan be tightened, but the results demonstrate matching scaling behavior of clustering error with d, n\nand \u03bb/\u03c3. Thus, we see (ignoring constants and log terms) that\n\n(cid:114)\n\nRn \u2248 \u03c32\n\u03bb2\n\nd\nn\n\n,\n\nor equivalently n \u2248 d\n\n\u03bb4/\u03c34 for a constant target value of Rn.\n\nThe result is quite intuitive: the dependence on dimension d is as expected. Also we see that the rate\ndepends in a precise way on the signal-to-noise ratio \u03bb/\u03c3. In particular, the results imply that we\nneed d \u2264 n.\nIn modern high-dimensional datasets, we often have d > n i.e. large number of features and not\nenough samples. However, inference is usually tractable since not all features are relevant to the\nlearning task at hand. This sparsity of relevant feature set has been successfully exploited in super-\nvised learning problems such as regression and classi\ufb01cation. We show next that the same is true for\nclustering under the Gaussian mixture model.\n\n3.2 Sparse and small mean separation setting\n\nfeatures. We begin by constructing an estimator (cid:98)Sn of S as follows. De\ufb01ne\n\nNow we consider the case where there are s < d relevant features. Let S denote the set of relevant\n\nFurthermore, if \u03bb\n\n\u221a\n\u03c3 \u2265 2 max(80, 14\n\n5d), then\n\nE\u03b8L\u03b8((cid:98)F ) \u2264 17 exp\n\nsup\n\u03b8\u2208\u0398\u03bb\n\ni\u2208{1,...,d}(cid:98)\u03a3n(i, i), where\n\nmin\n\n(cid:98)\u03c4n =\n\n1 + \u03b1\n1 \u2212 \u03b1\n\n(cid:114)\n\nwhere\n\n\u03b1 =\n\n6 log(nd)\n\n2 log(nd)\n\n+\n\n.\n\nn\n\nn\n\nNow let\n\nTheorem 3 (Upper bound). De\ufb01ne\n\nNow we use the same method as before, but using only the features in (cid:98)Sn identi\ufb01ed as relevant.\n\n(cid:98)Sn = {i \u2208 {1, . . . , d} :(cid:98)\u03a3n(i, i) >(cid:98)\u03c4n}.\n(cid:26) 1\nv1((cid:98)\u03a3(cid:98)Sn\n(cid:98)Fn(x) =\nand (cid:98)\u03a3(cid:98)Sn\n\n) \u2265(cid:98)\u00b5T(cid:98)Sn\nare the coordinates of x restricted to (cid:98)Sn, and(cid:98)\u00b5(cid:98)Sn\nE\u03b8L\u03b8((cid:98)F ) \u2264 603 max\n\nv1((cid:98)\u03a3(cid:98)Sn\n\n(cid:18) 16\u03c32\n\nwhere x(cid:98)Sn\ncovariance of the data restricted to (cid:98)Sn. If n \u2265 max(68, 4s), d \u2265 2 and \u03b1 \u2264 1\n\n4 , then\n\n(cid:18) log(nd)\n\n(cid:19)(cid:114)\n\nif xT(cid:98)Sn\n\notherwise\n\n(cid:19) 1\n\ns log(ns)\n\n+ 220\n\n\u03c3\n\ns\n\n4\n\n.\n\n\u221a\n\n2\n\n)\n\nn\n\n\u03bb\n\nn\n\n\u03bb2 , 1\n\nsup\n\u03b8\u2208\u0398\u03bb,s\n\nare the sample mean and\n\nNext we \ufb01nd the lower bound.\nTheorem 4 (Lower bound). Assume that \u03bb\n\nE\u03b8L\u03b8((cid:98)Fn) \u2265 1\n\nmin\n\n600\n\n(cid:19)\n\u03c3 \u2264 0.2, d \u2265 17, and that 5 \u2264 s \u2264 d+3\n\n(cid:18) d \u2212 1\n\n(cid:40)(cid:114) 8\n\n(cid:115)\n\n(cid:41)\n\n4 . Then\n\n\u03c32\n\u03bb2\n\ns \u2212 1\nn\n\nlog\n\ns \u2212 1\n\n,\n\n1\n2\n\n.\n\ninf(cid:98)Fn\n\nsup\n\u03b8\u2208\u0398\u03bb,s\n\n45\n\n4\n\n\fWe remark again that the constants in our bounds can be tightened, but the results suggest that\n\n(cid:18) s2 log d\n(cid:19)1/4 (cid:31) Rn (cid:31) \u03c32\n(cid:19)\n(cid:18) s2 log d\n\n\u03bb2\n\nn\n\n\u03c3\n\u03bb\n\n(cid:114)\n\ns log d\n\nn\n\n,\n\nor n = \u2126\n\n\u03bb4/\u03c34\n\nfor a constant target value of Rn.\n\nIn this case, we have a gap between the upper and lower bounds for the clustering loss. Also, the\nsample complexity can possibly be improved to scale as s (instead of s2) using a different method.\nHowever, notice that the dimension only enters logarithmically. If the number of relevant dimensions\nis small then we can expect good rates. This provides some justi\ufb01cation for feature selection. We\nconjecture that the lower bound is tight and that the gap could be closed by using a sparse principal\ncomponent method as in Vu and Lei (2012) to \ufb01nd the relevant features. However, that method is\ncombinatorial and so far there is no known computationally ef\ufb01cient method for implementing it\nwith similar guarantees.\nWe note that the upper bound is achieved by a two-stage method that \ufb01rst \ufb01nds the relevant dimen-\nsions and then estimates the clusters. This is in contrast to the methods described in the introduction\nwhich do clustering and variable selection simultaneously. This raises an interesting question: is it\nalways possible to achieve the minimax rate with a two-stage procedure or are there cases where a\nsimultaneous method outperforms a two-stage procedure? Indeed, it is possible that in the case of\ngeneral covariance matrices (non-spherical) two-stage methods might fail. We hope to address this\nquestion in future work.\n\n4 Proofs of the Lower Bounds\n\nThe lower bounds for estimation problems rely on a standard reduction from expected error to hy-\npothesis testing that assumes the loss function is a semi-distance, which the clustering loss isn\u2019t.\nHowever, a local triangle inequality-type bound can be shown (Proposition 2). This weaker condi-\ntion can then be used to lower-bound the expected loss, as stated in Proposition 1 (which follows\neasily from Fano\u2019s inequality).\nThe proof techniques of the sparse and non-sparse lower bounds are almost identical. The main dif-\nference is that in the non-sparse case, we use the Varshamov\u2013Gilbert bound (Lemma 1) to construct\na set of suf\ufb01ciently dissimilar hypotheses, whereas in the sparse case we use an analogous result for\nsparse hypercubes (Lemma 2). See the supplementary material for complete proofs of all results.\nIn this section and the next, \u03c6 and \u03a6 denote the univariate standard normal PDF and CDF.\nLemma 1 (Varshamov\u2013Gilbert bound). Let \u2126 = {0, 1}m for m \u2265 8. There exists a subset\n{\u03c90, ..., \u03c9M} \u2286 \u2126 such that \u03c90 = (0, ..., 0), \u03c1(\u03c9i, \u03c9j) \u2265 m\n8 for all 0 \u2264 i < j \u2264 M, and\nM \u2265 2m/8, where \u03c1 denotes the Hamming distance between two vectors (Tsybakov (2009)).\nLemma 2. Let \u2126 = {\u03c9 \u2208 {0, 1}m : (cid:107)\u03c9(cid:107)0 = s} for integers m > s \u2265 1 such that s \u2264 m/4. There\n\n(cid:1)s/5 \u2212 1\nexist \u03c90, ..., \u03c9M \u2208 \u2126 such that \u03c1(\u03c9i, \u03c9j) > s/2 for all 0 \u2264 i < j \u2264 M, and M \u2265 (cid:0) m\n(Massart (2007), Lemma 4.10).\n, and if L\u03b8i((cid:98)F ) < \u03b3 implies L\u03b8j ((cid:98)F ) \u2265 \u03b3 for all 0 \u2264 i (cid:54)= j \u2264 M and\nProposition 1. Let \u03b80, ..., \u03b8M \u2208 \u0398\u03bb (or \u0398\u03bb,s), M \u2265 2, 0 < \u03b1 < 1/8, and \u03b3 > 0. If for all 1 \u2264 i \u2264\nclusterings (cid:98)F , then inf(cid:98)Fn\nmaxi\u2208[0..M ] E\u03b8i L\u03b8i ((cid:98)Fn) \u2265 0.07\u03b3.\nM, KL(P\u03b8i, P\u03b80) \u2264 \u03b1 log M\nProposition 2. For any \u03b8, \u03b8(cid:48) \u2208 \u0398\u03bb, and any clustering (cid:98)F , let \u03c4 = L\u03b8((cid:98)F ) +(cid:112)KL(P\u03b8, P\u03b8(cid:48))/2. If\nL\u03b8(F\u03b8(cid:48)) + \u03c4 \u2264 1/2, then L\u03b8(F\u03b8(cid:48)) \u2212 \u03c4 \u2264 L\u03b8(cid:48)((cid:98)F ) \u2264 L\u03b8(F\u03b8(cid:48)) + \u03c4.\n(cid:16)(cid:107)\u00b5(cid:107)\n(cid:107)\u00b5(cid:107)\n2\u03c3 . Then KL(P\u03b8, P\u03b8(cid:48)) \u2264 \u03be4(1 \u2212 cos \u03b2).\n\nWe will also need the following two results. Let \u03b8 = (\u00b50\u2212\u00b5/2, \u00b50 +\u00b5/2) and \u03b8(cid:48) = (\u00b50\u2212\u00b5(cid:48)/2, \u00b50 +\n\u00b5(cid:48)/2) for \u00b50, \u00b5, \u00b5(cid:48) \u2208 Rd such that (cid:107)\u00b5(cid:107) = (cid:107)\u00b5(cid:48)(cid:107), and let cos \u03b2 =\nProposition 3. Let g(x) = \u03c6(x)(\u03c6(x) \u2212 x\u03a6(\u2212x)). Then 2g\nProposition 4. Let \u03be =\n\nsin \u03b2 cos \u03b2 \u2264 L\u03b8(F\u03b8(cid:48)) \u2264 tan \u03b2\n\u03c0 .\n\n(cid:17)\n\n|\u00b5T \u00b5(cid:48)|\n(cid:107)\u00b5(cid:107)2\n\n.\n\n2\u03c3\n\nn\n\ns\n\n5\n\n\fL\u03b8\u03c9 (F\u03b8\u03bd ) \u2264 1\n\u03c0\n\ntan \u03b2\u03c9,\u03bd \u2264 1\n\u03c0\n\nL\u03b8\u03c9 (F\u03b8\u03bd ) \u2265 2g(\u03be) sin \u03b2\u03c9,\u03bd cos \u03b2\u03c9,\u03bd \u2265 g(\u03be)(cid:112)1 + cos \u03b2\u03c9,\u03bd\n\ncos \u03b2\u03c9,\u03bd\n\nd \u2212 1\u0001\n\u03bb\n\u221a\n\n, and\n\n(cid:112)\u03c1(\u03c9, \u03bd)\u0001\n\n2g(\u03be)\n\n\u03bb\n\nProof of Theorem 2. Let \u03be = \u03bb\n\n1)\u00012. Let \u2126 = {0, 1}d\u22121. For \u03c9 = (\u03c9(1), ..., \u03c9(d \u2212 1)) \u2208 \u2126, let \u00b5\u03c9 = \u03bb0ed +(cid:80)d\u22121\n\n(cid:110)\u221a\ni=1 is the standard basis for Rd). Let \u03b8\u03c9 =(cid:0)\u2212 \u00b5\u03c9\n\n(where {ei}d\nBy Proposition 4, KL(P\u03b8\u03c9 , P\u03b8\u03bd ) \u2264 \u03be4(1\u2212 cos \u03b2\u03c9,\u03bd) where cos \u03b2\u03c9,\u03bd = 1\u2212 2\u03c1(\u03c9,\u03bd)\u00012\n\u03c1 is the Hamming distance, so KL(P\u03b8\u03c9 , P\u03b8\u03bd ) \u2264 \u03be4 2(d\u22121)\u00012\n\n, \u03c9, \u03bd \u2208 \u2126, and\n. By Proposition 3, since cos \u03b2\u03c9,\u03bd \u2265 1\n2,\n\n0 = \u03bb2\u2212(d\u2212\ni=1 (2\u03c9(i) \u2212 1)\u0001ei\n\n2\u03c3 , and de\ufb01ne \u0001 = min\n\n. De\ufb01ne \u03bb2\n\n\u221a\n\u03bb\nd\u22121\n\n1\u221a\nn ,\n\n(cid:111)\n\n\u03c32\n\u03bb\n\n\u03bb2\n\n\u03bb2\n\n(cid:112)1 + cos \u03b2\u03c9,\u03bd\n\n2\n\n4\n\nlog 2\n3\n\n2 , \u00b5\u03c9\n\n(cid:1) \u2208 \u0398\u03bb.\n(cid:112)1 \u2212 cos \u03b2\u03c9,\u03bd \u2264 4\n(cid:112)1 \u2212 cos \u03b2\u03c9,\u03bd \u2265\n\n\u221a\n\n\u03c0\n\n8\n\n\u221a\n\nwhere g(x) = \u03c6(x)(\u03c6(x) \u2212 x\u03a6(\u2212x)). By Lemma 1, there exist \u03c90, ..., \u03c9M \u2208 \u2126 such that M \u2265\n2(d\u22121)/8 and \u03c1(\u03c9i, \u03c9j) \u2265 d\u22121\nfor all 0 \u2264 i < j \u2264 M. For simplicity of notation, let \u03b8i = \u03b8\u03c9i for\nall i \u2208 [0..M ]. Then, for i (cid:54)= j \u2208 [0..M ],\nKL(P\u03b8i , P\u03b8j ) \u2264 \u03be4 2(d \u2212 1)\u00012\n\u03bb2\n\u221a\n(cid:114)\n4 (g(\u03be) \u2212 2\u03be2)\nL\u03b8i(F\u03b8j ) + L\u03b8i((cid:98)F ) +\n(cid:18) 4\n\n, L\u03b8i (F\u03b8j ) \u2264 4\n. Then for any i (cid:54)= j \u2208 [0..M ], and any (cid:98)F such that L\u03b8i((cid:98)F ) < \u03b3,\n\u03c0\n\nand L\u03b8i (F\u03b8j ) \u2265 1\n2\n(cid:19) \u221a\n\n2\nbecause, for \u03be \u2264 0.1, by de\ufb01nition of \u0001,\n\n(cid:18) 4\n(cid:19) \u221a\n\n(g(\u03be) \u2212 2\u03be2) + \u03be2\n\nd \u2212 1\u0001\n\u03bb\n\nd \u2212 1\u0001\n\u03bb\n\nd \u2212 1\u0001\n\u03bb\n\nDe\ufb01ne \u03b3 = 1\n\nKL(P\u03b8i, P\u03b8j )\n\n\u2264 1\n2\n\nd\u22121\u0001\n\u03bb\n\ng(\u03be)\n\n\u221a\n\n\u221a\n\n1\n4\n\n+\n\n<\n\n\u03c0\n\n.\n\nSo, by Proposition 2, L\u03b8j ((cid:98)F ) \u2265 \u03b3. Also, KL(P\u03b8i, P\u03b80) \u2264 (d\u2212 1)\u03be4 2\u00012\n\n\u03c0\n\n(g(\u03be) \u2212 2\u03be2) + \u03be2\n\n+\n\n1\n4\n\nd \u2212 1\u0001\n\u03bb\n\n\u2264 2\n\nd \u2212 1\u0001\n\u03bb\n\n.\n\n\u2264 1\n2\n\u03bb2 \u2264 log M\n9n for all 1 \u2264 i \u2264 M,\n(cid:114)\n\nbecause, by de\ufb01nition of \u0001, \u03be4 2\u00012\n\n\u03bb2 \u2264 log 2\n\n(cid:41)\n72n . So by Proposition 1 and the fact that \u03be \u2264 0.1,\n\n(cid:40)\u221a\n\nE\u03b8iL\u03b8i((cid:98)Fn) \u2265 0.07\u03b3 \u2265 1\n\nmin\n\nd \u2212 1\nn\n\n,\n\n1\n4\n\nlog 2\n3\n\n\u03c32\n\u03bb2\n\ninf(cid:98)Fn\n\nmax\ni\u2208[0..M ]\n\n500\n\nE\u03b8L\u03b8((cid:98)Fn) \u2265 maxi\u2208[0..M ] E\u03b8iL\u03b8i ((cid:98)Fn) for any (cid:98)Fn. (cid:3)\n(cid:27)\n(cid:113) 1\nn log(cid:0) d\u22121\n(cid:1), 1\ni=1 is the standard basis for Rd). Let \u03b8\u03c9 = (cid:0)\u2212 \u00b5\u03c9\n(cid:1)s/5 \u2212 1 and \u03c1(\u03c9i, \u03c9j) \u2265 s\n\n(cid:26)(cid:113) 8\n{\u03c9 \u2208 {0, 1}d\u22121 : (cid:107)\u03c9(cid:107)0 = s}. For \u03c9 = (\u03c9(1), ..., \u03c9(d \u2212 1)) \u2208 \u2126, let \u00b5\u03c9 = \u03bb0ed +(cid:80)d\u22121\nexist \u03c90, ..., \u03c9M \u2208 \u2126 such that M \u2265(cid:0) d\u22121\n\nand to complete the proof we use sup\u03b8\u2208\u0398\u03bb\nProof of Theorem 4. For simplicity, we state this construction for \u0398\u03bb,s+1, assuming 4 \u2264 s \u2264 d\u22121\n4 .\n0 = \u03bb2 \u2212 s\u00012. Let \u2126 =\nLet \u03be = \u03bb\n(cid:1) \u2208 \u0398\u03bb,s. By Lemma 2, there\ni=1 \u03c9(i)\u0001ei\n2 for all 0 \u2264 i < j \u2264 M. The\n4 (g(\u03be) \u2212 \u221a\n(cid:3)\n\nremainder of the proof is analogous to that of Theorem 2 with \u03b3 = 1\n\n2\u03c3 , and de\ufb01ne \u0001 = min\n\n(where {ei}d\n\n. De\ufb01ne \u03bb2\n\n\u221a\n\u03bb .\n\n2 , \u00b5\u03c9\n\n2\u03be2)\n\n\u03bb\u221a\ns\n\n\u03c32\n\u03bb\n\n45\n\ns\u0001\n\n2\n\n2\n\ns\n\ns\n\n5 Proofs of the Upper Bounds\n\nPropositions 5 and 6 below bound the error in estimating the mean and principal direction, and\ncan be obtained using standard concentration bounds and a variant of the Davis\u2013Kahan theorem.\nProposition 7 relates these errors to the clustering loss. For the sparse case, Propositions 8 and 9\nbound the added error induced by the support estimation procedure. See supplementary material for\nproof details.\nProposition 5. Let \u03b8 = (\u00b50 \u2212 \u00b5, \u00b50 + \u00b5) for some \u00b50, \u00b5 \u2208 Rd and X1, ..., Xn\n\ni.i.d.\u223c P\u03b8. For any\nn with probability at least 1 \u2212 3\u03b4.\n\n\u03b4\n\n\u03b4 > 0, we have (cid:107)\u00b50 \u2212(cid:98)\u00b5n(cid:107) \u2265 \u03c3\n\n(cid:113) 2 max(d,8 log 1\n\nn\n\n\u03b4 )\n\n+ (cid:107)\u00b5(cid:107)\n\n(cid:113) 2 log 1\n\n6\n\n\f(cid:18) \u03c32\n\n(cid:115)\n\n(cid:19)\u221a\n\n(cid:18)\n\n(cid:19)\n\nd\n\u03b4\n\nProposition 6. Let \u03b8 = (\u00b50 \u2212 \u00b5, \u00b50 + \u00b5) for some \u00b50, \u00b5 \u2208 Rd and X1, ..., Xn\n\nd > 1 and n \u2265 4d. De\ufb01ne cos \u03b2 = |v1(\u03c32I + \u00b5\u00b5T )T v1((cid:98)\u03a3n)|. For any 0 < \u03b4 < d\u22121\u221a\n160 , then with probability at least 1 \u2212 12\u03b4 \u2212 2 exp(cid:0)\u2212 n\n(cid:1),\n\n(cid:17)(cid:113) max(d,8 log 1\n\ni.i.d.\u223c P\u03b8 with\ne , if\n\n(cid:16) \u03c32\n\n(cid:107)\u00b5(cid:107)2 , \u03c3(cid:107)\u00b5(cid:107)\n\n\u2264 1\n\nmax\n\n\u03b4 )\n\n20\n\nn\n\nsin \u03b2 \u2264 14 max\n\n(cid:107)\u00b5(cid:107)2 ,\n\n\u03c3\n(cid:107)\u00b5(cid:107)\n\nd\n\n10\nn\n\nd\n\u03b4\n\n10\nn\n\ncos \u03b2\n\n2 sin \u03b2\n\n(cid:19)\n\n\u03a6\n\n\u03c3\n\n(cid:18)\n\nlog\n\nmax\n\n1,\n\nlog\n\n.\n\n, then\n\n5\n\n+ 1\n\n.\n\n1\n\u03c3\n\n\u03c6\n\n2\n\n\u2212\u221e\n\nmax\n\n0,\n\n\u2212 1\n2\n\n+ 2 sin \u03b2\n\n2\u00011 + \u00012\n\n\u2212 2\u00011\n\n(cid:107)\u00b5(cid:107)\n\u03c3\n\n(cid:107)\u00b5(cid:107)\n2\u03c3\n\n(cid:107)\u00b5(cid:107)\n\u03c3\n\nProof. Let r =\n\nif xT v \u2265 xT\nsome \u00011 \u2265 0 and 0 \u2264 \u00012 \u2264 1\n\nProposition 7. Let \u03b8 = (\u00b50 \u2212 \u00b5, \u00b50 + \u00b5), and for some x0, v \u2208 Rd with (cid:107)v(cid:107) = 1, let (cid:98)F (x) = 1\n0 v, and 2 otherwise. De\ufb01ne cos \u03b2 = |vT \u00b5|/(cid:107)\u00b5(cid:107). If |(x0 \u2212 \u00b50)T v| \u2264 \u03c3\u00011 + (cid:107)\u00b5(cid:107)\u00012 for\n(cid:40)\n(cid:19)2(cid:41)(cid:20)\n(cid:18)\n(cid:19)(cid:21)\n4 , and if sin \u03b2 \u2264 1\u221a\nL\u03b8((cid:98)F ) \u2264 exp\n(cid:12)(cid:12)(cid:12). Since the clustering loss is invariant to rotation and translation,\n(cid:12)(cid:12)(cid:12) (x0\u2212\u00b50)T v\n(cid:18)(cid:107)\u00b5(cid:107) \u2212 |x| tan \u03b2 \u2212 r\n(cid:19)(cid:21)\n(cid:17)(cid:20)\n(cid:90) \u221e\n(cid:16) x\nL\u03b8((cid:98)F ) \u2264 1\n(cid:19)(cid:21)\n(cid:20)\n(cid:18)(cid:107)\u00b5(cid:107)\n(cid:90) \u221e\n(cid:17) \u2212 \u03a6\n(cid:16)(cid:107)\u00b5(cid:107)\n\u2212\u221e\n(cid:12)(cid:12)(cid:12),\n(cid:16)\n(cid:16)\n(cid:17)\n(cid:17)(cid:17)\nSince tan \u03b2 \u2264 1\n4, we have r \u2264 2\u03c3\u00011 + 2(cid:107)\u00b5(cid:107)\u00012, and \u03a6\n2 and \u00012 \u2264 1\n(cid:18)(cid:107)\u00b5(cid:107) \u2212 r\n(cid:18)(cid:107)\u00b5(cid:107) \u2212 r\n(cid:19)\n(cid:90) A\n(cid:90) \u221e\n(cid:107)\u00b5(cid:107)\n2\u03c3 \u2212 2\u00011\n\u03c6\n2\n\u2212 \u03a6\n(cid:90) A cos \u03b2+(u+A sin \u03b2) tan \u03b2\n(cid:18)\n(cid:18)(cid:18)\n\n(cid:18)(cid:107)\u00b5(cid:107) + |x| tan \u03b2 + r\n(cid:19)\n(cid:18)(cid:107)\u00b5(cid:107) \u2212 r\n\n\u03c6(u)\u03c6(v)dudv \u2264 2\u03c6 (A) tan \u03b2 (A sin \u03b2 + 1)\n\n\u2264 2\u03c6\nwhere we used u = x cos \u03b2 \u2212 y sin \u03b2 and v = x sin \u03b2 + y cos \u03b2 in the second step. The bound now\nfollows easily.\n\n(cid:12)(cid:12)(cid:12)(cid:107)\u00b5(cid:107)\u2212r\n(cid:19)(cid:21)\n\n(cid:90) \u221e\n(cid:18)\n\n(cid:16)(cid:107)\u00b5(cid:107)\u2212r\n\n(cid:16)\n(cid:90) \u221e\n\n(cid:17) \u2264\n\n\u2212 |x| tan \u03b2\n\n\u2212 |x| tan \u03b2\n\n. De\ufb01ning A =\n\n\u03c6(x)\u03c6(y)dydx\n\n(cid:19)(cid:19)\n\ndx \u2264 2\n\n(cid:107)\u00b5(cid:107)\n2\u03c3\n\n(cid:107)\u00b5(cid:107)\n\u03c3\n\nA\u2212x tan \u03b2\n\n\u2212 2\u00011\n\nsin \u03b2 + 1\n\n(cid:107)\u00b5(cid:107)\n\u03c3\n\n(cid:20)\n\n\u2212A sin \u03b2\n\n\u03c6(x)\n\n\u03a6\n\nmax\n\n0,\n\nmax\n\n0,\n\n\u03c6(x)\n\n\u03a6\n\n\u00011 + \u00012\n\n\u2212 \u03a6\n\ntan \u03b2\n\n2\n\n0\n\n(cid:19)\n\n\u2212 \u03a6\n\n(cid:19)\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u2212\u221e\n\n= 2\n\n+ 2\u00011\n\nA cos \u03b2\n\ndx.\n\n\u2264\n\ndx\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\nProof of Theorem 1. Using Propositions 5 and 6 with \u03b4 = 1\u221a\n(C + x) exp(\u2212 max(0, x \u2212 4)2/8) \u2264 (C + 6) exp(\u2212 max(0, x \u2212 4)2/10) for all C, x > 0,\n\nn, Proposition 7, and the fact that\n\nE\u03b8L\u03b8((cid:98)F ) \u2264 600 max\n\n(cid:19)(cid:114)\n\n(cid:18) 4\u03c32\n5d) can be shown similarly, using \u03b4 = exp(cid:0)\u2212 n\n\n\u03bb2 , 1\n\nd log(nd)\n\nn\n\n(it is easy to verify that the bounds are decreasing with (cid:107)\u00b5(cid:107), so we use (cid:107)\u00b5(cid:107) = \u03bb\n2 to bound the\nsupremum). In the d = 1 case Proposition 6 need not be applied, since the principal directions agree\ntrivially. The bound for \u03bb\nProposition 8. Let \u03b8 = (\u00b50 \u2212 \u00b5, \u00b50 + \u00b5) for some \u00b50, \u00b5 \u2208 Rd and X1, ..., Xn\n0 < \u03b4 < 1\u221a\n\n\u03c3 \u2265 2 max(80, 14\n(cid:113) 6 log 1\n(cid:115)\n2 , with probability at least 1 \u2212 6d\u03b4, for all i \u2208 [d],\n\ni.i.d.\u223c P\u03b8. For any\n\n(cid:1). (cid:3)\n\n(cid:115)\n\n\u221a\n\n32\n\n\u03b4\n\ne such that\n\nn \u2264 1\n|(cid:98)\u03a3n(i, i) \u2212 (\u03c32 + \u00b5(i)2)| \u2264 \u03c32\n\n6 log 1\n\u03b4\n\nn\n\n+ 2\u03c3|\u00b5(i)|\n\n2 log 1\n\u03b4\n\nn\n\nProposition 9. Let \u03b8 = (\u00b50 \u2212 \u00b5, \u00b50 + \u00b5) for some \u00b50, \u00b5 \u2208 Rd and X1, ..., Xn\n\nS(\u03b8) = {i \u2208 [d] : \u00b5(i) (cid:54)= 0} and (cid:101)S(\u03b8) = {i \u2208 [d] : |\u00b5(i)| \u2265 4\u03c3\n\n4 . Then (cid:101)S(\u03b8) \u2286 (cid:98)Sn \u2286 S(\u03b8) with probability at least 1 \u2212 6\n\n\u03b1}.\n\nn .\n\nAssume that n \u2265 1, d \u2265 2, and \u03b1 \u2264 1\n\n\u03b4\n\n+ (\u03c3 + |\u00b5(i)|)2 2 log 1\n.\ni.i.d.\u223c P\u03b8. De\ufb01ne\n\u221a\n\nn\n\n7\n\n\f(cid:114)\n\n(cid:114)\n\nn\n\nn\n\nn\n\n\u221a\n\n(cid:18)\n\n(cid:19)\n\n2 log(nd)\n\n6 log(nd)\n\n1 \u2212 2 log(nd)\n\n+ (\u03c3 + |\u00b5(i)|)2 2 log(nd)\n\nProof. By Proposition 8, with probability at least 1 \u2212 6\nn,\n+ 2\u03c3|\u00b5(i)|\n\n|(cid:98)\u03a3n(i, i) \u2212 (\u03c32 + \u00b5(i)2)| \u2264 \u03c32\nfor all i \u2208 [d]. Assume the above event holds. If S(\u03b8) = [d], then of course (cid:98)Sn \u2286 S(\u03b8). Otherwise,\nfor i /\u2208 S(\u03b8), we have (1 \u2212 \u03b1)\u03c32 \u2264 (cid:98)\u03a3n(i, i) \u2264 (1 + \u03b1)\u03c32, so it is clear that (cid:98)Sn \u2286 S(\u03b8). The\nremainder of the proof is trivial if (cid:101)S(\u03b8) = \u2205 or S(\u03b8) = \u2205. Assume otherwise. For any i \u2208 S(\u03b8),\n1\u2212\u03b1 \u03c32 \u2264(cid:98)\u03a3n(i, i) and i \u2208 (cid:98)Sn (we ignore strict\n\n\u03b1 for all i \u2208 (cid:101)S(\u03b8), so (1+\u03b1)2\n\n(cid:98)\u03a3n(i, i) \u2265 (1 \u2212 \u03b1)\u03c32 +\n\nBy de\ufb01nition, |\u00b5(i)| \u2265 4\u03c3\n\nequality above as a measure 0 event), i.e. (cid:101)S(\u03b8) \u2286 (cid:98)Sn, which concludes the proof.\nProof of Theorem 3. De\ufb01ne S(\u03b8) = {i \u2208 [d] : \u00b5(i) (cid:54)= 0} and (cid:101)S(\u03b8) = {i \u2208 [d] : |\u00b5(i)| \u2265 4\u03c3\nAssume (cid:101)S(\u03b8) \u2286 (cid:98)Sn \u2286 S(\u03b8) (by Proposition 9, this holds with probability at least 1 \u2212 6\n(cid:101)S(\u03b8) = \u2205, then we simply have E\u03b8L\u03b8((cid:98)Fn) \u2264 1\nAssume (cid:101)S(\u03b8) (cid:54)= \u2205. Let cos(cid:98)\u03b2 = |v1((cid:98)\u03a3(cid:98)Sn\n|v1((cid:98)\u03a3(cid:98)Sn\nand \u03a3(cid:98)Sn\nthe same as(cid:98)\u03a3n and \u03a3 in (cid:98)Sn, respectively, and 0 elsewhere. Then sin(cid:98)\u03b2 \u2264 sin(cid:101)\u03b2 + sin \u03b2, and\n\u221a\n\n)| where \u03a3 = \u03c32I + \u00b5\u00b5T , and for simplicity we de\ufb01ne (cid:98)\u03a3(cid:98)Sn\n(cid:107)\u00b5 \u2212 \u00b5(cid:98)S(\u03b8)(cid:107)\n\n(cid:107)\u00b5 \u2212 \u00b5(cid:101)S(\u03b8)(cid:107)\n\n)T v1(\u03a3)|, and cos \u03b2 =\nto be\n\n\u00b5(i)2 \u2212 2\u03b1\u03c3|\u00b5(i)|.\n\n\u221a\nn).\n\n\u03b1}.\nIf\n\nn\n\n2.\n\n\u221a\n\n)T v1(\u03a3)|, cos(cid:101)\u03b2 = |v1(\u03a3(cid:98)Sn\n(cid:113)|S(\u03b8)| \u2212 |(cid:101)S(\u03b8)|\n(cid:33)(cid:114)\ns\u03b1(cid:1)2 , 1\n\n\u2264 4\u03c3\n\ns log(ns)\n\n(cid:107)\u00b5(cid:107)\n\n+ 104\n\n\u03b1\n\nn\n4 implies log(nd)\n\n\u03c3\n\n\u2264 8\n\ns\u03b1\n\u03bb\n\n.\n\n\u221a\n\n\u03c3\n\ns\u03b1\n\u03bb\n\n+\n\n3\nn\n\n.\n\n)T v1(\u03a3(cid:98)Sn\nsin(cid:101)\u03b2 =\nE\u03b8L\u03b8((cid:98)F ) \u2264 600 max\n\n(cid:107)\u00b5(cid:107)\n\n\u2264\n\n(cid:32)\n\n(cid:107)\u00b5(cid:107)\n\n(cid:0) \u03bb\n\u03c32\n\u221a\n2 \u2212 4\u03c3\n\nUsing the fact L\u03b8((cid:98)F ) \u2264 1\n\n\u221a\nUsing the same argument as the proof of Theorem 1, as long as the above bound is smaller than 1\n2\n\n,\n\n5\n\n2 always, and that \u03b1 \u2264 1\n\nn \u2264 1, the bound follows.\n\n(cid:3)\n\n6 Conclusion\n\nWe have provided minimax lower and upper bounds for estimating high dimensional mixtures. The\nbounds show explicitly how the statistical dif\ufb01culty of the problem depends on dimension d, sample\nsize n, separation \u03bb and sparsity level s.\nFor clarity, we focused on the special case where there are two spherical components with equal\nmixture weights. In future work, we plan to extend the results to general mixtures of k Gaussians.\nOne of our motivations for this work is the recent interest in variable selection methods to facilitate\nclustering in high dimensional problems. Existing methods such as Pan and Shen (2007); Witten\nand Tibshirani (2010); Raftery and Dean (2006); Sun et al. (2012) and Guo et al. (2010) provide\npromising numerical evidence that variable selection does improve high dimensional clustering.\nOur results provide some theoretical basis for this idea.\nHowever, there is a gap between the results in this paper and the above methodology papers. In-\ndeed, as of now, there is no rigorous proof that the methods in those papers outperform a two stage\napproach where the \ufb01rst stage screens for relevant features and the second stage applies standard\nclustering methods on the features found in the \ufb01rst stage. We conjecture that there are conditions\nunder which simultaneous feature selection and clustering outperforms a two stage method. Settling\nthis question will require the aforementioned extension of our results to the general mixture case.\n\nAcknowledgements\n\nThis research is supported in part by NSF grants IIS-1116458 and CAREER award IIS-1252412, as\nwell as NSF Grant DMS-0806009 and Air Force Grant FA95500910373.\n\n8\n\n\fReferences\nDimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions.\n\nLearning Theory, pages 458\u2013469. Springer, 2005.\n\nIn\n\nSanjeev Arora and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the\n\nthirty-third annual ACM symposium on Theory of computing, pages 247\u2013257. ACM, 2001.\n\nMikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In Foundations of\nComputer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 103\u2013112. IEEE, 2010.\nS Charles Brubaker and Santosh S Vempala. Isotropic pca and af\ufb01ne-invariant clustering. In Building\n\nBridges, pages 241\u2013281. Springer, 2008.\n\nKamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions using correlations\n\nand independence. In COLT, pages 9\u201320, 2008.\n\nKamalika Chaudhuri, Sanjoy Dasgupta, and Andrea Vattani. Learning mixtures of gaussians using\n\nthe k-means algorithm. arXiv preprint arXiv:0912.0086, 2009.\n\nSanjoy Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, 1999. 40th\n\nAnnual Symposium on, pages 634\u2013644. IEEE, 1999.\n\nJian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Pairwise variable selection for high-\n\ndimensional model-based clustering. Biometrics, 66(3):793\u2013804, 2010.\n\nDaniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and\nIn Proceedings of the 4th conference on Innovations in Theoretical\n\nspectral decompositions.\nComputer Science, pages 11\u201320. ACM, 2013.\n\nAdam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Disentangling gaussians. Communica-\n\ntions of the ACM, 55(2):113\u2013120, 2012.\n\nRavindran Kannan, Hadi Salmasian, and Santosh Vempala. The spectral method for general mixture\n\nmodels. In Learning Theory, pages 444\u2013457. Springer, 2005.\n\nPascal Massart. Concentration inequalities and model selection. 2007.\nWei Pan and Xiaotong Shen. Penalized model-based clustering with application to variable selec-\n\ntion. The Journal of Machine Learning Research, 8:1145\u20131164, 2007.\n\nAdrian E Raftery and Nema Dean. Variable selection for model-based clustering. Journal of the\n\nAmerican Statistical Association, 101(473):168\u2013178, 2006.\n\nLeonard J. Schulman and Sanjoy Dasgupta. A two-round variant of em for gaussian mixtures. In\n\nProc. 16th UAI (Conference on Uncertainty in Arti\ufb01cial Intelligence), pages 152\u2013159, 2000.\n\nWei Sun, Junhui Wang, and Yixin Fang. Regularized k-means clustering of high-dimensional data\n\nand its asymptotic consistency. Electronic Journal of Statistics, 6:148\u2013167, 2012.\n\nAlexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics.\n\nSpringer, 2009.\n\nSantosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. Journal of\n\nComputer and System Sciences, 68(4):841\u2013860, 2004.\n\nVincent Q Vu and Jing Lei. Minimax sparse principal subspace estimation in high dimensions. arXiv\n\npreprint arXiv:1211.0373, 2012.\n\nDaniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal\n\nof the American Statistical Association, 105(490), 2010.\n\n9\n\n\f", "award": [], "sourceid": 1051, "authors": [{"given_name": "Martin", "family_name": "Azizyan", "institution": "CMU"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}, {"given_name": "Larry", "family_name": "Wasserman", "institution": "CMU"}]}