{"title": "Moment-based Uniform Deviation Bounds for $k$-means and Friends", "book": "Advances in Neural Information Processing Systems", "page_first": 2940, "page_last": 2948, "abstract": "Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution?  This question is resolved here for distributions with $p\\geq 4$ bounded moments; in particular, the difference between the sample cost and distribution cost decays with $m$ and $p$ as $m^{\\min\\{-1/4, -1/2+2/p\\}}$.  The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions.  To further demonstrate this mechanism, a soft clustering variant of $k$-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum.  Lastly, a rate with refined constants is provided for $k$-means instances possessing some cluster structure.", "full_text": "Moment-based Uniform Deviation Bounds for\n\nk-means and Friends\n\nMatus Telgarsky\n\nSanjoy Dasgupta\n\nComputer Science and Engineering, UC San Diego\n\n{mtelgars,dasgupta}@cs.ucsd.edu\n\nAbstract\n\nSuppose k centers are \ufb01t to m points by heuristically minimizing the k-means\ncost; what is the corresponding \ufb01t over the source distribution? This question is\nresolved here for distributions with p  4 bounded moments; in particular, the\ndifference between the sample cost and distribution cost decays with m and p as\nmmin{1/4,1/2+2/p}. The essential technical contribution is a mechanism to uni-\nformly control deviations in the face of unbounded parameter sets, cost functions,\nand source distributions. To further demonstrate this mechanism, a soft clustering\nvariant of k-means cost is also considered, namely the log likelihood of a Gaus-\nsian mixture, subject to the constraint that all covariance matrices have bounded\nspectrum. Lastly, a rate with re\ufb01ned constants is provided for k-means instances\npossessing some cluster structure.\n\n1\n\nIntroduction\n\ni=1 is selected by approximate minimization of k-means cost; how\nSuppose a set of k centers {pi}k\ndoes the \ufb01t over the sample compare with the \ufb01t over the distribution? Concretely: given m points\nsampled from a source distribution \u21e2, what can be said about the quantities\n\n1\nm\n\n\nmXj=1\n\n1\nm\n\nmXj=1\nln kXi=1\n\ni kxj  pik2\nmin\n\n2 Z min\n\u21b5ip\u2713i(xj)! Z ln kXi=1\n\n\n\ni kx  pik2\n\n2d\u21e2(x)\n\u21b5ip\u2713i(x)! d\u21e2(x)\n\n(k-means),\n\n(1.1)\n\n(soft k-means),\n\n(1.2)\n\nwhere each p\u2713i denotes the density of a Gaussian with a covariance matrix whose eigenvalues lie in\nsome closed positive interval.\nThe literature offers a wealth of information related to this question. For k-means, there is \ufb01rstly a\nconsistency result: under some identi\ufb01ability conditions, the global minimizer over the sample will\nconverge to the global minimizer over the distribution as the sample size m increases [1]. Further-\nmore, if the distribution is bounded, standard tools can provide deviation inequalities [2, 3, 4]. For\nthe second problem, which is maximum likelihood of a Gaussian mixture (thus amenable to EM\n[5]), classical results regarding the consistency of maximum likelihood again provide that, under\nsome identi\ufb01ability conditions, the optimal solutions over the sample converge to the optimum over\nthe distribution [6].\nThe task here is thus: to provide \ufb01nite sample guarantees for these problems, but eschewing bound-\nedness, subgaussianity, and similar assumptions in favor of moment assumptions.\n\n1\n\n\f1.1 Contribution\n\nThe results here are of the following form: given m examples from a distribution with a few bounded\nmoments, and any set of parameters beating some \ufb01xed cost c, the corresponding deviations in cost\n(as in eq. (1.1) and eq. (1.2)) approach O(m1/2) with the availability of higher moments.\n\n\u2022 In the case of k-means (cf. Corollary 3.1), p  4 moments suf\ufb01ce, and the rate is\nO(mmin{1/4,1/2+2/p}). For Gaussian mixtures (cf. Theorem 5.1), p  8 moments\nsuf\ufb01ce, and the rate is O(m1/2+3/p).\n\u2022 The parameter c allows these guarantees to hold for heuristics. For instance, suppose k\ncenters are output by Lloyd\u2019s method. While Lloyd\u2019s method carries no optimality guar-\nantees, the results here hold for the output of Lloyd\u2019s method simply by setting c to be the\nvariance of the data, equivalently the k-means cost with a single center placed at the mean.\n\u2022 The k-means and Gaussian mixture costs are only well-de\ufb01ned when the source distribu-\ntion has p  2 moments. The condition of p  4 moments, meaning the variance has a\nvariance, allows consideration of many heavy-tailed distributions, which are ruled out by\nboundedness and subgaussianity assumptions.\n\nThe main technical byproduct of the proof is a mechanism to deal with the unboundedness of the\ncost function; this technique will be detailed in Section 3, but the dif\ufb01culty and its resolution can be\neasily sketched here.\nFor a single set of centers P , the deviations in eq. (1.1) may be controlled with an application of\nChebyshev\u2019s inequality. But this does not immediately grant deviation bounds on another set of\ncenters P 0, even if P and P 0 are very close: for instance, the difference between the two costs will\ngrow as successively farther and farther away points are considered.\nThe resolution is to simply note that there is so little probability mass in those far reaches that the\ncost there is irrelevant. Consider a single center p (and assume x 7! kx  pk2\n2 is integrable); the\ndominated convergence theorem grants\nZBi kx  pk2\nIn other words, a ball Bi may be chosen so thatRBc\np0 with kp  p0k2 \uf8ff i. Then\n2d\u21e2(x) \uf8ffZBc\nZBc\n\nwhere Bi := {x 2 Rd : kx  pk2 \uf8ff i}.\n2d\u21e2(x) \uf8ff 1/1024. Now consider some\n\n(kx  pk2 + kp  p0k2)2d\u21e2(x) \uf8ff 4ZBc\n\n2d\u21e2(x) ! Z kx  pk2\n\nkx  pk2\n\n2d\u21e2(x) \uf8ff\n\nkx  p0k2\n\ni\n\ni\n\ni\n\n2d\u21e2(x),\n\ni kx  pk2\n\n1\n256\n\n.\n\nIn this way, a single center may control the outer deviations of whole swaths of other centers. Indeed,\nthose choices outperforming the reference score c will provide a suitable swath. Of course, it would\nbe nice to get a sense of the size of Bi; this however is provided by the moment assumptions.\nThe general strategy is thus to split consideration into outer deviations, and local deviations. The\nlocal deviations may be controlled by standard techniques. To control outer deviations, a single pair\nof dominating costs \u2014 a lower bound and an upper bound \u2014 is controlled.\nThis technique can be found in the proof of the consistency of k-means due to Pollard [1]. The\npresent work shows it can also provide \ufb01nite sample guarantees, and moreover be applied outside\nhard clustering.\nThe content here is organized as follows. The remainder of the introduction surveys related work,\nand subsequently Section 2 establishes some basic notation. The core deviation technique, termed\nouter bracketing (to connect it to the bracketing technique from empirical process theory), is pre-\nsented along with the deviations of k-means in Section 3. The technique is then applied in Section 5\nto a soft clustering variant, namely log likelihood of Gaussian mixtures having bounded spectra. As\na reprieve between these two heavier bracketing sections, Section 4 provides a simple re\ufb01nement for\nk-means which can adapt to cluster structure.\nAll proofs are deferred to the appendices, however the construction and application of outer brackets\nis sketched in the text.\n\n2\n\n\f1.2 Related Work\n\nAs referenced earlier, Pollard\u2019s work deserves special mention, both since it can be seen as the origin\nof the outer bracketing technique, and since it handled k-means under similarly slight assumptions\n(just two moments, rather than the four here) [1, 7]. The present work hopes to be a spiritual\nsuccessor, providing \ufb01nite sample guarantees, and adapting technique to a soft clustering problem.\nIn the machine learning community, statistical guarantees for clustering have been extensively stud-\nied under the topic of clustering stability [4, 8, 9, 10]. One formulation of stability is: if param-\neters are learned over two samples, how close are they? The technical component of these works\nfrequently involves \ufb01nite sample guarantees, which in the works listed here make a boundedness\nassumption, or something similar (for instance, the work of Shamir and Tishby [9] requires the cost\nfunction to satisfy a bounded differences condition). Amongst these \ufb01nite sample guarantees, the\n\ufb01nite sample guarantees due to Rakhlin and Caponnetto [4] are similar to the development here after\nthe invocation of the outer bracket: namely, a covering argument controls deviations over a bounded\nset. The results of Shamir and Tishby [10] do not make a boundedness assumption, but the main\nresults are not \ufb01nite sample guarantees; in particular, they rely on asymptotic results due to Pollard\n[7].\nThere are many standard tools which may be applied to the problems here, particularly if a bound-\nedness assumption is made [11, 12]; for instance, Lugosi and Zeger [2] use tools from VC theory to\nhandle k-means in the bounded case. Another interesting work, by Ben-david [3], develops special-\nized tools to measure the complexity of certain clustering problems; when applied to the problems\nof the type considered here, a boundedness assumption is made.\nA few of the above works provide some negative results and related commentary on the topic of\nuniform deviations for distributions with unbounded support [10, Theorem 3 and subsequent discus-\nsion] [3, Page 5 above De\ufb01nition 2]. The primary \u201cloophole\u201d here is to constrain consideration to\nthose solutions beating some reference score c. It is reasonable to guess that such a condition en-\ntails that a few centers must lie near the bulk of the distribution\u2019s mass; making this guess rigorous\nis the \ufb01rst step here both for k-means and for Gaussian mixtures, and moreover the same conse-\nquence was used by Pollard for the consistency of k-means [1]. In Pollard\u2019s work, only optimal\nchoices were considered, but the same argument relaxes to arbitrary c, which can thus encapsulate\nheuristic schemes, and not just nearly optimal ones. (The secondary loophole is to make moment\nassumptions; these suf\ufb01ciently constrain the structure of the distribution to provide rates.)\nIn recent years, the empirical process theory community has produced a large body of work on the\ntopic of maximum likelihood (see for instance the excellent overviews and recent work of Wellner\n[13], van der Vaart and Wellner [14], Gao and Wellner [15]). As stated previously, the choice of the\nterm \u201cbracket\u201d is to connect to empirical process theory. Loosely stated, a bracket is simply a pair\nof functions which sandwich some set of functions; the bracketing entropy is then (the logarithm of)\nthe number of brackets needed to control a particular set of functions. In the present work, brackets\nare paired with sets which identify the far away regions they are meant to control; furthermore,\nwhile there is potential for the use of many outer brackets, the approach here is able to make use of\njust a single outer bracket. The name bracket is suitable, as opposed to cover, since the bracketing\nelements need not be members of the function class being dominated. (By contrast, Pollard\u2019s use in\nthe proof of the consistency of k-means was more akin to covering, in that remote \ufb02uctuations were\ncompared to that of a a single center placed at the origin [1].)\n\n2 Notation\n\nThe ambient space will always be the Euclidean space Rd, though a few results will be stated for a\ngeneral domain X . The source probability measure will be \u21e2, and when a \ufb01nite sample of size m\nis available, \u02c6\u21e2 is the corresponding empirical measure. Occasionally, the variable \u232b will refer to an\narbitrary probability measure (where \u21e2 and \u02c6\u21e2 will serve as relevant instantiations). Both integral and\nexpectation notation will be used; for example, E(f (X)) = E\u21e2(f (X) =R f (x)d\u21e2(x); for integrals,\nRB f (x)d\u21e2(x) =R f (x)1[x 2 B]d\u21e2(x), where 1 is the indicator function. The moments of \u21e2 are\nde\ufb01ned as follows.\nDe\ufb01nition 2.1. Probability measure \u21e2 has order-p moment bound M with respect to norm k\u00b7k when\nE\u21e2kX  E\u21e2(X)kl \uf8ff M for 1 \uf8ff l \uf8ff p.\n\n3\n\n\fFor example, the typical setting of k-means uses norm k\u00b7k 2, and at least two moments are needed for\nthe cost over \u21e2 to be \ufb01nite; the condition here of needing 4 moments can be seen as naturally arising\nvia Chebyshev\u2019s inequality. Of course, the availability of higher moments is bene\ufb01cial, dropping the\nrates here from m1/4 down to m1/2. Note that the basic controls derived from moments, which\nare primarily elaborations of Chebyshev\u2019s inequality, can be found in Appendix A.\nThe k-means analysis will generalize slightly beyond the single-center cost x 7! kx  pk2\nBregman divergences [16, 17].\nDe\ufb01nition 2.2. Given a convex differentiable function f : X! R, the corresponding Bregman\ndivergence is Bf (x, y) := f (x)  f (y)  hrf (y), x  yi.\nNot all Bregman divergences are handled; rather, the following regularity conditions will be placed\non the convex function.\nDe\ufb01nition 2.3. A convex differentiable function f is strongly convex with modulus r1 and has Lip-\nschitz gradients with constant r2, both respect to some norm k\u00b7k , when f (respectively) satis\ufb01es\n\n2 via\n\nf (\u21b5x + (1  \u21b5)y) \uf8ff \u21b5f (x) + (1  \u21b5)f (y) \nkrf (x)  rf (y)k\u21e4 \uf8ff r2kx  yk,\n\nr1\u21b5(1  \u21b5)\n\n2\n\nkx  yk2,\n\nwhere x, y 2X , \u21b5 2 [0, 1], and k\u00b7k \u21e4 is the dual of k\u00b7k . (The Lipschitz gradient condition is\nsometimes called strong smoothness.)\n\nThese conditions are a fancy way of saying the corresponding Bregman divergence is sandwiched\nbetween two quadratics (cf. Lemma B.1).\nDe\ufb01nition 2.4. Given a convex differentiable function f : Rd ! R which is strongly convex and\nhas Lipschitz gradients with respective constants r1, r2 with respect to norm k\u00b7k , the hard k-means\ncost of a single point x according to a set of centers P is\n\nf (x; P ) := min\np2P\n\nBf (x, p).\n\nThe corresponding k-means cost of a set of points (or distribution) is thus computed as\nE\u232b(f (X; P )), and let Hf (\u232b; c, k) denote all sets of at most k centers beating cost c, meaning\n\nHf (\u232b; c, k) := {P : |P|\uf8ff k, E\u232b(f (X; P )) \uf8ff c}.\n\n2 (which has r1 = r2 = 2), the\n2, and E\u02c6\u21e2(f (X; P )) denotes the vanilla\n\nFor example, choosing norm k\u00b7k 2 and convex function f (x) = kxk2\ncorresponding Bregman divergence is Bf (x, y) = kx  yk2\nk-means cost of some \ufb01nite point set encoded in the empirical measure \u02c6\u21e2.\nThe hard clustering guarantees will work with Hf (\u232b; c, k), where \u232b can be either the source distri-\nbution \u21e2, or its empirical counterpart \u02c6\u21e2. As discussed previously, it is reasonable to set c to simply\nthe sample variance of the data, or a related estimate of the true variance (cf. Appendix A).\nLastly, the class of Gaussian mixture penalties is as follows.\nDe\ufb01nition 2.5. Given Gaussian parameters \u2713 := (\u00b5, \u2303), let p\u2713 denote Gaussian density\n\np\u2713(x) =\n\nexp\u2713\n\n1\n2\n\n1\n\np(2\u21e1)d|\u2303i|\n\nGiven Gaussian mixture parameters (\u21b5, \u21e5) = ({\u21b5i}k\n(written \u21b5 2 ), the Gaussian mixture cost at a point x is\n\ng(x; (\u21b5, \u21e5)) := g(x;{(\u21b5i,\u2713 i) = (\u21b5i, \u00b5i,\u2303 i)}k\n\ni\n\n(x  \u00b5i)T \u23031\ni=1,{\u2713i}k\n\n(x  \u00b5i)\u25c6 .\ni=1) with \u21b5  0 andPi \u21b5i = 1\ni=1) := ln kXi=1\n\n\u21b5ip\u2713i(x)! ,\n\nLastly, given a measure \u232b, bound k on the number of mixture parameters, and spectrum bounds\n0 < 1 \uf8ff 2, let Smog(\u232b; c, k, 1, 2) denote those mixture parameters beating cost c, meaning\nSmog(\u232b; c, k, 1, 2) := {(\u21b5, \u21e5) : 1I  \u2303i  2I,|\u21b5|\uf8ff k, \u21b5 2 , E\u232b (g(X; (\u21b5, \u21e5))) \uf8ff c} .\nWhile a condition of the form \u2303 \u232b 1I is typically enforced in practice (say, with a Bayesian prior,\nor by ignoring updates which shrink the covariance beyond this point), the condition \u2303  2I is\npotentially violated. These conditions will be discussed further in Section 5.\n\n4\n\n\f3 Controlling k-means with an Outer Bracket\n\nFirst consider the special case of k-means cost.\nCorollary 3.1. Set f (x) := kxk2\n2, whereby f is the k-means cost. Let real c  0 and probability\nmeasure \u21e2 be given with order-p moment bound M with respect to k\u00b7k 2, where p  4 is a positive\nmultiple of 4. De\ufb01ne the quantities\nc1 := (2M )1/p + p2c, M1 := M 1/(p2) + M 2/p, N1 := 2 + 576d(c1 + c2\nThen with probability at\nthe draw of a sample of\nmax{(p/(2p/4+2e))2, 9 ln(1/)}, every set of centers P 2H f (\u02c6\u21e2; c, k) [H f (\u21e2; c, k) satis\ufb01es\n\n1 ).\nsize m \n\nleast 1  3 over\n\n1 + M1 + M 2\n\nZ f (x; P )d\u21e2(x) Z f (x; P )d\u02c6\u21e2(x)\n\uf8ff m1/2+min{1/4,2/p} 4 + (72c2\n\n1 + 32M 2\n\n1 )s 1\n\n2\n\nln\u2713 (mN1)dk\n\n\n\n\u25c6 +r 2p/4ep\n8m1/2 \u2713 2\n\n\u25c64/p! .\n\nOne artifact of the moment approach (cf. Appendix A), heretofore ignored, is the term (2/)4/p.\nWhile this may seem inferior to ln(2/), note that the choice p = 4 ln(2/)/ ln(ln(2/)) suf\ufb01ces to\nmake the two equal.\nNext consider a general bound for Bregman divergences. This bound has a few more parameters\nthan Corollary 3.1. In particular, the term \u270f, which is instantiated to m1/2+1/p in the proof of\nCorollary 3.1, catches the mass of points discarded due to the outer bracket, as well as the resolution\nof the (inner) cover. The parameter p0, which controls the tradeoff between m and 1/, is set to p/4\nin the proof of Corollary 3.1.\nTheorem 3.2. Fix a reference norm k\u00b7k throughout the following. Let probability measure \u21e2 be\ngiven with order-p moment bound M where p  4, a convex function f with corresponding constants\nr1 and r2, reals c and \u270f> 0, and integer 1 \uf8ff p0 \uf8ff p/2  1 be given. De\ufb01ne the quantities\n\n(M/\u270f)1/(p2i) ,\n\nRB := max\u21e2(2M )1/p +p4c/r1, max\nRC :=pr2/r1\u21e3(2M )1/p +p4c/r1 + RB\u2318 + RB,\nB :=x 2 Rd : kx  E(X)k \uf8ff RB ,\nC :=x 2 Rd : kx  E(X)k \uf8ff RC ,\n2(RB + RC)r2 ,\n\u2327 := min\u21e2r \u270f\n|N| \uf8ff\u27131 +\n\u2327 \u25c6d\nleast 1  3 over\nCs 1\nZ f (x; P )d\u21e2(x) Z f (x; P )d\u02c6\u21e2(x) \uf8ff 4\u270f+4r2R2\n\nln\u2713 2|N|k\n\n3.1 Compacti\ufb01cation via Outer Brackets\n\nsize m \nThen with probability at\nmax{p0/(e2p0\u270f), 9 ln(1/)}, every set of centers P 2H f (\u21e2; c, k) [H f (\u02c6\u21e2; c, k) satis\ufb01es\n \u25c6+r e2p0\u270fp0\n\u25c61/p0\n2m \u2713 2\n\nthe draw of a sample of\n\nand let N be a cover of C by k\u00b7k -balls with radius \u2327; in the case that k\u00b7k is an lp norm, the size of\nthis cover has bound\n\n,\n\n2r2\n\n2RCd\n\n.\n\ni2[p0]\n\n2m\n\n.\n\n\u270f\n\nThe outer bracket is de\ufb01ned as follows.\nDe\ufb01nition 3.3. An outer bracket for probability measure \u232b at scale \u270f consists of two triples, one\neach for lower and upper bounds.\n\n5\n\n\fu\n\n` and  2 Z`,\n\n`(x)d\u232b(x)|\uf8ff \u270f.\n\n`\n\nu(x)d\u232b(x)|\uf8ff \u270f.\n\n1. The function `, function class Z`, and set B` satisfy two conditions: if x 2 Bc\nthen `(x) \uf8ff (x), and secondly |RBc\n2. Similarly, function u, function class Zu, and set Bu satisfy: if x 2 Bc\nu(x)  (x), and secondly |RBc\n`(x)d\u232b(x) \uf8ffZBc\n\nDirect from the de\ufb01nition, given bracketing functions (`, u), a bracketed function f (\u00b7; P ), and the\nbracketing set B := Bu [ B`,\n\u270f \uf8ffZBc\n\nf (x; P )d\u232b(x) \uf8ffZBc\n\nu and  2 Zu, then\n\nin other words, as intended, this mechanism allows deviations on Bc to be discarded. Thus to\nuniformly control the deviations of the dominated functions Z := Zu [ Z` over the set Bc, it\nsuf\ufb01ces to simply control the deviations of the pair (`, u).\nThe following lemma shows that a bracket exists for {f (\u00b7; P ) : P 2H f (\u232b; c, k)} and compact B,\nand moreover that this allows sampled points and candidate centers in far reaches to be deleted.\nLemma 3.5. Consider the setting and de\ufb01nitions in Theorem 3.2, but additionally de\ufb01ne\n\u25c61/p0\nM0 := 2p0\u270f,\n.\nThe following statements hold with probability at least 1  2 over a draw of size m \nmax{p0/(M0e), 9 ln(1/)}.\n\n\u02c6\u21e2 := \u270f +r M0ep0\n2m \u2713 2\n\nu(x) := 4r2kx  E(X)k2,\u270f\n\nu(x)d\u232b(x) \uf8ff \u270f;\n\n(3.4)\n\n`(x) := 0,\n\n1. (u, `) is an outer bracket for \u21e2 at scale \u270f\u21e2 := \u270f with sets B` = Bu = B and Z` = Zu =\n{f (\u00b7; P ) : P 2H f (\u02c6\u21e2; c, k)[H f (\u21e2; c, k)}, and furthermore the pair (u, `) is also an outer\nbracket for \u02c6\u21e2 at scale \u270f\u02c6\u21e2 with the same sets.\n2. For every P 2H f (\u02c6\u21e2; c, k) [H f (\u21e2; c, k),\n\nand\n\nZ f (x; P )d\u21e2(x) ZB\nZ f (x; P )d\u02c6\u21e2(x) ZB\n\nf (x; P \\ C)d\u21e2(x) \uf8ff \u270f\u21e2 = \u270f.\nf (x; P \\ C)d\u02c6\u21e2(x) \uf8ff \u270f\u02c6\u21e2.\n\nThe proof of Lemma 3.5 has roughly the following outline.\n\n1. Pick some ball B0 which has probability mass at least 1/4. It is not possible for an element\nof Hf (\u02c6\u21e2; c, k) [H f (\u21e2; c, k) to have all centers far from B0, since otherwise the cost is\nlarger than c. (Concretely, \u201cfar from\u201d means at leastp4c/r1 away; note that this term\nappears in the de\ufb01nitions of B and C in Theorem 3.2.) Consequently, at least one center\nlies near to B0; this reasoning was also the \ufb01rst step in the k-means consistency proof due\nto k-means Pollard [1].\n2. It is now easy to dominate P 2H f (\u02c6\u21e2; c, k) [H f (\u21e2; c, k) far away from B0. In particular,\nchoose any p0 2 B0 \\ P , which was guaranteed to exist in the preceding point; since\nminp2P Bf (x, p) \uf8ff Bf (x, p0) holds for all x, it suf\ufb01ces to dominate p0. This domination\nproceeds exactly as discussed in the introduction; in fact, the factor 4 appeared there, and\nagain appears in the u here, for exactly the same reason. Once again, similar reasoning can\nbe found in the proof by Pollard [1].\n\n3. Satisfying the integral conditions over \u21e2 is easy: it suf\ufb01ces to make B huge. To control the\nsize of B0, as well as the size of B, and moreover the deviations of the bracket over B, the\nmoment tools from Appendix A are used.\n\nNow turning consideration back to the proof of Theorem 3.2, the above bracketing allows the re-\nmoval of points and centers outside of a compact set (in particular, the pair of compact sets B and\nC, respectively). On the remaining truncated data and set of centers, any standard tool suf\ufb01ces; for\nmathematical convenience, and also to \ufb01t well with the use of norms in the de\ufb01nition of moments\nas well as the conditions on the convex function f providing the divergence Bf , norm structure\nused throughout the other properties, covering arguments are used here. (For details, please see\nAppendix B.)\n\n6\n\n\f4\n\nInterlude: Re\ufb01ned Estimates via Clamping\n\nSo far, rates have been given that guarantee uniform convergence when the distribution has a few\nmoments, and these rates improve with the availability of higher moments. These moment condi-\ntions, however, do not necessarily re\ufb02ect any natural cluster structure in the source distribution. The\npurpose of this section is to propose and analyze another distributional property which is intended\nto capture cluster structure. To this end, consider the following de\ufb01nition.\nDe\ufb01nition 4.1. Real number R and compact set C are a clamp for probability measure \u232b and family\nof centers Z and cost f at scale \u270f> 0 if every P 2 Z satis\ufb01es\n\n|E\u232b(f (X; P ))  E\u232b (min{f (X; P \\ C) , R})|\uf8ff \u270f.\n\nNote that this de\ufb01nition is similar to the second part of the outer bracket guarantee in Lemma 3.5,\nand, predictably enough, will soon lead to another deviation bound.\nExample 4.2. If the distribution has bounded support, then choosing a clamping value R and clamp-\ning set C respectively slightly larger than the support size and set is suf\ufb01cient: as was reasoned in\nthe construction of outer brackets, if no centers are close to the support, then the cost is bad. Corre-\nspondingly, the clamped set of functions Z should again be choices of centers whose cost is not too\nhigh.\nFor a more interesting example, suppose \u21e2 is supported on k small balls of radius R1, where the\ndistance between their respective centers is some R2  R1. Then by reasoning similar to the\nbounded case, all choices of centers achieving a good cost will place centers near to each ball, and\n\u2305\nthus the clamping value can be taken closer to R1.\nOf course, the above gave the existence of clamps under favorable conditions. The following shows\nthat outer brackets can be used to show the existence of clamps in general. In fact, the proof is very\nshort, and follows the scheme laid out in the bounded example above: outer bracketing allows the\nrestriction of consideration to a bounded set, and some algebra from there gives a conservative upper\nbound for the clamping value.\nProposition 4.3. Suppose the setting and de\ufb01nitions of Lemma 3.5, and additionally de\ufb01ne\n\nR := 2((2M )2/p + R2\n\nB).\n\nThen (C, R) is a clamp for measure \u21e2 and center Hf (\u21e2; c, k) at scale \u270f, and with probability at least\n1  3 over a draw of size m  max{p0/(M0e), 9 ln(1/)}, it is also a clamp for \u02c6\u21e2 and centers\nHf (\u02c6\u21e2; c, k) at scale \u270f\u02c6\u21e2.\nThe general guarantee using clamps is as follows. The proof is almost the same as for Theorem 3.2,\nbut note that this statement is not used quite as readily, since it \ufb01rst requires the construction of\nclamps.\nTheorem 4.4. Fix a norm k\u00b7k . Let (R, C) be a clamp for probability measure \u21e2 and empirical\ncounterpart \u02c6\u21e2 over some center class Z and cost f at respective scales \u270f\u21e2 and \u270f\u02c6\u21e2, where f has\ncorresponding convexity constants r1 and r2. Suppose C is contained within a ball of radius RC,\nlet \u270f> 0 be given, de\ufb01ne scale parameter\n\n\u2327 := min\u21e2r \u270f\nand let N be a cover of C by k\u00b7k -balls of radius \u2327 (as per lemma B.4, if k\u00b7k is an lp norm, then\n|N| \uf8ff (1 + (2RCd)/\u2327 )d suf\ufb01ces). Then with probability at least 1  over the draw of a sample of\nsize m  p0/(M0e), every set of centers P 2 Z satis\ufb01es\n\n2r2\n\nr1\u270f\n\n2r2R3 ,\n\n,\n\nZ f (x; P )d\u21e2(x) Z f (x; P )d\u02c6\u21e2(x) \uf8ff 2\u270f + \u270f\u21e2 + \u270f\u02c6\u21e2 + R2s 1\n\n2m\n\nBefore adjourning this section, note that clamps and outer brackets disagree on the treatment of the\nouter regions: the former replaces the cost there with the \ufb01xed value R, whereas the latter uses the\nvalue 0. On the technical side, this is necessitated by the covering argument used to produce the\n\ufb01nal theorem: if the clamping operation instead truncated beyond a ball of radius R centered at each\np 2 P , then the deviations would be wild as these balls moved and suddenly switched the value at a\npoint from 0 to something large. This is not a problem with outer bracketing, since the same points\n(namely Bc) are ignored by every set of centers.\n\nln\u2713 2|N|k\n \u25c6.\n\n7\n\n\f5 Mixtures of Gaussians\n\nBefore turning to the deviation bound, it is a good place to discuss the condition 1I  \u2303  2I,\nwhich must be met by every covariance matrix of every constituent Gaussian in a mixture.\nThe lower bound 1I  \u2303, as discussed previously, is fairly common in practice, arising either\nvia a Bayesian prior, or by implementing EM with an explicit condition that covariance updates are\ndiscarded when the eigenvalues fall below some threshold. In the analysis here, this lower bound is\nused to rule out two kinds of bad behavior.\n\n1. Given a budget of at least 2 Gaussians, and a sample of at least 2 distinct points, arbitrarily\nlarge likelihood may be achieved by devoting one Gaussian to one point, and shrinking its\ncovariance. This issue destroys convergence properties of maximum likelihood, since the\nlikelihood score may be arbitrarily large over every sample, but is \ufb01nite for well-behaved\ndistributions. The condition 1I  \u2303 rules this out.\n2. Another phenomenon is a \u201c\ufb02at\u201d Gaussian, meaning a Gaussian whose density is high along\na lower dimensional manifold, but small elsewhere. Concretely, consider a Gaussian over\nR2 with covariance \u2303 = diag(, 1); as  decreases, the Gaussian has large density on\na line, but low density elsewhere. This phenomenon is distinct from the preceding in that\nit does not produce arbitrarily large likelihood scores over \ufb01nite samples. The condition\n1I  \u2303 rules this situation out as well.\n\nIn both the hard and soft clustering analyses here, a crucial early step allows the assertion that good\nscores in some region mean the relevant parameter is nearby. For the case of Gaussians, the condition\n1I  \u2303 makes this problem manageable, but there is still the possibility that some far away, fairly\nuniform Gaussian has reasonable density. This case is ruled out here via 2I \u232b \u2303.\nTheorem 5.1. Let probability measure \u21e2 be given with order-p moment bound M according to norm\nk\u00b7k 2 where p  8 is a positive multiple of 4, covariance bounds 0 < 1 \uf8ff 2 with 1 \uf8ff 1 for\nsimplicity, and real c \uf8ff 1/2 be given. Then with probability at least 1  5 over the draw of a\nsample of size m  max(p/(2p/4+2e))2, 8 ln(1/), d2 ln(\u21e12)2 ln(1/) , every set of Gaussian\nmixture parameters (\u21b5, \u21e5) 2S mog(\u02c6\u21e2; c, k, 1, 2) [S mog(\u21e2; c, k, 1, 2) satis\ufb01es\n\nZ g(x; (\u21b5, \u21e5))d\u21e2(x) Z g(x; (\u21b5, \u21e5))d\u02c6\u21e2(x)\n\n= O\u21e3m1/2+3/p\u21e31 +pln(m) + ln(1/) + (1/)4/p\u2318\u2318 ,\n\nwhere the O(\u00b7) drops numerical constants, polynomial terms depending on c, M, d, and k, 2/1,\nand ln(2/1), but in particular has no sample-dependent quantities.\n\nThe proof follows the scheme of the hard clustering analysis. One distinction is that the outer bracket\nnow uses both components; the upper component is the log of the largest possible density \u2014 indeed,\nit is ln((2\u21e11)d/2) \u2014 whereas the lower component is a function mimicking the log density of\nthe steepest possible Gaussian \u2014 concretely, the lower bracket\u2019s de\ufb01nition contains the expression\n2/1, which lacks the normalization of a proper Gaussian, high-\nln((2\u21e12)d/2)  2kx  E\u21e2(X)k2\nlighting the fact that bracketing elements need not be elements of the class. Super\ufb01cially, a second\ndistinction with the hard clustering case is that far away Gaussians can not be entirely ignored on\nlocal regions; the in\ufb02uence is limited, however, and the analysis proceeds similarly in each case.\n\nAcknowledgments\nThe authors thank the NSF for supporting this work under grant IIS-1162581.\n\n8\n\n\fReferences\n[1] David Pollard. Strong consistency of k-means clustering. The Annals of Statistics, 9(1):135\u2013\n\n140, 1981.\n\n[2] Gbor Lugosi and Kenneth Zeger. Rates of convergence in the source coding theorem, in em-\npirical quantizer design, and in universal lossy source coding. IEEE Trans. Inform. Theory, 40:\n1728\u20131740, 1994.\n\n[3] Shai Ben-david. A framework for statistical clustering with a constant time approximation\n\nalgorithms for k-median clustering. In COLT, pages 415\u2013426. Springer, 2004.\n\n[4] Alexander Rakhlin and Andrea Caponnetto. Stability of k-means clustering. In NIPS, pages\n\n1121\u20131128, 2006.\n\n[5] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classi\ufb01cation. Wiley, 2 edition,\n\n2001.\n\n[6] Thomas S. Ferguson. A course in large sample theory. Chapman & Hall, 1996.\n[7] David Pollard. A central limit theorem for k-means clustering. The Annals of Probability, 10\n\n(4):919\u2013926, 1982.\n\n[8] Shai Ben-david, Ulrike Von Luxburg, and D\u00b4avid P\u00b4al. A sober look at clustering stability. In In\n\nCOLT, pages 5\u201319. Springer, 2006.\n\n[9] Ohad Shamir and Naftali Tishby. Cluster stability for \ufb01nite samples. In Annals of Probability,\n\n10(4), pages 919\u2013926, 1982.\n\n[10] Ohad Shamir and Naftali Tishby. Model selection and stability in k-means clustering.\n\nCOLT, 2008.\n\nIn\n\n[11] St\u00b4ephane Boucheron, Olivier Bousquet, and G\u00b4abor Lugosi. Theory of classi\ufb01cation: a survey\n\nof recent advances. ESAIM: Probability and Statistics, 9:323\u2013375, 2005.\n\n[12] St\u00b4ephane Boucheron, G\u00b4abor Lugosi, and Pascal Massart. Concentration Inequalities: A\n\nNonasymptotic Theory of Independence. Oxford, 2013.\n\n[13] Jon Wellner. Consistency and rates of convergence for maximum likelihood estimators via\n\nempirical process theory. 2005.\n\n[14] Aad van der Vaart and Jon Wellner. Weak Convergence and Empirical Processes. Springer,\n\n1996.\n\n[15] FuChang Gao and Jon A. Wellner. On the rate of convergence of the maximum likelihood\nestimator of a k-monotone density. Science in China Series A: Mathematics, 52(7):1525\u20131538,\n2009.\n\n[16] Yair Al Censor and Stavros A. Zenios. Parallel Optimization: Theory, Algorithms and Appli-\n\ncations. Oxford University Press, 1997.\n\n[17] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with\n\n[18] Terence\n\nTao.\n\n254a\n\nBregman divergences. Journal of Machine Learning Research, 6:1705\u20131749, 2005.\nof measure,\n\nConcentration\n\nnotes\nJanuary\nhttp://terrytao.wordpress.com/2010/01/03/\n\n2010.\n254a-notes-1-concentration-of-measure/.\n\nURL\n\n1:\n\n[19] I. F. Pinelis and S. A. Utev. Estimates of the moments of sums of independent random vari-\nables. Teor. Veroyatnost. i Primenen., 29(3):554\u2013557, 1984. Translation to English by Bernard\nSeckler.\n\n[20] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis,\n\nThe Hebrew University of Jerusalem, July 2007.\n\n[21] Jean-Baptiste Hiriart-Urruty and Claude Lemar\u00b4echal. Fundamentals of Convex Analysis.\n\nSpringer Publishing Company, Incorporated, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1344, "authors": [{"given_name": "Matus", "family_name": "Telgarsky", "institution": "UC San Diego"}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}]}