{"title": "Optimal kernel choice for large-scale two-sample tests", "book": "Advances in Neural Information Processing Systems", "page_first": 1205, "page_last": 1213, "abstract": "Abstract Given samples from distributions $p$ and $q$, a two-sample test determines whether to reject the null hypothesis that $p=q$, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is thus critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.", "full_text": "Optimal kernel choice for large-scale two-sample tests\n\nArthur Gretton,1,3 Bharath Sriperumbudur,1 Dino Sejdinovic,1 Heiko Strathmann2\n\n1Gatsby Unit and 2CSD, CSML, UCL, UK; 3MPI for Intelligent Systems, Germany\n\n{arthur.gretton,bharat.sv,dino.sejdinovic,heiko.strathmann}@gmail\nSivaraman Balakrishnan\n\nKenji Fukumizu\n\nMassimiliano Pontil\nCSD, CSML, UCL, UK\n\nISM, Japan\n\nLTI, CMU, USA\n\nsbalakri@cs.cmu.edu\n\nm.pontil@cs.ucl.ac.uk\n\nfukumizu@ism.ac.jp\n\nAbstract\n\nGiven samples from distributions p and q, a two-sample test determines whether\nto reject the null hypothesis that p = q, based on the value of a test statistic\nmeasuring the distance between the samples. One choice of test statistic is the\nmaximum mean discrepancy (MMD), which is a distance between embeddings\nof the probability distributions in a reproducing kernel Hilbert space. The kernel\nused in obtaining these embeddings is critical in ensuring the test has high power,\nand correctly distinguishes unlike distributions with high probability. A means of\nparameter selection for the two-sample test based on the MMD is proposed. For a\ngiven test level (an upper bound on the probability of making a Type I error), the\nkernel is chosen so as to maximize the test power, and minimize the probability\nof making a Type II error. The test statistic, test threshold, and optimization over\nthe kernel parameters are obtained with cost linear in the sample size. These\nproperties make the kernel selection and test procedures suited to data streams,\nwhere the observations cannot all be stored in memory. In experiments, the new\nkernel selection approach yields a more powerful test than earlier kernel selection\nheuristics.\n\n1 Introduction\n\nThe two sample problem addresses the question of whether two independent samples are drawn from\nthe same distribution. In the setting of statistical hypothesis testing, this corresponds to choosing\nwhether to reject the null hypothesis H0 that the generating distributions p and q are the same, vs.\nthe alternative hypothesis HA that distributions p and q are different, given a set of independent\nobservations drawn from each.\nA number of recent approaches to two-sample testing have made use of mappings of the distribu-\ntions to a reproducing kernel Hilbert space (RKHS); or have sought out RKHS functions with large\namplitude where the probability mass of p and q differs most [8, 10, 15, 17, 7]. The most straight-\nforward test statistic is the norm of the difference between distribution embeddings, and is called\nthe maximum mean discrepancy (MMD). One dif\ufb01culty in using this statistic in a hypothesis test,\nhowever, is that the MMD depends on the choice of the kernel. If we are given a family of kernels,\nwe obtain a different value of the MMD for each member of the family, and indeed for any positive\nde\ufb01nite linear combination of the kernels. When a radial basis function kernel (such as the Gaus-\nsian kernel) is used, one simple choice is to set the kernel width to the median distance between\npoints in the aggregate sample [8, 7]. While this is certainly straightforward, it has no guarantees of\noptimality. An alternative heuristic is to choose the kernel that maximizes the test statistic [15]: in\nexperiments, this was found to reliably outperform the median approach. Since the MMD returns\na smooth RKHS function that minimizes classi\ufb01cation error under linear loss, then maximizing the\n\n1\n\n\fMMD corresponds to minimizing this classi\ufb01cation error under a smoothness constraint.\nIf the\nstatistic is to be applied in hypothesis testing, however, then this choice of kernel does not explicitly\naddress the question of test performance.\nWe propose a new approach to kernel choice for hypothesis testing, which explicitly optimizes the\nperformance of the hypothesis test. Our kernel choice minimizes Type II error (the probability of\nwrongly accepting H0 when p \uffff= q), given an upper bound on Type I error (the probability of\nwrongly rejecting H0 when p = q). This corresponds to optimizing the asymptotic relative ef\ufb01-\nciency in the sense of Hodges and Lehmann [13, Ch. 10]. We address the case of the linear time\nstatistic in [7, Section 6], for which both the test statistic and the parameters of the null distribu-\ntion can be computed in O(n), for sample size n. This has a higher variance at a given n than\nthe U-statistic estimate costing O(n2) used in [8, 7], since the latter is the minimum variance un-\nbiased estimator. Thus, we would use the quadratic time statistic in the \u201climited data, unlimited\ntime\u201d scenario, as it extracts the most possible information from the data available. The linear time\nstatistic is used in the \u201cunlimited data, limited time\u201d scenario, since it is the cheapest statistic that\nstill incorporates each datapoint: it does not require the data to be stored, and is thus appropriate\nfor analyzing data streams. As a further consequence of the streaming data setting, we learn the\nkernel parameter on a separate sample to the sample used in testing; i.e., unlike the classical testing\nscenario, we use a training set to learn the kernel parameters. An advantage of this setting is that our\nnull distribution remains straightforward, and the test threshold can be computed without a costly\nbootstrap procedure.\nWe begin our presentation in Section 2 with a review of the maximum mean discrepancy, its linear\ntime estimate, and the associated asymptotic distribution and test. In Section 3 we describe a cri-\nterion for kernel choice to maximize the Hodges and Lehmann asymptotic relative ef\ufb01ciency. We\ndemonstrate the convergence of the empirical estimate of this criterion when the family of kernels is\na linear combination of base kernels (with non-negative coef\ufb01cients), and of the kernel coef\ufb01cients\nthemselves. In Section 4, we provide an optimization procedure to learn the kernel weights. Finally,\nin Section 5, we present experiments, in which we compare our kernel selection strategy with the\napproach of simply maximizing the test statistic subject to various constraints on the coef\ufb01cients of\nthe linear combination; and with a cross-validation approach, which follows from the interpretation\nof the MMD as a classi\ufb01er. We observe that a principled kernel choice for testing outperforms com-\npeting heuristics, including the previous best-performing heuristic in [15]. A Matlab implementation\nis available at: www.gatsby.ucl.ac.uk/ \u223c gretton/adaptMMD/adaptMMD.htm\n2 Maximum mean discrepancy, and a linear time estimate\n\nWe begin with a brief review of kernel methods, and of the maximum mean discrepancy [8, 7, 14].\nWe then describe the family of kernels over which we optimize, and the linear time estimate of the\nMMD.\n\n2.1 MMD for a family of kernels\nLet Fk be a reproducing kernel Hilbert space (RKHS) de\ufb01ned on a topological space X with repro-\nducing kernel k, and p a Borel probability measure on X . The mean embedding of p in Fk is a unique\nelement \u00b5k(p) \u2208F k such that Ex\u223cpf (x) = \ufffff, \u00b5k(p)\uffffFk for all f \u2208F k [4]. By the Riesz rep-\nresentation theorem, a suf\ufb01cient condition for the existence of \u00b5k(p) is that k be Borel-measurable\nand Ex\u223cpk1/2(x, x) < \u221e. We assume k is a bounded continuous function, hence this condition\nholds for all Borel probability measures. The maximum mean discrepancy (MMD) between Borel\nprobability measures p and q is de\ufb01ned as the RKHS-distance between the mean embeddings of p\nand q. An expression for the squared MMD is thus\n\n\u03b7k(p, q) = \uffff\u00b5k(p) \u2212 \u00b5k(q)\uffff2\nFk\n\n= Exx\uffffk(x, x\uffff) + Eyy\uffffk(y, y\uffff) \u2212 2Exyk(x, y),\n\n(1)\n\n(2)\n\nwhere x, x\uffff i.i.d.\u223c p and y, y\uffff i.i.d.\u223c q. By introducing\n\nhk(x, x\uffff, y, y\uffff) = k(x, x\uffff) + k(y, y\uffff) \u2212 k(x, y\uffff) \u2212 k(x\uffff, y),\n\nwe can write\n\n\u03b7k(p, q) = Exx\uffffyy\uffffhk(x, x\uffff, y, y\uffff) =: Evhk(v),\n\n2\n\n\fwhere we have de\ufb01ned the random vector v := [x, x\uffff, y, y\uffff]. If \u00b5k is an injective map, then k is said\nto be a characteristic kernel, and the MMD is a metric on the space of Borel probability measures,\ni.e., \u03b7k (p, q) = 0 iff p = q [16]. The Gaussian kernels used in the present work are characteristic.\nOur goal is to select a kernel for hypothesis testing from a particular family K of kernels, which we\nnow de\ufb01ne. Let {ku}d\n\nu=1 be a set of positive de\ufb01nite functions ku : X\u00d7X\u2192 R. Let\n\u03b2u = D, \u03b2u \u2265 0, \u2200u \u2208{ 1, . . . , d}\uffff\n\nK :=\uffffk : k =\n\n\u03b2uku,\n\n(3)\n\nfor some D > 0, where the constraint on the sum of coef\ufb01cients is needed for the consistency proof\n(see Section 3). Each k \u2208K is associated uniquely with an RKHS Fk, and we assume the kernels\nare bounded, |ku|\u2264 K, \u2200u \u2208{ 1, . . . , d}. The squared MMD becomes\n\nd\uffffu=1\n\nd\uffffu=1\n\n\u03b7k(p, q) = \uffff\u00b5k(p) \u2212 \u00b5k(q)\uffff2\nFk\n\n=\n\n\u03b2u\u03b7u(p, q),\n\nd\uffffu=1\n\nwhere \u03b7u(p, q) := Evhu(v). It is clear that if every kernel ku, u \u2208{ 1, . . . , d}, is characteristic and at\nleast one \u03b2u > 0, then k is characteristic. Where there is no ambiguity, we will write \u03b7u := \u03b7u(p, q)\nand Ehu := Evhu(v). We denote h = (h1, h2, . . . , hd)\uffff \u2208 Rd\u00d71, \u03b2 = (\u03b21,\u03b2 2, . . . ,\u03b2 d)\uffff \u2208 Rd\u00d71,\nand \u03b7 = (\u03b71,\u03b7 2, . . . ,\u03b7 d)\uffff \u2208 Rd\u00d71. With this notation, we may write\n\n\u03b7k(p, q) = E(\u03b2\uffffh) = \u03b2\uffff\u03b7.\n\n2.2 Empirical estimate of the MMD, asymptotic distribution, and test\n\nWe now describe an empirical estimate of the maximum mean discrepancy, given i.i.d. samples\nX := {x1, . . . , xn} and Y := {y1, . . . , yn} from p and q, respectively. We use the linear time\nestimate of [7, Section 6], for which both the test statistic and the parameters of the null distribution\ncan be computed in time O(n). This has a higher variance at a given n than a U-statistic estimate\ncosting O(n2), since the latter is the minimum variance unbiased estimator [13, Ch. 5]. That\nsaid, it was observed experimentally in [7, Section 8.3] that the linear time statistic yields better\nperformance at a given computational cost than the quadratic time statistic, when suf\ufb01cient data\nare available (bearing in mind that consistent estimates of the null distribution in the latter case are\ncomputationally demanding [9]). Moreover, the linear time statistic does not require the sample\nto be stored in memory, and is thus suited to data streaming contexts, where a large number of\nobservations arrive in sequence.\nThe linear time estimate of \u03b7k(p, q) is de\ufb01ned in [7, Lemma 14]: assuming for ease of notation that\nn is even,\n\n\u02c7\u03b7k =\n\n2\nn\n\nn/2\uffffi=1\n\nhk(vi),\n\n(4)\n\nwhere vi := [x2i\u22121, x2i, y2i\u22121, y2i] and hk(vi) := k(x2i\u22121, x2i) + k(y2i\u22121, y2i)\u2212 k(x2i\u22121, y2i)\u2212\nk(x2i, y2i\u22121); this arrangement of the samples ensures we get an expectation over independent\nvariables as in (2) with cost O(n). We use \u02c7\u03b7k to denote the empirical statistic computed over the\nsamples being tested, to distinguish it from the training sample estimate \u02c6\u03b7k used in selecting the\nkernel. Given the family of kernels K in (3), this can be written \u02c7\u03b7k = \u03b2\uffff \u02c7\u03b7, where we again use\nthe convention \u02c7\u03b7 = (\u02c7\u03b71, \u02c7\u03b72, . . . , \u02c7\u03b7d)\uffff \u2208 Rd\u00d71. The statistic \u02c7\u03b7k has expectation zero under the null\nhypothesis H0 that p = q, and has positive expectation under the alternative hypothesis HA that\np \uffff= q.\nSince \u02c7\u03b7k is a straightforward average of independent random variables, its asymptotic distribution\nis given by the central limit theorem (e.g. [13, Section 1.9]). From [7, corollary 16], under the\nassumption 0 < E(h2\n\nk) < \u221e (which is true for bounded continuous k),\n\nwhere the factor of two arises since the average is over n/2 terms, and\n\nn1/2 (\u02c7\u03b7k \u2212 \u03b7k(p, q)) D\u2192N (0, 2\u03c32\nk),\n\n\u03c32\nk = Evh2\n\nk(v) \u2212 [Ev(hk(v))]2 .\n\n3\n\n(5)\n\n(6)\n\n\fUnlike the case of a quadratic time statistic, the null and alternative distributions differ only in\nmean; by contrast, the quadratic time statistic has as its null distribution an in\ufb01nite weighted sum of\n\u03c72 variables [7, Section 5], and a Gaussian alternative distribution.\nTo obtain an estimate of the variance based on the samples X, Y , we will use an expression derived\nfrom the U-statistic of [13, p. 173] (although as earlier, we will express this as a simple average so\nas to compute it in linear time). The population variance can be written\n\nk(v) \u2212 Ev,v\uffff(hk(v)hk(v\uffff)) =\nExpanding in terms of the kernel coef\ufb01cients \u03b2, we get\n\n\u03c32\nk = Evh2\n\n1\n2\n\nEv,v\uffff(hk(v) \u2212 hk(v\uffff))2.\n\nwhere Qk := cov(h) is the covariance matrix of h. A linear time estimate for the variance is\n\nk := \u03b2\uffffQk\u03b2,\n\u03c32\n\n(7)\n\nk = \u03b2\uffff \u02c7Qk\u03b2, where\n\u02c7\u03c32\n\nn/4\uffffi=1\n4\nn\nand wi := [v2i\u22121, v2i],1 h\u2206,k(wi) := hk(v2i\u22121) \u2212 hk(v2i).\nWe now address the construction of a hypothesis test. We denote by \u03a6 the CDF of a standard Normal\nrandom variable N (0, 1), and by \u03a6\u22121 the inverse CDF. From (5), a test of asymptotic level \u03b1 using\nthe statistic \u02c7\u03b7k will have the threshold\n\n\uffff \u02c7Qk\uffffuu\uffff =\n\nh\u2206,u(wi)h\u2206,u\uffff(wi),\n\ntk,\u03b1 = n\u22121/2\u03c3k\u221a2\u03a6\u22121(1 \u2212 \u03b1),\n\n(8)\nbearing in mind the asymptotic distribution of the test statistic, and that \u03b7k(p, p) = 0. This threshold\nis computed empirically by replacing \u03c3k with its estimate \u02c7\u03c3k (computed using the data being tested),\nwhich yields a test of the desired asymptotic level.\nThe asymptotic distribution (5) holds only when the kernel is \ufb01xed, and does not depend on the\nsample X, Y . If the kernel were a function of the data, then a test would require large deviation\nprobabilities over the supremum of the Gaussian process indexed by the kernel parameters (e.g.\n[1]). In practice, the threshold would be computed via a bootstrap procedure, which has a high\ncomputational cost. Instead, we set aside a portion of the data to learn the kernel (the \u201ctraining\ndata\u201d), and use the remainder to construct a test using the learned kernel parameters.\n\n3 Choice of kernel\n\nThe choice of kernel will affect both the test statistic itself, (4), and its asymptotic variance, (6).\nThus, we need to consider how these statistics determine the power of a test with a given level \u03b1 (the\nupper bound on the Type I error). We consider the case where p \uffff= q. A Type II error occurs when\nthe random variable \u02c7\u03b7k falls below the threshold tk,\u03b1 de\ufb01ned in (8). The asymptotic probability of a\nType II error is therefore\n\nP (\u02c7\u03b7k < tk,\u03b1) =\u03a6 \uffff\u03a6\u22121(1 \u2212 \u03b1) \u2212\n\n\u03b7k(p, q)\u221an\n\n\u03c3k\u221a2 \uffff .\n\nAs \u03a6 is monotonic, the Type II error probability will decrease as the ratio \u03b7k(p, q)\u03c3\u22121\nk\nTherefore, the kernel minimizing this error probability is\n\nincreases.\n\n(9)\n\nk\u2217 = arg sup\nk\u2208K\n\n\u03b7k(p, q)\u03c3\u22121\nk ,\n\nwith the associated test threshold tk\u2217,\u03b1. In practice, we do not have access to the population estimates\n\u03b7k(p, q) and \u03c3k, but only their empirical estimates \u02c6\u03b7k, \u02c6\u03c3k from m pairs of training points (xi, yi)\n(this training sample must be independent of the sample used to compute the test parameters \u02c7\u03b7k, \u02c7\u03c3k).\nWe therefore estimate tk\u2217,\u03b1 by a regularized empirical estimate t\u02c6k\u2217,\u03b1, where\n\n1This vector is the concatenation of two four-dimensional vectors, and has eight dimensions.\n\n\u02c6k\u2217 = arg sup\nk\u2208K\n\n\u02c6\u03b7k (\u02c6\u03c3k,\u03bb)\u22121 ,\n\n4\n\n\fk,\u03bb \u2212 sup\nk\u2208K\n\n\u02c6\u03b7k \u02c6\u03c3\u22121\n\nk\u2208K\n\u2264 sup\n\n\uffff\uffff\uffff\uffffsup\nk\u2208K\uffff\uffff\uffff\u02c6\u03b7k \u02c6\u03c3\u22121\nk\u2208K\uffff\u02c6\u03c32\nC1\u221ad\nD\u221a\u03bbm\n\n\u2264 sup\n\n\u2264\n\nsup\n\nk + \uffff\u03b2\uffff2\n\nand\n\nk\u2208K\n\n\u02c6\u03b7k \u02c6\u03c3\u22121\n\n\u03b7k\u03c3\u22121\n\n\u03b7k\u03c3\u22121\n\nk,\u03bb \u2212 sup\nk\u2208K\n\nk,\u03bb \u2212 \u03b7k\u03c3\u22121\n\n\uffff\uffff\uffff\uffffsup\n\nk \uffff\uffff\uffff\uffff = OP\uffffm\u22121/3\uffff\nk \uffff\uffff\uffff\uffff \u2264 sup\nk \uffff\uffff\uffff\nk\u2208K\uffff\uffff\uffff\u02c6\u03b7k \u02c6\u03c3\u22121\nk \uffff\uffff\uffff\nk\u2208K\uffff\uffff\uffff\u03b7k\u03c3\u22121\nk,\u03bb\uffff\uffff\uffff + sup\n\u02c6\u03c3k,\u03bb\u03c3k,\u03bb \uffff\uffff\uffff\uffff + sup\n\u03b7k\uffff\uffff\uffff\uffff\n2\u03bbm\uffff\u22121/2 | \u02c6\u03b7k \u2212 \u03b7k| + sup\n\u03b7k\uffff\uffff\uffff\uffff\uffff\nk\u2208K| \u02c6\u03b7k \u2212 \u03b7k| + sup\nk\u2208K\n\u03c3k\uffff \uffff\u03b2\uffff2\n\nk,\u03bb \u2212 \u03b7k\u03c3\u22121\nk,\u03bb \u2212 \u03b7k\u03c3\u22121\n\nk\uffff\nk\u2208K|\u02c6\u03c3k,\u03bb \u2212 \u03c3k,\u03bb|\uffff + C3D2\u03bbm,\n\n+ sup\n2\u03bbm + \u03c32\nk\u2208K\nk\u2208K|\u02c6\u03b7k \u2212 \u03b7k| + C2 sup\n\n\u02c6\u03c3k,\u03bb \u2212 \u03c3k,\u03bb\nk + \u02c6\u03c32\n\n\u02c6\u03c3k,\u03bb \u2212 \u03c3k,\u03bb\n\nk + \uffff\u03b2\uffff2\n\n(\u03c32\n2\u03bbm\n\n2\u03bbm (\u03c32\n\n\uffff\u03b2\uffff2\n\nk \u02c6\u03c32\n\nk\u2208K\n\n\u03b7k\n\nk\u2208K\n\n\u221ad\nD\u221a\u03bbm\uffffC1 sup\n\n\u2264\n\n\u03b7k\n\n\u03c3k\uffff\uffff\uffff\uffff\uffff\n\nk\n\n\u03c32\nk,\u03bb \u2212 \u03c32\n\n\u03c3k,\u03bb (\u03c3k,\u03bb + \u03c3k)\uffff\uffff\uffff\uffff\uffff\nm)1/2\uffff\uffff\uffff\uffff\uffff\n\nk) + \uffff\u03b2\uffff2\n\n2\u03bb2\n\nand we de\ufb01ne the regularized standard deviation \u02c6\u03c3k,\u03bb =\uffff\u03b2\uffff\uffff \u02c6Q + \u03bbmI\uffff \u03b2 =\uffff\u02c6\u03c32\n2.\nk + \u03bbm\uffff\u03b2\uffff2\nThe next theorem shows the convergence of supk\u2208K \u02c6\u03b7k (\u02c6\u03c3k,\u03bb)\u22121 to supk\u2208K \u03b7k(p, q)\u03c3\u22121\nk , and of \u02c6k\u2217\nto k\u2217, for an appropriate schedule of decrease for \u03bbm with increasing m.\nTheorem 1. Let K be de\ufb01ned as in (3). Assume supk\u2208K,x,y\u2208X |k(x, y)| < K and \u03c3k is bounded\naway from zero. Then if \u03bbm =\u0398 \uffffm\u22121/3\uffff,\n\n\u02c6k\u2217\n\nP\u2192 k\u2217.\n\nProof. Recall from the de\ufb01nition of K that \uffff\u03b2\uffff1 = D, and that \uffff\u03b2\uffff2 \u2264 \uffff\u03b2\uffff1 and \uffff\u03b2\uffff1 \u2264 \u221ad\uffff\u03b2\uffff2\n[11, Problem 3 p. 278], hence \uffff\u03b2\uffff2 \u2265 Dd\u22121/2. We begin with the bound\n\nwhere constants C1, C2, and C3 follow from the boundedness of \u03c3k and \u03b7k. The the \ufb01rst result in the\ntheorem follows from supk\u2208K |\u02c6\u03b7k \u2212 \u03b7k| = OP (m\u22121/2) and supk\u2208K |\u02c6\u03c3k,\u03bb \u2212 \u03c3k,\u03bb| = OP (m\u22121/2),\nwhich are proved using McDiarmid\u2019s Theorem [12] and results from [3]: see Appendix A of the\nsupplementary material.\nConvergence of \u02c6k\u2217 to k\u2217: For k \u2208K de\ufb01ned in (3), we show in Section 4 that \u02c6k\u2217 and k\u2217 are unique\noptimizers of \u02c6\u03b7k \u02c6\u03c3\u22121\n, the result follows\nfrom [18, Corollary 3.2.3(i)].\n\nk , respectively. Since supk\u2208K\n\nP\u2192 supk\u2208K\n\nk,\u03bb and \u03b7k\u03c3\u22121\n\n\u02c6\u03b7k\n\u02c6\u03c3k,\u03bb\n\n\u03b7k\n\u03c3k\n\nWe remark that other families of kernels may be worth considering, besides K. For instance, we\ncould use a family of RBF kernels with continuous bandwidth parameter \u03b8 \u2265 0. We return to this\npoint in the conclusions (Section 6).\n\n4 Optimization procedure\n\nu=1\n\n\u02c6\u03b2\u2217uku \u2208K that maximizes the ratio \u02c6\u03b7k/\u02c6\u03c3k,\u03bb. We perform\nthis optimization over training data, then use the resulting parameters \u02c6\u03b2\u2217 to construct a hypothesis\ntest on the data to be tested (which must be independent of the training data, and drawn from the\nsame p, q). As discussed in Section 2.2, this gives us the test threshold without requiring a bootstrap\n\nWe wish to select kernel k = \uffffd\nprocedure. Recall from Sections 2.2 and 3 that \u02c6\u03b7k = \u03b2\uffff \u02c6\u03b7, and \u02c6\u03c3k,\u03bb = \uffff\u03b2\uffff\uffff \u02c6Q + \u03bbmI\uffff \u03b2,\n\u03b1(\u03b2; \u02c6\u03b7, \u02c6Q) := \uffff\u03b2\uffff \u02c6\u03b7\uffff\uffff\u03b2\uffff\uffff \u02c6Q + \u03bbmI\uffff \u03b2\uffff\u22121/2\n\nwhere \u02c6Q is a linear-time empirical estimate of the covariance matrix cov(h). Since the objective\nis a homogenous function of order zero in \u03b2, we\n\ncan omit the constraint \uffff\u03b2\uffff1 = D, and set\n\n(10)\n\n\u02c6\u03b2\u2217 = arg max\n\u03b2\uffff0\n\n\u03b1(\u03b2; \u02c6\u03b7, \u02c6Q).\n\n5\n\n\fFeature selection\n\n \n\nmax ratio\nopt\nl2\nmax mmd\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nDimension\n\n35\n\n30\n\n25\n\n2\nx\n\n20\n\n15\n\n10\n\n5\n\n35\n\n30\n\n25\n\n2\ny\n\n20\n\n15\n\n10\n\n5\n\nGrid of Gaussians, p\n\n10\n\n20\nx1\n\n30\n\nGrid of Gaussians, q\n\n10\n\n20\ny1\n\n30\n\n0.5\n\nr\no\nr\nr\ne\n\nI\nI\n\nr\ne\np\ny\nT\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\nI\nI\n\ne\np\ny\nT\n\nGrid of Gaussians\n\n \n\nmax ratio\nopt\nl2\nmax mmd\nxval\nmedian\n\n \n\n0\n\n0\n\n5\n\n10\n\n15\n\nRatio \u03b5\n\nFigure 1: Left: Feature selection results, Type II error vs number of dimensions, average over 5000\ntrials, m = n = 104. Centre: 3 \u00d7 3 Gaussian grid, samples from p and q. Right: Gaussian grid\nresults, Type II error vs \u03b5, the eigenvalue ratio for the covariance of the Gaussians in q; average over\n1500 trials, m = n = 104. The asymptotic test level was \u03b1 = 0.05 in both experiments. Error bars\ngive the 95% Wald con\ufb01dence interval.\n\nIf \u02c6\u03b7 has at least one positive entry, there exists \u03b2 \uffff 0 such that \u03b1(\u03b2; \u02c6\u03b7, \u02c6Q) > 0. Then clearly,\n\u03b1( \u02c6\u03b2\u2217; \u02c6\u03b7, \u02c6Q) > 0, so we can write \u02c6\u03b2\u2217 = arg max\u03b2\uffff0 \u03b12(\u03b2; \u02c6\u03b7, \u02c6Q). In this case, the problem (10)\nbecomes equivalent to a (convex) quadratic program with a unique solution, given by\n\nmin{\u03b2\uffff\uffff \u02c6Q + \u03bbmI\uffff \u03b2 : \u03b2\uffff \u02c6\u03b7 = 1,\u03b2 \uffff 0}.\n\n(11)\n\nUnder the alternative hypothesis, we have \u03b7u > 0, \u2200u \u2208{ 1, . . . , d}, so the same rea-\nto \u03b2\u2217 =\nsoning can be applied to the population version of the optimization problem,\narg max\u03b2\uffff0 \u03b1(\u03b2; \u03b7, cov(h)), which implies the optimizer \u03b2\u2217 is unique. In the case where no entries\nin \u02c6\u03b7 are positive, we obtain maximization of a quadratic form subject to a linear constraint,\n\ni.e.,\n\nmax{\u03b2\uffff\uffff \u02c6Q + \u03bbmI\uffff \u03b2 : \u03b2\uffff \u02c6\u03b7 = \u22121,\u03b2 \uffff 0}.\n\nWhile this problem is somewhat more dif\ufb01cult to solve, in practice its exact solution is irrelevant to\nthe Type II error performance of the proposed two-sample test. Indeed, since all of the squared MMD\nestimates calculated on the training data using each of the base kernels are negative, it is unlikely the\nstatistic computed on the data used for the test will exceed the (always positive) threshold. Therefore,\nwhen no entries in \u02c6\u03b7 are positive, we (arbitrarily) select a single base kernel ku with largest \u02c6\u03b7u/\u02c6\u03c3u,\u03bb.\nThe key component of the optimization procedure is the quadratic program in (11). This problem can\nbe solved by interior point methods, or, if the number of kernels d is large, we could use proximal-\ngradient methods. In this case, an \uffff-minimizer can be found in O(d2/\u221a\uffff) time. Therefore, the\noverall computational cost of the proposed test is linear in the number of samples, and quadratic in\nthe number of kernels.\n\n5 Experiments\n\nWe compared our kernel selection strategy to alternative approaches, with a focus on challenging\nproblems that bene\ufb01t from careful kernel choice. In our \ufb01rst experiment, we investigated a synthetic\ndata set for which the best kernel in the family K of linear combinations in (3) outperforms the best\nindividual kernel from the set {ku}d\nu=1 . Here p was a zero mean Gaussian with unit covariance,\nand q was a mixture of two Gaussians with equal weight, one with mean 0.5 in the \ufb01rst coordinate\nand zero elsewhere, and the other with mean 0.5 in the second coordinate and zero elsewhere.\nOur base kernel set {ku}d\nu=1 contained only d univariate kernels with \ufb01xed bandwidth (one for each\ndimension): in other words, this was a feature selection problem. We used two kernel selection\nstrategies arising from our criterion in (9): opt - the kernel from the set K that maximizes the ratio\n\u02c6\u03b7k/\u02c6\u03c3k,\u03bb, as described in Section 4, and max-ratio - the single base kernel ku with largest \u02c6\u03b7u/\u02c6\u03c3u,\u03bb.\n\n6\n\n\fAM signals, p\n\nAM signals, q\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\nI\nI\n\ne\np\ny\nT\n\n0\n\n \n0\n\nAmplitude modulated signals\n\n \n\nmax ratio\nopt\nmedian\nl2\nmax mmd\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\nAdded noise \u03c3\u03b5\n\nFigure 2: Left: amplitude modulated signals, four samples from each of p and q prior to noise being\nadded. Right: AM results, Type II error vs added noise, average over 5000 trials, m = n = 104.\nThe asymptotic test level was \u03b1 = 0.05. Error bars give the 95% Wald con\ufb01dence interval.\n\nWe used \u03bbn = 10\u22124 in both cases. An alternative kernel selection procedure is simply to maximize\nthe MMD on the training data, which is equivalent to minimizing the error in classifying p vs. q\nunder linear loss [15]. In this case, it is necessary to bound the norm of \u03b2, since the test statistic\ncan otherwise be increased without limit by rescaling the \u03b2 entries. We employed two such kernel\nselection strategies: max-mmd - a single base kernel ku that maximizes \u02c6\u03b7u (as proposed in [15]),\nand l2 - a kernel from the set K that maximizes \u02c6\u03b7k subject to the constraint \uffff\u03b2\uffff2 \u2264 1 on the vector\nof weights.\nOur results are shown in Figure 1. We see that opt and l2 perform much better than max-ratio and\nmax-mmd, with the former each having large \u02c6\u03b2\u2217 weights in both the relevant dimensions, whereas the\nlatter are permitted to choose only a single kernel. The performance advantage decreases as more\nirrelevant dimensions are added. Also note that on these data, there is no statistically signi\ufb01cant\ndifference between opt and l2, or between max-ratio and max-mmd.\nDif\ufb01cult problems in two-sample testing arise when the main data variation does not re\ufb02ect the\ndifference between p and q; rather, this is encoded as perturbations at much smaller lengthscales. In\nthese cases, a good choice of kernel becomes crucial. Both remaining experiments are of this type.\nIn the second experiment, p and q were both grids of Gaussians in two dimensions, where p had\nunit covariance matrices in each mixture component, and q was a grid of correlated Gaussians with\na ratio \u03b5 of largest to smallest covariance eigenvalues. A sample dataset is provided in Figure 1. The\ntesting problem becomes more dif\ufb01cult when the number of Gaussian centers in the grid increases,\nand when \u03b5 \u2192 1. In experiments, we used a \ufb01ve-by-\ufb01ve grid.\nWe compared opt, max-ratio, max-mmd, and l2, as well as an additional approach, xval, for which\nwe chose the best kernel from {ku}d\nu=1 by \ufb01ve-fold cross-validation, following [17]. In this case,\nwe learned a witness function on four \ufb01fths of the training data, and used it to evaluate the linear\nloss on p vs q for the rest of the training data (see [7, Section 2.3] for the witness function de\ufb01nition,\nand [15] for the classi\ufb01cation interpretation of the MMD). We made repeated splits to obtain the\naverage validation error, and chose the kernel with the highest average MMD on the validation sets\n(equivalently, the lowest average linear loss). This procedure has cost O(m2), and is much more\ncomputationally demanding than the remaining approaches.\nOur base kernels {ku}d\nu=1 in (3) were multivariate isotropic Gaussians with bandwidth varying\nbetween 2\u221210 and 215, with a multiplicative step-size of 20.5, and we set \u03bbn = 10\u22125. Results\nare plotted in Figure 1: opt and max-ratio are statistically indistinguishable, followed in order of\ndecreasing performance by xval, max-mmd, and l2. The median heuristic fails entirely, yielding\nthe 95% error expected under the null hypothesis. It is notable that the cross-validation approach\nperforms less well than our criterion, which suggests that a direct approach addressing the Type II\nerror is preferable to optimizing the classi\ufb01er performance.\nIn our \ufb01nal experiment, the distributions p, q were short samples of amplitude modulated (AM)\nsignals, which were carrier sinusoids with amplitudes scaled by different audio signals for p and q.\n\n7\n\n\fThese signals took the form\n\ny(t) = cos(\u03c9ct) (As(t) + oc) + n(t),\n\nwhere y(t) is the AM signal at time t, s(t) is an audio signal, \u03c9c is the frequency of the carrier\nsignal, A is an amplitude scaling parameter, oc is a constant offset, and n(t) is i.i.d. Gaussian noise\nwith standard deviation \u03c3\u03b5. The source audio signals were [5, Vol. 1, Track 2; Vol. 2 Track 17],\nand had the same singer but different accompanying instruments. Both songs were normalized to\nhave unit standard deviation, to avoid a trivial distinction on the basis of sound volume. The audio\nwas sampled at 8kHz, the carrier was at 24kHz, and the resulting AM signals were sampled at\n120kHz. Further settings were A = 0.5 and oc = 2. We extracted signal fragments of length 1000,\ncorresponding to a time duration of 8.3 \u00d7 10\u22123 seconds in the original audio. Our base kernels\n{ku}d\nu=1 in (3) were multivariate isotropic Gaussians with bandwidth varying between 2\u221215 and\n215, with a multiplicative step-size of 2, and we set \u03bbn = 10\u22125. Sample extracts from each source\nand Type II error vs noise level \u03c3\u03b5 are shown in Figure 2. Here max-ratio does best, with successively\ndecreasing performance by opt, max-mmd, l2, and median. We remark that in the second and third\nexperiments, simply choosing the kernel ku with largest ratio \u02c6\u03b7u/\u02c6\u03c3u,\u03bb does as well or better than\nsolving for \u02c6\u03b2\u2217 in (11). The max-ratio strategy is thus recommended when a single best kernel exists\nin the set {ku}d\nu=1, although it clearly fails when a linear combination of several kernels is needed\n(as in the \ufb01rst experiment).\nFurther experiments are provided in the supplementary material. These include an empirical veri-\n\ufb01cation that the Type I error is close to the design parameter \u03b1, and that kernels are not chosen at\nextreme values when the null hypothesis holds, additional AM experiments, and further synthetic\nbenchmarks.\n\n6 Conclusions\n\nWe have proposed a criterion to explicitly optimize the Hodges and Lehmann asymptotic relative\nef\ufb01ciency for the kernel two-sample test: the kernel parameters are chosen to minimize the asymp-\ntotic Type II error at a given Type I error. In experiments using linear combinations of kernels, this\napproach often performs signi\ufb01cantly better than the simple strategy of choosing the kernel with\nlargest MMD (the previous best approach), or maximizing the MMD subject to an \uffff2 constraint on\nthe kernel weights, and yields good performance even when the median heuristic fails completely.\nA promising next step would be to optimize over the parameters of a single kernel (e.g., over the\nbandwidth of an RBF kernel). This presents two challenges: \ufb01rst, in proving that a \ufb01nite sample\nestimate of the kernel selection criterion converges, which might be possible following [15]; and\nsecond, in ef\ufb01ciently optimizing the criterion over the kernel parameter, where we could employ a\nDC programming [2] or semi-in\ufb01nite programming [6] approach.\nAcknowledgements: Part of this work was accomplished when S. B. was visiting the MPI for\nIntelligent Systems. We thank Samory Kpotufe and Bernhard Sch\u00a8olkopf for helpful discussions.\n\nReferences\n[1] R. Adler and J. Taylor. Random Fields and Geometry. Springer, 2007.\n[2] Andreas Argyriou, Raphael Hauser, Charles A. Micchelli, and Massimiliano Pontil. A dc-\n\nprogramming algorithm for kernel selection. In ICML, pages 41\u201348, 2006.\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[4] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Kluwer, 2004.\n\n[5] Magnetic Fields. 69 love songs. Merge, MRG169, 1999.\n[6] P. Gehler and S. Nowozin. In\ufb01nite kernel learning. Technical Report TR-178, Max Planck\n\nInstitute for Biological Cybernetics, 2008.\n\n[7] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel two-sample test.\n\nJMLR, 13:723\u2013773, 2012.\n\n8\n\n\f[8] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. J. Smola. A kernel method for\nthe two-sample problem. In Advances in Neural Information Processing Systems 15, pages\n513\u2013520, Cambridge, MA, 2007. MIT Press.\n\n[9] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-\nsample test. In Advances in Neural Information Processing Systems 22, Red Hook, NY, 2009.\nCurran Associates Inc.\n\n[10] Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discrimi-\nnant analysis. In Advances in Neural Information Processing Systems 20, pages 609\u2013616. MIT\nPress, Cambridge, MA, 2008.\n\n[11] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge Univ Press, 1990.\n[12] C. McDiarmid. On the method of bounded differences. In Survey in Combinatorics, pages\n\n148\u2013188. Cambridge University Press, 1989.\n\n[13] R. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.\n[14] A. J. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A Hilbert space embedding for dis-\ntributions. In Proceedings of the International Conference on Algorithmic Learning Theory,\nvolume 4754, pages 13\u201331. Springer, 2007.\n\n[15] B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Schoelkopf. Kernel choice\nand classi\ufb01ability for RKHS embeddings of probability distributions. In Advances in Neural\nInformation Processing Systems 22, Red Hook, NY, 2009. Curran Associates Inc.\n\n[16] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Hilbert space\nembeddings and metrics on probability measures. Journal of Machine Learning Research,\n11:1517\u20131561, 2010.\n\n[17] M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura. Least-squares two-sample test.\n\nNeural Networks, 24(7):735\u2013751, 2011.\n\n[18] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer,\n\n1996.\n\n9\n\n\f", "award": [], "sourceid": 592, "authors": [{"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": null}, {"given_name": "Heiko", "family_name": "Strathmann", "institution": null}, {"given_name": "Sivaraman", "family_name": "Balakrishnan", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": null}]}