{"title": "A Kernel Method for the Two-Sample-Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 513, "page_last": 520, "abstract": "", "full_text": "A Kernel Method for the Two-Sample-Problem\n\nArthur Gretton\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\narthur@tuebingen.mpg.de\n\nKarsten M. Borgwardt\n\nLudwig-Maximilians-Univ.\n\nMunich, Germany\nkb@dbs.i\ufb01.lmu.de\n\nMalte Rasch\n\nmalte.rasch@igi.tu-graz.ac.at\n\nGraz Univ. of Technology,\n\nGraz, Austria\n\nBernhard Sch\u00a8olkopf\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\nbs@tuebingen.mpg.de\n\nAlexander J. Smola\n\nNICTA, ANU\n\nCanberra, Australia\n\nAlex.Smola@anu.edu.au\n\nAbstract\n\nWe propose two statistical tests to determine if two samples are from different dis-\ntributions. Our test statistic is in both cases the distance between the means of the\ntwo samples mapped into a reproducing kernel Hilbert space (RKHS). The \ufb01rst\ntest is based on a large deviation bound for the test statistic, while the second is\nbased on the asymptotic distribution of this statistic. The test statistic can be com-\nputed in O(m2) time. We apply our approach to a variety of problems, including\nattribute matching for databases using the Hungarian marriage method, where our\ntest performs strongly. We also demonstrate excellent performance when compar-\ning distributions over graphs, for which no alternative tests currently exist.\n\n1 Introduction\nWe address the problem of comparing samples from two probability distributions, by proposing a\nstatistical test of the hypothesis that these distributions are different (this is called the two-sample\nor homogeneity problem). This test has application in a variety of areas. In bioinformatics, it is\nof interest to compare microarray data from different tissue types, either to determine whether two\nsubtypes of cancer may be treated as statistically indistinguishable from a diagnosis perspective,\nor to detect differences in healthy and cancerous tissue. In database attribute matching, it is desir-\nable to merge databases containing multiple \ufb01elds, where it is not known in advance which \ufb01elds\ncorrespond: the \ufb01elds are matched by maximising the similarity in the distributions of their entries.\n\nIn this study, we propose to test whether distributions p and q are different on the basis of samples\ndrawn from each of them, by \ufb01nding a smooth function which is large on the points drawn from p,\nand small (as negative as possible) on the points from q. We use as our test statistic the difference\nbetween the mean function values on the two samples; when this is large, the samples are likely\nfrom different distributions. We call this statistic the Maximum Mean Discrepancy (MMD).\n\nClearly the quality of MMD as a statistic depends heavily on the class F of smooth functions that\nde\ufb01ne it. On one hand, F must be \u201crich enough\u201d so that the population MMD vanishes if and only\nif p = q. On the other hand, for the test to be consistent, F needs to be \u201crestrictive\u201d enough for the\nempirical estimate of MMD to converge quickly to its expectation as the sample size increases. We\nshall use the unit balls in universal reproducing kernel Hilbert spaces [22] as our function class, since\nthese will be shown to satisfy both of the foregoing properties. On a more practical note, MMD is\ncheap to compute: given m points sampled from p and n from q, the cost is O(m + n)2 time.\nWe de\ufb01ne two non-parametric statistical tests based on MMD. The \ufb01rst, which uses distribution-\nindependent uniform convergence bounds, provides \ufb01nite sample guarantees of test performance,\nat the expense of being conservative in detecting differences between p and q. The second test is\n\n\fbased on the asymptotic distribution of MMD, and is in practice more sensitive to differences in\ndistribution at small sample sizes. These results build on our earlier work in [6] on MMD for the\ntwo sample problem, which addresses only the second kind of test. In addition, the present approach\nemploys a more accurate approximation to the asymptotic distribution of the test statistic.\n\nWe begin our presentation in Section 2 with a formal de\ufb01nition of the MMD, and a proof that the\npopulation MMD is zero if and only if p = q when F is the unit ball of a universal RKHS. We also\ngive an overview of hypothesis testing as it applies to the two-sample problem, and review previous\napproaches. In Section 3, we provide a bound on the deviation between the population and empirical\nMMD, as a function of the Rademacher averages of F with respect to p and q. This leads to a \ufb01rst\nhypothesis test. We take a different approach in Section 4, where we use the asymptotic distribution\nof an unbiased estimate of the squared MMD as the basis for a second test. Finally, in Section 5, we\ndemonstrate the performance of our method on problems from neuroscience, bioinformatics, and\nattribute matching using the Hungarian marriage approach. Our approach performs well on high\ndimensional data with low sample size; in addition, we are able to successfully apply our test to\ngraph data, for which no alternative tests exist. Proofs and further details are provided in [13], and\nsoftware may be downloaded from http : //www.kyb.mpg.de/bs/people/arthur/mmd.htm\n\n2 The Two-Sample-Problem\nOur goal is to formulate a statistical test that answers the following question:\n\nProblem 1 Let p and q be distributions de\ufb01ned on a domain X. Given observations X :=\n{x1, . . . , xm} and Y := {y1, . . . , yn}, drawn independently and identically distributed (i.i.d.) from\np and q respectively, is p 6= q?\nTo start with, we wish to determine a criterion that, in the population setting, takes on a unique and\ndistinctive value only when p = q. It will be de\ufb01ned based on [10, Lemma 9.3.2].\n\nLemma 1 Let (X, d) be a separable metric space, and let p, q be two Borel probability measures\nde\ufb01ned on X. Then p = q if and only if Ep(f (x)) = Eq(f (x)) for all f \u2208 C(X), where C(X) is the\nspace of continuous bounded functions on X.\n\nAlthough C(X) in principle allows us to identify p = q uniquely, such a rich function class is not\npractical in the \ufb01nite sample setting. We thus de\ufb01ne a more general class of statistic, for as yet\nunspeci\ufb01ed function classes F, to measure the discrepancy between p and q, as proposed in [11].\n\nDe\ufb01nition 2 Let F be a class of functions f : X \u2192 R and let p, q, X, Y be de\ufb01ned as above. Then\nwe de\ufb01ne the maximum mean discrepancy (MMD) and its empirical estimate as\n\nMMD [F, p, q] := sup\nf \u2208F\n\nMMD [F, X, Y ] := sup\n\n(Ex\u223cp[f (x)] \u2212 Ey\u223cq[f (y)]) ,\nf (yi)! .\n\nf (xi) \u2212\n\n1\nn\n\nf \u2208F 1\n\nm\n\nm\n\nXi=1\n\nn\n\nXi=1\n\n(1)\n\n(2)\n\nWe must now identify a function class that is rich enough to uniquely establish whether p = q, yet\nrestrictive enough to provide useful \ufb01nite sample estimates (the latter property will be established\nin subsequent sections). To this end, we select F to be the unit ball in a universal RKHS H [22];\nwe will henceforth use F only to denote this function class. With the additional restriction that X be\ncompact, a universal RKHS is dense in C(X) with respect to the L\u221e norm. It is shown in [22] that\nGaussian and Laplace kernels are universal.\n\nTheorem 3 Let F be a unit ball in a universal RKHS H, de\ufb01ned on the compact metric space X,\nwith associated kernel k(\u00b7,\u00b7). Then MMD [F, p, q] = 0 if and only if p = q.\nThis theorem is proved in [13]. We next express the MMD in a more easily computable form. This is\nsimpli\ufb01ed by the fact that in an RKHS, function evaluations can be written f (x) = h\u03c6(x), fi, where\n\u03c6(x) = k(x, .). Denote by \u00b5[p] := Ex\u223cp(x) [\u03c6(x)] the expectation of \u03c6(x) (assuming that it exists\n\n\f\u2013 a suf\ufb01cient condition for this is k\u00b5[p]k2\nH < \u221e, which is rearranged as Ep[k(x, x\u2032)] < \u221e, where\nx and x\u2032 are independent random variables drawn according to p). Since Ep[f (x)] = h\u00b5[p], fi, we\nmay rewrite\n(3)\n\nMMD[F, p, q] = sup\n\nkf kH\u22641h\u00b5[p] \u2212 \u00b5[q], fi = k\u00b5[p] \u2212 \u00b5[q]kH .\n\nUsing \u00b5[X] := 1\n\ni=1 \u03c6(xi) and k(x, x\u2032) = h\u03c6(x), \u03c6(x\u2032)i, an empirical estimate of MMD is\n\nmPm\n\nMMD [F, X, Y ] =\uf8ee\n\uf8f0\n\n1\nm2\n\nm\n\nXi,j=1\n\nk(xi, xj) \u2212\n\n2\nmn\n\nk(xi, yj) +\n\n1\nn2\n\nm,n\n\nXi,j=1\n\n1\n\n2\n\n.\n\n(4)\n\nn\n\nXi,j=1\n\nk(yi, yj)\uf8f9\n\uf8fb\n\nEq. (4) provides us with a test statistic for p 6= q. We shall see in Section 3 that this estimate is\nbiased, although it is straightforward to upper bound the bias (we give an unbiased estimate, and an\nassociated test, in Section 4). Intuitively we expect MMD[F, X, Y ] to be small if p = q, and the\nquantity to be large if the distributions are far apart. Computing (4) costs O((m + n)2) time.\n\nOverview of Statistical Hypothesis Testing, and of Previous Approaches Having de\ufb01ned our\ntest statistic, we brie\ufb02y describe the framework of statistical hypothesis testing as it applies in the\npresent context, following [9, Chapter 8]. Given i.i.d. samples X \u223c p of size m and Y \u223c q of\nsize n, the statistical test, T(X, Y ) : Xm \u00d7 Xn 7\u2192 {0, 1} is used to distinguish between the null\nhypothesis H0 : p = q and the alternative hypothesis H1 : p 6= q. This is achieved by comparing\nthe test statistic MMD[F, X, Y ] with a particular threshold: if the threshold is exceeded, then the\ntest rejects the null hypothesis (bearing in mind that a zero population MMD indicates p = q). The\nacceptance region of the test is thus de\ufb01ned as any real number below the threshold. Since the test is\nbased on \ufb01nite samples, it is possible that an incorrect answer will be returned: we de\ufb01ne the Type I\nerror as the probability of rejecting p = q based on the observed sample, despite the null hypothesis\nbeing true. Conversely, the Type II error is the probability of accepting p = q despite the underlying\ndistributions being different. The level \u03b1 of a test is an upper bound on the Type I error: this is a\ndesign parameter of the test, and is used to set the threshold to which we compare the test statistic\n(\ufb01nding the test threshold for a given \u03b1 is the topic of Sections 3 and 4). A consistent test achieves\na level \u03b1, and a Type II error of zero, in the large sample limit. We will see that both of the tests\nproposed in this paper are consistent.\n\nWe next give a brief overview of previous approaches to the two sample problem for multivariate\ndata. Since our later experimental comparison is with respect to certain of these methods, we give\nabbreviated algorithm names in italics where appropriate: these should be used as a key to the\ntables in Section 5. We provide further details in [13]. A generalisation of the Wald-Wolfowitz\nruns test to the multivariate domain was proposed and analysed in [12, 17] (Wolf), which involves\ncounting the number of edges in the minimum spanning tree over the aggregated data that connect\npoints in X to points in Y . The computational cost of this method using Kruskal\u2019s algorithm is\nO((m + n)2 log(m + n)), although more modern methods improve on the log(m + n) term. Two\npossible generalisations of the Kolmogorov-Smirnov test to the multivariate case were studied in\n[4, 12]. The approach of Friedman and Rafsky (Smir) in this case again requires a minimal spanning\ntree, and has a similar cost to their multivariate runs test. A more recent multivariate test was\nproposed in [20], which is based on the minimum distance non-bipartite matching over the aggregate\ndata, at cost O((m + n)3). Another recent test was proposed in [15] (Hall): for each point from\np, it requires computing the closest points in the aggregated data, and counting how many of these\nare from q (the procedure is repeated for each point from q with respect to points from p). The test\nstatistic is costly to compute; [15] consider only tens of points in their experiments.\nYet another approach is to use some distance (e.g. L1 or L2) between Parzen window estimates\nof the densities as a test statistic [1, 3], based on the asymptotic distribution of this distance given\np = q. When the L2 norm is used, the test statistic is related to those we present here, although it is\narrived at from a different perspective (see [13]: the test in [1] is obtained in a more restricted setting\nwhere the RKHS kernel is an inner product between Parzen windows. Since we are not doing density\nestimation, however, we need not decrease the kernel width as the sample grows. In fact, decreasing\nthe kernel width reduces the convergence rate of the associated two-sample test, compared with\nthe (m + n)\u22121/2 rate for \ufb01xed kernels). The L1 approach of [3] (Biau) requires the space to be\npartitioned into a grid of bins, which becomes dif\ufb01cult or impossible for high dimensional problems.\nHence we use this test only for low-dimensional problems in our experiments.\n\n\f3 A Test based on Uniform Convergence Bounds\n\nIn this section, we establish two properties of the MMD. First, we show that regardless of whether or\nnot p = q, the empirical MMD converges in probability at rate 1/\u221am + n to its population value.\nThis establishes the consistency of statistical tests based on MMD. Second, we give probabilistic\nbounds for large deviations of the empirical MMD in the case p = q. These bounds lead directly to\na threshold for our \ufb01rst hypothesis test.\n\nWe begin our discussion of the convergence of MMD[F, X, Y ] to MMD[F, p, q].\nTheorem 4 Let p, q, X, Y be de\ufb01ned as in Problem 1, and assume |k(x, y)| \u2264 K. Then\nPrn|MMD[F, X, Y ] \u2212 MMD[F, p, q]| > 2(cid:16)(K/m)\n\n2(cid:17) + \u01ebo \u2264 2 exp(cid:16) \u2212\u01eb2mn\n\n2 + (K/n)\n\n2K(m+n)(cid:17) .\n\n1\n\n1\n\nOur next goal is to re\ufb01ne this result in a way that allows us to de\ufb01ne a test threshold under the null\nhypothesis p = q. Under this circumstance, the constants in the exponent are slightly improved.\n\nTheorem 5 Under the conditions of Theorem 4 where additionally p = q and m = n,\n\nMMD[F, X, Y ] > m\u2212 1\n\n+ \u01eb > 2(K/m)1/2\n\n+ \u01eb,\n\n2q2Ep [k(x, x) \u2212 k(x, x\u2032)]\n}\n\n{z\n\nB1(F,p)\n\n|\n\n|\n\nB2(F,p)\n\n{z\n\n}\n\nboth with probability less than exp(cid:16)\u2212 \u01eb2m\n\n4K(cid:17) (see [13] for the proof).\n\nIn this theorem, we illustrate two possible bounds B1(F, p) and B2(F, p) on the bias in the empirical\nestimate (4). The \ufb01rst inequality is interesting inasmuch as it provides a link between the bias bound\nB1(F, p) and kernel size (for instance, if we were to use a Gaussian kernel with large \u03c3, then k(x, x)\nand k(x, x\u2032) would likely be close, and the bias small). In the context of testing, however, we would\nneed to provide an additional bound to show convergence of an empirical estimate of B1(F, p) to its\npopulation equivalent. Thus, in the following test for p = q based on Theorem 5, we use B2(F, p)\nto bound the bias.\n\nLemma 6 A hypothesis test of level \u03b1 for the null hypothesis p = q (equivalently MMD[F, p, q] =\n\n0) has the acceptance region MMD[F, X, Y ] < 2pK/m(cid:16)1 +plog \u03b1\u22121(cid:17) .\n\nWe emphasise that Theorem 4 guarantees the consistency of the test, and that the Type II error\nprobability decreases to zero at rate 1/\u221am (assuming m = n). To put this convergence rate in\nperspective, consider a test of whether two normal distributions have equal means, given they have\nunknown but equal variance [9, Exercise 8.41]. In this case, the test statistic has a Student-t distri-\nbution with n + m \u2212 2 degrees of freedom, and its error probability converges at the same rate as\nour test.\n\n4 An Unbiased Test Based on the Asymptotic Distribution of the U-Statistic\n\nWe now propose a second test, which is based on the asymptotic distribution of an unbiased estimate\nof MMD2. We begin by de\ufb01ning this test statistic.\n\nLemma 7 Given x and x\u2032 independent random variables with distribution p, and y and y\u2032 indepen-\ndent random variables with distribution q, the population MMD2 is\n\nMMD2 [F, p, q] = Ex,x\u2032\u223cp [k(x, x\u2032)] \u2212 2Ex\u223cp,y\u223cq [k(x, y)] + Ey,y \u2032\u223cq [k(y, y\u2032)]\n\n(5)\n(see [13] for details). Let Z := (z1, . . . , zm) be m i.i.d. random variables, where zi := (xi, yi) (i.e.\nwe assume m = n). An unbiased empirical estimate of MMD2 is\n\nMMD2\n\nu [F, X, Y ] =\n\n1\n\n(m)(m \u2212 1)\n\nm\n\nXi6=j\n\nh(zi, zj),\n\n(6)\n\nwhich is a one-sample U-statistic with h(zi, zj) := k(xi, xj ) + k(yi, yj) \u2212 k(xi, yj) \u2212 k(xj, yi).\n\n\fThe empirical statistic is an unbiased estimate of MMD2, although it does not have minimum vari-\nance (the minimum variance estimate is almost identical: see [21, Section 5.1.4]). We remark that\nthese quantities can easily be linked with a simple kernel between probability measures: (5) is a\nspecial case of the Hilbertian metric [16, Eq. (4)] with the associated kernel K(p, q) = Ep,qk(x, y)\n[16, Theorem 4]. The asymptotic distribution of this test statistic under H1 is given by [21, Section\n5.5.1], and the distribution under H0 is computed based on [21, Section 5.5.2] and [1, Appendix];\nsee [13] for details.\n\nTheorem 8 We assume E(cid:0)h2(cid:1) < \u221e. Under H1, MMD2\n\n[14, Section 7.2]) to a Gaussian according to\n\nu converges in distribution (de\ufb01ned e.g. in\n\nm\n\n1\n\n2 (cid:0)MMD2\n\nu \u2212 MMD2 [F, p, q](cid:1) D\u2192 N(cid:0)0, \u03c32\nu(cid:1) ,\n\nu = 4(cid:16)Ez(cid:2)(Ez\u2032 h(z, z\u2032))2(cid:3) \u2212 [Ez,z\u2032(h(z, z\u2032))]2(cid:17), uniformly at rate 1/\u221am [21, Theorem\n\nwhere \u03c32\nB, p. 193]. Under H0, the U-statistic is degenerate, meaning Ez\u2032 h(z, z\u2032) = 0. In this case, MMD2\nu\nconverges in distribution according to\n\n(7)\n\nwhere zl \u223c N(0, 2) i.i.d., \u03bbi are the solutions to the eigenvalue equation\n\nmMMD2\nu\n\nD\u2192\n\n\u221e\n\nXl=1\n\n\u03bbl(cid:2)z2\n\nl \u2212 2(cid:3) ,\n\n\u02dck(x, x\u2032)\u03c8i(x)dp(x) = \u03bbi\u03c8i(x\u2032),\n\nZX\n\nand \u02dck(xi, xj) := k(xi, xj)\u2212 Exk(xi, x)\u2212 Exk(x, xj ) + Ex,x\u2032k(x, x\u2032) is the centred RKHS kernel.\nOur goal is to determine whether the empirical test statistic MMD2\nu is so large as to be outside the\n1 \u2212 \u03b1 quantile of the null distribution in (7) (consistency of the resulting test is guaranteed by the\nform of the distribution under H1). One way to estimate this quantile is using the bootstrap [2] on the\naggregated data. Alternatively, we may approximate the null distribution by \ufb01tting Pearson curves\nto its \ufb01rst four moments [18, Section 18.8]. Taking advantage of the degeneracy of the U-statistic,\nwe obtain (see [13])\n\nE(cid:16)(cid:2)MMD2\n\nu(cid:3)2(cid:17) =\n\n2\n\nm(m \u2212 1)\n\nEz,z\u2032(cid:2)h2(z, z\u2032)(cid:3)\n\nand\n\nE(cid:16)(cid:2)MMD2\n\nu(cid:3)3(cid:17) =\n\n8(m \u2212 2)\nm2(m \u2212 1)2\n\nEz,z\u2032 [h(z, z\u2032)Ez\u2032\u2032 (h(z, z\u2032\u2032)h(z\u2032, z\u2032\u2032))] + O(m\u22124).\n\n(8)\n\nand expensive to calculate (O(m4)).\n\nThe fourth moment E(cid:16)(cid:2)MMD2\nu(cid:3)4(cid:17) is not computed, since it is both very small (O(m\u22124))\nu(cid:1)(cid:1)2\nu(cid:1) \u2265(cid:0)skew(cid:0)MMD2\nkurt(cid:0)MMD2\n\nInstead, we replace the kurtosis with its lower bound\n\n+ 1.\n\n5 Experiments\n\nWe conducted distribution comparisons using our MMD-based tests on datasets from three real-\nworld domains: database applications, bioinformatics, and neurobiology. We investigated the\nuniform convergence approach (MMD), the asymptotic approach with bootstrap (MMD2\nu B),\nand the asymptotic approach with moment matching to Pearson curves (MMD2\nu M). We also\ncompared against several alternatives from the literature (where applicable):\nthe multivariate t-\ntest, the Friedman-Rafsky Kolmogorov-Smirnov generalisation (Smir), the Friedman-Rafsky Wald-\nWolfowitz generalisation (Wolf), the Biau-Gy\u00a8or\ufb01 test (Biau), and the Hall-Tajvidi test (Hall). Note\nthat we do not apply the Biau-Gy\u00a8or\ufb01 test to high-dimensional problems (see end of Section 2), and\nthat MMD is the only method applicable to structured data such as graphs.\n\nAn important issue in the practical application of the MMD-based tests is the selection of the kernel\nparameters. We illustrate this with a Gaussian RBF kernel, where we must choose the kernel width\n\n\f\u03c3 (we use this kernel for univariate and multivariate data, but not for graphs). The empirical MMD is\nzero both for kernel size \u03c3 = 0 (where the aggregate Gram matrix over X and Y is a unit matrix), and\nalso approaches zero as \u03c3 \u2192 \u221e (where the aggregate Gram matrix becomes uniformly constant).\nWe set \u03c3 to be the median distance between points in the aggregate sample, as a compromise between\nthese two extremes: this remains a heuristic, however, and the optimum choice of kernel size is an\nongoing area of research.\n\nData integration As a \ufb01rst application of MMD, we performed distribution testing for data inte-\ngration: the objective is to aggregate two datasets into a single sample, with the understanding that\nboth original samples are generated from the same distribution. Clearly, it is important to check\nthis last condition before proceeding, or an analysis could detect patterns in the new dataset that are\ncaused by combining the two different source distributions, and not by real-world phenomena. We\nchose several real-world settings to perform this task: we compared microarray data from normal\nand tumor tissues (Health status), microarray data from different subtypes of cancer (Subtype), and\nlocal \ufb01eld potential (LFP) electrode recordings from the Macaque primary visual cortex (V1) with\nand without spike events (Neural Data I and II). In all cases, the two data sets have different statistical\nproperties, but the detection of these differences is made dif\ufb01cult by the high data dimensionality.\n\nWe applied our tests to these datasets in the following fashion. Given two datasets A and B, we either\nchose one sample from A and the other from B (attributes = different); or both samples from either\nA or B (attributes = same). We then repeated this process up to 1200 times. Results are reported\nin Table 1. Our asymptotic tests perform better than all competitors besides Wolf: in the latter case,\nwe have greater Type II error for one neural dataset, lower Type II error on the Health Status data\n(which has very high dimension and low sample size), and identical (error-free) performance on the\nremaining examples. We note that the Type I error of the bootstrap test on the Subtype dataset is far\nfrom its design value of 0.05, indicating that the Pearson curves provide a better threshold estimate\nfor these low sample sizes. For the remaining datasets, the Type I errors of the Pearson and Bootstrap\napproximations are close. Thus, for larger datasets, the bootstrap is to be preferred, since it costs\nO(m2), compared with a cost of O(m3) for Pearson (due to the cost of computing (8)). Finally, the\nuniform convergence-based test is too conservative, \ufb01nding differences in distribution only for the\ndata with largest sample size.\n\nDataset\nNeural Data I\n\nNeural Data II\n\nHealth status\n\nSubtype\n\nAttr.\nSame\nDifferent\nSame\nDifferent\nSame\nDifferent\nSame\nDifferent\n\nMMD MMD2\n100.0\n50.0\n100.0\n100.0\n100.0\n100.0\n100.0\n100.0\n\nu B MMD2\n96.5\n0.0\n94.6\n3.3\n95.5\n1.0\n99.1\n0.0\n\nu M t-test Wolf\n96.5\n97.0\n0.0\n0.0\n95.0\n95.2\n0.8\n3.4\n94.4\n94.7\n0.8\n2.8\n94.6\n96.4\n0.0\n0.0\n\n100.0\n42.0\n100.0\n100.0\n100.0\n100.0\n100.0\n100.0\n\nSmir Hall\n95.0\n96.0\n49.0\n10.0\n96.0\n94.5\n5.9\n31.8\n96.1\n95.6\n35.7\n44.0\n96.5\n97.3\n28.4\n0.2\n\nTable 1: Distribution testing for data integration on multivariate data. Numbers indicate the per-\ncentage of repetitions for which the null hypothesis (p=q) was accepted, given \u03b1 = 0.05. Sample\nsize (dimension; repetitions of experiment): Neural I 4000 (63; 100) ; Neural II 1000 (100; 1200);\nHealth Status 25 (12,600; 1000); Subtype 25 (2,118; 1000).\n\nAttribute matching Our second series of experiments addresses automatic attribute matching.\nGiven two databases, we want to detect corresponding attributes in the schemas of these databases,\nbased on their data-content (as a simple example, two databases might have respective \ufb01elds Wage\nand Salary, which are assumed to be observed via a subsampling of a particular population, and we\nwish to automatically determine that both Wage and Salary denote to the same underlying attribute).\nWe use a two-sample test on pairs of attributes from two databases to \ufb01nd corresponding pairs.1 This\nprocedure is also called table matching for tables from different databases. We performed attribute\nmatching as follows: \ufb01rst, the dataset D was split into two halves A and B. Each of the n attributes\n\n1Note that corresponding attributes may have different distributions in real-world databases. Hence, schema\nmatching cannot solely rely on distribution testing. Advanced approaches to schema matching using MMD as\none key statistical test are a topic of current research.\n\n\fin A (and B, resp.) was then represented by its instances in A (resp. B). We then tested all pairs\nof attributes from A and from B against each other, to \ufb01nd the optimal assignment of attributes\nA1, . . . , An from A to attributes B1, . . . , Bn from B. We assumed that A and B contain the same\nnumber of attributes.\n\nAs a naive approach, one could assume that any possible pair of attributes might correspond, and\nthus that every attribute of A needs to be tested against all the attributes of B to \ufb01nd the optimal\nmatch. We report results for this naive approach, aggregated over all pairs of possible attribute\nmatches, in Table 2. We used three datasets: the census income dataset from the UCI KDD archive\n(CNUM), the protein homology dataset from the 2004 KDD Cup (BIO) [8], and the forest dataset\nfrom the UCI ML archive [5]. For the \ufb01nal dataset, we performed univariate matching of attributes\n(FOREST) and multivariate matching of tables (FOREST10D) from two different databases, where\neach table represents one type of forest. Both our asymptotic MMD2\nu-based tests perform as well as\nor better than the alternatives, notably for CNUM, where the advantage of MMD2\nu is large. Unlike\nin Table 1, the next best alternatives are not consistently the same across all data: e.g. in BIO they\nare Wolf or Hall, whereas in FOREST they are Smir, Biau, or the t-test. Thus, MMD2\nu appears to\nperform more consistently across the multiple datasets. The Friedman-Rafsky tests do not always\nreturn a Type I error close to the design parameter: for instance, Wolf has a Type I error of 9.7% on\nthe BIO dataset (on these data, MMD2\nu has the joint best Type II error without compromising the\ndesigned Type I performance). Finally, our uniform convergence approach performs much better\nthan in Table 1, although surprisingly it fails to detect differences in FOREST10D.\n\nA more principled approach to attribute matching is also possible. Assume that \u03c6(A) =\n(\u03c61(A1), \u03c62(A2), ..., \u03c6n(An)): in other words, the kernel decomposes into kernels on the individual\nattributes of A (and also decomposes this way on the attributes of B). In this case, M M D2 can be\nwrittenPn\ni=1 k\u00b5i(Ai) \u2212 \u00b5i(Bi)k2, where we sum over the MMD terms on each of the attributes.\nOur goal of optimally assigning attributes from B to attributes of A via MMD is equivalent to \ufb01nd-\ning the optimal permutation \u03c0 of attributes of B that minimizesPn\ni=1 k\u00b5i(Ai)\u2212 \u00b5i(B\u03c0(i))k2. If we\nde\ufb01ne Cij = k\u00b5i(Ai) \u2212 \u00b5i(Bj )k2, then this is the same as minimizing the sum over Ci,\u03c0(i). This is\nthe linear assignment problem, which costs O(n3) time using the Hungarian method [19].\n\nDataset\nBIO\n\nFOREST\n\nCNUM\n\nAttr.\nSame\nDifferent\nSame\nDifferent\nSame\nDifferent\n\nFOREST10D Same\n\nDifferent\n\nMMD MMD2\n100.0\n20.0\n100.0\n4.9\n100.0\n15.2\n100.0\n100.0\n\nu B MMD2\n93.8\n17.2\n96.4\n0.0\n94.5\n2.7\n94.0\n0.0\n\nu M t-test Wolf\n90.3\n94.8\n17.2\n17.6\n94.6\n96.0\n0.0\n3.8\n98.4\n93.8\n2.5\n22.5\n93.5\n94.0\n0.0\n0.0\n\n95.2\n36.2\n97.4\n0.2\n94.0\n19.17\n100.0\n0.0\n\nSmir Hall\n95.3\n95.8\n18.6\n17.9\n95.5\n99.8\n0.0\n50.1\n91.2\n97.5\n11.6\n79.1\n97.0\n96.5\n1.0\n72.0\n\nBiau\n99.3\n42.1\n100.0\n0.0\n98.5\n50.5\n100.0\n100.0\n\nTable 2: Naive attribute matching on univariate (BIO, FOREST, CNUM) and multivariate data\n(FOREST10D). Numbers indicate the percentage of accepted null hypothesis (p=q) pooled over\nattributes. \u03b1 = 0.05. Sample size (dimension; attributes; repetitions of experiment): BIO 377 (1; 6;\n100); FOREST 538 (1; 10; 100); CNUM 386 (1; 13; 100); FOREST10D 1000 (10; 2; 100).\n\nWe tested this \u2019Hungarian approach\u2019 to attribute matching via MMD2\nu B on three univariate datasets\n(BIO, CNUM, FOREST) and for table matching on a fourth (FOREST10D). To study MMD2\nu B\non structured data, we obtained two datasets of protein graphs (PROTEINS and ENZYMES) and\nused the graph kernel for proteins from [7] for table matching via the Hungarian method (the other\ntests were not applicable to this graph data). The challenge here is to match tables representing\none functional class of proteins (or enzymes) from dataset A to the corresponding tables (functional\nclasses) in B. Results are shown in Table 3. Besides on the BIO dataset, MMD2\nu B made no errors.\n\n6 Summary and Discussion\n\nWe have established two simple multivariate tests for comparing two distributions p and q. The test\nstatistics are based on the maximum deviation of the expectation of a function evaluated on each\nof the random variables, taken over a suf\ufb01ciently rich function class. We do not require density\n\n\fData type\nunivariate\nunivariate\nunivariate\n\nDataset\nBIO\nCNUM\nFOREST\nFOREST10D multivariate\nENZYME\nPROTEINS\n\nstructured\nstructured\n\nNo. attributes\n\nSample size Repetitions % correct matches\n\n6\n13\n10\n2\n6\n2\n\n377\n386\n538\n1000\n50\n200\n\n100\n100\n100\n100\n50\n50\n\n90.0\n99.8\n100.0\n100.0\n100.0\n100.0\n\nTable 3: Hungarian Method for attribute matching via MMD2\nu B on univariate (BIO, CNUM, FOR-\nEST), multivariate (FOREST10D), and structured data (ENZYMES, PROTEINS) (\u03b1 = 0.05; \u2018%\ncorrect matches\u2019 is the percentage of the correct attribute matches detected over all repetitions).\n\nestimates as an intermediate step. Our method either outperforms competing methods, or is close to\nthe best performing alternative. Finally, our test was successfully used to compare distributions on\ngraphs, for which it is currently the only option.\nAcknowledgements: The authors thank Matthias Hein for helpful discussions, Patrick Warnat\n(DKFZ, Heidelberg) for providing the microarray datasets, and Nikos Logothetis for providing the\nneural datasets. NICTA is funded through the Australian Government\u2019s Backing Australia\u2019s Ability\ninitiative, in part through the ARC. This work was supported in part by the IST Programme of the\nEuropean Community, under the PASCAL Network of Excellence, IST-2002-506778.\nReferences\n[1] N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies be-\ntween two multivariate probability density functions using kernel-based density estimates. Journal of\nMultivariate Analysis, 50:41\u201354, 1994.\n\n[2] M. Arcones and E. Gin\u00b4e. On the bootstrap of u and v statistics. The Annals of Statistics, 20(2):655\u2013674,\n\n1992.\n\n[3] G. Biau and L. Gyor\ufb01. On the asymptotic properties of a nonparametric l1-test statistic of homogeneity.\n\nIEEE Transactions on Information Theory, 51(11):3965\u20133973, 2005.\n\n[4] P. Bickel. A distribution free version of the Smirnov two sample test in the p-variate case. The Annals of\n\nMathematical Statistics, 40(1):1\u201323, 1969.\n\n[5] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.\n[6] K. M. Borgwardt, A. Gretton, M.J. Rasch, H.P. Kriegel, B. Sch\u00a8olkopf, and A.J. Smola.\n\nstructured biological data by kernel maximum mean discrepancy. In ISMB, 2006.\n\nIntegrating\n\n[7] K. M. Borgwardt, C. S. Ong, S. Schonauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein\n\nfunction prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47\u2013i56, Jun 2005.\n\n[8] R. Caruana and T. Joachims. Kdd cup. http://kodiak.cs.cornell.edu/kddcup/index.html, 2004.\n[9] G. Casella and R. Berger. Statistical Inference. Duxbury, Paci\ufb01c Grove, CA, 2nd edition, 2002.\n[10] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002.\n[11] R. Fortet and E. Mourier. Convergence de la r\u00b4eparation empirique vers la r\u00b4eparation th\u00b4eorique. Ann.\n\nScient. \u00b4Ecole Norm. Sup., 70:266\u2013285, 1953.\n\n[12] J. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample\n\ntests. The Annals of Statistics, 7(4):697\u2013717, 1979.\n\n[13] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two sample\n\nproblem. Technical Report 157, MPI for Biological Cybernetics, 2007.\n\n[14] G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, Oxford,\n\nthird edition, 2001.\n\nBiometrika, 89(2):359\u2013374, 2002.\n\n[15] P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings.\n\n[16] M. Hein, T.N. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their application in\n\nsvm\u2019s. In Proceedings of the 26th DAGM Symposium, pages 270\u2013277, Berlin, 2004. Springer.\n\n[17] N. Henze and M. Penrose. On the multivariate runs test. The Annals of Statistics, 27(1):290\u2013298, 1999.\n[18] N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions. Volume 1 (Second\n\n[19] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly,\n\nEdition). John Wiley and Sons, 1994.\n\n2:83\u201397, 1955.\n\n[20] P. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on adja-\n\ncency. Journal of the Royal Statistical Society B, 67(4):515\u2013530, 2005.\n\n[21] R. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.\n[22] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines. J. Mach.\n\nLearn. Res., 2:67\u201393, 2002.\n\n\f", "award": [], "sourceid": 3110, "authors": [{"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": ""}, {"given_name": "Malte", "family_name": "Rasch", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}