{"title": "Interpretable Distribution Features with Maximum Testing Power", "book": "Advances in Neural Information Processing Systems", "page_first": 181, "page_last": 189, "abstract": "Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using these features. The result is a parsimonious and interpretable indication of how and where two distributions differ locally. An empirical estimate of the test power criterion converges with increasing sample size, ensuring the quality of the returned features. In real-world benchmarks on high-dimensional text and image data, linear-time tests using the proposed semimetrics achieve comparable performance to the state-of-the-art quadratic-time maximum mean discrepancy test, while returning human-interpretable features that explain the test results.", "full_text": "Interpretable Distribution Features\n\nwith Maximum Testing Power\n\nWittawat Jitkrittum, Zolt\u00e1n Szab\u00f3, Kacper Chwialkowski, Arthur Gretton\n\nwittawatj@gmail.com\n\nzoltan.szabo.m@gmail.com\n\nkacper.chwialkowski@gmail.com\n\narthur.gretton@gmail.com\n\nGatsby Unit, University College London\n\nAbstract\n\nTwo semimetrics on probability distributions are proposed, given as the sum of\ndifferences of expectations of analytic functions evaluated at spatial or frequency\nlocations (i.e, features). The features are chosen so as to maximize the distin-\nguishability of the distributions, by optimizing a lower bound on test power for a\nstatistical test using these features. The result is a parsimonious and interpretable\nindication of how and where two distributions differ locally. We show that the\nempirical estimate of the test power criterion converges with increasing sample size,\nensuring the quality of the returned features. In real-world benchmarks on high-\ndimensional text and image data, linear-time tests using the proposed semimetrics\nachieve comparable performance to the state-of-the-art quadratic-time maximum\nmean discrepancy test, while returning human-interpretable features that explain\nthe test results.\n\nIntroduction\n\n1\nWe address the problem of discovering features of distinct probability distributions, with which they\ncan most easily be distinguished. The distributions may be in high dimensions, can differ in non-trivial\nways (i.e., not simply in their means), and are observed only through i.i.d. samples. One application\nfor such divergence measures is to model criticism, where samples from a trained model are compared\nwith a validation sample: in the univariate case, through the KL divergence (Cinzia Carota and Polson,\n1996), or in the multivariate case, by use of the maximum mean discrepancy (MMD) (Lloyd and\nGhahramani, 2015). An alternative, interpretable analysis of a multivariate difference in distributions\nmay be obtained by projecting onto a discriminative direction, such that the Wasserstein distance on\nthis projection is maximized (Mueller and Jaakkola, 2015). Note that both recent works require low\ndimensionality, either explicitly (in the case of Lloyd and Gharamani, the function becomes dif\ufb01cult\nto plot in more than two dimensions), or implicitly in the case of Mueller and Jaakkola, in that a\nlarge difference in distributions must occur in projection along a particular one-dimensional axis.\nDistances between distributions in high dimensions may be more subtle, however, and it is of interest\nto \ufb01nd interpretable, distinguishing features of these distributions.\nIn the present paper, we take a hypothesis testing approach to discovering features which best\ndistinguish two multivariate probability measures P and Q, as observed by samples X := {xi}n\ni=1 \u2282 Rd from Q. Non-\ni=1\ndrawn independently and identically (i.i.d.)\nparametric two-sample tests based on RKHS distances (Eric et al., 2008; Fromont et al., 2012;\nGretton et al., 2012a) or energy distances (Sz\u00e9kely and Rizzo, 2004; Baringhaus and Franz, 2004)\nhave as their test statistic an integral probability metric, the Maximum Mean Discrepancy (Gretton\net al., 2012a; Sejdinovic et al., 2013). For this metric, a smooth witness function is computed, such\nthat the amplitude is largest where the probability mass differs most (e.g. Gretton et al., 2012a,\n\nfrom P , and Y := {yi}n\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1). Lloyd and Ghahramani (2015) used this witness function to compare the model output of\nthe Automated Statistician (Lloyd et al., 2014) with a reference sample, yielding a visual indication\nof where the model fails. In high dimensions, however, the witness function cannot be plotted, and\nis less helpful. Furthermore, the witness function does not give an easily interpretable result for\ndistributions with local differences in their characteristic functions. A more subtle shortcoming is\nthat it does not provide a direct indication of the distribution features which, when compared, would\nmaximize test power - rather, it is the witness function norm, and (broadly speaking) its variance\nunder the null, that determine test power.\nOur approach builds on the analytic representations of probability distributions of Chwialkowski\net al. (2015), where differences in expectations of analytic functions at particular spatial or frequency\nlocations are used to construct a two-sample test statistic, which can be computed in linear time.\nDespite the differences in these analytic functions being evaluated at random locations, the analytic\ntests have greater power than linear time tests based on subsampled estimates of the MMD (Gretton\net al., 2012b; Zaremba et al., 2013). Our \ufb01rst theoretical contribution, in Sec. 3, is to derive a lower\nbound on the test power, which can be maximized over the choice of test locations. We propose two\nnovel tests, both of which signi\ufb01cantly outperform the random feature choice of Chwialkowski et al..\nThe (ME) test evaluates the difference of mean embeddings at locations chosen to maximize the test\npower lower bound (i.e., spatial features); unlike the maxima of the MMD witness function, these\nfeatures are directly chosen to maximize the distinguishability of the distributions, and take variance\ninto account. The Smooth Characteristic Function (SCF) test uses as its statistic the difference of\nthe two smoothed empirical characteristic functions, evaluated at points in the frequency domain so\nas to maximize the same criterion (i.e., frequency features). Optimization of the mean embedding\nkernels/frequency smoothing functions themselves is achieved on a held-out data set with the same\nconsistent objective.\nAs our second theoretical contribution in Sec. 3, we prove that the empirical estimate of the test\npower criterion asymptotically converges to its population quantity uniformly over the class of\nGaussian kernels. Two important consequences follow: \ufb01rst, in testing, we obtain a more powerful\ntest with fewer features. Second, we obtain a parsimonious and interpretable set of features that best\ndistinguish the probability distributions. In Sec. 4, we provide experiments demonstrating that the\nproposed linear-time tests greatly outperform all previous linear time tests, and achieve performance\nthat compares to or exceeds the more expensive quadratic-time MMD test (Gretton et al., 2012a).\nMoreover, the new tests discover features of text data (NIPS proceedings) and image data (distinct\nfacial expressions) which have a clear human interpretation, thus validating our feature elicitation\nprocedure in these challenging high-dimensional testing scenarios.\n\ni=1, Y := {yi}n\n\n2 ME and SCF tests\nIn this section, we review the ME and SCF tests (Chwialkowski et al., 2015) for two-sample testing.\nIn Sec. 3, we will extend these approaches to learn features that optimize the power of these tests.\nGiven two samples X := {xi}n\ni=1 \u2282 Rd independently and identically distributed\n(i.i.d.) according to P and Q, respectively, the goal of a two-sample test is to decide whether P is\ndifferent from Q on the basis of the samples. The task is formulated as a statistical hypothesis test\nproposing a null hypothesis H0 : P = Q (samples are drawn from the same distribution) against\nan alternative hypothesis H1 : P (cid:54)= Q (the sample generating distributions are different). A test\ncalculates a test statistic \u02c6\u03bbn from X and Y, and rejects H0 if \u02c6\u03bbn exceeds a predetermined test threshold\n(critical value). The threshold is given by the (1 \u2212 \u03b1)-quantile of the (asymptotic) distribution of \u02c6\u03bbn\nunder H0 i.e., the null distribution, and \u03b1 is the signi\ufb01cance level of the test.\n(cid:80)n\nME test The ME test uses as its test statistic \u02c6\u03bbn, a form of Hotelling\u2019s T-squared statistic, de\ufb01ned\ni=1(zi \u2212 zn)(zi \u2212 zn)(cid:62), and zi :=\nas \u02c6\u03bbn := nz(cid:62)\ni=1 zi, Sn := 1\nn\u22121\nj=1 \u2208 RJ . The statistic depends on a positive de\ufb01nite kernel k : X \u00d7 X \u2192 R\n(k(xi, vj) \u2212 k(yi, vj))J\n(with X \u2286 Rd), and a set of J test locations V = {vj}J\nj=1 \u2282 Rd. Under H0, asymptotically\n\u02c6\u03bbn follows \u03c72(J), a chi-squared distribution with J degrees of freedom. The ME test rejects H0\nif \u02c6\u03bbn > T\u03b1, where the test threshold T\u03b1 is given by the (1 \u2212 \u03b1)-quantile of the asymptotic null\ndistribution \u03c72(J). Although the distribution of \u02c6\u03bbn under H1 was not derived, Chwialkowski et al.\n(2015) showed that if k is analytic, integrable and characteristic (in the sense of Sriperumbudur et al.\n(2011)), under H1, \u02c6\u03bbn can be arbitrarily large as n \u2192 \u221e, allowing the test to correctly reject H0.\n\nn zn, where zn := 1\nn\n\nn S\u22121\n\n(cid:80)n\n\n2\n\n\f(cid:80)n\n\ni=1 \u03b4yi where VJ := 1\n\nJ\n\nn\n\n(cid:80)J\n\n(cid:80)n\n\ni=1 \u03b4xi, and Qn := 1\n\ni vj), \u02c6l(xi) cos(x(cid:62)\n\ni vj) \u2212 \u02c6l(yi) sin(y(cid:62)\n\nR2J , where \u02c6l(x) = (cid:82)\nfunction as \u03c6P (v) = (cid:82)\n\nOne can intuitively think of the ME test statistic as a squared normalized (by the inverse covariance\nn ) L2(X , VJ ) distance of the mean embeddings (Smola et al., 2007) of the empirical measures\nS\u22121\ni=1 \u03b4vi, and \u03b4x is the Dirac measure\nPn := 1\nn\nconcentrated at x. The unnormalized counterpart (i.e., without S\u22121\nn ) was shown by Chwialkowski\net al. (2015) to be a metric on the space of probability measures for any V. Both variants behave\nsimilarly for two-sample testing, with the normalized version being a semimetric having a more\ncomputationally tractable null distribution, i.e., \u03c72(J).\nSCF test The SCF uses the test statistic which has the same form as the ME test statistic with\nj=1 \u2208\na modi\ufb01ed zi := [\u02c6l(xi) sin(x(cid:62)\nRd exp(\u2212iu(cid:62)x)l(u) du is the Fourier transform of l(x), and l : Rd \u2192 R\nis an analytic translation-invariant kernel i.e., l(x \u2212 y) de\ufb01nes a positive de\ufb01nite kernel for x\nand y. In contrast to the ME test de\ufb01ning the statistic in terms of spatial locations, the locations\nV = {vj}J\nj=1 \u2282 Rd in the SCF test are in the frequency domain. As a brief description, let\n\u03d5P (w) := Ex\u223cP exp(iw(cid:62)x) be the characteristic function of P . De\ufb01ne a smooth characteristic\nRd \u03d5P (w)l(v \u2212 w) dw (Chwialkowski et al., 2015, De\ufb01nition 2). Then,\nsimilar to the ME test, the statistic de\ufb01ned by the SCF test can be seen as a normalized (by S\u22121\nn )\nversion of L2(X , VJ ) distance of empirical \u03c6P (v) and \u03c6Q(v). The SCF test statistic has asymptotic\ndistribution \u03c72(2J) under H0. We will use J(cid:48) to refer to the degrees of freedom of the chi-squared\ndistribution i.e., J(cid:48) = J for the ME test, and J(cid:48) = 2J for the SCF test.\nIn this work, we modify the statistic with a regularization parameter \u03b3n > 0, giving \u02c6\u03bbn :=\n\u22121 zn, for stability of the matrix inverse. Using multivariate Slutsky\u2019s theorem,\nnz(cid:62)\nunder H0, \u02c6\u03bbn still asymptotically follows \u03c72(J(cid:48)) provided that \u03b3n \u2192 0 as n \u2192 \u221e.\n\ni vj) \u2212 \u02c6l(yi) cos(y(cid:62)\n\nn (Sn + \u03b3nI)\n\ni vj)]J\n\n3 Lower bound on test power, consistency of empirical power statistic\nThis section contains our main results. We propose to optimize the test locations V and kernel\nparameters (jointly referred to as \u03b8) by maximizing a lower bound on the test power in Proposition 1.\nThis criterion offers a simple objective function for fast parameter tuning. The bound may be of\nindependent interest in other Hotelling\u2019s T-squared statistics, since apart from the Gaussian case\n(e.g. Bilodeau and Brenner, 2008, Ch. 8), the characterization of such statistics under the alternative\ndistribution is challenging. The optimization procedure is given in Sec. 4. We use Exy as a shorthand\nfor Ex\u223cP Ey\u223cQ and let (cid:107) \u00b7 (cid:107)F be the Frobenius norm.\nProposition 1 (Lower bound on ME test power). Let K be a uniformly bounded (i.e., \u2203B < \u221e\nsuch that supk\u2208K sup(x,y)\u2208X 2 |k(x, y)| \u2264 B) family of k : X \u00d7 X \u2192 R measurable ker-\nnels. Let V be a collection in which each element is a set of J test locations. Assume that\nof the ME test satis\ufb01es\n\n\u02dcc := supV\u2208V,k\u2208K (cid:107)\u03a3\u22121(cid:107)F < \u221e. Then, the test power P(cid:16)\u02c6\u03bbn \u2265 T\u03b1\nP(cid:16)\u02c6\u03bbn \u2265 T\u03b1\n\n(cid:17) \u2265 L(\u03bbn) where\n\n(cid:17)\n\nL(\u03bbn) := 1 \u2212 2e\u2212\u03be1(\u03bbn\u2212T\u03b1)2/n \u2212 2e\n\n\u2212 [\u03b3n (\u03bbn\u2212T\u03b1)(n\u22121)\u2212\u03be2n]2\n\n\u03be3n(2n\u22121)2\n\n\u2212 2e\u2212[(\u03bbn\u2212T\u03b1)/3\u2212c3n\u03b3n]2\u03b32\n\nn/\u03be4 ,\n\nand c3, \u03be1, . . . \u03be4 are positive constants depending on only B, J and \u02dcc. The parameter \u03bbn :=\n\u22121 zn where \u00b5 = Exy[z1] and\nn\u00b5(cid:62)\u03a3\u22121\u00b5 is the population counterpart of \u02c6\u03bbn := nz(cid:62)\n\u03a3 = Exy[(z1 \u2212 \u00b5)(z1 \u2212 \u00b5)(cid:62)]. For large n, L(\u03bbn) is increasing in \u03bbn.\nProof (sketch). The idea is to construct a bound for |\u02c6\u03bbn \u2212 \u03bbn| which involves bounding (cid:107)zn \u2212 \u00b5(cid:107)2\nand (cid:107)Sn \u2212 \u03a3(cid:107)F separately using Hoeffding\u2019s inequality. The result follows after a reparameterization\n\nof the bound on P(|\u02c6\u03bbn \u2212 \u03bbn| \u2265 t) to have P(cid:16)\u02c6\u03bbn \u2265 T\u03b1\n\n. See Sec. F for details.\n\nn (Sn + \u03b3nI)\n\n(cid:17)\n\nProposition 1 suggests that for large n it is suf\ufb01cient to maximize \u03bbn to maximize a lower bound\non the ME test power. The same conclusion holds for the SCF test (result omitted due to space\nconstraints). Assume that k is characteristic (Sriperumbudur et al., 2011). It can be shown that\n\u03bbn = 0 if and only if P = Q i.e., \u03bbn is a semimetric for P and Q. In this sense, one can see \u03bbn as\nencoding the ease of rejecting H0. The higher \u03bbn, the easier for the test to correctly reject H0 when\nH1 holds. This observation justi\ufb01es the use of \u03bbn as a maximization objective for parameter tuning.\n\n3\n\n\fContributions The statistic \u02c6\u03bbn for both ME and SCF tests depends on a set of test locations V and\na kernel parameter \u03c3. We propose to set \u03b8 := {V, \u03c3} = arg max\u03b8 \u03bbn = arg max\u03b8 \u00b5(cid:62)\u03a3\u22121\u00b5. The\noptimization of \u03b8 brings two bene\ufb01ts: \ufb01rst, it signi\ufb01cantly increases the probability of rejecting\nH0 when H1 holds; second, the learned test locations act as discriminative features allowing an\ninterpretation of how the two distributions differ. We note that optimizing parameters by maximizing\na test power proxy (Gretton et al., 2012b) is valid under both H0 and H1 as long as the data used\nfor parameter tuning and for testing are disjoint. If H0 holds, then \u03b8 = arg max 0 is arbitrary. Since\nthe test statistic asymptotically follows \u03c72(J(cid:48)) for any \u03b8, the optimization does not change the null\ndistribution. Also, the rejection threshold T\u03b1 depends on only J(cid:48) and is independent of \u03b8.\nTo avoid creating a dependency between \u03b8 and the data used for testing (which would affect the null\ndistribution), we split the data into two disjoint sets. Let D := (X, Y) and Dtr, Dte \u2282 D such that\nDtr \u2229 Dte = \u2205 and Dtr \u222a Dte = D. In practice, since \u00b5 and \u03a3 are unknown, we use \u02c6\u03bbtr\nn/2 in place of\n\u03bbn, where \u02c6\u03bbtr\nn/2 is the test statistic computed on the training set Dtr. For simplicity, we assume that\neach of Dtr and Dte has half of the samples in D. We perform an optimization of \u03b8 with gradient\nascent algorithm on \u02c6\u03bbtr\nn/2(\u03b8)\ncomputed on Dte. The full procedure from tuning the parameters to the actual two-sample test is\nsummarized in Sec. A.\nSince we use an empirical estimate \u02c6\u03bbtr\nn/2 in place of \u03bbn for parameter optimization, we give a \ufb01nite-\nsample bound in Theorem 2 guaranteeing the convergence of z(cid:62)\nn (Sn + \u03b3nI)\u22121zn to \u00b5(cid:62)\u03a3\u22121\u00b5 as\nn increases, uniformly over all kernels k \u2208 K (a family of uniformly bounded kernels) and all test\nlocations in an appropriate class. Kernel classes satisfying conditions of Theorem 2 include the widely\n\nused isotropic Gaussian kernel class Kg =(cid:8)k\u03c3 : (x, y) (cid:55)\u2192 exp(cid:0)\u2212(2\u03c32)\u22121(cid:107)x \u2212 y(cid:107)2(cid:1) | \u03c3 > 0(cid:9), and\nthe more general full Gaussian kernel class Kfull = {k : (x, y) (cid:55)\u2192 exp(cid:0)\u2212(x \u2212 y)(cid:62)A(x \u2212 y)(cid:1) |\n\nn/2(\u03b8). The actual two-sample test is performed using the test statistic \u02c6\u03bbte\n\nA is positive de\ufb01nite} (see Lemma 5 and Lemma 6).\nTheorem 2 (Consistency of \u02c6\u03bbn in the ME test). Let X \u2286 Rd be a measurable set, and V be a\ncollection in which each element is a set of J test locations. All suprema over V and k are to be\nunderstood as supV\u2208V and supk\u2208K respectively. For a class of kernels K on X \u2286 Rd, de\ufb01ne\nF1 := {x (cid:55)\u2192 k(x, v) | k \u2208 K, v \u2208 X}, F2 := {x (cid:55)\u2192 k(x, v)k(x, v(cid:48)) | k \u2208 K, v, v(cid:48) \u2208 X}, (1)\nF3 := {(x, y) (cid:55)\u2192 k(x, v)k(y, v(cid:48)) | k \u2208 K, v, v(cid:48) \u2208 X}.\n(2)\nAssume that (1) K is a uniformly bounded (by B) family of k : X \u00d7 X \u2192 R measurable kernels,\n(2) \u02dcc := supV,k (cid:107)\u03a3\u22121(cid:107)F < \u221e, and (3) Fi = {f\u03b8i | \u03b8i \u2208 \u0398i} is VC-subgraph with VC-index\nV C(Fi), and \u03b8 (cid:55)\u2192 f\u03b8i(m) is continuous (\u2200m, i = 1, 2, 3). Let c1 := 4B2J\nJ \u02dcc,\nand c3 := 4B2J \u02dcc2. Let Ci-s (i = 1, 2, 3) be the universal constants associated to Fi-s according to\nTheorem 2.6.7 in van der Vaart and Wellner (2000). Then for any \u03b4 \u2208 (0, 1) with probability at least\n1 \u2212 \u03b4,\n\nJ \u02dcc, c2 := 4B\n\n\u221a\n\n\u221a\n\nJ\n\n+\n\n+ c2\n\nlog(cid:2)Cj \u00d7 V C(Fj)(16e)V C(Fj )(cid:3) +\n\nc1J(TF2 + TF3 ) +\n\n8\n\u03b3n\n\nc1B2J\nn \u2212 1\n\n(cid:112)2\u03c0[V C(Fj) \u2212 1]\n\n(cid:33)\n\n2\n\u03b3n\n\n+ c3\u03b3n, where\n\n(cid:114)\n\n2\n\n+ B\u03b6j\n\n2 log(5/\u03b4)\n\nn\n\n,\n\nfor j = 1, 2, 3 and \u03b61 = 1, \u03b62 = \u03b63 = 2.\nProof (sketch). The idea is to lower bound the difference with an expression involving supV,k (cid:107)zn \u2212\n\u00b5(cid:107)2 and supV,k (cid:107)Sn \u2212 \u03a3(cid:107)F . These two quantities can be seen as suprema of empirical processes,\nand can be bounded by Rademacher complexities of their respective function classes (i.e., F1,F2,\nand F3). Finally, the Rademacher complexities can be upper bounded using Dudley entropy bound\nand VC subgraph properties of the function classes. Proof details are given in Sec. D.\n\n(cid:12)(cid:12)z(cid:62)\nn (Sn + \u03b3nI)\u22121zn \u2212 \u00b5(cid:62)\u03a3\u22121\u00b5(cid:12)(cid:12) = Op(n\u22121/4) as the rate of convergence. Both\n\nO(n\u22121/4),\n\nTheorem 2\nsupV,k\n\nthen we\n\nif we\n\nimplies\n\nhave\n\nthat\n\nset\n\n\u03b3n\n\n=\n\n4\n\n(cid:12)(cid:12)(cid:12)z\n\n(cid:62)\nn (Sn + \u03b3nI)\n\n(cid:18) 2\n\nsup\nV,k\n\u2264 2TF1\n\u03b3n\n\u221a\n2B\u03b6j\u221a\n16\nn\n\nc1BJ\n\n(cid:32)\n\n(cid:113)\n\n2\n\nTFj =\n\n(cid:12)(cid:12)(cid:12)\n\n\u22121zn \u2212 \u00b5\n(cid:62)\n2n \u2212 1\nn \u2212 1\n\n\u22121\u00b5\n\u03a3\n\u221a\n\n(cid:19)\n\n\fProposition 1 and Theorem 2 require \u02dcc := supV\u2208V,k\u2208K (cid:107)\u03a3\u22121(cid:107)F < \u221e as a precondition.\nTo guarantee that \u02dcc < \u221e, a concrete construction of K is the isotropic Gaussian ker-\nnel class Kg, where \u03c3 is constrained to be in a compact set. Also, consider V := {V |\nany two locations are at least \u0001 distance apart, and all test locations have their norms bounded by \u03b6}\nfor some \u0001, \u03b6 > 0. Then, for any non-degenerate P, Q, we have \u02dcc < \u221e since (\u03c3,V) (cid:55)\u2192 \u03bbn is\ncontinuous, and thus attains its supremum over compact sets K and V.\n\n4 Experiments\n\nP\nN (0d, Id) N (0d, Id)\n\nQ\n\nTable 1: Four toy problems. H0 holds only in SG.\n\nData\nSG\nGMD N (0d, Id) N ((1, 0, . . . , 0)(cid:62), Id)\nGVD N (0d, Id) N (0d, diag(2, 1, . . . , 1))\nBlobs Gaussian mixtures in R2 as studied in\nChwialkowski et al. (2015); Gretton\net al. (2012b).\n\nIn this section, we demonstrate the effectiveness\nof the proposed methods on both toy and real\nproblems. We consider the isotropic Gaussian\nkernel class Kg in all kernel-based tests. We\nstudy seven two-sample test algorithms. For the\nSCF test, we set \u02c6l(x) = k(x, 0). Denote by ME-\nfull and SCF-full the ME and SCF tests whose\ntest locations and the Gaussian width \u03c3 are fully\noptimized using gradient ascent on a separate\ntraining sample (Dtr) of the same size as the\ntest set (Dte). ME-grid and SCF-grid are as in\nChwialkowski et al. (2015) where only the Gaus-\nsian width is optimized by a grid search,1and\nthe test locations are randomly drawn from a\nmultivariate normal distribution. MMD-quad\n(quadratic-time) and MMD-lin (linear-time) re-\nfer to the nonparametric tests based on maximum mean discrepancy of Gretton et al. (2012a), where\nto ensure a fair comparison, the Gaussian kernel width is also chosen so as to maximize a criterion\nfor the test power on training data, following the same principle as (Gretton et al., 2012b). For MMD-\nquad, since its null distribution is given by an in\ufb01nite sum of weighted chi-squared variables (no\nclosed-form quantiles), in each trial we randomly permute the two samples 400 times to approximate\nthe null distribution. Finally, T 2 is the standard two-sample Hotelling\u2019s T-squared test, which serves\nas a baseline with Gaussian assumptions on P and Q.\nIn all the following experiments, each problem is repeated for 500 trials. For toy problems, new\nsamples are generated from the speci\ufb01ed P, Q distributions in each trial. For real problems, samples\nare partitioned randomly into training and test sets in each trial. In all of the simulations, we report an\nn/2 \u2265 T\u03b1) which is the proportion of the number of times the statistic \u02c6\u03bbte\nempirical estimate of P(\u02c6\u03bbte\nn/2\nis above T\u03b1. This quantity is an estimate of type-I error under H0, and corresponds to test power\nwhen H1 is true. We set \u03b1 = 0.01 in all the experiments. All the code and preprocessed data are\navailable at https://github.com/wittawatj/interpretable-test.\nOptimization The parameter tuning objective \u02c6\u03bbtr\nn/2(\u03b8) is a function of \u03b8 consisting of one real-valued\n\u03c3 and J test locations each of d dimensions. The parameters \u03b8 can thus be regarded as a Jd + 1\nEuclidean vector. We take the derivative of \u02c6\u03bbtr\nn/2(\u03b8) with respect to \u03b8, and use gradient ascent to\nmaximize it. J is pre-speci\ufb01ed and \ufb01xed. For the ME test, we initialize the test locations with\nrealizations from two multivariate normal distributions \ufb01tted to samples from P and Q; this ensures\nthat the initial locations are well supported by the data. For the SCF test, initialization using the\nstandard normal distribution is found to be suf\ufb01cient. The parameter \u03b3n is not optimized; we set\nthe regularization parameter \u03b3n to be as small as possible while being large enough to ensure that\n(Sn + \u03b3nI)\u22121 can be stably computed. We emphasize that both the optimization and testing are\nlinear in n. The testing cost O(J 3 + J 2n + dJn) and the optimization costs O(J 3 + dJ 2n) per\ngradient ascent iteration. Runtimes of all methods are reported in Sec. C in the appendix.\n1. Informative features: simple demonstration We begin with a demonstration that the proxy\n\u02c6\u03bbtr\nn/2(\u03b8) for the test power is informative for revealing the difference of the two samples in the ME\n1Chwialkowski et al. (2015) chooses the Gaussian width that minimizes the median of the p-values, a\nheuristic that does not directly address test power. Here, we perform a grid search to choose the best Gaussian\nwidth by maximizing \u02c6\u03bbtr\n\nn/2 as done in ME-full and SCF-full.\n\n5\n\n\u221210\u221250510\u221210\u221250510Blobs data. Sample from P.\u221210\u221250510\u221210\u221250510Blobs data. Sample from Q.\f(a) SG. d = 50.\n\n(b) GMD. d = 100.\n\n(c) GVD. d = 50.\n\n(d) Blobs.\n\nFigure 2: Plots of type-I error/test power against the test sample size nte in the four toy problems.\n\nn/2(v1, v2).\n\ntest. We consider the Gaussian Mean Difference (GMD) problem (see Table 1), where both P and Q\nare two-dimensional normal distributions with the difference in means. We use J = 2 test locations\nv1 and v2, where v1 is \ufb01xed to the location indicated by the black triangle in Fig. 1. The contour plot\nshows v2 (cid:55)\u2192 \u02c6\u03bbtr\nFig. 1 (top) suggests that \u02c6\u03bbtr\nn/2 is maximized when v2 is placed in either of the two regions that\ncaptures the difference of the two samples i.e., the region in which the probability masses of P and\nQ have less overlap. Fig. 1 (bottom), we consider placing v1 in one of the two key regions. In this\ncase, the contour plot shows that v2 should be placed in the other region to maximize \u02c6\u03bbtr\nn/2, implying\nthat placing multiple test locations in the same neighborhood will not increase the discriminability.\nThe two modes on the left and right suggest two ways to place the test location in a region that\nreveals the difference. The non-convexity of the \u02c6\u03bbtr\nn/2 is an indication of many informative ways to\ndetect differences of P and Q, rather than a drawback. A convex objective would not capture this\nmultimodality.\n2. Test power vs. sample size n We now demonstrate the rate of in-\ncrease of test power with sample size. When the null hypothesis holds, the\ntype-I error stays at the speci\ufb01ed level \u03b1. We consider the following four\ntoy problems: Same Gaussian (SG), Gaussian mean difference (GMD),\nGaussian variance difference (GVD), and Blobs. The speci\ufb01cations of\nP and Q are summarized in Table. 1. In the Blobs problem, P and Q are\nde\ufb01ned as a mixture of Gaussian distributions arranged on a 4 \u00d7 4 grid in\nR2. This problem is challenging as the difference of P and Q is encoded\nat a much smaller length scale compared to the global structure (Gretton\net al., 2012b). Speci\ufb01cally, the eigenvalue ratio for the covariance of each\nGaussian distribution is 2.0 in P , and 1.0 in Q. We set J = 5 in this\nexperiment.\nThe results are shown in Fig. 2 where type-I error (for SG problem), and\ntest power (for GMD, GVD and Blobs problems) are plotted against test\nsample size. A number of observations are worth noting. In the SG\nproblem, we see that the type-I error roughly stays at the speci\ufb01ed level:\nthe rate of rejection of H0 when it is true is roughly at the speci\ufb01ed level\n\u03b1 = 0.01.\nGMD with 100 dimensions turns out to be an easy problem for all the\ntests except MMD-lin. In the GVD and Blobs cases, ME-full and SCF-\nfull achieve substantially higher test power than ME-grid and SCF-grid,\nrespectively, suggesting a clear advantage from optimizing the test locations. Remarkably, ME-full\nconsistently outperforms the quadratic-time MMD across all test sample sizes in the GVD case. When\nthe difference of P and Q is subtle as in the Blobs problem, ME-grid, which uses randomly drawn\ntest locations, can perform poorly (see Fig. 2d) since it is unlikely that randomly drawn locations will\nbe placed in the key regions that reveal the difference. In this case, optimization of the test locations\ncan considerably boost the test power (see ME-full in Fig. 2d). Note also that SCF variants perform\nsigni\ufb01cantly better than ME variants on the Blobs problem, as the difference in P and Q is localized\nin the frequency domain; ME-full and ME-grid would require many more test locations in the spatial\ndomain to match the test powers of the SCF variants. For the same reason, SCF-full does much better\nthan the quadratic-time MMD across most sample sizes, as the latter represents a weighted distance\nbetween characteristic functions integrated across the entire frequency domain (Sriperumbudur et al.,\n2010, Corollary 4).\n\nFigure 1: A contour plot\nof \u02c6\u03bbtr\nn/2 as a function of\nv2 when J = 2 and\nv1 is \ufb01xed (black trian-\ngle). The objective \u02c6\u03bbtr\nn/2\nis high in the regions that\nreveal the difference of\nthe two samples.\n\n6\n\n10002000300040005000Test sample size0.0000.0050.0100.0150.020Type-I error10002000300040005000Test sample size0.00.20.40.60.81.0Test power10002000300040005000Test sample size0.00.20.40.60.81.0Test power10002000300040005000Test sample size0.00.20.40.60.81.0Test powerME-fullME-gridSCF-fullSCF-gridMMD-quadMMD-linT2v2\u21a6^\u00b8trn=2(v1;v2)020406080100120140160v2\u21a6^\u00b8trn=2(v1;v2)128136144152160168176184192\f(a) SG\n\n(b) GMD\n\n(c) GVD\n\nFigure 3: Plots of type-I error/test power against the dimensions d in the four toy problems in Table 1.\n\nTable 2: Type-I errors and powers of various tests in the problem of distinguishing NIPS papers from\ntwo categories. \u03b1 = 0.01. J = 1. nte denotes the test sample size of each of the two samples.\n\nProblem\nBayes-Bayes\nBayes-Deep\nBayes-Learn\nBayes-Neuro\nLearn-Deep\nLearn-Neuro\n\nnte ME-full ME-grid\n215\n216\n138\n394\n149\n146\n\n.012\n.954\n.990\n1.00\n.956\n.960\n\n.018\n.034\n.774\n.300\n.052\n.572\n\nSCF-full\n\nSCF-grid MMD-quad MMD-lin\n\n.012\n.688\n.836\n.828\n.656\n.590\n\n.004\n.180\n.534\n.500\n.138\n.360\n\n.022\n.906\n1.00\n.952\n.876\n1.00\n\n.008\n.262\n.238\n.972\n.500\n.538\n\n3. Test power vs. dimension d We next investigate how the dimension (d) of the problem can\naffect type-I errors and test powers of ME and SCF tests. We consider the same arti\ufb01cial problems:\nSG, GMD and GVD. This time, we \ufb01x the test sample size to 10000, set J = 5, and vary the\ndimension. The results are shown in Fig. 3. Due to the large dimensions and sample size, it is\ncomputationally infeasible to run MMD-quad.\nWe observe that all the tests except the T-test can maintain type-I error at roughly the speci\ufb01ed\nsigni\ufb01cance level \u03b1 = 0.01 as dimension increases. The type-I performance of the T-test is incorrect\nat large d because of the dif\ufb01culty in accurately estimating the covariance matrix in high dimensions.\nIt is interesting to note the high performance of ME-full in the GMD problem in Fig. 3b. ME-full\nachieves the maximum test power of 1.0 throughout and matches the power T-test, in spite of being\nnonparametric and making no assumption on P and Q (the T-test is further advantaged by its excessive\nType-I error). However, this is true only with optimization of the test locations. This is re\ufb02ected in\nthe test power of ME-grid in Fig. 3b which drops monotonically as dimension increases, highlighting\nthe importance of test location optimization. The performance of MMD-lin degrades quickly with\nincreasing dimension, as expected from Ramdas et al. (2015).\n\n4. Distinguishing articles from two categories We now turn to performance on real data. We \ufb01rst\nconsider the problem of distinguishing two categories of publications at the conference on Neural\nInformation Processing Systems (NIPS). Out of 5903 papers published in NIPS from 1988 to 2015,\nwe manually select disjoint subsets related to Bayesian inference (Bayes), neuroscience (Neuro),\ndeep learning (Deep), and statistical learning theory (Learn) (see Sec. B). Each paper is represented\nas a bag of words using TF-IDF (Manning et al., 2008) as features. We perform stemming, remove\nall stop words, and retain only nouns. A further \ufb01ltering of document-frequency (DF) of words that\nsatis\ufb01es 5 \u2264 DF \u2264 2000 yields approximately 5000 words from which 2000 words (i.e., d = 2000\ndimensions) are randomly selected. See Sec. B for more details on the preprocessing. For ME\nand SCF tests, we use only one test location i.e., set J = 1. We perform 1000 permutations to\napproximate the null distribution of MMD-quad in this and the following experiments.\nType-I errors and test powers are summarized in Table. 2. The \ufb01rst column indicates the categories of\nthe papers in the two samples. In Bayes-Bayes problem, papers on Bayesian inference are randomly\npartitioned into two samples in each trial. This task represents a case in which H0 holds. Among all\nthe linear-time tests, we observe that ME-full has the highest test power in all the tasks, attaining a\nmaximum test power of 1.0 in the Bayes-Neuro problem. This high performance assures that although\ndifferent test locations V may be selected in different trials, these locations are each informative. It\nis interesting to observe that ME-full has performance close to or better than MMD-quad, which\nrequires O(n2) runtime complexity. Besides clear advantages of interpretability and linear runtime\nof the proposed tests, these results suggest that evaluating the differences in expectations of analytic\nfunctions at particular locations can yield an equally powerful test at a much lower cost, as opposed to\n\n7\n\n530060090012001500Dimension0.0000.0050.0100.0150.0200.025Type-I error530060090012001500Dimension0.00.20.40.60.81.0Test power5100200300400500Dimension0.00.20.40.60.81.0Test powerME-fullME-gridSCF-fullSCF-gridMMD-linT2\fTable 3: Type-I errors and powers in the problem of distinguishing positive (+) and negative (-) facial\nexpressions. \u03b1 = 0.01. J = 1.\n\nProblem nte ME-full ME-grid\n\u00b1 vs. \u00b1 201\n+ vs. \u2212 201\n\n.010\n.998\n\n.012\n.656\n\nSCF-full\n\nSCF-grid MMD-quad MMD-lin\n\n.014\n1.00\n\n.002\n.750\n\n.018\n1.00\n\n.008\n.578\n\nj \u2208 {0, 1} be an indicator variable taking value 1 if \u02dcvt\n\nj \u2208 {1, . . . , d}, and 0 otherwise. De\ufb01ne \u03b7j :=(cid:80)\n\ncomputing the RKHS norm of the witness function as done in MMD. Unlike Blobs, however, Fourier\nfeatures are less powerful in this setting.\nWe further investigate the interpretability of the ME test by the following procedure. For the learned\ntest location vt \u2208 Rd (d = 2000) in trial t, we construct \u02dcvt = (\u02dcvt\nj = |vt\nj|.\nj is among the top \ufb01ve largest for all\nLet \u03b7t\nj as a proxy indicating the signi\ufb01cance of word\nj i.e., \u03b7j is high if word j is frequently among the top \ufb01ve largest as measured by \u02dcvt\nj. The top seven\nwords as sorted in descending order by \u03b7j in the Bayes-Neuro problem are spike, markov, cortex,\ndropout, recurr, iii, gibb, showing that the learned test locations are highly interpretable. Indeed,\n\u201cmarkov\u201d and \u201cgibb\u201d (i.e., stemmed from Gibbs) are discriminative terms in Bayesian inference\ncategory, and \u201cspike\u201d and \u201ccortex\u201d are key terms in neuroscience. We give full lists of discriminative\nterms learned in all the problems in Sec. B.1. To show that not all the randomly selected 2000 terms\nj is modi\ufb01ed to consider the least important words (i.e., \u03b7j\nare informative, if the de\ufb01nition of \u03b7t\nis high if word j is frequently among the top \ufb01ve smallest as measured by \u02dcvt\nj), we instead obtain\ncircumfer, bra, dominiqu, rhino, mitra, kid, impostor, which are not discriminative.\n\nd) such that \u02dcvt\n\n1, . . . , \u02dcvt\n\nt \u03b7t\n\n(a) HA (b) NE (c) SU\n\n5. Distinguishing positive and negative emotions\nIn the \ufb01-\nnal experiment, we study how well ME and SCF tests can dis-\ntinguish two samples of photos of people showing positive and\nnegative facial expressions. Our emphasis is on the discrimi-\nnative features of the faces identi\ufb01ed by ME test showing how\nthe two groups differ. For this purpose, we use Karolinska Di-\nrected Emotional Faces (KDEF) dataset (Lundqvist et al., 1998)\ncontaining 5040 aligned face images of 70 amateur actors, 35 fe-\nmales and 35 males. We use only photos showing front views of\nthe faces. In the dataset, each actor displays seven expressions:\nhappy (HA), neutral (NE), surprised (SU), sad (SA), afraid (AF),\nangry (AN), and disgusted (DI). We assign HA, NE, and SU\nfaces into the positive emotion group (i.e., samples from P ), and\nAF, AN and DI faces into the negative emotion group (samples\nfrom Q). We denote this problem as \u201c+ vs. \u2212\u201d. Examples of\nsix facial expressions from one actor are shown in Fig. 4. Photos\nof the SA group are unused to keep the sizes of the two samples\nthe same. Each image of size 562 \u00d7 762 pixels is cropped to exclude the background, resized to\n48 \u00d7 34 = 1632 pixels (d), and converted to grayscale.\nWe run the tests 500 times with the same setting used previously i.e., Gaussian kernels, and J = 1.\nThe type-I errors and test powers are shown in Table 3. In the table, \u201c\u00b1 vs. \u00b1\u201d is a problem in which\nall faces expressing the six emotions are randomly split into two samples of equal sizes i.e., H0 is\ntrue. Both ME-full and SCF-full achieve high test powers while maintaining the correct type-I errors.\nAs a way to interpret how positive and negative emotions differ, we take an average across trials of\nthe learned test locations of ME-full in the \u201c+ vs. \u2212\u201d problem. This average is shown in Fig. 4g.\nWe see that the test locations faithfully capture the difference of positive and negative emotions by\ngiving more weights to the regions of nose, upper lip, and nasolabial folds (smile lines), con\ufb01rming\nthe interpretability of the test in a high-dimensional setting.\n\nFigure 4: (a)-(f): Six facial expres-\nsions of actor AM05 in the KDEF\ndata. (g): Average across trials of\nthe learned test locations v1.\n\n(d) AF (e) AN (f) DI\n\n(g) v1\n\nAcknowledgement\n\nWe thank the Gatsby Charitable Foundation for the \ufb01nancial support.\n\n8\n\n\fReferences\nL. Baringhaus and C. Franz. On a new multivariate two-sample test. Journal of Multivariate Analysis, 88:\n\n190\u2013206, 2004.\n\nM. Bilodeau and D. Brenner. Theory of multivariate statistics. Springer Science & Business Media, 2008.\n\nS. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O\u2019Reilly Media, 1st edition, 2009.\n\nO. Bousquet. New approaches to statistical learning theory. Annals of the Institute of Statistical Mathematics,\n\n55:371\u2013389, 2003.\n\nK. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic representa-\n\ntions of probability measures. In NIPS, pages 1972\u20131980, 2015.\n\nG. P. Cinzia Carota and N. G. Polson. Diagnostic measures for model criticism. Journal of the American\n\nStatistical Association, 91(434):753\u2013762, 1996.\n\nM. Eric, F. R. Bach, and Z. Harchaoui. Testing for homogeneity with kernel Fisher discriminant analysis. In\n\nNIPS, pages 609\u2013616. 2008.\n\nM. Fromont, B. Laurent, M. Lerasle, and P. Reynaud-Bouret. Kernels based tests with non-asymptotic bootstrap\n\napproaches for two-sample problems. In COLT, pages 23.1\u201323.22, 2012.\n\nA. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample test. Journal of\n\nMachine Learning Research, 13:723\u2013773, 2012a.\n\nA. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur.\n\nOptimal kernel choice for large-scale two-sample tests. In NIPS, pages 1205\u20131213, 2012b.\n\nM. R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Springer, 2008.\n\nJ. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In NIPS, pages\n\n829\u2013837, 2015.\n\nJ. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic construction and\n\nNatural-Language description of nonparametric regression models. In AAAI, pages 1242\u20131250, 2014.\n\nD. Lundqvist, A. Flykt, and A. \u00d6hman. The Karolinska directed emotional faces-KDEF. Technical report, ISBN\n\n91-630-7164-9, 1998.\n\nC. D. Manning, P. Raghavan, and H. Sch\u00fctze. Introduction to information retrieval. Cambridge University Press,\n\n2008.\n\nJ. Mueller and T. Jaakkola. Principal differences analysis: Interpretable characterization of differences between\n\ndistributions. In NIPS, pages 1693\u20131701, 2015.\n\nA. Ramdas, S. Jakkam Reddi, B. P\u00f3czos, A. Singh, and L. Wasserman. On the decreasing power of kernel and\n\ndistance based nonparametric hypothesis tests in high dimensions. In AAAI, pages 3571\u20133577, 2015.\n\nD. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and RKHS-based\n\nstatistics in hypothesis testing. Annals of Statistics, 41(5):2263\u20132291, 2013.\n\nA. Smola, A. Gretton, L. Song, and B. Sch\u00f6lkopf. A Hilbert space embedding for distributions. In ALT, pages\n\n13\u201331, 2007.\n\nN. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels. In COLT,\n\npages 169\u2013183, 2006.\n\nB. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schoelkopf, and G. Lanckriet. Hilbert space embeddings and\n\nmetrics on probability measures. Journal of Machine Learning Research, 11:1517\u20131561, 2010.\n\nB. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels and RKHS\n\nembedding of measures. The Journal of Machine Learning Research, 12:2389\u20132410, 2011.\n\nI. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n\nG. Sz\u00e9kely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, (5), 2004.\n\nA. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics\n\n(Springer Series in Statistics). Springer, 2000.\n\nW. Zaremba, A. Gretton, and M. Blaschko. B-test: A non-parametric, low variance kernel two-sample test. In\n\nNIPS, pages 755\u2013763, 2013.\n\n9\n\n\f", "award": [], "sourceid": 144, "authors": [{"given_name": "Wittawat", "family_name": "Jitkrittum", "institution": "Gatsby Unit"}, {"given_name": "Zolt\u00e1n", "family_name": "Szab\u00f3", "institution": "\u00c9cole Polytechnique"}, {"given_name": "Kacper", "family_name": "Chwialkowski", "institution": "Gatsby Unit"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "University Collage London"}]}