{"title": "Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data", "book": "Advances in Neural Information Processing Systems", "page_first": 728, "page_last": 736, "abstract": "We propose an efficient, generalized, nonparametric, statistical Kolmogorov-Smirnov test for detecting distributional change in high-dimensional data. To implement the test, we introduce a novel, hierarchical, minimum-volume sets estimator to represent the distributions to be tested. Our work is motivated by the need to detect changes in data streams, and the test is especially efficient in this context. We provide the theoretical foundations of our test and show its superiority over existing methods.", "full_text": "Learning High-Density Regions for a Generalized\n\nKolmogorov-Smirnov Test in High-Dimensional Data\n\nAssaf Glazer\n\nMichael Lindenbaoum\n\nDepartment of Computer Science\n\nTechnion \u2013 Israel Institute of Technology\n\nDepartment of Computer Science\n\nTechnion \u2013 Israel Institute of Technology\n\nHaifa 32000, Israel\n\nHaifa 32000, Israel\n\nassafgr@cs.technion.ac.il\n\nmic@cs.technion.ac.il\n\nShaul Markovitch\n\nDepartment of Computer Science\n\nTechnion \u2013 Israel Institute of Technology\n\nHaifa 32000, Israel\n\nAddress\n\nshaulm@cs.technion.ac.il\n\nAbstract\n\nWe propose an ef\ufb01cient, generalized, nonparametric, statistical Kolmogorov-\nSmirnov test for detecting distributional change in high-dimensional data. To\nimplement the test, we introduce a novel, hierarchical, minimum-volume sets es-\ntimator to represent the distributions to be tested. Our work is motivated by the\nneed to detect changes in data streams, and the test is especially ef\ufb01cient in this\ncontext. We provide the theoretical foundations of our test and show its superiority\nover existing methods.\n\n1\n\nIntroduction\n\n1, . . . , x(cid:48)\n\nThe Kolmogorov-Smirnov (KS) test is ef\ufb01cient, simple, and often considered the choice method for\nm} be two sets of feature\ncomparing distributions. Let X = {x1, . . . , xn} and X (cid:48) = {x(cid:48)\nvectors sampled i.i.d. with respect to F and F (cid:48) distributions. The goal of the KS test is to determine\nwhether F (cid:54)= F (cid:48). For one-dimensional distributions, the KS statistics are based on the maximal\ndifference between cumulative distribution functions (CDFs) of the two distributions. However,\nnonparametric extensions of this test to high-dimensional data are hard to de\ufb01ne since there are\n2d\u22121 ways to represent a d-dimensional distribution by a CDF. Indeed, due to this limitation, several\nextensions of the KS test to more than one dimension have been proposed [17, 9] but their practical\napplications are mostly limited to a few dimensions.\nOne prominent approach of generalizing the KS test to beyond one-dimensional data is that\nof Polonik [18].\nIt is based on a generalized quantile transform to a set of high-density hierar-\nchical regions. The transform is used to construct two sets of plots, expected and empirical, which\nserve as the two input CDFs for the KS test. Polonik\u2019s transform is based on a density estimation\nover X . It maps the input quantile in [0, 1] to a level-set of the estimated density such that the ex-\npected probability of feature vectors to lie within it is equal to its associated quantile. The expected\nplots are the quantiles, and the empirical plots are fractions of examples in X (cid:48) that lie within each\nmapped region.\nPolonik\u2019s approach can handle multivariate data, but is hard to apply in high-dimensional or small-\nsample-sized settings where a reliable density estimation is hard. In this paper we introduce a gen-\neralized KS test, based on Polonik\u2019s theory, to determine whether two samples are drawn from dif-\n\n1\n\n\fferent distributions. However, instead of a density estimator, we use a novel hierarchical minimum-\nvolume sets estimator to estimate the set of high-density regions directly. Because the estimation of\nsuch regions is intrinsically simpler than density estimation, our test is more accurate than density-\nestimation approaches. In addition, whereas Polonik\u2019s work was largely theoretical, we take a prac-\ntical approach and empirically show the superiority of our test over existing nonparametric tests in\nrealistic, high-dimensional data.\nTo use Polonik\u2019s generalization of the KS test, the high-density regions should be hierarchical.\nUsing classical minimum-volume set (MV-set) estimators, however, does not, in itself, guarantee\nthis property. We present here a novel method for approximate MV-sets estimation that guarantees\nthe hierarchy, thus allowing the KS test to be generalized to high dimensions. Our method uses\nclassical MV-set estimators as a basic component. We test our method with two types of estimators:\none-class SVMs (OCSVMs) and one-class neighbor machines (OCNMs).\nWhile the statistical test introduced in this paper traces distributional changes in high dimensional\ndata in general, it is effective in particular for change detection in data streams. Many real-world\napplications (e.g. process control) work in dynamic environments where streams of multivariate\ndata are collected over time, during which unanticipated distributional changes in data streams might\nprevent the proper operation of these applications. Change-detection methods are thus required to\ntrace such changes (e.g. [6]). We extensively evaluate our test on a collection of change-detection\ntasks. We also show that our proposed test can be used for the classical setting of the two-sample\nproblem using symmetric and asymmetric variations of our test.\n\n2 Learning Hierarchical High-Density Regions\n\nwith respect to a probability distribution F de\ufb01ned on a measurable space(cid:0)Rd,S(cid:1). Let \u03bb be a real-\n\nOur approach for generalizing the KS test is based on estimating a hierarchical set of MV-sets in\ninput space. In this section we introduce a method for \ufb01nding such a set in high-dimensional data.\nFollowing the notion of multivariate quantiles [8], let X = {x1, . . . , xn} be a set of examples i.i.d.\nvalued function de\ufb01ned on C \u2282 S. Then, the minimum-volume set (MV-set) with respect to F , \u03bb,\nand C at level \u03b1 is\n\nC (\u03b1) = argmin\n\nC(cid:48)\u2208C\n\n{\u03bb(C(cid:48)) : F (C(cid:48)) \u2265 \u03b1} .\n\n(1)\n\n(cid:80)n\n\nIf more than one set attains the minimum, one will be picked. Equivalently, if F (C) is replaced with\n1 1C (xi), then Cn(\u03b1) is one of the empirical MV-sets that attains the minimum. In\nFn (C) = 1\nn\nthe following we think of \u03bb as a Lebesgue measure on Rd.\nPolonik introduced a new approach that uses a hierarchical set of MV-sets to generalize the KS test\nbeyond one dimension. Assume F has a density function f with respect to \u03bb, and let Lf (c) =\n{x : f (x) \u2265 c} be the level set of f at level c. Suf\ufb01cient regularity conditions on f are assumed.\nPolonik observed that if Lf (c) \u2208 C, then Lf (c) is an MV-set of F at level \u03b1 = F (Lf (c)). He thus\nsuggested that level-sets can be used as approximations of the MV-sets of a distribution. Hence, a\ndensity estimator was used to de\ufb01ne a family of MV-sets {C(\u03b1), \u03b1 \u2208 [0, 1]} such that a hierarchy\nconstraint C(\u03b1) \u2282 C(\u03b2) is satis\ufb01ed for 0 \u2264 \u03b1 < \u03b2 \u2264 1.\nWe also use hierarchical MV-sets to represent distributions in our research. However, since a density\nestimation is hard to apply in high-dimensional data, a more practical solution is proposed. Instead\nof basing our method on the products of a density estimation method, we introduce a novel non-\nparametric method, which uses MV-set estimators (OCSVM and OCNM) as a basic component, to\nestimate hierarchical MV-sets without the need for a density estimation step.\n\n2.1 Learning Minimum-Volume Sets with One-Class SVM Estimators\n\nOCSVM is a nonparametric method for estimating a high-density region in a high-dimensional dis-\ntribution [19]. Consider a function \u03a6 : Rd \u2192 F mapping the feature vectors in X to a hyper-\nsphere in an in\ufb01nite Hilbert space F. Let H be a hypothesis space of half-space decision functions\nfC(x) = sgn ((w \u00b7 \u03a6(x)) \u2212 \u03c1) such that fC(x) = +1 if x \u2208 C, and \u22121 otherwise. To separate X\n\n2\n\n\ffrom the origin, the learner is asked to solve this quadratic program:\n\nmin\n\nw\u2208F ,\u03be\u2208Rn,\u03c1\u2208R\n\n||w||2 \u2212 \u03c1 +\n\n1\n2\n\n1\n\u03bdn\n\n\u03bei, s.t. (w \u00b7 \u03a6 (xi)) \u2265 \u03c1 \u2212 \u03bei, \u03bei \u2265 0,\n\n(2)\n\n(cid:88)\n\ni\n\nwhere \u03be is the vector of the slack variables, and 0 < \u03bd < 1 is a regularization parameter related to the\nproportion of outliers in the training data. All training examples xi for which (w \u00b7 \u03a6(x))\u2212 \u03c1 \u2264 0 are\ncalled support vectors (SVs). Outliers are referred to as examples that strictly satisfy (w \u00b7 \u03a6(x)) \u2212\n\u03c1 < 0. Since the algorithm only depends on the dot product in F, \u03a6 never needs to be explicitly\ncomputed, and a kernel function k (\u00b7,\u00b7) is used instead such that k (xi, xj) = (\u03a6(xi) \u00b7 \u03a6(xj))F .\nThe following theorem draws the connection between the \u03bd regularization parameter and the region\nC provided by the solution of Equation 2:\nTheorem 1 (Sch\u00a8olkopf et al. [19]). Assume the solution of Equation 2 satis\ufb01es \u03c1 (cid:54)= 0. The following\nstatements hold: (1) \u03bd is an upper bound on the fraction of outliers. (2) \u03bd is a lower bound on the\nfraction of SVs. (3) Suppose X were generated i.i.d. from a distribution F which does not contain\ndiscrete components. Suppose, moreover, that the kernel k is analytic and non-constant. Then, with\nprobability 1, asymptotically, \u03bd is equal to both the fraction of SVs and to the fraction of outliers.\n\nThis theorem shows that we can use OCSVMs to estimate high-density regions in the input space\nwhile bounding the number of examples in X lying outside these regions. Thus, by setting \u03bd = 1\u2212\u03b1,\nwe can use OCSVMs to estimate regions approximating C(\u03b1). We use this estimation method with\nits original quadratic optimization scheme to learn a family of MV-sets. However, a straightforward\napproach of training a set of OCSVMs, each with different \u03bd \u2208 (0, 1), would not necessarily satisfy\nthe hierarchy requirement. In the following algorithm, we propose a modi\ufb01ed construction of these\nregions such that both the hierarchical constraint and the density assumption (Theorem 1) will hold\nfor each region.\nLet 0 < \u03b11 < \u03b12, . . . , < \u03b1q < 1 be a sequence of quantiles. Given X and a kernel function k (\u00b7,\u00b7),\nour hierarchical MV-sets estimator iteratively trains a set of q OCSVMs, one for each quantile,\nand returns a set of decision functions, \u02c6fC(\u03b11), . . . , \u02c6fC(\u03b1q) that satisfy both hierarchy and density\nrequirements. Training starts from the largest quantile (\u03b1q). Let Di be the training set of the OCSVM\ntrained for the \u03b1i quantile. Let fC(\u03b1i), SVbi be the decision function and the calculated outliers\nj=i SVbj . At each iteration,\nDi contains examples in X that were not classi\ufb01ed as outliers in previous iterations (not in Oi+1).\nIn addition, \u03bd is set to the required fraction of outliers over Di that will keep the total fraction of\noutliers over X equal to 1\u2212 \u03b1i. After each iteration, \u02c6fC(\u03b1i) corresponds to the intersection between\nthe region associated with the previous decision function and the half-space associated with the\ncurrent learned OCSVM. Thus \u02c6fC(\u03b1i) corresponds to the region speci\ufb01ed by an intersection of half-\nspaces. The outliers in Oi are points that lie strictly outside the constructed region. The pseudo-code\nof our estimator is given in Algorithm 1.\n\n(bounded SVs) of the OCSVM trained for the i-th quantile. Let Oi =(cid:83)q\n\nAlgorithm 1 Hierarchical MV-sets Estimator (HMVE)\n1: Input: X , 0 < \u03b11 < \u03b12, . . . , < \u03b1q < 1, k (\u00b7, \u00b7)\n7:\n8:\n2: Output: \u02c6fC(\u03b11), . . . , \u02c6fC(\u03b1q )\n3: Initialize: Dq \u2190 X , Oq+1 \u2190 \u2205\n9:\n4: for i = q to 1 do\n10:\n5:\n11:\n6:\n12: return \u02c6fC(\u03b11), . . . , \u02c6fC(\u03b1q )\n\n\u03bd \u2190 (1\u2212\u03b1i)|X|\u2212|Oi+1|\nfC(\u03b1i), SVbi \u2190 OCSV M (Di, \u03bd, k)\n\nif i = q then\n\n|Di|\n\n\u02c6fC(\u03b1i)(x) \u2190 fC(\u03b1i(x))\n\u02c6fC(\u03b1i)(x) \u2190\n\n(cid:26) fC(\u03b1i(x))\n\n\u22121\n\nelse\n\nOi \u2190 Oi+1 \u222a SVbi , Di\u22121 \u2190 Di \\ SVbi\n\n:\n:\n\n\u02c6fC(\u03b1i+1)(x)\notherwise\n\nThe following theorem shows that the regions speci\ufb01ed by the decision functions \u02c6fC(\u03b11), . . . , \u02c6fC(\u03b1q)\nare: (a) approximations for the MV-sets in the same sense suggested by Sch\u00a8olkopf et al., and (b)\nhierarchically nested. In the following, \u02c6C(\u03b1i) is denoted as the estimates of C(\u03b1i) with respect to\n\u02c6fC(\u03b1i).\nTheorem 2. Let \u02c6fC(\u03b11), . . . , \u02c6f eC(\u03b1q) be the decision functions returned by Algorithm 1 with param-\neters {\u03b11, . . . , \u03b1q},X , k (\u00b7,\u00b7). Assume X is separable. Let \u02c6C(\u03b1i) be the region in the input space\n\n3\n\n\fFigure 1: Left: Estimated MV-sets \u02c6C(\u03b1i) in\nthe original input space, q = 3. Right: the\nprojected \u02c6C(\u03b1i) in F.\n\nFigure 2: Averaged symmetric differences\nagainst the number of training points for the\nOCSVM / OCNM versions of our estimator,\nand the KDE2d density estimator\n\nassociated with \u02c6fC(\u03b1i), and SVubi be the set of (unbounded) SVs lying on the separating hyperplane\nin the region associated with fC(\u03b1i(x)). Then, the following statements hold:(1) \u02c6C(\u03b1i) \u2286 \u02c6C(\u03b1j) for\n\u03b1i < \u03b1j. (2) |Oi|\n. (3) Suppose X were i.i.d. drawn from a distribution\nF which does not contain discrete components, and k is analytic and non-constant. Then, 1 \u2212 \u03b1i is\nasymptotically equal to |Oi|\n|X| .\n\n|X| \u2264 1 \u2212 \u03b1i \u2264 |SVubi|+|Oi|\n\n|X|\n\n|Di|\n\nProof. Statement (1) holds by de\ufb01nition of \u02c6fC(\u03b1i). Statements (2)-(3) are proved by induction\non the number of iterations. In the \ufb01rst iteration \u02c6fC(\u03b1q) equals fC(\u03b1q). Thus, since Oq = SVbq\nand \u03bd = 1 \u2212 \u03b1q, statements (2)-(3) follow directly from Theorem 1 1. Then, by the induction\nhypothesis, statements (2)-(3) hold for the \ufb01rst n \u2212 1 iterations over the \u03b1q, . . . , \u03b1q\u2212n+1 quantiles.\nWe now prove that statements (2)-(3) hold for \u02c6fC(\u03b1q\u2212n) in the next iteration. Since \u02c6fC(\u03b1q\u2212n+1)(x) =\n\u22121 implies \u02c6fC(\u03b1q\u2212n)(x) = \u22121, Oq\u2212n+1 are outliers with respect to \u02c6fC(\u03b1q\u2212n). In addition, \u03bd =\n(1\u2212\u03b1q\u2212n)|X|\u2212|Oq\u2212n+1|\n. Hence, following Theorem 1, the total proportion of outliers with respect to\nX is |Oq\u2212n| = |SVbq\u2212n| + |Oq\u2212n+1| \u2264 \u03bd|Di| + |Oq\u2212n+1| = (1 \u2212 \u03b1q\u2212n)|X|, and |SVubq\u2212n| +\n|Oq\u2212n+1| \u2265 (1 \u2212 \u03b1q\u2212n)|X|. Hence, |Oq\u2212n|\n. In the same manner,\nunder the conditions of statement (3), |Oq\u2212n| is asymptotically equal to (1 \u2212 \u03b1q\u2212n)|X|, and hence,\nasymptotically, 1 \u2212 \u03b1q\u2212n =\nFigure 1 illustrates the estimated MV-sets \u02c6C(\u03b1i) in both the original and the projected spaces. On\nthe left, all \u02c6C(\u03b1i) regions in the original input space are colored with decreased gray levels. Note\nthat \u02c6C(\u03b1i) is a subset of \u02c6C(\u03b1j) if i < j. On the right, the projected regions of all \u02c6C(\u03b1i)s in F are\nmarked with the same colors. Examples xi in the input space and their mapped vectors \u03c6(xi) in F\nare contained in the same relative regions in both spaces. It can be seen that the projections of \u02c6C(\u03b1i)\nin F are the intersecting half-spaces learned by Algorithm 1.\n\n|X| \u2264 1 \u2212 \u03b1q\u2212n \u2264 |SVubq\u2212n|+|Oq\u2212n|\n\n|Oq\u2212n|\n\n|X|\n\n|X|\n\n.\n\n2.2 Learning Minimum-Volume Sets with One-Class Neighbor Machine Estimators\n\nOCNM [15] is as an alternative method to the OCSVM estimator for \ufb01nding regions close to C(\u03b1).\nUnlike OCSVM, the OCNM solution is proven to be asymptotically close to the MV-set speci\ufb01ed 2.\nDegenerated structures in data that may damage the generalization of SVMs could be another reason\nfor choosing OCNM [24]. In practice, for \ufb01nite sample size, it is not clear which estimator is more\naccurate.\n\n1Note that the separability of the data implies that the solution of Equation 2 satis\ufb01es \u03c1 (cid:54)= 0.\n2Sch\u00a8olkopf et al. [19] proved that the set provided by OCSVM converges asymptotically to the correct\nprobability and not to the correct MV-set. Although this property should be suf\ufb01cient for the correctness of our\ntest, Polonik observed that MV-sets are preferred.\n\n4\n\nC1C2C2C3C3C4S1x2x3xF1FdO\uf028\uf0291x\uf046\uf028\uf0293x\uf046\uf028\uf0292x\uf046jh1jh\uf02bjtopp1jtopp\uf02bjsvp1jsvp\uf02bOjjw\uf07211jjw\uf072\uf02b\uf02bHypersphere with radius 11100,...,xxTime101150,...,xx49,...,iixx\uf02b49,...,mnmnxx\uf02b\uf02d\uf02b. . .. . .Training setTesting sets. . .. . .\uf028\uf0291\u02c6C\uf061\uf028\uf0292\u02c6C\uf061\uf028\uf0293\u02c6C\uf061\uf028\uf0292\u02c6C\uf0611015202530354045500.020.040.060.080.10.120.140.160.180.2# training pointssymmetric difference2D level\u2212sets estimations: qcd ocsvm/ocnm Vs. kde  HMVE (OCSVM)HMVE (OCNM)KDE2D\fOCNM uses either a sparsity or a concentration neighborhood measure. M (Rd,X ) \u2192 R is a\nsparsity measure if f (x) > f (y) implies lim|X|\u2192\u221eP (M (x,X ) < M (y,X )) = 1. An example\nfor a valid sparsity measure is the distance of x to its kth-nearest neighbor in X . When a sparsity\nmeasure is used, the OCNM estimator solves the following linear problem\n\n\u03bei, s.t. M (xi,X ) \u2265 \u03c1 \u2212 \u03bei, \u03bei \u2265 0,\n\n(3)\n\n\u03be\u2208Rn,\u03c1\u2208R \u03bdn\u03c1 \u2212 n(cid:88)\n\nmax\n\ni\n\nsuch that the resulting decision function fC(x) = sgn (\u03c1 \u2212 M (x,X )) satis\ufb01es bounds and conver-\ngence properties similar to those mentioned in Theorem 1 (\u03bd-property).\nOCNM can replace OCSVM in our hierarchical MV-sets estimator. In contrast to OCSVMs, when\nOCNMs are iteratively trained on X using a growing sequence of \u03bd values, outliers need not be\nmoved from previous iterations to ensure that the \u03bd-property will hold for each decision function.\nHence, a simpler version of Algorithm 1 can be used, where X is used for training all OCNMs and\n\u03bd = 1 \u2212 \u03b1i for each step 3. Since Theorem 2 relies on the \u03bd-property of the estimator, it can be\nshown that similar statements to those of Theorem 2 also hold when OCNM is used.\nAs previously discussed, since the estimation of MV-sets is simpler than density estimation,\nour test can achieve higher accuracy than approaches based on density estimation. To illus-\ntrate this hypothesis empirically, we conducted the following preliminary experiment. We sam-\npled 10 to 50 i.i.d. points with respect to a two-dimensional, mixture of Gaussians, distribution\n2N (\u00b5 = (\u22120.5,\u22120.5), \u03a3 = 0.5I). We use the OCNM\n2N (\u00b5 = (0.5, 0.5), \u03a3 = 0.1I) + 1\np = 1\nand OCSVM versions of our estimator to approximate hierarchical MV-sets for q\u03b1 = 9 quantiles:\n\u03b1 = 0.1, 0, 2, . . . , 0.9 (detailed setup parameters are discussed in Section 4). MV-sets estimated\n(cid:82)\n(cid:80)\nwith a KDE2d kernel-density estimation [2] were used for comparison. For each sample size, we\nmeasured the error of each method according to the mean weighted symmetric difference between\nx\u2208C(\u03b1)\u2206 \u02c6C(\u03b1) p(x)dx. Results, averaged over 50 sim-\nthe true MV-sets and their estimates, 1\nq\u03b1\nulations, are shows in Figure 2. The advantages of our approach can easily be seen: both versions\nof our estimator preform notably better, especially for small sample sizes.\n\n\u03b1\n\n3 Generalized Kolmogorov-Smirnov Test\n\nWe now introduce a nonparametric, generalized Kolmogorov-Smirnov (GKS) statistical test for de-\ntermining whether F (cid:54)= F (cid:48) in high-dimensional data. Assume F, F (cid:48) are one-dimensional continuous\ndistributions and Fn, F (cid:48)\nm are empirical distributions estimated from n and m examples i.i.d. drawn\nfrom F, F (cid:48). Then, the two-sample Kolmogorov-Smirnov (KS) statistic is\n\n(cid:113) nm\n\nKSn,m = sup\nx\u2208R\n\n|Fn(x) \u2212 F (cid:48)\n\nm(x)|\n\nn+m KSn,m is asymptotically distributed, under the null hypothesis, as the distribution of\nand\nsupx\u2208R |B(F (x))| for a standard Brownian bridge B when F = F (cid:48). Under the null hypothesis,\nassume F = F (cid:48) and let F \u22121 be a quantile transform of F , i.e., the inverse of F . Then we can\nreplace the supremum over x \u2208 R with the supremum over \u03b1 \u2208 [0, 1] as follows:\n\n(cid:12)(cid:12)Fn(F \u22121(\u03b1)) \u2212 F (cid:48)\n\nm(F \u22121(\u03b1))(cid:12)(cid:12).\n\nKSn,m = sup\n\u03b1\u2208[0,1]\n\nNote that in the one-dimensional setting, F \u22121(\u03b1) is the point x s.t. F (X \u2264 x) \u2264 \u03b1 where X is a\nrandom variable drawn from F . Equivalently, F \u22121(\u03b1) can be identi\ufb01ed with the interval [\u2212\u221e, x].\nIn a high-dimensional space these intervals can be replaced by hierarchical MV-sets C(\u03b1) [18],\nand hence, Equation 5 can be calculated regardless of the input space dimensionality. We suggest\nreplacing KSn,m with\n\nTn,m = sup\n\u03b1\u2208[0,1]\n\n|Fn(C(\u03b1)) \u2212 F (cid:48)\n\nm(C(\u03b1))|.\n\nFor estimating C(\u03b1) we use our nonparametric method from Section 2. \u02c6C(\u03b1) is learned with X\nand marked as \u02c6CX (\u03b1). In practice, when |X| is \ufb01nite, the expected proportion of examples that lie\n3Note that intersection is still needed (Algorithm 1, line 10) to ensure the hierarchical property on \u02c6C(\u03b1i).\n\n5\n\n(4)\n\n(5)\n\n(6)\n\n\fwithin \u02c6CX (\u03b1i) is not guaranteed to be exactly \u03b1i. Therefore, after learning the decision functions,\nwe estimate Fn( \u02c6CX (\u03b1i)) by a k-folds cross-validation procedure. Our \ufb01nal test statistic is\n\n(7)\n\n(cid:12)(cid:12)(cid:12) \u02c6Fn( \u02c6CX (\u03b1i)) \u2212 Fm( \u02c6CX (\u03b1i))\n\n(cid:12)(cid:12)(cid:12),\n\n\u02c6Tn,m = sup\n1\u2264i\u2264q\n\nwhere \u02c6Fn( \u02c6CX (\u03b1i)) is the estimate of Fn( \u02c6CX (\u03b1i)). The two-sample KS statistical test is used over\n\u02c6Tn,m to calculate the resulting p-value.\nThe test de\ufb01ned above works only in one direction by predicting whether distributions of the samples\nshare the same \u201cconcentrations\u201d as regions estimated according to X , and not according to X (cid:48). We\nmay symmetrize it by running the non-symmetric test twice, once in each direction, and return twice\ntheir minimum p-value (Bonferroni correction). Note that by doing so in the context of a change\ndetection task, we pay in runtime required for learning MV-sets for each X (cid:48).\n\n4 Empirical Evaluation\n\nWe \ufb01rst evaluated our test on concept-drift detection problems in data-stream classi\ufb01cation tasks.\nConcept drifts are associated with distributional changes in data streams that occur due to hidden\ncontext [22] \u2014 changes of which the classi\ufb01er is unaware. We used the 27 UCI datasets used\nin [6], and 6 additional high-dimensionality UCI datasets: arrhythmia, madelon, semeion, internet\nadvertisement, hill-valley, and musk. The average number of features for all datasets is 123 4.\nFollowing the experimental setup used by [11, 6], we generated, for each dataset, a sequence\n(cid:104)x1, . . . , xn+m(cid:105), where the \ufb01rst n examples are associated with the most frequent label, and the\nfollowing m examples with the second most frequent. Within each label the examples were shuf\ufb02ed\nrandomly. The \ufb01rst 100 examples (cid:104)x1, . . . , x100(cid:105), associated, in all datasets, with the most common\nlabel, were used as the baseline dataset X . A sliding window of 50 consecutive examples over the\nfollowing sequence of examples was iteratively used to de\ufb01ne the most recent data X (cid:48) at hand. Sta-\ntistical tests were evaluated with X and all possible X (cid:48) windows. In total, for each dataset, the set\ni = {xi, . . . , xi+49} , 101 \u2264 i \u2264 n + m \u2212 49} of pairs were used for evaluation. The\n{(cid:104)X ,X (cid:48)\nfollowing \ufb01gure illustrates this setup:\n\ni(cid:105)|X (cid:48)\n\ni(cid:105) , i \u2264 n \u2212 49, where all examples in X (cid:48)\n\nThe pairs (cid:104)X ,X (cid:48)\ni have the same labels as in X , are\nconsidered \u201cunchanged.\u201d The remaining pairs are considered \u201cchanged.\u201d Performance is evaluated\nusing precision-recall values with respect to the change detection task.\nWe compare our one-directional (GKS1d) and two-directional (GKS2d) tests to the following 5 ref-\nerence tests: kdq-tree test (KDQ) [4], Metavariable Wald-Wolfowitz test (WW) [10], Kernel change\ndetection (KCD) [5], Maximum mean discrepancy test (MMD) [12], and PAC-Bayesian margin test\n(PBM) [6]. See section 5 for details. All tests, except of MMD, were implemented and parameters\nwere set with accordance to their suggested setting in their associate papers. The implementation of\nMMD test provided by the authors 5 was used with default parameters (RBF kernels with automatic\nkernel width detection) and Rademacher bounds. Similar results were also measured for asymp-\ntotic bounds. Note that we cannot compare our test to Polonik\u2019s test since density estimations and\nlevel-sets extractions are not practically feasible on high-dimensional data.\n#f eatures) was used for the OCSVMs. A\nThe LibSVM package [3] with a Gaussian kernel (\u03b3 =\ndistance from a point to its kth-nearest neighbor was used as a sparsity measure for the OCNMs. k\nis set to 10% of the sample size 6. \u03b1 = 0.1, 0.2, . . . , 0.9 were used for all experiments.\n\n2\n\n4Nominal features were transformed into numeric ones using binary encoding; missing values were replaced\n\nby their features\u2019 average values.\n\n5The code can be downloaded at http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm.\n6Preliminary experiments show similar results obtained with k equal to 10, 20, . . . , 50% of |X|.\n\n6\n\nC1C2C2C3C3C4S1x2x3xF1FdC3C2C2C1O\uf028\uf0291x\uf046\uf028\uf0293x\uf046\uf028\uf0292x\uf046jh1jh\uf02bjtopp1jtopp\uf02bjsvp1jsvp\uf02bOjjw\uf07211jjw\uf072\uf02b\uf02bHypersphere with radius 11100,...,xxTime101150,...,xx49,...,iixx\uf02b49,...,mnmnxx\uf02b\uf02d\uf02b. . .. . .Training setTesting sets. . .. . .\fFigure 3:\nPrecision-recall curves aver-\naged over all 33 experiments for GKS1d\n(OCSVMs), GKS2d (OCSVMs), and the 5\nreference tests.\n\nFigure 4:\nPrecision-recall curves aver-\naged over all 33 experiments for GKS1d\n(OCSVMs), GKS2d (OCSVMs), GKS1d\n(OSNMs), and GKS2d (OSNMs).\n\n4.1 Results\n\nFor better visualization, results are shown in two separate \ufb01gures: Figure 3 shows the precision-\nrecall plots averaged over the 33 experiments for the OCSVM version of our tests, and the 5 reference\ntests. Figure 4 shows the precision-recall plots averaged over the 33 experiments for the OCSVM\nand OCNM versions of our tests. In both versions, GKS1d and GKS2d provide the best precision-\nrecall compromise. For example, for the OCSVM version, at a recall of 0.86, GKS1d accurately\ndetects distributional changes with 0.90 precision and GKS2d with 0.88 precision, while the second\nbest competitor does so with 0.84 precision. In terms of their break even point (BEP) measures \u2013\nthe points at which precision equals recall \u2013 GKS1d outperforms the other 5 reference tests with\na BEP of 0.89 while its second best competitor does so with BEP of 0.84. Mean precisions for\neach dataset were compared using the Wilcoxon statistical test with \u03b1 = 0.05. Here, too, GKS1d\nperforms signi\ufb01cantly better than all others for both OCSVM and OCNM versions, except for the\nMMD with a p-value of 0.08 for GKS1d(OCSVM) and 0.12 for GKS1d(OCNM).\nAlthough the plots for our GKS1d (OCSVM) test (Figure 4) look better than GKS2d, no signi\ufb01cant\ndifference was found. This result is consistent with previous studies which claim that variants of\nsolutions whose goal is to make the tests more symmetric have empirically shown no conclusive\nsuperiority [4]. We also found that the GKS1d (OCSVM) version of our test has the least runtime\nand scales well with dimensionality, while the GKS1d (OSNM) version suffers from increased time\ncomplexity, especially in high dimensions, due to its expensive neighborhood measure. However,\nnote that this observation is true only when off-line computational processing on X is not considered.\nAs opposed to the KCD, and, PBM, tests, our GKS1d test need not be retrained on each X (cid:48). Hence,\nin the context where X is treated as a baseline dataset, GKS1d (OCSVM) is relatively cheap, and\nestimated in O (nm) time (the total number of SVs used to calculate f(cid:48)\nC(\u03b1q) is O (n)).\nIn comparison to other tests, it is still the least computationally demanding 7.\n\nC(\u03b11), . . . , f(cid:48)\n\n4.2 Topic Change Detection among Documents\n\nWe evaluated our test on an additional setup of high-dimensionality problems pertaining to the de-\ntection of topic changes in streams of documents. We used the 20-Newsgroup document corpus 8.\n1000 words were randomly picked to generate 1000 bag-of-words features. 12 categories were used\nfor the experiments 9. Topic changes were simulated between all pairs of categories (66 pairs in to-\ntal), using the same methodology as in the previous UCI experiments. Due to the excessive runtime\n\n7MMD and WW complexities are estimated in O(cid:0)(n + m)2(cid:1) time where n, m are the sample sizes. KDQ\n\nuses bootstrapping for p-value estimations, and hence, is more expensive.\n\n8The 20-Newsgroup corpus is at http://people.csail.mit.edu/jrennie/20Newsgroups/.\n9The selection of these categories is based on the train/test split de\ufb01ned in http://www.cad.zju.\n\nedu.cn/home/dengcai/Data/TextData.html.\n\n7\n\n00.10.20.30.40.50.60.70.80.910.40.50.60.70.80.91recallprecision  GKS1d (OCSVM)GKS2d (OCSVM)WWMMDPBMKCDKDQBEP00.10.20.30.40.50.60.70.80.910.40.50.60.70.80.91recallprecision  GKS1d (OCSVM)GKS2d (OCSVM)GKS1d (OCNM)GKS2d (OCNM)\fof some of the tests, especially with high-dimensional data, we evaluated only 4 of the 7 methods:\nGKS1d (OCSVM), WW, MMD, and KDQ, whose expected runtime may be more reasonable.\nOnce again, our GKS1d test dominates the others with the best precision-recall compromise. With\nregard to BEP values, GKS1d outperforms the other reference tests with a BEP of 0.67 (0.70 pre-\ncision on average), while its second best competitor (MMD) does so with a BEP of 0.62 (0.64\nprecision on average). According to the Wilcoxon statistical test with \u03b1 = 0.05, GKS1d performs\nsigni\ufb01cantly better than the others in terms of their average precision measures.\n\n5 Related Work\n\nOur proposed test belongs to a family of nonparametric tests for detecting change in multivariate\ndata that compare distributions without the intermediate density estimation step. Our reference tests\nwere thus taken from this family of studies. The kdq-tree test (KDQ) [4] uses a spatial scheme (called\nkdq-tree) to partition the data into small cells. Then, the Kullback-Leibler (KL) distance is used to\nmeasure the difference between data counts for the two samples in each cell. A permutation (boot-\nstrapping) test [7] is used to calculate the signi\ufb01cant difference (p-value). The metavariable Wald-\nWolfowitz test (WW) [10] measures the differences between two samples according to the minimum\nspanning tree in the graph of distances between all pairs in both samples. Then, the Wald-Wolfowitz\ntest statistics are computed over the number of components left in the graph after removing edges\nbetween examples of different samples. The kernel change detection (KCD) [5] measures the dis-\ntance between two samples according to a \u201cFisher-like\u201d distance between samples. This distance is\nbased on hypercircle characteristics of the resulting two OCSVMs, which were trained separately on\neach sample. The maximum mean discrepancy test (MMD) [12] meausres discrepancy according to\na complete matrix of kernel-based dissimilarity measures between all examples, and test statistics\nare then computed. (5) The PAC-Bayesian margin test (PBM) [6] measures the distance between\ntwo samples according to the average margins of a linear SVM classi\ufb01er between the samples, and\ntest statistics are computed.\nAs discussed in detail before, our test follows the general approach of Polonik but differs in three\nimportant ways: (1) While Polonik uses a density estimator for specifying the MV-sets, we introduce\na simpler method that \ufb01nds the MV-sets directly from the data. Our method is thus more practical\nand accurate in high-dimensional or small-sample-sized settings. (2) Once the MV-sets are de\ufb01ned,\nPolonik uses their hypothetical quantiles as the expected plots, and hence, runs the KS test in its one-\nsample version (goodness-of-\ufb01t test). We take a more practically accurate approach for \ufb01nite sample\nsize when approximations of MV-sets are not precise. Instead of using the hypothetical measures,\nwe estimate the expected plots of X empirically and use the two-sample KS test instead. (3) Unlike\nPolonik\u2019s work, ours was evaluated empirically and its superiority demonstrated over a wide range\nof nonparametric tests. Moreover, since Polonik\u2019s test relies on a density estimation and the ability\nto extract its level-sets, it is not practically feasible in high-dimensional settings.\nOther methods for estimating MV-sets exist in the literature [21, 1, 16, 13, 20, 23, 14]. Unfortu-\nnately, for problems beyond two dimensions and non-convex sets, there is often a gap between their\ntheoretical and practical estimates [20]. We chose here OCSVM and OSNM because they perform\nwell on small, high-dimensional samples.\n\n6 Discussion and Summary\n\nThis paper makes two contributions. First, it proposes a new method that uses OCSVMs or OCNMs\nto represent high-dimensional distributions as a hierarchy of high-density regions. This method\nis used for statistical tests, but can also be used as a general, black-box, method for ef\ufb01cient and\npractical representations of high-dimensional distributions. Second, it presents a nonparametric,\ngeneralized, KS test that uses our representation method to detect distributional changes in high-\ndimensional data. Our test was found superior to competing tests in the sense of average precision\nand BEP measures, especially in the context of change-detection tasks.\nAn interesting and still open question is how we should set the input \u03b1 quantiles for our method.\nThe problem of determining the number of quantiles \u2013 and the gaps between consecutive ones \u2013 is\nrelated to the problem of histogram design.\n\n8\n\n\fReferences\n[1] S. Ben-David and M. Lindenbaum. Learning distributions by their density levels: A paradigm\nfor learning without a teacher. Journal of Computer and System Sciences, 55(1):171\u2013182,\n1997.\n\n[2] ZI Botev, JF Grotowski, and DP Kroese. Kernel density estimation via diffusion. The Annals\n\nof Statistics, 38(5):2916\u20132957, 2010.\n\n[3] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001.\n[4] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. An information-theoretic approach\n\nto detecting changes in multi-dimensional data streams. In INTERFACE, 2006.\n\n[5] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. Signal\n\nProcessing, Transactions on Information Theory, 53(8):2961\u20132974, 2005.\n\n[6] Anton Dries and Ulrich R\u00a8uckert. Adaptive concept drift detection. Statistical Analysis and\n\nData Mining, 2(5-6):311\u2013327, 2009.\n\n[7] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994.\n[8] J.H.J. Einmahl and D.M. Mason. Generalized quantile processes. The Annals of Statistics,\n\npages 1062\u20131078, 1992.\n\n[9] G. Fasano and A. Franceschini. A multidimensional version of the kolmogorov-smirnov test.\n\nMonthly Notices of the Royal Astronomical Society, 225:155\u2013170, 1987.\n\n[10] J.H. Friedman and L.C. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and\n\nSmirnov two-sample tests. The Annals of Statistics, 7(4):697\u2013717, 1979.\n\n[11] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In SBIA, pages\n\n66\u2013112. Springer, 2004.\n\n[12] A. Gretton, K.M. Borgwardt, M. Rasch, B. Scholkopf, and A.J. Smola. A kernel method for\n\nthe two-sample-problem. Machine Learning, 1:1\u201310, 2008.\n\n[13] X. Huo and J.C. Lu. A network \ufb02ow approach in \ufb01nding maximum likelihood estimate of high\n\nconcentration regions. Computational Statistics & Data Analysis, 46(1):33\u201356, 2004.\n\n[14] D.M. Mason and W. Polonik. Asymptotic normality of plug-in level set estimates. The Annals\n\nof Applied Probability, 19(3):1108\u20131142, 2009.\n\n[15] A. Munoz and J.M. Moguerza. Estimation of high-density regions using one-class neighbor\n\nmachines. In PAMI, pages 476\u2013480, 2006.\n\n[16] J. Nunez Garcia, Z. Kutalik, K.H. Cho, and O. Wolkenhauer. Level sets and minimum volume\nsets of probability density functions. International Journal of Approximate Reasoning, 34(1):\n25\u201347, 2003.\n\n[17] JA Peacock. Two-dimensional goodness-of-\ufb01t testing in astronomy. Monthly Notices of the\n\nRoyal Astronomical Society, 202:615\u2013627, 1983.\n\n[18] W. Polonik.\n\nConcentration and goodness-of-\ufb01t\n\nin higher dimensions:(asymptotically)\n\ndistribution-free methods. The Annals of Statistics, 27(4):1210\u20131229, 1999.\n\n[19] Bernhard Sch\u00a8olkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C.\nWilliamson. Estimating the support of a high-dimensional distribution. Neural Computation,\n13(7):1443\u20131471, 2001.\n\n[20] C.D. Scott and R.D. Nowak. Learning minimum volume sets. The Journal of Machine Learn-\n\ning Research, 7:665\u2013704, 2006.\n\n[21] G. Walther. Granulometric smoothing. The Annals of Statistics, pages 2273\u20132299, 1997.\n[22] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts.\n\nMachine Learning, 23(1):69\u2013101, 1996.\n\n[23] R.M. Willett and R.D. Nowak. Minimax optimal level-set estimation. Image Processing, IEEE\n\nTransactions on, 16(12):2965\u20132979, 2007.\n\n[24] John Wright, Yi Ma, Yangyu Tao, Zhouchen Lin, and Heung-Yeung Shum. Classi\ufb01cation via\n\nminimum incremental coding length. SIAM J. Imaging Sciences, 2(2):367\u2013395, 2009.\n\n9\n\n\f", "award": [], "sourceid": 336, "authors": [{"given_name": "Assaf", "family_name": "Glazer", "institution": null}, {"given_name": "Michael", "family_name": "Lindenbaum", "institution": null}, {"given_name": "Shaul", "family_name": "Markovitch", "institution": null}]}