{"title": "The Randomized Dependence Coefficient", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 9, "abstract": "We introduce the Randomized Dependence Coefficient (RDC), a measure of non-linear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R\u00e9nyi Maximum Correlation Coefficient. RDC is defined in terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper.", "full_text": "The Randomized Dependence Coef\ufb01cient\n\nDavid Lopez-Paz, Philipp Hennig, Bernhard Sch\u00a8olkopf\n\n{dlopez,phennig,bs}@tue.mpg.de\n\nMax Planck Institute for Intelligent Systems\n\nSpemannstra\u00dfe 38, T\u00a8ubingen, Germany\n\nAbstract\n\nWe introduce the Randomized Dependence Coef\ufb01cient (RDC), a measure of non-\nlinear dependence between random variables of arbitrary dimension based on the\nHirschfeld-Gebelein-R\u00b4enyi Maximum Correlation Coef\ufb01cient. RDC is de\ufb01ned in\nterms of correlation of random non-linear copula projections; it is invariant with\nrespect to marginal distribution transformations, has low computational cost and\nis easy to implement: just \ufb01ve lines of R code, included at the end of the paper.\n\n1\n\nIntroduction\n\nMeasuring statistical dependence between random variables is a fundamental problem in statistics.\nCommonly used measures of dependence, Pearson\u2019s rho, Spearman\u2019s rank or Kendall\u2019s tau are com-\nputationally ef\ufb01cient and theoretically well understood, but consider only a limited class of asso-\nciation patterns, like linear or monotonically increasing functions. The development of non-linear\ndependence measures is challenging because of the radically larger amount of possible association\npatterns.\nDespite these dif\ufb01culties, many non-linear statistical dependence measures have been developed\nrecently. Examples include the Alternating Conditional Expectations or back\ufb01tting algorithm (ACE)\n[2, 9], Kernel Canonical Correlation Analysis (KCCA) [1], (Copula) Hilbert-Schmidt Independence\nCriterion (CHSIC, HSIC) [6, 5, 15], Distance or Brownian Correlation (dCor) [24, 23] and the\nMaximal Information Coef\ufb01cient (MIC) [18]. However, these methods exhibit high computational\ndemands (at least quadratic costs in the number of samples for KCCA, HSIC, CHSIC, dCor or\nMIC), are limited to measuring dependencies between scalar random variables (ACE, MIC) or can\nbe dif\ufb01cult to implement (ACE, MIC).\nThis paper develops the Randomized Dependence Coef\ufb01cient (RDC), an estimator of the Hirschfeld-\nGebelein-R\u00b4enyi Maximum Correlation Coef\ufb01cient (HGR) addressing the issues listed above. RDC\nde\ufb01nes dependence between two random variables as the largest canonical correlation between ran-\ndom non-linear projections of their respective empirical copula-transformations. RDC is invariant\nto monotonically increasing transformations, operates on random variables of arbitrary dimension,\nand has computational cost of O(n log n) with respect to the sample size. Moreover, it is easy to\nimplement: just \ufb01ve lines of R code, included in Appendix A.\nThe following Section reviews the classic work of Alfr\u00b4ed R\u00b4enyi [17], who proposed seven desirable\nfundamental properties of dependence measures, proved to be satis\ufb01ed by the Hirschfeld-Gebelein-\nR\u00b4enyi\u2019s Maximum Correlation Coef\ufb01cient (HGR). Section 3 introduces the Randomized Depen-\ndence Coef\ufb01cient as an estimator designed in the spirit of HGR, since HGR itself is computationally\nintractable. Properties of RDC and its relationship to other non-linear dependence measures are\nanalysed in Section 4. Section 5 validates the empirical performance of RDC on a series of numeri-\ncal experiments on both synthetic and real-world data.\n\n1\n\n\f2 Hirschfeld-Gebelein-R\u00b4enyi\u2019s Maximum Correlation Coef\ufb01cient\nIn 1959 [17], Alfr\u00b4ed R\u00b4enyi argued that a measure of dependence \u03c1\u2217 : X \u00d7 Y \u2192 [0, 1] between\nrandom variables X \u2208 X and Y \u2208 Y should satisfy seven fundamental properties:\n\n1. \u03c1\u2217(X, Y ) is de\ufb01ned for any pair of non-constant random variables X and Y .\n2. \u03c1\u2217(X, Y ) = \u03c1\u2217(Y, X)\n3. 0 \u2264 \u03c1\u2217(X, Y ) \u2264 1\n4. \u03c1\u2217(X, Y ) = 0 iff X and Y are statistically independent.\n5. For bijective Borel-measurable functions f, g : R \u2192 R, \u03c1\u2217(X, Y ) = \u03c1\u2217(f (X), g(Y )).\n6. \u03c1\u2217(X, Y ) = 1 if for Borel-measurable functions f or g, Y = f (X) or X = g(Y ).\n7. If (X, Y ) \u223c N (\u00b5, \u03a3), then \u03c1\u2217(X, Y ) = |\u03c1(X, Y )|, where \u03c1 is the correlation coef\ufb01cient.\nR\u00b4enyi also showed the Hirschfeld-Gebelein-R\u00b4enyi Maximum Correlation Coef\ufb01cient (HGR) [3, 17]\nto satisfy all these properties. HGR was de\ufb01ned by Gebelein in 1941 [3] as the supremum of Pear-\nson\u2019s correlation coef\ufb01cient \u03c1 over all Borel-measurable functions f, g of \ufb01nite variance:\n\nhgr(X, Y ) = sup\nf,g\n\n\u03c1(f (X), g(Y )),\n\n(1)\n\nSince the supremum in (1) is over an in\ufb01nite-dimensional space, HGR is not computable.\nIt is\nan abstract concept, not a practical dependence measure. In the following we propose a scalable\nestimator with the same structure as HGR: the Randomized Dependence Coef\ufb01cient.\n\n3 Randomized Dependence Coef\ufb01cient\n\nThe Randomized Dependence Coef\ufb01cient (RDC) measures the dependence between random samples\nX \u2208 Rp\u00d7n and Y \u2208 Rq\u00d7n as the largest canonical correlation between k randomly chosen non-\nlinear projections of their copula transformations. Before Section 3.4 de\ufb01nes this concept formally,\nwe describe the three necessary steps to construct the RDC statistic: copula-transformation of each\nof the two random samples (Section 3.1), projection of the copulas through k randomly chosen non-\nlinear maps (Section 3.2) and computation of the largest canonical correlation between the two sets\nof non-linear random projections (Section 3.3). Figure 1 offers a sketch of this process.\n\nFigure 1: RDC computation for a simple set of samples {(xi, yi)}100\ni=1 drawn from a noisy circular\npattern: The samples are used to estimate the copula, then mapped with randomly drawn non-linear\nfunctions. The RDC is the largest canonical correlation between these non-linear projections.\n\n3.1 Estimation of Copula-Transformations\n\nTo achieve invariance with respect to transformations on marginal distributions (such as shifts or\nrescalings), we operate on the empirical copula transformation of the data [14, 15]. Consider a ran-\ndom vector X = (X1, . . . , Xd) with continuous marginal cumulative distribution functions (cdfs)\nPi, 1 \u2264 i \u2264 d. Then the vector U = (U1, . . . , Ud) := P (X) = (P1(X1), . . . , Pd(Xd)), known as\nthe copula transformation, has uniform marginals:\n\n2\n\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fe\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8feyxP(y)\u223cU[0,1]P(x)\u223cU[0,1]\u03b2T\u03a6(P(y))\u03b1T\u03a6(P(x))\u03b1T\u03a6(P(x))\u03b2T\u03a6(P(y))P(x)P(y)P(x)P(y)\u03c6(wiP(x)+bi)\u03c6(miP(y)+li)\u03c1\u22480\u03c1\u22480\u03c1\u22481CCA\fTheorem 1. (Probability Integral Transform [14]) For a random variable X with cdf P , the random\nvariable U := P (X) is uniformly distributed on [0, 1].\n\nThe random variables U1, . . . , Ud are known as the observation ranks of X1, . . . , Xd. Crucially,\nU preserves the dependence structure of the original random vector X, but ignores each of its d\nmarginal forms [14]. The joint distribution of U is known as the copula of X:\nTheorem 2. (Sklar [20]) Let the random vector X = (X1, . . . , Xd) have continuous marginal cdfs\nPi, 1 \u2264 i \u2264 d. Then, the joint cumulative distribution of X is uniquely expressed as:\n\nwhere the distribution C is known as the copula of X.\n\nP (X1, . . . , Xd) = C(P1(X1), . . . , Pd(Xd)),\n\nA practical estimator of the univariate cdfs P1, . . . , Pd is the empirical cdf :\n\nwhich gives rise to the empirical copula transformations of a multivariate sample:\n\nPn(x) :=\n\n1\nn\n\nI(Xi \u2264 x),\n\nn(cid:88)i=1\n\nPn(x) = [Pn,1(x1), . . . , Pn,d(xd)].\n\n(2)\n\n(3)\n\n(4)\n\nThe Massart-Dvoretzky-Kiefer-Wolfowitz inequality [13] can be used to show that empirical copula\ntransformations converge fast to the true transformation as the sample size increases:\nTheorem 3. (Convergence of the empirical copula, [15, Lemma 7]) Let X1, . . . , Xn be an i.i.d.\nsample from a probability distribution over Rd with marginal cdf\u2019s P1, . . . , Pd. Let P (X) be the\ncopula transformation and Pn(X) the empirical copula transformation. Then, for any \u0001 > 0:\n\nPr(cid:20) sup\nx\u2208Rd (cid:107)P (x) \u2212 Pn(x)(cid:107)2 > \u0001(cid:21) \u2264 2d exp(cid:18)\u2212\n\n2n\u00012\n\nd (cid:19) .\n\n(5)\n\nComputing Pn(X) involves sorting the marginals of X \u2208 Rd\u00d7n, thus O(dn log(n)) operations.\n3.2 Generation of Random Non-Linear Projections\n\nThe second step of the RDC computation is to augment the empirical copula transformations with\nnon-linear projections, so that linear methods can subsequently be used to capture non-linear depen-\ndencies on the original data. This is a classic idea also used in other areas, particularly in regression.\nIn an elegant result, Rahimi and Recht [16] proved that linear regression on random, non-linear\nprojections of the original feature space can generate high-performance regressors:\nTheorem 4.\n\n(Rahimi-Recht) Let p be a distribution on \u2126 and |\u03c6(x; w)| \u2264 1. Let F =\ni=1 drawn iid from some\ni=1 \u03b1i\u03c6(x; wi) minimizes the empirical\n\n(cid:8) f (x) =(cid:82)\u2126 \u03b1(w)\u03c6(x; w)dw(cid:12)(cid:12)|\u03b1(w)| \u2264 Cp(w)(cid:9). Draw w1, . . . , wk iid from p. Further let\n\u03b4 > 0, and c be some L-Lipschitz loss function, and consider data {xi, yi}n\narbitrary P (X, Y ). The \u03b11, . . . , \u03b1k for which fk(x) =(cid:80)k\nrisk c(fk(x), y) has a distance from the c-optimal estimator in F bounded by\nEP [c(f (x), y)] \u2264 O(cid:32)(cid:18) 1\nwith probability at least 1 \u2212 2\u03b4.\nIntuitively, Theorem 4 states that randomly selecting wi in(cid:80)k\ni=1 \u03b1i\u03c6(x; wi) instead of optimising\nthem causes only bounded error.\nThe choice of the non-linearities \u03c6 : R \u2192 R is the main and unavoidable assumption in RDC.\nThis choice is a well-known problem common to all non-linear regression methods and has been\nstudied extensively in the theory of regression as the selection of reproducing kernel Hilbert space\n[19, \u00a73.13]. The only way to favour one such family and distribution over another is to use prior\nassumptions about which kind of distributions the method will typically have to analyse.\n\n1\n\n\u221ak(cid:19) LC(cid:114)log\n\nEP [c(fk(x), y)] \u2212 min\nf\u2208F\n\n1\n\n\u03b4(cid:33)\n\n+\n\n\u221an\n\n(6)\n\n3\n\n\fWe use random features instead of the Nystr\u00a8om method because of their smaller memory and com-\nputation requirements [11]. In our experiments, we will use sinusoidal projections, \u03c6(wT x + b) :=\nsin(wT x + b). Arguments favouring this choice are that shift-invariant kernels are approximated\nwith these features when using the appropriate random parameter sampling distribution [16], [4,\np. 208] [22, p. 24], and that functions with absolutely integrable Fourier transforms are approxi-\nmated with L2 error below O(1/\u221ak) by k of these features [10].\nLet the random parameters wi \u223c N (0, sI), bi \u223c N (0, s). Choosing wi to be Normal is analogous\nto the use of the Gaussian kernel for HSIC, CHSIC or KCCA [16]. Tuning s is analogous to selecting\nthe kernel width, that is, to regularize the non-linearity of the random projections.\nGiven a data collection X = (x1, . . . , xn), we will denote by\n\n\u03a6(X; k, s) :=\uf8eb\uf8ec\uf8ed\n\n\u03c6(wT\n\n1 x1 + b1)\n\nk x1 + bk)\n\n...\n\n\u00b7\u00b7\u00b7 \u03c6(wT\n...\n\u00b7\u00b7\u00b7 \u03c6(wT\n\n...\n\nT\n\n\uf8f6\uf8f7\uf8f8\n\n\u03c6(wT\n\n1 xn + b1)\n\nk xn + bk)\n\nthe k\u2212th order random non-linear projection from X \u2208 Rd\u00d7n to \u03a6(X; k, s) \u2208 Rk\u00d7n. The com-\nputational complexity of computing \u03a6(X; k, s) with naive matrix multiplications is O(kdn). How-\never, recent techniques using fast Walsh-Hadamard transforms [11] allow computing these feature\nexpansions within a computational cost of O(k log(d)n) and O(k) storage.\n\n(7)\n\n(8)\n\n3.3 Computation of Canonical Correlations\n\nThe \ufb01nal step of RDC is to compute the linear combinations of the augmented empirical copula\ntransformations that have maximal correlation. Canonical Correlation Analysis (CCA, [7]) is the\ncalculation of pairs of basis vectors (\u03b1, \u03b2) such that the projections \u03b1T X and \u03b2T Y of two ran-\ndom samples X \u2208 Rp\u00d7n and Y \u2208 Rq\u00d7n are maximally correlated. The correlations between the\nprojected (or canonical) random samples are referred to as canonical correlations. There exist up to\nmax(rank(X), rank(Y )) of them. Canonical correlations \u03c12 are the solutions to the eigenproblem:\n\n0\nC\u22121\nyy Cyx\n\nC\u22121\nxx Cxy\n0\n\n(cid:18)\n\n(cid:19)(cid:18) \u03b1\n\n\u03b2 (cid:19) = \u03c12(cid:18) \u03b1\n\u03b2 (cid:19) ,\n\nwhere Cxy = cov(X, Y ) and the matrices Cxx and Cyy are assumed to be invertible. Therefore,\nthe largest canonical correlation \u03c11 between X and Y is the supremum of the correlation coef\ufb01cients\nover their linear projections, that is: \u03c11(X, Y ) = sup\u03b1,\u03b2 \u03c1(\u03b1T X, \u03b2T Y ).\nWhen p, q (cid:28) n, the cost of CCA is dominated by the estimation of the matrices Cxx, Cyy and Cxy,\nhence being O((p + q)2n) for two random variables of dimensions p and q, respectively.\n\n3.4 Formal De\ufb01nition or RDC\nGiven the random samples X \u2208 Rp\u00d7n and Y \u2208 Rq\u00d7n and the parameters k \u2208 N+ and s \u2208 R+, the\nRandomized Dependence Coef\ufb01cient between X and Y is de\ufb01ned as:\n(9)\n\n\u03c1(cid:0)\u03b1T \u03a6(P (X); k, s), \u03b2T \u03a6(P (Y ); k, s)(cid:1) .\n\nrdc(X, Y ; k, s) := sup\n\u03b1,\u03b2\n\n4 Properties of RDC\n\nComputational complexity:\nIn the typical setup (very large n, large p and q, small k) the compu-\ntational complexity of RDC is dominated by the calculation of the copula-transformations. Hence,\nwe achieve a cost in terms of the sample size of O((p+q)n log n+kn log(pq)+k2n) \u2248 O(n log n).\nEase of implementation: An implementation of RDC in R is included in the Appendix A.\n\nRelationship to the HGR coef\ufb01cient:\nIt is tempting to wonder whether RDC is a consistent, or\neven an ef\ufb01cient estimator of the HGR coef\ufb01cient. However, a simple experiment shows that it is not\ndesirable to approximate HGR exactly on \ufb01nite datasets: Consider p(X, Y ) = N (x; 0, 1)N (y; 0, 1)\n\n4\n\n\fwhich is independent, thus, by both R\u00b4enyi\u2019s 4th and 7th properties, has hgr(X, Y ) = 0. How-\never, for \ufb01nitely many N samples from p(X, Y ), almost surely, values in both X and Y are\npairwise different and separated by a \ufb01nite difference. So there exist continuous (thus Borel\nmeasurable) functions f (X) and g(Y ) mapping both X and Y to the sorting ranks of Y , i.e.\nf (xi) = g(yi) \u2200(xi, yi) \u2208 (X, Y ). Therefore, the \ufb01nite-sample version of Equation (1) is con-\nstant and equal to \u201c1\u201d for continuous random variables. Meaningful measures of dependence from\n\ufb01nite samples thus must rely on some form of regularization. RDC achieves this by approximating\nthe space of Borel measurable functions with the restricted function class F from Theorem 4:\nAssume the optimal transformations f and g (Equation 1) to belong to the Reproducing Kernel\nHilbert Space F (Theorem 4), with associated shift-invariant, positive semi-de\ufb01nite kernel function\nk(x, x(cid:48)) = (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105)F \u2264 1. Then, with probability greater than 1 \u2212 2\u03b4:\n\u221ak(cid:19)(cid:114)log\n\nhgr(X, Y ;F) \u2212 rdc(X, Y ; k) = O(cid:32)(cid:18)(cid:107)m(cid:107)F\u221an\n\n\u03b4(cid:33) ,\n\n(10)\n\nLC\n\n+\n\n1\n\nwhere m := \u03b1\u03b1T + \u03b2\u03b2T and n, k denote the sample size and number of random features. The\nbound (10) is the sum of two errors. The error O(1/\u221an) is due to the convergence of CCA\u2019s\nlargest eigenvalue in the \ufb01nite sample size regime. This result [8, Theorem 6] is originally ob-\ntained by posing CCA as a least squares regression on the product space induced by the feature map\n\u03c8(x, y) = [\u03c6(x)\u03c6(x)T , \u03c6(y)\u03c6(y)T ,\u221a2\u03c6(x)\u03c6(y)T ]T . Because of approximating \u03c8 with k ran-\ndom features, an additional error O(1/\u221ak) is introduced in the least squares regression [16, Lemma\n3]. Therefore, an equivalence between RDC and KCCA is established if RDC uses an in\ufb01nite num-\nber of sinusoidal features, the random sampling distribution is set to the inverse Fourier transform\nof the shift-invariant kernel used by KCCA and the copula-transformations are discarded. However,\nwhen k \u2265 n regularization is needed to avoid spurious perfect correlations, as discussed above.\nRelationship to other estimators: Table 1 summarizes several state-of-the-art dependence mea-\nsures showing, for each measure, whether it allows for general non-linear dependence estimation,\nhandles multidimensional random variables, is invariant with respect to changes in marginal distri-\nbutions, returns a statistic in [0, 1], satisfy R\u00b4enyi\u2019s properties (Section 2), and how many parameters\nit requires. As parameters, we here count the kernel function for kernel methods, the basis function\nand number of random features for RDC, the stopping tolerance for ACE and the search-grid size for\nMIC, respectively. Finally, the table lists computational complexities with respect to sample size.\nWhen using random features \u03c6 linear for some neighbourhood around zero (like sinusoids or sig-\nmoids), RDC converges to Spearman\u2019s rank correlation coef\ufb01cient as s \u2192 0, for any k.\n\nTable 1: Comparison between non-linear dependence measures.\nNon-\nLinear\n\nRenyi\u2019s\nProperties\n\nMarginal\nInvariant\n\nVector\nInputs\n\nCoeff.\n\u2208 [0, 1]\n\nName of\nCoeff.\n\nPearson\u2019s \u03c1\nSpearman\u2019s \u03c1\nKendall\u2019s \u03c4\nCCA\nKCCA [1]\nACE [2]\nMIC [18]\ndCor [24]\nHSIC [5]\nCHSIC [15]\nRDC\n\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u00d7\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u00d7\n(cid:88)\n(cid:88)\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n(cid:88)\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n(cid:88)\n\n# Par. Comp.\nCost\nn\nn log n\nn log n\nn\nn3\nn\nn1.2\nn2\nn2\nn2\nn log n\n\n0\n0\n0\n0\n1\n1\n1\n1\n1\n1\n2\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\u00d7\n\u00d7\n(cid:88)\n\nTesting for independence with RDC: Consider the hypothesis \u201cthe two sets of non-linear projec-\ntions are mutually uncorrelated\u201d. Under normality assumptions (or large sample sizes), Bartlett\u2019s ap-\nk2, where \u03c11, . . . , \u03c1k are the\n\nproximation [12] can be used to show(cid:0) 2k+3\n\n2 \u2212 n(cid:1) log(cid:81)k\n\ni=1(1\u2212\u03c12\n\ni ) \u223c \u03c72\n\n5\n\n\fcanonical correlations between \u03a6(P (X); k, s) and \u03a6(P (Y ); k, s). Alternatively, non-parametric\nasymptotic distributions can be obtained from the spectrum of the inner products of the non-linear\nrandom projection matrices [25, Theorem 3].\n\n5 Experimental Results\n\nWe performed experiments on both synthetic and real-world data to validate the empirical perfor-\nmance of RDC versus the non-linear dependence measures listed in Table 1. In some experiments\nwe do not compare against to KCCA because we were unable to \ufb01nd a good set of hyperparameters.\n\nParameter selection: For RDC, the number of random features is set to k = 20 for both random\nsamples, since no signi\ufb01cant improvements were observed for larger values. The random feature\nsampling parameter s is more crucial, and set as follows: when the marginals of u are standard\n\nuniforms, w \u223c N (0, sI) and b \u223c N (0, s), then V[wT u + b] = s(cid:0)1 + d\ns to a linear scaling of the input variable dimensionality. In all our experiments s = 1\n6d worked well.\nThe development of better methods to set the parameters of RDC is left as future work.\nHSIC and CHSIC use Gaussian kernels k(z, z(cid:48)) = exp(\u2212\u03b3(cid:107)z\u2212 z(cid:48)\n2) with \u03b3\u22121 set to the euclidean\n(cid:107)2\ndistance median of each sample [5]. MIC\u2019s search-grid size is set to B(n) = n0.6 as recommended\nby the authors [18], although speed improvements are achieved when using lower values. ACE\u2019s\ntolerance is set to \u0001 = 0.01, default value in the R package acepack.\n\n3(cid:1); therefore, we opt to set\n\n5.1 Synthetic Data\n\nResistance to additive noise: We de\ufb01ne the power of a dependence measure as its ability to\ndiscern between dependent and independent samples that share equal marginal forms. In the spirit\nof Simon and Tibshirani1, we conducted experiments to estimate the power of RDC as a measure\nof non-linear dependence. We chose 8 bivariate association patterns, depicted inside little boxes in\nFigure 3. For each of the 8 association patterns, 500 repetitions of 500 samples were generated,\nin which the input sample was uniformly distributed on the unit interval. Next, we regenerated\nthe input sample randomly, to generate independent versions of each sample with equal marginals.\nFigure 3 shows the power for the discussed non-linear dependence measures as the variance of some\nzero-mean Gaussian additive noise increases from 1/30 to 3. RDC shows worse performance in\nthe linear association pattern due to over\ufb01tting and in the step-function due to the smoothness prior\ninduced by the sinusoidal features, but has good performance in non-functional association patterns.\n\nRunning times: Table 2 shows running times for the considered non-linear dependence measures\non scalar, uniformly distributed, independent samples of sizes {103, . . . , 106} when averaging over\n100 runs. Single runs above ten minutes were cancelled. Pearson\u2019s \u03c1, ACE, dCor, KCCA and MIC\nare implemented in C, while RDC, HSIC and CHSIC are implemented as interpreted R code. KCCA\nis approximated using incomplete Cholesky decompositions as described in [1].\n\nPearson\u2019s \u03c1\n\nTable 2: Average running times (in seconds) for dependence measures on versus sample sizes.\nsample size\nHSIC CHSIC MIC\n0.3103\n1,000\n1.0983\n10,000\n27.630\n100,000\n1,000,000\n\nACE KCCA\n0.402\n0.0080\n3.247\n0.0782\n0.5101\n43.801\n5.3830\n\nRDC\n0.0047\n0.0557\n0.3991\n4.6253\n\n0.0001\n0.0002\n0.0071\n0.0914\n\n0.3501\n29.522\n\n\u2014\n\u2014\n\n\u2014\n\u2014\n\u2014\n\ndCor\n0.3417\n59.587\n\n\u2014\n\u2014\n\n\u2014\n\u2014\n\n\u2014\n\nValue of statistic in [0, 1]: Figure 4 shows RDC, ACE, dCor, MIC, Pearson\u2019s \u03c1, Spearman\u2019s rank\nand Kendall\u2019s \u03c4 dependence estimates for 14 different associations of two scalar random samples.\nRDC scores values close to one on all the proposed dependent associations, whilst scoring values\nclose to zero for the independent association, depicted last. When the associations are Gaussian (\ufb01rst\nrow), RDC scores values close to the Pearson\u2019s correlation coef\ufb01cient (Section 2, 7th property).\n\n1http://www-stat.stanford.edu/\u02dctibs/reshef/comment.pdf\n\n6\n\n\f5.2 Feature Selection in Real-World Data\n\nWe performed greedy feature selection via dependence maximization [21] on eight real-world\ndatasets. More speci\ufb01cally, we attempted to construct the subset of features G \u2282 X that mini-\nmizes the normalized mean squared regression error (NMSE) of a Gaussian process regressor. We\ndo so by selecting the feature x(i) maximizing dependence between the feature set Gi = {Gi\u22121, x(i)}\nand the target variable y at each iteration i \u2208 {1, . . . 10}, such that G0 = {\u2205} and x(i) /\u2208 Gi\u22121.\nWe considered 12 heterogeneous datasets, obtained from the UCI dataset repository2, the Gaus-\nsian process web site Data3 and the Machine Learning data set repository4. Random training/test\npartitions are computed to be disjoint and equal sized.\nSince G can be multi-dimensional, we compare RDC to the non-linear methods dCor, HSIC and\nCHSIC. Given their quadratic computational demands, dCor, HSIC and CHSIC use up to 1, 000\npoints when measuring dependence; this constraint only applied on the sarcos and abalone\ndatasets. Results are averages of 20 random training/test partitions.\n\nFigure 2: Feature selection experiments on real-world datasets.\n\nFigure 2 summarizes the results for all datasets and algorithms as the number of selected features\nincreases. RDC performs best in most datasets, with much lower running time than its contenders.\n\n6 Conclusion\n\nWe have presented the randomized dependence coef\ufb01cient, a lightweight non-linear measure of\ndependence between multivariate random samples. Constructed as a \ufb01nite-dimensional estimator in\nthe spirit of the Hirschfeld-Gebelein-R\u00b4enyi maximum correlation coef\ufb01cient, RDC performs well\nempirically, is scalable to very large datasets, and is easy to adapt to concrete problems.\nWe thank fruitful discussions with Alberto Su\u00b4arez, Theofanis Karaletsos and David Reshef.\n\n2http://www.ics.uci.edu/\u02dcmlearn\n3http://www.gaussianprocess.org/gpml/data/\n4http://www.mldata.org\n\n7\n\n12345670.480.540.60abalone2468100.360.400.440.48automobile12345670.150.250.35autompg2468100.350.45breastdCorHSICCHSICRDC123456780.300.400.50calhousing2468100.30.50.70.9cpuact2468100.200.30crime2468100.250.35housing2468100.9650.980insurance2468100.350.400.45parkinson2468100.10.30.50.7sarcos2468100.700.80whitewineNumberoffeaturesGaussianProcessNMSE\fFigure 3: Power of discussed measures on several bivariate association patterns as noise increases.\nInsets show the noise-free form of each association pattern.\n\nFigure 4: RDC, ACE, dCor, MIC, Pearson\u2019s \u03c1, Spearman\u2019s rank and Kendall\u2019s \u03c4 estimates (numbers\nin tables above plots, in that order) for several bivariate association patterns.\n\nA R Source Code\n\nrdc <- function(x,y,k=20,s=1/6,f=sin) {\n\nx <- cbind(apply(as.matrix(x),2,function(u)rank(u)/length(u)),1)\ny <- cbind(apply(as.matrix(y),2,function(u)rank(u)/length(u)),1)\nx <- s/ncol(x)*x%*%matrix(rnorm(ncol(x)*k),ncol(x))\ny <- s/ncol(y)*y%*%matrix(rnorm(ncol(y)*k),ncol(y))\ncancor(cbind(f(x),1),cbind(f(y),1))$cor[1]\n\n}\n\n8\n\n0.00.40.8xvalspower.cor[typ,]xvalspower.cor[typ,]0.00.40.8xvalspower.cor[typ,]xvalspower.cor[typ,]cordCorMICACEHSICCHSICRDC0.00.40.8xvalspower.cor[typ,]xvalspower.cor[typ,]0204060801000.00.40.8xvalspower.cor[typ,]020406080100xvalspower.cor[typ,]NoiseLevelPower1.01.01.01.01.01.01.00.80.80.70.50.80.80.60.40.40.40.20.40.40.30.10.10.10.10.00.00.00.40.40.40.2-0.4-0.4-0.30.80.80.70.5-0.8-0.8-0.61.01.01.01.0-1.0-1.0-1.01.01.00.41.00.00.00.00.30.30.10.20.00.0-0.00.50.50.10.20.00.00.01.01.00.50.90.00.00.01.01.00.30.60.10.10.11.01.00.20.6-0.0-0.0-0.00.10.10.00.1-0.0-0.0-0.0\fReferences\n[1] F. R. Bach and M. I. Jordan. Kernel independent component analysis. JMLR, 3:1\u201348, 2002.\n[2] L. Breiman and J. H. Friedman. Estimating Optimal Transformations for Multiple Regression\n\nand Correlation. Journal of the American Statistical Association, 80(391):580\u2013598, 1985.\n\n[3] H. Gebelein. Das statistische Problem der Korrelation als Variations- und Eigenwertproblem\nund sein Zusammenhang mit der Ausgleichsrechnung. Zeitschrift f\u00a8ur Angewandte Mathematik\nund Mechanik, 21(6):364\u2013379, 1941.\n\n[4] I.I. Gihman and A.V. Skorohod. The Theory of Stochastic Processes, volume 1. Springer,\n\n1974s.\n\n[5] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel two-sample\n\ntest. JMLR, 13:723\u2013773, 2012.\n\n[6] A. Gretton, O. Bousquet, A. Smola, and B. Sch\u00a8olkopf. Measuring statistical dependence with\nHilbert-Schmidt norms. In Proceedings of the 16th international conference on Algorithmic\nLearning Theory, pages 63\u201377. Springer-Verlag, 2005.\n\n[7] W. K. H\u00a8ardle and L. Simar. Applied Multivariate Statistical Analysis. Springer, 2nd edition,\n\n2007.\n\n[8] D. Hardoon and J. Shawe-Taylor. Convergence analysis of kernel canonical correlation analy-\n\nsis: theory and practice. Machine Learning, 74(1):23\u201338, 2009.\n\n[9] T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science, 1:297\u2013310, 1986.\n[10] L. K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates\nfor projection pursuit regression and neural network training. Annals of Statistics, 20(1):608\u2013\n613, 1992.\n\n[11] Q. Le, T. Sarlos, and A. Smola. Fastfood \u2013 Approximating kernel expansions in loglinear time.\n\nIn ICML, 2013.\n\n[12] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press, 1979.\n[13] P. Massart. The tight constant in the Dvoretzky-Kiefer-wolfowitz inequality. The Annals of\n\nProbability, 18(3), 1990.\n\n[14] R. Nelsen. An Introduction to Copulas. Springer Series in Statistics, 2nd edition, 2006.\n[15] B. Poczos, Z. Ghahramani, and J. Schneider. Copula-based kernel dependency measures. In\n\nICML, 2012.\n\n[16] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization\n\nwith randomization in learning. NIPS, 2008.\n\n[17] A. R\u00b4enyi. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungari-\n\ncae, 10:441\u2013451, 1959.\n\n[18] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh,\nE. S. Lander, M. Mitzenmacher, and P. C. Sabeti. Detecting novel associations in large data\nsets. Science, 334(6062):1518\u20131524, 2011.\n\n[19] B. Sch\u00a8olkopf and A.J. Smola. Learning with Kernels. MIT Press, 2002.\n[20] A. Sklar. Fonctions de repartition `a n dimension set leurs marges. Publ. Inst. Statis. Univ.\n\nParis, 8(1):229\u2013231, 1959.\n\n[21] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence\n\nmaximization. JMLR, 13:1393\u20131434, June 2012.\n\n[22] M.L. Stein. Interpolation of Spatial Data. Springer, 1999.\n[23] G. J. Sz\u00b4ekely and M. L. Rizzo. Rejoinder: Brownian distance covariance. Annals of Applied\n\nStatistics, 3(4):1303\u20131308, 2009.\n\n[24] G. J. Sz\u00b4ekely, M. L. Rizzo, and N. K. Bakirov. Measuring and testing dependence by correla-\n\ntion of distances. Annals of Statistics, 35(6), 2007.\n\n[25] K. Zhang, J. Peters, D. Janzing, and B.Sch\u00a8olkopf. Kernel-based conditional independence test\n\nand application in causal discovery. CoRR, abs/1202.3775, 2012.\n\n9\n\n\f", "award": [], "sourceid": 14, "authors": [{"given_name": "David", "family_name": "Lopez-Paz", "institution": "MPI for Intelligent Systems & University of Cambridge"}, {"given_name": "Philipp", "family_name": "Hennig", "institution": "MPI T\u00fcbingen"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI T\u00fcbingen"}]}