{"title": "Hypothesis Testing in Unsupervised Domain Adaptation with Applications in Alzheimer's Disease", "book": "Advances in Neural Information Processing Systems", "page_first": 2496, "page_last": 2504, "abstract": "Consider samples from two different data sources $\\{\\mathbf{x_s^i}\\} \\sim P_{\\rm source}$ and $\\{\\mathbf{x_t^i}\\} \\sim P_{\\rm target}$. We only observe their transformed versions $h(\\mathbf{x_s^i})$ and $g(\\mathbf{x_t^i})$, for some known function class $h(\\cdot)$ and $g(\\cdot)$. Our goal is to perform a statistical test checking if $P_{\\rm source}$ = $P_{\\rm target}$ while removing the distortions induced by the transformations. This problem is closely related to concepts underlying numerous domain adaptation algorithms, and in our case, is motivated by the need to combine clinical and imaging based biomarkers from multiple sites and/or batches, where this problem is fairly common and an impediment in the conduct of analyses with much larger sample sizes. We develop a framework that addresses this problem using ideas from hypothesis testing on the transformed measurements, where in the distortions need to be estimated {\\it in tandem} with the testing. We derive a simple algorithm and study its convergence and consistency properties in detail, and we also provide lower-bound strategies based on recent work in continuous optimization. On a dataset of individuals at risk for neurological disease, our results are competitive with alternative procedures that are twice as expensive and in some cases operationally infeasible to implement.", "full_text": "Hypothesis Testing in Unsupervised Domain\n\nAdaptation with Applications in Alzheimer\u2019s Disease\n\nHao Henry Zhou\u2020\nSterling C. Johnson\u00a7,\u2020\n\nSathya N. Ravi\u2020\n\nGrace Wahba\u2020\n\n\u00a7William S. Middleton Memorial VA Hospital\n\n\u2020University of Wisconsin\u2013Madison\n\nVamsi K. Ithapu\u2020\nVikas Singh\u2020\n\nAbstract\n\ns) and g(xi\n\ns} \u223c Psource and {xi\n\nConsider samples from two different data sources {xi\nt} \u223c\nPtarget. We only observe their transformed versions h(xi\nt), for some\nknown function class h(\u00b7) and g(\u00b7). Our goal is to perform a statistical test checking\nif Psource = Ptarget while removing the distortions induced by the transformations.\nThis problem is closely related to domain adaptation, and in our case, is motivated\nby the need to combine clinical and imaging based biomarkers from multiple sites\nand/or batches \u2013 a fairly common impediment in conducting analyses with much\nlarger sample sizes. We address this problem using ideas from hypothesis testing\non the transformed measurements, wherein the distortions need to be estimated in\ntandem with the testing. We derive a simple algorithm and study its convergence\nand consistency properties in detail, and provide lower-bound strategies based on\nrecent work in continuous optimization. On a dataset of individuals at risk for\nAlzheimer\u2019s disease, our framework is competitive with alternative procedures that\nare twice as expensive and in some cases operationally infeasible to implement.\n\n1\n\nIntroduction\n\nA \ufb01rst order requirement in many estimation tasks is that the training and testing samples are from\nthe same underlying distribution and the associated features are directly comparable. But in many\nreal world datasets, training/testing (or source/target) samples may come from different \u201cdomains\u201d:\nthey may be variously represented and involve different marginal distributions [8, 32]. \u201cDomain\nadaptation\u201d (DA) algorithms [24, 27] are often used to address such problems. For example, in\nvision, not accounting for systematic source/target variations in images due to commodity versus\nprofessional camera equipment yields poor accuracy for visual recognition; here, these schemes\ncan be used to match the source/target distributions or identify intermediate latent representations\n[12, 1, 9], often yielding superior performance [29, 12, 1, 9]. Such success has lead to specialized\nformulations, for instance, when target annotations are absent (unsupervised) [11, 13] or minimally\navailable (semi-supervised) [7, 22]. With a mapping to compensate for this domain shift, we know\nthat the normalized (or transformed) features are suf\ufb01ciently invariant and reliable in practice.\nIn numerous DA applications, the interest is in seamlessly translating a classi\ufb01er across domains \u2014\nconsequently, the model\u2019s test/target predictive performance serves the intended goals. However, in\nmany areas of science, issues concerning the statistical power of the experiment, the sample sizes\nneeded to achieve this power and whether we can derive p-values for the estimated domain adaptation\nmodel are equally, if not, more important. For instance, the differences in instrument calibration and\nreagents in wet lab experiments are potential DA applications except that the downstream analysis may\ninvolve little to no discrimination performance measures per se. Separately, in multi-site population\nstudies [17, 18, 21], where due to operational reasons, recruitment and data acquisition is distributed\nover multiple sites (even countries) \u2014 site-speci\ufb01c shifts in measurements and missing covariates\nare common [17, 18, 21]. The need to harmonize such data requires some form of DA. While good\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fpredictive performance is useful, the ability to perform hypothesis tests and obtain interpretable\nstatistical quantities remain central to the conduct of experiments or analyses across a majority of\nscienti\ufb01c disciplines. We remark that constructs such as H\u2206H distance have been widely used to\nanalyze non-conservative DA and obtain probabilistic bounds on the performance of a classi\ufb01er from\ncertain hypotheses classes, but the statistical considerations identi\ufb01ed above are not well studied and\ndo not follow straightforwardly from the learning theoretic results derived in [2, 5].\nA Motivating Example from Neuroscience. The social and \ufb01nancial burden (of health-care) is\nprojected to grow considerably since elderly are the fastest growing populace [28, 6], and age is the\nstrongest risk factor for neurological disorders such as Alzheimer\u2019s disease (AD). Although numerous\nlarge scale projects study the aging brain to identify early biomarkers for various types of dementia,\nwhen younger cohorts are analyzed (farther away from disease onset), the effect sizes become worse.\nThis has led to multi-center research collaborations and clinical trials in an effort to increase sample\nsizes. Despite the promise, combining data across sites pose signi\ufb01cant statistical challenges \u2013 for AD\nin particular, the need for harmonization or standardization (i.e., domain adaptation) was found to be\nessential [20, 34] in the analysis of multi-site Cerebrospinal \ufb02uid (CSF) assays and brain volumetric\nmeasurements. These analyses refer to the use of AD related pathological biomarkers (\u03b2-amyloid\npeptide in CSF), but there is variability in absolute concentrations due to CSF collection and storage\nprocedures [34]. Similar variability issues exist for amyloid and structural brain imaging studies,\nand are impediments before multi-site data can be pooled and analyzed in totality. The temporary\nsolution emerging from [20] is to use an \u201cnormalization/anchor\u201d cohort of individuals which will\nthen be validated using test/retest variation. The goal of this paper is to provide a rigorous statistical\nframework for addressing these challenges that will make domain adaptation an analysis tool in\nneuroimaging as well as other experimental areas.\nThis paper makes the following key contributions. a) On the formulation side, we generalize\nexisting models which assume an identical transformation applied to both the source/target domains\nto compensate for the domain shift. Our proposal permits domain-speci\ufb01c transformations to align\nboth the marginal (and the conditional) data distributions; b) On the statistical side, we derive a\nprovably consistent hypothesis test to check whether the transformation model can indeed correct the\n\u2018shift\u2019, directly yielding p-values. We also show consistency of the model in that we can provably\nestimate the actual transformation parameters in an asymptotic sense; c) We identify some interesting\nlinks of our estimation with recent developments in continuous optimization and show how our model\npermits an analysis based on obtaining successively tighter lower bounds; d) Finally, we present\nexperiments on an AD study showing how CSF data from different batches (source/target) can be\nharmonized enabling the application of standard statistical analysis schemes.\n2 Background\n\nConsider the unsupervised domain adaptation setting where the inputs/features/covariates in the\nsource and target domains are denoted by xs and xt respectively. The source and target feature\nspaces are related via some unknown mapping, which is recovered by applying some appropriate\ntransformations on the inputs. We denote these transformed inputs as \u02dcxs and \u02dcxt. Within this setting,\nour goal is two-fold: \ufb01rst, to estimate the source-to-target mapping, followed by performing some\nstatistical test about the \u2018goodness\u2019 of the estimate. Speci\ufb01cally, the problem is to \ufb01rst estimate\nsuitable transformations h \u2208 G, g \u2208 G(cid:48), parameterized by some \u03bb and \u03b2 respectively, such that the\ntransformed data \u02dcxs := h(xs, \u03bb) and \u02dcxt := g(xt, \u03b2) have similar distributions. G and G(cid:48) restrict\nthe allowable mappings (e.g., af\ufb01ne) between source and target. Clearly the goodness of domain\nadaptation depends on the nature and size of G, and the similarity measure used to compare the\ndistributions. The distance/similarity measure used in our model de\ufb01nes a statistic for comparing\ndistributions. Hence, using the estimated transformations, we then provide a hypothesis test for the\nexistence of \u03bb and \u03b2 such that P r(\u02dcxs) = P r(\u02dcxt), and \ufb01nally assign p-values for the signi\ufb01cance.\nTo setup this framework, we start with a statistic that measures the distance between two distributions.\nAs motivated in Section 1, we do not impose any parametric assumptions. Since we are interested in\nthe mismatch of P r(\u02dcxs) and P r(\u02dcxt), we use maximum mean discrepancy (MMD) which measures\nthe mean distance between {xs} and {xt} in a Hilbert space induced by a characteristic kernel K,\n\nm(cid:88)\n\ni=1\n\n= (cid:107) 1\nm\n\nK(xi\n\nt,\u00b7) \u2212 1\nn\n\nn(cid:88)\n\ni=1\n\nK(xi\n\ns,\u00b7)(cid:107)H\n\n(1)\n\n(cid:32)\n\nM M D(xs, xt) = sup\nf\u2208F\n\n(cid:33)\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nf (xi\n\ns) \u2212 1\nn\n\nn(cid:88)\n\ni=1\n\nf (xi\nt)\n\n2\n\n\fwhere F = {f \u2208 HK,||f||HK \u2264 1} and HK denotes the universal RKHS. The advantage of\nMMD over other nonparametric distance measures is discussed in [30, 15, 16, 31]. Speci\ufb01cally,\nMMD statistic de\ufb01nes a metric, and whenever MMD is large, the samples are \u201clikely\u201d from different\ndistributions. The simplicity of MMD and the statistical and asymptotic guarantees it provides\n[15, 16], largely drive our estimation and testing approach. In fact, our framework will operate on\n\u2018transformed\u2019 data \u02dcxs and \u02dcxt while estimating the appropriate transformations.\n\n2.1 Related Work\n\nThe body of work on domain adaptation is fairly extensive, even when restricted to the unsupervised\nversion. Below, we describe algorithms that are more closely related to our work and identify\nthe similarities/differences. A common feature of many unsupervised methods is to match the\nfeature/covariate distributions between the source and the target domains, and broadly, these fall\ninto two different categories. The \ufb01rst set of methods deal with feature distributions that may be\ndifferent but not due to the distortion of the inputs/features. Denoting the labels/outputs for the source\nand target domains as ys and yt respectively, here we have, P r(ys|xs) \u2248 P r(yt|xt) but P r(xs) (cid:54)=\nP r(xt) \u2013 this is sampling bias. The ideas in [19, 25, 2, 5] address this by \u2018re-weighting\u2019 the source\ninstances so as to minimize feature distribution differences between the source and the target. Such\nre-weighting schemes do not necessarily correspond to transforming the source and target inputs,\nand may simply scale or shift the appropriate loss functions. The central difference among these\napproaches is the distance metric used to measure the discrepancy of the feature distributions.\nThe second set of methods correspond to the case where distributional differences are mainly caused\nby feature distortion such as change in pose, lighting, blur and resolution in visual recognition.\nUnder this scenario, P r(ys|xs) (cid:54)= P r(yt|xt) but P r(\u02dcxs) \u2248 P r(\u02dcxt) and the transformed conditional\ndistributions are close. [26, 1, 10, 14, 12] address this problem by learning the same feature trans-\nformation on source and target domains to minimize the difference of P r(\u02dcxs) and P r(\u02dcxt) directly.\nOur proposed model \ufb01ts better under this umbrella \u2014 where the distributional differences are mainly\ncaused by feature distortion due to site speci\ufb01c acquisition and other experimental issues. While\nsome methods are purely data-driven such as those using geodesic \ufb02ow [14, 12], backpropagation\n[10]) and so on, other approaches estimate the transformation that minimizes distance metrics such\nas the Maximum Mean Discrepancy (MMD) [26, 1]. To our knowledge, no statistical consistency\nresults are known for any of the methods that fall in the second set.\nOverview: The idea in [1] is perhaps the most closely related to our proposal, but with a few important\ndifferences. First, we relax the condition that the same transformation must be applied to each\ndomain; instead, we permit domain-speci\ufb01c transformations. Second, we derive a provably consistent\nhypothesis test to check whether the transformation model can indeed correct the shift. We then prove\nthat the model is consistent when it is correct. These theoretical results apply directly to [1], which\nturns out to be a special case of our framework. We \ufb01nd that the extension of our results to [26] is\nproblematic since that method violates the requirement that the mean differences should be measured\nin a valid Reproducing Kernel Hilbert space (RKHS).\n\n3 Model\n\nWe \ufb01rst present the objective function of our estimation problem and provide a simple algorithm\nto compute the unknown parameters \u03bb and \u03b2. Recall the de\ufb01nition of MMD from (1). Given the\nkernel K and the source and target inputs xs and xt, we are interested in the MMD between the\n\u201ctransformed\u201d inputs \u02dcxs and \u02dcxt. We are only provided the class of the transformations; m and n\ndenote the sample sizes of source and target inputs. So our objective function is simply\n\nmin\n\u03bb\u2208\u2126\u03bb\n\nmin\n\u03b2\u2208\u2126\u03b2\n\n(cid:107)ExtK(g(xt, \u03b2),\u00b7) \u2212 ExsK(h(xs, \u03bb),\u00b7)(cid:107)H\n\n(2)\nwhere \u03bb \u2208 \u2126\u03bb and \u03b2 \u2208 \u2126\u03b2 are the constraint sets of the unknowns. Assume that the parameters are\nbounded is reasonable (discussed in Section 4.3), and their approximations can be easily computed\nusing certain data statistics. The empirical estimate of the above objective would simply be\n\nK(h(xi\n\ns, \u03bb),\u00b7)||H\n\n(3)\n\nm(cid:88)\n\ni=1\n\nmin\n\u03bb\u2208\u2126\u03bb\n\nmin\n\u03b2\u2208\u2126\u03b2\n\n(cid:107) 1\nm\n\nn(cid:88)\n\ni=1\n\nK(g(xi\n\nt, \u03b2),\u00b7) \u2212 1\nn\n\n3\n\n\fRemarks: We note a few important observations about (3) to draw the contrast from (1). The power\nof MMD lies in differentiating feature distributions, and the correction factor is entirely dependant\non the choice of the kernel class \u2013 a richer one does a better job. Instead, our objective in (3) is\nshowing that complex distortions can be corrected before applying the kernel in an intra-domain\nmanner (as we show in Section 4). From the perspective of the complexity of distortions, this strategy\nmay correspond to a larger hypotheses space compared to the classical MMD setup. This is clearly\nbene\ufb01cial in settings where source and target are related by complex feature distortions.\nIt may be seen from the structure of the objective in (3) that designing an algorithm for any given\nK and G may not be straightforward. We present the estimation procedure for certain widely-used\nclasses of K and G in Section 4.3. For the remainder of the section, where we present our testing\nprocedure and describe technical results, we will assume that we can solve the above objective and\nthe corresponding estimates are denoted by \u02c6\u03bb and \u02c6\u03b2.\n\n3.1 Minimal MMD test statistic\nObserve that the objective in (3) is based on the assumption that the transformations h(\u00b7) \u2208 G and\ng(\u00b7) \u2208 G(cid:48) (G and G(cid:48) may be different if desired) are suf\ufb01cient in some sense for \u2018correcting\u2019 the\ndiscrepancy between the source and target inputs. Hence, we need to specify a model checking\ntask on the recoverability of these transforms, while also concurrently checking the goodness of the\nestimates of \u03bb and \u03b2. This task will correspond to a hypothesis test where the two hypotheses being\ncompared are as follows.\n\nH0 : There exists a \u03bb and \u03b2 such that P r(g(xt, \u03b2)) = P r(h(xs, \u03bb)).\nHA : There does not exist any such \u03bb and \u03b2 such that P r(g(xt, \u03b2)) = P r(h(xs, \u03bb)).\n\nSince the statistic for testing H0 here needs to measure the discrepancy of P r(g(xt, \u03b2)) and\nP r(h(xs, \u03bb)), one can simply use the objective from (3). Hence our test statistic is given by\nthe minimal MMD estimate for a given h \u2208 G, g \u2208 G(cid:48), xs, xt and computed at the estimates \u02c6\u03bb, \u02c6\u03b2\n\n(\u02c6\u03bb, \u02c6\u03b2) := arg min\n\u03bb\u2208\u2126\u03bb\n\nmin\n\u03b2\u2208\u2126\u03b2\n\nM(\u03bb, \u03b2) := (cid:107) 1\nm\n\nK(g(xi\n\nt, \u03b2),\u00b7) \u2212 1\nn\n\nK(h(xi\n\ns, \u03bb),\u00b7)||H\n\n(4)\n\nm(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nWe denote the population estimates of the parameters under the null and alternate hypothesis as\n(\u03bb0, \u03b20) and (\u03bbA, \u03b2A). Recall that the MMD corresponds to a statistic, and it has been used for\ntesting the equality of distributions in earlier works [15]. It is straightforward to see that the true\nminimal MMD M\u2217(\u03bb0, \u03b20) = 0 if and only if H0 is true. Observe that (4) is the empirical (and\nhence biased) \u2018approximation\u2019 of the true minimal MMD statistic M\u2217(\u00b7) from the objective in (2).\nThis will be used while presenting our technical results (in Section 4) on the consistency and the\ncorresponding statistical power guaranteed by this minimal MMD statistic based testing.\n\nRelationship to existing approaches. Hypothesis testing involves transforming the inputs before\ncomparing their distributions in some RKHS (while we solve for the transformation parameters).\nThe approach in [15, 16] applies the kernel to the input data directly and asks whether or not the\ndistributions are the same based on the MMD measure. Our approach derives from the intuition\nthat allowing for the two-step procedure of transforming the inputs \ufb01rst, followed by computing\ntheir distance in some RKHS is \ufb02exible, and in some sense is more general compared to directly\nusing MMD (or other distance measures) on the inputs. To see this, consider the simple example\nwhere xs \u223c N (0, 1) and xt = xs + 1. A simple application of MMD (from (1)) on the inputs xs\nand xt directly will reject the null hypothesis (where the H0 states that the source and target are the\nsame distributions). Our algorithm allows for a transformation on the source/target and will correct\nthis discrepancy and accept H0. Further, our proposed model generalizes the approach taken in [1].\nSpeci\ufb01cally, their approach is a special case of (3) with h(xs) = WT xs, g(xt) = WT xt (\u03bb and \u03b2\ncorrespond to W here) with the constraint that W is orthogonal.\nSummary: Overall, our estimation followed by testing procedure will be two-fold. Given xs and\nxt, the kernel K and the function spaces G, G(cid:48), we \ufb01rst estimate the unknowns \u03bb and \u03b2 (described\nin Section 4.3). The corresponding statistic M(\u02c6\u03bb, \u02c6\u03b2) at the estimates is then compared to a given\nsigni\ufb01cance threshold \u03b3. Whenever M(\u02c6\u03bb, \u02c6\u03b2) > \u03b3 the null H0 is rejected. This rejection simply\nindicates that G and/or G(cid:48) are not suf\ufb01cient in recovering the mismatch of source to target at\n\n4\n\n\fthe Type I error of \u03b1. Clearly, the richness of these function classes is central to the power of\nthe testing procedure. We will further argue in Section 4 that even allowing h(\u00b7) and g(\u00b7) to be\nlinear transformations greatly enhances the ability to remove the distorted feature distributions and\nreliably test their difference or equivalence. Also the test is non-parametric and handles missing\n(systematic/noisy) features among the two distributions of interest (see appendix for more details).\n\n4 Consistency\n\nBuilding upon the two-fold estimating and testing procedure presented in the previous sections,\nwe provide several guarantees about the estimation consistency and the power of minimal MMD\nbased hypothesis testing, both in the asymptotic and \ufb01nite sample regimes. The technical results\npresented here are applicable for large classes of transformation functions G with fairly weak and\nreasonable assumptions on K. Speci\ufb01cally we consider Holder-continuous h(\u00b7) and g(\u00b7) functions on\ncompact sets \u2126\u03bb and \u2126\u03b2. Like [15], we have K to be a bounded non-negative characteristic kernel\ni.e., 0 \u2264 K(x, x(cid:48)) \u2264 K \u2200x, x(cid:48), and we assume \u2202K to be bounded in a neighborhood of 0. We note\nthat technical results for an even more general class of kernels are fairly involved and so in this paper\nwe restrict ourselves to radial basis kernels. Nevertheless, even under the above assumptions our null\nhypothesis space is more general than the one considered in [15] because of the extra transformations\nthat we allow on the inputs. With these assumptions, and the Holder-continuity of h(xs,\u00b7) and\ng(xt,\u00b7), we assume\n\n(A1)\n(A2)\n\n(cid:107)K(h(xs, \u03bb1),\u00b7) \u2212 K(h(xs, \u03bb2),\u00b7)(cid:107) \u2264 Lhd(\u03bb1, \u03bb2)rh \u2200xs; \u03bb1, \u03bb2 \u2208 \u2126\u03bb\n(cid:107)K(g(xt, \u03b21),\u00b7) \u2212 K(g(xt, \u03b22),\u00b7)(cid:107) \u2264 Lgd(\u03b21, \u03b22)rg\n\u2200xt; \u03b21, \u03b22 \u2208 \u2126\u03b2\n\n4.1 Estimation Consistency\n\nObserve that the minimization of (3) assumes that the null is true i.e., the parameter estimates\ncorrespond to H0. Therefore, we discuss consistency in the context of existence of a unique set of\nparameters (\u03bb0, \u03b20) that match the distributions of \u02dcxs and \u02dcxt perfectly. By inspecting the structure of\nthe objective in (2) and (3), we see that the estimates will be asymptotically unbiased. Our \ufb01rst set\nof results summarized here provide consistency of the estimation whenever the assumptions (A1)\nand (A2) hold. This consistency result follows from the convergence of objective. All the proofs are\nincluded in the appendix.\nTheorem 4.1 (MMD Convergence). Under H0, (cid:107)ExsK(h(xs, \u02c6\u03bb),\u00b7) \u2212 ExtK(g(xt, \u02c6\u03b2),\u00b7)(cid:107)H \u2192 0\nat the rate, max\nTheorem 4.2 (Consistency). Under H0, the estimators \u02c6\u03bb and \u02c6\u03b2 are consistent.\nRemarks: Theorem 4.1 shows the convergence rate of MMD distance between the source and the\ntarget after the transformations are applied. Recall that m and n are the sample sizes of source and\ntarget respectively, and h(xs, \u02c6\u03bb) and g(xt, \u02c6\u03b2) are the recovered transformations.\n\n\u221a\nlog m\u221a\nm\n\n(cid:16)\u221a\n\nlog n\u221a\nn ,\n\n(cid:17)\n\n.\n\n4.2 Power of the Hypothesis Test\n\nWe now discuss the statistical power of minimal MMD based testing. The next set of results establish\nthat the testing setup from Section 3.1 is asymptotically consistent. Recall that M\u2217(\u00b7) denotes the\n(unknown) expected statistic from (2) while M(\u00b7) is its empirical estimate from (4).\nTheorem 4.3 (Hypothesis Testing). (a) Whenever H0 is true, with probability at least 1 \u2212 \u03b1,\n\n0 \u2264 M(\u02c6\u03bb, \u02c6\u03b2) \u2264\n\n2K(m + n) log \u03b1\u22121\n\nmn\n\n\u221a\nK\u221a\n2\nn\n\n+\n\n+\n\n\u221a\nK\u221a\n2\nm\n\n+\n\n(cid:33)\n\n\u221a\nK\u221a\n2\nm\n\u221a\nK\u221a\nm\n\n\u2212\n\n(cid:32)\n\n(cid:115)\n\n4 +\n\nC (g,\u0001) +\n\nd\u03bb\n2rh\n\nlog n\n\n(5)\n\n(cid:33)\n\nlog m\n\n(6)\n\nd\u03b2\n2rg\n\n(b) Whenever HA is true, with probability at least 1 \u2212 \u0001,\n\u221a\nK\u221a\n2\nn\n\n2K(m + n) log \u0001\u22121\n\nM(\u02c6\u03bb, \u02c6\u03b2) \u2264 M\u2217\n\n(\u03bbA, \u03b2A) +\n\nmn\n\n+\n\nM(\u02c6\u03bb, \u02c6\u03b2) \u2265 M\u2217\n\n(\u03bbA, \u03b2A) \u2212\n\n(cid:114)\n\n(cid:114)\n\n\u221a\nK\u221a\nn\n\n(cid:32)\n\n(cid:114)\n\n4 +\n\nC (h,\u0001) +\n\n5\n\n\fwhere C (h,\u0001) = log(2|\u2126\u03bb|)+log \u0001\u22121 + d\u03bb\n\nrh\n\nlog Lh\u221a\nK\n\n, and C (g,\u0001) = log(2|\u2126\u03b2|)+log \u0001\u22121 + d\u03b2\n\nrg\n\nlog Lg\u221a\nK\n\nn , 1\u221a\n\n\u221a\nlog n\u221a\nn ,\n\nRemarks: We make a few comments about the theorem. Recall that the constant K is the kernel\nbound, and Lh, Lg, rh and rg are de\ufb01ned in (A1)(A2). d\u03bb and d\u03b2 are the dimensions of \u03bb and \u03b2\nspaces respectively. Observe that whenever H0 is true, (5) shows that M(\u02c6\u03bb, \u02c6\u03b2) approaches 0 as\nthe sample size increases. Similarly, under HA the statistic converges to some positive (unknown)\nvalue M\u2217(\u03bbA, \u03b2A). Following these observations, Theorem 4.3 basically implies that the statistical\npower of our test (described in Section 3.1) increases to 1 as the sample size m, n increases. Except\nconstants, the upper bounds under both H0 and HA have a rate of max( 1\u221a\nm ), while the lower\n\u221a\nlog m\u221a\nbound under HA has the rate max(\nm ). In the appendix we show that (see Lemma 4.5)\nas m, n \u2192 \u221e, the constants |\u2126\u03bb|, |\u2126\u03b2| converge to a small positive number, thus removing the\ndependence of consistency on these constants.\nThe dependence on the sizes of search spaces \u2126\u03bb and \u2126\u03b2 may nevertheless make the bounds for\nHA loose. In practice, one can choose \u2018good\u2019 bound constraints based on some pre-processing on\nthe source and target inputs (e.g., comparison of median and modes). The loss in power due to\noverestimated \u2126\u03bb and \u2126\u03b2 will be compensated by \u2018large enough\u2019 sample sizes. Observe that this\ntrade-off of sample size versus complexity of hypothesis space is fundamental in statistical testing\nand is not speci\ufb01c to our model. We further investigate this trade-off for certain special cases of\ntransformations h(\u00b7) and g(\u00b7) that may be of interest in practice. For instance, consider the scenario\nwhere one of the transformations is identity and the other one is linear in the unknowns. Speci\ufb01cally,\n\u02dcxt = xt and h0(xs, \u03bb0) = \u03c6(xs)T \u03bb0 where \u03c6(\u00b7) is some known transformation. Although restrictive,\nthis scenario is very common in medical data acquisition (refer to Section 1) where the source and\ntarget inputs are assumed to have linear/af\ufb01ne distortions. Within this setting, the assumptions for our\ntechnical results will be satis\ufb01ed whenever \u03c6(xs) is bounded with high probability and with rh = 1\n2.\nWe have the following result for this scenario (Var(\u00b7) denotes empirical variance).\nTheorem 4.4 (Linear transformation). Under H0, identity g(\u00b7) with h = \u03c6(xs)T \u03bb, we have\nk=1 Var(xt,k) + \u0001}. For any \u0001, \u03b1 > 0 and suf\ufb01ciently\n\u2126\u03bb := {\u03bb;| 1\nlarge sample size, a neighborhood of \u03bb0 is contained in \u2126\u03bb with probability at least 1 \u2212 \u03b1.\n\ns)T \u03bb)(cid:107)2 \u2264 3(cid:80)p\n\n(cid:80)n\ni=1 (cid:107)xi\n\nn\n\nt \u2212 \u03c6(xi\n\nObserve that subscript k in xt,k above denotes the kth dimensional feature of xt. The above result\nimplies that the search space for \u03bb reduces to a quadratic constraint in the above described example\nscenario. Clearly, this re\ufb01ned search region would enhance the statistical power for the test even when\nthe sample sizes are small (which is almost always the case in population studies). Note that such\nre\ufb01ned sets may be computed using \u2018extra\u2019 information about the structure of the transformations\nand/or input data statistics, there by allowing for better estimation and higher power. Lastly, we point\nout that the ideas presented in [16] for a \ufb01nite sample testing setting translate to our model as well\nbut we do not present explicit details in this work.\n\n4.3 Optimization Lower Bounds\n\nWe see that it is valid to assume that the feasible set is compact and convex for our purposes\n(Theorem 4.4). This immediately allows us to use algorithms that exploit feasible set compactness to\nestimate model parameters, for instance, conditional gradient algorithms which have low per iteration\ncomplexity [23]. Even though these algorithms offer practical bene\ufb01ts, with non-convex objective,\nit is nontrivial to analyze their theoretical/convergence aspects, and as was noted earlier in Section\n3, except for simplistic G, G(cid:48) and K, the minimization in (3) might involve a non-convex objective.\nWe turn to some recent results which have shown that speci\ufb01c classes of non-convex problems or\nNP-Hard problems can be solved to any desired accuracy using a sequence of convex optimization\nproblems [33]. This strategy is currently an active area of research and has already shown to provide\nimpressive performance in practice [3].\nVery recently,[4] showed that one such class of problems called signomial programming can be solved\nusing successive relative entropy relaxations. Interestingly, we show that for the widely-used class of\nGaussian kernels, our objective can be optimized using these ideas. For notational simplicity, we do\nnot transform the targets i.e, \u02dcxt = xt or g(\u00b7) is identity and only allow for linear transformations h(\u00b7).\nObserve that, with respect to the estimation problem (refer to (3)) this is the same as transforming\nboth source and target inputs. When K is Gaussian, the objective in (3) with identity g(\u00b7) and linear\n\n6\n\n\fh(\u00b7) (\u03bb corresponds to slope and intercept here) can be equivalently written as,\n\n(cid:32)\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\nj=1\n\n:= min\n\u03bb\u2208\u2126\u03bb\n\n(cid:88)\n\nj\n\nmin\n\u03bb\u2208\u2126\u03bb\n\n1\nn2\n\nK(h(xi\n\ns, \u03bb), h(xj\n\ns, \u03bb)) \u2212 2\nmn\n\n(cid:16)\u2212(cid:16)\n\n1\nn2 exp\n\naT\nj \u03bb\u03bbT ai)\n\nm(cid:88)\nn(cid:88)\n(cid:17)(cid:17) \u2212(cid:88)\n\nj=1\n\ni=1\n\ni,j\n\nK(xi\n\nt, h(xj\n\n2\n\nmn\n\nexp\n\ns, \u03bb))\n\n(cid:16)\u2212(cid:16)\n\nij\u03bb + c2(cid:17)(cid:17) (7)\n\nbT\nij\u03bb\u03bbT bij + 2cbT\n\n(cid:33)\n\n(cid:17)\n\nfor appropriate aj, bij and c. Denoting \u03b3 = \u03bb\u03bbT , the above objective can be made linear in the\ndecision variables \u03b3 and \u03bb thus putting it in the standard form of signomial optimization. The convex\nrelaxation of the quadratic equality constraint is \u03b3 \u2212 \u03bb\u03bbT (cid:23) 0, hence we seek to solve,\n\n(cid:88)\n\nj\n\nn2 exp (tr(Aj\u03b3)) \u2212(cid:88)\n\n1\n\ni,j\n\nmin\n\u03b3,\u03bb\n\n(cid:16)\n\n2\n\nmn\n\nexp\n\ntr(Bij\u03b3) + C T\n\nij\u03bb + c\n\ns.t. \u03b3 \u2212 \u03bb\u03bbT (cid:23) 0\n\n(8)\n\nClearly the objective is exactly in the form that [4] solves, albeit we also have a convex constraint.\nNevertheless, using their procedure for the unconstrained signomial optimization we can write a\nsequence of convex relaxations for this objective. This sequence is hierarchical, in the sense that, as\nwe go down the sequence, each problem gives tighter bounds to the original nonconvex objective [4].\nFor our applications, we see that since con\ufb01dence interval procedure (mentioned earlier) naturally\nsuggests a good initial point in addition, any generic (local) numerical optimization schemes like\ntrust region, gradient projection etc. can be used to solve (7) whereas the hierarchy of (8) can be used\nin general when one does not have access to a good starting point.\n\n5 Experiments\n\nDesign and Overall Goals. We performed evaluations on both synthetic data as well as data from\nan AD study. (A) We \ufb01rst evaluate the goodness of our estimation procedure and the power of the\nminimal MMD based test when the source and target inputs are known transformations of samples\nfrom different distribution families (e.g., Normal, Laplace). Here, we seek to clearly identify the\nin\ufb02uence of the sample size as well as the effect of the transformations on recoverability. (B) After\nthese checks, we then apply our proposed model for matching CSF protein levels of 600 subjects.\nThese biomarkers were collected in two different batches; it is known that the measures for the same\nparticipant (across batches) have high variability [20]. In our data, fortunately, a subset of individuals\nhave both batch data (the \u201creal\u201d measurement must be similar in both batches) whereas a fraction\nof individuals\u2019 CSF is only available in one batch. If we \ufb01nd a linear standardization between the\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 1: (a,b) Acceptance Ratios, (c,d) Estimation errors, (e,f) Histograms of minimal MMD statistic.\n\n7\n\nSample Size (Log2 scale)46810Acceptance Rate00.20.40.60.811.2Normal target vs. different sourcesNormal(0,1)Laplace(0,1)Exponential(1)Sample Size (Log2 scale)46810Acceptance Rate00.20.40.60.811.2Models linear in parametersa*x2+b*x+ca*log(|x|)+bSample Size (Log2 scale)24681012L1 Error00.20.40.60.811.2Estimation Errors normal vs. normalSlopeInterceptSample Size (Log2 scale)456789Quartic Mean of estimation error0.511.522.5Estimation error for 2D simulationModel 1, first rowModel 1, second rowModel 2, first rowModel 2, second rowmMMD value00.0050.010.0150.020.025histogram00.20.40.60.81Minimal MMD histogram (128 samples)Nor vs NorNor vs ExpNor vs LapmMMD value00.0050.010.015histogram00.20.40.60.81Minimal MMD histogram (1024 samples)Nor vs NorNor vs ExpNor vs Lap\fresults\n\non\n\nsynthetic\n\ncomes\n\nfrom different\n\nsource\n\ndata where\nfamilies.\n\n1\nsamples\n\nsummarizes\nand\n\ndata.\n\nFig\nare Normal\n\ntwo batches it serves as a gold standard, against which we compare our algorithm which does not\nuse information about corresponding samples. Note that the standardization trick is unavailable in\nmulti-center studies; we use this data in this paper simply to make the description of our evaluation\ndesign simpler which, for multi-site data pooling, must be addressed using secondary analyses.\nSynthetic\nour\nthe\ntargets\nFirst, observe that our testing procedure ef\ufb01ciently rejects\nH0 whenever the targets are not Normal (blue and black\ncurves in Fig. 1(a)). If the transformation class is beyond\nlinear (e.g., log), the null is ef\ufb01ciently rejected as samples\nincrease (see Fig. 1(b)). Beyond the testing power, Figs.\n1(c,d) shows the error in the actual estimates, which de-\ncrease as the sample size increases (with tighter con\ufb01dence\nintervals). The appendix includes additional model details.\nTo get a better idea about the minimal MMD statistic, we\nshow its histogram (over multiple bootstrap simulations)\nfor different targets in Fig 1(e,f). The green line here de-\nnotes the bootstrap signi\ufb01cance threshold (0.05). In Fig.\n1(e,f), the red curve is always to the left of the threshold,\nas desired. However, the samples are not enough to reject\nthe null the black and blue curves; and we will need larger\nsample sizes (Fig. 1(f)). If needed, the minimal MMD value can be used to obtain a better threshold.\nOverall, these plots show that the minimal MMD based estimation and testing setup robustly removes\nthe feature distortions and facilitates the statistical test.\nAD study. Fig 2 shows the relative errors after correcting the feature distortions between the two\nbatches in the 12 CSF proteins. The bars correspond to simple linear \u201cstandardization\u201d transformation\nwhere we assume we have corresponding sample information (blue) and our minimal MMD based\ndomain adaptation procedure on sets S1 and S2 (S1: participants available in both batches, S2: all\nparticipants). Our models perform as well as the gold standard (where some subjects have volunteered\nCSF sampling for both batches). Speci\ufb01cally, the trends in Fig 2 indicate that our minimal MMD\nbased testing procedure is a powerful procedure for conducting analyses on such pooled datasets.\nTo further validate these observations, we used the \u2018trans-\nformed\u2019 CSF data from the two batches (our algorithm and\ngold standard) and performed a multiple regression to predict\nLeft and Right Hippocampal Volume (which are known to be\nAD markers). Table 1 shows that the correlations (predicted\nvs. actual) resulting from the minimal MMD corrected data\nare comparable or offer improvements to the alternatives. We\npoint out that the best correlations are achieved when all the\ndata is used with minimal MMD (which the gold standard\ncannot bene\ufb01t from). Any downstream prediction tasks we\nwish to conduct are independent of the model presented here.\n\nTable 1:\nPerformance of transformed\n(our vs. gold standard) CSF on a regres-\nsion task.\nModel\nNone\nLinear\nM (S1)\nM (S2)\n\nFigure 2: Relative error in transformation\nestimation between CSF batches.\n\n0.46\u00b1 0.15\n0.46\u00b1 0.15\n0.48\u00b1 0.15\n0.48\u00b1 0.15\n\nRight\n\n0.37\u00b10.16\n0.37\u00b10.16\n0.39\u00b1 0.15\n0.40\u00b1 0.15\n\nLeft\n\n6 Conclusions\n\nWe presented a framework for kernelized statistical testing on data from multiple sources when the\nobserved measurements/features have been systematically distorted/transformed. While there is a rich\nbody of work on kernel test statistics based on the maximum mean discrepancy and other measures,\nthe \ufb02exibility to account for a given class of transformations offers improvements in statistical power.\nWe analyze the statistical properties of the estimation and demonstrate how such a formulation may\nenable pooling datasets from multiple participating sites, and facilitate the conduct of neuroscience\nstudies with substantially higher sample sizes which may be otherwise infeasible.\n\nAcknowledgments: This work is supported by NIH AG040396, NIH U54AI117924, NSF\nDMS1308847, NSF CAREER 1252725, NSF CCF 1320755 and UW CPCP AI117924. The authors\nare grateful for partial support from UW ADRC AG033514 and UW ICTR 1UL1RR025011. We\nthank Marilyn S. Albert (Johns Hopkins) and Anne Fagan (Washington University at St. Louis) for\ndiscussions at a preclinical Alzheimer\u2019s disease meeting in 2015 (supported by Stay Sharp fund).\n\n8\n\np1p2p3p4p5p6p7p8p9p10p11p12Relative Error00.050.10.150.20.25Relative error on comparsion to baselineLinear ModelMinimal MMD (S1)Minimal MMD (S2)\fReferences\n[1] M Baktashmotlagh, M Harandi, B Lovell, and M Salzmann. Unsupervised domain adaptation by domain\n\ninvariant projection. In Proceedings of the IEEE ICCV, 2013.\n\n[2] S Ben-David, J Blitzer, K Crammer, et al. A theory of learning from different domains. Machine learning,\n\n[3] B Chalise, Y Zhang, and M Amin. Successive convex approximation for system performance optimization\n\nin a multiuser network with multiple mimo relays. In IEEE CAMSAP, 2011.\n\n[4] V Chandrasekaran and P Shah. Relative entropy relaxations for signomial optimization. arXiv:1409.7640,\n\n2010.\n\n2014.\n\n[5] C Cortes and M Mohri. Domain adaptation in regression. In Algorithmic Learning Theory, 2011.\n[6] T Dall, P Gallo, R Chakrabarti, T West, A Semilla, and M Storm. An aging population and growing disease\n\nburden will require alarge and specialized health care workforce by 2025. Health Affairs, 2013.\n\n[7] H Daum\u00e9 III, A Kumar, and A Saha. Frustratingly easy semi-supervised domain adaptation. In Proceedings\n\nof the 2010 Workshop on Domain Adaptation for Natural Language Processing, 2010.\n\n[8] P Doll\u00e1r, C Wojek, B Schiele, and P Perona. Pedestrian detection: A benchmark. In CVPR, 2009.\n[9] B Fernando, A Habrard, M Sebban, and T Tuytelaars. Unsupervised visual domain adaptation using\n\nsubspace alignment. In Proceedings of the IEEE ICCV, 2013.\n\n[10] Y Ganin and V Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv:1409.7495, 2014.\n[11] B Gong, K Grauman, and F Sha. Connecting the dots with landmarks: Discriminatively learning domain-\n\ninvariant features for unsupervised domain adaptation. In ICML, 2013.\n\n[12] B Gong, Y Shi, F Sha, and K Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\n[13] Boqing Gong. Kernel Methods for Unsupervised Domain Adaptation. PhD thesis, Citeseer, 2015.\n[14] R Gopalan, R Li, and R Chellappa. Domain adaptation for object recognition: An unsupervised approach.\n\nIn Proceedings of the IEEE ICCV, 2011.\n\n[15] A Gretton, K Borgwardt, M Rasch, B Sch\u00f6lkopf, and A Smola. A kernel two-sample test. JMLR, 2012.\n[16] A Gretton, K Fukumizu, Z Harchaoui, and B Sriperumbudur. A fast, consistent kernel two-sample test. In\n\nCVPR, 2012.\n\nNIPS, 2009.\n\n[17] Glioma Meta-analysis Trialists GMT Group. Chemotherapy in adult high-grade glioma: a systematic\n\nreview and meta-analysis of individual patient data from 12 randomised trials. The Lancet, 2002.\n\n[18] M Haase, R Bellomo, P Devarajan, P Schlattmann, et al. Accuracy of neutrophil gelatinase-associated\nlipocalin (ngal) in diagnosis and prognosis in acute kidney injury: a systematic review and meta-analysis.\nAmerican Journal of Kidney Diseases, 2009.\n\n[19] J Huang, A Gretton, K Borgwardt, B Sch\u00f6lkopf, and A Smola. Correcting sample selection bias by\n\nunlabeled data. In NIPS, 2006.\n\n[20] W Klunk, R Koeppe, J Price, T Benzinger, M Devous, et al. The centiloid project: standardizing quantitative\n\namyloid plaque estimation by pet. Alzheimer\u2019s & Dementia, 2015.\n\n[21] W Klunk, R Koeppe, J Price, T Benzinger, et al. The centiloid project: standardizing quantitative amyloid\n\nplaque estimation by pet. Alzheimer\u2019s & Dementia, 2015.\n\n[22] A Kumar, A Saha, and H Daume. Co-regularization based semi-supervised domain adaptation. In NIPS,\n\n[23] S Lacoste-Julien, M Jaggi, M Schmidt, and P Pletscher. Block-coordinate frank-wolfe optimization for\n\nstructural svms. arXiv:1207.4747, 2012.\n\n[24] Qi Li. Literature survey: domain adaptation algorithms for natural language processing. Department of CS\n\nThe Graduate Center, The City University of New York, 2012.\n\n[25] X Nguyen, M Wainwright, and M Jordan. Estimating divergence functionals and the likelihood ratio by\n\nconvex risk minimization. Information Theory, IEEE Transactions on, 2010.\n\n[26] S Pan, I Tsang, J Kwok, and Q Yang. Domain adaptation via transfer component analysis. Neural Networks,\n\n[27] V Patel, R Gopalan, R Li, and R Chellappa. Visual domain adaptation: A survey of recent advances. Signal\n\nIEEE Transactions on, 2011.\n\nProcessing Magazine, IEEE, 2015.\n\n[28] B Plassman, K Langa, G Fisher, S Heeringa, et al. Prevalence of dementia in the united states: the aging,\n\ndemographics, and memory study. Neuroepidemiology, 2007.\n\n[29] K Saenko, B Kulis, M Fritz, and T Darrell. Adapting visual category models to new domains. In ECCV.\n\n[30] D Sejdinovic, B Sriperumbudur, A Gretton, K Fukumizu, et al. Equivalence of distance-based and\n\nrkhs-based statistics in hypothesis testing. The Annals of Statistics, 2013.\n\n[31] B Sriperumbudur, K Fukumizu, A Gretton, G Lanckriet, and B Sch\u00f6lkopf. Kernel choice and classi\ufb01ability\n\nfor rkhs embeddings of probability distributions. In NIPS, 2009.\n\n[32] A Torralba and A Efros. Unbiased look at dataset bias. In CVPR, 2011.\n[33] L Tun\u00e7el. Polyhedral and semide\ufb01nite programming methods in combinatorial optimization. AMS, 2010.\n[34] H Vanderstichele, M Bibl, S Engelborghs, N Le Bastard, et al. Standardization of preanalytical aspects\nof cerebrospinal \ufb02uid biomarker testing for alzheimer\u2019s disease diagnosis: a consensus paper from the\nalzheimer\u2019s biomarkers standardization initiative. Alzheimer\u2019s & Dementia, 2012.\n\n2010.\n\n2010.\n\n9\n\n\f", "award": [], "sourceid": 1302, "authors": [{"given_name": "Hao", "family_name": "Zhou", "institution": "University of Wisconsin Madiso"}, {"given_name": "Vamsi", "family_name": "Ithapu", "institution": "University of Wisconsin Madison"}, {"given_name": "Sathya Narayanan", "family_name": "Ravi", "institution": "University of Wisconsin Madiso"}, {"given_name": "Vikas", "family_name": "Singh", "institution": "UW Madison"}, {"given_name": "Grace", "family_name": "Wahba", "institution": "University of Wisconsin Madison"}, {"given_name": "Sterling", "family_name": "Johnson", "institution": "University of Wisconsin Madison"}]}