{"title": "Active Comparison of Prediction Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1754, "page_last": 1762, "abstract": "We address the problem of comparing the risks of two given predictive models - for instance, a baseline model and a challenger - as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values.", "full_text": "Active Comparison of Prediction Models\n\nChristoph Sawade, Niels Landwehr, and Tobias Scheffer\n\nAugust-Bebel-Strasse 89, 14482 Potsdam, Germany\n\n{sawade, landwehr, scheffer}@cs.uni-potsdam.de\n\nUniversity of Potsdam\n\nDepartment of Computer Science\n\nAbstract\n\nWe address the problem of comparing the risks of two given predictive\nmodels\u2014for instance, a baseline model and a challenger\u2014as con\ufb01dently as pos-\nsible on a \ufb01xed labeling budget. This problem occurs whenever models cannot\nbe compared on held-out training data, possibly because the training data are un-\navailable or do not re\ufb02ect the desired test distribution. In this case, new test in-\nstances have to be drawn and labeled at a cost. We devise an active comparison\nmethod that selects instances according to an instrumental sampling distribution.\nWe derive the sampling distribution that maximizes the power of a statistical test\napplied to the observed empirical risks, and thereby minimizes the likelihood of\nchoosing the inferior model. Empirically, we investigate model selection prob-\nlems on several classi\ufb01cation and regression tasks and study the accuracy of the\nresulting p-values.\n\n1\n\nIntroduction\n\nWe address situations in which an informed choice between candidate predictive models\u2014for in-\nstance, a baseline method and a challenger\u2014has to be made. In practice, it is not always possible to\ncompare the models\u2019 risks on held-out training data. For example, in computer vision it is common\nto acquire pre-trained object or face recognizers from third parties. Such recognizers do not typi-\ncally come with the image databases that have been used to train them. The suppliers of the models\ncould provide risk estimates based on held-out training data; however, such estimates might be bi-\nased because the training data would not necessarily re\ufb02ect the distribution of images the deployed\nmodels will be exposed to. Another example are domains where the input distribution changes\nover a period of time in which a baseline model, e.g., a spam \ufb01lter, has been employed. By the\ntime a new predictive model is considered, a previous risk estimate of the baseline model may no\nlonger be accurate.\nIn these example scenarios, new test data have to be drawn and labeled. The standard approach\nto comparing models would be to draw n test instances according to the test distribution which\nthe model is exposed to in practice, label these data, and calculate the difference of the empirical\nrisks \u02c6\u2206n and the sample variance S2\nis\nasymptotically governed by a standard normal distribution, and we can compute a p-value which\nquanti\ufb01es the likelihood that an observed empirical difference is due to chance, indicating how\ncon\ufb01dently the decision to prefer the apparently better model can be made.\nIn many application scenarios, unlabeled test instances are readily available whereas the process\nof labeling data is costly. We study an active model comparison process that, in analogy to active\nlearning, selects instances from a pool of unlabeled test data and queries their labels. Instances\nare selected according to an instrumental sampling distribution q. The empirical difference of the\nmodels\u2019 risks is weighted appropriately to compensate for the discrepancy between instrumental\nand test distributions which leads to consistent\u2014that is, asymptotically unbiased\u2014risk estimates.\n\nn. Then, under the null hypothesis of identical risks,\n\nn \u02c6\u2206n\nSn\n\n\u221a\n\n1\n\n\fThe principal theoretical contribution of this paper is the derivation of a sampling distribution q that\nallows us to make the decision to prefer the superior model as con\ufb01dently as possible given a \ufb01xed\nlabeling budget n, if one of the models is in fact superior. Equivalently, one may use q to minimize\nthe labeling costs n required to reach a correct decision at a prescribed level of con\ufb01dence.\nThe active comparison problem that we study can be seen as an extreme case of active learning, in\nwhich the model space contains only two (or, more generally, a small number of) models. For the\nspecial case of classi\ufb01cation with zero-one loss and two models under study, a simpli\ufb01ed version\nof the sampling distribution we derive coincides with the sampling distribution used in the A 2 and\nIWAL active learning algorithms proposed by Balcan et al. [1] and Beygelzimer et al. [2]. For A 2\nand IWAL, the derivation of this distribution is based on \ufb01nite-sample complexity bounds, while in\nour approach, it is based on maximizing the power of a statistical test comparing the models under\nstudy. The latter approach has the advantage that it directly generalizes to regression problems. A\nfurther difference to active learning is that our goal is not only to choose the best model, but also to\nobtain a well-calibrated p-value indicating the con\ufb01dence with which this decision can be made.\nOur method is also related to recent work on active data acquisition strategies for the evaluation\nof a single predictive model, in terms of standard risks [8] or generalized risks that subsume pre-\ncision, recall, and f-measure [9]. The problem addressed in this paper is different in that we seek\nto assess the relative performance of two models, without necessarily determining absolute risks\nprecisely. Madani et al. have studied active model selection, where the goal is also to identify a\nmodel with lowest risk [5]. However, in their setting costs are associated with obtaining predictions\n\u02c6y = f (x), while in our setting costs are associated with obtaining labels y \u223c p(y|x). Hoeffding\nraces [6] and sequential sampling algorithms [10] perform ef\ufb01cient model selection by keeping\ntrack of risk bounds for candidate models and removing models that are clearly outperformed from\nconsideration. The goal of these methods is to reduce computational complexity, not labeling effort.\nThe rest of this paper is organized as follows. The problem setting is laid out in Section 2. Section 3\nderives the instrumental distribution and details our theoretical \ufb01ndings. Section 4 explores active\nmodel comparison experimentally. Section 5 concludes.\n\n2 Problem Setting\nLet X denote the feature space and Y the label space; an unknown test distribution p(x, y) is de\ufb01ned\nover X \u00d7 Y. Let p(y|x; \u03b81) and p(y|x; \u03b82) be given \u03b8-parameterized models of p(y|x) and let\nfj : X \u2192 Y with fj(x) = arg maxy p(y|x; \u03b8j) be the corresponding predictive functions.\nThe risks of f1, f2 are given by\n\n(1)\nfor a loss function (cid:96) : Y \u00d7 Y \u2192 R. In a classi\ufb01cation setting, the integral over Y reduces to a sum.\nThe standard approach to comparing models is to compare empirical risk estimates\n\n(cid:96)(fj(x), y)p(x, y)dy dx\n\nR[fj] =\n\n(cid:90)(cid:90)\n\nn(cid:88)\n\ni=1\n\n\u02c6Rn[fj] =\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n2\n\n(cid:96)(fj(xi), yi),\n\n(2)\n\nwhere n test instances (xi, yi) are drawn from p(x, y) = p(x)p(y|x). We assume that unlabeled\ndata are readily available, but acquiring labels y for selected instances x according to p(y|x) is a\ncostly process that may involve a query to a human labeler.\nTest instances need not necessarily be drawn according to the input distribution p(x). We will focus\non a data labeling process that draws test instances according to an instrumental distribution q(x)\nrather than p(x).\nIntuitively, q(x) should be designed such as to prefer instances that highlight\ndifferences between the models f1 and f2. Let q(x) denote an instrumental distribution with the\nproperty that p(x) > 0 implies q(x) > 0 for all x \u2208 X . A consistent risk estimate is then given by\n\n\u02c6Rn,q[fj] =\n\n1\nW\n\np(xi)\nq(xi)\n\n(cid:96)(fj(xi), yi),\n\n(3)\n\n\fwhere (xi, yi) \u223c q(x)p(y|x) and W = (cid:80)n\n\ni=1\n\np(xi)\n\nq(xi) . Weighting factors p(xi)\n\nq(xi) compensate for the\ndiscrepancy between test and instrumental distribution, and the normalizer is the sum of weights.\nBecause of the weighting factors, Equation 3 de\ufb01nes a consistent risk estimate (see [4], Chapter 2).\nConsistency means that the expected value of \u02c6Rn,q[fj] converges to the true risk R[fj] for n \u2192 \u221e.\nGiven estimates \u02c6Rn,q[f1] and \u02c6Rn,q[f2], the difference \u02c6\u2206n,q = \u02c6Rn,q[f1]\u2212 \u02c6Rn,q[f2] provides evidence\non which model is preferable; a positive \u02c6\u2206n,q argues in favor of f2. In preferring one model over the\nother, one rejects the null hypothesis that the observed difference \u02c6\u2206n,q is only a random effect, and\nR[f1] = R[f2] holds. The null hypothesis implies that the mean of \u02c6\u2206n,q is asymptotically zero.\nBecause \u02c6\u2206n,q is asymptotically normally distributed (see, e.g., [3]), it further implies that the statistic\n\n\u221a\n\nn\n\n\u02c6\u2206n,q\n\u03c3n,q\n\n\u223c N (0, 1)\n\nis asymptotically standard-normally distributed, where 1\nof \u02c6\u2206n,q. In practice, \u03c32\n\nn,q is unknown. A consistent estimator of \u03c32\n\nn \u03c32\n\nn,q is given by\n\nn,q = Var[ \u02c6\u2206n,q] denotes the variance\n\n(cid:96)(f1(xi), yi) \u2212 (cid:96)(f2(xi), yi) \u2212 \u02c6\u2206n,q\n\n,\n\n(4)\n\n(cid:17)2\n\nn(cid:88)\n\ni=1\n\nS2\n\nn,q =\n\n1\nW\n\np(xi)2\nq(xi)2\n\n(cid:16)\n\nas shown, for example, by Geweke [3]. Substituting the empirical for the true standard deviation\nn,q, the null hypothesis\nyields an observable statistic\nalso implies that the observable statistic is asymptotically standard normally distributed,\n\nn,q consistently estimates \u03c32\n\n. Because S2\n\n\u02c6\u2206n,q\nSn,q\n\n\u221a\n\nn\n\nLet \u03a6 denote the cumulative distribution function of the standard normal distribution. Then,\n\n\u221a\n\nn\n\n\u02c6\u2206n,q\nSn,q\n\n\u223c N (0, 1).\n\n(cid:32)\n\n1 \u2212 \u03a6\n\n2\n\n(cid:33)(cid:33)\n\n(cid:32)\u221a\n\n| \u02c6\u2206n,q|\nSn,q\n\nn\n\n(5)\n\nis called the p-value of a two-sided paired Wald test (see, e.g., [12], Chapter 10). The p-value quan-\nti\ufb01es the likelihood of observing the given absolute value of the test statistic, or a higher value, by\nchance under the null hypothesis. Student\u2019s t-distribution can serve as a more popular approximation\nof the distribution of a test statistic under the null hypothesis, resulting in the common t-test. Note,\nhowever, that Sn,q would have to be a sum of squared, normally distributed random variables for the\ntest statistic to be asymptotically governed by the t-distribution. This assumption is reasonable for\nregression, but not for classi\ufb01cation, and only for the case of p = q.\nIf the null hypothesis does not hold and the two models incur different risks, the distribution of the\ntest statistic depends on the chosen sampling distribution q(x). Our goal is to \ufb01nd a distribution q(x)\nthat allows us to tell the risks of f1 and f2 apart with high con\ufb01dence. More formally, the power of\na test when sampling from q(x) is the likelihood that the null hypothesis can be rejected, that is, the\nlikelihood that the p-value falls below a pre-speci\ufb01ed con\ufb01dence threshold \u03b1. Our goal is to \ufb01nd the\nsampling distribution q that maximizes test power:\n\n(cid:32)\n\n(cid:32)\n\n1 \u2212 \u03a6\n\n2\n\n(cid:32)\u221a\n\n(cid:33)(cid:33)\n\n(cid:33)\n\n\u2264 \u03b1\n\n| \u02c6\u2206n,q|\nSn,q\n\nn\n\n.\n\n(6)\n\nq\u2217 = arg max\n\np\n\nq\n\n3 Active Model Comparison\n\nWe now turn towards deriving an optimal sampling distribution q\u2217 according to Equation 6. Sec-\ntion 3.1 analytically derives an asymptotically optimal sampling distribution. Section 3.2 discusses\nthe sampling distribution in a pool-based setting and presents the active comparison algorithm.\n\n3\n\n\f3.1 Asymptotically Optimal Sampling\nLet \u2206 = R[f1] \u2212 R[f2] denote the true risk difference, and assume \u2206 (cid:54)= 0. Given a con\ufb01dence\nthreshold \u03b1, the test power equals the probability that the absolute value of the test statistic exceeds\n\nthe corresponding critical value z\u03b1 = \u03a6\u22121(cid:0)1 \u2212 \u03b1\n(cid:1):\n(cid:33)\n\n(cid:33)\n\n(cid:32)\n\n(cid:33)\n\n2\n\n2 \u2212 2\u03a6\n\np\n\n| \u02c6\u2206n,q|\nSn,q\n\nn\n\n\u2264 \u03b1\n\n= p\n\n| \u02c6\u2206n,q|\nSn,q\n\nn\n\n\u2265 z\u03b1\n\n(cid:32)\u221a\n\n(cid:32)\u221a\n\nAsymptotically, it holds that\n\n\u221a\n\nn( \u02c6\u2206n,q \u2212 \u2206)\n\n\u03c3n,q\n\n\u223c N (0, 1).\n\nSince Sn,q consistently estimates \u03c3n,q, it follows that for large n the statistic\ndistributed with mean\n\nand unit variance,\n\n\u221a\n\nn\u2206\n\u03c3n,q\n\n(cid:18)\u221a\n\n(cid:19)\n\nn\u2206\n\u03c3n,q\n\n, 1\n\n.\n\n\u221a\n\nn \u02c6\u2206n,q\nSn,q\n\n\u221a\n\n\u223c N\n\n| \u02c6\u2206n,q|\nSn,q\n\n.\n\n(7)\n\n\u221a\n\nn\n\n\u02c6\u2206n,q\nSn,q\n\nis normally\n\n(8)\n\n| \u02c6\u2206n,q|\nSn,q\n\nn\n\n(cid:18)\n\n\u03b2n,q = 1 \u2212\n\n(cid:115)(cid:90)\n\nof the test statistic follows a folded normal dis-\nEquation 8 implies that the absolute value\ntribution with location parameter\nand scale parameter one. According to Equation 7, test power\ncan thus be approximated in terms of the cumulative distribution of this folded normal distribution,\n\nn\u2206\n\u03c3n,q\n\n\u221a\n\nn\n\n(cid:32)\n\n2 \u2212 2\u03a6\n\n(cid:32)\u221a\n\n(cid:33)\n\n(cid:33)\n\n\u2264 \u03b1\n\n\u2248 1 \u2212\n\n(cid:90) z\u03b1\n\n(cid:18)\n\n(cid:19)\n\n\u221a\n\nn\u2206\n\u03c3n,q\n\nf\n\nT ;\n\n, 1\n\ndT,\n\np\n\nwhere\n\n(9)\n\n(cid:19)\n\nf (T ; \u00b5, 1) =\n\n1\u221a\n2\u03c0\n\nexp\n\n\u2212 1\n2\n\n(T \u2212 \u00b5)2\n\n\u2212 1\n2\n\ndenotes the density of a folded normal distribution with location parameter \u00b5 and scale parameter\none. We de\ufb01ne the shorthand\n\n(cid:19)\n\n(T + \u00b5)2\n\n0\n\n+\n\n1\u221a\n2\u03c0\n\nexp\n\n(cid:18)\n\n(cid:18)\n\n(cid:90) z\u03b1\n\n0\n\n(cid:19)\n\n\u221a\n\nn\u2206\n\u03c3n,q\n\nf\n\nT ;\n\n, 1\n\ndT\n\nfor the approximation of test power given by Equation 9. In the following, we derive a sampling dis-\ntribution maximizing \u03b2n,q, thereby approximately solving the optimization problem of Equation 6.\nTheorem 1 (Optimal Sampling Distribution). Let \u2206 = R[f1]\u2212 R[f2] with \u2206 (cid:54)= 0. The distribution\n\nq\u2217(x) \u221d p(x)\n\n((cid:96)(f1(x), y) \u2212 (cid:96)(f2(x), y) \u2212 \u2206)2 p(y|x)dy\n\nasymptotically maximizes \u03b2n,q; that is, for any other sampling distribution q (cid:54)= q\u2217 it holds that\n\u03b2n,q < \u03b2n,q\u2217 for suf\ufb01ciently large n.\n\nBefore we prove Theorem 1, we show that a sampling distribution asymptotically maximizes \u03b2n,q if\nand only if it minimizes the asymptotic variance of the estimator \u02c6\u2206n,q.\nLemma 2 (Variance Optimality). Let q, q(cid:48) denote two sampling distributions. Then it holds that\n\u03b2n,q > \u03b2n,q(cid:48) for suf\ufb01ciently large n if and only if\n\n(cid:104) \u02c6\u2206n,q\n\n(cid:105)\n\n(cid:104) \u02c6\u2206n,q(cid:48)\n\n(cid:105)\n\nlim\nn\u2192\u221e n Var\n\n< lim\n\nn\u2192\u221e n Var\n\n.\n\n(10)\n\nA proof is included in the online appendix. Lemma 2 shows that in order to solve the optimization\nproblem given by Equation 6, we need to \ufb01nd the sampling distribution minimizing the asymptotic\nvariance of the estimator \u02c6\u2206n,q. This asymptotic variance is characterized by the following Lemma.\n\n4\n\n\fLemma 3 (Asymptotic Variance). The asymptotic variance \u03c32\ngiven by\n\n(cid:90)(cid:90) p(x)2\nq(x)2 ((cid:96)(f1(x), y) \u2212 (cid:96)(f2(x), y) \u2212 \u2206)2 p(y|x)q(x)dy dx.\n\nq = lim\n\n\u03c32\nq =\n\nn\u2192\u221en Var[ \u02c6\u2206n,q] of \u02c6\u2206n,q is\n\n(cid:19)\n\nL [q, \u03b2] = \u03c32\n\nq + \u03b2\n\nq as given by Lemma 3. We minimize the functional \u03c32\n\nA proof of Lemma 3 is included in the online appendix.\nProof of Theorem 1. We can now prove Theorem 1 by deriving the distribution q\u2217 that minimizes the\nasymptotic variance \u03c32\nq in terms of q under\n\nthe constraint(cid:82) q(x)dx = 1 using a Lagrange multiplier \u03b2.\n(cid:90) c(x)\nwhere c(x) = p(x)2(cid:82) ((cid:96)(f1(x), y) \u2212 (cid:96)(f2(x), y) \u2212 \u2206)2 p(y|x)dy. The optimal point for the con-\n(cid:19)\n(cid:112)c(x)\n(cid:82)(cid:112)c(x)dx\n\nstrained problem satis\ufb01es the Euler-Lagrange equation\n+ \u03b2 (q(x) \u2212 p(x))\n\nA solution for Equation 11 with respect to the normalization constraint is given by\n\n(cid:18)(cid:90)\n(cid:18) c(x)\n\n+ \u03b2 (q(x) \u2212 p(x)) dx\n\nq(x)2 + \u03b2 = 0.\n\nq(x)dx \u2212 1\n\n= \u2212 c(x)\n\nq\u2217(x) =\n\n\u2202q(x)\n\nq(x)\n\n=\n\nq(x)\n\n(11)\n\n(12)\n\n\u2202\n\n.\n\nResubstitution of c(x) into Equation 12 implies the theorem.\n\n3.2 Empirical Sampling Distribution\nThe distribution q\u2217 also depends on the true conditional p(y|x) and the true difference in risks \u2206.\nIn order to implement the method, we have to approximate these quantities. Note that as long as\np(x) > 0 implies q(x) > 0, any choice of q will yield consistent risk estimates because weighting\nfactors account for the discrepancy between sampling and test distribution (Equation 3). That is,\n\u02c6\u2206n,q is guaranteed to converge to \u2206 as n grows large; any approximation employed to compute q\u2217\nwill only affect the number of test examples required to reach a certain level of estimation accu-\nracy. To approximate the true conditional p(y|x), we use the given predictive models p(y|x; \u03b81) and\np(y|x; \u03b82), and assume a mixture distribution giving equal weight to both models:\n\n1\n2\n\np(y|x) \u2248 1\n2\n\np(y|x; \u03b81) +\n\np(y|x; \u03b82).\n\n(13)\nThe risk difference \u2206 is replaced by a difference \u2206\u03b8 of introspective risks calculated from Equa-\ntion 1, where the integral over X is replaced by a sum over the pool, p(x) = 1\nm, and p(y|x) is\napproximated by Equation 13.\nWe will now derive the empirical sampling distribution for two standard loss functions.\nDerivation 4 (Sampling for Zero-one Loss). Let (cid:96) be the zero-one loss for a binary prediction\nproblem with label space Y = {0, 1}. When p(y|x) is approximated as in Equation 13, the sampling\ndistribution asymptotically maximizing \u03b2n,q in a pool-based setting resolves to\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n|\u2206\u03b8|\n\n(cid:113)\n(cid:113)\n\nq\u2217(x)\u221d\n\n1 \u2212 2\u2206\u03b8(1 \u2212 2p(y = 1|x; \u03b8)) + \u2206\u03b8\n1 + 2\u2206\u03b8(1 \u2212 2p(y = 1|x; \u03b8)) + \u2206\u03b8\n\n: f1(x) = f2(x)\n2 : f1(x) > f2(x)\n2 : f1(x) < f2(x)\n\nfor all x \u2208 D.\nA proof is included in the online appendix. Instead of using Approximation 13, an uninformative\napproximation p(y = 1|x) \u2248 0.5 may be used. In this case q\u2217 degenerates to uniform sampling from\nthe subset of the pool where f1(x) (cid:54)= f2(x). We denote this baseline as active(cid:54)=. This baseline\ncoincides with the A 2 as well as the IWAL active learning algorithms, applied to the model space\n{f1, f2}, as can be seen from inspection of Algorithm 1 in [1] and Algorithms 1 and 2 in [2].\nWe now derive the optimal sampling distribution for regression problems with a squared loss func-\ntion, assuming that the predictive distributions p(y|x; \u03b81) and p(y|x; \u03b82) are Gaussian:\n\n5\n\n\fAlgorithm 1 Active Model Comparison\ninput Models f1, f2 with distributions p(y|x; \u03b81), p(y|x; \u03b82); pool D; labeling budget n.\n1: Compute sampling distribution q\u2217 (Derivation 4 or 5).\n2: for i = 1, . . . , n do\n3:\n4:\n5: end for\n6: Compute \u02c6Rn,q[f1] and \u02c6Rn,q[f2] (Equation 3).\n7: Determine f\u2217 \u2190 arg minf\u2208{f1,f2} \u02c6Rn,q[f ], compute p-value for sample (Equation 5)\noutput f\u2217, p-value.\n\nDraw xi \u223c q\u2217(x) from D with replacement.\nQuery label yi \u223c p(y|xi) from oracle.\n\n(cid:115)\n\nDerivation 5 (Sampling for Squared Loss). Let (cid:96) be the squared loss, and let p(y|x; \u03b81) and\np(y|x; \u03b82) be Gaussian. When p(y|x) is approximated as in Equation 13, then the sampling dis-\ntribution asymptotically maximizing \u03b2n,q in a pool-based setting resolves to\n\nq\u2217(x) \u221d\n\n1 (x) + f 2\n\n2 (x) + \u03c4 2\n\n2 (f1(x) \u2212 f2(x))2 (f 2\n(14)\nx denotes the sum of the variances of the predictive distributions at x \u2208 D.\nfor all x \u2208 D, where \u03c4 2\nA proof is given in the online appendix. Variances of predictive distributions at instance x would be\navailable from a probabilistic model such as a Gaussian process [7]. If only predictions fj(x) but\nx \u2192 0, leading to\nno predictive distribution is available, we can assume peaked distributions with \u03c4 2\n\n1 (x) \u2212 f 2\n\nx)\u2212 (f 2\n\n2 (x))2\n\nq\u2217(x) \u221d (f1(x) \u2212 f2(x))2,\nor we can assume in\ufb01nitely broad predictive distributions with \u03c4 2\n\nx \u2192 \u221e, leading to\n\nq\u2217(x) \u221d |f1(x) \u2212 f2(x)|.\n\nWe refer to these baselines as active0 and active\u221e.\nAlgorithm 1 summarizes the active model comparison algorithm. It samples n instances with re-\nplacement from the pool according to the distribution prescribed by Derivations 4 (for zero-one\nloss) or 5 (for squared loss) and queries their label. Note that instances can be drawn more than\nonce; in the special case that the labeling process is deterministic, the actual labeling costs may thus\nstay below the sample size. In this case, the loop is continued until the labeling budget is exhausted.\nWe have so far focused on the problem of comparing the risks of two prediction models, such as\na baseline and a challenger. We might also consider several alternative models; the objective of\nan evaluation could be to rank the models according to the risk incurred or to identify the model\nwith lowest risk. Standard generalizations of the Wald test that compare multiple alternatives\u2014for\ninstance, within-subject ANOVA [11]\u2014try to reject the null hypothesis that the means of all con-\nsidered alternatives are equal. Rejection does not imply that all empirically observed differences are\nsigni\ufb01cant; for instance, the test could become signi\ufb01cant because one of the alternatives performs\nclearly worst. Choosing a sampling distribution q that maximizes the power of such a test would\nthus in general not re\ufb02ect the objectives of the empirical evaluation.\nIn practice, researchers often resort to pairwise hypothesis testing when comparing multiple predic-\ntion models. Accordingly, we derive a heuristic sampling distribution for the comparison of multiple\nmodels \u03b81, ..., \u03b8k as a mixture of pairwise-optimal sampling distributions,\n\nq\u2217(x) =\n\n1\n\nk(k \u2212 1)\n\nq\u2217\ni,j(x),\n\n(15)\n\n(cid:88)\n\ni(cid:54)=j\n\nwhere q\u2217\ni,j denotes the optimal distribution for comparing the models \u03b8i and \u03b8j given by Theorem 1.\nWhen comparing multiple models, we replace Equation 13 by a mixture over all models \u03b81, ..., \u03b8k.\n\n4 Empirical Results\n\nWe study the empirical behavior of active comparison (Algorithm 1, labeled active in all diagrams)\nrelative to a risk comparison based on a test sample drawn uniformly from the pool (labeled passive)\n\n6\n\n\fFigure 1: Model selection accuracy over labeling costs for comparison of two prediction models\n(top) and multiple prediction models (bottom). Error bars indicate the standard error.\n\n2 q\u2217\n\n1(x) + 1\n\n2 q\u2217\n\n1 and q\u2217\n\nand the baselines active(cid:54)=, active0, and active\u221e discussed in Section 3.2. We also include the active\nrisk estimator presented in [8] in our study, which infers optimal sampling distributions q\u2217\n2 for\nindividually estimating the risks of the models \u03b81 and \u03b82. Test instances are sampled from a mixture\ndistribution q\u2217(x) = 1\n2(x) (labeled ARE). Each comparison method returns the model\nwith lower empirical risk and the p-value of a paired two-sided test. When studying classi\ufb01cation,\nwe also include the active learning algorithms A 2 [1] and IWAL [2] as baselines by using them to\nsample test instances. Their model space is the set of predictive models that are to be compared.\nWe conduct experiments in two classi\ufb01cation domains (spam \ufb01ltering, object recognition) and two\nregression domains (inverse dynamics, Abalone) ranging from 4,109 to 169,612 instances. Kernel-\nized logistic regression is employed for classi\ufb01cation, Gaussian processes are employed for regres-\nsion. In the spam \ufb01ltering domain, we compare models that differ in the recency of their training\ndata. In the object recognition domain, we compare SIFT-based recognizers using different interest\npoint detectors (Harris operator, Canny edge detector, F\u00a8orstner operator) and visual vocabularies.\nFor regression, we compare models that differ in the choice of their kernel function (linear versus\nMatern, polynomial kernels of different degrees). Models are trained on part of the available data;\nthe rest of the data serve as the pool of unlabeled test instances for which labels can be queried.\nResults are averaged over 5,000 repetitions of the evaluation process. Further details on the datasets\nand experimental setup are included in the online appendix.\n\n4.1\n\nIdentifying the Model With Lower True Risk\n\nWe measure model selection accuracy, de\ufb01ned as the fraction of experiments in which an evaluation\nmethod correctly identi\ufb01es the model with lower true risk. The true risk is taken to be the risk over\nall test instances in the pool. Figure 1 (top) shows that for the comparison of two models active\nresults in signi\ufb01cantly higher model selection accuracy than passive, or, equivalently, saves between\n70% and 90% of labeling effort. Differences between active and the simpli\ufb01ed variants active0,\nactive\u221e, and active(cid:54)= are marginal. These variants do not require an estimate of p(y|x), thus the\nmethod is applicable even if no such estimate is available. A 2 and IWAL coincide with active(cid:54)=\n(cf. Section 3.2). Figure 1 (bottom) shows results when comparing multiple models. In the object\nrecognition domain, active saves approximately 70% of labeling effort compared to passive. A 2 and\nIWAL outperform passive but are less accurate than active. For the regression domains, active saves\nbetween 60% and 85% of labeling effort compared to passive.\n\n4.2 Signi\ufb01cance Testing: Type I and Type II Errors\n\nWe now study how often a comparison method is able to reject the null hypothesis that two predictive\nmodels incur identical risks, and the calibration of the resulting p-values. For classi\ufb01cation, the\n\n7\n\n501001502000.40.50.60.70.80.9labeling costs nmodel selection accuracySpam Filtering(Classification, 2 Models)active\u2260passiveAREA2activeIWAL2004006008000.650.70.750.80.850.90.95labeling costs nmodel selection accuracyAbalone(Regression, 2 Models)active\u221eactive0AREpassiveactive2004006008000.70.750.80.850.90.951labeling costs nmodel selection accuracyInverse Dynamics(Regression, 2 Models)active\u221eactive0AREpassiveactive5010015020000.20.40.60.81labeling costs nmodel selection accuracyObject Recognition(Classification, 13 Models)active\u2260passiveAREA2activeIWAL2004006008000.40.50.60.70.80.9labeling costs nmodel selection accuracyAbalone(Regression, 5 Models)active\u221eactive0AREpassiveactive2004006008000.40.50.60.70.8labeling costs nmodel selection accuracyInverse Dynamics(Regression, 5 Models)active\u221eactive0AREpassiveactive\fFigure 2: True-positive signi\ufb01cance rate for different test levels \u03b1 (left, left-center). Average p-value\nover labeling costs n (right-center, right). Error bars indicate the standard error.\n\nFigure 3: False-positive signi\ufb01cance rate over test level \u03b1 (left, left-center). False-positive signi\ufb01-\ncance rate over labeling costs n (right-center, right). Error bars indicate the standard error.\nmethod active(cid:54)= is equivalent to passive applied to D(cid:54)= = {x \u2208 D|f1(x) (cid:54)= f2(x)} (see Section 3.2).\nLabeling effort is thus simply reduced by a factor of |D(cid:54)=|/|D|. For regression, the analysis is less\nstraightforward as typically D = D(cid:54)=. In this section we therefore focus on regression problems.\nFigure 2 (left, left-center) shows how often the active and passive comparison methods are able to\nreject the null hypothesis that the two models incur identical risk. The true risks incurred are never\nequal in these experiments. We observe that active is able to reject the null hypothesis more often\nand with a higher con\ufb01dence. In the Abalone domain, active rejects the null hypothesis at \u03b1 = 0.001\nmore often than passive is able to reject it at \u03b1 = 0.1. Figure 2 (right-center, right) shows that active\ncomparison also results in lower average p-values, in particular for large n.\nWe also conduct experiments under the null hypothesis. Whenever a test instance x is sampled and\nthe predictions y = f1(x) and y(cid:48) = f2(x) are queried, the predicted labels y and y(cid:48) are swapped\nwith probability 0.5; this ensures that the true risks of f1 and f2 coincide. Figure 3 (left, left-center)\nshows that Type I errors are well calibrated for both tests, as the false-positive rate stays below the\n(ideal) diagonal line when plotted against \u03b1. Figure 3 (right-center, right) shows that both tests are\nslightly conservative for small n, and approach the expected false-positive rate as n grows larger.\nWe \ufb01nally study a protocol in which test instances are drawn and labeled until the null hypothesis can\nbe rejected or the labeling budget is exhausted. Results (included in the online appendix) indicate\nthat active incurs the lowest average labeling costs, obtains signi\ufb01cance results most often, and has\nthe lowest likelihood of incorrectly choosing the model with higher true risk.\n\n5 Conclusion\n\nWe have derived the sampling distribution that asymptotically maximizes the power of a statistical\ntest that compares the risk of two predictive models. The sampling distribution intuitively gives\npreference to test instances on which the models disagree strongly.\nEmpirically, we observed that the resulting active comparison method consistently outperforms a\ntraditional comparison based on a uniform sample of test instances. Active comparison identi\ufb01es\nthe model with lower true risk more often, and is able to detect signi\ufb01cant differences between\nthe risks of two given models more quickly.\nIn the four experimental domains that we studied,\nperforming active comparison resulted in a saved labeling effort of between 60% and over 90%. We\nalso performed experiments under the null hypothesis that both models incur identical risks, and\nveri\ufb01ed that active comparison does not lead to increased false-positive signi\ufb01cance results.\n\nAcknowledgements\n\nWe wish to thank Paul Prasse for his help with the experiments on object recognition data.\n\n8\n\n0.0010.010.050.100.20.40.60.81True Positive SignificanceInverse Dynamics (Regression, n=800)\u03b1\u2212levelfrequency  passiveactive0.0010.010.050.100.20.40.60.81True Positive SignificanceAbalone (Regression, n=800)\u03b1\u2212levelfrequency  passiveactive20040060080000.10.20.3labeling costs naverage p\u2212valueAverage p\u2212valueInverse Dynamics (Regression)  passiveactive20040060080000.10.20.3labeling costs naverage p\u2212valueAverage p\u2212valueAbalone (Regression)  passiveactive00.050.10.150.200.050.10.150.2\u03b1\u2212levelfrequencyFalse Positive SignificanceInverse Dynamics (Regression, n=800)  passiveactive00.050.10.150.200.050.10.150.2\u03b1\u2212levelfrequencyFalse Positive SignificanceAbalone (Regression, n=800)  passiveactive20040060080000.020.04labeling costs nfrequencyFalse Positive SignificanceInverse Dynamics (Regression, \u03b1=0.05)  passiveactive20040060080000.020.04labeling costs nfrequencyFalse Positive SignificanceAbalone (Regression, \u03b1=0.05)  passiveactive\fReferences\n[1] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the\n\n23rd International Conference on Machine Learning, 2006.\n\n[2] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Pro-\n\nceedings of the 26th International Conference on Machine Learning, 2009.\n\n[3] J. Geweke. Bayesian inference in econometric models using monte carlo integration. Econo-\n\nmetrica, 57(6):1317\u20131339, 1989.\n\n[4] J. S. Liu. Monte Carlo Strategies in Scienti\ufb01c Computing. Springer, 2001.\n[5] O. Madani, D. J. Lizotte, and R. Greiner. Active model selection. In Proceedings of the 20th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, 2004.\n\n[6] O. Maron and A. W. Moore. Hoeffding races: Accelerating model selection search for classi-\n\ufb01cation and function approximation. In Proceedings of the 6th Annual Conference on Neural\nInformation Processing Systems, 1993.\n\n[7] Carl Edward Rasmussen and Christopher Williams. Gaussian Processes for Machine Learn-\n\ning. MIT Press, 2006.\n\n[8] C. Sawade, N. Landwehr, S. Bickel, and T. Scheffer. Active risk estimation. In Proceedings of\n\nthe 27th International Conference on Machine Learning, 2010.\n\n[9] C. Sawade, N. Landwehr, and T. Scheffer. Active estimation of f-measures. In Proceedings of\n\nthe 23rd Annual Conference on Neural Information Processing Systems, 2010.\n\n[10] T. Scheffer and S. Wrobel. Finding the most interesting patterns in a database quickly by using\n\nsequential sampling. Journal of Machine Learning Research, 3:833\u2013862, 2003.\n\n[11] D. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. Chapman &\n\nHall, 2004.\n\n[12] L. Wasserman. All of Statistics: a Concise Course in Statistical Inference. Springer, 2004.\n\n9\n\n\f", "award": [], "sourceid": 851, "authors": [{"given_name": "Christoph", "family_name": "Sawade", "institution": null}, {"given_name": "Niels", "family_name": "Landwehr", "institution": null}, {"given_name": "Tobias", "family_name": "Scheffer", "institution": null}]}