{"title": "Correcting sample selection bias in maximum entropy density estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 323, "page_last": 330, "abstract": null, "full_text": "Correcting sample selection bias in maximum\n\nentropy density estimation\n\nMiroslav Dud\u00b4\u0131k, Robert E. Schapire\n\nPrinceton University\n\nDepartment of Computer Science\n35 Olden St, Princeton, NJ 08544\n\nSteven J. Phillips\n\nAT&T Labs \u2212 Research\n\n180 Park Ave, Florham Park, NJ 07932\n\nphillips@research.att.com\n\n{mdudik,schapire}@princeton.edu\n\nAbstract\n\nWe study the problem of maximum entropy density estimation in the\npresence of known sample selection bias. We propose three bias cor-\nrection approaches. The \ufb01rst one takes advantage of unbiased suf\ufb01cient\nstatistics which can be obtained from biased samples. The second one es-\ntimates the biased distribution and then factors the bias out. The third one\napproximates the second by only using samples from the sampling distri-\nbution. We provide guarantees for the \ufb01rst two approaches and evaluate\nthe performance of all three approaches in synthetic experiments and on\nreal data from species habitat modeling, where maxent has been success-\nfully applied and where sample selection bias is a signi\ufb01cant problem.\n\nIntroduction\n\n1\nWe study the problem of estimating a probability distribution, particularly in the context of\nspecies habitat modeling. It is very common in distribution modeling to assume access to\nindependent samples from the distribution being estimated. In practice, this assumption is\nviolated for various reasons. For example, habitat modeling is typically based on known\noccurrence locations derived from collections in natural history museums and herbariums\nas well as biological surveys [1, 2, 3]. Here, the goal is to predict the species\u2019 distribution\nas a function of climatic and other environmental variables. To achieve this in a statis-\ntically sound manner using current methods, it is necessary to assume that the sampling\ndistribution and species distributions are not correlated. In fact, however, most sampling is\ndone in locations that are easier to access, such as areas close to towns, roads, airports or\nwaterways [4]. Furthermore, the independence assumption may not hold since roads and\nwaterways are often correlated with topography and vegetation which in\ufb02uence species dis-\ntributions. New unbiased sampling may be expensive, so much can be gained by using the\nextensive existing biased data, especially since it is becoming freely available online [5].\n\nAlthough the available data may have been collected in a biased manner, we usually\nhave some information available about the nature of the bias. For instance, in the case of\nhabitat modeling, some factors in\ufb02uencing the sampling distribution are well known, such\nas distance from roads, towns, etc. In addition, a list of visited sites may be available and\nviewed as a sample of the sampling distribution itself. If such a list is not available, the\nset of sites where any species from a large group has been observed may be a reasonable\napproximation of all visited locations.\n\nIn this paper, we study probability density estimation under sample selection bias. We\n\n\fassume that the sampling distribution (or an approximation) is known during training, but\nwe require that unbiased models not use any knowledge of sample selection bias during\ntesting. This requirement is vital for habitat modeling where models are often applied to\na different region or under different climatic conditions. To our knowledge this is the \ufb01rst\nwork addressing sample selection bias in a statistically sound manner and in a setup suitable\nfor species habitat modeling from presence-only data.\n\nWe propose three approaches that incorporate sample selection bias in a common den-\nsity estimation technique based on the principle of maximum entropy (maxent). Max-\nent with \u21131-regularization has been successfully used to model geographic distributions\nof species under the assumption that samples are unbiased [3]. We review \u21131-regularized\nmaxent with unbiased data in Section 2, and give details of the new approaches in Section 3.\nOur three approaches make simple modi\ufb01cations to unbiased maxent and achieve anal-\nogous provable performance guarantees. The \ufb01rst approach uses a bias correction technique\nsimilar to that of Zadrozny et al. [6, 7] to obtain unbiased con\ufb01dence intervals from biased\nsamples as required by our version of maxent. We prove that, as in the unbiased case, this\nproduces models whose log loss approaches that of the best possible Gibbs distribution\n(with increasing sample size).\n\nIn contrast, the second approach we propose \ufb01rst estimates the biased distribution and\nthen factors the bias out. When the target distribution is a Gibbs distribution, the solution\nagain approaches the log loss of the target distribution. When the target distribution is not\nGibbs, we demonstrate that the second approach need not produce the optimal Gibbs dis-\ntribution (with respect to log loss) even in the limit of in\ufb01nitely many samples. However,\nwe prove that it produces models that are almost as good as the best Gibbs distribution ac-\ncording to a certain Bregman divergence that depends on the selection bias. In addition, we\nobserve good empirical performance for moderate sample sizes. The third approach is an\napproximation of the second approach which uses samples from the sampling distribution\ninstead of the distribution itself.\n\nOne of the challenges in studying methods for correcting sample selection bias is that\nunbiased data sets, though not required during training, are needed as test sets to evaluate\nperformance. Unbiased data sets are dif\ufb01cult to obtain \u2014 this is the very reason why\nwe study this problem! Thus, it is almost inevitable that synthetic data must be used. In\nSection 4, we describe experiments evaluating performance of the three methods. We use\nboth fully synthetic data, as well as a biological dataset consisting of a biased training set\nand an independently collected reasonably unbiased test set.\nRelated work. Sample selection bias also arises in econometrics where it stems from\nfactors such as attrition, nonresponse and self selection [8, 9, 10]. It has been extensively\nstudied in the context of linear regression after Heckman\u2019s seminal paper [8] in which the\nbias is \ufb01rst estimated and then a transform of the estimate is used as an additional regressor.\nIn the machine learning community, sample selection bias has been recently considered\nfor classi\ufb01cation problems by Zadrozny [6]. Here the goal is to learn a decision rule from\na biased sample. The problem is closely related to cost-sensitive learning [11, 7] and the\nsame techniques such as resampling or differential weighting of samples apply.\n\nHowever, the methods of the previous two approaches do not apply directly to density\nestimation where the setup is \u201cunconditional\u201d, i.e. there is no dependent variable, or, in the\nclassi\ufb01cation terminology, we only have access to positive examples, and the cost function\n(log loss) is unbounded. In addition, in the case of modeling species habitats, we face the\nchallenge of sample sizes that are very small (2\u2013100) by machine learning standards.\n\n2 Maxent setup\n\nIn this section, we describe the setup for unbiased maximum entropy density estimation\nand review performance guarantees. We use a relaxed formulation which will yield an\n\u21131-regularization term in our objective function.\n\n\fThe goal is to estimate an unknown target distribution \u03c0 over a known sample space X\nbased on samples x1, . . . , xm \u2208 X . We assume that samples are independently distributed\naccording to \u03c0 and denote the empirical distribution by \u02dc\u03c0(x) = |{1 \u2264 i \u2264 m : xi =\nx}|/m. The structure of the problem is speci\ufb01ed by real valued functions fj : X \u2192 R,\nj = 1, . . . , n, called features and by a distribution q0 representing a default estimate. We\nassume that features capture all the relevant information available for the problem at hand\nand q0 is the distribution we would choose if we were given no samples. The distribution\nq0 is most often assumed uniform.\n\nFor a limited number of samples, we expect that \u02dc\u03c0 will be a poor estimate of \u03c0 under\nany reasonable distance measure. However, empirical averages of features will not be too\ndifferent from their expectations with respect to \u03c0. Let p[f ] denote the expectation of a\nfunction f (x) when x is chosen randomly according to distribution p. We would like to\n\ufb01nd a distribution p which satis\ufb01es\n\n|p[fj] \u2212 \u02dc\u03c0[fj]| \u2264 \u03b2j for all 1 \u2264 j \u2264 n,\n\n(1)\nfor some estimates \u03b2j of deviations of empirical averages from their expectations. Usually\nthere will be in\ufb01nitely many distributions satisfying these constraints. For the case when\nthe default distribution q0 is uniform, the maximum entropy principle tells us to choose\nthe distribution of maximum entropy satisfying these constraints. In general, we should\nminimize the relative entropy from q0. This corresponds to choosing the distribution that\nsatis\ufb01es the constraints (1) but imposes as little additional information as possible when\ncompared with q0. Allowing for asymmetric constraints, we obtain the formulation\n\nmin\np\u2208\u2206\n\nRE(p k q0) subject to \u22001 \u2264 j \u2264 n : aj \u2264 p[fj] \u2264 bj.\n\n(2)\n\nHere, \u2206 \u2286 RX is the simplex of probability distributions and RE(p k q) is the relative\nentropy (or Kullback-Leibler divergence) from q to p, an information theoretic measure of\ndifference between the two distributions. It is non-negative, equal to zero only when the\ntwo distributions are identical, and convex in its arguments.\n\nProblem (2) is a convex program. Using Lagrange multipliers, we obtain that the solu-\n\ntion takes the form\n\nq\u03bb(x) = q0(x)e\u03bb\u00b7f (x)/Z\u03bb\n\nwhere Z\u03bb = Px q0(x)e\u03bb\u00b7f (x) is the normalization constant. Distributions q\u03bb of the form\n\n(3) will be referred to as q0-Gibbs or just Gibbs when no ambiguity arises.\n\nInstead of solving (2) directly, we solve its dual:\n\n\u03bb\u2208Rn(cid:16)log Z\u03bb \u2212 1\n\nmin\n\n2Pj (bj + aj)\u03bbj + 1\n\n2Pj (bj \u2212 aj)|\u03bbj|(cid:17).\n\nWe can choose from a range of general convex optimization techniques or use some of the\nalgorithms in [12]. For the symmetric case when\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nthe dual becomes\n\n[aj, bj] = (cid:2)\u02dc\u03c0[fj] \u2212 \u03b2j, \u02dc\u03c0[fj] + \u03b2j(cid:3),\n\u03bb\u2208Rn(cid:16)\u2212\u02dc\u03c0[log q\u03bb] +Pj \u03b2j|\u03bbj|(cid:17).\n\nmin\n\nThe \ufb01rst term is the empirical log loss (negative log likelihood), the second term is an \u21131-\nregularization. Small values of log loss mean a good \ufb01t to the data. This is balanced by\nregularization forcing simpler models and hence preventing over\ufb01tting.\n\nWhen all the primal constraints are satis\ufb01ed by the target distribution \u03c0 then the solution\n\u02c6q of the dual is guaranteed to be not much worse an approximation of \u03c0 than the best Gibbs\ndistribution q\u2217. More precisely:\nTheorem 1 (Performance guarantees, Theorem 1 of [12]). Assume that the distribution\n\u03c0 satis\ufb01es the primal constraints (2). Let \u02c6q be the solution of the dual (4). Then for an\narbitrary Gibbs distribution q\u2217 = q\u03bb\u2217\n\nRE(\u03c0 k \u02c6q) \u2264 RE(\u03c0 k q\u2217) +Pj (bj \u2212 aj)|\u03bb\u2217\nj|.\n\n\fInput: \ufb01nite domain X\n\ndefault estimate q0\nsampling distribution s\n\nfeatures f1, . . . , fn where fj : X \u2192 [0, 1]\nregularization parameter \u03b2 > 0\nsamples x1, . . . , xm \u2208 X\nOutput: q \u02c6\u03bb approximating the target distribution\nLet \u03b20 = `\u03b2/\u221am\u00b4 \u00b7 min {f\u03c3s[1/s], (max 1/s \u2212 min 1/s)/2}\n[c0, d0] = \u02c6f\u03c0s[1/s] \u2212 \u03b20, f\u03c0s[1/s] + \u03b20\u02dc \u2229 \u02c6min 1/s, max 1/s]\n\u03b2j = `\u03b2/\u221am\u00b4 \u00b7 min {f\u03c3s[fj/s], (max fj/s \u2212 min fj/s)/2}\n[cj, dj] = \u02c6f\u03c0s[fj/s] \u2212 \u03b2j, f\u03c0s[fj/s] + \u03b2j\u02dc \u2229 \u02c6min fj/s, max fj/s]\n[aj, bj] = \u02c6cj/d0, dj/c0\u02dc \u2229 \u02c6min fj, max fj]\n\nFor j = 1, . . . , n:\n\nSolve the dual (4)\n\nAlgorithm 1: DEBIASAVERAGES.\n\nTable 1: Example 1.\nComparison\nof distributions q\u2217 and q\u2217\u2217 minimizing\nRE(\u03c0 k q\u03bb) and RE(\u03c0s k q\u03bbs).\n\nx f (x) \u03c0(x) s(x) \u03c0s(x) q\u2217(x) q\u2217\u2217s(x) q\u2217\u2217(x)\n0.34\n1 (0, 0) 0.4\n0.16\n2 (0, 1) 0.1\n0.34\n3 (1, 0) 0.1\n4 (1, 1) 0.4\n0.16\n\n0.544\n0.256\n0.136\n0.064\n\n0.64\n0.16\n0.04\n0.16\n\n0.25\n0.25\n0.25\n0.25\n\n0.4\n0.4\n0.1\n0.1\n\nWhen features are bounded between 0 and 1, the symmetric box constraints (5) with\n\nthe union bound. Then the relative entropy from \u02c6q to \u03c0 will not be worse than the relative\n\n\u03b2j = O(p(log n)/m) are satis\ufb01ed with high probability by Hoeffding\u2019s inequality and\nentropy from any Gibbs distribution q\u2217 to \u03c0 by more than O(k\u03bb\u2217k1p(log n)/m).\n\nIn practice, we set\n\n\u03b2j = (cid:0)\u03b2/\u221am(cid:1) \u00b7 min{\u02dc\u03c3[fj], \u03c3max[fj]}\n\n(7)\n\nwhere \u03b2 is a tuned constant, \u02dc\u03c3[fj] is the sample deviation of fj, and \u03c3max[fj] is an upper\nbound on the standard deviation, such as (maxx fj(x) \u2212 minx fj(x))/2. We refer to this\nalgorithm for unbiased data as UNBIASEDMAXENT.\n\n3 Maxent with sample selection bias\nIn the biased case, the goal is to estimate the target distribution \u03c0, but samples do not come\ndirectly from \u03c0. For nonnegative functions p1, p2 de\ufb01ned on X , let p1p2 denote the distri-\nbution obtained by multiplying weights p1(x) and p2(x) at every point and renormalizing:\n\np1p2(x) =\n\np1(x)p2(x)\n\nPx\u2032 p1(x\u2032)p2(x\u2032)\n\n.\n\nSamples x1, . . . , xm come from the biased distribution \u03c0s where s is the sampling distri-\nbution. This setup corresponds to the situation when an event being observed occurs at the\npoint x with probability \u03c0(x) while we perform an independent observation with probabil-\nity s(x). The probability of observing an event at x given that we observe an event is then\nequal to \u03c0s(x). The empirical distribution of m samples drawn from \u03c0s will be denoted\n\nby f\u03c0s. We assume that s is known (principal assumption, see introduction) and strictly\n\npositive (technical assumption).\nApproach I: Debiasing Averages.\nIn our \ufb01rst approach, we use the same algorithm as\nfor the unbiased case but employ a different method to obtain con\ufb01dence intervals [aj, bj].\nSince we do not have direct access to samples from \u03c0, we use a version of the Bias Cor-\nrection Theorem of Zadrozny [6] to convert expectations with respect to \u03c0s to expectations\nwith respect to \u03c0.\nTheorem 2 (Bias Correction Theorem [6], Translation Theorem [7]).\n\n\u03c0s[f /s](cid:14)\u03c0s[1/s] = \u03c0[f ].\n\n\fHence, it suf\ufb01ces to give con\ufb01dence intervals for \u03c0s[f /s] and \u03c0s[1/s] to obtain con\ufb01-\n\ndence intervals for \u03c0[f ].\nCorollary 3. Assume that for some sample-derived bounds cj, dj, 0 \u2264 j \u2264 n, with high\nprobability 0 < c0 \u2264 \u03c0s[1/s] \u2264 d0 and 0 \u2264 cj \u2264 \u03c0s[fj/s] \u2264 dj for all 1 \u2264 j \u2264 n. Then\nwith at least the same probability cj/d0 \u2264 \u03c0[fj] \u2264 dj/c0 for all 1 \u2264 j \u2264 n.\nIf s is bounded away from 0 then Chernoff bounds may be used to determine cj, dj.\nCorollary 3 and Theorem 1 then yield guarantees that this method\u2019s performance converges,\nwith increasing sample sizes, to that of the \u201cbest\u201d Gibbs distribution.\n\nIn practice, con\ufb01dence intervals [cj, dj] may be determined using expressions analo-\n\ngous to (5) and (7) for random variables fj/s, 1/s and the empirical distribution f\u03c0s. After\n\n\ufb01rst restricting the con\ufb01dence intervals in a natural fashion, this yields Algorithm 1. Alter-\nnatively, we could use bootstrap or other types of estimates for the con\ufb01dence intervals.\nApproach II: Factoring Bias Out. The second algorithm does not approximate \u03c0 di-\nrectly, but uses maxent to estimate the distribution \u03c0s and then converts this estimate into\nan approximation of \u03c0. If the default estimate of \u03c0 is q0, then the default estimate of \u03c0s is\n\nq0s. Applying unbiased maxent to the empirical distribution f\u03c0s with the default q0s, we\nobtain a q0s-Gibbs distribution q0se\u02c6\u03bb\u00b7f approximating \u03c0s. We factor out s to obtain q0e\u02c6\u03bb\u00b7f\nas an estimate of \u03c0. This yields the algorithm FACTORBIASOUT.\n\nThis approach corresponds to \u21131-regularized maximum likelihood estimation of \u03c0 by\nq0-Gibbs distributions. When \u03c0 itself is q0-Gibbs then the distribution \u03c0s is q0s-Gibbs.\nPerformance guarantees for unbiased maxent imply that estimates of \u03c0s converge to \u03c0s as\nthe number of samples increases. Now, if inf x s(x) > 0 (which is the case for \ufb01nite X )\nthen estimates of \u03c0 obtained by factoring out s converge to \u03c0 as well.\nWhen \u03c0 is not q0-Gibbs then \u03c0s is not q0s-Gibbs either. We approximate \u03c0 by a\nq0-Gibbs distribution \u02c6q = q\u02c6\u03bb which, with an increasing number of samples, minimizes\nRE(\u03c0s k q\u03bbs) rather than RE(\u03c0 k q\u03bb). Our next example shows that these two minimiz-\ners may be different.\nExample 1. Consider the space X = {1, 2, 3, 4} with two features f1, f2. Features f1, f2,\ntarget distribution \u03c0, sampling distribution s and the biased distribution \u03c0s are given in Ta-\nble 1. We use the uniform distribution as a default estimate. The minimizer of RE(\u03c0 k q\u03bb)\nis the unique uniform-Gibbs distribution q\u2217 such that q\u2217[f ] = \u03c0[f ]. Similarly, the mini-\nmizer q\u2217\u2217s of RE(\u03c0s k q\u03bbs) is the unique s-Gibbs distribution for which q\u2217\u2217s[f ] = \u03c0s[f ].\nSolving for these exactly, we \ufb01nd that q\u2217 and q\u2217\u2217 are as given in Table 1, and that these two\ndistributions differ.\n\nEven though FACTORBIASOUT does not minimize RE(\u03c0 k q\u03bb), we can show that it\nminimizes a different Bregman divergence. More precisely, it minimizes a Bregman di-\nvergence between certain projections of the two distributions. Bregman divergences gen-\neralize some common distance measures such as relative entropy or the squared Euclidean\ndistance, and enjoy many of the same favorable properties. The Bregman divergence asso-\nciated with a convex function F is de\ufb01ned as DF (u k v) = F (u)\u2212F (v)\u2212\u2207F (v)\u00b7(u\u2212v).\n+ \u2192 R as F (u) = Px s(x)u(x) log u(x). Then F\nProposition 4. De\ufb01ne F : RX\nis a convex function and for all p1, p2 \u2208 \u2206, RE(p1s k p2s) = DF (p\u2032\n2), where\n2(x) = p2(x)/Px\u2032 s(x\u2032)p2(x\u2032) are projections of\n1(x) = p1(x)/Px\u2032 s(x\u2032)p1(x\u2032) and p\u2032\np\u2032\np1, p2 along lines tp, t \u2208 R onto the hyperplanePx s(x)p(x) = 1.\n\nApproach III: Approximating FACTORBIASOUT. As mentioned in the introduction,\nknowing the sampling distribution s exactly is unrealistic. However, we often have ac-\ncess to samples from s.\nIn this approach we assume that s is unknown but that, in\naddition to samples x1, . . . , xm from \u03c0s, we are also given a separate set of samples\nx(1), x(2), . . . , x(N ) from s. We use the algorithm FACTORBIASOUT with the sampling\ndistribution s replaced by the corresponding empirical distribution \u02dcs.\n\n1 k p\u2032\n\nTo simplify the algorithm, we note that instead of using q0\u02dcs as a default estimate for\n\n\u03c0s, it suf\ufb01ces to replace the sample space X by X \u2032 = (cid:8)x(1), x(2), . . . , x(N )(cid:9) and use q0\n\n\ft\n\ne\ng\nr\na\n\ny\np\no\nr\nt\nn\ne\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nr\n\nt\n \n\no\n\nt\n\n2.5\n\n2\n\n1.5\n\n10\n\ntarget=\u03c0\n1\nRE(\u03c0\n||u)=4.5\n1\n\n2\n\n1.5\n\n1\n\ntarget=\u03c0\n2\nRE(\u03c0\n||u)=5.0\n2\n\n2.2\n\n2\n\n1.8\n\ntarget=\u03c0\n3\nRE(\u03c0\n||u)=3.3\n3\n\n100\n\n1000\n\n10\n\n100\n\n1000\n\nnumber of training\n\nsamples (m)\n\nunbiased maxent\ndebias averages\nfactor bias out\n\n10\napproximate factor bias out\n\n100\n\n1,000 samples\n10,000 samples\n\n1000\n\nFigure 1: Learning curves for synthetic experiments. We use u to denote the uniform distribution. For\nthe sampling distribution s, RE(s k u) = 0.8. Performance is measured in terms of relative entropy\nto the target distribution as a function of an increasing number of training samples. The number of\nsamples is plotted on a log scale.\nrestricted to X \u2032 as a default. The last step of factoring out \u02dcs is equivalent to using \u02c6\u03bb returned\nfor space X \u2032 on the entire space X .\nWhen the sampling distribution s is correlated with feature values, X \u2032 might not cover\nall feature ranges. In that case, reprojecting on X may yield poor estimates outside of these\nranges. We therefore do \u201cclamping\u201d, restricting values fj(x) to their ranges over X \u2032 and\ncapping values of the exponent \u02c6\u03bb \u00b7 f (x) at its maximum over X \u2032. The resulting algorithm\nis called APPROXFACTORBIASOUT.\n\n4 Experiments\n\nConducting real data experiments to evaluate bias correction techniques is dif\ufb01cult, be-\ncause bias is typically unknown and samples from unbiased distributions are not available.\nTherefore, synthetic experiments are often a necessity for precise evaluation. Neverthe-\nless, in addition to synthetic experiments, we were also able to conduct experiments with\nreal-world data for habitat modeling.\nSynthetic experiments.\nIn synthetic experiments, we generated three target uniform-\nGibbs distributions \u03c01, \u03c02, \u03c03 over a domain X of size 10,000. These distributions were\nderived from 65 features indexed as fi, 0 \u2264 i \u2264 9 and fij, 0 \u2264 i \u2264 j \u2264 9. Values\nfi(x) were chosen independently and uniformly in [0, 1], and we set fij(x) = fi(x)fj(x).\nFixing these features, we generated weights for each distribution. Weights \u03bbi and \u03bbii were\ngenerated jointly to capture a range of different behaviors for values of fi in the range [0, 1].\nLet US denote a random variable uniform over the set S. Each instance of US corre-\nsponds to a new independent variable. We set \u03bbii = U{\u22121,0,1}U[1,5] and \u03bbi to be \u03bbiiU[\u22123,1]\nif \u03bbii 6= 0, and U{\u22121,1}U[2,10] otherwise. Weights \u03bbij, i < j were chosen to create cor-\nrelations between fi\u2019s that would be observable, but not strong enough to dominate \u03bbi\u2019s\nand \u03bbii\u2019s. We set \u03bbij = \u22120.5 or 0 or 0.5 with respective probabilities 0.05, 0.9 and 0.05.\nIn maxent, we used a subset of features specifying target distributions and some irrelevant\nfeatures. We used features f \u2032\ni (x) = fi(x) for\ni (x) = U[0,1] for 6 \u2264 i \u2264 9 (irrelevant features). Once\n0 \u2264 i \u2264 5 (relevant features) and f \u2032\ngenerated, we used the same set of features in all experiments. We generated a sampling\ndistribution s correlated with target distributions. More speci\ufb01cally, s was a Gibbs distrib-\nution generated from features f (s)\ni (x) = U[0,1]\nfor 0 \u2264 i \u2264 1 and f (s)\nii = \u22121.\nFor every target distribution, we evaluated the performance of UNBIASEDMAXENT,\nDEBIASAVERAGES, FACTORBIASOUT and APPROXFACTORBIASOUT with 1,000 and\n10,000 samples from the sampling distribution. The performance was evaluated in terms\nof relative entropy to the target distribution. We used training sets of sizes 10 to 1000. We\nconsidered \ufb01ve randomly generated training sets and took the average performance over\nthese \ufb01ve sets for settings of \u03b2 from the range [0.05, 4.64]. We report results for the best \u03b2,\nchosen separately for each average. The rationale behind this approach is that we want to\n\n, 0 \u2264 i \u2264 5 and their squares f (s)\ni = fi+2 for 2 \u2264 i \u2264 5. We used weights \u03bb(s)\n\ni , 0 \u2264 i \u2264 9 and their squares f \u2032\n\nii , where f (s)\ni = 0 and \u03bb(s)\n\nii, where f \u2032\n\ni\n\n\fTable 2: Results of real data experiments. Average performance of unbiased maxent and three bias\ncorrection approaches over all species in six regions. The uniform distribution would receive the\nlog loss of 14.2 and AUC of 0.5. Results of bias correction approaches are italicized if they are\nsigni\ufb01cantly worse and set in boldface if they are signi\ufb01cantly better than those of the unbiased\nmaxent according to a paired t-test at the level of signi\ufb01cance 5%.\n\naverage log loss\n\naverage AUC\n\ncan\n\nawt\nswi\n13.78 12.89 13.40 13.77 13.14 12.81 0.69 0.58 0.71 0.72 0.78 0.81\nunbiased maxent\n13.92 13.10 13.88 14.31 14.10 13.59 0.67 0.64 0.65 0.67 0.68 0.78\ndebias averages\n13.90 13.13 14.06 14.20 13.66 13.46 0.71 0.69 0.72 0.72 0.78 0.83\nfactor bias out\napx. factor bias out 13.89 13.40 14.19 14.07 13.62 13.41 0.72 0.72 0.73 0.73 0.78 0.84\n\ncan nsw nz\n\nnsw\n\nawt\n\nswi\n\nnz\n\nsa\n\nsa\n\nexplore the potential performance of each method.\n\nFigure 1 shows the results at the optimal \u03b2 as a function of an increasing number of sam-\nples. FACTORBIASOUT is always better than UNBIASEDMAXENT. DEBIASAVERAGES is\nworse than UNBIASEDMAXENT for small sample sizes, but as the number of training sam-\nples increases, it soon outperforms UNBIASEDMAXENT and eventually also outperforms\nFACTORBIASOUT. APPROXFACTORBIASOUT improves as the number of samples from\nthe sampling distribution increases from 1,000 to 10,000, but both versions of APPROX-\nFACTORBIASOUT perform worse than UNBIASEDMAXENT for the distribution \u03c02.\nReal data experiments.\nIn this set of experiments, we evaluated maxent in the task of\nestimating species habitats. The sample space is a geographic region divided into a grid\nof cells and samples are known occurrence localities \u2014 cells where a given species was\nobserved. Every cell is described by a set of environmental variables, which may be cat-\negorical, such as vegetation type, or continuous, such as altitude or annual precipitation.\nFeatures are real-valued functions derived from environmental variables. We used binary\nindicator features for different values of categorical variables and binary threshold features\nfor continuous variables. The latter are equal to one when the value of a variable is greater\nthan a \ufb01xed threshold and zero otherwise.\n\nSpecies sample locations and environmental variables were all produced and used as\npart of the \u201cTesting alternative methodologies for modeling species\u2019 ecological niches and\npredicting geographic distributions\u201d Working Group at the National Center for Ecological\nAnalysis and Synthesis (NCEAS). The working group compared modeling methods across\na variety of species and regions. The training set contained presence-only data from un-\nplanned surveys or incidental records, including those from museums and herbariums. The\ntest set contained presence-absence data from rigorously planned independent surveys.\n\nWe compared performance of our bias correction approaches with that of the unbiased\nmaxent which was among the top methods in the NCEAS comparison [13]. We used the full\ndataset consisting of 226 species in 6 regions with 2\u20135822 training presences per species\n(233 on average) and 102\u201319120 test presences/absences. For more details see [13].\n\nWe treated training occurrence locations for all species in each region as sampling\ndistribution samples and used them directly in APPROXFACTORBIASOUT.\nIn order to\napply DEBIASAVERAGES and FACTORBIASOUT, we estimated the sampling distribution\nusing unbiased maxent. Sampling distribution estimation is also the \ufb01rst step of [6]. In\ncontrast with that work, however, our experiments do not use the sampling distribution\nestimate during evaluation and hence do not depend on its quality.\n\nThe resulting distributions were evaluated on test presences according to the log loss\nand on test presences and absences according to the area under an ROC curve (AUC) [14].\nAUC quanti\ufb01es how well the predicted distribution ranks test presences above test ab-\nsences. Its value is equal to the probability that a randomly chosen presence will be ranked\nabove a randomly chosen absence. The uniformly random prediction receives AUC of 0.5\nwhile a perfect prediction receives AUC of 1.0.\n\nIn Table 2 we show performance of our three approaches compared with the unbiased\nmaxent. All three algorithms yield on average a worse log loss than the unbiased maxent.\nThis can perhaps be attributed to the imperfect estimate of the sampling distribution or to\n\n\fthe sampling distribution being zero over large portions of the sample space. In contrast,\nwhen the performance is measured in terms of AUC, FACTORBIASOUT and APPROX-\nFACTORBIASOUT yield on average the same or better AUC as UNBIASEDMAXENT in all\nsix regions. Improvements in regions awt, can and swi are dramatic enough so that both of\nthese methods perform better than any method evaluated in [13].\n\n5 Conclusions\nWe have proposed three approaches that incorporate information about sample selection\nbias in maxent and demonstrated their utility in synthetic and real data experiments. Ex-\nperiments also raise several questions that merit further research: DEBIASAVERAGES has\nthe strongest performance guarantees, but it performs the worst in real data experiments\nand catches up with other methods only for large sample sizes in synthetic experiments.\nThis may be due to poor estimates of unbiased con\ufb01dence intervals and could be possibly\nimproved using a different estimation method. FACTORBIASOUT and APPROXFACTOR-\nBIASOUT improve over UNBIASEDMAXENT in terms of AUC over real data, but are worse\nin terms of log loss. This disagreement suggests that methods which aim to optimize AUC\ndirectly could be more successful in species modeling, possibly incorporating some con-\ncepts from FACTORBIASOUT and APPROXFACTORBIASOUT. APPROXFACTORBIAS-\nOUT performs the best on real world data, possibly due to the direct use of samples from the\nsampling distribution rather than a sampling distribution estimate. However, this method\ncomes without performance guarantees and does not exploit the knowledge of the full sam-\nple space. Proving performance guarantees for APPROXFACTORBIASOUT remains open\nfor future research.\n\nAcknowledgments\nThis material is based upon work supported by NSF under grant 0325463. Any opinions, \ufb01ndings, and conclusions\nor recommendations expressed in this material are those of the authors and do not necessarily re\ufb02ect the views\nof NSF. The NCEAS data was kindly shared with us by the members of the \u201cTesting alternative methodologies\nfor modeling species\u2019 ecological niches and predicting geographic distributions\u201d Working Group, which was\nsupported by the National Center for Ecological Analysis and Synthesis, a Center funded by NSF (grant DEB-\n0072909), the University of California and the Santa Barbara campus.\n\nReferences\n[1] Jane Elith. Quantitative methods for modeling species habitat: Comparative performance and an application\nto Australian plants. In Scott Ferson and Mark Burgman, editors, Quantitative Methods for Conservation\nBiology, pages 39\u201358. Springer-Verlag, 2002.\n\n[2] A. Guisan and N. E. Zimmerman. Predictive habitat distribution models in ecology. Ecological Modelling,\n\n135:147\u2013186, 2000.\n\n[3] Steven J. Phillips, Miroslav Dud\u00b4\u0131k, and Robert E. Schapire. A maximum entropy approach to species distri-\nbution modeling. In Proceedings of the Twenty-First International Conference on Machine Learning, 2004.\n[4] S. Reddy and L. M. D\u00b4avalos. Geographical sampling bias and its implications for conservation priorities in\n\nAfrica. Journal of Biogeography, 30:1719\u20131727, 2003.\n\n[5] Barbara R. Stein and John Wieczorek. Mammals of the world: MaNIS as an example of data integration in\n\na distributed network environment. Biodiversity Informatics, 1(1):14\u201322, 2004.\n\n[6] Bianca Zadrozny. Learning and evaluating classi\ufb01ers under sample selection bias. In Proceedings of the\n\nTwenty-First International Conference on Machine Learning, 2004.\n\n[7] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example\n\nweighting. In Proceedings of the Third IEEE International Conference on Data Mining, 2003.\n\n[8] James J. Heckman. Sample selection bias as a speci\ufb01cation error. Econometrica, 47(1):153\u2013161, 1979.\n[9] Robert M. Groves. Survey Errors and Survey Costs. Wiley, 1989.\n[10] Roderick J. Little and Donald B. Rubin. Statistical Analysis with Missing Data. Wiley, second edition, 2002.\n[11] Charles Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International\n\nJoint Conference on Arti\ufb01cial Intelligence, 2001.\n\n[12] Miroslav Dud\u00b4\u0131k, Steven J. Phillips, and Robert E. Schapire. Performance guarantees for regularized maxi-\n\nmum entropy density estimation. In 17th Annual Conference on Learning Theory, 2004.\n\n[13] J. Elith, C. Graham, and NCEAS working group. Comparing methodologies for modeling species\u2019 distribu-\n\ntions from presence-only data. In preparation.\n\n[14] J. A. Hanley and B. S. McNeil. The meaning and use of the area under a receiver operating characteristic\n\n(ROC) curve. Radiology, 143:29\u201336, 1982.\n\n\f", "award": [], "sourceid": 2929, "authors": [{"given_name": "Miroslav", "family_name": "Dud\u00edk", "institution": null}, {"given_name": "Steven", "family_name": "Phillips", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}