{"title": "An urn model for majority voting in classification ensembles", "book": "Advances in Neural Information Processing Systems", "page_first": 4430, "page_last": 4438, "abstract": "In this work we analyze the class prediction of parallel randomized ensembles by majority voting as an urn model. For a given test instance, the ensemble can be viewed as an urn of marbles of different colors. A marble represents an individual classifier. Its color represents the class label prediction of the corresponding classifier. The sequential querying of classifiers in the ensemble can be seen as draws without replacement from the urn. An analysis of this classical urn model based on the hypergeometric distribution makes it possible to estimate the confidence on the outcome of majority voting when only a fraction of the individual predictions is known. These estimates can be used to speed up the prediction by the ensemble. Specifically, the aggregation of votes can be halted when the confidence in the final prediction is sufficiently high. If one assumes a uniform prior for the distribution of possible votes the analysis is shown to be equivalent to a previous one based on Dirichlet distributions. The advantage of the current approach is that prior knowledge on the possible vote outcomes can be readily incorporated in a Bayesian framework. We show how incorporating this type of problem-specific knowledge into the statistical analysis of majority voting leads to faster classification by the ensemble and allows us to estimate the expected average speed-up beforehand.", "full_text": "An urn model for majority voting\n\nin classi\ufb01cation ensembles\n\nVictor Soto\n\nComputer Science Department\n\nColumbia University\nNew York, NY, USA\n\nvsoto@cs.columbia.edu\n\n{gonzalo.martinez,alberto.suarez}@uam.es\n\nAlberto Su\u00e1rez and Gonzalo Mart\u00ednez-Mu\u00f1oz\n\nComputer Science Department\n\nUniversidad Aut\u00f3noma de Madrid\n\nMadrid, Spain\n\nAbstract\n\nIn this work we analyze the class prediction of parallel randomized ensembles by\nmajority voting as an urn model. For a given test instance, the ensemble can be\nviewed as an urn of marbles of different colors. A marble represents an individual\nclassi\ufb01er.\nIts color represents the class label prediction of the corresponding\nclassi\ufb01er. The sequential querying of classi\ufb01ers in the ensemble can be seen\nas draws without replacement from the urn. An analysis of this classical urn\nmodel based on the hypergeometric distribution makes it possible to estimate\nthe con\ufb01dence on the outcome of majority voting when only a fraction of the\nindividual predictions is known. These estimates can be used to speed up the\nprediction by the ensemble. Speci\ufb01cally, the aggregation of votes can be halted\nwhen the con\ufb01dence in the \ufb01nal prediction is suf\ufb01ciently high. If one assumes\na uniform prior for the distribution of possible votes the analysis is shown to be\nequivalent to a previous one based on Dirichlet distributions. The advantage of\nthe current approach is that prior knowledge on the possible vote outcomes can be\nreadily incorporated in a Bayesian framework. We show how incorporating this\ntype of problem-speci\ufb01c knowledge into the statistical analysis of majority voting\nleads to faster classi\ufb01cation by the ensemble and allows us to estimate the expected\naverage speed-up beforehand.\n\n1\n\nIntroduction\n\nCombining the outputs of multiple predictors is in many cases of interest a successful strategy to\nimprove the capabilities of arti\ufb01cial intelligence systems, ranging from agent architectures [19], to\ncommittee learning [13, 15, 8, 9]. A common approach is to build a collection of individual subsys-\ntems and then integrate their outputs into a \ufb01nal decision by means of a voting process. Speci\ufb01cally,\nin the machine learning literature, there is extensive empirical evidence on the improvements in\ngeneralization capacity that can be obtained using ensembles of learners [7, 11]. However, one of the\ndrawbacks of these types of systems is the linear memory and time costs incurred in the computation\nof the \ufb01nal ensemble prediction by combination of the individual predictions. There are various\nstrategies that alleviate these shortcomings. These techniques are grouped into static (or off-line)\nand dynamic (or online). In static pruning techniques, only a subset of complementary predictors\nfrom the original ensemble is kept [16, 21, 6]. By contrast, in dynamic pruning, the whole ensemble\nis retained. The prediction of the class label of a particular instance is accelerated by halting the\nsequential querying process when it is unlikely that the remaining (unknown) votes would change\nthe output prediction [10, 20, 14, 12, 2, 3, 17]. These techniques are online in the sense that, as new\nindividual predictions become known, the algorithm dynamically updated the estimated probability\nof having a stable prediction; i.d. a prediction that coincides with that by the complete ensemble.\nThis is the basis of the Statistical Instance-Based Algorithm (SIBA) proposed in [14]. In a similar\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fapproach, albeit with a different objective, Reyzin proposes to randomly sample hypotheses from\nthe original AdaBoost ensemble. The goal is to minimize the number of features that are used for\nprediction, with a limited loss of accuracy [18]. This feature-ef\ufb01cient prediction is bene\ufb01cial when\naccess to the features of a new instance at test time is costly (e.g., in some medical problems). A\ndifferent approach is followed in [3]. In this work, a policy is learned to decide which classi\ufb01ers\nshould be queried and which discarded in the prediction of the class label of a given instance.\nThe dynamic ensemble pruning method proposed in this work is closely related to SIBA [14]. In\nSIBA, the members of a committee are queried sequentially. At each step in the querying process, the\nvotes recorded are used to estimate the probability that the majority decision of the classi\ufb01ers queried\nup to that moment coincides with the complete ensemble. If this probability exceeds a speci\ufb01ed\ncon\ufb01dence level, \u03b1, the voting process is halted. To compute this estimate, the probability that a\nsingle predictor outputs a given decision for the particular instance considered is modeled as a random\nvariable. Starting from a uniform prior, Bayes\u2019 theorem is used to update the distribution of this\nvariable with the information provided by the actual votes, as they become known. In most of the\nproblems analyzed in [14], the assumption that the prior is uniform leads to conservative estimates\nof the con\ufb01dence on the stability of the predictions when only a fraction of the classi\ufb01ers have\nbeen queried. Analyzing the results of those experiments, it is apparent that the actual disagreement\npercentages between the dynamic decision output and the decision made by the complete committee\nare signi\ufb01cantly lower than the speci\ufb01ed target \u03b1. As a consequence, more queries are made than the\nones that are actually needed.\nThe present work has two objectives. First, we propose an intuitive mathematical modeling of\nthe voting process in ensembles of classi\ufb01ers based on the hypergeometric distribution. Under the\nassumption that the distribution of possible vote outcomes is uniform, we prove that this derivation is\nequivalent to the one presented in [14]. However, the vote distribution is, in general, not uniform.\nIts shape depends on the classi\ufb01cation task considered and on the base learning algorithm used to\ngenerate the predictors. Second, to take into account this dependence, we propose to approximate\nthis distribution using a non-parametric prior. The use of this problem-speci\ufb01c prior knowledge\nleads to more accurate estimations of the disagreement rates between the dynamic sub-committee\nprediction and the complete committee, which are closer to the speci\ufb01ed target \u03b1. In this manner,\nfaster classi\ufb01cation can be achieved with minimal loss of accuracy. In addition, the use of priors allow\nus to estimate quite precisely the expected average number of trees that would be necessary to query.\n\n2 Modeling ensemble voting processes as a classical urn problem\n\nConsider the following process modeled as a classical urn model. Let us suppose we have marbles\nof l different colors in an urn. The number of marbles of color yk in the urn is Tk, with k = 1 . . . l.\nk=1 Tk. The contents of the urn can therefore be\ndescribed by vector T = (cid:104)T1, T2 . . . Tl(cid:105). Assume that t < T marbles are extracted from the urn\nwithout replacement. This extraction process can be characterized by vector t = (cid:104)t1, t2 . . . tl(cid:105) where\nk=1 tk. The probability of extracting a\ncolor distribution of marbles t, given the initial color distribution of the urn T is described by the\nmultivariate hypergeometric distribution\n\nThe total number of marbles in the urn is T =(cid:80)l\ntk is the number of marbles of color yk extracted, with t =(cid:80)l\n(cid:81)l\n(cid:0)T\n\n(cid:1) . . .(cid:0)Tl\n(cid:0)T\n(cid:1)\n\n.\n\n(1)\n\nP(t|T) =\n\n(cid:0)Ti\n(cid:1)\n\nti\n\n(cid:0)T1\n\n(cid:1)\n\n(cid:1)\n\ntl\n\n=\n\nt1\n\nt\n\ni=1\n\nt\n\nConsider the case in which the total number of marbles in the urn, T , is known but that the color\ndistribution, T, is unknown. In this case, the color distribution of the extracted marbles, t, can be\nused to estimate the content of the urn applying Bayes Theorem\n\n(cid:0)T1\n\n(cid:1)P(T)\n(cid:1) . . .(cid:0)Tl\n(cid:1) . . .(cid:0)T \u2217\n(cid:0)T \u2217\n(cid:1)P(T\u2217)\n\ntl\n\nl\ntl\n\n(2)\n\nP(T|t) =\n\nP(t|T)P(T)\n\nP(t)\n\n=\n\n(cid:80)\n\nP(t|T)P(T)\n\nP(t|T\u2217)P(T\u2217)\n\n(cid:80)\nt1\ni \u2265 ti \u2200i and(cid:80)l\nT\u2217\u2208\u2126t\ni=1 T \u2217\n\n=\n\n1\nt1\ni = T .\n\nT\u2217\u2208\u2126t\n\nwhere \u2126t is the set of vectors T\u2217, such that T \u2217\nThis problem is equivalent to the voting process in an ensemble of classi\ufb01ers: Suppose we want\nto predict the class label of an instance by combining the individual predictions of the ensemble\nclassi\ufb01ers (marbles). Assuming that the individual predictions are deterministic, the class (color) that\n\n2\n\n\feach classi\ufb01er (marble) would output if queried is \ufb01xed, but unknown before the query. Therefore,\nfor each instance considered we have a different \"bag of colored marbles\" with an unknown class\ndistribution. After a partial count of votes of the ensemble is known, Eq. 2 provides an estimate of the\ndistribution of votes for the complete ensemble. This estimate can be used to compute the probability\nthat the decision obtained using only a partial tally of votes, t, of size t < T and by the \ufb01nal decision\nusing all T votes, coincide\n\n(cid:88)\n\nT\u2208Tt\n\n(cid:80)\n\n(cid:0)T1\n\n(cid:1)P(T)\n(cid:1) . . .(cid:0)Tl\n(cid:1) . . .(cid:0)T \u2217\n(cid:0)T \u2217\n(cid:1)P(T\u2217)\n\ntl\n\n1\nt1\n\nl\ntl\n\nt1\nT\u2217\u2208\u2126t\n\nP\u2217(t, T ) =\n\n,\n\n(3)\n\ni=1 Ti = T .\n\nwhere Tt is the set of vectors of votes for the complete ensemble T = {T1, T2 . . . Tl} such that\nthe class predicted by the subensemble of size t and the class predicted by the complete committee\n\ncoincide, with Ti \u2265 ti, and(cid:80)l\n\nIf P\u2217(t, T ) = 1, then the classi\ufb01cation given by the partial ensemble and the full ensemble coincide.\nThis case happens when the difference between the number of votes for the \ufb01rst and second class\nin t is greater than the remaining votes in the urn. In such case, the voting process can be halted\nwith full con\ufb01dence that the decision of the partial ensemble will not change when the predictions of\nthe remaining classi\ufb01ers are considered. In addition, if it is acceptable that, with a small probability\n1 \u2212 \u03b1, the prediction of the partially polled ensemble and that of the complete ensemble disagree,\nthen the voting process can be stopped when the P\u2217(t, T ) exceeds the speci\ufb01ed con\ufb01dence level \u03b1.\nThe \ufb01nal classi\ufb01cation would be given as the combined decisions of the classi\ufb01ers that have been\npolled up to that point only.\n\n2.1 Uniform prior\nAssuming a uniform prior for the distribution of possible T vectors P(T) = 1/(cid:107)T(cid:107), where (cid:107)T(cid:107)\nstands for the number of possible T vectors, this derivation is equivalent to the one presented in\n[14]. That formulation assumes that the base classi\ufb01ers of the ensemble are independent realizations\nfrom a pool of all possible classi\ufb01ers given the training dataset. Assuming that an unlimited number\nof realizations can be performed, the distribution of class votes in the ensemble converges to a\nDirichlet distribution in the limit of in\ufb01nite ensemble size. Then, assuming a partial tally of t votes,\nthe probability that the ensemble\u2019s decision will change if the precictions of the remaining T \u2212 t\nclassi\ufb01ers are considered, can be estimated.\nIn order to prove the equivalence between both formulations, we \ufb01rst need to introduce three results,\npresented in the theorem and propositions below.\nTheorem. Chu-Vandermonde Identity. Let s, t, r \u2208 N then\n\nProposition 1. Upper negation. Let r \u2208 C and k \u2208 Z, then\n\n(cid:18)s + t\n(cid:19)\n(cid:19)\n(cid:18)r\n\nr\n\nk\n\nr(cid:88)\n\nk=0\n\n=\n\n(cid:19)(cid:18) t\n(cid:18)s\n(cid:18)k \u2212 r \u2212 1\n\nk\n\nr \u2212 k\n\n(cid:19)\n(cid:19)\n\nk\n\n= (\u22121)k\n\n(4)\n\n(5)\n\nThe previous theorem and proposition are used in the following proposition, which is the key to prove\nthe equivalence between the two formulations:\nProposition 2. Let n1 and n2 be positive integers such that n1 + n2 = n and n \u2264 N. Then\n\n(cid:18) i\n\nN\u2212n2(cid:88)\n\n(cid:19)(cid:18)N \u2212 i\n(cid:19)\n\n(cid:18)N + 1\n(cid:19)\n(cid:1) =(cid:0) n\nProof. First the symmetry property of the binomial (i.e.,(cid:0)n\n(cid:19)(cid:18) N \u2212 i\n(cid:18) i\n\n(cid:19)(cid:18)N \u2212 i\n(cid:19)\n\n(cid:18) i\n\nN \u2212 n\n\nN\u2212n2(cid:88)\n\nN\u2212n2(cid:88)\n\nindices\n\nn\u2212k\n\nn1\n\nn2\n\ni=n1\n\n=\n\nk\n\nn1\n\nn2\n\ni=n1\n\n=\n\ni \u2212 n1\n\nN \u2212 i \u2212 n2\n\n(cid:19)\n\n(6)\n\n(cid:1)) is used to bring down the\n\ni=n1\n\n3\n\n\fThe upper indices are removed by applying the upper negation property of proposition 1.\n\nNow, the Chu-Vandermonde identity can be applied with r = N \u2212 n1 \u2212 n2 and k = i \u2212 n1\n\nN\u2212n2(cid:88)\n\n(cid:18) i\n\ni \u2212 n1\n\ni=n1\n\nN\u2212n2(cid:88)\n\ni=n1\n\n(cid:19)(cid:18) N \u2212 i\n(cid:18)\u2212n1 \u2212 1\n\nN \u2212 i \u2212 n2\n\n(cid:19)\n\n=\n\ni=n1\n\nN\u2212n2(cid:88)\n(cid:19)\n(cid:19)(cid:18) \u2212n2 \u2212 1\n(cid:18)\u2212n \u2212 2\n(cid:19)\n\nN \u2212 i \u2212 n2\n\ni \u2212 n1\n\nFinally the upper negation is applied again\n\n(cid:18)\u2212n1 \u2212 1\n\n(cid:19)(cid:18) \u2212n2 \u2212 1\n\n(cid:19)\n\ni \u2212 n1\n\nN \u2212 i \u2212 n2\n\n(\u22121)i\u2212n1 (\u22121)N\u2212i\u2212n2\n\n(\u22121)i\u2212n1(\u22121)N\u2212i\u2212n2 =\n\n(\u22121)N\u2212n\n\n(cid:18)\u2212n \u2212 2\n(cid:19)\n\nN \u2212 n\n\n(cid:19)\n\n(cid:18)N + 1\n\nN \u2212 n\n\n(\u22121)N\u2212n =\n\nN \u2212 n\n\nt1\nT\u2217\u2208\u2126t\n\n=\n\n(cid:80)\n(cid:0)T1\n(cid:1) . . .(cid:0)Tl\n\u00b7\u00b7\u00b7(cid:80) \u02c6Tl\u22121\n\nt1\n\ntl\n\n(cid:1)\n\nProposition 3 Following the hypergeometric reformulation given by Equation 2 and assuming that\nP(T) follows a uniform distribution 1/(cid:107)T(cid:107), where (cid:107)T(cid:107) stands for the number of possible T vectors\nthen\n\n(cid:81)l\n\nP(T|t) =\n\n(cid:81)l\n(T \u2212 t)!\ni=1 (Ti \u2212 ti)!\n\ni=1 (ti + 1)Ti\u2212ti\n\n(t + l)T\u2212t\n\nwhere (x)n = x(x + 1) . . . (x + n \u2212 1) is the Pochhammer symbol. This formulation is equivalent\nto the one proposed in [14].\nProof. Equation 2 can be simpli\ufb01ed by taking into account the uniform prior P(T) = 1/(cid:107)T(cid:107) as\n\nP(T|t) =\n\nP(t|T)P(T)\n\nThe indices of the summation, \u2126t, is the set of vectors T such that Ti \u2265 ti \u2200i and(cid:80)l\n\n1\nt1\n\nP(t)\n\nl\ntl\n\n(7)\n\ni=1 Ti = T .\n\nThey can be changed for l classes to\n\n(cid:0)T1\n\n(cid:1) . . .(cid:0)Tl\n(cid:1)\n(cid:1) . . .(cid:0)T \u2217\n(cid:0)T \u2217\n\ntl\n\n(cid:1)\n\nP(T|t) =\n\n(cid:80) \u02c6T1\n\n(cid:80) \u02c6T2\n\n(cid:0)T \u2217\n\n1\nt1\n\n(cid:1) . . .(cid:0)T \u2217\n\nl\ntl\n\n(cid:1)\n\n(8)\n\nl\n\nl becomes \ufb01xed once the values of T \u2217\n\nk in the summations. Note that the\n1 . . . T \u2217\nl\u22121\ni for i < k as\nk = 1, . . . , (l \u2212 1). The summations in the denominator of\n\ni = T . In this sense, the values for \u02c6Tk have a dependency on T \u2217\ni \u2212 ti),\ni=1 (T \u2217\n(cid:19)\n\n(cid:19)\n\n(cid:19)\n\n(cid:19)(cid:18)T \u2217\n\n(cid:18)T \u2217\n\n(cid:18)T \u2217\n\n\u02c6Tl\u22121(cid:88)\n\nT \u2217\n2 =t2\n\nT \u2217\n1 =t1\n\nEq. 8 can be rearranged\n\nT \u2217\nl\u22121=tl\u22121\nwhere \u02c6Tk for k = 1, . . . , (l \u2212 1) are the maximum values for T \u2217\nsummation over T \u2217\n\nare \ufb01xed since(cid:80)l\nis unnecessary since the value of T \u2217\n\u02c6Tk = T \u2212 t + tk \u2212(cid:80)k\u22121\ni=1 T \u2217\n(cid:19) \u02c6T2(cid:88)\n(cid:18)T \u2217\n\u02c6T1(cid:88)\n\u02c6Tl\u22121(cid:88)\nProposition 2 (Eq. 6) can be used, together with N = T \u2212(cid:80)l\u22122\n(cid:18)T \u2217\nT\u2212t+tl\u22121\u2212(cid:80)l\u22122\n(cid:88)\n\nT \u2217\nl\u22121 in closed form\n\n(cid:19)(cid:18)T \u2217\n\n(cid:18)T \u2217\n\n(cid:18)T \u2217\n\n(cid:18)T \u2217\n\n\u02c6T2(cid:88)\n\n\u02c6T1(cid:88)\n\nT \u2217\nl\u22121=tl\u22121\n\ni=1(T \u2217\n\nT \u2217\n1 =t1\n\nT \u2217\n1 =t1\n\nT \u2217\n2 =t2\n\ni \u2212ti)\n\n\u00b7\u00b7\u00b7\n\n1\nt1\n\n1\nt1\n\nl\ntl\n\n. . .\n\n=\n\ni , to express the summation over\n\nT \u2217\n2 =t2\ni=1 T \u2217\n\nT \u2217\nl\u22121=tl\u22121\n\nl\u22121\ntl\u22121\n\ni \u2212 T \u2217\nl\u22121\n\n\u00b7\u00b7\u00b7\n\n2\nt2\n\nl\ntl\n\n(cid:19)\n\n.\n\n=\n\ni=1 T \u2217\n\n(cid:19)\nT\u2212(cid:80)l\u22122\ni \u2212tl(cid:88)\nT \u2212(cid:80)l\u22122\n(cid:18)\nT \u2212(cid:80)l\u22122\n\nT \u2217\nl\u22121=tl\u22121\n\ni=1 T \u2217\n\ni + 1\n\ni=1 T \u2217\ni \u2212 tl\u22121 \u2212 tl\n\nl\u22121\ntl\u22121\n\n(cid:19)(cid:18)T \u2212(cid:80)l\u22122\n(cid:19)\n\n(cid:18)T \u2212(cid:80)l\u22122\n\ni=1 T \u2217\n\ntl\n\n=\n\ni=1 T \u2217\ntl\u22121 + tl + 1\n\ni + 1\n\n,\n\n(cid:19)\n(cid:19)\n\n=\n\nT \u2217\nl\u22121=tl\u22121\n\nl\u22121\ntl\u22121\n\nl\ntl\n\n4\n\n\fwhere the symmetry property of the binomial has been used in the last step. The subsequent\nsummations are carried out in the same manner. The summation over T \u2217\nk requires the application of\n\ni + (l \u2212 k \u2212 1), n1 = tk and n2 =(cid:80)l\ni=1 T \u2217\n(cid:19)\n\u02c6Tl\u22122(cid:88)\n\n(cid:19)(cid:18)T \u2212(cid:80)l\u22122\n\n(cid:18)T \u2217\n\ni + 1\n\ni=1 T \u2217\ntl\u22121 + tl + 1\n\nEq. 6 with N = T \u2212(cid:80)k\u22121\n(cid:18)T \u2217\n(cid:19)\n\u02c6T1(cid:88)\n(cid:0)T1\n(cid:1) . . .(cid:0)Tl\n(cid:1)\n(cid:1) =\n(cid:0)T +l\u22121\n\nP(T|t) =\n\nl\u22122\ntl\u22122\nEmploying this result in Eq. 8, one obtains\n\nT \u2217\nl\u22122=tl\u22122\n\nT \u2217\n1 =t1\n\n1\nt1\n\n...\n\nt1\n\ntl\n\nt+l\u22121\n\n= \u00b7\u00b7\u00b7 =\n\ni=k+1 ti + (l \u2212 k \u2212 1)\n(cid:18)T + l \u2212 1\n(cid:19)\n(cid:81)l\n\nt + l \u2212 1\n\nT1!\n\nt1!(T1\u2212t1)! . . .\n\nTl!\n\ntl!(Tl\u2212tl)!\n\n(T +l\u22121)!\n\n(t+l\u22121)!(T\u2212t)!\n\n(cid:81)l\n(T \u2212 t)!\ni=1 (Ti \u2212 ti)!\n\n=\n\ni=1 (ti + 1)Ti\u2212ti\n\n(t + l)T\u2212t\n\n.\n\n1 + . . . + \u02dcT n\n\n2.2 Non-uniform prior\nThe distribution P(T) can be modeled using a non-parametric non-uniform prior. The values of this\nprior can be obtained from the training data by some form of validation; e.g., out-of-bag or cross\nvalidation. Out-of-bag validation is faster because it does not require multiple generations of the\nensemble. Therefore, it will be the validation method used in our implementation of the method. To\ncompute the out-of-bag error, each training instance, xn, is classi\ufb01ed by the ensemble predictors\nthat do not have that particular instance in their training set. Let \u02dcT n = \u02dcT n\nl , be the\nnumber of such classi\ufb01ers, where \u02dcT n\nis the number of out-of-bag votes for class i, where i = 1, . . . , l,\ni\nassigned to instance xn. The number of votes for each class for an ensemble of size T is estimated as\ni \u2248 round(T \u02dcT n\ni / \u02dcT n). To mitigate the in\ufb02uence of the random \ufb02uctuations that appear because\nT n\nof the \ufb01nite size of the training set and to avoid spurious numeric artifacts, the prior is subsequently\nsmoothed using a sliding window of size 5 over the vote distribution.\nAs shown in Section 2, the response time of the ensemble can be reduced by using Eq. 3, if we allow\nthat a small fraction, 1 \u2212 \u03b1, of the predictions given by ensembles of size t and T do not coincide.\nAssuming this tolerance, when P\u2217(t, T ) > \u03b1, the voting process can be halted and the ensemble\nwill output the decision given by the t \u2264 T queried classi\ufb01ers. However, the computation of Eq. 3\nis costly and should be performed off-line. In the SIBA formulation, a lookup table indexed by the\nnumber of votes of the minority class (for binary problems) and whose values are the minimum\nnumber of votes of the majority class such that P\u2217(t, T ) > \u03b1, is used. Using a precomputed lookup\ntable to halt the voting process does not entail a signi\ufb01cant overhead during classi\ufb01cation: a single\nlookup operation in the table is needed for each vote. The consequence of using a uniform prior is\nthat all classes are considered equivalent. Hence, it is suf\ufb01cient to compute one lookup table and use\nthe minority class for indexing.\nWhen prior knowledge is taken into account, the probability P\u2217(t1 = n, t2 = m, T ) is not necessarily\nequal to P\u2217(t1 = m, t2 = n, T ) for n (cid:54)= m. Therefore, a different lookup table per class will be\nnecessary. In addition, it is necessary to compute a different set of tables for each dataset. In the\noriginal formulation, the lookup table values depend only on T and \u03b1. Therefore, they are independent\nof the particular classi\ufb01cation problem considered. In our case, the prior distribution is estimated\nfrom the training data: Hence, it is problem dependent. However, the querying process is similar to\nSIBA. For instance, if we have 1 vote for class 1 and 7 for class 2, one determines whether the value\nin position 1 (minority class at this moment) of the lookup table for class 1 is greater or equal to 7. If\nit is, the querying process stops. As a side effect, for the experimental comparison, it is necessary to\nrecompute the lookup tables for each realization of the data. Notwithstanding, in a real setting, these\ntables need to be computed only once. This can be done of\ufb02ine. Therefore, the speed improvements\nin the classi\ufb01cation phase are independent of the size of the training set.\nThe lookup table and the estimated non-parametric prior can be used to estimate also the average\nnumber of classi\ufb01ers that are expected to be queried during test. This estimation can be made using\nMonte Carlo simulation. To this end one would perform the following experiment repeatedly and\ncompute the average number of queries: extract a random vector T from the prior distribution;\ngenerate a vector of votes of size T that contains exactly Ti votes for class i with i = 1 . . . l; \ufb01nally,\nquery a random permutation of this vector of votes until the process can be halted as given by the\nlookup table and keep the number of queries.\n\n5\n\n\f3 Experiments\n\nIn this section we present the results of an extensive empirical evaluation of the dynamical ensemble\npruning method described in the previous section. The experiments are performed in a series of\nbenchmark classi\ufb01cation problems from the UCI Repository [1] and synthetic data [4] using Random\nForests [5]. The code is available at: https://github.com/vsoto/majority-ibp-prior.\nThe protocol for the experiments is as follows: for each problem, 100 partitions are created by\n10 \u00d7 10-fold cross-validation for real datasets and by random sampling in the synthetic datasets. All\nthe classi\ufb01cation tasks considered are binary, except for New-thyroid, Waveform and Wine, which have\nthree classes. For each partition, the following steps are carried out: (i) a Random Forest ensemble\nof size T = 101 is built; (ii) we compute the generalization error rate of the complete ensemble in\nthe test set and record the mean number of trees that are queried to determine the \ufb01nal prediction.\nNote that this number need not be T : the voting process can be halted when the remaining votes (i.e.\nthe predictions of classi\ufb01ers that have not been queried up to that point) cannot modify the partial\nensemble decision. This is the case when the number of remaining votes is below the difference\nbetween the majority class and the second most voted class; (iii) The SIBA algorithm [14] is applied\nto dynamically select the number of classi\ufb01ers that are needed for each instance in the test set to\nachieve a level of con\ufb01dence in the prediction above \u03b1 = 0.99. We use SIBA as the benchmark for\ncomparison since in previous studies it has been shown to provide the best overall results, especially\nfor T < 500 [2]; (iv) The process is repeated using the proposed method with non-uniform priors\nfor the class vote distribution, with the same con\ufb01dence threshold, \u03b1 = 0.99. The prior distribution\nP(T) is estimated in the training set using out-of-bag data. This prior is also used to estimate the\nexpected number of trees to be queried in the testing phase. In addition, for steps (iii) and (iv) we\ncompute the test error rate, the average number of queried trees, and the disagreement rates between\nthe predictions of the partially queried ensembles and the complete ones.\n\nTable 1: Error rates (left) and disagreement % (right). The statistical signi\ufb01cant differences, using\npaired t-tests at a signi\ufb01cance level \u03b1 = 0.05, are highlighted in boldface.\n\nRF\n\nProblem\n13.00\u00b13.7\nAustralian\n3.22\u00b12.1\nBreast\n24.34\u00b14.2\nDiabetes\nEchocardiogram 22.18\u00b114.3\n23.43\u00b13.5\nGerman\n18.30\u00b16.9\nHeart\n15.47\u00b15.6\nHorse-colic\n6.44\u00b14.1\nIonosphere\n6.33\u00b18.9\nLabor\n27.10\u00b16.7\nLiver\n0.00\u00b10.0\nMushroom\n4.29\u00b14.0\nNew-thyroid\n7.60\u00b11.3\nRingnorm\n16.25\u00b18.7\nSonar\n4.59\u00b11.5\nSpam\n17.85\u00b11.1\nThreenorm\n1.05\u00b11.1\nTic-tac-toe\n4.66\u00b10.6\nTwonorm\n4.05\u00b12.9\nVotes\n17.30\u00b10.9\nWaveform\n1.69\u00b12.8\nWine\n\nError rates\nSIBA\n13.09\u00b13.7\n3.23\u00b12.1\n24.25\u00b14.1\n22.05\u00b114.7\n23.65\u00b13.3\n18.37\u00b17.0\n15.44\u00b15.4\n6.44\u00b14.1\n6.17\u00b18.8\n27.09\u00b17.0\n0.00\u00b10.0\n4.38\u00b14.0\n7.72\u00b11.2\n16.45\u00b18.7\n4.63\u00b11.5\n18.04\u00b11.1\n1.16\u00b11.1\n4.77\u00b10.6\n4.12\u00b12.9\n17.36\u00b10.8\n1.74\u00b12.8\n\nHYPER\n13.25\u00b13.8\n3.76\u00b12.3\n24.23\u00b14.0\n22.18\u00b114.1\n23.62\u00b13.3\n18.37\u00b17.2\n15.44\u00b15.4\n6.52\u00b13.9\n6.43\u00b19.1\n27.01\u00b16.9\n0.08\u00b10.2\n4.66\u00b14.2\n7.82\u00b11.2\n16.45\u00b18.8\n4.86\u00b11.4\n17.97\u00b11.1\n1.72\u00b11.5\n4.90\u00b10.6\n4.30\u00b12.9\n17.45\u00b10.8\n2.30\u00b13.5\n\nDisagreement %\nSIBA\nHYPER\n0.3\u00b10.6\n0.9\u00b11.1\n1.0\u00b11.1\n0.1\u00b10.4\n0.8\u00b11.0\n0.6\u00b10.9\n0.7\u00b13.1\n1.4\u00b14.6\n0.8\u00b10.8\n0.8\u00b10.9\n0.8\u00b11.8\n1.0\u00b12.1\n0.7\u00b11.3\n0.4\u00b10.9\n0.1\u00b10.6\n0.7\u00b11.3\n0.2\u00b11.7\n1.2\u00b14.5\n1.0\u00b11.7\n0.9\u00b11.5\n0.1\u00b10.2\n0.0\u00b10.0\n0.7\u00b12.0\n0.1\u00b10.7\n0.5\u00b10.2\n0.8\u00b10.3\n0.9\u00b12.0\n0.8\u00b11.9\n0.1\u00b10.2\n0.7\u00b10.4\n0.8\u00b10.2\n1.0\u00b10.2\n0.1\u00b10.4\n0.7\u00b11.0\n0.4\u00b10.1\n0.7\u00b10.2\n0.1\u00b10.4\n1.0\u00b11.8\n1.0\u00b10.3\n0.6\u00b10.1\n0.1\u00b10.6\n1.1\u00b12.5\n\nIn Table 1, we compare the error rates of Random Forest (RF) and of the dynamically pruned\nensembles using the halting rule derived from assuming uniform priors (SIBA) and using non-\nuniform priors (HYPER), and the disagreement rates. The values displayed are averages over 100\nrealizations of the datasets The standard deviation is given after the \u00b1 symbol.\n\n6\n\n\fFigure 1: Vote distribution, P(T), and disagreement rates for Sonar (left) and Votes (right)\n\nFrom Table 1, one observes that the mean error rates of the pruned ensembles using SIBA and HYPER\nare only slightly worse than the rates obtained by the complete ensemble (RF). These differences\nshould be expected since we are allowing a small disagreement of 1 \u2212 \u03b1 = 1% between the decisions\nof the partial and the complete ensemble. In any case, the differences in generalization error can\nbe made arbitrarily small by increasing \u03b1. By design, the disagreement rates are expected to be\nbelow, but close to 1%. From Table 1, one observes that the disagreement % of the proposed method\n(HYPER) are closer to the speci\ufb01ed threshold (1 \u2212 \u03b1 = 1%) than those of SIBA, except for Liver,\nSonar and Threenorm, where the differences are small. In these problems (and in general in the\nproblems where SIBA obtains disagreement rates closer to 1 \u2212 \u03b1), the distribution of T is closer to a\nuniform distribution (see Figure 1, left histogram). In consequence, the assumption of uniform prior\ntaken by SIBA is closer to the real one. However, when P(T) differs from the uniform distribution\n(see for instance Votes in Figure 1 right histogram) the results of SIBA are rather different from the\nexpected disagreement rates.\n\nTable 2: Number of queried trees and speed-up rate with respect to the full ensemble of 101 trees.\nThe statistical signi\ufb01cant differences between SIBA and HYPER, using paired t-tests at a signi\ufb01cance\nlevel \u03b1 = 0.05, are highlighted in boldface.\n\n# of trees\n\nRF\u2217\n\nProblem\n62.2\u00b11.4\nAustralian\n54.2\u00b10.9\nBreast\n68.8\u00b11.8\nDiabetes\nEchocardiogram 68.0\u00b14.6\n71.8\u00b11.3\nGerman\n67.2\u00b12.5\nHeart\n66.2\u00b12.1\nHorse-colic\n57.9\u00b11.5\nIonosphere\n61.6\u00b14.0\nLabor\n74.5\u00b12.3\nLiver\n51.0\u00b10.0\nMushroom\n55.2\u00b11.8\nNew-thyroid\n68.6\u00b10.8\nRingnorm\n73.9\u00b13.0\nSonar\n57.1\u00b10.3\nSpam\n76.6\u00b10.5\nThreenorm\n60.7\u00b10.9\nTic-tac-toe\n67.2\u00b10.2\nTwonorm\n54.5\u00b11.2\nVotes\n72.3\u00b10.7\nWaveform\n57.3\u00b12.1\nWine\n\nSIBA\n16.1\u00b12.1\n8.9\u00b11.4\n24.9\u00b13.2\n22.6\u00b18.2\n28.4\u00b12.8\n22.5\u00b14.2\n20.2\u00b13.5\n11.9\u00b12.3\n14.1\u00b16.0\n31.8\u00b14.5\n6.0\u00b10.0\n10.7\u00b12.6\n22.9\u00b11.1\n32.1\u00b16.6\n11.1\u00b10.5\n34.8\u00b11.0\n12.8\u00b11.4\n21.0\u00b10.5\n8.8\u00b11.8\n29.3\u00b11.1\n11.4\u00b12.7\n\nHYPER MC Estim RF\u2217\n12.8\u00b12.3\n1.6\n4.0\u00b11.0\n1.9\n24.0\u00b13.2\n1.5\n20.0\u00b18.0\n1.5\n27.7\u00b12.9\n1.4\n20.9\u00b14.2\n1.5\n17.5\u00b13.7\n1.5\n7.8\u00b12.1\n1.7\n9.7\u00b15.3\n1.6\n31.7\u00b14.5\n1.4\n1.0\u00b10.0\n2.0\n6.0\u00b12.3\n1.8\n20.4\u00b11.5\n1.5\n32.6\u00b16.8\n1.4\n7.2\u00b10.6\n1.8\n35.8\u00b11.6\n1.3\n7.8\u00b11.2\n1.7\n18.4\u00b10.9\n1.5\n4.1\u00b11.4\n1.9\n27.8\u00b11.7\n1.4\n5.8\u00b11.8\n1.8\n\n12.9\u00b10.9\n4.0\u00b10.4\n23.8\u00b11.1\n21.6\u00b13.2\n30.1\u00b11.0\n20.7\u00b11.7\n18.6\u00b11.5\n7.8\u00b10.6\n10.2\u00b12.0\n31.6\u00b12.0\n1.0\u00b10.0\n6.2\u00b11.4\n19.7\u00b12.3\n31.8\u00b12.4\n7.1\u00b10.5\n33.4\u00b12.5\n8.6\u00b10.7\n18.8\u00b11.7\n4.0\u00b10.7\n28.6\u00b12.8\n6.7\u00b11.4\n\nSpeed-up rate\n\nSIBA HYPER\n6.3\n11.3\n4.1\n4.5\n3.6\n4.5\n5.0\n8.5\n7.2\n3.2\n16.8\n9.4\n4.4\n3.1\n9.1\n2.9\n7.9\n4.8\n11.5\n3.4\n8.9\n\n7.9\n25.3\n4.2\n5.1\n3.6\n4.8\n5.8\n12.9\n10.4\n3.2\n101.0\n16.8\n5.0\n3.1\n14.0\n2.8\n12.9\n5.5\n24.6\n3.6\n17.5\n\nIn order to analyze this aspect in more detail, we have computed the disagreement rates for different\nvalues of alpha (\u03b1 = 0.999, 0.995, 0.99, 0.95). In Figure 1 the relation between the target 1 \u2212 \u03b1 and\nthe actual disagreement rate is presented. A diagonal solid line marks the expected upper limit for\nthe disagreement. The results for SIBA, HYPER and for the case of using a \ufb01xed number of trees\nfor all instances (FIXED) (and equal to the average number of trees used by HYPER in those tasks)\n\n7\n\n010203040506070809010000.0050.010.0150.020.025t1P(t1) 0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05disagreement1-alpha sonar HYPERSIBAFIXED010203040506070809010000.050.10.150.20.250.3t1P(t1) 0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05disagreement1-alpha votes HYPERSIBAFIXED\fare presented in these plots. This last case (FIXED) can be seen as a stochastic approximation to\nthe prediction of the whole ensemble. From these plots, we observe that the results for HYPER are\nvery close to the expected disagreement rates for cases in which the prior is approximately uniform\n(Sonar), and for cases in which the prior is non-uniform (Votes). As expected, the results of SIBA are\nclose to the target only for the case of approximately uniform prior (Sonar). Finally, when a stochastic\napproximation is used (FIXED) the disagreement rates are clearly above the target threshold given by\n\u03b1. From these results we conclude that the proposed model provides a more accurate description\nof the voting process used to compute the prediction of the ensemble. This means that taking into\naccount the prior distribution of possible vote outcomes, P(T), is important to obtain disagreement\nrates that are closer to the threshold established.\nFinally, in Table 2, we present the average number of trees used by Random Forest (RF\u2217), the SIBA\nmethod,the proposed method using non-parametric priors (HYPER), and the expected average of\nthe number of trees to be queried in HYPER using Monte Carlo sampling (MC Estim). Note that\nthe number of trees used by RF\u2217 is not necessary T = 101: the voting process is halted when the\nremaining (unknown) predictions cannot alter the decision of the ensemble. The number of trees\ngiven in RF\u2217 is the same as the trees that SIBA or HYPER would use when \u03b1 = 100%. Finally, the\nlast three columns of Table 2 display the speed-up rate of the partial ensembles with respect to the\nfull ensemble of size T = 101. From this table it is clear that HYPER reduces the number of queried\nclassi\ufb01ers with respect to SIBA in most of the tasks investigated. In addition, using only training data,\nthe Monte Carlo estimations of the average number of trees are very precise. The largest average\ndifference between this estimation and HYPER is 2.4 trees for German and Threenorm. The speed-up\nrate of HYPER with respect to the full ensemble is remarkable: from 2.8 times faster for Threenorm\nto 101 times faster in Mushroom. This dataset can be used to illustrate the bene\ufb01ts of using the prior\ndistribution. For this problem, most classi\ufb01ers agree in their predictions. HYPER takes advantage of\nthis prior knowledge and queries only one classi\ufb01er to cast the \ufb01nal decision. In this problem, the\nchances that the prediction of a single classi\ufb01er, and the prediction of the complete ensemble are\ndifferent, are below 1%. Similar behavior (but not as extreme) is observed in Breast and Votes.\n\n4 Conclusions\n\nIn this work, we present an intuitive, rigorous mathematical description of the voting process in an\nensemble of classi\ufb01ers: For a given an instance, the process is equivalent to extracting marbles (the\nindividual classi\ufb01ers), without replacement, from a bag that contains a known number of marbles, but\nwhose color (class label prediction) distribution is unknown. In addition, we show that for the speci\ufb01c\ncase of a uniform prior distribution of class votes this process is equivalent to the one developed in\n[14]. In the current description, which does not assume a uniform distribution prior for the class\nvotes, the hypergeometric distribution plays a central role.\nThe results of this statistical description are then used to design a dynamic ensemble pruning method,\nwith the goal of speeding up predictions in the test phase. For a given instance, it is possible to\ncompute the probability that the the partial decision made on the basis of the known votes (i.e., the\nclass label predictions of the subset of classi\ufb01ers that have been queried) and the \ufb01nal ensemble\ndecision coincide. If this probability is above a speci\ufb01ed threshold, suf\ufb01ciently close to 1, a reliable\nestimate of the class label that the complete ensemble would predict can be made on the basis of\nthe known votes. The effectiveness of this dynamic ensemble pruning method is illustrated using\nrandom forests. The prior distribution of class votes is estimated using out-of-bag data. As a result of\nincorporating this problem-speci\ufb01c knowledge in the statistical analysis of the voting process, the\ndifferences between the predictions of the dynamically pruned ensemble and the complete ensemble\nare closer to the speci\ufb01ed threshold than when a uniform distribution is assumed, as in SIBA [14]. In\nthe empirical evaluation performed, this dynamic ensemble pruning algorithm consistently yields\nimprovements of classi\ufb01cation speed over SIBA without a signi\ufb01cant deterioration of accuracy.\nFinally, the statistical model proposed is used to provide an accurate estimate of the average number\nof individual classi\ufb01er predictions that are needed to reach a stable ensemble prediction.\n\nAcknowledgments\n\nThe authors acknowledge \ufb01nancial support from the Comunidad de Madrid (project CASI-CAM-\nCM S2013/ICE-2845), and from the Spanish Ministerio de Econom\u00eda y Competitividad (projects\nTIN2013-42351-P and TIN2015-70308-REDT).\n\n8\n\n\fReferences\n[1] A. Asuncion and D. Newman. UCI machine learning repository, 2007.\n\n[2] J. Basilico, M. Munson, T. Kolda, K. Dixon, and W. Kegelmeyer. Comet: A recipe for learning and using\nlarge ensembles on massive data. In Proceedings - IEEE International Conference on Data Mining, ICDM,\npages 41\u201350, 2011.\n\n[3] D. Benbouzid, R. Busa-Fekete, and B. K\u00e9gl. Fast classi\ufb01cation using sparse decision dags. In Proceedings\nof the 29th International Conference on Machine Learning, ICML 2012, volume 1, pages 951\u2013958, 2012.\n\n[4] L. Breiman. Bias, variance, and arcing classi\ufb01ers. Technical Report 460, Statistics Department, University\n\nof California, 1996.\n\n[5] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n\n[6] R. Caruana and A. Niculescu-Mizil. Ensemble selection from libraries of models. In Proc. of the 21st\n\nInternational Conference on Machine Learning (ICML\u201904), 2004.\n\n[7] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proc.\nof the 23rd International Conference on Machine Learning, pages 161\u2013168, New York, NY, USA, 2006.\nACM Press.\n\n[8] T. G. Dietterich. Ensemble methods in machine learning. In Multiple Classi\ufb01er Systems: First International\n\nWorkshop, pages 1\u201315, 2000.\n\n[9] T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision\n\ntrees: Bagging, boosting, and randomization. Machine Learning, 40(2):139\u2013157, 2000.\n\n[10] W. Fan, F. Chu, H. Wang, and P. S. Yu. Pruning and dynamic scheduling of cost-sensitive ensembles. In\nProc. of the 18th National Conference on Arti\ufb01cial Intelligence, pages 146\u2013151. American Association for\nArti\ufb01cial Intelligence, 2002.\n\n[11] M. Fern\u00e1ndez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classi\ufb01ers to\nsolve real world classi\ufb01cation problems? Journal of Machine Learning Research, 15:3133\u20133181, 2014.\n\n[12] T. Gao and D. Koller. Active classi\ufb01cation based on value of classi\ufb01er. In NIPS, 2011.\n\n[13] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and\n\nMachine Intelligence, 12:993\u20131001, 1990.\n\n[14] D. Hern\u00e1ndez-Lobato, G. Mart\u00ednez-Mu\u00f1oz, and A. Su\u00e1rez. Statistical instance-based pruning in ensembles\nof independent classi\ufb01ers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):364\u2013369,\n2009.\n\n[15] T. K. Ho, J. J. Hull, and S. N. Srihari. Decision combination in multiple classi\ufb01er systems.\n\nTransactions on Pattern Analysis Machine Intelligence, 16(1):66\u201375, 1994.\n\nIEEE\n\n[16] D. D. Margineantu and T. G. Dietterich. Pruning adaptive boosting. In Proc. of the 14th International\n\nConference on Machine Learning, pages 211\u2013218. Morgan Kaufmann, 1997.\n\n[17] F. Markatopoulou, G. Tsoumakas, and I. Vlahavas. Dynamic ensemble pruning based on multi-label\n\nclassi\ufb01cation. Neurocomputing, 150(PB):501\u2013512, 2015.\n\n[18] L. Reyzin. Boosting on a budget: Sampling for feature-ef\ufb01cient prediction. In L. Getoor and T. Scheffer,\neditors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML \u201911,\npages 529\u2013536, New York, NY, USA, June 2011. ACM.\n\n[19] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde: A scalable\nreal-time architecture for learning knowledge from unsupervised sensorimotor interaction. In S. Tumer,\nYolum and Stone, editors, Proc. of 10th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS\n2011), pages 761\u2013768, Taipei, Taiwan, 2011.\n\n[20] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classi\ufb01ers. In\nKDD \u201903: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 226\u2013235, New York, NY, USA, 2003. ACM Press.\n\n[21] Y. Zhang, S. Burer, and W. N. Street. Ensemble pruning via semi-de\ufb01nite programming. Journal of\n\nMachine Learning Research, 7:1315\u20131338, 2006.\n\n9\n\n\f", "award": [], "sourceid": 2174, "authors": [{"given_name": "Victor", "family_name": "Soto", "institution": "Columbia University"}, {"given_name": "Alberto", "family_name": "Su\u00e1rez", "institution": "Universidad Aut\u00f3noma de Madrid"}, {"given_name": "Gonzalo", "family_name": "Martinez-Mu\u00f1oz", "institution": "UAM"}]}