{"title": "Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion", "book": "Advances in Neural Information Processing Systems", "page_first": 1457, "page_last": 1465, "abstract": "We introduce a new objective function for pool-based Bayesian active learning with probabilistic hypotheses. This objective function, called the policy Gibbs error, is the expected error rate of a random classifier drawn from the prior distribution on the examples adaptively selected by the active learning policy. Exact maximization of the policy Gibbs error is hard, so we propose a greedy strategy that maximizes the Gibbs error at each iteration, where the Gibbs error on an instance is the expected error of a random classifier selected from the posterior label distribution on that instance. We apply this maximum Gibbs error criterion to three active learning scenarios: non-adaptive, adaptive, and batch active learning. In each scenario, we prove that the criterion achieves near-maximal policy Gibbs error when constrained to a fixed budget. For practical implementations, we provide approximations to the maximum Gibbs error criterion for Bayesian conditional random fields and transductive Naive Bayes. Our experimental results on a named entity recognition task and a text classification task show that the maximum Gibbs error criterion is an effective active learning criterion for noisy models.", "full_text": "Active Learning for Probabilistic Hypotheses Using\n\nthe Maximum Gibbs Error Criterion\n\nNguyen Viet Cuong\n\nWee Sun Lee\n\nNan Ye\n\nDepartment of Computer Science\nNational University of Singapore\n\n{nvcuong,leews,yenan}@comp.nus.edu.sg\n\nKian Ming A. Chai\nHai Leong Chieu\n{ckianmin,chaileon}@dso.org.sg\n\nDSO National Laboratories, Singapore\n\nAbstract\n\nWe introduce a new objective function for pool-based Bayesian active learning\nwith probabilistic hypotheses. This objective function, called the policy Gibbs\nerror, is the expected error rate of a random classi\ufb01er drawn from the prior dis-\ntribution on the examples adaptively selected by the active learning policy. Exact\nmaximization of the policy Gibbs error is hard, so we propose a greedy strategy\nthat maximizes the Gibbs error at each iteration, where the Gibbs error on an\ninstance is the expected error of a random classi\ufb01er selected from the posterior\nlabel distribution on that instance. We apply this maximum Gibbs error criterion\nto three active learning scenarios: non-adaptive, adaptive, and batch active learn-\ning. In each scenario, we prove that the criterion achieves near-maximal policy\nGibbs error when constrained to a \ufb01xed budget. For practical implementations,\nwe provide approximations to the maximum Gibbs error criterion for Bayesian\nconditional random \ufb01elds and transductive Naive Bayes. Our experimental re-\nsults on a named entity recognition task and a text classi\ufb01cation task show that the\nmaximum Gibbs error criterion is an effective active learning criterion for noisy\nmodels.\n\n1\n\nIntroduction\n\nIn pool-based active learning [1], we select training data from a \ufb01nite set (called a pool) of unlabeled\nexamples and aim to obtain good performance on the set by asking for as few labels as possible. If a\nlarge enough pool is sampled from the true distribution, good performance of a classi\ufb01er on the pool\nimplies good generalization performance of the classi\ufb01er. Previous theoretical works on Bayesian\nactive learning mainly deal with the noiseless case, which assumes a prior distribution on a collection\nof deterministic mappings from observations to labels [2, 3]. A \ufb01xed deterministic mapping is then\ndrawn from the prior, and it is used to label the examples.\nIn this paper, probabilistic hypotheses, rather than deterministic ones, are used to label the examples.\nWe formulate the objective as a maximum coverage objective with a \ufb01xed budget: with a budget of\nk queries, we aim to select k examples such that the policy Gibbs error is maximal. The policy\nGibbs error of a policy is the expected error rate of a Gibbs classi\ufb01er1 on the set adaptively selected\nby the policy. The policy Gibbs error is a lower bound of the policy entropy, a generalization of the\nShannon entropy to general (both adaptive and non-adaptive) policies. For non-adaptive policies,\n\n1A Gibbs classi\ufb01er samples a hypothesis from the prior for labeling.\n\n1\n\n\fFigure 1: An example of a non-adaptive policy tree (left) and an adaptive policy tree (right).\n\nthe policy Gibbs error reduces to the Gibbs error for sets, which is a special case of a measure of\nuncertainty called the Tsallis entropy [4].\nBy maximizing policy Gibbs error, we hope to maximize the policy entropy, whose maximality\nimplies the minimality of the posterior label entropy of the remaining unlabeled examples in the\npool. Besides, by maximizing policy Gibbs error, we also aim to obtain a small expected error of\na posterior Gibbs classi\ufb01er (which samples a hypothesis from the posterior instead of the prior for\nlabeling). Small expected error of the posterior Gibbs classi\ufb01er is desirable as it upper bounds the\nBayes error but is at most twice of it.\nMaximizing policy Gibbs error is hard, and we propose a greedy criterion, the maximum Gibbs error\ncriterion (maxGEC), to solve it. With this criterion, the next query is made on the candidate (which\nmay be one or several examples) that has maximum Gibbs error, the probability that a randomly\nsampled labeling does not match the actual labeling. We investigate this criterion in three settings:\nthe non-adaptive setting, the adaptive setting and batch setting (also called batch mode setting) [5].\nIn the non-adaptive setting, the set of examples is not labeled until all examples in the set have all\nbeen selected. In the adaptive setting, the examples are labeled as soon as they are selected, and\nthe new information is used to select the next example. In the batch setting, we select a batch of\nexamples, query their labels and proceed to select the next batch taking into account the labels. In all\nthese settings, we prove that maxGEC is near-optimal compared to the best policy that has maximal\npolicy Gibbs error in the setting.\nWe examine how to compute the maxGEC criterion, particularly for large structured probabilistic\nmodels such as the conditional random \ufb01elds [6]. When inference in the conditional random \ufb01eld\ncan be done ef\ufb01ciently, we show how to compute an approximation to the Gibbs error by sampling\nand ef\ufb01cient inference. We provide an approximation for maxGEC in the non-adaptive and batch\nsettings with Bayesian transductive Naive Bayes model. Finally, we conduct pool-based active\nlearning experiments using maxGEC for a named entity recognition task with conditional random\n\ufb01elds and a text classi\ufb01cation task with Bayesian transductive Naive Bayes. The results show good\nperformance of maxGEC in terms of the area under the curve (AUC).\n\n2 Preliminaries\nLet X be a set of examples, Y be a \ufb01xed \ufb01nite set of labels, and H be a set of probabilistic hy-\npotheses. We assume H is \ufb01nite, but our results extend readily to general H. For any probabilistic\nhypothesis h \u2208 H, its application to an example x \u2208 X is a categorical random variable with sup-\nport Y, and we write P[h(x) = y|h] for the probability that h(x) has value y \u2208 Y. We extend the\nnotation to any sequence S of examples from X and write P[h(S) = y|h] for the probability that\nh(S) has a labeling y \u2208 Y|S|, where Y|S| is the set of all labelings of S. We operate within the\nBayesian setting and assume a prior probability p0[h] on H. We use pD[h] to denote the posterior\np0[h|D] after observing a set D of labeled examples from X \u00d7 Y.\nA pool-based active learning algorithm is a policy for choosing training examples from a pool\nX \u2286 X . At the beginning, a \ufb01xed labeling y\u2217 of X is given by a hypothesis h drawn from the\nprior p0[h] and is hidden from the learner. Equivalently, y\u2217 can be drawn from the prior label\n(cid:80)\ndistribution p0[y\u2217; X]. For any distribution p[h], we use p[y; S] to denote the probability that ex-\namples in S are assigned the labeling y by a hypothesis drawn randomly from p[h]. Formally,\nh\u2208H p[h] P[h(S) = y|h]. When S is a singleton {x}, we write p[y; x] for p[{y};{x}].\np[y; S] def=\n\n2\n\nx1x2x4y1=1y1=2\u00b7\u00b7\u00b7y2=1y2=2\u00b7\u00b7\u00b712x4x1x3x4x3x2x11=y21=y12=y14=y. . . 22=y. . . 24=y. . . \fDuring the learning process, each time the learner selects an unlabeled example, its label will be\nrevealed to the learner. A policy for choosing training examples is a mapping from a set of labeled\nexamples to an unlabeled example to be queried. This can be represented by a policy tree, where\na node represents the next example to be queried, and each edge from the node corresponds to a\npossible label. We use policy and policy tree as synonyms. Figure 1 illustrates two policy trees\nwith their top three levels: in the non-adaptive setting, the policy ignores the labels of the previously\nselected examples, so all examples at the same depth of the policy tree are the same; in the adaptive\nsetting, the policy takes into account the observed labels when choosing the next example.\nA full policy tree for a pool X is a policy tree of height |X|. A partial policy tree is a subtree of\na full policy tree with the same root. The class of policies of height k will be denoted by \u03a0k. Our\nquery criterion gives a method to build a full policy tree one level at a time. The main building block\n0 [\u00b7] over all possible paths from the root to the leaves for any (full\nis the probability distribution p\u03c0\nor partial) policy tree \u03c0. This distribution over paths is induced from the uncertainty in the \ufb01xed\nlabeling y\u2217 for X: since y\u2217 is drawn randomly from p0[y\u2217; X], the path \u03c1 followed from the root\nto a leaf of the policy tree during the execution of \u03c0 is also a random variable. If x\u03c1 (resp. y\u03c1) is the\nsequence of examples (resp. labels) along path \u03c1, then the probability of \u03c1 is p\u03c0\n\n0 [\u03c1] def= p0[y\u03c1; x\u03c1].\n\n3 Maximum Gibbs Error Criterion for Active Learning\n\nA commonly used objective for active learning in the non-adaptive setting is to choose k training\nexamples such that their Shannon entropy is maximal, as this reduces uncertainty in the later stage.\nWe \ufb01rst give a generalization of the concept of Shannon entropy to general (both adaptive and non-\nadaptive) policies. Formally, the policy entropy of a policy \u03c0 is\n\nH(\u03c0) def= E\u03c1\u223cp\u03c0\n\n0 [\u2212 ln p\u03c0\n\n0 [\u03c1] ].\n\nFrom this de\ufb01nition, policy entropy is the Shannon entropy of the paths in the policy. The policy\nentropy reduces to the Shannon entropy on a set of examples when the policy is non-adaptive. The\nfollowing result gives a formal statement that maximizing policy entropy minimizes the uncertainty\non the label of the remaining unlabeled examples in the pool. Suppose a path \u03c1 has been observed,\nthe labels of the remaining examples in X \\ x\u03c1 follow the distribution p\u03c1[\u00b7 ; X \\ x\u03c1], where p\u03c1 is\nthe posterior obtained after observing (x\u03c1, y\u03c1). The entropy of this distribution will be denoted by\nG(\u03c1) and will be called the posterior label entropy of the remaining examples given \u03c1. Formally,\ny p\u03c1[y; X \\ x\u03c1] ln p\u03c1[y; X \\ x\u03c1], where the summation is over all possible labelings y\nof X \\ x\u03c1. The posterior label entropy of a policy \u03c0 is de\ufb01ned as G(\u03c0) = E\u03c1\u223cp\u03c0\nTheorem 1. For any k \u2265 1, if a policy \u03c0 in \u03a0k maximizes H(\u03c0), then \u03c0 minimizes the posterior\nlabel entropy G(\u03c0).\n\nG(\u03c1) = \u2212(cid:80)\n\n0 G(\u03c1).\n\nProof. It can be easily veri\ufb01ed that H(\u03c0) + G(\u03c0) is the Shannon entropy of the label distribution\np0[\u00b7 ; X], which is a constant (detailed proof is in the supplementary). Thus, the theorem follows.\n\nThe usual maximum Shannon entropy criterion, which selects the next example x maximizing\nEy\u223cpD[y;x][\u2212 ln pD[y; x]] where D is the previously observed labeled examples, can be thought of\nas a greedy heuristic for building a policy \u03c0 maximizing H(\u03c0). However, it is still unknown whether\nthis greedy criterion has any theoretical guarantee, except for the non-adaptive case.\nIn this paper, we introduce a new objective for active learning: the policy Gibbs error. This new\nobjective is a lower bound of the policy entropy and there are near-optimal greedy algorithms to\noptimize it. Intuitively, the policy Gibbs error of a policy \u03c0 is the expected probability for a Gibbs\nclassi\ufb01er to make an error on the set adaptively selected by \u03c0. Formally, we de\ufb01ne the policy Gibbs\nerror of a policy \u03c0 as\n\n(1)\nIn the above equation, 1 \u2212 p\u03c0\n0 [\u03c1] is the probability that a Gibbs classi\ufb01er makes an error on the\nselected set along the path \u03c1. Theorem 2 below, which is straightforward from the inequality\nx \u2265 1 + ln x, states that the policy Gibbs error is a lower bound of the policy entropy.\nTheorem 2. For any (full or partial) policy \u03c0, we have V (\u03c0) \u2264 H(\u03c0).\n\nV (\u03c0) def= E\u03c1\u223cp\u03c0\n\n0 [ 1 \u2212 p\u03c0\n\n0 [\u03c1] ],\n\n3\n\n\fV (\u03c0) = Ey\u223cp0[ \u00b7 ;x\u03c0][1 \u2212 p0[y; x\u03c0]].\n\ng (S) def= 1 \u2212(cid:80)\n\nGiven a budget of k queries, our proposed objective is to \ufb01nd \u03c0\u2217 = arg max\u03c0\u2208\u03a0k V (\u03c0), the height\nk policy with maximum policy Gibbs error. By maximizing V (\u03c0), we hope to maximize the policy\nentropy H(\u03c0), and thus minimize the uncertainty in the remaining examples. Furthermore, we\nalso hope to obtain a small expected error of a posterior Gibbs classi\ufb01er, which upper bounds the\nBayes error but is at most twice of it. Using this objective, we propose greedy algorithms for\nactive learning that are provably near-optimal for probabilistic hypotheses. We will consider the\nnon-adaptive, adaptive and batch settings.\n\n3.1 The Non-adaptive Setting\n\nIn the non-adaptive setting, the policy \u03c0 ignores the observed labels: it never updates the posterior.\nThis is equivalent to selecting a set of examples before any labeling is done. In this setting, the\nexamples selected along all paths of \u03c0 are the same. Let x\u03c0 be the set of examples selected by \u03c0.\nThe Gibbs error of a non-adaptive policy \u03c0 is simply\n\nThus, the optimal non-adaptive policy selects a set S of examples maximizing its Gibbs error, which\nis de\ufb01ned by \u0001p0\n\nIn general, the Gibbs error of a distribution P is 1\u2212(cid:80)\n\ni P [i]2, where the summation is over elements\nin the support of P . The Gibbs error is a special case of the Tsallis entropy used in nonextensive\nstatistical mechanics [4] and is known to be monotone submodular [7]. From the properties of\nmonotone submodular functions [8], the greedy non-adaptive policy that selects the next example\n\ny p0[y; S]2.\n\nxi+1 = arg max\n\nx\n\n{\u0001p0\ng (Si \u222a {x})} = arg max\n\nx\n\np0[y; Si \u222a {x}]2},\n\n(2)\n\n{1 \u2212(cid:88)\n\ny\n\nwhere Si is the set of previously selected examples, is near-optimal compared to the best non-\nadaptive policy. This is stated below.\nTheorem 3. Given a budget of k \u2265 1 queries, let \u03c0n be the non-adaptive policy in \u03a0k selecting\nexamples using Equation (2), and let \u03c0\u2217\nn be the non-adaptive policy in \u03a0k with the maximum policy\nGibbs error. Then, V (\u03c0n) > (1 \u2212 1/e)V (\u03c0\u2217\nn).\n\n3.2 The Adaptive Setting\n\nIn the adaptive setting, a policy takes into account the observed labels when choosing the next\nexample. This is done via the posterior update after observing the label of a selected example.\nThe adaptive setting is the most common setting for active learning. We now describe a greedy\nadaptive algorithm for this setting that is near-optimal. Assume that the current posterior obtained\nafter observing the labeled examples D is pD. Our greedy algorithm selects the next example x that\nmaximizes \u0001pDg (x):\n\n\u2217\n\nx\n\n= arg max\n\nx\n\n\u0001pDg (x) = arg max\n\nx\n\npD[y; x]2}.\n\n(3)\n\n{1 \u2212(cid:88)\n\ny\u2208Y\n\nin Section 3.1, \u0001pDg (x) is in fact the Gibbs error of a 1-step policy with\nFrom the de\ufb01nition of \u0001pDg\nrespect to the prior pD. Thus, we call this greedy criterion the adaptive maximum Gibbs error\ncriterion (maxGEC). Note that in binary classi\ufb01cation where |Y| = 2, maxGEC selects the same\nexample as the maximum Shannon entropy and the least con\ufb01dence criteria. However, they are\ndifferent in the multi-class case. Theorem 4 below states that maxGEC is near-optimal compared to\nthe best adaptive policy with respect to the objective in Equation (1).\nTheorem 4. Given a budget of k \u2265 1 queries, let \u03c0maxGEC be the adaptive policy in \u03a0k selecting\nexamples using maxGEC and \u03c0\u2217 be the adaptive policy in \u03a0k with the maximum policy Gibbs error.\nThen, V (\u03c0maxGEC) > (1 \u2212 1/e)V (\u03c0\u2217).\nThe proof for this theorem is in the supplementary material. The main idea of the proof is to reduce\nprobabilistic hypotheses to deterministic ones by expanding the hypothesis space. For deterministic\nhypotheses, we show that maxGEC is equivalent to maximizing the version space reduction objec-\ntive, which is known to be adaptive monotone submodular [2]. Thus, we can apply a known result\nfor optimizing adaptive monotone submodular function [2] to obtain Theorem 4.\n\n4\n\n\fAlgorithm 1 Batch maxGEC for Bayesian Batch Active Learning\n\nInput: Unlabeled pool X, prior p0, number of iterations k, and batch size s.\nfor i = 0 to k \u2212 1 do\n\nS \u2190 \u2205\nfor j = 0 to s \u2212 1 do\nx\u2217 \u2190 arg maxx \u0001pi\nend for\nyS \u2190 Query-labels(S);\n\nend for\n\ng (S \u222a {x}); S \u2190 S \u222a {x\u2217}; X \u2190 X \\ {x\u2217}\n\npi+1 \u2190 Posterior-update(pi, S, yS)\n\n3.3 The Batch Setting\n\nIn the batch setting [5], we query the labels of s (instead of 1) examples each time, and we do\nthis for a given number of k iterations. After each iteration, we query the labeling of the selected\nbatch and update the posterior based on this labeling. The new posterior can be used to select the\nnext batch of examples. A non-adaptive policy can be seen as a batch policy that selects only one\nbatch. Algorithm 1 describes a greedy algorithm for this setting which we call the batch maxGEC\nalgorithm. At iteration i of the algorithm with the posterior pi, the batch S is \ufb01rst initialized to be\nempty, then s examples are greedily chosen one at a time using the criterion\n\n\u2217\n\nx\n\n= arg max\n\nx\n\ng (S \u222a {x}).\n\u0001pi\n\n(4)\n\nThis is equivalent to running the non-adaptive greedy algorithm in Section 3.1 to select each batch.\nQuery-labels(S) returns the true labeling yS of S and Posterior-update(pi, S, yS) returns the new\nposterior obtained from the prior pi after observing yS.\nThe following theorem states that batch maxGEC is near optimal compared to the best batch policy\nwith respect to the objective in Equation (1). The proof for this theorem is in the supplementary\nmaterial. The proof also makes use of the reduction to deterministic hypotheses and the adaptive\nsubmodularity of version space reduction.\nTheorem 5. Given a budget of k batches of size s, let \u03c0maxGEC\nbatches using batch maxGEC and \u03c0\u2217\nGibbs error. Then, V (\u03c0maxGEC\n\nbe the batch policy selecting k\nb be the batch policy selecting k batches with maximum policy\n\n) > (1 \u2212 e\u2212(e\u22121)/e)V (\u03c0\u2217\nb ).\n\nb\n\nb\n\nThis theorem has a different bounding constant than those in Theorems 3 and 4 because it uses two\nlevels of approximation to compute the batch policy: at each iteration, it approximates the optimal\nbatch by greedily choosing one example at a time using equation (4) (1st approximation). Then it\nuses these chosen batches to approximate the optimal batch policy (2nd approximation). In contrast,\nthe fully adaptive case has batch size 1 and only needs the 2nd approximation, while the non-adaptive\ncase chooses 1 batch and only needs the 1st approximation.\nIn non-adaptive and batch settings, our algorithms need to sum over all labelings of the previously\nselected examples in a batch to choose the next example. This summation is usually expensive and\nit restricts the algorithms to small batches. However, we note that small batches may be preferred in\nsome real problems. For example, if there is a small number of annotators and labeling one example\ntakes a long time, we may want to select a batch size that matches the number of annotators. In\nthis case, the annotators can label the examples concurrently while we can make use of the labels as\nsoon as they are available. It would take a longer time to label a larger batch and we cannot use the\nlabels until all the examples in the batch are labeled.\n\n4 Computing maxGEC\n\nWe now discuss how to compute maxGEC and batch maxGEC for some probabilistic models. Com-\nputing the values is often dif\ufb01cult and we discuss some sampling methods for this task.\n\nbels (cid:126)y given a structured inputs (cid:126)x as P\u03bb[(cid:126)y|(cid:126)x] = exp ((cid:80)m\n\n4.1 MaxGEC for Bayesian Conditional Exponential Models\nA conditional exponential model de\ufb01nes the conditional probability P\u03bb[(cid:126)y|(cid:126)x] of a structured la-\ni=1 \u03bbiFi((cid:126)y, (cid:126)x)) /Z\u03bb((cid:126)x), where \u03bb =\n\n5\n\n\fAlgorithm 2 Approximation for Equation (4).\n\nInput: Selected unlabeled examples S, current unlabeled example x, current posterior pcD.\nSample M label vectors (yi)M\u22121\nfor i = 0 to M \u2212 1 do\n\ni=0 of (X \\ T )\u222aT from pcD using Gibbs sampling and set r \u2190 0.\n\nfor y \u2208 Y do\n\n(cid:99)pcD[h(S) = yi\nr \u2190 r + ((cid:99)pcD[h(S) = yi\n\nS \u2227 h(x) = y] \u2190 M\n\nS \u2227 h(x) = y])2\n\n\u22121(cid:12)(cid:12)(cid:12)(cid:110)\n\nyj | yj\n\nS = yi\n\nS \u2227 yj{x} = y\n\n(cid:111)(cid:12)(cid:12)(cid:12)\n\nend for\n\nend for\nreturn 1 \u2212 r\n\n(\u03bb1, \u03bb2, . . . , \u03bbm) is the parameter vector, Fi((cid:126)y, (cid:126)x) is the total score of the i-th feature, and Z\u03bb((cid:126)x) =\ni=1 \u03bbiFi((cid:126)y, (cid:126)x)) is the partition function. A well-known conditional exponential model\nis the linear-chain conditional random \ufb01eld (CRF) [6] in which (cid:126)x and (cid:126)y both have sequence struc-\ntures. That is, (cid:126)x = (x1, x2, . . . , x|(cid:126)x|) \u2208 X |(cid:126)x| and (cid:126)y = (y1, y2, . . . , y|(cid:126)x|) \u2208 Y|(cid:126)x|. In this model,\nj=1 fi(yj, yj\u22121, (cid:126)x) where fi(yj, yj\u22121, (cid:126)x) is the score of the i-th feature at position j.\ni=1 p0[\u03bbi] on \u03bb, where p0[\u03bbi] = N (\u03bbi|0, \u03c32)\nj=1, we can obtain the posterior\n\n(cid:80)\n(cid:126)y exp ((cid:80)m\nFi((cid:126)y, (cid:126)x) =(cid:80)|(cid:126)x|\nIn the Bayesian setting, we assume a prior p0[\u03bb] =(cid:81)m\n(cid:32) m(cid:88)\n\n(cid:19)2(cid:33)\n(cid:18) \u03bbi\nFor each (cid:126)x, we can approximate the Gibbs error \u0001pDg ((cid:126)x) = 1 \u2212(cid:80)\n\nfor a known \u03c3. After observing the labeled examples D = {((cid:126)xj, (cid:126)yj)}t\n\n\u0001pDg ((cid:126)x) \u2248 1 \u2212 N\u22122(cid:80)N\n\nFor active learning, we need to estimate the Gibbs error in Equation (3) from the pos-\nterior pD.\n(cid:126)y pD[(cid:126)y; (cid:126)x]2\nby sampling N hypotheses \u03bb1, \u03bb2, . . . , \u03bbN from the posterior pD.\ncase,\nt=1 Z\u03bbj +\u03bbt((cid:126)x)/Z\u03bbj ((cid:126)x)Z\u03bbt((cid:126)x). The derivation for this formula is in\nthe supplementary material. If we only use the MAP hypothesis \u03bb\u2217 to approximate the Gibbs error\n(i.e. the non-Bayesian setting), then N = 1 and \u0001pDg ((cid:126)x) \u2248 1 \u2212 Z2\u03bb\u2217 ((cid:126)x)/Z\u03bb\u2217 ((cid:126)x)2.\nThis approximation can be done ef\ufb01ciently if we can compute the partition functions Z\u03bb((cid:126)x) ef\ufb01-\nciently for any \u03bb. This condition holds for a wide range of models including logistic regression,\nlinear-chain CRF, semi-Markov CRF [9], and sparse high-order semi-Markov CRF [10].\n\npD[\u03bb] = p0[\u03bb|D] \u221d t(cid:89)\n(cid:80)N\n\nm(cid:88)\n\n\u03bbiFi((cid:126)yj, (cid:126)xj)\n\nIn this\n\n\u2212 1\n2\n\nZ\u03bb((cid:126)xj)\n\n(cid:33)\n\n(cid:32)\n\nexp\n\nexp\n\n\u03c3\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\n1\n\n.\n\n(cid:81)|X|\n\n4.2 Batch maxGEC for Bayesian Transductive Naive Bayes\nWe discuss an algorithm to approximate batch maxGEC for non-adaptive and batch active learning\nwith Bayesian transductive Naive Bayes. First, we describe the Bayesian transductive Naive Bayes\nmodel for text classi\ufb01cation. Let Y \u2208 Y be a random variable denoting the label of a document\nand W \u2208 W be a random variable denoting a word. In a Naive Bayes model, the parameters are\n\u03b8 = {\u03b8y}y\u2208Y \u222a {\u03b8w|y}w\u2208W,y\u2208Y, where \u03b8y = P[Y = y] and \u03b8w|y = P[W = w|Y = y]. For a\ndocument X and a label Y , if X = {W1, W2, . . . , W|X|} where Wi is a word in the document, we\nmodel the joint distribution P[X, Y ] = \u03b8Y\nIn the Bayesian setting, we have a prior p0[\u03b8] such that \u03b8y \u223c Dirichlet(\u03b1) and \u03b8w|y \u223c Dirichlet(\u03b1y)\nfor each y. When we observe the labeled documents, we update the posterior by counting the labels\nand the words in each document label. The posterior parameters also follow Dirichlet distributions.\nLet X be the original pool of training examples and T be the unlabeled testing examples. In trans-\n0[\u03b8] = p0[\u03b8|X;T ]. For a set D = (T, yT ) of\nductive setting, we work with the conditional prior pc\nlabeled examples where T \u2286 X is the set of unlabeled examples and yT is the labeling of T , the\nconditional posterior is pcD[\u03b8] = p0[\u03b8|X;T ;D] = pD[\u03b8|(X \\ T ) \u222a T ], where pD[\u03b8] = p0[\u03b8|D] is\nthe Dirichlet posterior of the non-transductive model. To implement the batch maxGEC algorithm,\n(cid:35)\nwe need to estimate the Gibbs error in Equation (4) from the conditional posterior. Let S be the\ncurrently selected batch. For each unlabeled example x /\u2208 S, we need to estimate:\n\n(cid:34)(cid:80)\ny (pcD [h(S) = yS \u2227 h(x) = y])2\n\ni=1 \u03b8Wi|Y .\n\n(pcD [h(S) = yS \u2227 h(x) = y])2 = 1 \u2212 EyS\n\n1 \u2212(cid:88)\n\n,\n\npcD[yS; S]\n\nyS ,y\n\n6\n\n\fTable 1: AUC of different learning algorithms with batch size s = 10.\n\nTask\n\nalt.atheism/comp.graphics\n\ntalk.politics.guns/talk.politics.mideast\n\ncomp.sys.mac.hardware/comp.windows.x\n\nrec.motorcycles/rec.sport.baseball\n\nsci.crypt/sci.electronics\n\nsci.space/soc.religion.christian\n\nsoc.religion.christian/talk.politics.guns\n\nAverage\n\nTPass maxGEC\n87.43\n84.92\n73.17\n93.82\n60.46\n92.38\n91.57\n83.39\n\n91.69\n92.03\n93.60\n96.40\n85.51\n95.83\n95.94\n93.00\n\nLC\n91.66\n92.16\n92.27\n96.23\n85.86\n95.45\n95.59\n92.75\n\nNPass\n84.98\n80.80\n74.41\n92.33\n60.85\n89.72\n85.56\n81.24\n\nLogPass\n91.63\n86.07\n85.87\n89.46\n82.89\n91.16\n90.35\n88.21\n\nLogFisher\n\n93.92\n88.36\n88.71\n93.90\n87.72\n94.04\n93.96\n91.52\n\nwhere the expectation is with respect to the distribution pcD[yS; S]. We can use Gibbs sampling\nto approximate this expectation. First, we sample M label vectors y(X\\T )\u222aT of the remaining\nunlabeled examples from pcD using Gibbs sampling. Then, for each yS, we estimate pcD[yS; S] by\ncounting the fraction of the M sampled vectors consistent with yS. For each yS and y, we also\nestimate pcD [h(S) = yS \u2227 h(x) = y] by counting the fraction of the M sampled vectors consistent\nwith both yS and y on S \u222a {x}. This approximation is equivalent to Algorithm 2. In the algorithm,\nS is the labeling of S according to yi.\nyi\n\n5 Experiments\n\n5.1 Named Entity Recognition (NER) with CRF\n\nIn this experiment, we consider the NER task with the Bayesian CRF model described in Section\n4.1. We use a subset of the CoNLL 2003 NER task [11] which contains 1928 training and 969\ntest sentences. Following the setting in [12], we let the cost of querying the label sequence of each\nsentence be 1. We implement two versions of maxGEC with the approximation algorithm in Section\n4.1: the \ufb01rst version approximates Gibbs error by using only the MAP hypothesis (maxGEC-MAP)\nand the second version approximates Gibbs error by using 50 hypotheses sampled from the poste-\nrior (maxGEC-50). We sample the hypotheses for maxGEC-50 from the posterior by Metropolis-\nHastings algorithm with the MAP hypothesis as the initial point.\nWe compare the maxGEC algorithms with 4 other learning criteria: passive learner (Passive), active\nlearner which chooses the longest unlabeled sequence (Longest), active learner which chooses the\nunlabeled sequence with maximum Shannon entropy (SegEnt), and active learner which chooses the\nunlabeled sequence with the least con\ufb01dence (LeastConf). For SegEnt and LeastConf, the entropy\nand con\ufb01dence are estimated from the MAP hypothesis. For all the algorithms, we use the MAP\nhypothesis for Viterbi decoding. To our knowledge, there is no simple way to compute SegEnt or\nLeastConf criteria from a \ufb01nite sample of hypotheses except for using only the MAP estimation.\nThe dif\ufb01culty is to compute a summation (minimization for LeastConf) over all the outputs (cid:126)y in the\ncomplex structured models. For maxGEC, the summation can be rearranged to obtain the partition\nfunctions, which can be computed ef\ufb01ciently using known inference algorithms. This is thus an\nadvantage of using maxGEC.\nWe compare the total area under the F1 curve (AUC) for each algorithm after querying the \ufb01rst 500\nsentences. As a percentage of the maximum score of 500, algorithms Passive, Longest, SegEnt,\nLeastConf, maxGEC-MAP and maxGEC-50 attain 72.8, 67.0, 75.4, 75.5, 75.8 and 76.0 respec-\ntively. Hence, the maxGEC algorithms perform better than all the other algorithms, and signi\ufb01cantly\nso over the Passive and Longest algorithms.\n\n5.2 Text Classi\ufb01cation with Bayesian Transductive Naive Bayes\n\nIn this experiment, we consider the text classi\ufb01cation model in Section 4.2 with the meta-parameters\n\u03b1 = (0.1, . . . , 0.1) and \u03b1y = (0.1, . . . , 0.1) for all y. We implement batch maxGEC (maxGEC)\nwith the approximation in Algorithm 2 and compare with 5 other algorithms: passive learner with\nBayesian transductive Naive Bayes model (TPass), least con\ufb01dence active learner with Bayesian\ntransductive Naive Bayes model (LC), passive learner with Bayesian non-transductive Naive Bayes\nmodel (NPass), passive learner with logistic regression model (LogPass), and batch active learner\n\n7\n\n\fwith Fisher information matrix and logistic regression model (LogFisher) [5]. To implement the\nleast con\ufb01dence algorithm, we sample M label vectors as in Algorithm 2 and use them to estimate\nthe label distribution for each unlabeled example. The algorithm will then select s examples whose\nlabel is least con\ufb01dent according to these estimates.\nWe run the algorithms on 7 binary tasks from the 20Newsgroups dataset [13] with batch size s =\n10, 20, 30 and report the areas under the accuracy curve (AUC) for the case s = 10 in Table 1. The\nresults for s = 20, 30 are in the supplementary material. The results are obtained by averaging over\n5 different runs of the algorithms, and the AUCs are normalized so that their range is from 0 to\n100. From the results, maxGEC obtains the best AUC scores on 4/7 tasks for each batch size and\nalso the best average AUC scores. LC also performs well and its scores are only slightly lower than\nmaxGEC. The passive learning algorithms are much worse than the active learning algorithms.\n\n6 Related Work\n\n1\n\nAmong pool-based active learning algorithms, greedy methods are the simplest and most common\n[14]. Often, the greedy algorithms try to maximize the uncertainty, e.g. Shannon entropy, of the\nexample to be queried [12]. For non-adaptive active learning, greedy optimization of the Shannon\nentropy guarantees near optimal performance due to the submodularity of the entropy [2]. However,\nthis has not been shown to extend to adaptive active learning, where each example is labeled as soon\nas it is selected, and the labeled examples are exploited in selecting the next example to label.\nAlthough greedy algorithms work well in practice [12, 14], they usually do not have any theoretical\nguarantee except for the case where data are noiseless. In noiseless Bayesian setting, an algorithm\ncalled generalized binary search was proven to be near-optimal: its expected number of queries is\nwithin a factor of (ln\nminh p0[h] + 1) of the optimum, where p0 is the prior [2]. This result was ob-\ntained using the adaptive submodularity of the version space reduction. Adaptive submodularity is\nan adaptive version of submodularity, a natural diminishing returns property. The adaptive submod-\nularity of version space reduction was also applied to the batch setting to prove the near-optimality of\na batch greedy algorithm that maximizes the average version space reduction for each selected batch\n[3]. The maxGEC and batch maxGEC algorithms that we proposed in this paper can be seen as gen-\neralizations of these version space reduction algorithms to the noisy setting. When the hypotheses\nare deterministic, our algorithms are equivalent to these version space reduction algorithms.\nFor the case of noisy data, a noisy version of the generalized binary search was proposed [15]. The\nalgorithm was proven to be optimal under the neighborly condition, a very limited setting where\n\u201ceach hypothesis is locally distinguishable from all others\u201d [15]. In another work, Bayesian active\nlearning was modeled by the Equivalance Class Determination problem and a greedy algorithm\ncalled EC2 was proposed for this problem [16]. Although the cost of EC2 is provably near-optimal,\nthis formulation requires an explicit noise model and the near-optimality bound is only useful when\nthe support of the noise model is small. Our formulation, in contrast, is simpler and does not require\nan explicit noise model: the noise model is implicit in the probabilistic model and our algorithms\nare only limited by computational concerns.\n\n7 Conclusion\n\nWe considered a new objective function for Bayesian active learning: the policy Gibbs error. With\nthis objective, we described the maximum Gibbs error criterion for selecting the examples. The algo-\nrithm has near-optimality guarantees in the non-adaptive, adaptive and batch settings. We discussed\nalgorithms to approximate the Gibbs error criterion for Bayesian CRF and Bayesian transductive\nNaive Bayes. We also showed that the criterion is useful for NER with CRF model and for text\nclassi\ufb01cation with Bayesian transductive Naive Bayes model.\n\nAcknowledgments\nThis work is supported by DSO grant DSOL11102 and the US Air Force Research Laboratory under\nagreement number FA2386-12-1-4031.\n\n8\n\n\fReferences\n[1] Andrew McCallum and Kamal Nigam. Employing EM and Pool-Based Active Learning for Text Classi-\n\n\ufb01cation. In International Conference on Machine Learning (ICML), pages 350\u2013358, 1998.\n\n[2] Daniel Golovin and Andreas Krause. Adaptive Submodularity: Theory and Applications in Active Learn-\n\ning and Stochastic Optimization. Journal of Arti\ufb01cial Intelligence Research, 42(1):427\u2013486, 2011.\n\n[3] Yuxin Chen and Andreas Krause. Near-optimal Batch Mode Active Learning and Adaptive Submodular\n\nOptimization. In International Conference on Machine Learning (ICML), pages 160\u2013168, 2013.\n\n[4] Constantino Tsallis and Edgardo Brigatti. Nonextensive statistical mechanics: A brief introduction. Con-\n\ntinuum Mechanics and Thermodynamics, 16(3):223\u2013235, 2004.\n\n[5] Steven CH Hoi, Rong Jin, Jianke Zhu, and Michael R Lyu. Batch Mode Active Learning and Its Appli-\ncation to Medical Image Classi\ufb01cation. In International Conference on Machine learning (ICML), pages\n417\u2013424. ACM, 2006.\n\n[6] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional Random Fields: Probabilistic\nModels for Segmenting and Labeling Sequence Data. In International Conference on Machine Learning\n(ICML), pages 282\u2013289, 2001.\n\n[7] Bassem Sayra\ufb01, Dirk Van Gucht, and Marc Gyssens. The implication problem for measure-based con-\n\nstraints. Information Systems, 33(2):221\u2013239, 2008.\n\n[8] G.L. Nemhauser and L.A. Wolsey. Best Algorithms for Approximating the Maximum of a Submodular\n\nSet Function. Mathematics of Operations Research, 3(3):177\u2013188, 1978.\n\n[9] Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for Information Ex-\n\ntraction. Advances in Neural Information Processing Systems (NIPS), 17:1185\u20131192, 2004.\n\n[10] Viet Cuong Nguyen, Nan Ye, Wee Sun Lee, and Hai Leong Chieu. Semi-Markov Conditional Random\nIn ICML Workshop on Structured Sparsity: Learning and Inference,\n\nField with High-Order Features.\n2011.\n\n[11] Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-\nIndependent Named Entity Recognition. In Proceedings of the 17th Conference on Natural Language\nLearning (HLT-NAACL 2003), pages 142\u2013147, 2003.\n\n[12] Burr Settles and Mark Craven. An Analysis of Active Learning Strategies for Sequence Labeling Tasks.\nIn Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1070\u20131079. As-\nsociation for Computational Linguistics, 2008.\n\n[13] Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categoriza-\n\ntion. Technical report, DTIC Document, 1996.\n\n[14] Burr Settles. Active Learning Literature Survey. Technical Report 1648, University of Wisconsin-\n\nMadison, 2009.\n\n[15] Robert Nowak. Noisy Generalized Binary Search. Advances in Neural Information Processing Systems\n\n(NIPS), 22:1366\u20131374, 2009.\n\n[16] Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-Optimal Bayesian Active Learning with Noisy\n\nObservations. In Advances in Neural Information Processing Systems (NIPS), pages 766\u2013774, 2010.\n\n9\n\n\f", "award": [], "sourceid": 727, "authors": [{"given_name": "Nguyen Viet", "family_name": "Cuong", "institution": "National University of Singapore"}, {"given_name": "Wee Sun", "family_name": "Lee", "institution": "National University of Singapore"}, {"given_name": "Nan", "family_name": "Ye", "institution": "National University of Singapore"}, {"given_name": "Kian Ming", "family_name": "Chai", "institution": "DSO National Laboratories"}, {"given_name": "Hai Leong", "family_name": "Chieu", "institution": "DSO National Laboratories"}]}