Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander Hartemink*

A graph-based prior is proposed for parametric semi-supervised classi- fication. The prior utilizes both labelled and unlabelled data; it also in- tegrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff be- tween the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is per- formed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors.

1 Introduction

In many pattern classification problems, the acquisition of labelled training data is costly and/or time consuming, whereas unlabelled samples can be obtained easily. Semi- supervised algorithms that learn from both labelled and unlabelled samples have been the focus of much research in the last few years; a comprehensive review up to 2001 can be found in [13], while more recent references include [1, 2, 6, 7, 1618].

Most recent semi-supervised learning algorithms work by formulating the assumption that "nearby" points, and points in the same structure (e.g., cluster), should have similar labels [6, 7, 16]. This can be seen as a form of regularization, pushing the class boundaries toward regions of low data density. This regularization is often implemented by associating the vertices of a graph to all the (labelled and unlabelled) samples, and then formulating the problem on the vertices of the graph [6, 1618].

While current graph-based algorithms are inherently transductive -- i.e., they cannot be used directly to classify samples not present when training -- our classifier is paramet- ric and the learned classifier can be used directly on new samples. Furthermore, our al- gorithm is trained discriminatively by maximizing a concave objective function; thus we avoid thorny local maxima issues that plague many earlier methods.

Unlike existing methods, our algorithm automatically learns the relative importance of the labelled and unlabelled data. When multiple views of the same sample are provided (e.g. features from different sensors), we develop a new Bayesian form of co-training [4]. In

addition, we also show how to exploit the unlabelled data and the redundant views of the sample (from co-training) in order to improve active label query selection [15].

The paper is organized as follows. Sec. 2 briefly reviews multinomial logistic regression. Sec. 3 describes the priors for semi-supervised learning and co-training. The EM algorithm derived to learn the classifiers is presented in Sec. 4. Active label selection is discussed in Sec. 5. Experimental results are shown in Sec. 6, followed by conclusions in Sec. 7.

2 Multinomial Logistic Regression

In an m-class supervised learning problem, one is given a labelled training set DL = {(x d 1, y ), . . . , (x )} 1 L, yL , where xi R is a feature vector and yi the corresponding (1) (m) class label. In "1-of-m" encoding, y = [y , . . . , y ] i is a binary vector, such that i i (c) (j) y = 1 and y = 0, for j = c, indicates that sample i belongs to class c. In multinomial i i logistic regression [5], the posterior class probabilities are modelled as

```
log P (y(c) = 1|x) = xT w(c) - log m exp(xT w(k)), for c = 1, . . . , m, (1) k=1
```

where w(c) d R is the class-c weight vector. Notice that since m P (y(c)= 1|x) = 1, c=1 one of the weight vectors is redundant; we arbitrarily choose to set w(m) = 0, and consider the (d (m-1))-dimensional vector w = [(w(1))T , ..., (w(m-1))T ]T . Estimation of w may be achieved by maximizing the log-likelihood (with Y {y , ..., y } 1 L ) [5]

```
(c) (w) log P (Y|w) = L m y xT w(c) - log m exp(xT w(j)) . (2) i=1 c=1 i i j=1 i
```

In the presence of a prior p(w), we seek a maximum a posteriori (MAP) estimate, w = arg max { (w) + log p(w)} w . Actually, if the training data is separable, (w) is unbounded, and a prior is crucial.

Although we focus on linear classifiers, we may see the d-dimensional feature vectors x as having resulted from some deterministic, maybe nonlinear, transformation of an input raw feature vector r; e.g., in a kernel classifier, xi = [1, K(ri, r1), ..., K(ri, rL)] (d = L + 1).

3 Graph-Based Data-Dependent Priors

3.1 Graph Laplacians and Regularization for Semi-Supervised Learning

Consider a scalar function f = [f1, ..., f|V |]T , defined on the set V = {1, 2, ..., |V |} of vertices of an undirected graph (V, E). Each edge of the graph, joining vertices i and j, is given a weight kij = kji 0, and we collect all the weights in a |V | |V | matrix K. A natural way to measure how much f varies across the graph is by the quantity

```
kij(fi - fj)2 = 2 f T f , (3) i j
```

where = diag{ k k j 1j , ..., j |V |j } - K is the so-called graph Laplacian [2]. Notice that kij 0 (for all i, j) guarantees that is positive semi-definite and also that has (at least) one null eigenvalue (1T1 = 0, where 1 has all elements equal to one).

In semi-supervised learning, in addition to DL, we are given U unlabelled samples DU = {xL+1, . . . , xL+U }. To use (3) for semi-supervised learning, the usual choice is to assign one vertex of the graph to each sample in X = [x1, . . . , xL+U ]T (thus |V | = L + U ), and to let kij represent some (non-negative) measure of "similarity" between xi and xj. A Gaussian random field (GRF) is defined on the vertices of V (with inverse variance )

```
p(f ) exp{- f T f /2},
```

in which configurations that vary more (according to (3)) are less probable. Most graph- based approaches estimate the values of f , given the labels, using p(f ) (or some modifica- tion thereof) as a prior. Accordingly, they work in a strictly transductive manner.

3.2 Non-Transductive Semi-Supervised Learning

We first consider two-class problems (m = 2, thus w d R ). In contrast to previous uses of graph-based priors, we define f as the real function f (defined over the entire observation space) evaluated at the graph nodes. Specifically, f is defined as a linear function of x, and at the graph node i, fi f (xi) = wT xi. Then, f = [f1, ..., f|V |]T = Xw, and p(f ) induces a Gaussian prior on w, with precision matrix A = XT X,

```
p(w) exp{-(/2) wT XT Xw} = exp{-(/2) wT Aw}. (4)
```

Notice that since is singular, A may also be singular, and the corresponding prior may therefore be improper. This is no problem for MAP estimation of w because (as is well known) the normalization factor of the prior plays no role in this estimate. If we include extra regularization, by adding a non-negative diagonal matrix to A, the prior becomes

```
p(w) exp -(1/2) wT (0A + ) w , (5)
```

where we may choose = diag{1, ..., d}, = 1I, or even = 0.

For m > 2, we define (m-1) identical independent priors, one for each w(c), c = 1, ..., m. The joint prior on w = [(w(1))T , ..., (w(m-1))T ]T is then

```
m-1 1 (c) 1 p(w|) exp{- (w(c))T A + (c) w(c)} = exp{- wT ()w}, (6) 2 0 2 c=1
(c) (c) (c) where is a vector containing all the parameters, (c) = diag{ , ..., }, and i 1 d
(1) (m-1) () = diag{ , ..., } A + block-diag{(1), ..., (m-1)}. (7) 0 0
```

Finally, since all the 's are inverses of variances, the conjugate priors are Gamma [3]: (c) (c) (c) (c) p( | | | | 0 0, 0) = Ga(0 0, 0), and p(i 1, 1) = Ga(i 1, 1), for c = 1, ..., m - 1 and i = 1, ..., d. Usually, 0, 0, 1, and 1 are given small values indicating diffuse priors. In the zero limit, we obtain scale-invariant (improper) Jeffreys hyper-priors.

Summarizing, our model for semi-supervised learning includes the log-likelihood (2), a prior (6), and Gamma hyper-priors. In Section 4, we present a simple and computationally efficient expectation-maximization (EM) algorithm for obtaining the MAP estimate of w.

3.3 Exploiting Features from Multiple Sensors: The Co-Training Prior

In some applications several sensors are available, each providing a different set of features. For simplicity, we assume two sensors s {1, 2}, but everything discussed here is easily (s) extended to any number of sensors. Denote the features from sensor s, for sample i, as x , i and Ss as the set of sample indices for which we have features from sensor s (S1 S2 = {1, ..., L + U }). Let O = S1 S2 be the indices for which both sensors are available, and OU = O {L + 1, ..., L + U } the unlabelled subset of O.

By using the samples in S1 and S2 as two independent training sets, we may obtain two sep- arate classifiers (denoted w1 and w2). However, we can coordinate the information from both sensors by using an idea known as co-training [4]: on the OU samples, classifiers w1 and w2 should agree as much as possible. Notice that, in a logistic regression framework, the disagreement between the two classifiers on the OU samples can be measured by (1) (2) [(w1)T x - (w2)T x ]2 = T C , (8) iOU i i

where = [(w1)T (w2)T ]T and C = [(x1)T (-x2)T ]T [(x1)T (-x2)T ]. This iOU i i i i suggests the "co-training prior" (where co is an inverse variance): p(w1, w2) = p() exp -(co/2) TC . (9)

This Gaussian prior can be combined with two smoothness Gaussian priors on w1 and w2 (obtained as described in Section 3.2); this leads to a prior which is still Gaussian,

```
p(w1, w2) = p() exp -(1/2) T coC + block-diag{1, 2} , (10)
```

where 1 and 2 are the two graph-based precision matrices (see (7)) for w1 and w2. We can again adopt a Gamma hyper-prior for co. Under this prior, and with a logistic regression likelihood as above, estimates of w1 and w2 can easily be found using minor modifications to the EM algorithm described in Section 4. Computationally, this is only slightly more expensive than separately training the two classifiers.

4 Learning Via EM

To find the MAP estimate w, we use the EM algorithm, with as missing data, which is equivalent to integrating out from the full posterior before maximization [8]. For simplicity, we will only describe the single sensor case (no co-training).

E-step: We compute the expected value of the complete log-posterior, given Y and the current parameter estimate w: Q(w|w) E[log p(w, |Y)|w]. Since

```
log p(w, |Y) = log p(Y|w) - (1/2)wT ()w + K, (11)
```

(where K collects all terms independent of w) is linear w.r.t. all the parameters (see (6) and (7)), we just have to plug their conditional expectations into (11):

```
Q(w|w) = log p(Y|w) - (1/2)wT E[()|w] w = (w) - (1/2)wT (w) w. (12) We consider several different choices for the structure of the matrix. The necessary expectations have well-known closed forms, due to the use of conjugate Gamma hyper- (c) priors [3]. For example, if the are m - 1 free non-negative parameters, we have 0 (c) (c) E[ |w] = (2 0 0 0 + d) [2 0 + (w(c))T Aw(c)]-1. (c) for c = 1, ..., m - 1. For = 0 0, we still have a simple closed-form expres- (c) sion for E[0|w], and the same is true for the parameters, for i > 0. Finally, i (w) E[()|w] results from replacing the 's in (7) by the corresponding conditional expectations.
```

M-step: Given matrix (w), the M-step reduces to a logistic regression problem with a quadratic regularizer, i.e., maximizing (12). To this end, we adopt the bound optimization approach (see details in [5, 11]). Let B be a positive definite matrix such that -B bounds below (in the matrix sense) the Hessian of (w), which is negative definite, and g(w) is the gradient of (w). Then, we have the following lower bound on Q(w|w):

```
Q(w|w) l(w) + (w - w)T g(w) - [(w - w)T B(w - w) + wT (w)w]/2. - The maximizer of this lower bound, wnew = (B + (w)) 1 (Bw + g(w)), is guaranteed to increase the Q-function, Q(wnew|w) Q(w|w), and we thus obtain a monotonic gen- eralized EM algorithm [5, 11]. This (maybe costly) matrix inversion can be avoided by a sequential approach where we only maximize w.r.t. one element of w at a time, preserving the monotonicity of the procedure. The sequential algorithm visits one particular element of w, say wu, and updates its estimate by maximizing the bound derived above, while keeping all other variables fixed at their previous values. This leads to - wnew = w ] [(B + (w)) 1 , u u + [gu(w) - ((w)w) (13) u uu] and wnew = w v v , for v = u. The total time required by a full sweep for all u = 1, ..., d is O(md(L + d)); this may be much better than the O((dm)3) of the matrix inversion.
```

5 Active Label Selection

If we are allowed to obtain the label for one of the unlabelled samples, the following ques- tion arises: which sample, if labelled, would provide the most information?

Consider the MAP estimate w provided by EM. Our approach uses a Laplace approxima- tion of the posterior p(w|Y) N (w|w, H-1), where H is the posterior precision matrix, i.e., the Hessian of minus the log-posterior H = 2(- log p(w|Y)). This approximation is known to be accurate for logistic regression under a Gaussian prior [14]. By treating (w) (the expectation of ()) as deterministic, we obtain an evidence-type approximation [14]

```
H = 2[- log(p(Y|w)p(w|(w)))] = (w) + L (diag{p } - p pT ) x , i=1 i i i ixT i
```

where pi is the (m - 1)-dimensional vector computed from (1), the c-th element of which indicates the probability that sample xi belongs to class c.

Now let x DU be an unlabelled sample and y its label. Assume that the MAP esti- mate w remains unchanged after including y. In Sec. 7 we will discuss the merits and shortcomings of this assumption, which is only strictly valid when L . Accepting it implies that after labeling x, and regardless of y, the posterior precision changes to H = H + (diag{p} - ppT ) xxT . (14) Since the entropy of a Gaussian with precision H is (-1/2) log |H| (up to an additive constant), the mutual information (MI) between y and w (i.e., the expected decrease in entropy of w when y is observed) is I(w; y) = (1/2) log {|H |/|H|}. Our criterion is then: the best sample to label is the one that maximizes I(w; y). Further insight into I(w; y) can be obtained in the binary case (where p is a scalar); here, the matrix identity |H + p(1 - p)xxT | = |H|(1 + p(1 - p)xT H-1x) yields I(w; y) = (1/2) log(1 + p(1 - p)xT H-1x). (15) This MI is larger when p 0.5, i.e., for samples with uncertain classifications. On the other hand, with p fixed, I(w; y) grows with xT H-1x, i.e., it is large for samples with high variance of the corresponding class probability estimate. Summarizing, (15) favors samples with uncertain class labels and high uncertainty in the class probability estimate.

6 Experimental Results

We begin by presenting two-dimensional synthetic examples to visually illustrate our semi- supervised classifier. Fig. 1 shows the utility of using unlabelled data to improve the deci-

Figure 1: Synthetic two-dimensional examples. (a) Comparison of the supervised logistic linear classifier (boundary shown as dashed line) learned only from the labelled data (shown in color) with the proposed semi-supervised classifier (boundary shown as solid line) which also uses the unlabelled samples (shown as dots). (b) A RBF kernel classifier obtained by our algorithm, using two labelled samples (shaded circles) and many unlabelled samples.

Figure 2: (a)-(c) Accuracy (on UCI datasets) of the proposed method, the supervised SVM, and the other semi-supervised classifiers mentioned in the text; a subset of samples is la- belled and the others are treated as unlabelled samples. In (d), a separate holdout set is used to evaluate the accuracy of our method versus the amount of labelled and unlabelled data.

sion boundary in linear and non-linear (kernel) classifiers (see figure caption for details).

Next we show results with linear classifiers on three UCI benchmark datasets. Results with nonlinear kernels are similar, and therefore omitted to save space. We compare our method against state-of-the-art semi-supervised classifiers: the GRF method of [18], the SGT method of [10], and the transductive SVM (TSVM) of [9]. For reference, we also present results for a standard SVM. To avoid unduly helping our method, we always use a k=5 nearest neighbors graph, though our algorithm is not very sensitive to k. To avoid disadvantaging other methods that do depend on such parameters, we use their best settings. Since these adjustments cannot be made in practice, the difference between our algorithm and the others is under-represented. Each point on the plots in Fig. 2(a)-(c) is an average of 20 trials: we randomly select 20 labelled sets which are used by every method. All remaining samples are used as unlabelled by the semi-supervised algorithms.

Figs. 2(a)-(c) are transductive, in the sense that the unlabelled and test data are the same. Our logistic GRF is non-transductive: after being trained, it may be applied to classify new data without re-training. In Fig. 2(d) we present non-transductive results for the Ionosphere data. Training took place using labelled and unlabelled data, and testing was performed on 200 new unseen samples. The results suggest that semi-supervised classifiers are most relevant when the labelled set is small relative to the unlabelled set (as is often the case).

Our final set of results address co-training (Sec. 3.3) and active learning (Sec. 5), applied to airborne sensing data for the detection of surface and subsurface land mines. Two sensors were used: (1) a 70-band hyper-spectral electro-optic (EOIR) sensor; (2) an X-band syn- thetic aperture radar (SAR). A simple (energy) "prescreener" detected potential targets; for each of these, two feature vectors were extracted, of sizes 420 and 9, for the EOIR and SAR sensors, respectively. 123 samples have features from the EOIR sensor alone, 398 from the

Figure 3: (a) Land mine detection ROC curves of classifiers designed using only hyper- spectral (EOIR) features, only SAR features, and both. (b) Number of landmines detected during the active querying process (dotted lines), for active training and random selection (for the latter the bars reflect one standard deviation about the mean). ROC curves (solid) are for the learned classifier as applied to the remaining samples.

SAR sensor alone, and 316 from both. This data will be made available upon request.

We first consider supervised and semi-supervised classification. For the purely supervised case, a sparseness prior is used (as in [14]). In both cases a linear classifier is employed. For the data for which only one sensor is available, 20% of it is labelled (selected randomly). For the data for which both sensors are available, 80% is labelled (again selected randomly). The results presented in Fig. 3(a) show that, in general, the semi-supervised classifiers outperform the corresponding supervised ones, and the classifier learned from both sensors is markedly superior to classifiers learned from either sensor alone.

In a second illustration, we use the active-learning algorithm (Sec. 5) to only acquire the 100 most informative labels. For comparison, we also show average results over 100 in- dependent realizations for random label query selection (error bars indicate one standard deviation). The results in Fig. 3(b) are plotted in two stages: first, mines and clutter are se- lected during the labeling process (dashed curves); then, the 100 labelled examples are used to build the final semi-supervised classifier, for which the ROC curve is obtained using the remaining unlabelled data (solid curves). Interestingly, the active-learning algorithm finds almost half of the mines while querying for labels. Due to physical limitations of the sen- sors, the rate at which mines are detected drops precipitously after approximately 90 mines are detected -- i.e., the remaining mines are poorly matched to the sensor physics.

Do not remove: This comment is monitored to verify that the site is working properly