{"title": "Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 800, "abstract": null, "full_text": "Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields\nChi-Hoon Lee Department of Computing Science University of Alberta chihoon@cs.ualberta.ca Feng Jiao Department of Computing Science University of Waterloo fjiao@cs.uwaterloo.ca Shaojun Wang  Department of Computer Science and Engineering Wright State University shaojun.wang@wright.edu Dale Schuurmans, Russell Greiner Department of Computing Science University of Alberta {dale, greiner}@cs.ualberta.ca\n\nAbstract\nWe present a novel, semi-supervised approach to training discriminative random fields (DRFs) that efficiently exploits labeled and unlabeled training data to achieve improved accuracy in a variety of image processing tasks. We formulate DRF training as a form of MAP estimation that combines conditional loglikelihood on labeled data, given a data-dependent prior, with a conditional entropy regularizer defined on unlabeled data. Although the training objective is no longer concave, we develop an efficient local optimization procedure that produces classifiers that are more accurate than ones based on standard supervised DRF training. We then apply our semi-supervised approach to train DRFs to segment both synthetic and real data sets, and demonstrate significant improvements over supervised DRFs in each case.\n\n1 Introduction\nRandom field models are a popular probabilistic framework for representing complex dependencies in natural image data. The two predominant types of random field models correspond to generative versus discriminative graphical models respectively. Classical Markov random fields (MRFs) [2] follow a traditional generative approach, where one models the joint probability of the observed image along with the hidden label field over the pixels. Discriminative random fields (DRFs) [11, 10], on the other hand, directly model the conditional probability over the pixel label field given an observed image. In this sense, a DRF is equivalent to a conditional random field [12] defined over a 2-D lattice. Following the basic tenet of Vapnik [18], it is natural to anticipate that learning an accurate joint model should be more challenging than learning an accurate conditional model. Indeed, recent experimental evidence shows that DRFs tend to produce more accurate image labeling models than MRFs, in many applications like gesture recognition [15] and object detection [11, 10, 19, 17]. Although DRFs tend to produce superior pixel labellings to MRFs, partly by relaxing the assumption of conditional independence of observed images given the labels, the approach relies more heavily on supervised training. DRF training typically uses labeled image data where each pixel label has been assigned. However, it is considerably more difficult to obtain labeled data for image analysis than for other classification tasks, such as document classification, since hand-labeling the individual pixels of each image is much harder than assigning class labels to objects like text documents.\n\n\nWork done while at University of Alberta\n\n\f\nRecently, semi-supervised training has taken on an important new role in many application areas due to the abundance of unlabeled data. Consequently, many researchers are now working on developing semi-supervised learning techniques for a variety of approaches, including generative models [14], self-learning [5], co-training [3], information-theoretic regularization [6, 8], and graph-based transduction [22, 23, 24]. However, most of these techniques have been developed for univariate classification problems, or class label classification with a structured input [22, 23, 24]. Unfortunately, semi-supervised learning for structured classification problems, where the prediction variables are interdependent in complex ways, have not been as widely studied, with few exceptions [1, 9]. Current work on semi-supervised learning for structured predictors [1, 9] has focused primarily on simple sequence prediction tasks where learning and inference can be efficiently performed using standard dynamic programming. Unfortunately, the problem we address is more challenging, since the spatial correlations in a 2-D grid structure create numerous dependency cycles. That is, our graphical model structure prevents exact inference from being feasible. Kumar et al [10] and Vishwanathan et al [19] argue that learning a model in the context of approximate inference creates a greater risk of the over-fitting and over estimating. In this paper, we extend the work on semi-supervised learning for sequence predictors [1, 9], particularly the CRF based approach [9], to semi-supervised learning of DRFs. There are several advantages of our approach to semi-supervised DRFs. (1) We inherit the standard advantage of discriminative conditional versus joint model training, while still being able to exploit unlabeled data. (2) The use of unlabeled data enhances our ability to avoid parameter over-fitting and over-estimation in grid based random fields even when using a learner that uses only approximate inference methods. (3) We are still able to model spatial correlations in a 2-D lattice, despite the fact that this introduces dependency cycles in the model. That is, our semi-supervised training procedure can be interpreted as a MAP estimator, where the parameter prior for the model on labeled data is governed by the conditional entropy of the model on unlabeled data. This allows us to learn local potentials that capture spatial correlations while often avoiding local over-estimation. We demonstrate the robustness of our model by applying it to a pixel denoising problem on synthetic images, and also to a challenging real world problem of segmenting tumor in magnetic resonance images. In each case, we have obtained significant improvements over current baselines based on standard DRF training.\n\n2 Semi-Supervised DRFs (SSDRFs)\nWe formulate a new semi-supervised DRF training principle based on the standard supervised formulation of [11, 10]. Let x be an observed input image, represented by x = {x i }iS , where S is a set of the observed image pixels (nodes). Let y = {yi }iS be the joint set of labels over all pixels of an image. For simplicity we assume each component y i  y ranges over binary classes Y = {-1, 1}. For example, x might be a magnetic resonance image of a brain and y is a realization of a joint labeling over all pixels that indicates whether each pixel is normal or a tumor. In this case, Y would be the set of pre-defined pixel categories (e.g. tumor versus non-tumor). A DRF is a conditional random field defined on the pixel labels, conditioned on the observation x. More explicitly, the joint distribution over the labels y given the observations x is written\np (y|x) = \"X \" XX 1 exp w (yi , x) +  (yi , yj , x) Z (x) iS iS j N\ni\n\n(1)\n\n d Here Ni denotes the neighboring pixels of i. w (yi , x) = log (yi wT hi (x) enotes the node potential at pixel i, which quantifies the belief that the class label is yi for the pre-defined feature vextor hi (x), where  (t) = 1+1 -t .  (yi , yj , x) = yi yj vT ij (x) is an edge potential that captures e spatial correlations among neighboring pixels (here, the ones at positions i and j ), such that  ij (x) is the pre-defined feature vector associated with observation x. Z  (x) is the normalizing factor, also known as a (conditional) partition function, which is\nZ (x) = X\ny\n\nexp\n\n\"X\niS\n\nw (yi , x) +\n\nXX\n\n (yi , yj , x)\n\n\"\n\n(2)\n\niS j Ni\n\nFinally,  = (w,  ) are the model parameters. When the edge potentials are set to zero, a DRF yields a standard logistic regression classifier. The potentials in a DRF can use properties of the observed image, and thereby relax the conditional independence assumption of MRFs. Moreover, the edge potentials in a DRF can smooth discontinuities between heterogeneous class pixels, and also correct errors made by the node potentials.\n\n\f\nAssume we have a set of independent labeled images, D l = (x(1) , y(1) )),    , (x(M ) , y(M ) ) , and a\n\" \"\n\nset of independent unlabeled images, D u = x(M +1) ,    , x(T ) . Our goal is to build a DRF model from the combined set of labeled and unlabeled examples, D l  Du . The standard supervised DRF training procedure is based on maximizing the log of the posterior probability of the labeled examples in D l\nC L() =\nM X k=1\n\n\"\n\n\"\n\nlog P (y(k) |x(k) ) -\n\nT  2 2\n\n(3)\n\nA Gaussian prior over the edge parameters  is assumed and a uniform prior over parameters w. Here p( ) = N ( ; 0,  2 I), where I is the identity matrix. The hyperparameter  2 adds a regularization term. In effect, the Gaussian prior introduces a form of regularization to limit over-fitting on rare features and avoid degeneracy in the case of correlated features. There are a few issues regarding the supervised learning criteria (3). First, the value of  2 is critical to the final result, and unfortunately selecting the appropriate  2 is a non-trivial task, which in turn makes the learning procedures more challenging and costly [13]. Second, the Gaussian prior is data-independent, and is not associated with either the unlabeled or labeled observations a priori. Inspired by the work in [8] and [9], we propose a semi-supervised learning algorithm for DRFs that makes full use of the available data by exploiting a form of entropy regularization as a prior over the parameters on D u . Specifically, for a semi-supervised DRF, we attempt to find  that maximizes the following objective function\nRL() =\nM X\n\nlog p (y(m) |x(m) ) + \n\nm=1\n\nm=M +1\n\nT X\n\nX\ny\n\np (y|x(m) ) log p (y|x(m) )\n\n(4)\n\nThe first term of (4) is the conditional likelihood over the labeled data set D l , and the second term is a conditional entropy prior over the unlabeled data set D u , weighted by a tradeoff parameter  . The resulting estimate is then formulated as a MAP estimate. The goal of the objective (4) is to minimize the uncertainty on possible configurations over parameters. That is, minimizing the conditional entropy over unlabeled instances provides more confidence to the algorithm that the hypothetical labellings for the unlabeled data are consistent with the supervised labels, as greater certainty on the estimated labellings coincides with greater conditional likelihood on the supervised labels, and vice versa. This criterion has been shown to be effective for univariate classification [8], and chain structured CRFs [9]; here we apply it to the 2-D lattice case.\n\n3 Parameter Estimation\nSeveral factors constrain the form of training algorithm: Because of overhead and the risk of divergence, it was not practical to employ a Newton method. Iterative scaling was not possible because the updates no longer have a closed form. Although the criticism of the gradient descent's principle is well taken, it is the most practical approach we will adopt to optimize the semi-supervised MAP formulation (4) and allows us to improve on standard supervised DRF training. To formulate a local optimization procedure, we need to compute the gradient of the objective (4) with respect to the parameters. Unfortunately, because of the nonlinear mapping function  (.), we are not able to represent the gradient of objective function as compactly as [9], which was able to express the gradient as a product of the covariance matrix of features and the parameter vector . Nevertheless, it is straightforward to show that the derivatives of objective function with respect to the node parameters w is given by 1\n\nNote that the derivatives of objective function with respect to the edge parameters  are computed analogously.\n\n1\n\n\f\nm=1 iS m T X\n\n RL( ) = w 0 1 M \" \"X \" \" XX (m) T (m) @y (m) 1 -  (y (m) wT hi (x(m) ) - p (y|x )yi 1 -  (yi w hi (x ) A hi (x(m) ) i i\ny\n\n(5)\n\n+\n\nm=M +1 iS m\n\nX\n\n@\n\n0\n\nX\ny\n\np (y|x -\n\n(m)\n\n) w (yi , x) + p (y|x\n(m)\n\n\"\n\nj Ni\n\nX\n\n\"\" \" T (m)  (yi , yj , x) yi 1 -  (yi w hi (x ) X  (yi , yj , x) 1 \"i\n\nhX\ny\n\n) w (yi , x) +\n\n\"\n\nj Ni\n\nhX\ny\n\np (y|x(m) )yi 1 -  (yi wT hi (x(m) )\n\n\"\n\n\"i\n\nA hi (x(m) ),\n\nwhere the first term in (5) is the gradient of the supervised component of the DRF over labeled data, and the second term is the gradient of conditional entropy prior of the DRF over unlabeled data. Given the lattice structure of the joint labels, it is intractable to compute the exact expectation terms in the above derivatives. It is also intractable to compute the conditional partition function Z  (x). Therefore, as in standard supervised DRFs, we need to incorporate some form of approximation. Following [2, 11, 10], we incorporate the pseudo-likelihood approximation, which assumes that the joint conditional distribution can be approximated as a product of the local posterior probabilities given the neighboring nodes and the observation\np (y|x) p (yi |yNi , x)  =\niS\n\nY\n\np (yi |yNi , x)\n\n(6) (7)\n\n\" \" X 1  (yi , yj , x) exp w (yi , x) + zi (x) j N\ni\n\nUsing the factored approximation in (7), we can reformulate the training objective as\nRLP L () =\nMS XX\nm\n\nlog p (yi\nm\n\n(m)\n\n|yNi , x(m) )\n\n(m)\n\n(8)\n\nm=1 i=1\n\n+\n\nm=M +1 i=1 yi\n\nT X\n\nS XX\n\np (yi |yNi , x(m) ) log p (yi |yNi x(m) )\n\nHere, the derivative of the second term in (8), with respect to the potential parameters w and  , can be reformulated as a factored conditional entropy, yielding\n RLP L ( ) (9) w 1 0 M \" \" \" \"X XX @y (m) 1 -  (y (m) wT hi (x(m) ) - p (yi |yNi , x(m) )yi 1 -  (yi wT hi (x(m) ) A hi (x(m) ) = i i\nm=1 iS m T X yi\n\n+\n\nm=M +1 iS m\n\nX\n\n0 @\n\nX\nyi\n\np (yi |yNi , x - hX\nyi\n\n(m)\n\n) w (yi , x) +\n(m)\n\n\"\n\nj Ni\n\nX\n\n\" \"\" T (m) )  (yi , yj , x) yi 1 -  (yi w hi (x X  (yi , yj , x) 1 \"i\n\np (yi |yNi x\n\n) w (yi , x) +\n\n\"\n\nj Ni (m)\n\nhX\nyi\n\np (yi |yNi , x\n\n)yi 1 -  (yi w hi (x\n\n\"\n\nT\n\n(m)\n\n)\n\n\"i\n\nA hi (x(m) )\n\nNote that  RLP L () is computed analogously. Assuming the factorization, the true conditional  entropy and feature expectations can be computed in terms of local conditional distributions. This allows us efficiently to approximate the global conditional entropy over unlabeled data. Note that there may be an over-smoothing issue associated with the pseudo-likelihood approximation, as mentioned in [10, 19]. However, due to the fast and stable performance of this approximation in the supervised case [2, 10] we still employ it, but below show that the over-smoothing effect is mitigated by our data-dependent prior in the MAP objective (4).\n\n\f\n4 Inference\nAs a result of our formulation, the learning method is tightly coupled with the inference steps. That is, for the unlabeled data, XU , each time we compute the local conditional covariance (9), we perform inference steps for each node i and its neighboring nodes N i . Our inference is based on iterative conditional modes (ICM) [2], and is given by\n yi = argmax P (yi |yNi , X ) yi Y\n\n(10)\n\nwhere, for each position i, we assume that the labels of all of its neighbors y y Ni are fixed. We could alternatively compute the marginal conditional probability P (y i |X) = P (yi , yS \\i |X ) S \\i for each node using the sum-product algorithm (i.e. loopy belief propagation), which iteratively propagates the belief of each node to its neighbors. Clearly, there are a range of approximation methods available, each entailing different accuracy-complexity tradeoffs. However, we have found that ICM yields good performance at our tasks below, and is probably one of the simplest possible alternatives.\n\n5 Experiments\nUsing standard supervised DRF models, Kumar and Hebert [11, 10] reported interesting experimental results for joint classification tasks on a 2-D lattice, which represents an image with a DRF model. Since labeling image data is expensive and tedious, we believe that better results could be further obtained by formulating a MAP estimation of DRFs by also using the abundant unlabeled image data. In this section, we present a series of experiments on synthetic and real data sets using our novel semi-supervised DRFs(SSDRFs). In order to evaluate our model, we compare the results with those using maximum likelihood estimation of supervised DRFs [11]. There is a major reason that we consider the standard MLE DRF from [11] instead of the parameter regularized DRFs from [10]: that is, we want to show the difference between the ML and MAP principles without using any regularization term that can be problematic [10, 13].\nTP To quantify the performance of each model, we used the Jaccard score J = (T P +F P +F N ) , where TP denotes true positives, FP false positives, and FN false negatives. Although there are many accuracy measures available, we used this score to penalize the false negatives since many imaging tasks are very imbalanced: that is, only a small percentage of pixels are in the \"positive\" class. The tradeoff parameter,  , was hand-tuned on one held out data set and then held fixed at 0.2 for all of the experiments.\n\n5.1 Synthetic image sets Our primary goal in using synthetic data sets was to demonstrate how well different models classified pixels as a binary classification over a 2-D lattice in the presence of noise. We generated 18 synthetic data sets, each with its own shape. The intensities of pixels in each image were independently corrupted by noise generated from a Gaussian N (0, 1). Figure 1 shows the results of using supervised DRFs, as well as semi-supervised DRFs. [10, 19] reported over-smoothing effects from the local approximation approach of PL while our experiments indicate that the over-smoothing is caused not only by PL approximation, but also by the sensitivity of the regularization to the parameters. However, using our semi-supervised DRF as a MAP formulation, we have dramatically improved the performance over standard supervised DRF. Note that the first row in Figure 1 shows good results from the standard DRF, while the oversmoothed outputs are presented in the last row. Although the ML approach may learn proper parameters from some of data sets, unfortunately its performance has not been consistent since the standard DRF's learning of the edge potential tends to be overestimated. For instance, the last row shows that overestimating parameters of the DRF segment almost all pixels into a class due to the complicated edges and structures containing non-target area within the target area, while semi-supervised DRF performance is not degraded at all. Overall, by learning more statistics from unlabeled data, our model dominates the standard DRF in most cases. This is because our MAP formulation avoids the overestimate of potentials and uses the edge potential to correct the errors made by the node potential. Figure 2(a) shows the results over 18 synthetic data sets. Each point above the diagonal line in Figure 2(a) indicates SSDRF producing higher Jaccard scores for a data set. Note that our model stably converged as we increased the ratio (nU /nL) of unlabeled data sets in our learning,\n\n\f\n1 0.9\nJ: 0.933890\nJ: 0.933377\n\n4500\n\n0.8 0.7\n\n4400\n\n4300\n\nSSDRF\n\n0.6 0.5 0.4 0.3 0.2 0.1 4100 4200\n\nJ: 0.729527\n\nJ: 0.957983\n\n4000\n\nJ: 0.008178\n\nJ: 0.923836\n\n0\n\n0\n\n0.2\n\n0.4\n\nDRF\n\n0.6\n\n0.8\n\n1\n\n3900\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\nFrom left to right: Testing instance, Ground Truth, Logistic Regression (LR), DRF, and SSDRF.\n\nFigure 1: Outputs from synthetic data sets.\n\n(a)\n\nAccuracy from DRF and SSDRF for all 18 synthetic data sets\n\n(b) Log likelihood values (Y axis) for a testing image by increasing ratio (X axis) of unlabeled instances for SSDRF\n\nFigure 2: Accuracy and Convergency as in Figure 2(b), where nU denotes the number of unlabeled images and nL the number of labeled images. Similar results have also been reported in simple single variable classification task [8]. 5.2 Brain Tumor Segmentation We have applied our semi-supervised DRF model to the challenging real world problem of segmenting tumor in medical images. Our goal here is to classify each pixel of an magnetic resonance (MR) image into a pre-defined category: tumor and non-tumor. This is a very important, yet notoriously difficult, task in surgical planning and radiation therapy which currently involves a significant amount of manual work by human medical experts. We applied three models to the classification of 9 studies from brain tumor MR images. For each S U L study2 , i, we divided the MR images into Di , Di , and Di , where an MR image (a.k.a slice) has three modalities available -- T1, T2, and T1 contrast. Note that each modality for each slice has 66, 564 pixels. As with much of the related work on automatic brain tumor segmentation (such as [7, 21]), our training is based on patient-specific data, where training MR images for a classifier are obtained from the patient to be tested. Note that the training sets and testing sets for a classifier are disjoint. S U L Specifically, LR and DRF takes Di as the training set and Di and Di for testing sets, while SSDRF S U U L takes Di and Di for training and Di and Di for testing. We segmented the \"enhancing\" tumor area, the region that appears hyper-intense after injecting the contrast agent (we also included non-enhancing areas contained within the enhancing contour). U S Table 1 and 2 present Jaccard scores of testing Di and Di for each study, pi , respectively. While the standard supervised DRF improves over its degenerate model LR by 1%, semi-supervised DRF significantly improves over the supervised DRF by 11%, which is significant at p < 0.00566 using a paired example t test. Considering the fact that MR images contain much noise and the three modalities are not consistent among slices of the same patient, our improvement is considerable. Figure 3 shows the segmentation results by overlaying the testing slices with segmented outputs from the three models. Each row demonstrates the segmentation for a slice, where the white blob areas for the slice correspond to the enhancing tumor area.\n\n6 Conclusion\nWe have proposed a new semi-supervised learning algorithm for DRFs, which was formulated as MAP estimation with conditional entropy over unlabeled data as a data-dependent prior regularization. Our approach is motivated by the information-theoretic argument [8, 16] that unlabeled examples can provide the most benefit when classes have small overlap. We introduced a simple approximation approach for this new learning procedure that exploits the local conditional probability to efficiently compute the derivative of objective function.\nEach study involves a number (typically 21) of images of a single patient  here parallel axial slices through the head.\n2\n\n\f\nU Table 1: Jaccard Scores for Di .\n\nS Table 2: Jaccard Scores for Di .\n\nStudies p1 p2 p3 p4 p5 p6 p7 p8 p9 Average\n\nU Testing from Di LR DRF SSDRF 53.84 59.81 59.81 83.24 83.65 84.67 30.72 30.17 75.76 72.04 76.16 79.02 73.26 73.59 75.25 88.39 89.61 87.01 69.33 69.91 75.60 58.49 58.89 73.03 60.85 56.49 83.91 65.57 66.48 77.12\n\nStudies p1 p2 p3 p4 p5 p6 p7 p8 p9 Average\n\nS Testing from Di LR DRF SSDRF 68.01 68.75 68.75 69.61 69.73 70.06 23.11 21.90 71.13 56.52 63.07 68.40 51.38 52.36 51.29 85.65 86.35 85.43 66.71 68.68 70.27 44.92 45.36 73.09 21.11 20.16 38.06 54.11 55.15 66.27\n\nFigure 3: From Left to Right: Human Expert, LR, DRF, and SSDRF We have applied this new approach to the problem of image pixel classification tasks. By exploiting the availability of auxiliary unlabeled data, we are able to improve the performance of the state of the art supervised DRF approach. Our semi-supervised DRF approach shares all of the benefits of the standard DRF training, including the ability to exploit arbitrary potentials in the presence of dependency cycles, while improving accuracy through the use of the unlabeled data. The main drawback is the increased training time involved in computing the derivative of the conditional entropy over unlabeled data. Nevertheless, the algorithm is efficient to be trained on unlabeled data sets, and to obtain a significant improvement in classification accuracy over standard supervised training of DRFs as well as iid logistic regression classifiers. To further accelerate the performance with respect to accuracy, we may apply loopy belief propagation [20] or graph-cuts [4] as an inference tool. Since our model is tightly coupled with inference steps during the learning, the proper choice of an inference algorithm will most likely improve segmentation tasks. Acknowledgments This research is supported by the Alberta Ingenuity Centre for Machine Learning, Cross Cancer Institute, and NSERC. We gratefully acknowledge many helpful suggestions from members of the Brain Tumor Analysis Project, including Dr. A. Murtha and Dr. J Sander.\n\nReferences\n[1] Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learning for structured variables. In NIPS 18. 2006. [2] J. Besag. On the statistical analysis of dirty pictures. Journal of Royal Statistical Society. Series B, 48:3:259302, 1986. [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.\n\n\f\n[4] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. In ICCV (1), pages 377384, 1999. [5] G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal., 14(3):315332, 1992. [6] A. Corduneanu and T. Jaakkola. Data dependent regularization. In O. Chapelle, B. Schoelkopf, and A. Zien, editors, Semi-Supervised Learning. MIT Press, 2006. [7] C. Garcia and J.A. Moreno. Kernel based method for segmentation and modeling of magnetic resonance images. LNCS, 3315:636645, Oct 2004. [8] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS 17, 2004. [9] F. Jiao, S. Wang, C. Lee, R. Greiner, and D Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In COLING/ACL, 2006. [10] S. Kumar and M. Hebert. Discriminative fields for modeling spatial dependencies in natural images. In NIPS 16, 2003. [11] S. Kumar and M. Hebert. Discriminative random fields: A discriminative framework for contextual interaction in classification. In CVPR, 2003. [12] J. Lafferty, F. Pereira, and A. McCallum. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. [13] C. Lee, R. Greiner, and O. Zaane. Efficient spatial classification using decoupled conditional i random fields. In 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 272283, 2006. [14] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103134, 2000. [15] A. Quattoni, M. Collins, and T. Darrell. Conditional random fields for object recognition. In NIPS 17, 2004. [16] S. Roberts, R. Everson, and I. Rezek. Maximum certainty data partitioning, 2000. [17] A. Torralba, K. Murphy, and W. Freeman. Contextual models for object detection using boosted random fields. In NIPS 17, 2004. [18] V. Vapnik. Statistical Learning Theory. John-Wiley, 1998. [19] S.V.N. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Murphy. Accelerated training of conditional random fields with stochastic gradient methods. In ICML, 2006. [20] J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS 13, pages 689695, 2000. [21] J. Zhang, K. Ma, M.H. Er, and V. Chong. Tumor segmentation from magnetic resonance imaging by learning via one-class support vector machine. Intl. Workshop on Advanced Image Technology, 2004. [22] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, and B. Scholkopf. Learning with local and  global consistency. In NIPS 16, 2004. [23] D. Zhou, J. Huang, and B. Scholkopf. Learning from labeled and unlabeled data on a directed  graph. In ICML, 2005. [24] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, 2003.\n\n\f\n", "award": [], "sourceid": 3029, "authors": [{"given_name": "Chi-hoon", "family_name": "Lee", "institution": null}, {"given_name": "Shaojun", "family_name": "Wang", "institution": null}, {"given_name": "Feng", "family_name": "Jiao", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Russell", "family_name": "Greiner", "institution": null}]}