{"title": "Clustering Under Prior Knowledge with Application to Image Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": null, "full_text": "Clustering Under Prior Knowledge with Application\n\nto Image Segmentation\n\nM\u00b4ario A. T. Figueiredo\n\nInstituto de Telecomunicac\u00b8 \u02dcoes\n\nInstituto Superior T\u00b4ecnico\n\nTechnical University of Lisbon\n\nPortugal\n\nDong Seon Cheng, Vittorio Murino\n\nVision, Image Processing, and Sound Laboratory\n\nDipartimento di Informatica\n\nUniversity of Verona\n\nItaly\n\nmario.\ufb01gueiredo@lx.it.pt\n\ncheng@sci.univr.it, vittorio.murino@univr.it\n\nAbstract\n\nThis paper proposes a new approach to model-based clustering under prior knowl-\nedge. The proposed formulation can be interpreted from two different angles: as\npenalized logistic regression, where the class labels are only indirectly observed\n(via the probability density of each class); as \ufb01nite mixture learning under a group-\ning prior. To estimate the parameters of the proposed model, we derive a (gener-\nalized) EM algorithm with a closed-form E-step, in contrast with other recent\napproaches to semi-supervised probabilistic clustering which require Gibbs sam-\npling or suboptimal shortcuts. We show that our approach is ideally suited for\nimage segmentation: it avoids the combinatorial nature Markov random \ufb01eld pri-\nors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based)\nin a simple and computationally ef\ufb01cient way. Finally, we extend our formulation\nto work in unsupervised, semi-supervised, or discriminative modes.\n\n1 Introduction\n\nMost approaches to semi-supervised learning (SSL) see the problem from one of two (dual) per-\nspectives: supervised classi\ufb01cation with additional unlabelled data (see [20] for a recent survey);\nclustering with prior information or constraints (e.g., [4, 10, 11, 15, 17]). The second perspective,\nusually termed semi-supervised clustering (SSC), is usually adopted when labels are totaly absent,\nbut there are (usually pair-wise) relations that one wishes to enforce or encourage.\n\nMost SSC techniques work by incorporating the constrains (or prior) into classical algorithms such\nas K-means or EM for mixtures. The semi-supervision may be hard (i.e., grouping constraints\n[15, 17]), or have the form of a prior under which probabilistic clustering is performed [4, 11]. The\nlater is clearly the most natural formulation for cases where one wishes to encourage, not enforce,\ncertain relations; an obvious example is image segmentation, seen as clustering under a spatial\nprior, where neighboring sites should be encouraged, but not constrained, to belong to the same\ncluster/segment. However, the previous EM-type algorithms for this class of methods have a major\ndrawback: the presence of the prior makes the E-step non-trivial, forcing the use of expensive Gibbs\nsampling [11] or suboptimal methods such as the iterated conditional modes algorithm [4].\n\nIn this paper, we introduce a new approach to mixture-based SSC, leading to a simple, fully deter-\nministic, generalized EM (GEM) algorithm. The keystone is the formulation of SSC as a penalized\nlogistic regression problem, where the labels are only indirectly observed. The linearity of the\nresulting complete log-likelihood, w.r.t.\nthe missing group labels, underlies the simplicity of the\nresulting GEM algorithm. When applied to image segmentation, our method allows using spatial\npriors which are typical of image estimation problems (e.g., restoration/denoising), such as Gaussian\n\n\f\ufb01elds or wavelet-based priors. Under these priors, the M-step of our GEM algorithm reduces to a\nsimple image denoising procedure, for which there are several extremely ef\ufb01cient algorithms.\n\n2 Formulation\n\nWe start from the standard formulation of \ufb01nite mixture models: X = {x1, ..., xn} is an observed\ndata set, where each xi \u2208 IRd was generated (independently) according to one of a set of K prob-\nability (density or mass) functions {p(\u00b7|\u03c6(1)), ..., p(\u00b7|\u03c6(K))}. In image segmentation, each xi is a\npixel value (gray scale, d = 1; color, d = 3) or a vector of local (e.g., texture) features. Associated\nwith X , there is a hidden label set Y = {y1, ..., yn}, where yi = [y(1)\n]T \u2208 {0, 1}K, with\ny(k)\ni = 1 if and only if xi was generated by source k (the so-called \u201c1-of-K\u201d binary encoding). Thus,\n\n, ..., y(K)\n\ni\n\ni\n\np (X |Y, \u03c6 ) =\n\nKYk=1 Yi: y\n\n(k)\ni =1\n\np(xi|\u03c6(k)) =\n\nnYi=1\n\nKYk=1hp(xi|\u03c6(k))iy\n\n(k)\ni\n\n,\n\n(1)\n\n(k)\ni\n\nQiQk(\u03b7(k))y\n\nwhere \u03c6 = (\u03c6(1), ..., \u03c6(K)) is the set of parameters of the generative models of classes.\nIn standard mixture models, all the yi are assumed to be independent and identically distrib-\nuted samples following a multinomial distribution with probabilities {\u03b7(1), ..., \u03b7(K)}, i.e., P (Y) =\n. This is the part of standard mixture models that has to be modi\ufb01ed in order to\ninsert grouping constraints [15] or a grouping prior p(Y) [4, 11]. However, this prior destroys the\nsimplicity of the standard E-step for \ufb01nite mixtures, which is critically based on the independence\nassumption. We follow a different route to avoid that roadblock.\nLet the hidden labels Y = {y1, ..., yn} depend on a new set of variables Z = {z1, ..., zn}, where\neach zi = [z(1)\n\n]T \u2208 IRK, following a multinomial logistic model [5]:\n\n, ..., z(K)\n\ni\n\ni\n\nP (Y|Z) =\n\nnYi=1\n\nKYk=1(cid:16)P [y(k)\n\ni = 1|zi](cid:17)y\n\n(k)\ni\n\n,\n\nwhere\n\nP [y(k)\n\ni = 1|zi] =\n\n.\n\n(2)\n\nDue to the normalization, we can set (w.l.o.g.) z(K)\nn (K \u2212 1) real variables, i.e., Z = {z(1), ..., z(K\u22121)}, where z(k) = [z(k)\ncan be seen as an n \u00d7 (K \u2212 1) matrix, where z(k) is the k-th column and zi is the i-th row.\nWith this formulation, certain grouping preferences may be expressed by a prior p(Z). For example,\npreferred pair-wise relations can be easily embodied in a Gaussian prior\n\ni = 0, for i = 1, ..., n [5]. We\u2019re thus left with\nn ]T ; of course, Z\n\n1 , ..., z(k)\n\n(k)\ni\n\nez\nl=1 ez\n\n(l)\ni\n\nPK\n\n1\n2\n\n(z(k))T \u2206 z(k)(cid:21) ,\n\n(3)\n\nwhere A is a matrix (with a null diagonal) encoding pair-wise preferences (Ai,j > 0 expresses\npreference, with strength proportional to Ai,j, for having points i and j in the same cluster) and \u2206\nis the well-known graph-Laplacian matrix [20],\n\np(Z) \u221d\n\nK\u22121Yk=1\n\nj\n\n1\n4\n\nAi,j(z(k)\n\nnXj=1\n\nnXi=1\n\nexp\uf8ee\uf8f0\u2212\n\u2206 = diagnPn\n\ni \u2212 z(k)\n\nK\u22121Yk=1\n\nexp(cid:20)\u2212\n\n)2\uf8f9\uf8fb =\nj=1 An,jo \u2212 A.\nj=1 A1,j, ...,Pn\n\nFor image segmentation, each z(k) is an image with real-valued elements and a natural choice for\nA is to have Ai,j = \u03bb, if i and j are neighbors, and zero otherwise. Assuming periodic boundary\nconditions for the neighborhood system, \u2206 is a block-circulant matrix with circulant blocks [2].\nHowever, as shown below, other more sophisticated priors (such as wavelet-based priors) can also\nbe used at no additional computational cost [1].\n\n(4)\n\n3 Model Estimation\n\n3.1 Marginal Maximum A Posteriori and the GEM Algorithm\n\nBased on the formulation presented in the previous section, SSC is performed by estimating Z\nand \u03c6, seeing Y as missing data. The marginal maximum a posteriori estimate is obtained by\n\n\fmarginalizing out the hidden labels (over all the possible label con\ufb01gurations),\n\n(cid:16)bZ,b\u03c6(cid:17) = arg max\nZ,\u03c6XY\n\nZ,\u03c6XY\n\np(X , Y, Z|\u03c6) = arg max\n\np(X |Y, \u03c6) P (Y|Z) p(Z),\n\n(5)\n\nwhere we\u2019re assuming a \ufb02at prior for \u03c6. One of the key advantages of this approach is that (5) is\na continuous (not combinatorial) optimization problem. This is in contrast Markov random \ufb01eld\napproaches to image segmentation, which lead to hard combinatorial problems, since they perform\noptimization directly with respect to the (discrete) label variables Y. Finally, notice that once in\n\npossession of an estimate bZ, one may compute P (Y|bZ) which gives the probability that each data\n\npoint belongs to each class. By \ufb01nding arg maxk P [y(k)\nclustering/segmentation.\n\ni = 1|zi], for every i, one may obtain a hard\n\nWe handle (5) with a generalized EM (GEM) algorithm [13], i.e., by applying the following iterative\nprocedure (until some convergence criterion is satis\ufb01ed):\n\nE-step: Compute the conditional expectation of the complete log-posterior, given the current esti-\n\nmates (bZ,b\u03c6) and the observations X :\n\nQ(Z, \u03c6|bZ,b\u03c6) = EYhlog p(Y, Z, \u03c6|X )(cid:12)(cid:12)(cid:12)bZ,b\u03c6, Xi .\nM-step: Update the estimate: (bZ,b\u03c6) \u2190 (bZnew,b\u03c6new), with new values such that\n\nQ(bZnew,b\u03c6new|bZ,b\u03c6) \u2265 Q(bZ,b\u03c6|bZ,b\u03c6).\n\nUnder mild conditions, it is well known that GEM algorithms converge to a local maximum of the\nmarginal log-posterior [18].\n\n(6)\n\n(7)\n\n3.2 E-step\n\nThe complete log-posterior is\n\nlog p(Y, Z, \u03c6|X )\n\n.\n=\n\nnXi=1\n\nKXk=1\n\n.\n= log p(X |Y, \u03c6) + log P (Y|Z) + log p(Z)\n\ny(k)\n\ni\n\nlog p(xi|\u03c6(k)) +\n\ny(k)\ni z(k)\n\ni \u2212 log\n\nnXi=1\" KXk=1\n\n(k)\n\ni # + log p(Z)\n\nez\n\nKXk=1\n\n(8)\n\nwhere .\near w.r.t. the hidden variables y(k)\nexpectations, which are then plugged into (8).\n\n= stands for \u201cequal up to an additive constant\u201d. The key observation is that this function is lin-\n. Consequently, the E-step reduces to computing their conditional\n\ni\n\ni\n\ni\n\n) equals its\n\ni \u2261 E[y(k)\n\nAs in standard mixtures, each missing y(k)\nposterior probability of being equal to one, easily obtained via Bayes law:\n\nis binary, thus its expectation (denotedby(k)\ni = 1|bzi]\ni = 1|bzi]\n\nNotice that this is the same as the E-step for a standard \ufb01nite mixture, where the probabilities\nP [y(k)\n\np(xi|b\u03c6\nby(k)\ni = 1|bzi,b\u03c6, xi] =\nPK\nj=1 p(xi|b\u03c6\ni = 1|bzi] (given by (2)) play the role of the probabilities of the classes/components. Finally,\nthe Q function is obtained by plugging the expectationsby(k)\n\n|bZ,b\u03c6, X ] = P [y(k)\n\n) P [y(k)\n(j)\n\n3.3 M-Step\n\n) P [y(j)\n\ninto (8).\n\n.\n\n(9)\n\n(k)\n\ni\n\ni\n\nIt\u2019s clear from (8) that the maximization w.r.t. \u03c6 can be performed separately w.r.t. to each \u03c6(k),\n\n(k)\nnew = arg max\n\u03c6(k)\n\nb\u03c6\n\ni\n\nnXi=1by(k)\n\nlog p(xi|\u03c6(k)).\n\n(10)\n\nThis is the familiar weighted maximum likelihood criterion, exactly as it appears in EM for standard\nmixtures. The explicit form of this update depends on the choice of p(\u00b7|\u03c6(k)); e.g., this step can be\neasily applied to any \ufb01nite mixture of exponential family densities [3].\n\n\fIn supervised image segmentation, these parameters are known (e.g., previously estimated from\ntraining data) and thus it\u2019s not necessary to estimate them; the M-step reduces to the estimation of\nZ. In unsupervised image segmentation, \u03c6 is unknown and (10) will have to be applied.\nTo update the estimate of Z, we need to maximize (or at least improve, see (7))\n\nL(Z|bY) \u2261\n\nnXi=1\" KXk=1by(k)\n\ni z(k)\n\ni \u2212 log\n\nWithout the prior, this would be a simple logistic regression (LR) problem, with an identity design\nmatrix [5]; however, instead of the usual hard labels y(k)\ni \u2208 [0, 1].\nArguably, the two standard approaches to maximum likelihood LR are the Newton-Raphson al-\ngorithm (a.k.a. iteratively reweighted least squares \u2013 IRLS [7]) and the minorize-maximize (MM)\napproach (formerly known as bound optimization) [5, 9]. We will show below that the MM approach\ncan be easily modi\ufb01ed to accommodate the presence of a prior.\n\n(k)\n\nez\n\ni # + log p(Z).\n\nKXk=1\ni \u2208 {0, 1}, we have \u201csoft\u201d labelsby(k)\n\n(11)\n\nLet\u2019s brie\ufb02y review the MM approach for maximizing a twice differentiable concave function E(\u03b8)\nwith bounded Hessian [5, 9]. Let the Hessian H(\u03b8) of E(\u03b8) be bounded below by \u2212B (that is,\nH(\u03b8) (cid:23) \u2212B, in the matrix sense, meaning that H(\u03b8) + B is positive de\ufb01nite), where B is a positive\n\n1\n\nR(\u03b8,b\u03b8) = \u2212\n\nde\ufb01nite matrix. It\u2019s trivial to show that E(\u03b8) \u2212 R(\u03b8,b\u03b8) has a minimum at \u03b8 =b\u03b8, where\nwith g(b\u03b8) denoting the gradient of E(\u03b8) atb\u03b8. Thus, the iteration\nR(\u03b8,b\u03b8) =b\u03b8 + B\u22121g(b\u03b8)\nis guaranteed to monotonically improve E(\u03b8), i.e., E(b\u03b8new) \u2265 E(b\u03b8).\n\n2(cid:16)\u03b8 \u2212b\u03b8 \u2212 B\u22121g(b\u03b8)(cid:17)T\nb\u03b8new = arg max\n\nB(cid:16)\u03b8 \u2212b\u03b8 \u2212 B\u22121g(b\u03b8)(cid:17) ,\n\nIt was shown in [5] that the gradient and the Hessian of the logistic log-likelihood function, i.e., (11)\nwithout the log-prior, verify (with Ia denoting an a \u00d7 a identity matrix and 1a a vector of a ones)\n\n(12)\n\n(13)\n\n\u03b8\n\nand H(z) (cid:23) \u2212\n\n1\n\n2 IK\u22121 \u2212\n\n1K\u22121 1T\n\nK\u22121\n\nK\n\n! \u2297 In \u2261 \u2212B,\n\n(14)\n\ng(z) =by \u2212 \u03b7(z)\n\n1 , ..., z(1)\n\nn\n\nwhere z = [z(1)\n\nn , z(2)\n\n1 , ..., z(K\u22121)\n\nwith p(k)\n\ni = 1|zi].\n\ni = P [y(k)\n\nthe corresponding lexicographic vectorization of bY, and \u03b7(z) = [p(1)\nDe\ufb01ning v =bz + B\u22121(by \u2212 \u03b7(bz)), the MM update equation for solving (11) is thus\n\n]T denotes the lexicographic vectorization of Z,by denotes\n(z \u2212 v)T B (z \u2212 v) \u2212 log p(z)(cid:27) ,\nz (cid:26) 1\nbznew(v) = arg min\n\nwhere p(z) is equivalent to p(Z), because z is simply the lexicographic vectorization of Z.\nWe now summarize our GEM algorithm:\n\n1 , ..., p(K\u22121)\n\n1 , ..., p(1)\n\nn , p(2)\n\n(15)\n\n]T\n\n2\n\nn\n\nE-step: computeby, using (9), for all i = 1, ..., n and k = 1, ..., K \u2212 1.\n(Generalized) M-step: Apply one or more iterations (15), keepingby \ufb01xed, that is, loop through\n\nthe following two steps: v \u2190bz + B\u22121(by \u2212 \u03b7(bz)) andbz \u2190bznew(v).\n\n3.4 Speeding Up the Algorithm\n\nIn image segmentation, the MM update equation (15) is formally equivalent to the MAP estimation\nof an image with n pixels in IRK\u22121, under prior p(z), where v plays the role of observed image, and\nB is the inverse covariance matrix of the noise. Due to the structure of B, even if the prior models\nthe several z(k) as independent, i.e., if log p(z) = log p(z(1)) + \u00b7 \u00b7 \u00b7 + log p(z(K\u22121)), (15) can not be\ndecoupled into the several components {z(1), ..., z(K\u22121)}. We sidestep this dif\ufb01culty, at the cost of\nusing a less tight bound in (14), based the following lemma:\n\n\fLemma 1 Let \u03beK = 1/2, if K > 2, and \u03beK = 1/4, if K = 2. Then, B (cid:22) \u03beK In(K\u22121).\n\nProof: Inserting K = 2 in (14) yields B = I/4, which proves the case K = 2. For K > 2, the\ninequality I/2 (cid:23) B is equivalent to \u03bbmin(I/2 \u2212 B) \u2265 0, which is equivalent to \u03bbmax(B) \u2264 (1/2).\nSince the eigenvalues of the Kronecker product are the products of the eigenvalues of the matrices,\n\u03bbmax(B) = \u03bbmax(I \u2212 (1/K) 1 1T )/2. Since 1 1T is a rank-1 matrix with eigenvalues {0, ..., 0, K \u2212\n1}, the eigenvalues of (I \u2212 (1/K) 1 1T ) are {1, ..., 1, 1/K}, thus \u03bbmax(I \u2212 (1/K) 1 1T ) = 1, and\n\u03bbmax(B) = 1/2.\n\nThis lemma allows replacing B with \u03beK In(K\u22121) in (15) which (assuming independent priors, as is\nthe case of (3)) becomes decoupled, leading to\n2\n\nnew(v(k)) = arg min\n\nbz(k)\nwhere v(k) =bz(k) + (1/\u03beK )(by(k) \u2212 \u03b7(k)(bz(k))). Moreover, the \u201cnoise\u201d in each of these \u201cdenoising\u201d\n\nproblems is white and Gaussian, of variance 1/\u03beK.\n\n2 (cid:13)(cid:13)(cid:13)z(k) \u2212 v(k)(cid:13)(cid:13)(cid:13)\n\n\u2212 log p(z(k))(cid:27) ,\n\nz(k)(cid:26) \u03beK\n\nfor\n\nk = 1, ..., K \u2212 1,\n\n(16)\n\n3.5 Stationary Gaussian Field Priors\n\nConsider a Gaussian prior of form (3), where Ai,j only depends on the relative position of i and j and\nthe neighborhood system de\ufb01ned by A has periodic boundary conditions. In this case, both A and \u2206\nare block-circulant matrices, with circulant blocks [2], thus diagonalizable by a 2D discrete Fourier\ntransform (2D-DFT). Formally, \u2206 = UH DU, where D is diagonal, U is the orthogonal matrix\nrepresenting the 2D-DFT, and (\u00b7)H denotes conjugate transpose. The log-prior is then expressed in\nthe DFT domain, log p(z(k))\n\n2 (Uz(k))H D(Uz(k)), and the solution of (16) is\n\n.\n= 1\n\nbz(k)\n\nnew(v(k)) = \u03beK UH [\u03beK In + D]\u22121 U v(k),\n\n(17)\nObserve that (17) corresponds to \ufb01ltering each image v(k), in the DFT domain, with a \ufb01xed \ufb01lter\nwith frequency response [\u03beK In + D]\u22121; this inversion can be computed off-line and is trivial be-\ncause \u03beK In + D is diagonal. Finally, it\u2019s worth stressing that the matrix-vector products by U and\nUH are not carried out explicitly but more ef\ufb01ciently via the FFT algorithm, with cost O(n log n).\n\nk = 1, ..., K \u2212 1.\n\nfor\n\n3.6 Wavelet-Based Priors for Segmentation\n\nIt\u2019s known that piece-wise smooth images have sparse wavelet-based representations (see [12]\nand the many references therein); this fact underlies the state-of-the-art denoising performance of\nwavelet-based methods. Piece-wise smoothness of the z(k) translates into segmentations in which\npixels in each class tend to form connected regions. Consider a wavelet expansion of each z(k)\n\nz(k) = W\u03b8(k), k = 1, ..., K \u2212 1,\n\n(18)\nwhere the \u03b8(k) are sets of coef\ufb01cients and W is the matrix representation of an inverse wavelet\ntransform; W may be orthogonal or have more columns than lines (over-complete representations)\n[12]. A wavelet-based prior for z(k) is induced by placing a prior on the coef\ufb01cients \u03b8(k). A classical\nchoice for p(\u03b8(k)) is a generalized Gaussian [14]. Without going into details, under this class of\npriors (and others), (16) becomes a non-linear wavelet-based denoising step, which has been widely\nstudied in the image processing literature. For several choices of p(\u03b8(k)) and W, this denoising\nstep has a very simple closed form, which essentially corresponds to computing a wavelet transform\nof the observations, applying a coef\ufb01cient-wise non-linear shrinkage/thresholding operation, and\napplying the inverse transform to the processed coef\ufb01cients. This is computationally very ef\ufb01cient,\ndue to the existence of fast algorithms for computing direct and inverse wavelet transforms; e.g.,\nO(n) for an orthogonal wavelet transform or O(n log n) for a shift-invariant redundant transform.\n\n4 Extensions\n\n4.1 Semi-Supervised Segmentation\n\nFor semi-supervised image segmentation, the user de\ufb01nes regions in the image for which the true\nlabel is known. Our GEM algorithm is trivially modi\ufb01ed to handle this case: if at location i the label\n\n\fthose locations for which the label is unknown. The M-step remains unchanged.\n\nis known to be (say) k, we freezeby(k)\n\ni = 1, andby(j)\n\n4.2 Discriminative Features\n\ni = 0, for j 6= k. The E-step is only applied to\n\n(k)\n\nOur formulation (as most probabilistic segmentation methods) adopts a generative perspective,\nwhere each p(\u00b7|\u03c6(k)) models the data generation mechanism in the corresponding class. However,\ndiscriminative methods (such as support vector machines) are seen as the current state-of-the-art in\nclassi\ufb01cation [7]. We will now show how a pre-trained discriminative classi\ufb01er can be used in our\nGEM algorithm instead of the generative likelihoods.\nThe E-step (see (9)) obtains the posterior probability that xi was generated by the k-th model, by\n) with the local prior probability\n\nP [y(k)\ndiscriminative classi\ufb01er, i.e., one that directly provides estimates of the posterior class probabilities\nP [y(k)\ni = 1|xi]. To use these values in our segmentation algorithm, we need a way to bias these\nestimates according to the local prior probabilities P [y(k)\naging spatial coherence. Let us assume that we know that the discriminative classi\ufb01er was trained\nusing mk samples from the k-th class. It can thus be assumed that these posterior class probabilities\nverify P [y(k)\ni = 1). It is then possible to \u201cbias\u201d these classi\ufb01ers, with the\nlocal prior probabilities P [y(k)\n\ncombining (via Bayes law) the corresponding likelihood p(xi|b\u03c6\ni = 1|bzi]. Consider that, instead of likelihoods derived from generative models, we have a\ni = 1|bzi], which are responsible for encour-\ni = 1|bzi], simply by computing\n\uf8eb\uf8ed\nKXj=1\n\ni = 1|xi] \u221d mk p(xi|y(k)\n\ni = 1|bzi]\n\ni = 1|bzi]\n\nP [y(k)\n\ni = 1|xi] P [y(k)\n\nP [y(j)\n\ni = 1|xi] P [y(j)\n\nP [y(k)\n\ni = 1|xi] =\n\n\u22121\n\n.\n\n\uf8f6\uf8f8\n\nmk\n\nmj\n\n5 Experiments\n\nIn this section we will show experimental results of image segmentation in supervised, unsupervised,\nsemi-supervised, and discriminative modes. Assessing the performance of a segmentation method\nis not a trivial task. Moreover, the performance of segmentation algorithms depends more critically\non the adopted features (which is not the focus of this paper) than on the spatial coherence prior. For\nthese reasons, we will not present any careful comparative study, but simply a set of experimental\nexamples testifying for the promising behavior of the proposed approach.\n\n5.1 Supervised and Unsupervised Image Segmentation\n\nThe \ufb01rst experiment, reported in Fig. 1, illustrates the algorithm on a synthetic gray scale image\nwith four Gaussian classes of means 1, 2, 3, and 4, and standard deviation 0.6. For this image, both\nsupervised and unsupervised segmentation lead to almost visually indistinguishable results, so we\nonly show the supervised segmentation results. In the Gaussian prior, matrix A corresponds to a\n\ufb01rst order neighborhood, that is, Ai,j = \u03b3 if and only if j is one of the four nearest neighbors of\ni. For wavelet-based segmentation, we have used undecimated Haar wavelets and the Bayes-shrink\ndenoising procedure [6].\n\nFigure 1: From left to right: observed image, maximum likelihood segmentation, GEM result with\nGaussian prior, GEM result with wavelet-based prior.\n\n\f5.2 Semi-supervised Image Segmentation\n\nWe illustrate the semi-supervised mode of our approach on two real RGB images, shown in Fig. 2.\nEach region is modelled by a single multivariate Gaussian density in RGB space. In the example\nin the \ufb01rst row, the goal is to segment the image into skin, cloth, and background regions; in the\nsecond example, the goal is to segment the horses from the background. These examples show how\nthe semi-supervised mode of our algorithm is able to segment the image into regions which \u201clook\nlike\u201d the seed regions provided by the user.\n\nFigure 2: From left to right (in each row): observed image with regions indicated by the user as\nbelonging to each class, segmentation result, region boundaries.\n\n5.3 Discriminative Texture Segmentation\n\nFinally, we illustrate the behavior of the algorithm when used with discriminative classi\ufb01ers by\napplying it to texture segmentation. We build on the work in [8], where SVM classi\ufb01ers are used\nfor texture classi\ufb01cation (see [8] for complete details about the kernels and texture features used).\nFig. 3 shows two experiments; one with a two-texture 256\u00d7512 image and the other with a 5-texture\n256 \u00d7 256 image. In the two-class case, one binary SVM was trained on 1000 random patterns from\neach class. For the 5-class case, 5 binary SVMs were trained in the \u201c1-vs-all\u201d mode, with 500\nsamples from each class. In the 2-class and 5-class cases, the error rates of the SVM classi\ufb01er are\n12.69% and 13.92%, respectively. Our GEM algorithm achieves 0.51% and 2.22%, respectively.\nThese examples show that our method is able to take class predictions produced by a classi\ufb01er\nlacking any spatial prior and produce segmentations with a high degree of spatial coherence.\n\n6 Conclusions\n\nWe have introduced an approach to probabilistic semi-supervised clustering which is particularly\nsuited for image segmentation. The formulation allows supervised, unsupervised, semi-supervised,\nand discriminative modes, and can be used with classical standard image priors (such as Gaussian\n\ufb01elds, or wavelet-based priors). Unlike the usual Markov random \ufb01eld approaches, which involve\ncombinatorial optimization, our segmentation algorithm consists of a simple generalized EM algo-\nrithm. Several experimental examples illustrated the promising behavior of our method. Ongoing\nwork includes a thorough experimental comparison with state-of-the-art segmentation algorithms,\nnamely, spectral methods [16] and techniques based on \u201cgraph-cuts\u201d [19].\nAcknowledgement: This work was partially supported by the (Portuguese) Fundac\u00b8 \u02dcao para a\nCi\u02c6encia e Tecnologia (FCT), grant POSC/EEA-SRI/61924/2004.\n\n\fFigure 3: From left to right (in each row): observed image, direct SVM segmentation, segmentation\nproduced by our algorithm.\n\nReferences\n\n[1] M. Figueiredo. \u201cBayesian image segmentation using wavelet-based priors\u201d, Proc. IEEE Conf. Computer\n\nVision and Pattern Recognition - CVPR\u20192005, San Diego, CA, 2005.\n\n[2] N. Balram, J. Moura. \u201cNoncausal Gauss-Markov random \ufb01elds: parameter structure and estimation\u201d,\n\nIEEE Trans. Information Theory, vol. 39, pp. 1333\u20131355, 1993.\n\n[3] A. Banerjee, S. Merugu, I. Dhillon, J. Ghosh. \u201cClustering with Bregman divergences.\u201d Proc. SIAM Intern.\n\nConf. Data Mining \u2013 SDM\u20192004, Lake Buena Vista, FL, 2004.\n\n[4] S. Basu, M. Bilenko, R. Mooney. \u201cA probabilistic framework for semi-supervised clustering.\u201d Proc. of the\n\nKDD-2004, Seattle, WA, 2004.\n\n[5] D. B\u00a8ohning. \u201cMultinomial logistic regression\u201d, Annals Inst. Stat. Math., vol. 44, pp. 197-200, 1992.\n[6] G. Chang, B. Yu, M. Vetterli. \u201cAdaptive wavelet thresholding for image denoising and compression.\u201d IEEE\n\nTrans. Image Proc., vol. 9, pp. 1532\u20131546, 2000.\n\n[7] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2001.\n[8] K. I. Kim, K. Jung, S. H. Park, H. J. Kim. \u201cSupport vector machines for texture classi\ufb01cation.\u201d IEEE\n\nTrans. Pattern Analysis and Machine Intelligence, vol. 24, pp. 1542\u20131550, 2002.\n\n[9] D. Hunter, K. Lange. \u201cA tutorial on MM algorithms\u201d, The American Statistician, vol. 58, pp. 30\u201337, 2004.\n[10] M. Law, A. Topchy, A. K. Jain. \u201cModel-based clustering with probabilistic constraints.\u201d In Proc. of the\n\nSIAM Conf. on Data Mining, pp. 641-645, Newport Beach, CA, 2005.\n\n[11] Z. Lu, T. Leen. \u201cProbabilistic penalized clustering.\u201d In NIPS 17, MIT Press, 2005.\n[12] S. Mallat. A Wavelet Tour of Signal Proc.. Academic Press, San Diego, CA, 1998.\n[13] G. McLachlan, T. Krishnan. The EM Algorithm and Extensions, Wiley, N. York, 1997.\n[14] P. Moulin, J. Liu. \u201cAnalysis of multiresolution image denoising schemes using generalized - Gaussian\n\nand complexity priors,\u201d IEEE Trans. Inform. Theory, vol. 45, pp. 909\u2013919, 1999.\n\n[15] N. Shental, A. Bar-Hillel, T. Hertz, D. Weinshall. \u201cComputing Gaussian mixture models with EM using\n\nequivalence constraints.\u201d In NIPS 15, MIT Press, Cambridge, MA, 2003.\n\n[16] J. Shi, J. Malik, \u201cNormalized cuts and image segmentation.\u201d IEEE-TPAMI, vol. 22, pp. 888\u2013905, 2000.\n[17] K. Wagstaff, C. Cardie, S. Rogers, S. Schr\u00a8odl. \u201cConstrained K-means clustering with background knowl-\n\nedge.\u201d In Proc. of ICML\u20192001, Williamstown, MA, 2001.\n\n[18] C. Wu. \u201cOn the convergence properties of the EM algorithm,\u201d Ann. Statistics, vol. 11, pp. 95-103, 1983.\n[19] R. Zabih, V. Kolmogorov, \u201cSpatially coherent clustering with graph cuts.\u201d Proc. IEEE-CVPR, vol. II,\n\npp. 437\u2013444, 2004.\n\n[20] X. Zhu. \u201cSemi-Supervised Learning Literature Survey\u201d, TR-1530, Comp. Sci. Dept., Univ. of Wisconsin,\n\nMadison, 2006. Available at www.cs.wisc.edu/\u02dcjerryzhu/pub/ssl_survey.pdf\n\n\f", "award": [], "sourceid": 3008, "authors": [{"given_name": "Dong", "family_name": "Cheng", "institution": null}, {"given_name": "Vittorio", "family_name": "Murino", "institution": null}, {"given_name": "M\u00e1rio", "family_name": "Figueiredo", "institution": null}]}