{"title": "Latent Maximum Margin Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 28, "page_last": 36, "abstract": "We present a maximum margin framework that clusters data using latent variables. Using latent representations enables our framework to model unobserved information embedded in the data. We implement our idea by large margin learning, and develop an alternating descent algorithm to effectively solve the resultant non-convex optimization problem. We instantiate our latent maximum margin clustering framework with tag-based video clustering tasks, where each video is represented by a latent tag model describing the presence or absence of video tags. Experimental results obtained on three standard datasets show that the proposed method outperforms non-latent maximum margin clustering as well as conventional clustering approaches.", "full_text": "Latent Maximum Margin Clustering\n\nGuang-Tong Zhou, Tian Lan, Arash Vahdat, and Greg Mori\n\nSchool of Computing Science\n\n{gza11,tla58,avahdat,mori}@cs.sfu.ca\n\nSimon Fraser University\n\nAbstract\n\nWe present a maximum margin framework that clusters data using latent vari-\nables. Using latent representations enables our framework to model unobserved\ninformation embedded in the data. We implement our idea by large margin learn-\ning, and develop an alternating descent algorithm to effectively solve the resultant\nnon-convex optimization problem. We instantiate our latent maximum margin\nclustering framework with tag-based video clustering tasks, where each video is\nrepresented by a latent tag model describing the presence or absence of video tags.\nExperimental results obtained on three standard datasets show that the proposed\nmethod outperforms non-latent maximum margin clustering as well as conven-\ntional clustering approaches.\n\nIntroduction\n\n1\nClustering is a major task in machine learning and has been extensively studied over decades of\nresearch [11]. Given a set of observations, clustering aims to group data instances of similar struc-\ntures or patterns together. Popular clustering approaches include the k-means algorithm [7], mixture\nmodels [22], normalized cuts [27], and spectral clustering [18]. Recent progress has been made\nusing maximum margin clustering (MMC) [32], which extends the supervised large margin theory\n(e.g. SVM) to the unsupervised scenario. MMC performs clustering by simultaneously optimiz-\ning cluster-speci\ufb01c models and instance-speci\ufb01c labeling assignments, and often generates better\nperformance than conventional methods [33, 29, 37, 38, 16, 6].\nModeling data with latent variables is common in many applications. Latent variables are often\nde\ufb01ned to have intuitive meaning, and are used to capture unobserved semantics in the data. As\ncompared with ordinary linear models, latent variable models feature the ability to exploit a richer\nrepresentation of the space of instances. Thus, they often achieve superior performance in practice.\nIn computer vision, this is best exempli\ufb01ed by the success of deformable part models (DPMs) [5]\nfor object detection. DPMs enhance the representation of an object class by capturing viewpoint and\npose variations. They utilize a root template describing the entire object appearance and several part\ntemplates. Latent variables are used to capture deformations and appearance variations of the root\ntemplate and parts. DPMs perform object detection via search for the best locations of the root and\npart templates.\nLatent variable models are often coupled with supervised learning to learn models incorporating the\nunobserved variables. For example, DPMs are learned in a latent SVM framework [5] for object\ndetection; similar models have been shown to improve human action recognition [31]. A host of\nother applications of latent SVMs have obtained state-of-the-art performance in computer vision.\nMotivated by their success in supervised learning, we believe latent variable models can also help\nin unsupervised clustering \u2013 data instances with similar latent representations should be grouped\ntogether in one cluster.\nAs the latent variables are unobserved in the original data, we need a learning framework to handle\nthis latent knowledge. To implement this idea, we develop a novel clustering algorithm based on\nMMC that incorporates latent variables \u2013 we call this latent maximum margin clustering (LMMC).\nThe LMMC algorithm results in a non-convex optimization problem, for which we introduce an\n\n1\n\n\fiterative alternating descent algorithm. Each iteration involves three steps: inferring latent variables\nfor each sample point, optimizing cluster assignments, and updating cluster model parameters.\nTo evaluate the ef\ufb01cacy of this clustering algorithm, we instantiate LMMC for tag-based video\nclustering, where each video is modeled with latent variables controlling the presence or absence\nof a set of descriptive tags. We conduct experiments on three standard datasets: TRECVID MED\n11 [19], KTH Actions [26] and UCF Sports [23], and show that LMMC outperforms non-latent\nMMC and conventional clustering methods.\nThe rest of this paper is organized as follows. Section 2 reviews related work. Section 3 formulates\nthe LMMC framework in detail. We describe tag-based video clustering in Section 4, followed by\nexperimental results reported in Section 5. Finally, Section 6 concludes this paper.\n\n2 Related Work\n\nLatent variable models. There has been much work in recent years using latent variable mod-\nels. The de\ufb01nition of latent variables are usually task-dependent. Here we focus on the learning\npart only. Andrews et al. [1] propose multiple-instance SVM to learn latent variables in positive\nbags. Felzenszwalb et al. [5] formulate latent SVM by extending binary linear SVM with latent\nvariables. Yu and Joachims [36] handle structural outputs with latent structural SVM. This model\nis also known as maximum margin hidden conditional random \ufb01elds (MMHCRF) [31]. Kumar et\nal. [14] propose self-paced learning, an optimization strategy that focuses on simple models \ufb01rst.\nYang et al. [35] kernelize latent SVM for better performance. All of this work demonstrates the\npower of max-margin latent variable models for supervised learning; our framework conducts unsu-\npervised clustering while modeling data with latent variables.\nMaximum margin clustering. MMC was \ufb01rst proposed by Xu et al. [32] to extend supervised large\nmargin methods to unsupervised clustering. Different from the supervised case, where the optimiza-\ntion is convex, MMC results in non-convex problems. To solve it, Xu et al. [32] and Valizadegan\nand Rong [29] reformulate the original problem as a semi-de\ufb01nite programming (SDP) problem.\nZhang et al. [37] employ alternating optimization \u2013 \ufb01nding labels and optimizing a support vector\nregression (SVR). Li et al. [16] iteratively generate the most violated labels, and combine them via\nmultiple kernel learning. Note that the above methods can only solve binary-cluster clustering prob-\nlems. To handle the multi-cluster case, Xu and Schuurmans [33] extends the SDP method in [32].\nZhao et al. [38] propose a cutting-plane method which uses the constrained convex-concave proce-\ndure (CCCP) to relax the non-convex constraint. Gopalan and Sankaranarayanan [6] examine data\nprojections to identify the maximum margin. Our framework deals with multi-cluster clustering,\nand we model data instances with latent variables to exploit rich representations. It is also worth\nmentioning that MMC leads naturally to the semi-supervised SVM framework [12] by assuming a\ntraining set of labeled instances [32, 33]. Using the same idea, we could extend LMMC to semi-\nsupervised learning.\nMMC has also shown its success in various computer vision applications. For example, Zhang\net al. [37] conduct MMC based image segmentation. Farhadi and Tabrizi [4] \ufb01nd different view\npoints of human activities via MMC. Wang and Cao [30] incorporate MMC to discover geographical\nclusters of beach images. Hoai and Zisserman [8] form a joint framework of maximum margin\nclassi\ufb01cation and clustering to improve sub-categorization.\nTag-based video analysis. Tagging videos with relevant concepts or attributes is common in video\nanalysis. Qi et al. [20] predict multiple correlative tags in a structural SVM framework. Yang and\nToderici [34] exploit latent sub-categories of tags in large-scale videos. The obtained tags can assist\nin recognition. For example, Liu et al. [17] use semantic attributes (e.g. up-down motion, torso\nmotion, twist) to recognize human actions (e.g. walking, hand clapping). Izadinia and Shah [10]\nmodel low-level event tags (e.g. people dancing, animal eating) as latent variables to recognize\ncomplex video events (e.g. wedding ceremony, grooming animal).\nInstead of supervised recognition of tags or video categories, we focus on unsupervised tag-based\nvideo clustering.\nIn fact, recently research collects various sources of tags for video clustering.\nSchroff et al. [25] cluster videos by the capturing locations. Hsu et al. [9] build hierarchical cluster-\ning using user-contributed comments. Our paper uses latent tag models, and our LMMC framework\nis general enough to handle various types of tags.\n\n2\n\n\fK(cid:88)\nK(cid:88)\n\n1\n2\n\nt=1\n\n||wt||2 +\n\nC\nK\n\nminW,Y,\u03be\u22650\n\ns.t.\n\nN(cid:88)\n\nK(cid:88)\n\ni=1\n\nr=1\n\nK(cid:88)\n\nt=1\n\nyitw(cid:62)\n\nt xi \u2212 w(cid:62)\n\nr xi \u2265 1 \u2212 yir \u2212 \u03beir, \u2200i, r\n\nt=1\n\nyit \u2208 {0, 1}, \u2200i, t\n\nyit = 1, \u2200i\n\n3 Latent Maximum Margin Clustering\nAs stated above, modeling data with latent variables can be bene\ufb01cial in a variety of supervised\napplications. For unsupervised clustering, we believe it also helps to group data instances based on\nlatent representations. To implement this idea, we propose the LMMC framework.\nLMMC models instances with latent variables. When \ufb01tting an instance to a cluster, we \ufb01nd the\noptimal values for latent variables and use the corresponding latent representation of the instance.\nTo best \ufb01t different clusters, an instance is allowed to \ufb02exibly take different latent variable values\nwhen being compared to different clusters. This enables LMMC to explore a rich latent space when\nforming clusters. Note that in conventional clustering algorithms, an instance is usually restricted to\nhave the same representation in all clusters. Furthermore, as the latent variables are unobserved in\nthe original data, we need a learning framework to exploit this latent knowledge. Here we develop a\nlarge margin learning framework based on MMC, and learn a discriminative model for each cluster.\nThe resultant LMMC optimization is non-convex, and we design an alternating descent algorithm to\napproximate the solution. Next we will brie\ufb02y introduce MMC in Section 3.1, followed by detailed\ndescriptions of the LMMC framework and optimization respectively in Sections 3.2 and 3.3.\n\n3.1 Maximum Margin Clustering\nMMC [32, 37, 38] extends the maximum margin principle popularized by supervised SVMs to\nunsupervised clustering, where the input instances are unlabeled. The idea of MMC is to \ufb01nd a\nlabeling so that the margin obtained would be maximal over all possible labelings. Suppose there\nare N instances {xi}N\n\ni=1 to be clustered into K clusters, MMC is formulated as follows [33, 38]:\n\n\u03beir\n\n(1)\n\nwhere W = {wt}K\nt=1 are the linear model parameters for each cluster, \u03be = {\u03beir} (i \u2208 {1, . . . , N},\nt \u2208 {1, . . . , K}) are the slack variables to allow soft margin, and C is a trade-off parameter. We\ndenote the labeling assignment by Y = {yit} (i \u2208 {1, . . . , N}, t \u2208 {1, . . . , K}), where yit = 1\nindicates that the instance xi is clustered into the t-th cluster, and yit = 0 otherwise. By convention,\nwe require that each instance is assigned to one and only one cluster, i.e. the last constraint in Eq. 1.\nMoreover, the \ufb01rst constraint in Eq. 1 enforces a large margin between clusters by constraining that\nthe score of xi to the assigned cluster is suf\ufb01ciently larger than the score of xi to any other clusters.\nNote that MMC is an unsupervised clustering method, which jointly estimates the model parameters\nW and \ufb01nds the best labeling Y.\nEnforcing balanced clusters. Unfortunately, solving Eq. 1 could end up with trivial solutions\nwhere all instances are simply assigned to the same cluster, and we obtain an unbounded margin. To\naddress this problem, we add cluster balance constraints to Eq. 1 that require Y to satisfy\n\nL \u2264 N(cid:88)\nposed on the accumulated model scores (i.e. (cid:80)N\n\ni=1\n\nenforcing balanced cluster sizes led to better results.\n\nwhere L and U are the lower and upper bounds controlling the size of a cluster. Note that we explic-\nitly enforce cluster balance using a hard constraint on the cluster sizes. This is different from [38], a\nrepresentative multi-cluster MMC method, where the cluster balance constraints are implicitly im-\nt xi). We found empirically that explicitly\n\ni=1 w(cid:62)\n\nyit \u2264 U, \u2200t\n\n(2)\n\n3.2 Latent Maximum Margin Clustering\nWe now extend MMC to include latent variables. The latent variable of an instance is cluster-\nspeci\ufb01c. Formally, we denote h as the latent variable of an instance x associated to a cluster param-\neterized by w. Following the latent SVM formulation [5, 36, 31], scoring x w.r.t. w is to solve an\n\n3\n\n\finference problem of the form:\n\nw(cid:62)\u03a6(x, h)\n\nfw(x) = max\n\n(3)\nwhere \u03a6(x, h) is the feature vector de\ufb01ned for the pair of (x, h). To simplify the notation, we\nassume the latent variable h takes its value from a discrete set of labels. However, our formulation\ncan be easily generalized to handle more complex latent variables (e.g. graph structures [36, 31]).\nTo incorporate the latent variable models into clustering, we replace the linear model w(cid:62)x in E-\nq. 1 by the latent variable model fw(x). We call the resultant framework latent maximum margin\nclustering (LMMC). LMMC \ufb01nds clusters via the following optimization:\n\nh\n\nK(cid:88)\nK(cid:88)\n\n1\n2\n\nt=1\n\n||wt||2 +\n\nC\nK\n\nminW,Y,\u03be\u22650\n\ns.t.\n\nN(cid:88)\n\nK(cid:88)\n\ni=1\n\nr=1\n\nK(cid:88)\n\nt=1\n\nyitfwt(xi) \u2212 fwr (xi) \u2265 1 \u2212 yir \u2212 \u03beir, \u2200i, r\n\nt=1\n\nyit \u2208 {0, 1}, \u2200i, t\n\nyit = 1, \u2200i\n\nyit \u2264 U, \u2200t\n\nL \u2264 N(cid:88)\n\ni=1\n\n\u03beir\n\n(4)\n\nK(cid:88)\n\nt=1\n\n4\n\nWe adopt the notation Y from the MMC formulation to denote the labeling assignment. Similar to\nMMC, the \ufb01rst constraint in Eq. 4 enforces the large margin criterion where the score of \ufb01tting xi\nto the assigned cluster is marginally larger than the score of \ufb01tting xi to any other clusters. Cluster\nbalance is enforced by the last constraint in Eq. 4. Note that LMMC jointly optimizes the model\nparameters W and \ufb01nds the best labeling assignment Y, while inferring the optimal latent variables.\n3.3 Optimization\nIt is easy to verify that the optimization problem described in Eq. 4 is non-convex due to the\noptimization over the labeling assignment variables Y and the latent variables H = {hit} (i \u2208\n{1, . . . , N}, t \u2208 {1, . . . , K}). To solve it, we \ufb01rst eliminate the slack variables \u03be, and rewrite Eq. 4\nequivalently as:\n\n1\n2\n\n||wt||2 +\n\nK(cid:88)\nmax(cid:0)0, 1 \u2212 yir + fwr (xi) \u2212 K(cid:88)\n\nR(W)\n\nC\nK\n\nt=1\n\nminW\n\nN(cid:88)\n\nK(cid:88)\n\ni=1\n\nr=1\n\nwhere R(W) is the risk function de\ufb01ned by:\n\nR(W) = minY\n\n(5)\n\n(6)\n\ns.t.\n\nyit \u2208 {0, 1}, \u2200i, t\n\nyit = 1, \u2200i\n\nyit \u2264 U, \u2200t\n\nNote that Eq. 5 minimizes over the model parameters W, and Eq. 6 minimizes over the labeling\nassignment variables Y while inferring the latent variables H. We develop an alternating descent\nalgorithm to \ufb01nd an approximate solution. In each iteration, we \ufb01rst evaluate the risk function R(W)\ngiven the current model parameters W, and then update W with the obtained risk value. Next we\ndescribe each step in detail.\nRisk evaluation: The \ufb01rst step of learning is to compute the risk function R(W) with the model\nparameters W \ufb01xed. We \ufb01rst infer the latent variables H and then optimize the labeling assignment\nY. According to Eq. 3, the latent variable hit of an instance xi associated to cluster t can be obtained\nvia: argmaxhit w(cid:62)\nt \u03a6(xi, hit). Note that the inference problem is task-dependent. For our latent tag\nmodel, we present an ef\ufb01cient inference method in Section 4.\nAfter obtaining the latent variables H, we optimize the labeling assignment Y from Eq. 6. Intu-\nitively, this is to minimize the total risk of labeling all instances yet maintaining the cluster balance\nconstraints. We reformulate Eq. 6 as an integer linear programming (ILP) problem by introducing a\nvariable \u03c8it to capture the risk of assigning an instance xi to a cluster t. The ILP can be written as:\nR(W) = minY\nyit \u2264 U,\u2200t (7)\n\nyit = 1,\u2200i L \u2264 N(cid:88)\n\ns.t. yit \u2208 {0, 1},\u2200i, t\n\nN(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\n\u03c8ityit\n\ni=1\n\nt=1\n\nt=1\n\ni=1\n\nyitfwt(xi)(cid:1)\nL \u2264 N(cid:88)\n\nt=1\n\ni=1\n\n\fCluster: feeding animal\n\nCluster: board trick\n\no\ne\nd\ni\nv\n\n0\n\n0\n\n1\n\n1\n\nT : board car dog food grass man snow tree\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\nh:\n1\nFigure 1: Two videos represented by the latent tag model. Please refer to the text for details about T\nand h. Note that the cluster labels (i.e. \u201cfeeding animal\u201d, \u201cboard trick\u201d) are unknown beforehand.\nThey are added for a better understanding of the video content and the latent tag representations.\n\nboard car dog food grass man snow tree\n0\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n0\n\n0\n\n0\n\n1\n\n0\n\n1\n\n1\n\n1\n\n1\n\n0\n\nwhere \u03c8it = (cid:80)K\n\nr=1,r(cid:54)=t max(0, 1 + fwr (xi) \u2212 fwt(xi)). This captures the total \u201cmis-clustering\u201d\npenalties - suppose that we regard t as the \u201cground truth\u201d cluster label for an instance xi, then \u03c8it\nmeasures the sum of hinge losses for all incorrect predictions r (r (cid:54)= t), which is consistent with\nthe supervised multi-class SVM at a higher level [2]. Eq. 7 is a standard ILP problem with N \u00d7 K\nvariables and N + K constraints. We use the GNU Linear Programming Kit (GLPK) to obtain an\napproximate solution to this problem.\nUpdating W: The next step of learning is the optimization over the model parameters W (Eq. 5).\nThe learning problem is non-convex and we use the the non-convex bundle optimization solver\nin [3]. In a nutshell, this method builds a piecewise quadratic approximation to the objective function\nof Eq. 5 by iteratively adding a linear cutting plane at the current optimum and updating the optimum.\nNow the key issue is to compute the subgradient \u2202wtfwt(xi) for a particular wt. Let h\u2217\nit be the\nit = argmaxhit w(cid:62)\noptimal solution to the inference problem: h\u2217\nt \u03a6(xi, hit). Then the subgradient\ncan be calculated as \u2202wtfwt(xi) = \u03a6(xi, h\u2217\nit). Using the subgradient \u2202wtfwt(xi), we optimize\nEq. 5 by the algorithm in [3].\n\n4 Tag-Based Video Clustering\n\nIn this section, we introduce an application of LMMC: tag-based video clustering. Our goal is\nto jointly learn video clusters and tags in a single framework. We treat tags of a video as latent\nvariables and capture the correlations between clusters and tags. Intuitively, videos with a similar\nset of tags should be assigned to the same cluster. We assume a separate training dataset consisting of\nvideos with ground-truth tag labels exists, from which we train tag detectors independently. During\nclustering, we are given a set of new videos without the ground-truth tag labels, and our goal is to\nassign cluster labels to these videos.\nWe employ a latent tag model to represent videos. We are particularly interested in tags which\ndescribe different aspects of videos. For example, a video from the cluster \u201cfeeding animal\u201d (see\nFigure 1) may be annotated with \u201cdog\u201d, \u201cfood\u201d, \u201cman\u201d, etc. Assume we collect all the tags in a\nset T . For a video being assigned to a particular cluster, we know it could have a number of tags\nfrom T describing its visual content related to the cluster. However, we do not know which tags are\npresent in the video. To address this problem, we associate latent variables to the video to denote\nthe presence and absence of tags.\nFormally, given a cluster parameterized by w, we associate a latent variable h to a video x, where\nh = {ht}t\u2208T and ht \u2208 {0, 1} is a binary variable denoting the presence/absence of each tag t.\nht = 1 means x has the tag t, while ht = 0 means x does not have the tag t. Figure 1 shows the\nlatent tag representations of two sample videos. We score the video x according to the model in\nEq. 3: fw(x) = maxh w(cid:62)\u03a6(x, h), where the potential function w(cid:62)\u03a6(x, h) is de\ufb01ned as follows:\n\nw(cid:62)\u03a6(x, h) =\n\n1\n|T |\n\nht \u00b7 \u03c9(cid:62)\n\nt \u03c6t(x)\n\n(8)\n\n(cid:88)\n\nt\u2208T\n\nThis potential function measures the compatibility between the video x and tag t associated with\nthe current cluster. Note that w = {\u03c9t}t\u2208T are the cluster-speci\ufb01c model parameters, and \u03a6 =\n{ht \u00b7 \u03c6t(x)}t\u2208T is the feature vector depending on the video x and its tags h. Here \u03c6t(x) \u2208 Rd is\nthe feature vector extracted from the video x, and the parameter \u03c9t is a template for tag t. In our\ncurrent implementation, instead of keeping \u03c6t(x) as a high dimensional vector of video features, we\n\n5\n\n\fsimply represent it as a scalar score of detecting tag t on x by a pre-trained binary tag detector. To\nlearn biases between different clusters, we append a constant 1 to make \u03c6t(x) two-dimensional.\nNow we describe how to infer the latent variable h\u2217 = argmaxh w(cid:62)\u03a6(x, h). As there is no de-\npendency between tags, we can infer each latent variable separately. According to Eq. 8, the term\ncorresponding to tag t is ht \u00b7 \u03c9(cid:62)\nt \u03c6t(x) > 0;\notherwise, we set ht to 0.\n\nt \u03c6t(x). Considering that ht is binary, we set ht to 1 if \u03c9(cid:62)\n\n5 Experiments\n\nWe evaluate the performance of our method on three standard video datasets: TRECVID MED\n11 [19], KTH Actions [26] and UCF Sports [23]. We brie\ufb02y describe our experimental setup before\nreporting the experimental results in Section 5.1.\nTRECVID MED 11 dataset [19]: This dataset contains web videos collected by the Linguistic\nData Consortium from various web video hosting sites. There are 15 complex event categories\nincluding \u201cboard trick\u201d, \u201cfeeding animal\u201d, \u201clanding \ufb01sh\u201d, \u201cwedding ceremony\u201d, \u201cwoodworking\nproject\u201d, \u201cbirthday party\u201d, \u201cchanging tire\u201d, \u201c\ufb02ash mob\u201d, \u201cgetting vehicle unstuck\u201d, \u201cgrooming\nanimal\u201d, \u201cmaking sandwich\u201d, \u201cparade\u201d, \u201cparkour\u201d, \u201crepairing appliance\u201d, and \u201csewing project\u201d.\nTRECVID MED 11 has three data collections: Event-Kit, DEV-T and DEV-O. DEV-T and DEV-O\nare dominated by videos of the null category, i.e. background videos that do not contain the events\nof interest. Thus, we use the Event-Kit data collection in the experiments. By removing 13 short\nvideos that contain no visual content, we \ufb01nally have a total of 2,379 videos for clustering.\nWe use tags that were generated in Vahdat and Mori [28] for the TRECVID MED 11 dataset. Specif-\nically, this dataset includes \u201cjudgment \ufb01les\u201d that contain a short one-sentence description for each\nvideo. A sample description is: \u201cA man and a little boy lie on the ground after the boy has fallen\noff his bike\u201d. This sentence provides us with information about presence of objects such as \u201cman\u201d,\n\u201cboy\u201d, \u201cground\u201d and \u201cbike\u201d, which could be used as tags. In [28], text analysis tools are employed\nto extract binary tags based on frequent nouns in the judgment \ufb01les. Examples of 74 frequent tags\nused in this work are: \u201cmusic\u201d, \u201cperson\u201d, \u201cfood\u201d, \u201ckitchen\u201d, \u201cbird\u201d, \u201cbike\u201d, \u201ccar\u201d, \u201cstreet\u201d, \u201cboat\u201d,\n\u201cwater\u201d, etc. The complete list of tags are available on our website.\nTo train tag detectors, we use the DEV-T and DEV-O videos that belong to the 15 event categories.\nThere are 1675 videos in total. We extract HOG3D descriptors [13] and form a 1,000 word code-\nbook. Each video is then represented by a 1,000-dimensional feature vector. We train a linear SVM\nfor each tag, and predict the detection scores on the Event-Kit videos. To remove biases between tag\ndetectors, we normalize the detection scores by z-score normalization. Note that we make no use of\nthe ground-truth tags on the Event-Kit videos that are to be clustered.\nKTH Actions dataset [26]: This dataset contains a total of 599 videos of 6 human actions: \u201cwalk-\ning\u201d, \u201cjogging\u201d, \u201crunning\u201d, \u201cboxing\u201d, \u201chand waving\u201d, and \u201chand clapping\u201d. Our experiments use\nall the videos for clustering.\nWe use Action Bank [24] to generate tags for this dataset. Action Bank has 205 template actions\nwith various action semantics and viewpoints. Randomly selected examples of template actions\nare: \u201chula1\u201d, \u201cski5\u201d, \u201cclap3\u201d, \u201cfence2\u201d, \u201cviolin6\u201d, etc. In our experiments, we treat the template\nactions as tags. Speci\ufb01cally, on each video and for each template action, we use the set of Action\nBank action detection scores collected at different spatiotemporal scales and correlation volumes.\nWe perform max-pooling on the scores to obtain the corresponding tag detection score. Again, for\neach tag, we normalize the detection scores by z-score normalization.\nUCF Sports dataset [23]: This dataset consists of 140 videos from 10 action classes: \u201cdiving\u201d, \u201cgolf\nswinging\u201d, \u201ckicking\u201d, \u201clifting\u201d, \u201chorse riding\u201d, \u201crunning\u201d, \u201cskating\u201d, \u201cswinging (on the pommel\nhorse)\u201d, \u201cswinging (at the high bar)\u201d, and \u201cwalking\u201d. We use all the videos for clustering. The tags\nand tag detection scores are generated from Action Bank, in the same way as KTH Actions.\nBaselines: To evaluate the ef\ufb01cacy of LMMC, we implement three conventional clustering methods\nfor comparison, including the k-means algorithm (KM), normalized cut (NC) [27], and spectral\nclustering (SC) [18]. For NC, the implementation and parameter settings are the same as [27],\nwhich uses a Gaussian similarity function with all the instances considered as neighbors. For SC,\nwe use a 5-nearest neighborhood graph and set the width of the Gaussian similarity function as the\n\n6\n\n\fLMMC\nMMC\n\nSC\nKM\nNC\n\nPUR NMI\n28.7\n39.0\n26.6\n36.0\n28.6\n23.6\n23.8\n27.0\n12.9\n5.7\n\nRI\n89.5\n89.3\n87.1\n85.9\n31.6\n\nFM\n22.1\n20.3\n20.3\n20.4\n12.7\n\nTRECVID MED 11\n\nKTH Actions\n\nPUR NMI\n87.0\n92.5\n86.5\n91.3\n61.0\n60.8\n60.7\n64.8\n48.0\n33.9\n\nRI\n95.8\n95.2\n75.6\n84.0\n72.9\n\nUCF Sports\nRI\n92.0\n89.2\n90.6\n87.9\n83.4\n\nPUR NMI\n71.2\n76.4\n62.2\n63.6\n69.9\n70.8\n66.2\n63.1\n60.7\n55.8\n\nFM\n60.0\n46.1\n58.1\n58.7\n41.8\n\nFM\n87.2\n85.5\n58.2\n60.6\n35.1\n\nTable 1: Clustering results (in %) on the three datasets. The \ufb01gures boldfaced are the best perfor-\nmance among all the compared methods.\n\naverage distance over all the 5-nearest neighbors. Note that these three methods do not use latent\nvariable models. Therefore, for a fair comparison with LMMC, they are directly performed on the\ndata where each video is represented by a vector of tag detection scores. We have also tried KM,\nNC and SC on the 1,000-dimensional HOG3D features. However, the performance is worse and is\nnot reported here. Furthermore, to mitigate the effect of randomness, KM, NC and SC are run 10\ntimes with different initial seeds and the average results are recorded in the experiments.\nIn order to show the bene\ufb01ts of incorporating latent variables, we further develop a baseline called\nMMC by replacing the latent variable model fw(x) in Eq. 4 with a linear model w(cid:62)x. This is equiv-\nalent to running an ordinary maximum margin clustering algorithm on the video data represented by\ntag detection scores. For a fair comparison, we use the same solver for learning MMC and LMMC.\nThe trade-off parameter C in Eq. 4 is selected as the best from the range {101, 102, 103}. The lower\nbound and upper bounds of the cluster-balance constraint (i.e. L and U in Eq. 4) are set as 0.9 N\nK\nand 1.1 N\nPerformance measures: Following the convention of maximum margin clustering [32, 33, 29,\n37, 38, 16, 6], we set the number of clusters to be the ground-truth number of classes for all the\ncompared methods. The clustering quality is evaluated by four standard measurements including\npurity (PUR) [32], normalized mutual information (NMI) [15], Rand index (RI) [21] and balanced\nF-measure (FM). They are employed to assess different aspects of a given clustering: PUR mea-\nsures the accuracy of the dominating class in each cluster; NMI is from the information-theoretic\nperspective and calculates the mutual dependence of the predicted clustering and the ground-truth\npartitions; RI evaluates true positives within clusters and true negatives between clusters; and FM\nconsiders both precision and recall. The higher the four measures, the better the performance.\n\nK respectively to enforce balanced clusters.\n\n5.1 Results\nThe clustering results are listed in Table 1. It shows that LMMC consistently outperforms the MMC\nbaseline and conventional clustering methods on all three datasets. Speci\ufb01cally, by incorporating\nlatent variables, LMMC improves the MMC baseline by 3% on TRECVID MED 11, 1% on KTH\nActions, and 13% on UCF Sports respectively, in terms of PUR. This demonstrates that learning the\nlatent presence and absence of tags can exploit rich representations of videos, and boost clustering\nperformance. Moreover, LMMC performs better than the three conventional methods, SC, KM and\nNC, showing the ef\ufb01cacy of the proposed LMMC framework for unsupervised data clustering.\nNote that MMC runs on the same non-latent representation as the three conventional methods, SC,\nKM and NC. However, MMC outperforms them on the two largest datasets, TRECVID MED 11\nand KTH Actions, and is comparable with them on UCF Sports. This provides evidence for the\neffectiveness of maximum margin clustering as well as the proposed alternating descent algorithm\nfor optimizing the non-convex objective.\nVisualization: We select four clusters from TRECVID MED 11, and visualize the results in Fig-\nure 2. Please refer to the caption for more details.\n\n6 Conclusion\n\nWe have presented a latent maximum margin framework for unsupervised clustering. By repre-\nsenting instances with latent variables, our method features the ability to exploit the unobserved\ninformation embedded in data. We formulate our framework by large margin learning, and an alter-\n\n7\n\n\fCluster: woodworking project\n\nCluster: birthday party\n\nTags: piece, wood, machine, lady, indoors, man,\nkitchen, baby\n\nTags: party, birthday, restaurant, couple, wedding\nceremony, wedding, ceremony, indoors\n\n\u0014\n\nTags: piece, man, wood, baby, hand, machine, la-\ndy, kitchen\n\nTags: birthday, party, restaurant, family, child,\nwedding ceremony, wedding, couple\n\n\u0014\n\nTags: wood, piece, baby, indoors, hand, man, lady,\nbike\n\nCluster: parade\n\n\u0014\n\n\u0018\n\nTags: party, birthday, restaurant, child, family,\nwedding ceremony, chicken, couple\nCluster: landing \ufb01sh\n\nTags: city, day, year, Chinese, Christmas, people,\nlot, group\n\nTags: \ufb01sh, \ufb01shing, boat, man, beach, line, water,\nwoman\n\nTags: day, street, lot, Chinese, year, line, Christ-\nmas, dance\n\nTags: boat, beach, \ufb01sh, man, men, group, water,\nwoman\n\n\u0014\n\nTags: street, day, lot, Chinese, line, year, dancing,\ndance\n\nTags: \ufb01sh, beach, boat, men, man, chicken, truck,\nmove\n\n\u0014\n\n\u0014\n\n\u0014\n\n\u0014\n\n\u0018\n\n\u0014\n\n\u0014\n\nFigure 2: Four sample clusters from TRECVID MED 11. We label each cluster by the dominating\nvideo class, e.g. \u201cwoodworking project\u201d, \u201cparade\u201d, and visualize the top-3 scored videos. A \u201c\u0014\u201d\nsign indicates that the video label is consistent with the cluster label; otherwise, a \u201c\u0018\u201d sign is used.\nThe two \u201cmis-clustered\u201d videos are on \u201cparkour\u201d (left) and \u201cfeeding animal\u201d (right). Below each\nvideo, we show the top eight inferred tags sorted by the potential calculated from Eq. 8.\n\nnating descent algorithm is developed to solve the resultant non-convex objective. We instantiate our\nframework with tag-based video clustering, where each video is represented by a latent tag model\nwith latent presence and absence of video tags. Our experiments conducted on three standard video\ndatasets validate the ef\ufb01cacy of the proposed framework. We believe our solution is general enough\nto be applied in other applications with latent representations, e.g. video clustering with latent key\nsegments, image clustering with latent region-of-interest, etc. It would also be interesting to extend\nour framework to semi-supervised learning by assuming a training set of labeled instances.\n\nAcknowledgments\n\nThis work was supported by a Google Research Award, NSERC, and the Intelligence Advanced\nResearch Projects Activity (IARPA) via Department of Interior National Business Center contract\nnumber D11PC20069. The U.S. Government is authorized to reproduce and distribute reprints for\nGovernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views\nand conclusions contained herein are those of the authors and should not be interpreted as necessarily\nrepresenting the of\ufb01cial policies or endorsements, either expressed or implied, of IARPA, DOI/NBC,\nor the U.S. Government.\n\n8\n\n\fstates. In ICML, 2009.\n\n2008.\n\npart model. In CVPR, 2008.\n\npoints on lines. In CVPR, 2011.\n\nReferences\n[1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning.\n\nIn NIPS, 2002.\n\n[2] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. Journal of Machine Learning Research, 2:265\u2013292, 2001.\n\n[3] T. M. T. Do and T. Arti`eres. Large margin training for hidden Markov models with partially observed\n\n[4] A. Farhadi and M. K. Tabrizi. Learning to recognize activities from the wrong view point. In ECCV,\n\n[5] P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable\n\n[6] R. Gopalan and J. Sankaranarayanan. Max-margin clustering: Detecting margins from projections of\n\n[7] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28:100\u2013108, 1979.\n[8] M. Hoai and A. Zisserman. Discriminative sub-categorization. In CVPR, 2013.\n[9] C.-F. Hsu, J. Caverlee, and E. Khabiri. Hierarchical comments-based clustering. In SAC, 2011.\n[10] H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In\n\nECCV, 2012.\n\n2008.\n\n[11] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.\n[12] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In ICML, 1999.\n[13] A. Kl\u00a8aser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC,\n\n[14] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.\n[15] T. O. Kvalseth. Entropy and correlation: Some comments.\n\nIEEE Transactions on Systems, Man and\n\nCybernetics, 17(3):517\u2013519, 1987.\n\nAISTATS, 2009.\n\n[16] Y.-F. Li, I. W. Tsang, J. T.-Y. Kwok, and Z.-H. Zhou. Tighter and convex maximum margin clustering. In\n\n[17] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011.\n[18] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, 2001.\n[19] P. Over, G. Awad, J. Fiscus, A. F. Smeaton, W. Kraaij, and G. Quenot. TRECVID 2011 \u2013 an overview of\n\nthe goals, tasks, data, evaluation mechanisms and metrics. In TRECVID, 2011.\n\n[20] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative multi-label video annotation.\n\n[21] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statis-\n\n[22] R. Redner and H. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review,\n\nIn ACM Multimedia, 2007.\n\ntical Association, 66(336):846\u2013850, 1971.\n\n26(2):195\u2013239, 1984.\n\n[23] M. D. Rodriguez, J. Ahmed, and M. Shah. Action MACH a spatio-temporal maximum average correlation\n\nheight \ufb01lter for action recognition. In CVPR, 2008.\n\n[24] S. Sadanand and J. J. Corso. Action Bank: A high-level representation of activity in video. In CVPR,\n\n[25] F. Schroff, C. L. Zitnick, and S. Baker. Clustering videos by location. In BMVC, 2009.\n[26] C. Sch\u00a8uldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR,\n\n[27] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[28] A. Vahdat and G. Mori. Handling uncertain tags in visual recognition. In ICCV, 2013.\n[29] H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning. In\n\n[30] Y. Wang and L. Cao. Discovering latent clusters from geotagged beach images. In MMM, 2013.\n[31] Y. Wang and G. Mori. Max-margin hidden conditional random \ufb01elds for human action recognition. In\n\n[32] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, 2004.\n[33] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector machines. In\n\n[34] W. Yang and G. Toderici. Discriminative tag learning on YouTube videos with latent sub-tags. In CVPR,\n\n2012.\n\n2004.\n\nNIPS, 2006.\n\nCVPR, 2009.\n\nAAAI, 2005.\n\n2011.\n\n[35] W. Yang, Y. Wang, A. Vahdat, and G. Mori. Kernel latent SVM for visual recognition. In NIPS, 2012.\n[36] C.-N. J. Yu and T. Joachims. Learning structural SVMs with latent variables. In ICML, 2009.\n[37] K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. In ICML, 2007.\n[38] B. Zhao, F. Wang, and C. Zhang. Ef\ufb01cient multiclass maximum margin clustering. In ICML, 2008.\n\n9\n\n\f", "award": [], "sourceid": 52, "authors": [{"given_name": "Guang-Tong", "family_name": "Zhou", "institution": "Simon Fraser University"}, {"given_name": "Tian", "family_name": "Lan", "institution": "Simon Fraser University"}, {"given_name": "Arash", "family_name": "Vahdat", "institution": "Simon Fraser University"}, {"given_name": "Greg", "family_name": "Mori", "institution": "Simon Fraser University"}]}