{"title": "Metric Learning by Collapsing Classes", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 458, "abstract": null, "full_text": "Metric Learning by Collapsing Classes\n\nAmir Globerson\n\nSchool of Computer Science and Engineering,\nInterdisciplinary Center for Neural Computation\nThe Hebrew University Jerusalem, 91904, Israel\n\ngamir@cs.huji.ac.il\n\nSam Roweis\n\nMachine Learning Group\n\nDepartment of Computer Science\n\nUniversity of Toronto, Canada\n\nroweis@cs.toronto.edu\n\nAbstract\n\nWe present an algorithm for learning a quadratic Gaussian metric (Maha-\nlanobis distance) for use in classi\ufb01cation tasks. Our method relies on the\nsimple geometric intuition that a good metric is one under which points\nin the same class are simultaneously near each other and far from points\nin the other classes. We construct a convex optimization problem whose\nsolution generates such a metric by trying to collapse all examples in the\nsame class to a single point and push examples in other classes in\ufb01nitely\nfar away. We show that when the metric we learn is used in simple clas-\nsi\ufb01ers, it yields substantial improvements over standard alternatives on\na variety of problems. We also discuss how the learned metric may be\nused to obtain a compact low dimensional feature representation of the\noriginal input space, allowing more ef\ufb01cient classi\ufb01cation with very little\nreduction in performance.\n\n1 Supervised Learning of Metrics\n\nThe problem of learning a distance measure (metric) over an input space is of fundamental\nimportance in machine learning [10, 9], both supervised and unsupervised. When such\nmeasures are learned directly from the available data, they can be used to improve learn-\ning algorithms which rely on distance computations such as nearest neighbour classi\ufb01-\ncation [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised\nclustering algorithms [10]. Good similarity measures may also provide insight into the\ninter-protein distances), and may aid in building bet-\nunderlying structure of data (e.g.\nter data visualizations via embedding.\nIn fact, there is a close link between distance\n\nlearning and feature extraction since whenever we construct a featuref\u0004x\u0005 for an input\nspaceX, we can measure distances betweenx1;x22X using a simple distance func-\ntion (e.g. Euclidean)d[f\u0004x1\u0005;f\u0004x2\u0005\u2104 in feature space. Thus by \ufb01xingd, any feature\nillustration of this approach is when thef\u0004x\u0005 is a linear projection ofx2<\u0006\nf\u0004x\u0005=Wx. The Euclidean distance betweenf\u0004x1\u0005 andf\u0004x2\u0005 is then the Mahalanobis\ndistancekf\u0004x1\u0005f\u0004x2\u0005k2=\u0004x1x2\u0005TA\u0004x1x2\u0005, whereA=WTW is a positive\nlearning the matrixA. This is also the goal of the\n\nsemide\ufb01nite matrix. Much of the recent work on metric learning has indeed focused on\nlearning Mahalanobis distances, i.e.\ncurrent work.\n\nextraction algorithm may be considered a metric learning method. Perhaps the simplest\nso that\n\nA common approach to learning metrics is to assume some knowledge in the form of equiv-\n\n\falence relations, i.e. which points should be close and which should be far (without speci-\nfying their exact distances). In the classi\ufb01cation setting there is a natural equivalence rela-\ntion, namely whether two points are in the same class or not. One of the classical statistical\nmethods which uses this idea for the Mahalanobis distance is Fisher\u2019s Linear Discriminant\nAnalysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize\nvarious separation criteria between the classes under the new metric.\n\n2 The Approach of Collapsing Classes\n\n(1)\n\n(2)\n\n(3)\n\nclass look close whereas those in different classes appear far. Our approach starts with\nthe ideal case when this is true in the most optimistic sense: same class points are at zero\ndistance, and different class points are in\ufb01nitely far. Alternatively this can be viewed as\n\nIn this work, we present a novel approach to learning such a metric. Our approach, the\nMaximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric\nintuition that if all points in the same class could be mapped into a single location in feature\nspace and all points in other classes mapped to other locations, this would result in an ideal\napproximation of our equivalence relation. Our algorithm approximates this scenario via a\nstochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However,\nunlike NCA, the optimization problem is convex and thus our method is completely spec-\ni\ufb01ed by our objective function. Different initialization and optimization techniques may\naffect the speed of obtaining the solution but the \ufb01nal solution itself is unique. We also\nshow that our method approximates the local covariance structure of the data, as opposed\nto Linear Discriminant Analysis methods which use only global covariance structure.\n\nGiven a set of\u0002 labeled examples\u0004xi;yi\u0005, wherexi2<\u0006\nandyi2f1:::kg, we seek a\nsimilarity measure between two points inX space. We focus on Mahalanobis form metrics\nd\u0004xi;xjjA\u0005=dAij=\u0004xixj\u0005TA\u0004xixj\u0005;\nwhereA is a positive semide\ufb01nite (PSD) matrix.\nIntuitively, what we want from a good metric is that it makes elements ofX in the same\nmappingx via a linear projectionWx (A=WTW ), such that all points in the same\nSpeci\ufb01cally, for eachxi we de\ufb01ne a conditional distribution over pointsi6=j such that\nedAij\bk6=iedAik\n\u0004A\u0004jji\u0005=1ZiedAij=\ni6=j:\n\u00040\u0004jji\u0005/\u00021\nyi=yj\n0yi6=yj:\nbution must have the desired geometry. In particular, assume there are at least^\u0006\u00072 points\nin each class, where^\u0006= rank[A\u2104 (note that^\u0006(cid:20)\u0006). Then\u0004A\u0004jji\u0005=\u00040\u0004jji\u0005 (8i;j) implies\nthat underA all points in the same class will be mapped to a single point, in\ufb01nitely far from\n\nclass are mapped into the same point. This intuition is related to the analysis of spectral\nclustering [8], where the ideal case analysis of the algorithm results in all same cluster\npoints being mapped to a single point.\n\nIf all points in the same class were mapped to a single point and in\ufb01nitely far from points\nin different classes, we would have the ideal \u201cbi-level\u201d distribution:\n\nTo learn a metric which approximates the ideal geometric setup described above, we in-\ntroduce, for each training point, a conditional distribution over other points (as in [5]).\n\nFurthermore, under very mild conditions, any set of points which achieves the above distri-\n\nother class points 1.\n\n1Proof sketch: The in\ufb01nite separation between points of different classes follows simply from\n\n\f(4)\n\n(5)\n\nwhere we assumed for simplicity that classes are equi-probable, yielding a multiplicative\n\nSince our optimization problem is convex, it has an equivalent convex dual. Speci\ufb01cally,\nthe convex dual of Eq. (4) is the following entropy maximization problem:\n\nThus it is natural to seek a matrixA such that\u0004A\u0004jji\u0005 is as close as possible to\u00040\u0004jji\u0005.\nSince we are trying to match distributions, we minimize the KL divergence\u0003\u0004[\u00040j\u0004\u2104:\n\u0001i\u0002AXi\u0003\u0004[\u00040\u0004jji\u0005j\u0004A\u0004jji\u0005\u2104\n\u0007:\b:A2\bSD\nThe crucial property of this optimization problem is that it is convex in the matrixA. To see\nthis, \ufb01rst note that any convex linear combination of feasible solutionsA=(cid:11)A0\u0007\u00041\n(cid:11)\u0005A1 s.t.0(cid:20)(cid:11)(cid:20)1 is still a feasible solution, since the set of PSD matrices is convex.\nNext, we can show thatf\u0004A\u0005 alway has a greater cost than either of the endpoints. To do\nthis, we rewrite the objective functionf\u0004A\u0005=\bi\u0003\u0004[\u00040\u0004jji\u0005j\u0004\u0004jji\u0005\u2104 in the form 2:\nf\u0004A\u0005=Xi;j:yj=yi\u0003g\u0004\u0004jji\u0005=Xi;j:yj=yidAij\u0007Xi\n\u0003gZi\nconstant. To see whyf\u0004A\u0005 is convex, \ufb01rst note thatdAij=\u0004xixj\u0005TA\u0004xixj\u0005 is linear\ninA, and thus convex. The function\u0003gZi is a\u0003g\bex\u0004 function of af\ufb01ne functions of\nA and is therefore also convex (see [4], page 74).\n\u0001ax\u0004\u0004jji\u0005Xi[\u0004\u0004jji\u0005\u2104\n\u0007:\b:XiE\u00040\u0004jji\u0005[vjivTji\u2104XiE\u0004\u0004jji\u0005[vjivTji\u2104(cid:23)0\n\bj\u0004\u0004jji\u0005=18i.\nwherevji=xjxi,[\u0001\u2104 is the entropy function and we require\nEquation 4 as its dual. Write the Lagrangian for the above problem (where(cid:21) is PSD) 3\n\u0004\u0004\u0004;(cid:21);(cid:12)\u0005=Xi\u0004\u0004\u0004jji\u0005\u0005T\u0006\u0004(cid:21)\u0004Xi\u0004E\u00040[vjivTji\u2104E\u0004[vjivTji\u2104\u0005\u0005\u0005Xi(cid:12)i\u0004Xj\u0004\u0004jji\u00051\u0005\nThe dual function is de\ufb01ned asg\u0004(cid:21);(cid:12)\u0005=\u0001i\u0002\u0004\u0004\u0004\u0004;(cid:21);(cid:12)\u0005. To derive it, we \ufb01rst solve for\nthe minimizing\u0004 by setting the derivative of\u0004\u0004\u0004;(cid:21);(cid:12)\u0005 w.r.t.\u0004\u0004jji\u0005 equal to zero.\n0=1\u0007\u0003g\u0004\u0004jji\u0005\u0007T\u0006\u0004(cid:21)vjivTji\u0005(cid:12)i \u0005 \u0004\u0004jji\u0005=e(cid:12)i1T\u0006\u0004(cid:21)vjivTji\u0005\nPlugging this solution to\u0004\u0004\u0004;(cid:21);(cid:12)\u0005 we getg\u0004(cid:21);(cid:12)\u0005=T\u0006\u0004(cid:21)\biE\u00040[vjivTji\u2104\u0005\u0007\bi(cid:12)i\n\bi;j\u0004\u0004jji\u0005. The dual problem is to maximizeg\u0004(cid:21);(cid:12)\u0005. We can do this analytically w.r.t.\n(cid:12)i, yielding1(cid:12)i=\u0003g\bjeT\u0006\u0004(cid:21)vjivTji\u0005\nNow note thatT\u0006\u0004(cid:21)vjivTji\u0005=vTji(cid:21)vji=d(cid:21)ji, so we can write\n\u0003gXjed(cid:21)ji\ng\u0004(cid:21)\u0005=Xi;j:yi=yjd(cid:21)jiXi\nwhich is minus our original target function. Sinceg\u0004(cid:21)\u0005 should be maximized, and(cid:21)(cid:23)0\nwe have the desired duality result (identifying(cid:21) withA).\n\u00040\u0004jji\u0005=0 whenyj6=yi. For a given pointxi, all the pointsj in its class satisfy\u0004\u0004jji\u0005/1.\nDue to the structure of\u0004\u0004jji\u0005 in Equation 2, and because it is obeyed for all points inx0i\u0007 class, this\n[1]) in^\u0006 dimensions is^\u0006\u00071. Since by assumption we have at least^\u0006\u00072 points in the class ofxi,\nandA maps points into<^\u0006\n2Up to an additive constant\bi[\u00040\u0004jji\u0005\u2104.\n\nimplies that all the points in that class are equidistant from each other. However, it is easy to show\nthat the maximum number of different equidistant points (also known as the equilateral dimension\n\nTo prove this duality we start with the proposed dual and obtain the original problem in\n\n2.1 Convex Duality\n\n.\n\n, it follows that all points are identical.\n\n3We consider the equivalent problem of minimizing minus entropy.\n\n\f2.1.1 Relation to covariance based and embedding methods\n\ndistribution which matches these matrices when averaged over the sample.\n\nOur method can also be thought of as a supervised version of the Stochastic Neighbour\n\nlabels) and the embedding points are not completely free but are instead constrained to be\n\nThe convex dual derived above reveals an interesting relation to covariance based learning\n\nThis should be contrasted with the covariance matrices used in metric learning such as\nFisher\u2019s Discriminant Analysis. The latter uses the within and between class covariance\nmatrices. The within covariance matrix is similar to the covariance matrix used here, but is\ncalculated with respect to the class means, whereas here it is calculated separately for every\npoint, and is centered on this point. This highlights the fact that MCML is not based on\nGaussian assumptions where it is indeed suf\ufb01cient to calculate a single class covariance.\n\nmethods. The suf\ufb01cient statistics used by the algorithm are a set of\u0002 \u201cspread\u201d matrices.\nEach matrix is of the formE\u00040\u0004jji\u0005[vjivTji\u2104. The algorithm tries to \ufb01nd a maximum entropy\nEmbedding algorithm [7] in which the \u201ctarget\u201d distribution is\u00040 (determined by the class\nof the formWxi.\nwhich would require\u0007\u0004d4\u0005 resources, and could be prohibitive in our case. Instead, we\ncone. This projection is performed simply by taking the eigen-decomposition ofA and\nSet of labeled data points\u0004xi;yi\u0005,i=1:::\u0002\nInitialization: InitializeA0 to some PSD matrix\nIterate:(cid:15) SetA\b\u00071=A\b(cid:15)5f\u0004A\b\u0005 where\n5f\u0004A\u0005=\bij\u0004\u00040\u0004jji\u0005\u0004\u0004jji\u0005\u0005\u0004xjxi\u0005\u0004xjxi\u0005T\n(cid:15) Calculate the eigen-decomposition ofA\b\u00071\nA\b\u00071=\bk(cid:21)k\tk\tTk , then setA\b\u00071=\bk\u0001ax\u0004(cid:21)k;0\u0005\tk\tTk\nour case, if the training data consists of\u0002 points in\u0006-dimensional space then the primal has\nonly\u0007\u0004\u00062=2\u0005 variables while the dual has\u0007\u0004\u00022\u0005 so it will almost always be more ef\ufb01cient\nto operate on the primalA directly. One exception to this case may be the kernel version\n(Section 4) where the primal is also of size\u0007\u0004\u00022\u0005.\n\nSince the optimization problem in Equation 4 is convex, it is guaranteed to have only a\nsingle minimum which is the globally optimal solution4. It can be optimized using any\nappropriate numerical convex optimization machinery; all methods will yield the same so-\nlution although some may be faster than others. One standard approach is to use interior\npoint Newton methods. However, these algorithms require the Hessian to be calculated,\n\nhave experimented with using a \ufb01rst order gradient method, speci\ufb01cally the projected gra-\ndient approach as in [10]. At each iteration we take a small step in the direction of the\nnegative gradient of the objective function5, followed by a projection back onto the PSD\n\nremoving the components with negative eigenvalues. The algorithm is summarized below:\n\nInput:\n\nOutput:\n\nPSD metric which optimally collapses classes.\n\n(randomly or using some initialization heuristic).\n\n2.2 Optimizing the Convex Objective\n\nOf course in principle it is possible to optimize over the dual instead of the primal but in\n\n4When the data can be exactly collapsed into single class points, there will be multiple solutions\n\nat in\ufb01nity. However, this is very unlikely to happen in real data.\n\n5In the experiments, we used an Armijo like step size rule, as described in [3].\n\n\f3 Low Dimensional Projections for Feature Extraction\n\nconstraints on a matrix are not convex [4], and hence the rank constrained problem is not\nconvex and is likely to have local minima which make the optimization dif\ufb01cult and ill-\nde\ufb01ned since it becomes sensitive to initial conditions and choice of optimization method.\n\nLuckily, there is an alternative approach to obtaining low dimensional projections, which\ndoes specify a unique solution by sequentially solving two globally tractable problems.\nThis is the approach we follow here. First we solve for a (potentially) full rank met-\n\nbased on low dimensional projections. Such metrics and the induced distances can be\nadvantageous for several reasons [5]. First, low dimensional projections can substantially\nreduce the storage and computational requirements of a supervised method since only the\nprojections of the training points must be stored and the manipulations at test time all occur\nin the lower dimensional feature space. Second, low dimensional projections re-represent\nthe inputs, allowing for a supervised embedding or visualization of the original data.\n\nThe Mahalanobis distance under a metricA can be interpreted as a linear projection of the\noriginal inputs by the square root ofA, followed by Euclidean distance in the projected\nspace. MatricesA which have less than full rank correspond to Mahalanobis distances\nIf we consider matricesA with rank at most\u0005, we can always represent them in the form\nA=WTW for some projection matrixW of size\u0005\u0002\u0006. This corresponds to projecting\nthe original data into a\u0005-dimensional space speci\ufb01ed by the rows ofW . However, rank\nricA using the convex program outlined above, and then obtain a low rank projec-\ntion from it via spectral decomposition. This is done by diagonalizingA into the form\nA=\b\u0006i=1(cid:21)ivivTi where(cid:21)1(cid:21)(cid:21)2:::(cid:21)\u0006 are eigenvalues ofA andvi are the corre-\ninclude only the\u0005 terms corresponding to the\u0005 largest eigenvalues:A\u0005=\b\u0005i=1(cid:21)ivivTi .\nW=diag\u0004\u0004(cid:21)1;:::\u0004(cid:21)\u0005\u0005[vT1;:::;vT\u0005\u2104.\nonA unless the optimal metricA is of rank less than or equal to\u0005. However, as we show\nin the experimental results, it is often the case that for practical problems the optimalA has\nIt is interesting to consider the case wherexi are mapped into a high dimensional feature\nspace(cid:30)\u0004xi\u0005 and a Mahalanobis distance is sought in this space. We focus on the case\n(cid:30)\u0004xi\u0005\u0001(cid:30)\u0004xj\u0005=k\u0004xi;xj\u0005 for some kernelk. We now show how our method can be changed\nfReg\u0004A\u0005=Xi\u0003\u0004[\u00040\u0004jji\u0005j\u0004\u0004jji\u0005\u2104\u0007(cid:21)T\u0006\u0004A\u0005;\nW sinceT\u0006\u0004A\u0005=kWk2\n. Deriving w.r.t.W we obtainW=UX, whereU is some matrix\nwhich speci\ufb01esW as a linear combination of sample points, and thei\bh\nX isxi. ThusA is given byA=XTUTUX. De\ufb01ning the PSD matrix^A=UTU, we can\nrecast our optimization as looking for a PSD matrix^A, where the Mahalanobis distance\nis\u0004xixj\u0005TXT^AX\u0004xixj\u0005=\u0004kikj\u0005T^A\u0004kikj\u0005, where we de\ufb01neki=Xxi.\n\nan eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed\nvery small, suggesting the low rank solution will be close to optimal.\n\nIn general, the projection returned by this approach is not guaranteed to be the same as the\nprojection corresponding to minimizing our objective function subject to a rank constraint\n\nsponding eigenvectors. To obtain a low rank projection we constrain the sum above to\n\nThe resulting projection is uniquely de\ufb01ned (up to an irrelevant unitary transformation) as\n\n(6)\n\nrow of the matrix\n\n4 Learning Metrics with Kernels\n\nwhere dot products in the feature space may be expressed via a kernel function, such that\n\nto accommodate this setting, so that optimization depends only on dot products.\nConsider the regularized target function:\n\nwhere the regularizing factor is equivalent to the Frobenius norm of the projection matrix\n\n\fThus, as long as dot products can be represented via kernels, the optimization can be carried\nout without explicitly using the high dimensional space.\n\nTo obtain a low dimensional solution, we follow the approach in Section 3: obtain a de-\n\nThis is exactly our original distance, withxi replaced byki, which depends only on dot\nproducts inX space. The regularization term also depends solely on the dot products since\nT\u0006\u0004A\u0005=T\u0006\u0004XT^AX\u0005=T\u0006\u0004XXT^A\u0005=T\u0006\u0004\u0003^A\u0005, where\u0003 is the kernel matrix given\nby\u0003=XXT\n. Note that the trace is a linear function of^A, keeping the problem convex.\ncompositionA=VTDV 6, and take the projection matrix to be the \ufb01rst\u0005 rows ofD0:5V .\nAs a \ufb01rst step, we calculate a matrixB such that^A=BTB, and thusA=XTBTBX.\nSinceA is a correlation matrix for the rows ofBX it can be shown (as in Kernel PCA) that\nits (left) eigenvectors are linear combinations of the rows ofBX. Denoting byV=(cid:11)BX\nthe eigenvector matrix, we obtain, after some algebra, that(cid:11)B\u0003BT=D(cid:11). We conclude\nthat(cid:11) is an eigenvector of the matrixB\u0003BT\n. Denote by^(cid:11) the matrix whose rows are\northonormal eigenvectors ofB\u0003BT\n. ThenV can be shown to be orthonormal if we set\nV=D0:5^(cid:11)BX. The \ufb01nal projection will then beD0:5Vxi=^(cid:11)Bki. Low dimensional\nprojections will be obtained by keeping only the \ufb01rst\u0005 components of this projection.\n(cid:15) Fisher\u2019s Linear Discriminant Analysis (LDA), which projects on the eigenvectors\nofS1WSB whereSW;SB are the within and between class covariance matrices.\n(cid:15) The method of Xing et al [10] which minimizes the mean within class distance,\n(cid:15) Principal Component Analysis (PCA). There are several possibilities for scaling\n6WhereV is orthonormal, and the eigenvalues inD are sorted in decreasing order.\n7The regularization parameter(cid:21) and the width of the RBF kernel were chosen using 5 fold cross-\n\nWe also evaluated the kernel version of MCML with an RBF kernel (denoted by KM-\nCML)7. Since all methods allow projections to lower dimensions we compared perfor-\nmance for different projection dimensions 8.\nThe out-of sample performance results (based on 40 random splits of the data taking 70%\nfor training and 30% for testing9) are shown in Figure 1. It can be seen that when used in a\nsimple nearest-neighbour classi\ufb01er, the metric learned by MCML almost always performs\nas well as, or signi\ufb01cantly better than those learned by all other methods, across most\ndimensions. Furthermore, the kernel version of MCML outperforms the linear one on most\ndatasets.\n\nWe compared our method to several metric learning algorithms on a supervised classi\ufb01-\ncation task. Training data was \ufb01rst used to learn a metric over the input space. Then this\nmetric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we in-\nvestigated were taken from the UCI repository and have been used previously in evaluating\nsupervised methods for metric learning [10, 5]. To these we added the USPS handwritten\ndigits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22).\n\nthe PCA projections. We tested several, and report results of the empirically supe-\nrior one (PCAW), which scales the projection components so that the covariance\nmatrix after projection is the identity. PCAW often performs poorly on high di-\nmensions, but globally outperforms all other variants.\n\nvalidation. KMCML was only evaluated for datasets with less than 1000 training points.\n\n8To obtain low dimensional mappings we used the approach outlined in Section 3.\n9Except for the larger datasets where 1000 random samples were used for training.\n\n5 Experimental Results\n\nThe algorithms used in the comparative evaluation were\n\nwhile keeping the mean between class distance larger than one.\n\n\fWine\n\nIon\n\nBalance\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n1\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.6\n\n0.4\n\n0.2\n\n2\n\n3\n\n4\n\nSpam\n\n10\n\n20\n\n30\nDigits\n\n40\n\n50\n\n0.3\n\n0.2\n\n0.1\n\nt\n\ne\na\nR\n\n \nr\no\nr\nr\n\nE\n\n0.4\n\n0.2\n\n0\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nMCML\nPCAW\nLDA\nXING\nKMCML\n\n2\nProjection Dimension\n\n4\n\n6\n\n8\n\n10\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n10\n\n20\n\nProjection Dimension\n\n30\n\nSoybean\u2212small\n\nProtein\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n5\n\n10\nYale7\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\nHousing\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n10\n\n20\n\n30\n\n40\n\n50\n\n5\n\n10\n\n15\n\n10\n\n20\n\n30\n\n40\n\n50\n\n5.1 Comparison to non convex procedures\n\nFigure 1: Classi\ufb01cation error rate on several UCI datasets, USPS digits and YALE faces, for\ndifferent projection dimensions. Algorithms are our Maximally Collapsing Metric Learn-\ning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher\u2019s\nDiscriminant Analysis (LDA). Standard errors of the means shown on curves. No results\ngiven for XING on YALE and KMCML on Digits and Spam due to the data size.\n\nThe methods in the previous comparison are all well de\ufb01ned, in the sense that they are not\nsusceptible to local minima in the optimization. They also have the added advantage of ob-\ntaining projections to all dimensions using one optimization run. Below, we also compare\nthe MCML results to the results of two non-convex procedures. The \ufb01rst is the Non Convex\nvariant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the\n\nprojection matrixW , whereA=WTW . Although this is no longer a convex problem, it\nFor both methods we optimized the matrixW by restarting the optimization separately\nfor each size ofW . Minimization was performed using a conjugate gradient algorithm,\nThe inset in each \ufb01gure shows the spectrum of the MCML matrixA, revealing that it often\n\nis not constrained and is thus easier to optimize. The second non convex method is Neigh-\nbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error\nincurred by a nearest neighbor classi\ufb01er.\n\ninitialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets.\nIt can be seen that the performance of NMCML is similar to that of MCML, although it\nis less stable, possibly due to local minima, and both methods usually outperform NCA.\n\ndrops quickly after a few dimensions. This illustrates the effectiveness of our two stage\noptimization procedure, and suggests its low dimensional solutions are close to optimal.\n\n6 Discussion and Extensions\n\nWe have presented an algorithm for learning maximally collapsing metrics (MCML), based\non the intuition of collapsing classes into single points. MCML assumes that each class\n\n\fWine\n\nMCML\nNMCML\nNCA\n\n2\nProjection Dimension\n\n4\n\n8\n\n6\n\n10\n\nProtein\n\n5\n\n10\n\n15\n\n20\n\n0.2\n\n0.1\n\nt\n\ne\na\nR\n\n \nr\no\nr\nr\n\nE\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\nBalance\n\nSoybean\n\n2\n\n3\n\n4\n\nIon\n\n0.1\n\n0.05\n\n0\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n5\n\n10\n\n15\n\n20\n\nHousing\n\n10\n\n20\n\n30\n\n5\n\n10\n\n15\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n1\n\n0.3\n\n0.2\n\n0.1\n\nFigure 2: Classi\ufb01cation error for non convex procedures, and the MCML method.\nEigen-spectra for the MCML solution are shown in the inset.\n\nmay be collapsed to a single point, at least approximately, and thus is only suitable for uni-\nmodal class distributions (or for simply connected sets if kernelization is used). However,\nif points belonging to a single class appear in several disconnected clusters in input (or\nfeature) space, it is unlikely that MCML could collapse the class into a single point. It is\npossible that using a mixture of distributions, an EM-like algorithm can be constructed to\naccommodate this scenario.\n\nThe method can also be used to learn low dimensional projections of the input space. We\nshowed that it performs well, even across a range of projection dimensions, and consistently\noutperforms existing methods. Finally, we have shown how the method can be extended\nto projections in high dimensional feature spaces using the kernel trick. The resulting\nnonlinear method was shown to improve classi\ufb01cation results over the linear version.\n\n[1] N. Alon and P. Pudlak. Equilateral sets in\u0002\u0004 . Geom. Funct. Anal., 13(3), 2003.\n\nReferences\n\n[2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces:\n\nRecognition using class speci\ufb01c linear projection. In ECCV (1), 1996.\n\n[3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE\n\nTransaction on Automatic Control, 21(2):174\u2013184, 1976.\n\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004.\n[5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood compo-\nnents analysis. In Advances in Neural Information Processing Systems (NIPS), 2004.\n[6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data\n\nmining, inference, and prediction. New York: Springer-Verlag, 2001.\n\n[7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2002.\n\n[8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2001.\n\n[9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant\n\ncomponent analysis. In Proc. of ECCV, 2002.\n\n[10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application\nto clustering with side-information. In Advances in Neural Information Processing\nSystems (NIPS), 2004.\n\n\f", "award": [], "sourceid": 2947, "authors": [{"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}