{"title": "Approximate Correspondences in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 512, "abstract": null, "full_text": "Approximate Correspondences in High Dimensions\n\nKristen Grauman\n\nDepartment of Computer Sciences\n\nUniversity of Texas at Austin\ngrauman@cs.utexas.edu\n\nTrevor Darrell\n\nCS and AI Laboratory\n\nMassachusetts Institute of Technology\n\ntrevor@csail.mit.edu\n\nAbstract\n\nPyramid intersection is an ef\ufb01cient method for computing an approximate partial\nmatching between two sets of feature vectors. We introduce a novel pyramid em-\nbedding based on a hierarchy of non-uniformly shaped bins that takes advantage\nof the underlying structure of the feature space and remains accurate even for sets\nwith high-dimensional feature vectors. The matching similarity is computed in\nlinear time and forms a Mercer kernel. Whereas previous matching approxima-\ntion algorithms suffer from distortion factors that increase linearly with the fea-\nture dimension, we demonstrate that our approach can maintain constant accuracy\neven as the feature dimension increases. When used as a kernel in a discrimina-\ntive classi\ufb01er, our approach achieves improved object recognition results over a\nstate-of-the-art set kernel.\n\n1 Introduction\n\nWhen a single data object is described by a set of feature vectors, it is often useful to consider\nthe matching or \u201ccorrespondence\u201d between two sets\u2019 elements in order to measure their overall\nsimilarity or recover the alignment of their parts. For example, in computer vision, images are often\nrepresented as collections of local part descriptions extracted from regions or patches (e.g., [11, 12]),\nand many recognition algorithms rely on establishing the correspondence between the parts from\ntwo images to quantify similarity between objects or localize an object within the image [2, 3, 7].\nLikewise, in text processing, a document may be represented as a bag of word-feature vectors; for\nexample, Latent Semantic Analysis can be used to recover a \u201cword meaning\u201d subspace on which\nto project the co-occurrence count vectors for every word [9]. The relationship between documents\nmay then be judged in terms of the matching between the sets of local meaning features.\n\nThe critical challenge, however, is to compute the correspondences between the feature sets in an\nef\ufb01cient way. The optimal correspondences\u2014those that minimize the matching cost\u2014require cubic\ntime to compute, which quickly becomes prohibitive for sizeable sets and makes processing realistic\nlarge data sets impractical. Due to the optimal matching\u2019s complexity, researchers have developed\napproximation algorithms to compute close solutions for a fraction of the computational cost [4, 8,\n1, 7]. However, previous approximations suffer from distortion factors that increase linearly with\nthe dimension of the features, and they fail to take advantage of structure in the feature space.\n\nIn this paper we present a new algorithm for computing an approximate partial matching between\npoint sets that can remain accurate even for sets with high-dimensional feature vectors, and bene\ufb01ts\nfrom taking advantage of the underlying structure in the feature space. The main idea is to derive a\nhierarchical, data-dependent decomposition of the feature space that can be used to encode feature\nsets as multi-resolution histograms with non-uniformly shaped bins. For two such histograms (pyra-\nmids), the matching cost is ef\ufb01ciently calculated by counting the number of features that intersect\nin each bin, and weighting these match counts according to geometric estimates of inter-feature dis-\ntances. Our method allows for partial matchings, which means that the input sets can have varying\nnumbers of features in them, and outlier features from the larger set can be ignored with no penalty\n\n\fto the matching cost. The matching score is computed in time linear in the number of features per\nset, and it forms a Mercer kernel suitable for use within existing kernel-based algorithms.\n\nIn this paper we demonstrate how, unlike previous set matching approximations (including our orig-\ninal pyramid match algorithm [7]), the proposed approach can maintain consistent accuracy as the\ndimension of the features within the sets increases. We also show how the data-dependent hierarchi-\ncal decomposition of the feature space produces more accurate correspondence \ufb01elds than a previous\napproximation that uses a uniform decomposition. Finally, using our matching measure as a kernel\nin a discriminative classi\ufb01er, we achieve improved object recognition results over a state-of-the-art\nset kernel on a benchmark data set.\n\n2 Related Work\n\nSeveral previous matching approximation methods have also considered a hierarchical decomposi-\ntion of the feature space to reduce matching complexity, but all suffer from distortion factors that\nscale linearly with the feature dimension [4, 8, 1, 7]. In this work we show how to alleviate this\ndecline in accuracy for high-dimensional data by tuning the hierarchical decomposition according\nto the particular structure of the data, when such structure exists.\n\nWe build on our pyramid match algorithm [7], a partial matching approximation that also uses\nhistogram intersection to ef\ufb01ciently count matches implicitly formed by the bin structures. However,\nin contrast to [7], our use of data-dependent, non-uniform bins and a more precise weighting scheme\nresults in matchings that are consistently accurate for structured, high-dimensional data.\n\nThe idea of partitioning a feature space with vector quantization (VQ) is fairly widely used in prac-\ntice; in the vision literature in particular, VQ has been used to establish a vocabulary of prototypical\nimage features, from \u201ctextons\u201d to the \u201cvisual words\u201d of [16]. A variant of the pyramid match ap-\nplied to spatial features was shown to be effective for matching quantized features in [10]. More\nrecently, the authors of [13] have shown that a tree-structured vector quantization (TSVQ [5]) of im-\nage features provides a scalable means of indexing into a very large feature vocabulary. The actual\ntree structure employed is similar to the one constructed in this work; however, whereas the authors\nof [13] are interested in matching individual features to one another to access an inverted \ufb01le, our\napproach computes approximate correspondences between sets of features. Note the distinction be-\ntween the problem we are addressing\u2014approximate matchings between sets\u2014and the problem of\nef\ufb01ciently identifying approximate or exact nearest neighbor feature vectors (e.g., via k-d trees): in\nthe former, the goal is a one-to-one correspondence between sets of vectors, whereas in the latter, a\nsingle vector is independently matched to a nearby vector.\n\n3 Approach\n\nThe main contribution of this work is a new very ef\ufb01cient approximate bipartite matching method\nthat measures the correspondence-based similarity between unordered, variable-sized sets of vec-\ntors, and can optionally extract an explicit correspondence \ufb01eld. We call our algorithm the\nvocabulary-guided (VG) pyramid match, since the histogram pyramids are de\ufb01ned by the \u201cvocabu-\nlary\u201d or structure of the feature space, and the pyramids are used to count implicit matches.\n\nThe basic idea is to \ufb01rst partition the given feature space into a pyramid of non-uniformly shaped re-\ngions based on the distribution of a provided corpus of feature vectors. Point sets are then encoded as\nmulti-resolution histograms determined by that pyramid, and an ef\ufb01cient intersection-based compu-\ntation between any two histogram pyramids yields an approximate matching score for the original\nsets. The implicit matching version of our method estimates the inter-feature distances based on\ntheir respective distances to the bin centers. To produce an explicit correspondence \ufb01eld between\nthe sets, we use the pyramid construct to divide-and-conquer the optimal matching computation. As\nour experiments will show, the proposed algorithm in practice provides a good approximation to the\noptimal partial matching, but is orders of magnitude faster to compute.\nPreliminaries: We consider a feature space F of d-dimensional vectors, F \u2286 <d. The point sets\nour algorithm matches will come from the input space S, which contains sets of feature vectors\ndrawn from F : S = {X|X = {x1, . . . , xm}}, where each xi \u2208 F , and the value m = |X| may\nvary across instances of sets in S. Throughout the text we will use the terms feature, vector, and\npoint interchangeably to refer to the elements within a set.\n\n\f(a) Uniform bins\n\n(b) Vocabulary-guided bins\n\nFigure 1: Rather than carve the feature space into uniformly-shaped partitions (left), we let the vocabulary\n(structure) of the feature space determine the partitions (right). As a result, the bins are better concentrated on\ndecomposing the space where features cluster, particularly for high-dimensional feature spaces. These figures\ndepict the grid boundaries for two resolution levels for a 2-D feature space. In both (a) and (b), the left plot\ncontains the coarser resolution level, and the right plot contains the finer one. Features are red points, bin\ncenters are larger black points, and blue lines denote bin boundaries.\n\nA partial matching between two point sets is an assignment that maps all points in the smaller set\nto some subset of the points in the larger (or equally-sized) set. Given point sets X and Y, where\nm = |X|, n = |Y|, and m \u2264 n, a partial matching M (X, Y; \u03c0) = {(x1, y\u03c01 ), . . . , (xm, y\u03c0m )}\npairs each point in X to some unique point in Y according to the permutation of indices speci\ufb01ed\nby \u03c0 = [\u03c01, . . . , \u03c0m], 1 \u2264 \u03c0i \u2264 n, where \u03c0i speci\ufb01es which point y\u03c0i \u2208 Y is matched to xi \u2208 X,\nfor 1 \u2264 i \u2264 m. The cost of a partial matching is the sum of the distances between matched points:\nC (M(X, Y; \u03c0)) = Pxi\u2208X ||xi \u2212 y\u03c0i||2. The optimal partial matching M(X, Y; \u03c0\u2217) uses the\nassignment \u03c0\u2217 that minimizes this cost: \u03c0\u2217 = argmin\u03c0 C (M(X, Y; \u03c0)). It is this matching that\nwe wish to ef\ufb01ciently approximate. In Section 3.2 we describe how our algorithm approximates\nthe cost C (M(X, Y; \u03c0\u2217)); for a small increase in computational cost we can also extract explicit\ncorrespondences to estimate \u03c0\u2217 itself.\n\n3.1 Building Vocabulary-Guided Pyramids\nThe \ufb01rst step is to generate the structure of the vocabulary-guided (VG) pyramid to de\ufb01ne the bin\nplacement for the multi-resolution histograms used in the matching. This is a one-time process\nperformed before any matching takes place. We would like the bins in the pyramid to follow the\nfeature distribution and concentrate partitions where the features actually fall. To accomplish this,\nwe perform hierarchical clustering on a sample of representative feature vectors from F .\nWe randomly select some example feature vectors from the feature type of interest to form the repre-\nsentative feature corpus, and perform hierarchical k-means clustering with the Euclidean distance to\nbuild the pyramid tree. Other hierarchical clustering techniques, such as agglomerative clustering,\nare also possible and do not change the operation of the method. For this unsupervised clustering\nprocess there are two parameters: the number of levels in the tree L, and the branching factor k.\nThe initial corpus of features is clustered into k top-level groups, where group membership is deter-\nmined by the Voronoi partitioning of the feature corpus according to the k cluster centers. Then the\nclustering is repeated recursively L \u2212 1 times on each of these groups, \ufb01lling out a tree with L total\nlevels containing ki bins (nodes) at level i, where levels are counted from the root (i = 0) to the\nleaves (i = L \u2212 1). The bins are irregularly shaped and sized, and their boundaries are determined\nby the Voronoi cells surrounding the cluster centers. (See Figure 1.) For each bin in the VG pyramid\nwe record its diameter, which we estimate empirically based on the maximal inter-feature distance\nbetween any points from the initial feature corpus that were assigned to it.\n\nOnce we have constructed a VG pyramid, we can embed point sets from S as multi-resolution\nhistograms. A point\u2019s placement in the pyramid is determined by comparing it to the appropriate k\nbin centers at each of the L pyramid levels. The histogram count is incremented for the bin (among\nthe k choices) that the point is nearest to in terms of the same distance function used to cluster the\ninitial corpus. We then push the point down the tree and continue to increment \ufb01ner level counts\nonly along the branch (bin center) that is chosen at each level. So a point is \ufb01rst assigned to one of\nthe top-level clusters, then it is assigned to one of its children, and so on recursively. This amounts\nto a total of kL distances that must be computed between a point and the pyramid\u2019s bin centers.\nGiven the bin structure of the VG pyramid, a point set X is mapped to its pyramid: \u03a8 (X) =\n[H0(X), . . . , HL\u22121(X)], with Hi(X) = [hp, n, di1, . . . , hp, n, diki ], and where Hi(X) is a ki-\ndimensional histogram associated with level i in the pyramid, p \u2208 Zi for entries in Hi(X), and\n\n\f0 \u2264 i < L. Each entry in this histogram is a triple hp, n, di giving the bin index, the bin count, and\nthe bin\u2019s points\u2019 maximal distance to the bin center, respectively.\n\nStoring the VG pyramid itself requires space for O(kL) d-dimensional feature vectors, i.e., all of\nthe cluster centers. However, each point set\u2019s histogram is stored sparsely, meaning only O(mL)\nnonzero bin counts are maintained to encode the entire pyramid for a set with m features. This is\nan important point: we do not store O(kL) counts for every point set; Hi(X) is represented by at\nmost m triples having n > 0. We achieve a sparse implementation as follows: each vector in a set is\npushed through the tree as described above. At every level i, we record a hp, n, di triple describing\nthe nonzero entry for the current bin. The vector p = [p1, . . . , pi], pj \u2208 [1, k] denotes the indices\nof the clusters traversed from the root so far, n \u2208 Z+ denotes the count for the bin (initially 1),\nand d \u2208 < denotes the distance computed between the inserted point and the current bin\u2019s center.\nUpon reaching the leaf level, p is an L-dimensional path-vector indicating which of the k bins were\nchosen at each level, and every path-vector uniquely identi\ufb01es some bin on the pyramid.\n\nInitially, an input set with m features yields a total of mL such triples\u2014there is one nonzero entry\nper level per point, and each has n = 1. Then each of the L lists of entries is sorted by the index\nvectors (p in the triple), and they are collapsed to a list of sorted nonzero entries with unique indices:\nwhen two or more entries with the same index are found, they are replaced with a single entry with\nthe same index for p, the summed counts for n, and the maximum distance for d. The sorting is done\nin linear time using integer sorting algorithms. Maintaining the maximum distance of any point in a\nbin to the bin center will allow us to ef\ufb01ciently estimate inter-point distances at the time of matching,\nas described in Section 3.2.\n\n3.2 Vocabulary-Guided Pyramid Match\nGiven two point sets\u2019 pyramid encodings, we ef\ufb01ciently compute the approximate matching score\nusing a simple weighted intersection measure. The VG pyramid\u2019s multi-resolution partitioning of\nthe feature space is used to direct the matching. The basic intuition is to start collecting groups of\nmatched points from the bottom of the pyramid up, i.e., from within increasingly larger partitions.\nIn this way, we will \ufb01rst consider matching the closest points (at the leaves), and as we climb to\nthe higher-level clusters in the pyramid we will allow increasingly further points to be matched. We\nde\ufb01ne the number of new matches within a bin to be a count of the minimum number of points either\nof the two input sets contributes to that bin, minus the number of matches already counted by any of\nits child bins. A weighted sum of these counts yields an approximate matching score.\nLet nij(X) denote the element n from hp, n, dij, the jth bin entry of histogram Hi(X), and let\nch (nij(X)) denote the element n for the hth child bin of that entry, 1 \u2264 h \u2264 k. Similarly, let\ndij(X) refer to the element d from the same triple. Given point sets X and Y, we compute the\nmatching score via their pyramids \u03a8(X) and \u03a8(Y) as follows:\n\nC (\u03a8(X), \u03a8(Y)) =\n\nL\u22121\n\nki\n\nXi=0\n\nXj=1\n\nwij\"min (nij(X), nij(Y)) \u2212\n\nmin (ch (nij(X)) , ch (nij(Y)))# .\n\n(1)\n\nk\n\nXh=1\n\nThe outer sum loops over the levels in the pyramids; the second sum loops over the bins at a given\nlevel, and the innermost sum loops over the children of a given bin. The \ufb01rst min term re\ufb02ects\nthe number of matchable points in the current bin, and the second min term tallies the number of\nmatches already counted at \ufb01ner resolutions (in child bins). Note that as the leaf nodes have no\nchildren, when i = L \u2212 1 the last sum is zero. All matches are new at the leaves. The matching\nscores are normalized according to the size of the input sets in order to not favor larger sets.\nThe number of new matches calculated for a bin is weighted by wij, an estimate of the distance\nbetween points contained in the bin.1 With a VG pyramid match there are two alternatives for the\ndistance estimate: (a) weights based on the diameters of the pyramid\u2019s bins, or (b) input-dependent\nweights based on the maximal distances of the points in the bin to its center. Option (a) is a con-\nservative estimate of the actual inter-point distances in the bin if the corpus of features used to build\nthe pyramid is representative of the feature space; its advantages are that it provides a guaranteed\nMercer kernel (see below) and eliminates the need to store a distance d in the entry triples. Option\n(b)\u2019s input-speci\ufb01c weights estimate the distance between any two points in the bin as the sum of the\nstored maximal to-center distances from either input set: wij = dij(X) + dij(Y). This weighting\n\n1To use our matching as a cost function, weights are set as the distance estimates; to use as a similarity\n\nmeasure or kernel, weights are set as (some function of) the inverse of the distance estimates.\n\n\fgives a true upper bound on the furthest any two points could be from one another, and it has the po-\ntential to provide tighter estimates of inter-feature distances (as we con\ufb01rm experimentally below);\nhowever, we cannot guarantee this weighting will yield a Mercer kernel.\n\nJust as we encode the pyramids sparsely, we derive a means to compute intersections in Eqn. 1\nwithout ever traversing the entire pyramid tree. Given two sparse lists Hi(X) and Hi(Y) which have\nbeen sorted according to the bin indices, we obtain the minimum counts in linear time by moving\npointers down the lists and processing only those nonzero entries that share an index, making the\ntime required to compute a matching between two pyramids O(mL). A key aspect of our method is\nthat we obtain a measure of matching quality between two point sets without computing pair-wise\ndistances between their features\u2014an O(m2) savings over sub-optimal greedy matchings. Instead,\nwe exploit the fact that the points\u2019 placement in the pyramid re\ufb02ects their distance from one another.\nThe only inter-feature distances computed are the kL distances needed to insert a point into the\npyramid, and this small one-time cost is amortized every time we re-use a histogram to approximate\nanother matching against a different point set.\n\nWe \ufb01rst suggested the idea of using histogram intersection to count implicit matches in a multi-\nresolution grid in [7]. However, in [7], bins are constructed to uniformly partition the space, bin\ndiameters exponentially increase over the levels, and intersections are weighted indistinguishably\nacross an entire level. In contrast, here we have developed a pyramid embedding that partitions\naccording to the distribution of features, and weighting schemes that allow more precise approxima-\ntions of the inter-feature costs. As we will show in Section 4, our VG pyramid match remains accu-\nrate and ef\ufb01cient even for high-dimensional feature spaces, while the uniform-bin pyramid match is\nlimited in practice to relatively low-dimensional features.\n\nFor the increased accuracy our method provides, there are some complexity trade-offs versus [7],\nwhich does not require computing any distances to place the points into bins; their uniform shape\nand size allows points to be placed directly via division by bin size. On the other hand, sorting the\nbin indices with the VG method has a lower complexity, since the values only range to k, the branch\nfactor, which is typically much smaller than the aspect ratio that bounds the range in [7]. In addition,\nas we show in Section 4, in practice the cost of extracting an explicit correspondence \ufb01eld using the\nuniform-bin pyramid in high dimensions approaches the cubic cost of the optimal measure, whereas\nit remains linear with the proposed approach, assuming features are not uniformly distributed.\n\nOur approximation can be used to compare sets of vectors in any case where the presence of low-\ncost correspondences indicates their similarity (e.g., nearest-neighbor retrieval). We can also employ\nthe measure as a kernel function for structured inputs. According to Mercer\u2019s theorem, a kernel is\np.s.d if and only if it corresponds to an inner product in some feature space [15]. We can re-write\nEqn. 1 as: C (\u03a8(X), \u03a8(Y)) = PL\u22121\nj=1 (wij \u2212 pij) min (nij(X), nij(Y)), where pij refers\nto the weight associated with the parent bin of the jth node at level i. Since the min operation is\np.d. [14], and since kernels are closed under summation and scaling by a positive constant [15], we\nhave that the VG pyramid match is a Mercer kernel if wij \u2265 pij. This inequality holds if every\nchild bin receives a similarity weight that is greater than its parent bin, or rather that every child\nbin has a distance estimate that is less than that of its parent. Indeed this is the case for weighting\noption (a), where wij is inversely proportional to the diameter of the bin. It holds by de\ufb01nition of the\nhierarchical clustering: the diameter of a subset of points must be less than or equal to the diameter\nof all those points. We cannot make this guarantee for weighting option (b).\n\ni=0 Pki\n\nIn addition to scalar matching scores, we can optionally extract explicit correspondence \ufb01elds\nthrough the pyramid. In this case, the VG pyramid decomposes the required matching computa-\ntion into a hierarchy of smaller matchings. Upon encountering a bin with a nonzero intersection,\nthe optimal matching is computed between only those features from the two sets that fall into that\nparticular bin. All points that are used in that per-bin matching are then \ufb02agged as matched and may\nnot take part in subsequent matchings at coarser resolutions of the pyramid.\n4 Results\nIn this section, we provide results to empirically demonstrate our matching\u2019s accuracy and ef\ufb01ciency\non real data, and we compare it to a pyramid match using a uniform partitioning of the feature\nspace. In addition to directly evaluating the matching scores and correspondence \ufb01elds, we show\nthat our method leads to improved object recognition performance when used as a kernel within a\ndiscriminative classi\ufb01er.\n\n\fRank correlations, d=8(R=0.86)\n\nRank correlations, d=128(R=0.78)\n\n10000\n\n9000\n\n8000\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nUniform bin pyramid match ranks\n\ns\nk\nn\na\nr\n \nl\na\nm\n\ni\nt\np\nO\n\n10000\n\n9000\n\n8000\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nUniform bin pyramid match ranks\n\nRank correlations, d=8(R=0.92)\n\nRank correlations, d=128(R=0.95)\n\ns\nk\nn\na\nr\n \nl\na\nm\n\ni\nt\np\nO\n\n10000\n\n9000\n\n8000\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n\nRanking quality over feature dimensions\n\nUniform bin pyramid\nVG pyramid \u2212 input\u2212specific weights\nVG pyramid \u2212 global weights\n\n \n\n1.05\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\ns\nk\nn\na\nr\n \nl\na\nm\n\ni\nt\np\nO\n\ns\nk\nn\na\nr\n \nl\na\nm\n\ni\nt\np\nO\n\nn\no\ni\nt\na\nl\ne\nr\nr\no\nc\n \nk\nn\na\nr\n \nn\na\nm\nr\na\ne\np\nS\n\n \n \n \n \n \n \n)\n\nR\n\n(\n \nh\nc\nt\na\nm\n\n \nl\na\nm\n\ni\nt\np\no\nh\nt\ni\n\n \n\nw\n\n0.7\n \n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\nFeature dimension (d)\n\n10000\n\n9000\n\n8000\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nVG pyramid match (input\u2212specific weights) ranks\n\nVG pyramid match (input\u2212specific weights) ranks\n\nFigure 2: Comparison of optimal and approximate matching rankings on image data. Left: The set rankings\nproduced with the VG pyramid match are consistently accurate for increasing feature dimensions, while the\naccuracy with uniform bins degrades about linearly in the feature dimension. Right: Example rankings for both\napproximations at d = [8, 128].\n\nApproximate Matching Scores: In these experiments, we extracted local SIFT [11] features from\nimages in the ETH-80 database, producing an unordered set of about m = 256 vectors for every\nexample. In this case, F is the space of SIFT image features. We sampled some features from 300 of\nthe images to build the VG pyramid, and 100 images were used to test the matching. In order to test\nacross varying feature dimensions, we also used some training features to establish a PCA subspace\nthat was used to project features onto varying numbers of bases. For each feature dimension, we\nbuilt a VG pyramid with k = 10 and L = 5, encoded the 100 point sets as pyramids, and computed\nthe pair-wise matching scores with both our method and the optimal least-cost matching.\n\nIf our measure is approximating the optimal matching well, we should \ufb01nd the ranking we induce\nto be highly correlated with the ranking produced by the optimal matching for the same data. In\nother words, the images should be sorted similarly by either method. Spearman\u2019s rank correlation\ncoef\ufb01cient R provides a good quantitative measure to evaluate this: R = 1 \u2212 6 PN\n1 D2/N (N 2 \u2212 1),\nwhere D is the difference in rank for the N corresponding ordinal values assigned by the two mea-\nsures. The left plot in Figure 2 shows the Spearman correlation scores against the optimal measure\nfor both our method (with both weighting options) and the approximation in [7] for varying feature\ndimensions for the 10,000 pair-wise matching scores for the 100 test sets. Due to the randomized\nelements of the algorithms, for each method we have plotted the mean and standard deviation of the\ncorrelation for 10 runs on the same data.\n\nWhile the VG pyramid match remains consistently accurate for high feature dimensions (R = 0.95\nwith input-speci\ufb01c weights), the accuracy of the uniform bins degrades rapidly for dimensions\nover 10. The ranking quality of the input-speci\ufb01c weighting scheme (blue diamonds) is somewhat\nstronger than that of the \u201cglobal\u201d bin diameter weighting scheme (green squares). The four plots\non the right of Figure 2 display the actual ranks computed for both approximations for two of the\n26 dimensions summarized in the left plot. The black diagonals denote the optimal performance,\nwhere the approximate rankings would be identical to the optimal ones; higher Spearman correla-\ntions have points clustered more tightly along this diagonal. For the low-dimensional features, the\nmethods perform fairly comparably; however, for the full 128-D features, the VG pyramid match\nis far superior (rightmost column). The optimal measure requires about 1.25s per match, while our\napproximation is about 2500x faster at 5 \u00d7 10\u22124s per match. Computing the pyramid structure from\nthe feature corpus took about three minutes in Matlab; this is a one-time of\ufb02ine cost.\n\nFor a pyramid matching to work well, the gradation in bin sizes up the pyramid must be such\nthat at most levels of the pyramid we can capture distinct groups of points to match within the\nbins. That is, unless all the points in two sets are equidistant, the bin placement must allow us to\nmatch very near points at the \ufb01nest resolutions, and gradually add matches that are more distant\nat coarser resolutions. In low dimensions, both uniform or data-dependent bins can achieve this.\nIn high dimensions, however, uniform bin placement and exponentially increasing bin diameters\nfail to capture such a gradation: once any features from different point sets are close enough to\n\n\f1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\nh\nc\nt\na\nm\n\n \nr\ne\np\n \nr\no\nr\nr\ne\nn\na\ne\nM\n\n \n\n)\n\n \n\nE\nL\nA\nC\nS\nG\nO\nL\n(\n \n)\ns\n(\n \nh\nc\nt\na\nm\n\n \nr\ne\np\ne\nm\n\n \n\ni\nt\n \nn\na\ne\nM\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n \n0\n\n20\n\n \n\nOptimal\nUniform bins, random per\nUniform bins, optimal per\nVocab.\u2212guided bins, random per\nVocab.\u2212guided bins, optimal per\n\n40\nFeature dimension (d )\n\n60\n\n80\n\n100\n\n120\n\nd\ne\nm\nr\no\nf\n \ns\ne\nh\nc\nt\na\nm\nw\ne\nn\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n \n\nd = 3\n\nVocabulary\u2212guided bins\nUniform bins\n\n \n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n\u221250\n \n0\n\n2\n\n4\n\n6\nPyramid level\n\n8\n\nd = 8\n\nVocabulary\u2212guided bins\nUniform bins\n\n \n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\nd\ne\nm\nr\no\nf\n \ns\ne\nh\nc\nt\na\nm\nw\ne\nn\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n \n\nd = 13\n\nVocabulary\u2212guided bins\nUniform bins\n\n \n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\nd\ne\nm\nr\no\nf\n \ns\ne\nh\nc\nt\na\nm\nw\ne\nn\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n \n\nd = 128\nVocabulary\u2212guided bins\nUniform bins\n\n \n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\nd\ne\nm\nr\no\nf\n \ns\ne\nh\nc\nt\na\nm\nw\ne\nn\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n \n\n\u221250\n \n0\n\n2\n\n4\n\n6\nPyramid level\n\n8\n\n\u221250\n \n0\n\n2\n\n4\n\n6\nPyramid level\n\n8\n\n\u221250\n \n0\n\n2\n\n4\n\n6\nPyramid level\n\n8\n\nFigure 3: Number of new matches formed at each pyramid level for either uniform (dashed red) or VG (solid\nblue) bins for increasing feature dimensions. Points represent mean counts per level for 10,000 matches. In low\ndimensions, both partition styles gradually collect matches up the pyramid. In high dimensions with uniform\npartitions, points begin sharing a bin \u201call at once\u201d; in contrast, the VG bins still accrue new matches consistently\nacross levels since the decomposition is tailored to where points cluster in the feature space.\n\nx 105 Error in approximate correspondence fields\n\n \n\nComputation time\n\nUniform bins, random per\nUniform bins, optimal per\nVocab.\u2212guided bins, random per\nVocab.\u2212guided bins, optimal per\n\n0.4\n \n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\nFeature dimension (d )\n\nFigure 4: Comparison of correspondence field errors (left) and associated computation times (right). This\nfigure is best viewed in color. (Note that errors level out with d for all methods due to PCA.)\n\nmatch (share bins), the bins are so large that almost all of them match. The matching score is then\napproximately the number of points weighted by a single bin size. In contrast, when we tailor the\nfeature space partitions to the distribution of the data, even in high dimensions the match counts\nincrease gradually across levels, thereby yielding more discriminating implicit matches. Figure 3\ncon\ufb01rms this intuition, again using the ETH-80 image data from above.\nApproximate Correspondence Fields: For the same image data, we ran the explicit matching\nvariant of our method and compared the induced correspondences to those produced by the globally\noptimal measure. For comparison, we also applied the same variant to pyramids with uniform bins.\nWe measure the error of an approximate matching \u02c6\u03c0 by the sum of the errors at every link in the \ufb01eld:\n||2. Figure 4 compares the correspondence\nE (M (X, Y; \u02c6\u03c0) , M (X, Y; \u03c0\u2217)) = Pxi\u2208X ||y\u02c6\u03c0i \u2212 y\u03c0\u2217\n\ufb01eld error and computation times for the VG and uniform pyramids. For each approximation, there\nare two variations tested: in one, an optimal assignment is computed for all points in the same bin;\nfor the other, a random assignment is made. The left plot shows the mean error per match for each\nmethod, and the right plot shows the corresponding mean time required to compute those matches.\n\ni\n\nThe computation times are as we would expect: the optimal matching is orders of magnitude more\nexpensive than the approximations. Using the random assignment variation, both approximations\nhave negligible costs, since they simply choose any combination of points within a bin. However, in\nhigh dimensions, the time required by the uniform bin pyramid with the optimal per-bin matching\napproaches the time required by the optimal matching itself. This occurs for similar reasons as the\npoorer matching score accuracy exhibited by the uniform bins, both in the left plot and above in\nFigure 2; since most or all of the points begin to match at a certain level, the pyramid does not help\nto divide-and-conquer the computation, and for high dimensions, the optimal matching in its entirety\nmust be computed. In contrast, the expense of the VG pyramid matching remains steady and low,\neven for high dimensions, since data-dependent pyramids better divide the matching labor into the\nnatural segments in the feature space.\n\nFor similar reasons, the errors are comparable for the optimal per-bin variation with either the VG\nor uniform bins. The VG bins divide the computation so it can be done inexpensively, while the\nuniform bins divide the computation poorly and must compute it expensively, but about as accu-\nrately. Likewise, the error for the uniform bins when using a per-bin random assignment is very\nhigh for any but the lowest dimensions (red line on left plot), since such a large number of points\nare being randomly assigned to one another. In contrast, the VG bins actually result in similar errors\nwhether the points in a bin are matched optimally or randomly (blue and pink lines on left plot).\n\n\fPyramid matching method Mean recognition rate/class (d=128 / d=10)\n\nTime/match (s) (d=128 / d=10)\n\nVocabulary-guided bins\n\nUniform bins\n\n99.0 / 97.7\n64.9 / 96.5\n\n6.1e-4 / 6.2e-4\n1.5e-3 / 5.7e-4\n\nThis again indicates that tuning the pyramid bins to the data\u2019s distribution achieves a much more\nsuitable breakdown of the computation, even in high dimensions.\nRealizing Improvements in Recognition: Finally, we have experimented with the VG pyramid\nmatch within a discriminative classi\ufb01er for an object recognition task. We trained an SVM with\nour matching as the kernel to recognize the four categories in the Caltech-4 benchmark data set.\nWe trained with 200 images per class and tested with all the remaining images. We extracted fea-\ntures using both the Harris and MSER [12] detectors and the 128-D SIFT [11] descriptor. We also\ngenerated lower-dimensional (d = 10) features using PCA. To form a Mercer kernel, the weights\nwere set according to each bin diameter Aij: wij = e\u2212Aij /\u03c3, with \u03c3 set automatically as the mean\ndistance between a sample of features from the training set. The table shows our improvements\nover the uniform-bin pyramid match kernel. The VG pyramid match is more accurate and requires\nminor additional computation. Our near-perfect performance on this data set is comparable to that\nreached by others in the literature; the real signi\ufb01cance of the result is that it distinguishes what\ncan be achieved with a VG pyramid embedding as opposed to the uniform histograms used in [7],\nparticularly for high-dimensional features. In addition, here the optimal matching requires 0.31s per\nmatch, over 500x the cost of our method.\nConclusion: We have introduced a linear-time method to compute a matching between point sets\nthat takes advantage of the underlying structure in the feature space and remains consistently ac-\ncurate and ef\ufb01cient for high-dimensional inputs on real image data. Our results demonstrate the\nstrength of the approximation empirically, compare it directly against an alternative state-of-the-art\napproximation, and successfully use it as a Mercer kernel for an object recognition task. We have\ncommented most on potential applications in vision and text, but in fact it is a generic matching\nmeasure that can be applied whenever it is meaningful to compare sets by their correspondence.\nAcknowledgments: We thank Ben Kuipers for suggesting the use of Spearman\u2019s rank correlation.\n\nReferences\n[1] P. Agarwal and K. R. Varadarajan. A Near-Linear Algorithm for Euclidean Bipartite Matching. In Sym-\n\nposium on Computational Geometry, 2004.\n\n[2] S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Object Recognition Using Shape Contexts.\n\nIEEE Trans. on Pattern Analysis and Machine Intelligence, 24(24):509\u2013522, April 2002.\n\n[3] A. Berg, T. Berg, and J. Malik. Shape Matching and Object Recognition using Low Distortion Corre-\nspondences. In Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, San Diego, CA, June 2005.\n[4] M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of the 34th\n\nAnnual ACM Symposium on Theory of Computing, 2002.\n\n[5] A. Gersho and R. Gray. Vector Quantization and Signal Compression. Springer, 1992.\n[6] K. Grauman. Matching Sets of Features for Ef\ufb01cient Retrieval and Recognition. PhD thesis, MIT, 2006.\n[7] K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classi\ufb01cation with Sets of Image\n\nFeatures. In Proc. IEEE Int. Conf. on Computer Vision, Beijing, China, Oct 2005.\n\n[8] P. Indyk and N. Thaper. Fast Image Retrieval via Embeddings. In 3rd International Workshop on Statis-\n\ntical and Computational Theories of Vision, Nice, France, Oct 2003.\n\n[9] T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to LSA. Discourse Processes, 25:259\u201384, 1998.\n[10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recog-\n\nnizing Scene Categories. In Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, June 2006.\n\n[11] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer\n\nVision, 60(2):91\u2013110, Jan 2004.\n\n[12] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust Wide Baseline Stereo from Maximally Stable\n\nExtremal Regions. In British Machine Vision Conference, Cardiff, UK, Sept. 2002.\n\n[13] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In Proc. IEEE Conf. on Comp.\n\nVision and Pattern Recognition, New York City, NY, June 2006.\n\n[14] F. Odone, A. Barla, and A. Verri. Building Kernels from Binary Strings for Image Matching. IEEE Trans.\n\non Image Processing, 14(2):169\u2013180, Feb 2005.\n\n[15] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.\n[16] J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In\n\nProc. IEEE Int. Conf. on Computer Vision, Nice, Oct 2003.\n\n\f", "award": [], "sourceid": 3030, "authors": [{"given_name": "Kristen", "family_name": "Grauman", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}