{"title": "Learning to Prune in Metric and Non-Metric Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 1574, "page_last": 1582, "abstract": "Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-to prune approaches: density estimation through sampling and \u201cstretching\u201d of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality.", "full_text": "Learning to Prune in Metric and Non-Metric Spaces\n\nLeonid Boytsov\n\nBilegsaikhan Naidan\n\nCarnegie Mellon University\n\nNorwegian University of Science and Technology\n\nPittsburgh, PA, USA\n\nsrchvrs@cmu.edu\n\nTrondheim, Norway\n\nbileg@idi.ntnu.no\n\nAbstract\n\nOur focus is on approximate nearest neighbor retrieval in metric and non-metric\nspaces. We employ a VP-tree and explore two simple yet effective learning-to-\nprune approaches: density estimation through sampling and \u201cstretching\u201d of the\ntriangle inequality. Both methods are evaluated using data sets with metric (Eu-\nclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions.\nConditions on spaces where the VP-tree is applicable are discussed. The VP-tree\nwith a learned pruner is compared against the recently proposed state-of-the-art\napproaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and per-\nmutation methods. Our method was competitive against state-of-the-art methods\nand, in most cases, was more ef\ufb01cient for the same rank approximation quality.\n\n1\n\nIntroduction\n\nSimilarity search algorithms are essential to multimedia retrieval, computational biology, and sta-\ntistical machine learning. Resemblance between objects x and y is typically expressed in the form\nof a distance function d(x, y), where smaller values indicate less dissimilarity.\nIn our work we\n\nuse the Euclidean distance (L2), the KL-divergence ((cid:80) xi log xi/yi), and the Itakura-Saito distance\n((cid:80) xi/yi \u2212 log xi/yi \u2212 1). KL-divergence is commonly used in text analysis, image classi\ufb01cation,\n\nand machine learning [6]. Both KL-divergence and the Itakura-Saito distance belong to a class of\ndistances called Bregman divergences.\nOur interest is in the nearest neighbor (NN) search, i.e., we aim to retrieve the object o that is closest\nto the query q. For the KL-divergence and other non-symmetric distances two types of NN-queries\nare de\ufb01ned. The left NN-query returns the object o that minimizes the distance d(o, q), while the\nright NN-query \ufb01nds o that minimizes d(q, o).\nThe distance function can be computationally expensive. There was a considerable effort to re-\nduce computational costs through approximating the distance function, projecting data in a low-\ndimensional space, and/or applying a hierarchical space decomposition. In the case of the hierarchi-\ncal space decomposition, a retrieval process is a recursion that employs an \u201coracle\u201d procedure. At\neach step of the recursion, retrieval can continue in one or more partitions. The oracle allows one\nto prune partitions without directly comparing the query against data points in these partitions. To\nthis end, the oracle assesses the query and estimates which partitions may contain an answer and,\ntherefore, should be recursively analyzed. A pruning algorithm is essentially a binary classi\ufb01er. In\nmetric spaces, one can use the classi\ufb01er based on the triangle inequality. In non-metric spaces, a\nclassi\ufb01er can be learned from data.\nThere are numerous data structures that speedup the NN-search by creating hierarchies of partitions\nat index time, most notably the VP-tree [28, 31] and the KD-tree [4]. A comprehensive review of\nthese approaches can be found in books by Zezula et al. [32] and Samet [27]. As dimensionality\n\n1\n\n\fincreases, the \ufb01ltering ef\ufb01ciency of space-partitioning methods decreases rapidly, which is known\nas the \u201ccurse of dimensionality\u201d [30]. This happens because in high-dimensional spaces histograms\nof distances and 1-Lipschitz function values become concentrated [25]. The negative effect can be\npartially offset by creating overlapping partitions (see, e.g., [21]) and, thus, trading index size for\nretrieval time. The approximate NN-queries are less affected by the curse of the dimensionality, be-\ncause it is possible to reduce retrieval time at the cost of missing some relevant answers [18, 9, 25].\nLow-dimensional data sets embedded into a high-dimensional space do not exhibit high concen-\ntration of distances, i.e., their intrinsic dimensionality is low. In metric spaces, it was proposed to\ncompute the intrinsic dimensionality as the half of the squared signal to noise ratio (for the distance\ndistribution) [10].\nA well-known approximate NN-search method is the locality sensitive hashing (LSH) [18, 17]. It is\nbased on the idea of random projections [18, 20]. There is also an extension of the LSH for symmet-\nric non-metric distances [23]. The LSH employs several hash functions: It is likely that close objects\nhave same hash values and distant objects have different hash values. In the classic LSH index, the\nprobability of \ufb01nding an element in one hash table is small and, consequently, many hash tables\nare to be created during indexing. To reduce space requirements, Lv et al. proposed a multi-probe\nversion of the LSH, which can query multiple buckets of the same hash table [22]. Performance of\nthe LSH depends on the choice of parameters, which can be tuned to \ufb01t the distribution of data [11].\nFor approximate searching it was demonstrated that an early termination strategy could rely on infor-\nmation about distances from typical queries to their respective nearest neighbors [33, 1]. Amato et\nal. [1] showed that density estimates can be used to approximate a pruning function in metric spaces.\nThey relied on a hierarchical decomposition method (an M-tree) and proposed to visit partitions in\nthe order de\ufb01ned by density estimates. Ch\u00b4avez and Navarro [9] proposed to relax triangle-inequality\nbased lower bounds for distances to potential nearest neighbors. The approach, which they dubbed\nas stretching of the triangle inequality, involves multiplying an exact bound by \u03b1 > 1.\nFew methods were designed to work in non-metric spaces. One common indexing approach involves\nmapping the data to a low-dimensional Euclidean space. The goal is to \ufb01nd the mapping without\nlarge distortions of the original similarity measure [19, 16]. Jacobs et al. [19] review various pro-\njection methods and argue that such a coercion is often against the nature of a similarity measure,\nwhich can be, e.g., intrinsically non-symmetric. A mapping can be found using machine learning\nmethods. This can be done either separately for each data point [12, 24] or by computing one global\nmodel [3]. There are also a number of approaches, where machine learning is used to estimate\noptimal parameters of classic search methods [7]. Vermorel [29] applied VP-trees to searching in\nundisclosed non-metric spaces without trying to learn a pruning function. Like Amato et al. [1], he\nproposed to visit partitions in the order de\ufb01ned by density estimates and employed the same early\ntermination method as Zezula et al. [33].\nCayton [6] proposed a Bregman ball tree (bbtree), which is an exact search method for Bregman\ndivergences. The bbtree divides data into two clusters (each covered by a Bregman ball) and recur-\nsively repeats this procedure for each cluster until the number of data points in a cluster falls below\na threshold (a bucket size). At search time, the method relies on properties of Bregman divergences\nto compute the shortest distances to covering balls. This is an expensive iterative procedure that\nmay require several computations of direct and inverse gradients, as well as of several distances.\nAdditionally, Cayton [6] employed an early termination method: The algorithm can be told to stop\nafter processing a pre-speci\ufb01ed number of buckets. The resulting method is an approximate search\nprocedure. Zhang et al. [34] proposed an exact search method based on estimating the maximum\ndistance to a bounding rectangle, but it works with left queries only. The most ef\ufb01cient variant of\nthis method relies on an optimization technique applicable only to certain decomposable Bregman\ndivergences (a decomposable distance is a sum of values computed separately for each coordinate).\nCh\u00b4avez et al. [8] as well as Amato and Savino [2] independently proposed permutation-based search\nmethods. These approximate methods do not involve learning, but, nevertheless, are applicable to\nnon-metric spaces. At index time, k pivots are selected. For every data point, we create a list, called\na permutation, where pivots are sorted in the order of increasing distances from the data point.\nAt query time, a rank correlation (e.g., Spearman\u2019s) is computed between the permutation of the\nquery and permutations of data points. Candidate points, which have suf\ufb01ciently small correlation\nvalues, are then compared directly with the query (by computing the original distance function).\nOne can sequentially scan the list of permutations and compute the rank correlation between the\n\n2\n\n\fpermutation of the query and the permutation of every data point [8]. Data points are then sorted\nby rank-correlation values. This approach can be improved by incremental sorting [14], storing\npermutations as inverted \ufb01les [2], or pre\ufb01x trees [13].\nIn this work we experiment with two approaches to learning a pruning function of the VP-tree,\nwhich to our knowledge was not attempted previously. We compare the resulting method, which\ncan be applied to both metric and non-metric spaces, with the following state-of-the-art methods:\nthe multi-probe LSH, permutation methods, and the bbtree.\n\n2 Proposed Method\n\n2.1 Classic VP-tree\n\nIn the VP-tree (also known as a ball tree) the space is partitioned with respect to a (usually randomly)\nchosen pivot \u03c0 [28, 31]. Assume that we have computed distances from all points to the pivot \u03c0 and\nR is a median of these distances. The sphere centered at \u03c0 with the radius R divides the space\ninto two partitions, each of which contains approximately half of all points. Points inside the pivot-\ncentered sphere are placed into the left subtree, while points outside the pivot-centered sphere are\nplaced into the right subtree (points on the border may be placed arbitrarily). The search algorithm\nproceeds recursively. When the number of data points is below a certain threshold (the bucket size),\nthese data points are stored as a single bucket. The obtained hierarchical partition is represented by\nthe binary tree, where buckets are leaves.\nThe NN-search is a recursive traversal procedure that\nstarts from the root of the tree and iteratively updates\nthe distance r to the closest object found. When it\nreaches a bucket (i.e., a leaf), bucket elements are\nsearched sequentially. Each internal node stores the\npivot \u03c0 and the radius R.\nIn a metric space with\nthe distance d(x, y), we use the triangle inequality\nto prune the search space. We visit:\n\nFigure 1: Three types of query balls in the\nVP-tree. The black circle (centered at the\npivot \u03c0) is the sphere that divides the space.\n\n\u2022 only the left subtree if d(\u03c0, q) < R \u2212 r;\n\u2022 only the right subtree if d(\u03c0, q) > R + r;\n\u2022 both subtrees if R \u2212 r \u2264 d(\u03c0, q) \u2264 R + r.\n\nIn the third case, we \ufb01rst visit the partition that con-\ntains q. These three cases are illustrated in Fig. 1. Let D\u03c0,R(x) = |R \u2212 x|. Then we need to visit\nboth partitions if and only if r \u2265 D\u03c0,R(d(\u03c0, q)). If r < D\u03c0,R(d(\u03c0, q)), we visit only the partition\ncontaining the query point. In this case, we prune the other partition. Pruning is a classi\ufb01cation task\nwith three classes, where the prediction function is de\ufb01ned through D\u03c0,R(x). The only argument of\nthis function is a distance between the pivot and the query, i.e., d(\u03c0, q). The function value is equal\nto the maximum radius of the query ball that \ufb01ts inside the partition containing the query (see the\nred and the blue sample balls in Fig. 1).\n\n2.2 Approximating D\u03c0,R(x) with a Piece-wise Linear Function\n\nIn Section 2 of the supplemental materials, we describe a straightforward sampling algorithm to\nlearn the decision function D\u03c0,R(x) for every pivot \u03c0. This method turned out to be inferior to\nmost state-of-the-art approaches. It is, nevertheless, instructive to examine the decision functions\nD\u03c0,R(x) learned by sampling for the Euclidean distance and KL-divergence (see Table 1 for details\non data sets).\nEach point in Fig. 2a-2c is a value of the decision function obtained by sampling. Blue curves are\n\ufb01t to these points. For the Euclidean data (Fig. 2a), D\u03c0,R(x) resembles a piece-wise linear function\napproximately equal to |R \u2212 x|. For the KL-divergence data (Fig. 2b and 2c), D\u03c0,R(x) looks like a\nU-shape and a hockey-stick curve, respectively. Yet, most data points concentrate around the median\n(denoted by a dashed red line). In this area, a piece-wise linear approximation of D\u03c0,R(x) could\n\n3\n\n\u03c0R\f(a) Colors, L2\n\n(b) RCV-8, KL-divergence\n\n(c) RCV-16, gen. KL-divergence\n\nFigure 2: The empirically obtained decision function D\u03c0,R(x). Each point is a value of the function\nlearned by sampling (see Section 2 of the supplemental materials). Blue curves are \ufb01t to these points.\nThe red dashed line denotes a median distance R from data set points to the pivot \u03c0.\n\nstill be reasonable. Formally, we de\ufb01ne the decision function as:\n\n\uf8f1\uf8f2\uf8f3 \u03b1lef t|x \u2212 R|,\n\n\u03b1right|x \u2212 R|,\n\nD\u03c0,R(x) =\n\nif x \u2264 R\nif x \u2265 R\n\n(1)\n\nOnce we obtain the values of \u03b1lef t and \u03b1right that permit near exact searching, we can induce more\naggressive pruning by increasing \u03b1lef t and/or \u03b1right, thus, exploring trade-offs between retrieval\nef\ufb01ciency and effectiveness. This is similar to stretching of the triangle inequality proposed by\nCh\u00b4avez and Navarro [9].\nOptimal \u03b1lef t and \u03b1right are determined using a grid search. To this end, we index a small subset of\nthe data points and seek to obtain parameters that give the shortest retrieval time at a speci\ufb01ed recall\nthreshold. The grid search is initialized by values a and b. Then, recall values and retrieval times for\nall \u03b1lef t = a\u03c1i/m\u22120.5 and \u03b1right = b\u03c1j/m\u22120.5 are obtained (1 \u2264 i, j \u2264 m). The values of m and\n\u03c1 are chosen so that: (1) the grid step is reasonably small (i.e., \u03c11/m is close to one); (2) the search\nspace is manageable (i.e., m is not large).\nIf the obtained recall values are considerably larger than a speci\ufb01ed threshold, the procedure repeats\nthe grid search using larger values of a and b. Similarly, if the recall is not suf\ufb01cient, the values\nof a and b are decreased and the grid search is repeated. One can see that the perfect recall can be\nachieved with \u03b1lef t = 0 and \u03b1right = 0. In this case, no pruning is done and the data set is searched\nsequentially. Values of \u03b1lef t = \u221e and \u03b1right = \u221e represent an (almost) zero recall, because one\nof the partitions is always pruned.\n\n2.3 Applicability Conditions\n\n(cid:54)= R.\n\nIt is possible to apply the classic VP-tree algorithm only to data sets such that D\u03c0,R(d(\u03c0, q)) > 0\nwhen d(\u03c0, q)\nIn a relaxed version of this applicability condition, we require that\nD\u03c0,R(d(\u03c0, q)) > 0 for almost all queries and a large subset of data points. More formally:\nProperty 1. For any pivot \u03c0, probability \u03b1, and distance x (cid:54)= R, there exists a radius r > 0\nsuch that, if two randomly selected points q (a potential query) and u (a potential nearest neighbor)\nsatisfy d(\u03c0, q) = x and d(u, q) \u2264 r, then both p and q belong to the same partition (de\ufb01ned by \u03c0\nand R) with a probability at least \u03b1.\n\n0,(cid:80) xi = 1}. The proof, which is given in Section 1 of supplemental materials, can be trivially\n\nThe Property 1, which is true for all metric spaces due to the triangle inequality, holds in the case of\nthe KL-divergence and data points u sampled randomly and uniformly from the simplex {xi|xi \u2265\nextended to other non-negative distance functions d(x, y) \u2265 0 (e.g., to the Itakura-Saito distance)\nthat satisfy (additional compactness requirements may be required): (1) d(x, y) = 0 \u21d4 x = y; (2)\nthe set of discontinuities of d(x, y) has measure zero in L2. This suggests that the VP-tree could be\napplicable to a wide class of non-metric spaces.\n\n4\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.10.20.30.40.50.250.500.75distance to pivotmax distance to queryllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.10.20.30246distance to pivotmax distance to queryllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.51.00246distance to pivotmax distance to query\fTable 1: Description of the data sets\n\nName\n\nd(x, y)\n\nColors\nL2\nRCV-i\nKL-div, L2\nSIFT-signat. KL-div, L2\nUniform\n\nL2\n\nData set size\n1.1 \u00b7 105\n0.5 \u00b7 106\n1 \u00b7 104\n0.5 \u00b7 106\n\nDimensionality\n\nSource\n\n112\n\nMetric Space Library1\n\ni \u2208 {8, 16, 32, 128, 256} Cayton [6]\nCayton [6]\nSampled from U 64[0, 1]\n\n1111\n64\n\n3 Experiments\n\nWe run experiments on a Linux server equipped with Intel Core i7 2600 (3.40 GHz, 8192 KB of\nL3 CPU cache) and 16 GB of DDR3 RAM (transfer rate is 20GB/sec). The software (including\nscripts that can be used to reproduce our results) is available online, as a part of the Non-Metric\nSpace Library2 [5]. The code was written in C++, compiled using GNU C++ 4.7 (optimization\n\ufb02ag -Ofast), and executed in a single thread. SIMD instructions were enabled using the \ufb02ags -msse2\n-msse4.1 -mssse3.\nAll distance and rank correlation functions are highly optimized and employ SIMD instructions.\nVector elements were single-precision numbers. For the KL-divergence, though, we also imple-\nmented a slower version, which computes logarithms on-line, i.e., for each distance computation.\nThe faster version computes logarithms of vector elements off-line, i.e., during indexing, and stores\nwith the vectors. Additionally, we need to compute logarithms of query vector elements, but this is\ndone only once per query. The optimized implementation is about 30x times faster than the slower\none.\nThe data sets are described in Table 1. Each data set is randomly divided into two parts. The\nsmaller part (containing 1,000 elements) is used as queries, while the larger part is indexed. This\nprocedure is repeated 5 times (for each data sets) and results are aggregated using a classic \ufb01xed-\neffect model [15]. Improvement in ef\ufb01ciency due to indexing is measured as a reduction in retrieval\ntime compared to a sequential, i.e., exhaustive, search. The effectiveness was measured using a\nsimple rank error metric proposed by Cayton [6]. It is equal to the number of NN-points closer to\nthe query than the nearest point returned by the search method. This metric is appropriate mostly for\n1-NN queries. We present results only for left queries, but we also veri\ufb01ed that in the case of right\nqueries the VP-tree provides similar effectiveness/ef\ufb01ciency trade-offs. We ran benchmarks for L2,\nthe KL-divergence,3 and the Itakura-Saito distance. Implemented methods included:\n\nobtained by using 16 pivots.\n\n\u2022 The novel search algorithm based on the VP-tree and a piece-wise linear approximation for\nD\u03c0,R(x) as described in Section 2.2. The parameters of the grid search algorithm were:\nm = 7 and \u03c1 = 8.\n\u2022 The permutation method with incremental sorting [14]. The near-optimal performance was\n\u2022 The permutation pre\ufb01x index, where permutation pro\ufb01les are stored in a pre\ufb01x tree of\nlimited depth [13]. We used 16 pivots and the maximal pre\ufb01x length 4 (again selected for\nbest performance).\n\u2022 The bbtree [6], which is designed for Bregman divergences, and, thus, it was not used with\n\u2022 The multi-probe LSH, which is designed to work only for L2. The implementation employs\nthe LSHKit, 4 which is embedded in the Non-Metric Space Library. The best-performing\ncon\ufb01guration that we could \ufb01nd used 10 probes and 50 hash tables. The remaining param-\neters were selected automatically using the cost model proposed by Dong et al. [11].\n\nL2.\n\n2https://github.com/searchivarius/NonMetricSpaceLib\n3In the case of SIFT signatures, we use generalized KL-divergence (similarly to Cayton).\n4Downloaded from http://lshkit.sourceforge.net/\n\n5\n\n\fFigure 3: Performance of NN-search for L2\n\nFigure 4: Performance of NN-search for the KL-divergence and Itakura-Saito distance\n\nFor the bbtree and the VP-tree, vectors in the same bucket were stored in contiguous chunks of mem-\nory (allowing for about 1.5-2x reduction in retrieval times). It is typically more ef\ufb01cient to search\nelements of a small bucket sequentially, rather than using an index. A near-optimal performance\nwas obtained with 50 elements in a bucket. The same optimization approach was also used for both\npermutation methods.\nSeveral parameters were manually selected to achieve various effectiveness/ef\ufb01ciency trade-offs.\nThey included: the minimal number/percentage of candidates in permutation methods, the desired\n\n6\n\n10\u2212210\u22121100101102103101102103Numberofpointscloser(log.scale)Improvementine\ufb03ciency(log.scale)Colors(L2)multi-probeLSHpref.indexvp-treepermutation10\u22121100101102103104100101102103104Numberofpointscloser(log.scale)RCV-128(L2)multi-probeLSHpref.indexvp-treepermutation10\u2212210\u22121100101102103100101102103Numberofpointscloser(log.scale)Uniform(L2)multi-probeLSHpref.indexvp-treepermutation10\u2212210\u22121100101102103100101102103Numberofpointscloser(log.scale)Improvementine\ufb03ciency(log.scale)RCV-16(L2)multi-probeLSHpref.indexvp-treepermutation10\u22121100101102103104101102103104Numberofpointscloser(log.scale)RCV-256(L2)multi-probeLSHpref.indexvp-treepermutation10\u2212610\u2212410\u22122100102104100101102Numberofpointscloser(log.scale)SIFTsignatures(L2)multi-probeLSHpref.indexvp-treepermutation10\u2212210\u22121100101102100101102103Numberofpointscloser(log.scale)Improvementine\ufb03ciency(log.scale)RCV-16(KL-div)pref.indexbbtreevp-treepermutation10\u22121100101102103101102103Numberofpointscloser(log.scale)RCV-256(KL-div)pref.indexbbtreevp-treepermutation10\u2212210\u22121100101102100101102Numberofpointscloser(log.scale)SIFTsignatures(KL-div)pref.indexbbtreevp-treepermutation100101102103100101102103Numberofpointscloser(log.scale)Improvementine\ufb03ciency(log.scale)RCV-16(Itakura-Saito)pref.indexbbtreevp-treepermutation100101102103104101102103104Numberofpointscloser(log.scale)RCV-256(Itakura-Saito)pref.indexbbtreevp-treepermutation100101102100101102Numberofpointscloser(log.scale)SIFTsignatures(Itakura-Saito)pref.indexbbtreevp-treepermutation\fTable 2: Improvement in ef\ufb01ciency and retrieval time (ms) for the bbtree without early termination\n\nData set\n\nRCV-16\n\nRCV-32\n\nRCV-128\n\nRCV-256\n\nSIFT sign.\n\nimpr.\n\ntime\n\nimpr.\n\ntime\n\nimpr.\n\ntime\n\nimpr.\n\ntime\n\nimpr.\n\ntime\n\nSlow KL-divergence\nFast KL-divergence\n\n15.7\n4.6\n\n8\n2.5\n\n6.7\n1.9\n\n36\n9.6\n\n1.6\n0.5\n\n613\n108\n\n1.1\n0.4\n\n1700\n274\n\n0.9\n0.4\n\n164\n18\n\nrecall in the multi-probe LSH and in the VP-tree, as well as the maximum number of processed\nbuckets in the bbtree.\nThe results for L2 are given in Fig. 3. Even though a representational dimensionality of the Uniform\ndata set is only 64, it has the highest intrinsic dimensionality among all sets in Table 1 (according to\nthe de\ufb01nition of Ch\u00b4avez et al. [10]). Thus, for the Uniform data set, no method achieved more than\na 10x speedup over sequential searching without substantial quality degradation. For instance, for\nthe VP-tree, a 160x speedup was only possible, when a retrieved object was a 40-th nearest neighbor\n(on average) instead of the \ufb01rst one. The multi-probe LSH can be twice as fast as the VP-tree at the\nexpense of having a 4.7x larger index. All the remaining data sets have low or moderate intrinsic\ndimensionality (smaller than eight). For example, the SIFT signatures have the representational\ndimensionality of 1111, but the intrinsic dimensionality is only four. For data sets with low and\nmoderate intrinsic dimensionality, the VP-tree is faster than the other methods most of the time. For\nthe data sets Colors and RCV-16 there is a two orders of magnitude difference.\nThe results for the KL-divergence and Itakura-Saito distance are summarized in Fig. 4. The bb-\ntree is never substantially faster than the VP-tree, while being up to an order of magnitude slower\nfor RCV-16 and RCV-256 in the case of Itakura-Saito distance. Similar to results in L2, in most\ncases, the VP-tree is at least as fast as other methods. Yet, for the SIFT signatures data set and the\nItakura-Saito distance, permutation methods can be twice as fast.\nAdditional analysis has showed that the VP-tree is a good rank-approximation method, but it is not\nnecessarily the best approach in terms of recall. When the VP-tree misses the nearest neighbor, it\noften returns the second nearest or the third nearest neighbor instead. However, when other exam-\nined methods miss the nearest neighbor, they frequently return elements that are far from the true\nresult. For example, the multi-probe LSH may return a true nearest neighbor 50% of the time, and\n50% of the time it would return the 100-th nearest neighbor. This observation about the LSH is in\nline with previous \ufb01ndings [26].\nFinally, we measured improvement in ef\ufb01ciency (over exhaustive search) for the bbtree, where the\nearly termination algorithm was disabled. This was done using both the slow and the fast implemen-\ntation of the KL-divergence. The results are given in Table 2. Improvements in ef\ufb01ciency for the case\nof the slower KL-divergence (reported in the \ufb01rst row) are consistent with those reported by Cayton\n[6]. The second row shows improvements in ef\ufb01ciency for the case of the faster KL-divergence and\nthese improvements are substantially smaller than those reported in the \ufb01rst row, despite the fact\nthat using the faster KL-divergence greatly reduces retrieval times. The reason is that the pruning\nalgorithm of the bbtree is quite expensive. It involves computations of logarithms/exponents for\ncoordinates of unknown vectors, and, thus, these computations cannot be deferred to index time.\n\n4 Discussion and conclusions\n\nWe evaluated two simple yet effective learning-to-prune methods and showed that the resulting ap-\nproach was competitive against state-of-the-art methods in both metric and non-metric spaces. In\nmost cases, this method provided better trade-offs between rank approximation quality and retrieval\nspeed. For datasets with low or moderate intrinsic dimensionality, the VP-tree could be one-two or-\nders of magnitude faster than other methods (for the same rank approximation quality). We discussed\napplicability of our method (a VP-tree with the learned pruner) and proved a theorem supporting the\npoint of view that our method can be applicable to a class of non-metric distances, which includes\n\n7\n\n\fthe KL-divergence. We also showed that a simple trick of pre-computing logarithms at index time\nsubstantially improved performance of existing methods (e.g., bbtree) for the studied distances.\nIt should be possible to improve over basic learning-to-prune methods (employed in this work)\nusing: (1) a better pivot-selection strategy [31]; (2) a more sophisticated sampling strategy; (3) a\nmore accurate (non-linear) approximation for the decision function D\u03c0,R(x) (see section 2.1).\n\n5 Acknowledgements\n\nWe thank Lawrence Cayton for providing the data sets, the bbtree code, and answering our questions;\nAnna Belova for checking the proof of Property 1 (supplemental materials) and editing the paper.\n\nReferences\n[1] G. Amato, F. Rabitti, P. Savino, and P. Zezula. Region proximity in metric spaces and its use\n\nfor approximate similarity search. ACM Trans. Inf. Syst., 21(2):192\u2013227, Apr. 2003.\n\n[2] G. Amato and P. Savino. Approximate similarity search in metric spaces using inverted \ufb01les.\nIn Proceedings of the 3rd international conference on Scalable information systems, InfoScale\n\u201908, pages 28:1\u201328:10, ICST, Brussels, Belgium, Belgium, 2008. ICST (Institute for Computer\nSciences, Social-Informatics and Telecommunications Engineering).\n\n[3] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. BoostMap: A method for ef\ufb01cient approx-\nimate similarity rankings. In Computer Vision and Pattern Recognition, 2004. CVPR 2004.\nProceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II\u2013268 \u2013\nII\u2013275 Vol.2, june-2 july 2004.\n\n[4] J. Bentley. Multidimensional binary search trees used for associative searching. Communica-\n\ntions of the ACM, 18(9):509\u2013517, 1975.\n\n[5] L. Boytsov and B. Naidan. Engineering ef\ufb01cient and effective Non-Metric Space Library. In\nN. Brisaboa, O. Pedreira, and P. Zezula, editors, Similarity Search and Applications, volume\n8199 of Lecture Notes in Computer Science, pages 280\u2013293. Springer Berlin Heidelberg, 2013.\n[6] L. Cayton. Fast nearest neighbor retrieval for Bregman divergences. In Proceedings of the\n25th international conference on Machine learning, ICML \u201908, pages 112\u2013119, New York,\nNY, USA, 2008. ACM.\n\n[7] L. Cayton and S. Dasgupta. A learning framework for nearest neighbor search. Advances in\n\nNeural Information Processing Systems, 20, 2007.\n\n[8] E. Ch\u00b4avez, K. Figueroa, and G. Navarro. Effective proximity retrieval by ordering permuta-\ntions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(9):1647 \u20131658,\nsept. 2008.\n\n[9] E. Ch\u00b4avez and G. Navarro. Probabilistic proximity search: Fighting the curse of dimensionality\n\nin metric spaces. Information Processing Letters, 85(1):39\u201346, 2003.\n\n[10] E. Ch\u00b4avez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin. Searching in metric spaces. ACM\n\nComputing Surveys, 33(3):273\u2013321, 2001.\n\n[11] W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li. Modeling LSH for performance tun-\ning. In Proceedings of the 17th ACM conference on Information and knowledge management,\nCIKM \u201908, pages 669\u2013678, New York, NY, USA, 2008. ACM.\n\n[12] O. Edsberg and M. L. Hetland. Indexing inexact proximity search with distance regression in\npivot space. In Proceedings of the Third International Conference on SImilarity Search and\nAPplications, SISAP \u201910, pages 51\u201358, New York, NY, USA, 2010. ACM.\n\n[13] A. Esuli. Use of permutation pre\ufb01xes for ef\ufb01cient and scalable approximate similarity search.\n\nInf. Process. Manage., 48(5):889\u2013902, Sept. 2012.\n\n[14] E. Gonzalez, K. Figueroa, and G. Navarro. Effective proximity retrieval by ordering permu-\ntations. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(9):1647\u20131658,\n2008.\n\n[15] L. V. Hedges and J. L. Vevea. Fixed-and random-effects models in meta-analysis. Psycholog-\n\nical methods, 3(4):486\u2013504, 1998.\n\n8\n\n\f[16] G. Hjaltason and H. Samet. Properties of embedding methods for similarity searching in metric\nspaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(5):530\u2013549,\n2003.\n\n[17] P. Indyk. Nearest neighbors in high-dimensional spaces. In J. E. Goodman and J. O\u2019Rourke,\neditors, Handbook of discrete and computational geometry, pages 877\u2013892. Chapman and\nHall/CRC, 2004.\n\n[18] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of di-\nmensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing,\nSTOC \u201998, pages 604\u2013613, New York, NY, USA, 1998. ACM.\n\n[19] D. Jacobs, D. Weinshall, and Y. Gdalyahu. Classi\ufb01cation with nonmetric distances: Image re-\ntrieval and class representation. Pattern Analysis and Machine Intelligence, IEEE Transactions\non, 22(6):583\u2013600, 2000.\n\n[20] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Ef\ufb01cient search for approximate nearest neighbor\nin high dimensional spaces. In Proceedings of the 30th annual ACM symposium on Theory of\ncomputing, STOC \u201998, pages 614\u2013623, New York, NY, USA, 1998. ACM.\n\n[21] H. Lejsek, F. \u00b4Asmundsson, B. J\u00b4onsson, and L. Amsaleg. NV-Tree: An ef\ufb01cient disk-based\nindex for approximate search in very large high-dimensional collections. Pattern Analysis and\nMachine Intelligence, IEEE Transactions on, 31(5):869 \u2013883, may 2009.\n\n[22] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: ef\ufb01cient indexing\nfor high-dimensional similarity search. In Proceedings of the 33rd international conference on\nVery large data bases, VLDB \u201907, pages 950\u2013961. VLDB Endowment, 2007.\n\n[23] Y. Mu and S. Yan. Non-metric locality-sensitive hashing. In AAAI, 2010.\n[24] T. Murakami, K. Takahashi, S. Serita, and Y. Fujii. Versatile probability-based indexing for\napproximate similarity search. In Proceedings of the Fourth International Conference on SIm-\nilarity Search and APplications, SISAP \u201911, pages 51\u201358, New York, NY, USA, 2011. ACM.\nIndexability, concentration, and {VC} theory. Journal of Discrete Algorithms,\n13(0):2 \u2013 18, 2012. Best Papers from the 3rd International Conference on Similarity Search\nand Applications (SISAP 2010).\n\n[25] V. Pestov.\n\n[26] P. Ram, D. Lee, H. Ouyang, and A. G. Gray. Rank-approximate nearest neighbor search: Re-\ntaining meaning and speed in high dimensions. In Advances in Neural Information Processing\nSystems, pages 1536\u20131544, 2009.\n\n[27] H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann\n\nPublishers Inc., 2005.\n\n[28] J. Uhlmann. Satisfying general proximity similarity queries with metric trees. Information\n\nProcessing Letters, 40:175\u2013179, 1991.\n\n[29] J. Vermorel. Near neighbor search in metric and nonmetric space, 2005.\n\nhttp://\nhal.archives-ouvertes.fr/docs/00/03/04/85/PDF/densitree.pdf last\naccessed on Nov 1st 2012.\n\n[30] R. Weber, H. J. Schek, and S. Blott. A quantitative analysis and performance study for\nIn Proceedings of the 24th Interna-\nsimilarity-search methods in high-dimensional spaces.\ntional Conference on Very Large Data Bases, pages 194\u2013205. Morgan Kaufmann, August\n1998.\n\n[31] P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric\nspaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms,\nSODA \u201993, pages 311\u2013321, Philadelphia, PA, USA, 1993. Society for Industrial and Applied\nMathematics.\n\n[32] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach\n(Advances in Database Systems). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.\n[33] P. Zezula, P. Savino, G. Amato, and F. Rabitti. Approximate similarity retrieval with M-trees.\n\nThe VLDB Journal, 7(4):275\u2013293, Dec. 1998.\n\n[34] Z. Zhang, B. C. Ooi, S. Parthasarathy, and A. K. H. Tung. Similarity search on Bregman\n\ndivergence: towards non-metric indexing. Proc. VLDB Endow., 2(1):13\u201324, Aug. 2009.\n\n9\n\n\f", "award": [], "sourceid": 784, "authors": [{"given_name": "Leonid", "family_name": "Boytsov", "institution": "CMU"}, {"given_name": "Bilegsaikhan", "family_name": "Naidan", "institution": "Norwegian University of Science and Technology (NTNU)"}]}