{"title": "The Power of Asymmetry in Binary Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 2823, "page_last": 2831, "abstract": "When approximating binary similarity using the hamming distance between short binary hashes, we shown that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e.~by approximating the similarity between $x$ and $x'$ as the hamming distance between $f(x)$ and $g(x')$, for two distinct binary codes $f,g$, rather than as the hamming distance between $f(x)$ and $f(x')$.", "full_text": "The Power of Asymmetry in Binary Hashing\n\nBehnam Neyshabur\n\nPayman Yadollahpour\n\nYury Makarychev\n\nToyota Technological Institute at Chicago\n\n[btavakoli,pyadolla,yury]@ttic.edu\n\nRuslan Salakhutdinov\n\nNathan Srebro\n\nDepartments of Statistics and Computer Science\n\nToyota Technological Institute at Chicago\n\nUniversity of Toronto\n\nrsalakhu@cs.toronto.edu\n\nand Technion, Haifa, Israel\n\nnati@ttic.edu\n\nAbstract\n\nWhen approximating binary similarity using the hamming distance between short\nbinary hashes, we show that even if the similarity is symmetric, we can have\nshorter and more accurate hashes by using two distinct code maps. I.e. by approx-\nimating the similarity between x and x(cid:48) as the hamming distance between f (x)\nand g(x(cid:48)), for two distinct binary codes f, g, rather than as the hamming distance\nbetween f (x) and f (x(cid:48)).\n\n1\n\nIntroduction\n\nEncoding high-dimensional objects using short binary hashes can be useful for fast approximate\nsimilarity computations and nearest neighbor searches. Calculating the hamming distance between\ntwo short binary strings is an extremely cheap computational operation, and the communication cost\nof sending such hash strings for lookup on a server (e.g. sending hashes of all features or patches in\nan image taken on a mobile device) is low. Furthermore, it is also possible to quickly look up nearby\nhash strings in populated hash tables. Indeed, it only takes a fraction of a second to retrieve a shortlist\nof similar items from a corpus containing billions of data points, which is important in image, video,\naudio, and document retrieval tasks [11, 9, 10, 13]. Moreover, compact binary codes are remarkably\nstorage ef\ufb01cient, and allow one to store massive datasets in memory. It is therefore desirable to \ufb01nd\nshort binary hashes that correspond well to some target notion of similarity. Pioneering work on\nLocality Sensitive Hashing used random linear thresholds for obtaining bits of the hash [1]. Later\nwork suggested learning hash functions attuned to the distribution of the data [15, 11, 5, 7, 3].\nMore recent work focuses on learning hash functions so as to optimize agreement with the target\nsimilarity measure on speci\ufb01c datasets [14, 8, 9, 6] . It is important to obtain accurate and short\nhashes\u2014the computational and communication costs scale linearly with the length of the hash, and\nmore importantly, the memory cost of the hash table can scale exponentially with the length.\nIn all the above-mentioned approaches, similarity S(x, x(cid:48)) between two objects is approximated by\nthe hamming distance between the outputs of the same hash function, i.e. between f (x) and f (x(cid:48)),\nfor some f \u2208 {\u00b11}k. The emphasis here is that the same hash function is applied to both x and x(cid:48)\n(in methods like LSH multiple hashes might be used to boost accuracy, but the comparison is still\nbetween outputs of the same function).\nThe only exception we are aware of is where a single mapping of objects to fractional vectors\n\u02dcf (x) \u2208 [\u22121, 1]k is used, its thresholding f (x) = sign \u02dcf (x) \u2208 {\u00b11}k is used in the database,\nand similarity between x and x(cid:48) is approximated using\n. This has become known\nas \u201casymmetric hashing\u201d [2, 4], but even with such a-symmetry, both mappings are based on the\n\nf (x), \u02dcf (x(cid:48))\n\n(cid:68)\n\n(cid:69)\n\n1\n\n\fsame fractional mapping \u02dcf (\u00b7). That is, the asymmetry is in that one side of the comparison gets\nthresholded while the other is fractional, but not in the actual mapping.\nIn this paper, we propose using two distinct mappings f (x), g(x) \u2208 {\u00b11}k and approximating the\nsimilarity S(x, x(cid:48)) by the hamming distance between f (x) and g(x(cid:48)). We refer to such hashing\nschemes as \u201casymmetric\u201d. Our main result is that even if the target similarity function is sym-\nmetric and \u201cwell behaved\u201d (e.g., even if it is based on Euclidean distances between objects), using\nasymmetric binary hashes can be much more powerful, and allow better approximation of the tar-\nget similarity with shorter code lengths. In particular, we show extreme examples of collections\nof points in Euclidean space, where the neighborhood similarity S(x, x(cid:48)) can be realized using an\nasymmetric binary hash (based on a pair of distinct functions) of length O(r) bits, but where a sym-\nmetric hash (based on a single function) would require at least \u2126(2r) bits. Although actual data is\nnot as extreme, our experimental results on real data sets demonstrate signi\ufb01cant bene\ufb01ts from using\nasymmetric binary hashes.\nAsymmetric hashes can be used in almost all places where symmetric hashes are typically used,\nusually without any additional storage or computational cost. Consider the typical application of\nstoring hash vectors for all objects in a database, and then calculating similarities to queries by\ncomputing the hash of the query and its hamming distance to the stored database hashes. Using\nan asymmetric hash means using different hash functions for the database and for the query. This\nneither increases the size of the database representation, nor the computational or communication\ncost of populating the database or performing a query, as the exact same operations are required.\nIn fact, when hashing the entire database, asymmetric hashes provide even more opportunity for\nimprovement. We argue that using two different hash functions to encode database objects and\nqueries allows for much more \ufb02exibility in choosing the database hash. Unlike the query hash,\nwhich has to be stored compactly and ef\ufb01ciently evaluated on queries as they appear, if the database\nis \ufb01xed, an arbitrary mapping of database objects to bit strings may be used. We demonstrate that\nthis can indeed increase similarity accuracy while reducing the bit length required.\n\n2 Minimum Code Lengths and the Power of Asymmetry\nLet S : X \u00d7 X \u2192 {\u00b11} be a binary similarity function over a set of objects X , where we can\ninterpret S(x, x(cid:48)) to mean that x and x(cid:48) are \u201csimilar\u201d or \u201cdissimilar\u201d, or to indicate whether they are\n\u201cneighbors\u201d. A symmetric binary coding of X is a mapping f : X \u2192 {\u00b11}k, where k is the bit-\nlength of the code. We are interested in constructing codes such that the hamming distance between\nf (x) and f (x(cid:48)) corresponds to the similarity S(x, x(cid:48)). That is, for some threshold \u03b8 \u2208 R, S(x, x(cid:48)) \u2248\nsign((cid:104)f (x), f (x(cid:48))(cid:105) \u2212 \u03b8). Although discussing the hamming distance, it is more convenient for us\nto work with the inner product (cid:104)u, v(cid:105), which is equivalent to the hamming distance dh(u, v) since\n(cid:104)u, v(cid:105) = (k \u2212 2dh(u, v)) for u, v \u2208 {\u00b11}k.\nIn this section, we will consider the problem of capturing a given similarity using an arbitrary binary\ncode. That is, we are given the entire similarity mapping S, e.g. as a matrix S \u2208 {\u00b11}n\u00d7n over\na \ufb01nite domain X = {x1, . . . , xn} of n objects, with Sij = S(xi, xj). We ask for an encoding\nui = f (xi) \u2208 {\u00b11}k of each object xi \u2208 X , and a threshold \u03b8, such that Sij = sign((cid:104)ui, uj(cid:105) \u2212 \u03b8),\nor at least such that equality holds for as many pairs (i, j) as possible. It is important to emphasize\nthat the goal here is purely to approximate the given matrix S using a short binary code\u2014there is no\nout-of-sample generalization (yet).\nWe now ask: Can allowing an asymmetric coding enable approximating a symmetric similarity\nmatrix S with a shorter code length?\nDenoting U \u2208 {\u00b11}n\u00d7k for the matrix whose columns contain the codewords ui, the minimal\nbinary code length that allows exactly representing S is then given by the following matrix factor-\nization problem:\n\nk s.t U \u2208 {\u00b11}k\u00d7n\n\n\u03b8 \u2208 R\n\n(1)\n\nks(S) = min\nk,U,\u03b8\n\nwhere 1n is an n \u00d7 n matrix of ones.\n\nY (cid:44) U(cid:62)U \u2212 \u03b81n\n\u2200ij SijYij > 0\n\n2\n\n\fWe begin demonstrating the power of asymmetry by considering an asymmetric variant of the above\nproblem. That is, even if S is symmetric, we allow associating with each object xi two distinct\nbinary codewords, ui \u2208 {\u00b11}k and vi \u2208 {\u00b11}k (we can think of this as having two arbitrary\nmappings ui = f (xi) and vi = g(xi)), such that Sij = sign((cid:104)ui, vj(cid:105)\u2212 \u03b8). The minimal asymmetric\nbinary code length is then given by:\n\n\u03b8 \u2208 R\n\n(2)\n\nka(S) = min\nk,U,V,\u03b8\n\nk s.t U, V \u2208 {\u00b11}k\u00d7n\nY (cid:44) U(cid:62)V \u2212 \u03b81n\n\u2200ij SijYij > 0\n\nWriting the binary coding problems as matrix factorization problems is useful for understanding\nthe power we can get by asymmetry: even if S is symmetric, and even if we seek a symmetric Y ,\ninsisting on writing Y as a square of a binary matrix might be a tough constraint. This is captured\nin the following Theorem, which establishes that there could be an exponential gap between the\nminimal asymmetry binary code length and the minimal symmetric code length, even if the matrix\nS is symmetric and very well behaved:\nTheorem 1. For any r, there exists a set of n = 2r points in Euclidean space, with similarity matrix\n\nSij =\n\n\u22121\n\nif (cid:107)xi \u2212 xj(cid:107) \u2264 1\nif (cid:107)xi \u2212 xj(cid:107) > 1\n\n, such that ka(S) \u2264 2r but ks(S) \u2265 2r/2\n\n(cid:26)1\n\nProof. Let I1 = {1, . . . , n/2} and I2 = {n/2 + 1, . . . , n}. Consider the matrix G de\ufb01ned by\nGii = 1/2, Gij = \u22121/(2n) if i, j \u2208 I1 or i, j \u2208 I2, and Gij = 1/(2n) otherwise. Matrix G is\ndiagonally dominant. By the Gershgorin circle theorem, G is positive de\ufb01nite. Therefore, there exist\nvectors x1, . . . , xn such that (cid:104)xi, xj(cid:105) = Gij (for every i and j). De\ufb01ne\n\n(cid:26)1\n\nSij =\n\n\u22121\n\nif (cid:107)xi \u2212 xj(cid:107) \u2264 1\nif (cid:107)xi \u2212 xj(cid:107) > 1\n\n.\n\n(cid:21)\n\n(cid:21)\n\n(cid:20)B\n\n(cid:20) B\u2212C\n\nNote that if i = j then Sij = 1; if i (cid:54)= j and (i, j) \u2208 I1 \u00d7 I1 \u222a I2 \u00d7 I2 then (cid:107)xi \u2212 xj(cid:107)2 =\nGii+Gjj\u22122Gij = 1+1/n > 1 and therefore Sij = \u22121. Finally, if i (cid:54)= j and (i, j) \u2208 I1\u00d7I2\u222aI2\u00d7I1\nthen (cid:107)xi \u2212 xj(cid:107)2 = Gii + Gjj \u2212 2Gij = 1 + 1/n < 1 and therefore Sij = 1. We show that\nka(S) \u2264 2r. Let B be an r \u00d7 n matrix whose column vectors are the vertices of the cube {\u00b11}r\n(in any order); let C be an r \u00d7 n matrix de\ufb01ned by Cij = 1 if j \u2208 I1 and Cij = \u22121 if j \u2208 I2. Let\n. For Y = U(cid:62)V \u2212 \u03b81n where threshold \u03b8 = \u22121 , we have that Yij \u2265 1\nU =\nif Sij = 1 and Yij \u2264 \u22121 if Sij = \u22121. Therefore, ka(S) \u2264 2r.\nNow we show that ks = ks(S) \u2265 n/2. Consider Y , U and \u03b8 as in (1). Let Y (cid:48) = (U(cid:62)U ). Note\nij \u2208 [\u2212ks, ks] and thus \u03b8 \u2208 [\u2212ks + 1, ks \u2212 1]. Let q = [1, . . . , 1,\u22121, . . . ,\u22121](cid:62) (n/2 ones\nthat Y (cid:48)\nfollowed by n/2 minus ones). We have,\nY (cid:48)\nii +\n\n0 \u2264 q(cid:62)Y (cid:48)q =\n\nand V =\n\nC\n\nY (cid:48)\n\nij\n\nn(cid:88)\n\u2264 n(cid:88)\n\ni=1\n\n(cid:88)\n(cid:88)\n\nY (cid:48)\n\nij \u2212 (cid:88)\n(\u03b8 \u2212 1) \u2212 (cid:88)\n\ni,j:Sij =1,i(cid:54)=j\n\ni,j:Sij =\u22121\n\nks +\n\n(\u03b8 + 1)\n\ni=1\n\ni,j:Sij =\u22121\n\ni,j:Sij =1,i(cid:54)=j\n= nks + (0.5n2 \u2212 n)(\u03b8 \u2212 1) \u2212 0.5n2(\u03b8 + 1)\n= nks \u2212 n2 \u2212 n(\u03b8 \u2212 1)\n\u2264 2nks \u2212 n2.\n\nWe conclude that ks \u2265 n/2.\n\nThe construction of Theorem 1 shows that there exists data sets for which an asymmetric binary hash\nmight be much shorter then a symmetric hash. This is an important observation as it demonstrates\nthat asymmetric hashes could be much more powerful, and should prompt us to consider them\ninstead of symmetric hashes. The precise construction of Theorem 1 is of course rather extreme (in\nfact, the most extreme construction possible) and we would not expect actual data sets to have this\nexact structure, but we will show later signi\ufb01cant gaps also on real data sets.\n\n3\n\n\f10-D Uniform\n\nLabelMe\n\nFigure 1: Number of bits required for approximating two similarity matrices (as a function of average pre-\ncision). Left: uniform data in the 10-dimensional hypercube, similarity represents a thresholded Euclidean\ndistance, set such that 30% of the similarities are positive. Right: Semantic similarity of a subset of LabelMe\nimages, thresholded such that 5% of the similarities are positive.\n3 Approximate Binary Codes\n\nAs we turn to real data sets, we also need to depart from seeking a binary coding that exactly\ncaptures the similarity matrix. Rather, we are usually satis\ufb01ed with merely approximating S, and\nfor any \ufb01xed code length k seek the (symmetric or asymmetric) k-bit code that \u201cbest captures\u201d the\nsimilarity matrix S. This is captured by the following optimization problem:\n\n(cid:88)\n\n(cid:88)\n\nL(Y ; S) (cid:44) \u03b2\n\nmin\nU,V,\u03b8\n\n(cid:96)(Yij) + (1 \u2212 \u03b2)\n\n(cid:96)(\u2212Yij)\n\ni,j:Sij =1\n\ni,j:Sij =\u22121\n\ns.t. U, V \u2208 {\u00b11}k\u00d7n\nY (cid:44) U(cid:62)V \u2212 \u03b81n\n\n\u03b8 \u2208 R\n\n(3)\n\nwhere (cid:96)(z) = 1z\u22640 is the zero-one-error and \u03b2 is a parameter that allows us to weight positive\nand negative errors differently. Such weighting can compensate for Sij being imbalanced (typically\nmany more pairs of points are non-similar rather then similar), and allows us to obtain different\nbalances between precision and recall.\nThe optimization problem (3) is a discrete, discontinuous and highly non-convex problem. In our\nexperiments, we replace the zero-one loss (cid:96)(\u00b7) with a continuous loss and perform local search\nby greedily updating single bits so as to improve this objective. Although the resulting objective\n(let alone the discrete optimization problem) is still not convex even if (cid:96)(z) is convex, we found it\nbene\ufb01cial to use a loss function that is not \ufb02at on z < 0, so as to encourage moving towards the\ncorrect sign. In our experiments, we used the square root of the logistic loss, (cid:96)(z) = log1/2(1+e\u2212z).\nBefore moving on to out-of-sample generalizations, we brie\ufb02y report on the number of bits needed\nempirically to \ufb01nd good approximations of actual similarity matrices with symmetric and asymmet-\nric codes. We experimented with several data sets, attempting to \ufb01t them with both symmetric and\nasymmetric codes, and then calculating average precision by varying the threshold \u03b8 (while keeping\nU and V \ufb01xed). Results for two similarity matrices, one based on Euclidean distances between\npoints uniformly distributed in a hypoercube, and the other based on semantic similarity between\nimages, are shown in Figure 1.\n\n4 Out of Sample Generalization: Learning a Mapping\n\nSo far we focused on learning binary codes over a \ufb01xed set of objects by associating an arbitrary\ncode word with each object and completely ignoring the input representation of the objects xi.\nWe discussed only how well binary hashing can approximate the similarity, but did not consider\ngeneralizing to additional new objects. However, in most applications, we would like to be able to\nhave such an out-of-sample generalization. That is, we would like to learn a mapping f : X \u2192\n{\u00b11}k over an in\ufb01nite domain X using only a \ufb01nite training set of objects, and then apply the\nmapping to obtain binary codes f (x) for future objects to be encountered, such that S(x, x(cid:48)) \u2248\nsign((cid:104)f (x), f (x(cid:48))(cid:105) \u2212 \u03b8). Thus, the mapping f : X \u2192 {\u00b11}k is usually limited to some constrained\nparametric class, both so we could represent and evaluate it ef\ufb01ciently on new objects, and to ensure\ngood generalization. For example, when X = Rd, we can consider linear threshold mappings\nfW (x) = sign(W x), where W \u2208 Rk\u00d7d and sign(\u00b7) operates elementwise, as in Minimal Loss\nHashing [8]. Or, we could also consider more complex classes, such as multilayer networks [11, 9].\nWe already saw that asymmetric binary codes can allow for better approximations using shorter\ncodes, so it is natural to seek asymmetric codes here as well. That is, instead of learning a single\n\n4\n\n0.70.750.80.850.90.9505101520253035Average Precisionbits  SymmetricAsymmetric0.80.820.840.860.880.90.920.940.96010203040506070Average Precisionbits  SymmetricAsymmetric\fparametric map f (x) we can learn a pair of maps f : X \u2192 {\u00b11}k and g : X \u2192 {\u00b11}k, both\nconstrained to some parametric class, and a threshold \u03b8, such that S(x, x(cid:48)) \u2248 sign((cid:104)f (x), g(x(cid:48))(cid:105) \u2212\n\u03b8). This has the potential of allowing for better approximating the similarity, and thus better overall\naccuracy with shorter codes (despite possibly slightly harder generalization due to the increase in\nthe number of parameters).\nIn fact, in a typical application where a database of objects is hashed for similarity search over\nfuture queries, asymmetry allows us to go even further. Consider the following setup: We are given\nn objects x1, . . . , xn \u2208 X from some in\ufb01nite domain X and the similarities S(xi, xj) between\nthese objects. Our goal is to hash these objects using short binary codes which would allow us to\nquickly compute approximate similarities between these objects (the \u201cdatabase\u201d) and future objects\nx (the \u201cquery\u201d). That is, we would like to generate and store compact binary codes for objects in a\ndatabase. Then, given a new query object, we would like to ef\ufb01ciently compute a compact binary\ncode for a given query and retrieve similar items in the database very fast by \ufb01nding binary codes\nin the database that are within small hamming distance from the query binary code. Recall that it\nis important to ensure that the bit length of the hashes are small, as short codes allow for very fast\nhamming distance calculations and low communication costs if the codes need to be sent remotely.\nMore importantly, if we would like to store the database in a hash table allowing immediate lookup,\nthe size of the hash table is exponential in the code length.\nThe symmetric binary hashing approach (e.g. [8]), would be to \ufb01nd a single parametric mapping\nf : X \u2192 {\u00b11}k such that S(x, xi) \u2248 sign((cid:104)f (x), f (xi)(cid:105) \u2212 \u03b8) for future queries x and database\nobjects xi, calculate f (xi) for all database objects xi, and store these hashes (perhaps in a hash table\nallowing for fast retrieval of codes within a short hamming distance). The asymmetric approach\ndescribed above would be to \ufb01nd two parametric mappings f : X \u2192 {\u00b11}k and g : X \u2192 {\u00b11}k\nsuch that S(x, xi) \u2248 sign((cid:104)f (x), g(xi)(cid:105) \u2212 \u03b8), and then calculate and store g(xi).\nBut if the database is \ufb01xed, we can go further. There is actually no need for g(\u00b7) to be in a constrained\nparametric class, as we do not need to generalize g(\u00b7) to future objects, nor do we have to ef\ufb01ciently\ncalculate it on-the-\ufb02y nor communicate g(x) to the database. Hence, we can consider allowing the\ndatabase hash function g(\u00b7) to be an arbitrary mapping. That is, we aim to \ufb01nd a simple parametric\nmapping f : X \u2192 {\u00b11}k and n arbitrary codewords v1, . . . , vn \u2208 {\u00b11}k for each x1, . . . , xn\nin the database, such that S(x, xi) \u2248 sign((cid:104)f (x), vi(cid:105) \u2212 \u03b8) for future queries x and for the objects\nxi, . . . , xn in the database. This form of asymmetry can allow us for greater approximation power,\nand thus better accuracy with shorter codes, at no additional computational or storage cost.\nIn Section 6 we evaluate empirically both of the above asymmetric strategies and demonstrate their\nbene\ufb01ts. But before doing so, in the next Section, we discuss a local-search approach for \ufb01nding the\nmappings f, g, or the mapping f and the codes v1, . . . , vn.\n\n5 Optimization\n\nWe focus on x \u2208 X \u2282 Rd and linear threshold hash maps of the form f (x) = sign(W x), where\nW \u2208 Rk\u00d7d. Given training points x1, . . . , xn, we consider the two models discussed above:\nLIN:LIN We learn two linear threshold functions f (x) = sign(Wqx) and g(x) = sign(Wdx).\n\nI.e. we need to \ufb01nd the parameters Wq, Wd \u2208 Rk\u00d7d.\n\nLIN:V We learn a single linear threshold function f (x) = sign(Wqx) and n codewords\nI.e. we need to \ufb01nd Wq \u2208 Rk\u00d7d, as well as V \u2208 Rk\u00d7n (where vi\n\nv1, . . . , vn \u2208 Rk.\nare the columns of V ).\n\nIn either case we denote ui = f (xi), and in LIN:LIN also vi = g(xi), and learn by attempting to\nminimizing the objective in (3), where (cid:96)(\u00b7) is again a continuous loss function such as the square\nroot of the logistic. That is, we learn by optimizing the problem (3) with the additional constraint\nU = sign(WqX), and possibly also V = sign(WdX) (for LIN:LIN), where X = [x1 . . . xn] \u2208\nRd\u00d7n.\nWe optimize these problems by alternatively updating rows of Wq and either rows of Wd (for\nLIN:LIN ) or of V (for LIN:V ). To understand these updates, let us \ufb01rst return to (3) (with un-\n\n5\n\n\fMij =\n\n\u03b2ij\n2\n\n(cid:96)(Sij(Y (t)\n\nij \u2212 1)) \u2212 (cid:96)(Sij(Y (t)\n\nij + 1))\n\n,\n\n(cid:17)\n\n(cid:16)\n\nconstrained U, V ), and consider updating a row u(t) \u2208 Rn of U. Denote\n\nY (t) = U(cid:62)V \u2212 \u03b81n \u2212 u(t)(cid:62)\n\nv(t),\n\nthe prediction matrix with component t subtracted away. It is easy to verify that we can write:\n\nL(U(cid:62)V \u2212 \u03b81n; S) = C \u2212 u(t)M v(t)(cid:62)\n\n(4)\n2 (L(Y (t) +1n; S)+L(Y (t)\u22121n; S)) does not depend on u(t) and v(t), and M \u2208 Rn\u00d7n\n\nwhere C = 1\nalso does not depend on u(t), v(t) and is given by:\n\nwith \u03b2ij = \u03b2 or \u03b2ij = (1 \u2212 \u03b2) depending on Sij. This implies that we can optimize over the entire\nrow u(t) concurrently by maximizing u(t)M v(t)(cid:62)\n, and so the optimum (conditioned on \u03b8, V and all\nother rows of U) is given by\n\n(5)\nSymmetrically, we can optimize over the row v(t) conditioned on \u03b8, U and the rest of V , or in the\ncase of LIN:V , conditioned on \u03b8, Wq and the rest of V .\nSimilarly, optimizing over a row w(t) of Wq amount to optimizing:\n\nu(t) = sign(M v(t)).\n\nsign(w(t)X)M v(t)(cid:62)\n\ni\n\nw(t), xi\n\n).\n\nsign(\n\narg max\nw(t)\u2208Rd\n\n= arg max\nw(t)\u2208Rd\n\nThis is a weighted zero-one-loss binary classi\ufb01cation problem, with targets sign((cid:10)Mi, v(t)(cid:11)) and\nweights(cid:12)(cid:12)(cid:10)Mi, v(t)(cid:11)(cid:12)(cid:12). We approximate it as a weighted logistic regression problem, and at each\n\nupdate iteration attempt to improve the objective using a small number (e.g. 10) epochs of stochastic\ngradient descent on the logistic loss. For LIN:LIN , we also symmetrically update rows of Wd.\nWhen optimizing the model for some bit-length k, we initialize to the optimal k \u2212 1-length model.\nWe initialize the new bit either randomly, or by thresholding the rank-one projection of M (for\nunconstrained U, V ) or the rank-one projection after projecting the columns of M (for LIN:V ) or\nboth rows and columns of M (for LIN:LIN ) to the column space of X. We take the initialization\n(random, or rank-one based) that yields a lower objective value.\n\n(6)\n\n(cid:88)\n\n(cid:68)\nMi, v(t)(cid:69)\n\n(cid:68)\n\n(cid:69)\n\n6 Empirical Evaluation\n\nIn order to empirically evaluate the bene\ufb01ts of asymmetry in hashing, we replicate the experiments\nof [8], which were in turn based on [5], on six datasets using learned (symmetric) linear threshold\ncodes. These datasets include: LabelMe and Peekaboom are collections of images, represented as\n512D GIST features [13], Photo-tourism is a database of image patches, represented as 128 SIFT\nfeatures [12], MNIST is a collection of 785D greyscale handwritten images, and Nursery contains\n8D features. Similar to [8, 5], we also constructed a synthetic 10D Uniform dataset, containing\nuniformly sampled 4000 points for a 10D hypercube. We used 1000 points for training and 3000 for\ntesting.\nFor each dataset, we \ufb01nd the Euclidean distance at which each point has, on average, 50 neighbours.\nThis de\ufb01nes our ground-truth similarity in terms of neighbours and non-neighbours. So for each\ndataset, we are given a set of n points x1, . . . , xn, represented as vectors in X = Rd, and the binary\nsimilarities S(xi, xj) between the points, with +1 corresponding to xi and xj being neighbors and\n-1 otherwise. Based on these n training points, [8] present a sophisticated optimization approach\nfor learning a thresholded linear hash function of the form f (x) = sign(W x), where W \u2208 Rk\u00d7d.\nThis hash function is then applied and f (x1), . . . , f (xn) are stored in the database. [8] evaluate\nthe quality of the hash by considering an independent set of test points and comparing S(x, xi) to\nsign((cid:104)f (x), f (xi)(cid:105) \u2212 \u03b8) on the test points x and the database objects (i.e. training points) xi.\nIn our experiments, we followed the same protocol, but with the two asymmetric variations LIN:LIN\nand LIN:V, using the optimization method discussed in Sec. 5. In order to obtain different balances\nbetween precision and recall, we should vary \u03b2 in (3), obtaining different codes for each value of\n\n6\n\n\f10-D Uniform\n\nLabelMe\n\nMNIST\n\nPeekaboom\n\nPhoto-tourism\n\nNursery\n\nFigure 2: Average Precision (AP) of points retrieved using Hamming distance as a function of code length\nfor six datasets. Five curves represent: LSH, BRE, KSH, MLH, and two variants of our method: Asymmetric\nLIN-LIN and Asymmetric LIN-V. (Best viewed in color.)\n\nLabelMe\n\nMNIST\n\nPeekaboom\n\nFigure 3: Code length required as a function of Average Precision (AP) for three datasets.\n\n\u03b2. However, as in the experiments of [8], we actually learn a code (i.e. mappings f (\u00b7) and g(\u00b7), or\na mapping f (\u00b7) and matrix V ) using a \ufb01xed value of \u03b2 = 0.7, and then only vary the threshold \u03b8 to\nobtain the precision-recall curve.\nIn all of our experiments, in addition to Minimal Loss Hashing (MLH), we also compare our ap-\nproach to three other widely used methods: Kernel-Based Supervised Hashing (KSH) of [6], Binary\nReconstructive Embedding (BRE) of [5], and Locality-Sensitive Hashing (LSH) of [1]. 1\nIn our \ufb01rst set of experiments, we test performance of the asymmetric hash codes as a function of\nthe bit length. Figure 2 displays Average Precision (AP) of data points retrieved using Hamming\ndistance as a function of code length. These results are similar to ones reported by [8], where MLH\nyields higher precision compared to BRE and LSH. Observe that for all six datasets both variants\nof our method, asymmetric LIN:LIN and asymmetric LIN:V , consistently outperform all other\nmethods for different binary code length. The gap is particularly large for short codes. For example,\nfor the LabelMe dataset, MLH and KSH with 16 bits achieve AP of 0.52 and 0.54 respectively,\nwhereas LIN:V already achieves AP of 0.54 with only 8 bits. Figure 3 shows similar performance\ngains appear for a number of other datasets. We also note across all datasets LIN:V improves upon\nLIN:LIN for short-sized codes. These results clearly show that an asymmetric binary hash can be\nmuch more compact than a symmetric hash.\n\n1We used the BRE, KSH and MLH implementations available from the original authors. For each method,\nwe followed the instructions provided by the authors. More speci\ufb01cally, we set the number of points for each\nhash function in BRE to 50 and the number of anchors in KSH to 300 (the default values). For MLH, we learn\nthe threshold and shrinkage parameters by cross-validation and other parameters are initialized to the suggested\nvalues in the package.\n\n7\n\n812162024283236404448525660640.20.40.60.81Number of BitsAverage Precision  LIN:VLIN:LINMLHKSHBRELSH812162024283236404448525660640.20.40.60.81Number of BitsAverage Precision  LIN:VLIN:LINMLHKSHBRELSH812162024283236404448525660640.20.40.60.81Number of BitsAverage Precision  LIN:VLIN:LINMLHKSHBRELSH812162024283236404448525660640.20.40.60.81Number of BitsAverage Precision  LIN:VLIN:LINMLHKSHBRELSH812162024283236404448525660640.20.40.60.81Number of BitsAverage Precision  LIN:VLIN:LINMLHKSHBRELSH812162024283236404448525660640.20.40.60.81Number of BitsAverage Precision  LIN:VLIN:LINMLHKSHBRELSH0.550.60.650.70.750.805101520253035404550Average PrecisionBits Required  LIN:VLIN:LINMLHKSH0.550.60.650.70.750.805101520253035404550Average PrecisionBits Required  LIN:VLIN:LINMLHKSH0.550.60.650.70.750.805101520253035404550Average PrecisionBits Required  LIN:VLIN:LINMLHKSH\f16 bits\n\nLabelMe\n\n64 bits\n\n16 bits\n\nMNIST\n\n64bits\n\nFigure 4: Precision-Recall curves for LabelMe and MNIST datasets using 16 and 64 binary codes. (Best\nviewed in color.)\n\nFigure 5: Left: Precision-Recall curves for the Semantic 22K LabelMe dataset Right: Percentage of 50\nground-truth neighbours as a function of retrieved images. (Best viewed in color.)\n\nNext, we show, in Figure 4, the full Precision-Recall curves for two datasets, LabelMe and MNIST,\nand for two speci\ufb01c code lengths: 16 and 64 bits. The performance of LIN:LIN and LIN:V is almost\nuniformly superior to that of MLH, KSH and BRE methods. We observed similar behavior also for\nthe four other datasets across various different code lengths.\nResults on previous 6 datasets show that asymmetric binary codes can signi\ufb01cantly outperform\nother state-of-the-art methods on relatively small scale datasets. We now consider a much larger\nLabelMe dataset [13], called Semantic 22K LabelMe. It contains 20,019 training images and 2,000\ntest images, where each image is represented by a 512D GIST descriptor. The dataset also provides a\nsemantic similarity S(x, x(cid:48)) between two images based on semantic content (object labels overlap in\ntwo images). As argued by [8], hash functions learned using semantic labels should be more useful\nfor content-based image retrieval compared to Euclidean distances. Figure 5 shows that LIN:V with\n64 bits substantially outperforms MLH and KSH with 64 bits.\n\n7 Summary\n\nThe main point we would like to make is that when considering binary hashes in order to approxi-\nmate similarity, even if the similarity measure is entirely symmetric and \u201cwell behaved\u201d, much power\ncan be gained by considering asymmetric codes. We substantiate this claim by both a theoretical\nanalysis of the possible power of asymmetric codes, and by showing, in a fairly direct experimental\nreplication, that asymmetric codes outperform state-of-the-art results obtained for symmetric codes.\nThe optimization approach we use is very crude. However, even using this crude approach, we could\n\ufb01nd asymmetric codes that outperformed well-optimized symmetric codes. It should certainly be\npossible to develop much better, and more well-founded, training and optimization procedures.\nAlthough we demonstrated our results in a speci\ufb01c setting using linear threshold codes, we believe\nthe power of asymmetry is far more widely applicable in binary hashing, and view the experiments\nhere as merely a demonstration of this power. Using asymmetric codes instead of symmetric codes\ncould be much more powerful, and allow for shorter and more accurate codes, and is usually straight-\nforward and does not require any additional computational, communication or signi\ufb01cant additional\nmemory resources when using the code. We would therefore encourage the use of such asymmetric\ncodes (with two distinct hash mappings) wherever binary hashing is used to approximate similarity.\n\nAcknowledgments\n\nThis research was partially supported by NSF CAREER award CCF-1150062 and NSF grant IIS-\n1302662.\n\n8\n\n0.20.40.60.81RecallPrecision  LIN:VLIN:LINMLHKSHBRELSH0.20.40.60.81RecallPrecision  LIN:VLIN:LINMLHKSHBRELSH0.20.40.60.81RecallPrecision  LIN:VLIN:LINMLHKSHBRELSH0.20.40.60.81RecallPrecision  LIN:VLIN:LINMLHKSHBRELSH0.20.40.60.810.10.20.3RecallPrecision  LIN:VMLHKSH5001000150020002500300035004000450050000.20.40.60.8Number RetrievedRecall  LIN:VMLHKSH\fReferences\n\n[1] M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni. Locality-sensitive hashing scheme based\non p-stable distributions. In Proceedings of the twentieth annual symposium on Computational\ngeometry, pages 253\u2013262. ACM, 2004.\n\n[2] W. Dong and M. Charikar. Asymmetric distance estimation with sketches for similarity search\n\nin high-dimensional spaces. SIGIR, 2008.\n\n[3] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin.\n\nIterative quantization: A procrustean\n\napproach to learning binary codes for large-scale image retrieval. TPAMI, 2012.\n\n[4] A. Gordo and F. Perronnin. Asymmetric distances for binary embeddings. CVPR, 2011.\n[5] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS, 2009.\n[6] W. Liu, R. Ji J. Wang, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. CVPR,\n\n2012.\n\n[7] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. ICML, 2011.\n[8] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. ICML, 2011.\n[9] M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance metric learning. NIPS,\n\n2012.\n\n[10] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels.\n\nNIPS, 2009.\n\n[11] R. Salakhutdinov and G. Hinton. Semantic hashing.\n\nReasoning, 2009.\n\nInternational Journal of Approximate\n\n[12] N. Snavely, S. M. Seitz, and R.Szeliski. Photo tourism: Exploring photo collections in 3d. In\n\nProc. SIGGRAPH, 2006.\n\n[13] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition.\n\nCVPR, 2008.\n\n[14] J. Wang, S. Kumar, and S. Chang. Sequential projection learning for hashing with compact\n\ncodes. ICML, 2010.\n\n[15] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. NIPS, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1294, "authors": [{"given_name": "Behnam", "family_name": "Neyshabur", "institution": "TTI Chicago"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI Chicago"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Yury", "family_name": "Makarychev", "institution": "TTI Chicago"}, {"given_name": "Payman", "family_name": "Yadollahpour", "institution": "TTI Chicago"}]}