{"title": "Efficient Sampling for Bipartite Matching Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1313, "page_last": 1321, "abstract": "Bipartite matching problems characterize many situations, ranging from ranking in information retrieval to correspondence in vision. Exact inference in real-world applications of these problems is intractable, making efficient approximation methods essential for learning and inference. In this paper we propose a novel {\\it sequential matching} sampler based on the generalization of the Plackett-Luce model, which can effectively make large moves in the space of matchings. This allows the sampler to match the difficult target distributions common in these problems: highly multimodal distributions with well separated modes. We present experimental results with bipartite matching problems - ranking and image correspondence - which show that the sequential matching sampler efficiently approximates the target distribution, significantly outperforming other sampling approaches.", "full_text": "Ef\ufb01cient Sampling for Bipartite Matching Problems\n\nMaksims N. Volkovs\nUniversity of Toronto\n\nmvolkovs@cs.toronto.edu\n\nRichard S. Zemel\nUniversity of Toronto\n\nzemel@cs.toronto.edu\n\nAbstract\n\nBipartite matching problems characterize many situations, ranging from ranking\nin information retrieval to correspondence in vision. Exact inference in real-\nworld applications of these problems is intractable, making ef\ufb01cient approxi-\nmation methods essential for learning and inference. In this paper we propose\na novel sequential matching sampler based on a generalization of the Plackett-\nLuce model, which can effectively make large moves in the space of matchings.\nThis allows the sampler to match the dif\ufb01cult target distributions common in\nthese problems: highly multimodal distributions with well separated modes. We\npresent experimental results with bipartite matching problems\u2014ranking and im-\nage correspondence\u2014which show that the sequential matching sampler ef\ufb01ciently\napproximates the target distribution, signi\ufb01cantly outperforming other sampling\napproaches.\n\n1\n\nIntroduction\n\nBipartite matching problems (BMPs), which involve mapping one set of items to another, are ubiq-\nuitous, with applications ranging from computational biology to information retrieval to computer\nvision. Many problems in these domains can be expressed as a bipartite graph, with one node for\neach of the items, and edges representing the compatibility between pairs.\nIn a typical BMP a set of labeled instances with target matches is provided together with feature\ndescriptions of the items. The features for any two items do not provide a natural measure of\ncompatibility between the items, i.e., should they be matched or not. Consequently the goal of\nlearning is to create a mapping from the item features to the target matches such that when an\nunlabeled instance is presented the same mapping can be applied to accurately infer the matches.\nProbabilistic formulations of this problem, which involve specifying a distribution over possible\nmatches, have become increasingly popular, e.g., [23, 26, 1], and these models have been applied to\nproblems ranging from preference aggregation in social choice and information retrieval [7, 13] to\nmultiple sequence protein alignment in computational biology [24, 27].\nHowever, exact learning and inference in real-world applications of these problems quickly become\nintractable because the state space is typically factorial in the number of items. Approximate infer-\nence methods are also problematic in this domain. Variational approaches, in which aspects of the\njoint distribution are treated independently, may miss important contingencies in the joint. On the\nother hand sampling is hard, plagued by the multimodality and strict constraints inherent in discrete\ncombinatorial spaces.\nRecently there has been a \ufb02urry of new methods for sampling for bipartite matching problems.\nSome of these have strong theoretical properties [10, 9], while others are appealingly simple [6, 13].\nHowever, to the best of our knowledge, even for simple versions of bipartite matching problems,\nno ef\ufb01cient sampler exists. In this paper we propose a novel Markov Chain Monte Carlo (MCMC)\nmethod applicable to a wide subclass of BMPs. We compare the ef\ufb01ciency and performance of our\nsampler to others on two applications.\n\n1\n\n\f2 Problem Formulation\nA standard BMP consists of the two sets of N items U = {u1, ..., uN} and V = {v1, ..., vN}. The\ngoal is to \ufb01nd an assignment of the items so that every item in U is matched to exactly one item in V\nand no two items share the same match. In this problem an assignment corresponds to a permutation\n\u03c0 where \u03c0 is a bijection {1, ..., N} \u2192 {1, ..., N}, mapping each item in U to its match in V ; we\nuse the terms assignment and permutation interchangeably. We de\ufb01ne \u03c0(i) = j to denote the index\nof a match v\u03c0(i) = vj for item ui in \u03c0 and use \u03c0\u22121(j) = i to denote the reverse. Permutations\nhave a useful property that any subset of the permutation also constitutes a valid permutation with\nrespect to the items in the subset. We will utilize this property in later sections; here we introduce\nthe notation. Given a full permutation \u03c0 we de\ufb01ne \u03c01:t (\u03c01:0 = \u2205) as a partial permutation of only\nthe \ufb01rst t items in U.\nTo express uncertainty over assignments, we use the standard Gibbs form to de\ufb01ne the probability\nof a permutation \u03c0:\n\nP (\u03c0|\u03b8) =\n\n1\n\nZ(\u03b8)\n\nexp(\u2212E(\u03c0, \u03b8))\n\nZ(\u03b8) =\n\nexp(\u2212E(\u03c0, \u03b8))\n\n(1)\n\n(cid:88)\n\n\u03c0\n\nwhere \u03b8 is the set of model parameters and E(\u03c0, \u03b8) is the energy. We assume, without loss of\ngenerality, that the energy E(\u03c0, \u03b8) is given by a sum of single and/or higher order potentials.\nMany important problems can be formulated in this form. For example, in information retrieval the\ncrucial problem of learning a ranking function can be modeled as a BMP [12, 26]. In this domain\nU corresponds to a set of documents and V to a set of ranks. The energy of a given assignment\nis typically formulated as a combination of ranks and the model\u2019s output from the query-document\nfeatures. For example in [12] the energy is de\ufb01ned as:\n\nE(\u03c0, \u03b8) = \u2212 1\nN\n\n\u03b8i(N \u2212 \u03c0(i) + 1)\n\n(2)\n\nwhere \u03b8i is a score assigned by the model to ui. Similarly, in computer vision the problem of \ufb01nding\na correspondence between sets of images can be expressed as a BMP [5, 3, 17]. Here U and V are\ntypically sets of points in the two images and the energy is de\ufb01ned on the feature descriptors of these\npoints. For example in [17] the energy is given by:\n\nN(cid:88)\n\ni=1\n\n(cid:10)\u03b8, (\u03c8u\n\nN(cid:88)\n\ni=1\n\n\u03c0(i))2(cid:11)\n\ni \u2212 \u03c8v\n\nE(\u03c0, \u03b8) =\n\n1\n|\u03c8|\n\n(3)\n\n\u03b8, (\u03c8u\n\ni=1\n\n(cid:68)\n\ni and \u03c8v\n\n(cid:80)t\n\ni \u2212 \u03c8v\n\n\u03c01:t(i))2(cid:69)\n\nwhere \u03c8u\n\u03c0(i) are feature descriptors for points ui and v\u03c0(i). Finally, some clustering problems\ncan also be expressed in the form of a BMP [8]. It is important to note here that for all models where\nthe energy is additive we can compute the energy E(\u03c01:t, \u03b8) for any partial permutation \u03c01:t by\nsumming the potentials only over the t assignments in \u03c01:t. For instance for the energy in Equation\n3, E(\u03c01:t, \u03b8) = 1|\u03c8|\nLearning in these models typically involves maximizing the log probability of the correct match as a\nfunction of \u03b8. To do this one generally needs to \ufb01nd the gradient of the log probability with respect\nto \u03b8: \u2202 log(P (\u03c0|\u03b8))\n. Unfortunately, computing the gradient with respect to\nthe partition function requires a summation over N ! valid assignments, which very quickly becomes\nintractable. For example for N = 20 \ufb01nding \u2202 log(Z(\u03b8))\nrequires over 1017 summations. Thus\neffective approximation techniques are necessary to learn such models.\nA particular instance of BMP that has been studied extensively is the maximum weight bipartite\nmatching problem (WBMP). In WBMP the energy is reduced to only the single potential \u03c6:\n\nwith E(\u03c01:0, \u03b8) = 0.\n\n= \u2212 \u2202E(\u03c0,\u03b8)\n\n\u2202\u03b8 \u2212 \u2202 log(Z(\u03b8))\n\n\u2202\u03b8\n\n\u2202\u03b8\n\n\u2202\u03b8\n\nEunary(\u03c0, \u03b8) =\n\n\u03c6(ui, v\u03c0(i), \u03b8)\n\n(4)\n\nEquations 2 and 3 are both examples of WBMP energies. Finding the assignment with the maxi-\nmum energy is tractable and can be solved in O(N 3) [16]. Determining the partition function in\na WBMP is equivalent to \ufb01nding the permanent of the edge weight matrix (de\ufb01ned by the unary\npotential), a well-known #P problem [25]. The majority of the proposed samplers are designed for\n\ni\n\n2\n\n(cid:88)\n\n\fWBMPs and cannot be applied to the more general BMPs where the energy includes higher order\npotentials. However, distributions based on higher order potentials allow greater \ufb02exibility and have\nbeen actively used in problems ranging from computer vision and robotics [20, 2] to information\nretrieval [19, 26]. There is thus an evident need to develop an effective sampler applicable to any\nBMP distribution.\n\n3 Related Approaches\n\nIn this section we brie\ufb02y describe existing sampling approaches, some of which have been developed\nspeci\ufb01cally for bipartite matching problems while others come from matrix permanent research.\n\n3.1 Gibbs Sampling\n\nGibbs and block-Gibbs sampling can be applied straightforwardly to sample from distributions de-\n\ufb01ned by Equation 1. To do that we start with some initial assignment \u03c0 and consider a subset of\nitems in U; for illustration purposes we will use two items ui and uj. Given the selected subset of\nitems the Gibbs sampler considers all possible assignment swaps within this subset. In our example\nthere are only two possibilities: leave \u03c0 unchanged or swap \u03c0(i) with \u03c0(j) to produce a new per-\nmutation \u03c0(cid:48). Conditioned on the assignment of all the other items in U that were not selected, the\nprobability of each permutation is:\n\n(cid:48)|\u03c0\\{i,j}) =\n\np(\u03c0\n\nexp(\u2212E(\u03c0(cid:48), \u03b8))\n\nexp(\u2212E(\u03c0, \u03b8)) + exp(\u2212E(\u03c0(cid:48), \u03b8))\n\np(\u03c0|\u03c0\\{i,j}) = 1 \u2212 p(\u03c0\n\n(cid:48)|\u03c0\\{i,j})\n\nwhere \u03c0\\{i,j} is permutation \u03c0 with ui and uj removed. We sample using these probabilities to\neither stay at \u03c0 or move to \u03c0(cid:48), and repeat the process.\nGibbs sampling has been applied to a wide range of energy-based probabilistic models. It is often\nfound to mix very slowly and to get trapped in local modes [22]. The main reason for this is\nthat the path from one probable assignment to another using only pairwise swaps is likely to go\nthrough regions that have very low probability [5]. This makes it very unlikely that those moves\nwill be accepted, which typically traps the sampler in one mode. Thus, the local structure of the\nGibbs sampler is likely to be inadequate for problems of the type considered here, in which several\nprobable assignments will produce well-separated modes.\n\n3.2 Chain-Based Approaches\n\nChain-based methods extend the assignment swap idea behind the Gibbs sampler to generate sam-\nples more ef\ufb01ciently from WBMP distributions. Instead of randomly choosing subsets of items to\nswap, chain-based method generate a sequence (chain) of interdependent swaps. Given a (random)\nstarting permutation \u03c0, an item ui (currently matched with v\u03c0(i)) is selected at random and a new\nmatch vj is proposed with probability p(ui, vj|\u03b8) where p depends on the unary potential \u03c6(ui, vj, \u03b8)\nin the WBMP energy (see Equation 4). Now, assuming that the match {ui, vj}, is selected, matches\n{ui, v\u03c0(i)} and {u\u03c0\u22121(j), vj} are removed from \u03c0 and {ui, vj} is added to make \u03c0(cid:48). After this\nchange u\u03c0\u22121(j) and v\u03c0(i) are no longer matched to any item so \u03c0(cid:48) is a partial assignment. The proce-\ndure then \ufb01nds a new match for u\u03c0\u22121(j) using p. This chain-like match sampling is repeated either\nuntil \u03c0(cid:48) is a complete assignment or a termination criteria is reached.\nSeveral chain-based methods have been proposed including the chain \ufb02ipping approach [5] and the\nMarkov Chain approach [11]. Dellaert et al., [5] empirically demonstrated that the chain \ufb02ipping\nsampler can mix better than the Gibbs sampler when applied to multimodal distributions. However,\nchain-based methods also have several drawbacks that signi\ufb01cantly affect their performance. First,\nunlike the Gibbs sampler which always maintains a valid assignment, the intermediate assignments\n\u03c0(cid:48) in chain-based methods are incomplete. This means that the chain either has to be run until a valid\nassignment is generated [5] or terminated early and produce an incomplete assignment [11]. In the\n\ufb01rst case the sampler has a non-deterministic run-time whereas in the second case the incomplete\nassignment can not be taken as a valid sample from the model. Finally, to the best of our knowledge\nno chain-based method can be applied to general BMPs because they are speci\ufb01cally designed for\nEunary (see Equation 4).\n\n3\n\n\f(a) t = 0\n\n(b) t = 1\n\n(c) t = 2\n\n(d) t = 3\n\nFigure 1: Top row: Plackett-Luce generative process viewed as rank matching. Bottom row: sequential match-\ning procedure. Items are U = {u1, u2, u3} and V = {v1, v2, v3}; the reference permutation is \u03c3 = {2, 3, 1}.\nThe proposed matches are shown in red dotted arrows and accepted matches in black arrows.\n3.3 Recursive Partitioning Algorithm\n\nThe recursive partitioning [10] algorithm was developed to obtain exact samples from the distribu-\ntion for WBMP. This method is considered to be the state-of-the-art in matrix permanent research\nand to the beset of our knowledge has the lowest expected run time. Recursive partitioning proceeds\nby splitting the space of all valid assignments \u2126 into K subsets \u21261, ..., \u2126K with corresponding parti-\ntion functions Z1, ..., ZK. It then samples one of these subsets and repeats the partitioning procedure\nrecursively, generating exact samples from a WBMP distribution.\nDespite strong theoretical guarantees the recursive partitioning procedure has a number of limi-\ntations that signi\ufb01cantly affect its applicability. First, the running time of the sampler is non-\ndeterministic as the algorithm has to be restarted every time a sample falls outside of \u2126. The\nprobability of restart increases with N which is an undesirable property especially for training large\nmodels where one typically needs to have precise control over the time spent in each training phase.\nMoreover, this algorithm is also speci\ufb01c to WBMP and cannot be generalized to sample from arbi-\ntrary BMP distributions with higher order potentials.\n\n3.4 Plackett-Luce Model\n\nOur proposed sampling approach is based on a generalization of the well-established Plackett-Luce\nmodel [18, 14], which is a generative model for permutations. Given a set of items V = {v1, ..., vN},\na Plackett-Luce model is parametrized by a set of weights (one per item) W = {w1, ..., wN}. Under\nthis model a permutation \u03c0 is generated by \ufb01rst selecting item v\u03c0(1) from the set of N items and\nplacing it in the \ufb01rst position, then selecting v\u03c0(2) from the remaining N \u2212 1 items and placing it\nthe second position, and so on until all N items are placed. The probability of \u03c0 under this model is\ngiven by:\n\nQ(\u03c0) =\n\nexp(w\u03c0(1))\ni=1 exp(w\u03c0(i))\n\n\u00d7\n\nexp(w\u03c0(2))\ni=2 exp(w\u03c0(i))\n\n\u00d7 ... \u00d7 exp(w\u03c0(N ))\nexp(w\u03c0(N ))\n\n(5)\n\nHere\n\nexp(w\u03c0(t))\n\nexp((cid:80)N\n\nitems. It can be shown that Q is a valid distribution with(cid:80)\n\ni=t w\u03c0(i)) is the probability of choosing the item v\u03c0(t) out of the N \u2212 t + 1 remaining\n\u03c0 Q(\u03c0) = 1. Moreover, note that it is\nvery easy to draw samples from Q by applying the sequential procedure described above. In the next\nsection we show how this model can be generalized to draw samples from any BMP distribution.\n\n(cid:80)N\n\n(cid:80)N\n\n4 Sampling by Sequentially Matching Vertices\n\nIn this section we introduce a class of proposal distributions that can be effectively used in con-\njunction with the Metropolis-Hastings algorithm to obtain samples from a BMP distribution. Our\napproach is based on the observation that the sequential procedure behind the Plackett-Luce model\ncan also be extended to generate matches between item sets. Instead of placing items into ranked\npositions we can think of the Plackett-Luce generative process as sequentially matching ranks to the\nitems in V , as illustrated in the top row of Figure 1. To generate the permutation \u03c0 = {3, 1, 2} the\nPlackett-Luce model \ufb01rst matches rank 1 with v\u03c0(1) = v2 then rank 2 with v\u03c0(2) = v3 and \ufb01nally\nrank 3 with v\u03c0(3) = v1. Taking this one step further we can replace ranks with a general item set\n\n4\n\n\fU and repeat the same process. Unlike ranks, items in U do not have a natural order so we use a\nreference permutation \u03c3, which speci\ufb01es the order in which items in U are matched. We refer to\nthis procedure as sequential matching. The bottom row of Figure 1 illustrates this process.\nFormally the sequential matching process proceeds as follows: given some reference permutation \u03c3,\nwe start with an empty assignment \u03c01:0 = \u2205. Then at each iteration t = 1, ..., N the corresponding\nitem u\u03c3(t) gets matched with one of the V \\ \u03c01:t\u22121 items, where V \\ \u03c01:t\u22121 = {vjt, ..., vjN}\ndenotes the set of items not matched in \u03c01:t\u22121. Note that similarly to the Plackett-Luce model,\n|V \\ \u03c01:t\u22121| = N \u2212 t + 1 so at each iteration, u\u03c3(t) will have N \u2212 t + 1 left over items in V \\ \u03c01:t\u22121\n(cid:80)\nto match with. We de\ufb01ne the conditional probability of each such match to be p(vj|u\u03c3(t), \u03c01:t\u22121),\np(vj|u\u03c3(t), \u03c01:t\u22121) = 1. After N iterations the permutation \u03c01:N = \u03c0 is produced\n\nvj\u2208V \\\u03c01:t\u22121\n\nwith probability:\n\nN(cid:89)\n\nQ(\u03c0|\u03c3) =\n\np(v\u03c0(\u03c3(t))|u\u03c3(t), \u03c01:t\u22121)\n\n(6)\n\nt=1\n\nwhere v\u03c0(\u03c3(t)) is a match for u\u03c3(t) in \u03c0. The conditional match probabilities depend on both the\ncurrent item u\u03c3(t) and on the partial assignment \u03c01:t\u22121. Introducing this dependency generalizes the\nPlackett-Luce model which only takes into account that the items in \u03c01:t\u22121 are already matched but\ndoes not take into account how these items are matched. This dependency becomes very important\nwhen the energy contains pairwise and/or higher order potentials as it allows us to compute the\nchange in energy for each new match, in turn allowing for close approximations to the target BMP\ndistribution.\nthat satisfy (cid:80)\nWe can show that the distribution Q de\ufb01ned by the p\u2019s is a valid distribution over assignments:\n(cid:81)N\nProposition 1 For any reference permutation \u03c3 and any choice of matching probabilities\nthe distribution given by: Q(\u03c0|\u03c3) =\nt=1 p(v\u03c0(\u03c3(t))|u\u03c3(t), \u03c01:t\u22121) is a valid probability distribution over assignments.1\nThe important consequence of this proposition is that it allows us to work with a very rich class of\nmatching probabilities with arbitrary dependencies and still obtain a valid distribution over assign-\nments with a simple way to generate exact samples from it. This opens many avenues for tailoring\nproposal distributions for MCMC applications to speci\ufb01c BMPs. In the next section we propose one\nsuch approach.\n\np(vj|u\u03c3(t), \u03c01:t\u22121) = 1,\n\nvj\u2208V \\\u03c01:t\u22121\n\n4.1 Proposal Distribution\n\nGiven the general matching probabilities the goal is to de\ufb01ne them so that the resulting proposal\ndistribution Q matches the target distribution as closely as possible. One potential way of achieving\nthis is through the partial energy E(\u03c01:t, \u03b8) (see Section 2). The partial energy ignores all the items\nthat are not matched in \u03c01:t and thus provides an estimate of the \u201dcurrent\u201d energy at each iteration t.\nUsing partial energies we can also \ufb01nd the changes in energy when a given item is matched. Given\nthat our goal is to explore low-energy (high-probability) modes we de\ufb01ne the matching probabilities\nas:\n\nexp(\u2212E(H(vj, u\u03c3(t), \u03c01:t\u22121), \u03b8))\n\np(vj|u\u03c3(t), \u03c01:t\u22121) =\n\nZt(u\u03c3(t), \u03c01:t\u22121) =\n\n(cid:88)\n\nvj\u2208V \\\u03c01:t\u22121\n\nZt(u\u03c3(t), \u03c01:t\u22121)\nexp(\u2212E(H(vj, u\u03c3(t), \u03c01:t\u22121), \u03b8))\n\n(7)\n\nwhere H(vj, u\u03c3(t), \u03c01:t\u22121) is the resulting partial assignment after match {u\u03c3(t), vj} is added to\n\u03c01:t\u22121. The normalizing constant Zt ensures that the probabilities sum to 1, which is the necessary\ncondition for Proposition 1 to apply. It is useful to rewrite the matching probabilities as:\n\np(vj|u\u03c3(t), \u03c01:t\u22121) =\n\u2217\nt (u\u03c3(t), \u03c01:t\u22121) =\n\nZ\n\n(cid:88)\n\nvj\u2208V \\\u03c01:t\u22121\n\nexp(\u2212E(H(vj, u\u03c3(t), \u03c01:t\u22121), \u03b8) + E(\u03c01:t\u22121, \u03b8))\n\nZ\u2217\nt (u\u03c3(t), \u03c01:t\u22121)\n\nexp(\u2212E(H(vj, u\u03c3(t), \u03c01:t\u22121), \u03b8) + E(\u03c01:t\u22121, \u03b8))\n\nAdding E(\u03c01:t\u22121, \u03b8) to each item\u2019s energy does not change the probabilities because this term can-\ncels out during normalization (but it does change the partition function, denoted by Z\u2217\nt here). How-\never, in this form we see that p(vj|u\u03c3(t), \u03c01:t\u22121) is directly related to the change in the partial energy\n\n1The proof is in the supplementary material.\n\n5\n\n\fZ\u2217\n1 (u\u03c3(1), \u03c01:0)\n\nexp(\u2212E(\u03c01:1, \u03b8) + E(\u03c01:0, \u03b8))\n\n\u00d7 ... \u00d7 exp(\u2212E(\u03c01:N , \u03b8) + E(\u03c01:N\u22121, \u03b8))\n\nfrom \u03c01:t\u22121 to H(vj, u\u03c3(t), \u03c01:t\u22121) \u2013 the larger the change the bigger the resulting probability will\nbe. Thus, the matching choices will be made solely based on the changes in the partial energy.\nReorganizing the terms yields the proposal distribution:\nQ(\u03c0|\u03c3) =\nHere Z\u2217(\u03c0, \u03c3) is the normalization factor which depends both on the reference permutation \u03c3 and\nthe generated assignment \u03c0. The resulting proposal distribution is essentially a renormalized version\nof the target distribution. The numerator remains the exponent of the energy but the denominator\nis no longer a constant; rather it is a function which depends on the generated assignment and the\nreference permutation. Note that the proposal distribution de\ufb01ned above can be used to generate\nsamples for any target distribution with arbitrary energy consisting of single and/or higher order\npotentials. To the best of our knowledge aside from the Gibbs sampler this is the only sampling\nprocedure that can be applied to arbitrary BMP distributions.\n\nZ\u2217\nN (u\u03c3(N ), \u03c01:N\u22121)\n\nexp(\u2212E(\u03c0, \u03b8))\n\nZ\u2217(\u03c0, \u03c3)\n\n=\n\n4.2 Temperature and Chain Properties\n\nAcceptance rate, a key property of any sampler, is typically controlled by a parameter which\neither shrinks or expands the proposal distribution. To achieve this effect with the sequen-\ntial matching model we introduce an additional parameter \u03c1 which we refer to as temperature:\np(vj|u\u03c3(t), \u03c01:t\u22121, \u03c1) \u221d exp(\u2212E(H(vj, u\u03c3(t), \u03c01:t\u22121), \u03b8)/\u03c1). Decreasing \u03c1 leads to sharp proposal\ndistributions typically highly skewed towards one speci\ufb01c assignment, while increasing \u03c1 makes the\nproposal distribution approach the uniform distribution. By adjusting \u03c1 we can control the range of\nthe proposed moves therefore controlling the acceptance rate.\nTo ensure that the SM sampler converges to the required distribution we demonstrate that it satis\ufb01es\nthe three requisite properties: detailed balance, ergodicity, and aperiodicity [15]. The detailed bal-\nance condition is satis\ufb01ed because every Metropolis-Hastings algorithm satis\ufb01es detailed balance.\nErgodicity follows from the fact that the insertion probabilities are always strictly greater than 0.\nTherefore any \u03c0 is reachable from any \u03c3 in one proposal cycle. Finally, aperiodicity follows from\nthe fact that the chain allows self-transitions.\n\n4.3 Reference Permutation\n\nAlgorithm 1 Sequential Matching (SM)\n\nInput: \u03c3, M, \u03c1\nfor m = 1 to M do\nInitialize \u03c01:0 = \u2205\nfor t = 1 to N do {generate sample from Q(\u00b7|\u03c3)}\n\nFixing the reference permutation \u03c3 yields a\nstate independent sampler.\nEmpirically we\nfound that setting \u03c3 to the MAP permutation\ngives good performance for WBMP problems.\nHowever, for the general energy based distri-\nbutions considered here \ufb01nding the MAP state\ncan be very expensive and in many cases in-\ntractable. Moreover, even if MAP can be found\nef\ufb01ciently there is still no guarantee that using it\nas the reference permutation will lead to a good\nsampler. To avoid these problems we use a state\ndependent sampler where the reference permu-\ntation \u03c3 is updated every time a sample gets ac-\ncepted. In the matching example (bottom row\nof Figure 1) if the new match at t = 3 is ac-\ncepted then \u03c3 would be updated to {3, 1, 2}.\nEmpirically we found the state dependent sam-\npler to be more stable, with consistent perfor-\nmance across different random initializations of the reference permutation. Algorithm 1 summarizes\nthe Metropolis-Hastings procedure for the state dependent sequential matching sampler.\n\nt=1 p(v\u03c0(\u03c3(t))|u\u03c3(t), \u03c01:t\u22121, \u03c1)\nt=1 p(v\u03c3(\u03c0(t))|u\u03c0(t), \u03c31:t\u22121, \u03c1)\nexp(\u2212E(\u03c3,\u03b8))Q(\u03c0|\u03c3) then\n\nFind a match vj for u\u03c3(t) using:\nAdd {u\u03c3(t), vj} to \u03c01:t\u22121 to get \u03c01:t\n\nQ(\u03c0|\u03c3) =(cid:81)N\nQ(\u03c3|\u03c0) =(cid:81)N\n\nend for\nCalculate forward probability:\n\nif U nif orm(0, 1) < exp(\u2212E(\u03c0,\u03b8))Q(\u03c3|\u03c0)\n\np(vj|u\u03c3(t), \u03c01:t\u22121, \u03c1)\n\nCalculate backward probability:\n\n\u03c3 \u2190 \u03c0\n\nend if\nend for\nReturn: \u03c3\n\n5 Experiments\n\nTo test the sequential matching sampling approach we conducted extensive experiments. We consid-\nered document ranking and image matching, two popular applications of BMP; and for the sake of\n\n6\n\n\fTable 1: Average Hellinger distances for learning to rank (left half) and image matching (right half) problems.\nStatistically signi\ufb01cant results are underlined. Note that Hellinger distances for N = 8 are not directly com-\nparable to those for N = 25, 50 since approximate normalization is used for N > 8. For N = 50 we were\nunable to get a single sample from the RP sampler for any c in the allocated time limit (over 5 minutes).\n\nLearning to Rank\n\nImage Matching\n\nc = 20\n\nc = 40\n\nc = 60\n\nc = 80\n\nc = 100\n\nc = 0.2 c = 0.4 c = 0.6 c = 0.8\n\nc = 1\n\nN = 8:\n\nGB 0.7948 0.6211 0.4635 0.4218\nCF 0.9012 0.8987 0.8887 0.8714\nRP 0.7945 0.6209 0.4629 0.4986\nSM 0.7902 0.6188 0.4636 0.4474\n\nN = 25:\n\nGB 0.9533 0.9728 0.9646 0.9449\nCF 0.9767 0.9990 0.9937 0.9953\nRP 0.9533 0.9728 0.9694 0.9462\nSM 0.1970 0.1937 0.2899 0.4166\n\nN = 50:\n\nGB 0.9983 0.9991 0.9988 0.9974\nCF 0.9841 0.9995 0.9993 0.9906\nSM 0.1617 0.2335 0.3462 0.4931\n\n0.3737\n0.8748\n0.3734\n0.3725\n\n0.9486\n0.9781\n0.9673\n0.3858\n\n0.9985\n0.9305\n0.4895\n\n0.9108 0.8868 0.8320 0.7616 0.6533\n0.9112 0.8882 0.8336 0.7672 0.6623\n0.9110 0.8870 0.8312 0.7623 0.6548\n0.9109 0.8866 0.8307 0.7621 0.6557\n\n0.7246 0.8669 0.9902 0.9960 0.9976\n0.7243 0.8675 0.9904 0.9950 0.9807\n0.7279 0.9788 0.9896 0.9988 0.9969\n0.7234 0.8471 0.8472 0.6350 0.5576\n\n0.6949 0.9646 1.0000 1.0000 1.0000\n0.6960 0.9635 1.0000 1.0000 0.9992\n0.6941 0.9243 0.7016 0.3550 0.1677\n\nbutions P and Q the Hellinger distance is given by D = (1 \u2212 ((cid:80)\n\ncomparison we concentrated on WBMP, as most of the methods cannot be applied to general BMP\nproblems. When comparing the samplers we concentrated on evaluating how well the Monte Carlo\nestimates of probabilities produced by the samplers approximate the true distribution P . When tar-\nget probabilities are known this method of evaluation provides a good estimate of performance since\nthe ultimate goal of any sampler is to approximate P as closely as possible.\nFor all experiments the Hellinger distance was used to compare the true distributions with the ap-\nproximations produced by samplers. We chose this metric because it is symmetric and bounded.\nFurthermore it avoids the log(0) problems that arise in cross entropy measures. For any two distri-\n\u03c0 P (\u03c0)Q(\u03c0))1/2)1/2. Note that\n0 \u2264 D \u2264 1 where 0 indicates that P = Q. Computing D exactly quickly becomes intractable as the\nnumber of items grows. To overcome this problem we note that if a given permutation \u03c0 is not gen-\nerated by any of the samplers then the term P (\u03c0)Q(\u03c0) is 0 and does not affect the resulting estimate\n(cid:80)\nof D for any sampler. Hence we can locally approximate D up to a constant for all samplers by\nexp(\u2212E(\u03c0,\u03b8))\nchanging Equation 1 to: P (\u03c0|\u03b8) \u2248\n\u03c0(cid:48)\u2208\u2126\u2217 exp(\u2212E(\u03c0(cid:48),\u03b8), where \u2126\u2217 is the union of all distinct per-\nmutations produced by the samplers. The Hellinger distance is then estimated with respect to \u2126\u2217.\nFor all experiments we ran the samplers on small (N = 8), medium (N = 25) and large (N = 50)\nscale problems. The sampling chains for each method were run in parallel using 4 cores; the use of\nmultiprocessor boards such as GPUs allows our method to scale to large problems. We compare the\nSM approach with Gibbs (GB), chain \ufb02ipping (CF) and recursive partitioning (RP) samplers. To run\nRP we used the code available from the author\u2019s webpage. These methods cover all of the primary\nleading approaches in WBMP and matrix permanent research.\nSince any valid sampler will eventually produce samples from the target distribution, we tested\nthe methods with short chain lengths. This regime also simulates real applications of the methods\nwhere, due to computational time limits, the user is typically unable to run long chains. Note that\nthis is especially relevant if the distributions are being sampled as an inner loop during parameter\noptimization. Furthermore, to make comparisons fair we used the block GB sampler with the block\nsize of 7 (the largest computationally feasible size) as the reference point. We used 2N swaps for\neach GB chain, setting the number of iterations for other methods to match the total run time for GB\n(for all experiments the difference in running times between GB and SM did not exceed 10%). The\nrun times of the CF and RP methods are dif\ufb01cult to control as they are non-deterministic. To deal\nwith this we set an upper-bound on the running time (consistent with the other methods) after which\nCF and RP were terminated. Finally, the temperature for SM was chosen in the [0.1, 1] interval to\nkeep the acceptance rate approximately between 20% and 60%.\n\n5.1 Learning to Rank\n\nFor a learning to rank problem we used the Yahoo! Learning To Rank dataset [4]. For each query\nthe distribution over assignments was parametrized by the energy given in Equation 2. Here \u03b8i\n\n7\n\n\fis the output of the neural network scoring function trained on query-document features. After\npretraining the network on the full dataset we randomly selected 50 queries with N = 8, 25, 50\ndocuments and used GB, CF, RP and SM methods to generate 1000 samples for each query. To gain\ninsight into sampling accuracy we experimented with different distribution shapes by introducing\nan additional scaling constant c so that P (\u03c0|\u03b8, c) \u221d exp(\u2212c \u00d7 E(\u03c0, \u03b8)). In this form c controls the\n\u201dpeakiness\u201d of the distribution with large values resulting in highly peaked distributions; we used\nc \u2208 {20, 40, 60, 80, 100}.\nThe left half of Table 1 shows Hellinger distances for N = 8, 25, 50, averaged across the 50 queries.2\nFrom the table it is seen that all the samplers perform equally well when the number of items is small\n(N = 8). However, as the number of items increases SM signi\ufb01cantly outperforms all other sam-\nplers. Throughout experiments we found that the CF and RP samplers often reached the allocated\ntime limit and had to be forced to terminate early. For N = 50 we were unable to get a single\nsample from the RP sampler after running it for over 5 minutes. This is likely due to the fact that\nat each matching step t = 1, ..., N the RP sampler has a non-zero probability of failing (rejecting).\nConsequently the total rejection probability increases linearly with the number of items N. Even\nfor N = 25 we found the RP sampler to reject over 95% of the time. This further suggests that ap-\nproaches with non-deterministic run times are not suitable for this problem because their worst-case\nperformance can be extremely slow. Overall, the results indicate that SM can produce higher quality\nsamples more rapidly, a crucial property for learning large-scale models.\n\n5.2\n\nImage Matching\n\nFor an image matching task we followed the framework of Petterson et al. [17]. Here, we used the\nGiraffe dataset [21] which is a video sequence of a walking giraffe. From this data we randomly se-\nlected 50 pairs of images that were at least 20 frames apart. Using the available set of 61 hand labeled\npoints we then randomly selected three sets of correspondence points for each image pair, containing\n8, 25 and 50 points respectively, and extracted SIFT feature descriptors at each point. The target dis-\ntribution over matchings was parametrized by the energy given by Equation 3 where \u03c8\u2019s are the SIFT\nfeature descriptors. We also experimented with different scale settings: c \u2208 {0.2, 0.4, 0.6, 0.8, 1}.\nFigure 2 shows an example pair of images with 25 labeled points and the inferred MAP assignment.\nThe results for N = 8, 25, 50 are shown in in\nthe right half of Table 1. We see that when\nthe distributions are relatively \ufb02at (c < 0.6) all\nsamplers have comparable performance. How-\never, as the distributions become sharper with\nseveral well de\ufb01ned modes (c \u2265 0.6),\nthe\nSM sampler signi\ufb01cantly outperforms all other\nsamplers. As mentioned above, when the dis-\ntribution has well de\ufb01ned modes the path from\none mode to the other using only local swaps is likely to go through low probability modes. This is\nthe likely cause of the poor performance of the GB and CF samplers as both samplers propose new\nassignments through local moves. As in the learning to rank experiments, we found the rejection\nrate for the RP sampler to increase signi\ufb01cantly for N \u2265 25. We were unable to obtain any samples\nin the allocated time (over 5 mins) from the RP sampler for N = 50. Overall, the results further\nshow that the SM method is able to generate higher quality samples faster than the other methods.\n\nFigure 2: Example image pair with N = 25. Green\nlines show the inferred MAP assignment.\n\n6 Conclusion\n\nIn this paper we introduced a new sampling approach for bipartite matching problems based on a\ngeneralization of the Plackett-Luce model. In this approach the matching probabilities at each stage\nare conditioned on the partial assignment made to that point. This global dependency allows us\nto de\ufb01ne a rich class of proposal distributions that accurately approximate the target distribution.\nEmpirically we found that our method is able to generate good quality samples faster and is less\nprone to getting stuck in local modes. Future work involves applying the sampler during inference\nwhile learning BMP models. We also plan to investigate the relationship between the proposal\ndistribution produced by sequential matching and the target one.\n\n2Trace and Hellinger distance plots (for both experiments) are in the supplementary material.\n\n8\n\n\fReferences\n[1] A. Bouchard-Cote and M. I. Jordan. Variational inference over combinatorial spaces. In NIPS,\n\n2010.\n\n[2] C. Cadena, D. Galvez-Lopez, F. Ramos, J. D. Tardos, and J. Neira. Robust place recognition\n\nwith stereo cameras. In IROS, 2010.\n\n[3] T. S. Caetano, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. In ICML, 2009.\n[4] O. Chapelle, Y. Chang, and T.-Y. Liu. The Yahoo! Learning to Rank Challenge. 2010.\n[5] F. Dellaert, S. M. Seitz, C. E. Thorpe, and S. Thrun. EM, MCMC, and chain \ufb02ipping for\n\nstructure from motion with unknown correspondence. Machine Learning, 50, 2003.\n\n[6] J.-P. Doignon, A. Pekec, and M. Regenwetter. The repeated insertion model for rankings:\n\nMissing link between two subset choice models. Psychometrika, 69, 2004.\n\n[7] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In\n\nWWW, 2001.\n\n[8] B. Huang and T. Jebara. Loopy belief propagation for bipartite maximum weight b-matching.\n\nIn AISTATS, 2007.\n\n[9] J. Huang, C. Guestrin, and L. Guibas. Fourier theoretic probabilistic inference over permuta-\n\ntions. Machine Learning Research, 10, 2009.\n\n[10] M. Huber and J. Law. Fast approximation of the permanent for very dense problems. In SODA,\n\n2008.\n\n[11] M. Jerrum, A. Sinclair, and E. Vigoda. A polynomial-time approximation algorithm for the\n\npermanent of a matrix with non-negative entries. 2004.\n\n[12] Q. V. Le and A. Smola. Direct optimization of ranking measures. In arxiv: 0704.3359, 2007.\n[13] T. Lu and C. Boutilier. Learning Mallows models with pairwise preferences. In ICML, 2011.\n[14] R. D. Luce. Individual choice behavior: A theoretical analysis. Wiley, 1959.\n[15] R. M. Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical\n\nreport, University of Toronto, 1993.\n\n[16] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: Algorithms and complexity.\n\nPrentice-Hall, 1982.\n\n[17] J. Petterson, T. S. Caetano, J. J. McAuley, and J. Yu. Exponential family graph matching and\n\nranking. In NIPS, 2009.\n\n[18] R. Plackett. The analysis of permutations. Applied Statistics, 24, 1975.\n[19] T. Qin, T.-Y. Liu, X.-D. Zhang, D.-S. Wang, and H. Li. Global ranking using continuous\n\nconditional random \ufb01elds. In NIPS, 2008.\n\n[20] F. Ramos, D. Fox, and H. Durrant-Whyte. CRF-Matching: Conditional random \ufb01elds for\n\nfeature-based scan matching. In Robotics: Science and Systems, 2007.\n\n[21] D. A. Ross, D. Tarlow, and R. S. Zemel. Learning articulated structure and motion. Interna-\n\ntional Journal on Computer Vision, 88, 2010.\n\n[22] R. Salakhutdinov. Learning deep Boltzmann machines using adaptive MCMC. In ICML, 2010.\n[23] M. Taylor, J. Guiver, S. Robertson, and T. Minka. Softrank: Optimizing non-smooth rank\n\nmetrics. In WSDM, 2008.\n\n[24] W. R. Taylor. Protein structure comparison using bipartite graph matching and its application\n\nto protein structure classi\ufb01cation. In Molecular Cell Proteomics, 2002.\n\n[25] L. G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8,\n\n1979.\n\n[26] M. N. Volkovs and R. S. Zemel. Boltzrank: Learning to maximize expected ranking gain. In\n\nICML, 2009.\n\n[27] Y. Wang, F. Makedon, and J. Ford. A bipartite graph matching framework for \ufb01nding corre-\n\nspondences between structural elements in two proteins. In IEBMS, 2004.\n\n9\n\n\f", "award": [], "sourceid": 638, "authors": [{"given_name": "Maksims", "family_name": "Volkovs", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}]}