{"title": "Mapping Estimation for Discrete Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 4197, "page_last": 4205, "abstract": "We are interested in the computation of the transport map of an Optimal Transport problem. Most of the computational approaches of Optimal Transport use the Kantorovich relaxation of the problem to learn a probabilistic coupling $\\mgamma$ but do not address the problem of learning the underlying transport map $\\funcT$ linked to the original Monge problem. Consequently, it lowers the potential usage of such methods in contexts where out-of-samples computations are mandatory. In this paper we propose a new way to jointly learn the coupling and an approximation of the transport map. We use a jointly convex formulation which can be efficiently optimized. Additionally, jointly learning the coupling and the transport map allows to smooth the result of the Optimal Transport and generalize it to out-of-samples examples. Empirically, we show the interest and the relevance of our method in two tasks: domain adaptation and image editing.", "full_text": "Mapping Estimation for Discrete Optimal Transport\n\nMicha\u00a8el Perrot\n\nUniv Lyon, UJM-Saint-Etienne, CNRS,\nLab. Hubert Curien UMR 5516, F-42023\nmichael.perrot@univ-st-etienne.fr\n\nNicolas Courty\n\nUniversit\u00b4e de Bretagne Sud,\nIRISA, UMR 6074, CNRS,\n\ncourty@univ-ubs.fr\n\nR\u00b4emi Flamary\n\nUniversit\u00b4e C\u02c6ote d\u2019Azur,\n\nLagrange, UMR 7293 , CNRS, OCA\n\nremi.flamary@unice.fr\n\nAmaury Habrard\n\nUniv Lyon, UJM-Saint-Etienne, CNRS,\nLab. Hubert Curien UMR 5516, F-42023\namaury.habrard@univ-st-etienne.fr\n\nAbstract\n\nWe are interested in the computation of the transport map of an Optimal Transport\nproblem. Most of the computational approaches of Optimal Transport use the\nKantorovich relaxation of the problem to learn a probabilistic coupling \u03b3 but do\nnot address the problem of learning the underlying transport map T linked to\nthe original Monge problem. Consequently, it lowers the potential usage of such\nmethods in contexts where out-of-samples computations are mandatory. In this\npaper we propose a new way to jointly learn the coupling and an approximation of\nthe transport map. We use a jointly convex formulation which can be ef\ufb01ciently\noptimized. Additionally, jointly learning the coupling and the transport map allows\nto smooth the result of the Optimal Transport and generalize it to out-of-samples\nexamples. Empirically, we show the interest and the relevance of our method in\ntwo tasks: domain adaptation and image editing.\n\n1\n\nIntroduction\n\nIn recent years Optimal Transport (OT) [1] has received a lot of attention in the machine learning\ncommunity [2, 3, 4, 5]. This gain of interest comes from several nice properties of OT when used\nas a divergence to compare discrete distributions: (i) it provides a sound and theoretically grounded\nway of comparing multivariate probability distributions without the need for estimating parametric\nversions and (ii) by considering the geometry of the underlying space through a cost metric, it can\nencode useful information about the nature of the problem.\nOT is usually expressed as an optimal cost functional but it also enjoys a dual variational formula-\ntion [1, Chapter 5]. It has been proven useful in several settings. As a \ufb01rst example it corresponds to\nthe Wasserstein distance in the space of probability distributions. Using this distance it is possible to\ncompute means and barycentres [6, 7] or to perform a PCA in the space of probability measures [8].\nThis distance has also been used in subspace identi\ufb01cation problems for analysing the differences\nbetween distributions [9], in graph based semi-supervised learning to propagate histogram labels\nacross nodes [4] or as a way to de\ufb01ne a loss function for multi-label learning [5]. As a second example\nOT enjoys a variety of bounds for the convergence rate of empirical to population measures which can\nbe used to derive new probabilistic bounds for the performance of unsupervised learning algorithms\nsuch as k-means [2]. As a last example OT is a mean of interpolation between distributions [10] that\nhas been used in Bayesian inference [11], color transfer [12] or domain adaptation [13].\nOn the computational side, despite some results with \ufb01nite difference schemes [14], one of the major\ngain is the recent development of regularized versions that leads to ef\ufb01cient algorithms [3, 7, 15]. Most\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fOT formulations are based on the computation of a (probabilistic) coupling matrix that can be seen\nas a bi-partite graph between the bins of the distributions. This coupling, also called transportation\nmatrix, corresponds to an empirical transport map which suffers from some drawbacks: it can only be\napplied to the examples used to learn it. In other words when a new dataset (or sample) is available,\none has to recompute an OT problem to deal with the new instances which can be prohibitive for some\napplications in particular when the task is similar or related. From a machine learning standpoint, this\nalso means that we do not know how to \ufb01nd a good approximation of a transport map computed from\na small sample that can be generalized to unseen data. This is particularly critical when one considers\nmedium or large scale applications such as image editing problems. In this paper, we propose to\nbridge this gap by learning an explicit transformation that can be interpreted as a good approximation\nof the transport map. As far as we know, this is the \ufb01rst approach that addresses directly this problem\nof out-of-sample mapping.\nOur formulation is based on classic regularized regression and admits two appealing interpretations.\nOn the one hand, it can be seen as learning a transformation regularized by a transport map. On the\nother hand, we can see it as the computation of the transport map regularized w.r.t. the de\ufb01nition\nof a transformation (e.g.\nlinear, non-linear, . . . ). This results in an optimization problem that\njointly learns both the transport map and the transformation. This formulation can be ef\ufb01ciently\nsolved thanks to alternating block-coordinate descent and actually bene\ufb01ts the two models: (i) we\nobtain smoother transport maps that must be compliant with a transformation that can be used on\nout-of-sample examples and (ii) the transformation is able to take into account some geometrical\ninformation captured by OT. See Figure 1 for an illustration. We provide some empirical evidence for\nthe usefulness of our approach in domain adaptation and image editing. Beyond that, we think that\nthis paper can open the door to new research on the generalization ability of OT.\nThe rest of the paper is organized as follows. Section 2 introduces some notations and preliminaries\nin optimal transport. We present our approach in Section 3. Our experimental evaluation is given in\nSection 4 and we conclude in Section 5.\n\n2 Background\nMonge problem Let \u2126S \u2208 Rds and \u2126T \u2208 Rdt be two separable metric spaces such that any\nprobability measure on \u2126S, respectively \u2126T , is a Radon measure. By considering a cost function\nc : \u2126S \u00d7 \u2126T \u2192 [0,\u221e[, Monge\u2019s formulation of the OT problem is to \ufb01nd a transport map\nT : \u2126S \u2192 \u2126T (also known as a push-forward operator) between two probability measures \u00b5S on\n\u2126S and \u00b5T on \u2126T realizing the in\ufb01mum of the following function:\n\n(cid:27)\n\n(cid:26)(cid:90)\n\n\u2126S\n\ninf\n\nc(x, T (x))d\u00b5S (x), T #\u00b5S = \u00b5T\n\n.\n\n(1)\n\nWhen reaching this in\ufb01mum, the corresponding map T is an optimal transport map. It associates one\npoint from \u2126S to a single point in \u2126T . Therefore, the existence of this map is not always guaranteed,\nas when for example \u00b5S is a Dirac and \u00b5T is not. As such, the existence of solutions for this problem\ncan in general not be established when \u00b5S and \u00b5T are supported on a different number of Diracs. Yet,\nin a machine learning context, data samples usually form discrete distributions, but can be seen as\nobservations of a regular, continuous (with respect to the Lebesgue measure) underlying distribution,\nthus ful\ufb01lling existence conditions (see [1, Chapter 9]). As such, assuming the existence of T calls\nfor a relaxation of the previous problem.\n\nKantorovich relaxation The Kantorovitch formulation of OT [16] is a convex relaxation of the\nMonge problem. Let us de\ufb01ne \u03a0 as the set of all probabilistic couplings in P(\u2126S \u00d7 \u2126T ), the space\nof all joint distributions with marginals \u00b5S and \u00b5T . The Kantorovitch problem seeks for a general\ncoupling \u03b3 \u2208 \u03a0 between \u2126S and \u2126T :\n\n(cid:90)\n\n\u03b30 = arg min\n\n\u03b3\u2208\u03a0\n\n\u2126S\u00d7\u2126T\n\nc(xs, xt)d\u03b3(xs, xt).\n\n(2)\n\nThe optimal coupling always exists [1, Theorem 4.1]. This leads to a simple formulation of the\nOT problem in the discrete case, i.e. whenever \u00b5S and \u00b5T are only accessible through discrete\ni}nt\nsamples Xs = {xs\ni=1. The corresponding empirical distributions can be\nwhere \u03b4x is the Dirac function at location\n\nwritten as \u02c6\u00b5S = (cid:80)ns\n\ni}ns\ni=1, and Xt = {xt\ni=1 ps\n\nand \u02c6\u00b5T = (cid:80)nt\n\ni=1 pt\n\ni\u03b4xt\n\ni \u03b4xs\n\ni\n\ni\n\n2\n\n\fwhere (cid:104)\u00b7,\u00b7(cid:105)F is the Frobenius dot product1 and C \u2265 0 is the cost matrix related to the function c.\nBarycentric mapping Once the probabilistic coupling \u03b30 has been computed, one needs to map\nthe examples from \u2126S to \u2126T . This mapping can be conveniently expressed with respect to the set of\nexamples Xt as the following barycentric mapping [11, 13, 12]:\n\n\u03b30(i, j)c(x, xt\n\nj),\n\n(4)\n\n(cid:99)xs\n\nnt(cid:88)\n\nj=1\n\ni is a given source sample and(cid:99)xs\n\nx\u2208\u2126T\n\ni = arg min\n\nFigure 1: Illustration of the mappings estimated on the clown dataset with a linear (top) and nonlinear\n(bottom) mapping (best viewed in color).\n\nx \u2208 \u2126. ps\n\nsimplex, i.e.(cid:80)ns\nempirical distributions de\ufb01ned as \u02c6\u03a0 =(cid:8)\u03b3 \u2208 (R+)ns\u00d7nt| \u03b31nt = \u02c6\u00b5S , \u03b3T 1ns = \u02c6\u00b5T(cid:9) where 1n is a\n\ni are probability masses associated to the i-th sample and belong to the probability\ni = 1. Let \u02c6\u03a0 be the set of probabilistic couplings between the two\ni=1 ps\n\ni =(cid:80)nt\n\ni and pt\n\ni=1 pt\n\nn-dimensional vector of ones. Problem 2 becomes:\n\u03b30 = arg min\n\n\u03b3\u2208 \u02c6\u03a0\n\n(cid:104)\u03b3, C(cid:105)F ,\n\n(3)\n\ni is its corresponding image. When the cost function is the\nwhere xs\nsquared (cid:96)2 distance, i.e. c(x, x(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107)2\n2, this barycentre corresponds to a weighted average\nand the sample is mapped into the convex hull of the target examples. For all source samples, this\nbarycentric mapping can therefore be expressed as:\n\n(cid:99)Xs = B\u03b30(Xs) = diag(\u03b301nt)\u22121\u03b30Xt.\n\nIn the rest of the paper we will focus on a uniform sampling, i.e.\n\nfrom \u00b5S and \u00b5T , whence (cid:99)Xs = ns\u03b30Xt. The main drawback of the mapping (5) is that it does not\n\n(5)\nthe examples are drawn i.i.d.\n\nallow the projection of out-of-sample examples which do not have been seen during the learning\nprocess of \u03b30. It means that to transport a new example xs \u223c \u2126S one has to compute the coupling\nmatrix \u03b30 again using this new example. Also, while some authors consider speci\ufb01c regularization of\n\u03b3 [3, 13] to control the nature of the coupling, inducing speci\ufb01c properties of the transformation T\n(i.e. regularity, divergence free, etc.) is hard to achieve.\nIn the next section we present a relaxation of the OT problem, which consists in jointly learning \u03b3 and\nT . We derive the corresponding optimization problem, and show its usefulness in speci\ufb01c scenarios.\n\n3 Contributions\n\n3.1\n\nJoint learning of T and \u03b3\n\nIn this paper we propose to solve the problem of optimal transport by jointly learning the matrix \u03b3\nand the transformation function T . First of all, we denote H the space of transformations from \u2126T\n\n1(cid:104)A, B(cid:105)F = Tr(AT B)\n\n3\n\n\fto \u2126T and using a slight abuse of notations Xs and Xt are matrices where each line is an example\nrespectively drawn from \u2126S and \u2126T . We propose the following optimisation problem:\n\n(cid:104)\u03b3, C(cid:105)F +\n\n\u03bbT\ndsdt\n\n\u03bb\u03b3\n\n1\n\n(cid:107)T (Xs) \u2212 ns\u03b3Xt(cid:107)2F +\n\nnsdt\n\nR(T )\n\nmax(C)\n\nf (\u03b3, T ) =\n\narg min\nT\u2208H,\u03b3\u2208 \u02c6\u03a0\n\n(6)\nwhere T (Xs) is a short-hand for the application of T on each example in Xs, R(\u00b7) is a regularization\nterm on T and \u03bb\u03b3, \u03bbT are hyper-parameters controlling the trade-off between the three terms in the\noptimization problem. The \ufb01rst term in (6) depends on both T and \u03b3 and controls the closeness\nbetween the transformation induced by T and the barycentric interpolation obtained from \u03b3. The\nsecond term only depends on \u03b3 and corresponds to the standard optimal transport loss. The third term\nregularizes T to ensure a better generalization.\nA standard approach to solve problem (6) is to use block-coordinate descent (BCD) [17], where the\nidea is to alternatively optimize for T and \u03b3. In the next theorem we show that under some mild\nassumptions on the regularization term R(\u00b7) and the function space H this problem is jointly convex.\nNote that in this case we are guaranteed to converge to the optimal solution only if we are strictly\nconvex w.r.t. T and \u03b3. While this is not the case for \u03b3, the algorithm works well in practice and\na small regularization term can be added if theoretical convergence is required. The proof of the\nfollowing theorem can be found in the supplementary.\nTheorem 1. Let H be a convex space and R(\u00b7) be a convex function. Problem (6) is jointly convex\nin T and \u03b3.\n\nAs discussed above we propose to solve optimization problem (6) using a block coordinate descent\napproach. As such we need to \ufb01nd an ef\ufb01cient way to solve: (i) for \u03b3 when T is \ufb01xed and (ii) for\nT when \u03b3 is \ufb01xed. To solve the problem w.r.t. \u03b3 with a \ufb01xed T , a common approach is to use the\nFrank-Wolfe algorithm [12, 18]. It is a procedure for solving any convex constrained optimization\nproblem with a convex and continuously differentiable objective function over a compact convex\nsubset of any vector space. This algorithm can \ufb01nd an \u0001 approximation of the optimal solution in\nO(1/\u0001) iterations [19]. A detailed algorithm is given in the supplementary material. In the next\nsection we discuss the solution of the minimization w.r.t. T with \ufb01xed \u03b3 for different functional\nspaces.\n3.2 Choosing H\nIn the previous subsection we presented our method when considering a general set of transformations\nH. In this section we propose several possibilities for the choice of a convex set H. On the one hand,\nwe propose to de\ufb01ne H as a set of linear transformations from \u2126S to \u2126T . On the other hand, using\nthe kernel trick, we propose to consider non-linear transformations. A summary of the approach can\nbe found in Algorithm 1.\nLinear transformations A \ufb01rst way to de\ufb01ne H is to consider linear transformations induced by a\nds \u00d7 dt real matrix L:\n\nH =\n\nT : \u2203 L \u2208 Rds\u00d7dt\n\n,\u2200xs \u2208 \u2126S , T (xs) = xsT L\n\n(7)\nFurthermore, we de\ufb01ne R(T ) = (cid:107)L \u2212 I(cid:107)2F where I is the identity matrix. We choose to bias L\ntoward I in order to ensure that the examples are not moved too far away from their initial position.\nIn this case we can rewrite optimization problem (6) as:\n\n.\n\n(cid:110)\n\n(cid:111)\n\narg min\n\nL\u2208Rds\u00d7dt ,\u03b3\u2208 \u02c6\u03a0\n\n1\n\nnsdt\n\n(cid:107)XsL \u2212 ns\u03b3Xt(cid:107)2F +\n\n\u03bb\u03b3\n\nmax(C)\n\n(cid:104)\u03b3, C(cid:105)F +\n\n\u03bbT\ndsdt\n\n(8)\n\n(cid:107)L \u2212 I(cid:107)2F .\n(cid:19)\n\n(cid:18) 1\n\nAccording to Algorithm 1 a part of our procedure requires to solve optimization problem (8) when \u03b3\nis \ufb01xed. One solution is to use the following closed form for L:\n\nXT\n\nL =\n\nnsdt\n\ns Xs +\n\n(9)\nwhere (\u00b7)\u22121 is the matrix inverse (Moore-Penrose pseudo-inverse when the matrix is singular). In the\nprevious de\ufb01nition of H, we considered non biased linear transformations. However it is sometimes\ndesirable to add a bias to the transformation. The equations being very similar in spirit to the non\nbiased case we refer the interested reader to the supplementary material.\n\ns ns\u03b3Xt +\n\nnsdt\n\nXT\n\nI\n\nI\n\n\u03bbT\ndsdt\n\n(cid:19)\u22121(cid:18) 1\n\n\u03bbT\ndsdt\n\n4\n\n\fAlgorithm 1: Joint Learning of L and \u03b3.\n\ninput\noutput : L, \u03b3.\n\n: Xs, Xt source and target examples and \u03bb\u03b3, \u03bbT hyper parameters.\n\n1 begin\n2\n3\n4\n5\n6\n7\n\nInitialize k = 0, \u03b30 \u2208 \u02c6\u03a0 and L0 = I\nrepeat\n\nLearn \u03b3k+1 solving problem (6) with \ufb01xed Lk using a Frank-Wolfe approach.\nLearn Lk+1 using Equation (9), (12) or their biased counterparts with \ufb01xed \u03b3k+1.\nSet k = k + 1.\nuntil convergence\n\nNon-linear transformations\nIn some cases a linear transformation is not suf\ufb01cient to approximate\nthe transport map. Hence, we propose to consider non-linear transformations. Let \u03c6 be a non-linear\nH,\nwe can de\ufb01ne H for a given set of examples Xs as:\n\nfunction associated to a kernel function k : \u2126S \u00d7 \u2126S \u2192 R such that k(xs, xs(cid:48)) =(cid:10)\u03c6(xs), \u03c6(xs(cid:48))(cid:11)\n)(cid:1) where\nwhere kXs(xsT ) is a short-hand for the vector(cid:0)k(xs, xs\n)(cid:1).\nwhere kXs (\u00b7) is a short-hand for the vector(cid:0)k(\u00b7, xs\n\n(cid:110)\n(cid:111)\nT : \u2203 L \u2208 Rns\u00d7dt\u2200xs \u2208 \u2126S , T (xs) = kXs(xsT )L\n\u00b7\u00b7\u00b7\n\n\u2208 Xs. In this case optimization problem (6) becomes:\n\n(cid:104)\u03b3, C(cid:105)F +\nk(\u00b7, xs\n\n\u03bbT\nnsdt\n\n)(cid:1) =(cid:0)\u03c6(xs\n\n(cid:107)kXs(Xs)L \u2212 ns\u03b3Xt(cid:107)2F +\n\n(cid:107)kXs (\u00b7)L(cid:107)2F .\n\n1,\u00b7\u00b7\u00b7 , xs\nxs\n\nmax(C)\n\u00b7\u00b7\u00b7\n\n1)\n\n1) k(xs, xs\n2)\n\narg min\n\nL\u2208Rns\u00d7dt ,\u03b3\u2208 \u02c6\u03a0\n\n\u00b7\u00b7\u00b7 \u03c6(xs\n\nns\n\nk(xs, xs\nns\n\n1\n\nnsdt\n\nH =\n\n(10)\n\n(11)\n\n1)\n\nns\n\nns\n\nAs in the linear case there is a closed form solution for L when \u03b3 is \ufb01xed:\n\n\u03bb\u03b3\n\n(cid:19)\u22121\n\nL =\n\nkXs (Xs) +\n\n\u03bbT\nd2 I\n\nnsdt\n\n1\n\nnsdt\n\nns\u03b3Xt.\n\n(12)\n\n(cid:18) 1\n\nAs in the linear case it might be interesting to use a bias (Presented in the supplementary material).\n\n3.3 Discussion on the quality of the transport map approximation\n\nIn this section we propose to discuss some theoretical considerations about our framework and more\nprecisely on the quality of the learned transformation T . To assess this quality we consider the\nFrobenius norm between T and the true transport map, denoted T \u2217, that we would obtain if we could\nsolve Monge\u2019s problem. Let B\u02c6\u03b3 be the empirical barycentric mapping of Xs using the probabilistic\ncoupling \u02c6\u03b3 learned between Xs and Xt. Similarly let B\u03b30 be the theoretical barycentric mapping\nassociated with the probabilistic coupling \u03b30 learned on \u00b5S , \u00b5T the whole distributions and which\ncorresponds to the solution of Kantorovich\u2019s problem. Using a slight abuse of notations we denote by\nB\u02c6\u03b3(xs) and B\u03b30(xs) the projection of xs \u2208 Xs by these barycentric mappings. Using the triangle\ninequality, some standard properties on the square function, the de\ufb01nition of H and [20, Theorem 2],\nwe have with high probability that (See the supplementary material for a justi\ufb01cation):\n\nE\n\nxs\u223c\u2126S\n\n(cid:107)T (xs) \u2212 T\n\n\u2217\n\n(xs)(cid:107)2F \u2264 4\n\n(cid:88)\n(cid:88)\n\nxs\u2208Xs\n\n+ 4\n\n(cid:19)\n\n(cid:18) 1\u221a\n\n(cid:107)T (xs) \u2212 B\u02c6\u03b3(xs)(cid:107)2F + O\nns\n(cid:107)B\u02c6\u03b3(xs) \u2212 B\u03b30 (xs)(cid:107)2F + 2 E\nxs\u223c\u2126S\n\n(cid:107)B\u03b30 (xs) \u2212 T\n\n\u2217\n\n(xs)(cid:107)2F .\n\n(13)\n\nxs\u2208Xs\n\nties. The \ufb01rst quantity,(cid:80)\nxs\u2208Xs\nFrom Inequality 13 we assess the quality of the learned transformation T w.r.t. three key quanti-\n(cid:107)T (xs) \u2212 B\u02c6\u03b3(xs)(cid:107)2F , is a measure of the difference between the\nlearned transformation and the empirical barycentric mapping. We minimize it in Problem (6). The\n(cid:80)\nsecond and third quantities are theoretical and hard to bound because, as far as we know, there\nis a lack of theoretical results related to these terms in the literature. Nevertheless, we expect\n(cid:107)B\u02c6\u03b3(xs) \u2212 B\u03b30 (xs)(cid:107)2F to decrease uniformly with respect to the number of examples as it\ncorresponds to a measure of how well the empirical barycentric mapping estimates the theoretical\none. Similarly, we expect Exs\u223c\u2126S (cid:107)B\u03b30 (xs) \u2212 T \u2217(xs)(cid:107)2F to be small as it characterizes that the\ntheoretical barycentric mapping is a good approximation of the true transport map. This depends of\ncourse on the expressiveness of the set H considered. We think that this discussion opens up new\ntheoretical perspectives for OT in Machine Learning but these are beyond the scope of this paper.\n\nxs\u2208Xs\n\n5\n\n\fTable 1: Accuracy on the Moons dataset. Color-code: the darker the result, the better.\n\nAngle 1NN GFK SA OT L1L2 OTE\n\nOTLin\n\nOTLinB\n\nOTKer\n\nOTKerB\n\nT\n\n\u03b3\n\nT\n\n\u03b3\n\nT\n\n\u03b3\n\nT\n\n\u03b3\n\n10\n20\n30\n40\n50\n60\n70\n80\n90\n\n100.0 99.9 100.0 97.9 99.6 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0\n93.1 95.0 98.7 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0\n93.1\n99.9 100.0 100.0 100.0 100.0\n84.0 90.6 98.4 100.0 99.8\n84.0\n74.4 83.7 95.8 100.0 98.3\n99.7\n98.5\n77.1\n99.1\n97.5\n97.8\n73.1 77.8 87.7\n61.7\n96.8\n97.0\n96.4\n72.3 71.0 88.3\n41.2\n23.1\n72.3 64.5 89.0\n88.0\n94.3\n83.1\n74.2\n80.7\n76.9\n72.3 57.3 73.6\n20.7\n68.1\n19.4\n34.2 51.0 58.1\n67.9\n55.4\n\n95.8\n92.5\n90.8\n90.2\n79.4\n61.0\n36.2\n43.1\n\n99.9\n98.7\n97.6\n97.2\n94.7\n81.0\n68.0\n\n99.8\n98.1\n97.5\n95.8\n88.2\n76.6\n67.1\n\n99.6\n99.1\n96.6\n82.5\n73.9\n57.6\n\n87.3\n86.3\n77.5\n58.8\n51.3\n\n99.7\n99.1\n96.6\n80.8\n74.0\n56.3\n\n99.7\n99.2\n96.8\n81.5\n74.1\n55.8\n\n4 Experiments\n\n4.1 Domain Adaptation\n\nDatasets We consider two domain adaptation (DA) datasets, namely Moons [21] and Of\ufb01ce-\nCaltech [22]. The Moons dataset is a binary classi\ufb01cation task where the source domain corresponds\nto two intertwined moons, each one representing a class. The target domain is built by rotating the\nsource domain with an angle ranging from 10 to 90 degrees. It leads to 9 different adaptation tasks\nof increasing dif\ufb01culty. The examples are two dimensional and we consider 300 source and target\nexamples for training and 1000 target examples for testing. The Of\ufb01ce-Caltech dataset is a 10 class\nimage classi\ufb01cation task with 4 domains corresponding to images coming from different sources:\namazom (A), dslr (D), webcam (W) and Caltech10 (C). There are 12 adaptation tasks where each\ndomain is in turn considered as the source or the target (denoted source \u2192 target). To represent\nthe images we use the deep learning features of size 4096 named decaf6 [23]. During the training\nprocess we consider all the examples from the source domain and half of the examples from the target\ndomain, the other half being used as the test set.\n\nMethods We consider 6 baselines. The \ufb01rst one is a simple 1-Nearest-Neighbour (1NN) using\nthe original source examples only. The second and third ones are two widely used DA approaches,\nnamely Geodesic Flow Kernel (GFK) [22] and Subspace Alignment (SA) [24]. The fourth to sixth\nbaselines are OT based approaches: the classic OT method (OT), OT with entropy based regularization\n(OTE) [3] and OT with (cid:96)1(cid:96)2 regularization (L1L2) [13]. We present the results of our approach with\nthe linear (OTLin) and kernel (OTKer) versions of T and their biased counterpart (*B). For OT based\nmethods the idea is to (i) compute the transport map between the source and the target, (ii) project\nthe source examples and (iii) classify the target examples using a 1NN on the projected source.\n\nt\n\nt\n\nand a labelled target testing set Xtest\n\nExperimental Setup We consider the following experimental setup for all the methods and datasets.\nAll the results presented in this section are averaged over 10 trials. For each trial we consider three\nsets of examples, a labelled source training set denoted Xs, ys, an unlabelled target training set\ndenoted Xtrain\n. The model is learned on Xs, ys and Xtrain\nand evaluated on Xtest\nt with a 1NN learned on Xs, ys. All the hyper-parameters are tuned according\nto a grid search on the source and target training instances using a circular validation procedure\nderived from [21, 25] and described in the supplementary material. For GFK and SA we choose\nthe dimension of the subspace d \u2208 {3, 6, . . . , 30}, for L1L2 and OTE we set the parameter for\nentropy regularization in {10\u22126, 10\u22125, . . . , 105}, for L1L2 we choose the class related parameter\n\u03b7 \u2208 {10\u22125, 10\u22124, . . . , 102}, for all our methods we choose \u03bbT , \u03bb\u03b3 \u2208 {10\u22123, 10\u22122, . . . , 100}.\nThe results on the Moons and Of\ufb01ce-Caltech datasets are respectively given in Table 1 and 2. A \ufb01rst\nimportant remark is that the coupling \u03b3 and the transformation T almost always obtain the same\nresults. It shows that our method is able to learn a good approximation T of the transport map induced\nby \u03b3. In terms of accuracy our approach tends to give the best results. It shows that we are effectively\nable to move closer the distributions in a relevant way. For the Moons dataset, the last 6 approaches\n(including ours) based on OT obtain similar results until 40 degrees while the other methods fail to\nobtain good results at 20 degrees. Beyond 50 degrees, our approaches give signi\ufb01cantly better results\nthan the others. Furthermore they are more stable when the dif\ufb01culty of the problem increases which\n\nt\n\n6\n\n\fTable 2: Accuracy on the Of\ufb01ce-Caltech dataset. Color-code: the darker the result, the better.\n\nTask\n\n1NN GFK SA OT L1L2 OTE OTLin\n\u03b3\n\nOTLinB\n\nOTKer\n\nOTKerB\n\nT\n\n\u03b3\n\nT\n\n\u03b3\n\nT\n\nT\n\n\u03b3\nD \u2192 W 89.5 93.3 95.6 77.0 95.7 95.7 97.3 97.3 97.3 97.3 98.4 98.5 98.5 98.5\nD \u2192 A 62.5 77.2 88.5 70.8 74.9 74.8 85.7 85.7 85.8 85.8 89.9 89.9 89.5 89.5\nD \u2192 C 51.8 69.7 79.0 68.1 67.8 68.0 77.2 77.2 77.4 77.4 69.1 69.2 69.3 69.3\nW \u2192 D 99.2 99.8 99.6 74.1 94.4 94.4 99.4 99.4 99.8 99.8 97.2 97.2 96.9 96.9\nW \u2192 A 62.5 72.4 79.2 67.6 71.3 71.3 81.5 81.5 81.4 81.4 78.5 78.3 78.5 78.8\nW \u2192 C 59.5 63.7 55.0 63.1 67.8 67.8 75.9 75.9 75.4 75.4 72.7 72.7 65.1 63.3\nA \u2192 D 65.2 75.9 83.8 64.6 70.1 70.5 80.6 80.6 80.4 80.5 65.6 65.5 71.9 71.5\nA \u2192 W 56.8 68.0 74.6 66.8 67.2 67.3 74.6 74.6 74.4 74.4 66.4 64.8 70.0 68.9\nA \u2192 C 70.1 75.7 79.2 70.4 74.1 74.3 81.8 81.8 81.6 81.6 84.4 84.4 84.5 84.5\nC \u2192 D 75.9 79.5 85.0 66.0 69.8 70.2 87.1 87.1 87.2 87.2 70.1 70.0 78.6 78.6\nC \u2192 W 65.2 70.7 74.4 59.2 63.8 63.8 78.3 78.3 78.5 78.5 80.0 80.4 73.5 73.4\nC \u2192 A 85.8 87.1 89.3 75.2 76.6 76.7 89.9 89.9 89.7 89.7 82.4 82.2 83.6 83.5\n70.3 77.8 81.9 68.6 74.5 74.6 84.1 84.1 84.1 84.1 79.6 79.4 80.0 79.7\nMean\n\ncan be interpreted as a bene\ufb01t from our regularization. In the supplementary material we propose\nan illustration of the transformation learned by our approach. For Of\ufb01ce-Caltech, our methods are\nsigni\ufb01cantly better than other approaches which illustrates the potential of our method for dif\ufb01cult\ntasks. To conclude, forcing OT to simultaneously learn coupling and transformation seems bene\ufb01cial.\n\n4.2 Seamless copy in images with gradient adaptation\n\nWe propose here a direct application of our mapping estimation in the context of image editing.\nWhile several papers using OT are focusing on color adaptation [12, 26], we explore here a new\nvariant in the domain of image editing: the seamless editing or cloning in images. In this context, one\nmay desire to import a region from a given source image to a target image. As a direct copy of the\nregion leads to inaccurate results in the \ufb01nal image nearby the boundaries of the copied selection, a\nvery popular method, proposed by P\u00b4erez and co-workers [27], allows to seamlessly blend the target\nimage and the selection. This technique, coined as Poisson Image Editing, operates in the gradient\ndomain of the image. Hence, the gradients of the selection operate as a guidance \ufb01eld for an image\nreconstruction based on membrane interpolation with appropriate boundary conditions extracted from\nthe target image (See the supplementary material for more details).\nThough appealing, this technique is prone to errors due local contrast change or false colors resulting\nfrom the integration. While some solutions combining both gradient and color domains exist [28],\nthis editing technique usually requires the source and target images to have similar colors and contrast.\nHere, we propose to enhance the genericity of this technique by forcing the gradient distribution from\nthe source image to follow the gradient distribution in the target image. As a result, the seamless\ncloning not only blends smoothly the copied region in the target domain, but also constraints the color\ndynamics to that of the target image. Hence, a part of the style of the target image is preserved. We\nstart by learning a transfer function Ts\u2192t : R6 \u2192 R6 with our method, where 6 denotes the vertical\nand horizontal components of gradient per color, and we then directly solve the same system as [27].\nWhen dealing with images, the number of source and target gradients are largely exceeding tens of\nthousands and it is mandatory to consider methods that scale appropriately. As such, our technique can\nreadily learn the transfer function Ts\u2192t over a limited set of gradients and generalizes appropriately\nto unseen gradients. Three illustrations of this method are proposed in a context of face swapping\nin Figure 2. As one can observe, the original method of Poisson image editing [27] (3rd column)\ntends to preserve the color dynamics of the original image and fails in copying the style of the target\nimage. Our method was tested with a linear and kernel version of Ts\u2192t, that was learned with only\n500 gradients sampled randomly from both sources (\u03bbT = 10\u22122, \u03bbT = 103 for respectively the\nlinear and kernel versions, and \u03bb\u03b3 = 10\u22127 for both cases). As a general qualitative comment, one\ncan observe that the kernel version of Ts\u2192t is better at preserving the dynamics of the gradient, while\nthe linear version tends to \ufb02atten the colors. In this low-dimensional space, this illustrates the need of\na non-linear transformation. Regarding the computational time, the gradient adaptation is of the same\n\n7\n\n\fFigure 2: Illustrations of seamless copies with gradient adaptation. Each row is composed of the\nsource image, the corresponding selection zone \u2126 described as a binary mask, and the target image.\nWe compare here the two linear (4th column) and kernel (5th column) versions of the map Ts\u2192t with\nthe original method of [27] (2nd column) (best viewed in color).\n\norder of magnitude as the Poisson equation solving, and each example is computed in less than 30s\non a standard personal laptop. In the supplementary material we give other examples of the method.\n\n5 Conclusion\n\nIn this paper we proposed a jointly convex approach to learn both the coupling \u03b3 and a transformation\nT approximating the transport map given by \u03b3. It allowed us to apply a learned transport to a set\nof out-of-samples examples not seen during the learning process. Furthermore, jointly learning\nthe coupling and the transformation allowed us to regularize the transport by enforcing a certain\nsmoothness on the transport map. We also proposed several possibilities to choose H the set of\npossible transformations. We presented some theoretical considerations on the generalization ability\nof the learned transformation T . Hence we discussed that under the assumption that the barycentric\nmapping generalizes well and is a good estimate of the true transformation, then T learned with our\nmethod should be a good approximation of the true transformation. We have shown that our approach\nis ef\ufb01cient in practice on two different tasks: domain adaptation and image editing.\nThe framework presented in this paper opens the door to several perspectives. First, from a theoretical\nstandpoint the bound proposed raises some questions on the generalization ability of the barycentric\nmapping and on the estimation of the quality of the true barycentric mapping with respect to the target\ntransformation. On a more practical side, note that in recent years regularized OT has encountered a\ngrowing interest and several methods have been proposed to control the behaviour of the transport.\nAs long as these regularization terms are convex, one could imagine using them in our framework.\nAnother perspective could be to use our framework in a mini-batch setting where instead of learning\nfrom the whole dataset we can estimate a single function T from several couplings \u03b3 optimized on\ndifferent splits of the examples. As a last example we believe that our framework could allow the use\nof the notion of OT in deep architectures as, contrary to the coupling \u03b3, the function T can be used\non out-of-samples examples.\n\nAcknowledgments\n\nThis work was supported in part by the french ANR project LIVES ANR-15-CE23-0026-03.\n\n8\n\n\fReferences\n[1] C. Villani. Optimal transport: old and new. Grund. der mathematischen Wissenschaften. Springer, 2009.\n\n[2] G. Canas and L. Rosasco. Learning probability measures with respect to optimal transport metrics. In\n\nNIPS. 2012.\n\n[3] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.\n\n[4] J. Solomon, R. Rustamov, G. Leonidas, and A. Butscher. Wasserstein propagation for semi-supervised\n\nlearning. In ICML, 2014.\n\n[5] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. Poggio. Learning with a Wasserstein loss. In NIPS.\n\n2015.\n\n[6] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In ICML, 2014.\n\n[7] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00b4e. Iterative Bregman projections for regularized\n\ntransportation problems. SISC, 2015.\n\n[8] V. Seguy and M. Cuturi. Principal geodesic analysis for probability measures under the optimal transport\n\nmetric. In NIPS. 2015.\n\n[9] J. Mueller and T. Jaakkola. Principal differences analysis: Interpretable characterization of differences\n\nbetween distributions. In NIPS. 2015.\n\n[10] R. McCann. A convexity principle for interacting gases. Advances in Mathematics, 128(1), 1997.\n\n[11] S. Reich. A nonparametric ensemble transform method for bayesian inference. SISC, 2013.\n\n[12] S. Ferradans, N. Papadakis, G. Peyr\u00b4e, and J.-F. Aujol. Regularized discrete optimal transport. SIIMS, 2014.\n\n[13] N. Courty, R. Flamary, and D. Tuia. Domain adaptation with regularized optimal transport. In ECML\n\nPKDD, 2014.\n\n[14] J.-D. Benamou, B. D Froese, and A. M Oberman. Numerical solution of the optimal transportation problem\n\nusing the Monge\u2013Amp`ere equation. Journal of Computational Physics, 260, 2014.\n\n[15] M. Cuturi and G. Peyr\u00b4e. A smoothed dual approach for variational Wasserstein problems. SIIMS, 2016.\n\n[16] L. Kantorovich. On the translocation of masses. C.R. (Doklady) Acad. Sci. URSS (N.S.), 37, 1942.\n\n[17] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal\n\nof Optimization Theory and Applications, 109(3), 2001.\n\n[18] M. Frank and P. Wolfe. An algorithm for quadratic programming. NRL, 3(1-2), 1956.\n\n[19] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.\n\n[20] M. Perrot and A. Habrard. Regressive virtual metric learning. In NIPS, 2015.\n\n[21] L. Bruzzone and M. Marconcini. Domain adaptation problems: A DASVM classi\ufb01cation technique and a\n\ncircular validation strategy. IEEE PAMI, 32(5), 2010.\n\n[22] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\nCVPR, 2012.\n\n[23] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional\n\nactivation feature for generic visual recognition. In ICML, 2014.\n\n[24] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using\n\nsubspace alignment. In ICCV, 2013.\n\n[25] E. Zhong, W. Fan, Q. Yang, O. Verscheure, and J. Ren. Cross validation framework to choose amongst\n\nmodels and datasets for transfer learning. In ECML PKDD, 2010.\n\n[26] J. Solomon, F. De Goes, G. Peyr\u00b4e, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional\n\nWasserstein distances. ACM Trans. on Graphics, 34(4), 2015.\n\n[27] P. P\u00b4erez, M. Gangnet, and A. Blake. Poisson image editing. ACM Trans. on Graphics, 22(3), 2003.\n\n[28] F. Deng, S. J. Kim, Y.-W. Tai, and M. Brown. ACCV, chapter Color-Aware Regularization for Gradient\n\nDomain Image Manipulation. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2090, "authors": [{"given_name": "Micha\u00ebl", "family_name": "Perrot", "institution": "University of Saint-Etienne"}, {"given_name": "Nicolas", "family_name": "Courty", "institution": "IRISA / University South Brittany"}, {"given_name": "R\u00e9mi", "family_name": "Flamary", "institution": "Universit\u00e9 C\u00f4te d'Azur"}, {"given_name": "Amaury", "family_name": "Habrard", "institution": "University of Saint-Etienne"}]}