{"title": "Stochastic Optimization for Large-scale Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 3440, "page_last": 3448, "abstract": "Optimal transport (OT) defines a powerful framework to compare probability distributions in a geometrically faithful way. However, the practical impact of OT is still limited because of its computational burden. We propose a new class of stochastic optimization algorithms to cope with large-scale problems routinely encountered in machine learning applications. These methods are able to manipulate arbitrary distributions (either discrete or continuous) by simply requiring to be able to draw samples from them, which is the typical setup in high-dimensional learning problems. This alleviates the need to discretize these densities, while giving access to provably convergent methods that output the correct distance without discretization error. These algorithms rely on two main ideas: (a) the dual OT problem can be re-cast as the maximization of an expectation; (b) entropic regularization of the primal OT problem results in a smooth dual optimization optimization which can be addressed with algorithms that have a provably faster convergence. We instantiate these ideas in three different computational setups: (i) when comparing a discrete distribution to another, we show that incremental stochastic optimization schemes can beat the current state of the art finite dimensional OT solver (Sinkhorn's algorithm) ; (ii) when comparing a discrete distribution to a continuous density, a re-formulation (semi-discrete) of the dual program is amenable to averaged stochastic gradient descent, leading to better performance than approximately solving the problem by discretization ; (iii) when dealing with two continuous densities, we propose a stochastic gradient descent over a reproducing kernel Hilbert space (RKHS). This is currently the only known method to solve this problem, and is more efficient than discretizing beforehand the two densities. We backup these claims on a set of discrete, semi-discrete and continuous benchmark problems.", "full_text": "Stochastic Optimization for\n\nLarge-scale Optimal Transport\n\nAude Genevay\n\nCEREMADE, Universit\u00e9 Paris-Dauphine\n\nINRIA \u2013 Mokaplan project-team\n\ngenevay@ceremade.dauphine.fr\n\nMarco Cuturi\nCREST, ENSAE\n\nUniversit\u00e9 Paris-Saclay\n\nmarco.cuturi@ensae.fr\n\nGabriel Peyr\u00e9\n\nCNRS and DMA, \u00c9cole Normale Sup\u00e9rieure\n\nINRIA \u2013 Mokaplan project-team\n\ngabriel.peyre@ens.fr\n\nFrancis Bach\n\nINRIA \u2013 Sierra project-team\n\nDI, ENS\n\nfrancis.bach@inria.fr\n\nAbstract\n\nOptimal transport (OT) de\ufb01nes a powerful framework to compare probability\ndistributions in a geometrically faithful way. However, the practical impact of OT\nis still limited because of its computational burden. We propose a new class of\nstochastic optimization algorithms to cope with large-scale OT problems. These\nmethods can handle arbitrary distributions (either discrete or continuous) as long\nas one is able to draw samples from them, which is the typical setup in high-\ndimensional learning problems. This alleviates the need to discretize these densities,\nwhile giving access to provably convergent methods that output the correct distance\nwithout discretization error. These algorithms rely on two main ideas: (a) the\ndual OT problem can be re-cast as the maximization of an expectation; (b) the\nentropic regularization of the primal OT problem yields a smooth dual optimization\nwhich can be addressed with algorithms that have a provably faster convergence.\nWe instantiate these ideas in three different setups: (i) when comparing a discrete\ndistribution to another, we show that incremental stochastic optimization schemes\ncan beat Sinkhorn\u2019s algorithm, the current state-of-the-art \ufb01nite dimensional OT\nsolver; (ii) when comparing a discrete distribution to a continuous density, a semi-\ndiscrete reformulation of the dual program is amenable to averaged stochastic\ngradient descent, leading to better performance than approximately solving the\nproblem by discretization ; (iii) when dealing with two continuous densities, we\npropose a stochastic gradient descent over a reproducing kernel Hilbert space\n(RKHS). This is currently the only known method to solve this problem, apart\nfrom computing OT on \ufb01nite samples. We backup these claims on a set of discrete,\nsemi-discrete and continuous benchmark problems.\n\nIntroduction\n\n1\nMany problems in computational sciences require to compare probability measures or histograms.\nAs a set of representative examples, let us quote: bag-of-visual-words comparison in computer\nvision [17], color and shape processing in computer graphics [21], bag-of-words for natural language\nprocessing [11] and multi-label classi\ufb01cation [9]. In all of these problems, a geometry between the\nfeatures (words, visual words, labels) is usually known, and can be leveraged to compare probability\ndistributions in a geometrically faithful way. This underlying geometry might be for instance the\nplanar Euclidean domain for 2-D shapes, a perceptual 3D color metric space for image processing\nor a high-dimensional semantic embedding for words. Optimal transport (OT) [24] is the canonical\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fway to automatically lift this geometry to de\ufb01ne a metric for probability distributions. That metric is\nknown as the Wasserstein or earth mover\u2019s distance. As an illustrative example, OT can use a metric\nbetween words to build a metric between documents that are represented as frequency histograms of\nwords (see [11] for details). All the above-cited lines of work advocate, among others, that OT is the\nnatural choice to solve these problems, and that it leads to performance improvement when compared\nto geometrically-oblivious distances such as the Euclidean or \u03c72 distances or the Kullback-Leibler\ndivergence. However, these advantages come at the price of an enormous computational overhead.\nThis is especially true because current OT solvers require to sample beforehand these distributions\non a pre-de\ufb01ned set of points, or on a grid. This is both inef\ufb01cient (in term of storage and speed)\nand counter-intuitive. Indeed, most high-dimensional computational scenarios naturally represent\ndistributions as objects from which one can sample, not as density functions to be discretized.\nOur goal is to alleviate these shortcomings. We propose a class of provably convergent stochastic\noptimization schemes that can handle both discrete and continuous distributions through sampling.\nPrevious works. The prevalent way to compute OT distances is by solving the so-called Kantorovitch\nproblem [10] (see Section 2 for a short primer on the basics of OT formulations), which boils down\nto a large-scale linear program when dealing with discrete distributions (i.e., \ufb01nite weighted sums of\nDirac masses). This linear program can be solved using network \ufb02ow solvers, which can be further\nre\ufb01ned to assignment problems when comparing measures of the same size with uniform weights [3].\nRecently, regularized approaches that solve the OT with an entropic penalization [6] have been shown\nto be extremely ef\ufb01cient to approximate OT solutions at a very low computational cost. These regu-\nlarized approaches have supported recent applications of OT to computer graphics [21] and machine\nlearning [9]. These methods apply the celebrated Sinkhorn algorithm [20], and can be extended\nto solve more exotic transportation-related problems such as the computation of barycenters [21].\nTheir chief computational advantage over competing solvers is that each iteration boils down to\nmatrix-vector multiplications, which can be easily parallelized, streams extremely well on GPU, and\nenjoys linear-time implementation on regular grids or triangulated domains [21].\nThese methods are however purely discrete and cannot cope with continuous densities. The only\nknown class of methods that can overcome this limitation are so-called semi-discrete solvers [1], that\ncan be implemented ef\ufb01ciently using computational geometry primitives [12]. They can compute\ndistance between a discrete distribution and a continuous density. Nonetheless, they are restricted to\nthe Euclidean squared cost, and can only be implemented in low dimensions (2-D and 3-D). Solving\nthese semi-discrete problems ef\ufb01ciently could have a signi\ufb01cant impact for applications to density\n\ufb01tting with an OT loss [2] for machine learning applications, see [13]. Lastly, let us point out that\nthere is currently no method that can compute OT distances between two continuous densities, which\nis thus an open problem we tackle in this article.\nContributions. This paper introduces stochastic optimization methods to compute large-scale optimal\ntransport in all three possible settings: discrete OT, to compare a discrete vs. another discrete measure;\nsemi-discrete OT, to compare a discrete vs. a continuous measure; and continous OT, to compare\na continuous vs. another continuous measure. These methods can be used to solve classical OT\nproblems, but they enjoy faster convergence properties when considering their entropic-regularized\nversions. We show that the discrete regularized OT problem can be tackled using incremental\nalgorithms, and we consider in particular the stochastic averaged gradient (SAG) method [19]. Each\niteration of that algorithm requires N operations (N being the size of the supports of the input\ndistributions), which makes it scale better in large-scale problems than the state-of-the-art Sinkhorn\nalgorithm, while still enjoying a convergence rate of O(1/k), k being the number of iterations. We\n\u221a\nshow that the semi-discrete OT problem can be solved using averaged stochastic gradient descent\n(SGD), whose convergence rate is O(1/\nk). This approach is numerically advantageous over the\nbrute force approach consisting in sampling \ufb01rst the continuous density to solve next a discrete OT\nproblem. Lastly, for continuous optimal transport, we propose a novel method which makes use of an\nexpansion of the dual variables in a reproducing kernel Hilbert space (RKHS). This allows us for\nthe \ufb01rst time to compute with a converging algorithm OT distances between two arbitrary densities,\nunder the assumption that the two potentials belong to such an RKHS.\nNotations. In the following we consider two metric spaces X and Y. We denote by M1\n+(X ) the set\nof positive Radon probability measures on X , and C(X ) the space of continuous functions on X . Let\n\u00b5 \u2208 M1\n\n+(X \u00d7 Y) ; \u2200(A, B) \u2282 X \u00d7 Y, \u03c0(A \u00d7 Y) = \u00b5(A), \u03c0(X \u00d7 B) = \u03bd(B)(cid:9) ,\n\n+(Y), we de\ufb01ne\n\n+(X ), \u03bd \u2208 M1\n\n\u03a0(\u00b5, \u03bd) def.=(cid:8)\u03c0 \u2208 M1\n\n2\n\n\fthe set of joint probability measures on X \u00d7 Y with marginals \u00b5 and \u03bd. The Kullback-Leibler\ndivergence between joint probabilities is de\ufb01ned as\n\n+(X \u00d7 Y)2, KL(\u03c0|\u03be) def.=(cid:82)\n\nX\u00d7Y\n\n(cid:0) log(cid:0) d\u03c0\n\nd\u03be (x, y)(cid:1) \u2212 1(cid:1)d\u03be(x, y),\n\n\u2200(\u03c0, \u03be) \u2208 M1\n\n(cid:8)\u00b5 \u2208 RN\n\nd\u03be the relative density of \u03c0 with respect to \u03be, and by convention KL(\u03c0|\u03be) def.= +\u221e\nwhere we denote d\u03c0\nif \u03c0 does not have a density with respect to \u03be. The Dirac measure at point x is \u03b4x. For a set C,\n\u03b9C(x) = 0 if x \u2208 C and \u03b9C(x) = +\u221e otherwise. The probability simplex of N bins is \u03a3N =\nthe transpose of a matrix K. We denote 1N = (1, . . . , 1)(cid:62) \u2208 RN and 0N = (0, . . . , 0)(cid:62) \u2208 RN .\n\ni \u00b5i = 1(cid:9). Element-wise multiplication of vectors is denoted by (cid:12) and K(cid:62) denotes\n\n+ ; (cid:80)\n\n(cid:90)\n\n+(X )\u00d7M1\n\n+(X ) and \u03bd \u2208 M1\n\n2 Optimal Transport: Primal, Dual and Semi-dual Formulations\nWe consider the optimal transport problem between two measures \u00b5 \u2208 M1\n+(Y),\nde\ufb01ned on metric spaces X and Y. No particular assumption is made on the form of \u00b5 and \u03bd, we\nonly assume that they both can be sampled from to be able to apply our algorithms.\nPrimal, Dual and Semi-dual Formulations. The Kantorovich formulation [10] of OT and its\nentropic regularization [6] can be conveniently written in a single convex optimization problem as\nfollows\n\u2200(\u00b5, \u03bd) \u2208 M1\nc(x, y)d\u03c0(x, y)+\u03b5 KL(\u03c0|\u00b5\u2297\u03bd). (P\u03b5)\nHere c \u2208 C(X \u00d7 Y) and c(x, y) should be interpreted as the \u201cground cost\u201d to move a unit of mass\nfrom x to y. This c is typically application-dependent, and re\ufb02ects some prior knowledge on the\ndata to process. We refer to the introduction for a list of previous work where various examples (in\nimaging, vision, graphics or machine learning) of such costs are given.\nWhen X = Y, \u03b5 = 0 and c = dp for p \u2265 1, where d is a distance on X , then W0(\u00b5, \u03bd)\np is known as\n+(X ). Note that this de\ufb01nition can be used for any type of measure,\nthe p-Wasserstein distance on M1\nboth discrete and continuous. When \u03b5 > 0, problem (P\u03b5) is strongly convex, so that the optimal \u03c0 is\nunique, and algebraic properties of the KL regularization result in computations that can be tackled\nusing the Sinkhorn algorithm [6].\nFor any c \u2208 C(X \u00d7 Y), we de\ufb01ne the following constraint set\n\n+(Y), W\u03b5(\u00b5, \u03bd) def.= min\n\u03c0\u2208\u03a0(\u00b5,\u03bd)\n\nX\u00d7Y\n\n1\n\ndef.= {(u, v) \u2208 C(X ) \u00d7 C(Y) ; \u2200(x, y) \u2208 X \u00d7 Y, u(x) + v(y) \u2264 c(x, y)} ,\n\nUc\n\nand de\ufb01ne its indicator function as well as its \u201csmoothed\u201d approximation\n\n\u03b9\u03b5\nUc\n\n(u, v) def.=\n\nif\n\n\u03b5 = 0,\n\nX\u00d7Y exp( u(x)+v(y)\u2212c(x,y)\n\n\u03b5\n\n)d\u00b5(x)d\u03bd(y)\n\nif\n\n\u03b5 > 0.\n\nFor any v \u2208 C(Y), we de\ufb01ne its c-transform and its \u201csmoothed\u201d approximation\n\n(cid:26) \u03b9Uc(u, v)\n\u03b5(cid:82)\n\uf8f1\uf8f2\uf8f3 min\n\n(1)\n\n(2)\n\n\u2200x \u2208 X ,\n\nvc,\u03b5(x) def.=\n\n(cid:16)(cid:82)\n\ny\u2208Y c(x, y) \u2212 v(y)\n\u2212\u03b5 log\n\nY exp( v(y)\u2212c(x,y)\n\n\u03b5\n\nif\n\n\u03b5 = 0,\n\n(cid:17)\n\n)d\u03bd(y)\n\nif\n\n\u03b5 > 0.\n\n(cid:90)\n\nThe proposition below describes two dual problems. It is central to our analysis and paves the way\nfor the application of stochastic optimization methods.\nProposition 2.1 (Dual and semi-dual formulations). For \u03b5 \u2265 0, one has\n\n(cid:90)\n\nmax\n\nW\u03b5(\u00b5, \u03bd) =\n\nu(x)d\u00b5(x) +\n\nF\u03b5(u, v) def.=\n\nu\u2208C(X ),v\u2208C(Y)\n\n(D\u03b5)\n(S\u03b5)\nis de\ufb01ned in (1) and vc,\u03b5 in (2). Furthermore, u solving (D\u03b5) is recovered from an optimal\nwhere \u03b9\u03b5\nv solving (S\u03b5) as u = vc,\u03b5. For \u03b5 > 0, the solution \u03c0 of (P\u03b5) is recovered from any (u, v) solving\nUc\n(D\u03b5) as d\u03c0(x, y) = exp( u(x)+v(y)\u2212c(x,y)\n\nv(y)d\u03bd(y) \u2212 \u03b5,\n\nv(y)d\u03bd(y) \u2212 \u03b9\u03b5\n\nvc,\u03b5(x)d\u00b5(x) +\n\n)d\u00b5(x)d\u03bd(y).\n\n= max\nv\u2208C(Y)\n\nH\u03b5(v) def.=\n\n(u, v),\n\nUc\n\nX\n\nX\n\n\u03b5\n\n(cid:90)\n\nY\n\n(cid:90)\n\nY\n\n3\n\n\fProof. Problem (D\u03b5) is the convex dual of (P\u03b5), and is derived using Fenchel-Rockafellar\u2019s theorem.\nThe relation between u and v is obtained by writing the \ufb01rst order optimality condition for v in (D\u03b5).\nPlugging this expression back in (D\u03b5) yields (S\u03b5).\n\nProblem (P\u03b5) is called the primal while (D\u03b5) is its associated dual problem. We refer to (S\u03b5) as the\n\u201csemi-dual\u201d problem, because in the special case \u03b5 = 0, (S\u03b5) boils down to the so-called semi-discrete\nOT problem [1]. Both dual problems are concave maximization problems. The optimal dual variables\n(u, v)\u2014known as Kantorovitch potentials\u2014are not unique, since for any solution (u, v) of (D\u03b5),\n(u + \u03bb, v \u2212 \u03bb) is also a solution for any \u03bb \u2208 R. When \u03b5 > 0, they can be shown to be unique up to\nthis scalar translation [6]. We refer to the supplementary material for a discussion (and proofs) of the\nconvergence of the solutions of (P\u03b5), (D\u03b5) and (S\u03b5) towards those of (P0), (D0) and (S0) as \u03b5 \u2192 0.\nA key advantage of (S\u03b5) over (D\u03b5) is that, when \u03bd is a discrete density (but not necessarily \u00b5),\nthen (S\u03b5) is a \ufb01nite-dimensional concave maximization problem, which can thus be solved using\nstochastic programming techniques, as highlighted in Section 4. By contrast, when both \u00b5 and \u03bd are\ncontinuous densities, these dual problems are intrinsically in\ufb01nite dimensional, and we propose in\nSection 5 more advanced techniques based on RKHSs.\nStochastic Optimization Formulations. The fundamental property needed to apply stochastic\nprogramming is that both dual problems (D\u03b5) and (S\u03b5) must be rephrased as maximizing expectations:\n\n\u2200\u03b5 > 0, F\u03b5(u, v) = EX,Y [f\u03b5(X, Y, u, v)]\n\nand \u2200\u03b5 \u2265 0, H\u03b5(v) = EX [h\u03b5(X, v)] ,\n\n(3)\n\nwhere the random variables X and Y are independent and distributed according to \u00b5 and \u03bd respec-\ntively, and where, for (x, y) \u2208 X \u00d7 Y and (u, v) \u2208 C(X ) \u00d7 C(Y),\n\n\u2200\u03b5 > 0,\n\nf\u03b5(x, y, u, v) def.= u(x) + v(y) \u2212 \u03b5 exp\n\n\u2200\u03b5 \u2265 0,\n\nh\u03b5(x, v) def.=\n\nv(y)d\u03bd(y) + vc,\u03b5(x) \u2212 \u03b5.\n\n(cid:90)\n\nY\n\n(cid:16) u(x) + v(y) \u2212 c(x, y)\n\n(cid:17)\n\n,\n\n\u03b5\n\nThis reformulation is at the heart of the methods detailed in the remainder of this article. Note that\nthe dual problem (D\u03b5) cannot be cast as an unconstrained expectation maximization problem when\n\u03b5 = 0, because of the constraint on the potentials which arises in that case.\n\nWhen \u03bd is discrete, i.e \u03bd = (cid:80)J\n\n\u03b5\n\n)\n\n(cid:17)\u22121\n\n(cid:16)(cid:80)J\n\nj=1 \u03bdj\u03b4yj the potential v is a J-dimensional vector (vj)j={1...J}\nand we can compute the gradient of h\u03b5. When \u03b5 > 0 the gradient reads \u2207vh\u03b5(v, x) =\n\u03b5 (\u03c0(x)\u03c0(x)T \u2212 diag(\u03c0(x))) where \u03c0(x)i =\n\u03bd \u2212 \u03c0(x) and the hessian is given by \u22022\nv h\u03b5(v, x) = 1\nj=1 exp( vj\u2212c(x,yj )\nexp( vi\u2212c(x,yi)\n. The eigenvalues of the hessian can be upper bounded\n)\nby 1\n\u03b5 , which guarantees a lipschitz gradient, and lower bounded by 0 which does not ensure\nstrong convexity.\nIn the unregularized case, h0 is not smooth and a subgradient is given by\n\u2207vh0(v, x) = \u03bd \u2212 \u02dc\u03c0(x), where \u02dc\u03c0(x)i = \u03c7i=j(cid:63) and j(cid:63) = arg minj\u2208{1...J} c(x, yj) \u2212 vj (when\nseveral elements are in the argmin, we arbitrarily choose one of them to be j(cid:63)). We insist on the\nlack of strong convexity of the semi-dual problem, as it impacts the convergence properties of the\nstochastic algorithms (stochastic averaged gradient and stochastic gradient descent) described below.\n\n\u03b5\n\n3 Discrete Optimal Transport\n\nthe form \u00b5 = (cid:80)I\n\n\u03bd = (cid:80)J\n\nand\n\ni=1 \u00b5i\u03b4xi\n\nWe assume in this section that both \u00b5 and \u03bd are discrete measures, i.e. \ufb01nite sums of Diracs, of\nj=1 \u03bdj\u03b4yj , where (xi)i \u2282 X and (yj)j \u2282 Y, and the\nhistogram vector weights are \u00b5 \u2208 \u03a3I and \u03bd \u2208 \u03a3J. These discrete measures may come from the\nevaluation of continuous densities on a grid, counting features in a structured object, or be empirical\nmeasures based on samples. This setting is relevant for several applications, including all known\napplications of the earth mover\u2019s distance. We show in this section that our stochastic formulation\ncan prove extremely ef\ufb01cient to compare measures with a large number of points.\nDiscrete Optimization and Sinkhorn. In this setup, the primal (P\u03b5), dual (D\u03b5) and semi-dual (S\u03b5)\nproblems can be rewritten as \ufb01nite-dimensional optimization problems involving the cost matrix\n\n4\n\n\fc \u2208 RI\u00d7J\n\n+\n\nde\ufb01ned by ci,j = c(xi, yj):\n\nW\u03b5(\u00b5, \u03bd) = min\n\u03c0\u2208RI\u00d7J\n= max\n\n+\n\nu\u2208RI ,v\u2208RJ\n\n= max\nv\u2208RJ\n\n(cid:111)\n\n, ( \u00afP\u03b5)\n( \u00afD\u03b5)\n( \u00afS\u03b5)\n\n(cid:17)\n(cid:16) ui+vj\u2212ci,j\n\n(cid:17)\n\n\u03b5\n\n(cid:110)(cid:80)\n(cid:16)\ni,j ci,j\u03c0i,j + \u03b5(cid:80)\n(cid:80)\ni ui\u00b5i +(cid:80)\nj vj\u03bdj \u2212 \u03b5(cid:80)\n\u00afH\u03b5(v) =(cid:80)\n(cid:40)\u2212\u03b5 log((cid:80)\n(cid:88)\n\ni,j exp\n\u00afh\u03b5(xi, v)\u00b5i, where\n\nlog \u03c0i,j\n\u00b5i\u03bdj\n\ni\u2208I\n\ni,j\n\nminj (c(x, yj) \u2212 vj)\n\nj\u2208J\n\n\u2212 1\n\n\u03c0i,j ; \u03c01J = \u00b5, \u03c0(cid:62)1I = \u03bd\n\n\u00b5i\u03bdj, (for \u03b5 > 0)\n\n\u00afh\u03b5(x, v) =\n\nvj\u03bdj +\n\nj\u2208J exp( vj\u2212c(x,yj )\n\n\u03b5\n\n)\u03bdj) \u2212 \u03b5\n\nif \u03b5 > 0,\nif \u03b5 = 0,\n\n(4)\n\nAlgorithm 1 SAG for Discrete OT\nInput: C\nOutput: v\n\nv \u2190 0J, d \u2190 0J, \u2200i, gi \u2190 0J\nfor k = 1, 2, . . . do\n\nSample i \u2208 {1, 2, . . . , I} uniform.\nd \u2190 d \u2212 gi\ngi \u2190 \u00b5i\u2207v\nd \u2190 d + gi ; v \u2190 v + Cd\n\nThe state-of-the-art method to solve the discrete regularized OT (i.e. when \u03b5 > 0) is Sinkhorn\u2019s\nalgorithm [6, Alg.1], which has linear convergence rate [8]. It corresponds to a block coordinate\nmaximization, successively optimizing ( \u00afD\u03b5) with respect to either u or v. Each iteration of this algo-\nrithm is however costly, because it requires a matrix-vector multiplication. Indeed, this corresponds\nto a \u201cbatch\u201d method where all the samples (xi)i and (yj)j are used at each iteration, which has thus\ncomplexity O(N 2) where N = max(I, J). We now detail how to alleviate this issue using online\nstochastic optimization methods.\nIncremental Discrete Optimization when \u03b5 > 0. Stochastic gradient descent (SGD), in which an\nindex k is drawn from distribution \u00b5 at each iteration can be used to minimize the \ufb01nite sum that\nappears in in \u00afS\u03b5. The gradient of that term \u00afh\u03b5(xk,\u00b7) can be used as a proxy for the full gradient in a\nstandard gradient ascent step to maximize \u00afH\u03b5.\nWhen \u03b5 > 0, the \ufb01nite sum appearing in ( \u00afS\u03b5) sug-\ngests to use incremental gradient methods\u2014rather than\npurely stochastic ones\u2014which are known to converge\nfaster than SGD. We propose to use the stochastic av-\neraged gradient (SAG) [19]. As SGD, SAG operates at\neach iteration by sampling a point xk from \u00b5, to com-\npute the gradient corresponding to that sample for the\ncurrent estimate v. Unlike SGD, SAG keeps in memory\na copy of that gradient. Another difference is that SAG\napplies a \ufb01xed length update, in the direction of the\naverage of all gradients stored so far, which provides a\nbetter proxy of the gradient corresponding to the entire\nsum. This improves the convergence rate to | \u00afH\u03b5(v(cid:63)\n\u03b5 is a minimizer\nof \u00afH\u03b5, at the expense of storing the gradient for each of the I points. This expense can be mitigated\nby considering mini-batches instead of individual points. Note that the SAG algorithm is adaptive to\nstrong-convexity and will be linearly convergent around the optimum. The pseudo-code for SAG\nis provided in Algorithm 1, and we defer more details on SGD for Section 4, in which it will be\nshown to play a crucial role. Note that the Lipschitz constant of all these terms is upperbounded by\nL = maxi \u00b5i/\u03b5.\nNumerical Illustrations on Bags of Word-Embeddings. Comparing texts using a Wasserstein\ndistance on their representations as clouds of word embeddings has been recently shown to yield\nstate-of-the-art accuracy for text classi\ufb01cation [11]. The authors of [11] have however highlighted\nthat this accuracy comes at a large computational cost. We test our stochastic approach to discrete\nOT in this scenario, using the complete works of 35 authors (names in supplementary material). We\nuse Glove word embeddings [14] to represent words, namely X = Y = R300. We discard all most\nfrequent 1, 000 words that appear at the top of the \ufb01le glove.840B.300d provided on the authors\u2019\nwebsite. We sample N = 20, 000 words (found within the remaining huge dictionary of relatively\nrare words) from each authors\u2019 complete work. Each author is thus represented as a cloud of 20, 000\npoints in R300. The cost function c between the word embeddings is the squared-Euclidean distance,\nre-scaled so that it has a unit empirical median on 2, 000 points sampled randomly among all vector\nembeddings. We set \u03b5 to 0.01 (other values are considered in the supplementary material). We\ncompute all (35 \u00d7 34/2 = 595) pairwise regularized Wasserstein distances using both the Sinkhorn\nalgorithm and SAG. Following the recommendations in [19], SAG\u2019s stepsize is tested for 3 different\nsettings, 1/L, 3/L and 5/L. The convergence of each algorithm is measured by computing the (cid:96)1\nnorm of the gradient of the full sum (which also corresponds to the marginal violation of the primal\ntransport solution that can be recovered with these dual variables[6]), as well as the (cid:96)2 norm of the\n\n\u03b5) \u2212 \u00afH\u03b5(vk)| = O(1/k), where v(cid:63)\n\n\u00afh\u03b5(xi, v)\n\nend for\n\n5\n\n\fFigure 1: We compute all 595 pairwise word mover\u2019s distances [11] between 35 very large corpora\nof text, each represented as a cloud of I = 20, 000 word embeddings. We compare the Sinkhorn\nalgorithm with SAG, tuned with different stepsizes. Each pass corresponds to a I \u00d7 I matrix-vector\nproduct. We used minibatches of size 200 for SAG. Left plot: convergence of the gradient (cid:96)1 norm\n(average and \u00b1 standard deviation error bars). A stepsize of 3/L achieves a substantial speed-up\nof \u2248 2.5, as illustrated in the boxplots in the center plot. Convergence to v(cid:63) (the best dual variable\nacross all variables after 4, 000 passes) in (cid:96)2 norm is given in the right plot, up to 2, 000 \u2248 211 steps.\n\ndeviation to the optimal scaling found after 4, 000 passes for any of the three methods. Results are\npresented in Fig. 1 and suggest that SAG can be more than twice faster than Sinkhorn on average\nfor all tolerance thresholds. Note that SAG retains exactly the same parallel properties as Sinkhorn:\nall of these computations can be streamlined on GPUs. We used 4 Tesla K80 cards to compute both\nSAG and Sinkhorn results. For each computation, all 4, 000 passes take less than 3 minutes (far less\nare needed if the goal is only to approximate the Wasserstein distance itself, as proposed in [11]).\n\n4 Semi-Discrete Optimal Transport\n\nthat \u03bd =(cid:80)J\n\nIn this section, we assume that \u00b5 is an arbitrary measure (in particular, it needs not to be discrete) and\nj=1 \u03bdj\u03b4yj is a discrete measure. This corresponds to the semi-discrete OT problem [1, 12].\nThe semi-dual problem (S\u03b5) is then a \ufb01nite-dimensional maximization problem, written in expectation\nform as W\u03b5(\u00b5, \u03bd) = max\nv\u2208RJ\n\n(cid:2)\u00afh\u03b5(X, v)(cid:3) where X \u223c \u00b5 and \u00afh\u03b5 is de\ufb01ned in (4).\n\nEX\n\ndef.= 1\nN\n\n(cid:80)N\n\nAlgorithm 2 Averaged SGD for\nSemi-Discrete OT\nInput: C\nOutput: v\n\nStochastic Semi-discrete Optimization. Since the expectation\nis taken over an arbitrary measure, neither Sinkhorn algorithm nor\nincremental algorithms such as SAG can be used. An alternative\nis to approximate \u00b5 by an empirical measure \u02c6\u00b5N\ni=1 \u03b4xi\nwhere (xi)i=1,...,N are i.i.d samples from \u00b5, and computing\nW\u03b5(\u02c6\u00b5N , \u03bd) using the discrete methods (Sinkhorn or SAG) de-\ntailed in Section 3. However this introduces a discretization noise\nin the solution as the discrete problem is now different from the\noriginal one and thus has a different solution. Averaged SGD\non the other hand does not require \u00b5 to be discrete and is thus\nperfectly adapted to this semi-discrete setting. The algorithm\nis detailed in Algorithm 2 (the expression for \u2207\u00afh\u03b5 being given\n\u221a\nin Equation 4). The convergence rate is O(1/\nk) thanks to\naveraging \u02dcvk [15].\nNumerical Illustrations. Simulations are performed in X = Y = R3. Here \u00b5 is a Gaussian mixture\n(continuous density) and \u03bd = 1\nj=1 \u03b4yj with J = 10 and (xj)j are i.i.d. samples from another\nJ\ngaussian mixture. Each mixture is composed of three gaussians whose means are drawn randomly in\n[0, 1]3, and their correlation matrices are constructed as \u03a3 = 0.01(RT + R) + 3I3 where R is 3 \u00d7 3\n\u03b5 a solution of (S\u03b5), which is approximated\nwith random entries in [0, 1]. In the following, we denote v(cid:63)\nby running SGD for 107 iterations, 100 times more than those plotted, to ensure reliable convergence\ncurves. Both plots are averaged over 50 runs, lighter lines show the variability in a single run.\n\n\u02dcv \u2190 0J , v \u2190 \u02dcv\nfor k = 1, 2, . . . do\nSample xk from \u00b5\n\u02dcv \u2190 \u02dcv + C\u221a\n\u2207v\nv \u2190 1\nk \u02dcv + k\u22121\nk v\n\n(cid:80)J\n\n\u00afh\u03b5(xk, \u02dcv)\n\nk\n\nend for\n\n6\n\n\f(a) SGD\n\n(b) SGD vs. SAG\n\n\u03b5(cid:107)2 /(cid:107)v(cid:63)\n\n0(cid:107)2 /(cid:107)v(cid:63)\n\n(a) Plot of (cid:107)vk \u2212 v(cid:63)\n\n0(cid:107)2 as a function of k, for SGD and different values of \u03b5\n\u03b5(cid:107)2 as a function of k, for SGD and SAG\n\nFigure 2:\n(\u03b5 = 0 being un-regularized). (b) Plot of (cid:107)vk \u2212 v(cid:63)\nwith different number N of samples, for regularized OT using \u03b5 = 10\u22122.\n0(cid:107)2 as a function of k. It highlights the in\ufb02uence\nFigure 2 (a) shows the evolution of (cid:107)vk \u2212 v(cid:63)\nof the regularization parameters \u03b5 on the iterates of SGD. While the regularized iterates converge\nfaster, they do not converge to the correct unregularized solution. This \ufb01gure also illustrates the\nconvergence theorem of solution of (S\u03b5) toward those (S0) when \u03b5 \u2192 0, which can be found in the\nsupplementary material. Figure 2 (b) shows the evolution of (cid:107)vk \u2212 v(cid:63)\n\u03b5(cid:107)2 as a function of\nk, for a \ufb01xed regularization parameter value \u03b5 = 10\u22122. It compares SGD to SAG using different\nnumbers N of samples for the empirical measures \u02c6\u00b5N . While SGD converges to the true solution of\nthe semi-discrete problem, the solution computed by SAG is biased because of the approximation\nerror which comes from the discretization of \u00b5. This error decreases when the sample size N is\nincreased, as the approximation of \u00b5 by \u02c6\u00b5N becomes more accurate.\n\n0(cid:107)2 /(cid:107)v(cid:63)\n\n\u03b5(cid:107)2 /(cid:107)v(cid:63)\n\n5 Continuous optimal transport using RKHS\nIn the case where neither \u00b5 nor \u03bd are discrete, problem (S\u03b5) is in\ufb01nite-dimensional, so it cannot be\nsolved directly using SGD. We propose in this section to solve the initial dual problem (D\u03b5), using\nexpansions of the dual variables in two reproducing kernel Hilbert spaces (RKHS). Choosing dual\nvariables (or test functions) in a RKHS is the fundamental assumption underlying the Maximum\nMean Discrepancy (MMD)[22]. It is thus tempting to draw parallels between the approach in this\nsection and the MMD. The two methods do not, however, share much beyond using RKHSs. Indeed,\nunlike the MMD, problem (D\u03b5) involves two different dual (test) functions u and v, one for each\n. Recall \ufb01nally that contrarily to the\nmeasure; these are furthermore linked through a regularizer \u03b9\u03b5\nsemi-discrete setting, we can only solve the regularized problem here (i.e. \u03b5 > 0), since (D\u03b5) cannot\nUc\nbe cast as an expectation maximization problem when \u03b5 = 0.\nStochastic Continuous Optimization. We consider two RKHS H and G de\ufb01ned on X and on Y,\nwith kernels \u03ba and (cid:96), associated with norms (cid:107) \u00b7 (cid:107)H and (cid:107) \u00b7 (cid:107)G. Recall the two main properties of\nRKHS: (a) if u \u2208 H, then u(x) = (cid:104)u, \u03ba(\u00b7, x)(cid:105)H and (b) \u03ba(x, x(cid:48)) = (cid:104)\u03ba(\u00b7, x), \u03ba(\u00b7, x(cid:48))(cid:105)H.\nThe dual problem (D\u03b5) is conveniently re-written in (3) as the maximization of the expectation of\nf \u03b5(X, Y, u, v) with respect to the random variables (X, Y ) \u223c \u00b5 \u2297 \u03bd. The SGD algorithm applied to\nthis problem reads, starting with u0 = 0 and v0 = 0,\n\n(uk, vk) def.= (uk\u22121, vk\u22121) +\n\n(5)\nwhere (xk, yk) are i.i.d. samples from \u00b5 \u2297 \u03bd. The following proposition shows that these (uk, vk)\niterates can be expressed as \ufb01nite sums of kernel functions, with a simple recursion formula.\nProposition 5.1. The iterates (uk, vk) de\ufb01ned in (5) satisfy\n\n\u2207f\u03b5(xk, yk, uk\u22121, vk\u22121) \u2208 H \u00d7 G,\n\nC\u221a\nk\n\ni=1\n\n\u03b1i(\u03ba(\u00b7, xi), (cid:96)(\u00b7, yi)), where \u03b1i\n\n(uk, vk) def.=\n(6)\nwhere (xi, yi)i=1...k are i.i.d samples from \u00b5 \u2297 \u03bd and \u03a0Br is the projection on the centered ball of\nradius r. If the solutions of (D\u03b5) are in the H \u00d7 G and if r is large enough, the iterates (uk,vk)\nconverge to a solution of (D\u03b5).\nThe proof of proposition 5.1 can be found in the supplementary material.\n\nui\u22121(xi)+vi\u22121(yi)\u2212c(xi,yi)\n\ndef.= \u03a0Br\n\n1 \u2212 e\n\ni\n\n\u03b5\n\n(cid:17)(cid:19)\n\n,\n\nk(cid:88)\n\n(cid:18) C\u221a\n\n(cid:16)\n\n7\n\n\f(a) setting\n\n(b) convergence of uk\n\n(c) plots of uk\n\ndx. (b) Plot of (cid:107)uk \u2212 \u02c6u(cid:63)(cid:107)2 /(cid:107)\u02c6u(cid:63)(cid:107)2 as a function of k with SGD in the\nFigure 3: (a) Plot of d\u00b5\nRKHS, for regularized OT using \u03b5 = 10\u22121. (c) Plot of the iterates uk for k = 103, 104, 105 and the\nproxy for the true potential \u02c6u(cid:63), evaluated on a grid where \u00b5 has non negligible mass.\n\ndx and d\u03bd\n\n(cid:17)\n\n(cid:82)\n\n(cid:16)\n\n1 \u2212 e\n\nuk\u22121(xk )+vk\u22121(yk )\u2212c(xk ,yk )\n\n\u03b5\n\nfor k = 1, 2, . . . do\n\ndef.= C\u221a\nk\n\n\u03b1k\nend for\n\ni=1 \u03b1i\u03ba(xk, xi)\ni=1 \u03b1i(cid:96)(yk, yi)\n\nSample xk from \u00b5\nSample yk from \u03bd\n\nuk\u22121(xk) def.=(cid:80)k\u22121\nvk\u22121(yk) def.=(cid:80)k\u22121\n\nAlgorithm 3 Kernel SGD for continuous OT\nInput: C, kernels \u03ba and (cid:96)\nOutput: (\u03b1k, xk, yk)k=1,...\n\nAlgorithm 3 describes our kernel SGD approach,\nin which both potentials u and v are approxi-\nmated by a linear combination of kernel func-\ntions. The main cost lies in the computation of\nthe terms uk\u22121(xk) and vk\u22121(yk) which imply\na quadratic complexity O(k2). Several methods\nexist to alleviate the running time complexity\nof kernel algorithms, e.g. random Fourier fea-\ntures [16] or incremental incomplete Cholesky\ndecomposition [25].\nKernels that are associated with dense RHKS\nare called universal [23] and can approach any arbitrary potential. In Euclidean spaces X = Y = Rd,\nwhere d > 0, a natural choice of universal kernel is the kernel de\ufb01ned by \u03ba(x, x(cid:48)) = exp(\u2212(cid:107)x \u2212\nx(cid:48)(cid:107)2/\u03c32). Tuning its bandwidth \u03c3 is crucial to obtain a good convergence of the algorithm.\nFinally, let us note that, while entropy regularization of the primal problem (P\u03b5) was instrumental\nto be able to apply semi-discrete methods in Sections 3 and 4, this is not the case here. Indeed,\nsince the kernel SGD algorithm is applied to the dual (D\u03b5), it is possible to replace KL(\u03c0|\u00b5 \u2297 \u03bd)\nappearing in (P\u03b5) by other regularizing divergences. A typical example would be a \u03c72 divergence\nX\u00d7Y ( d\u03c0\nNumerical Illustrations. We consider optimal transport in 1D between a Gaussian \u00b5 and a Gaussian\nmixture \u03bd whose densities are represented in Figure 3 (a). Since there is no existing benchmark for\ncontinuous transport, we use the solution of the semi-discrete problem W\u03b5(\u00b5, \u02c6\u03bdN ) with N = 103\ncomputed with SGD as a proxy for the solution and we denote it by \u02c6u(cid:63). We focus on the convergence\nof the potential u, as it is continuous in both problems contrarily to v. Figure 3 (b) represents the plot\nof (cid:107)uk \u2212 \u02c6u(cid:63)(cid:107)2/(cid:107)\u02c6u(cid:63)(cid:107)2 where u is the evaluation of u on a sample (xi)i=1...N(cid:48) drawn from \u00b5. This\ngives more emphasis to the norm on points where \u00b5 has more mass. The convergence is rather slow\nbut still noticeable. The iterates uk are plotted on a grid for different values of k in Figure 3 (c), to\nemphasize the convergence to the proxy \u02c6u(cid:63). We can see that the iterates computed with the RKHS\nconverge faster where \u00b5 has more mass, which is actually where the value of u has the greatest impact\nin F\u03b5 (u being integrated against \u00b5).\nConclusion\nWe have shown in this work that the computations behind (regularized) optimal transport can be\nconsiderably alleviated, or simply enabled, using a stochastic optimization approach. In the discrete\ncase, we have shown that incremental gradient methods can surpass the Sinkhorn algorithm in\nterms of ef\ufb01ciency, taken for granted that the (constant) stepsize has been correctly selected, which\nshould be possible in practical applications. We have also proposed the \ufb01rst known methods that can\naddress the challenging semi-discrete and continuous cases. All of these three settings can open new\nperspectives for the application of OT to high-dimensional problems.\n\nd\u00b5d\u03bd (x, y))2d\u00b5(x)d\u03bd(y) (with positivity constraints on \u03c0).\n\nAcknowledgements GP was supported by the European Research Council (ERC SIGMA-Vision); AG by\nR\u00e9gion Ile-de-France; MC by JSPS grant 26700002.\n\n8\n\n\fReferences\n[1] F. Aurenhammer, F. Hoffmann, and B. Aronov. Minkowski-type theorems and least-squares clustering.\n\nAlgorithmica, 20(1):61\u201376, 1998.\n\n[2] F. Bassetti, A. Bodini, and E. Regazzini. On minimum Kantorovich distance estimators. Statistics &\n\nProbability Letters, 76(12):1298\u20131302, 2006.\n\n[3] R. Burkard, M. Dell\u2019Amico, and S. Martello. Assignment Problems. SIAM, 2009.\n[4] G. Carlier, V. Duval, G. Peyr\u00e9, and B. Schmitzer. Convergence of entropic schemes for optimal transport\n\nand gradient \ufb02ows. arXiv preprint arXiv:1512.02783, 2015.\n\n[5] R. Cominetti and J. San Martin. Asymptotic analysis of the exponential penalty trajectory in linear\n\nprogramming. Mathematical Programming, 67(1-3):169\u2013187, 1994.\n\n[6] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Adv. in Neural Information\n\nProcessing Systems, pages 2292\u20132300, 2013.\n\n[7] A. Dieuleveut and F. Bach. Non-parametric stochastic approximation with large step sizes. arXiv preprint\n\narXiv:1408.0361, 2014.\n\n[8] J. Franklin and J. Lorenz. On the scaling of multidimensional matrices. Linear Algebra and its applications,\n\n114:717\u2013735, 1989.\n\n[9] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. Poggio. Learning with a Wasserstein loss. In Adv. in\n\nNeural Information Processing Systems, pages 2044\u20132052, 2015.\n\n[10] L. Kantorovich. On the transfer of masses (in russian). Doklady Akademii Nauk, 37(2):227\u2013229, 1942.\n[11] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances.\n\nIn ICML, 2015.\n\n[12] Q. M\u00e9rigot. A multiscale approach to optimal transport. Comput. Graph. Forum, 30(5):1583\u20131592, 2011.\n[13] G. Montavon, K.-R. M\u00fcller, and M. Cuturi. Wasserstein training of restricted Boltzmann machines. In Adv.\n\nin Neural Information Processing Systems, 2016.\n\n[14] J. Pennington, R. Socher, and C.D. Manning. Glove: Global vectors for word representation. Proc. of the\n\nEmpirical Methods in Natural Language Processing (EMNLP 2014), 12:1532\u20131543, 2014.\n\n[15] B. T Polyak and A. B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on\n\nControl and Optimization, 30(4):838\u2013855, 1992.\n\n[16] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Adv. in Neural Information\n\nProcessing Systems, pages 1177\u20131184, 2007.\n\n[17] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover\u2019s distance as a metric for image retrieval. IJCV,\n\n40(2):99\u2013121, November 2000.\n\n[18] F. Santambrogio. Optimal transport for applied mathematicians. Birk\u00e4user, NY, 2015.\n[19] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average gradient.\n\nMathematical Programming, 2016.\n\n[20] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math.\n\nStatist., 35:876\u2013879, 1964.\n\n[21] J. Solomon, F. de Goes, G. Peyr\u00e9, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional\nWasserstein distances: Ef\ufb01cient optimal transportation on geometric domains. ACM Transactions on\nGraphics (SIGGRAPH), 34(4):66:1\u201366:11, 2015.\n\n[22] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, Gert RG Lanckriet, et al.\nOn the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550\u20131599,\n2012.\n\n[23] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.\n[24] C. Villani. Topics in Optimal Transportation. Graduate studies in Math. AMS, 2003.\n[25] G. Wu, E. Chang, Y. K. Chen, and C. Hughes. Incremental approximate matrix factorization for speeding\nup support vector machines. In Proc. of the 12th ACM SIGKDD Intern. Conf. on Knowledge Discovery\nand Data Mining, pages 760\u2013766, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1705, "authors": [{"given_name": "Aude", "family_name": "Genevay", "institution": "Universit\u00e9 Paris Dauphine"}, {"given_name": "Marco", "family_name": "Cuturi", "institution": "ENSAE - CREST"}, {"given_name": "Gabriel", "family_name": "Peyr\u00e9", "institution": "Universit\u00e9 Paris Dauphine"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}