This post was largely inspired by Gabriel Peyré and Marco Cuturi’s excellent book about Computational Optimal Transport, which is free, (arXiv link, ref: [1]).
A simple problem?
Let’s start at the beginning: what is Optimal transport (OT)?
It all begins with Gaspard Monge, reading his mémoire [2] in front of eminent scientists and engineers of the time — including famous Enlightenment philosopher Condorcet — at the French Académie Royale des Sciences in 1776.
The memoir is in french, and is probably not relevant to someone interested about modern optimal transport. However, Monge’s problem could be explained in simple terms as follows:
Given two excavations $D_1$ and $D_2$, a cost of moving a unit of mass of dirt from any point of $D_1$ to any point of $D_2$,
What is the cheapest way to move mass from $D_1$ to $D_2$ ?
The answer is far from simple it turns out. Monge didn’t solve it in his mémoire, although he did lay the foundations of modern optimal transport theory.
Let’s move to the problem.
Discrete Optimal Transport
The easiest way to think of the problem is in terms of atomic items.
Assignment problem
In its simplest form it can be viewed as an assignment problem: among all configurations, which one is the best? This is already quite restrictive, because we can only work with two sets of the same total size. However, this simple version is incredibly hard to solve.
Each set (the above “excavations”) can be represented as a histogram (or vector) $\mathbf{a}$ that belongs to the probability simplex — the vectors which components sum to 1:
\[ \mathbf{a} \in \left\{ x = (x_1, …, x_N) \in \mathbb{R}^N : \sum_{i=1}^N x_i = 1 \right\} \]
If we write $\mathbf{C}_{i,j} $ the cost of moving an element from $i$ to $j$, the quantity we want to minimize is $\sum_{i} \mathbf{C}_{i,\sigma(i)}$ , where $\sigma$ is a permutation of the set $\{1, …, N\}$. This permutation represents an assignment of bin $i$ of the first histogram to output positions $j$ in the second histogram.
In this form, optimal transport is fundamentally combinatorial, and might be summarized like so:
How can we assign every element $i$ of $\{1, …, N\}$ to elements $j$ of $\{1, …, N\}$ so as to minimize the cumulative cost of this assignment?
The result of this search is called the optimal assignment. As you may already know, there are exactly $N!$ possible solutions to this problem, which makes it very difficult to solve when $N$ grows large: e.g, with a histogram of size 20, there are already $2.43\times10^{18}$ possible solutions.
It is interesting to note that this assignment is not unique, as shown in the example below where 2 elements are mapped to 2 others that form together the four corners of a 2D square with sides of size 1.
Working with asymmetric histograms: the Monge Problem
Using two equally sized histograms is a very strong limitation. By extending the above definition to a slightly larger family of histograms, we obtain the Monge problem. In this problem, several points $x_i$ can map to the same $y_j$ and their weights are summed.
In that case, the mapping between inputs and outputs is no longer a permutation, but a surjective map $T$. If points $x_1, …, x_n$ have weights $\mathbf{a} = (a_1, …, a_n)$ and points $y_1, …, y_m$ have weights $\mathbf{b} = (b_1, …, b_m)$, $T$ must verify:
\[ \forall j \in \{1, … m\},\quad b_j = \sum_{i:T(x_i) = y_j} a_i \]
With this formulation, we cannot work if the mass conservation constraint is not satisfied. It is also not easier to solve because we are still assigning elements $x_i$ to elements $y_j$.
Kantorovitch relaxation
Even with the above extension, this formulation of the optimal transport problem is still too constrained to be practically useful in many cases. In 1942, Leonid Kantorovitch [3] introduced another key idea, which is to relax the deterministic portion of transportation. Source points $x_i$ no longer have to map to a single target point and can be fragmented into smaller pieces, this is called mass splitting.
This new formulation is much more adapted to some real world situations, for instance logistic problems. Frank L. Hitchcock stated a version of this problem in [4] as follows
When several factories supply a product to a number of cities we desire the least costly manner of distribution. Due to freight rates and other matters the cost of a ton of product to a particular city will vary according to which factory supplies it, and will also vary from city to city.
To reflect this change, we modify our previous formulation slightly, by replacing the permutation $\sigma$ by a coupling matrix $\mathbf{P} = \mathbf{P}_ {ij} \in \mathbb{R}_ {+}^{n\times m}$. In the example above, each $\mathbf{P}_{ij}$ would be the weight of the arrow from factory $i$ to city $j$. Admissible coupling for our problem can therefore be written as
\[ \mathbf{U}(\mathbf{a}, \mathbf{b}) = \left\{\mathbf{P} \in \mathbb{R}_{+}^{n \times m}: \mathbf{P}\mathbb{1}_m = \mathbf{a} \text{ and } \mathbf{P}^\text{T}\mathbb{1}_n = \mathbf{b} \right\} \]
This new formulation, contrary to the Monge formulation, is symmetric. This means that we can reverse the irreversible mapping of the problem above, because if $\mathbf{P} \in \mathbf{U}(\mathbf{a}, \mathbf{b})$, then $\mathbf{P}^{\text{T}} \in \mathbf{U}(\mathbf{b}, \mathbf{a})$.
We can now formulate the problem in a mathematically cleaner way, and it reads:
\[ L_\mathbf{C}(\mathbf{a}, \mathbf{b}) = \min_{\mathbf{P}\in\mathbf{U}(\mathbf{b}, \mathbf{a})} \sum_{i,j} \mathbf{C}_{i,j} \mathbf{P}_{i,j} = \min_{\mathbf{P} \in \mathbf{U}(\mathbf{b}, \mathbf{a})} \langle \mathbf{C}, \mathbf{P} \rangle \]
If you are familiar with optimization, you might have recognized a linear program, where the constraints are a set of $m+n$ equality constraints — or $2(m+n)$ inequalities, which are contained in the admissible set of solutions. These constraints define a convex polytope.
This is good news, because we have moved from the dreadful realm of combinatorics to the comfortable world of convex optimization. Optimization over a matrix space might sound hard but it is usually much simpler than searching among a set of possible assignments.
I will not go into the details of the tools used to solve this problem because they are widely used and taught in optimization and operations research courses. You will find a detailed explanation of the simplex algorithm, and other algorithmic tools to solve the OT problem, such as dual ascent methods or the Auction algorithm in [1].
Regularized optimal transport
This is great, we have transformed our exponentially hard to solve problem in a much more pleasant looking one! Unfortunately, this is still not easy to solve, and the algorithms mentioned above have mostly polynomial complexities with exponents larger than 2, and sometimes exponential worst-case complexities.
Entropic regularization
Regularizing optimal transport was proposed in [4]. It is a method for approximating solutions to the optimal transport problem by adding a regularizing term to the objective function. This forces the solution to satisfy a number of constraints, making the problem easier to solve.
The entropy of a coupling matrix $\mathbf{P}$ is defined as
\[ \mathbf{H}(\mathbf{P}) = - \sum_{i,j} \mathbf{P_{i,j}}(\log(\mathbf{P}_{i,j}) - 1) \]
and the objective function for two histograms $\mathbf{a}$ and $\mathbf{b}$ now reads
\[ L^{\varepsilon}_\mathbf{C}(\mathbf{a}, \mathbf{b}) = \min _{\mathbf{P} \in \mathbf{U}(\mathbf{a), \mathbf{b}})} \langle \mathbf{P}, \mathbf{C} \rangle - \varepsilon \mathbf{H} (\mathbf{P}) \]
This addition makes the objective function an $\varepsilon$-strongly convex function. There is therefore a unique optimal solution $\mathbf{P}$. It can be shown that the solution to this regularized problem has the following form:
\[ \forall (i,j) \in \{1,…,n\}\times \{1, …, m\},\quad \mathbf{P}_ {i,j} = \mathbf{u}_ i \mathbf{K} _{i,j} \mathbf{v}_j \]
where $\mathbf{K}_{i,j} = e^{-\frac{\mathbf{C} _{i,j}}{\varepsilon}}$ and $\mathbf{u}$ and $\mathbf{v}$ are unknown scaling variables. This is a really big improvement, because we now have an explicit formula for an optimal coupling.
Sinkhorn’s algorithm
The problem is known as the matrix scaling problem: we are trying to find two scaling vectors that when multiplied with $\mathbf{K}$ give $\mathbf{P}$. This can be achieved by alternatively updating $\mathbf{u}$ and $\mathbf{v}$ with Sinkhorn’s algorithm updates:
\[ \mathbf{u}^{(\ell+1)} = \dfrac{\mathbf{a}}{\mathbf{Kv}^{(\ell)}}\ \text{and} \ \mathbf{v}^{(\ell + 1)} = \dfrac{\mathbf{b}}{\mathbf{K}^{T}\mathbf{u}^{(\ell + 1)}} \]
Although this algorithm likely appeared at the beginning of the 20th century, the proof of its convergence is attributed to Sinkhorn [5]. The algorithm not only converges, but it does so at a linear rate. Don’t get fooled by the name, it is fast! This means it exists a certain factor $\mu \in [0, 1]$ and constant $C$, such that the solution $\mathbf{u}^*$ of iterates $\mathbf{u}^{(\ell)}$ is approached at the speed of a geometric progression
\[ \left|\mathbf{u}^{(\ell)} - \mathbf{u}^* \right| = C\mu^\ell \]
Regularized OT and Sinkhorn’s algorithms received renewed attention in the machine learning and data science community following a paper from Marco Cuturi in 2013 [6] that showed that Sinkhorn’s updates were an efficient and scalable approximation to OT that can be easily parallelized, for instance on GPUs.
Going further: Wasserstein distances, WGANs, Variational problems
The machine learning community is getting excited by the possibilities that optimal transport has to offer. Wasserstein distances (more on this here) can be used as a loss function, leveraging physical and geometric ideas — such as mass displacement, ground cost, etc. — that are more natural than information based divergence between probability measures (such as Kullback-Leibler divergence). This excitement is illustrated by the number of papers accepted at NeurIPS mentioning the concept.
With more than a thousands citation in 2018-2019 (according to Semantic scholar), Wasserstein Generative Adversarial Networks (WGANs) are a good illustration of optimal transport ideas going mainstream — although the fact that it is a paper about GANs might account more for those citations than the maths.
Wasserstein distances
In the case where the cost function corresponds to a distance — like most examples above, the solution to the optimal transport problem for two measures (or histograms here) is a distance called Wasserstein distance. Formally, it is written
\[ W_p(\mathbf{a}, \mathbf{b}) = L _{\mathbf{D}^p}(\mathbf{a}, \mathbf{b})^{1/p} \]
This is the form of OT that has had the most applications in machine learning and data science, because of its ability to use a ground metric between bins and transform it into a metric between histograms of such bins. Earth mover’s distance is a particular example of $W_1$ with the Euclidean distance in $\mathbb{R}^d$.
Wasserstein GANs
WGANs [8] uses the Wasserstein distance in GAN training, instead of the Jensen-Shannon divergence. This results in improved stability and converging loss, and the added benefit that for applications such as computer vision, this loss should correspond better with image qualities because of the properties of OT.
Variational problems
When using Wasserstein distances as a loss function, the main technical challenge lies in differentiating it efficiently. Many algorithmic and mathematical tricks enable it in some settings, unlocking very interesting applications.
For example in [9], the authors are able to compute Wasserstein barycenters between 2D or 3D shapes. This barycenter is a intermediate shape, equidistant from a weighted set of starting shapes — where the distance is the Wasserstein distance — with a very natural looking “mass displacement” resembling interpolation. The figure below is from this paper.
The below animation illustrates this interpolation process. You can see parts of the shapes that get split and merged to fit the other shape in a smooth and natural-looking way.
Conclusion
There is a lot more to say about optimal transport, but this post only aims at introducing the concept and is already long enough. The theory and mathematics are beautiful and get very advanced the deeper you go with this subject (see e.g. [10]). There is a lot of ongoing research into the theory and applications of optimal transport, especially in the machine learning community. If you are interested in getting started learning about it, I suggest you check Gabriel Peyré and Marco Cuturi’s book Computational optimal transport. It gives a very interesting overview of optimal transport and its applications. It can be read at different levels of mathematical complexity depending on your level of comfort. It is particularly adapted to data science and machine learning people who wish to learn more about the subject and the theory behind it.