*This post was largely inspired by Gabriel Peyré and Marco Cuturi’s excellent book about
Computational Optimal Transport, which is free, (arXiv link, ref: [1])*.

## A simple problem?

Let’s start at the beginning: what is Optimal transport (OT)?

It all begins with Gaspard Monge,
reading his *mémoire* [2] in front of
eminent scientists and engineers of the time — including famous
Enlightenment philosopher Condorcet
— at the French *Académie Royale des Sciences* in 1776.

The memoir is in french, and is probably not relevant to someone interested about modern optimal transport. However, Monge’s problem could be explained in simple terms as follows:

Given two excavations $D_1$ and $D_2$, a cost of moving a unit of mass of dirt from any point of $D_1$ to any point of $D_2$,

What is the cheapest way to move mass from $D_1$ to $D_2$ ?

The answer is far from simple it turns out. Monge didn’t solve
it in his *mémoire*, although he did lay the foundations of
modern optimal transport theory.

Let’s move to the problem.

## Discrete Optimal Transport

The easiest way to think of the problem is in terms of atomic items.

### Assignment problem

In its simplest form it can be viewed as an assignment problem: among all configurations, which one is the best? This is already quite restrictive, because we can only work with two sets of the same total size. However, this simple version is incredibly hard to solve.

Each set (the above “excavations”) can be represented as a histogram (or vector) $\mathbf{a}$ that belongs to the probability simplex — the vectors which components sum to 1:

\[ \mathbf{a} \in \left\{ x = (x_1, …, x_N) \in \mathbb{R}^N : \sum_{i=1}^N x_i = 1 \right\} \]

If we write $\mathbf{C}_{i,j} $ the cost of moving an element
from $i$ to $j$, the quantity we want to minimize is $\sum_{i}
\mathbf{C}_{i,\sigma(i)}$ , where $\sigma$ is a
**permutation** of the set $\{1, …, N\}$. This
permutation represents an assignment of bin $i$ of the first
histogram to output positions $j$ in the second histogram.

In this form, optimal transport is fundamentally
**combinatorial**, and might be summarized like
so:

How can we assign every element $i$ of $\{1, …, N\}$ to elements $j$ of $\{1, …, N\}$ so as to minimize the cumulative cost of this assignment?

The result of this search is called the **optimal
assignment**. As you may already know, there are exactly
$N!$ possible solutions to this problem, which makes it very
difficult to solve when $N$ grows large: e.g, with a histogram of
size 20, there are already $2.43\times10^{18}$ possible
solutions.

It is interesting to note that this assignment is **not
unique**, as shown in the example below where 2 elements are
mapped to 2 others that form together the four corners of a 2D
square with sides of size 1.

### Working with asymmetric histograms: the Monge Problem

Using two equally sized histograms is a very strong limitation.
By extending the above definition to a slightly larger family of
histograms, we obtain the Monge problem. In this problem,
**several points $x_i$ can map to the same $y_j$** and
their weights are summed.

In that case, the mapping between inputs and outputs is no longer a permutation, but a surjective map $T$. If points $x_1, …, x_n$ have weights $\mathbf{a} = (a_1, …, a_n)$ and points $y_1, …, y_m$ have weights $\mathbf{b} = (b_1, …, b_m)$, $T$ must verify:

\[ \forall j \in \{1, … m\},\quad b_j = \sum_{i:T(x_i) = y_j} a_i \]

With this formulation, we cannot work if the mass conservation constraint is not satisfied. It is also not easier to solve because we are still assigning elements $x_i$ to elements $y_j$.

### Kantorovitch relaxation

Even with the above extension, this formulation of the optimal
transport problem is still too constrained to be practically useful
in many cases. In 1942, Leonid
Kantorovitch [3] introduced another
key idea, which is to relax the deterministic portion of
transportation. Source points $x_i$ no longer have to map to a
single target point and can be *fragmented* into smaller
pieces, this is called ** mass
splitting**.

This new formulation is much more adapted to some real world
situations, for instance **logistic problems**. Frank
L. Hitchcock stated a version of this problem in [4] as follows

When several factories supply a product to a number of cities we desire the least costly manner of distribution. Due to freight rates and other matters the cost of a ton of product to a particular city will vary according to which factory supplies it, and will also vary from city to city.

To reflect this change, we modify our previous formulation slightly, by replacing the permutation $\sigma$ by a coupling matrix $\mathbf{P} = \mathbf{P}_ {ij} \in \mathbb{R}_ {+}^{n\times m}$. In the example above, each $\mathbf{P}_{ij}$ would be the weight of the arrow from factory $i$ to city $j$. Admissible coupling for our problem can therefore be written as

\[ \mathbf{U}(\mathbf{a}, \mathbf{b}) = \left\{\mathbf{P} \in \mathbb{R}_{+}^{n \times m}: \mathbf{P}\mathbb{1}_m = \mathbf{a} \text{ and } \mathbf{P}^\text{T}\mathbb{1}_n = \mathbf{b} \right\} \]

This new formulation, contrary to the Monge formulation, is
**symmetric**. This means that we can reverse the
irreversible mapping of the problem above, because if $\mathbf{P}
\in \mathbf{U}(\mathbf{a}, \mathbf{b})$, then
$\mathbf{P}^{\text{T}} \in \mathbf{U}(\mathbf{b}, \mathbf{a})$.

We can now formulate the problem in a mathematically cleaner way, and it reads:

\[ L_\mathbf{C}(\mathbf{a}, \mathbf{b}) = \min_{\mathbf{P}\in\mathbf{U}(\mathbf{b}, \mathbf{a})} \sum_{i,j} \mathbf{C}_{i,j}\mathbf{P}_{i,j} = \min_{\mathbf{P}\in\mathbf{U}(\mathbf{b}, \mathbf{a})} \langle \mathbf{C}, \mathbf{P} \rangle \]

If you are familiar with optimization, you might have recognized
a **linear
program**, where the constraints are a set of $m+n$
equality constraints — or $2(m+n)$ inequalities, which are
contained in the admissible set of solutions. These constraints
define a convex polytope.

This is good news, because we have moved from the dreadful realm
of **combinatorics** to the comfortable world of
**convex optimization**. Optimization over a matrix
space might sound hard but it is usually much simpler than
searching among a set of possible assignments.

I will not go into the details of the tools used to solve this
problem because they are widely used and taught in optimization and
operations research courses. You will find a detailed explanation
of the **simplex algorithm**, and other algorithmic
tools to solve the OT problem, such as dual ascent methods or the
Auction algorithm in [1].

## Regularized optimal transport

This is great, we have transformed our exponentially hard to
solve problem in a much more pleasant looking one! Unfortunately,
this is *still* not easy to solve, and the algorithms
mentioned above have mostly polynomial complexities with exponents
larger than 2, and sometimes exponential worst-case
complexities.

### Entropic regularization

Regularizing optimal transport was proposed in [4]. It is a method for approximating solutions to the optimal transport problem by adding a regularizing term to the objective function. This forces the solution to satisfy a number of constraints, making the problem easier to solve.

The **entropy of a coupling matrix** $\mathbf{P}$
is defined as

\[ \mathbf{H}(\mathbf{P}) = - \sum_{i,j} \mathbf{P_{i,j}}(\log(\mathbf{P}_{i,j}) - 1) \]

and the objective function for two histograms $\mathbf{a}$ and $\mathbf{b}$ now reads

\[ L^{\varepsilon}_\mathbf{C}(\mathbf{a}, \mathbf{b}) = \min _{\mathbf{P} \in \mathbf{U}(\mathbf{a), \mathbf{b}})} \langle \mathbf{P}, \mathbf{C} \rangle - \varepsilon \mathbf{H} (\mathbf{P}) \]

This addition makes the objective function an
**$\varepsilon$-strongly convex function**. There is
therefore a unique optimal solution $\mathbf{P}$. It can be shown
that the solution to this regularized problem has the following
form:

\[ \forall (i,j) \in \{1,…,n\}\times \{1, …, m\},\quad \mathbf{P}_ {i,j} = \mathbf{u}_ i \mathbf{K} _{i,j} \mathbf{v}_j \]

where $\mathbf{K}_{i,j} = e^{-\frac{\mathbf{C}
_{i,j}}{\varepsilon}}$ and $\mathbf{u}$ and $\mathbf{v}$ are
unknown scaling variables. This is a really big improvement,
because we now have an **explicit formula** for an
optimal coupling.

### Sinkhorn’s algorithm

The problem is known as the matrix scaling problem: we are
trying to find two scaling vectors that when multiplied with
$\mathbf{K}$ give $\mathbf{P}$. This can be achieved by
alternatively updating $\mathbf{u}$ and $\mathbf{v}$ with
**Sinkhorn’s algorithm updates**:

\[ \mathbf{u}^{(\ell+1)} = \dfrac{\mathbf{a}}{\mathbf{Kv}^{(\ell)}}\ \text{and} \ \mathbf{v}^{(\ell + 1)} = \dfrac{\mathbf{b}}{\mathbf{K}^{T}\mathbf{u}^{(\ell + 1)}} \]

Although this algorithm likely appeared at the beginning of the
20th century, the proof of its convergence is attributed to
Sinkhorn [5]. The algorithm not only
converges, but it does so at a **linear rate**. Don’t
get fooled by the name, it is fast! This means it exists a certain
factor $\mu \in [0, 1]$ and constant $C$, such that the solution
$\mathbf{u}^*$ of iterates $\mathbf{u}^{(\ell)}$ is approached at
the speed of a geometric progression

\[ \left|\mathbf{u}^{(\ell)} - \mathbf{u}^* \right| = C\mu^\ell \]

Regularized OT and Sinkhorn’s algorithms received renewed
attention in the machine learning and data science community
following a paper from Marco Cuturi in 2013 [6] that showed that Sinkhorn’s updates were an
**efficient and scalable** approximation to OT that
can be easily **parallelized**, for instance on
GPUs.

## Going further: Wasserstein distances, WGANs, Variational problems

The machine learning community is getting excited by the
possibilities that optimal transport has to offer. Wasserstein
distances (more on this here)
can be used as a loss function, leveraging **physical and
geometric ideas** — such as mass displacement, ground cost,
etc. — that are more natural than information based divergence
between probability measures (such as
Kullback-Leibler divergence). This excitement is illustrated by
the **number of papers accepted at NeurIPS**
mentioning the concept.

With more than a thousands citation in 2018-2019 (according to Semantic scholar), Wasserstein Generative Adversarial Networks (WGANs) are a good illustration of optimal transport ideas going mainstream — although the fact that it is a paper about GANs might account more for those citations than the maths.

### Wasserstein distances

In the case where the cost function corresponds to a distance —
like most examples above, the solution to the optimal transport
problem for two measures (or histograms here) is a distance called
**Wasserstein distance**. Formally, it is written

\[ W_p(\mathbf{a}, \mathbf{b}) = L _{\mathbf{D}^p}(\mathbf{a}, \mathbf{b})^{1/p} \]

This is the form of OT that has had the most applications in
machine learning and data science, because of its ability to use a
ground metric between bins and transform it into a metric between
histograms of such bins. **Earth
mover’s distance** is a particular example of $W_1$ with
the Euclidean distance in $\mathbb{R}^d$.

### Wasserstein GANs

WGANs [8] uses the Wasserstein distance in GAN training, instead of the Jensen-Shannon divergence. This results in improved stability and converging loss, and the added benefit that for applications such as computer vision, this loss should correspond better with image qualities because of the properties of OT.

### Variational problems

When using Wasserstein distances as a loss function, the main
technical challenge lies in **differentiating it
efficiently**. Many algorithmic and mathematical tricks
enable it in some settings, unlocking very interesting
applications.

For example in [9], the authors are able
to compute **Wasserstein barycenters** between 2D or
3D shapes. This barycenter is a intermediate shape, equidistant
from a weighted set of starting shapes — where the distance is the
Wasserstein distance — with a very natural looking “mass
displacement” resembling interpolation. The figure below is from
this paper.

The below animation illustrates this interpolation process. You can see parts of the shapes that get split and merged to fit the other shape in a smooth and natural-looking way.

## Conclusion

There is a lot more to say about optimal transport, but this
post only aims at introducing the concept and is already long
enough. The theory and mathematics are beautiful and get very
advanced the deeper you go with this subject (see e.g. [10]). There is a lot of ongoing research into the
theory and applications of optimal transport, especially in the
machine learning community. If you are interested in getting
started learning about it, I suggest you check Gabriel Peyré and
Marco Cuturi’s book *Computational
optimal transport*. It gives a very interesting overview of
optimal transport and its applications. It can be read at different
levels of mathematical complexity depending on your level of
comfort. It is particularly adapted to data science and machine
learning people who wish to learn more about the subject and the
theory behind it.