Introduction: \(2\)-Wasserstein Distances
Recall that the Monge-Kantorovich formulation of the OT problem is defined as
$$ \begin{align} \mathrm{P} (\mu, \nu) &= \inf_{\pi \in \Pi(\mu, \nu)} \int_{\mathcal{X} \times \mathcal{Y}} c(\boldsymbol{x}, \boldsymbol{y}) \mathrm{d} \pi. \end{align} $$
In today’s post, we will focus on the specific case of Wasserstein distances with \(p = 2\), which can be written by using the cost function \(c(\boldsymbol{x}, \boldsymbol{y}) = \lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}\) and taking the square root:
$$ \begin{align}
\mathcal{W}_{2} (\mu, \nu) &:= \inf_{\pi \in \Pi(\mu, \nu)} \left( \int \lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2} \mathrm{d} \pi \right)^{1/2}.
\end{align} $$
For technical reasons, we will rewrite the above into the following equivalent OT problem:
$$\begin{align}
\mathrm{P} (\mu, \nu) &:= \frac{1}{2} \mathcal{W}_{2}^{2} (\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \mathrm{d} \pi. \tag{P}
\end{align}$$
The Dual Problem of \((\mathrm{P})\)
Let us write \(L^{1} (\mu)\) for the collection of \(\mu\)-measurable functions. Recalling the definition of \(\Pi (\mu, \nu)\), we can reformulate the RHS of \((\mathrm{P})\) as follows:
$$\begin{align}
\inf_{\pi \in \mathcal{M}^{+} (\mathbb{R}^{d} \times \mathbb{R}^{d})} \sup_{\substack{f \in L^{1} (\mu) \\ g \in L^{1} (\nu)}} \left\{ \int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \mathrm{d} \pi + \int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu {}-{} \int (f(x) + g(y)) \ \mathrm{d} \pi \right\}.
\end{align}$$
Note that this is equivalent to \((\mathrm{P})\), because the supremum of the additional latter parts involving \(f\) and \(g\) must diverge to \(+\infty\) whenever \(\pi \notin \Pi (\mu, \nu)\). Now, we can switch the \(\sup\) and the \(\inf\) to obtain the dual of this problem:
$$\begin{align}
&\sup_{\substack{f \in L^{1} (\mu) \\ g \in L^{1} (\nu)}} \inf_{\pi \in \mathcal{M}^{+} (\mathbb{R}^{d} \times \mathbb{R}^{d})} \left\{ \int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \mathrm{d} \pi + \int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu {}-{} \int f \oplus g \ \mathrm{d} \pi \right\} \\
&= \sup_{\substack{f \in L^{1} (\mu) \\ g \in L^{1} (\nu)}} \left\{ \int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu + \inf_{\pi \in \mathcal{M}^{+} (\mathbb{R}^{d} \times \mathbb{R}^{d})} \left\{ \int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} {}-{} (f \oplus g) \ \mathrm{d} \pi \right\} \right\},
\end{align}$$
where we write \(f \oplus g = f(\boldsymbol{x}) + g(\boldsymbol{y})\) for simplicity. Now the \(\inf\) part must always diverge to \(- \infty\) whenever there exists any \((\boldsymbol{x}, \boldsymbol{y})\) such that \(\frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} {}-{} (f \oplus g) < 0\), and thus it is sufficient to take the \(\sup\) only when \(\frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} {}-{} (f \oplus g) \ge 0\) for all \((\boldsymbol{x}, \boldsymbol{y})\). Converting this back into a constraint on \(f\) and \(g\), we can concisely write down the dual problem as shown below.
Definition 1.12. The dual problem of \((\mathrm{P})\) is of the form
$$\begin{align}
\mathrm{D} (\mu, \nu) &= \sup_{(f, g) \in \mathcal{D} (\mu, \nu)} \left\{ \int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu \right\}, \tag{D}
\end{align}$$
where
$$\begin{align}
\mathcal{D}(\mu, \nu) &:= \left\{ (f, g) \in L^{1} (\mu) \times L^{1} (\nu) : f \oplus g \le \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \right\}.
\end{align}$$
Remark. You may simply understand the functions \(f, g\) as the (infinite-dimensional) Lagrange multipliers corresponding to the constraint \(\pi \in \Pi (\mu, \nu)\) of \(\mathrm{(P)}\). In fact, we can derive the dual for general OT problems by substituting \(\frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2}\) with any cost \(c(\boldsymbol{x}, \boldsymbol{y})\).
Strong Duality and Brenier’s Theorem
This switch between the \(\sup\) and the \(\inf\) immediately implies \(\mathrm{D} (\mu, \nu) \le \mathrm{P} (\mu, \nu)\), which we often call weak duality. This type of inequality holds for any type of convex optimization problems, and for some special cases when we can show that \(\mathrm{D} (\mu, \nu) = \mathrm{P} (\mu, \nu)\) (i.e., Sion’s minimax theorem holds when we swap the \(\sup\) and \(\inf\),) we say that strong duality holds.
As you may have expected, we can show that the optimal transport problem in \((\mathrm{P})\) admits strong duality. Moreover, we can also show that the optimal coupling \(\pi^{\star}\) for \((\mathrm{P})\) can be characterized with the gradient of a convex function, which is famously known as Brenier’s theorem. Quite interestingly, we can merge these properties into a single theorem and prove these facts simultaneously!
Theorem 1.13 (Fundamental theorem of OT). Consider the OT problem in \((\mathrm{P})\), where \(\mu\) is absolutely continuous w.r.t. the Lebesgue measure. Then the following are equivalent.
- The coupling \(\pi^{\star}\) attains the optimum of \((\mathrm{P})\).
- (Brenier’s Theorem) There exists a convex, closed, proper (CCP), and \(\mu\)-a.e. differentiable function \(\varphi\) such that \((\boldsymbol{X}, \nabla \varphi (\boldsymbol{X})) \sim \pi^{\star}\) for \(\boldsymbol{X} \sim \mu\).
- (Strong Duality) We have \(\mathrm{D} (\mu, \nu) = \mathrm{P} (\mu, \nu)\), and the functions \(f = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi\) and \(g = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi^{\star}\) attain the optimum of \((\mathrm{D})\), where \(\varphi^{\star}\) is the Fenchel conjugate of \(\varphi\).
As the proof (based on the proof of Theorem 1.14 [CNR25]) is quite lengthly, we will sequentially prove the directions 1 → 2 (Part I), 2 → 3 (Part II), and 3 → 1 (Part III) in the subsequent sections.
Part I. Brenier’s theorem (1 → 2)
Brenier’s theorem [Bre87] shows that the solution to the optimal transport problem for the Wasserstein distance can be fully characterized by the gradient of a convex function.
Cyclical Monotonicity
The proof starts with the following characterization of convex subdifferentials via cyclical monotonicity. (You can see this post for more details about cyclical monotonicity!)
Theorem 1.14. ([Roc66]) An operator \(\mathbb{T} : \mathbb{R}^{d} \rightrightarrows \mathbb{R}^{d}\) is cyclically monotone, i.e.,
$$ \sum_{i=1}^{n} \langle \boldsymbol{x}_{i} {}-{} \boldsymbol{x}_{i+1}, \boldsymbol{y}_{i} \rangle \ge 0, \quad \forall \boldsymbol{y}_{i} \in \mathbb{T} (\boldsymbol{x}_{i}) $$
for any finite collection of points \((\boldsymbol{x}_{i})_{i \in [n]}\) and \(\boldsymbol{x}_{n+1} := \boldsymbol{x}_{1}\), if and only if there exists a convex, closed, and proper (CCP) function \(\varphi\) such that \(\mathbb{T} (\boldsymbol{x}) \subseteq \partial \varphi (\boldsymbol{x})\) for all \(\boldsymbol{x} \in \mathbb{R}^{d}\).
Now we will show that the support of the optimal coupling \(\pi^{\star}\), denoted by \(\mathrm{supp} (\pi^{\star})\), must satisfy cyclical monotonicity. The basic idea is that, whenever there exists a set of points violating cyclical monotonicity, there is a way to rearrange the measure in a certain way to obtain a smaller cost.
Lemma 1.15. For \(\pi^{\star} \in \Pi(\mu, \nu)\) that attains the optimum of \((\mathrm{P})\), i.e.,
$$\begin{align}
\mathrm{P} (\mu, \nu) &= \int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \mathrm{d} \pi^{\star},
\end{align}$$
the operator \(\mathbb{T}\) defined by \(\mathrm{Gra} (\mathbb{T}) = \mathrm{supp} (\pi^{\star})\) must be cyclically monotone.
Proof. We prove by contradiction. Suppose that there exists some \(n \ge 2\) and \((\boldsymbol{x}_{i}, \boldsymbol{y}_{i}) \in \mathrm{supp} (\pi^{\star})\) such that
$$ \sum_{i=1}^{n} \langle \boldsymbol{x}_{i} {}-{} \boldsymbol{x}_{i+1}, \boldsymbol{y}_{i} \rangle < 0, $$
with \(\boldsymbol{x}_{n+1} := \boldsymbol{x}_{1}\), which is equivalent to
$$ \sum_{i=1}^{n} \lVert \boldsymbol{x}_{i} {}-{} \boldsymbol{y}_{i} \rVert^{2} > \sum_{i=1}^{n} \lVert \boldsymbol{x}_{i+1} {}-{} \boldsymbol{y}_{i} \rVert^{2}. $$
Since the Euclidean norm is a continuous function, we can find neighborhoods \(U_{i}, V_{i}\) of \(\boldsymbol{x}_{i}, \boldsymbol{y}_{i}\) (each with positive measure), respectively, for each \(i \in [n]\) such that
$$ \sum_{i=1}^{n} \lVert \tilde{\boldsymbol{x}}_{i} {}-{} \tilde{\boldsymbol{y}}_{i} \rVert^{2} > \sum_{i=1}^{n} \lVert \tilde{\boldsymbol{x}}_{i+1}’ {}-{} \tilde{\boldsymbol{y}}_{i}’ \rVert^{2}, \quad \forall \tilde{\boldsymbol{x}}_{i}, \tilde{\boldsymbol{x}}_{i}’ \in U_{i}, \ \tilde{\boldsymbol{y}}_{i}, \tilde{\boldsymbol{y}}_{i}’ \in V_{i}, \tag{$\spadesuit$}$$
with all points with index \(n+1\) substituted to have index \(1\). Now, we define conditional probability distributions \(\pi_{i} \in \mathcal{P}(\mathbb{R}^{d} \times \mathbb{R}^{d})\) for \(i \in [n]\) such that \(\pi_{i} (A) = \pi^{\star} (A \ | \ U_{i} \times V_{i})\) for all Borel sets \(A \subset \mathbb{R}^{d} \times \mathbb{R}^{d}\). Then, we define the mixture distribution
$$\begin{align}
\pi &:= \pi^{\star} + \frac{c}{n} \sum_{i=1}^{n} \left( \pi_{i+1}^{(1)} \otimes \pi_{i}^{(2)} {}-{} \pi_{i} \right), \tag{$\diamondsuit$}
\end{align}$$
where \(\pi_{i}^{(1)}, \pi_{i}^{(2)}\) are defined as the first and second marginals of \(\pi_{i}\), respectively, and \(c > 0\) is a small constant. The basic idea is to transport little bits of measures from \(U_{i} \times V_{i}\) to \(U_{i+1} \times V_{i}\) so that the new measure \(\pi\) has a smaller \(\int \lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2} \mathrm{d} \pi\) (and thus \(\int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \mathrm{d} \pi\)) by \((\diamondsuit)\). To formalize the argument, we must choose the constant \(c\) so that \(\pi\) becomes a valid measure. For any Borel set \(A \subset \mathbb{R}^{d} \times \mathbb{R}^{d}\), we have
$$\begin{align}
\pi(A) &\ge \pi^{\star} (A) {}-{} \frac{c}{n} \sum_{i=1}^{n} \pi_{i} (A) \\
&= \pi^{\star} (A) {}-{} \frac{c}{n} \sum_{i=1}^{n} \pi (A \ | \ U_{i} \times V_{i}) \\
&= \pi^{\star} (A) {}-{} \frac{c}{n} \sum_{i=1}^{n} \frac{\pi (A \cap (U_{i} \times V_{i}))}{\pi (U_{i} \times V_{i})} \\
&\ge \pi^{\star} (A) {}-{} \frac{c}{n} \sum_{i=1}^{n} \frac{\pi (A)}{\pi (U_{i} \times V_{i})},
\end{align}$$
and thus we can choose \(c < \min_{i \in [n]} \pi^{\star} (U_{i} \times V_{i})\) so that we always have \(\pi(A) \ge 0\).
Also, for any Borel set \(B \in \mathbb{R}^{d}\), we can compute
$$\begin{align}
\pi (B \times \mathbb{R}^{d}) &= \mu(B) + \frac{c}{n} \sum_{i=1}^{n} \left( \pi_{i+1}^{(1)} (B) {}-{} \pi_{i} (B \times \mathbb{R}^{d}) \right) \\
&= \mu(B) + \frac{c}{n} \sum_{i=1}^{n} \left( \pi_{i+1}^{(1)} (B) {}-{} \pi_{i}^{(1)} (B) \right) \\
&= \mu(B)
\end{align}$$
and similarly \(\pi(\mathbb{R}^{d} \times B) = \nu(B)\), which implies that \(\pi \in \Pi (\mu, \nu)\).
Finally, we can compute
$$\begin{align}
&\int \lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2} \mathrm{d} \pi {}-{} \int \lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2} \mathrm{d} \pi^{\star} \\
&= \frac{c}{n} \sum_{i=1}^{n} \left( \int_{U_{i+1} \times V_{i}} \lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2} \mathrm{d} (\pi_{i+1}^{(1)} \otimes \pi_{i}^{(2)}) {}-{} \int_{U_{i} \times V_{i}} \lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2} \mathrm{d} \pi_{i} \right) \\
&< 0
\end{align}$$
by applying \((\diamondsuit)\), which contradicts the optimality of \(\pi^{\star}\). \(\square\)
Proof of Brenier’s Theorem
Now we are ready to prove Brenier’s theorem, concluding the direction 1 → 2.
Theorem 1.16 (Brenier’s theorem). For \(\pi^{\star} \in \Pi(\mu, \nu)\) that attains the optimum of \((\mathrm{P})\), there exists a CCP, \(\mu\)-a.e. differentiable function \(\varphi\) such that \((\boldsymbol{X}, \nabla \varphi (\boldsymbol{X})) \sim \pi^{\star}\) for \(\boldsymbol{X} \sim \mu\).
Proof. By Theorem 1.14 and Lemma 1.15, there exists a CCP function \(\varphi\) such that \(\boldsymbol{y} \in \partial \varphi (\boldsymbol{x})\) for all \((\boldsymbol{x}, \boldsymbol{y}) \in \mathrm{supp} (\pi^{\star})\). Also, since \(\varphi\) is convex, it must be differentiable \(\mu\)-a.e. w.r.t. the Lebesgue measure (Rademacher’s theorem). If \(\mu\) is absolutely continuous w.r.t. the Lebesgue measure, \(\varphi\) is also differentiable \(\mu\)-a.e., which we may write as \(\pi^{\star} (\boldsymbol{Y} = \nabla \varphi(\boldsymbol{X})) = 1\) or equivalently \((\boldsymbol{X}, \nabla \varphi (\boldsymbol{X})) \sim \pi^{\star}\). \(\square\)
Remark. If you think about this carefully, you can realize that the problem \((\mathrm{P})\) actually admits a solution of the Monge formulation (almost everywhere) as well. This is because we have shown that \(\mathrm{supp} (\pi^{\star})\) must be a subset of
$$\begin{align}
\left\{ (\boldsymbol{x}, \nabla \varphi (\boldsymbol{x})) : \boldsymbol{x} \in \mathbb{R}^{d} \right\},
\end{align}$$
and thus we can think of the map \(T = \nabla \varphi\), which is uniquely defined \(\mu\)-a.e., for which we have \(\pi^{\star} = (\mathrm{id} \times T)_{\sharp} (\mu)\) and \(\nu = T_{\sharp} (\mu)\).
Part II. Strong Duality (2 → 3)
Here we will show that strong duality holds, starting from the conclusion of Theorem 1.16. We start by formally defining the Fenchel conjugate of (convex) functions.
Fenchel Conjugates
Definition 1.17. Given a (convex) function \(\phi: \mathbb{R}^{d} \rightarrow \mathbb{R}\), the Fenchel conjugate of \(\phi\) is defined as
$$\begin{align}
\phi^{\star} (\boldsymbol{y}) := \sup_{\boldsymbol{x} \in \mathbb{R}^{d}} \left\{ \langle \boldsymbol{x}, \boldsymbol{y} \rangle {}-{} \phi (\boldsymbol{x}) \right\}.
\end{align}$$
Note that \(\phi^{\star}\) is always a convex function as it is the supremum of a collection of affine functions. Also, the Fenchel-Young inequality immediately follows from the definition:
$$\begin{align}
\phi (\boldsymbol{x}) + \phi^{\star} (\boldsymbol{y}) \ge \langle \boldsymbol{x}, \boldsymbol{y} \rangle \quad \forall \boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^{d},
\end{align}$$
and, if \(\phi\) is convex, then equality holds if and only if \(\boldsymbol{y} \in \partial \phi (\boldsymbol{x})\). This is because
$$\begin{align}
\phi (\boldsymbol{x}’) \ge \phi (\boldsymbol{x}) + \langle \boldsymbol{x}’ {}-{} \boldsymbol{x}, \boldsymbol{y} \rangle
\end{align}$$
for all \(\boldsymbol{x}’ \in \mathbb{R}^{d}\) and \(\boldsymbol{y} \in \partial \phi (\boldsymbol{x})\), and therefore
$$\begin{align}
\langle \boldsymbol{x}, \boldsymbol{y} \rangle {}-{} \phi (\boldsymbol{x}) = \sup_{\boldsymbol{x}’ \in \mathbb{R}^{d}} \left\{ \langle \boldsymbol{x}’, \boldsymbol{y} \rangle {}-{} \phi (\boldsymbol{x}’) \right\} = \phi^{\star} (\boldsymbol{y}).
\end{align}$$
All of these facts will be useful throughout the proof of 2 → 3.
Proof of Strong Duality
As a reminder, the dual problem is of the form
$$\begin{align}
\mathrm{D} (\mu, \nu) &= \sup_{(f, g) \in \mathcal{D} (\mu, \nu)} \left\{ \int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu \right\}, \tag{D}
\end{align}$$
where
$$\begin{align}
\mathcal{D}(\mu, \nu) &:= \left\{ (f, g) \in L^{1} (\mu) \times L^{1} (\nu) : f \oplus g \le \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \right\}.
\end{align}$$
Observe that the inequality defining the constraint \(\mathcal{D} (\mu, \nu)\) is equivalent to
$$\begin{align}
g (\boldsymbol{y}) &\le \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} {}-{} f (\boldsymbol{x}).
\end{align}$$
In order to attain the \(\sup\) in the dual problem \((\mathrm{D})\) given a function \(f\), we can think of the point-wise largest possible choice of \(g\) satisfying the constraint via
$$\begin{align}
g (\boldsymbol{y}) &= \inf_{\boldsymbol{x} \in \mathbb{R}^{d}} \left\{ \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} {}-{} f (\boldsymbol{x}) \right\} \\
&= \frac{\lVert \boldsymbol{y} \rVert^{2}}{2} {}-{} \sup_{\boldsymbol{x} \in \mathbb{R}^{d}} \left\{ \langle \boldsymbol{x}, \boldsymbol{y} \rangle {}-{} \left( \frac{\lVert \boldsymbol{x} \rVert^{2}}{2} {}-{} f (\boldsymbol{x}) \right) \right\} \\
&= \frac{\lVert \boldsymbol{y} \rVert^{2}}{2} {}-{} \sup_{\boldsymbol{x} \in \mathbb{R}^{d}} \left\{ \langle \boldsymbol{x}, \boldsymbol{y} \rangle {}-{} \varphi (\boldsymbol{x}) \right\} \\
&= \frac{\lVert \boldsymbol{y} \rVert^{2}}{2} {}-{} \phi^{\star} (\boldsymbol{y}),
\end{align}$$
where we define \(\phi := \frac{\lVert \cdot \rVert^{2}}{2} {}-{} f\). This means that, whenever there exists a solution that attains the \(\sup\) of \((\mathrm{D})\), the function pair must satisfy \(\phi = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} f\) and \(\phi^{\star} = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} g\) for some function \(\phi\). Hence we have
$$\begin{align}
\mathrm{D} (\mu, \nu) &= \sup_{(f, g) \in \mathcal{D} (\mu, \nu)} \left\{ \int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu \right\} \\
&= \sup_{\phi \in L^{1} (\mu)} \left\{ \int \left( \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \phi \right) \mathrm{d} \mu + \int \left( \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \phi^{\star} \right) \mathrm{d} \nu \right\} \\
&= \int \frac{\lVert \cdot \rVert^{2}}{2} \mathrm{d} \mu + \int \frac{\lVert \cdot \rVert^{2}}{2} \mathrm{d} \nu {}-{} \inf_{\phi \in L^{1} (\mu)} \left\{ \int \phi \ \mathrm{d} \mu + \int \phi^{\star} \ \mathrm{d} \nu \right\},
\end{align}$$
and thus it suffices to consider the problem
$$\begin{align}
\mathrm{S} (\mu, \nu) &:= \inf_{\phi \in L^{1} (\mu)} \left\{ \int \phi \ \mathrm{d} \mu + \int \phi^{\star} \ \mathrm{d} \nu \right\}.
\end{align}$$
Following this idea, we prove strong duality as follows.
Theorem 1.18 (Strong duality). Suppose that there exists a CCP, \(\mu\)-a.e. differentiable function \(\varphi\) such that \((\boldsymbol{X}, \nabla \varphi (\boldsymbol{X})) \sim \pi^{\star}\) for \(\boldsymbol{X} \sim \mu\). Then we have \(\mathrm{D} (\mu, \nu) = \mathrm{P} (\mu, \nu)\), and the functions \(f = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi\) and \(g = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi^{\star}\) attain the optimum of \((\mathrm{D})\), where \(\varphi^{\star}\) is the Fenchel conjugate of \(\varphi\).
Proof. Consider \(\varphi\) such that \((\boldsymbol{X}, \nabla \varphi (\boldsymbol{X})) \sim \pi^{\star}\) for \(\boldsymbol{X} \sim \mu\). We can observe that
$$\varphi (\boldsymbol{x}) + \varphi^{\star} (\nabla \varphi (\boldsymbol{x})) = \langle \nabla \varphi (\boldsymbol{x}), \boldsymbol{x} \rangle $$
by the equality condition of the Fenchel-Young inequality, where \(\nabla \varphi (\boldsymbol{x})\) uniquely exists \(\mu\)-a.e. Therefore we have
$$\begin{align}
\begin{aligned}
\int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^2}{2} \mathrm{d} \pi^{\star} &= \int \frac{\lVert \boldsymbol{x} {}-{} \nabla \varphi (\boldsymbol{x}) \rVert^2}{2} \mathrm{d} \mu \\
&= \int \left( \frac{\lVert \boldsymbol{x} \rVert^{2}}{2} {}-{} \langle \nabla \varphi (\boldsymbol{x}), \boldsymbol{x} \rangle + \frac{\lVert \nabla \varphi (\boldsymbol{x}) \rVert^{2}}{2} \right) \mathrm{d} \mu \\
&= \int \left( \frac{\lVert \boldsymbol{x} \rVert^{2}}{2} {}-{} \varphi (\boldsymbol{x}) {}-{} \varphi^{\star} (\nabla \varphi (\boldsymbol{x})) + \frac{\lVert \nabla \varphi (\boldsymbol{x}) \rVert^{2}}{2} \right) \mathrm{d} \mu \\
&= \int f (\boldsymbol{x}) \mathrm{d} \mu + \int g (\nabla \varphi (\boldsymbol{x})) \mathrm{d} \mu \\
&= \int f(\boldsymbol{x}) \mathrm{d} \mu + \int g (\boldsymbol{y}) \mathrm{d} \nu,
\end{aligned} \tag{$\heartsuit$}
\end{align}$$
where we define \(f = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi\) and \(g = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi^{\star}\). Now we are only left to check that \((f, g) \in \mathcal{D}(\mu, \nu)\). By the Fenchel-Young inequality, we have
$$\begin{align}
\begin{aligned}
f(\boldsymbol{x}) + g(\boldsymbol{y}) &= \frac{\lVert \boldsymbol{x} \rVert^{2}}{2} + \frac{\lVert \boldsymbol{y} \rVert^{2}}{2} {}-{} \varphi (\boldsymbol{x}) {}-{} \varphi^{\star} (\boldsymbol{y}) \\
&\le \frac{\lVert \boldsymbol{x} \rVert^{2}}{2} + \frac{\lVert \boldsymbol{y} \rVert^{2}}{2} {}-{} \langle \boldsymbol{x}, \boldsymbol{y} \rangle \\
&= \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2}.
\end{aligned}
\tag{$\clubsuit$}
\end{align}$$
Moreover, since \(f = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi\) and \(\varphi\) is convex, the function \(f_{+} := \max \{ f, 0 \}\) must be \(f_{+} \in L^{1} (\mu)\). Similarly, since \(g = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi^{\star}\) and \(\varphi^{\star}\) is also convex, the function \(g_{+} := \max \{ g, 0 \}\) must be \(g_{+} \in L^{1} (\nu)\). Finally, \(\int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu = \int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^2}{2} \mathrm{d} \pi^{\star} \ge 0\) yields \(\int f \ \mathrm{d} \mu {}>{} – \infty\) and
$$\int |f| \ \mathrm{d} \mu = 2 \int f_{+} \ \mathrm{d} \mu {}-{} \int f \ \mathrm{d} \mu < \infty,$$
which assures \(f \in L^{1} (\mu)\). Similarly we also have \(g \in L^{1} (\nu)\) and therefore \((f, g) \in \mathcal{D}(\mu, \nu)\). \(\square\)
Remark. There are also ways to prove strong duality without relying on Brenier’s theorem; see [BLS12, Kel84]. Specifically, [Kel84] show that strong duality holds in the more general case when \(\mathcal{X}, \mathcal{Y}\) are Polish spaces and the cost function \(c(\boldsymbol{x}, \boldsymbol{y})\) is lower semi-continuous, and [BLS12] show a stronger notion duality considering relaxed cases where we only transfer a portion of mass \(1 – \epsilon\) from \(\mu\) to \(\nu\).
Part III. Final Touch (3 → 1)
Now we conclude the proof of Theorem 1.13 with the easiest direction 3 → 1.
Lemma 1.19. Suppose that \(\mathrm{D} (\mu, \nu) = \mathrm{P} (\mu, \nu)\), and the functions \(f = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi\) and \(g = \frac{\lVert \cdot \rVert^{2}}{2} {}-{} \varphi^{\star}\) attain the optimum of \((\mathrm{D})\), where \(\varphi\) is a CCP function such that \((\boldsymbol{X}, \nabla \varphi (\boldsymbol{X})) \sim \pi^{\star}\) for \(\boldsymbol{X} \sim \mu\). Then \(\pi^{\star}\) attains the optimum of \((\mathrm{P})\).
Proof. From \((\heartsuit)\) and \((\clubsuit)\), which follows from our choice of \(\varphi\), \(f\), and \(g\), we can easily compute
$$\begin{align}
\int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \mathrm{d} \pi^{\star} &= \int f \ \mathrm{d} \mu + \int g \ \mathrm{d} \nu \\
&= \int f (\boldsymbol{x}) + g (\boldsymbol{y}) \ \mathrm{d} \pi \\
&\le \int \frac{\lVert \boldsymbol{x} {}-{} \boldsymbol{y} \rVert^{2}}{2} \mathrm{d} \pi
\end{align}$$
for any choice of \(\pi \in \Pi (\mu, \nu)\). \(\square\)
Today we have shown the fundamental theorem of OT (Theorem 1.13), which states that the optimal solution to \((P)\) can be characterized through convex gradients by Brenier’s theorem (Theorem 1.16), and strong duality between \((P)\) and \((D)\) holds (Theorem 1.18).
References
[BLS12] M. Beiglböck, C. Léonard, and W. Schachermayer. A general duality theorem for the Monge-Kantorovich transport problem. Studia Mathematica 209.2 (2012): 151-167.
[Bre87] Y. Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs. C. R. Acad. Sci. Paris Sér. I Math., 305(19):805–808, 1987.
[CNR25] S. Chewi, J. Niles-Weed, and P. Rigollet. Statistical optimal transport. École d’Été de Probabilités de Saint-Flour XLIX, 2025.
[Kel84] H.G. Kellerer. Duality theorems for marginal problems. Z. Wahrscheinlichkeitstheorie verw Gebiete 67, 399–432 (1984).
[Roc66] R. T. Rockafellar. Characterization of the subdifferentials of convex functions. Pacific J. Math. 17 (1966), 497–510.