Sampling methods and generative modeling, including the well-known idea of diffusion models, aim to learn the data distribution from a given set of data points (often high-dimensional) and, if possible, generate more samples from that particular distribution.

One straightforward approach to this problem is to minimize a function defined on the space of probability measures. Formally, we can write our objective as
$$ \begin{align} \min_{\mu \in \mathcal{P}_{2} (\mathbb{R}^{d})} \mathcal{F}(\mu), \tag{W} \end{align} $$
where \(\mathcal{P}_2 (\mathbb{R}^{d})\) is the collection of probability measures over \(\mathbb{R}^{d}\), often referred to as the Wasserstein space (when equipped with a certain metric called the Wasserstein distance, which we will formally define later). For instance, we might choose to set the objective simply as \(\mathcal{F}(\mu) = \mathrm{KL} (\mu \| \pi)\), where \(\pi\) is the target distribution and \(\mathrm{KL}\) is the Kullback-Leibler (KL) divergence between measure \(\mu\) and \(\pi\). If we have a way to get an approximate solution of \(\mathrm{(W)}\), then we will obtain a distribution close to the target \(\pi\) with which we could, say, happily generate more samples.
The problem is, how can we even do this? The biggest difference between \((\mathrm{W})\) and typical optimization problems that we are fond of is that the domain \(\mathcal{P}_2 (\mathbb{R}^{d})\) of \((\mathrm{W})\) is an infinite-dimensional, non-Euclidean space with complex structures. In fact, it is not even straightforward to see how we should define the gradient of functions like \(\mathcal{F}(\mu)\), let alone how algorithms running on this weird space should even look like!
It turns out that, surprisingly, there is a complicated but powerful way to define such a gradient, which we call the Wasserstein gradient, from which we can define continuous-time dynamics called Wasserstein gradient flows (WGF). WGFs are closely related to fun concepts from not only functional analysis and probability theory but also many other different fields like PDEs, convex optimization, Riemannian manifolds, and etc. Also, WGFs have similarities to (and, obviously, also some differences with) classical gradient flows in the Euclidean space, and these flows can also be discretized in time and space to implement real-world algorithms.

Image Source: Carrillo, J.A., Craig, K., Wang, L., and Wei, C. Primal Dual Methods for Wasserstein Gradient Flows. Found Comput Math 22, 389–443 (2022).
I believe that WGFs are both an interesting mathematical object and a practically useful tool whenever we want to do something directly on the space of probability measures. (Although not 100% inspired from WGFs, diffusion models show very well how these ideas based on theory could actually be powerful in practice.) However, learning about the very basics of WGF is already something quite difficult to do on one’s own, especially to innocent non-mathematicians like me. Below is what I plan to write in the succeeding posts to help out noobs understand WGFs easily.
First, we will take a look at the basics of optimal transport (OT) theory.
- Understand the Monge-Kantorovich formulation of OT problems and define Wasserstein distances as an instance of an OT problem
- Prove the fundamental theorem of OT (Brenier’s theorem and strong duality) that characterizes the solution to the OT problem that defines Wasserstein distances
Second, we will take a look at Wasserstein gradient flows via the following steps.
- Understand the Wasserstein space \(\mathcal{P}_2 (\mathbb{R}^{d})\) and its properties
- Define the Wasserstein gradient of functions \(\mathcal{F} : \mathcal{P}_2 (\mathbb{R}^{d}) \rightarrow \mathbb{R}\)
- Construct and learn about Wasserstein gradient flows (WGF)
Third, we will study the dynamics of WGF in the following order.
- Convergence rates of continuous-time WGF under various assumptions
- Convergence rates for time-space discretized algorithms
- Several ways to achieve acceleration
Lastly, we will explore some applications of WGF in various areas.
- Wasserstein-Fisher-Rao (WFR) gradient flow
- WGF and Diffusion Models
- WGF and Mean-field Neural Networks
(These topics are temporarily chosen and might change.)
I am currently planning to make a long series with a length of around 15 posts (excluding this intro), each bullet point in the above list being covered in 1-2 posts. I will make updates on this page whenever anything changes later.
References
Finally, these are the three main references I have read to study and write this series of posts. I would strongly recommend reading the first two; both are available online through the links. The third reference is a collection of lecture notes and slides from a grad math course on optimal transport I took on 2023. (The materials do not seem to be uploaded anywhere online, and I’m not really sure if I am allowed to share these to other people.)
[Che23] S. Chewi. Log-concave sampling. 2023.
[CNR25] S. Chewi, J. Niles-Weed, and P. Rigollet, Statistical optimal transport. École d’Été de Probabilités de Saint-Flour XLIX, 2025.
[KP23] Y.-H. Kim and S. Pal, PIMS graduate course on OT+GF. 2023.