The generative-modeling problem
We have samples from pdata. We want more. What machinery do we need?
What we are actually trying to do
Concretely: you hand me a finite set {x⁽¹⁾, …, x⁽ᴺ⁾} drawn i.i.d. from an unknown distribution pdata(x). I should give you a procedure that produces fresh samples that look like they came from the same distribution. Not a likelihood model, not a density estimator — a sampler.
That word choice matters. Modeling the density p(x) at every point of ℝD is dramatically harder than producing one sample per call. A density model has to normalize over the whole space; a sampler only has to walk to plausible spots. Every method in this series is a sampler first and a density model second (or never).
The figure below is two interleaved crescents (“two moons”) — bimodal, non-Gaussian, has curvature. The same code in data.py standardizes to zero mean and unit variance per axis so the prior N(0, I) and the data live on the same scale. If you skip standardization, the noise level and the data level disagree and training silently fails. That is gotcha #1.
Why “just fit a neural net to pdata” doesn’t work
The naive plan: pick a parametric family pθ(x), maximize log-likelihood on the training set. Two obstacles:
- Normalization. A general neural net fθ(x) ∈ ℝ is not a density — it doesn’t integrate to 1. You can do pθ(x) = efθ(x) / Z(θ), but Z(θ) = ∫ efθ(x) dx is intractable. Energy-based models live here and pay this price.
- Sampling. Even if you had pθ, drawing samples from a general high-dimensional density is hard. MCMC mixes badly in modal landscapes (each mode is a metastable basin); rejection sampling needs a proposal that already looks like the target.
VAEs and normalizing flows fix this by constraining the architecture: VAEs introduce a tractable latent and a Jensen-bound surrogate (the ELBO); flows force fθ to be invertible with a tractable Jacobian. Both work; both pay an architectural tax. Diffusion and flow matching keep the architecture free and pay a process tax instead.
The reframe: density paths
Here is the move. Don’t try to model pdata directly. Instead, build a one-parameter family of distributions {pt}t ∈ [0, 1] that connects a tractable prior to the target:
Now learn a local operator — at each point x and time t, what should I do to move probability mass along the path? Two natural choices for that operator:
| Method | Operator the net predicts | How sampling moves |
|---|---|---|
| DDPM | noise ε added at this step | subtract the predicted noise (with stochasticity) |
| Flow matching | velocity v at this point in time | step in the direction of v by dt |
Why is this easier? Because locally, each pt looks Gaussian-ish in a neighborhood, even when pdata is multimodal and complicated. The net only has to learn a smooth field over (x, t); it never has to confront the global density. That is the whole conceptual win.
Interactive · pick a path, watch a particle drift
The widget below shows the linear interpolation path that flow matching will use in lesson 5: xt = (1 − t) x0 + t x1 with x0 ∼ N(0, I) and x1 ∼ pdata. Drag t: the cloud morphs from the standard normal at t = 0 to the two moons at t = 1. Each particle moves on a straight line. That is going to matter a lot.
The path as a 3D stack
One way to see the path of distributions: plot pt as slices stacked along a time axis. Below, each translucent disk is one snapshot of the cloud at a particular t; the disks are placed in oblique 3D projection along t. The bundle shows in one glance what the slider above shows over time.
What this lets us avoid
- No likelihood normalization. We never write pθ(x) as a normalized density. The training loss is a regression target, not a log-likelihood.
- No mode-mixing. Sampling is a deterministic walk (FM) or a controlled noisy walk (DDPM); both move smoothly from the prior to data. Multimodal targets are fine — for FM, different starting x0 generally end at different modes (the path is deterministic in x0); for DDPM, the mode is also shaped by the intermediate noise injections, but the same “no mode-mixing” intuition holds.
- No architecture constraints. The denoiser/velocity net can be an MLP, a UNet, a transformer — any function of (x, t) with the right output shape. The first time we’ll see anyone care about architecture is lesson 8 (DiT) and it’s a separable concern from the loss.
The cost we pay
What’s next
We now have a frame: pick a path, learn a local operator. The next four lessons fill that in for DDPM (lessons 2–4) and flow matching (lessons 5–6). Lesson 7 then shows they’re the same object viewed two ways. The 2D toy from above is the one we’ll keep returning to — small enough to plot, complex enough to make every bug visible.