The generative-modeling problem

We have samples from p_data. We want more. What machinery do we need?

What we are actually trying to do

Concretely: you hand me a finite set {x⁽¹⁾, …, x⁽ᴺ⁾} drawn i.i.d. from an unknown distribution p_data(x). I should give you a procedure that produces fresh samples that look like they came from the same distribution. Not a likelihood model, not a density estimator — a sampler.

That word choice matters. Modeling the density p(x) at every point of ℝ^D is dramatically harder than producing one sample per call. A density model has to normalize over the whole space; a sampler only has to walk to plausible spots. Every method in this series is a sampler first and a density model second (or never).

The 2D toy we’ll use

The figure below is two interleaved crescents (“two moons”) — bimodal, non-Gaussian, has curvature. The same code in data.py standardizes to zero mean and unit variance per axis so the prior N(0, I) and the data live on the same scale. If you skip standardization, the noise level and the data level disagree and training silently fails. That is gotcha #1.

Why “just fit a neural net to p_data” doesn’t work

The naive plan: pick a parametric family p_θ(x), maximize log-likelihood on the training set. Two obstacles:

Normalization. A neural net spits out a raw score f_θ(x) ∈ ℝ, not a probability — nothing forces the scores to add up to 1 over all of space. The standard fix is to exponentiate and divide by the total, p_θ(x) = e^f_θ(x) / Z(θ), but that “total” Z(θ) = ∫ e^f_θ(x) dx is an integral over the entire space — intractable. Energy-based models live here and pay this price.
Sampling. Even if you somehow had p_θ, being able to score a point is not the same as being able to produce one. Drawing fresh samples from a general high-dimensional density is its own hard problem. MCMC mixes badly in modal landscapes (each mode is a metastable basin it rarely escapes); rejection sampling needs a proposal that already looks like the target, and in high dimensions even a slight mismatch crushes the acceptance rate.

Intuition · linear unpacking

Claim: the harmless-looking constant Z(θ) is what makes the “just normalize a neural net” plan intractable.

Why we need it. The net’s raw output f_θ(x) ranks points but doesn’t obey the one rule a probability must: the values have to sum to 1 over the whole space. Z(θ) is just “the current total,” and dividing by it forces that rule to hold.
Why it’s hard. Computing the total means adding up e^f_θ(x) at every point of an enormous space — a D-dimensional integral with no closed form. You can’t visit that many points, and you can’t shortcut it for a general net.
Why it won’t go away. Z depends on θ, so every gradient step changes it. You can’t compute it once and forget it; it sits inside the loss and follows you through training.

Central point. Turning an arbitrary neural net into a normalized density forces you to pay for the global total Z(θ) on every step — and that bill is exactly what the path-based methods in this series refuse to pick up.

Intuition · linear unpacking

Claim: knowing the density p_θ still doesn’t tell you how to draw samples from it — and the usual workarounds (MCMC, rejection) break down in high dimensions.

Two different questions. Evaluating a density answers a local question: “how much probability is piled up right here, at this point?” Sampling is a global task: find all the regions that matter and visit them in the right proportions. A height-meter that reports the altitude wherever you stand never tells you where the mountains are.
Why MCMC mixes badly. MCMC explores by taking small local steps — it readily wanders within a high-probability region but only rarely accepts a move that crosses a low-probability stretch. So once it settles into one mode it sits there: the low-probability valleys between separated modes act like walls. Each mode becomes a metastable basin the chain almost never leaves, so the samples over-represent whichever mode you happened to start in.
Why rejection scaling collapses. Rejection sampling draws from an easy proposal q and keeps a fraction of the draws so the survivors look like the target. That only works if q already overlaps the target well. In D dimensions a small per-axis mismatch compounds: a modest envelope factor per dimension raised to the D-th power becomes exponentially tiny, so you reject essentially everything.

Central point. A density model tells you where probability is high at the points you query; it does not tell you how to find all those points in the first place. Sampling is the harder, global half of the problem — and it is exactly the half the rest of this series is built to solve.

VAEs and normalizing flows fix this by constraining the architecture: VAEs introduce a tractable latent and a Jensen-bound surrogate (the ELBO); flows force f_θ to be invertible with a tractable Jacobian. Both work; both pay an architectural tax. Diffusion and flow matching keep the architecture free and pay a process tax instead.

The reframe: density paths

Here is the move. Don’t try to model p_data directly. Instead, build a one-parameter family of distributions {p_t}_{t ∈ [0, 1]} that connects a tractable prior to the target:

p₀ = N(0, I), p₁ = p_data, p_t for t ∈ (0, 1) is anything you like that interpolates.

Now learn a local operator — at each point x and time t, what should I do to move probability mass along the path? Two natural choices for that operator:

Method	Operator the net predicts	How sampling moves
DDPM	noise ε added at this step	subtract the predicted noise (with stochasticity)
Flow matching	velocity v at this point in time	step in the direction of v by dt

Why is this easier? Because locally, each p_t looks Gaussian-ish in a neighborhood, even when p_data is multimodal and complicated. The net only has to learn a smooth field over (x, t); it never has to confront the global density. That is the whole conceptual win.

Interactive · pick a path, watch a particle drift

The widget below shows the linear interpolation path that flow matching will use in lesson 5: x_t = (1 − t) x₀ + t x₁ with x₀ ∼ N(0, I) and x₁ ∼ p_data. Drag t: the cloud morphs from the standard normal at t = 0 to the two moons at t = 1. Each particle moves on a straight line. That is going to matter a lot.

A linear density path

Each grey dot is one (x₀, x₁) pair frozen at the start; orange dots are the same particles at the current t. Watch a Gaussian deform into two moons in real time.

t: 0.00 N: show trajectories

The grey lines are the trajectories x_t = (1 − t)·x₀ + t·x₁ traced through time. They are straight in this path; the velocity along each trajectory is the constant x₁ − x₀. This is why flow matching needs only ~50 sampling steps where DDPM needs 1000.

The path as a 3D stack

One way to see the path of distributions: plot p_t as slices stacked along a time axis. Below, each translucent disk is one snapshot of the cloud at a particular t; the disks are placed in oblique 3D projection along t. The bundle shows in one glance what the slider above shows over time.

What this lets us avoid

No likelihood normalization. We never write p_θ(x) as a normalized density. The training loss is a regression target, not a log-likelihood.
No mode-mixing. Sampling is a deterministic walk (FM) or a controlled noisy walk (DDPM); both move smoothly from the prior to data. Multimodal targets are fine — for FM, different starting x₀ generally end at different modes (the path is deterministic in x₀); for DDPM, the mode is also shaped by the intermediate noise injections, but the same “no mode-mixing” intuition holds.
No architecture constraints. The denoiser/velocity net can be an MLP, a UNet, a transformer — any function of (x, t) with the right output shape. The first time we’ll see anyone care about architecture is lesson 8 (DiT) and it’s a separable concern from the loss.

The cost we pay

There is no free lunch — the cost is sampling

Both DDPM and FM produce a single sample by running the net many times (10–1000 forward passes). For a 256×256 image with a 1B-parameter DiT, that’s several seconds per image, vs. a GAN’s single shot. The recent diffusion literature (DDIM, EDM, consistency models, rectified flow) is largely about chipping away at this cost.

What’s next

We now have a frame: pick a path, learn a local operator. The next four lessons fill that in for DDPM (lessons 2–4) and flow matching (lessons 5–6). Lesson 7 then shows they’re the same object viewed two ways. The 2D toy from above is the one we’ll keep returning to — small enough to plot, complex enough to make every bug visible.