DDPM — ancestral sampling
A trained εθ in hand, how do we get from N(0, I) to a sample?
The reverse-step formula
Last lesson we picked pθ(xt−1 | xt) = N(μθ, σt2 I) and derived the predicted mean via the ε-parameterization:
To generate, start at xT ∼ N(0, I) and at each step sample
with σt = 0 at the last step (no point adding noise to the final output). That is the entire sampler. The cost is T forward passes through the network, sequentially.
Where does Eq. ♦ come from, intuitively?
Two equivalent reads:
- From the forward equation. Eq. ⋆ says xt = √ᾱt x0 + √(1 − ᾱt) ε. Solve for x0, plug into the true posterior mean μ̃t(xt, x0), simplify. The β/√(1−ᾱ) coefficient and the 1/√α prefactor fall out.
- As score subtraction. The score of q(xt | x0) is −ε / √(1 − ᾱt). The Langevin-style update xt−1 ≈ xt + (βt/2) · ∇x log q(xt) + √βt z is the same line, up to the 1/√αt rescale that preserves variance. (This is why “score-based” and “denoising-diffusion” are the same model in two languages.)
What variance to use? Two valid choices
Two reverse-process variances pass the math:
| Choice | Formula | Comment |
|---|---|---|
| upper-bound (DDPM default) | σt2 = βt | matches noise added at step t. Simple, slightly noisier samples than the tight choice. |
| posterior variance (tighter) | σt2 = β̃t = βt·(1 − ᾱt−1) / (1 − ᾱt) | the actual Var(xt−1 | xt, x0). Slightly less stochastic; same asymptotic samples. |
Both are valid: Ho et al. 2020 §3.2 note these are the two “extreme” reverse variances corresponding to the deterministic x0 case (lower bound β̃t) and the maximum-entropy case (upper bound βt). The optimal variance for an arbitrary x0-entropy regime sits between them; either endpoint gives the same ELBO bound up to a constant. diffusion.py uses the simpler σt2 = βt. Learning σt (the “improved DDPM” trick, Nichol & Dhariwal 2021) gives modest sample-quality gains and is the “learned sigma” that most production codebases use today.
Interactive · drive the sampler with a fake denoiser
We don’t need a trained model to see the sampler in action — replace εθ with an oracle that knows the closed-form expected noise given xt and a fixed target distribution. Below, the “target” is the two moons; the oracle computes 𝔼[ε | xt] by averaging over plausible x0’s from a kernel-density estimate. Watch noise denoise.
3D · the trajectory bundle
The flat scatter above hides the fact that each particle has a history — a trajectory through time from noise to data. Below is the same sampler with time drawn as the third axis. Each curve is one particle’s (x, y, t) trace. Notice how the bundle is wide at t = T (everyone starts near Gaussian) and pinches into the two moons at t = 0. The wiggles are the stochastic σt z term; set σ-choice to “zero” for noise-free trajectories (still useful — that’s DDIM with η = 0).
Why T = 1000 hurts
Each step is one forward pass through εθ. For a 1B-parameter DiT-XL/2 on a single H100, one forward at 256×256 is ~30 ms. 30 s per image. Generation is 100× slower than a comparably-sized GAN. The literature’s response:
| Trick | Steps | What it does |
|---|---|---|
| DDIM (Song et al. 2021) | 50–100 | noise-free deterministic reverse; same training, different sampler |
| DPM-Solver (Lu et al. 2022) | 10–20 | higher-order ODE solver on the same network |
| Distillation (Salimans & Ho 2022) | 1–4 | train a student that maps directly to K-step ahead |
| Consistency models (Song et al. 2023) | 1–2 | net learns to jump to x0 from any xt |
| Flow matching | 20–50 | straight conditional paths → simple Euler suffices (lesson 6) |
This series’ DDPM.sample is the simple ancestral version. For a 2D toy it’s fast even at T=1000.
One way to see why straight paths win — the curvature problem
The reverse process Eq. ♦ is locally linear in xt: take a step in the score direction, perturb with noise. If the true trajectory x(t) from prior to data is curved, the linear step at xt hops off the trajectory; we need more steps to recover. The VP path (DDPM’s) is curved. The linear path (FM’s) is straight by construction. That’s the entire argument for fewer FM steps.