generative_continuous / 04 · DDPM sampling lesson 4 / 15

DDPM — ancestral sampling

A trained εθ in hand, how do we get from N(0, I) to a sample?

The reverse-step formula

Last lesson we picked pθ(xt−1 | xt) = N(μθ, σt2 I) and derived the predicted mean via the ε-parameterization:

μθ(xt, t) = (1/√αt) · ( xt − (βt / √(1 − ᾱt)) · εθ(xt, t) )   [Eq. ♦]

To generate, start at xT ∼ N(0, I) and at each step sample

xt−1 = μθ(xt, t) + σt · z,   z ∼ N(0, I)

with σt = 0 at the last step (no point adding noise to the final output). That is the entire sampler. The cost is T forward passes through the network, sequentially.

Where does Eq. ♦ come from, intuitively?

Two equivalent reads:

  • From the forward equation. Eq. ⋆ says xt = √ᾱt x0 + √(1 − ᾱt) ε. Solve for x0, plug into the true posterior mean μ̃t(xt, x0), simplify. The β/√(1−ᾱ) coefficient and the 1/√α prefactor fall out.
  • As score subtraction. The score of q(xt | x0) is −ε / √(1 − ᾱt). The Langevin-style update xt−1 ≈ xt + (βt/2) · ∇x log q(xt) + √βt z is the same line, up to the 1/√αt rescale that preserves variance. (This is why “score-based” and “denoising-diffusion” are the same model in two languages.)

What variance to use? Two valid choices

Two reverse-process variances pass the math:

ChoiceFormulaComment
upper-bound (DDPM default)σt2 = βtmatches noise added at step t. Simple, slightly noisier samples than the tight choice.
posterior variance (tighter)σt2 = β̃t = βt·(1 − ᾱt−1) / (1 − ᾱt)the actual Var(xt−1 | xt, x0). Slightly less stochastic; same asymptotic samples.

Both are valid: Ho et al. 2020 §3.2 note these are the two “extreme” reverse variances corresponding to the deterministic x0 case (lower bound β̃t) and the maximum-entropy case (upper bound βt). The optimal variance for an arbitrary x0-entropy regime sits between them; either endpoint gives the same ELBO bound up to a constant. diffusion.py uses the simpler σt2 = βt. Learning σt (the “improved DDPM” trick, Nichol & Dhariwal 2021) gives modest sample-quality gains and is the “learned sigma” that most production codebases use today.

Interactive · drive the sampler with a fake denoiser

We don’t need a trained model to see the sampler in action — replace εθ with an oracle that knows the closed-form expected noise given xt and a fixed target distribution. Below, the “target” is the two moons; the oracle computes 𝔼[ε | xt] by averaging over plausible x0’s from a kernel-density estimate. Watch noise denoise.

Ancestral sampler — oracle ε, T sweep
Press Run. T steps of Eq. ♦. The fewer steps, the worse it looks — that’s the cost we’ll pay until DDIM/FM gives us a way out.
step (of T)
t
empirical Var (per axis)

3D · the trajectory bundle

The flat scatter above hides the fact that each particle has a history — a trajectory through time from noise to data. Below is the same sampler with time drawn as the third axis. Each curve is one particle’s (x, y, t) trace. Notice how the bundle is wide at t = T (everyone starts near Gaussian) and pinches into the two moons at t = 0. The wiggles are the stochastic σt z term; set σ-choice to “zero” for noise-free trajectories (still useful — that’s DDIM with η = 0).

Ancestral trajectories in (x, y, t) — oblique projection
Time runs upward and to the right (into the picture). Watch the bundle squeeze: ~Gaussian at the top → two moons at the bottom. Hit run to redraw with a fresh batch.

Why T = 1000 hurts

Each step is one forward pass through εθ. For a 1B-parameter DiT-XL/2 on a single H100, one forward at 256×256 is ~30 ms. 30 s per image. Generation is 100× slower than a comparably-sized GAN. The literature’s response:

TrickStepsWhat it does
DDIM (Song et al. 2021)50–100noise-free deterministic reverse; same training, different sampler
DPM-Solver (Lu et al. 2022)10–20higher-order ODE solver on the same network
Distillation (Salimans & Ho 2022)1–4train a student that maps directly to K-step ahead
Consistency models (Song et al. 2023)1–2net learns to jump to x0 from any xt
Flow matching20–50straight conditional paths → simple Euler suffices (lesson 6)

This series’ DDPM.sample is the simple ancestral version. For a 2D toy it’s fast even at T=1000.

One way to see why straight paths win — the curvature problem

The reverse process Eq. ♦ is locally linear in xt: take a step in the score direction, perturb with noise. If the true trajectory x(t) from prior to data is curved, the linear step at xt hops off the trajectory; we need more steps to recover. The VP path (DDPM’s) is curved. The linear path (FM’s) is straight by construction. That’s the entire argument for fewer FM steps.

Path curvature ↔ integration step cost
Two paths between a prior point and a target point. The straight one needs ~1 Euler step; the curved one needs many. This is in miniature what changes between FM’s 50 steps and DDPM’s 1000.

Black = true path. Orange dots = where a forward-Euler integrator with K steps actually lands. As curvature ↑ or K ↓, the dots drift off-path.

Punchline
Sampling is Eq. ♦ in a loop. The cost is one network forward per step, and the number of steps is set by how curved the path is. DDPM’s VP path is curved, so 1000 steps; flow matching’s linear path is straight, so 50. We’ll build that next.