DDPM — ancestral sampling

A trained ε_θ in hand, how do we get from N(0, I) to a sample?

The reverse-step formula

Last lesson we picked p_θ(x_t−1 | x_t) = N(μ_θ, σ_t² I) and derived the predicted mean via the ε-parameterization:

μ_θ(x_t, t) = (1/√α_t) · ( x_t − (β_t / √(1 − ᾱ_t)) · ε_θ(x_t, t) ) [Eq. ♦]

To generate, start at x_T ∼ N(0, I) and at each step sample

x_t−1 = μ_θ(x_t, t) + σ_t · z, z ∼ N(0, I)

with σ_t = 0 at the last step (no point adding noise to the final output). That is the entire sampler. The cost is T forward passes through the network, sequentially.

Where does Eq. ♦ come from, intuitively?

Two equivalent reads:

From the forward equation. Eq. ⋆ says x_t = √ᾱ_t x₀ + √(1 − ᾱ_t) ε. Solve for x₀, plug into the true posterior mean μ̃_t(x_t, x₀), simplify. The β/√(1−ᾱ) coefficient and the 1/√α prefactor fall out.
As score subtraction. The score of q(x_t | x₀) is −ε / √(1 − ᾱ_t). The Langevin-style update x_t−1 ≈ x_t + (β_t/2) · ∇_x log q(x_t) + √β_t z is the same line, up to the 1/√α_t rescale that preserves variance. (This is why “score-based” and “denoising-diffusion” are the same model in two languages.)

Intuition · linear unpacking

Claim: predicting the noise ε is secretly predicting which way “uphill toward real data” points, so the reverse step is just a nudge in that direction.

What the score is. ∇_x log q(x_t) is the direction in which the noisy point becomes more likely — it points toward the higher-density, less-noised regions (the blurred data manifold at this noise level, which sharpens toward the real data as t shrinks). Follow it and you climb out of the noise.
Noise is the score, flipped. Because x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε, the leftover noise ε is exactly the vector pointing away from data. So −ε (up to a scale) is the uphill direction. A net that predicts ε has therefore learned the score for free.
The reverse step is one nudge. “Move a little way uphill, then jostle with a bit of fresh noise so you don’t collapse onto a single point” — that is the Langevin line, and it is literally Eq. ♦ once you rescale by 1/√α_t to keep the variance honest.

Central point. “Subtract the predicted noise” and “take a step up the data-likelihood hill” are the same instruction — which is why the noise-prediction and score-based stories describe one model.

What variance to use? Two valid choices

Two reverse-process variances pass the math:

Choice	Formula	Comment
upper-bound (DDPM default)	σ_t² = β_t	matches noise added at step t. Simple, slightly noisier samples than the tight choice.
posterior variance (tighter)	σ_t² = β̃_t = β_t·(1 − ᾱ_t−1) / (1 − ᾱ_t)	the actual Var(x_t−1 \| x_t, x₀). Slightly less stochastic; same asymptotic samples.

Why are there two answers, and why is picking either one fine? The mean tells you where to step; the variance only says how much to jitter once you’re there. Ho et al. 2020 §3.2 show the “right” amount of jitter depends on something the sampler doesn’t know — how spread out the clean data x₀ was — so they bracket it with the two extremes instead. The lower bound β̃_t is the “data was a single point” case (least jitter); the upper bound β_t is the “data was maximally spread” case (most jitter). The truly optimal value for any real dataset sits between them, and because it’s only the jitter that differs — never the mean — Ho et al. found both endpoints give comparable sample quality in practice. (They are not identical likelihood bounds — the bracket is loose — which is exactly why learning the variance later squeezes out a little more.) So you just pick one. diffusion.py uses the simpler σ_t² = β_t. Learning σ_t (the “improved DDPM” trick, Nichol & Dhariwal 2021) gives modest sample-quality gains and is the “learned sigma” that most production codebases use today.

Intuition · linear unpacking

Claim: two different reverse-step variances are both “correct” because the variance is a free knob whose ideal setting depends on data the sampler never sees.

The step has two parts. x_t−1 = μ_θ + σ_t z: a deterministic aim (μ_θ) plus a random jitter (σ_t z). The aim is what the network learned; the jitter size is what we’re choosing here.
The ideal jitter is unknowable. The exactly-right jitter is the spread of x_t−1 given x_t, and that depends on how varied the original clean data was — information the sampler doesn’t have mid-run.
So bracket it. Pretend the data was one fixed point and you get the smallest sensible jitter β̃_t; pretend it was as spread out as possible and you get the largest, β_t. The real answer is somewhere in between.
Either bracket is “legal.” Because only the jitter differs — never the aim — both endpoints are principled and, empirically, give comparable samples. They are not exactly equal likelihood bounds (learning the variance does a touch better), but the difference is small. The visible effect is cosmetic: β_t samples look a touch noisier, β̃_t a touch smoother.

Central point. Since the perfect variance hinges on data the sampler can’t access at generation time, the pragmatic move is to pick either principled endpoint — neither is exactly optimal, and the visible difference is only how grainy the output looks.

Interactive · drive the sampler with a fake denoiser

We don’t need a trained model to see the sampler in action — replace ε_θ with an oracle that knows the closed-form expected noise given x_t and a fixed target distribution. Below, the “target” is the two moons; the oracle computes 𝔼[ε | x_t] by averaging over plausible x₀’s from a kernel-density estimate. Watch noise denoise.

3D · the trajectory bundle

The flat scatter above hides the fact that each particle has a history — a trajectory through time from noise to data. Below is the same sampler with time drawn as the third axis. Each curve is one particle’s (x, y, t) trace. Notice how the bundle is wide at t = T (everyone starts near Gaussian) and pinches into the two moons at t = 0. The wiggles are the stochastic σ_t z term; set σ-choice to “zero” for noise-free trajectories (still useful — that’s DDIM with η = 0).

Why T = 1000 hurts

Each step is one forward pass through ε_θ. For a 1B-parameter DiT-XL/2 on a single H100, one forward at 256×256 is ~30 ms. 30 s per image. Generation is 100× slower than a comparably-sized GAN. The literature’s response:

Trick	Steps	What it does
DDIM (Song et al. 2021)	50–100	noise-free deterministic reverse; same training, different sampler
DPM-Solver (Lu et al. 2022)	10–20	higher-order ODE solver on the same network
Distillation (Salimans & Ho 2022)	1–4	train a student that maps directly to K-step ahead
Consistency models (Song et al. 2023)	1–2	net learns to jump to x₀ from any x_t
Flow matching	20–50	straight conditional paths → simple Euler suffices (lesson 6)

This series’ DDPM.sample is the simple ancestral version. For a 2D toy it’s fast even at T=1000.

One way to see why straight paths win — the curvature problem

The reverse process Eq. ♦ is locally linear in x_t: take a step in the score direction, perturb with noise. If the true trajectory x(t) from prior to data is curved, the linear step at x_t hops off the trajectory; we need more steps to recover. The VP path (DDPM’s) is curved. The linear path (FM’s) is straight by construction. That’s the entire argument for fewer FM steps.

Punchline

Sampling is Eq. ♦ in a loop. The cost is one network forward per step, and the number of steps is set by how curved the path is. DDPM’s VP path is curved, so 1000 steps; flow matching’s linear path is straight, so 50. We’ll build that next.