DDPM — the forward chain
A fixed noising process, designed so we never have to simulate it at training time.
The design move
We want a path of distributions from pdata back to a tractable prior. The trick: instead of designing the path directly, design a noising process that does it for us. Pick a Markov chain
with small per-step variance βt > 0, run for T steps. The chain progressively adds Gaussian noise and shrinks the signal. The induced marginals q(xt) are a path that takes pdata (at t = 0) to a near-Gaussian (at t = T).
Note the careful scaling. If we wrote q(xt | xt−1) = N(xt−1, βt I) (just adding noise) the variance would grow without bound — at large t we’d have N(0, Tβ I), not N(0, I). The √(1 − βt) shrink-by-a-bit factor is what keeps the chain variance-preserving: if Var(xt−1) = I, then Var(xt) = (1 − βt) · I + βt · I = I. The terminal distribution stays at unit variance regardless of T.
diffusion.py implements.
The miracle: a closed-form marginal
Naively, to get a sample xt from q(xt | x0) we’d simulate t Gaussian steps — that’s t network-free passes, but we’d need to do it freshly for every training example at every step. For Gaussians this is unnecessary.
Composing linear-Gaussian maps stays linear-Gaussian. Let αt = 1 − βt and ᾱt = ∏s ≤ t αs. Then:
Equivalently, with ε ∼ N(0, I):
One Gaussian sample, no simulation, jumps straight to any t. This is the line that makes diffusion practical — without it, training would mean simulating T steps for every minibatch element.
Show the derivation (induction on t)
Base case t = 1: x1 = √α1 · x0 + √β1 · ε1. Conditional on x0, this is Gaussian with mean √α1 · x0 = √ᾱ1 · x0 and variance β1 · I = (1 − ᾱ1) · I. Matches the claim.
Inductive step: assume xt−1 = √ᾱt−1 · x0 + √(1 − ᾱt−1) · η with η ∼ N(0, I). Then
xt = √αt · xt−1 + √βt · εt = √αt ( √ᾱt−1 · x0 + √(1 − ᾱt−1) · η ) + √βt · εt
= √ᾱt · x0 + ( √(αt(1 − ᾱt−1)) · η + √βt · εt )
The bracketed term is a sum of independent Gaussians with total variance αt(1 − ᾱt−1) + βt = αt − αt·ᾱt−1 + (1 − αt) = 1 − ᾱt. So xt = √ᾱt · x0 + √(1 − ᾱt) · ε as claimed. ▪
The schedule, visualized
The two things you care about across the chain are the signal coefficient √ᾱt (how much of x0 remains in xt) and the noise coefficient √(1 − ᾱt). A good schedule keeps both above zero across most of the chain — if signal collapses to zero too early, mid-range timesteps are pure noise and the net has nothing to denoise.
Interactive · drive the chain on the two moons
Below is Eq. ★ applied to actual two-moons samples. Pull the t slider: as ᾱt drops, the two crescents wash out into a Gaussian. The opposite direction is what the network will learn to do.
The path as a heatmap
Above we visualized one slice of the path at a chosen t. Here’s the whole path at once: project the moons to their x-coordinate, bin the histogram, and stack the histograms left-to-right as columns of a heatmap. Each column is q(xt) (marginalized over y) at one timestep. Reading left-to-right is the forward process; reading right-to-left is what the reverse model has to learn.
Three claims to take away
- Eq. ★ is the entire forward process. Code-wise that’s the four lines of
DDPM.q_sample: index ᾱ at the chosen t, multiply x0 by its square root, add √(1 − ᾱt) · noise. - Training picks t uniformly.
t ~ U{0, …, T−1}inDDPM.loss. We don’t weight by anything — that’s a deliberate choice (the “L_simple” choice) we’ll defend in lesson 3. - The forward process is not learned. It’s a fixed schedule baked in as buffers (
register_bufferindiffusion.py). Only the denoiser’s weights are trained.