generative_continuous / 02 · DDPM forward chain lesson 2 / 15

DDPM — the forward chain

A fixed noising process, designed so we never have to simulate it at training time.

The design move

We want a path of distributions from pdata back to a tractable prior. The trick: instead of designing the path directly, design a noising process that does it for us. Pick a Markov chain

q(xt | xt−1) = N( xt ;  √(1 − βt) · xt−1,  βt · I )

with small per-step variance βt > 0, run for T steps. The chain progressively adds Gaussian noise and shrinks the signal. The induced marginals q(xt) are a path that takes pdata (at t = 0) to a near-Gaussian (at t = T).

Note the careful scaling. If we wrote q(xt | xt−1) = N(xt−1, βt I) (just adding noise) the variance would grow without bound — at large t we’d have N(0, Tβ I), not N(0, I). The √(1 − βt) shrink-by-a-bit factor is what keeps the chain variance-preserving: if Var(xt−1) = I, then Var(xt) = (1 − βt) · I + βt · I = I. The terminal distribution stays at unit variance regardless of T.

VP vs. VE
This is the variance-preserving (VP) schedule. The other family, variance-exploding (VE, used in NCSN / EDM), drops the shrink factor and lets variance grow. VE’s terminal distribution is N(0, σ2max I) with big σmax. Both work; VP is what the DDPM paper used and what diffusion.py implements.

The miracle: a closed-form marginal

Naively, to get a sample xt from q(xt | x0) we’d simulate t Gaussian steps — that’s t network-free passes, but we’d need to do it freshly for every training example at every step. For Gaussians this is unnecessary.

Composing linear-Gaussian maps stays linear-Gaussian. Let αt = 1 − βt and t = ∏s ≤ t αs. Then:

q(xt | x0) = N( xt ;  √ᾱt · x0,  (1 − ᾱt) · I )

Equivalently, with ε ∼ N(0, I):

xt = √ᾱt · x0 + √(1 − ᾱt) · ε    [Eq. ★]

One Gaussian sample, no simulation, jumps straight to any t. This is the line that makes diffusion practical — without it, training would mean simulating T steps for every minibatch element.

Show the derivation (induction on t)

Base case t = 1: x1 = √α1 · x0 + √β1 · ε1. Conditional on x0, this is Gaussian with mean √α1 · x0 = √ᾱ1 · x0 and variance β1 · I = (1 − ᾱ1) · I. Matches the claim.

Inductive step: assume xt−1 = √ᾱt−1 · x0 + √(1 − ᾱt−1) · η with η ∼ N(0, I). Then

xt = √αt · xt−1 + √βt · εt = √αt ( √ᾱt−1 · x0 + √(1 − ᾱt−1) · η ) + √βt · εt

= √ᾱt · x0 + ( √(αt(1 − ᾱt−1)) · η + √βt · εt )

The bracketed term is a sum of independent Gaussians with total variance αt(1 − ᾱt−1) + βt = αt − αt·ᾱt−1 + (1 − αt) = 1 − ᾱt. So xt = √ᾱt · x0 + √(1 − ᾱt) · ε as claimed. ▪

The schedule, visualized

The two things you care about across the chain are the signal coefficient √ᾱt (how much of x0 remains in xt) and the noise coefficient √(1 − ᾱt). A good schedule keeps both above zero across most of the chain — if signal collapses to zero too early, mid-range timesteps are pure noise and the net has nothing to denoise.

Schedule explorer — what does T and (β₁, β_T) buy you?
DDPM’s default is T=1000, β₁=1e-4, β_T=0.02 linear. Drag the knobs; watch where ᾱ_t hits zero. For images (high-D data) the linear schedule collapses signal too early — cosine fixes that (lesson 9).
ᾱ at T/4
ᾱ at T/2
ᾱ at 3T/4
ᾱ at T

Interactive · drive the chain on the two moons

Below is Eq. ★ applied to actual two-moons samples. Pull the t slider: as t drops, the two crescents wash out into a Gaussian. The opposite direction is what the network will learn to do.

Forward jump x_t = √ᾱ_t · x_0 + √(1 − ᾱ_t) · ε
Same N=2000 points throughout; the only thing changing is t. Notice the orange dots stay on the same scale across t — that’s the variance-preserving property.
√ᾱ_t (signal)
1.000
√(1−ᾱ_t) (noise)
0.000
empirical Var (avg axis)
1.000

The path as a heatmap

Above we visualized one slice of the path at a chosen t. Here’s the whole path at once: project the moons to their x-coordinate, bin the histogram, and stack the histograms left-to-right as columns of a heatmap. Each column is q(xt) (marginalized over y) at one timestep. Reading left-to-right is the forward process; reading right-to-left is what the reverse model has to learn.

Density evolution — q(xt(x)) over t
Left edge = t = 0 (the two moons projected to one dim — you can see the two humps). Right edge = t = T (a single Gaussian hump centered at 0). The transition between them is the path.

Try the cosine schedule: with linear-β at high T the right side of the heatmap pins to pure noise too fast — most of the heatmap is wasted “already-Gaussian” columns. Cosine spreads the transition over more of the path.

Three claims to take away

  1. Eq. ★ is the entire forward process. Code-wise that’s the four lines of DDPM.q_sample: index ᾱ at the chosen t, multiply x0 by its square root, add √(1 − ᾱt) · noise.
  2. Training picks t uniformly. t ~ U{0, …, T−1} in DDPM.loss. We don’t weight by anything — that’s a deliberate choice (the “L_simple” choice) we’ll defend in lesson 3.
  3. The forward process is not learned. It’s a fixed schedule baked in as buffers (register_buffer in diffusion.py). Only the denoiser’s weights are trained.
Punchline
We have built a path: q(xt | x0) = N(√ᾱt x0, (1 − ᾱt) I). It costs nothing to sample any point on it. Next lesson: invert the path. Given xt, what is xt−1 — and what should the net learn to predict to make that reversal work?