DDPM — the forward chain

A fixed noising process, designed so we never have to simulate it at training time.

The design move

We want a path of distributions from p_data back to a tractable prior. The trick: instead of designing the path directly, design a noising process that does it for us. Pick a Markov chain

q(x_t | x_t−1) = N( x_t ; √(1 − β_t) · x_t−1, β_t · I )

with small per-step variance β_t > 0, run for T steps. The chain progressively adds Gaussian noise and shrinks the signal. The induced marginals q(x_t) are a path that takes p_data (at t = 0) to a near-Gaussian (at t = T).

Why the odd-looking √(1 − β_t) factor in front of x_t−1? Because we want a known destination. We are going to start generation from pure N(0, I) noise, so the chain had better land there no matter how many steps we take. If we just kept piling on noise — q(x_t | x_t−1) = N(x_t−1, β_t I) — the spread would grow without bound: after T steps we’d sit at N(0, Tβ I), a target that moves as we change T. The fix is to shrink the old signal by a hair at each step before adding the noise. That shrink factor is exactly tuned so the shrinkage in variance cancels the variance we add. This is what makes the chain variance-preserving: if Var(x_t−1) = I, then Var(x_t) = (1 − β_t) · I + β_t · I = I. The total spread never budges, so the terminal distribution stays at unit variance regardless of T.

Intuition · linear unpacking

Claim: the √(1 − β_t) factor isn’t cosmetic — it’s what pins the end of the chain to a fixed N(0, I) we can sample from for free.

We need a known starting line for generation. Sampling later means drawing a fresh x_T from a prior and denoising. That prior has to be something trivial to sample — standard Gaussian noise. So the forward chain’s endpoint must be N(0, I).
Plain noise-adding overshoots. Each step contributes its own variance β_t. If nothing pushes back, those variances stack up and the cloud keeps spreading — the endpoint depends on T and isn’t unit-variance.
So shrink a little before adding. Multiplying the old sample by √(1 − β_t) deflates its variance by a factor (1 − β_t). Adding noise of variance β_t tops it right back up: (1 − β_t) + β_t = 1.
The accounting closes every step. Start at variance 1, and you stay at variance 1 forever — the loss from shrinking is precisely the gain from the noise, by design.

Central point. The shrink factor is the thermostat that holds total variance at 1, guaranteeing the chain always ends at the same easy-to-sample distribution — very close to N(0, I) (the leftover signal √ᾱ_T is tiny but not exactly zero) — which is the only reason we can later begin generation from scratch.

VP vs. VE

This is the variance-preserving (VP) schedule. The other family, variance-exploding (VE, used in NCSN / EDM), drops the shrink factor and lets variance grow. VE’s terminal distribution is N(0, σ²_max I) with big σ_max. Both work; VP is what the DDPM paper used and what diffusion.py implements.

The miracle: a closed-form marginal

Naively, to get a sample x_t from q(x_t | x₀) we’d simulate t Gaussian steps — that’s t network-free passes, but we’d need to do it freshly for every training example at every step. For Gaussians this is unnecessary.

Here is the reason it’s unnecessary. Each step does the same two simple things: scale the current point a little, then add independent Gaussian noise. Chaining many such steps is still just “scale and add noise” — a sum of Gaussians is Gaussian, and scaling a Gaussian keeps it Gaussian. So all the intermediate randomness collapses into a single equivalent Gaussian that depends only on x₀ and how far we’ve travelled. We just need to track the running product of the per-step scale factors. Let α_t = 1 − β_t be the per-step survival factor and ᾱ_t = ∏_{s ≤ t} α_s the cumulative one — the fraction of the original signal’s variance still alive at step t. Then:

q(x_t | x₀) = N( x_t ; √ᾱ_t · x₀, (1 − ᾱ_t) · I )

Equivalently, with ε ∼ N(0, I):

x_t = √ᾱ_t · x₀ + √(1 − ᾱ_t) · ε [Eq. ★]

One Gaussian sample, no simulation, jumps straight to any t. This is the line that makes diffusion practical — without it, training would mean simulating T steps for every minibatch element.

Intuition · linear unpacking

Claim: Eq. ★ lets you teleport from clean data x₀ to any noise level x_t in one shot, instead of walking t steps.

The naive picture is a long walk. To reach step t you’d apply the one-step rule t times — t separate noise draws and rescalings, redone for every training example.
But the whole walk is one Gaussian. Every step only scales and adds noise, so the combined effect of all t steps is itself just “keep a fraction of x₀, add one lump of noise.” Nothing about the intermediate points matters to the endpoint.
How much signal survives: √ᾱ_t, the running product of the per-step shrink factors. As t grows, ᾱ_t drifts toward 0 and the original picture fades out.
How much noise replaces it: √(1 − ᾱ_t) — the leftover, set so signal-variance plus noise-variance stays at 1 (the same variance-preserving bookkeeping as before). One draw ε ∼ N(0, I), scaled and added, reproduces the exact distribution the t-step walk would have given.

Central point. Because the chain is linear-Gaussian, the entire forward process at any timestep is a single weighted blend — √ᾱ_t of the data plus √(1 − ᾱ_t) of fresh noise — so training can sample any noise level in O(1), not O(t).

Show the derivation (induction on t)

Base case t = 1: x₁ = √α₁ · x₀ + √β₁ · ε₁. Conditional on x₀, this is Gaussian with mean √α₁ · x₀ = √ᾱ₁ · x₀ and variance β₁ · I = (1 − ᾱ₁) · I. Matches the claim.

Inductive step: assume x_t−1 = √ᾱ_t−1 · x₀ + √(1 − ᾱ_t−1) · η with η ∼ N(0, I). Then

x_t = √α_t · x_t−1 + √β_t · ε_t = √α_t ( √ᾱ_t−1 · x₀ + √(1 − ᾱ_t−1) · η ) + √β_t · ε_t

= √ᾱ_t · x₀ + ( √(α_t(1 − ᾱ_t−1)) · η + √β_t · ε_t )

The bracketed term is a sum of independent Gaussians with total variance α_t(1 − ᾱ_t−1) + β_t = α_t − α_t·ᾱ_t−1 + (1 − α_t) = 1 − ᾱ_t. So x_t = √ᾱ_t · x₀ + √(1 − ᾱ_t) · ε as claimed. ▪

The schedule, visualized

The two things you care about across the chain are the signal coefficient √ᾱ_t (how much of x₀ remains in x_t) and the noise coefficient √(1 − ᾱ_t). A good schedule keeps both above zero across most of the chain — if signal collapses to zero too early, mid-range timesteps are pure noise and the net has nothing to denoise.

Interactive · drive the chain on the two moons

Below is Eq. ★ applied to actual two-moons samples. Pull the t slider: as ᾱ_t drops, the two crescents wash out into a Gaussian. The opposite direction is what the network will learn to do.

The path as a heatmap

Above we visualized one slice of the path at a chosen t. Here’s the whole path at once: project the moons to their x-coordinate, bin the histogram, and stack the histograms left-to-right as columns of a heatmap. Each column is q(x_t) (marginalized over y) at one timestep. Reading left-to-right is the forward process; reading right-to-left is what the reverse model has to learn.

Density evolution — q(x_t(^x)) over t

Left edge = t = 0 (the two moons projected to one dim — you can see the two humps). Right edge = t = T (a single Gaussian hump centered at 0). The transition between them is the path.

samples per slice: 2000 t-slices: 100 project axis: cosine schedule

Try the cosine schedule: with linear-β at high T the right side of the heatmap pins to pure noise too fast — most of the heatmap is wasted “already-Gaussian” columns. Cosine spreads the transition over more of the path.

Three claims to take away

Eq. ★ is the entire forward process. Code-wise that’s the four lines of DDPM.q_sample: index ᾱ at the chosen t, multiply x₀ by its square root, add √(1 − ᾱ_t) · noise.
Training picks t uniformly. t ~ U{0, …, T−1} in DDPM.loss. We don’t weight by anything — that’s a deliberate choice (the “L_simple” choice) we’ll defend in lesson 3.
The forward process is not learned. It’s a fixed schedule baked in as buffers (register_buffer in diffusion.py). Only the denoiser’s weights are trained.

Punchline

We have built a path: q(x_t | x₀) = N(√ᾱ_t x₀, (1 − ᾱ_t) I). It costs nothing to sample any point on it. Next lesson: invert the path. Given x_t, what is x_t−1 — and what should the net learn to predict to make that reversal work?