DDPM — from ELBO to L_simple

Why the loss is a one-liner MSE, when the underlying object is a 1000-step variational bound.

Set up the reverse process

The forward chain q destroys information. To generate, we run the reverse: start at x_T ∼ N(0, I) and denoise step by step until x₀. We don’t know the true reverse q(x_t−1 | x_t) — that’s the intractable thing — but for small β_t it is approximately Gaussian. So we parameterize a Gaussian reverse

p_θ(x_t−1 | x_t) = N( x_t−1 ; μ_θ(x_t, t), σ_t² I )

with mean μ_θ a neural network and variance σ_t² a hand-chosen scalar at each t (we’ll defend the choice in lesson 4).

The ELBO, before any clever tricks

We would like to maximize log p_θ(x₀) — the probability the model assigns to real data. But under the joint p_θ(x_0:T) = p(x_T) ∏ p_θ(x_t−1 | x_t), writing that number down means integrating over every possible noising trajectory x₁, …, x_T that could have produced this x₀ — a T-dimensional integral we cannot do. So we don’t compute it; we lower-bound it. The trick: the forward chain q is a free, exact “answer key” for how clean data turns into noise, so we use q as the proposal and Jensen’s inequality buys us a bound we can evaluate.

log p_θ(x₀) ≥ 𝔼_q [ log p_θ(x_0:T) − log q(x_1:T | x₀) ] = − L_ELBO

Expand the joint and the q chain, regroup by t. After a half page of algebra (Ho et al. 2020, Appendix A), the bound decomposes into

L_ELBO = 𝔼_q [ KL(q(x_T | x₀) ‖ p(x_T)) + Σ_t>1 KL(q(x_t−1 | x_t, x₀) ‖ p_θ(x_t−1 | x_t)) − log p_θ(x₀ | x₁) ]

The first term doesn’t involve θ (the schedule guarantees it’s ≈ 0). The last term is one Gaussian log-density. The middle sum is the work: a sum of KL divergences between Gaussians, indexed by t.

Intuition · linear unpacking

Claim: maximizing an intractable T-dimensional likelihood collapses into a stack of tiny “match this Gaussian to that Gaussian” problems.

The wall. log p_θ(x₀) can’t be computed directly — it sums over every trajectory that could have noised x₀ up to x_T. Too many paths.
The dodge. Don’t compute it — bound it. The forward chain q already tells us, exactly, one plausible way the noise was added. Feed that in as the proposal and Jensen turns a log-of-a-sum (hard) into a sum-of-logs (easy). The price is a known gap; the ELBO is what’s left.
It factorizes. Once written through q, the bound splits by timestep. Each piece only ever compares the model’s one-step reverse to the true one-step reverse — never the whole chain at once.
Three kinds of term. (a) the endpoint at T: does fully-noised data match N(0, I)? The schedule forces yes ⇒ ≈ 0, no parameters. (b) the final decode at t = 1: one Gaussian log-density. (c) the middle sum: at every t, push the model’s guessed reverse Gaussian onto the true reverse Gaussian.

Central point. An impossible global integral becomes a pile of small, local matching problems — one per timestep. The next section shows each of those matches is, after the dust settles, a plain squared error.

KL between two Gaussians (with the same variance) is just MSE

Why does this collapse? Because the conditional reverse q(x_t−1 | x_t, x₀) is exactly Gaussian (Bayes’ rule on Gaussians is Gaussian) and we’ve chosen p_θ Gaussian with the same variance σ_t². KL between two N’s with the same Σ is

KL( N(μ_q, σ²I) ‖ N(μ_p, σ²I) ) = ‖μ_q − μ_p‖² / (2σ²)

So we’re left with a sum of weighted MSEs between the true posterior mean μ̃_t(x_t, x₀) and our predicted mean μ_θ(x_t, t). The true posterior mean has the closed form

μ̃_t(x_t, x₀) = (√ᾱ_t−1 β_t) / (1 − ᾱ_t) · x₀ + (√α_t (1 − ᾱ_t−1)) / (1 − ᾱ_t) · x_t

(derived from Bayes: q(x_t−1 | x_t, x₀) ∝ q(x_t | x_t−1) q(x_t−1 | x₀), both Gaussian; complete the square).

The ε-parameterization

Here is the key practical move. Instead of having the network predict μ_θ directly, we have it predict the noise ε that was added to x₀ to make x_t. Then μ_θ is read off via Eq. ★ rearranged:

x₀ = (x_t − √(1 − ᾱ_t) · ε) / √ᾱ_t ⇒ μ_θ(x_t, t) = (1/√α_t) · ( x_t − (β_t / √(1 − ᾱ_t)) · ε_θ(x_t, t) )

Substitute into the KL loss and the cross-cancellations leave

L_t = w_t · 𝔼_{x₀, ε} ‖ ε − ε_θ(x_t, t) ‖²

for a t-dependent weight w_t. Three reasons to pick the ε-parameterization over x₀-prediction or μ-prediction:

Target	Scale at t = 0	Scale at t = T	Verdict
x₀ direct	‖x₀‖ ≈ √D	‖x₀‖ ≈ √D	scale OK, but the net sees x_t ≈ noise at large t and has to hallucinate x₀ from nothing → high variance there
μ direct	≈ x₀	≈ √α_T · x_t (≈ x_t scale)	two regimes — mostly x₀ at small t, mostly x_t at large t; gradient norm has a t-dependent kink
ε	‖ε‖ = √D	‖ε‖ = √D	target is unit-variance Gaussian at every t → gradients well-scaled across timesteps

The third row is the win. We don’t need per-t loss weighting because the target itself is scale-stable.

The L_simple choice: throw away the weights

The ELBO has w_t = β_t² / (2 σ_t² α_t (1 − ᾱ_t)) — large at small t (fine-detail steps), small at large t (coarse-structure steps). Ho et al. observed that dropping the weights:

L_simple = 𝔼_{x₀, ε, t ∼ U{1, …, T}} ‖ ε − ε_θ(x_t, t) ‖²

produces visibly better samples than the weighted ELBO, even though it’s a worse likelihood bound. Why?

The likelihood-vs-perception trade-off

The ELBO weights upweight small t, where the model is denoising tiny perturbations — those steps carry most of the likelihood mass (they pin down the fine-grained pixel values, which is where the bits of log-likelihood actually live). But they don’t carry most of the perceptual content; that’s the medium-t steps where coarse shapes get fixed.

By unweighting, L_simple spends gradient on the perceptual middle. The model gives you samples that look like the data even though they have slightly less probability under p_θ.

This is the deliberate choice in DDPM.loss: ((eps - eps_pred) ** 2).mean(), no w_t, sample t uniformly. Three lines.

Interactive · the parameterization horse race

Below we train two tiny random regression problems: one with ε as target, one with x₀. We don’t fit a net; we just compute the loss for a fixed random “model” and watch the per-t magnitude of the gradient. Notice how the x₀-target loss balloons at large t while the ε-target loss stays flat. This is what saves you from per-t reweighting.

Interactive · what does the denoiser actually see?

The widget below shows the three quantities flowing through one training step, side by side:

Green — x₀ sampled from the data (the two moons).
Orange — x_t obtained by Eq. ⋆ for the current t.
Blue — predicted clean data x̂₀ = (x_t − √(1−ᾱ_t) · ε_θ) / √ᾱ_t (Tweedie’s formula).

For ε_θ we use the oracle 𝔼[ε | x_t] (the perfect-model fantasy). At small t, blue overlaps green almost exactly — the denoiser’s job is trivial. At large t, x_t is essentially Gaussian and the “denoised” estimate collapses to the data mean (rotating around the origin). The mid range is where the actual work happens, and where L_simple spends most of its perceptual budget.

Putting it all together: the four-line trainer

# Pseudocode of DDPM.loss():
B    = x0.shape[0]
t    = randint(0, T, (B,))              # sample timestep uniformly
eps  = randn_like(x0)                   # sample Gaussian noise
xt   = sqrt(αbar[t]) * x0 + sqrt(1-αbar[t]) * eps   # forward jump, Eq. ⋆
eps_pred = model(xt, t)                 # ε_θ(x_t, t)
loss = ((eps - eps_pred) ** 2).mean()   # L_simple

That’s the entire training step. The net is trained to undo Gaussian noise of arbitrary magnitude, one timestep at a time, and the schedule guarantees those magnitudes span everything from “barely perturbed” to “pure noise.” What we haven’t answered yet: how do you turn a trained ε_θ into samples?

Punchline

The DDPM loss is a one-liner MSE because (a) KL of equal-variance Gaussians is squared distance, (b) the ε-parameterization gives a scale-stable target, and (c) we deliberately drop the per-t weights to trade likelihood for perceptual quality. The bound is heavy machinery; the loss it gives back is trivial.