generative_continuous / 03 · DDPM loss lesson 3 / 15

DDPM — from ELBO to L_simple

Why the loss is a one-liner MSE, when the underlying object is a 1000-step variational bound.

Set up the reverse process

The forward chain q destroys information. To generate, we run the reverse: start at xT ∼ N(0, I) and denoise step by step until x0. We don’t know the true reverse q(xt−1 | xt) — that’s the intractable thing — but for small βt it is approximately Gaussian. So we parameterize a Gaussian reverse

pθ(xt−1 | xt) = N( xt−1 ;  μθ(xt, t),  σt2 I )

with mean μθ a neural network and variance σt2 a hand-chosen scalar at each t (we’ll defend the choice in lesson 4).

The ELBO, before any clever tricks

The data likelihood under the joint pθ(x0:T) = p(xT) ∏ pθ(xt−1 | xt) is intractable (a T-dimensional integral over the latent chain). The standard variational dodge: bound it from below using q as the proposal.

log pθ(x0) ≥ 𝔼q [ log pθ(x0:T) − log q(x1:T | x0) ] = − L_ELBO

Expand the joint and the q chain, regroup by t. After a half page of algebra (Ho et al. 2020, Appendix A), the bound decomposes into

LELBO = 𝔼q [ KL(q(xT | x0) ‖ p(xT)) + Σt>1 KL(q(xt−1 | xt, x0) ‖ pθ(xt−1 | xt)) − log pθ(x0 | x1) ]

The first term doesn’t involve θ (the schedule guarantees it’s ≈ 0). The last term is one Gaussian log-density. The middle sum is the work: a sum of KL divergences between Gaussians, indexed by t.

KL between two Gaussians (with the same variance) is just MSE

Why does this collapse? Because the conditional reverse q(xt−1 | xt, x0) is exactly Gaussian (Bayes’ rule on Gaussians is Gaussian) and we’ve chosen pθ Gaussian with the same variance σt2. KL between two N’s with the same Σ is

KL( N(μq, σ2I) ‖ N(μp, σ2I) ) = ‖μq − μp2 / (2σ2)

So we’re left with a sum of weighted MSEs between the true posterior mean μ̃t(xt, x0) and our predicted mean μθ(xt, t). The true posterior mean has the closed form

μ̃t(xt, x0) = (√ᾱt−1 βt) / (1 − ᾱt) · x0 + (√αt (1 − ᾱt−1)) / (1 − ᾱt) · xt

(derived from Bayes: q(xt−1 | xt, x0) ∝ q(xt | xt−1) q(xt−1 | x0), both Gaussian; complete the square).

The ε-parameterization

Here is the key practical move. Instead of having the network predict μθ directly, we have it predict the noise ε that was added to x0 to make xt. Then μθ is read off via Eq. ★ rearranged:

x0 = (xt − √(1 − ᾱt) · ε) / √ᾱt   ⇒    μθ(xt, t) = (1/√αt) · ( xt − (βt / √(1 − ᾱt)) · εθ(xt, t) )

Substitute into the KL loss and the cross-cancellations leave

Lt = wt · 𝔼x0, ε ‖ ε − εθ(xt, t) ‖2

for a t-dependent weight wt. Three reasons to pick the ε-parameterization over x0-prediction or μ-prediction:

TargetScale at t = 0Scale at t = TVerdict
x₀ direct‖x₀‖ ≈ √D‖x₀‖ ≈ √Dscale OK, but the net sees xt ≈ noise at large t and has to hallucinate x₀ from nothing → high variance there
μ direct≈ x₀≈ √α_T · x_t (≈ x_t scale)two regimes — mostly x₀ at small t, mostly x_t at large t; gradient norm has a t-dependent kink
ε‖ε‖ = √D‖ε‖ = √Dtarget is unit-variance Gaussian at every t → gradients well-scaled across timesteps

The third row is the win. We don’t need per-t loss weighting because the target itself is scale-stable.

The L_simple choice: throw away the weights

The ELBO has wt = βt2 / (2 σt2 αt (1 − ᾱt)) — large at small t (fine-detail steps), small at large t (coarse-structure steps). Ho et al. observed that dropping the weights:

Lsimple = 𝔼x0, ε, t ∼ U{1, …, T} ‖ ε − εθ(xt, t) ‖2

produces visibly better samples than the weighted ELBO, even though it’s a worse likelihood bound. Why?

The likelihood-vs-perception trade-off

The ELBO weights upweight small t, where the model is denoising tiny perturbations — those steps carry most of the likelihood mass (they pin down fine-grained pixel values that change low entropy). But they don’t carry most of the perceptual content; that’s the medium-t steps where coarse shapes get fixed.

By unweighting, L_simple spends gradient on the perceptual middle. The model gives you samples that look like the data even though they have slightly less probability under pθ.

This is the deliberate choice in DDPM.loss: ((eps - eps_pred) ** 2).mean(), no wt, sample t uniformly. Three lines.

Interactive · the parameterization horse race

Below we train two tiny random regression problems: one with ε as target, one with x0. We don’t fit a net; we just compute the loss for a fixed random “model” and watch the per-t magnitude of the gradient. Notice how the x0-target loss balloons at large t while the ε-target loss stays flat. This is what saves you from per-t reweighting.

ε vs. x₀ vs. μ — gradient norm across t
A single random “predictor” (zero, so the squared error is the target norm) is evaluated at every t. Lower variance across t = better-behaved gradients. The ε curve is flat by construction.

Interactive · what does the denoiser actually see?

The widget below shows the three quantities flowing through one training step, side by side:

For εθ we use the oracle 𝔼[ε | xt] (the perfect-model fantasy). At small t, blue overlaps green almost exactly — the denoiser’s job is trivial. At large t, xt is essentially Gaussian and the “denoised” estimate collapses to the data mean (rotating around the origin). The mid range is where the actual work happens, and where L_simple spends most of its perceptual budget.

x0 → xt → predicted x̂0 for a single t
Drag t. Green points are real data; orange points are noised; blue points are what the oracle “denoises” them back to. The three lines from a few sample particles show how far the noise pushed them and where the denoiser tries to send them back.
√ᾱ_t
√(1−ᾱ_t)
mean ‖x̂_0 − x_0‖

Putting it all together: the four-line trainer

# Pseudocode of DDPM.loss():
B    = x0.shape[0]
t    = randint(0, T, (B,))              # sample timestep uniformly
eps  = randn_like(x0)                   # sample Gaussian noise
xt   = sqrt(αbar[t]) * x0 + sqrt(1-αbar[t]) * eps   # forward jump, Eq. ⋆
eps_pred = model(xt, t)                 # ε_θ(x_t, t)
loss = ((eps - eps_pred) ** 2).mean()   # L_simple

That’s the entire training step. The net is trained to undo Gaussian noise of arbitrary magnitude, one timestep at a time, and the schedule guarantees those magnitudes span everything from “barely perturbed” to “pure noise.” What we haven’t answered yet: how do you turn a trained εθ into samples?

Punchline
The DDPM loss is a one-liner MSE because (a) KL of equal-variance Gaussians is squared distance, (b) the ε-parameterization gives a scale-stable target, and (c) we deliberately drop the per-t weights to trade likelihood for perceptual quality. The bound is heavy machinery; the loss it gives back is trivial.