generative_continuous / 07 · DDPM ↔ FM lesson 7 / 15

DDPM ↔ Flow matching

They look like two different methods. They are the same object, viewed from two angles.

Both fit a local operator on a density path

Strip away the surface differences:

DDPMFlow matching
density pathq(x_t | x₀) = N(√ᾱ_t x₀, (1−ᾱ_t) I)
VP “curved” path
chosen by you, e.g. (1−t)x₀ + t·x₁
linear (OT) path
training targetnoise ε that perturbed x₀ → x_tvelocity (x₁ − x₀) along the segment
training loss‖ε − ε_θ(x_t, t)‖²‖v_θ(x_t, t) − (x₁ − x₀)‖²
samplerancestral Gaussian, 1000 stepsEuler ODE, 50 steps
derivation frameworkELBO + Gaussian KL collapsecontinuity equation + conditional E[·]

The cells that look different are surface: path shape, what the net predicts, which solver you use. The cells that matter (training-by-regression-against-a-tractable-conditional-target, sampling-by-iterated-local-update) are identical.

The mapping: ε ↔ v

For the VP path used by DDPM, the conditional velocity along the path is

ut(xt | x0, ε) = d/dt ( √ᾱt x0 + √(1 − ᾱt) ε ) = (d√ᾱt/dt) · x0 + (d√(1 − ᾱt)/dt) · ε

If you replace x0 with its DDPM-form value x0 = (xt − √(1−ᾱt) ε) / √ᾱt (Eq. ⋆ inverted) you get a velocity expressed in (xt, ε). So any DDPM ε-predictor implicitly defines a flow-matching velocity along the VP path, and vice versa. With the standard shorthand αt := ᾱt and σt := √(1 − ᾱt) (dot = d/dt):

vVP(x, t) = (α̇t / (2αt)) · x  +  ( σ̇t − (α̇t / (2αt)) · σt ) · εθ(x, t)

Don’t memorize the algebra; remember the structural claim: v is an affine function of x and ε with time-dependent coefficients fully determined by the schedule. Score, noise, velocity are three names for the same object up to the path’s known scaling.

Score, noise, velocity — the three-way identity

The score of q(xt | x0) — the gradient of its log density — is

st(x | x0) = ∇x log q(x | x0) = −(x − √ᾱt x0) / (1 − ᾱt)

Plug in x = √ᾱt x0 + √(1−ᾱt) ε and the cancellation gives

st(x | x0) = −ε / √(1 − ᾱt)

So noise prediction = scaled score prediction = scaled velocity prediction (along VP path). The same network with three name tags.

The deeper view: SDE limits

Take DDPM’s discrete chain and let the per-step β go to zero with the number of steps T → ∞ at a constant rate. The chain becomes a continuous-time SDE:

dx = −½ β(t) x dt + √(β(t)) dW

(Song et al. 2021, “Score-Based Generative Modeling Through Stochastic Differential Equations”.) For every SDE there’s a corresponding probability-flow ODE — the deterministic ODE whose marginals match the SDE’s. That probability-flow ODE is the flow-matching ODE for the VP path. The mapping above is the one between the SDE’s drift and the ODE’s velocity, made explicit.

So:

This is the “DDPM is a special case of flow matching” line from the README. It’s correct, but more useful as a perspective than as a derivation aid: the SDE language is heavier than either DDPM or FM alone.

Interactive · two paths, same destination

The widget below schematically contrasts two routes from a fixed x0 to a fixed x1: the straight FM linear path and a curved VP-flavored route. Strictly speaking, the VP forward marginal mean 𝔼[xt | x0] = √ᾱt · x0 shrinks monotonically toward the origin (it does not swing out to a fixed x1); the orange curve here is a Bézier schematic of “shrink toward zero, then bloom out” that captures the qualitative curvature of the sampling-time trajectory. The takeaway is just: curved routes need more integrator steps.

VP vs. linear: same x₀, same x₁, two routes (schematic)
Blue path = exact FM linear segment. Orange path = pedagogical Bézier proxy for the curvature you’d see in a VP-style sampler trajectory — not a literal q(x_t|x_0) marginal mean.

Each dot on the orange path is the expected position at one of T=200 DDPM steps. The blue dots are 50 linear-FM positions. Curvature (orange) ↔ many more integrator steps required (lesson 4 widget).

Interactive · the two trajectory bundles, side by side

Below: two 3D plots of the same particles’ trajectories under the two methods, drawn at matched compute. Left is DDPM ancestral (~T = 200 steps, curved + jittery); right is FM Euler (~K = 30 steps, straight + clean). Same endpoints (same final mode assignments), different routes. The picture is the argument for FM’s sampling cost.

DDPM (curved + stochastic) vs. FM (straight + deterministic)
Press “render both”. Pause and rotate by changing the obliqueness slider. Notice: DDPM trajectories wiggle and curve; FM trajectories are nearly straight lines.

DDPM uses an oracle ε along the VP schedule with T = 200 and σt2 = βt. FM uses the exact linear-path tangent x1x0 with the same (x0, x1) pair — the “mode-matched” choice that lets us draw the same endpoints on both sides.

Practical consequence: pick by sampling-step budget

Budget (forward passes)PickWhy
1000+DDPM ancestralReference quality; you have the budget for the curved path
50–100DDIM or FM-EulerEither works; FM is one design choice cleaner
10–20FM-RK4 or DPM-Solver++Higher-order solver on a straight path
1–4Distilled student or consistency modelBelow this, you’re fundamentally re-training the model to be a one-shot sampler

What the code shows

In this repo, the same DiT class (diffusion_transformer.py) is used as both the ε-predictor for DDPM (make_ddpm_dit) and the v-predictor for flow matching (make_flow_dit). The architecture doesn’t know which one it is. The two factories differ in one line:

# diffusion_transformer.py  (make_ddpm_dit, line ~289)
denoiser = DiT(..., t_scale=1.0)

# flow_matching_transformer.py  (make_flow_dit, line ~56)
velocity = DiT(..., t_scale=1000.0)

That’s the difference. Everything else — patch embed, attention, adaLN-Zero, unpatch, optimizer — is shared. Whether the same parameters mean “predict noise” or “predict velocity” is decided entirely by which loss you train under. We’ll dig into the architecture next lesson.

Punchline
DDPM and FM share a parent: regress a local operator that pushes samples along a chosen density path. DDPM picks a curved path (the VP chain) and predicts the noise that defines it; FM picks a straight path and predicts the velocity that defines it. The mathematical objects (score, noise, velocity) are convertible. The architectures are interchangeable. The choice is mostly “what path do I want my sampler to follow?”