DDPM ↔ Flow matching
They look like two different methods. They are the same object, viewed from two angles.
Both fit a local operator on a density path
Strip away the surface differences:
| DDPM | Flow matching | |
|---|---|---|
| density path | q(x_t | x₀) = N(√ᾱ_t x₀, (1−ᾱ_t) I) VP “curved” path | chosen by you, e.g. (1−t)x₀ + t·x₁ linear (OT) path |
| training target | noise ε that perturbed x₀ → x_t | velocity (x₁ − x₀) along the segment |
| training loss | ‖ε − ε_θ(x_t, t)‖² | ‖v_θ(x_t, t) − (x₁ − x₀)‖² |
| sampler | ancestral Gaussian, 1000 steps | Euler ODE, 50 steps |
| derivation framework | ELBO + Gaussian KL collapse | continuity equation + conditional E[·] |
The cells that look different are surface: path shape, what the net predicts, which solver you use. The cells that matter (training-by-regression-against-a-tractable-conditional-target, sampling-by-iterated-local-update) are identical.
The mapping: ε ↔ v
For the VP path used by DDPM, the conditional velocity along the path is
If you replace x0 with its DDPM-form value x0 = (xt − √(1−ᾱt) ε) / √ᾱt (Eq. ⋆ inverted) you get a velocity expressed in (xt, ε). So any DDPM ε-predictor implicitly defines a flow-matching velocity along the VP path, and vice versa. With the standard shorthand αt := ᾱt and σt := √(1 − ᾱt) (dot = d/dt):
Don’t memorize the algebra; remember the structural claim: v is an affine function of x and ε with time-dependent coefficients fully determined by the schedule. Score, noise, velocity are three names for the same object up to the path’s known scaling.
Score, noise, velocity — the three-way identity
The score of q(xt | x0) — the gradient of its log density — is
st(x | x0) = ∇x log q(x | x0) = −(x − √ᾱt x0) / (1 − ᾱt)
Plug in x = √ᾱt x0 + √(1−ᾱt) ε and the cancellation gives
st(x | x0) = −ε / √(1 − ᾱt)
So noise prediction = scaled score prediction = scaled velocity prediction (along VP path). The same network with three name tags.
The deeper view: SDE limits
Take DDPM’s discrete chain and let the per-step β go to zero with the number of steps T → ∞ at a constant rate. The chain becomes a continuous-time SDE:
(Song et al. 2021, “Score-Based Generative Modeling Through Stochastic Differential Equations”.) For every SDE there’s a corresponding probability-flow ODE — the deterministic ODE whose marginals match the SDE’s. That probability-flow ODE is the flow-matching ODE for the VP path. The mapping above is the one between the SDE’s drift and the ODE’s velocity, made explicit.
So:
- DDPM (discrete) is the discretization of the VP SDE.
- The VP probability-flow ODE is flow matching with the VP path.
- FM with the linear path is the same machinery on a different path — one that’s straighter at sampling time.
This is the “DDPM is a special case of flow matching” line from the README. It’s correct, but more useful as a perspective than as a derivation aid: the SDE language is heavier than either DDPM or FM alone.
Interactive · two paths, same destination
The widget below schematically contrasts two routes from a fixed x0 to a fixed x1: the straight FM linear path and a curved VP-flavored route. Strictly speaking, the VP forward marginal mean 𝔼[xt | x0] = √ᾱt · x0 shrinks monotonically toward the origin (it does not swing out to a fixed x1); the orange curve here is a Bézier schematic of “shrink toward zero, then bloom out” that captures the qualitative curvature of the sampling-time trajectory. The takeaway is just: curved routes need more integrator steps.
Interactive · the two trajectory bundles, side by side
Below: two 3D plots of the same particles’ trajectories under the two methods, drawn at matched compute. Left is DDPM ancestral (~T = 200 steps, curved + jittery); right is FM Euler (~K = 30 steps, straight + clean). Same endpoints (same final mode assignments), different routes. The picture is the argument for FM’s sampling cost.
Practical consequence: pick by sampling-step budget
| Budget (forward passes) | Pick | Why |
|---|---|---|
| 1000+ | DDPM ancestral | Reference quality; you have the budget for the curved path |
| 50–100 | DDIM or FM-Euler | Either works; FM is one design choice cleaner |
| 10–20 | FM-RK4 or DPM-Solver++ | Higher-order solver on a straight path |
| 1–4 | Distilled student or consistency model | Below this, you’re fundamentally re-training the model to be a one-shot sampler |
What the code shows
In this repo, the same DiT class (diffusion_transformer.py) is used as both the ε-predictor for DDPM (make_ddpm_dit) and the v-predictor for flow matching (make_flow_dit). The architecture doesn’t know which one it is. The two factories differ in one line:
# diffusion_transformer.py (make_ddpm_dit, line ~289)
denoiser = DiT(..., t_scale=1.0)
# flow_matching_transformer.py (make_flow_dit, line ~56)
velocity = DiT(..., t_scale=1000.0)
That’s the difference. Everything else — patch embed, attention, adaLN-Zero, unpatch, optimizer — is shared. Whether the same parameters mean “predict noise” or “predict velocity” is decided entirely by which loss you train under. We’ll dig into the architecture next lesson.