DDPM ↔ Flow matching

They look like two different methods. They are the same object, viewed from two angles.

Both fit a local operator on a density path

Strip away the surface differences:

	DDPM	Flow matching
density path	q(x_t \| x₀) = N(√ᾱ_t x₀, (1−ᾱ_t) I) VP “curved” path	chosen by you, e.g. (1−t)x₀ + t·x₁ linear (OT) path
training target	noise ε that perturbed x₀ → x_t	velocity (x₁ − x₀) along the segment
training loss	‖ε − ε_θ(x_t, t)‖²	‖v_θ(x_t, t) − (x₁ − x₀)‖²
sampler	ancestral Gaussian, 1000 steps	Euler ODE, 50 steps
derivation framework	ELBO + Gaussian KL collapse	continuity equation + conditional E[·]

The cells that look different are surface: path shape, what the net predicts, which solver you use. The cells that matter (training-by-regression-against-a-tractable-conditional-target, sampling-by-iterated-local-update) are identical.

The mapping: ε ↔ v

For the VP path used by DDPM, the conditional velocity along the path is

u_t(x_t | x₀, ε) = d/dt ( √ᾱ_t x₀ + √(1 − ᾱ_t) ε ) = (d√ᾱ_t/dt) · x₀ + (d√(1 − ᾱ_t)/dt) · ε

Here is the whole trick in words before any symbols. A point on the path is built from two ingredients — a shrinking piece of the clean data and a growing piece of the noise. The velocity is just how fast that mixture is changing, so it too is built from those same two ingredients. But we don’t store the clean data x₀ at sampling time — we only have the noisy point x_t and the network’s noise guess ε_θ. So we trade x₀ away: substitute its DDPM-form value x₀ = (x_t − √(1−ᾱ_t) ε) / √ᾱ_t (Eq. ⋆ inverted), and the velocity comes back out written purely in (x_t, ε). The upshot: any DDPM ε-predictor already is a flow-matching velocity along the VP path, and vice versa — you just rewrite it. With the standard shorthand α_t := ᾱ_t and σ_t := √(1 − ᾱ_t) (dot = d/dt):

v_VP(x, t) = (α̇_t / (2α_t)) · x + ( σ̇_t − (α̇_t / (2α_t)) · σ_t ) · ε_θ(x, t)

Don’t memorize the algebra; remember the structural claim: v is an affine function of x and ε with time-dependent coefficients fully determined by the schedule. Score, noise, velocity are three names for the same object up to the path’s known scaling.

Intuition · linear unpacking

Claim: a DDPM noise-predictor and a flow-matching velocity-predictor are the same network wearing different labels — converting one to the other is just multiply-and-add with numbers the schedule already fixes.

One path, two readings. Both methods walk the same curve from data to noise. DDPM reports “which noise ε sits at this point?”; flow matching reports “which direction v is this point moving?” Same curve — so knowing one answer must pin down the other.
Why it’s only multiply-and-add. The point x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε is a straight-line mix of data and noise. Velocity is its time-derivative, and the derivative of a straight mix is again a straight mix — no curves enter. So v can only be x_t and ε each scaled by some number, then added.
Where the numbers come from. Those scaling numbers are pure schedule — the coefficients α̇_t/(2α_t) and the σ̇_t term — chosen up front, identical for every datapoint. The network never has to learn the conversion; it’s arithmetic on known constants.
Hence interchangeable. Predict ε and you can read off v on the spot; train for v and you can read off ε. Neither is more fundamental; the path picks the dictionary between them.

Central point. “Noise” and “velocity” aren’t rival quantities to choose between — they are one quantity in two coordinate systems, and the schedule is the fixed exchange rate.

Score, noise, velocity — the three-way identity

The score of q(x_t | x₀) — the gradient of its log density — is

s_t(x | x₀) = ∇_x log q(x | x₀) = −(x − √ᾱ_t x₀) / (1 − ᾱ_t)

Plug in x = √ᾱ_t x₀ + √(1−ᾱ_t) ε and the cancellation gives

s_t(x | x₀) = −ε / √(1 − ᾱ_t)

So noise prediction = scaled score prediction = scaled velocity prediction (along VP path). The same network with three name tags.

The deeper view: SDE limits

Take DDPM’s discrete chain and let the per-step β go to zero with the number of steps T → ∞ at a constant rate. The chain becomes a continuous-time SDE:

dx = −½ β(t) x dt + √(β(t)) dW

(Song et al. 2021, “Score-Based Generative Modeling Through Stochastic Differential Equations”.) For every SDE there’s a corresponding probability-flow ODE — the deterministic ODE whose marginals match the SDE’s. That probability-flow ODE is the flow-matching ODE for the VP path. The mapping above is the one between the SDE’s drift and the ODE’s velocity, made explicit.

So:

DDPM (discrete) is the discretization of the VP SDE.
The VP probability-flow ODE is flow matching with the VP path.
FM with the linear path is the same machinery on a different path — one that’s straighter at sampling time.

Intuition · linear unpacking

Claim: “DDPM is a special case of flow matching” means DDPM is one particular path, sampled with extra randomness, where flow matching lets you pick the path.

Make the chain continuous. DDPM’s 1000 discrete denoising steps are a staircase. Shrink each step toward zero and the staircase becomes a smooth ramp — an SDE. Nothing new is added; we just stop pretending time is chunky.
Split off the dice-rolling. That smooth process has two parts: a deterministic drift (where the cloud of samples is heading on average) and a random jitter (the coin-flip each step adds). The drift alone carries all the information about where probability mass goes.
Keep the drift, drop the dice. There is a purely deterministic ODE — the probability-flow ODE — that moves each point so the overall distribution at every time matches the noisy process exactly. Same destinations, no jitter. That deterministic ODE is a flow-matching velocity field, just for the curved VP path.
So FM is the bigger room. DDPM = this one curved path, run with the jitter still on. Flow matching = the freedom to choose any path (a straight one runs the same machinery with fewer, cleaner solver steps). DDPM is the special case; FM is the general setting.

Central point. The SDE story isn’t new math — it’s the same DDPM process, made smooth and stripped of its randomness, which reveals it was a flow-matching velocity field on a particular path all along.

This is the “DDPM is a special case of flow matching” line from the README. It’s correct, but more useful as a perspective than as a derivation aid: the SDE language is heavier than either DDPM or FM alone.

Interactive · two paths, same destination

The widget below schematically contrasts two routes from a fixed x₀ to a fixed x₁: the straight FM linear path and a curved VP-flavored route. Strictly speaking, the VP forward marginal mean 𝔼[x_t | x₀] = √ᾱ_t · x₀ shrinks monotonically toward the origin (it does not swing out to a fixed x₁); the orange curve here is a Bézier schematic of “shrink toward zero, then bloom out” that captures the qualitative curvature of the sampling-time trajectory. The takeaway is just: curved routes need more integrator steps.

VP vs. linear: same x₀, same x₁, two routes (schematic)

Blue path = exact FM linear segment. Orange path = pedagogical Bézier proxy for the curvature you’d see in a VP-style sampler trajectory — not a literal q(x_t|x_0) marginal mean.

x₁ angle: 35° x₁ radius: 2.00

Each dot on the orange path is the expected position at one of T=200 DDPM steps. The blue dots are 50 linear-FM positions. Curvature (orange) ↔ many more integrator steps required (lesson 4 widget).

Interactive · the two trajectory bundles, side by side

Below: two 3D plots of the same particles’ trajectories under the two methods, drawn at matched compute. Left is DDPM ancestral (~T = 200 steps, curved + jittery); right is FM Euler (~K = 30 steps, straight + clean). Same endpoints (same final mode assignments), different routes. The picture is the argument for FM’s sampling cost.

DDPM (curved + stochastic) vs. FM (straight + deterministic)

Press “render both”. Pause and rotate by changing the obliqueness slider. Notice: DDPM trajectories wiggle and curve; FM trajectories are nearly straight lines.

N particles: 30 obliqueness: 0.60

DDPM uses an oracle ε along the VP schedule with T = 200 and σ_t² = β_t. FM uses the exact linear-path tangent x₁ − x₀ with the same (x₀, x₁) pair — the “mode-matched” choice that lets us draw the same endpoints on both sides.

Practical consequence: pick by sampling-step budget

Budget (forward passes)	Pick	Why
1000+	DDPM ancestral	Reference quality; you have the budget for the curved path
50–100	DDIM or FM-Euler	Either works; FM is one design choice cleaner
10–20	FM-RK4 or DPM-Solver++	Higher-order solver on a straight path
1–4	Distilled student or consistency model	Below this, you’re fundamentally re-training the model to be a one-shot sampler

What the code shows

In this repo, the same DiT class (diffusion_transformer.py) is used as both the ε-predictor for DDPM (make_ddpm_dit) and the v-predictor for flow matching (make_flow_dit). The architecture doesn’t know which one it is. The two factories differ in one line:

# diffusion_transformer.py  (make_ddpm_dit, line ~289)
denoiser = DiT(..., t_scale=1.0)

# flow_matching_transformer.py  (make_flow_dit, line ~56)
velocity = DiT(..., t_scale=1000.0)

That’s the difference. Everything else — patch embed, attention, adaLN-Zero, unpatch, optimizer — is shared. Whether the same parameters mean “predict noise” or “predict velocity” is decided entirely by which loss you train under. We’ll dig into the architecture next lesson.

Punchline

DDPM and FM share a parent: regress a local operator that pushes samples along a chosen density path. DDPM picks a curved path (the VP chain) and predicts the noise that defines it; FM picks a straight path and predicts the velocity that defines it. The mathematical objects (score, noise, velocity) are convertible. The architectures are interchangeable. The choice is mostly “what path do I want my sampler to follow?”