generative_continuous / 05 · flow matching lesson 5 / 15

Flow matching — pick a path, not a chain

A different design choice with a shockingly short derivation: the entire training objective is one regression line.

The frame shift

DDPM built a path by designing a noise process: pick βt, induce q(xt), derive everything else. The path is implicit, curved, and tied to Gaussian noising.

Flow matching flips it: design the path directly. Pick any one-parameter family of distributions {pt}t∈[0,1] with p0 = N(0, I) and p1 = pdata. Learn the velocity field vθ(x, t) that pushes mass along it. Sample by ODE integration.

The objective doesn’t depend on the path being Gaussian, Markov, or anything in particular. That freedom is the design lever — we’ll use it to pick a path that’s easy to integrate (lesson 6).

The continuity equation

The compatibility condition: a time-varying density pt and velocity ut are compatible (the field actually moves mass along the path) iff they satisfy

t pt(x) + ∇ · ( pt(x) · ut(x) ) = 0

This is just conservation of probability mass written in PDE form — the same equation as fluid dynamics. It says: the rate of change of density at x equals the negative divergence of the flux pt ut.

If we knew the “marginal velocity” ut(x) at every point, sampling would be easy: integrate dx/dt = ut(x) starting at x0 ∼ p0; you land in p1. The trouble: ut(x) has no closed form for any non-trivial pdata. We can’t evaluate it, so we can’t regress against it directly.

The CFM trick: conditional paths

Pick a conditional path that’s trivial: for each pair (x0, x1) with x0 ∼ p0, x1 ∼ pdata, define a path from x0 to x1. The linear (OT) path is

xt = (1 − t) · x0 + t · x1   [Eq. ▲]

This is a straight segment; the conditional velocity is its time derivative, which is constant:

ut(xt | x0, x1) = d/dt xt = x1 − x0   [Eq. ●]

Both quantities are evaluable: given a sampled (x0, x1, t), the conditional xt is one linear combination and the conditional velocity is one subtraction. Zero networks needed.

The marginal-conditional equivalence

Why does regressing against the easy conditional thing give us the impossible marginal one? Because conditional expectation is the minimizer of squared error. The marginal velocity that’s compatible with pt is

ut(x) = 𝔼(x0, x1) ∼ q | xt = x [ x1 − x0 ]

That’s the average of the constant conditional velocities, weighted by which (x0, x1) pairs are likely to have produced this particular xt. And the regression

argminv   𝔼t, x0, x1 ‖ v(xt, t) − (x1 − x0) ‖2

has minimizer exactly ut(x) = 𝔼[x1 − x0 | xt = x] — the same thing! So we can regress against the easy target and recover the impossible marginal one. This is the entire CFM identity.

Why is the minimizer of E‖v − Y‖² conditional on X exactly E[Y | X]?

For any function f(X):

E ‖f(X) − Y‖² = E ‖f(X) − E[Y | X]‖² + E ‖E[Y | X] − Y‖²

This is the orthogonal (Pythagorean) decomposition of L²: the cross-term E[(f(X) − E[Y|X]) · (E[Y|X] − Y)] vanishes because, conditional on X, the first factor is constant while the second has zero conditional mean. The second term doesn’t depend on f; the first is minimized exactly at f(X) = E[Y | X]. ▪

Plug in X = (xt, t) and Y = x1 − x0; the regression optimum is the marginal velocity by definition.

The training objective in one line

LCFM = 𝔼t ∼ U(0, 1), x0 ∼ N(0, I), x1 ∼ pdata   ‖ vθ(xt, t) − (x1 − x0) ‖2

That’s the whole thing. In Python (FlowMatching.loss):

x0 = randn_like(x1)
t  = rand(B)
xt = (1 - t) * x0 + t * x1
target = x1 - x0
pred   = self.model(xt, t)
return ((pred - target) ** 2).mean()

Five lines. Compare to the half-page ELBO derivation behind DDPM’s identical-looking but conceptually different MSE.

Interactive · the conditional velocity field

The widget below visualizes Eq. ●. Pick a t; we sample many (x0, x1) pairs, place each xt at (1 − t) x0 + t x1, and draw an arrow showing the constant conditional velocity x1 − x0. Notice how, at each point in the cloud, several arrows of different directions exist — the average of those arrows is the marginal velocity field, which is what the net learns.

Conditional velocity scatter (a.k.a. “the targets the net sees”)
Each grey dot is one x_t at the chosen t; the orange arrow is its conditional target x₁ − x₀. Hit show marginal field to average arrows in bins — that’s what the trained velocity net produces.

Interactive · train a velocity net, see the field

The previous widget shows the target field (averaged from sampled pairs). Here’s what a trained network actually learns: a smooth field vθ(x, t) defined on the whole plane. Hit train, then drag t: the arrows are what the model would push samples toward if they were at that grid point at that time. At t = 0 the field points outward from the origin (move noise toward data); at t = 1 the field shrinks toward zero (we’re already at data).

Learned v_θ(x, t) as an arrow field
After ~3000 CFM updates on the moons, the field stabilizes. Then slide t and watch how the field changes character with time.
step
0
loss (EMA)
field max ‖v‖

Why this is “better” than diffusion (sometimes)

…and where it doesn’t obviously win
Empirically, flow-matching DiTs (Stable Diffusion 3, Flux, EsserEtAl-style) trade quality nearly head-on with DDPM-DiTs at matched compute. The real advantage is sampling cost: 50 steps vs. 1000, holding quality. For training, both are similar — the choice is mostly engineering.

Trade-offs vs. DDPM, summarized

AxisDDPMFlow matching (linear)
PathVP Gaussian chain (curved)Linear interpolation (straight)
Time domaindiscrete t ∈ {0, …, T−1}continuous t ∈ [0, 1]
Targetnoise εvelocity x₁ − x₀
Schedule knobsβ_start, β_end, T, cosine vs. linearnone (just the path choice)
Math prereqELBO + Gaussian KL algebracontinuity equation + conditional E[·]
Loss derivation~half page~3 lines
Theoretical floorsame minimizer (the true marginal velocity / score)same minimizer
Punchline
Conditional expectation is the L₂ minimizer. So regressing against an easy conditional target (constant velocity along a linear segment) gives you the impossible marginal target for free. The whole flow-matching loss is one regression line, with one Gaussian sample and one uniform t.