Flow matching — pick a path, not a chain
A different design choice with a shockingly short derivation: the entire training objective is one regression line.
The frame shift
DDPM built a path by designing a noise process: pick βt, induce q(xt), derive everything else. The path is implicit, curved, and tied to Gaussian noising.
Flow matching flips it: design the path directly. Pick any one-parameter family of distributions {pt}t∈[0,1] with p0 = N(0, I) and p1 = pdata. Learn the velocity field vθ(x, t) that pushes mass along it. Sample by ODE integration.
The objective doesn’t depend on the path being Gaussian, Markov, or anything in particular. That freedom is the design lever — we’ll use it to pick a path that’s easy to integrate (lesson 6).
The continuity equation
The compatibility condition: a time-varying density pt and velocity ut are compatible (the field actually moves mass along the path) iff they satisfy
This is just conservation of probability mass written in PDE form — the same equation as fluid dynamics. It says: the rate of change of density at x equals the negative divergence of the flux pt ut.
If we knew the “marginal velocity” ut(x) at every point, sampling would be easy: integrate dx/dt = ut(x) starting at x0 ∼ p0; you land in p1. The trouble: ut(x) has no closed form for any non-trivial pdata. We can’t evaluate it, so we can’t regress against it directly.
The CFM trick: conditional paths
Pick a conditional path that’s trivial: for each pair (x0, x1) with x0 ∼ p0, x1 ∼ pdata, define a path from x0 to x1. The linear (OT) path is
This is a straight segment; the conditional velocity is its time derivative, which is constant:
Both quantities are evaluable: given a sampled (x0, x1, t), the conditional xt is one linear combination and the conditional velocity is one subtraction. Zero networks needed.
The marginal-conditional equivalence
Why does regressing against the easy conditional thing give us the impossible marginal one? Because conditional expectation is the minimizer of squared error. The marginal velocity that’s compatible with pt is
That’s the average of the constant conditional velocities, weighted by which (x0, x1) pairs are likely to have produced this particular xt. And the regression
has minimizer exactly ut(x) = 𝔼[x1 − x0 | xt = x] — the same thing! So we can regress against the easy target and recover the impossible marginal one. This is the entire CFM identity.
Why is the minimizer of E‖v − Y‖² conditional on X exactly E[Y | X]?
For any function f(X):
E ‖f(X) − Y‖² = E ‖f(X) − E[Y | X]‖² + E ‖E[Y | X] − Y‖²
This is the orthogonal (Pythagorean) decomposition of L²: the cross-term E[(f(X) − E[Y|X]) · (E[Y|X] − Y)] vanishes because, conditional on X, the first factor is constant while the second has zero conditional mean. The second term doesn’t depend on f; the first is minimized exactly at f(X) = E[Y | X]. ▪
Plug in X = (xt, t) and Y = x1 − x0; the regression optimum is the marginal velocity by definition.
The training objective in one line
That’s the whole thing. In Python (FlowMatching.loss):
x0 = randn_like(x1)
t = rand(B)
xt = (1 - t) * x0 + t * x1
target = x1 - x0
pred = self.model(xt, t)
return ((pred - target) ** 2).mean()
Five lines. Compare to the half-page ELBO derivation behind DDPM’s identical-looking but conceptually different MSE.
Interactive · the conditional velocity field
The widget below visualizes Eq. ●. Pick a t; we sample many (x0, x1) pairs, place each xt at (1 − t) x0 + t x1, and draw an arrow showing the constant conditional velocity x1 − x0. Notice how, at each point in the cloud, several arrows of different directions exist — the average of those arrows is the marginal velocity field, which is what the net learns.
Interactive · train a velocity net, see the field
The previous widget shows the target field (averaged from sampled pairs). Here’s what a trained network actually learns: a smooth field vθ(x, t) defined on the whole plane. Hit train, then drag t: the arrows are what the model would push samples toward if they were at that grid point at that time. At t = 0 the field points outward from the origin (move noise toward data); at t = 1 the field shrinks toward zero (we’re already at data).
Why this is “better” than diffusion (sometimes)
- Path is straight. Each conditional trajectory is a line. The marginal field inherits much of that straightness, so a low-order ODE integrator (Euler with 20–50 steps) is enough. Lesson 6 demos this.
- Time is just U(0, 1). No β schedule, no choice of T, no special β-spacing tricks. The path is the design choice; everything else is the path.
- Scale-stable targets. ‖x1 − x0‖ is O(1) at every t, just like ε in DDPM. No reweighting.
- Composes cleanly with anything. Want a different path? Swap one line. Want video? Replace (x0, x1) with (z0, z1) in some latent space. Want guidance? Same trick as classifier-free guidance in DDPM, mechanically.
Trade-offs vs. DDPM, summarized
| Axis | DDPM | Flow matching (linear) |
|---|---|---|
| Path | VP Gaussian chain (curved) | Linear interpolation (straight) |
| Time domain | discrete t ∈ {0, …, T−1} | continuous t ∈ [0, 1] |
| Target | noise ε | velocity x₁ − x₀ |
| Schedule knobs | β_start, β_end, T, cosine vs. linear | none (just the path choice) |
| Math prereq | ELBO + Gaussian KL algebra | continuity equation + conditional E[·] |
| Loss derivation | ~half page | ~3 lines |
| Theoretical floor | same minimizer (the true marginal velocity / score) | same minimizer |