Flow matching — pick a path, not a chain

A different design choice with a shockingly short derivation: the entire training objective is one regression line.

The frame shift

DDPM built a path by designing a noise process: pick β_t, induce q(x_t), derive everything else. The path is implicit, curved, and tied to Gaussian noising.

Flow matching flips it: design the path directly. Pick any one-parameter family of distributions {p_t}_t∈[0,1] with p₀ = N(0, I) and p₁ = p_data. Learn the velocity field v_θ(x, t) that pushes mass along it. Sample by ODE integration.

The objective doesn’t depend on the path being Gaussian, Markov, or anything in particular. That freedom is the design lever — we’ll use it to pick a path that’s easy to integrate (lesson 6).

The continuity equation

The compatibility condition: a time-varying density p_t and velocity u_t are compatible (the field actually moves mass along the path) iff they satisfy

∂_t p_t(x) + ∇ · ( p_t(x) · u_t(x) ) = 0

This is just conservation of probability mass written in PDE form — the same equation as fluid dynamics. It says: the rate of change of density at x equals the negative divergence of the flux p_t u_t.

If we knew the “marginal velocity” u_t(x) at every point, sampling would be easy: integrate dx/dt = u_t(x) starting at x₀ ∼ p₀; you land in p₁. The trouble: u_t(x) has no closed form for any non-trivial p_data. We can’t evaluate it, so we can’t regress against it directly.

Intuition · linear unpacking

Claim: a velocity field is just an instruction sheet for moving probability mass, and one PDE is the rule that keeps the moving from leaking mass.

A density is a pile of sand. Picture p_t as sand spread over the plane, taller where data is dense. Generating a sample is asking: where does a single grain end up?
The velocity field is the wind. u_t(x) says which way and how fast a grain sitting at x moves at time t. Follow the wind from a random start at t = 0 to t = 1 and you have drawn a sample — that’s the ODE dx/dt = u_t(x).
The continuity equation just forbids cheating. Sand can be pushed around but not created or destroyed. So if the pile gets shorter somewhere, the wind must have carried that sand out; if it grows, the wind blew sand in. That bookkeeping — change in height equals net flow out — is the whole PDE; the divergence term is just “net flow out of a point.”
The catch. We get to choose the pile’s shape over time (the path), but the wind that’s consistent with it has no formula for real data. We know it exists; we just can’t print its value — which is why the next section learns it instead of computing it.

Central point. Sampling is blowing a grain of sand along a wind field; the continuity equation is only the no-mass-lost rule that ties the wind to the pile it’s shaping.

The CFM trick: conditional paths

Pick a conditional path that’s trivial: for each pair (x₀, x₁) with x₀ ∼ p₀, x₁ ∼ p_data, define a path from x₀ to x₁. The linear (OT) path is

x_t = (1 − t) · x₀ + t · x₁ [Eq. ▲]

This is a straight segment; the conditional velocity is its time derivative, which is constant:

u_t(x_t | x₀, x₁) = d/dt x_t = x₁ − x₀ [Eq. ●]

Both quantities are evaluable: given a sampled (x₀, x₁, t), the conditional x_t is one linear combination and the conditional velocity is one subtraction. Zero networks needed.

The marginal-conditional equivalence

Here is the worry, stated plainly: the target we can actually compute, x₁ − x₀, is not the field we want. Many different (x₀, x₁) segments cross through the same point x_t, each carrying a different constant velocity. We hand the net a single one of those conflicting arrows on each draw — so how can it ever learn the one true field? The answer is that a least-squares regression, fed conflicting targets at the same input, cannot fit them all; the best it can do is settle on their average. And the average of the conditional arrows passing through x_t is, by construction, exactly the marginal velocity we wanted. The principle behind this is one line: conditional expectation is the minimizer of squared error. Spelled out — the marginal velocity that’s compatible with p_t is

u_t(x) = 𝔼_{(x₀, x₁) ∼ q | x_t = x} [ x₁ − x₀ ]

that is, the average of the constant conditional velocities, weighted by which (x₀, x₁) pairs are likely to have produced this particular x_t. And the regression

argmin_v 𝔼_{t, x₀, x₁} ‖ v(x_t, t) − (x₁ − x₀) ‖²

has minimizer exactly u_t(x) = 𝔼[x₁ − x₀ | x_t = x] — the same thing! So we can regress against the easy target and recover the impossible marginal one. This is the entire CFM identity.

Intuition · linear unpacking

Claim: fitting the net to the cheap per-pair velocity x₁ − x₀ teaches it the expensive marginal field u_t we could never write down.

What we want is unreachable. The marginal velocity u_t(x) is the one true arrow that moves the whole cloud along the path. For any real data it has no formula — we can’t print its value, so we can’t aim the net at it directly.
What we can reach is ambiguous. Pick a noise point and a data point, draw the straight segment, and its velocity is just the difference x₁ − x₀ — one subtraction, free. But it’s the velocity of one segment. Lots of segments pass through the same midpoint, each with its own difference. So the cheap target disagrees with itself.
A regression cannot please conflicting targets — so it averages them. Show a least-squares fit the same input x_t paired with many different answers, and no single output can match them all. The loss is smallest when the net outputs the mean of those answers. That is just what squared error does: it parks the prediction at the balance point of whatever it’s shown.
The mean of the cheap targets is the expensive one. The marginal velocity is exactly the average of every conditional arrow through x_t, weighted by how often each arrow actually lands there — and the regression gets that weighting for free, because the more often a segment passes through x_t, the more times that target shows up in the loss. So the balance point the net settles on is u_t(x) — we never had to evaluate it, we only had to let the (frequency-weighted) averaging happen.

Central point. We never compute the field we want; we feed the net a swarm of easy, conflicting guesses and let least-squares average them into exactly that field.

Why is the minimizer of E‖v − Y‖² conditional on X exactly E[Y | X]?

For any function f(X):

E ‖f(X) − Y‖² = E ‖f(X) − E[Y | X]‖² + E ‖E[Y | X] − Y‖²

This is the orthogonal (Pythagorean) decomposition of L²: the cross-term E[(f(X) − E[Y|X]) · (E[Y|X] − Y)] vanishes because, conditional on X, the first factor is constant while the second has zero conditional mean. The second term doesn’t depend on f; the first is minimized exactly at f(X) = E[Y | X]. ▪

Plug in X = (x_t, t) and Y = x₁ − x₀; the regression optimum is the marginal velocity by definition.

The training objective in one line

L_CFM = 𝔼_{t ∼ U(0, 1), x₀ ∼ N(0, I), x₁ ∼ p_data} ‖ v_θ(x_t, t) − (x₁ − x₀) ‖²

That’s the whole thing. In Python (FlowMatching.loss):

x0 = randn_like(x1)
t  = rand(B)
xt = (1 - t) * x0 + t * x1
target = x1 - x0
pred   = self.model(xt, t)
return ((pred - target) ** 2).mean()

Five lines. Compare to the half-page ELBO derivation behind DDPM’s identical-looking but conceptually different MSE.

Interactive · the conditional velocity field

The widget below visualizes Eq. ●. Pick a t; we sample many (x₀, x₁) pairs, place each x_t at (1 − t) x₀ + t x₁, and draw an arrow showing the constant conditional velocity x₁ − x₀. Notice how, at each point in the cloud, several arrows of different directions exist — the average of those arrows is the marginal velocity field, which is what the net learns.

Interactive · train a velocity net, see the field

The previous widget shows the target field (averaged from sampled pairs). Here’s what a trained network actually learns: a smooth field v_θ(x, t) defined on the whole plane. Hit train, then drag t: the arrows are what the model would push samples toward if they were at that grid point at that time. At t = 0 the field points outward from the origin (move noise toward data); at t = 1 the field shrinks toward zero (we’re already at data).

Why this is “better” than diffusion (sometimes)

Path is straight. Each conditional trajectory is a line. The marginal field inherits much of that straightness, so a low-order ODE integrator (Euler with 20–50 steps) is enough. Lesson 6 demos this.
Time is just U(0, 1). No β schedule, no choice of T, no special β-spacing tricks. The path is the design choice; everything else is the path.
Scale-stable targets. ‖x₁ − x₀‖ is O(1) at every t, just like ε in DDPM. No reweighting.
Composes cleanly with anything. Want a different path? Swap one line. Want video? Replace (x₀, x₁) with (z₀, z₁) in some latent space. Want guidance? Same trick as classifier-free guidance in DDPM, mechanically.

…and where it doesn’t obviously win

Empirically, flow-matching DiTs (Stable Diffusion 3, Flux, EsserEtAl-style) trade quality nearly head-on with DDPM-DiTs at matched compute. The real advantage is sampling cost: 50 steps vs. 1000, holding quality. For training, both are similar — the choice is mostly engineering.

Trade-offs vs. DDPM, summarized

Axis	DDPM	Flow matching (linear)
Path	VP Gaussian chain (curved)	Linear interpolation (straight)
Time domain	discrete t ∈ {0, …, T−1}	continuous t ∈ [0, 1]
Target	noise ε	velocity x₁ − x₀
Schedule knobs	β_start, β_end, T, cosine vs. linear	none (just the path choice)
Math prereq	ELBO + Gaussian KL algebra	continuity equation + conditional E[·]
Loss derivation	~half page	~3 lines
Theoretical floor	same minimizer (the true marginal velocity / score)	same minimizer

Punchline

Conditional expectation is the L₂ minimizer. So regressing against an easy conditional target (constant velocity along a linear segment) gives you the impossible marginal target for free. The whole flow-matching loss is one regression line, with one Gaussian sample and one uniform t.