generative_continuous / 01 · the problem lesson 1 / 15

The generative-modeling problem

We have samples from pdata. We want more. What machinery do we need?

What we are actually trying to do

Concretely: you hand me a finite set {x⁽¹⁾, …, x⁽ᴺ⁾} drawn i.i.d. from an unknown distribution pdata(x). I should give you a procedure that produces fresh samples that look like they came from the same distribution. Not a likelihood model, not a density estimator — a sampler.

That word choice matters. Modeling the density p(x) at every point of D is dramatically harder than producing one sample per call. A density model has to normalize over the whole space; a sampler only has to walk to plausible spots. Every method in this series is a sampler first and a density model second (or never).

The 2D toy we’ll use

The figure below is two interleaved crescents (“two moons”) — bimodal, non-Gaussian, has curvature. The same code in data.py standardizes to zero mean and unit variance per axis so the prior N(0, I) and the data live on the same scale. If you skip standardization, the noise level and the data level disagree and training silently fails. That is gotcha #1.

Two moons — the target distribution
Drag noise to see how the crescents fatten; tap resample for fresh draws. This is what every model in the series tries to learn to reproduce.

Why “just fit a neural net to pdata” doesn’t work

The naive plan: pick a parametric family pθ(x), maximize log-likelihood on the training set. Two obstacles:

  1. Normalization. A general neural net fθ(x) ∈ ℝ is not a density — it doesn’t integrate to 1. You can do pθ(x) = efθ(x) / Z(θ), but Z(θ) = ∫ efθ(x) dx is intractable. Energy-based models live here and pay this price.
  2. Sampling. Even if you had pθ, drawing samples from a general high-dimensional density is hard. MCMC mixes badly in modal landscapes (each mode is a metastable basin); rejection sampling needs a proposal that already looks like the target.

VAEs and normalizing flows fix this by constraining the architecture: VAEs introduce a tractable latent and a Jensen-bound surrogate (the ELBO); flows force fθ to be invertible with a tractable Jacobian. Both work; both pay an architectural tax. Diffusion and flow matching keep the architecture free and pay a process tax instead.

The reframe: density paths

Here is the move. Don’t try to model pdata directly. Instead, build a one-parameter family of distributions {pt}t ∈ [0, 1] that connects a tractable prior to the target:

p0 = N(0, I),   p1 = pdata,   pt for t ∈ (0, 1) is anything you like that interpolates.

Now learn a local operator — at each point x and time t, what should I do to move probability mass along the path? Two natural choices for that operator:

MethodOperator the net predictsHow sampling moves
DDPMnoise ε added at this stepsubtract the predicted noise (with stochasticity)
Flow matchingvelocity v at this point in timestep in the direction of v by dt

Why is this easier? Because locally, each pt looks Gaussian-ish in a neighborhood, even when pdata is multimodal and complicated. The net only has to learn a smooth field over (x, t); it never has to confront the global density. That is the whole conceptual win.

Interactive · pick a path, watch a particle drift

The widget below shows the linear interpolation path that flow matching will use in lesson 5: xt = (1 − t) x0 + t x1 with x0 ∼ N(0, I) and x1 ∼ pdata. Drag t: the cloud morphs from the standard normal at t = 0 to the two moons at t = 1. Each particle moves on a straight line. That is going to matter a lot.

A linear density path
Each grey dot is one (x₀, x₁) pair frozen at the start; orange dots are the same particles at the current t. Watch a Gaussian deform into two moons in real time.

The grey lines are the trajectories xt = (1 − t)·x₀ + t·x₁ traced through time. They are straight in this path; the velocity along each trajectory is the constant x1 − x0. This is why flow matching needs only ~50 sampling steps where DDPM needs 1000.

The path as a 3D stack

One way to see the path of distributions: plot pt as slices stacked along a time axis. Below, each translucent disk is one snapshot of the cloud at a particular t; the disks are placed in oblique 3D projection along t. The bundle shows in one glance what the slider above shows over time.

Density-path stack — pt at K time slices
Each disk is one t-slice; trajectories tie corresponding particles across slices. The front slice (t=0) is a unit Gaussian; the back slice (t=1) is the two moons. Linear path = straight ties between slices.

What this lets us avoid

The cost we pay

There is no free lunch — the cost is sampling
Both DDPM and FM produce a single sample by running the net many times (10–1000 forward passes). For a 256×256 image with a 1B-parameter DiT, that’s several seconds per image, vs. a GAN’s single shot. The recent diffusion literature (DDIM, EDM, consistency models, rectified flow) is largely about chipping away at this cost.

What’s next

We now have a frame: pick a path, learn a local operator. The next four lessons fill that in for DDPM (lessons 2–4) and flow matching (lessons 5–6). Lesson 7 then shows they’re the same object viewed two ways. The 2D toy from above is the one we’ll keep returning to — small enough to plot, complex enough to make every bug visible.