generative_continuous / lessons / index 15 lessons · ~4.5h read

Generative Models, From First Principles

Part A: DDPM and Flow Matching on continuous data. Part B: discrete tokens, VQ-VAEs, masked diffusion, and the integrated reasoning + image-generation systems (Nano Banana Pro, GPT-Image-2 family) built on top.

This series is the companion to generative_continuous/: four models on the same axis cross — {DDPM, Flow Matching} × {MLP, DiT}. Each lesson takes one knob in that grid, derives it from scratch, shows you the trade-off it forces, and gives you a widget where you can drive it until it breaks.

Who this is for
You know calculus and what a neural net is. You may have seen diffusion as a black box. By the end, you can: derive DDPM’s loss in a half-page; explain why flow matching is shorter; tell when a DiT block earns its parameters and when an MLP wins; and read every line of the four .py files in this folder and say why it is there.

The one picture

Both DDPM and Flow Matching pose the same problem: build a path of distributions from a tractable prior to data, then learn a local operator that pushes probability mass along the path.

p₀ = N(0, I) tractable prior p₁ = p_data samples only p_t a chosen path t ∈ [0, 1] forward (fixed) reverse (learned) Two design choices: (1) which path? (2) what does the net predict at every point on it?

DDPM’s answer: the path is a Gaussian noising chain, the net predicts the noise that was added. Flow Matching’s answer: the path is whatever you like, the net predicts the velocity that moves mass along it. Despite the cosmetics, these are the same object viewed from two angles — DDPM is flow matching with a specific curved path. Lesson 7 makes that explicit.

Part A · Continuous data — the math that underlies everything (lessons 01–09)

01
The generative-modeling problem
Why “model p_data” is hard, and why transporting a tractable prior along a chosen density path is a productive way to think about it. Why we work in 2D before images.
02
DDPM — the forward chain
Variance-preserving Gaussian Markov chain. Why composing linear-Gaussian maps stays Gaussian. The closed-form marginal that makes training simulation-free.
03
DDPM — ELBO to L_simple
The KL-between-Gaussians collapse. Why we predict ε instead of x₀ or μ. What “dropping the weights” actually trades away (likelihood for perceptual quality).
04
DDPM — ancestral sampling
Bayes’ rule gives you the posterior-mean formula. Why σ² = β_t is the “upper-bound” choice and when β̃_t is tighter. The ~1000-step cost and where it goes.
05
Flow matching — pick a path, not a chain
The continuity equation as the compatibility condition. The CFM trick: conditional expectation is the L₂ minimizer, so regressing on the easy conditional target gives you the impossible marginal one.
06
Flow matching — Euler ODEs
Forward Euler on the learned velocity field. Why straight conditional paths buy you 20× fewer integration steps. RK4 as a one-line upgrade.
07
DDPM ↔ flow matching
DDPM is flow matching with the variance-preserving curved path and a stochastic sampler. The conversion formula between ε and v. Where the “deeper” theory (score matching, SDE limits) actually lives.
08
DiT — transformer denoiser
PatchEmbed as a single GEMM. Bidirectional attention vs. UNet’s convs. adaLN-Zero: the modulation MLP that starts every block at identity and is the reason DiT trains without warmup.
09
Practical knobs & trade-offs
Linear vs. cosine β. ε vs. v vs. x₀ targets. Step-count vs. solver order. t_scale = 1000 (and why). Classifier-free guidance in five lines. The full extension menu.

Part B · Discrete tokens & integrated systems (lessons 10–15)

Once you can generate continuous things, you can compress images into a sequence of discrete tokens and let an LLM model them directly. That move — tokenize the image — is what makes “native multimodal” reasoning models possible: a single transformer that consumes and produces both text and image tokens, with chain-of-thought reasoning before image generation. Nano Banana Pro, the GPT-Image-2 family, Chameleon, and JanusFlow are all instances. This part covers the tokenizer, the discrete-generation algorithm, the unified architecture, and the reasoning layer on top.

10
Why discrete?
When tokens beat pixels: text, audio codecs, image patches as a sequence. The four-way menu of discrete generative methods (autoregressive, MaskGIT, discrete diffusion, hybrids) and what each is good for.
11
VQ-VAE · VQGAN · FSQ
Vector quantization as a learned dictionary. The straight-through gradient and why it works. Codebook collapse, EMA updates, FSQ as the modern simplification. Interactive 2D codebook trainer.
12
Masked & discrete diffusion
D3PM as the discrete analog of DDPM. MaskGIT’s parallel decoding: start fully masked, unmask the K most confident per round. Cost: ~10 steps vs. autoregression’s N. The accuracy-vs-throughput knob.
13
Unified-token transformers
One transformer over [text tokens, image tokens, audio tokens]. Chameleon, JanusFlow, LWM. Interleaved sequences, BOI/EOI markers, modality-specific decoders. Why this beats two-tower pipelines.
14
Reasoning + image generation
Chain-of-thought before image tokens: the “think then draw” pattern. Nano Banana Pro, GPT-Image-2 family. Why this beats prompt rewriting. Editing as token surgery. Reasoning over multiple input images.
15
Hybrid pipelines
When you don’t go fully autoregressive: diffusion conditioned on LLM features (DALL-E 3 + GPT-4, SD3 + T5, Imagen 3). The conditioning interface and where the failures live. Decision tree for picking an architecture.

How to use this

  1. Linearly. Each lesson assumes the previous one’s vocabulary. Lessons 02–04 build DDPM; 05–06 build FM; 07 connects them; 08–09 talk architecture and practice.
  2. Touch every knob. Each widget has a configuration that produces visibly wrong samples or visibly fails to integrate. Find it — the failure is the lesson.
  3. Open the code. Every claim corresponds to lines you can read in diffusion.py, flow_matching.py, or diffusion_transformer.py. The lessons explain why; the code is what.
Companion reading
The README gives the same content in dense form. Use it as a refresher after the lessons; the lessons exist to make the README’s one-liners believable.