Generative Models, From First Principles

Part A: DDPM and Flow Matching on continuous data. Part B: discrete tokens, VQ-VAEs, masked diffusion, and the integrated reasoning + image-generation systems (Nano Banana Pro, GPT-Image-2 family) built on top. Part C: the production text-to-image stack — score/SDE theory, fast samplers, guidance, latent diffusion, CLIP, evaluation, and the Stable Diffusion lineage + adapter ecosystem.

This series is the companion to generative_continuous/: four models on the same axis cross — {DDPM, Flow Matching} × {MLP, DiT}. Each lesson takes one knob in that grid, derives it from scratch, shows you the trade-off it forces, and gives you a widget where you can drive it until it breaks.

Who this is for

You know calculus and what a neural net is. You may have seen diffusion as a black box. By the end, you can: derive DDPM’s loss in a half-page; explain why flow matching is shorter; tell when a DiT block earns its parameters and when an MLP wins; and read every line of the four .py files in this folder and say why it is there.

The one picture

Both DDPM and Flow Matching pose the same problem: build a path of distributions from a tractable prior to data, then learn a local operator that pushes probability mass along the path.

DDPM’s answer: the path is a Gaussian noising chain, the net predicts the noise that was added. Flow Matching’s answer: the path is whatever you like, the net predicts the velocity that moves mass along it. Despite the cosmetics, these are the same object viewed from two angles — DDPM is flow matching with a specific curved path. Lesson 7 makes that explicit.

Part A · Continuous data — the math that underlies everything (lessons 01–09)

The generative-modeling problem

Why “model p_data” is hard, and why transporting a tractable prior along a chosen density path is a productive way to think about it. Why we work in 2D before images.

DDPM — the forward chain

Variance-preserving Gaussian Markov chain. Why composing linear-Gaussian maps stays Gaussian. The closed-form marginal that makes training simulation-free.

DDPM — ELBO to L_simple

The KL-between-Gaussians collapse. Why we predict ε instead of x₀ or μ. What “dropping the weights” actually trades away (likelihood for perceptual quality).

DDPM — ancestral sampling

Bayes’ rule gives you the posterior-mean formula. Why σ² = β_t is the “upper-bound” choice and when β̃_t is tighter. The ~1000-step cost and where it goes.

Flow matching — pick a path, not a chain

The continuity equation as the compatibility condition. The CFM trick: conditional expectation is the L₂ minimizer, so regressing on the easy conditional target gives you the impossible marginal one.

Flow matching — Euler ODEs

Forward Euler on the learned velocity field. Why straight conditional paths buy you 20× fewer integration steps. RK4 as a one-line upgrade.

DDPM ↔ flow matching

DDPM is flow matching with the variance-preserving curved path and a stochastic sampler. The conversion formula between ε and v. Where the “deeper” theory (score matching, SDE limits) actually lives.

DiT — transformer denoiser

PatchEmbed as a single GEMM. Bidirectional attention vs. UNet’s convs. adaLN-Zero: the modulation MLP that starts every block at identity and is the reason DiT trains without warmup.

Practical knobs & trade-offs

Linear vs. cosine β. ε vs. v vs. x₀ targets. Step-count vs. solver order. t_scale = 1000 (and why). Classifier-free guidance in five lines. The full extension menu.

Part B · Discrete tokens & integrated systems (lessons 10–15)

Once you can generate continuous things, you can compress images into a sequence of discrete tokens and let an LLM model them directly. That move — tokenize the image — is what makes “native multimodal” reasoning models possible: a single transformer that consumes and produces both text and image tokens, with chain-of-thought reasoning before image generation. Nano Banana Pro, the GPT-Image-2 family, Chameleon, and JanusFlow are all instances. This part covers the tokenizer, the discrete-generation algorithm, the unified architecture, and the reasoning layer on top.

Why discrete?

When tokens beat pixels: text, audio codecs, image patches as a sequence. The four-way menu of discrete generative methods (autoregressive, MaskGIT, discrete diffusion, hybrids) and what each is good for.

VQ-VAE · VQGAN · FSQ

Vector quantization as a learned dictionary. The straight-through gradient and why it works. Codebook collapse, EMA updates, FSQ as the modern simplification. Interactive 2D codebook trainer.

Masked & discrete diffusion

D3PM as the discrete analog of DDPM. MaskGIT’s parallel decoding: start fully masked, unmask the K most confident per round. Cost: ~10 steps vs. autoregression’s N. The accuracy-vs-throughput knob.

Unified-token transformers

One transformer over [text tokens, image tokens, audio tokens]. Chameleon, JanusFlow, LWM. Interleaved sequences, BOI/EOI markers, modality-specific decoders. Why this beats two-tower pipelines.

Reasoning + image generation

Chain-of-thought before image tokens: the “think then draw” pattern. Nano Banana Pro, GPT-Image-2 family. Why this beats prompt rewriting. Editing as token surgery. Reasoning over multiple input images.

Hybrid pipelines

When you don’t go fully autoregressive: diffusion conditioned on LLM features (DALL-E 3 + GPT-4, SD3 + T5, Imagen 3). The conditioning interface and where the failures live. Decision tree for picking an architecture.

Part C · The production text-to-image stack (lessons 16–23)

Part A gave you the math of unconditional continuous diffusion; Part B took the discrete fork. Part C returns to continuous diffusion and builds the actual Stable-Diffusion-class product. It branches off Part A — every lesson is forced by a limitation an earlier lesson named, and reduces an “applied AIGC” topic (samplers, guidance, VAEs, CLIP, FID, LoRA, ControlNet…) back to a first principle. This is the answer key to the diffusion/AIGC interview canon, derived rather than memorized.

The score & SDE view

The “deeper theory” lesson 7 deferred. ε-prediction is score estimation (s = −ε/√(1−ᾱ)). The forward chain is an SDE; its stochastic reverse = DDPM ancestral sampling, its probability-flow ODE = the deterministic samplers. One object, three currencies.

Fast & deterministic sampling

DDIM as the non-Markovian re-derivation and the PF-ODE’s Euler step. DPM-Solver as an exponential integrator on the semilinear ODE. The smoothness wall, then distillation: progressive, consistency, LCM, Turbo. The whole sampler zoo demystified.

Guidance

Conditioning = add a vector to the score (Bayes split). Classifier guidance builds the classifier; classifier-free fakes it with (cond−uncond) + 10% dropout, giving ε̂ = ε_∅+w(ε_y−ε_∅). Why w>1 oversharpens. Negative prompts and offset noise.

VAE, reparameterization & latent diffusion

The “Latent” in Latent Diffusion. AE vs VAE vs VQ-VAE = what distribution the latent is forced into. The reparameterization trick (same move as DDPM’s). The perceptual + adversarial autoencoder SD actually uses, and why its KL is tiny.

CLIP & text conditioning

Contrastive pretraining and the symmetric InfoNCE loss, read off the similarity matrix. How the text vector enters the denoiser: cross-attention (Q=image, K/V=text) vs self-attention. CLIP’s pros/cons and why SD3 bolts on a T5.

Evaluation & the GAN contrast

Inception Score (sharp-and-varied, blind to real data and intra-class collapse). FID as the Fréchet distance between Gaussian fits of Inception features — and what its covariance term catches that IS misses. GANs, mode collapse, and why diffusion won.

SD → SDXL → SD3

Stable Diffusion = lessons 16–21 wired together (the full training loop). SD vs DALL·E (latent diffusion vs unCLIP prior vs autoregression). SDXL’s scaling + size/crop conditioning + refiner. SD3’s switch to rectified flow on an MM-DiT with T5.

Adapt & control a frozen model

One axis: where do you inject new information? Textual Inversion (embedding) → LoRA (low-rank deltas) → IP-Adapter (parallel attention) → HyperNetwork (generated layers) → ControlNet (zero-conv branch) → DreamBooth (all weights) → inpainting (latent surgery). Why LoRA sits at the knee.

How to use this

Linearly. Each lesson assumes the previous one’s vocabulary. Lessons 02–04 build DDPM; 05–06 build FM; 07 connects them; 08–09 talk architecture and practice.
Touch every knob. Each widget has a configuration that produces visibly wrong samples or visibly fails to integrate. Find it — the failure is the lesson.
Open the code. Every claim corresponds to lines you can read in diffusion.py, flow_matching.py, or diffusion_transformer.py. The lessons explain why; the code is what.

Companion reading

The README gives the same content in dense form. Use it as a refresher after the lessons; the lessons exist to make the README’s one-liners believable.