Why discrete?
When tokens beat pixels, and the four-way menu of discrete generation methods that follow from that choice.
The frame shift
Part A treated the data as continuous — x ∈ ℝD — and learned a vector field or denoiser that operated on real numbers. That’s the right move for raw pixels at high resolution. But there’s a different option that turns out to be very productive: quantize the data into a sequence of discrete symbols from a finite vocabulary, then model the symbol sequence with a transformer.
Why bother? Three reasons:
- Compression. A 256×256 image is 196,608 floats. A typical VQ tokenizer turns it into 256–1024 tokens drawn from a vocabulary of ~16k entries. Two orders of magnitude smaller. A transformer that has to attend to 256 tokens uses ~1000× less compute than one that has to attend to 196k.
- Compatibility with language models. An LLM is already a sequence-of-tokens model. If you tokenize images into the same alphabet (or a parallel alphabet), one transformer can jointly reason over text and image tokens. This is the architecture under Chameleon, Gemini 2.5 Flash Image (Nano Banana), GPT-Image-1/2 and friends. We get to lesson 13–14.
- Discrete objectives are easier to interpret. Categorical cross-entropy gives you per-token loss, per-token sample probabilities, per-token attention — the same telemetry an LLM gives you. Continuous diffusion makes you reason about score fields and KL bounds.
The two-stage pattern
Almost every discrete generative system follows the same shape:
Stage 1 is a tokenizer (lesson 11): a VAE-style encoder that produces a small spatial grid of discrete indices, plus a decoder that reverses the map. The tokenizer is trained separately on raw images, then frozen.
Stage 2 is a sequence model over the discrete tokens. Four flavors:
| Method | How it samples | Steps | Quality vs. AR | Used by |
|---|---|---|---|---|
| Autoregressive (next-token) | one token at a time, left to right | N (= seq length) | highest fidelity, slowest | Chameleon, GPT-Image (text-to-image branch), DALL-E 1, ImageGPT |
| MaskGIT / parallel decoding | start fully masked, unmask top-K confident per round | ~10–20 | nearly AR-quality at 10× speed | MaskGIT, Muse, MAGVIT-v2 |
| Discrete diffusion (D3PM, MD4) | iteratively edit a sequence with mask/swap transitions | ~20–100 | matches AR on some benchmarks | D3PM, MD4, SEDD, Lou et al. 2024 |
| Hybrid (token model + diffusion decoder) | generate discrete tokens with AR or parallel decoding, then run a continuous diffusion decoder on top → pixels | (N or ~10) + ~50 | state of the art at the cost of two-model engineering | several 2025–2026 “pro” tiers compose this way (the discrete model handles structure/text; diffusion handles texture). Note: Parti by itself is pure-AR-with-VQ-decoder (no diffusion); Imagen by itself is cascaded pixel diffusion (no AR) |
Autoregression — the simplest baseline
Given tokens z1, …, zN, factorize the joint by the chain rule:
Train with cross-entropy; sample left-to-right. This is literally GPT applied to image tokens. The pros are obvious (transformers are well-understood, the loss is convex per-token, parallel training is trivial). The con is sampling cost: a 256×256 image is 256 tokens, so generation takes 256 sequential forward passes through the network.
Parallel decoding (MaskGIT) — the practical win
Treat the sequence as a partially-filled grid; train the model to predict the masked positions given the unmasked ones. At inference, start with everything masked, predict everything at once, keep only the top-K most confident, mask the rest, repeat. After ~10 rounds the whole grid is filled.
This is much faster (1.6 seconds vs. 30 on a typical text-to-image task at matched resolution) and quality is competitive with autoregression on standard benchmarks. Lesson 12 walks through the iterative-unmasking algorithm with an interactive widget. The key insight: order matters less for images than text, so parallel decoding doesn’t hurt much.
Discrete diffusion
The natural generalization of MaskGIT: instead of just “masked or revealed,” allow the corruption process to edit tokens (swap one symbol for another) at each step. Then train the model to predict the clean tokens given the corrupted ones, exactly as DDPM predicts clean data given noisy data. D3PM (Austin et al. 2021) was the first big paper here; SEDD (Lou et al. 2024) brought it competitive with autoregression on text.
For images, masked-only diffusion (= MaskGIT with a longer schedule) usually wins; for text and code, swap-based diffusion is more interesting because tokens have natural “neighbors” (synonyms, edit-distance-1 swaps).
Interactive · the four samplers, in miniature
Below, four samplers fill a 16-token grid (think: one flattened row of patches). The “target” is a synthetic distribution where each cell prefers a specific color. Watch how each method fills the grid step-by-step. Autoregression is strictly left-to-right; MaskGIT fills from confident regions outward; discrete diffusion edits everywhere at once; the hybrid does AR for token IDs then refines via a smoothing pass.
How to pick
| Situation | Pick | Why |
|---|---|---|
| Need joint text+image reasoning | autoregressive over unified tokens | simplest path to chain-of-thought + image-as-output (lesson 13–14) |
| Need fastest image generation at known resolution | MaskGIT / Muse | ~10 forward passes; competitive quality |
| Want absolute best quality, throughput is secondary | discrete token model + frozen diffusion decoder on the tokens | standard recipe for several 2026 “pro” tiers (Parti and Imagen are not a single pipeline — this row describes the architectural pattern, not a specific product) |
| Domain has natural token neighbors (text, code) | SEDD / discrete diffusion | edits-during-corruption matches what text wants to do |
| Audio codec output | autoregressive or RVQ + diffusion | autoregressive over EnCodec tokens has been the dominant baseline |
What this part of the series is doing
Lesson 11 covers the tokenizer in depth (VQ-VAE, VQGAN, FSQ — the modern simplification). Lesson 12 covers discrete diffusion / MaskGIT mechanics. Lesson 13 zooms out to unified-token transformers (Chameleon, JanusFlow). Lesson 14 covers the “chain-of-thought before drawing” pattern that gives Nano Banana Pro and the GPT-Image-2 family their reasoning. Lesson 15 covers hybrid pipelines (LLM-conditioned diffusion: DALL-E 3, SD3+T5, Imagen 3) for when full autoregression isn’t the right answer.