Masked & discrete diffusion

The discrete analog of DDPM. MaskGIT’s parallel decoding. How to sample N tokens in ~10 forward passes instead of N.

The setup, by analogy

Continuous DDPM corrupts data by adding Gaussian noise in T steps; the model learns to denoise. Discrete diffusion does the same with a categorical corruption. Define a forward Markov chain on token sequences z₀, z₁, …, z_T with

q(z_t | z_t−1) = Cat( z_t ; transition_matrix(β_t) )

where the transition matrix says “with probability 1 − β_t keep the token unchanged; with probability β_t, do something corrupting.” The two big choices for “something”:

Choice	What corruption looks like	Used in
Absorbing (mask)	token becomes a special [MASK] symbol; once masked, stays masked	MaskGIT, MDLM, MD4, most image work
Uniform (swap)	token becomes a uniformly random vocab entry	D3PM uniform variant, less used in practice
Edit distance (swap to neighbor)	token swaps to a similar one (e.g. semantically near in embedding)	SEDD, MD4-edit, text-focused work

For images, absorbing (mask-only) wins almost always. The reason: image patches don’t have meaningful “edit distance” in token space, while “present vs. missing” is a clean signal the model can condition on.

The closed-form marginal — discrete Eq. ⋆

For absorbing corruption, the marginal probability of token z_t given z₀ is:

q(z_t = MASK | z₀) = γ_t, q(z_t = z₀ | z₀) = 1 − γ_t

where γ_t ∈ [0, 1] is the cumulative mask probability at step t — the discrete analog of the noise-mass fraction (1 − ᾱ_t) from DDPM (Eq. ⋆ in lesson 2). Both are probabilities / fractions in [0, 1] measuring accumulated corruption — neither is a standard deviation. Pick any monotone γ_t with γ₀ = 0, γ_T = 1; linear, cosine, sigmoid all work. To sample z_t, independently mask each token with probability γ_t. Closed-form, simulation-free, exactly like DDPM’s Eq. ⋆.

Intuition · linear unpacking

Claim: the whole forward process is a single coin flip per token — no chain to simulate.

What corruption can do. Under masking, a token has exactly two fates: it’s still itself, or it’s [MASK]. Once masked it can never come back (absorbing). So the only question about z_t is: has this token been masked yet?
One number answers it. “Has it been masked by step t?” is a yes/no with probability γ_t. That single number rolls up all t steps of the chain — just as (1 − ᾱ_t) rolls up t Gaussian kicks into one noise fraction.
So you skip the chain. To get z_t you don’t step through z₁, …, z_t−1. You flip one biased coin per token at rate γ_t and mask the heads. Every token is independent, so the whole sequence is one vectorized masking pass.

Central point. γ_t is just “the chance a token is masked by now,” and because masking is one-way, that one probability lets you jump straight to any step.

The training objective

Strip away the variational machinery and the training task is embarrassingly familiar: show the model a sequence with some tokens masked out, and ask it to name the missing ones. The unmasked tokens are free — the model already sees them — so the only thing worth scoring is the guesses at the masked slots, and the natural score for a categorical guess is cross-entropy. That is exactly what the discrete ELBO reduces to. Just as DDPM’s Gaussian ELBO collapsed to an MSE, the discrete ELBO collapses to a sum of cross-entropies over the masked positions:

L = 𝔼_{t, z₀, mask} [ Σ_{i : masked} −log p_θ(z_0,i | z_t, t) ]

Read it left to right: average over a random step t and a random masking, then for each masked slot i add the negative log-probability the model assigned to the true clean token z_0,i. In words: predict the clean token at every masked position, given the partially-masked sequence. That’s BERT’s masked-language-modeling objective with a single twist: the mask rate is sampled per example (so the model sees every fraction from 0% to 100% masked, not just a fixed 15%).

Intuition · linear unpacking

Claim: the scary categorical ELBO is just “fill in the blanks, graded by cross-entropy” — BERT with a dial.

Only the blanks count. In z_t some tokens are visible and some are [MASK]. The visible ones carry no learning signal — the model can copy them. All the loss lives at the masked slots, which is why the sum runs over i : masked only.
Each blank is a classification. At a masked slot the model outputs a distribution over the vocabulary and there’s one right answer, the original token z_0,i. Scoring “how much probability did you put on the right answer?” is exactly cross-entropy, −log p_θ. No KL formula to evaluate by hand.
The expectation is just the training loop. “𝔼_{t, z₀, mask}” reads as: each step, grab a real sequence, pick a random mask rate, mask accordingly, and average the blank-filling loss. Sampling t is what turns a fixed-15%-mask BERT into a diffusion model that has seen every corruption level.

Central point. The objective isn’t new math — it’s masked-token prediction graded by cross-entropy, with the mask rate randomized so the model learns to denoise from any amount of masking.

If you train at a fixed mask rate you’ve trained BERT. If you train at variable mask rate, you’ve trained a discrete diffusion model. The architecture is the same.

Sampling: the MaskGIT loop

Given a trained p_θ(z₀ | z_t), here is the sampler in five lines:

# start with everything masked
z = [MASK] * N
for r in range(R):                       # R rounds ~ 10
    # one forward pass over the whole sequence
    logits = model(z, t = (R-r)/R)
    pred   = sample(logits)              # categorical per position
    conf   = max(softmax(logits), axis=-1)  # confidence per position
    # cosine mask schedule: mask_frac shrinks from 1 toward 0 as r grows
    mask_frac = cos((r+1)/R · π/2)        # ∈ [cos(π/2R), 0]
    n_masked  = ceil(N * mask_frac)       # cells still masked after this round
    n_keep    = N - n_masked              # cells revealed after this round
    # take the n_keep most confident positions; mask the rest back to [MASK]
    z = commit_top_k_by_confidence(pred, conf, n_keep)

Reading the schedule the right way: n_masked is what stays masked after round r. Cosine starts at near-1 (almost everything still masked after round 0), shrinks fast in the middle, hits 0 by round R. The dual quantity n_keep = N − n_masked is what the model commits to after the round. The original MaskGIT paper writes this as “mask schedule” rather than “unmask schedule”; both are valid as long as you keep the polarity straight.

Each round costs one forward pass. After R rounds, the sequence is fully filled. For a typical image at 256 tokens and R = 12, that’s 12 forwards instead of 256 — 20× faster than autoregression.

Why “keep the most confident”?

At each round the model gives you a guess for every masked cell. Some guesses are confident (sharp posterior), some are diffuse. If you commit to all of them at once, the diffuse ones are essentially random. By committing only to the confident ones and remasking the rest, you give the model another round of conditioning on the committed cells — the diffuse cells’ posteriors sharpen because they now have more context. This is the same trick beam search uses to defer uncertain decisions.

Interactive · MaskGIT rounds on a synthetic 64-token grid

Below, an 8×8 token grid representing a small image. The “model” is an oracle: every position has a true preferred color, and the predicted-confidence is a function of how much of the neighborhood is already filled (so committed cells help neighbors become more confident). Drag rounds from low to high and watch the trade-off.

What the schedule does

The unmask schedule decides how fast we commit. Three common choices:

Linear: at round r reveal (r+1)/R of cells. Even pacing.
Cosine (default in MaskGIT): reveal slowly at first, then fast. Lets the model take many small bites early when context is poor, large bites late when context is rich.
Sqrt: reveal fast at first, slow at the end. Front-loads commitments — rarely the right choice for images (context-rich regions deserve more, not less, deferral).

The trade-off is “number of rounds” vs. “quality.” A common operating point: R = 12 with cosine, matching the original MaskGIT paper.

Discrete diffusion beyond masking

D3PM (Austin et al. 2021) generalized to arbitrary discrete corruption matrices — not just masking. The math is identical to DDPM but with categorical KLs instead of Gaussian KLs:

L_t = 𝔼_{z₀, z_t} [ KL( q(z_t−1 | z_t, z₀) ‖ p_θ(z_t−1 | z_t) ) ]

For absorbing corruption the KL has the same closed form as the masked-CE above; for other corruptions you compute the categorical KL directly. Recent work (SEDD, MD4) makes this competitive on text where token swaps matter; for image tokens, the absorbing case dominates.

Why this matters for the multimodal stack

Parallel decoding is the standard research choice for token-based image generators that prioritize throughput (Muse, MAGVIT-v2 are the canonical examples). Flagship product systems — Nano Banana Pro, the GPT-Image-2 family — don’t publicly document their decoder strategy; reasonable inferences from latency profiles and reported capabilities suggest hybrid approaches (parallel decoding for the bulk of image tokens, with autoregressive or diffusion stages around the edges). The architectural option is what matters here: any transformer trained with a masked image-token objective can run either decoder. Reasons to lean parallel for the image span:

Throughput. 12 forwards beats 256 forwards by a lot, especially when image generation latency dominates user-perceived latency.
Editing. Parallel decoders trivially support “regenerate this region”: just mask the region and run a few more rounds. Autoregression has to either start over or use a clever inpainting recipe.
Bidirectional context. Image patches are not naturally ordered. Parallel decoding lets every position attend to every other from round 1; autoregression has to scan a fixed order.

The text branch of these models usually stays autoregressive (that’s where chain-of-thought lives — lesson 14). The image branch flips to parallel. A single transformer can do both: the only thing that changes is whether you decode left-to-right or all-at-once.

Trade-offs in summary

Axis	Autoregressive	MaskGIT / parallel	Continuous diffusion
Forward passes per image	N (= seq length)	~10–20	~20–50 (FM) or ~1000 (DDPM)
Editing	hard	trivial (remask region)	inpainting via masked sampling
Bidirectional context	no (causal mask)	yes (no mask in attention)	n/a (operates on whole image at once)
Reasoning + text in same model	natural (just one LM)	natural (same LM, parallel-decoded image span)	requires separate conditioning interface (lesson 15)
Token-level loss	cross-entropy per position	cross-entropy on masked positions	MSE (continuous target)
Sample quality (image, matched compute)	highest	nearly AR, much faster	highest on raw pixel diffusion; competitive in latent diffusion

Punchline

Discrete diffusion is “BERT-style masked prediction with a variable mask rate at training, plus an iterative top-K-confident unmask loop at sampling.” It runs in ~10 forwards instead of N, gives you bidirectional context, and supports editing for free. That’s the sampling backbone of most modern native-multimodal image generators.