DiT — transformer denoiser

Patch the image, run attention between patches, condition on time via adaLN-Zero. Why this is the modern default for both DDPM and FM.

The minimal moves

A diffusion / flow-matching model is any function f_θ(x, t) with the same shape on the output as on x. For 2D points the MLP from lessons 2–6 is enough. For images, we want a function class that:

Treats the image as a sequence of patches (token structure that attention can mix), not raw pixels.
Lets any patch attend to any other patch in a single layer — far-flung structure (e.g. circles spanning the whole image) shouldn’t require a deep stack of convs to integrate.
Conditions on t in a way that’s strong enough to modulate every layer, but cheap and stable to train.

DiT (Peebles & Xie 2023) does this with three components: a patch embedding, an LLM-style transformer body with bidirectional attention, and the adaLN-Zero conditioning trick. diffusion_transformer.py is a faithful, pedagogically-stripped implementation.

Shape flow at a glance

Patch embedding — three steps fused into one GEMM

Conceptually, patch embedding is:

Reshape (B, C, H, W) into non-overlapping patches of size p × p: (B, C, H/p, p, W/p, p).
Flatten each patch to a vector of size p²·C: (B, N, p²·C) where N = (H/p)².
Apply a Linear from p²·C to d: (B, N, d).

A Conv2d with kernel_size = stride = p does all three in one GEMM (PatchEmbed). Mathematically identical, more efficient on GPU.

For img_size=16, patch_size=4 (this repo’s defaults): N = (16/4)² = 16 tokens per image, embedding dim d = 64. Sixteen tokens is exactly enough for attention to do something interesting without making the toy slow.

Attention — bidirectional, no mask

Standard transformer self-attention, but no causal mask. Each patch attends to every other patch. For images this is the correct inductive bias: pixels at the top can correlate with pixels at the bottom, and the model shouldn’t have to learn that connection through three convs and a downsample first.

Attention sparsity — what does each patch look at?

Click a patch (orange highlight) to see a synthetic attention pattern produced from a trained DiT-style model on a 16×16 circle image. Patches inside the circle attend to other in-circle patches; background patches attend to themselves and their neighbours.

The left grid is a 16-token image (4×4 patches at p=4 on a 16×16 image). The right grid is the attention pattern from the selected token (orange box on the left). Brighter = more attention. This is illustrative, not from a real model — the point is to show the “every patch reaches everywhere” freedom.

Full attention matrix — every patch’s view at once

The previous widget shows one patch’s attention. Here’s the full N × N matrix: rows are query patches, columns are keys. Bright entry (i, j) means patch i looks at patch j. Click a column or row to highlight the corresponding patch on the image.

N × N attention matrix on the 16-patch image

Click any cell. The corresponding query patch (row) and key patch (column) light up on the 4×4 image. The diagonal is always bright — every patch attends to itself most.

sharpness: 1.0 image content: layer: 2

Synthetic — based on a content-similarity + distance kernel, scaled by a sharpness knob (note: not the conventional softmax temperature, which is divisive; this one is multiplicative on the distance penalty, so larger sharpness = more concentrated). Softened with depth. Real DiT attention has rich layer-dependent structure; this widget’s point is the topology: every cell is non-zero, every patch can reach every other in a single layer.

adaLN-Zero — the conditioning trick

The hardest part of a denoiser/velocity-net is conditioning on t. Think of t as a single dial that says “how noisy is this input?” — and the whole network has to behave differently depending on where the dial sits. At t = 0 the input is basically clean, so the net should pass x through almost unchanged; at t = T the input is basically static, so the net should aggressively predict structure. The puzzle: one scalar has to reach into every layer and retune it, without that being expensive or making training blow up.

Three reasonable approaches:

Approach	Cost	Issues
Concat (B, d) time embedding as extra token	+1 token per block	attention has to discover that token is special; conditioning is “soft”
FiLM at every layer	2·d params per layer	works, but no principled init — deep stacks can be unstable
adaLN-Zero	6·d params per block	each block starts as identity ⇒ stable training without warmup

The adaLN-Zero formula, for one DiT block:

x ← x + gate_i ⊙ f_i( (1 + scale_i) ⊙ LayerNorm(x) + shift_i )

where f₁ is attention, f₂ is the MLP, and (scale_i, shift_i, gate_i) come from a small MLP applied to the time embedding c.

Intuition · linear unpacking

Claim: “condition by modulating normalization” just means the time signal sets three knobs — stretch, slide, and volume — on each block, instead of being mixed into the data itself.

LayerNorm wipes the slate. Before each sub-layer, LayerNorm rescales the tokens to a fixed, standardized shape. That standardized output is a clean surface to write a condition onto, because it has no leftover scale of its own to fight.
Three knobs per sub-layer. The time embedding c is turned into scale (stretch the normalized features), shift (slide them), and gate (turn the whole sub-layer’s output up or down). So “conditioning” is literally retuning the normalization, not adding an extra token for attention to hunt for.
Why this is the strong version. Because the knobs sit on every block’s normalization, the single scalar t gets a direct lever on every layer at once — cheaply, since it’s only a handful of numbers per block.

Central point. The time signal never enters as data; it enters as the dials that tell each block how hard to push, so one scalar can steer the whole network without being something the network has to decode.

The “-Zero” trick

Initialize the modulation MLP’s output layer to zero. At step 0:

1 + scale = 1 ⇒ LayerNorm is unmodified
shift = 0 ⇒ no additive change
gate = 0 ⇒ the residual contribution is multiplied by zero; the block degenerates to the identity

The whole network starts as output = x. Each block learns how much to contribute over training, gated by its own gate. Same effect as careful residual-scale inits (Fixup, &c.) but more interpretable.

Intuition · linear unpacking

Claim: zero-initializing the gate helps training because it lets a deep stack of blocks start as a do-nothing passthrough and switch itself on one safe step at a time.

A fresh block is noise. At step 0 its weights are random, so whatever it outputs is garbage. In a deep stack, garbage from each block compounds layer after layer, and the signal arriving at the top is wild — the usual cause of needing learning-rate warmup and delicate tuning.
Gate = 0 mutes the garbage. Multiplying each block’s output by zero means it contributes nothing at first. The residual path carries x straight through untouched, so the network starts as the identity — an output that’s automatically sane no matter how deep it is.
The gate opens gradually. The gate is a learned number starting at zero. It only moves once gradients flowing back through the block show that using the block lowers the loss, so each block fades its influence in over training instead of being forced on from the start.

Central point. Starting every block muted turns “train a deep network from scratch” into “start from something that already works and let each block dial itself up,” which is why adaLN-Zero trains stably with no warmup. (The widget below animates exactly these gates rising from zero.)

Interactive · watch the adaLN-Zero gates open

Below is a tiny simulator of an adaLN-Zero training trajectory. We don’t train a real network; we just animate gate magnitudes growing from zero as training progresses, and show what the predicted output looks like (identity at step 0, eventually some shaped output).

What changes between MLP and DiT

Axis	MLP (2D toy)	DiT (image)
Net class	MLPDenoiser / MLPVelocity	DiT
Input/output shape	(B, 2)	(B, 1, 16, 16)
Time conditioning	concat	adaLN-Zero
Parameter count (this repo)	~25k	~250k (d=64, depth=4)
DDPM / FM wrapper	unchanged	unchanged

Crucially, the wrapper (DDPM, FlowMatching) is shape-agnostic. The same DDPM.loss(x) and DDPM.sample(n, shape=…) work for both 2D and image data. q_sample uses while ab.dim() < x0.dim(): ab = ab.unsqueeze(-1) to broadcast the schedule scalar across whatever trailing dimensions the data has. That is the entire mechanism that lets the 2D and image cases share the same code.

Why DiT vs. UNet?

	UNet	DiT
Strong locality bias?	yes (convs)	no (attention is content-based, not spatial)
Long-range receptive field?	through downsample + conv stack	one attention layer
Conditioning style	FiLM, cross-attn, ad-hoc	adaLN-Zero (uniform across blocks)
Scales with (depth, width)?	middling — bottleneck dominates	clean LLM-like curves
Best when	data is small or strongly spatially local	data is large, structure is global, you want one architecture across modalities

Concrete recent evidence: every big generative-image model since 2023 (Stable Diffusion 3, Flux, Sora, Veo) uses a DiT or DiT-derived backbone. UNet still wins at small scale and for very local denoising (e.g. very low-resolution toys with limited compute).

Punchline

DiT is “ε_θ / v_θ as a vision transformer.” PatchEmbed turns image into tokens; bidirectional attention mixes them globally; adaLN-Zero conditions on time at every block with a clean identity init. The loss machinery is unchanged. That clean factorization — architecture-independent objective, objective-independent architecture — is the takeaway.