Practical knobs & trade-offs
A tour of the design decisions in the four .py files: what was picked, what alternatives exist, when you would change them.
The decision tree, abbreviated
The β schedule: linear vs. cosine
DDPM’s “default” in this repo: linear βt ∈ [10−4, 0.02] over T=1000. The cosine schedule (Nichol & Dhariwal 2021) is
Why cosine wins on real images: linear’s ᾱT collapses to ≈ 4·10⁻⁵ on a 32×32 image (D = 3072), meaning xT is effectively zero noise relative to the data norm — you waste many timesteps in “pure noise” territory where the model has nothing to learn. Cosine keeps signal alive longer.
For the 2D toy in this repo (D = 2), linear is fine because the data’s norm and the noise’s norm naturally stay comparable.
What does the net predict? ε vs. v vs. x₀
| Target | Used in | Pros | Cons |
|---|---|---|---|
| ε (noise) | DDPM (this repo) | scale-stable across t; standard | at very small t (low noise), prediction noisy — tiny ε to recover |
| v = α̇ x − σ̇ ε (velocity) | Stable Diffusion 2, EDM, FM-style models | balances ε and x₀ across t; better-behaved at endpoints | derivation more involved; need ad-hoc weighting |
| x₀ (clean data) | some EDM variants, consistency models | directly interpretable; useful for consistency training | at large t scale of target is huge; needs explicit reweighting |
The trio is interconvertible: given any one of (ε, v, x₀) and the schedule, you can compute the other two. The choice is a numerical-stability + weighting story, not a fundamental one.
Why t_scale = 1000 for continuous time
This shows up in two places (and the repo’s README spends a paragraph on it — here’s the picture).
The sinusoidal time embedding (SinusoidalTimeEmbedding) uses frequencies log-spaced from 1 to 1/10000. For DDPM’s integer t ∈ [0, 1000), the embedding is sin(freq · t) (no 2π factor — matches diffusion.py’s args.sin()) with freq covering several decades — different entries oscillate at different rates across the chain.
For flow matching t ∈ [0, 1], those same frequencies produce almost-linear embeddings for any one t (since sin(freq · t) ≈ freq · t for small arguments). The whole embedding collapses to ~rank 2 in t, wasting most of its dimensions.
The fix: rescale t by 1000 before the embedding so it covers the same range diffusion saw. te = self.time(t * 1000.0). Same for DiT in FM mode via t_scale=1000.
How many sampling steps do I actually need?
| Quality target | DDPM | DDIM | FM-Euler | FM-RK4 |
|---|---|---|---|---|
| indistinguishable from many-step | ~1000 | 50–100 | 50 | 20 (×4 net calls = 80) |
| visibly good | ~250 | 20 | 20 | 10 (×4 = 40) |
| recognizable but rough | ~50 | 10 | 5–10 | 4 (×4 = 16) |
These are rules of thumb on diverse image datasets at 32×32 to 256×256. The 2D toy in this repo converges visibly to two-moons-like clouds at K = 5 with FM-Euler; DDPM needs ~200 even with the oracle.
Interactive · K-budget grid — same task, different step counts
How much does K matter? Below: the same oracle DDPM sampler runs at five different K (= number of denoising steps), all on the same starting noise and the same target. Each panel is the final sample cloud; reading left to right shows how quality recovers as you spend more sampling compute.
Classifier-free guidance (CFG) in five lines
Standard recipe (Ho & Salimans 2022) for conditional generation:
# Training: drop the class label 10% of the time
if random() < 0.10:
cls = NULL
# Sampling at each step:
eps_cond = model(x, t, cls)
eps_uncond = model(x, t, NULL)
eps_guided = eps_uncond + w * (eps_cond - eps_uncond) # w depends on task
That’s the entire trick. At w = 1, you get standard conditional sampling. w > 1 pushes samples toward the class manifold (sharper class adherence at the cost of diversity). Sweet spots reported in the literature: w ≈ 1.5–4 for class-conditional ImageNet (DiT, ADM), w ≈ 5–9 for text-to-image (Imagen, SD). For FM, replace eps with v — same line. The repo doesn’t implement it; it’s a 5-line extension.
The full extension menu
| Extension | LOC | What you change | Why bother |
|---|---|---|---|
| RK4 sampler | ~10 | FlowMatching.sample | O(dt⁴) vs O(dt); free quality at fixed K |
| Cosine β | ~5 | buffer init in DDPM.__init__ | delays ᾱ collapse on images |
| VP flow matching | ~8 | FlowMatching.loss path | recovers diffusion as FM exercise |
| Class conditioning | ~8 | extra embedding, concat in MLP / add to c in DiT | conditional generation |
| CFG | ~5 | 10% label drop at train; mix at sample | sharper conditional samples |
| FlashAttention | ~1 | F.scaled_dot_product_attention | O(N) memory vs O(N²) explicit attention |
| EMA on weights | ~10 | maintain decay-averaged params, use for sampling | sample quality 1–3 FID points better |
| Latent diffusion | ~80 | train VAE → diffuse in latent space | 10× faster training at same quality |
Common gotchas, expanded
data.py standardizes for this reason.
t_scale = 1000 section above. Without it, ~half the embedding dimensions are wasted and the model converges slower / not at all.
nn.LayerNorm γ/β, they fight with the modulation MLP’s (scale, shift). Two knobs for the same job → instability. DiTBlock sets elementwise_affine=False on both norms.
A self-test menu
If you can answer these from memory, you have the material:
- Write down Eq. ⋆. Why does the closed-form marginal exist?
- Why is ε the “right” thing for the net to predict, vs. x₀ or μ?
- Derive the CFM loss. What is the trick that makes regressing against the conditional target legal?
- How does DDPM’s sampling cost scale, and why does FM cost less?
- State the adaLN-Zero formula. Why “-Zero”?
- What single line in
diffusion_transformer.pywould you change to use FlashAttention? - What happens if you forget to standardize
data.py’s output? Trace the bug forward to either method. - Convert a trained ε-predictor into a v-predictor (under the VP path).