generative_continuous / 09 · practical knobs lesson 9 / 15

Practical knobs & trade-offs

A tour of the design decisions in the four .py files: what was picked, what alternatives exist, when you would change them.

The decision tree, abbreviated

data domain? low-D (≤32) → MLP, either DDPM or FM image / token → DiT or UNet latent (VAE) → DiT in latent space step budget? → pick objective + sampler

The β schedule: linear vs. cosine

DDPM’s “default” in this repo: linear βt ∈ [10−4, 0.02] over T=1000. The cosine schedule (Nichol & Dhariwal 2021) is

t = cos²((t/T + s) / (1 + s) · π/2),   s = 0.008

Why cosine wins on real images: linear’s ᾱT collapses to ≈ 4·10⁻⁵ on a 32×32 image (D = 3072), meaning xT is effectively zero noise relative to the data norm — you waste many timesteps in “pure noise” territory where the model has nothing to learn. Cosine keeps signal alive longer.

For the 2D toy in this repo (D = 2), linear is fine because the data’s norm and the noise’s norm naturally stay comparable.

What does the net predict? ε vs. v vs. x₀

TargetUsed inProsCons
ε (noise)DDPM (this repo)scale-stable across t; standardat very small t (low noise), prediction noisy — tiny ε to recover
v = α̇ x − σ̇ ε (velocity)Stable Diffusion 2, EDM, FM-style modelsbalances ε and x₀ across t; better-behaved at endpointsderivation more involved; need ad-hoc weighting
x₀ (clean data)some EDM variants, consistency modelsdirectly interpretable; useful for consistency trainingat large t scale of target is huge; needs explicit reweighting

The trio is interconvertible: given any one of (ε, v, x₀) and the schedule, you can compute the other two. The choice is a numerical-stability + weighting story, not a fundamental one.

Why t_scale = 1000 for continuous time

This shows up in two places (and the repo’s README spends a paragraph on it — here’s the picture).

The sinusoidal time embedding (SinusoidalTimeEmbedding) uses frequencies log-spaced from 1 to 1/10000. For DDPM’s integer t ∈ [0, 1000), the embedding is sin(freq · t) (no 2π factor — matches diffusion.py’s args.sin()) with freq covering several decades — different entries oscillate at different rates across the chain.

For flow matching t ∈ [0, 1], those same frequencies produce almost-linear embeddings for any one t (since sin(freq · t) ≈ freq · t for small arguments). The whole embedding collapses to ~rank 2 in t, wasting most of its dimensions.

The fix: rescale t by 1000 before the embedding so it covers the same range diffusion saw. te = self.time(t * 1000.0). Same for DiT in FM mode via t_scale=1000.

Sinusoidal time embedding — with and without rescale
Each column is the 32-dim sinusoidal embedding at one of 60 evenly-spaced t; rows are the 32 embedding entries. Top: t ∈ [0, 1] without rescale (almost-linear, low rank). Bottom: t · 1000 (uses the full frequency ladder).

How many sampling steps do I actually need?

Quality targetDDPMDDIMFM-EulerFM-RK4
indistinguishable from many-step~100050–1005020 (×4 net calls = 80)
visibly good~250202010 (×4 = 40)
recognizable but rough~50105–104 (×4 = 16)

These are rules of thumb on diverse image datasets at 32×32 to 256×256. The 2D toy in this repo converges visibly to two-moons-like clouds at K = 5 with FM-Euler; DDPM needs ~200 even with the oracle.

Interactive · K-budget grid — same task, different step counts

How much does K matter? Below: the same oracle DDPM sampler runs at five different K (= number of denoising steps), all on the same starting noise and the same target. Each panel is the final sample cloud; reading left to right shows how quality recovers as you spend more sampling compute.

Final samples at K ∈ {3, 10, 30, 100, 400}
Hit render. Tiny K shreds the modes; ~30 starts to look right; ~400 is indistinguishable from full ancestral. The cost is linear in K (forward passes).

Five panels, each one independent run. With σ = 0 (deterministic), the samples at low K cluster on identifiable failure modes (the integrator under-runs the curved path). With σ = β_t the failures look noisier — you can tell which one is real.

Classifier-free guidance (CFG) in five lines

Standard recipe (Ho & Salimans 2022) for conditional generation:

# Training: drop the class label 10% of the time
if random() < 0.10:
    cls = NULL

# Sampling at each step:
eps_cond = model(x, t, cls)
eps_uncond = model(x, t, NULL)
eps_guided = eps_uncond + w * (eps_cond - eps_uncond)   # w depends on task

That’s the entire trick. At w = 1, you get standard conditional sampling. w > 1 pushes samples toward the class manifold (sharper class adherence at the cost of diversity). Sweet spots reported in the literature: w ≈ 1.5–4 for class-conditional ImageNet (DiT, ADM), w ≈ 5–9 for text-to-image (Imagen, SD). For FM, replace eps with v — same line. The repo doesn’t implement it; it’s a 5-line extension.

The full extension menu

ExtensionLOCWhat you changeWhy bother
RK4 sampler~10FlowMatching.sampleO(dt⁴) vs O(dt); free quality at fixed K
Cosine β~5buffer init in DDPM.__init__delays ᾱ collapse on images
VP flow matching~8FlowMatching.loss pathrecovers diffusion as FM exercise
Class conditioning~8extra embedding, concat in MLP / add to c in DiTconditional generation
CFG~510% label drop at train; mix at samplesharper conditional samples
FlashAttention~1F.scaled_dot_product_attentionO(N) memory vs O(N²) explicit attention
EMA on weights~10maintain decay-averaged params, use for samplingsample quality 1–3 FID points better
Latent diffusion~80train VAE → diffuse in latent space10× faster training at same quality

Common gotchas, expanded

1. Not centering the data
Prior is N(0, I). If pdata has mean 5 and std 0.1, either the noise scale is wrong (DDPM: noise dominates at low t) or the path endpoints don’t match (FM). data.py standardizes for this reason.
2. Sampling from the wrong endpoint in FM
Training goes x₀ → x₁ (prior to data). Sampling starts at x0 ∼ N(0, I) and integrates forward in t. Reverse-time confusion is the most common bug. Mnemonic: start where you sample from the prior.
3. Sinusoidal embedding range mismatch
See the t_scale = 1000 section above. Without it, ~half the embedding dimensions are wasted and the model converges slower / not at all.
4. Forgetting elementwise_affine=False in adaLN’s LayerNorm
If you keep the default nn.LayerNorm γ/β, they fight with the modulation MLP’s (scale, shift). Two knobs for the same job → instability. DiTBlock sets elementwise_affine=False on both norms.
5. Variance-exploding pitfall
If you build a VE schedule (no shrink factor), the data and the prior need to share that scale. People often set up VE with a unit-variance prior; the schedule then crushes xt at large t into a numerically nasty regime. VP avoids this almost by construction.

A self-test menu

If you can answer these from memory, you have the material:

  1. Write down Eq. ⋆. Why does the closed-form marginal exist?
  2. Why is ε the “right” thing for the net to predict, vs. x₀ or μ?
  3. Derive the CFM loss. What is the trick that makes regressing against the conditional target legal?
  4. How does DDPM’s sampling cost scale, and why does FM cost less?
  5. State the adaLN-Zero formula. Why “-Zero”?
  6. What single line in diffusion_transformer.py would you change to use FlashAttention?
  7. What happens if you forget to standardize data.py’s output? Trace the bug forward to either method.
  8. Convert a trained ε-predictor into a v-predictor (under the VP path).
Punchline
Both DDPM and Flow Matching come down to fit a local operator on a chosen density path. The interesting knobs are: which path, what target the net predicts, which solver you use at test time. The architecture (MLP vs. DiT) is orthogonal — same loss, same sampler, different function class.