Practical knobs & trade-offs

A tour of the design decisions in the four .py files: what was picked, what alternatives exist, when you would change them.

The decision tree, abbreviated

The β schedule: linear vs. cosine

DDPM’s “default” in this repo: linear β_t ∈ [10⁻⁴, 0.02] over T=1000. The cosine schedule (Nichol & Dhariwal 2021) is

ᾱ_t = cos²((t/T + s) / (1 + s) · π/2), s = 0.008

Why cosine wins on real images: ᾱ_t is the fraction of the original signal still present at step t (the rest is noise). The linear schedule drives ᾱ_t toward zero too fast: on a 32×32 image (D = 3072) it bottoms out at ᾱ_T ≈ 4·10⁻⁵ well before the final step, so a big block of the late timesteps all describe the same thing — “essentially pure noise.” The model gets almost no usable training signal from those steps, and at sampling time it must do most of the real reconstruction in the few remaining non-degenerate steps. Cosine bends the curve so the signal decays gently and evenly, spreading the hard work across all T steps instead of cramming it into a handful. Same step budget, more of it spent on steps that teach the model something.

Intuition · linear unpacking

Claim: a “better” noise schedule is not about adding more noise — it’s about not wasting timesteps on noise levels that all look the same.

What a timestep is for. Each step t is one training example: “given data corrupted to noise-level ᾱ_t, predict the noise.” The steps are only useful if their noise levels are different from one another.
The linear failure. On high-dimensional data the linear schedule slams ᾱ_t down to near-zero early, then keeps it pinned there. Dozens of the late steps are all “basically pure noise” — indistinguishable examples, so the model learns nothing new from them.
Why that hurts at sampling. Those wasted steps still cost a forward pass each, but contribute almost nothing, so the genuine denoising gets compressed into the few steps where the signal still varies. Crowded work = rougher samples.
What cosine fixes. It makes ᾱ_t fall smoothly from 1 to 0, so consecutive steps always differ by a meaningful amount. Every step is a distinct, learnable noise level.

Central point. The schedule’s job is to ration a fixed budget of timesteps so each one sits at a useful, distinct noise level; cosine simply rations better than linear on high-dimensional data.

For the 2D toy in this repo (D = 2), linear is fine because the data’s norm and the noise’s norm naturally stay comparable — with so few dimensions the late steps don’t all collapse into the same “pure noise” regime, so there is little waste to fix.

What does the net predict? ε vs. v vs. x₀

Target	Used in	Pros	Cons
ε (noise)	DDPM (this repo)	scale-stable across t; standard	at very small t (low noise), prediction noisy — tiny ε to recover
v = α̇ x − σ̇ ε (velocity)	Stable Diffusion 2, EDM, FM-style models	balances ε and x₀ across t; better-behaved at endpoints	derivation more involved; need ad-hoc weighting
x₀ (clean data)	some EDM variants, consistency models	directly interpretable; useful for consistency training	at large t scale of target is huge; needs explicit reweighting

The trio is interconvertible: given any one of (ε, v, x₀) and the schedule, you can compute the other two. The choice is a numerical-stability + weighting story, not a fundamental one.

Intuition · linear unpacking

Claim: v-prediction is well-behaved at both ends of the chain precisely because it is a blend of ε and x₀ that automatically shifts toward whichever one is the “easy” target at that t.

Each pure target has one bad end. Predicting ε is easy at high noise but near-pointless at low noise (you’re asked for a tiny noise vector you can barely see). Predicting x₀ is easy at low noise but a wild guess at high noise (the input is almost pure noise, so the clean image is anyone’s guess).
v mixes them by the schedule. The velocity v = α̇ x − σ̇ ε is a weighted combination of the data and the noise directions, and the weights are set by the schedule coefficients at that t.
So the mix never collapses to a degenerate target. The schedule coefficients shift the balance between the data and noise directions as t moves, so at neither endpoint is v dominated by the “blind guess” component alone — there is always a recoverable part carrying the target.
Net effect. The target never degenerates into “predict almost nothing” or “predict pure guesswork” at the extremes — its scale and informativeness stay roughly even across the whole chain.

Central point. v isn’t a new quantity to learn — it’s ε and x₀ packaged into a fixed, schedule-set blend whose scale stays even across the chain, which is why its endpoints behave where the pure targets don’t.

Why t_scale = 1000 for continuous time

This shows up in two places (and the repo’s README spends a paragraph on it — here’s the picture).

The sinusoidal time embedding (SinusoidalTimeEmbedding) uses frequencies log-spaced from 1 to 1/10000. For DDPM’s integer t ∈ [0, 1000), the embedding is sin(freq · t) (no 2π factor — matches diffusion.py’s args.sin()) with freq covering several decades — different entries oscillate at different rates across the chain.

For flow matching t ∈ [0, 1], those same frequencies produce almost-linear embeddings for any one t (since sin(freq · t) ≈ freq · t for small arguments). The whole embedding collapses to ~rank 2 in t, wasting most of its dimensions.

The fix: rescale t by 1000 before the embedding so it covers the same range diffusion saw. te = self.time(t * 1000.0). Same for DiT in FM mode via t_scale=1000.

How many sampling steps do I actually need?

Quality target	DDPM	DDIM	FM-Euler	FM-RK4
indistinguishable from many-step	~1000	50–100	50	20 (×4 net calls = 80)
visibly good	~250	20	20	10 (×4 = 40)
recognizable but rough	~50	10	5–10	4 (×4 = 16)

These are rules of thumb on diverse image datasets at 32×32 to 256×256. The 2D toy in this repo converges visibly to two-moons-like clouds at K = 5 with FM-Euler; DDPM needs ~200 even with the oracle.

Interactive · K-budget grid — same task, different step counts

How much does K matter? Below: the same oracle DDPM sampler runs at five different K (= number of denoising steps), all on the same starting noise and the same target. Each panel is the final sample cloud; reading left to right shows how quality recovers as you spend more sampling compute.

Final samples at K ∈ {3, 10, 30, 100, 400}

Hit render. Tiny K shreds the modes; ~30 starts to look right; ~400 is indistinguishable from full ancestral. The cost is linear in K (forward passes).

N samples / panel: 400 σ:

Five panels, each one independent run. With σ = 0 (deterministic), the samples at low K cluster on identifiable failure modes (the integrator under-runs the curved path). With σ = β_t the failures look noisier — you can tell which one is real.

Classifier-free guidance (CFG) in five lines

Standard recipe (Ho & Salimans 2022) for conditional generation:

# Training: drop the class label 10% of the time
if random() < 0.10:
    cls = NULL

# Sampling at each step:
eps_cond = model(x, t, cls)
eps_uncond = model(x, t, NULL)
eps_guided = eps_uncond + w * (eps_cond - eps_uncond)   # w depends on task

That’s the entire trick. At w = 1, you get standard conditional sampling. w > 1 pushes samples toward the class manifold (sharper class adherence at the cost of diversity). Sweet spots reported in the literature: w ≈ 1.5–4 for class-conditional ImageNet (DiT, ADM), w ≈ 5–9 for text-to-image (Imagen, SD). For FM, replace eps with v — same line. The repo doesn’t implement it; it’s a 5-line extension.

The full extension menu

Extension	LOC	What you change	Why bother
RK4 sampler	~10	`FlowMatching.sample`	O(dt⁴) vs O(dt); free quality at fixed K
Cosine β	~5	buffer init in `DDPM.__init__`	delays ᾱ collapse on images
VP flow matching	~8	`FlowMatching.loss` path	recovers diffusion as FM exercise
Class conditioning	~8	extra embedding, concat in MLP / add to `c` in DiT	conditional generation
CFG	~5	10% label drop at train; mix at sample	sharper conditional samples
FlashAttention	~1	`F.scaled_dot_product_attention`	O(N) memory vs O(N²) explicit attention
EMA on weights	~10	maintain decay-averaged params, use for sampling	sample quality 1–3 FID points better
Latent diffusion	~80	train VAE → diffuse in latent space	10× faster training at same quality

Common gotchas, expanded

1. Not centering the data

Prior is N(0, I). If p_data has mean 5 and std 0.1, either the noise scale is wrong (DDPM: noise dominates at low t) or the path endpoints don’t match (FM). data.py standardizes for this reason.

2. Sampling from the wrong endpoint in FM

Training goes x₀ → x₁ (prior to data). Sampling starts at x₀ ∼ N(0, I) and integrates forward in t. Reverse-time confusion is the most common bug. Mnemonic: start where you sample from the prior.

3. Sinusoidal embedding range mismatch

See the t_scale = 1000 section above. Without it, ~half the embedding dimensions are wasted and the model converges slower / not at all.

4. Forgetting elementwise_affine=False in adaLN’s LayerNorm

If you keep the default nn.LayerNorm γ/β, they fight with the modulation MLP’s (scale, shift). Two knobs for the same job → instability. DiTBlock sets elementwise_affine=False on both norms.

5. Variance-exploding pitfall

If you build a VE schedule (no shrink factor), the data and the prior need to share that scale. People often set up VE with a unit-variance prior; the schedule then crushes x_t at large t into a numerically nasty regime. VP avoids this almost by construction.

A self-test menu

If you can answer these from memory, you have the material:

Write down Eq. ⋆. Why does the closed-form marginal exist?
Why is ε the “right” thing for the net to predict, vs. x₀ or μ?
Derive the CFM loss. What is the trick that makes regressing against the conditional target legal?
How does DDPM’s sampling cost scale, and why does FM cost less?
State the adaLN-Zero formula. Why “-Zero”?
What single line in diffusion_transformer.py would you change to use FlashAttention?
What happens if you forget to standardize data.py’s output? Trace the bug forward to either method.
Convert a trained ε-predictor into a v-predictor (under the VP path).

Punchline

Both DDPM and Flow Matching come down to fit a local operator on a chosen density path. The interesting knobs are: which path, what target the net predicts, which solver you use at test time. The architecture (MLP vs. DiT) is orthogonal — same loss, same sampler, different function class.