Capstone — design a reasoning-model platform

One brief, threaded through every lesson. We take a reasoning model from training to RL to serving to the feedback loop, on a fixed cluster and budget, and at each stage we run the five-step loop and name the one constraint that binds. No new mechanics — just the method, applied end to end, the way a design interview or a real planning doc demands.

The brief

Ship a 70B reasoning model (long chain-of-thought, average 4,000 output tokens) as a product. You have 1,024 H100s (128 nodes × 8, NVLink intra-node, InfiniBand between). Serving SLO: p95 TTFT < 1 s, TPOT < 50 ms, expected 500 req/s at peak. Budget pressure is real: leadership wants $/1M tokens reported weekly. Design the platform.

Before reading on, try it yourself. How many GPUs for serving alone? Which parallelism for the 70B? Is the RL loop rollout- or train-bound? The gaps between your guesses and the numbers below are exactly what this track was for.

Stage 0 · Requirements → numbers (lesson 03)

Turn the brief into the inputs the arithmetic needs. The output length is the headline: 4,000 tokens is 13× a chat reply, and by Little's Law it drives everything.

W ≈ TTFT + TPOT · output = 1 + 0.05 · 4000 = 201 s per request

That is a three-minute request. Little's Law: L = λ·W = 500 · 201 ≈ 100,500 requests in flight at once. The reasoning workload has quietly turned a modest 500 req/s into a six-figure concurrency problem — the single most important number in the whole design, and it came from one workload parameter, not the model.

Binding constraint, stage 0

Output length. Everything downstream scales with it. If product can cap "thinking" tokens or use adaptive length, that decision saves more money than any kernel. Always attack the requirement before the implementation.

Stage 1 · The serving fleet (lessons 02, 04, 05)

Fit (02): 70B fp16 = 140 GB > 80 GB, so it does not fit one H100. Minimum ⌈140/(80−overhead)⌉ = 2, but to hit the TPOT SLO we want more bandwidth per token, so use TP = 4 within a node (NVLink): decode latency ≈ model_bytes / (TP · BW) drops ~4×, and the weights now occupy 35 GB/GPU, leaving ~40 GB each for KV.

Per-replica capacity (04): KV/token for 70B at GQA-8 ≈ 320 KB. A 4,000-token request that has also read, say, a 1,000-token prompt caches ~5,000 tokens ≈ 1.6 GB at its peak. Across 4 GPUs that's ~160 GB of KV room → on the order of 160/1.6 ≈ 100 concurrent requests per TP-4 replica (decode-side, optimistic).

Fleet (05): 100,500 / 100 ≈ 1,005 TP-4 replicas → ~4,000 GPUs. We have 1,024. The naive design is 4× over budget. This is the moment the loop earns its keep — iterate on the binding constraint instead of asking for more hardware.

Disaggregate prefill/decode (05): long prompts + tight TTFT and TPOT is exactly the disaggregation case. A few prefill nodes feed many decode nodes; decode is where the 100k concurrency lives.
Quantize to fp8 (06): weights 70 GB and KV/token halve → roughly double the per-replica request ceiling. Validate quality on the regression suite (10) — non-negotiable for a reasoning model.
Shrink the workload (00→03): push back on the 4,000-token average; adaptive thinking that averages 2,500 halves W and thus the fleet.

Binding constraint, stage 1

Decode KV memory × concurrency (not compute). fp8 + disaggregation + a length cap together bring the serving fleet from ~4,000 toward a feasible few-hundred-GPU footprint. Notice we never reached for a fancier kernel — the wall was bytes, so the fixes were byte-shaped.

Stage 2 · Training the policy (lessons 02, 07, 08)

Assume we continue-train / specialize the 70B (not pretrain from scratch). Memory (02, 07): 16 · 70 = 1,120 GB of train state — ~14 H100s' worth before activations. Recipe from 07: TP = 8 in-node for weights/activations, FSDP/ZeRO-3 across the data-parallel dimension to shard the 1,120 GB, activation checkpointing to fit long-context reasoning traces. Score by MFU; if it sits below ~40%, a collective isn't overlapping.

Data plane (08): at a 4M-token global batch and ~2 s steps, ~2M tokens/s must stream pre-tokenized from object store, double-buffered, deterministically resumable — because at this scale a node dies every few hours and a non-deterministic restart corrupts the epoch.

Binding constraint, stage 2

Memory capacity first (forces TP×FSDP), then MFU (forces comms overlap). The data plane must not become the hidden third wall — instrument data-wait time from day one.

Stage 3 · RL post-training (lesson 09)

Reasoning quality comes from RL (e.g. GRPO on verifiable rewards). Now the system is inference + training in one loop. The diagnosis from 09 is decisive: rollouts generate 4,000-token reasoning traces — generation dominates wall-clock, so the loop is heavily rollout-bound. Therefore most GPUs become actors (inference engines), a minority are learners.

Balance the producer and consumer (08's framing, 09's split): size actor and learner pools so rollout tok/s ≈ train-consumed tok/s, or one side idles. GRPO drops the critic (09), saving a 70B model's worth of memory versus PPO — which is what lets policy + reference + reward + optimizer co-exist. Weight sync broadcasts the updated 140 GB policy to the actors each step; consider async RL with bounded staleness to overlap actor and learner, trading a little on-policyness for a lot of throughput.

Binding constraint, stage 3

Rollout (generation) throughput. The fix is more actor GPUs and faster decode — i.e. every serving optimization from stage 1 pays off again here. Inference and RL share a bottleneck; that's why this track teaches them with one method.

Stage 4 · The feedback flywheel & production (lessons 10, 11)

Eval (10): a reasoning model can regress silently — gate every deploy on an offline reasoning regression suite (decontaminated), use LLM-as-judge with position/length-bias controls for open-ended quality, and run online A/B with guardrails (TTFT p99, refusal rate, $/req). Production traces, filtered by the verifier, become the next RL dataset — closing the flywheel back into stage 3.

Production (11): watch queue depth and KV-utilization (not GPU util) as the leading incident signals; blue/green deploys with fast rollback wired to the guardrails; size for peak + k_spare + deploy_headroom remembering a TP group fails as a unit; and report $/1M tokens weekly, with autoscaling for the diurnal swing so you aren't buying peak capacity 24/7.

The whole platform, one diagram

Interactive · the design self-test

Eleven questions, one per design lesson (01–11). Answer from the method, not memory — each tests whether you reach for the binding constraint first. Click to reveal.

Where to go next

Down into the mechanisms. Each decision here links to the track that builds it: System ML (parallelism), vLLM / SGLang (serving), RL Post-Training (the loop), Data Engineering (the plane).
Back to the loop. Re-read lesson 00: requirements → arithmetic → topology → bottleneck → iterate. You now have the numbers (02), the requirements language (03), and worked designs for every subsystem. That's the whole skill.
Apply it cold — and then check your work. Part VII (lessons 13–18) does exactly this: six more design case studies — code assistant, consumer chatbot, RAG, agentic platform, long-context, batch — each running the loop on a workload where a different wall binds. Same model, six different systems. Read them to see the method generalize.