Capstone — design a reasoning-model platform
One brief, threaded through every lesson. We take a reasoning model from training to RL to serving to the feedback loop, on a fixed cluster and budget, and at each stage we run the five-step loop and name the one constraint that binds. No new mechanics — just the method, applied end to end, the way a design interview or a real planning doc demands.
Before reading on, try it yourself. How many GPUs for serving alone? Which parallelism for the 70B? Is the RL loop rollout- or train-bound? The gaps between your guesses and the numbers below are exactly what this track was for.
Stage 0 · Requirements → numbers (lesson 03)
Turn the brief into the inputs the arithmetic needs. The output length is the headline: 4,000 tokens is 13× a chat reply, and by Little's Law it drives everything.
That is a three-minute request. Little's Law: L = λ·W = 500 · 201 ≈ 100,500 requests in flight at once. The reasoning workload has quietly turned a modest 500 req/s into a six-figure concurrency problem — the single most important number in the whole design, and it came from one workload parameter, not the model.
Stage 1 · The serving fleet (lessons 02, 04, 05)
Fit (02): 70B fp16 = 140 GB > 80 GB, so it does not fit one H100. Minimum ⌈140/(80−overhead)⌉ = 2, but to hit the TPOT SLO we want more bandwidth per token, so use TP = 4 within a node (NVLink): decode latency ≈ model_bytes / (TP · BW) drops ~4×, and the weights now occupy 35 GB/GPU, leaving ~40 GB each for KV.
Per-replica capacity (04): KV/token for 70B at GQA-8 ≈ 320 KB. A 4,000-token request that has also read, say, a 1,000-token prompt caches ~5,000 tokens ≈ 1.6 GB at its peak. Across 4 GPUs that's ~160 GB of KV room → on the order of 160/1.6 ≈ 100 concurrent requests per TP-4 replica (decode-side, optimistic).
Fleet (05): 100,500 / 100 ≈ 1,005 TP-4 replicas → ~4,000 GPUs. We have 1,024. The naive design is 4× over budget. This is the moment the loop earns its keep — iterate on the binding constraint instead of asking for more hardware.
- Disaggregate prefill/decode (05): long prompts + tight TTFT and TPOT is exactly the disaggregation case. A few prefill nodes feed many decode nodes; decode is where the 100k concurrency lives.
- Quantize to fp8 (06): weights 70 GB and KV/token halve → roughly double the per-replica request ceiling. Validate quality on the regression suite (10) — non-negotiable for a reasoning model.
- Shrink the workload (00→03): push back on the 4,000-token average; adaptive thinking that averages 2,500 halves W and thus the fleet.
Stage 2 · Training the policy (lessons 02, 07, 08)
Assume we continue-train / specialize the 70B (not pretrain from scratch). Memory (02, 07): 16 · 70 = 1,120 GB of train state — ~14 H100s' worth before activations. Recipe from 07: TP = 8 in-node for weights/activations, FSDP/ZeRO-3 across the data-parallel dimension to shard the 1,120 GB, activation checkpointing to fit long-context reasoning traces. Score by MFU; if it sits below ~40%, a collective isn't overlapping.
Data plane (08): at a 4M-token global batch and ~2 s steps, ~2M tokens/s must stream pre-tokenized from object store, double-buffered, deterministically resumable — because at this scale a node dies every few hours and a non-deterministic restart corrupts the epoch.
Stage 3 · RL post-training (lesson 09)
Reasoning quality comes from RL (e.g. GRPO on verifiable rewards). Now the system is inference + training in one loop. The diagnosis from 09 is decisive: rollouts generate 4,000-token reasoning traces — generation dominates wall-clock, so the loop is heavily rollout-bound. Therefore most GPUs become actors (inference engines), a minority are learners.
Balance the producer and consumer (08's framing, 09's split): size actor and learner pools so rollout tok/s ≈ train-consumed tok/s, or one side idles. GRPO drops the critic (09), saving a 70B model's worth of memory versus PPO — which is what lets policy + reference + reward + optimizer co-exist. Weight sync broadcasts the updated 140 GB policy to the actors each step; consider async RL with bounded staleness to overlap actor and learner, trading a little on-policyness for a lot of throughput.
Stage 4 · The feedback flywheel & production (lessons 10, 11)
Eval (10): a reasoning model can regress silently — gate every deploy on an offline reasoning regression suite (decontaminated), use LLM-as-judge with position/length-bias controls for open-ended quality, and run online A/B with guardrails (TTFT p99, refusal rate, $/req). Production traces, filtered by the verifier, become the next RL dataset — closing the flywheel back into stage 3.
Production (11): watch queue depth and KV-utilization (not GPU util) as the leading incident signals; blue/green deploys with fast rollback wired to the guardrails; size for peak + k_spare + deploy_headroom remembering a TP group fails as a unit; and report $/1M tokens weekly, with autoscaling for the diurnal swing so you aren't buying peak capacity 24/7.
The whole platform, one diagram
Interactive · the design self-test
Eleven questions, one per design lesson (01–11). Answer from the method, not memory — each tests whether you reach for the binding constraint first. Click to reveal.
Where to go next
- Down into the mechanisms. Each decision here links to the track that builds it: System ML (parallelism), vLLM / SGLang (serving), RL Post-Training (the loop), Data Engineering (the plane).
- Back to the loop. Re-read lesson 00: requirements → arithmetic → topology → bottleneck → iterate. You now have the numbers (02), the requirements language (03), and worked designs for every subsystem. That's the whole skill.
- Apply it cold — and then check your work. Part VII (lessons 13–18) does exactly this: six more design case studies — code assistant, consumer chatbot, RAG, agentic platform, long-context, batch — each running the loop on a workload where a different wall binds. Same model, six different systems. Read them to see the method generalize.