Serving framework: synthesis
The kernel hot loop sits inside a much larger pipeline: HTTP → tokenize → queue → schedule → worker → sample → detokenize → stream. This lesson closes the chain. It also lets you read vLLM and SGLang configurations as different compositions of the same primitives we built in lessons 01–07.
The question this lesson answers
If a teammate says "decode is slow" or "TTFT is high under load," which knob do you turn first? The answer depends on which stage actually owns the time — and most of those stages are not kernels at all. This lesson installs the framework-level mental model and the debugging discipline that walks symptoms back to a specific lesson in this track.
End-to-end request path
What lives where: vLLM vs SGLang as compositions
Both engines implement every box above; they just emphasize different ones. The columns below identify which lesson's mechanism each engine pushes hardest. Treat these as where the design points rather than features one engine has and the other doesn't.
| Stage | vLLM emphasis | SGLang emphasis | From lesson… |
|---|---|---|---|
| API / entry | OpenAI-compatible HTTP server | OpenAI-compatible HTTP + native SGL DSL | — |
| Router | External (your load balancer) | Built-in Model Gateway with cache-aware routing | 05 |
| KV memory | Paged KV cache + block manager | Paged KV pool + radix prefix cache | 04, 05 |
| Scheduler | Continuous batching, chunked prefill, token budget | Continuous batching, prefix-aware scheduling, branch-aware | 06 |
| Attention backend | FlashAttention / FlashInfer / Triton / MLA family | FlashInfer / FA3 / Triton / FlashMLA | 03, 04 |
| Other kernels | Marlin/Machete quant, GPU sampling, CUDA graphs | FlashInfer ops, GPU sampling, CUDA graphs | 07, 06 |
| Multi-process arch | API server / engine core / GPU worker / DP coord | Gateway / worker servers | — |
Where the milliseconds actually go
A realistic latency budget for a chat request with a 2K prompt and 200-token response, on a healthy serving cluster:
| Stage | Typical share | What dominates it | Where it lives in this track |
|---|---|---|---|
| API + tokenize | 1–5 ms | HTTP parsing, tokenizer speed, multimodal preprocessing | — |
| Queue | 0–500 ms | Admission, concurrency limit, routing decisions | — |
| Prefill | 20–200 ms | Prompt length, prefix cache hit rate, chunking | 03, 05, 06 |
| Decode (200 tokens) | 200 × (10–50 ms) | KV bandwidth, weight bandwidth, batch size, graphs | 02, 04, 06, 07 |
| Stream + detokenize | ~constant per chunk | SSE flushing, detokenizer, client backpressure | — |
Two numbers worth memorizing: time-to-first-token (TTFT) ≈ queue + prefill; inter-token latency (ITL) ≈ decode-step time. Different users care about different ones — interactive chat hates high TTFT; long-form generation hates high ITL.
A decision tree for "decode is slow"
The tree's three branches correspond to the three terms in lesson 01's roofline: launch, memory, compute. Every fix is a lever introduced earlier in the track.
Common symptom → first-suspect table
| Symptom | First suspect | Check | Lever |
|---|---|---|---|
| Inter-token latency higher than the lesson-02 lower bound | Eager dispatch overhead | kernels/step × launch time vs total | CUDA graphs (06) |
| Throughput plateaus at small batch despite spare HBM | Kernel-launch-bound decode | Per-step kernel count, batch padding shapes | Continuous batching + graphs (06) |
| TTFT high during traffic spikes | Queue + admission | Queue depth, router load distribution | Cache-aware routing (05), scale out |
| One large prompt freezes other users | Unchunked prefill | Per-step time spikes | Chunked prefill, PD disaggregation (06) |
| HBM near full at modest concurrency | Contiguous KV or low pool fraction | Block-pool occupancy | Paged KV with sensible block size (04) |
| Quantized model not faster than bf16 | Dequant overhead or invalid backend | Profile GEMM kernel name + shape | Switch backend, check group size (07) |
| MoE latency 2× higher than dense | Routing imbalance / all-to-all | Per-expert token counts | Capacity factor, EP topology (07) |
| RL rollouts give different logprobs than training | Backend / shape mismatch between rollout and trainer | Same tokens, compare per-token logprobs | Align backends; mismatch-aware algorithm |
| Streaming feels jerky | Detokenization or network | Time between flushed chunks vs ITL | Batch detokenizer, server-side flush policy |
Interactive · end-to-end stage attribution
Set per-stage millisecond costs for one request. The widget identifies the largest stage and tells you which lesson's lever to pull first.
Putting the track together
You now have one coherent picture:
- Hardware sets the rules. HBM bandwidth and the roofline decide which side of every kernel is the bottleneck (01).
- The transformer forward is a chain of well-known kernels, each with a derivable byte and FLOP cost; the KV cache is what makes decode bandwidth-bound (02).
- FlashAttention rescues the attention kernel's natural arithmetic intensity by streaming softmax (03).
- Paged KV makes that kernel work on non-contiguous storage so the engine can use HBM efficiently (04).
- Prefix caches and radix trees let many requests share the same KV (05).
- Continuous batching, chunked prefill, and CUDA graphs make the scheduler's traffic look regular and cheap to launch (06).
- Quantized GEMM, MoE kernels, batched sampling attack what is left of the decode chain (07).
- The serving framework turns HTTP requests into the well-shaped batches the previous layers can run, and decides where milliseconds actually live (08).
Where to go next
- Run a profiler (NVIDIA Nsight Systems, or your engine's built-in timeline) on a real decode step and identify which lesson 02 row dominates.
- Pick a knob (block size, prefix cache fraction, chunked prefill size, attention backend) and predict what will move before changing it. Compare to measurement.
- For RL contexts, pay extra attention to lesson 05 (prefix reuse) and the RL row in the symptom table — rollout/trainer mismatch is its own beast.