Serving framework: synthesis

The kernel hot loop sits inside a much larger pipeline: HTTP → tokenize → queue → schedule → worker → sample → detokenize → stream. This lesson closes the chain. It also lets you read vLLM and SGLang configurations as different compositions of the same primitives we built in lessons 01–07.

The question this lesson answers

If a teammate says "decode is slow" or "TTFT is high under load," which knob do you turn first? The answer depends on which stage actually owns the time — and most of those stages are not kernels at all. This lesson installs the framework-level mental model and the debugging discipline that walks symptoms back to a specific lesson in this track.

End-to-end request path

once per request. Stages 6–10 happen once per token. Anything on the per-token path multiplies by output length, which is why "small" sampling or detokenization costs add up. Kernels live in one box. Everything else is framework. Most "slow" reports point at framework, not kernel.

What lives where: vLLM vs SGLang as compositions

Both engines implement every box above; they just emphasize different ones. The columns below identify which lesson's mechanism each engine pushes hardest. Treat these as where the design points rather than features one engine has and the other doesn't.

Stage	vLLM emphasis	SGLang emphasis	From lesson…
API / entry	OpenAI-compatible HTTP server	OpenAI-compatible HTTP + native SGL DSL	—
Router	External (your load balancer)	Built-in Model Gateway with cache-aware routing	05
KV memory	Paged KV cache + block manager	Paged KV pool + radix prefix cache	04, 05
Scheduler	Continuous batching, chunked prefill, token budget	Continuous batching, prefix-aware scheduling, branch-aware	06
Attention backend	FlashAttention / FlashInfer / Triton / MLA family	FlashInfer / FA3 / Triton / FlashMLA	03, 04
Other kernels	Marlin/Machete quant, GPU sampling, CUDA graphs	FlashInfer ops, GPU sampling, CUDA graphs	07, 06
Multi-process arch	API server / engine core / GPU worker / DP coord	Gateway / worker servers	—

Don't pick by mascot

"Which engine is faster" is the wrong question. Ask "which stage is my bottleneck?" Independent traffic with diverse prompts often favors vLLM-style paged scheduling. Branching, prefix-heavy workloads (tool agents, RAG with fixed docs, RL) often favor SGLang-style cache-aware routing. Both can be tuned to win in the other's regime; both can be misconfigured to lose.

Where the milliseconds actually go

A realistic latency budget for a chat request with a 2K prompt and 200-token response, on a healthy serving cluster:

Stage	Typical share	What dominates it	Where it lives in this track
API + tokenize	1–5 ms	HTTP parsing, tokenizer speed, multimodal preprocessing	—
Queue	0–500 ms	Admission, concurrency limit, routing decisions	—
Prefill	20–200 ms	Prompt length, prefix cache hit rate, chunking	03, 05, 06
Decode (200 tokens)	200 × (10–50 ms)	KV bandwidth, weight bandwidth, batch size, graphs	02, 04, 06, 07
Stream + detokenize	~constant per chunk	SSE flushing, detokenizer, client backpressure	—

Two numbers worth memorizing: time-to-first-token (TTFT) ≈ queue + prefill; inter-token latency (ITL) ≈ decode-step time. Different users care about different ones — interactive chat hates high TTFT; long-form generation hates high ITL.

A decision tree for "decode is slow"

The tree's three branches correspond to the three terms in lesson 01's roofline: launch, memory, compute. Every fix is a lever introduced earlier in the track.

Common symptom → first-suspect table

Symptom	First suspect	Check	Lever
Inter-token latency higher than the lesson-02 lower bound	Eager dispatch overhead	kernels/step × launch time vs total	CUDA graphs (06)
Throughput plateaus at small batch despite spare HBM	Kernel-launch-bound decode	Per-step kernel count, batch padding shapes	Continuous batching + graphs (06)
TTFT high during traffic spikes	Queue + admission	Queue depth, router load distribution	Cache-aware routing (05), scale out
One large prompt freezes other users	Unchunked prefill	Per-step time spikes	Chunked prefill, PD disaggregation (06)
HBM near full at modest concurrency	Contiguous KV or low pool fraction	Block-pool occupancy	Paged KV with sensible block size (04)
Quantized model not faster than bf16	Dequant overhead or invalid backend	Profile GEMM kernel name + shape	Switch backend, check group size (07)
MoE latency 2× higher than dense	Routing imbalance / all-to-all	Per-expert token counts	Capacity factor, EP topology (07)
RL rollouts give different logprobs than training	Backend / shape mismatch between rollout and trainer	Same tokens, compare per-token logprobs	Align backends; mismatch-aware algorithm
Streaming feels jerky	Detokenization or network	Time between flushed chunks vs ITL	Batch detokenizer, server-side flush policy

Interactive · end-to-end stage attribution

Set per-stage millisecond costs for one request. The widget identifies the largest stage and tells you which lesson's lever to pull first.

Putting the track together

You now have one coherent picture:

Hardware sets the rules. HBM bandwidth and the roofline decide which side of every kernel is the bottleneck (01).
The transformer forward is a chain of well-known kernels, each with a derivable byte and FLOP cost; the KV cache is what makes decode bandwidth-bound (02).
FlashAttention rescues the attention kernel's natural arithmetic intensity by streaming softmax (03).
Paged KV makes that kernel work on non-contiguous storage so the engine can use HBM efficiently (04).
Prefix caches and radix trees let many requests share the same KV (05).
Continuous batching, chunked prefill, and CUDA graphs make the scheduler's traffic look regular and cheap to launch (06).
Quantized GEMM, MoE kernels, batched sampling attack what is left of the decode chain (07).
The serving framework turns HTTP requests into the well-shaped batches the previous layers can run, and decides where milliseconds actually live (08).

Final mental model

vLLM and SGLang are not competing kernels; they are competing compositions. Each box in lessons 01–07 is something both engines implement. The differences are emphasis: which routing strategy, which prefix cache structure, which backend matrix, which scheduler policy. Once you can read a config flag and trace it back to a box in this track, you have what you need to design your own.

Where to go next

Run a profiler (NVIDIA Nsight Systems, or your engine's built-in timeline) on a real decode step and identify which lesson 02 row dominates.
Pick a knob (block size, prefix cache fraction, chunked prefill size, attention backend) and predict what will move before changing it. Compare to measurement.
For RL contexts, pay extra attention to lesson 05 (prefix reuse) and the RL row in the symptom table — rollout/trainer mismatch is its own beast.