all_lessons/gpu_kernel_serving/17 · serving frameworklesson 17 / 17

Serving framework: synthesis

The kernel hot loop sits inside a much larger pipeline: HTTP → tokenize → queue → schedule → worker → sample → detokenize → stream. This lesson closes the chain. It also lets you read vLLM and SGLang configurations as different compositions of the same primitives we built in lessons 01–07.

The question this lesson answers

If a teammate says "decode is slow" or "TTFT is high under load," which knob do you turn first? The answer depends on which stage actually owns the time — and most of those stages are not kernels at all. This lesson installs the framework-level mental model and the debugging discipline that walks symptoms back to a specific lesson in this track.

End-to-end request path

HTTP APIOpenAI-compat. tokenizevalidate, hash routercache-aware? queuepriority / FIFO schedulerbatch / chunk worker · kernelsattention, GEMM, … samplingtop-k/p, RNG (lesson 07) detokenizeids → text stream chunkSSE / gRPC stop checkEOS, max, stop free KVdecrement refcounts tokens-not-done loop Stages 1–5 happen once per request. Stages 6–10 happen once per token. Anything on the per-token path multiplies by output length, which is why "small" sampling or detokenization costs add up. Kernels live in one box. Everything else is framework. Most "slow" reports point at framework, not kernel.

What lives where: vLLM vs SGLang as compositions

Both engines implement every box above; they just emphasize different ones. The columns below identify which lesson's mechanism each engine pushes hardest. Treat these as where the design points rather than features one engine has and the other doesn't.

StagevLLM emphasisSGLang emphasisFrom lesson…
API / entryOpenAI-compatible HTTP serverOpenAI-compatible HTTP + native SGL DSL
RouterExternal (your load balancer)Built-in Model Gateway with cache-aware routing05
KV memoryPaged KV cache + block managerPaged KV pool + radix prefix cache04, 05
SchedulerContinuous batching, chunked prefill, token budgetContinuous batching, prefix-aware scheduling, branch-aware06
Attention backendFlashAttention / FlashInfer / Triton / MLA familyFlashInfer / FA3 / Triton / FlashMLA03, 04
Other kernelsMarlin/Machete quant, GPU sampling, CUDA graphsFlashInfer ops, GPU sampling, CUDA graphs07, 06
Multi-process archAPI server / engine core / GPU worker / DP coordGateway / worker servers
Don't pick by mascot
"Which engine is faster" is the wrong question. Ask "which stage is my bottleneck?" Independent traffic with diverse prompts often favors vLLM-style paged scheduling. Branching, prefix-heavy workloads (tool agents, RAG with fixed docs, RL) often favor SGLang-style cache-aware routing. Both can be tuned to win in the other's regime; both can be misconfigured to lose.

Where the milliseconds actually go

A realistic latency budget for a chat request with a 2K prompt and 200-token response, on a healthy serving cluster:

StageTypical shareWhat dominates itWhere it lives in this track
API + tokenize1–5 msHTTP parsing, tokenizer speed, multimodal preprocessing
Queue0–500 msAdmission, concurrency limit, routing decisions
Prefill20–200 msPrompt length, prefix cache hit rate, chunking03, 05, 06
Decode (200 tokens)200 × (10–50 ms)KV bandwidth, weight bandwidth, batch size, graphs02, 04, 06, 07
Stream + detokenize~constant per chunkSSE flushing, detokenizer, client backpressure

Two numbers worth memorizing: time-to-first-token (TTFT) ≈ queue + prefill; inter-token latency (ITL) ≈ decode-step time. Different users care about different ones — interactive chat hates high TTFT; long-form generation hates high ITL.

A decision tree for "decode is slow"

user says "decode is slow"profile first; do not guess large GPU idle gaps?CPU/scheduler/launch (lesson 06) HBM saturated?KV/weights (lessons 02, 04, 07) tensor cores idle?GEMM/backend shape (lessons 03, 07) turn on graphs & paddingcapture stable shapes reduce bytesprefix reuse, KV dtype, weight quant change backend / shapeFlashInfer / FA path / quant kernel queue or TTFT high? — different questionrouting, admission, chunked prefill, scaling if symptom is TTFT or queueing instead, take the side branch:

The tree's three branches correspond to the three terms in lesson 01's roofline: launch, memory, compute. Every fix is a lever introduced earlier in the track.

Common symptom → first-suspect table

SymptomFirst suspectCheckLever
Inter-token latency higher than the lesson-02 lower boundEager dispatch overheadkernels/step × launch time vs totalCUDA graphs (06)
Throughput plateaus at small batch despite spare HBMKernel-launch-bound decodePer-step kernel count, batch padding shapesContinuous batching + graphs (06)
TTFT high during traffic spikesQueue + admissionQueue depth, router load distributionCache-aware routing (05), scale out
One large prompt freezes other usersUnchunked prefillPer-step time spikesChunked prefill, PD disaggregation (06)
HBM near full at modest concurrencyContiguous KV or low pool fractionBlock-pool occupancyPaged KV with sensible block size (04)
Quantized model not faster than bf16Dequant overhead or invalid backendProfile GEMM kernel name + shapeSwitch backend, check group size (07)
MoE latency 2× higher than denseRouting imbalance / all-to-allPer-expert token countsCapacity factor, EP topology (07)
RL rollouts give different logprobs than trainingBackend / shape mismatch between rollout and trainerSame tokens, compare per-token logprobsAlign backends; mismatch-aware algorithm
Streaming feels jerkyDetokenization or networkTime between flushed chunks vs ITLBatch detokenizer, server-side flush policy

Interactive · end-to-end stage attribution

Set per-stage millisecond costs for one request. The widget identifies the largest stage and tells you which lesson's lever to pull first.

Where does the request spend its time?

Try TTFT-heavy (big queue + prefill) vs streaming-heavy (200+ decode tokens) configurations.

Putting the track together

You now have one coherent picture:

  1. Hardware sets the rules. HBM bandwidth and the roofline decide which side of every kernel is the bottleneck (01).
  2. The transformer forward is a chain of well-known kernels, each with a derivable byte and FLOP cost; the KV cache is what makes decode bandwidth-bound (02).
  3. FlashAttention rescues the attention kernel's natural arithmetic intensity by streaming softmax (03).
  4. Paged KV makes that kernel work on non-contiguous storage so the engine can use HBM efficiently (04).
  5. Prefix caches and radix trees let many requests share the same KV (05).
  6. Continuous batching, chunked prefill, and CUDA graphs make the scheduler's traffic look regular and cheap to launch (06).
  7. Quantized GEMM, MoE kernels, batched sampling attack what is left of the decode chain (07).
  8. The serving framework turns HTTP requests into the well-shaped batches the previous layers can run, and decides where milliseconds actually live (08).
Final mental model
vLLM and SGLang are not competing kernels; they are competing compositions. Each box in lessons 01–07 is something both engines implement. The differences are emphasis: which routing strategy, which prefix cache structure, which backend matrix, which scheduler policy. Once you can read a config flag and trace it back to a box in this track, you have what you need to design your own.

Where to go next