SGLang, from first principles
A linearized tour of the serving framework that treats the program — not the single forward pass — as the unit of work. Each lesson is justified from a workload, then from arithmetic, then from a picture.
The thesis SGLang is built around
vLLM, TensorRT-LLM, and most "serving" frameworks optimize the unit of work that arrives at the HTTP boundary: one prompt → one completion. SGLang's bet is that the real unit of work is a program made of many model calls — agents that take 30 turns, tree-search planners that fork 32 branches off a shared prefix, RAG pipelines that share a 4K system prompt across every retrieval. Once you commit to that view, three optimizations stop being optional:
Lessons 01–02 motivate the DSL. Lessons 03–05 build RadixAttention and the scheduler that drives it. Lessons 06–07 build constrained decoding from a finite automaton up. Lessons 08–11 cover the kernel, parallelism, and speculative-decoding stack, then close with a head-to-head against vLLM.
Part I · Why the program is the unit of work
Part II · RadixAttention and the scheduler
Part III · Structured decoding
Part IV · The runtime stack
Part V · Synthesis
Optimization ranking, by throughput impact
| Rank | Technique | Where | Mechanism |
|---|---|---|---|
| 1 | RadixAttention | 04 | 2–6× on multi-call workloads via prefix reuse |
| 2 | Cache-aware scheduling | 05 | 30–60% higher hit rate vs FCFS at same cache size |
| 3 | Continuous batching + paged KV | 03 | The baseline every modern engine ships |
| 4 | FlashInfer attention | 08 | Ragged + paged + MLA in one kernel family |
| 5 | Compressed FSM decoding | 06 | 2× on heavy-JSON workloads via fast-forward |
| 6 | CUDA graph decode | 08 | Removes Python launch overhead at small batch |
| 7 | EAGLE speculative decoding | 10 | ~2× decode tokens/sec at acceptance ≥ 0.6 |
| 8 | xgrammar CFG masks | 07 | 10–100× cheaper masks than naive grammar parsing |
| 9 | DP attention for MLA | 09 | Unlocks DeepSeek-V3 with full KV reuse per rank |
| 10 | Expert parallelism | 09 | Fits 256 experts across 8 GPUs without weight-replication |
Common misconceptions
- RadixAttention ≠ FlashAttention. RadixAttention is a data structure over KV blocks. The attention kernel that reads those blocks is FlashInfer (or FlashAttention-3). The two are orthogonal and ship together.
- RadixAttention ≠ vLLM's automatic prefix cache. Both reuse prefixes. vLLM hashes block-aligned prefixes into a hashmap; SGLang stores them in a radix tree. The tree captures partial overlaps that the hashmap misses.
- SGLang's DSL isn't a separate model. It's plain Python that records calls. The runtime sees the recording — the model itself is unchanged.
- Constrained decoding doesn't change sampling. The mask zeros out illegal tokens before the softmax; legal token probabilities are renormalized. Greedy / temperature / top-p all still apply on the masked distribution.
- Spec decoding is exact. The verification rule (rejection sampling against the target) preserves the target's distribution. EAGLE adds a learned draft head, not a different acceptance rule.
How to use this
- Linear is the recommended path. Lessons 01–05 are tightly coupled — 04 only makes sense after 03's framing of the prefix problem, and 05 needs 04's tree to schedule on. 06–11 stand alone after that.
- Touch the widgets. Each lesson has one knob whose extreme settings break the system. Find the break; that's the lesson.
- Read the source second. When a lesson references a file path in SGLang, read it after the lesson. The lesson tells you why each line exists; the code tells you exactly how.