all_lessons / sglang / lessons / index 11 lessons · ~3h read

SGLang, from first principles

A linearized tour of the serving framework that treats the program — not the single forward pass — as the unit of work. Each lesson is justified from a workload, then from arithmetic, then from a picture.

Who this is for
You've served an LLM, you know roughly what a KV cache is, and you've heard the names SGLang, vLLM, RadixAttention, FlashInfer thrown around without their meanings fully clicking. By the end of this series you can read SGLang's scheduler and runtime source, predict its behavior on your workload, and explain — in concrete bytes per second — why it wins (or doesn't) against vLLM for the workload you actually run.

The thesis SGLang is built around

vLLM, TensorRT-LLM, and most "serving" frameworks optimize the unit of work that arrives at the HTTP boundary: one prompt → one completion. SGLang's bet is that the real unit of work is a program made of many model calls — agents that take 30 turns, tree-search planners that fork 32 branches off a shared prefix, RAG pipelines that share a 4K system prompt across every retrieval. Once you commit to that view, three optimizations stop being optional:

FRONTEND DSL gen · select · fork · system programs over calls, not strings RADIXATTENTION radix tree over token sequences KV blocks shared by any prefix STRUCTURED DECODING compressed FSM + xgrammar JSON / regex / CFG at sample-time because the runtime sees program structure, not a stateless POST. because prefixes form a tree — every call's KV is reusable for some other call. because most agent outputs aren't free text; JSON masks belong in the sampler.

Lessons 01–02 motivate the DSL. Lessons 03–05 build RadixAttention and the scheduler that drives it. Lessons 06–07 build constrained decoding from a finite automaton up. Lessons 08–11 cover the kernel, parallelism, and speculative-decoding stack, then close with a head-to-head against vLLM.

Part I · Why the program is the unit of work

01
The workload that demands a new framework
Agents, tree-of-thought, self-consistency, RAG. Quantify how much of every modern workload is redundant prefill on a shared prefix. The single-call model leaves 50–90% of compute on the floor.
02
The frontend DSL — programs over LLM calls
gen, select, fork, system. A minimal embedded Python DSL whose IR the runtime can see. Why a stateless POST cannot express what a tree-search agent is doing.

Part II · RadixAttention and the scheduler

03
KV recap and the prefix-sharing problem
Bytes per token, paging, why vLLM's hashmap prefix cache misses sharing across different system prompts. The data structure we actually want is a trie of tokens.
04
RadixAttention — the radix tree of KV
Compressed trie over token sequences. Each node carries refcount + KV block pointers. Insert, lookup, evict in O(log V) amortized. The KV pool is unchanged; only the index over it is new.
05
Cache-aware scheduling — order is the optimization
FCFS leaves cache hits on the floor. Longest-prefix-match ordering, LRU eviction at the leaves, and the fairness / hit-rate trade-off. Why the scheduler and the cache are one system, not two.

Part III · Structured decoding

06
Constrained decoding — regex, FSMs, compressed FSM
Mask the logits at every step to keep the output on a finite automaton. Then notice that most FSM states are deterministic — fast-forward through them. Up to 2× speedup on JSON.
07
xgrammar — context-free grammars at sample-time
Regex isn't enough for nested JSON, balanced brackets, or SQL. Pushdown automata plus a token-aligned bitmask cache. The bitmask is per-state, vocab-sized, and built once.

Part IV · The runtime stack

08
The kernel stack — FlashInfer, Triton, CUDA graphs
SGLang's attention is FlashInfer (ragged + paged + MLA-aware). Sampling is Triton. Decode steps are CUDA-graphed. Each choice is a measured response to the bottleneck the previous one exposed.
09
Tensor, data, and expert parallelism
TP shards the linear layers. DP attention replicates the KV path per rank — the right answer for MLA. EP shards experts across GPUs. How DeepSeek-V3/R1 fits on 8× H200 in practice.
10
Speculative decoding — EAGLE trees in SGLang
A draft model proposes K tokens; the target verifies them in one forward pass. EAGLE-style tree verification accepts more tokens per step. The arithmetic of when this is worth it.

Part V · Synthesis

11
SGLang vs vLLM — picking a side for your workload
Two frameworks, two theses. vLLM optimizes one call. SGLang optimizes the program. Where each wins, where neither does, and how to read a benchmark that claims one is "faster" than the other.

Optimization ranking, by throughput impact

RankTechniqueWhereMechanism
1RadixAttention042–6× on multi-call workloads via prefix reuse
2Cache-aware scheduling0530–60% higher hit rate vs FCFS at same cache size
3Continuous batching + paged KV03The baseline every modern engine ships
4FlashInfer attention08Ragged + paged + MLA in one kernel family
5Compressed FSM decoding062× on heavy-JSON workloads via fast-forward
6CUDA graph decode08Removes Python launch overhead at small batch
7EAGLE speculative decoding10~2× decode tokens/sec at acceptance ≥ 0.6
8xgrammar CFG masks0710–100× cheaper masks than naive grammar parsing
9DP attention for MLA09Unlocks DeepSeek-V3 with full KV reuse per rank
10Expert parallelism09Fits 256 experts across 8 GPUs without weight-replication

Common misconceptions

How to use this

  1. Linear is the recommended path. Lessons 01–05 are tightly coupled — 04 only makes sense after 03's framing of the prefix problem, and 05 needs 04's tree to schedule on. 06–11 stand alone after that.
  2. Touch the widgets. Each lesson has one knob whose extreme settings break the system. Find the break; that's the lesson.
  3. Read the source second. When a lesson references a file path in SGLang, read it after the lesson. The lesson tells you why each line exists; the code tells you exactly how.
Prerequisites
If "bytes per token" or "paged KV" sound vague, read vLLM lesson 01 and vLLM lesson 02 first — they take 20 minutes and make everything here load-bearing instead of mysterious.