SGLang, from first principles

A linearized tour of the serving framework that treats the program — not the single forward pass — as the unit of work. Each lesson is justified from a workload, then from arithmetic, then from a picture.

Who this is for

You've served an LLM, you know roughly what a KV cache is, and you've heard the names SGLang, vLLM, RadixAttention, FlashInfer thrown around without their meanings fully clicking. By the end of this series you can read SGLang's scheduler and runtime source, predict its behavior on your workload, and explain — in concrete bytes per second — why it wins (or doesn't) against vLLM for the workload you actually run.

The thesis SGLang is built around

vLLM, TensorRT-LLM, and most "serving" frameworks optimize the unit of work that arrives at the HTTP boundary: one prompt → one completion. SGLang's bet is that the real unit of work is a program made of many model calls — agents that take 30 turns, tree-search planners that fork 32 branches off a shared prefix, RAG pipelines that share a 4K system prompt across every retrieval. Once you commit to that view, three optimizations stop being optional:

Lessons 01–02 motivate the DSL. Lessons 03–05 build RadixAttention and the scheduler that drives it. Lessons 06–07 build constrained decoding from a finite automaton up. Lessons 08–11 cover the kernel, parallelism, and speculative-decoding stack, then close with a head-to-head against vLLM.

Part I · Why the program is the unit of work

The workload that demands a new framework

Agents, tree-of-thought, self-consistency, RAG. Quantify how much of every modern workload is redundant prefill on a shared prefix. The single-call model leaves 50–90% of compute on the floor.

The frontend DSL — programs over LLM calls

gen, select, fork, system. A minimal embedded Python DSL whose IR the runtime can see. Why a stateless POST cannot express what a tree-search agent is doing.

Part II · RadixAttention and the scheduler

KV recap and the prefix-sharing problem

Bytes per token, paging, why vLLM's hashmap prefix cache misses sharing across different system prompts. The data structure we actually want is a trie of tokens.

RadixAttention — the radix tree of KV

Compressed trie over token sequences. Each node carries refcount + KV block pointers. Insert, lookup, evict in O(log V) amortized. The KV pool is unchanged; only the index over it is new.

Cache-aware scheduling — order is the optimization

FCFS leaves cache hits on the floor. Longest-prefix-match ordering, LRU eviction at the leaves, and the fairness / hit-rate trade-off. Why the scheduler and the cache are one system, not two.

Part III · Structured decoding

Constrained decoding — regex, FSMs, compressed FSM

Mask the logits at every step to keep the output on a finite automaton. Then notice that most FSM states are deterministic — fast-forward through them. Up to 2× speedup on JSON.

xgrammar — context-free grammars at sample-time

Regex isn't enough for nested JSON, balanced brackets, or SQL. Pushdown automata plus a token-aligned bitmask cache. The bitmask is per-state, vocab-sized, and built once.

Part IV · The runtime stack

The kernel stack — FlashInfer, Triton, CUDA graphs

SGLang's attention is FlashInfer (ragged + paged + MLA-aware). Sampling is Triton. Decode steps are CUDA-graphed. Each choice is a measured response to the bottleneck the previous one exposed.

Tensor, data, and expert parallelism

TP shards the linear layers. DP attention replicates the KV path per rank — the right answer for MLA. EP shards experts across GPUs. How DeepSeek-V3/R1 fits on 8× H200 in practice.

Speculative decoding — EAGLE trees in SGLang

A draft model proposes K tokens; the target verifies them in one forward pass. EAGLE-style tree verification accepts more tokens per step. The arithmetic of when this is worth it.

Part V · Synthesis

SGLang vs vLLM — picking a side for your workload

Two frameworks, two theses. vLLM optimizes one call. SGLang optimizes the program. Where each wins, where neither does, and how to read a benchmark that claims one is "faster" than the other.

Optimization ranking, by throughput impact

Rank	Technique	Where	Mechanism
1	RadixAttention	04	2–6× on multi-call workloads via prefix reuse
2	Cache-aware scheduling	05	30–60% higher hit rate vs FCFS at same cache size
3	Continuous batching + paged KV	03	The baseline every modern engine ships
4	FlashInfer attention	08	Ragged + paged + MLA in one kernel family
5	Compressed FSM decoding	06	2× on heavy-JSON workloads via fast-forward
6	CUDA graph decode	08	Removes Python launch overhead at small batch
7	EAGLE speculative decoding	10	~2× decode tokens/sec at acceptance ≥ 0.6
8	xgrammar CFG masks	07	10–100× cheaper masks than naive grammar parsing
9	DP attention for MLA	09	Unlocks DeepSeek-V3 with full KV reuse per rank
10	Expert parallelism	09	Fits 256 experts across 8 GPUs without weight-replication

Common misconceptions

RadixAttention ≠ FlashAttention. RadixAttention is a data structure over KV blocks. The attention kernel that reads those blocks is FlashInfer (or FlashAttention-3). The two are orthogonal and ship together.
RadixAttention ≠ vLLM's automatic prefix cache. Both reuse prefixes. vLLM hashes block-aligned prefixes into a hashmap; SGLang stores them in a radix tree. The tree captures partial overlaps that the hashmap misses.
SGLang's DSL isn't a separate model. It's plain Python that records calls. The runtime sees the recording — the model itself is unchanged.
Constrained decoding doesn't change sampling. The mask zeros out illegal tokens before the softmax; legal token probabilities are renormalized. Greedy / temperature / top-p all still apply on the masked distribution.
Spec decoding is exact. The verification rule (rejection sampling against the target) preserves the target's distribution. EAGLE adds a learned draft head, not a different acceptance rule.

How to use this

Linear is the recommended path. Lessons 01–05 are tightly coupled — 04 only makes sense after 03's framing of the prefix problem, and 05 needs 04's tree to schedule on. 06–11 stand alone after that.
Touch the widgets. Each lesson has one knob whose extreme settings break the system. Find the break; that's the lesson.
Read the source second. When a lesson references a file path in SGLang, read it after the lesson. The lesson tells you why each line exists; the code tells you exactly how.

Prerequisites

If "bytes per token" or "paged KV" sound vague, read vLLM lesson 01 and vLLM lesson 02 first — they take 20 minutes and make everything here load-bearing instead of mysterious.