all_lessons / sglang / 05 · scheduling lesson 5 / 11

Cache-aware scheduling — order is the optimization

A radix tree is a passive index. The scheduler is what turns it into a throughput win. By admitting requests in the order that maximizes prefix overlap, SGLang can convert the tree's capability for sharing into actual hit rate — often 30–60 percentage points above first-come-first-served.

The shape of the win

Suppose 32 requests arrive simultaneously and 24 of them share a 2,000-token system prompt while 8 are unrelated. The radix tree can in principle hold all 32 prefixes; the question is whether the scheduler ever lets it. Three scheduling policies, three outcomes:

PolicyAdmission orderWhat the tree seesEffective hit rate
FCFSarrival orderrandom walk over prompts~50% (depends on luck)
Shortest job firstby predicted output lengthstill random over prefixes~50%
Longest-prefix-match (LPM)by depth of cache hitshared-prefix requests first>95%

The reason LPM dominates is mechanical: the first request through a fresh tree pays the full prefill cost; the rest piggyback. Letting the cache-hitting requests run first means everyone benefits, including the ones that would have missed.

The scheduling problem, formally

At each step the engine has:

The decision: which waiting requests to admit into the running set this step. Constraints: the running set's KV must fit in the budget; some fairness floor (no request waits forever); some prefill-throughput cap (so prefill chunks don't starve decodes).

The objective: maximize aggregate useful tokens generated per GPU-second, where "useful" means tokens that aren't redundant prefill. This collapses, roughly, to:

maximize   ∑r∈running (input_len(r) − cached_prefix_len(r)) + decode_tokens(r)

The cache term is the entire knob the scheduler has — every other term is workload-given. Maximize it by picking, from the waiting queue, the requests whose cached_prefix_len is deepest.

The LPM-first algorithm

def schedule_one_step(waiting, running, kv_budget, tree):
    # 1. for each waiting request, look up its deepest cache hit
    for r in waiting:
        r.cached_len, r.node = tree.lookup(r.tokens)

    # 2. sort by cached_len descending (LPM-first)
    waiting.sort(key=lambda r: r.cached_len, reverse=True)

    # 3. greedy-admit while KV budget holds; aged requests bypass the budget
    admitted = []
    for r in waiting:
        new_blocks = blocks_needed(r) - r.node.cached_blocks  # what we have to allocate
        if r.wait_age > FAIRNESS_T:                # always admit aged requests
            admitted.append(r); kv_budget -= new_blocks
        elif new_blocks <= kv_budget:              # otherwise gate on budget
            admitted.append(r); kv_budget -= new_blocks

    return admitted

Three pieces beyond a simple sort:

  1. "Cached blocks" is subtracted from the budget. An LPM-first request that hits 2,000 tokens of cache costs the pool only the new suffix blocks — the cached blocks already live in the tree and have their refcounts bumped, not freshly allocated.
  2. Greedy admission has a fairness escape. Any request waiting more than FAIRNESS_T jumps the queue. Without this, a hot prefix can starve cold requests forever.
  3. Eviction reverses the order. When the pool is tight, evict leaves whose subtrees have lowest aggregate hit benefit (cold prefixes). The LPM-first admit and LPM-aware evict cooperate.

The fairness ↔ hit-rate trade-off

cache hit rate vs. p99 wait time, as FAIRNESS_T varies p99 wait time → hit rate ↑ FCFS · hit≈50%, wait p99 low FAIRNESS_T = 200ms · the sweet spot pure LPM · hit≈98%, but p99 wait → ∞ left edge: never reorder. right edge: always reorder for cache. middle: bounded reordering.

SGLang ships with FAIRNESS_T tuned for typical online workloads (a few hundred milliseconds in default config). For offline batch workloads — where there is no SLA — you can raise it freely and push hit rates above 95%. The right answer depends on whether you are pricing latency or throughput.

Why FCFS leaves so much on the floor — a concrete trace

Set up: 6 requests, 2 system prompts (A: 2,000 tokens · B: 1,500 tokens), arrival order alternates A,B,A,B,A,B. KV budget allows 4 concurrent.

FCFS schedule (arrival order): A1: full prefill B1: full prefill A2: cached B2: cached Total prefill compute: 2 × full + 2 × cached lookup. If KV budget is tight, A1's blocks may be evicted before A2/A3 arrive — losing the hit. LPM-first schedule (after first miss): A1: full prefill (seed) A2: cached (re-order) A3: cached B1: full prefill (seed) B2: cached A-cluster runs first; B-cluster after. Hits are saturated within each cluster.

How the scheduler talks to the tree

Three queries on the tree from the scheduler, two updates:

CallPurposeCost
tree.lookup(tokens)For ranking: how deep does this request hit?O(T) compares
tree.evictable_size()For admission: how many blocks could the pool reclaim if needed?O(1) cached field
tree.protect(node)For correctness: pin a node during a step so eviction won't free it.O(1)
tree.insert(tokens, blocks)After prefill: add the new suffix to the tree.O(suffix length)
tree.release(leaf)On request completion: decrement refcounts.O(depth)

The scheduler is essentially a producer for these calls, plus the LPM-sort. SGLang's implementation lives in python/sglang/srt/managers/scheduler.py and tree_cache.py — about a thousand lines total. Worth reading after this lesson.

Continuous batching, briefly

All of this happens inside an iteration-level batched loop (Orca / vLLM lesson 04). At each decode step:

  1. Compute decode for currently-running requests (one token each).
  2. Pick up to max_prefill_tokens worth of waiting requests, prefill them in the same forward pass (chunked prefill, vLLM lesson 10).
  3. Drop finished requests from the running set; release their KV via the tree.
  4. Repeat.

Cache-aware scheduling only changes step 2 — which waiting requests are picked up next. The continuous-batch machinery is unchanged.

What can go wrong

Failure modes worth knowing

Interactive · LPM-first vs FCFS

Simulate a stream of requests with adjustable shared-prefix rate. Compare hit rate and average wait time under FCFS vs LPM-first with bounded reordering.

FCFS vs LPM-first hit rate

As shared-prefix probability rises, LPM-first opens a large gap over FCFS in hit rate — at the cost of slightly higher wait variance.