The frontend DSL — programs over LLM calls

A stateless POST cannot tell the runtime "I'm about to fork 32 branches off this prefix." SGLang's DSL is the smallest possible vocabulary that lets the client say so — five primitives, embedded in Python, recorded as an IR the runtime can plan against.

The problem the DSL exists to solve

Lesson 01 ended with the observation that 50–95% of every modern workload is redundant prefill. Whose job is it to notice the redundancy?

The runtime could try. Hash incoming prompts, look for shared prefixes, schedule them together. This is what vLLM's automatic prefix cache (APC) does. It works if the requests arrive close enough in time that nothing has been evicted, and if the framework can hash blocks fast enough to keep up.
The client could declare it. If the client knows it's about to fan out 32 samples from one prompt, it can tell the runtime — "here is a prefix; here are 32 continuations from it." The runtime then needs no inference, just allocation.

SGLang offers both. The runtime does prefix detection (lesson 04 — RadixAttention). But for workloads where the structure is known at the client (which is most of them), declaring is strictly cheaper, more reliable, and unlocks scheduler optimizations the runtime cannot do from string-comparison alone.

The five primitives

The DSL is small on purpose. The whole vocabulary fits in this table:

Primitive	Meaning	What the runtime sees
`+= "text"`	Append plain text to the current state.	Bytes appended to the active prefix.
`gen(name, …)`	Sample tokens until stop, bind to `name`.	A decode segment with constraints, sampling params, max tokens.
`select(name, choices=[…])`	Pick one of N strings by log-prob.	N parallel cheap forward passes, then a logical fork.
`fork(k)`	Branch the current state into k independent children.	k requests, all rooted at the same KV.
`system / user / assistant`	Open a chat-role-tagged section.	Same as `+=` with role tokens added.

That's the entire surface area. Five primitives is enough to express agent loops, tree search, self-consistency, RAG, and constrained-output JSON generation — which is the entire workload palette from lesson 01.

What an actual program looks like

Here's a self-consistency program — sample n=5 chains-of-thought from one prompt and majority-vote the answer:

import sglang as sgl

@sgl.function
def majority_vote(s, question):
    s += sgl.system("You are a careful mathematician.")
    s += sgl.user(question)

    forks = s.fork(5)                       # ← runtime sees: 5 children, same KV up to here
    for f in forks:
        f += sgl.assistant("Let me reason step by step. ")
        f += sgl.gen("reasoning", max_tokens=400, stop="\n\n")
        f += "\nFinal answer: "
        f += sgl.gen("answer", max_tokens=20, regex=r"[0-9]+")   # constraint detail in lesson 06; treat as "only digits"

    answers = [f["answer"] for f in forks]
    return max(set(answers), key=answers.count)

What the runtime sees, in order:

System + user prompt assembled. KV grows linearly. Total: 1 prefill.
fork(5): not 5 copies of the KV. The runtime takes a refcount on the blocks holding the prompt and gives each child a block table that initially points to them.
5 parallel gen calls run as continuous-batched decodes. They read the shared prefix's KV through their block tables; they write new blocks for the new tokens they emit.
Inside each gen("answer", regex=…), a finite-automaton mask is applied so only digits are sampled (lesson 06).
The 5 answers come back to the client; Python does the vote.

The key observation: the client never wrote any of the cache management. It said "fork" and "gen" — the runtime turned that into refcount-and-decode.

Why this isn't expressible in the chat-completions API

guess: • hash request bodies • compare block-by-block • hope none has been evicted inference, not declaration. SGLang program (recorded IR) s += system + user # 1 prefill forks = s.fork(5) # refcount for f in forks: # 5 decodes f += gen("ans", regex=…) runtime knows: • fan-out factor = 5 • shared prefix = entire context up to fork • constraint on each leaf already declaration — no inference needed.

The chat-completions API was designed so each request stands alone. That's a fine abstraction for a chatbot. For a program of 96 related calls it forces the runtime to recover structure that the client already had.

What the IR actually is

A SGLang program is a sequence of operations against a State object. Internally, when you write s += "foo" or s += gen(…), the State records an op:

# conceptually (simplified)
class Op: pass
class AppendText(Op): text: str
class GenTokens(Op): name: str; max_tokens: int; constraint: Optional[FSM]
class Fork(Op):      k: int
class Select(Op):    name: str; choices: list[str]

# the program is just a list[Op] plus a parent pointer per state

The runtime turns this list into scheduler work items. AppendText grows the prefix; GenTokens creates a decode request; Fork creates k child decode requests pointing at the parent's KV; Select issues n short forward passes and picks the best. Nothing here is novel as a programming model — it's a small embedded effect system. What's novel is that every op is something the scheduler benefits from seeing.

The OpenAI-compatible escape hatch

You don't have to rewrite your application to use SGLang. The server speaks the OpenAI chat-completions protocol; you can point your existing client at it and immediately benefit from RadixAttention (lesson 04) and the fast kernel stack (lesson 08). What you give up:

The runtime no longer knows fork structure ahead of time; it has to discover prefix overlap via the radix tree.
You cannot express select or constrained gen directly — though JSON / regex masks can still be passed in a request header (see lesson 06).

The decision tree, in one line: If your workload is high-shared-prefix and you can change the client, use the DSL. Otherwise use the OpenAI endpoint and let RadixAttention catch what it can.

Interactive · what the runtime sees

Toggle between API-mode and DSL-mode for the same workload. The widget shows what the runtime can plan against in each case — and what it has to guess.

What this gives the next lesson

With the DSL on the table, we can ask the central technical question: how does the runtime physically share KV between a parent and its forked children, between two unrelated requests that happen to share the first 1,800 tokens, between today's RAG call and tomorrow's? The answer is a data structure — a radix tree over token sequences — and that is lesson 04. Lesson 03 first re-establishes the KV recap you need to read lesson 04.