all_lessons / ml_system_design / 16 · case · agentic lesson 16 / 20

Case study — an agentic tool-use platform

Every prior case bound on a property of a single model call: the code assistant (13) on prefill latency, the chatbot (14) on concurrency, the RAG assistant (15) on retrieval + long-context prefill. Here the wall moves up a level. A request is no longer one call — it is a sequence of them, and the binding constraint is end-to-end task latency built from sequential fan-out and tail amplification (lesson 03), compounded by statefulness: a task is pinned to its KV. This is a reliability-and-scheduling problem wearing a serving problem's clothes — not a throughput one.

The brief
A platform running autonomous agents. A task = 10–30 sequential model calls (think → call tool → observe → think …), each call's context = system prompt + growing history + tool outputs (history reaches 20–40K tokens by the end). Tools run in sandboxes (exec, web, code) with their own latency, 100 ms – several seconds. No human watches mid-task → nothing streams to a person, so token-level SLOs are irrelevant. The SLO is task p95 E2E < 60 s and tasks/hour throughput. Model: an ordinary 70B. Design the platform.

Run the loop. Before reading on: what binds — TTFT, TPOT, memory, throughput? None of them directly. The wall is a property of the composition of calls, and it forces a stateful design.

Stage 0 · Sequential fan-out and tail amplification (lesson 03)

The calls are sequential, not parallel — call k needs call k−1's output (the model must see the tool result before it reasons again). So task latency is a sum, not a max:

T_task = Σ (call_latency_k + tool_latency_k), k = 1 … n

A sum of n calls inherits the worst of them. Lesson 03's fan-out result applies directly: the chance the task hits at least one p99-slow call is 1 − 0.99ⁿ. At n = 20 that is 1 − 0.99²⁰ ≈ 18%. Nearly one task in five trips a p99 call — so your call-level p99 IS your task-level typical case. Designing to call-level median is self-deception; the task sees the tail every time.

Binding constraint, stage 0
Sequential fan-out + tail amplification. Because latencies add and the tail compounds, the only two levers on task latency are (a) cut the number of calls n and (b) tighten per-call p99 (not median). Everything else in this design serves one of those two. A median-fast / p99-slow call distribution is the failure mode — you must design and validate at call p99.

Stage 1 · Statefulness & KV residency (lessons 05, 13)

Now the redemption. Each call's context is the previous call's context plus the new tool output — i.e. call k's prompt is call k−1's prompt as a prefix. This is the exact prefix-caching premise from the code assistant (13) and SGLang (SGLang 04), but along the time axis of one task. With the history KV resident, call k re-prefills only the new tool-output + reasoning tokens (a few hundred), not the whole 30K history:

prefill_warm ≈ new_tokens / 11,300 tok/s ≈ 300/11,300 ≈ 27 ms vs. prefill_cold ≈ 30000/11,300 ≈ 2.7 s

That is a ~100× collapse in the prefill component of every call after the first. But it is conditional on two things, both from lesson 05:

Binding constraint, stage 1
KV residency per long-lived task. At ~9.6 GB/task, a TP-4 replica with ~160 GB of KV room (capstone, lesson 12) holds only ~16 concurrent tasks before eviction would force ruinous re-prefills. Residency, not compute, caps task concurrency — and routing must be sticky to use it at all.

Stage 2 · Tool latency doesn't overlap — so multiplex across tasks (lesson 04)

While a sandboxed tool runs (100 ms – several s), generation for that task is blocked on a sequential dependency. If a replica served one task, the GPU would sit idle for that entire tool window — pure waste. The fix is the same continuous batching from lesson 04, but applied across tasks: the replica decodes other tasks while this one waits on its tool. A replica is a multiplexer over many tasks at different stages (some thinking, most waiting on tools).

This is why high task-concurrency per replica is good for utilization even though each task is mostly idle: with tool windows averaging, say, 2 s against ~1 s of active generation per call, a task spends ~⅔ of its life waiting — so you need ~3× the concurrent tasks just to keep the GPU busy. The 16-task residency cap from stage 1 and this multiplexing requirement are in direct tension: residency limits how many tasks you can keep warm, and you need many warm tasks to stay utilized. That tension is the heart of sizing this fleet.

LeverEffect on task latencyEffect on utilization / cost
Prefix-cache the history (sticky routing)~100× per-call prefill cutfrees compute, but pins 9.6 GB KV/task
Multiplex tasks across tool waitsnone (per-task)recovers the idle GPU during tool exec
Cut number of calls nlinear and tail reductionfewer KV-pinned seconds per task

Stage 3 · Failure & statefulness is the hard part (lesson 11)

Stateless serving (cases 13–15) lets you treat replicas as cattle: drain, kill, reroute, autoscale freely. Statefulness breaks all of that.

Binding constraint, stage 3
Statefulness turns a stateless-serving problem into a stateful one. Sticky routing, slow drains, and a failover blast radius proportional to tasks-per-replica are the real engineering cost of this platform — harder than any kernel. The design question stops being "how fast is a call?" and becomes "what happens to a 45-second task when its replica dies at second 40?"

Stage 4 · Levers to cut calls and tighten the tail (lessons 06, 13)

Stage 0 said the only levers are fewer calls and lower call p99. Concretely:

Interactive · task latency & the tail

Build a task out of calls and watch the tail. Turn prefix caching off and see re-prefill of the growing history wreck both latency and cost; raise the call count and watch 1 − 0.99ⁿ push the p95 estimate past 60 s even when the median looks fine.

Task latency & the tail

Per-call latency = generation + (prefix-cache off → re-prefill of the full growing history, ~30K/11.3k tok/s ≈ 2.7 s extra/call). Task median ≈ n·(call median + tool). p95 estimate adds the tail term (1−0.99ⁿ)·(call p99 − median) per lesson 03. Order-of-magnitude per the series' ±30% contract.

task median
task p95 (est.)
P(task hits a p99 call)
verdict vs 60s

What this case teaches