Part VI - Knowledge, communication, and optimization

Resource-aware optimization - cost, latency, budgets

Up to now the agent has cared only about whether it reaches the goal. This lesson adds the second axis every production system is graded on: at what cost. Compute, wall-clock time, money, context window, tool quotas, and human attention are all finite. Resource-aware optimization is the policy layer that spends each of them deliberately — choosing the cheapest path that is still good enough, and degrading gracefully instead of failing when a budget runs dry.

Book source

Chapter 16 - Resource-Aware Optimization (资源感知优化); PDF outline pages 175-185. The chapter's worked examples — the financial-analyst agent, the Gemini Pro / Flash router built on Google ADK, the three-way OpenAI classifier (simple / reasoning / internet_search), and OpenRouter's auto-selection and ordered fallback — are threaded through below.

Linear position

Prerequisite: lesson 17 (A2A) — the agent can already retrieve (16), call tools (07), delegate to other agents (09, 17), set goals and monitor progress (13), and recover from failure (14). It knows how to act.
New capability: a budget-aware policy that, for each step, picks the model, tool, context size, retry depth, and fallback path that meets the quality bar at the lowest resource spend — and that can shed quality on purpose when constraints tighten.

The plan

Five moves. (1) Frame the problem the book frames: LLM work is expensive and slow, so spending the strongest model on every step is waste, not safety. (2) Build the central mechanism — the routing agent that classifies a request and sends it to a cheap or strong model — and do the cost arithmetic that justifies it. (3) Add the two reliability mechanisms the book stresses: fallback chains (survive rate limits and outages) and the critique agent (catch the cheap route silently dropping quality). (4) Widen from model routing to the book's full taxonomy: tool choice, context pruning, workload prediction, energy-aware deployment, distributed compute, learned allocation, and graceful degradation. (5) Make it auditable: every resource decision must be recorded in the trace so evaluation can compare routes, not just answers. We close with the one rule that keeps this from becoming quality debt.

1 · Why "always use the best model" is a bug, not a safety margin

The book opens with a distinction worth internalizing. Ordinary planning (lesson 08) cares about the sequence of actions — which step comes next. Resource-aware optimization cares about how each action is executed under a budget: more accurate but expensive model, or faster and cheaper one? Spend extra compute for a finer answer, or return a coarser answer now? The agent is no longer just choosing what to do; it is choosing how much to spend doing it.

The book's anchor example is an agent serving a financial analyst over a large dataset. If the analyst wants a quick preliminary read, the agent summarizes the key trends with a fast, cheap model. If the analyst needs a high-precision forecast to back a major investment decision — and has the budget and time — the agent allocates more resources and reaches for the slower, stronger model. Same agent, same goal shape, radically different resource policy depending on the stakes and constraints of the moment.

The mental model: think of the agent as a contractor with a fixed budget for a job. A good contractor does not hire the master craftsman to hang a single picture frame, and does not hand a load-bearing wall to the cheapest day laborer. The skill is matching the worker to the task. An agent that routes every request to the strongest model is the contractor who hires the master for everything — it will produce fine work and then go bankrupt. "Use the best model everywhere" feels safe but is the most common and most expensive anti-pattern in the field.

The resources on the table

The book is explicit that this is not only about money. The optimization surface includes compute (GPU/token cost), latency (wall-clock, for real-time systems), money (API spend and quotas), energy (battery on edge devices), bandwidth/storage (download a summary, not the full dataset), context window (tokens are finite and priced), tool quotas, and human attention (every escalation to a person is a scarce, costly resource — see lesson 15).

2 · The routing agent — the core mechanism, with the arithmetic

The book's standardized solution is a routing agent: a lightweight front component that first classifies the incoming request's complexity, then dispatches it to the most appropriate and economical downstream model or tool. Simple questions go to a fast, cheap model; complex reasoning goes to a strong, expensive one. The book's concrete pairing is Google's Gemini 2.5 Flash (the economical model) versus Gemini 2.5 Pro (the high-capability model), wired together with the Google Agent Development Kit (ADK), whose multi-agent orchestration supports exactly this kind of dynamic, LLM-driven routing.

The book sketches two routers. The trivial one uses a cheap heuristic — query length in words:

# Google ADK style router (conceptual). The book's simplest signal: word count.
class QueryRouterAgent(BaseAgent):
    async def _run_async_impl(self, ctx):
        query = ctx.current_message.text
        if len(query.split()) < 20:        # short → assume simple
            return await gemini_flash_agent.run_async(ctx.current_message)
        else:                              # long → assume complex
            return await gemini_pro_agent.run_async(ctx.current_message)

The book is honest that word count is a placeholder. A stronger router uses an LLM or a fine-tuned classifier to judge complexity directly — sending fact-recall queries to Flash and deep-analysis queries to Pro — and notes that you can prompt-engineer or fine-tune the router itself on a dataset of (question → best model) pairs to raise its accuracy. The router is a model whose only job is to allocate the other models well.

Worked numbers: what the route actually saves

Routing is only worth its complexity if the arithmetic is large, so let us do it. Take an illustrative price gap of strong model at $10 per million input tokens / $30 per million output versus cheap model at $0.50 / $1.50 — a 20× gap, typical of a flagship-vs-mini pairing. Suppose a daily workload of 100,000 requests, each averaging 800 input + 400 output tokens.

cost per strong request = 800·($10/1e6) + 400·($30/1e6) = $0.0080 + $0.0120 = $0.0200
cost per cheap request = 800·($0.50/1e6) + 400·($1.50/1e6) = $0.00040 + $0.00060 = $0.0010

Everything on the strong model: 100,000 × $0.0200 = $2,000/day (~$730k/year).

Now route. Suppose classification finds 70% of requests are "simple" (safe on the cheap model) and 30% need the strong one. Add the router's own classification call — say it runs on the cheap model at the same $0.0010 each, so 100,000 × $0.0010 = $100/day of routing overhead.

routed = 100,000·$0.0010 (router) + 70,000·$0.0010 (simple→cheap) + 30,000·$0.0200 (complex→strong)
= $100 + $70 + $600 = $770/day

That is a 61% reduction ($2,000 → $770), and the router overhead of $100 is dwarfed by the $1,230 it saves. The lesson the numbers teach: the router pays for itself the moment a meaningful fraction of traffic is genuinely simple, and the savings scale with both the price gap and the simple-traffic share. The widget below lets you turn those two dials and watch the daily bill and the quality risk move together.

Routing economics — trade daily cost against quality risk

A daily workload is split into simple and complex requests. You set the router's complexity threshold (how aggressively it sends traffic to the cheap model) and the price gap. The simulator sends each bucket to a model, charges for it, and estimates quality: the cheap model answers simple requests well but degrades on complex ones it should not have received. Push the threshold too aggressive and you save money while quietly shipping wrong answers — exactly the trap the book's critique agent exists to catch.

aggressiveness (% sent to cheap model): 55% price gap (strong / cheap): 20× true % simple in traffic: 70%

All-strong cost/day

$2000

Routed cost/day

$770

Savings

61%

Misrouted (complex→cheap)

Est. answer quality

99%

Show the core JS

const N = 100000, inTok = 800, outTok = 400;
const cheap = inTok*0.5e-6 + outTok*1.5e-6;        // $0.0010
const strong = cheap * gap;                         // price gap dial
const toCheap = Math.round(N * agg);                // aggressiveness dial
const toStrong = N - toCheap;
// router runs cheap classifier on every request
const cost = N*cheap + toCheap*cheap + toStrong*strong;
// quality: cheap model fine on simple, but complex-sent-to-cheap is misrouted
const trulySimple = Math.round(N * truePctSimple);
const misrouted = Math.max(0, toCheap - trulySimple); // cheap got complex work
const quality = 1 - 0.5 * (misrouted / N);          // each miss ~half-credit

3 · Fallback chains and the critique agent — reliability and quality control

Routing for cost is only half the chapter. The book pairs it with two mechanisms that keep the optimized system trustworthy.

Fallback chains: surviving the real world

The book's recurring reliability strategy is the fallback mechanism: when the preferred model is unavailable — overloaded, rate-limited, or its output is filtered — the system automatically switches to a default or more economical model so the service continues instead of failing outright. This is graceful degradation at the model layer.

The chapter's concrete vehicle is OpenRouter, a unified gateway over hundreds of models that offers two routing styles:

Automatic selection ("model": "openrouter/auto"): the gateway picks the best available model for the request content, and — crucially — returns metadata naming the model that actually served it.
Ordered fallback ("models": ["anthropic/claude-3.5-sonnet", "gryphe/mythomax-l2-13b"]): try the first; on failure (outage, rate limit, content filter) fall to the next, until one succeeds or the list is exhausted. The final cost and model id are those of whichever model actually completed the request.

Why "which model actually answered" is load-bearing

Both OpenRouter modes return the realized model id. Without it you cannot attribute cost, you cannot reproduce a result, and you cannot evaluate quality by route — your dashboard says "we used Sonnet" while the fallback quietly served a weaker model for an hour. Record the realized model, not the requested one.

# Ordered fallback as a budget-and-reliability policy.
PREFERRED = ["anthropic/claude-3.5-sonnet", "openai/gpt-4o-mini", "local/llama-3-8b"]
for model in PREFERRED:
    try:
        resp = call(model, prompt, timeout=8)
        state.trace.record(requested=PREFERRED[0], realized=model, cost=resp.cost)
        return resp
    except (RateLimited, ServiceDown, Filtered):
        continue            # degrade to the next, cheaper/available option
raise AllModelsExhausted()  # only now do we truly fail

The critique agent: catching silent quality loss

The danger of an aggressive router is invisible: it does not crash, it just returns slightly worse answers and a smaller bill. The book's answer is a dedicated critique agent (批判智能体) — a component whose system prompt makes it the quality-assurance reviewer of the answering agent's output. The book gives it a concrete job: scrutinize research results for factual correctness, completeness, and bias; flag missing data and inconsistent reasoning; and feed that judgment back to improve the system.

The critique agent does not manage the budget directly. It improves resource allocation indirectly: if it keeps finding that Flash answers are inadequate for a class of question, that signal is used to retune the router — or even to fine-tune / RL the routing policy. Conversely, if it finds Pro was overkill for questions Flash would have nailed, the router learned to over-spend. In the book's words, the critique agent catches the two routing errors that matter: simple questions sent to the expensive model (waste) and complex questions sent to the cheap model (quality debt).

The book's three-way OpenAI router, end to end

The chapter's fullest worked code is a question-answering system that classifies every prompt into one of three buckets before choosing a model — the cleanest illustration of routing as a complexity-to-resource map:

Class	Meaning	Model chosen	Extra resource
`simple`	Direct factual answer, no reasoning or fresh data	`gpt-4o-mini` (cheapest)	none
`reasoning`	Logic, math, or multi-step thinking	`o4-mini` (reasoning model)	none
`internet_search`	Needs current / post-training info	`gpt-4o` (strong)	Google Custom Search call, results injected as context

# The book's pattern: classify first (cheap), then bind model + tools to the class.
def handle_prompt(prompt):
    cls = classify_prompt(prompt)["classification"]   # one cheap call returns JSON
    search = google_search(prompt) if cls == "internet_search" else None
    answer, model = generate_response(prompt, cls, search)
    return {"classification": cls, "response": answer, "model": model}  # model recorded

Note three things the book builds in: the classifier returns strict JSON (so routing is parseable, not vibes); a tool — Google search — is attached only to the class that needs it (paying for a search API only when freshness is required); and the realized model is returned in the result for attribution.

4 · Beyond model switching — the full optimization taxonomy

The chapter's real teaching payload is that dynamic model switching is just the first item in a spectrum. A mature resource manager is a policy over many levers. The book enumerates them; here they are with the running coding/research agent in mind:

Dynamic model switching

Route by complexity and available compute. Flash for "what's this file's license?", Pro for "design a migration plan."

Adaptive tool selection

Pick the tool by API cost, latency, and runtime — a cached index lookup over a paid web search when both would answer.

Context pruning & summarization

Summarize and selectively keep history to cut tokens and inference cost — without deleting the evidence the answer rests on.

Proactive workload prediction

Forecast demand and pre-allocate (warm a model, prefetch docs) to avoid bottlenecks and cold-start latency.

Cost-sensitive exploration

In multi-agent systems, minimize communication + compute — share findings instead of every agent re-deriving them.

Energy-aware deployment

On edge / battery devices, optimize the pipeline to extend runtime and cut cost.

Distributed-compute awareness

Spread work across machines/processors for throughput when the job parallelizes (lesson 05's fan-out, priced).

Learned resource allocation

Use feedback and metrics to keep improving the allocation policy over time — the critique agent's signal closes this loop.

Graceful degradation

When resources are critically scarce, keep core function alive via reduced performance and alternative strategies rather than failing.

Worked example: context pruning with a token budget

Context is a priced, finite resource, so it gets its own budget. Suppose the coding agent's window is 128k tokens and its retrieved working set has grown to 90k tokens of file contents and tool logs, while a single Pro call at $10/M input would cost 90,000 × $10/1e6 = $0.90 just to read the context — on every turn. Prune: summarize the 60k tokens of old tool logs down to a 4k-token running summary, keep the 30k tokens of source files verbatim because they are the evidence. New context = 34k tokens → $0.34/turn, a 62% cut. The discipline is in what you keep: pruning that discards a citation or a safety check is not optimization, it is corner-cutting (see the failure modes below).

5 · Make every decision auditable — or it is just quality debt

The chapter is emphatic on one governance point, and it is the rule that separates real optimization from cost-cutting theater. A strong implementation records every resource decision in the trace: why this model, why this context size, why this tool, why this fallback fired. And evaluation (lesson 21) must compare routes, not only final answers.

The governing rule

A cheaper route that drops a citation, skips a verification, or misses a safety check is not an optimization — it is quality debt disguised as savings. You only know which it is if every resource decision is explainable, measurable, and reversible, and if your evals score quality per route. Optimizing cost without a quality regression test is how you ship a 60% cheaper system that is 20% wrong and not find out until a customer does.

01Represent budgets explicitly in task state — remaining tokens, dollars, latency target, retry count, escalation quota.

02Estimate the step's complexity, risk, and latency need (heuristic, LLM classifier, or fine-tuned router).

03Choose the model, tool path, context size, retry depth, and parallelism that meet the quality bar at least cost; attach a fallback chain.

04Record the decision and the realized resource use in the trace (requested vs actual model, tokens, cost, latency).

05Degrade gracefully when a budget tightens — smaller model, summarized context, coarser answer with stated uncertainty — never a hard crash.

Running example — the research / coding agent, made resource-aware

Threading the whole chapter through one agent: a research assistant uses a small model to classify each source (relevant? primary? paywalled?), a strong model only to synthesize conflicting evidence, and deterministic, code-based citation checks before emitting the final answer (free compared to an LLM verifier). It carries a per-task budget in state — say $0.50 and 12 seconds. When the budget tightens mid-task it degrades on purpose: switch synthesis from Pro to Flash, prune old tool logs to a summary, and label the result "preliminary — confidence reduced under budget" rather than overrunning or failing. A critique agent samples outputs and, when it finds Flash syntheses are too thin for multi-source conflicts, feeds that back to make the router send conflict-heavy tasks to Pro. Every hop records its model, tokens, and cost, so the eval harness can answer the only question that matters: did the cheaper route stay correct?

Try it (checkpoint exercise)

Take your own agent and, for each step, write down the cheapest model or tool that would still be sufficient, and mark the few steps that genuinely require the strongest model. Then add one line of trace per step recording the model, token count, and cost. You now have the raw material for both the router and the per-route eval.

Failure modes

Strong-model-everywhere: routing is hard, so the team defaults to the flagship for all traffic and burns the budget (our worked example: $2,000/day vs $770).
Over-aggressive routing: the threshold sends complex work to the cheap model; answers degrade silently because nothing crashes. (The critique agent exists for this.)
Context pruned past the evidence: summarization cuts tokens until the citation or constraint the answer depends on is gone.
Untracked fallback: a fallback fires for an hour; dashboards still show the preferred model; cost and quality attribution are both wrong.
Cost optimized without a quality gate: savings ship with no per-route regression test, so quality debt accrues invisibly.

Implementation checklist

What budgets exist (tokens, dollars, latency, retries, human escalations), and are they in task state?
Which steps can use a cheaper model or tool, and which truly need the strong one?
What caches / cheaper tools exist, and does the router prefer them when sufficient?
Is there a fallback chain, and does the trace record the realized model, not the requested one?
How is cost attributed per request and per route?
Do evals compare quality by route, not just final answers?
What is the graceful-degradation path when each budget is exhausted?

Where this points next

We have learned to spend less by routing easy work to cheap models and hard work to strong ones. But notice the asymmetry: the "complex" route still treats the strong model as a black box and just lets it think once. The next lesson, 19 · Reasoning techniques — decomposition, search, verification, is the other side of this coin. Where resource-aware optimization decides how little compute to spend, reasoning is about spending extra compute deliberately — decomposing a hard problem, searching over candidate solutions, and verifying before committing — precisely on the high-stakes "complex" path this lesson's router just identified. Routing tells you when a problem is worth more compute; reasoning tells you how to spend it well.

Takeaway

Resource-aware optimization is a policy layer over scarce resources — compute, time, money, context, tools, energy, and human attention — that chooses the cheapest path good enough for each step and degrades gracefully instead of failing. Its core mechanism is a routing agent that classifies request complexity and dispatches cheap-vs-strong (the book's Gemini Flash/Pro on Google ADK, the OpenAI simple/reasoning/internet_search router); routing easy traffic to a cheap model can cut spend ~60% the moment a meaningful fraction is genuinely simple. Pair it with fallback chains (OpenRouter's auto-selection and ordered fallback) for reliability and a critique agent for quality, widen it across the full taxonomy (adaptive tools, context pruning, learned allocation, graceful degradation), and record every decision in the trace so evals compare routes, not just answers. The governing rule: a cheaper route that drops a citation or a safety check is quality debt, not optimization.

Interview prompts

Why is "always use the strongest model" an anti-pattern rather than a safe default? (§1, §2 — it is pure waste on the large fraction of simple traffic; routing easy work to a cheap model cut our worked workload from $2,000 to $770/day, a 61% saving, because the router overhead is dwarfed by the gap it captures.)
How does a routing agent work, and what signal does it route on? (§2 — it classifies request complexity first, then dispatches: cheap/fast model for simple, strong/expensive for complex. The signal can be a heuristic like query length, an LLM classifier, or a fine-tuned router trained on (question → best model) pairs.)
What is a fallback chain and why must you record which model actually answered? (§3 — an ordered list of models tried until one succeeds, surviving rate limits/outages/filtering. OpenRouter returns the realized model id; without it you cannot attribute cost, reproduce results, or eval by route — the dashboard lies.)
What problem does the critique agent solve that the router cannot? (§3 — silent quality loss. The router does not crash on a bad route; the critique agent reviews answers for correctness/completeness/bias and feeds that back to retune routing, catching both waste and quality debt.)
Beyond model switching, name three more resource levers and one risk of context pruning. (§4 — adaptive tool selection, context pruning/summarization, proactive workload prediction, cost-sensitive exploration, energy-aware deployment, distributed compute, learned allocation, graceful degradation. Risk: pruning past the evidence — dropping a citation or safety check the answer depends on.)
How do you tell an optimization apart from quality debt? (§5 — record every resource decision (explainable, measurable, reversible) and run evals that compare quality per route. A cheaper route that drops a citation, verification, or safety check is debt, not savings; without a per-route quality gate you cannot distinguish them.)
Your agent has a $0.50 / 12s budget and is about to overrun. What does graceful degradation look like? (§4, running example — downshift the model (Pro→Flash), prune old logs to a summary while keeping evidence, and return a coarser answer labeled with reduced confidence — never a hard crash or silent overrun.)