Part III - Action and grounding

Tool use - function calling as controlled action

Every pattern so far kept the agent inside language: it chained prompts, routed, ran branches in parallel, and critiqued its own drafts. But a model that only emits text cannot read today's stock price, run your test suite, or send an email. Tool use is the pattern that lets the loop leave language — it gives the model a controlled, audited channel for taking real action and bringing back facts it could never have known.

Book source

Chapter 5 - Tool Use / Function Calling (工具使用); PDF outline pages 63-75. The chapter's running examples are a weather lookup, a stock-price lookup, code execution, and enterprise search, implemented across LangChain, CrewAI, and the Google Agent Developer Kit (ADK).

Linear position

Prerequisite: Lesson 06 (Reflection) and lesson 02's prompt/context contracts — you already know how to make the model emit a typed, validated structured output and how to fold a returned observation back into the next step.
New capability: a validated action channel that turns a structured model decision into a call against a live external system, and turns the system's reply into an observation the loop can reason over.

The plan

Five moves. (1) Name the wall a text-only model hits, and why "function calling" is the bridge over it. (2) Walk the six-step round-trip the book defines — define, decide, emit, execute, observe, continue — and locate exactly where control leaves the model and re-enters it. (3) Build the tool contract as a teaching object: name, description, schema, side-effect class, permissions, timeout, error shape. (4) Do the economics with real numbers — how observation size and retry budgets blow up your token bill and latency. (5) Make it concrete with the running coding/research assistant: a narrow, typed, auditable tool set instead of one run_anything. We close with the broader "tool calling" view (a tool can be an API, a database, even another agent) that lesson 08's planner and lesson 09's multi-agent system will lean on.

1 · The wall: a model is sealed off from the world

An LLM is a powerful text generator, but it is, in the book's framing, fundamentally isolated from the outside world: its knowledge is static, frozen at training time, and it cannot perform an action or fetch a live fact. Ask it "what is London's weather right now?" and it can only guess from stale training data; ask it "what is the exact profit on 100 shares of AAPL?" and it will produce a confident, plausibly-formatted, and frequently wrong number, because arithmetic over fresh data is exactly what next-token prediction is bad at.

Tool use (implemented through the function-calling mechanism) is the standardized way over this wall. We describe a set of external functions to the model; the model, reading the user request and the available tool descriptions, decides whether a tool is needed and emits a structured request naming the tool and its arguments; an orchestration layer outside the model executes the real function and feeds the result back. The model never runs anything itself — it only ever produces text. The text just happens to be a machine-readable instruction that something else acts on.

Mental model

Think of the model as a brilliant analyst locked in a room with a phone but no hands. It can describe precisely which lever to pull ("call get_stock_price with ticker="AAPL""), but a trusted operator outside the room is the one who actually pulls it and reads the result back through the phone. Function calling is the phone protocol; the operator is your orchestrator; the levers are your tools. Safety lives entirely in the operator, never in the analyst's good intentions.

2 · The six-step round-trip

The book lays out tool use as a precise loop. Memorize the seam in the middle — it is where every safety and reliability decision lives.

┌───────────────────── inside the model ─────────────────────┐ (1) TOOL (2) LLM (3) CALL (6) LLM DEFINITION ──▶ DECISION ──────▶ GENERATION ──┐ ┌──▶ PROCESSES name, purpose, "do I need a structured │ │ folds observation arg schema, tool? which?" JSON: tool + │ │ into next step → side effects args │ │ answer / re-call └─────────────────────┬─────────────┘ │ ════════════ SEAM: control leaves the model ═══════│════ │ │ ▼ │ (4) TOOL EXECUTION ──────▶ (5) OBSERVATION ─┘ orchestrator validates real result (or args + permissions, runs error) returned to the real function/API the model as context

Tool definition. Describe each external function to the model: its purpose, name, argument types, and meaning. This is a prompt-time fact — the descriptions become part of the context the model reasons over.
LLM decision. Given the request plus the tool catalog, the model judges whether one or more tools are needed.
Call generation. If yes, the model emits structured output (typically JSON) naming the tool and extracting the arguments from the request.
Tool execution. The framework / orchestration layer intercepts that structured output, identifies the requested tool, and actually runs the external function with the supplied arguments. This is outside the model.
Observation / result. The output (or error) is returned to the agent.
LLM processing (optional but usual). The model takes the observation as new context and produces the final reply, or decides to call another tool, reflect (lesson 06), or answer directly.

The dashed line between (3) and (4) is the entire lesson. Up to step 3, all you have is a string that claims to be a tool call — it has the trust level of any other model output, which is to say none. Steps 4 and 5 are ordinary software: parsing, validation, permission checks, execution, error handling. The discipline of tool use is refusing to let the model's intent become an action without that software in between.

Trap: hallucinated observations

The most insidious failure is the model fabricating step 5. If the orchestrator does not force the model to wait for a real observation, the model will happily continue as if the tool ran and returned a plausible value. The book's rule: after a tool call, the next model step must reason over the returned facts, never over the hoped-for effect. Make the real observation the only thing in context after a call.

3 · The tool contract — the real teaching object

The book's deepest point is that tool use is an interface design problem, not an API trick. The model only knows what the tool's description tells it, so the description is the API. A good tool contract pins down seven things:

Field	What it answers	Why the model / orchestrator needs it
name	Which lever?	The token the model emits; must be unambiguous (`get_stock_price`, not `tool3`).
description	When to call it?	The model's only basis for the step-2 decision. "Use for questions like 'capital of France' or 'weather in London'" is the book's own style — examples beat adjectives.
arg schema	What goes in?	Types + meaning so the model can extract args from the request and the orchestrator can validate them.
side-effect class	Read-only? Reversible write? Irreversible?	Decides whether a dry-run, log, or human approval gate is required.
permissions	Allowed in this task / for this user?	Enforced by the orchestrator, outside the model — never by prompt text.
timeout	How long before we give up?	A hung tool must become a recoverable error (lesson 14), not a stuck agent.
error shape	What does failure look like?	Errors are observations too. The book's CrewAI tool raises `ValueError` when a ticker is missing — so the model can say "I couldn't find that price" instead of inventing one.

That last row is worth dwelling on. In the book's CrewAI example, get_stock_price("AAPL") returns 178.15, but an unknown ticker raises an exception, and the task explicitly instructs the agent to say so plainly rather than guess. Designing the failure path is as much a part of the contract as the success path — an error that reaches the model as a clean, typed observation is recoverable; one that is swallowed or truncated becomes a silent wrong answer.

Frameworks, briefly. The chapter shows three. LangChain wraps a plain Python function with a @tool decorator (the docstring becomes the description), then create_tool_calling_agent + an AgentExecutor manage the round-trip — its search_information tool is a stubbed fact lookup. CrewAI attaches a @tool-decorated function to an Agent with a role/goal/backstory and runs it inside a Crew — the stock-price analyst above. Google ADK ships built-in tools you bind directly: google_search for web lookup, a BuiltInCodeExecutor that runs Python in a sandbox for exact arithmetic, and VSearchAgent for Vertex AI Search over a private datastore with source attribution. The abstractions differ; the six-step loop underneath is identical.

Function calling vs. Vertex Extensions

The book draws one sharp distinction. Plain function calling hands the structured request back to you — the client/orchestrator executes it (steps 4–5 are yours). Google's Vertex AI Extensions are structured API wrappers where the platform auto-executes the call with enterprise security and data-privacy guarantees built in. Same pattern, different owner of the execution step.

4 · The economics — observations and retries cost real money

Tool use is not free. Every observation you feed back in step 5 becomes input tokens on the next model call, and the whole conversation (system prompt + tool catalog + every prior call and observation) is re-sent on each turn because the model is stateless. This is where naive tool design quietly bankrupts a project, so let us do the arithmetic.

Worked example — the coding agent reads a file. Suppose the agent calls read_file on a 1,200-line source file. At roughly 10 tokens per line that observation is about 12{,}000 tokens. The agent then needs three more reasoning turns to finish the task. Because context is re-sent every turn, that one fat observation is paid for four times:

4 turns × 12{,}000 obs-tokens = 48{,}000 input tokens, just for one file.

Now contrast a read_file_slice(path, start, end) tool that returns only the 40 relevant lines (~400 tokens):

4 turns × 400 = 1{,}600 tokens → a 30× reduction, same task solved.

At a representative input price of $3 per million tokens, the fat version costs 48{,}000 / 10^6 × \$3 ≈ \$0.14 for that single file read, the lean version ≈ \$0.005. Multiply by thousands of files across thousands of sessions and the difference is the company's cloud bill. This is why the book insists the observation be designed for the next decision, not for raw completeness.

Retries multiply everything. Give a flaky tool a retry budget of 3. If each attempt round-trips ~5,000 tokens of context, a tool that fails twice before succeeding spends 3 × 5{,}000 = 15{,}000 tokens and triples the latency. A retry budget is a real budget — cap it, and prefer a clean error observation (which the model can route around) over silent infinite retries. The widget below lets you feel both knobs at once.

Tool-call economics — observation size × turns × retries

A single tool call's observation is re-sent on every subsequent turn (the model is stateless), and a retry budget re-pays the round-trip on each attempt. Drag the sliders to see total input tokens, cost, and added latency. Watch how trimming the observation beats almost everything else.

observation tokens: 12000 reasoning turns after: 4 retries per call: 1

Total input tokens

48000

Cost @ $3 / 1M

$0.144

Added latency

2.4 s

vs. 400-tok slice

30×

Show the core JS

// context is re-sent every turn; retries re-pay the round-trip
const totalTokens = obs * turns * retries;        // dominant term
const cost        = totalTokens / 1e6 * 3.0;        // $3 / 1M input tokens
const latency     = totalTokens / 20000;            // ~20k tok/s throughput → seconds
const lean        = 400 * turns * retries;          // a tight read_file_slice
const ratio       = totalTokens / lean;             // how much fat costs you

5 · Running example — a typed, auditable tool set for the coding agent

Tie it together with the track's running coding/research assistant. The lazy design is one tool: run_anything(command), "protected" by a line in the prompt that says please be careful. That is the book's canonical anti-pattern — a broad capability gated only by text the model can ignore or be talked out of. The disciplined design replaces it with a small set of narrow, typed, side-effect-labelled tools:

tools = [
  {"name": "list_files",          "args": {"dir": "str"},                       "side_effect": "read_only",   "approval": False},
  {"name": "read_file_slice",     "args": {"path": "str", "start": "int", "end": "int"}, "side_effect": "read_only", "approval": False},
  {"name": "apply_patch",         "args": {"path": "str", "diff": "str"},        "side_effect": "reversible",   "approval": False},
  {"name": "run_read_only_tests", "args": {"cmd": "str", "timeout_s": "int=60"}, "side_effect": "read_only",   "approval": False},
  {"name": "request_approval",    "args": {"action": "str", "reason": "str"},    "side_effect": "none",         "approval": True},
]

# the seam: model intent -> validated action
request    = parse_tool_call(model_output)          # step 3 output is just a string
validate_schema(request, tools)                     # types, required args, ranges
check_permissions(request, task_state)              # outside the model
if needs_approval(request): request_human_ok(...)   # irreversible -> lesson 15
observation = execute_tool(request, timeout=...)    # step 4, with timeout + error capture
task_state.observations.append(trim(observation))   # step 5, sized for the next decision
log.append({"call": request, "obs_hash": hash(observation)})  # replayable audit trail

The model can still act — edit code, run tests — but the action space is enumerable, every argument is validated, every side effect is labelled, anything irreversible routes through a human (lesson 15), and every call is logged with a hash of its observation so the whole episode is replayable. Notice read_file_slice rather than read_file: section 4's economics are baked straight into the tool surface.

Try it

Take one real API you might expose to an agent. Rewrite it as a model-facing tool contract: name, description (with a "use when…" example), argument types, side-effect class, error shape, timeout, and approval rule. Then ask: what is the worst thing a malformed or adversarial argument could do, and which line of the orchestrator stops it?

Failure modes

A broad run_anything / execute_sql tool gated only by prompt text — the prompt is not a permission system.
Hallucinated observations: the model continues as if the tool ran. Force a real result into context before the next step.
Tool results that hide errors or silently truncate — the model then reasons over a half-truth.
No side-effect taxonomy, so an irreversible delete is treated like a harmless read.
Fat observations re-sent every turn — cost and latency balloon (section 4).
Vague descriptions, so the model calls the wrong tool or skips a needed one (the step-2 decision is only as good as the description).

Implementation checklist

What side effects can the tool produce — read-only, reversible, irreversible?
Who validates arguments and permissions, and where (it must be outside the model)?
What does a clean, typed error look like, and does it reach the model as an observation?
What timeout and retry budget bound the call?
How large is the observation, and is it trimmed to what the next decision needs?
Which calls require human approval before execution?
Is every call logged with its arguments and observation for replay?

Where this points next

We now have a single validated action: the model decides, the orchestrator executes, the observation comes back. But real tasks need many such actions in the right order — read the failing test, find the bug, patch it, re-run the tests — and the model rarely knows that whole sequence up front. The broader "tool calling" view we ended on (a tool can be an API, a database, or even an instruction to another agent) is the hinge: once an agent can call tools, it can call plans. Lesson 08 (Planning — from goal to executable path) takes exactly this step, turning a goal into a multi-step path of tool calls when the path is not known in advance; lesson 09 then lets one of those "tools" be a whole specialist agent.

Takeaway

Tool use, implemented via function calling, breaks the model out of its sealed text-only world by letting it emit a structured request that an orchestrator validates and executes against a live system, returning a real observation. The six-step loop — define, decide, emit, execute, observe, continue — has one critical seam: between the model's emitted call and its execution, all trust, validation, permission, and side-effect handling must live in software, never in the prompt. The tool contract (name, description, schema, side-effect class, permissions, timeout, error shape) is the real interface; observations must be designed for the next decision, not for completeness, because every one is re-paid in tokens on every subsequent turn. LangChain (@tool + executor), CrewAI (tool-equipped agents), and Google ADK (built-in google_search, code executor, Vertex AI Search) all wrap the same loop.

Interview prompts

Walk through the six steps of a tool-use round-trip and point to where control leaves the model. (§2 — define, decide, emit (still inside the model), then the seam: execute and observe happen in the orchestrator, then the model processes the result. Everything after the emitted call is ordinary, untrusted-input software.)
Why is a tool gated only by a sentence in the system prompt a security bug? (§5 — the prompt is model context, not an enforcement boundary; the model can be argued out of it. Permissions and validation must run in the orchestrator outside the model.)
The model returns a final answer right after requesting a tool, without waiting. What went wrong? (§2 — hallucinated observation; the loop did not force a real result into context before the next step, so the model invented the tool output.)
Your agent's token bill is 30× higher than expected on file-heavy tasks. First thing to check? (§4 — observation size. Fat observations are re-sent every turn and re-paid per retry; switch read_file to read_file_slice and trim observations to what the next decision needs.)
Why does the book make an unknown stock ticker raise an error instead of returning null? (§3 — the error shape is part of the contract; a clean typed error becomes a recoverable observation ("I couldn't find that price") instead of the model inventing a number.)
What's the difference between plain function calling and a Vertex AI Extension? (§3 — same pattern, different owner of execution: function calling hands the structured request back to your client to run; an Extension auto-executes it on the platform with built-in enterprise security.)
How would you decide whether an agent action needs human approval? (§3, §5 — by its side-effect class; read-only runs freely, reversible writes are logged, irreversible actions route through request_approval and lesson 15's human-in-the-loop gate.)