all_lessons/agentic_systems/02 · prompt & context contractslesson 2 / 25

Part I - Foundations

Prompt and context contracts - the interface to the model

Lesson 01 drew a hard line between the model (a stateless function from text to text) and the control system around it. This lesson builds the only wire that crosses that line: the prompt going in and the structured decision coming out. We turn that wire from prose into an engineered contract — typed inputs, a typed output schema, a validation step, and an explicit context budget — because everything downstream (chaining, routing, planning, tools) is just code reading the model's answer, and code cannot read a paragraph.

The plan
Five moves. (1) Re-frame a "prompt" as an interface: the model is a function, and a function needs a typed signature. (2) Separate the two halves of the input — durable policy (the system prompt, set once) from dynamic context (what changes per call: state summary, retrieved evidence, tool output) — the book calls the second half context engineering. (3) Make the output a schema, not prose, and adopt the book's "parse, don't validate" rule with Pydantic so an invalid answer becomes a caught error, never a silent action. (4) Treat the prompting techniques the appendix catalogs — zero/one/few-shot, CoT, ReAct — not as magic phrases but as fields inside the contract. (5) Put real numbers on the context budget so "just give it more context" stops sounding free. We close on the rule that governs the rest of the track: a prompt is incomplete until its inputs, output schema, refusal mode, and evaluator are all named.
Linear position
Prerequisite: The agent boundary from lesson 01 — model vs. policy vs. state vs. environment. You must already see the model as a pure function the control loop calls, not as "the agent."
New capability: A reusable contract for every model call — named input slots, a typed output schema, a validation gate, and a sized context budget — so the next lessons can compose model calls into pipelines without each one re-inventing how to talk to the model.

1 · A prompt is a function signature, not a wish

In lesson 01 we established that the model is a stateless function: tokens in, tokens out, no memory between calls. The instinctive way to use such a function is to write it a paragraph — "You're a support assistant, look at this conversation and figure out what to do." That works in a chat window because a human reads the reply and supplies all the judgment. It fails the instant a program has to read the reply, because the program needs to branch on the answer, and you cannot reliably if on a paragraph.

So the first reframe of this lesson: stop thinking of a prompt as a request and start thinking of it as a function signature. A function signature names its inputs, names its return type, and tells the caller what happens on bad input. A production prompt must do the same three things. The book (Appendix A) opens with this exact spirit — prompt engineering is "structuring requests, providing relevant context, specifying output format" — and it lists the plain principles that make the signature legible to the model:

These are table stakes. The engineering — the part that turns a chat trick into an agent component — is everything that makes the signature machine-enforceable, which is the rest of this lesson.

2 · The input has two halves: durable policy vs. dynamic context

The most common beginner bug is one giant prompt that mixes everything: the persona, the rules, three examples, the conversation history, the current database row, and the last tool's raw output, all concatenated into a single string rebuilt from scratch on every turn. It is brittle (one edit risks all of it), expensive (you resend the unchanging parts every call), and the model loses the thread. The fix is to split the input by lifetime.

The book draws this line as prompt engineering vs. context engineering. The system prompt is the durable layer: it sets the model's role, rules, tone, and safety stance, and it is the same on every call ("You are a technical writer; tone must be formal and precise"). Context engineering is the dynamic layer: it assembles, per call, the background the model needs right now — conversation history, retrieved documents (RAG), tool outputs, and implicit data like user identity or environment state. The book's thesis is striking and worth internalizing: output quality depends more on the richness of the assembled context than on the model architecture. A frontier model with a thin context view underperforms a weaker model given the right operational picture.

role / policy
Durable. Who the model is, the rules it obeys, the refusal stance. Set once; ideally cached so you don't pay to resend it.
output schema
Durable. The exact shape the answer must take, with field descriptions. Also set once.
few-shot exemplars
Mostly durable. 2–5 input→output examples that pin format and style (§4). Curated, not regenerated.
objective
Dynamic. The specific thing this call must accomplish right now.
state summary
Dynamic. A compressed view of where the loop is — not the raw transcript (see §5 on budget).
evidence
Dynamic. Retrieved docs and tool outputs, each tagged with its source so the model can cite, not invent.
available actions
Dynamic. The legal moves right now (which tools, which routes) — this constrains the output schema's enums.

Two payoffs fall out of the split. First, the durable half is identical across calls, so providers can cache it and you stop paying full price to resend your 1,200-token system prompt on every turn (we cost this in §5). Second — the book's warning, restated — a bigger context window is not memory. Pasting the entire transcript into the dynamic half because "the window is huge now" is not context engineering; it is hoarding. Real memory is scoped storage with explicit read/write policies, which is its own pattern (lesson 10, Memory management). For now, treat the dynamic half as something you deliberately assemble and size, not a dumping ground.

Delimiters are not decoration
When you do assemble several pieces into one prompt, separate them unambiguously — the book recommends triple backticks, XML-style tags, or --- markers — so the model can tell the instruction from the article from the tool output. A retrieved document that bleeds into your instructions is a classic prompt-injection surface (revisited under Guardrails, lesson 20): untrusted evidence must be visibly fenced off from trusted policy.

3 · The output must be a schema — parse, don't validate

This is the load-bearing idea of the lesson. The book states it directly: in an agent, structured output is what lets the model's answer become the next component's input. Without it, the model's "decision" is prose that some regex or your own re-prompting has to interpret — and that interpretation layer is where agents rot.

Concretely: do not ask the model to "explain what it wants to do." Make it return a typed object whose every field the downstream code already knows how to consume. For our running coding/research assistant, the very first decision each turn is a route, and its contract looks like this:

route_schema = {
  "route":       "answer_directly | retrieve | call_tool | ask_human | stop",
  "tool":        "name of the tool to call, required iff route == call_tool",
  "args":        "object matching the chosen tool's parameter schema",
  "confidence":  "number in [0, 1]",
  "reason":      "short, evidence-grounded justification",
  "missing_info":"list of facts still needed before acting",
  "risk":        "low | medium | high"
}

# The model does NOT execute anything. It emits this object;
# the control loop (lesson 01) reads it and acts.
# Invalid JSON is not a route. It is a validation failure.
parsed = parse_json(model_output)        # may raise
decision = validate(parsed, route_schema)  # may raise

The book's recommended mechanism is Pydantic: define the schema as a typed Python model, then call Model.model_validate_json(llm_output) to parse and validate in one step. A field like confidence: float = Field(..., ge=0, le=1) or email: EmailStr is checked at the boundary; malformed output raises ValidationError instead of flowing onward as a plausible-looking lie.

from pydantic import BaseModel, Field, ValidationError
from typing import Literal, Optional

class Route(BaseModel):
    route: Literal["answer_directly","retrieve","call_tool","ask_human","stop"]
    tool: Optional[str] = Field(None, description="required iff route == call_tool")
    confidence: float = Field(..., ge=0.0, le=1.0)
    reason: str
    risk: Literal["low","medium","high"] = "low"

try:
    decision = Route.model_validate_json(model_output)   # parse + validate, one step
except ValidationError as e:
    decision = handle_invalid(e)   # retry with the error fed back, or escalate

The book names the principle: parse, don't validate at the system boundary. You don't accept a loose dict and sprinkle if-checks downstream; you convert the model's text into a known-good typed object once, at the edge, and everything inside the boundary can trust it. Three concrete wins:

The free-form foot-gun
The single most expensive mistake in agent code is letting prose from the model drive control flow — "if the reply contains the word 'search', call the search tool." It works in the demo and detonates in production the first time the model writes "I won't need to search for this." Free-form output consumed directly by code is the agentic equivalent of SQL string concatenation. Every model call that updates state or fires a tool MUST pass a schema gate first.

4 · Prompting techniques are fields in the contract, not the architecture

Appendix A is a catalog of techniques, and beginners tend to treat them as a bag of incantations. The framing this track insists on: each technique is a decision about what goes in a slot of the contract, not a replacement for having a contract.

Zero / one / few-shot
How many input→output exemplars you put in the durable half. Zero-shot for tasks the model has clearly seen; few-shot (the book says 3–5, diverse, correct, with shuffled class order for classification) when output format or style is unusual.
System / role prompting
Sets authority and stance in the durable policy slot ("Act as a senior data analyst"). It shapes tone; it does not relax the output schema.
Chain-of-Thought (CoT)
"Let's think step by step." Buys accuracy on multi-step reasoning at the cost of more output tokens. Best practice from the book: put the final answer after the reasoning, and use temperature 0 for single-answer tasks.
Self-consistency
Run the CoT prompt N times at higher temperature and majority-vote the answers. More reliable, N× the cost — a resource trade we revisit in lesson 18.
ReAct (Reason + Act)
Interleave thought → action (a tool call) → observation in a loop. The book's France-population example shows it; in our terms it is the route contract (§3) executed turn after turn. This is the backbone of tool use (lesson 07).
Decomposition
Split a hard task into sub-prompts and merge — the seam where this lesson hands off to prompt chaining (lesson 03).

The crucial point: CoT and the schema are not in tension. A common failure is to "add CoT" by letting the model ramble, then trying to parse a decision out of the ramble. Instead, give the reasoning its own field. The model writes its thinking into a reasoning string and its decision into the typed fields; the parser ignores the prose and reads the structure:

class ToolDecision(BaseModel):
    reasoning: str                       # CoT lives here, after the analysis
    route: Literal["call_tool","answer_directly","ask_human"]
    tool: Optional[str] = None
    confidence: float = Field(..., ge=0, le=1)
# The model "thinks step by step" in `reasoning`, then commits to typed fields.
# You keep the reasoning for debugging; control flow reads only the typed part.

So ReAct, few-shot, CoT, RAG, persona — every one of them resolves to "which slot, with what content, returning what schema." The architecture is the contract; the techniques are how you fill it.

5 · The context budget — putting numbers on "just add more context"

"Give the model more context" sounds free. It is not: every token in the prompt costs money and latency, and the dynamic half grows every turn. The book treats context as a managed resource (the explicit subject of lesson 18); here we just learn to size the contract so it cannot quietly blow up. A worked example for our coding/research assistant on a long session:

System policy (durable)
1,200 tok
Few-shot exemplars
800 tok
State summary (dynamic)
600 tok
Retrieved evidence
3,000 tok
Output (schema + reasoning)
400 tok

The naive transcript approach. Suppose instead of a 600-token summary you paste the raw transcript, and it grows ~700 tokens per turn. By turn 20 the dynamic state alone is ~14,000 tokens. Add the durable 2,000 and the 3,000 of evidence and you are sending ~19,000 input tokens per call. At an illustrative $3 / 1M input tokens that is ~$0.057 per call, and you re-pay almost all of it every turn because nothing is cached and the transcript keeps growing — call it ~$1.10 across the 20-turn session, dominated by re-sent history the model barely uses.

The engineered approach. Hold a compressed 600-token state summary instead of the transcript, and cache the 2,000 durable tokens (policy + exemplars). Per-call input is ~5,200 tokens, of which only ~3,200 are charged at full rate after the cache discount. That is roughly $0.011 per call versus $0.057 — about 5× cheaper, plus lower latency because the prompt is a quarter the size and the model isn't hunting for the needle in 14k tokens of chat. The contract's "context budget" is exactly this discipline: decide the token ceiling for each slot, summarize state instead of accumulating it, cache the durable half, and reject calls that would exceed the budget rather than letting them silently degrade.

CONTRACT (one model call) ┌─────────────────────────────────────────────────────────────┐ │ DURABLE role/policy ─┐ │ │ (cached) output schema ├─ same every call → cache, pay once │ │ few-shot ┘ │ ├─────────────────────────────────────────────────────────────┤ │ DYNAMIC objective │ │ (sized) state SUMMARY ← compressed, NOT raw transcript │ │ evidence [src] ← fenced, cited │ │ actions ← constrains output enums │ └──────────────────────────────┬──────────────────────────────┘ ▼ model (stateless fn) raw text output ▼ parse + validate (Pydantic) ── invalid ──▶ retry / escalate ▼ valid typed Decision object ───────▶ control loop acts (lesson 01)

Running example — from paragraph to contract

Put it together on the support/coding assistant. The user types: "my order hasn't shipped." The wrong design lets the model reply with a paragraph — "I think I may need to look up your order in the database, could you confirm..." — which the control loop then has to interpret. The contracted design forces the model to emit a validated object the loop can act on directly:

# Either a concrete, executable action…
{ "route": "call_tool", "tool": "lookup_order",
  "args": { "order_id": "A-2291" }, "confidence": 0.91,
  "reason": "user gave order id A-2291 earlier in session", "risk": "low" }

# …or an explicit request for the missing input, never a guess.
{ "route": "ask_human", "tool": null, "args": {},
  "confidence": 0.4, "reason": "no order id in context",
  "missing_info": ["order_id"], "risk": "low" }

Notice the contract makes uncertainty first-class: a low confidence plus a populated missing_info is a legal, well-typed answer that routes to ask_human — the model is allowed to say "I don't have enough to act," and the loop has a defined place to send that. This is the refusal mode the contract rule demands, and it is the seam into human-in-the-loop (lesson 15).

Checkpoint exercise

Try it
Take one prompt you already use in a chat window and rewrite it as a contract. Name the durable slots (role, schema, exemplars) and the dynamic slots (objective, state, evidence). Write the output as a Pydantic model with at least one Literal field and one numeric range. Then answer the four contract questions explicitly: What fields enter? What schema must leave? What happens on invalid output? Which context is durable vs. dynamic? If you cannot answer all four, the prompt is not yet a component.

Failure modes

  • The mega-prompt. One string mixing policy, examples, history, and raw tool output — brittle, costly, and the model loses the thread.
  • Prose-driven control flow. Code branching on free-form text ("if reply contains 'search'…"); detonates on the first negation.
  • Window-as-memory. Pasting the full transcript because the context window is large — cost and latency balloon while relevance drops.
  • Unfenced evidence. Retrieved/untrusted text concatenated next to instructions with no delimiter — an injection surface.
  • No refusal mode. A schema with no legal way to say "not enough info," so the model fills the gap with a confident guess.
  • Unbudgeted context. Slots with no token ceiling, so the dynamic half silently grows until quality and economics both crater.

Implementation checklist

  • What named fields enter the prompt, and which are durable vs. dynamic?
  • What typed schema must leave it? (Use Literal/enums + numeric ranges.)
  • What happens on invalid output — retry with the error fed back, or escalate?
  • Is the durable half cached so you don't re-pay for it every call?
  • Is evidence fenced with delimiters and tagged with its source?
  • Is there a legal "I'm not sure / missing X" path (confidence + missing_info)?
  • What is the token budget per slot, and what is the evaluator that grades the output?

Where this points next

We now have a single, reliable model call: a contract with named inputs, a durable/dynamic split, a typed schema gated by a parser, and a sized budget. But one well-formed call rarely solves a real task — "analyze this report, extract the data points, then draft an email" overloads a single call and the model drops steps. The natural next move is to chain contracts: make the typed output of one call the typed input of the next, so a hard task becomes a sequence of focused, individually-validated steps. That dependency — output-as-input, with structured handoffs between stages — is exactly what makes the chain reliable, and it is the subject of lesson 03. The book's market-research pipeline (summarize → identify trends as JSON → draft email) is the canonical first chain, and it only works because each link's output is the contract we just built.

Takeaway
A prompt becomes an engineering artifact when it is a contract, not a wish. Treat the model as a typed function: split the input into a durable half (role, schema, exemplars — cache it) and a dynamic half (objective, a compressed state summary, fenced/cited evidence, available actions) — the latter is the book's context engineering, and a rich context beats a bigger model. Make the output a schema, not prose, and parse, don't validate at the boundary (Pydantic) so a malformed answer is a caught error, never a silent action. Prompting techniques (few-shot, CoT, ReAct, RAG) are fields in the contract, not the architecture. Size every slot against a token budget so context cannot quietly balloon. The governing rule: a prompt is incomplete until its inputs, output schema, refusal mode, and evaluator are all named.

Interview prompts