Part VII - Reasoning, safety, and evaluation
Reasoning techniques - decomposition, search, verification
Up to now the agent loop has been about structure — routing, tools, memory, budgets. This lesson is about spending compute to think harder on a single hard decision. Reasoning techniques are the named ways an agent can trade more inference-time work for more accuracy: break a problem into steps, branch and backtrack, act-observe-act, debate, self-correct, and — the engineering crux — verify. The book's thesis is that performance no longer depends only on model size; it depends on how much "thinking time" you allocate and how you allocate it.
BuiltInCodeExecutor), Chain/Graph of Debate, RLVR reasoning models, MASS, and the Inference Scaling Laws. Running code: the Google gemini-fullstack-langgraph-quickstart "DeepSearch" agent built on LangGraph.New capability: Choosing and implementing a reasoning strategy — decomposition, search over thoughts, reason-act loops, multi-agent debate, self-correction — gated by a verifier and a thinking budget, with inspectable artifacts left behind in the trace.
1 · The one idea: reasoning is budgeted computation
A standard model call is a single forward pass: prompt in, answer out. For an arithmetic word problem or a multi-hop research question, that single pass often produces a confident wrong answer, because the model has committed to the first token of the answer before it has "worked anything out." Reasoning techniques are interventions that force the model to spend more inference-time computation — more tokens, more candidate paths, more tool round-trips — before committing. The book's framing: at inference time you can allocate variable "thinking time," and for hard problems that allocation buys accuracy, coherence, and robustness.
The governing principle is the inference scaling law (the book calls it 推理扩展定律). Unlike the training scaling law (bigger model, more data → better), the inference scaling law describes a runtime trade-off: by spending more compute when generating an answer — sampling multiple candidates and selecting among them, or searching deeper — even a small model can match or beat a larger model run once. The lever is not better hardware; it is a smarter inference algorithm (diverse sampling, self-consistency, verification). This is the economic heart of the chapter: it challenges "bigger is always better" and says allocate thinking deliberately.
2 · The ladder of techniques, cheapest to most expensive
The chapter is essentially a cost-ordered menu. Each rung spends more compute and unlocks a harder class of problem. Define each as you climb; the engineering job is to stop at the cheapest rung that solves your task.
| Technique | What it does | Relative cost | Use when |
|---|---|---|---|
| Direct answer | Single forward pass, no scratchpad. | 1× | Easy, low-risk, factual-lookup tasks. |
| Chain-of-Thought (CoT) | Emit intermediate reasoning steps before the answer; "think step by step." Turns one hard problem into a sequence of easy ones, and makes reasoning auditable. | ~2-5× tokens | Arithmetic, multi-step logic, anything where the answer depends on a derivation. |
| Self-Consistency | Sample k independent CoT chains and take a majority vote on the final answer. Averages out one-off slips. | k× passes | The model is mostly right but occasionally derails; you have a discrete answer to vote on. |
| Tree-of-Thoughts (ToT) | Explore multiple reasoning branches as a tree, evaluate partial states, backtrack from dead ends, keep the best. Search, not a single line. | k×depth, with pruning | Planning, puzzles, strategy — where you must compare and abandon partial solutions. |
| ReAct (reason + act) | Interleave Thought → Action (tool call) → Observation → Thought… so the model grounds each step in real evidence and adapts to feedback. | per-step tool latency | Knowledge-intensive or environment-coupled tasks needing live data, calculation, or APIs. |
| Debate (CoD / GoD) | Multiple model instances propose, critique, and rebut, converging on a checked consensus. Chain-of-Debate is a roundtable; Graph-of-Debate is a non-linear support/rebut network. | N agents × rounds | High-value, bias-prone, or contested questions where one chain is too fragile. |
Chain-of-Thought is the foundation — the book calls it the agent's "internal monologue." Its value is transparency as much as accuracy: a difficult problem becomes a series of simple, inspectable steps, which is exactly what makes agent actions auditable. The chapter's worked CoT example is an information-retrieval agent given a fixed five-step procedure (analyze the question → formulate retrieval queries → simulate retrieval and self-check → synthesize → review and refine) answering "what is the difference between classical and quantum computers, and name one application." The agent's visible thoughts walk through identifying key terms, drafting sub-queries, surfacing concepts (bits vs. qubits, superposition, entanglement), synthesizing, and a final review pass before emitting the answer.
Tree-of-Thoughts generalizes the single chain into a branching search: the model can fork into several candidate continuations, evaluate how promising each partial state is, backtrack out of dead ends, and select the strongest leaf. Self-correction is the inner loop that makes this work — the agent critiques its own draft against the original requirements, finds gaps, and rewrites. The chapter's self-correction example takes a flat social-media draft ("We have new products. They are eco and tech.") and runs a five-step critique that diagnoses weak verbs, missing eco-emphasis, and a soft call-to-action, then rewrites it into a punchy 148-character post. That is reflection (lesson 06) deployed as a reasoning primitive.
ReAct is the rung that fuses reasoning with the tool use you built in lesson 07. Instead of reasoning in a vacuum, the agent runs a Thought → Action → Observation loop: it thinks, picks an action from a fixed set (search, query a knowledge base, call an API, or "finish"), observes the result, and thinks again — repeating until it decides it has the answer. The book notes you can tune the thinking frequency: fact-checking inserts a thought before every step; navigation-style tasks can think less often. ReAct is strictly more robust than linear CoT for tasks that need repeated environment interaction, because every step is anchored to a real observation rather than to the model's prior.
Debate (Chain-of-Debate from Microsoft, and the richer Graph-of-Debate) is the multi-agent rung, building on lesson 09. Several models argue — propose, critique, rebut — and converge on an answer backed by agreement, search verification, or accepted fact. It trades a lot of compute for reduced bias and a transparent reasoning record, and is the bridge from single-agent reasoning to multi-agent collaboration.
3 · Watching the scaling law: best-of-k with a verifier
The inference scaling law is easy to state and easy to get wrong, so let's make it numeric. Suppose a base model answers a math task correctly with probability p = 0.55 on any single attempt. Run it once and you get 55% accuracy. Now sample k independent chains and select among them — but how you select changes everything:
- Majority vote (self-consistency). You have no oracle, so you trust the most common answer. This helps only when correct answers cluster and wrong ones scatter.
- Best-of-k with a verifier. You have a checker — unit tests, a math evaluator, a rubric model — that can recognize a correct answer when it sees one. Then you succeed if any of the k attempts is correct: accuracy = 1 - (1-p)^k.
Worked number. With p = 0.55 and a perfect verifier, best-of-4 gives 1 - 0.45^4 = 1 - 0.041 = 0.959 — 96% — for 4× the inference cost. A larger model that scores p = 0.80 in a single pass costs, say, 6× the small model per call. So which is cheaper for 95% accuracy? The small model at best-of-4 costs 4 × 1 = 4 units and hits 96%; the big model at one pass costs 6 units and hits only 80%. The small model with a verifier is both cheaper and more accurate. That is the inference scaling law in one comparison, and it is why "spend thinking time" beats "buy a bigger model" for verifiable tasks.
4 · The engineering rule: no verifier, no reasoning
The single most important sentence in this chapter, restated for engineers: a reasoning method without a verifier is expensive guessing. ToT that explores ten branches but cannot tell a good branch from a bad one wastes ten times the compute to return the model's original guess. Debate without a way to score arguments converges on whichever agent is most fluent, not most correct. Self-consistency on a task with no canonical answer just amplifies the dominant bias. So the pattern shape is always the same four steps:
Where do verifiers come from? The book points at RLVR — Reinforcement Learning from Verifiable Rewards — as the reason modern "reasoning models" are good: they are trained on problems with checkable answers (math, code) so the model learns to generate long, self-correcting, backtracking reasoning traces, rewarded by an automatic checker rather than a human. The same insight applies at your layer: prefer tasks and decompositions where a step's correctness can be checked cheaply. Concretely, a verifier can be:
BuiltInCodeExecutor), which offloads exact computation to deterministic code.5 · Externalize artifacts, not raw chain text
Appendix F asks six leading assistants (Gemini, ChatGPT, Grok, Kimi, Claude, DeepSeek) to describe their own reasoning, and they converge on a near-identical staged pipeline: parse the prompt → activate/retrieve relevant knowledge → choose a reasoning mode → simulate step-by-step thinking → generate → review and refine. Kimi's transcript is the cleanest illustration — asked whether 3⁴ or 4³ is larger, it shows stages 0-5: tokenize, classify the task as integer-power comparison, choose the cheapest safe strategy (direct computation, switching to a log comparison only if the exponents were large), retrieve 3³=27 and 4³=64, compute 3⁴=81, cross-check via 81 mod 5 = 1 vs 64 mod 5 = 4, and emit a confidence score of 0.99 with a noted edge case. Several models (Claude, DeepSeek) explicitly caveat that this is simulated reasoning — pattern prediction "walking along reasoning trails laid down in the training data," not human cognition.
The engineering takeaway is sharp: reasoning is an internal control resource, not a monologue to dump on the user verbatim. Raw chain text is unstructured, sometimes confabulated, and unsafe to expose. What you store and surface are artifacts — the structured by-products of reasoning that users and evaluators can actually inspect and score:
- Decomposition — the subproblems the task was split into.
- Assumptions — what the agent took as given (so a human can challenge them).
- Search branches — which alternatives were tried and why each was abandoned.
- Checks performed — the tests, queries, or rubric scores, and their results.
- Confidence — a calibrated score (Kimi's 0.99) and the noted edge cases.
These artifacts are what lesson 21 (Evaluation) scores — the trajectory, not just the final answer — and what lesson 14's recovery loop replays. Storing them is the difference between an agent you can debug and one you can only restart.
6 · Running example: the coding/research agent thinks
Take our running agent two ways. As a debugger: a test is failing. The agent does not patch blindly. It forms two hypotheses (H1: an off-by-one in the loop bound; H2: a null returned by the upstream fetch), writes a targeted assertion for each, executes both — H2's assertion fails — patches the upstream guard, reruns the suite green, and stores {hypotheses, the two test results, the chosen fix, confidence} in the trace. The executable test is the verifier; the artifact is replayable. That is ReAct (think-act-observe) plus self-correction, and it stops the moment the suite passes.
As a researcher: the book's concrete template is Google's open-source DeepSearch agent (the gemini-fullstack-langgraph-quickstart repo), a LangGraph state machine whose nodes are exactly the reasoning loop — generate_query → web_research → reflection → finalize_answer — with a conditional edge after reflection that loops back to more searching if a knowledge gap remains, or proceeds to the answer if not:
from langgraph.graph import StateGraph, START, END
builder = StateGraph(OverallState, config_schema=Configuration)
builder.add_node("generate_query", generate_query)
builder.add_node("web_research", web_research)
builder.add_node("reflection", reflection) # <- the verify/self-correct step
builder.add_node("finalize_answer", finalize_answer)
builder.add_edge(START, "generate_query")
builder.add_conditional_edges("generate_query", continue_to_web_research, ["web_research"])
builder.add_edge("web_research", "reflection")
# reflection decides: knowledge gap remains -> search again, else -> finalize
builder.add_conditional_edges("reflection", evaluate_research, ["web_research", "finalize_answer"])
builder.add_edge("finalize_answer", END)
graph = builder.compile(name="pro-search-agent")
The reflection node is the verifier: it inspects what was found, names the remaining gap, and either spends more thinking budget (loop) or stops. The book also notes the advanced MASS framework (Multi-Agent System Search), which automates this design: optimize each agent's prompt in isolation, then search for the best interaction topology (it found that for the MBPP coding benchmark, "iterative self-correction combined with external verification" was optimal), then globally optimize the system prompts. MASS is the meta-lesson — the reasoning structure itself is a thing you can tune.
Failure modes
- Length ≠ quality. Longer output mistaken for better reasoning; the model rambles to look thorough while the answer is unchanged.
- Search without a selector. ToT or debate with no verifier — N× compute to return the original guess; the fluent branch wins, not the correct one.
- Verifier rot. A weak judge model (precision < 1) confidently selects wrong candidates; the best-of-k ceiling silently collapses.
- Hidden artifacts. Reasoning kept only as raw chain text — un-replayable, un-scorable, sometimes confabulated, unsafe to expose.
- Budget blowout. No stop rule, so the reason-act loop or debate runs until the context window or the cost ceiling, not until the task is solved.
- Reasoning on a non-verifiable task. Self-consistency on a subjective question just amplifies the dominant bias.
Implementation checklist
- What is the cheapest rung (direct / CoT / self-consistency / ToT / ReAct / debate) that fits this task's difficulty and risk?
- What is the verifier — test, rubric, search, or human — and how precise is it?
- What is the thinking budget (max samples k, max tool steps, max debate rounds, cost cap)?
- What is the explicit stop rule — verifier passes, budget hit, or confidence threshold?
- Which artifacts (decomposition, assumptions, branches, checks, confidence) get stored for replay and eval?
- What gets exposed to the user vs. kept internal?
- How is reasoning quality measured downstream (lesson 21)?
Where this points next
We can now make an agent think harder on demand — and crucially, decide when not to. But more reasoning power is also more ways to go wrong: a ReAct loop that calls a destructive tool, a debate that talks itself into an unsafe conclusion, a self-correction step that "improves" an answer into a policy violation. Reasoning expands the action space; the next job is to bound it. Lesson 20 (Guardrails and safety) wraps the reasoning loop in layered constraints — input checks, tool permissions, output filters, review, and monitoring — so that thinking harder never means acting more dangerously. After that, lesson 21 (Evaluation and monitoring) scores the reasoning trajectory — using exactly the artifacts we agreed to externalize here.
reflection node is the verify-and-decide step that loops or finalizes.
Interview prompts
- What is the inference scaling law, and when does a small model beat a bigger one? (§1, §3 — for verifiable tasks, spending inference compute on k samples plus a verifier gives accuracy 1−(1−p)^k, which can exceed a bigger model's single-pass accuracy at lower total cost; the lever is the inference algorithm, not parameters.)
- Order CoT, ToT, ReAct, self-consistency, and debate by cost and say when each pays. (§2 — CoT (cheap, multi-step logic) → self-consistency (k votes, fixes one-off slips) → ToT (search + backtrack, planning) → ReAct (tool-grounded, live data) → debate (N agents, high-value bias-prone questions); stop at the cheapest rung that solves the task.)
- Why is ReAct more robust than plain Chain-of-Thought? (§2 — ReAct interleaves Thought→Action→Observation so each step is anchored to a real environment observation and the plan adapts to feedback, rather than reasoning entirely from the model's prior.)
- "We added Tree-of-Thoughts and accuracy didn't move." Diagnose it. (§4 — almost certainly no usable verifier: branching explores alternatives but cannot tell a good branch from a bad one, so it returns the original guess at N× the cost; add an executable test, rubric, or search-based selector.)
- How does verifier precision change the best-of-k benefit? (§3, §4 — a perfect verifier reaches 1−(1−p)^k; as precision drops it starts selecting confidently-wrong candidates and the ceiling collapses toward (or below) majority-vote; the selector is the scarce resource, not the samples.)
- What reasoning should an agent expose vs. keep internal, and why? (§5 — expose structured artifacts (decomposition, assumptions, branches tried, checks and their results, calibrated confidence); keep raw chain text internal because it is unstructured, sometimes confabulated, unsafe, and the models themselves describe it as simulated pattern-prediction.)
- What is RLVR and why does it matter for agent reasoning? (§4 — Reinforcement Learning from Verifiable Rewards trains reasoning models on checkable problems (math, code) so they learn long, self-correcting, backtracking traces rewarded by an automatic checker; it's why modern reasoning models are strong and why you should prefer cheaply-checkable decompositions in your own system.)