Part IV - State, collaboration, and protocols

Learning and adaptation - improving from experience

Lesson 10 gave the agent a memory: scoped storage it can write to and read from across turns and sessions. But memory only remembers; it does not yet improve the agent's behaviour. This lesson closes that gap. Learning and adaptation is the loop that turns the traces sitting in memory into measured changes to how the agent acts — and the book's headline claim is that this loop targets prompts, tools, code, routing, and evals long before it ever touches model weights.

Book source

Chapter 9 - Learning and Adaptation (学习与适应); PDF outline pages 111-119. Central case study: SICA, the Self-Improving Coding Agent (Robeyns, Aitchison & Szummer, 2025). Also AlphaEvolve and OpenEvolve.

The plan

Five moves. (1) Define learning as a measured change loop and place it relative to memory, so we never confuse "logging more" with "improving." (2) Catalogue the book's six learning modes (RL, supervised, unsupervised, few/zero-shot, online, memory-based) and the one distinction that matters most for agents: what layer changes. (3) Anchor on the book's two alignment algorithms, PPO and DPO, with the clip/trust-region intuition and a worked preference-update number. (4) Walk the SICA case study in full — the versioned archive, benchmark-weighted selection, self-editing tools that evolved, and the asynchronous supervisor that keeps it safe. (5) Generalize to AlphaEvolve / OpenEvolve: generate, evaluate, select, repeat. We close by threading the running coding/research assistant through every layer and handing off to MCP.

Linear position

Prerequisite: Lesson 10 (Memory management) — scoped session, project, and long-term stores with explicit write/read policies, plus the trace records they hold. Lesson 06 (Reflection) — the within-a-run critique loop. Lesson 09 (Multi-agent) — why complete trajectories matter when several agents share an outcome.
New capability: An across-run feedback loop that converts traces and feedback into validated changes to the agent's prompts, tools, code, routing policy, memories, evals, and — only sometimes — model weights.

1 · Learning is a measured change loop, not a longer log

It is tempting to think the agent "learns" simply because lesson 10 lets it remember. It does not. Memory is state that persists; learning is behaviour that changes because of that state. The book frames learning and adaptation as the agent altering its thinking, behaviour, or knowledge in response to new experience and environment interaction — moving from "merely executing instructions" to becoming genuinely more capable. The keyword is change: if nothing about the agent's future decisions is different, no learning happened.

Two clarifying distinctions the book leans on:

Learning vs adaptation. Learning is the acquisition of new knowledge or skill from data and experience. Adaptation is the observable behaviour change that results — the agent altering its strategy, its understanding, or even its goals to fit an environment that is unpredictable, shifting, or new. You learn from a thousand traces; you adapt by, say, raising one routing threshold.
Reflection (lesson 06) vs learning (this lesson). Reflection is a within-run loop: generate, critique, repair, all inside one task before you answer. Learning is an across-run loop: collect traces from many completed tasks, find a systematic failure, change the system, and verify the change against a regression suite before it ships. Reflection makes one answer better; learning makes the next thousand answers better.

So the unit of learning is a closed loop with a gate at the end. Skip the gate — change the system off one impressive or embarrassing run — and you are gambling, not learning.

┌─────────────┐ traces & ┌──────────────┐ labelled ┌───────────────┐ │ RUN AGENT │──feedback───▶│ ANALYSE │──failures────▶│ PROPOSE │ │ (production)│ │ classify, │ │ a change to │ └─────────────┘ │ cluster │ │ ONE layer │ ▲ └──────────────┘ └───────┬───────┘ │ │ │ ship only if better ┌──────────────────┐ │ └──────────────────────────────│ REGRESSION GATE │◀─────────┘ │ evals must pass │ │ + rollback path │ └──────────────────┘ The gate is the difference between learning and luck.

2 · The six learning modes — and the question that actually matters

The book opens with a taxonomy of how agents can learn. Memorize the list, but understand that for an LLM agent the interesting axis is not which algorithm but which layer of the system the change lands in.

Mode	What it does (book)	Fits an agent when…
Reinforcement learning	Try actions, get rewarded for good outcomes and penalized for bad, converge on an optimal policy in a changing environment.	Controlling robots or playing games — anywhere outcomes are scored and the action space is explorable.
Supervised	Learn input→output from labelled examples for decisions and pattern recognition.	Email classification, trend prediction — you have ground-truth labels.
Unsupervised	Find hidden structure in unlabelled data; build a cognitive map of the environment.	Exploratory data work with no explicit target.
Few-shot / zero-shot (LLM)	An LLM agent adapts to a new task from a handful of examples or a plain instruction.	Fast response to a new command or scenario with almost no data — the everyday agent mode.
Online learning	Continuously update knowledge as new data streams in.	Real-time, dynamic environments and continuous data streams.
Memory-based	Recall past experiences and adjust current behaviour in similar situations.	The agent has the retrieval ability from lesson 10 to look up analogous cases.

The book's practical-applications list is worth keeping concrete, because it shows the same loop under different names: a personalized assistant that refines its interaction protocol from long-term behaviour analysis; a trading bot that retunes parameters against high-resolution market data; a fraud-detection agent that folds newly-discovered fraud patterns into its predictor; a recommender that sharpens precision from preference learning; and a knowledge-base learning agent that uses RAG (lesson 16) to maintain a living store of successful strategies and past obstacles, consulting it at decision time. Same loop; different layer changing.

The layer question

Before you "train" anything, ask: what is the cheapest layer that fixes this class of failure? The ladder, cheapest and safest first: memory entry → prompt/instruction → tool design → routing policy/threshold → evaluation suite → model weights. The book's whole argument is that agents climb this ladder, and the top rung (weight training, RL, fine-tuning) is the last resort, not the first instinct.

3 · PPO and DPO — when the change really is the weights

When you do reach the top rung, the book names two algorithms. You do not need to implement them here (the RL track does that in depth), but you must be able to explain the intuition in an interview, because they encode why weight updates are dangerous and how each tames the danger.

PPO (Proximal Policy Optimization) trains a policy in environments with continuous actions — robot joints, game characters. Its core problem: a naive policy-gradient step can be too large and collapse performance, destroying knowledge the agent already had. PPO's fix is the clip: it defines a small trust region around the current policy and refuses to update beyond it. The book's analogy is a safety brake — you improve, but you are physically prevented from taking a single catastrophic step. The loop is: (1) collect a batch of experience (state, action, reward) with the current policy; (2) estimate how a candidate update would change expected reward; (3) clip that update so the new policy stays close to the old. Stable improvement, no cliff.

DPO (Direct Preference Optimization) is the LLM-alignment shortcut. The classical PPO-based alignment pipeline is two steps: first train a reward model on human preference data ("response A is better than response B"), then fine-tune the LLM with PPO to score high under that reward model acting as judge. Two steps, complex, unstable, and the LLM can game the reward model — produce text that scores high but is actually low quality. DPO skips the reward model entirely: it uses the preference pairs to update the policy directly, with a loss whose math says "raise the probability of the preferred response, lower the probability of the rejected one." Fewer moving parts, more stable alignment.

Worked number — what "raise the probability" means

Suppose for a prompt the agent currently assigns the human-preferred answer a likelihood that works out to a log-prob of −2.30 (probability ≈ e^−2.30 = 0.10) and the rejected answer −1.61 (prob ≈ 0.20). The model currently prefers the worse answer 2:1. DPO's gradient pushes to increase the preferred margin (−2.30) − (−1.61) = −0.69 toward positive. After updates that move the preferred log-prob to −1.20 (prob ≈ 0.30) and the rejected to −2.30 (prob ≈ 0.10), the margin is now +1.10 — the model prefers the human-favoured answer 3:1. No reward model was trained; the preference pairs moved the policy directly. The KL term (the DPO analogue of PPO's clip) keeps the model from drifting so far it forgets everything else.

Trap — reward hacking

Both the reward-model judge and any automated eval you build for the learning loop can be gamed. The book's warning about PPO LLMs ("the model may exploit the reward model to win points with low-quality output") is the same hazard you face when you let an agent optimize against your own regression suite. An agent that learns to pass your evals without doing the task is the headline failure mode of self-improvement. Hold out evals; rotate them; have a human spot-check the "improved" version.

4 · The book's case study: SICA, the self-improving coding agent

This is the chapter's centrepiece and the clearest demonstration that agent learning is not the same as model training. SICA (Robeyns, Aitchison & Szummer, 2025) is unusual: instead of one agent training another, SICA is both the modifier and the modified. It improves at coding benchmarks by rewriting its own source code, with the base LLM unchanged.

The loop, faithfully:

┌───────────────────────────────────────────────────────────────────────┐ │ 1. REVIEW the archive of past versions + their benchmark scores │ │ 2. SELECT the best version, scored by a WEIGHTED blend of: │ │ success rate · wall-clock time · compute cost │ │ 3. That version SELF-MODIFIES: analyse the archive for improvement │ │ opportunities, then directly edit its own codebase │ │ 4. BENCHMARK the modified agent │ │ 5. ARCHIVE the new version + its scores │ │ 6. GOTO 1 ── all of it watched by an asynchronous supervisor ── │ └───────────────────────────────────────────────────────────────────────┘

Notice every property of the lesson-1 change loop is present and made concrete: the traces are archived versions with scores; the analysis mines the archive for what to change; the proposal is an actual code edit; the gate is the benchmark. It is versioned, evaluated, and replayable — the gold standard for any learning loop you build.

4.1 · What SICA actually learned to change — its own tools

The most instructive detail for this track: SICA's biggest gains came not from "being smarter" but from improving how it edits and navigates code. The base model never changed. The evolution of its editing tool, in order:

Naive full-file overwrite (slow, error-prone, burns tokens).
A smart editor doing context-aware edits.
A diff-enhanced smart editor — targeted, pattern-based edits using diffs.
A quick-overwrite tool to cut processing overhead, plus minimal-diff output and context-sensitive diff minimization using AST (abstract syntax tree) parsing, and input normalization for the editor.
For navigation: an AST symbol locator that finds definitions via code structure, then a hybrid symbol locator combining fast search with AST inspection, then optimized AST parsing within that locator to focus on relevant code and speed up search.

This is the teaching point in physical form: adaptation targets prompts, memory, tools, code, and routing — base-model weights are optional. An agent gets dramatically better by changing how it touches the world, not by learning new facts.

4.2 · The architecture that makes self-modification safe and legible

SICA pairs the modular agent (a base toolset — file ops, command execution, arithmetic — plus specialized sub-agents for coding, problem-solving, and reasoning, invoked like tools to decompose tasks and manage LLM context length) with three safety/observability mechanisms the book stresses:

Asynchronous supervisor. A second LLM runs in parallel with the main agent, periodically receiving a text representation of system state — the call graph, the event stream of LLM messages, tool calls and responses. It watches for loops, stagnation, and inefficient or repeated work, and can send a notification or terminate execution. This is the safety net that makes an agent rewriting its own code tolerable.
Docker isolation. The agent runs in a dedicated container, isolated from the host, so its ability to execute shell commands cannot wreck the host filesystem.
Observability. An interactive web view of the event bus and call graph lets a human inspect events, read supervisor messages, and collapse sub-agent traces — self-improvement is inspectable, not mysterious.

The book is honest about the hard part: the initial challenge was getting the agent to propose modifications that are genuinely novel, innovative, feasible, and interesting on each meta-improvement round. Open-ended creativity in self-improvement is still an open research problem — a useful thing to say out loud in an interview rather than overselling autonomy.

5 · Generalizing the loop: AlphaEvolve and OpenEvolve

SICA improves one agent's code. AlphaEvolve (Google) generalizes the pattern to discovering and optimizing algorithms. It fuses three pieces: an LLM ensemble (Gemini Flash generates many candidate solutions broadly; Pro does deep analysis and refinement), an automated evaluation system that scores candidates against preset criteria, and an evolutionary framework that feeds scores back to iterate toward novel, efficient algorithms.

The reported results make the loop's payoff concrete: deployed on Google infrastructure for data-center scheduling, it cut global compute resource usage by 0.7%; it proposed Verilog optimizations for TPU hardware design; it sped up a Gemini-architecture core kernel by 23% and FlashAttention GPU instructions by up to 32.5%; and in pure math it found a way to multiply two 4×4 complex matrices in 48 scalar multiplications, beating the prior best, rediscovered the best known solution on 75% of a set of open problems, and improved on 20% (including the "kissing number" problem).

OpenEvolve is the open-source evolutionary coding agent of the same shape: an LLM-driven generate → evaluate → select pipeline that evolves whole code files (not just single functions), across languages, with multi-objective optimization, flexible prompting, and distributed evaluation, coordinated by a controller over a program sampler, a program database, an evaluation pool, and an LLM cluster. The library's surface is tiny — which is exactly why it is a good mental model for the loop:

from openevolve import OpenEvolve

evolve = OpenEvolve(
    initial_program_path="path/to/initial_program.py",
    evaluation_file="path/to/evaluator.py",   # the GATE: scores each candidate
    config_path="path/to/config.yaml",
)

best_program = await evolve.run(iterations=1000)  # generate→evaluate→select ×1000
for name, value in best_program.metrics.items():
    print(f"  {name}: {value:.4f}")

The evaluation_file is the whole point: evolution is only as good as the evaluator, and a weak evaluator gets gamed (section 3's trap, at scale). The strongest systems, the book repeats, combine generation with evaluation and selection — never generation alone.

6 · Running example: the coding/research assistant climbs the ladder

Thread the change-loop through our running assistant and the layer question answers itself. Each row is a real failure cluster found in traces, and the cheapest layer that fixes it:

Observed failure (from traces)	Diagnosis	Layer that changes
In one repo, the agent keeps running `npm test` when the repo uses `pytest`.	Repo-local convention the agent can't infer.	Project memory — write the test command once (lesson 10 store). Cheapest rung.
Across many repos, the agent skips reading the failing test before patching.	Systematic process gap.	Prompt / instruction — add a "read the failing test first" step to the system prompt.
Full-file rewrites for one-line fixes burn tokens and introduce regressions.	Tool too blunt (SICA's exact problem).	Tool design — add a diff/patch edit tool; keep overwrite for new files.
Trivial questions get routed to the expensive deep-research path.	Router threshold miscalibrated.	Routing policy — raise the complexity threshold (lesson 04).
A whole bug class (off-by-one in date math) recurs and no test catches it.	Eval blind spot.	Evaluation suite — add a regression case so future changes can't reintroduce it.
Tool-call formatting is wrong often enough to hurt, even with good prompts.	Base capability gap.	Model weights — fine-tune / DPO on preference pairs. Last resort, top rung.

Every one of those proposals then passes through the same gate: run the regression suite, ship only if quality improves, keep a rollback path. And one subtlety from lesson 09 carries straight in: for the multi-agent version of the assistant, the tuning data must capture the complete trajectory — every agent's inputs and outputs — because each contributed to the shared outcome. Crediting only the final agent's output mislabels the cause.

Where this points next

We can now improve the agent from experience: collect traces, find the systematic failure, change the cheapest sufficient layer, and gate it on evals. But notice what made SICA's tools improvable — they were well-defined, discoverable interfaces the agent could reason about and rewire. As agents start sharing tools, prompts, and resources across systems, ad-hoc tool wiring becomes the bottleneck. Lesson 12 (MCP — the Model Context Protocol) standardizes exactly that: how an agent discovers and uses external tools, prompts, and resources through a common contract — the integration layer that makes a tool worth adapting in the first place portable across agents.

Failure modes

Live self-modification with no sandbox or approval — SICA's whole safety story (Docker isolation + async supervisor) exists to prevent this. Self-editing without a sandbox is a host-compromise waiting to happen.
Optimizing from one run — changing the system off a single impressive or embarrassing trace. That is overfitting to noise, not learning.
Reward / eval hacking — the agent learns to pass your regression suite without doing the task (the PPO reward-model exploit, generalized). Hold-out and rotate evals.
Mixing private user memory into global policy — promoting one user's personal adaptation into the shared prompt leaks data and degrades others.
Crediting the wrong agent — in multi-agent learning, tuning on the final output alone instead of the full trajectory.
Skipping the gate — shipping a proposed change without the regression run or a rollback path.

Implementation checklist

What traces and feedback are collected, and is the full trajectory captured?
How are failures labelled and clustered into classes?
Which single layer will change — memory, prompt, tool, router, eval, or weights?
Is it the cheapest layer that fixes the class?
Which held-out evals must pass before it ships?
Is there a versioned archive and a rollback path?
If the agent edits itself, is it sandboxed (Docker) and supervised (async watcher)?
Are personal / project / global adaptations kept in separate scopes?

Takeaway

Learning and adaptation is a measured change loop — collect traces and feedback, label failures, propose a change to the cheapest sufficient layer, and gate it on a regression suite with a rollback path — that runs across runs, distinct from reflection's within-run repair. The book's defining claim is that the change usually lands in memory, prompts, tools, code, routing, or evals — not model weights; when it does reach weights, PPO uses a clip / trust region to update safely and DPO aligns to preferences directly without a separate reward model. SICA proves the point: it self-improves at coding by rewriting its own editing and navigation tools (full-file overwrite → diff-aware → AST symbol locator) with a fixed base model, kept safe by Docker isolation and an asynchronous supervisor. AlphaEvolve / OpenEvolve generalize the loop to algorithm discovery: generate, evaluate, select, repeat — and are only ever as good as the evaluator, because a weak gate gets gamed.

Interview prompts

How is learning different from memory, and from reflection? (§1 — memory persists state; learning changes future behaviour because of that state; reflection is a within-run critique loop, learning is an across-run change loop gated by regression evals.)
An agent keeps making the same mistake. What's your first move? (§2, §6 — climb the cheapest-layer ladder: memory → prompt → tool → router → eval → weights. Find the cheapest layer that fixes the whole failure class; weight training is the last resort.)
Explain PPO's clip and DPO's shortcut. (§3 — PPO clips updates to a trust region so one big step can't collapse the policy; DPO skips the separate reward model and updates the policy directly from preference pairs, raising the preferred response's probability and lowering the rejected one's, with a KL term for stability.)
What does SICA prove about agent learning? (§4 — that an agent can improve substantially with a fixed base model by rewriting its own tools and navigation, e.g. evolving full-file overwrite into diff-aware and AST-based editing; learning targets code/tools, not just weights.)
Why does SICA need an asynchronous supervisor and Docker? (§4.2 — a second LLM watches the call graph / event stream for loops and stagnation and can terminate; Docker isolates the self-editing agent from the host so shell commands can't damage the filesystem. Safety and legibility for self-modification.)
What is the single biggest risk in an automated self-improvement loop? (§3, §5 — a gameable evaluator: the agent learns to pass the eval without doing the task. Mitigate with held-out and rotated evals plus human spot-checks; the system must combine generation with honest evaluation and selection.)
Why do multi-agent systems need full trajectories for learning? (§6 — each agent contributes to the shared outcome, so tuning data must capture every agent's inputs and outputs; crediting only the final output mislabels the cause.)