Part IV - State, collaboration, and protocols
Learning and adaptation - improving from experience
Lesson 10 gave the agent a memory: scoped storage it can write to and read from across turns and sessions. But memory only remembers; it does not yet improve the agent's behaviour. This lesson closes that gap. Learning and adaptation is the loop that turns the traces sitting in memory into measured changes to how the agent acts — and the book's headline claim is that this loop targets prompts, tools, code, routing, and evals long before it ever touches model weights.
New capability: An across-run feedback loop that converts traces and feedback into validated changes to the agent's prompts, tools, code, routing policy, memories, evals, and — only sometimes — model weights.
1 · Learning is a measured change loop, not a longer log
It is tempting to think the agent "learns" simply because lesson 10 lets it remember. It does not. Memory is state that persists; learning is behaviour that changes because of that state. The book frames learning and adaptation as the agent altering its thinking, behaviour, or knowledge in response to new experience and environment interaction — moving from "merely executing instructions" to becoming genuinely more capable. The keyword is change: if nothing about the agent's future decisions is different, no learning happened.
Two clarifying distinctions the book leans on:
- Learning vs adaptation. Learning is the acquisition of new knowledge or skill from data and experience. Adaptation is the observable behaviour change that results — the agent altering its strategy, its understanding, or even its goals to fit an environment that is unpredictable, shifting, or new. You learn from a thousand traces; you adapt by, say, raising one routing threshold.
- Reflection (lesson 06) vs learning (this lesson). Reflection is a within-run loop: generate, critique, repair, all inside one task before you answer. Learning is an across-run loop: collect traces from many completed tasks, find a systematic failure, change the system, and verify the change against a regression suite before it ships. Reflection makes one answer better; learning makes the next thousand answers better.
So the unit of learning is a closed loop with a gate at the end. Skip the gate — change the system off one impressive or embarrassing run — and you are gambling, not learning.
2 · The six learning modes — and the question that actually matters
The book opens with a taxonomy of how agents can learn. Memorize the list, but understand that for an LLM agent the interesting axis is not which algorithm but which layer of the system the change lands in.
| Mode | What it does (book) | Fits an agent when… |
|---|---|---|
| Reinforcement learning | Try actions, get rewarded for good outcomes and penalized for bad, converge on an optimal policy in a changing environment. | Controlling robots or playing games — anywhere outcomes are scored and the action space is explorable. |
| Supervised | Learn input→output from labelled examples for decisions and pattern recognition. | Email classification, trend prediction — you have ground-truth labels. |
| Unsupervised | Find hidden structure in unlabelled data; build a cognitive map of the environment. | Exploratory data work with no explicit target. |
| Few-shot / zero-shot (LLM) | An LLM agent adapts to a new task from a handful of examples or a plain instruction. | Fast response to a new command or scenario with almost no data — the everyday agent mode. |
| Online learning | Continuously update knowledge as new data streams in. | Real-time, dynamic environments and continuous data streams. |
| Memory-based | Recall past experiences and adjust current behaviour in similar situations. | The agent has the retrieval ability from lesson 10 to look up analogous cases. |
The book's practical-applications list is worth keeping concrete, because it shows the same loop under different names: a personalized assistant that refines its interaction protocol from long-term behaviour analysis; a trading bot that retunes parameters against high-resolution market data; a fraud-detection agent that folds newly-discovered fraud patterns into its predictor; a recommender that sharpens precision from preference learning; and a knowledge-base learning agent that uses RAG (lesson 16) to maintain a living store of successful strategies and past obstacles, consulting it at decision time. Same loop; different layer changing.
3 · PPO and DPO — when the change really is the weights
When you do reach the top rung, the book names two algorithms. You do not need to implement them here (the RL track does that in depth), but you must be able to explain the intuition in an interview, because they encode why weight updates are dangerous and how each tames the danger.
PPO (Proximal Policy Optimization) trains a policy in environments with continuous actions — robot joints, game characters. Its core problem: a naive policy-gradient step can be too large and collapse performance, destroying knowledge the agent already had. PPO's fix is the clip: it defines a small trust region around the current policy and refuses to update beyond it. The book's analogy is a safety brake — you improve, but you are physically prevented from taking a single catastrophic step. The loop is: (1) collect a batch of experience (state, action, reward) with the current policy; (2) estimate how a candidate update would change expected reward; (3) clip that update so the new policy stays close to the old. Stable improvement, no cliff.
DPO (Direct Preference Optimization) is the LLM-alignment shortcut. The classical PPO-based alignment pipeline is two steps: first train a reward model on human preference data ("response A is better than response B"), then fine-tune the LLM with PPO to score high under that reward model acting as judge. Two steps, complex, unstable, and the LLM can game the reward model — produce text that scores high but is actually low quality. DPO skips the reward model entirely: it uses the preference pairs to update the policy directly, with a loss whose math says "raise the probability of the preferred response, lower the probability of the rejected one." Fewer moving parts, more stable alignment.
4 · The book's case study: SICA, the self-improving coding agent
This is the chapter's centrepiece and the clearest demonstration that agent learning is not the same as model training. SICA (Robeyns, Aitchison & Szummer, 2025) is unusual: instead of one agent training another, SICA is both the modifier and the modified. It improves at coding benchmarks by rewriting its own source code, with the base LLM unchanged.
The loop, faithfully:
Notice every property of the lesson-1 change loop is present and made concrete: the traces are archived versions with scores; the analysis mines the archive for what to change; the proposal is an actual code edit; the gate is the benchmark. It is versioned, evaluated, and replayable — the gold standard for any learning loop you build.
4.1 · What SICA actually learned to change — its own tools
The most instructive detail for this track: SICA's biggest gains came not from "being smarter" but from improving how it edits and navigates code. The base model never changed. The evolution of its editing tool, in order:
- Naive full-file overwrite (slow, error-prone, burns tokens).
- A smart editor doing context-aware edits.
- A diff-enhanced smart editor — targeted, pattern-based edits using diffs.
- A quick-overwrite tool to cut processing overhead, plus minimal-diff output and context-sensitive diff minimization using AST (abstract syntax tree) parsing, and input normalization for the editor.
- For navigation: an AST symbol locator that finds definitions via code structure, then a hybrid symbol locator combining fast search with AST inspection, then optimized AST parsing within that locator to focus on relevant code and speed up search.
This is the teaching point in physical form: adaptation targets prompts, memory, tools, code, and routing — base-model weights are optional. An agent gets dramatically better by changing how it touches the world, not by learning new facts.
4.2 · The architecture that makes self-modification safe and legible
SICA pairs the modular agent (a base toolset — file ops, command execution, arithmetic — plus specialized sub-agents for coding, problem-solving, and reasoning, invoked like tools to decompose tasks and manage LLM context length) with three safety/observability mechanisms the book stresses:
- Asynchronous supervisor. A second LLM runs in parallel with the main agent, periodically receiving a text representation of system state — the call graph, the event stream of LLM messages, tool calls and responses. It watches for loops, stagnation, and inefficient or repeated work, and can send a notification or terminate execution. This is the safety net that makes an agent rewriting its own code tolerable.
- Docker isolation. The agent runs in a dedicated container, isolated from the host, so its ability to execute shell commands cannot wreck the host filesystem.
- Observability. An interactive web view of the event bus and call graph lets a human inspect events, read supervisor messages, and collapse sub-agent traces — self-improvement is inspectable, not mysterious.
The book is honest about the hard part: the initial challenge was getting the agent to propose modifications that are genuinely novel, innovative, feasible, and interesting on each meta-improvement round. Open-ended creativity in self-improvement is still an open research problem — a useful thing to say out loud in an interview rather than overselling autonomy.
5 · Generalizing the loop: AlphaEvolve and OpenEvolve
SICA improves one agent's code. AlphaEvolve (Google) generalizes the pattern to discovering and optimizing algorithms. It fuses three pieces: an LLM ensemble (Gemini Flash generates many candidate solutions broadly; Pro does deep analysis and refinement), an automated evaluation system that scores candidates against preset criteria, and an evolutionary framework that feeds scores back to iterate toward novel, efficient algorithms.
The reported results make the loop's payoff concrete: deployed on Google infrastructure for data-center scheduling, it cut global compute resource usage by 0.7%; it proposed Verilog optimizations for TPU hardware design; it sped up a Gemini-architecture core kernel by 23% and FlashAttention GPU instructions by up to 32.5%; and in pure math it found a way to multiply two 4×4 complex matrices in 48 scalar multiplications, beating the prior best, rediscovered the best known solution on 75% of a set of open problems, and improved on 20% (including the "kissing number" problem).
OpenEvolve is the open-source evolutionary coding agent of the same shape: an LLM-driven generate → evaluate → select pipeline that evolves whole code files (not just single functions), across languages, with multi-objective optimization, flexible prompting, and distributed evaluation, coordinated by a controller over a program sampler, a program database, an evaluation pool, and an LLM cluster. The library's surface is tiny — which is exactly why it is a good mental model for the loop:
from openevolve import OpenEvolve
evolve = OpenEvolve(
initial_program_path="path/to/initial_program.py",
evaluation_file="path/to/evaluator.py", # the GATE: scores each candidate
config_path="path/to/config.yaml",
)
best_program = await evolve.run(iterations=1000) # generate→evaluate→select ×1000
for name, value in best_program.metrics.items():
print(f" {name}: {value:.4f}")
The evaluation_file is the whole point: evolution is only as good as the evaluator, and a weak evaluator gets gamed (section 3's trap, at scale). The strongest systems, the book repeats, combine generation with evaluation and selection — never generation alone.
6 · Running example: the coding/research assistant climbs the ladder
Thread the change-loop through our running assistant and the layer question answers itself. Each row is a real failure cluster found in traces, and the cheapest layer that fixes it:
| Observed failure (from traces) | Diagnosis | Layer that changes |
|---|---|---|
In one repo, the agent keeps running npm test when the repo uses pytest. | Repo-local convention the agent can't infer. | Project memory — write the test command once (lesson 10 store). Cheapest rung. |
| Across many repos, the agent skips reading the failing test before patching. | Systematic process gap. | Prompt / instruction — add a "read the failing test first" step to the system prompt. |
| Full-file rewrites for one-line fixes burn tokens and introduce regressions. | Tool too blunt (SICA's exact problem). | Tool design — add a diff/patch edit tool; keep overwrite for new files. |
| Trivial questions get routed to the expensive deep-research path. | Router threshold miscalibrated. | Routing policy — raise the complexity threshold (lesson 04). |
| A whole bug class (off-by-one in date math) recurs and no test catches it. | Eval blind spot. | Evaluation suite — add a regression case so future changes can't reintroduce it. |
| Tool-call formatting is wrong often enough to hurt, even with good prompts. | Base capability gap. | Model weights — fine-tune / DPO on preference pairs. Last resort, top rung. |
Every one of those proposals then passes through the same gate: run the regression suite, ship only if quality improves, keep a rollback path. And one subtlety from lesson 09 carries straight in: for the multi-agent version of the assistant, the tuning data must capture the complete trajectory — every agent's inputs and outputs — because each contributed to the shared outcome. Crediting only the final agent's output mislabels the cause.
Where this points next
We can now improve the agent from experience: collect traces, find the systematic failure, change the cheapest sufficient layer, and gate it on evals. But notice what made SICA's tools improvable — they were well-defined, discoverable interfaces the agent could reason about and rewire. As agents start sharing tools, prompts, and resources across systems, ad-hoc tool wiring becomes the bottleneck. Lesson 12 (MCP — the Model Context Protocol) standardizes exactly that: how an agent discovers and uses external tools, prompts, and resources through a common contract — the integration layer that makes a tool worth adapting in the first place portable across agents.
Failure modes
- Live self-modification with no sandbox or approval — SICA's whole safety story (Docker isolation + async supervisor) exists to prevent this. Self-editing without a sandbox is a host-compromise waiting to happen.
- Optimizing from one run — changing the system off a single impressive or embarrassing trace. That is overfitting to noise, not learning.
- Reward / eval hacking — the agent learns to pass your regression suite without doing the task (the PPO reward-model exploit, generalized). Hold-out and rotate evals.
- Mixing private user memory into global policy — promoting one user's personal adaptation into the shared prompt leaks data and degrades others.
- Crediting the wrong agent — in multi-agent learning, tuning on the final output alone instead of the full trajectory.
- Skipping the gate — shipping a proposed change without the regression run or a rollback path.
Implementation checklist
- What traces and feedback are collected, and is the full trajectory captured?
- How are failures labelled and clustered into classes?
- Which single layer will change — memory, prompt, tool, router, eval, or weights?
- Is it the cheapest layer that fixes the class?
- Which held-out evals must pass before it ships?
- Is there a versioned archive and a rollback path?
- If the agent edits itself, is it sandboxed (Docker) and supervised (async watcher)?
- Are personal / project / global adaptations kept in separate scopes?
Interview prompts
- How is learning different from memory, and from reflection? (§1 — memory persists state; learning changes future behaviour because of that state; reflection is a within-run critique loop, learning is an across-run change loop gated by regression evals.)
- An agent keeps making the same mistake. What's your first move? (§2, §6 — climb the cheapest-layer ladder: memory → prompt → tool → router → eval → weights. Find the cheapest layer that fixes the whole failure class; weight training is the last resort.)
- Explain PPO's clip and DPO's shortcut. (§3 — PPO clips updates to a trust region so one big step can't collapse the policy; DPO skips the separate reward model and updates the policy directly from preference pairs, raising the preferred response's probability and lowering the rejected one's, with a KL term for stability.)
- What does SICA prove about agent learning? (§4 — that an agent can improve substantially with a fixed base model by rewriting its own tools and navigation, e.g. evolving full-file overwrite into diff-aware and AST-based editing; learning targets code/tools, not just weights.)
- Why does SICA need an asynchronous supervisor and Docker? (§4.2 — a second LLM watches the call graph / event stream for loops and stagnation and can terminate; Docker isolates the self-editing agent from the host so shell commands can't damage the filesystem. Safety and legibility for self-modification.)
- What is the single biggest risk in an automated self-improvement loop? (§3, §5 — a gameable evaluator: the agent learns to pass the eval without doing the task. Mitigate with held-out and rotated evals plus human spot-checks; the system must combine generation with honest evaluation and selection.)
- Why do multi-agent systems need full trajectories for learning? (§6 — each agent contributes to the shared outcome, so tuning data must capture every agent's inputs and outputs; crediting only the final output mislabels the cause.)