Environments & verifiers — where reward really comes from
A reward function is the most consequential design decision in any RL system. Pick a leaky verifier and the policy will find the leak before it solves the task. This lesson is a tour of the verifier landscape: math, code, web, tool-use, multi-turn — and the failure modes each invites.
The single most important principle
From Selection in Open-Ended Evolution and basically every RL paper from the last forty years:
Every environment in this lesson lives somewhere on the spectrum between verifier captures the task (math: pretty close) and verifier is a sieve the policy can squeeze through (open-ended writing: very far). The further from "captures the task", the more KL anchoring, reward shaping, and human review you'll need.
The five environment archetypes
Modern RL training environments break into a small number of recurring shapes. Each shape has a characteristic verifier, a characteristic failure mode, and a characteristic system cost.
| Archetype | Verifier | Per-rollout cost | Famous in… |
|---|---|---|---|
| Math | String-equal final answer; CAS for symbolic | ~ms | R1, GSM8K, MATH, AIME |
| Code | Run unit tests in sandbox | 0.1–5s | HumanEval, MBPP, SWE-bench |
| Tool-use | End-state assertion after tool calls | 0.5–10s | τ-bench, ToolBench, ALFWorld |
| Web / browse | Goal-state predicate on rendered page | 5–60s | WebArena, Mind2Web |
| Open-ended (writing, dialog) | LLM-as-judge or reward model | ~100ms | RLHF, MT-Bench, AlpacaEval |
The most useful way to map these is on two axes: how tightly does the verifier match the actual task? (vertical), and how cheap is each rollout's reward? (horizontal). The top-left corner is paradise; the bottom-right is where most of the hard problems live.
The two axes pull in opposite directions: making the verifier tighter usually means running more of the actual task (slower); making it cheaper usually means substituting a proxy (more hackable). Where on this grid your task lives determines almost everything else about your RL setup — how many rollouts you can afford, how much KL anchoring you need, whether you can do RLVR or have to use a reward model.
1 · Math verifiers
The cleanest case. The standard pipeline:
- Prompt formats the problem with an explicit answer slot — typically
\boxed{...}for MATH-style, or "the answer is X" for GSM8K-style. - The verifier regex-extracts the boxed expression.
- Optionally normalize (strip units, simplify fractions, canonicalize symbols).
- Compare against the ground-truth answer with string-equality or, for symbolic problems, a Computer Algebra System like SymPy.
This is the verifier that powers RLVR (lessons 09–14). It is binary, fast, and almost-unhackable for elementary number answers. Trouble starts when the gold answer is "1/2" and the model writes "0.5" — naive string-equal gives reward 0. The fix is normalization (parse as a fraction, compare numerically); the failure mode if you don't is the policy learning to memorize the exact format of the training set's gold answers rather than to compute.
2 · Code verifiers
Run the model's code against unit tests in an isolated sandbox; reward = fraction of tests that pass. Three engineering challenges that don't exist in math:
- Sandboxing. The model can and will write
os.system("rm -rf /")if it cuts down on errors. Production code verifiers run in Firecracker / gVisor microVMs (or at least seccomp-restricted containers) with no network, ephemeral filesystems, and a hard CPU timeout (typically 5–30 seconds). Raw Docker /runcalone is not considered sufficient isolation for untrusted code from a learning policy. - Test leakage. If the test cases appear in the prompt, the policy will learn to print them directly. Hidden held-out tests are the standard mitigation.
- Throughput. Typical verifier latency is 0.1–5s per rollout (with the 5–30s timeout as a hard cap for runaway code). At 256 rollouts per step you're CPU-bound. Production setups maintain a pool of pre-warmed Firecracker / gVisor microVMs and dispatch verifications in parallel — this is often the bottleneck of a code-RL training run, not the GPU.
SWE-bench-style verifiers
The frontier case: the "task" is a multi-file repo edit, the verifier is "do the original test suite plus the bug-fix tests all pass after applying the model's patch?" Per-rollout cost is 10s–10min. This pushes the verifier from "small cost relative to rollout" to "dominant cost", and forces async-verification design — the trainer can't wait per-rollout.
3 · Tool-use environments
The model is given a set of callable tools (search, calculator, code interpreter, database, …) and a goal. A rollout is a sequence of {model token, tool call, tool result token, model token, …}. Reward is typically an end-state predicate: "did the model book the flight?", "did the database end in state X?", "is the file content equal to Y?".
From an RL perspective this is multi-turn rollout (lesson 08). The non-obvious system pieces:
- Response mask. Tokens emitted by the tool (not the model) must have
response_mask = 0in the loss — they are observations, not actions. This is the single most common bug in agentic RL implementations; if you forget, the policy gets gradient credit for tokens it didn't generate. - Deterministic replay. Many tools (search, web) return different results on different calls. For training you typically pin a cache so the same rollout is reproducible; for evaluation you let it be stochastic.
- Tool timeouts. A flaky tool that hangs five seconds amortizes badly when you have 1000 rollouts in flight. Pool, timeout, return a structured error token.
4 · Web / browser environments
The most expensive verifier per rollout. The model controls a headless browser via accessibility-tree actions; reward comes from inspecting the resulting page (URL, DOM, screenshot OCR). WebArena, Mind2Web, and the Anthropic computer-use benchmarks all fit here.
Practical realities:
- Per-rollout: 5–60 seconds. The GPU spends most of its time idle waiting for the browser.
- Verifier is often itself an LLM — "does this rendered page contain a confirmation that flight X was booked?" — which means you have a reward model in the loop and inherit all of lesson 17's reward-hacking risks.
- You cannot run 10⁶ rollouts. You must be very sample-efficient — which is part of why offline preference distillation (lesson 16) and behavior cloning are still dominant for web agents in 2025.
5 · LLM-as-judge for open-ended tasks
For "write a friendly response to this customer email", there is no verifier. The pragmatic substitute is an LLM judge — usually a stronger model (or an ensemble) that scores responses on a rubric.
- Cheap (~100ms per call), but strongly hackable. Models learn to emit responses with judge-pleasing surface features: bullet points, hedged caveats, specific phrases.
- Position bias: judges prefer the first response in a pair. Always randomize order.
- Verbosity bias: judges prefer longer responses. Apply length normalization or counter-bias.
- Self-bias: a judge prefers responses from its own family. Cross-family pairs (e.g., use Claude as judge for GPT-trained policy) help.
Interactive · how reward hacking looks in practice
Below: a toy two-channel verifier. Channel A is the "real" signal (correctness). Channel B is a leak — a surface feature correlated with correctness in the training set but not in held-out data. A policy can either learn the real channel or exploit the leak. Adjust the leak strength to see which it picks.
Designing a verifier — a short checklist
Before you turn on RL with a new verifier, ask:
- What's the trivial exploit? If "always return 42" gets reward > 0, your verifier is broken before you start. Run a uniform-random baseline and a constant-string baseline through it; both should score near zero.
- What's the format brittleness? If small format variants (capitalisation, trailing newlines, decimal vs fraction) flip the reward, normalize them out before comparing.
- What's the throughput? Multiply (per-rollout latency × rollouts per step) and compare to your GPU step time. If verifier-bound, parallelize, cache, or async-verify.
- What's the false-negative rate? Run your trained model on its own training prompts and have a human spot-check — every false negative is a token of gradient pointed in the wrong direction.
- What's the held-out delta? The held-out task accuracy is the only signal you should trust. If train reward goes up and held-out doesn't, you're hacking the verifier.
Where this lives in our framework
In RL/framework/environment.py, the Environment class is just a callable: env.verify(prompt, response) → float. Every archetype above swaps in a different verify body — a regex matcher, a sandboxed Python interpreter, a browser harness, an LLM judge. The rest of the framework doesn't care. That single seam is what lets the same training loop power AIME-math, SWE-bench, web agents, and RLHF without code surgery elsewhere.
Where Part III goes from here
Lessons 15–18 covered where the reward comes from. We've now closed that half of Part III. The next three lessons (19–22) shift to the systems side: given a fixed reward signal, what does the cluster look like that runs the loop at scale?
- Lesson 19 · System topology. Rollout and training have opposite hardware profiles. How you co-locate or disaggregate them on the cluster decides 2–4× of throughput.
- Lessons 20–21 · Inference engines. What's under
rollout.generate()— the KV cache + PagedAttention (lesson 20) and the scheduling tricks on top (continuous batching, prefix caching, chunked prefill, spec decode in lesson 21). The verifier of lesson 18 may dominate per-step cost; the generator of lessons 20–21 dominates the next-largest slice. - Lesson 22 · Memory math. Whether any of this fits on the GPUs you actually have.
Then lesson 23 puts the reward side (15–18) and the systems side (19–22) together into the famous reasoning recipes — R1, Tülu 3, Qwen, the o-series — each of which is a particular choice on the reward axis crossed with a particular choice on the systems axis.