Environments & verifiers — where reward really comes from

A reward function is the most consequential design decision in any RL system. Pick a leaky verifier and the policy will find the leak before it solves the task. This lesson is a tour of the verifier landscape: math, code, web, tool-use, multi-turn — and the failure modes each invites.

Where this lesson sits

Lessons 15–17 covered the shape of the reward signal: trajectory-level (RLHF, DPO), step-level (PRM), and how to search over each. This lesson goes one level lower: what is the actual function that emits the number? For verifiable tasks, that function is the verifier itself; for non-verifiable tasks, it's a reward model trained on top of one. Every reward in modern RL post-training is one of the five archetypes below.

The single most important principle

From Selection in Open-Ended Evolution and basically every RL paper from the last forty years:

Goodhart's law, RL flavor

"Any measure becomes a target the moment you optimize for it." If the verifier returns 1 because the answer is actually correct, the policy will learn to be correct. If the verifier returns 1 because the answer matches a string regex, the policy will learn to match the regex. The verifier is the task as far as the policy is concerned. Design accordingly.

Every environment in this lesson lives somewhere on the spectrum between verifier captures the task (math: pretty close) and verifier is a sieve the policy can squeeze through (open-ended writing: very far). The further from "captures the task", the more KL anchoring, reward shaping, and human review you'll need.

The five environment archetypes

Modern RL training environments break into a small number of recurring shapes. Each shape has a characteristic verifier, a characteristic failure mode, and a characteristic system cost.

Archetype	Verifier	Per-rollout cost	Famous in…
Math	String-equal final answer; CAS for symbolic	~ms	R1, GSM8K, MATH, AIME
Code	Run unit tests in sandbox	0.1–5s	HumanEval, MBPP, SWE-bench
Tool-use	End-state assertion after tool calls	0.5–10s	τ-bench, ToolBench, ALFWorld
Web / browse	Goal-state predicate on rendered page	5–60s	WebArena, Mind2Web
Open-ended (writing, dialog)	LLM-as-judge or reward model	~100ms	RLHF, MT-Bench, AlpacaEval

The most useful way to map these is on two axes: how tightly does the verifier match the actual task? (vertical), and how cheap is each rollout's reward? (horizontal). The top-left corner is paradise; the bottom-right is where most of the hard problems live.

The two axes pull in opposite directions: making the verifier tighter usually means running more of the actual task (slower); making it cheaper usually means substituting a proxy (more hackable). Where on this grid your task lives determines almost everything else about your RL setup — how many rollouts you can afford, how much KL anchoring you need, whether you can do RLVR or have to use a reward model.

1 · Math verifiers

The cleanest case. The standard pipeline:

Prompt formats the problem with an explicit answer slot — typically \boxed{...} for MATH-style, or "the answer is X" for GSM8K-style.
The verifier regex-extracts the boxed expression.
Optionally normalize (strip units, simplify fractions, canonicalize symbols).
Compare against the ground-truth answer with string-equality or, for symbolic problems, a Computer Algebra System like SymPy.

This is the verifier that powers RLVR (lessons 09–14). It is binary, fast, and almost-unhackable for elementary number answers. Trouble starts when the gold answer is "1/2" and the model writes "0.5" — naive string-equal gives reward 0. The fix is normalization (parse as a fraction, compare numerically); the failure mode if you don't is the policy learning to memorize the exact format of the training set's gold answers rather than to compute.

Common math-verifier bug

Stripping all whitespace from answers, including inside expressions, then string-comparing. The policy learns to emit answers with weird spacing that happens to match because of the strip. Trace: reward goes up, held-out accuracy doesn't move. Fix: normalize, then compare a small AST/numeric form, not raw strings.

2 · Code verifiers

Run the model's code against unit tests in an isolated sandbox; reward = fraction of tests that pass. Three engineering challenges that don't exist in math:

Sandboxing. The model can and will write os.system("rm -rf /") if it cuts down on errors. Production code verifiers run in Firecracker / gVisor microVMs (or at least seccomp-restricted containers) with no network, ephemeral filesystems, and a hard CPU timeout (typically 5–30 seconds). Raw Docker / runc alone is not considered sufficient isolation for untrusted code from a learning policy.
Test leakage. If the test cases appear in the prompt, the policy will learn to print them directly. Hidden held-out tests are the standard mitigation.
Throughput. Typical verifier latency is 0.1–5s per rollout (with the 5–30s timeout as a hard cap for runaway code). At 256 rollouts per step you're CPU-bound. Production setups maintain a pool of pre-warmed Firecracker / gVisor microVMs and dispatch verifications in parallel — this is often the bottleneck of a code-RL training run, not the GPU.

SWE-bench-style verifiers

The frontier case: the "task" is a multi-file repo edit, the verifier is "do the original test suite plus the bug-fix tests all pass after applying the model's patch?" Per-rollout cost is 10s–10min. This pushes the verifier from "small cost relative to rollout" to "dominant cost", and forces async-verification design — the trainer can't wait per-rollout.

3 · Tool-use environments

The model is given a set of callable tools (search, calculator, code interpreter, database, …) and a goal. A rollout is a sequence of {model token, tool call, tool result token, model token, …}. Reward is typically an end-state predicate: "did the model book the flight?", "did the database end in state X?", "is the file content equal to Y?".

From an RL perspective this is multi-turn rollout (lesson 08). The non-obvious system pieces:

Response mask. Tokens emitted by the tool (not the model) must have response_mask = 0 in the loss — they are observations, not actions. This is the single most common bug in agentic RL implementations; if you forget, the policy gets gradient credit for tokens it didn't generate.
Deterministic replay. Many tools (search, web) return different results on different calls. For training you typically pin a cache so the same rollout is reproducible; for evaluation you let it be stochastic.
Tool timeouts. A flaky tool that hangs five seconds amortizes badly when you have 1000 rollouts in flight. Pool, timeout, return a structured error token.

4 · Web / browser environments

The most expensive verifier per rollout. The model controls a headless browser via accessibility-tree actions; reward comes from inspecting the resulting page (URL, DOM, screenshot OCR). WebArena, Mind2Web, and the Anthropic computer-use benchmarks all fit here.

Practical realities:

Per-rollout: 5–60 seconds. The GPU spends most of its time idle waiting for the browser.
Verifier is often itself an LLM — "does this rendered page contain a confirmation that flight X was booked?" — which means you have a reward model in the loop and inherit all of lesson 17's reward-hacking risks.
You cannot run 10⁶ rollouts. You must be very sample-efficient — which is part of why offline preference distillation (lesson 16) and behavior cloning are still dominant for web agents in 2025.

5 · LLM-as-judge for open-ended tasks

For "write a friendly response to this customer email", there is no verifier. The pragmatic substitute is an LLM judge — usually a stronger model (or an ensemble) that scores responses on a rubric.

Cheap (~100ms per call), but strongly hackable. Models learn to emit responses with judge-pleasing surface features: bullet points, hedged caveats, specific phrases.
Position bias: judges prefer the first response in a pair. Always randomize order.
Verbosity bias: judges prefer longer responses. Apply length normalization or counter-bias.
Self-bias: a judge prefers responses from its own family. Cross-family pairs (e.g., use Claude as judge for GPT-trained policy) help.

Interactive · how reward hacking looks in practice

Below: a toy two-channel verifier. Channel A is the "real" signal (correctness). Channel B is a leak — a surface feature correlated with correctness in the training set but not in held-out data. A policy can either learn the real channel or exploit the leak. Adjust the leak strength to see which it picks.

Reward hacking under a leaky verifier

"Real" reward is +1 when the answer is correct. "Leak" reward is +ρ when the answer contains a specific phrase that happens to correlate with correctness in 60% of training prompts. The policy starts uniform over two actions: "solve the problem" (correctness 0.7, no phrase) and "emit the phrase" (correctness 0.4, always phrase). On held-out prompts, the leak no longer correlates — and the policy is judged on real correctness only.

Leak strength ρ: 0.50

π(solve)

0.50

π(emit phrase)

0.50

Train reward (proxy)

—

Held-out correctness

—

Designing a verifier — a short checklist

Before you turn on RL with a new verifier, ask:

What's the trivial exploit? If "always return 42" gets reward > 0, your verifier is broken before you start. Run a uniform-random baseline and a constant-string baseline through it; both should score near zero.
What's the format brittleness? If small format variants (capitalisation, trailing newlines, decimal vs fraction) flip the reward, normalize them out before comparing.
What's the throughput? Multiply (per-rollout latency × rollouts per step) and compare to your GPU step time. If verifier-bound, parallelize, cache, or async-verify.
What's the false-negative rate? Run your trained model on its own training prompts and have a human spot-check — every false negative is a token of gradient pointed in the wrong direction.
What's the held-out delta? The held-out task accuracy is the only signal you should trust. If train reward goes up and held-out doesn't, you're hacking the verifier.

Where this lives in our framework

In RL/framework/environment.py, the Environment class is just a callable: env.verify(prompt, response) → float. Every archetype above swaps in a different verify body — a regex matcher, a sandboxed Python interpreter, a browser harness, an LLM judge. The rest of the framework doesn't care. That single seam is what lets the same training loop power AIME-math, SWE-bench, web agents, and RLHF without code surgery elsewhere.

Where Part III goes from here

Lessons 15–18 covered where the reward comes from. We've now closed that half of Part III. The next three lessons (19–22) shift to the systems side: given a fixed reward signal, what does the cluster look like that runs the loop at scale?

Lesson 19 · System topology. Rollout and training have opposite hardware profiles. How you co-locate or disaggregate them on the cluster decides 2–4× of throughput.
Lessons 20–21 · Inference engines. What's under rollout.generate() — the KV cache + PagedAttention (lesson 20) and the scheduling tricks on top (continuous batching, prefix caching, chunked prefill, spec decode in lesson 21). The verifier of lesson 18 may dominate per-step cost; the generator of lessons 20–21 dominates the next-largest slice.
Lesson 22 · Memory math. Whether any of this fits on the GPUs you actually have.

Then lesson 23 puts the reward side (15–18) and the systems side (19–22) together into the famous reasoning recipes — R1, Tülu 3, Qwen, the o-series — each of which is a particular choice on the reward axis crossed with a particular choice on the systems axis.

Takeaway

The verifier is the task. Pick one that captures what you actually want, normalize aggressively, sandbox what executes, and always read your held-out delta — that's the only number that tells you whether you're learning the task or learning the verifier.