rl_lessons / 18 · environments lesson 4 / 9 · part III

Environments & verifiers — where reward really comes from

A reward function is the most consequential design decision in any RL system. Pick a leaky verifier and the policy will find the leak before it solves the task. This lesson is a tour of the verifier landscape: math, code, web, tool-use, multi-turn — and the failure modes each invites.

Where this lesson sits
Lessons 15–17 covered the shape of the reward signal: trajectory-level (RLHF, DPO), step-level (PRM), and how to search over each. This lesson goes one level lower: what is the actual function that emits the number? For verifiable tasks, that function is the verifier itself; for non-verifiable tasks, it's a reward model trained on top of one. Every reward in modern RL post-training is one of the five archetypes below.

The single most important principle

From Selection in Open-Ended Evolution and basically every RL paper from the last forty years:

Goodhart's law, RL flavor
"Any measure becomes a target the moment you optimize for it." If the verifier returns 1 because the answer is actually correct, the policy will learn to be correct. If the verifier returns 1 because the answer matches a string regex, the policy will learn to match the regex. The verifier is the task as far as the policy is concerned. Design accordingly.

Every environment in this lesson lives somewhere on the spectrum between verifier captures the task (math: pretty close) and verifier is a sieve the policy can squeeze through (open-ended writing: very far). The further from "captures the task", the more KL anchoring, reward shaping, and human review you'll need.

The five environment archetypes

Modern RL training environments break into a small number of recurring shapes. Each shape has a characteristic verifier, a characteristic failure mode, and a characteristic system cost.

ArchetypeVerifierPer-rollout costFamous in…
MathString-equal final answer; CAS for symbolic~msR1, GSM8K, MATH, AIME
CodeRun unit tests in sandbox0.1–5sHumanEval, MBPP, SWE-bench
Tool-useEnd-state assertion after tool calls0.5–10sτ-bench, ToolBench, ALFWorld
Web / browseGoal-state predicate on rendered page5–60sWebArena, Mind2Web
Open-ended (writing, dialog)LLM-as-judge or reward model~100msRLHF, MT-Bench, AlpacaEval

The most useful way to map these is on two axes: how tightly does the verifier match the actual task? (vertical), and how cheap is each rollout's reward? (horizontal). The top-left corner is paradise; the bottom-right is where most of the hard problems live.

verifier captures the task → cost per rollout → high fidelity leaky proxy cheap (ms) slow (10s+) Math \boxed{} + normalize Code sandboxed unit tests Tool-use end-state predicate Web DOM / screenshot LLM-as-judge RM scoring

The two axes pull in opposite directions: making the verifier tighter usually means running more of the actual task (slower); making it cheaper usually means substituting a proxy (more hackable). Where on this grid your task lives determines almost everything else about your RL setup — how many rollouts you can afford, how much KL anchoring you need, whether you can do RLVR or have to use a reward model.

1 · Math verifiers

The cleanest case. The standard pipeline:

  1. Prompt formats the problem with an explicit answer slot — typically \boxed{...} for MATH-style, or "the answer is X" for GSM8K-style.
  2. The verifier regex-extracts the boxed expression.
  3. Optionally normalize (strip units, simplify fractions, canonicalize symbols).
  4. Compare against the ground-truth answer with string-equality or, for symbolic problems, a Computer Algebra System like SymPy.

This is the verifier that powers RLVR (lessons 09–14). It is binary, fast, and almost-unhackable for elementary number answers. Trouble starts when the gold answer is "1/2" and the model writes "0.5" — naive string-equal gives reward 0. The fix is normalization (parse as a fraction, compare numerically); the failure mode if you don't is the policy learning to memorize the exact format of the training set's gold answers rather than to compute.

Common math-verifier bug
Stripping all whitespace from answers, including inside expressions, then string-comparing. The policy learns to emit answers with weird spacing that happens to match because of the strip. Trace: reward goes up, held-out accuracy doesn't move. Fix: normalize, then compare a small AST/numeric form, not raw strings.

2 · Code verifiers

Run the model's code against unit tests in an isolated sandbox; reward = fraction of tests that pass. Three engineering challenges that don't exist in math:

SWE-bench-style verifiers

The frontier case: the "task" is a multi-file repo edit, the verifier is "do the original test suite plus the bug-fix tests all pass after applying the model's patch?" Per-rollout cost is 10s–10min. This pushes the verifier from "small cost relative to rollout" to "dominant cost", and forces async-verification design — the trainer can't wait per-rollout.

3 · Tool-use environments

The model is given a set of callable tools (search, calculator, code interpreter, database, …) and a goal. A rollout is a sequence of {model token, tool call, tool result token, model token, …}. Reward is typically an end-state predicate: "did the model book the flight?", "did the database end in state X?", "is the file content equal to Y?".

From an RL perspective this is multi-turn rollout (lesson 08). The non-obvious system pieces:

4 · Web / browser environments

The most expensive verifier per rollout. The model controls a headless browser via accessibility-tree actions; reward comes from inspecting the resulting page (URL, DOM, screenshot OCR). WebArena, Mind2Web, and the Anthropic computer-use benchmarks all fit here.

Practical realities:

5 · LLM-as-judge for open-ended tasks

For "write a friendly response to this customer email", there is no verifier. The pragmatic substitute is an LLM judge — usually a stronger model (or an ensemble) that scores responses on a rubric.

Interactive · how reward hacking looks in practice

Below: a toy two-channel verifier. Channel A is the "real" signal (correctness). Channel B is a leak — a surface feature correlated with correctness in the training set but not in held-out data. A policy can either learn the real channel or exploit the leak. Adjust the leak strength to see which it picks.

Reward hacking under a leaky verifier
"Real" reward is +1 when the answer is correct. "Leak" reward is +ρ when the answer contains a specific phrase that happens to correlate with correctness in 60% of training prompts. The policy starts uniform over two actions: "solve the problem" (correctness 0.7, no phrase) and "emit the phrase" (correctness 0.4, always phrase). On held-out prompts, the leak no longer correlates — and the policy is judged on real correctness only.
π(solve)
0.50
π(emit phrase)
0.50
Train reward (proxy)
Held-out correctness

Designing a verifier — a short checklist

Before you turn on RL with a new verifier, ask:

  1. What's the trivial exploit? If "always return 42" gets reward > 0, your verifier is broken before you start. Run a uniform-random baseline and a constant-string baseline through it; both should score near zero.
  2. What's the format brittleness? If small format variants (capitalisation, trailing newlines, decimal vs fraction) flip the reward, normalize them out before comparing.
  3. What's the throughput? Multiply (per-rollout latency × rollouts per step) and compare to your GPU step time. If verifier-bound, parallelize, cache, or async-verify.
  4. What's the false-negative rate? Run your trained model on its own training prompts and have a human spot-check — every false negative is a token of gradient pointed in the wrong direction.
  5. What's the held-out delta? The held-out task accuracy is the only signal you should trust. If train reward goes up and held-out doesn't, you're hacking the verifier.

Where this lives in our framework

In RL/framework/environment.py, the Environment class is just a callable: env.verify(prompt, response) → float. Every archetype above swaps in a different verify body — a regex matcher, a sandboxed Python interpreter, a browser harness, an LLM judge. The rest of the framework doesn't care. That single seam is what lets the same training loop power AIME-math, SWE-bench, web agents, and RLHF without code surgery elsewhere.

Where Part III goes from here

Lessons 15–18 covered where the reward comes from. We've now closed that half of Part III. The next three lessons (19–22) shift to the systems side: given a fixed reward signal, what does the cluster look like that runs the loop at scale?

Then lesson 23 puts the reward side (15–18) and the systems side (19–22) together into the famous reasoning recipes — R1, Tülu 3, Qwen, the o-series — each of which is a particular choice on the reward axis crossed with a particular choice on the systems axis.

Takeaway
The verifier is the task. Pick one that captures what you actually want, normalize aggressively, sandbox what executes, and always read your held-out delta — that's the only number that tells you whether you're learning the task or learning the verifier.