Outcome rewards, process rewards, and search
A verifier tells you the answer is wrong. A process reward model tells you where it went wrong. The first is enough to train; the second is what makes test-time search and trajectory-level credit assignment possible — and is the conceptual core of OpenAI's o-series and DeepSeek-R1 distillations.
The credit-assignment problem, one more time
From lesson 04: you sampled a 600-token reasoning trace, the final answer was wrong, and now you have a single scalar r = 0 attached to all 600 tokens. The gradient is shared equally. But the trace probably contained 580 correct tokens and 20 wrong ones — the algorithm has no way to know which.
This is fine when reward is dense (lesson 01) or when you can afford enough rollouts that variance averages out (GRPO). It is not fine when:
- Trajectories are long. 32k-token reasoning chains amplify noise — most variance reduction techniques scale with the inverse of trajectory length.
- Errors are localized. A single arithmetic slip nine steps in invalidates a 1200-token chain that was otherwise correct.
- You want to search at test time. If a chain is going wrong at step 5, you'd like to back up and try a different step 5 — but you need a way to know it's going wrong before the final answer is in.
Two different kinds of reward model exist to address two of these. We'll define them, then look at how search uses them.
ORM · Outcome Reward Model
An Outcome Reward Model is the obvious thing: a model that, given the prompt and the full response, predicts a scalar score for "did this response solve the problem?". It is trained on (prompt, full response, 0/1) tuples, with a binary cross-entropy loss. At inference time, you feed it a candidate response, you get back a confidence that the answer is correct.
ORMs are the cheap option:
- Training data is easy to collect — for math, run a verifier; for code, run unit tests; for general tasks, ask a stronger model to judge.
- They're a drop-in for the BT reward model in RLHF (lesson 15) — same architecture, same scalar head, different training objective.
- At test time they let you do Best-of-N: sample N responses, score each with the ORM, keep the highest-scoring. A surprisingly strong baseline.
What ORMs cannot do: localize errors. If the ORM says "p = 0.08 that this 800-token response is correct", it has no way to point at the bad token. That's what PRMs are for.
PRM · Process Reward Model
A Process Reward Model scores each intermediate step of a reasoning trace, not just the final answer. Given a partial chain ending at step k, it returns a scalar — typically the probability that this prefix can still lead to a correct final answer.
Training data is the expensive part:
- Human-labeled (PRM800K, OpenAI '23). Annotators read each step and label it as correct / incorrect / neutral. Highest quality, very expensive.
- Monte-Carlo labeled (Math-Shepherd '23). For each prefix, sample N completions from the base policy, run the verifier on each, and use the empirical success rate as the soft label. Cheap, noisy, scales.
- Self-supervised step labels. If the final answer is correct, optimistically label all steps positive — then re-label using the trained PRM. Iterate.
The architecture is usually the same SFT-init scalar head, but applied at every step boundary (often a newline token) instead of just the final token.
How a PRM enables search at test time
Once you can score partial chains, you can search:
- Best-of-N rerank. Sample N full chains, score each with the PRM (e.g., min or mean of step scores), keep the best. Strictly better than ORM rerank when the PRM is well-calibrated, because it punishes chains with one broken step in the middle that the ORM might miss.
- Beam search over steps. Maintain B partial chains. At each step, extend each by sampling K continuations, then keep the top-B by PRM score. Vastly more compute than greedy decoding but with diminishing returns past B ≈ 8.
- MCTS (Monte-Carlo Tree Search). The o1 / R1-distillation idea. Expand a tree of partial chains. At each node, balance exploration (try less-explored branches) against exploitation (pursue branches with high PRM score). Roll out from leaves, back-propagate observed verifier rewards up the tree, re-expand. Continue until budget is exhausted; return the best chain found.
Visually, these three are points on the same axis — how much of the tree do you keep around?:
The thicker the branch, the more compute you spent on it — and the more confident the search is that the right answer lies under it. Best-of-N spends compute uniformly. Beam search prunes greedily by PRM score. MCTS spends adaptively, guided by the UCB formula q(node) + c·√(ln Nparent / nnode) — high-PRM nodes attract visits, but un-visited siblings get a bonus that decays as the parent is explored. The same compute, distributed very differently.
Interactive · Best-of-N rerank vs PRM-guided search
Below is a toy with a 6-step reasoning chain. At each step the policy samples a "good" or "bad" token. A bad step in the middle makes the final answer wrong; the ORM only sees the final right/wrong bit, the PRM sees each step. Compare Best-of-N rerank under both verifiers and PRM-guided beam search.
What PRMs give you at training time
The above is all test-time inference. The training-time benefits are subtler and arguably more important.
Dense reward signal
Instead of one r ∈ {0, 1} per trajectory, you have one per step. Plugging that into REINFORCE/PPO gives per-step advantages — credit assignment now lives at step granularity, the gradient-estimator variance drops by roughly a factor of L for an L-step chain (stdev by √L), and KL-anchored policy updates become much sharper.
Bootstrap improvement
If your PRM can identify "this step was wrong", you can delete the bad step and let the model regenerate from the previous state. This is the engine behind iterative self-improvement loops: generate, find errors with PRM, regenerate at error site, repeat. R1's "rejection sampling + SFT" stage is a discrete version of exactly this.
The cost
PRMs are much harder to train well than ORMs. They are also reward-hackable in step-granular ways the ORM is not — a policy can learn to emit confident-looking-but-wrong step prefixes that the PRM scores high. DeepSeek explicitly tried PRM-based training for R1 and reported (in the R1 paper) that they abandoned PRMs because of three obstacles: hard-to-define fine-grained steps, hard-to-label intermediate correctness, and reward hacking on the PRM. They ended up using the rule-based verifier directly (exact-match for math, compiler / unit tests for code) — not a learned ORM at all — paired with pure GRPO. This is a live area; the field has not converged.
Trade-offs · ORM vs PRM vs verifier
| Signal | Granularity | Cost to build | Hackability | When to use |
|---|---|---|---|---|
| Verifier | End of trajectory | Free (already exists for the task) | Low — it's the ground truth | Math, code, anything with an automatic checker. |
| ORM | End of trajectory | Cheap (label data with verifier or LLM judge) | Medium | RLHF replacement; Best-of-N at inference. |
| PRM | Each reasoning step | Expensive (per-step labels) | High | Test-time search (MCTS); dense training reward where you can afford the noise. |
How modern systems actually use these
The current pragmatic consensus, as of 2025:
- RLVR (math/code). Use the verifier directly during training (GRPO/DAPO). No reward model. PRMs explored but not yet standard.
- RLHF (helpful/harmless). ORM-as-RM; PPO or DPO against it.
- Test-time scaling. Best-of-N with an ORM is the easy win. PRM + beam search adds another 3–8% on hard math at high compute budget. MCTS adds more but is a complex piece of infra.
- Reasoning training (o1-style, R1-style). Use the verifier to label long chain-of-thought rollouts, then distill the successful chains back into the model via SFT, then RL on top. PRMs are sometimes used to filter or annotate intermediate steps in the distillation set.