Outcome rewards, process rewards, and search

A verifier tells you the answer is wrong. A process reward model tells you where it went wrong. The first is enough to train; the second is what makes test-time search and trajectory-level credit assignment possible — and is the conceptual core of OpenAI's o-series and DeepSeek-R1 distillations.

Where this lesson sits

Lesson 15 introduced trajectory-level reward signal (a single scalar per response, from a Bradley–Terry RM). Lesson 16 collapsed RM + RL into one step (DPO). This lesson asks: what if you scored every step of a reasoning chain, not just the final answer? That's a finer-grained signal that unlocks (a) test-time search over reasoning trees and (b) denser training-time credit. Together with lesson 18 it closes the question "what does the reward signal look like and where does it come from".

The credit-assignment problem, one more time

From lesson 04: you sampled a 600-token reasoning trace, the final answer was wrong, and now you have a single scalar r = 0 attached to all 600 tokens. The gradient is shared equally. But the trace probably contained 580 correct tokens and 20 wrong ones — the algorithm has no way to know which.

This is fine when reward is dense (lesson 01) or when you can afford enough rollouts that variance averages out (GRPO). It is not fine when:

Trajectories are long. 32k-token reasoning chains amplify noise — most variance reduction techniques scale with the inverse of trajectory length.
Errors are localized. A single arithmetic slip nine steps in invalidates a 1200-token chain that was otherwise correct.
You want to search at test time. If a chain is going wrong at step 5, you'd like to back up and try a different step 5 — but you need a way to know it's going wrong before the final answer is in.

Two different kinds of reward model exist to address two of these. We'll define them, then look at how search uses them.

ORM · Outcome Reward Model

An Outcome Reward Model is the obvious thing: a model that, given the prompt and the full response, predicts a scalar score for "did this response solve the problem?". It is trained on (prompt, full response, 0/1) tuples, with a binary cross-entropy loss. At inference time, you feed it a candidate response, you get back a confidence that the answer is correct.

ORMs are the cheap option:

Training data is easy to collect — for math, run a verifier; for code, run unit tests; for general tasks, ask a stronger model to judge.
They're a drop-in for the BT reward model in RLHF (lesson 15) — same architecture, same scalar head, different training objective.
At test time they let you do Best-of-N: sample N responses, score each with the ORM, keep the highest-scoring. A surprisingly strong baseline.

What ORMs cannot do: localize errors. If the ORM says "p = 0.08 that this 800-token response is correct", it has no way to point at the bad token. That's what PRMs are for.

PRM · Process Reward Model

A Process Reward Model scores each intermediate step of a reasoning trace, not just the final answer. Given a partial chain ending at step k, it returns a scalar — typically the probability that this prefix can still lead to a correct final answer.

Training data is the expensive part:

Human-labeled (PRM800K, OpenAI '23). Annotators read each step and label it as correct / incorrect / neutral. Highest quality, very expensive.
Monte-Carlo labeled (Math-Shepherd '23). For each prefix, sample N completions from the base policy, run the verifier on each, and use the empirical success rate as the soft label. Cheap, noisy, scales.
Self-supervised step labels. If the final answer is correct, optimistically label all steps positive — then re-label using the trained PRM. Iterate.

The architecture is usually the same SFT-init scalar head, but applied at every step boundary (often a newline token) instead of just the final token.

The PRM contract

At step boundary k, the PRM should output approximately P( verifier(full trajectory) = 1 | prefix up to step k ). That's it. Everything else — search, advantage estimation, dense training reward — is built on top of that one probabilistic interpretation.

How a PRM enables search at test time

Once you can score partial chains, you can search:

Best-of-N rerank. Sample N full chains, score each with the PRM (e.g., min or mean of step scores), keep the best. Strictly better than ORM rerank when the PRM is well-calibrated, because it punishes chains with one broken step in the middle that the ORM might miss.
Beam search over steps. Maintain B partial chains. At each step, extend each by sampling K continuations, then keep the top-B by PRM score. Vastly more compute than greedy decoding but with diminishing returns past B ≈ 8.
MCTS (Monte-Carlo Tree Search). The o1 / R1-distillation idea. Expand a tree of partial chains. At each node, balance exploration (try less-explored branches) against exploitation (pursue branches with high PRM score). Roll out from leaves, back-propagate observed verifier rewards up the tree, re-expand. Continue until budget is exhausted; return the best chain found.

Visually, these three are points on the same axis — how much of the tree do you keep around?:

The thicker the branch, the more compute you spent on it — and the more confident the search is that the right answer lies under it. Best-of-N spends compute uniformly. Beam search prunes greedily by PRM score. MCTS spends adaptively, guided by the UCB formula q(node) + c·√(ln N_parent / n_node) — high-PRM nodes attract visits, but un-visited siblings get a bonus that decays as the parent is explored. The same compute, distributed very differently.

Interactive · Best-of-N rerank vs PRM-guided search

Below is a toy with a 6-step reasoning chain. At each step the policy samples a "good" or "bad" token. A bad step in the middle makes the final answer wrong; the ORM only sees the final right/wrong bit, the PRM sees each step. Compare Best-of-N rerank under both verifiers and PRM-guided beam search.

ORM rerank vs. PRM rerank vs. PRM beam (toy 6-step chain)

Each chain has 6 steps; each step is "good" with probability p_good. A single bad step in any position fails the chain. The ORM only knows the final pass/fail (with 10% noise). The PRM scores each step (with 15% per-step noise). Beam search uses the PRM to prune at each step.

p_good per step: 0.75 N samples: 16 Beam width: 4

Greedy (one sample)

—

Best-of-N (ORM)

—

Best-of-N (PRM)

—

PRM beam

—

What PRMs give you at training time

The above is all test-time inference. The training-time benefits are subtler and arguably more important.

Dense reward signal

Instead of one r ∈ {0, 1} per trajectory, you have one per step. Plugging that into REINFORCE/PPO gives per-step advantages — credit assignment now lives at step granularity, the gradient-estimator variance drops by roughly a factor of L for an L-step chain (stdev by √L), and KL-anchored policy updates become much sharper.

Bootstrap improvement

If your PRM can identify "this step was wrong", you can delete the bad step and let the model regenerate from the previous state. This is the engine behind iterative self-improvement loops: generate, find errors with PRM, regenerate at error site, repeat. R1's "rejection sampling + SFT" stage is a discrete version of exactly this.

The cost

PRMs are much harder to train well than ORMs. They are also reward-hackable in step-granular ways the ORM is not — a policy can learn to emit confident-looking-but-wrong step prefixes that the PRM scores high. DeepSeek explicitly tried PRM-based training for R1 and reported (in the R1 paper) that they abandoned PRMs because of three obstacles: hard-to-define fine-grained steps, hard-to-label intermediate correctness, and reward hacking on the PRM. They ended up using the rule-based verifier directly (exact-match for math, compiler / unit tests for code) — not a learned ORM at all — paired with pure GRPO. This is a live area; the field has not converged.

Trade-offs · ORM vs PRM vs verifier

Signal	Granularity	Cost to build	Hackability	When to use
Verifier	End of trajectory	Free (already exists for the task)	Low — it's the ground truth	Math, code, anything with an automatic checker.
ORM	End of trajectory	Cheap (label data with verifier or LLM judge)	Medium	RLHF replacement; Best-of-N at inference.
PRM	Each reasoning step	Expensive (per-step labels)	High	Test-time search (MCTS); dense training reward where you can afford the noise.

How modern systems actually use these

The current pragmatic consensus, as of 2025:

RLVR (math/code). Use the verifier directly during training (GRPO/DAPO). No reward model. PRMs explored but not yet standard.
RLHF (helpful/harmless). ORM-as-RM; PPO or DPO against it.
Test-time scaling. Best-of-N with an ORM is the easy win. PRM + beam search adds another 3–8% on hard math at high compute budget. MCTS adds more but is a complex piece of infra.
Reasoning training (o1-style, R1-style). Use the verifier to label long chain-of-thought rollouts, then distill the successful chains back into the model via SFT, then RL on top. PRMs are sometimes used to filter or annotate intermediate steps in the distillation set.

A word on MCTS for LLM reasoning

Tree search over reasoning is conceptually elegant and empirically powerful, but the engineering surface is brutal: you need to back up partial KV cache state at every branch, mask out future tokens, share prefix cache across siblings, and avoid blowing through context length when traces are long. Most production systems do linear search (Best-of-N) or a constrained beam and call it a day. MCTS over reasoning chains is still mostly a research artifact, with a few notable exceptions.

Takeaway

ORMs score whole responses; PRMs score every reasoning step. The verifier is the strongest signal you can use during training. PRMs unlock test-time search but cost more to build, harder to keep honest. In practice the simplest setup that works is: verifier during training (RLVR), ORM for tasks without verifiers (RLHF/DPO), and PRMs only when you have a clear test-time-search budget that justifies them.