rl_lessons / 17 · PRM & search lesson 3 / 9 · part III

Outcome rewards, process rewards, and search

A verifier tells you the answer is wrong. A process reward model tells you where it went wrong. The first is enough to train; the second is what makes test-time search and trajectory-level credit assignment possible — and is the conceptual core of OpenAI's o-series and DeepSeek-R1 distillations.

Where this lesson sits
Lesson 15 introduced trajectory-level reward signal (a single scalar per response, from a Bradley–Terry RM). Lesson 16 collapsed RM + RL into one step (DPO). This lesson asks: what if you scored every step of a reasoning chain, not just the final answer? That's a finer-grained signal that unlocks (a) test-time search over reasoning trees and (b) denser training-time credit. Together with lesson 18 it closes the question "what does the reward signal look like and where does it come from".

The credit-assignment problem, one more time

From lesson 04: you sampled a 600-token reasoning trace, the final answer was wrong, and now you have a single scalar r = 0 attached to all 600 tokens. The gradient is shared equally. But the trace probably contained 580 correct tokens and 20 wrong ones — the algorithm has no way to know which.

This is fine when reward is dense (lesson 01) or when you can afford enough rollouts that variance averages out (GRPO). It is not fine when:

Two different kinds of reward model exist to address two of these. We'll define them, then look at how search uses them.

ORM · Outcome Reward Model

An Outcome Reward Model is the obvious thing: a model that, given the prompt and the full response, predicts a scalar score for "did this response solve the problem?". It is trained on (prompt, full response, 0/1) tuples, with a binary cross-entropy loss. At inference time, you feed it a candidate response, you get back a confidence that the answer is correct.

ORMs are the cheap option:

What ORMs cannot do: localize errors. If the ORM says "p = 0.08 that this 800-token response is correct", it has no way to point at the bad token. That's what PRMs are for.

PRM · Process Reward Model

A Process Reward Model scores each intermediate step of a reasoning trace, not just the final answer. Given a partial chain ending at step k, it returns a scalar — typically the probability that this prefix can still lead to a correct final answer.

Training data is the expensive part:

The architecture is usually the same SFT-init scalar head, but applied at every step boundary (often a newline token) instead of just the final token.

The PRM contract
At step boundary k, the PRM should output approximately P( verifier(full trajectory) = 1 | prefix up to step k ). That's it. Everything else — search, advantage estimation, dense training reward — is built on top of that one probabilistic interpretation.

How a PRM enables search at test time

Once you can score partial chains, you can search:

  1. Best-of-N rerank. Sample N full chains, score each with the PRM (e.g., min or mean of step scores), keep the best. Strictly better than ORM rerank when the PRM is well-calibrated, because it punishes chains with one broken step in the middle that the ORM might miss.
  2. Beam search over steps. Maintain B partial chains. At each step, extend each by sampling K continuations, then keep the top-B by PRM score. Vastly more compute than greedy decoding but with diminishing returns past B ≈ 8.
  3. MCTS (Monte-Carlo Tree Search). The o1 / R1-distillation idea. Expand a tree of partial chains. At each node, balance exploration (try less-explored branches) against exploitation (pursue branches with high PRM score). Roll out from leaves, back-propagate observed verifier rewards up the tree, re-expand. Continue until budget is exhausted; return the best chain found.

Visually, these three are points on the same axis — how much of the tree do you keep around?:

Best-of-N rerank no sharing · score at end PRM scores leaves only Beam search over steps keep top-B at each level PRM scores after every step MCTS expand · simulate · back-prop n=24, q=0.6 n=4 n=14 n=6 back-prop UCB(q + c·√(ln N / n))

The thicker the branch, the more compute you spent on it — and the more confident the search is that the right answer lies under it. Best-of-N spends compute uniformly. Beam search prunes greedily by PRM score. MCTS spends adaptively, guided by the UCB formula q(node) + c·√(ln Nparent / nnode) — high-PRM nodes attract visits, but un-visited siblings get a bonus that decays as the parent is explored. The same compute, distributed very differently.

Interactive · Best-of-N rerank vs PRM-guided search

Below is a toy with a 6-step reasoning chain. At each step the policy samples a "good" or "bad" token. A bad step in the middle makes the final answer wrong; the ORM only sees the final right/wrong bit, the PRM sees each step. Compare Best-of-N rerank under both verifiers and PRM-guided beam search.

ORM rerank vs. PRM rerank vs. PRM beam (toy 6-step chain)
Each chain has 6 steps; each step is "good" with probability p_good. A single bad step in any position fails the chain. The ORM only knows the final pass/fail (with 10% noise). The PRM scores each step (with 15% per-step noise). Beam search uses the PRM to prune at each step.
Greedy (one sample)
Best-of-N (ORM)
Best-of-N (PRM)
PRM beam

What PRMs give you at training time

The above is all test-time inference. The training-time benefits are subtler and arguably more important.

Dense reward signal

Instead of one r ∈ {0, 1} per trajectory, you have one per step. Plugging that into REINFORCE/PPO gives per-step advantages — credit assignment now lives at step granularity, the gradient-estimator variance drops by roughly a factor of L for an L-step chain (stdev by √L), and KL-anchored policy updates become much sharper.

Bootstrap improvement

If your PRM can identify "this step was wrong", you can delete the bad step and let the model regenerate from the previous state. This is the engine behind iterative self-improvement loops: generate, find errors with PRM, regenerate at error site, repeat. R1's "rejection sampling + SFT" stage is a discrete version of exactly this.

The cost

PRMs are much harder to train well than ORMs. They are also reward-hackable in step-granular ways the ORM is not — a policy can learn to emit confident-looking-but-wrong step prefixes that the PRM scores high. DeepSeek explicitly tried PRM-based training for R1 and reported (in the R1 paper) that they abandoned PRMs because of three obstacles: hard-to-define fine-grained steps, hard-to-label intermediate correctness, and reward hacking on the PRM. They ended up using the rule-based verifier directly (exact-match for math, compiler / unit tests for code) — not a learned ORM at all — paired with pure GRPO. This is a live area; the field has not converged.

Trade-offs · ORM vs PRM vs verifier

SignalGranularityCost to buildHackabilityWhen to use
VerifierEnd of trajectoryFree (already exists for the task)Low — it's the ground truthMath, code, anything with an automatic checker.
ORMEnd of trajectoryCheap (label data with verifier or LLM judge)MediumRLHF replacement; Best-of-N at inference.
PRMEach reasoning stepExpensive (per-step labels)HighTest-time search (MCTS); dense training reward where you can afford the noise.

How modern systems actually use these

The current pragmatic consensus, as of 2025:

A word on MCTS for LLM reasoning
Tree search over reasoning is conceptually elegant and empirically powerful, but the engineering surface is brutal: you need to back up partial KV cache state at every branch, mask out future tokens, share prefix cache across siblings, and avoid blowing through context length when traces are long. Most production systems do linear search (Best-of-N) or a constrained beam and call it a day. MCTS over reasoning chains is still mostly a research artifact, with a few notable exceptions.
Takeaway
ORMs score whole responses; PRMs score every reasoning step. The verifier is the strongest signal you can use during training. PRMs unlock test-time search but cost more to build, harder to keep honest. In practice the simplest setup that works is: verifier during training (RLVR), ORM for tasks without verifiers (RLHF/DPO), and PRMs only when you have a clear test-time-search budget that justifies them.