Algorithm — the plugin interface

The only piece of the system that's actually different across PPO, GRPO, RLOO, DAPO, and Dr.GRPO. Everything else is identical. This lesson defines the interface; Part II walks each algorithm in depth.

The two-method contract

In this framework, an Algorithm answers exactly two questions:

Advantage: Given a group of trajectories with rewards, what advantage does each token get?
Loss: Given a batch with advantages, old log-probs, and reference log-probs, what is the loss as a function of the current policy's log-probs?

That's it. The rollout engine doesn't know which algorithm is running. The trainer doesn't either. Swapping GRPO for DAPO is one file change — exactly the design pressure that makes algorithm.py a plugin and not a tangle of conditionals throughout the trainer.

The protocol, from algorithm.py

class Algorithm(Protocol):
    def compute_advantages(self, groups: list[list[Trajectory]]) -> None:
        """Write traj.advantage on every Trajectory in place."""

    def compute_loss(self, batch: Batch, new_logp: Tensor) -> tuple[Tensor, dict]:
        """Return (scalar_loss, metrics_dict). new_logp flows gradient."""

Why this is one role, not five

Look at any two algorithms in RL/algorithms/ side by side — GRPO vs DAPO, or REINFORCE vs Dr.GRPO. They differ in:

What the advantage is. (Raw reward? r − mean? Divided by std? Leave-one-out? GAE from a value head?)
What the loss looks like. (Multiply A · log π? Use a PPO ratio? Clip symmetrically? Asymmetrically? Normalize per-token or per-sequence?)

They don't differ in: how rollouts are sampled, what the reference is for, where gradients flow, when weights sync. Those live in other roles. The whole point of factoring algorithm out is that everything algorithmic sits behind these two methods, and the trainer / rollout / controller never need to know about it.

The shared signature — what flows in

By the time compute_advantages is called, every Trajectory has:

Field	Set by	Shape
`prompt_ids`, `response_ids`	rollout	list of ints
`response_mask`	rollout	(R,) — 1 for assistant tokens, 0 for tool / EOS-pad
`old_logp`	rollout (at sampling time, lesson 02)	(R,)
`ref_logp`	reference (lesson 03)	(R,)
`reward`	env.verify (lesson 03)	scalar (single-turn) or per-turn sum (agentic, lesson 08)

The algorithm's job is to fill in one more field — traj.advantage — then later, when given a packed Batch and the trainer's fresh new_logp, produce a scalar loss.

Interactive · feel the advantage assignment

Three rollouts. You set their rewards. The table shows what advantage each algorithm assigns. This is a preview of lesson 11's widget — read the depth there. The point here: this table is the only thing different across algorithms.

Advantage assignment, three rollouts

Reasonable real-world distributions: 0,0,1 (one lucky correct), 1,1,1 (degenerate-positive), 0,0,0 (degenerate-negative). Set them all equal — the GRPO row degenerates, which is the "degenerate group" case from lesson 02.

r₀: 0.00 r₁: 0.00 r₂: 1.00

algorithm	A₀	A₁	A₂	idea
REINFORCE	—	—	—	raw reward; always pushes up. L09
+ group baseline	—	—	—	r − mean; half push up, half down. Variance ↓.
GRPO	—	—	—	(r − mean) / std. Scale-invariant. L11
RLOO	—	—	—	r − mean(others). Strictly unbiased. L12
Dr.GRPO	—	—	—	r − mean (no /std). L14

Diagnostic

—

The loss — same skeleton, different fillings

Once advantages are written, every algorithm produces its loss the same way:

# Simplified from RL/framework/algorithm.py — the shared skeleton
log_ratio = (new_logp - batch.old_logp) * mask
ratio     = torch.exp(log_ratio)

# ── ONE OF THESE LINES is the algorithm-specific part ──
# REINFORCE: pg_tok = -batch.advantage * new_logp
# GRPO / PPO: pg_tok = -torch.min(ratio*A, clip(ratio, 1-eps, 1+eps)*A)
# DAPO:       same min, but clip uses asymmetric (1-eps_low, 1+eps_high)
# Dr.GRPO:    same min, but A is built without /std

pg_loss = _masked_mean(pg_tok, mask)                    # token-level
kl_loss = beta * _masked_mean(compute_kl_k3(new_logp, ref_logp), mask)
loss    = pg_loss + kl_loss

The KL anchor (lesson 03) is added by every algorithm — it doesn't really "belong to" any of them but we apply it inside compute_loss because new_logp and ref_logp are already in the batch. Same arithmetic, one fewer call site for the masking.

Token-level vs sequence-level loss

We sum then average at the token level: (per_token_loss * mask).sum() / mask.sum(). The alternative — sum within a sequence then average across sequences — introduces a length bias. DAPO and Dr.GRPO (L13, L14) make this their headline fix. The token-level form is what this framework uses everywhere.

Where to read each algorithm

Algorithm	One-line change vs predecessor	Depth
REINFORCE	baseline = 0 (no patches; the floor)	Lesson 09
PPO	add value head + importance ratio + symmetric clip	Lesson 10
GRPO	drop value head; baseline = group mean / std	Lesson 11
RLOO	baseline = mean of other K-1 rollouts	Lesson 12
DAPO	4 patches: clip-higher, dynamic sampling, token-level loss, overlong shaping	Lesson 13
Dr.GRPO	delete /std and /\|y_i\| from GRPO	Lesson 14

Takeaway

The algorithm is a plugin with two methods: compute_advantages and compute_loss. Six algorithms in this curriculum, one interface, one trainer that doesn't know which one is running. The depth — variance arguments, clip mechanics, length-bias debates — lives in Part II. Read this lesson as "where the plugin fits"; read Part II as "what fills the plugin".