rl_lessons / 08 · agentic lesson 8 / 8

Agentic RL — multi-turn + tool masking

When the trajectory contains tokens the model didn't generate, the algorithm doesn't need to change — but the mask does.

The shift from single-turn to agentic

Lessons 1–7 lived in a single-turn world: prompt in, response out, reward at the end. Modern agentic systems — Claude with tools, OpenAI function calling, SWE-bench solvers, Voyager — are multi-turn. The model emits, the environment responds with tool output or new observations, the model emits again, and so on. The trajectory is a sequence of alternating assistant turns and environment turns.

turn 0 (assistant):   "let me compute   c(123+456+789)#"
turn 1 (env tool):    "=1368#"
turn 2 (assistant):   "the answer is 1368#"   ← terminal turn
                                              reward = verify("1368") = 1

The trajectory is one continuous token stream. From the model's perspective it has emitted some tokens and "observed" others. The framework's job is to record per-token old_logp on tokens the model emitted and mark the rest as not-trainable.

The one-line key idea

Why tool-output tokens get response_mask = 0
We want the policy to learn when to call a tool, not to imitate the tool's output. Masking tool tokens makes the loss include exactly the assistant-generated tokens — the gradient flows only through what the model could have done differently.

Without this mask, two failure modes appear:

  1. Tool mimicry. The model learns to guess the tool output from the query alone — defeating the point of having a tool, and failing to generalize to inputs the model didn't see during training.
  2. Gradient pollution. Tool-output tokens carry the same advantage as the rest of the trajectory but the model didn't produce them — moving their log-probabilities is just noise on the gradient.

The fix is one line in agent_loop.MultiTurnRolloutEngine._one_trajectory:

response_mask.extend([0.0] * len(obs_ids))    # ← the key line: tool tokens, not trainable
old_logp     .extend([0.0] * len(obs_ids))    #    logp of tool tokens is undefined anyway

Interactive · the trajectory and its mask

Below is a complete multi-turn trajectory on the calculator-arithmetic task. Each tile is one token, colored by who emitted it. Toggle masking and watch the loss change — when masking is off, the model is being trained to imitate =1368#.

Multi-turn trajectory · tokens, mask, contribution to loss
Blue = assistant (mask=1, contributes to loss). Orange = tool output (mask=0, ignored). Toggle the mask to see what would break.
PROMPT
TURN 0 · ASSISTANT (sampled, mask=1)
TURN 1 · ENV TOOL OUTPUT (injected, mask=0)
TURN 2 · ASSISTANT (sampled, mask=1)
TERMINAL REWARD
tokens scored
assistant tokens
tool tokens
expected learning

What the rest of the framework didn't have to change

This is the prize. Going from single-turn to agentic RL required:

That's it. Everything else is unchanged:

This is the framework's value proposition delivering on a real workload: add a new task shape, and the algorithms you already have just work.

One subtlety: per-turn rewards

The single-turn case has one terminal reward. Multi-turn allows per-turn rewards: maybe the environment gives partial credit for an intermediate observation, or penalizes a malformed tool call. In our toy task the only reward comes at the terminal turn, but MultiTurnRolloutEngine accumulates rewards across turns into traj.reward:

for _turn in range(cfg.max_turns):
    gen_ids, gen_logp = self._generate_assistant(ids)
    obs_text, reward, done, info = env.step(decode(gen_ids), info)
    reward_total += reward          # ← accumulated across turns
    if done: break
return Trajectory(..., reward = reward_total)

For dense per-turn rewards (process reward models, e.g.) you'd want a per-token reward field instead of a scalar — and a different advantage estimator (GAE-style) that does temporal credit assignment. That's a one-field-on-Trajectory change and a new Algorithm subclass; the rest of the framework is unchanged. Same pattern.

Why the mask "belongs to" the trajectory, not the env

A design question worth answering: why does the rollout engine emit a per-token mask, instead of the algorithm or the trainer figuring out which tokens are assistant tokens?

Because only the rollout engine knows. By the time the trajectory reaches the trainer, all tokens look the same — a tensor of ids. The fact that token 47 was generated by the policy and token 48 came from env.step is information that exists at sampling time and would be expensive (and bug-prone) to reconstruct later.

Storing the mask on the trajectory at sampling time is the canonical solution. Every algorithm that needs to be "correct" on agentic data simply multiplies by the mask everywhere — exactly what algorithm.py's _masked_mean already does.

Reading the code

Two files implement the agentic surface:

The whole agentic addition is ~200 lines. Most of it is bookkeeping around the per-token mask.

Where the framework ends and the next steps begin

The eight lessons cover the full surface of the framework: six roles plus the orchestrator plus the agentic extension. Beyond this, the natural next questions are:

Each of these is a single-file change because of the role boundaries — which is the entire framework's whole value proposition. The neighboring folders flesh out specific pieces: rl_sota/ for the algorithm space, vllm/ for the rollout backend, gpt_mini/ for the full SFT → CoT → DPO → RLVR pipeline.

Takeaway
Agentic RL is single-turn RL plus one bit of bookkeeping: which tokens did the model produce, and which did the environment inject? Get the mask right and every existing algorithm — REINFORCE, GRPO, DAPO, Dr.GRPO — just works on multi-turn trajectories without modification.

You've now seen the complete framework. Open RL/framework/ — every file should read as a single role with a single responsibility, and the controller's seven-step loop should feel inevitable. Part II of the series (lessons 09–14) drills into the algorithms that live inside algorithm.py — start with REINFORCE to see what the loss can actually be.