RL / lessons / 24 · engineer role lesson 1 / 2 · part IV

RL infrastructure engineer — what the role actually is

A modern post-training RL loop is the most expensive, most fragile, most under-tooled software you'll ever own. The engineer who builds it sits between research, systems, and ops, and is the only person who knows why a step is slow today. This lesson lays out the role from first principles.

What you're actually optimizing

Before listing duties, derive the objective. A post-training RL run repeats a single loop:

┌─────────────────────────────────────────────────────────────────┐ │ for step in range(N_steps): │ │ prompts ← sample batch │ │ rollouts ← INFERENCE_ENGINE.generate(policy, prompts) │ │ rewards ← ENVIRONMENT.score(rollouts) │ │ logπ_θ ← TRAINER.forward(policy, rollouts) │ │ logπ_ref ← REF_MODEL.forward(rollouts) ─┐ KL anchor │ │ advantage = compute_advantage(rewards, logπ_*) │ │ │ loss = PPO/GRPO/RLOO surrogate │ │ │ TRAINER.backward(loss); optimizer.step() │ │ WEIGHT_SYNC.push(policy → INFERENCE_ENGINE) │ └─────────────────────────────────────────────────────────────────┘

Six distinct compute roles, four communication paths, one shared piece of state (the model weights) that lives in two formats simultaneously. The number you minimize is seconds per RL step at fixed correctness and stability. Every duty below traces back to one or both of those terms.

First-principles framing
An RL step is dominated by rollout time. For a 7B model generating 1024 tokens × 64 prompts × 16 rollouts, you'll do ~1M forward passes through the policy just for rollouts, vs. one forward + one backward for training. If the rollout engine isn't >5× faster per token than the trainer, rollout dominates > 80% of step time. This single ratio decides the architecture of the entire system.

The two duties you were asked about — yes

The two duties in the prompt are exactly the headline activities. Restated honestly:

Duty (as written)What it actually meansTime share
Improve and support RL infra for agentic use Extend the rollout loop from single-turn ("prompt → completion") to multi-turn agent trajectories with tool calls, structured outputs, branchy environments, per-turn rewards, and tool-output masking. Build the environment/verifier substrate (sandboxed code execution, deterministic graders, async tool I/O). Wire it into the trainer's loss masking so gradients only flow through model-generated tokens, never tool outputs. See 08_agentic, 18_environments. ~30–40%
Develop new kernels to speed up rollout generation and training Two distinct kernel surfaces. Rollout-side: attention variants (FlashAttention, FlashInfer, paged), prefix caching, speculative decoding, GQA/MQA, KV quantization, chunked prefill. Training-side: chunked cross-entropy, fused PPO ratio+clip, fused KL estimators, sequence packing (varlen attention), low-precision (bf16/fp8) gradients. Plus the bridge: log-prob compute kernels that produce byte-identical values across rollout and trainer. The next lesson (24) drills into this. ~25–35%

So yes — these are core. They're also incomplete. The role spans five layers, only two of which are listed above.

The five layers of the job

#LayerWhat lives hereWho else cares
1Framework The control flow above. Rollout engine integration (vLLM, SGLang). Trainer integration (FSDP, Megatron-LM, DeepSpeed, TorchTitan). Algorithm library: PPO, GRPO, RLOO, DAPO, Dr.GRPO, DPO. Controller / orchestrator. Logging, checkpointing, resume. In 2026 most teams build on top of an existing post-train framework (verl, OpenRLHF, NeMo-Aligner, AReaL, TRL) and extend — not from scratch. Knowing what each is good for (verl: scalable production; OpenRLHF: clean RLHF reference; AReaL: async; NeMo-Aligner: Megatron-native; TRL: HF-ecosystem default) is part of the job. Research
2Environment Verifiers (math grader, code runner, unit-test harness). Tool calling protocol (function schemas, parsers). Sandboxes (firejail / gVisor / Docker-in-Docker for arbitrary code). Reward computation pipeline (often per-trajectory not per-token). Multi-turn state management. Lesson 18 covers this. Research, Safety
3Performance The role owns: which kernels exist in the stack, what parallelism the cluster uses (TP/PP/DP/EP for trainer, TP/replication for rollout), and the SLA on step time. See 19_topology, 22_memory, 28_bottlenecks. Kernel/SysML
4Correctness Log-prob match between rollout engine and trainer (silent killer). Off-policy correction. Numerical stability (advantage normalization, KL clipping, gradient clipping). Reproducibility (seeded sampling, deterministic mode). Tokenizer consistency from SFT → RL. Loss-mask bugs in agentic settings. Research
5Operations Long runs (days to weeks) fail. Need: per-component health checks, retry logic, partial checkpoint recovery, reward-hacking detection, divergence dashboards, throughput regression alarms. In frontier 2026 setups, rollers and trainers usually share the same GPU generation (H100/H200/B200) with mode-swap and reshard at sync — the older "rollers on cheap A100s, trainer on H100" pattern is mostly only seen in academic / small-scale runs because the cross-fabric weight transfer eats the GPU-cost saving. Infra, SRE
The role failure mode
Junior infra engineers focus only on layer 3 (kernels) because it's the most measurable. Senior engineers spend 40%+ on layer 4 (correctness) because a correctness bug invalidates every layer-3 win. The most expensive RL bug ever shipped was a log-prob mismatch that silently biased gradients for three weeks of training — discovered when researchers noticed reward going down after every weight sync.

The four communication paths — each is its own engineering problem

Pull the loop diagram apart. The compute roles (rollout / trainer / ref / reward) exchange data along four paths:

PathDirectionPayloadVolumeFailure mode
Rollouts → trainer inference engine → trainer token IDs + sampled logprobs + advantage tags K · B · T tokens per step (typically 100MB–GB) Logprobs from rollout engine don't match trainer's recompute → biased importance ratio.
Rollouts → reward inference engine → environment generated trajectories same as above Verifier hangs / OOMs on adversarial outputs; rewards return slow → bubble in pipeline.
Trainer → ref shared or separate forward token IDs to score same as above, no grads Ref-model weights drift (if you accidentally update them); ref forward dominates if it's same-size and not amortized.
Trainer → rollouts (weight sync) parameter broadcast fp32 master → bf16 (or fp8) inference dtype full model size (140 GB for 70B bf16) per sync Sync timing — start a rollout while sync is incomplete → policy mismatch → off-policy training.

The weight-sync path is the load-bearing one. It controls whether your loop is colocated, disaggregated, or async — three architectures with very different cost profiles. Lesson 06 derives the trade-offs; the engineer's job is making whichever architecture was picked actually work.

Three reference architectures the engineer must support

Colocated Disaggregated Async / off-policy (same GPUs) (separate GPU pools) (separate, decoupled) ┌──────────────┐ ┌──────────┐ ┌────────┐ ┌──────────┐ ┌────────┐ │ GPU 0..7 │ │ ROLLERS │ │TRAINER │ │ ROLLERS │ │TRAINER │ │ │ │ (cheap) │ │(H100s) │ │ (cheap) │ │(H100s) │ │ rollout + │ │ │ │ │ │ ▲ │ │ │ │ │ trainer │ │ ◀─sync──│─│ │ │ │ │ │ │ │ │ │ │ 100ms │ │ │ │ sync │ │ step │ │ swap mode │ │ │ │ │ │ every N │ │ every │ │ each phase │ └──────────┘ └────────┘ │ steps │ │ step │ └──────────────┘ └──────────┘ └────────┘ + cheap sync + scale independently + pipeline throughput - mode switch overhead - sync over fabric - off-policy correction needed (PPO clipping must hold)
ArchitectureSync costGPU efficiencyOff-policy?Where it shines
Colocated ~free (same NVLink/HBM) Bad: trainer GPUs idle during rollout, rollers idle during step ~On-policy Small clusters (≤8 GPUs); research iteration
Disaggregated Inter-node bandwidth (IB / Ethernet) Good: each pool sized for its role On-policy if synchronous Production at scale
Async / off-policy Hidden by overlap Best throughput Significantly off-policy Frontier training; only justified when other modes saturate

The engineer's choice here is the single biggest architectural decision in the entire framework. It dictates: weight-sync kernel design, scheduler logic, off-policy correction, debug tooling, and cluster cost.

What "agentic" actually changes

The single-turn RL recipe assumes one prompt → one completion → one reward. Agentic RL breaks every part of that flow. Concretely, what you must build:

  1. Multi-turn rollout state machine. Each rollout is now a trajectory of (assistant_msg, tool_call, tool_response, …). The rollout engine must (a) accept partial trajectories, (b) inject tool outputs at the right position, (c) handle branchy structure when tools error.
  2. Tool sandbox. Code execution, web fetch, search, calculator. Each tool has latency (often 10s–100s of ms), failure modes, and the rollout pipeline must be async to overlap tool I/O with other rollouts' generation.
  3. Loss masking, per-token. The trainer should compute gradients only over tokens the model generated, not over tool outputs (which the model didn't produce) or system prompts. A wrong mask = "learn to mimic the tool output", which is catastrophic.
  4. Reward shape. Sparse (only at trajectory end) or per-step? For long trajectories, sparse rewards have high variance — process reward models (PRMs, lesson 17) help.
  5. Variable-length packing. Trajectories have wildly different lengths (200–10,000 tokens). Padding wastes compute; packing requires varlen attention.
  6. Trajectory deduplication. K rollouts often share a long shared prefix (the system prompt, few-shot examples, the user task). Prefix caching is non-optional — KV cache reuse cuts rollout cost by 3–10× depending on prefix length.
  7. Per-turn KV cache management. Across the turns of a single trajectory, the KV cache must persist; across rollouts it must be released. Memory pressure grows quadratically with batch × max-turns × max-tokens-per-turn.
The agentic-RL bug you'll meet
Tool outputs accidentally included in the policy gradient. Symptom: the model "learns" to emit fake tool outputs verbatim. Diagnosis: print the loss mask for a few trajectories, confirm 0/1 pattern matches assistant tokens exactly. Trigger: tokenizer differs between trainer and tool-output formatter (e.g., trainer adds BOS, formatter doesn't). The fix is a half-day; the bug eats a week of training.

The correctness traps that distinguish a senior

BugSymptomFirst-principles diagnosis
Log-prob mismatch Reward goes down right after each weight sync; PPO ratio histogram shows π/π_old skewed. Rollout engine and trainer use different attention kernels → slightly different logprobs. Importance ratio is biased. Either: recompute logprobs on the trainer at rollout time (slow, correct), or guarantee the engines produce byte-identical logprobs (hard, fast). See next lesson.
Off-by-one in advantage Loss looks healthy; rewards plateau early. Advantage is computed for token t using reward at trajectory level — but you assigned it to position t-1 or t+1. The shift means gradient pushes the wrong token. Spot via single-token sanity check.
Reference model drift KL anchor gradually loses meaning; loss spikes during decay schedule. Ref params accidentally seeing optimizer updates: forgotten requires_grad=False, ref params in the trainer's parameter group, or shared parameter tying with the policy. (FSDP sharding ref params is fine; what matters is whether they're in the optimizer's param list.) KL → 0 prematurely, then explodes when reference catches up.
Tokenizer drift Loss explodes ~step 1 of RL on a freshly SFT'd model. SFT model used tokenizer A; RL pipeline initialized with tokenizer B (different special tokens or added vocabulary). Every embedding lookup is off-by-token.
Stale rollouts in async π_old is from 5 steps ago; clipping fraction is 80% → almost no learning. Async architecture without bounded staleness. Fix: cap staleness (drop rollouts older than N steps), or reduce gap between sync events.
EOS / stop-token asymmetry Trainer's logprob over the rollout includes / excludes the EOS in a way the rollout engine doesn't match → ratios drift on the last token and a small number of "phantom" tokens past EOS. Rollout engine truncates at EOS or at any stop-string and may or may not include the EOS token in the returned sequence. Trainer must reconstruct exactly the same sequence (incl. EOS) and mask the same positions. Equal in frequency to tokenizer drift; less famous.
Reward hacking Reward goes up; quality (as judged by a held-out human eval) goes down. The reward signal is misspecified. The model found a shortcut. Mitigations: KL anchor (which you already have), reward-model robustness, regularly evaluate on a held-out grader.

Performance & operations — pointers to where these live

Two cross-cutting concerns the engineer owns but which get their own depth lesson:

This lesson is about ownership: the boundaries of the role and the layers of the system the engineer is responsible for. Lesson 28 is about action: what an engineer does when the run is slow or wrong.

The skill matrix — what the role hires for

SkillWhy it mattersWhat "good" looks like
PyTorch internals You'll patch FSDP, write custom autograd.Function, debug strided-tensor surprises. Can explain dispatcher → autograd → kernel launch path; has written a custom Function with hand-derived backward.
CUDA / Triton The kernels you'll write or port don't exist yet at the level you need. See next lesson. Comfortable with FlashAttention internals; has written a fused kernel in Triton; can read a Nsight profile.
Distributed systems The four communication paths above are network operations. Can diagnose an NCCL hang; understands NVLink / IB / PCIe bandwidth differences; has tuned a collective.
RL algorithms You'll bias-debug. PPO clipping behaviour, KL estimator choice, advantage normalization, importance-sampling instability — without algorithm fluency you can't tell a bug from a feature. Can derive PPO surrogate; can explain why Dr.GRPO normalizes per-token; knows the k1/k2/k3 KL estimators.
Inference engine internals vLLM and SGLang are the rollout layer. You'll integrate, extend, or replace them. Can read vLLM's scheduler; has implemented a prefix-cache feature; understands paged-attention block tables.
Numerical analysis Log-prob mismatch, advantage stability, KL clipping, low-precision gradients all break on numerical edges. Knows when fp16 underflows; can derive that softmax + log = log_softmax avoids it; understands gradient-clipping interaction with Adam's v̂.
Profiling "It's slow today" is the most common ticket. You need to find the cause without restarting. Lives in Nsight, py-spy, dcgm. Has a default profiling script for every component.

The role's distinguishing question

If you can answer this question end-to-end on a whiteboard, you can do the job:

The whiteboard question
"You're given a 70B base model and a 1000-prompt math benchmark. Budget: 64 H100s for 7 days. Design the post-training RL pipeline. Walk through: rollout architecture, algorithm choice, weight-sync strategy, environment design, expected throughput, the three things most likely to go wrong, and how you'd diagnose them. End with the kernel you'd write first."

A weak answer names libraries. A strong answer derives every choice from constraints (64 GPUs ÷ ? roles, model dtype, sequence length budget, batch math), names the trade-offs ignored, and identifies the first kernel from a back-of-envelope profile estimate. The kernel work itself is the subject of the next lesson.

Interview prompts you should be ready for

  1. "Walk through a single RL step from start to finish. Where does the time go?" (The six-role loop, rollout dominates 60–75%, trainer forward + ref forward another 15–26%. Backward and sync are small but block.)
  2. "Colocated vs disaggregated vs async — which would you pick for a 70B model on 8 H100s?" (Colocated. Inference-only 70B fits on 4 H100s with TP, but the trainer side needs the full ~1 TB of params + grads + Adam fp32 state + activations — splitting into 4-roller + 4-trainer leaves the trainer with nowhere to put it. With all 8 GPUs shared, you swap modes and reshard at each phase boundary. Disaggregated becomes the right answer around 32+ GPUs where the trainer pool can be sized properly. The lesson: sizing decides architecture before personal preference does.)
  3. "Your rewards go down right after each weight sync. Diagnose." (Almost always: logprob mismatch between rollout engine and trainer. PPO ratio is biased. Fix: recompute logprobs on the trainer for the importance ratio, or guarantee numerical equivalence between the engines.)
  4. "You see clipping fraction = 90% in PPO. What does it mean?" (π/π_old is way outside the clip range. Possibilities: too-stale rollouts in async, too-large learning rate, sampling temperature mismatch between rollout and training, or a logprob bug. Each has a different fix.)
  5. "How do you handle an environment that takes 500ms per reward call?" (Async pipeline: pipeline reward computation in parallel with the next rollout step. If 500ms × batch_size > rollout time, the verifier is on the critical path → batch the verifier, GPU-accelerate it, or use a learned proxy + occasional ground-truth verification.)
  6. "What's the first thing you'd profile on a fresh RL framework?" (Per-step breakdown of rollout / ref forward / trainer forward / backward / sync. Until you know where the time goes, no kernel decision is justified.)
  7. "What's the first kernel you'd write?" (Whatever your profile told you. Typically: prefix-aware paged attention if K-rollouts share long prompts, or chunked cross-entropy if vocab is large and trainer forward is HBM-bound. The next lesson is precisely this question.)
Takeaway
An RL infra engineer owns five layers: framework, environment, performance, correctness, operations. The two duties in the prompt — agentic infra and rollout/training kernels — are accurate but partial. The role is defined by the four communication paths between rollout, trainer, reference, and reward, each of which is its own engineering surface. The senior signal is being able to derive each architectural choice from constraints (model size, fabric, latency budget), profile the loop in 60 seconds, and name the correctness traps before they ship. Kernels are the last 30% of speed — measurable, individual, and where the next lesson lives.