RL infrastructure engineer — what the role actually is

A modern post-training RL loop is the most expensive, most fragile, most under-tooled software you'll ever own. The engineer who builds it sits between research, systems, and ops, and is the only person who knows why a step is slow today. This lesson lays out the role from first principles.

What you're actually optimizing

Before listing duties, derive the objective. A post-training RL run repeats a single loop:

┌─────────────────────────────────────────────────────────────────┐ │ for step in range(N_steps): │ │ prompts ← sample batch │ │ rollouts ← INFERENCE_ENGINE.generate(policy, prompts) │ │ rewards ← ENVIRONMENT.score(rollouts) │ │ logπ_θ ← TRAINER.forward(policy, rollouts) │ │ logπ_ref ← REF_MODEL.forward(rollouts) ─┐ KL anchor │ │ advantage = compute_advantage(rewards, logπ_*) │ │ │ loss = PPO/GRPO/RLOO surrogate │ │ │ TRAINER.backward(loss); optimizer.step() │ │ WEIGHT_SYNC.push(policy → INFERENCE_ENGINE) │ └─────────────────────────────────────────────────────────────────┘

Six distinct compute roles, four communication paths, one shared piece of state (the model weights) that lives in two formats simultaneously. The number you minimize is seconds per RL step at fixed correctness and stability. Every duty below traces back to one or both of those terms.

First-principles framing

An RL step is dominated by rollout time. For a 7B model generating 1024 tokens × 64 prompts × 16 rollouts, you'll do ~1M forward passes through the policy just for rollouts, vs. one forward + one backward for training. If the rollout engine isn't >5× faster per token than the trainer, rollout dominates > 80% of step time. This single ratio decides the architecture of the entire system.

The two duties you were asked about — yes

The two duties in the prompt are exactly the headline activities. Restated honestly:

Duty (as written)	What it actually means	Time share
Improve and support RL infra for agentic use	Extend the rollout loop from single-turn ("prompt → completion") to multi-turn agent trajectories with tool calls, structured outputs, branchy environments, per-turn rewards, and tool-output masking. Build the environment/verifier substrate (sandboxed code execution, deterministic graders, async tool I/O). Wire it into the trainer's loss masking so gradients only flow through model-generated tokens, never tool outputs. See 08_agentic, 18_environments.	~30–40%
Develop new kernels to speed up rollout generation and training	Two distinct kernel surfaces. Rollout-side: attention variants (FlashAttention, FlashInfer, paged), prefix caching, speculative decoding, GQA/MQA, KV quantization, chunked prefill. Training-side: chunked cross-entropy, fused PPO ratio+clip, fused KL estimators, sequence packing (varlen attention), low-precision (bf16/fp8) gradients. Plus the bridge: log-prob compute kernels that produce byte-identical values across rollout and trainer. The next lesson (24) drills into this.	~25–35%

So yes — these are core. They're also incomplete. The role spans five layers, only two of which are listed above.

The five layers of the job

#	Layer	What lives here	Who else cares
1	Framework	The control flow above. Rollout engine integration (vLLM, SGLang). Trainer integration (FSDP, Megatron-LM, DeepSpeed, TorchTitan). Algorithm library: PPO, GRPO, RLOO, DAPO, Dr.GRPO, DPO. Controller / orchestrator. Logging, checkpointing, resume. In 2026 most teams build on top of an existing post-train framework (`verl`, `OpenRLHF`, `NeMo-Aligner`, `AReaL`, `TRL`) and extend — not from scratch. Knowing what each is good for (verl: scalable production; OpenRLHF: clean RLHF reference; AReaL: async; NeMo-Aligner: Megatron-native; TRL: HF-ecosystem default) is part of the job.	Research
2	Environment	Verifiers (math grader, code runner, unit-test harness). Tool calling protocol (function schemas, parsers). Sandboxes (firejail / gVisor / Docker-in-Docker for arbitrary code). Reward computation pipeline (often per-trajectory not per-token). Multi-turn state management. Lesson 18 covers this.	Research, Safety
3	Performance	The role owns: which kernels exist in the stack, what parallelism the cluster uses (TP/PP/DP/EP for trainer, TP/replication for rollout), and the SLA on step time. See 19_topology, 22_memory, 28_bottlenecks.	Kernel/SysML
4	Correctness	Log-prob match between rollout engine and trainer (silent killer). Off-policy correction. Numerical stability (advantage normalization, KL clipping, gradient clipping). Reproducibility (seeded sampling, deterministic mode). Tokenizer consistency from SFT → RL. Loss-mask bugs in agentic settings.	Research
5	Operations	Long runs (days to weeks) fail. Need: per-component health checks, retry logic, partial checkpoint recovery, reward-hacking detection, divergence dashboards, throughput regression alarms. In frontier 2026 setups, rollers and trainers usually share the same GPU generation (H100/H200/B200) with mode-swap and reshard at sync — the older "rollers on cheap A100s, trainer on H100" pattern is mostly only seen in academic / small-scale runs because the cross-fabric weight transfer eats the GPU-cost saving.	Infra, SRE

The role failure mode

Junior infra engineers focus only on layer 3 (kernels) because it's the most measurable. Senior engineers spend 40%+ on layer 4 (correctness) because a correctness bug invalidates every layer-3 win. The most expensive RL bug ever shipped was a log-prob mismatch that silently biased gradients for three weeks of training — discovered when researchers noticed reward going down after every weight sync.

The four communication paths — each is its own engineering problem

Pull the loop diagram apart. The compute roles (rollout / trainer / ref / reward) exchange data along four paths:

Path	Direction	Payload	Volume	Failure mode
Rollouts → trainer	inference engine → trainer	token IDs + sampled logprobs + advantage tags	K · B · T tokens per step (typically 100MB–GB)	Logprobs from rollout engine don't match trainer's recompute → biased importance ratio.
Rollouts → reward	inference engine → environment	generated trajectories	same as above	Verifier hangs / OOMs on adversarial outputs; rewards return slow → bubble in pipeline.
Trainer → ref	shared or separate forward	token IDs to score	same as above, no grads	Ref-model weights drift (if you accidentally update them); ref forward dominates if it's same-size and not amortized.
Trainer → rollouts (weight sync)	parameter broadcast	fp32 master → bf16 (or fp8) inference dtype	full model size (140 GB for 70B bf16) per sync	Sync timing — start a rollout while sync is incomplete → policy mismatch → off-policy training.

The weight-sync path is the load-bearing one. It controls whether your loop is colocated, disaggregated, or async — three architectures with very different cost profiles. Lesson 06 derives the trade-offs; the engineer's job is making whichever architecture was picked actually work.

Three reference architectures the engineer must support

Colocated Disaggregated Async / off-policy (same GPUs) (separate GPU pools) (separate, decoupled) ┌──────────────┐ ┌──────────┐ ┌────────┐ ┌──────────┐ ┌────────┐ │ GPU 0..7 │ │ ROLLERS │ │TRAINER │ │ ROLLERS │ │TRAINER │ │ │ │ (cheap) │ │(H100s) │ │ (cheap) │ │(H100s) │ │ rollout + │ │ │ │ │ │ ▲ │ │ │ │ │ trainer │ │ ◀─sync──│─│ │ │ │ │ │ │ │ │ │ │ 100ms │ │ │ │ sync │ │ step │ │ swap mode │ │ │ │ │ │ every N │ │ every │ │ each phase │ └──────────┘ └────────┘ │ steps │ │ step │ └──────────────┘ └──────────┘ └────────┘ + cheap sync + scale independently + pipeline throughput - mode switch overhead - sync over fabric - off-policy correction needed (PPO clipping must hold)

Architecture	Sync cost	GPU efficiency	Off-policy?	Where it shines
Colocated	~free (same NVLink/HBM)	Bad: trainer GPUs idle during rollout, rollers idle during step	~On-policy	Small clusters (≤8 GPUs); research iteration
Disaggregated	Inter-node bandwidth (IB / Ethernet)	Good: each pool sized for its role	On-policy if synchronous	Production at scale
Async / off-policy	Hidden by overlap	Best throughput	Significantly off-policy	Frontier training; only justified when other modes saturate

The engineer's choice here is the single biggest architectural decision in the entire framework. It dictates: weight-sync kernel design, scheduler logic, off-policy correction, debug tooling, and cluster cost.

What "agentic" actually changes

The single-turn RL recipe assumes one prompt → one completion → one reward. Agentic RL breaks every part of that flow. Concretely, what you must build:

Multi-turn rollout state machine. Each rollout is now a trajectory of (assistant_msg, tool_call, tool_response, …). The rollout engine must (a) accept partial trajectories, (b) inject tool outputs at the right position, (c) handle branchy structure when tools error.
Tool sandbox. Code execution, web fetch, search, calculator. Each tool has latency (often 10s–100s of ms), failure modes, and the rollout pipeline must be async to overlap tool I/O with other rollouts' generation.
Loss masking, per-token. The trainer should compute gradients only over tokens the model generated, not over tool outputs (which the model didn't produce) or system prompts. A wrong mask = "learn to mimic the tool output", which is catastrophic.
Reward shape. Sparse (only at trajectory end) or per-step? For long trajectories, sparse rewards have high variance — process reward models (PRMs, lesson 17) help.
Variable-length packing. Trajectories have wildly different lengths (200–10,000 tokens). Padding wastes compute; packing requires varlen attention.
Trajectory deduplication. K rollouts often share a long shared prefix (the system prompt, few-shot examples, the user task). Prefix caching is non-optional — KV cache reuse cuts rollout cost by 3–10× depending on prefix length.
Per-turn KV cache management. Across the turns of a single trajectory, the KV cache must persist; across rollouts it must be released. Memory pressure grows quadratically with batch × max-turns × max-tokens-per-turn.

The agentic-RL bug you'll meet

Tool outputs accidentally included in the policy gradient. Symptom: the model "learns" to emit fake tool outputs verbatim. Diagnosis: print the loss mask for a few trajectories, confirm 0/1 pattern matches assistant tokens exactly. Trigger: tokenizer differs between trainer and tool-output formatter (e.g., trainer adds BOS, formatter doesn't). The fix is a half-day; the bug eats a week of training.

The correctness traps that distinguish a senior

Bug	Symptom	First-principles diagnosis
Log-prob mismatch	Reward goes down right after each weight sync; PPO ratio histogram shows π/π_old skewed.	Rollout engine and trainer use different attention kernels → slightly different logprobs. Importance ratio is biased. Either: recompute logprobs on the trainer at rollout time (slow, correct), or guarantee the engines produce byte-identical logprobs (hard, fast). See next lesson.
Off-by-one in advantage	Loss looks healthy; rewards plateau early.	Advantage is computed for token t using reward at trajectory level — but you assigned it to position t-1 or t+1. The shift means gradient pushes the wrong token. Spot via single-token sanity check.
Reference model drift	KL anchor gradually loses meaning; loss spikes during decay schedule.	Ref params accidentally seeing optimizer updates: forgotten `requires_grad=False`, ref params in the trainer's parameter group, or shared parameter tying with the policy. (FSDP sharding ref params is fine; what matters is whether they're in the optimizer's param list.) KL → 0 prematurely, then explodes when reference catches up.
Tokenizer drift	Loss explodes ~step 1 of RL on a freshly SFT'd model.	SFT model used tokenizer A; RL pipeline initialized with tokenizer B (different special tokens or added vocabulary). Every embedding lookup is off-by-token.
Stale rollouts in async	π_old is from 5 steps ago; clipping fraction is 80% → almost no learning.	Async architecture without bounded staleness. Fix: cap staleness (drop rollouts older than N steps), or reduce gap between sync events.
EOS / stop-token asymmetry	Trainer's logprob over the rollout includes / excludes the EOS in a way the rollout engine doesn't match → ratios drift on the last token and a small number of "phantom" tokens past EOS.	Rollout engine truncates at EOS or at any stop-string and may or may not include the EOS token in the returned sequence. Trainer must reconstruct exactly the same sequence (incl. EOS) and mask the same positions. Equal in frequency to tokenizer drift; less famous.
Reward hacking	Reward goes up; quality (as judged by a held-out human eval) goes down.	The reward signal is misspecified. The model found a shortcut. Mitigations: KL anchor (which you already have), reward-model robustness, regularly evaluate on a held-out grader.

Performance & operations — pointers to where these live

Two cross-cutting concerns the engineer owns but which get their own depth lesson:

Performance — where the wall-clock goes & how to find it. See lesson 28. Short version: rollout dominates 60–75% of step time, trainer forward + ref forward another 15–26%, backward and sync are small but block. The diagnosis playbook (wall-clock attribution → nvidia-smi → torch.profiler → memory peak → Nsight) and the ROI-ordered optimization menu both live in that lesson.
Operations — what fails on day 10 of training. Component health metrics, distributed checkpointing, failure isolation, reward-hacking detection, throughput regression alarms — covered in lesson 28's long-run failure-mode section.

This lesson is about ownership: the boundaries of the role and the layers of the system the engineer is responsible for. Lesson 28 is about action: what an engineer does when the run is slow or wrong.

The skill matrix — what the role hires for

Skill	Why it matters	What "good" looks like
PyTorch internals	You'll patch FSDP, write custom autograd.Function, debug strided-tensor surprises.	Can explain dispatcher → autograd → kernel launch path; has written a custom `Function` with hand-derived backward.
CUDA / Triton	The kernels you'll write or port don't exist yet at the level you need. See next lesson.	Comfortable with FlashAttention internals; has written a fused kernel in Triton; can read a Nsight profile.
Distributed systems	The four communication paths above are network operations.	Can diagnose an NCCL hang; understands NVLink / IB / PCIe bandwidth differences; has tuned a collective.
RL algorithms	You'll bias-debug. PPO clipping behaviour, KL estimator choice, advantage normalization, importance-sampling instability — without algorithm fluency you can't tell a bug from a feature.	Can derive PPO surrogate; can explain why Dr.GRPO normalizes per-token; knows the k1/k2/k3 KL estimators.
Inference engine internals	vLLM and SGLang are the rollout layer. You'll integrate, extend, or replace them.	Can read vLLM's scheduler; has implemented a prefix-cache feature; understands paged-attention block tables.
Numerical analysis	Log-prob mismatch, advantage stability, KL clipping, low-precision gradients all break on numerical edges.	Knows when fp16 underflows; can derive that softmax + log = log_softmax avoids it; understands gradient-clipping interaction with Adam's v̂.
Profiling	"It's slow today" is the most common ticket. You need to find the cause without restarting.	Lives in Nsight, py-spy, dcgm. Has a default profiling script for every component.

The role's distinguishing question

If you can answer this question end-to-end on a whiteboard, you can do the job:

The whiteboard question

"You're given a 70B base model and a 1000-prompt math benchmark. Budget: 64 H100s for 7 days. Design the post-training RL pipeline. Walk through: rollout architecture, algorithm choice, weight-sync strategy, environment design, expected throughput, the three things most likely to go wrong, and how you'd diagnose them. End with the kernel you'd write first."

A weak answer names libraries. A strong answer derives every choice from constraints (64 GPUs ÷ ? roles, model dtype, sequence length budget, batch math), names the trade-offs ignored, and identifies the first kernel from a back-of-envelope profile estimate. The kernel work itself is the subject of the next lesson.

Interview prompts you should be ready for

"Walk through a single RL step from start to finish. Where does the time go?" (The six-role loop, rollout dominates 60–75%, trainer forward + ref forward another 15–26%. Backward and sync are small but block.)
"Colocated vs disaggregated vs async — which would you pick for a 70B model on 8 H100s?" (Colocated. Inference-only 70B fits on 4 H100s with TP, but the trainer side needs the full ~1 TB of params + grads + Adam fp32 state + activations — splitting into 4-roller + 4-trainer leaves the trainer with nowhere to put it. With all 8 GPUs shared, you swap modes and reshard at each phase boundary. Disaggregated becomes the right answer around 32+ GPUs where the trainer pool can be sized properly. The lesson: sizing decides architecture before personal preference does.)
"Your rewards go down right after each weight sync. Diagnose." (Almost always: logprob mismatch between rollout engine and trainer. PPO ratio is biased. Fix: recompute logprobs on the trainer for the importance ratio, or guarantee numerical equivalence between the engines.)
"You see clipping fraction = 90% in PPO. What does it mean?" (π/π_old is way outside the clip range. Possibilities: too-stale rollouts in async, too-large learning rate, sampling temperature mismatch between rollout and training, or a logprob bug. Each has a different fix.)
"How do you handle an environment that takes 500ms per reward call?" (Async pipeline: pipeline reward computation in parallel with the next rollout step. If 500ms × batch_size > rollout time, the verifier is on the critical path → batch the verifier, GPU-accelerate it, or use a learned proxy + occasional ground-truth verification.)
"What's the first thing you'd profile on a fresh RL framework?" (Per-step breakdown of rollout / ref forward / trainer forward / backward / sync. Until you know where the time goes, no kernel decision is justified.)
"What's the first kernel you'd write?" (Whatever your profile told you. Typically: prefix-aware paged attention if K-rollouts share long prompts, or chunked cross-entropy if vocab is large and trainer forward is HBM-bound. The next lesson is precisely this question.)

Takeaway

An RL infra engineer owns five layers: framework, environment, performance, correctness, operations. The two duties in the prompt — agentic infra and rollout/training kernels — are accurate but partial. The role is defined by the four communication paths between rollout, trainer, reference, and reward, each of which is its own engineering surface. The senior signal is being able to derive each architectural choice from constraints (model size, fabric, latency budget), profile the loop in 60 seconds, and name the correctness traps before they ship. Kernels are the last 30% of speed — measurable, individual, and where the next lesson lives.