The RL online dataplane
Lessons 01–09 built a batch pipeline: fixed sources land in bronze, get cleaned into silver, and are packed into gold for the trainer. RL breaks every assumption of that model — because the data is generated by the policy during training, making the pipeline an online loop that runs inside every step.
The shift: from batch ETL to an online loop
In SFT and preference training the data is static: a human or a distillation pipeline builds the dataset once, you run it through the batch pipeline (lessons 01–09), and the trainer reads from a frozen gold store. The policy does not influence its own training data.
In RL — and especially RLVR (RL with verifiable rewards) — the policy generates its own data every step. A prompt is sampled, the current policy produces a rollout (a candidate response), a verifier scores it, and that scored trajectory is immediately fed back to the trainer as a gradient signal. The "dataset" is not a file on disk — it is a stream of trajectories produced about the current policy, flowing through a pipeline that runs in real time.
The batch ETL loop:
SOURCES ──▶ bronze ──▶ silver ──▶ gold ──▶ TRAINER (reads once, trains)
The RL online loop:
┌─────────────────────────────────────────────────────────────────┐ │ │ │ prompt rollout verify/ online replay trainer │ │ sampler ─▶ engine ──▶ reward ──▶ filter ──▶ buffer ──▶ │ │ ▲ │ │ │ └─────── weight sync ◀────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
The loop has the same logical stages as batch ETL — ingest (prompt sample), transform (rollout), quality-gate (verify/filter), buffer (replay buffer) — but all of them must run at training speed, with low latency, and with data that is constantly becoming stale as the model's weights advance.
For the system topology that implements this loop in practice, see the RL series' topology lesson (19). For the async weight-sync mechanism that keeps rollout engines from blocking the trainer, see RL lesson 22c.
Which batch assumptions break — and what survives
| Batch assumption | What happens in RL | Replacement |
|---|---|---|
| Bronze is immutable Write once, read many |
Rollout data is ephemeral — consumed once (or a handful of times) and then discarded. On-policy RL theory says re-using old samples introduces bias. | A transient replay buffer (ring buffer or FIFO queue) replaces the bronze layer. Data ages out rather than accumulating. |
| The dataset is finite A known number of rows |
Rollouts stream in continuously for the whole run — the "dataset" is an unbounded sequence of (prompt, response, reward) triples. | Think in rates (samples/s) and buffer occupancy, not row counts. Backpressure between producer and consumer replaces the static partition model. |
| The data is stationary The distribution does not change under your feet |
Every gradient step changes the policy weights, which changes the rollout distribution. Data from step t describes policy πt; it becomes off-policy by step t+k. | On-policy: discard data after one use. Off-policy: apply importance sampling to correct the distribution mismatch. |
| Quality gates are offline Run as a slow batch scan |
Filtering must happen inline — a slow quality gate becomes a bottleneck that starves the trainer. | Online, low-latency filters: rule-based (length, format), deterministic verifier (math answer check), or a fast reward model. Heavy heuristics stay; slow model-scored filters move to the prompt-sampler side. |
What carries over. Schema validation (every trajectory has the expected fields), dedup (avoid re-processing the same prompt repeatedly), and basic quality gates all still apply — they are just enforced inline, in the hot path, with microsecond budgets instead of minutes.
On-policy freshness and the staleness budget
RL theory is built on on-policy data: the gradient estimate is correct only if the trajectories were sampled from the current policy πθ. In practice, fully synchronous systems waste GPU time — the trainer idles while rollouts run, and rollout engines idle while the trainer updates weights.
Async systems solve this by letting rollout workers continue generating while the trainer trains, keeping the GPUs busy. The cost is staleness: by the time a rollout reaches the trainer, the weights may have advanced by Δ steps since the rollout was generated.
Define staleness formally:
Δ = (trainer step at consumption) − (trainer step at rollout generation) Δ = 0 → fully on-policy (synchronous; zero throughput waste) Δ = 1 → one step off-policy (common in practice) Δ > k → significantly off-policy; importance sampling needed, or clip and discard
The staleness budget is a tuning parameter, not a flaw. Most production systems tolerate Δ ≤ 1–2 without importance sampling corrections and accept a small bias in exchange for much higher hardware utilization. Larger Δ requires explicit correction (via importance weights πθ(a|s) / πold(a|s)) or clipping (PPO's clip ratio does exactly this). See RL lesson 22c for the engineering mechanics of async weight sync.
The replay buffer: on-policy vs off-policy
The replay buffer sits between the rollout engines and the trainer. Its size and eviction policy determine the freshness-throughput trade.
| On-policy buffer | Off-policy buffer | |
|---|---|---|
| Size | Small: hold one or a few batches, then discard | Larger: keep a replay window of recent trajectories |
| Reuse | Use each sample once (or a fixed small number of epochs — PPO's mini-epoch loop) | Uniform or priority-weighted sample from the replay window; samples used many times |
| Staleness | Minimal: data is nearly on-policy by construction | Higher: older samples may be significantly off-policy |
| Correction | None needed (PPO clip handles small deviations) | Importance sampling weights required for unbiased gradient |
| RL use case | GRPO, PPO, RLVR — short-horizon tasks with fast verifiers | Soft actor-critic, Q-learning, some RLHF variants |
Backpressure: matching the producer to the consumer
The replay buffer is a producer/consumer queue. The rollout engine is the producer; the trainer is the consumer. Rate mismatch in either direction is expensive:
rollout_rate < train_rate → TRAINER STARVES: trainer blocks waiting for data.
GPUs sit idle. Effective throughput = rollout_rate.
rollout_rate > train_rate → BUFFER FILLS: data queues up. If buffer is bounded,
oldest samples are evicted (staleness spikes) or the
rollout engine is back-pressured to slow down.
rollout_rate ≈ train_rate → STEADY STATE: buffer occupancy is stable, staleness
is bounded by async depth, GPUs stay fed.
This is the same producer/consumer overlap that lesson 05 described for Ray Data's streaming executor: CPU stages (rollout) and GPU stages (trainer) should run concurrently, with the buffer as the rate-decoupling mechanism. The difference is that in RL the "CPU stage" is itself a GPU-heavy generation pass — managing the rate match requires careful allocation of GPU capacity to rollout vs training.
Interactive · freshness and throughput simulator
Adjust rollout throughput, trainer consumption rate, and the maximum async depth allowed. The simulator reports steady-state buffer occupancy, effective step throughput, and whether the system is in a starving, stale, or balanced regime.