The RL online dataplane

Lessons 01–09 built a batch pipeline: fixed sources land in bronze, get cleaned into silver, and are packed into gold for the trainer. RL breaks every assumption of that model — because the data is generated by the policy during training, making the pipeline an online loop that runs inside every step.

Where we are

You have built the whole batch pipeline: storage formats (04), distributed transforms (05), dedup (06), tokenization and packing (07), quality gates (08), and an orchestration DAG (09). This lesson is the pivot. In RL, the ETL pipeline is not a nightly job — it is the inner loop of training itself. Everything you built still applies, but now it runs at sub-second latency, continuously, while the model's own weights are changing underneath it.

The shift: from batch ETL to an online loop

In SFT and preference training the data is static: a human or a distillation pipeline builds the dataset once, you run it through the batch pipeline (lessons 01–09), and the trainer reads from a frozen gold store. The policy does not influence its own training data.

In RL — and especially RLVR (RL with verifiable rewards) — the policy generates its own data every step. A prompt is sampled, the current policy produces a rollout (a candidate response), a verifier scores it, and that scored trajectory is immediately fed back to the trainer as a gradient signal. The "dataset" is not a file on disk — it is a stream of trajectories produced about the current policy, flowing through a pipeline that runs in real time.

The batch ETL loop:

  SOURCES  ──▶  bronze  ──▶  silver  ──▶  gold  ──▶  TRAINER (reads once, trains)

The RL online loop:

  ┌─────────────────────────────────────────────────────────────────┐
  │                                                                 │
  │  prompt    rollout     verify/     online     replay    trainer │
  │  sampler ─▶ engine ──▶ reward  ──▶ filter ──▶ buffer ──▶        │
  │                ▲                                         │      │
  │                └─────── weight sync ◀────────────────────┘      │
  │                                                                 │
  └─────────────────────────────────────────────────────────────────┘

The loop has the same logical stages as batch ETL — ingest (prompt sample), transform (rollout), quality-gate (verify/filter), buffer (replay buffer) — but all of them must run at training speed, with low latency, and with data that is constantly becoming stale as the model's weights advance.

For the system topology that implements this loop in practice, see the RL series' topology lesson (19). For the async weight-sync mechanism that keeps rollout engines from blocking the trainer, see RL lesson 22c.

Which batch assumptions break — and what survives

Batch assumption	What happens in RL	Replacement
Bronze is immutable Write once, read many	Rollout data is ephemeral — consumed once (or a handful of times) and then discarded. On-policy RL theory says re-using old samples introduces bias.	A transient replay buffer (ring buffer or FIFO queue) replaces the bronze layer. Data ages out rather than accumulating.
The dataset is finite A known number of rows	Rollouts stream in continuously for the whole run — the "dataset" is an unbounded sequence of (prompt, response, reward) triples.	Think in rates (samples/s) and buffer occupancy, not row counts. Backpressure between producer and consumer replaces the static partition model.
The data is stationary The distribution does not change under your feet	Every gradient step changes the policy weights, which changes the rollout distribution. Data from step t describes policy π_t; it becomes off-policy by step t+k.	On-policy: discard data after one use. Off-policy: apply importance sampling to correct the distribution mismatch.
Quality gates are offline Run as a slow batch scan	Filtering must happen inline — a slow quality gate becomes a bottleneck that starves the trainer.	Online, low-latency filters: rule-based (length, format), deterministic verifier (math answer check), or a fast reward model. Heavy heuristics stay; slow model-scored filters move to the prompt-sampler side.

What carries over. Schema validation (every trajectory has the expected fields), dedup (avoid re-processing the same prompt repeatedly), and basic quality gates all still apply — they are just enforced inline, in the hot path, with microsecond budgets instead of minutes.

Non-stationarity is the silent killer

In a batch pipeline a data quality regression is visible on the next audit. In RL, a poisoned or miscalibrated reward signal is immediately incorporated into the next gradient step, and the policy can degrade within dozens of steps before anyone notices. Online monitoring of reward distribution, response length, and format-error rates is mandatory — not a nice-to-have.

On-policy freshness and the staleness budget

RL theory is built on on-policy data: the gradient estimate is correct only if the trajectories were sampled from the current policy π_θ. In practice, fully synchronous systems waste GPU time — the trainer idles while rollouts run, and rollout engines idle while the trainer updates weights.

Async systems solve this by letting rollout workers continue generating while the trainer trains, keeping the GPUs busy. The cost is staleness: by the time a rollout reaches the trainer, the weights may have advanced by Δ steps since the rollout was generated.

Define staleness formally:

  Δ = (trainer step at consumption) − (trainer step at rollout generation)

  Δ = 0   → fully on-policy  (synchronous; zero throughput waste)
  Δ = 1   → one step off-policy (common in practice)
  Δ > k   → significantly off-policy; importance sampling needed, or clip and discard

The staleness budget is a tuning parameter, not a flaw. Most production systems tolerate Δ ≤ 1–2 without importance sampling corrections and accept a small bias in exchange for much higher hardware utilization. Larger Δ requires explicit correction (via importance weights π_θ(a|s) / π_old(a|s)) or clipping (PPO's clip ratio does exactly this). See RL lesson 22c for the engineering mechanics of async weight sync.

The staleness/bias trade-off

Higher async depth (larger maximum Δ) means rollout engines are never idle and hardware utilization rises toward 100%. But the gradient estimate drifts off-policy — the reward landscape at step t+Δ is not the landscape the rollout was generated under. Beyond a threshold, gradient bias dominates and training destabilizes. Every system has a sweet spot between starvation (Δ = 0, synchronous) and bias (Δ ≫ 1).

The replay buffer: on-policy vs off-policy

The replay buffer sits between the rollout engines and the trainer. Its size and eviction policy determine the freshness-throughput trade.

	On-policy buffer	Off-policy buffer
Size	Small: hold one or a few batches, then discard	Larger: keep a replay window of recent trajectories
Reuse	Use each sample once (or a fixed small number of epochs — PPO's mini-epoch loop)	Uniform or priority-weighted sample from the replay window; samples used many times
Staleness	Minimal: data is nearly on-policy by construction	Higher: older samples may be significantly off-policy
Correction	None needed (PPO clip handles small deviations)	Importance sampling weights required for unbiased gradient
RL use case	GRPO, PPO, RLVR — short-horizon tasks with fast verifiers	Soft actor-critic, Q-learning, some RLHF variants

Backpressure: matching the producer to the consumer

The replay buffer is a producer/consumer queue. The rollout engine is the producer; the trainer is the consumer. Rate mismatch in either direction is expensive:

  rollout_rate < train_rate  →  TRAINER STARVES: trainer blocks waiting for data.
                                 GPUs sit idle. Effective throughput = rollout_rate.

  rollout_rate > train_rate  →  BUFFER FILLS: data queues up. If buffer is bounded,
                                 oldest samples are evicted (staleness spikes) or the
                                 rollout engine is back-pressured to slow down.

  rollout_rate ≈ train_rate  →  STEADY STATE: buffer occupancy is stable, staleness
                                 is bounded by async depth, GPUs stay fed.

This is the same producer/consumer overlap that lesson 05 described for Ray Data's streaming executor: CPU stages (rollout) and GPU stages (trainer) should run concurrently, with the buffer as the rate-decoupling mechanism. The difference is that in RL the "CPU stage" is itself a GPU-heavy generation pass — managing the rate match requires careful allocation of GPU capacity to rollout vs training.

Interactive · freshness and throughput simulator

Adjust rollout throughput, trainer consumption rate, and the maximum async depth allowed. The simulator reports steady-state buffer occupancy, effective step throughput, and whether the system is in a starving, stale, or balanced regime.

RL dataplane: buffer fill & staleness simulator

Rollout engine produces samples; the trainer consumes them. The async depth cap (max Δ) determines how many trainer steps a sample may lag before it is discarded as too stale. Find the sweet spot between starving the trainer and letting data go off-policy.

rollout throughput (samples/s): 60 trainer consumption (samples/s): 80 buffer capacity (samples): 400 max async depth Δ (trainer steps): 2 samples per trainer step: 32

Steady-state Δ

—

Buffer occupancy

—

Effective throughput

—

Rollout GPU util

—

Regime

—

Staleness risk

—

Takeaway

What to carry to lesson 11

The RL online dataplane is not a new pipeline — it is the batch pipeline running inside the training loop, with three extra dimensions: freshness (data ages as the policy advances), backpressure (producer and consumer must rate-match or one starves the other), and non-stationarity (the data distribution drifts every step, so quality and reward monitoring must be continuous). Lesson 11 builds the cost and throughput model for the whole system — batch stages and online loop together — and shows where the bottleneck lives and what it costs per million tokens.