Platforms & tools — from OpenAI Gym to Ray
Every equation in this course turns into a loop of two function calls. This lesson is the engineering substrate that makes that loop runnable, reproducible, and — when you have a thousand machines — scalable.
What this lesson reuses
Nothing here is a new algorithm. This is the lesson where the whole course meets a keyboard:
- Lesson 01 — the agent–environment loop. observe state → choose action → receive reward and next state → repeat. The Gym / Gymnasium API is that exact loop, frozen into two functions:
reset()andstep(). We will show they are the same object. - Lesson 09 — A3C's parallel actors. Many workers rolling out in parallel, decorrelating experience, feeding one learner. Ray / RLlib is that idea productionized: rollout workers on many machines feeding a central learner.
- Lessons 05–21 — every algorithm. You almost never re-implement DQN, PPO, or SAC from scratch in production. Libraries (Stable-Baselines3, CleanRL) are the reference implementations everyone forks.
So this lesson is glue. But the glue is where research becomes results — and where the famous reproducibility crisis in RL lives.
The environment API is lesson 01
Recall the loop from orientation. On the left is the abstract loop; on the right is the literal Gymnasium code that implements it. Read them side by side — they are the same five lines.
repeat:
observe state s
choose action a ~ π(·|s)
receive r, s'
learn from (s,a,r,s')
obs, info = env.reset(seed=0)
done = False
while not done:
action = policy(obs) # a ~ π(·|s)
obs, reward, terminated, \
truncated, info = env.step(action)
done = terminated or truncated
agent.learn(obs, reward) # (s,a,r,s')
The contract is tiny and that is the point — it is the same contract for CartPole, Atari, a robot arm, or an LLM rollout harness:
| Call | Returns | Maps to (lesson 01) |
|---|---|---|
env.reset(seed) | obs, info | the initial state s0 of a fresh episode |
env.step(a) | obs, reward, terminated, truncated, info | one tick: next state s', reward r, and whether the episode ended |
done. Gymnasium splits it: terminated means the MDP reached a true terminal state (the pole fell, the goal was reached) — there is no future return, so you bootstrap with 0. truncated means you hit a time limit but the world would have continued — so you should bootstrap with γ·V(s'), not 0. Collapsing them into one done silently biases your value targets on every time-limited env. The API shape encodes a piece of the math from lesson 11.
Spaces — typing the state and action sets
Lesson 01's s and a need concrete shapes so a neural net can consume them. Every env declares two:
env.observation_space— what anobslooks like.env.action_space— what a validactionlooks like.
The common space types are exactly the two cases our algorithms split on:
| Space | Example | Which algorithms (this course) |
|---|---|---|
Discrete(n) | 4 moves in a gridworld | Q-learning / DQN (L02), softmax policy (L03) — argmax over n |
Box(low, high, shape) | a torque vector ∈ ℝ⁶ | continuous control: DDPG / TD3 / SAC (L15) — argmax is impossible |
This is why lesson 05 and lesson 18 exist as separate lessons: the type of the action space decides whether you can take a max over actions. The platform makes that type a first-class, checkable object — env.action_space.sample() gives you a random legal action, the simplest possible exploration policy.
Vectorized envs — the cheapest parallelism
One env steps once per call. A neural-net policy on a GPU would love to score a batch of states at once. A vectorized env bundles N copies and steps them together:
envs = gym.make_vec("CartPole-v1", num_envs=8)
obs, info = envs.reset(seed=0) # obs.shape == (8, 4)
actions = policy(obs) # one batched forward pass → (8,)
obs, rewards, term, trunc, info = envs.step(actions) # 8 transitions at once
Now one policy forward pass produces 8 actions, and you collect 8 transitions per tick. This is pure throughput: the GPU stays busy, and the 8 streams are decorrelated — which, as lesson 09 argued, is what lets on-policy methods learn stably without a replay buffer. Vectorized envs are A3C's decorrelation idea on a single machine.
Ray & RLlib — A3C, productionized
Vectorized envs scale until one machine runs out of cores. Lesson 09's A3C answered "more actors" with asynchronous workers. Ray is a general distributed-execution framework; RLlib is its RL library, and its architecture is lesson 09 drawn at cluster scale:
Workers run rollouts (each one is just the lesson-01 reset/step loop), ship trajectories to the learner, the learner takes a gradient step and broadcasts fresh weights back. Swap "trajectories" for "tokens" and "learner" for "trainer" and you have the LLM post-training topology — which is exactly where the systems course picks up.
Algorithm libraries — don't re-derive, fork
You have spent 29 lessons learning why PPO clips and why SAC adds entropy. In practice you reach for a reference implementation:
| Library | Philosophy | Best for |
|---|---|---|
| Stable-Baselines3 | polished, modular, batteries-included (DQN, PPO, SAC, TD3, …) | applying a known-good algorithm to your env |
| CleanRL | single-file, no abstraction — one algorithm = one readable script | understanding and modifying an algorithm; research |
| RLlib (Ray) | distributed-first, scales to clusters | throughput, many workers, production |
The tradeoff is the usual one: SB3's abstraction saves you typing but hides the loop; CleanRL's single file shows you everything but you copy-paste to extend it. For learning, read CleanRL — its PPO file is the cleanest way to see lessons 10–14 as runnable code.
Seeds & the reproducibility crisis
RL results are notoriously hard to reproduce. The estimator is high-variance (lesson 10 was an entire lesson about that), so two runs that differ only in random seed can land in completely different places — one "solves" the task, one never learns. A 2018 study reran the same algorithm on the same benchmark with different seeds and got results that, taken individually, would support opposite conclusions.
env.reset(seed=k)), the action sampling, the network init, and the framework RNG (NumPy / PyTorch). And you must report results over many seeds with their spread — a single learning curve from one lucky seed is not evidence. The variance from lesson 10 is not just a training nuisance; it is a scientific one.
Experiment-tracking tools (TensorBoard, Weights & Biases) exist for the same reason: when a run's outcome depends on a seed, you must log the seed, the exact config, the code version, and the full curve — not just the final number — so a result can be re-run and trusted.
Interactive · a tiny Gym-style env you can step
Below is a real reset()/step()/done contract — a 4×4 gridworld, written in vanilla JS, exposing the exact API above. The agent (blue) starts top-left; the goal (green) is bottom-right; one wall (gray) blocks a cell. Reward is −0.04 per step (a nudge to hurry) and +1.0 on reaching the goal, which terminates the episode. Hitting the 50-step cap truncates it.
Step it by hand with the arrow buttons, let a random policy (action_space.sample()) drive it, or let a greedy policy head for the goal. Watch the transition log: every line is one step() return — (obs, action, reward, terminated, truncated). The done flag is what stops the loop.
Mapping back to the spine
This lesson added no theory; it made the theory run:
- The Gym/Gymnasium API (
reset/step/done) is lesson 01's agent–environment loop, frozen into code. Theterminated/truncatedsplit even encodes the bootstrapping rule from lesson 11. - Spaces (
DiscretevsBox) are the typed version of the value-vs-policy fork: discrete actions admit an argmax (DQN, L02), continuous ones force a parameterized policy (SAC, L15). - Vectorized envs and Ray/RLlib are lesson 09's parallel actors — first on one machine, then across a cluster — feeding one learner.
- Seeds and tracking are the operational answer to lesson 10's villain: the estimator's variance is a reproducibility problem, not just a training one.
reset/step/done contract is lesson 01's MDP loop made literal, and Ray/RLlib is lesson 09's parallel actors productionized into rollout workers feeding a learner. The algorithms were the hard part; the platforms are what let those algorithms touch a thousand environments at once — and what make a reported result reproducible enough to believe.