Anatomy of an environment — spaces, the step/reset contract, and episodes
Lesson 01 wrote the agent–environment loop and gave the environment two jobs: return a reward, return the next state. This lesson opens the environment up. It is not a vague "world" — it is a small, precise object with an interface, two type signatures, and an episode structure. Get that object right and every algorithm in the course has something correct to talk to; get it wrong and they all learn nonsense from a clean-looking bug.
Who owns what
RL's hardest bugs come from blurring the line between the two halves of lesson 01's loop. So draw it sharply. The agent is the only thing you train; it is a function from observations to actions. The environment is everything else, and crucially it holds the parts of the MDP (𝒮, 𝒜, P, R, γ) the agent is not allowed to see directly:
a_t (the only thing the agent emits)
┌─────────┐ ───────────────► ┌──────────────────────────┐
│ AGENT │ │ ENVIRONMENT │
│ │ │ owns the true state s_t │
│ π(a|o)│ │ owns P(s'|s,a) (dynamics)│
│ │ │ owns R(s,a) (reward) │
└─────────┘ ◄─────────────── └──────────────────────────┘
o_{t+1}, r_{t+1}, done
(an OBSERVATION — maybe not the full state; lesson 03)
Notice the return arrow says observation, not state. In lesson 01 we assumed they were the same — a Markov state. They often aren't, and that gap is the whole of lesson 03. For now hold the boundary: the agent proposes an action; the environment is the sole authority on what happens next and what it's worth.
The contract: reset and step
Lesson 01's loop — "observe, act, receive reward and next state, repeat" — becomes two function calls. This is the contract every environment in every framework implements (we meet the real Gym/Gymnasium version in lesson 65; here is its skeleton):
obs = env.reset() # start a fresh episode → first observation
while not done:
a = policy(obs) # the AGENT's job
obs, reward, terminated, truncated, info = env.step(a)
done = terminated or truncated
Two calls, and they map one-to-one onto the MDP. reset() samples a start state s0 and returns its observation. step(a) does four things at once: sample s' ∼ P(·|s,a), compute r = R(s,a), decide whether the episode is over, and return the observation of s'. The agent never touches P or R — it only ever sees what step chooses to reveal. That asymmetry is what makes the problem reinforcement learning and not optimization.
One subtlety the contract hides: step samples the next state from P(·|s,a). In general the same action in the same state can land you in different places — wheels slip, a card is drawn, an opponent moves. Our corridor below is deterministic for clarity, but most environments are stochastic, which is exactly why the agent maximizes expected return (lesson 01) and why two runs of one policy rarely match.
Two type signatures: the action and observation spaces
Before you can call step(a) you must know what an a is, and what shape the observation comes back in. Those two types — the action space 𝒜 and the observation space 𝒪 — are the first thing you read off any environment, because the action space decides which family of algorithms you are even allowed to use.
| Space | Looks like | Example env | Which methods fit |
|---|---|---|---|
| Discrete(n) | one of n choices: {0,1,…,n−1} | gridworld moves, Atari buttons, "buy/hold/sell" | value-based (you can argmaxaQ) — lesson 05 |
| Box (continuous) | a real vector, e.g. [−1,1]k | joint torques, steering, portfolio weights | policy-based / actor-critic; argmax is intractable — lesson 18 |
| MultiDiscrete / Dict | several discrete or named sub-spaces | strategy games, tool-calling agents | factored or structured policies |
Episodes: where the loop stops — and the bug that hides here
Lesson 01 introduced a terminal state that ends the loop and pins its own value to 0. Real environments end episodes for two different reasons, and modern APIs return them as two separate flags because conflating them quietly corrupts learning:
- terminated — the MDP itself reached an end: the goal, a win, a crash, death. This is a true terminal state; there is genuinely no future, so its value is 0.
- truncated — an external cutoff fired: a step limit, a wall-clock timeout, the simulator was reset for batching. The MDP did not end — the pole was still balancing, the robot was still walking — you just stopped looking.
Why split hairs? Because of how value is learned. From lesson 01, the value of a state is the immediate reward plus the discounted value of where you go next: target = r + γ·V(s'). At a true terminal there is no "next," so the bootstrap term vanishes: target = r. At a truncation, the next state is perfectly real — you just aren't visiting it — so you must keep the bootstrap: target = r + γ·V(s').
done = terminated or truncated fed straight into the value target) that produces a subtly broken, timid policy and no error message. The widget below makes the dropped term visible.Interactive · the environment contract, one step at a time
A 1-D corridor environment. reset() puts the agent at the left; each step moves it left or right; reaching the right end pays +1 and terminates; hitting the step limit truncates. Step through it and watch the contract tick: observation, reward, and the two end-flags. Then read the two value targets for the final transition — and flip "treat truncation as terminal" to watch the bootstrap term get thrown away exactly when the episode hit the clock, not the goal.
Where this goes next
You now have the environment as a concrete object: a reset/step contract, an action space that pre-selects your algorithm, an observation space, and an episode boundary with two honest end-conditions. Two pieces of that object are deep enough to get their own lessons. The observation it hands back may not be the Markov state lesson 01 assumed — that is lesson 03. And the reward it returns is a design decision that can quietly break everything — that is lesson 04.