RL overview — the MDP & the agent–environment loop
One scalar of feedback, a loop that runs forever, and the single recursive equation that ties a future you can't see to a number you can compute. This lesson defines the symbols the other 31 reuse.
Start from the only thing that's different
Supervised learning hands you the answer. For every input there is a label, the gradient points straight at it, and learning is "make the output match." Reinforcement learning takes that crutch away. You act, and the world replies with a single number — a reward — that says better or worse, never here is what you should have done.
That one swap — evaluative feedback instead of instructive feedback — is the entire subject. It is why RL is hard, and it forces three problems that supervised learning never has to face:
- Exploration. You only see the reward of the action you took. To know whether something else is better, you must try it.
- Credit assignment. Reward often arrives late — a chess game is won at move 40, not 12. Which earlier actions deserve the credit?
- Acting vs. learning. Every step you spend exploring is a step you didn't spend exploiting what you already know.
The loop
The world and the learner take turns. The learner is the agent; everything else is the environment. At each timestep t the agent sees a state, picks an action, and the environment answers with a reward and the next state. Then it repeats.
action a_t
┌───────────────────────►┌─────────────┐
┌──┴──┐ │ │
│agent│ │ environment │
└──┬──┘ │ │
◄───────────────────────┐└──────┬──────┘
state s_{t+1}, reward r_{t+1} │
(and the loop runs again from s_{t+1})
That picture is the agent–environment loop, and it is all of RL's mechanics. Pseudocode:
observe s_0
for t = 0, 1, 2, ...:
a_t ~ π(·|s_t) # the policy chooses an action
r_{t+1}, s_{t+1} ~ env # environment returns reward and next state
# ... learn from (s_t, a_t, r_{t+1}, s_{t+1}) ...
The five symbols (used identically across the course)
Everything below is reused by lessons 02–32. Learn the notation once here.
- s ∈ 𝒮 — the state. What the agent observes; enough to decide.
- a ∈ 𝒜 — the action. What the agent does.
- r — the reward. The scalar the environment returns. We write the expected reward of taking a in s as R(s,a).
- P(s′ | s, a) — the transition. The probability the environment lands in s′ after a in s. Together (𝒮, 𝒜, P, R, γ) is a Markov decision process (MDP).
- γ ∈ [0, 1) — the discount. How much a reward one step later is worth now. This is the knob the widget below lets you drag.
The Markov property
The "Markov" in MDP is a promise about the state: the future depends on the past only through the present state. Formally,
If that holds, the agent never needs to remember history — the current state is a sufficient statistic. This is what makes the recursive Bellman equation below even possible. When it fails (you can't see everything that matters), you have a partially observable MDP; we'll meet that later (financial trading, lesson 24). For now: a good state is a Markov state.
Trajectory, return, and why we discount
Run the loop and you produce a trajectory τ = (s0, a0, r1, s1, a1, r2, …). The agent does not care about any single reward; it cares about the total it can collect from now on. That total, discounted, is the return:
Why discount at all? Two reasons, one mathematical and one human. Mathematically, if rewards never stop and γ < 1, the geometric weighting keeps the sum finite: a bounded reward |r| ≤ rmax gives |Gt| ≤ rmax / (1 − γ). Push γ → 1 in a loop that never ends and that bound explodes. Intuitively, γ sets a horizon: it is roughly how many steps into the future the agent "sees." Small γ is myopic; large γ is far-sighted. The widget makes this concrete — and makes the explosion visible.
Policy
The agent's behavior is its policy: a (possibly random) rule for choosing actions from states.
A policy can be deterministic (always the same action) or stochastic. When we later learn it as a neural net with weights θ, we write πθ. The agent's entire job is to find a good policy — and "good" needs a number.
Value: turning "good policy" into a number
How good is a policy? Run it and measure the return it earns. Because returns are random (the policy and the environment both flip coins), we take the expectation. Two value functions do this — one keyed on a state, one on a state–action pair.
The state-value Vπ(s) — expected return if you start in s and follow π forever:
The action-value Qπ(s,a) — expected return if you start in s, take a right now (even if π wouldn't have), then follow π:
They are linked by the obvious averaging step — V is Q averaged over the actions the policy would pick: Vπ(s) = ∑a π(a|s) Qπ(s,a). Keep Q in mind: knowing Qπ lets you compare actions, which is the seed of the value-based family.
The goal: the optimal policy π*
Among all policies there is at least one that is best in every state at once — the optimal policy π*, with value functions V* and Q* that no other policy beats:
That is the entire objective of reinforcement learning, stated once. Every algorithm in this course is a route to π*.
The Bellman expectation equation
The return is an infinite sum reaching into a future you can't see. The escape is a self-reference: the return from now is the immediate reward plus the discounted return from the next state. Taking expectations of Gt = rt+1 + γ Gt+1 gives the Bellman expectation equation:
and the same idea for actions:
This is the workhorse of the whole field. It turns an infinite-horizon problem into a one-step consistency condition that value must satisfy: the value of where I am must equal what I get now plus the discounted value of where I go next. Lesson 02 swaps the ∑a π(a|s) averaging for a maxa — the Bellman optimality equation — which removes the policy from the equation entirely and lets us learn Q* directly. That single change is the next lesson.
The fork — the spine of everything that follows
There are exactly two ways to extract a good policy from an MDP, and they organize the entire course.
(A) Value-based learns a value function (the Qπ above) and lets the policy fall out of it: in any state, take the action with the highest value. (B) Policy-based skips the value and learns the policy πθ directly, adjusting θ to push up the return. The two keep reconverging — Actor-Critic, GAE, PPO are all "a learned policy with a value function critiquing it." Lesson 02 takes branch (A); lesson 03 takes branch (B).
Interactive · discounting in a 5×5 gridworld
A robot lives on a 5×5 grid. The green cell is a +1 goal; every other step pays 0. We run policy evaluation (sweeping the Bellman expectation equation above) under a uniform-random policy and shade each cell by its value Vπ(s). Drag γ and watch value propagate out from the goal: small γ keeps it local (myopic), large γ spreads it across the whole grid (far-sighted). The number on each cell is its value; the bar shows the effective horizon 1/(1−γ).
Now the misbehavior. In the episodic setting the goal is terminal — reaching it ends the episode, so its value is pinned at 0 and the sweeps settle to a finite answer for any γ < 1. Tick looping (non-episodic) and the goal stops being terminal: you collect +1, get bounced back onto the grid, and can collect it again, forever. Now push γ toward 1 and run sweeps — the values diverge, climbing without bound. There is no finite return to converge to, because ∑ γk·1 → ∞ as γ → 1. That is the discount earning its keep: it is the only thing keeping an infinite future finite.