Exploration vs exploitation — from multi-armed bandits to Thompson sampling
Lessons 02–04 all assumed the experience just appeared: a stream of (s, a, r, s') to learn from. But the agent chooses its own actions, so it chooses its own data. Pick the action you currently think is best and you may never discover a better one. Strip the MDP down to its smallest non-trivial form — one state, no transitions — and that tension becomes the entire problem.
What broke: the data is not given, it is collected
Q-learning (02) updates Q from whatever (s, a, r, s') tuples land in the replay buffer. Policy gradient (03) averages over trajectories the policy itself sampled. MCTS (04) expands the tree along the moves it chose to simulate. In every case the learner is also the data collector — and a learner that always acts on its current best guess will starve every alternative of evidence. To learn what is good you must sometimes do things that look bad. That is exploration; acting on your current best estimate is exploitation; balancing them is the oldest problem in RL.
To study the trade-off without the distraction of states and transitions, collapse the MDP to its purest core.
The K-armed bandit: an MDP with one state
A multi-armed bandit is an MDP with a single state, no transitions, and K actions ("arms"). Each step you pull one arm a ∈ {1, …, K} and receive a reward drawn from that arm's fixed but unknown distribution. Because there is one state, there is no credit assignment over time and no discounting: γ is irrelevant, P(s'|s,a) is trivial. All that remains is the explore/exploit dilemma, naked.
- q*(a) = 𝔼[ r | a ] — the arm's true mean reward (hidden). This is just Q(s,a) with the single state dropped.
- Qt(a) — our estimate of q*(a) after t pulls, e.g. the running sample mean.
- Nt(a) — how many times arm a has been pulled so far.
- a* = argmaxa q*(a) — the best arm (unknown to the agent).
The objective is regret, not reward
Maximizing total reward and minimizing regret are the same goal stated two ways, but regret is the more useful lens because it measures the cost of not knowing. The regret after T pulls is the reward you forfeited by not playing the best arm every time:
Every pull of a suboptimal arm adds its gap Δa to cumulative regret. A perfect agent has zero regret; an agent that explores forever has regret growing linearly in T. The bar we want to clear is sublinear regret — Regret(T)/T → 0 — meaning the fraction of mistakes vanishes as you keep playing. The plot in the widget below is exactly this quantity; lower is better.
Strategy 1 — ε-greedy: the same ε from lesson 02
The simplest rule reuses the exploration knob you already met in Q-learning. With probability 1 − ε exploit (pull the arm with the highest current estimate); with probability ε explore (pull a uniformly random arm):
It works, and it is one line of code. But it wastes pulls in two ways. First, its exploration is undirected: when it explores, it picks uniformly at random, spending just as many pulls on an arm it has already proven terrible as on a promising under-sampled one. Second, with a fixed ε it keeps exploring at the same rate forever — so it incurs ≈ ε · Δ̄ regret every step, giving regret that grows linearly in T. (Decaying ε over time fixes the asymptotics but adds a schedule to tune.) We can do better by letting uncertainty decide where to look.
Strategy 2 — optimism: UCB and the √(ln t / n) bonus from lesson 04
The principle is optimism in the face of uncertainty: prefer arms that could be good, where "could" means "we have not pulled them enough to rule it out." Add to each estimate a bonus that is large when the arm is under-explored and shrinks as evidence accumulates. This is the Upper Confidence Bound (UCB1) rule:
You have seen that square-root term already. It is exactly the UCT bonus from MCTS in lesson 04 — selection in the tree picks the child maximizing Q + c√(ln(parent visits)/child visits). MCTS is running a UCB bandit at every node; the bandit is the atom, the tree is the molecule.
Why this shape? The bonus is a confidence interval. By a concentration inequality (Hoeffding), the sample mean of an arm pulled N times is within ≈ √(ln t / N) of its truth with high probability. So Q + bonus is a plausible upper bound on the arm's true value. Acting on the upper bound means: either the arm really is that good (great, we exploit it) or it is over-rated and we will pull it, watch the estimate fall, and the bonus shrink — so being wrong is self-correcting. The bonus grows like √(ln t) in the numerator (slowly, so we never stop probing entirely) and falls like 1/√N in the denominator (fast, as evidence piles up). UCB achieves regret growing only as O(ln T) — logarithmic, hence sublinear. The constant c tunes how optimistic to be: bigger c = more exploration.
Strategy 3 — the Bayesian view: Thompson sampling
UCB summarizes uncertainty with a single bonus. The Bayesian view keeps the whole distribution. Maintain a posterior over each arm's mean given what you have seen, then on each step do something delightfully simple — sample once from each posterior and pull the arm whose sample is highest:
This is posterior sampling, a.k.a. Thompson sampling. It is automatically self-balancing: an arm gets pulled in proportion to the posterior probability that it is the best arm. A barely-tried arm has a wide posterior, so its sample sometimes comes out high — it gets explored. An arm pulled thousands of times has a tight posterior pinned near its mean — if that mean is the best, it gets exploited; if not, its samples almost never win. Exploration and exploitation fall out of one sample-then-argmax, with no ε and no bonus constant to schedule.
For the common case of Bernoulli rewards (each pull is a 0/1 success), the natural posterior is a Beta distribution. Start each arm with a Beta(α0, β0) prior; after seeing S successes and F failures the posterior is Beta(α0 + S, β0 + F) — just count and update. The prior strength α0 + β0 is the only knob: weak priors trust early data fast (eager, occasionally over-committing to noise); strong priors stay cautious longer. Thompson sampling matches UCB's logarithmic regret and is often better in practice.
Interactive · race three strategies, watch the regret
Eight arms, Bernoulli rewards with fixed hidden means (one clearly best). Three agents pull in lockstep: ε-greedy, UCB, Thompson. The left plot is cumulative regret over time — flatter is better; a flat line means the agent has locked onto the best arm. The right panel shows each strategy's per-arm pull counts (the gold tick marks the true best arm). Turn the knobs and find the settings that break:
Two failure modes to provoke. ε = 1.0: ε-greedy pulls uniformly at random forever — it never exploits what it learned, so its regret line is a straight ramp at the average gap, the worst of the three by far. ε = 0.0: it becomes pure greedy; whichever arm happens to look best after the first lucky pulls gets pulled forever, and if that was not the true best, regret ramps linearly and never recovers — no mechanism exists to revisit the abandoned arms. UCB and Thompson have no such trap: their exploration is built into the action rule, so they keep probing under-sampled arms automatically and their regret curves bend flat.
The bridge out: contextual bandits
The bandit ignored state entirely. Put a little state back — but still no transitions — and you get the contextual bandit: each step you observe a context x (a feature vector), choose an arm, and the reward depends on both x and the arm. The exploration ideas survive almost unchanged: LinUCB adds the √(ln t / N)-style bonus on top of a linear reward model; Thompson sampling draws from a posterior over the model's weights. This is the formal model behind recommendation systems — context = the user, arm = the item to show, reward = a click — where the cold-start problem is precisely exploration. We return to it in lesson 19. Add transitions back on top of context and you are all the way back to the full MDP of lesson 01: the bandit was the floor of the same building.