rl_foundations / lessons / 05 · exploration lesson 05 / 32

Exploration vs exploitation — from multi-armed bandits to Thompson sampling

Lessons 02–04 all assumed the experience just appeared: a stream of (s, a, r, s') to learn from. But the agent chooses its own actions, so it chooses its own data. Pick the action you currently think is best and you may never discover a better one. Strip the MDP down to its smallest non-trivial form — one state, no transitions — and that tension becomes the entire problem.

What broke: the data is not given, it is collected

Q-learning (02) updates Q from whatever (s, a, r, s') tuples land in the replay buffer. Policy gradient (03) averages over trajectories the policy itself sampled. MCTS (04) expands the tree along the moves it chose to simulate. In every case the learner is also the data collector — and a learner that always acts on its current best guess will starve every alternative of evidence. To learn what is good you must sometimes do things that look bad. That is exploration; acting on your current best estimate is exploitation; balancing them is the oldest problem in RL.

To study the trade-off without the distraction of states and transitions, collapse the MDP to its purest core.

The K-armed bandit: an MDP with one state

A multi-armed bandit is an MDP with a single state, no transitions, and K actions ("arms"). Each step you pull one arm a ∈ {1, …, K} and receive a reward drawn from that arm's fixed but unknown distribution. Because there is one state, there is no credit assignment over time and no discounting: γ is irrelevant, P(s'|s,a) is trivial. All that remains is the explore/exploit dilemma, naked.

Notation, reduced from lesson 01

The objective is regret, not reward

Maximizing total reward and minimizing regret are the same goal stated two ways, but regret is the more useful lens because it measures the cost of not knowing. The regret after T pulls is the reward you forfeited by not playing the best arm every time:

Regret(T) = T · q*(a*)  −  ∑t=1T q*(at)  =  ∑t=1T Δat,    where Δa = q*(a*) − q*(a)

Every pull of a suboptimal arm adds its gap Δa to cumulative regret. A perfect agent has zero regret; an agent that explores forever has regret growing linearly in T. The bar we want to clear is sublinear regret — Regret(T)/T → 0 — meaning the fraction of mistakes vanishes as you keep playing. The plot in the widget below is exactly this quantity; lower is better.

Strategy 1 — ε-greedy: the same ε from lesson 02

The simplest rule reuses the exploration knob you already met in Q-learning. With probability 1 − ε exploit (pull the arm with the highest current estimate); with probability ε explore (pull a uniformly random arm):

at =   argmaxa Qt(a)   with prob. 1 − ε,     a uniform random arm   with prob. ε

It works, and it is one line of code. But it wastes pulls in two ways. First, its exploration is undirected: when it explores, it picks uniformly at random, spending just as many pulls on an arm it has already proven terrible as on a promising under-sampled one. Second, with a fixed ε it keeps exploring at the same rate forever — so it incurs ≈ ε · Δ̄ regret every step, giving regret that grows linearly in T. (Decaying ε over time fixes the asymptotics but adds a schedule to tune.) We can do better by letting uncertainty decide where to look.

Strategy 2 — optimism: UCB and the √(ln t / n) bonus from lesson 04

The principle is optimism in the face of uncertainty: prefer arms that could be good, where "could" means "we have not pulled them enough to rule it out." Add to each estimate a bonus that is large when the arm is under-explored and shrinks as evidence accumulates. This is the Upper Confidence Bound (UCB1) rule:

at = argmaxa  [   Qt(a)  +  c · √( ln t / Nt(a) )   ]

You have seen that square-root term already. It is exactly the UCT bonus from MCTS in lesson 04 — selection in the tree picks the child maximizing Q + c√(ln(parent visits)/child visits). MCTS is running a UCB bandit at every node; the bandit is the atom, the tree is the molecule.

Why this shape? The bonus is a confidence interval. By a concentration inequality (Hoeffding), the sample mean of an arm pulled N times is within ≈ √(ln t / N) of its truth with high probability. So Q + bonus is a plausible upper bound on the arm's true value. Acting on the upper bound means: either the arm really is that good (great, we exploit it) or it is over-rated and we will pull it, watch the estimate fall, and the bonus shrink — so being wrong is self-correcting. The bonus grows like √(ln t) in the numerator (slowly, so we never stop probing entirely) and falls like 1/√N in the denominator (fast, as evidence piles up). UCB achieves regret growing only as O(ln T) — logarithmic, hence sublinear. The constant c tunes how optimistic to be: bigger c = more exploration.

ε-greedy vs UCB, in one line
ε-greedy explores blindly and at a constant rate (linear regret). UCB explores where the uncertainty is, and tapers as it learns (logarithmic regret). Same goal; directed beats undirected.

Strategy 3 — the Bayesian view: Thompson sampling

UCB summarizes uncertainty with a single bonus. The Bayesian view keeps the whole distribution. Maintain a posterior over each arm's mean given what you have seen, then on each step do something delightfully simple — sample once from each posterior and pull the arm whose sample is highest:

for each arm a:   θ̃a ∼ posterior(q*(a) | data)    →    pull at = argmaxa θ̃a

This is posterior sampling, a.k.a. Thompson sampling. It is automatically self-balancing: an arm gets pulled in proportion to the posterior probability that it is the best arm. A barely-tried arm has a wide posterior, so its sample sometimes comes out high — it gets explored. An arm pulled thousands of times has a tight posterior pinned near its mean — if that mean is the best, it gets exploited; if not, its samples almost never win. Exploration and exploitation fall out of one sample-then-argmax, with no ε and no bonus constant to schedule.

For the common case of Bernoulli rewards (each pull is a 0/1 success), the natural posterior is a Beta distribution. Start each arm with a Beta(α0, β0) prior; after seeing S successes and F failures the posterior is Beta(α0 + S,   β0 + F) — just count and update. The prior strength α0 + β0 is the only knob: weak priors trust early data fast (eager, occasionally over-committing to noise); strong priors stay cautious longer. Thompson sampling matches UCB's logarithmic regret and is often better in practice.

Interactive · race three strategies, watch the regret

Eight arms, Bernoulli rewards with fixed hidden means (one clearly best). Three agents pull in lockstep: ε-greedy, UCB, Thompson. The left plot is cumulative regret over time — flatter is better; a flat line means the agent has locked onto the best arm. The right panel shows each strategy's per-arm pull counts (the gold tick marks the true best arm). Turn the knobs and find the settings that break:

8-armed bandit: ε-greedy vs UCB vs Thompson
Press Run to race them. Watch which regret curve bends flat first. Then push ε to 1.0 (pure exploration — never exploits) or to 0.0 (greedy — can lock onto a wrong arm and never recover). UCB's c and Thompson's prior strength trade caution for speed.
Pulls (t)
0
ε-greedy regret
0.0
UCB regret
0.0
Thompson regret
0.0
Show the core JS (one step of all three)
// arms: hidden Bernoulli means q*[a]; bestQ = max(q*)
function pull(a){ return Math.random() < qStar[a] ? 1 : 0; }   // 0/1 reward
function regretStep(a){ return bestQ - qStar[a]; }              // gap Δ_a added

// ε-greedy: argmax estimate, but random arm with prob ε
function pickEps(Q, N, eps){
  if (Math.random() < eps) return (Math.random()*K)|0;
  return argmax(Q);
}
// UCB1: estimate + c·sqrt(ln t / N)   (the UCT bonus from lesson 04)
function pickUCB(Q, N, t, c){
  let best=-1, bv=-Infinity;
  for (let a=0;a<K;a++){
    if (N[a]===0) return a;                       // try each arm once
    const u = Q[a] + c*Math.sqrt(Math.log(t)/N[a]);
    if (u>bv){ bv=u; best=a; }
  }
  return best;
}
// Thompson: sample each Beta posterior, pull the argmax sample
function pickTS(S, F){
  let best=-1, bv=-Infinity;
  for (let a=0;a<K;a++){
    const s = sampleBeta(a0+S[a], b0+F[a]);
    if (s>bv){ bv=s; best=a; }
  }
  return best;
}

Two failure modes to provoke. ε = 1.0: ε-greedy pulls uniformly at random forever — it never exploits what it learned, so its regret line is a straight ramp at the average gap, the worst of the three by far. ε = 0.0: it becomes pure greedy; whichever arm happens to look best after the first lucky pulls gets pulled forever, and if that was not the true best, regret ramps linearly and never recovers — no mechanism exists to revisit the abandoned arms. UCB and Thompson have no such trap: their exploration is built into the action rule, so they keep probing under-sampled arms automatically and their regret curves bend flat.

The bridge out: contextual bandits

The bandit ignored state entirely. Put a little state back — but still no transitions — and you get the contextual bandit: each step you observe a context x (a feature vector), choose an arm, and the reward depends on both x and the arm. The exploration ideas survive almost unchanged: LinUCB adds the √(ln t / N)-style bonus on top of a linear reward model; Thompson sampling draws from a posterior over the model's weights. This is the formal model behind recommendation systems — context = the user, arm = the item to show, reward = a click — where the cold-start problem is precisely exploration. We return to it in lesson 19. Add transitions back on top of context and you are all the way back to the full MDP of lesson 01: the bandit was the floor of the same building.

Takeaway
Because the agent collects its own data, it must explore to learn and exploit to earn; the multi-armed bandit isolates that trade-off as an MDP with one state, scored by regret. ε-greedy explores blindly at a fixed rate (linear regret, wastes pulls — and breaks at ε=0 or ε=1); UCB explores by optimism, adding the same c·√(ln t / N) bonus that UCT used inside MCTS (logarithmic regret); Thompson sampling explores by posterior sampling, letting one sample-then-argmax balance the two for free. That completes the foundations — the MDP, value-based and policy-based solutions, planning, and exploration. Part II now scales every one of them with deep neural networks, starting with Deep RL.