Recommendation systems (上) — personalized recommendation

The applications part begins. Each lesson here takes one domain and shows it is a method you already know wearing a costume. Today: a recommender is the lesson 08 bandit with a context vector bolted on — and the cold-start problem is nothing but the explore/exploit dilemma you already met.

What this lesson reuses

Contextual bandits — lesson 08. The bridge-out section of 05 promised we would return here. A recommender is a contextual bandit: observe a context, pick one arm, see a reward, repeat — with no state transitions to credit-assign across (that is lesson 55).
Off-policy thinking — lesson 12. Every recommender learns from logs generated by yesterday's policy, not today's. That gap is exactly the on-policy/off-policy distinction, and it is why a greedy recommender quietly poisons its own training data.

The mapping: recommendation IS a contextual bandit

Strip a recommendation feed to its loop. A user arrives. You see some features about them — who they are, what they just did, the time of day. You pick one item to show. They click or they don't. Then the next user arrives. That is the contextual-bandit loop from lesson 08, term for term:

Bandit (lesson 08)	Recommender	Symbol
context x	user features: history, demographics, session, time	x ∈ ℝ^d
arm / action a	the item shown (from a catalog of K items)	a ∈ {1…K}
reward r	click / dwell time / purchase	r ∈ {0,1} or ℝ
policy π(a\|x)	the ranking/serving model	π_θ(a\|x)

The thing being learned is the expected reward of an item given the context — the click-through rate conditioned on who is asking:

Q(x, a) = 𝔼[ r | context x, item a ]

In the plain bandit, Q(a) was just a number per arm. The only change is the conditioning on x: now each arm carries a small reward model — most simply a linear one, Q(x,a) ≈ θ_a·x, fit from the clicks logged for that item. This is exactly LinUCB from the end of lesson 08. Everything else — the explore/exploit machinery — is inherited unchanged.

Why not just supervised learning?

Predicting CTR from logged clicks is a supervised problem — if the log is a fair sample of every item under every context. It is not. You only ever logged a reward for the item you chose to show. Items you never show generate no data, so their CTR estimate never improves, so you keep not showing them. The feedback is evaluative, not instructive (the orientation's first sentence), and that bandit nature is what forces exploration into the picture.

Cold-start = the explore/exploit dilemma, renamed

A brand-new user has no history; a brand-new item has no clicks. Both are the same disease: an arm with zero pulls. Lesson 08 told you exactly what happens if you act greedily on a zero-pull arm — you never pull it, its estimate stays at its uninformed prior, and you can lock onto a worse arm forever (the ε=0 failure mode from the 8-armed race).

So cold-start is not a special recsys problem requiring a special recsys trick. It is the explore/exploit trade-off, and the same three tools apply:

ε-greedy: with probability ε, show a random (or under-served) item instead of the current best. Blind, but it guarantees every item gets sampled.
Optimism (LinUCB): add the √(ln t / N) confidence bonus to a new item's predicted CTR, so an unproven item is shown because we are unsure — and the uncertainty self-corrects as clicks arrive.
Thompson sampling: keep a posterior over each item's CTR and sample; a fresh item with a wide posterior occasionally wins the draw and gets its shot.

The same regret, one state taller

In lesson 08 the regret was "reward lost vs always pulling the best arm." Here it is "clicks lost vs always showing each user their truly-best item." Pure exploitation minimizes regret today on the items you already understand — and maximizes it tomorrow, because you never learned about the rest of the catalog.

The bug at scale: the feedback loop / filter bubble

Here is what makes recommendation worse than a textbook bandit. In lesson 08 the arms' true rewards were fixed and the data was an innocent record of pulls. In a recommender, the policy that picks items also generates the only data the next model is trained on. That is the off-policy hazard from lesson 12 turned into a closed loop:

  greedy policy shows only items it ALREADY scores high
            │
            ▼
  users can only click on what they were shown
            │
            ▼
  logs contain clicks for popular items, ~nothing for the rest
            │
            ▼
  next model learns "popular items work, the rest are unknown/bad"
            │
            ▼
  shows even fewer distinct items  ──►  (loop tightens)

The catalog the user actually sees collapses to a narrow shell — the filter bubble. Two damages compound:

Coverage collapse. Most of the catalog is never shown, so its CTR is never measured. A genuinely better item that happened to start cold is starved of the exposure it needs to prove itself — it can never escape.
Biased data. Tomorrow's training set is not a sample of the world; it is a sample of what this policy chose to show. Off-policy correction (importance weighting from lesson 12; doubly-robust estimators in lesson 55) exists precisely to undo this bias — but you cannot reweight toward items with zero logged exposure. Exploration has to put the data there in the first place.

The bug is the lesson

A recommender that maximizes only short-term CTR is the ε=0 greedy bandit. It looks great on next-day metrics and is quietly broken: it stops exploring, starves its own future data, and locks the catalog into a bubble it can never see out of. Maximizing click-through today is the very thing that prevents it from finding the better items that would raise click-through tomorrow.

Interactive · the explore knob — CTR vs catalog coverage

A catalog of items, each with a hidden true CTR. Users arrive with a context; for each user the recommender scores items, then shows one according to the explore rate below. At explore = 0 it is pure greedy: it shows whatever it currently thinks is best. Turn the knob up and it sometimes probes an under-served item instead. Watch two meters move in opposite directions at first — and notice which policy wins on the long-run click rate.

Contextual-bandit recommender: exploit ⇄ explore

Left bars: each item's estimated CTR (height) and how often it has been shown (opacity) — gold marks the item with the highest true CTR. Right plot: realized click-through rate (blue) and catalog coverage = fraction of items shown at least once (green), over the session. Set explore to 0 to watch the filter bubble form; raise it to break out.

explore ε: 0.00

Users served

CTR (realized)

0.0%

Coverage

Found best item?

—

Show the JS that runs this widget (≈25 lines)

// Each item i has a hidden true CTR p[i]. We keep an estimate from clicks.
function serve(eps){
  const Q = items.map(it => it.clicks / Math.max(1, it.shown));   // est. CTR
  let a;
  if (Math.random() < eps){                 // EXPLORE: probe a thin item
    a = argmin(items.map(it => it.shown));   // least-served item
  } else {                                   // EXPLOIT: greedy on estimate
    a = argmax(Q);
  }
  const clicked = Math.random() < p[a];      // user reacts to TRUE CTR
  items[a].shown  += 1;
  items[a].clicks += clicked ? 1 : 0;
  totalClicks += clicked; totalServed += 1;  // realized CTR = clicks/served
}

What you should see: at ε = 0 the realized CTR shoots up fast — and coverage flatlines low. The recommender commits to the first item that looks good and shows almost nothing else; if the truly-best item started cold, the "Found best item?" meter stays no and CTR plateaus below what was achievable. Nudge ε to ~0.05–0.15 and coverage climbs, the best item gets discovered, and the long-run CTR overtakes the greedy run. Push ε too high (toward 0.6) and you are back to lesson 08's ε=1 pathology: you spend so many impressions probing that realized CTR sags — over-exploration wastes clicks just as surely as under-exploration loses them.

Map back to the spine

Pin this domain onto the course's value / policy / model map:

Spine concept	In a recommender
MDP, collapsed to one state (lesson 08)	contextual bandit — context in, item out, click back, no transitions
State s	the user context vector x (history, session, time)
Action a	the item shown, from a catalog of K arms
Reward r	click / dwell / purchase
Value Q(x,a)	conditional CTR — a per-item reward model, e.g. LinUCB's θ_a·x
Exploration (lesson 08)	the cold-start fix: ε-greedy / UCB / Thompson over items
Off-policy data (lesson 12)	the feedback loop — today's policy biases tomorrow's logs

So this is value-based RL on the simplest possible MDP: learn Q(x,a), act near-greedily, but keep an exploration bonus alive so the data never goes stale. Two things are deliberately missing, and they are exactly what lesson 55 adds back. First, real feeds show a slate of items, not one, and a click now changes what the user wants later — that is transitions returning, and the bandit grows back into a full MDP where long-term value, not greedy CTR, is the objective. Second, you cannot A/B-test every candidate policy on live users, so you must evaluate off-policy from logs — the importance-sampling and doubly-robust estimators built directly on lesson 12.

Takeaway

A personalized recommender is the lesson-05 multi-armed bandit with a context vector on the input: state = user features, action = item, reward = click, value = conditional CTR Q(x,a). Cold-start is the explore/exploit dilemma in disguise, so the same ε-greedy / UCB / Thompson tools apply. The trap unique to the domain is the feedback loop: a purely greedy recommender starves its own exploration, collapses catalog coverage into a filter bubble, and biases the off-policy logs it must learn from — maximizing CTR today is precisely how it fails to find the better items that would raise CTR tomorrow.