all lessons / reinforcement learning / 54 · Recommendation systems (上) lesson 54 / 87

Recommendation systems (上) — personalized recommendation

The applications part begins. Each lesson here takes one domain and shows it is a method you already know wearing a costume. Today: a recommender is the lesson 08 bandit with a context vector bolted on — and the cold-start problem is nothing but the explore/exploit dilemma you already met.

What this lesson reuses

The mapping: recommendation IS a contextual bandit

Strip a recommendation feed to its loop. A user arrives. You see some features about them — who they are, what they just did, the time of day. You pick one item to show. They click or they don't. Then the next user arrives. That is the contextual-bandit loop from lesson 08, term for term:

Bandit (lesson 08)RecommenderSymbol
context xuser features: history, demographics, session, timex ∈ ℝd
arm / action athe item shown (from a catalog of K items)a ∈ {1…K}
reward rclick / dwell time / purchaser ∈ {0,1} or ℝ
policy π(a|x)the ranking/serving modelπθ(a|x)

The thing being learned is the expected reward of an item given the context — the click-through rate conditioned on who is asking:

Q(x, a) = 𝔼[ r | context x, item a ]

In the plain bandit, Q(a) was just a number per arm. The only change is the conditioning on x: now each arm carries a small reward model — most simply a linear one, Q(x,a) ≈ θa·x, fit from the clicks logged for that item. This is exactly LinUCB from the end of lesson 08. Everything else — the explore/exploit machinery — is inherited unchanged.

Why not just supervised learning?
Predicting CTR from logged clicks is a supervised problem — if the log is a fair sample of every item under every context. It is not. You only ever logged a reward for the item you chose to show. Items you never show generate no data, so their CTR estimate never improves, so you keep not showing them. The feedback is evaluative, not instructive (the orientation's first sentence), and that bandit nature is what forces exploration into the picture.

Cold-start = the explore/exploit dilemma, renamed

A brand-new user has no history; a brand-new item has no clicks. Both are the same disease: an arm with zero pulls. Lesson 08 told you exactly what happens if you act greedily on a zero-pull arm — you never pull it, its estimate stays at its uninformed prior, and you can lock onto a worse arm forever (the ε=0 failure mode from the 8-armed race).

So cold-start is not a special recsys problem requiring a special recsys trick. It is the explore/exploit trade-off, and the same three tools apply:

The same regret, one state taller
In lesson 08 the regret was "reward lost vs always pulling the best arm." Here it is "clicks lost vs always showing each user their truly-best item." Pure exploitation minimizes regret today on the items you already understand — and maximizes it tomorrow, because you never learned about the rest of the catalog.

The bug at scale: the feedback loop / filter bubble

Here is what makes recommendation worse than a textbook bandit. In lesson 08 the arms' true rewards were fixed and the data was an innocent record of pulls. In a recommender, the policy that picks items also generates the only data the next model is trained on. That is the off-policy hazard from lesson 12 turned into a closed loop:

  greedy policy shows only items it ALREADY scores high
            │
            ▼
  users can only click on what they were shown
            │
            ▼
  logs contain clicks for popular items, ~nothing for the rest
            │
            ▼
  next model learns "popular items work, the rest are unknown/bad"
            │
            ▼
  shows even fewer distinct items  ──►  (loop tightens)

The catalog the user actually sees collapses to a narrow shell — the filter bubble. Two damages compound:

The bug is the lesson
A recommender that maximizes only short-term CTR is the ε=0 greedy bandit. It looks great on next-day metrics and is quietly broken: it stops exploring, starves its own future data, and locks the catalog into a bubble it can never see out of. Maximizing click-through today is the very thing that prevents it from finding the better items that would raise click-through tomorrow.

Interactive · the explore knob — CTR vs catalog coverage

A catalog of items, each with a hidden true CTR. Users arrive with a context; for each user the recommender scores items, then shows one according to the explore rate below. At explore = 0 it is pure greedy: it shows whatever it currently thinks is best. Turn the knob up and it sometimes probes an under-served item instead. Watch two meters move in opposite directions at first — and notice which policy wins on the long-run click rate.

Contextual-bandit recommender: exploit ⇄ explore
Left bars: each item's estimated CTR (height) and how often it has been shown (opacity) — gold marks the item with the highest true CTR. Right plot: realized click-through rate (blue) and catalog coverage = fraction of items shown at least once (green), over the session. Set explore to 0 to watch the filter bubble form; raise it to break out.
Users served
0
CTR (realized)
0.0%
Coverage
0%
Found best item?
Show the JS that runs this widget (≈25 lines)
// Each item i has a hidden true CTR p[i]. We keep an estimate from clicks.
function serve(eps){
  const Q = items.map(it => it.clicks / Math.max(1, it.shown));   // est. CTR
  let a;
  if (Math.random() < eps){                 // EXPLORE: probe a thin item
    a = argmin(items.map(it => it.shown));   // least-served item
  } else {                                   // EXPLOIT: greedy on estimate
    a = argmax(Q);
  }
  const clicked = Math.random() < p[a];      // user reacts to TRUE CTR
  items[a].shown  += 1;
  items[a].clicks += clicked ? 1 : 0;
  totalClicks += clicked; totalServed += 1;  // realized CTR = clicks/served
}

What you should see: at ε = 0 the realized CTR shoots up fast — and coverage flatlines low. The recommender commits to the first item that looks good and shows almost nothing else; if the truly-best item started cold, the "Found best item?" meter stays no and CTR plateaus below what was achievable. Nudge ε to ~0.05–0.15 and coverage climbs, the best item gets discovered, and the long-run CTR overtakes the greedy run. Push ε too high (toward 0.6) and you are back to lesson 08's ε=1 pathology: you spend so many impressions probing that realized CTR sags — over-exploration wastes clicks just as surely as under-exploration loses them.

Map back to the spine

Pin this domain onto the course's value / policy / model map:

Spine conceptIn a recommender
MDP, collapsed to one state (lesson 08)contextual bandit — context in, item out, click back, no transitions
State sthe user context vector x (history, session, time)
Action athe item shown, from a catalog of K arms
Reward rclick / dwell / purchase
Value Q(x,a)conditional CTR — a per-item reward model, e.g. LinUCB's θa·x
Exploration (lesson 08)the cold-start fix: ε-greedy / UCB / Thompson over items
Off-policy data (lesson 12)the feedback loop — today's policy biases tomorrow's logs

So this is value-based RL on the simplest possible MDP: learn Q(x,a), act near-greedily, but keep an exploration bonus alive so the data never goes stale. Two things are deliberately missing, and they are exactly what lesson 55 adds back. First, real feeds show a slate of items, not one, and a click now changes what the user wants later — that is transitions returning, and the bandit grows back into a full MDP where long-term value, not greedy CTR, is the objective. Second, you cannot A/B-test every candidate policy on live users, so you must evaluate off-policy from logs — the importance-sampling and doubly-robust estimators built directly on lesson 12.

Takeaway
A personalized recommender is the lesson-05 multi-armed bandit with a context vector on the input: state = user features, action = item, reward = click, value = conditional CTR Q(x,a). Cold-start is the explore/exploit dilemma in disguise, so the same ε-greedy / UCB / Thompson tools apply. The trap unique to the domain is the feedback loop: a purely greedy recommender starves its own exploration, collapses catalog coverage into a filter bubble, and biases the off-policy logs it must learn from — maximizing CTR today is precisely how it fails to find the better items that would raise CTR tomorrow.