Recommendation systems (上) — personalized recommendation
The applications part begins. Each lesson here takes one domain and shows it is a method you already know wearing a costume. Today: a recommender is the lesson 08 bandit with a context vector bolted on — and the cold-start problem is nothing but the explore/exploit dilemma you already met.
- Contextual bandits — lesson 08. The bridge-out section of 05 promised we would return here. A recommender is a contextual bandit: observe a context, pick one arm, see a reward, repeat — with no state transitions to credit-assign across (that is lesson 55).
- Off-policy thinking — lesson 12. Every recommender learns from logs generated by yesterday's policy, not today's. That gap is exactly the on-policy/off-policy distinction, and it is why a greedy recommender quietly poisons its own training data.
The mapping: recommendation IS a contextual bandit
Strip a recommendation feed to its loop. A user arrives. You see some features about them — who they are, what they just did, the time of day. You pick one item to show. They click or they don't. Then the next user arrives. That is the contextual-bandit loop from lesson 08, term for term:
| Bandit (lesson 08) | Recommender | Symbol |
|---|---|---|
| context x | user features: history, demographics, session, time | x ∈ ℝd |
| arm / action a | the item shown (from a catalog of K items) | a ∈ {1…K} |
| reward r | click / dwell time / purchase | r ∈ {0,1} or ℝ |
| policy π(a|x) | the ranking/serving model | πθ(a|x) |
The thing being learned is the expected reward of an item given the context — the click-through rate conditioned on who is asking:
In the plain bandit, Q(a) was just a number per arm. The only change is the conditioning on x: now each arm carries a small reward model — most simply a linear one, Q(x,a) ≈ θa·x, fit from the clicks logged for that item. This is exactly LinUCB from the end of lesson 08. Everything else — the explore/exploit machinery — is inherited unchanged.
Cold-start = the explore/exploit dilemma, renamed
A brand-new user has no history; a brand-new item has no clicks. Both are the same disease: an arm with zero pulls. Lesson 08 told you exactly what happens if you act greedily on a zero-pull arm — you never pull it, its estimate stays at its uninformed prior, and you can lock onto a worse arm forever (the ε=0 failure mode from the 8-armed race).
So cold-start is not a special recsys problem requiring a special recsys trick. It is the explore/exploit trade-off, and the same three tools apply:
- ε-greedy: with probability ε, show a random (or under-served) item instead of the current best. Blind, but it guarantees every item gets sampled.
- Optimism (LinUCB): add the √(ln t / N) confidence bonus to a new item's predicted CTR, so an unproven item is shown because we are unsure — and the uncertainty self-corrects as clicks arrive.
- Thompson sampling: keep a posterior over each item's CTR and sample; a fresh item with a wide posterior occasionally wins the draw and gets its shot.
The bug at scale: the feedback loop / filter bubble
Here is what makes recommendation worse than a textbook bandit. In lesson 08 the arms' true rewards were fixed and the data was an innocent record of pulls. In a recommender, the policy that picks items also generates the only data the next model is trained on. That is the off-policy hazard from lesson 12 turned into a closed loop:
greedy policy shows only items it ALREADY scores high
│
▼
users can only click on what they were shown
│
▼
logs contain clicks for popular items, ~nothing for the rest
│
▼
next model learns "popular items work, the rest are unknown/bad"
│
▼
shows even fewer distinct items ──► (loop tightens)
The catalog the user actually sees collapses to a narrow shell — the filter bubble. Two damages compound:
- Coverage collapse. Most of the catalog is never shown, so its CTR is never measured. A genuinely better item that happened to start cold is starved of the exposure it needs to prove itself — it can never escape.
- Biased data. Tomorrow's training set is not a sample of the world; it is a sample of what this policy chose to show. Off-policy correction (importance weighting from lesson 12; doubly-robust estimators in lesson 55) exists precisely to undo this bias — but you cannot reweight toward items with zero logged exposure. Exploration has to put the data there in the first place.
Interactive · the explore knob — CTR vs catalog coverage
A catalog of items, each with a hidden true CTR. Users arrive with a context; for each user the recommender scores items, then shows one according to the explore rate below. At explore = 0 it is pure greedy: it shows whatever it currently thinks is best. Turn the knob up and it sometimes probes an under-served item instead. Watch two meters move in opposite directions at first — and notice which policy wins on the long-run click rate.
What you should see: at ε = 0 the realized CTR shoots up fast — and coverage flatlines low. The recommender commits to the first item that looks good and shows almost nothing else; if the truly-best item started cold, the "Found best item?" meter stays no and CTR plateaus below what was achievable. Nudge ε to ~0.05–0.15 and coverage climbs, the best item gets discovered, and the long-run CTR overtakes the greedy run. Push ε too high (toward 0.6) and you are back to lesson 08's ε=1 pathology: you spend so many impressions probing that realized CTR sags — over-exploration wastes clicks just as surely as under-exploration loses them.
Map back to the spine
Pin this domain onto the course's value / policy / model map:
| Spine concept | In a recommender |
|---|---|
| MDP, collapsed to one state (lesson 08) | contextual bandit — context in, item out, click back, no transitions |
| State s | the user context vector x (history, session, time) |
| Action a | the item shown, from a catalog of K arms |
| Reward r | click / dwell / purchase |
| Value Q(x,a) | conditional CTR — a per-item reward model, e.g. LinUCB's θa·x |
| Exploration (lesson 08) | the cold-start fix: ε-greedy / UCB / Thompson over items |
| Off-policy data (lesson 12) | the feedback loop — today's policy biases tomorrow's logs |
So this is value-based RL on the simplest possible MDP: learn Q(x,a), act near-greedily, but keep an exploration bonus alive so the data never goes stale. Two things are deliberately missing, and they are exactly what lesson 55 adds back. First, real feeds show a slate of items, not one, and a click now changes what the user wants later — that is transitions returning, and the bandit grows back into a full MDP where long-term value, not greedy CTR, is the objective. Second, you cannot A/B-test every candidate policy on live users, so you must evaluate off-policy from logs — the importance-sampling and doubly-robust estimators built directly on lesson 12.