all lessons / reinforcement learning / 70 · Recommender-system cold start lesson 70 / 87

Recommender-system cold start

A brand-new user opens the app. You have no click history, roughly ten interactions before they decide whether to come back, and a real-time budget of tens of milliseconds per recommendation. Casting this as RL is natural — recommend, observe, update — but every piece of the MDP is awkward: the state starts empty and must be built on-device from non-sensitive priors, the reward (a purchase or next-day return) arrives tens of steps late through a conversion funnel, and the very exploration that learns a cold user faster is what scares them away. Each tension names a mechanism.

The method — five steps, every lesson
Applied RL is the same loop in every domain. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one or two properties that make this MDP hard. (3) Engineer the mechanism that removes exactly that difficulty. (4) Guard it in production — detect when it breaks, monitor, fall back. (5) Iterate. For cold start the difficulties are an empty, low-dimensional state under a privacy/latency budget, a sparse and delayed funnel reward, and an exploration–retention conflict. We run the loop on each.

1 · Formulate — the MDP behind a cold-start session

Intuition. The platform shows a new user a feed, watches what they touch, and updates. That is an MDP: the user's situation is the state, the recommended slate is the action, engagement (and eventually a purchase or a return visit) is the reward, and the user's reaction plus the world is the transition. The catch is that at step 0 the state is almost empty, the reward you actually care about will not arrive for many steps, and the horizon is brutally short — about ten interactions to prove the app is worth keeping.

MDP = (S, A, R, P, γ)  ·  maximize  E[ Σₜ γᵗ rₜ ]  ·  horizon H ≈ 10 first interactions
PieceFor a cold-start recommenderThe awkward part
State Sat step 0: only signup-channel, device, IP-province, hour-of-day; then a short behaviour sequence builds upstarts empty & low-dimensional, must be assembled on-device without uploading raw PII, in <30 ms
Action Athe slate of items to show; over a huge, mostly-unseen cataloguemost items are themselves cold → every action is partly an exploration decision
Reward Rclick (weak), add-to-cart (medium), purchase / next-day return (the real signal)sparse & delayed — the payment that matters can be 30 steps downstream, or arrive after the session ends
Transition Pthe user's reaction, drifting interests, and operations campaigns (coupons, promos)partially observed & non-stationary; confounders (a coupon popup) can fake the credit

Same pattern as every lesson: the MDP table writes itself, and the rightmost column is the lesson. Three rows each carry a named difficulty — empty state, delayed reward, exploration risk — and each has a mechanism that answers it.

2 · Diagnose — three difficulties that bind, in order

Intuition. Cold start is not one hard thing; it is three, and they fire in sequence. First you cannot even represent the user (empty state). Once you can, you cannot tell which recommendation earned the eventual purchase (delayed reward). Once you can assign credit, you discover that the exploration needed to learn fast is precisely what makes a fragile new user churn (exploration–retention conflict). Solve them in that order — a clever credit-assignment scheme is worthless if your state vector is noise, and a brilliant explorer is worthless if it drives users away before it learns anything.

The three binding difficulties
(D1) Empty, constrained state. Zero history, a privacy regime that forbids shipping raw behaviour, and a hard sub-30 ms inference budget. (D2) Sparse, delayed funnel reward. The honest reward (purchase, next-day retention) is rare and arrives many steps after the action that caused it; clicks are dense but only weakly correlated with value. (D3) Exploration–retention conflict. A cold user is information-poor, so exploration has high value — but a new user is also the most likely to churn after one bad recommendation, so exploration has high cost.

3 · Engineer — build a state from nothing

Intuition. If you have no history, manufacture a usable state out of three things you legally do have: priors available at step 0, a compressed summary of the few interactions you collect, and a counter telling the network how much exploration budget is left. The trick is that none of this requires uploading raw, sensitive data — the vector is assembled on the device.

Engineering detail — the three-stage state. A concrete, ship-ready representation that is usable from step 0, stays under 30 ms, and is privacy-compliant:

s₀ = [ vchannel, vdevice, vip-prov, thour ]  →  3×32-dim embeddings + 24-dim one-hot ≈ 128-dim
z = Σi=1..k γk−i · ei,   γ = 0.9,   k ≤ 10  →  Linear(128)→ReLU = vbehav
s = [ s₀ ; vbehav ; k/10 ]  ≈  257-dim   (k/10 tells the net how much budget remains)
Dimension is not free — the cold-start regret bound
More state dimensions look like more signal, but with only ~10 interactions they are mostly variance. In the linear-bandit setting, reaching an ε-optimal policy needs the state dimension d and horizon T to satisfy roughly d ≤ O(√T) — so for a 10-step budget the informative dimension should not exceed ~32; beyond that the minimax regret bound degrades sharply. Shipping 257 dims works only because a deep model's generalisation buys headroom — and even then you must add Dropout / regularisation or you simply overfit ten clicks. The lesson: justify every dimension against the horizon, do not pile them on.

4 · Engineer — assign credit across a delayed funnel

Intuition. The reward you care about — a purchase, or the user coming back tomorrow — lands far downstream of the recommendation that caused it, and most steps produce nothing at all. Three moves turn this delayed, sparse signal into something a policy can learn from: give honest intermediate credit (shaping), correct for the gap between the logging policy and the policy you are training (off-policy correction), and bound the variance of the long return (truncation).

Engineering detail — potential-based, GMV-anchored shaping. Model the whole funnel as a POMDP whose true reward fires only on payment. Give intermediate signal scaled to the eventual payoff and decayed by step, so early actions are not over-credited:

rclick = 0.01·Rpay·αk,   rcart = 0.1·Rpay·αk   (αk = position decay, k = step)

To guarantee the shaping does not change the optimal policy, prefer potential-based shaping: with a learned conversion model F(k | s,a) (the probability of paying within k more steps, fit by a discrete-time survival network), the per-step reward is a potential difference, and a terminal correction Rterminal = 1 − Σ rt snaps the expected total return back to true GMV so you never over-estimate:

rt = γkF(k | st,at) − γk−1F(k−1 | st−1,at−1)   (potential-based ⇒ optimal policy unchanged)

Engineering detail — off-policy correction and variance control. Sessions are logged under the serving policy but trained against a newer one, so reweight with the importance ratio and truncate the return:

5 · Engineer — explore without churning the user

Intuition. A cold user is information-poor, so each exploratory recommendation is worth a lot. But a new user is also the most fragile — one irrelevant slate and they leave before you have learned anything. So you cannot explore freely (you lose users) and you cannot exploit only the few global hits (you never personalise, and you stay cold forever). The fix is a budget: front-load a little exploration, cap it hard, and decay it — then warm-start from a meta-policy so you need far less of it.

Engineering detail — ε-First with a hard floor, then meta warm-start. Three layers, straight from the playbook:

The widget makes the conflict quantitative: more exploration learns the user faster but costs retention up front, and warm-start shifts the whole curve. Feel where the optimum sits.

Cold-start exploration budget → learned-value vs. early-churn

Exploration buys information (faster personalisation, more long-run value) but a fraction of exploratory slates miss and nudge a fragile new user toward churn. Net cold-start value = value learned by exploring × users who survive to realise it. Push exploration too hard and the surviving population collapses; too little and you stay cold. Warm-start lowers the exploration you need.

steps to converge
D1 retention
net cold value
verdict

1 · FORMULATE S, A, R, P empty state 2 · DIAGNOSE delayed reward, explore vs churn 3 · ENGINEER 3-stage state, shaping, ε-First + warm-start 4 · GUARD drift detect, ε floor + rollback 5 · ITERATE re-diagnose solving one difficulty exposes the next — re-run the loop

6 · Guard — drift detection, monitoring, and fallback

Intuition. A cold-start policy ships into a moving world: new-user behaviour shifts on holidays and big sale days, interests drift, and operations campaigns change what "good" even means. Guarding is two jobs — detect that the distribution moved, and adapt without taking the live service down or losing what the model already knew.

Engineering detail — detect. Track a KL divergence between the current interest distribution and a reference; under local differential privacy the interest vector is noised before upload, so do a consistency calibration before the KL or the noise itself triggers false alarms. For online change detection use a CUSUM on an exponentially-weighted reward (deviation St, threshold h calibrated offline to ~1% false-positive via Neyman–Pearson); a calibration plot is checked daily so predicted conversion F(k) stays within ~2% of actual.

Engineering detail — adapt without downtime. Fork a side network from a hot backup that shares the embedding layer but has an independent MLP head (~30% memory saved); route ~5% of hash-bucketed (unbiased) traffic to it; fine-tune only the MLP head with PPO-clip (lr 3e-5, batch 256), allowing a step only while KL(πnew‖πold) ≤ 0.01. Evaluate the side model off-policy with a Doubly-Robust estimator; promote only if v̂ − vold ≥ +0.5% at p<0.05, else discard. Hot-swap atomically (distributed lock + versioned SavedModel reload, <100 ms), keeping the old model 72 h for rollback.

The through-line
Every section is one row of the MDP table turned into a mechanism: empty, constrained state → three-stage on-device representation (priors + decayed pooling + budget counter) under a dimension-vs-horizon bound; sparse, delayed funnel reward → potential-based GMV-anchored shaping + V-trace + n-step/λ truncation; exploration–retention conflict → annealed ε-First with a hard 5% floor + Thompson sampling + meta warm-start; non-stationary world → KL/CUSUM drift detection + side-network fine-tuning with KL-gated promotion and sub-100 ms rollback. You never reached for a tool until a row of the table demanded it.

Further considerations