Recommender-system cold start
A brand-new user opens the app. You have no click history, roughly ten interactions before they decide whether to come back, and a real-time budget of tens of milliseconds per recommendation. Casting this as RL is natural — recommend, observe, update — but every piece of the MDP is awkward: the state starts empty and must be built on-device from non-sensitive priors, the reward (a purchase or next-day return) arrives tens of steps late through a conversion funnel, and the very exploration that learns a cold user faster is what scares them away. Each tension names a mechanism.
1 · Formulate — the MDP behind a cold-start session
Intuition. The platform shows a new user a feed, watches what they touch, and updates. That is an MDP: the user's situation is the state, the recommended slate is the action, engagement (and eventually a purchase or a return visit) is the reward, and the user's reaction plus the world is the transition. The catch is that at step 0 the state is almost empty, the reward you actually care about will not arrive for many steps, and the horizon is brutally short — about ten interactions to prove the app is worth keeping.
| Piece | For a cold-start recommender | The awkward part |
|---|---|---|
| State S | at step 0: only signup-channel, device, IP-province, hour-of-day; then a short behaviour sequence builds up | starts empty & low-dimensional, must be assembled on-device without uploading raw PII, in <30 ms |
| Action A | the slate of items to show; over a huge, mostly-unseen catalogue | most items are themselves cold → every action is partly an exploration decision |
| Reward R | click (weak), add-to-cart (medium), purchase / next-day return (the real signal) | sparse & delayed — the payment that matters can be 30 steps downstream, or arrive after the session ends |
| Transition P | the user's reaction, drifting interests, and operations campaigns (coupons, promos) | partially observed & non-stationary; confounders (a coupon popup) can fake the credit |
Same pattern as every lesson: the MDP table writes itself, and the rightmost column is the lesson. Three rows each carry a named difficulty — empty state, delayed reward, exploration risk — and each has a mechanism that answers it.
2 · Diagnose — three difficulties that bind, in order
Intuition. Cold start is not one hard thing; it is three, and they fire in sequence. First you cannot even represent the user (empty state). Once you can, you cannot tell which recommendation earned the eventual purchase (delayed reward). Once you can assign credit, you discover that the exploration needed to learn fast is precisely what makes a fragile new user churn (exploration–retention conflict). Solve them in that order — a clever credit-assignment scheme is worthless if your state vector is noise, and a brilliant explorer is worthless if it drives users away before it learns anything.
3 · Engineer — build a state from nothing
Intuition. If you have no history, manufacture a usable state out of three things you legally do have: priors available at step 0, a compressed summary of the few interactions you collect, and a counter telling the network how much exploration budget is left. The trick is that none of this requires uploading raw, sensitive data — the vector is assembled on the device.
Engineering detail — the three-stage state. A concrete, ship-ready representation that is usable from step 0, stays under 30 ms, and is privacy-compliant:
- (1) Static priors, usable at step 0. Pre-trained 32-dim embeddings for signup channel, device model and IP province, plus a 24-dim one-hot for hour-of-day — concatenated on-device so no raw identifier leaves the phone.
- (2) Dynamic incremental compression, steps 1–10. Each interaction yields
{item_id, click_pos, dwell_time, reward}. Mapitem_idthrough a frozen 64-dim Item2Vec, then collapse the short sequence with a time-decayed sum-pool and a small projection:
- (3) Inference acceleration. INT8 quantization brings batch-1 latency to ~18 ms on a mid-range phone; the pooling is SIMD-vectorised (ARM NEON); features live in a ring buffer so each step is an O(1) update, never an O(k) recompute of the whole sequence.
4 · Engineer — assign credit across a delayed funnel
Intuition. The reward you care about — a purchase, or the user coming back tomorrow — lands far downstream of the recommendation that caused it, and most steps produce nothing at all. Three moves turn this delayed, sparse signal into something a policy can learn from: give honest intermediate credit (shaping), correct for the gap between the logging policy and the policy you are training (off-policy correction), and bound the variance of the long return (truncation).
Engineering detail — potential-based, GMV-anchored shaping. Model the whole funnel as a POMDP whose true reward fires only on payment. Give intermediate signal scaled to the eventual payoff and decayed by step, so early actions are not over-credited:
To guarantee the shaping does not change the optimal policy, prefer potential-based shaping: with a learned conversion model F(k | s,a) (the probability of paying within k more steps, fit by a discrete-time survival network), the per-step reward is a potential difference, and a terminal correction Rterminal = 1 − Σ rt snaps the expected total return back to true GMV so you never over-estimate:
Engineering detail — off-policy correction and variance control. Sessions are logged under the serving policy but trained against a newer one, so reweight with the importance ratio and truncate the return:
- V-trace / clipped IS. Replay whole sessions; after a payment fires, back-fill the trajectory and weight each step by a clipped πtarget/πbehavior ratio, giving a corrected multi-step return that keeps offline training consistent with the live policy.
- n-step truncation. A pure Monte-Carlo return over a 30-step funnel has enormous variance. Bootstrap after n steps: Ĝt = Σk=0..n−1 γkrt+k+1 + γnVφ(st+n). Grid-search n over 5→20 (step 2) and pick the variance-elbow; in one project this cut policy-gradient variance from 18.7 to 2.4 (≈3.2× sample efficiency). The elegant single-knob version is the λ-return (eligibility traces): one λ ∈ [0,1] smoothly trades bias for variance.
- Deployment. A multi-task actor-critic (SAC/PPO): the critic's main head estimates long-horizon payment value Vpay; click / add-to-cart heads act as auxiliary regularisers. GAE(λ) with λ≈0.8 covers the ~30-step delay. A delayed back-fill queue updates the replay buffer when payment arrives — hourly incremental training, minute-level hot-swap of embeddings.
5 · Engineer — explore without churning the user
Intuition. A cold user is information-poor, so each exploratory recommendation is worth a lot. But a new user is also the most fragile — one irrelevant slate and they leave before you have learned anything. So you cannot explore freely (you lose users) and you cannot exploit only the few global hits (you never personalise, and you stay cold forever). The fix is a budget: front-load a little exploration, cap it hard, and decay it — then warm-start from a meta-policy so you need far less of it.
Engineering detail — ε-First with a hard floor, then meta warm-start. Three layers, straight from the playbook:
- Annealed ε-First over the first interactions. First ~10 items at ε≈0.3 (mostly explore), next ~10 decaying to ε≈0.1, then exploit. The base schedule εt = ε₀/(1+βt)α with α≈0.7 makes Σεt² converge (a convergence condition) while matching a "see effect within a week" cadence. For long-tail items with visit-count N(s,a)<100, locally inflate ε by 1.5× so rare actions keep getting tried.
- Thompson Sampling for categorical items with Dirichlet posteriors; under drift, decay-then-increment αi ← γ·αi + ek (γ∈0.95–0.999) so old data is forgotten exponentially. Sampling is O(K) via one Gamma draw + normalisation; weak priors (α₀≈0.5) avoid over-confidence. Bayesian regret is O(√(KT·logT)), same order as UCB but lower empirical variance.
- A hard ε floor and a safety net. Lock εmin = 5% (never fully deterministic, or the policy gets gamed). Run ε±1pp buckets, watch CTR / GMV / complaint-rate; if a core metric drops >2%, roll back the schedule within ~10 min from the config centre.
- Meta warm-start. Train an offline user-level meta-policy πmeta(θ) on legally-available non-sensitive features (channel + device + region) that outputs initial Q-values; warm-starting from it shortens convergence from ~50 steps to ~10 (≈80% less wasted exploration). When the budget is spent but uncertainty is still high, use "pseudo-exploration": route cold items only to look-alike high-activity clusters to validate them, instead of spending a fragile new user's patience.
The widget makes the conflict quantitative: more exploration learns the user faster but costs retention up front, and warm-start shifts the whole curve. Feel where the optimum sits.
6 · Guard — drift detection, monitoring, and fallback
Intuition. A cold-start policy ships into a moving world: new-user behaviour shifts on holidays and big sale days, interests drift, and operations campaigns change what "good" even means. Guarding is two jobs — detect that the distribution moved, and adapt without taking the live service down or losing what the model already knew.
Engineering detail — detect. Track a KL divergence between the current interest distribution and a reference; under local differential privacy the interest vector is noised before upload, so do a consistency calibration before the KL or the noise itself triggers false alarms. For online change detection use a CUSUM on an exponentially-weighted reward (deviation St, threshold h calibrated offline to ~1% false-positive via Neyman–Pearson); a calibration plot is checked daily so predicted conversion F(k) stays within ~2% of actual.
Engineering detail — adapt without downtime. Fork a side network from a hot backup that shares the embedding layer but has an independent MLP head (~30% memory saved); route ~5% of hash-bucketed (unbiased) traffic to it; fine-tune only the MLP head with PPO-clip (lr 3e-5, batch 256), allowing a step only while KL(πnew‖πold) ≤ 0.01. Evaluate the side model off-policy with a Doubly-Robust estimator; promote only if v̂ − vold ≥ +0.5% at p<0.05, else discard. Hot-swap atomically (distributed lock + versioned SavedModel reload, <100 ms), keeping the old model 72 h for rollback.
Further considerations
- Cross-domain cold start. If the company runs both e-commerce and short-video, an anonymised, federated alignment can encrypt and transfer a "category-preference vector" from one side as an extra ~128-dim prior — additional signal before the user has done anything, without raw data leaving its domain.
- Causal credit, not correlational. If a coupon-popup campaign runs alongside recommendations, a purchase lift may be the coupon's, not the policy's. Use double/debiased RL or a causal forest to strip confounders; for coupons specifically, estimate the individual treatment effect (CATE) with a causal survival forest and subtract the cannibalisation ΔF (purchases merely pulled forward) so the policy is not rewarded for short-termism.
- Longer-horizon value. Replace the immediate payment reward with a 60-day LTV: fit an LTV distribution (e.g. Gamma–Poisson) and feed it as a stochastic return, pushing credit assignment to a much longer delay. As the modelled user lifetime lengthens (e.g. 180-day LTV), the cost of exploration is amortised — so relax the exploration cap over the lifecycle with a sliding-window regret estimate.
- Cold items, not just cold users. A brand-new item has no clicks either; meta-learning (e.g. MAML) at the category level learns an initialisation so the model produces a reasonable value estimate from very few interactions.
- Strict causality in sequence encoders. If a Transformer encodes the behaviour sequence, a causal mask (lower-triangular −∞ before softmax, hard-coded so framework fusion cannot drop it) must keep any future timestep from leaking into the current hidden state; with relative position bias, truncate offsets to Δt ∈ [−k, 0] so no positive (future) position difference exists at all.
- Privacy as a hard constraint. Data-minimisation rules push computation on-device: do local Bayesian posterior updates and upload only DP-noised α-increments (e.g. Laplace, ε≈0.1) for the server to aggregate — federated Thompson sampling. Survival/risk models additionally need monotonicity constraints (a larger coupon must not lower predicted risk) to pass an explainability audit.