Recommender-system cold start

A brand-new user opens the app. You have no click history, roughly ten interactions before they decide whether to come back, and a real-time budget of tens of milliseconds per recommendation. Casting this as RL is natural — recommend, observe, update — but every piece of the MDP is awkward: the state starts empty and must be built on-device from non-sensitive priors, the reward (a purchase or next-day return) arrives tens of steps late through a conversion funnel, and the very exploration that learns a cold user faster is what scares them away. Each tension names a mechanism.

The method — five steps, every lesson

Applied RL is the same loop in every domain. (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one or two properties that make this MDP hard. (3) Engineer the mechanism that removes exactly that difficulty. (4) Guard it in production — detect when it breaks, monitor, fall back. (5) Iterate. For cold start the difficulties are an empty, low-dimensional state under a privacy/latency budget, a sparse and delayed funnel reward, and an exploration–retention conflict. We run the loop on each.

1 · Formulate — the MDP behind a cold-start session

Intuition. The platform shows a new user a feed, watches what they touch, and updates. That is an MDP: the user's situation is the state, the recommended slate is the action, engagement (and eventually a purchase or a return visit) is the reward, and the user's reaction plus the world is the transition. The catch is that at step 0 the state is almost empty, the reward you actually care about will not arrive for many steps, and the horizon is brutally short — about ten interactions to prove the app is worth keeping.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ] · horizon H ≈ 10 first interactions

Piece	For a cold-start recommender	The awkward part
State S	at step 0: only signup-channel, device, IP-province, hour-of-day; then a short behaviour sequence builds up	starts empty & low-dimensional, must be assembled on-device without uploading raw PII, in <30 ms
Action A	the slate of items to show; over a huge, mostly-unseen catalogue	most items are themselves cold → every action is partly an exploration decision
Reward R	click (weak), add-to-cart (medium), purchase / next-day return (the real signal)	sparse & delayed — the payment that matters can be 30 steps downstream, or arrive after the session ends
Transition P	the user's reaction, drifting interests, and operations campaigns (coupons, promos)	partially observed & non-stationary; confounders (a coupon popup) can fake the credit

Same pattern as every lesson: the MDP table writes itself, and the rightmost column is the lesson. Three rows each carry a named difficulty — empty state, delayed reward, exploration risk — and each has a mechanism that answers it.

2 · Diagnose — three difficulties that bind, in order

Intuition. Cold start is not one hard thing; it is three, and they fire in sequence. First you cannot even represent the user (empty state). Once you can, you cannot tell which recommendation earned the eventual purchase (delayed reward). Once you can assign credit, you discover that the exploration needed to learn fast is precisely what makes a fragile new user churn (exploration–retention conflict). Solve them in that order — a clever credit-assignment scheme is worthless if your state vector is noise, and a brilliant explorer is worthless if it drives users away before it learns anything.

The three binding difficulties

(D1) Empty, constrained state. Zero history, a privacy regime that forbids shipping raw behaviour, and a hard sub-30 ms inference budget. (D2) Sparse, delayed funnel reward. The honest reward (purchase, next-day retention) is rare and arrives many steps after the action that caused it; clicks are dense but only weakly correlated with value. (D3) Exploration–retention conflict. A cold user is information-poor, so exploration has high value — but a new user is also the most likely to churn after one bad recommendation, so exploration has high cost.

3 · Engineer — build a state from nothing

Intuition. If you have no history, manufacture a usable state out of three things you legally do have: priors available at step 0, a compressed summary of the few interactions you collect, and a counter telling the network how much exploration budget is left. The trick is that none of this requires uploading raw, sensitive data — the vector is assembled on the device.

Engineering detail — the three-stage state. A concrete, ship-ready representation that is usable from step 0, stays under 30 ms, and is privacy-compliant:

s₀ = [ v_channel, v_device, v_ip-prov, t_hour ] → 3×32-dim embeddings + 24-dim one-hot ≈ 128-dim

(1) Static priors, usable at step 0. Pre-trained 32-dim embeddings for signup channel, device model and IP province, plus a 24-dim one-hot for hour-of-day — concatenated on-device so no raw identifier leaves the phone.
(2) Dynamic incremental compression, steps 1–10. Each interaction yields {item_id, click_pos, dwell_time, reward}. Map item_id through a frozen 64-dim Item2Vec, then collapse the short sequence with a time-decayed sum-pool and a small projection:

z = Σ_i=1..k γ^k−i · e_i, γ = 0.9, k ≤ 10 → Linear(128)→ReLU = v_behav

s = [ s₀ ; v_behav ; k/10 ] ≈ 257-dim (k/10 tells the net how much budget remains)

(3) Inference acceleration. INT8 quantization brings batch-1 latency to ~18 ms on a mid-range phone; the pooling is SIMD-vectorised (ARM NEON); features live in a ring buffer so each step is an O(1) update, never an O(k) recompute of the whole sequence.

Dimension is not free — the cold-start regret bound

More state dimensions look like more signal, but with only ~10 interactions they are mostly variance. In the linear-bandit setting, reaching an ε-optimal policy needs the state dimension d and horizon T to satisfy roughly d ≤ O(√T) — so for a 10-step budget the informative dimension should not exceed ~32; beyond that the minimax regret bound degrades sharply. Shipping 257 dims works only because a deep model's generalisation buys headroom — and even then you must add Dropout / regularisation or you simply overfit ten clicks. The lesson: justify every dimension against the horizon, do not pile them on.

4 · Engineer — assign credit across a delayed funnel

Intuition. The reward you care about — a purchase, or the user coming back tomorrow — lands far downstream of the recommendation that caused it, and most steps produce nothing at all. Three moves turn this delayed, sparse signal into something a policy can learn from: give honest intermediate credit (shaping), correct for the gap between the logging policy and the policy you are training (off-policy correction), and bound the variance of the long return (truncation).

Engineering detail — potential-based, GMV-anchored shaping. Model the whole funnel as a POMDP whose true reward fires only on payment. Give intermediate signal scaled to the eventual payoff and decayed by step, so early actions are not over-credited:

r_click = 0.01·R_pay·α^k, r_cart = 0.1·R_pay·α^k (α^k = position decay, k = step)

To guarantee the shaping does not change the optimal policy, prefer potential-based shaping: with a learned conversion model F(k | s,a) (the probability of paying within k more steps, fit by a discrete-time survival network), the per-step reward is a potential difference, and a terminal correction R_terminal = 1 − Σ r_t snaps the expected total return back to true GMV so you never over-estimate:

r_t = γ^kF(k | s_t,a_t) − γ^k−1F(k−1 | s_t−1,a_t−1) (potential-based ⇒ optimal policy unchanged)

Engineering detail — off-policy correction and variance control. Sessions are logged under the serving policy but trained against a newer one, so reweight with the importance ratio and truncate the return:

V-trace / clipped IS. Replay whole sessions; after a payment fires, back-fill the trajectory and weight each step by a clipped π_target/π_behavior ratio, giving a corrected multi-step return that keeps offline training consistent with the live policy.
n-step truncation. A pure Monte-Carlo return over a 30-step funnel has enormous variance. Bootstrap after n steps: Ĝ_t = Σ_k=0..n−1 γ^kr_t+k+1 + γⁿV_φ(s_t+n). Grid-search n over 5→20 (step 2) and pick the variance-elbow; in one project this cut policy-gradient variance from 18.7 to 2.4 (≈3.2× sample efficiency). The elegant single-knob version is the λ-return (eligibility traces): one λ ∈ [0,1] smoothly trades bias for variance.
Deployment. A multi-task actor-critic (SAC/PPO): the critic's main head estimates long-horizon payment value V_pay; click / add-to-cart heads act as auxiliary regularisers. GAE(λ) with λ≈0.8 covers the ~30-step delay. A delayed back-fill queue updates the replay buffer when payment arrives — hourly incremental training, minute-level hot-swap of embeddings.

5 · Engineer — explore without churning the user

Intuition. A cold user is information-poor, so each exploratory recommendation is worth a lot. But a new user is also the most fragile — one irrelevant slate and they leave before you have learned anything. So you cannot explore freely (you lose users) and you cannot exploit only the few global hits (you never personalise, and you stay cold forever). The fix is a budget: front-load a little exploration, cap it hard, and decay it — then warm-start from a meta-policy so you need far less of it.

Engineering detail — ε-First with a hard floor, then meta warm-start. Three layers, straight from the playbook:

Annealed ε-First over the first interactions. First ~10 items at ε≈0.3 (mostly explore), next ~10 decaying to ε≈0.1, then exploit. The base schedule ε_t = ε₀/(1+βt)^α with α≈0.7 makes Σε_t² converge (a convergence condition) while matching a "see effect within a week" cadence. For long-tail items with visit-count N(s,a)<100, locally inflate ε by 1.5× so rare actions keep getting tried.
Thompson Sampling for categorical items with Dirichlet posteriors; under drift, decay-then-increment α_i ← γ·α_i + e_k (γ∈0.95–0.999) so old data is forgotten exponentially. Sampling is O(K) via one Gamma draw + normalisation; weak priors (α₀≈0.5) avoid over-confidence. Bayesian regret is O(√(KT·logT)), same order as UCB but lower empirical variance.
A hard ε floor and a safety net. Lock ε_min = 5% (never fully deterministic, or the policy gets gamed). Run ε±1pp buckets, watch CTR / GMV / complaint-rate; if a core metric drops >2%, roll back the schedule within ~10 min from the config centre.
Meta warm-start. Train an offline user-level meta-policy π_meta(θ) on legally-available non-sensitive features (channel + device + region) that outputs initial Q-values; warm-starting from it shortens convergence from ~50 steps to ~10 (≈80% less wasted exploration). When the budget is spent but uncertainty is still high, use "pseudo-exploration": route cold items only to look-alike high-activity clusters to validate them, instead of spending a fragile new user's patience.

The widget makes the conflict quantitative: more exploration learns the user faster but costs retention up front, and warm-start shifts the whole curve. Feel where the optimum sits.

Cold-start exploration budget → learned-value vs. early-churn

Exploration buys information (faster personalisation, more long-run value) but a fraction of exploratory slates miss and nudge a fragile new user toward churn. Net cold-start value = value learned by exploring × users who survive to realise it. Push exploration too hard and the surviving population collapses; too little and you stay cold. Warm-start lowers the exploration you need.

explore rate ε 0.30 per-miss churn risk (%) 1.5 warm-start step cut (%) 0

steps to converge

–

D1 retention

–

net cold value

–

verdict

–

6 · Guard — drift detection, monitoring, and fallback

Intuition. A cold-start policy ships into a moving world: new-user behaviour shifts on holidays and big sale days, interests drift, and operations campaigns change what "good" even means. Guarding is two jobs — detect that the distribution moved, and adapt without taking the live service down or losing what the model already knew.

Engineering detail — detect. Track a KL divergence between the current interest distribution and a reference; under local differential privacy the interest vector is noised before upload, so do a consistency calibration before the KL or the noise itself triggers false alarms. For online change detection use a CUSUM on an exponentially-weighted reward (deviation S_t, threshold h calibrated offline to ~1% false-positive via Neyman–Pearson); a calibration plot is checked daily so predicted conversion F(k) stays within ~2% of actual.

Engineering detail — adapt without downtime. Fork a side network from a hot backup that shares the embedding layer but has an independent MLP head (~30% memory saved); route ~5% of hash-bucketed (unbiased) traffic to it; fine-tune only the MLP head with PPO-clip (lr 3e-5, batch 256), allowing a step only while KL(π_new‖π_old) ≤ 0.01. Evaluate the side model off-policy with a Doubly-Robust estimator; promote only if v̂ − v_old ≥ +0.5% at p<0.05, else discard. Hot-swap atomically (distributed lock + versioned SavedModel reload, <100 ms), keeping the old model 72 h for rollback.

The through-line

Every section is one row of the MDP table turned into a mechanism: empty, constrained state → three-stage on-device representation (priors + decayed pooling + budget counter) under a dimension-vs-horizon bound; sparse, delayed funnel reward → potential-based GMV-anchored shaping + V-trace + n-step/λ truncation; exploration–retention conflict → annealed ε-First with a hard 5% floor + Thompson sampling + meta warm-start; non-stationary world → KL/CUSUM drift detection + side-network fine-tuning with KL-gated promotion and sub-100 ms rollback. You never reached for a tool until a row of the table demanded it.

Further considerations

Cross-domain cold start. If the company runs both e-commerce and short-video, an anonymised, federated alignment can encrypt and transfer a "category-preference vector" from one side as an extra ~128-dim prior — additional signal before the user has done anything, without raw data leaving its domain.
Causal credit, not correlational. If a coupon-popup campaign runs alongside recommendations, a purchase lift may be the coupon's, not the policy's. Use double/debiased RL or a causal forest to strip confounders; for coupons specifically, estimate the individual treatment effect (CATE) with a causal survival forest and subtract the cannibalisation ΔF (purchases merely pulled forward) so the policy is not rewarded for short-termism.
Longer-horizon value. Replace the immediate payment reward with a 60-day LTV: fit an LTV distribution (e.g. Gamma–Poisson) and feed it as a stochastic return, pushing credit assignment to a much longer delay. As the modelled user lifetime lengthens (e.g. 180-day LTV), the cost of exploration is amortised — so relax the exploration cap over the lifecycle with a sliding-window regret estimate.
Cold items, not just cold users. A brand-new item has no clicks either; meta-learning (e.g. MAML) at the category level learns an initialisation so the model produces a reasonable value estimate from very few interactions.
Strict causality in sequence encoders. If a Transformer encodes the behaviour sequence, a causal mask (lower-triangular −∞ before softmax, hard-coded so framework fusion cannot drop it) must keep any future timestep from leaking into the current hidden state; with relative position bias, truncate offsets to Δt ∈ [−k, 0] so no positive (future) position difference exists at all.
Privacy as a hard constraint. Data-minimisation rules push computation on-device: do local Bayesian posterior updates and upload only DP-noised α-increments (e.g. Laplace, ε≈0.1) for the server to aggregate — federated Thompson sampling. Survival/risk models additionally need monotonicity constraints (a larger coupon must not lower predicted risk) to pass an explainability audit.