Game-AI agent training

The first domain, and the one that built modern deep RL (Atari, AlphaGo, OpenAI Five, AlphaStar). We use it to install the method the whole track runs on: write the MDP, name the one thing that makes this MDP hard, reach for the mechanism that removes exactly that difficulty — and not before. For a MOBA bot the binding difficulties are a hybrid action space, partial observability, a sparse, hackable reward, a non-stationary self-play opponent, and a 16 ms frame budget. Each one names a tool.

The method — five steps, every lesson

Applied RL is not a grab-bag of tricks. Every domain in this track is the same loop: (1) Formulate the MDP — state, action, reward, transition, horizon. (2) Diagnose the one property that makes the MDP hard. (3) Engineer the mechanism that removes that difficulty. (4) Guard it in production — detect when it breaks and fall back. (5) Iterate. The art is only in which difficulty binds first. This lesson runs the loop five times on one game.

1 · Formulate — the MDP behind a MOBA bot

Intuition. A game agent sees a screen, presses buttons, and gets a score later. That is already a Markov Decision Process: the screen is the state, the buttons are the action, the score change is the reward, and the game engine is the transition. Everything below is a consequence of one of these four pieces being awkward in a real game.

MDP = (S, A, R, P, γ) · maximize E[ Σₜ γᵗ rₜ ]

Piece	For a 5v5 MOBA bot	The awkward part
State S	minimap + unit stats + cooldowns, ~1k-dim feature image	you only see your team's vision → partially observable
Action A	which skill (discrete) × where to aim (continuous) × charge level	hybrid discrete-continuous, and most actions are illegal most frames
Reward R	win/loss, but shaped with kills, towers, farm	sparse (one signal per ~30 min) and hackable once shaped
Transition P	the game engine + four allies + five enemies	opponent is another learning agent → non-stationary

Notice the pattern we will reuse for all 20 domains: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.

2 · The hybrid action space — model the structure, don't flatten it

Intuition. "Press Q toward that bush, half-charged" is three decisions of different types: which skill (a choice from a menu), where (a continuous 2-D point), and how hard (a continuous scalar). Flattening all of that into one giant discrete menu explodes combinatorially; treating it as one continuous vector throws away the menu structure. The fix is to keep the structure: a discrete head for the skill, conditioned continuous heads for aim and charge.

a = (a_skill, a_aim, a_charge) | a_skill ~ Categorical(logits) a_aim ~ 𝒩(μ,σ) a_charge ~ Beta(α,β)

Engineering detail. All heads share one CNN backbone; only the small output heads are separate. The composite loss mixes the discrete and continuous pieces — negative-log-likelihood × advantage for the skill choice, clipped PPO surrogate + entropy bonus for the continuous aim/charge. A Beta distribution is preferred over a clipped Gaussian for the charge ∈ [0,1] because its support is already bounded — no probability mass is lost at the clip. When the discrete menu itself grows past ~20 entries (active items + summoner spells), switch the skill head to an autoregressive or pointer-network output so the logit count stops growing with the cross-product.

3 · Action masking — the detail that decides whether training even converges

Intuition. Most frames, most actions are illegal: the skill is on cooldown, you lack mana, there's no target. If the network can put probability on illegal actions, it wastes capacity learning "don't do that" and, worse, can produce NaN gradients. A mask says "these outputs are forbidden this frame" before the agent ever samples.

Engineering detail. The mask is a fixed-length byte array built engine-side (e.g. 12,288 = 64×64 grid × 3 command types) and carried alongside the observation for zero-copy. In the policy you set illegal logits to a large negative number before the softmax so their probability is exactly zero:

logits = logits.masked_fill(mask == 0, −1e9) → softmax

The two failure modes masking must avoid

(1) NaN in PPO. The importance ratio is π_new/π_old; if an action is illegal under both, both probabilities are 0 and you compute log(0). Fix: treat illegal-action probability as identically 0 on both old and new policies, and zero its gradient on the backward pass. (2) Over-masking. Mask only rule-illegality, never information-incompleteness. In an RTS, "attack empty ground" is legal — if you mask it because no enemy is visible, the agent can never learn to fire blind into fog. The mask enforces the rules; it must not encode strategy.

4 · Reward — sparse by nature, hackable once you shape it

Intuition. The only honest reward is win/loss, once per match. That is far too sparse to learn from, so you shape it: small bonuses for kills, towers, farm. But every shaping term is a new objective the agent will optimize literally — and the literal optimum is rarely what you meant. This is reward hacking.

Engineering detail — shaping that decays. Early in training the agent needs dense guidance; late in training the dense bonuses distort strategy. So shaping weights decay over training time, and they scale with game state — e.g. tower-reward weight is amplified when behind so the agent learns comebacks rather than only snowballing. A typical structure:

r_kill = r_kill^base · w_kill(t) · S | r_tower = r_tower^base · w_tower(t) · (2 − S)

plus a moral cutoff (zero the kill reward if the victim's recent gold/level gain is trivial, to kill "kill-farming"), and a pseudo-count exploration bonus 0.5/√(n+1) that is on during training and off in production.

Detecting reward hacking — three layers

(1) Offline correlation. Track the Pearson ρ between the shaped term and native return; if ρ collapses (0.75 → 0.3) while win-rate doesn't improve, the agent is gaming the shaping. Cross-check with counterfactual return gap: blank the shaping and re-evaluate native return — a widening gap is a red flag. (2) Online distribution drift. Stream a behavioral ratio (e.g. last-hits / total-damage); if it falls below human −3σ while shaped reward dominates, fire an alert and auto-rollback. (3) Prior-rule backstop. A hard constraint in the reward emitter: if farm is low yet shaped score is huge, force shaping to zero and make the policy re-explore. Together these reach ~98% hacking-detection at ~2% false-positive.

5 · Self-play — the opponent that learns back

Intuition. There is no fixed opponent to train against; the best opponent is a copy of yourself. Self-play creates an automatic curriculum — as you improve, so does your sparring partner. But it introduces a subtle systems problem: in a distributed cluster the opponent's strength (its Elo rating) is measured with delay, and a stale opponent rating injects noise straight into your advantage estimates.

Engineering detail — staleness is a variance multiplier. Model Elo as a delayed stochastic-approximation process. If the learner uses an opponent rating that is Δk games stale, the rating error is a random walk with variance ∝ Δk, and that error propagates into GAE, amplifying advantage variance by roughly:

variance amplification ≈ 1 + 2·Δk / τ

where τ is the Elo autocorrelation. More variance means you need a bigger batch to keep the same convergence speed — and batch size is GPU dollars. The widget below is the napkin math: feel how a half-day sync delay can multiply your training bill.

6 · The 16 ms wall — real-time inference is part of the MDP

Intuition. A 60-FPS game gives the policy ~16 ms per frame to see, think, and act. A model that is 1% stronger but misses the frame deadline is useless — the agent stutters. So inference latency is not an afterthought; it is a hard constraint that shapes the architecture.

Engineering detail. Three levers, each from the source playbook: (1) INT8 quantization of the policy, with KL(π_int8 ‖ π_fp32) tracked as an auxiliary loss and a FP32 "shadow engine" comparing logits in production — auto-recalibrate if KL drifts past ~0.02. (2) LSTM state reuse: keep hidden states GPU-resident in a ring buffer, capture the "load-state → infer → write-state" pipeline as a CUDA graph (CPU-side cost ~52 µs), reaching ~0.2 ms/frame. (3) Warp-level batch-1 sampling on mobile GPUs: pack 32 pseudo-environments into one warp, fp16 vector loads, a piecewise-linear tanh — bringing a full step to under 5 ms on a phone.

The through-line

Every section above is one row of the MDP table turned into a mechanism: hybrid action → structured heads; partial obs → memory + masking; sparse/hackable reward → decaying shaping + a 3-layer detector; non-stationary opponent → self-play with staleness-aware syncing; real-time transition → quantization, CUDA graphs, warp sampling. You never reached for a tool until a row of the table demanded it. That discipline is the whole track.

Further considerations

Multi-agent credit assignment. In 5v5, who gets the reward for a tower an assist enabled? A counterfactual baseline measures each agent's marginal contribution before applying the decay — it stops agents from farming "fake assists."
Value decomposition limits. QMIX's monotonicity constraint cannot represent "sacrifice-then-burst" coordination; at 5 agents the value error compounds. Weighted-QMIX, monotonic-attention hypernets, or dropping monotonicity (QPLEX) trade expressiveness for sample- and compute-cost — a decision, not a default.
Policy cycling. Self-play can fall into rock-paper-scissors loops. Detect via the effective rank of the win-rate matrix and α-divergence between policy-embedding snapshots; on detection, soft-restart (perturb policy weights, spike entropy, reset LR to warm-up) while keeping the value network.