Game-AI agent training
The first domain, and the one that built modern deep RL (Atari, AlphaGo, OpenAI Five, AlphaStar). We use it to install the method the whole track runs on: write the MDP, name the one thing that makes this MDP hard, reach for the mechanism that removes exactly that difficulty — and not before. For a MOBA bot the binding difficulties are a hybrid action space, partial observability, a sparse, hackable reward, a non-stationary self-play opponent, and a 16 ms frame budget. Each one names a tool.
1 · Formulate — the MDP behind a MOBA bot
Intuition. A game agent sees a screen, presses buttons, and gets a score later. That is already a Markov Decision Process: the screen is the state, the buttons are the action, the score change is the reward, and the game engine is the transition. Everything below is a consequence of one of these four pieces being awkward in a real game.
| Piece | For a 5v5 MOBA bot | The awkward part |
|---|---|---|
| State S | minimap + unit stats + cooldowns, ~1k-dim feature image | you only see your team's vision → partially observable |
| Action A | which skill (discrete) × where to aim (continuous) × charge level | hybrid discrete-continuous, and most actions are illegal most frames |
| Reward R | win/loss, but shaped with kills, towers, farm | sparse (one signal per ~30 min) and hackable once shaped |
| Transition P | the game engine + four allies + five enemies | opponent is another learning agent → non-stationary |
Notice the pattern we will reuse for all 20 domains: the MDP table writes itself, and the rightmost column — the awkward part — is the lesson. Each row's difficulty has a named mechanism that answers it.
2 · The hybrid action space — model the structure, don't flatten it
Intuition. "Press Q toward that bush, half-charged" is three decisions of different types: which skill (a choice from a menu), where (a continuous 2-D point), and how hard (a continuous scalar). Flattening all of that into one giant discrete menu explodes combinatorially; treating it as one continuous vector throws away the menu structure. The fix is to keep the structure: a discrete head for the skill, conditioned continuous heads for aim and charge.
Engineering detail. All heads share one CNN backbone; only the small output heads are separate. The composite loss mixes the discrete and continuous pieces — negative-log-likelihood × advantage for the skill choice, clipped PPO surrogate + entropy bonus for the continuous aim/charge. A Beta distribution is preferred over a clipped Gaussian for the charge ∈ [0,1] because its support is already bounded — no probability mass is lost at the clip. When the discrete menu itself grows past ~20 entries (active items + summoner spells), switch the skill head to an autoregressive or pointer-network output so the logit count stops growing with the cross-product.
3 · Action masking — the detail that decides whether training even converges
Intuition. Most frames, most actions are illegal: the skill is on cooldown, you lack mana, there's no target. If the network can put probability on illegal actions, it wastes capacity learning "don't do that" and, worse, can produce NaN gradients. A mask says "these outputs are forbidden this frame" before the agent ever samples.
Engineering detail. The mask is a fixed-length byte array built engine-side (e.g. 12,288 = 64×64 grid × 3 command types) and carried alongside the observation for zero-copy. In the policy you set illegal logits to a large negative number before the softmax so their probability is exactly zero:
4 · Reward — sparse by nature, hackable once you shape it
Intuition. The only honest reward is win/loss, once per match. That is far too sparse to learn from, so you shape it: small bonuses for kills, towers, farm. But every shaping term is a new objective the agent will optimize literally — and the literal optimum is rarely what you meant. This is reward hacking.
Engineering detail — shaping that decays. Early in training the agent needs dense guidance; late in training the dense bonuses distort strategy. So shaping weights decay over training time, and they scale with game state — e.g. tower-reward weight is amplified when behind so the agent learns comebacks rather than only snowballing. A typical structure:
plus a moral cutoff (zero the kill reward if the victim's recent gold/level gain is trivial, to kill "kill-farming"), and a pseudo-count exploration bonus 0.5/√(n+1) that is on during training and off in production.
5 · Self-play — the opponent that learns back
Intuition. There is no fixed opponent to train against; the best opponent is a copy of yourself. Self-play creates an automatic curriculum — as you improve, so does your sparring partner. But it introduces a subtle systems problem: in a distributed cluster the opponent's strength (its Elo rating) is measured with delay, and a stale opponent rating injects noise straight into your advantage estimates.
Engineering detail — staleness is a variance multiplier. Model Elo as a delayed stochastic-approximation process. If the learner uses an opponent rating that is Δk games stale, the rating error is a random walk with variance ∝ Δk, and that error propagates into GAE, amplifying advantage variance by roughly:
where τ is the Elo autocorrelation. More variance means you need a bigger batch to keep the same convergence speed — and batch size is GPU dollars. The widget below is the napkin math: feel how a half-day sync delay can multiply your training bill.
6 · The 16 ms wall — real-time inference is part of the MDP
Intuition. A 60-FPS game gives the policy ~16 ms per frame to see, think, and act. A model that is 1% stronger but misses the frame deadline is useless — the agent stutters. So inference latency is not an afterthought; it is a hard constraint that shapes the architecture.
Engineering detail. Three levers, each from the source playbook: (1) INT8 quantization of the policy, with KL(πint8 ‖ πfp32) tracked as an auxiliary loss and a FP32 "shadow engine" comparing logits in production — auto-recalibrate if KL drifts past ~0.02. (2) LSTM state reuse: keep hidden states GPU-resident in a ring buffer, capture the "load-state → infer → write-state" pipeline as a CUDA graph (CPU-side cost ~52 µs), reaching ~0.2 ms/frame. (3) Warp-level batch-1 sampling on mobile GPUs: pack 32 pseudo-environments into one warp, fp16 vector loads, a piecewise-linear tanh — bringing a full step to under 5 ms on a phone.
Further considerations
- Multi-agent credit assignment. In 5v5, who gets the reward for a tower an assist enabled? A counterfactual baseline measures each agent's marginal contribution before applying the decay — it stops agents from farming "fake assists."
- Value decomposition limits. QMIX's monotonicity constraint cannot represent "sacrifice-then-burst" coordination; at 5 agents the value error compounds. Weighted-QMIX, monotonic-attention hypernets, or dropping monotonicity (QPLEX) trade expressiveness for sample- and compute-cost — a decision, not a default.
- Policy cycling. Self-play can fall into rock-paper-scissors loops. Detect via the effective rank of the win-rate matrix and α-divergence between policy-embedding snapshots; on detection, soft-restart (perturb policy weights, spike entropy, reset LR to warm-up) while keeping the value network.