Quant-trading order execution
You are handed a parent order — "buy 200k shares before the close" — and an RL agent that decides, microsecond by microsecond, what to post and where. The domain looks like a clean MDP until you write it down: the state is a level-2 order book that leaks the future if you encode it carelessly, the action is a hybrid price-and-size decision constrained by a 500 µs exchange round-trip, the reward conflates your own market impact with the market's noise, you can never run the policy live to evaluate it so everything rides on off-policy estimation under a non-stationary market, and every action is gated by regulation that can change at 9am. Each binding difficulty names a tool.
1 · Formulate — the MDP behind an execution agent
Intuition. A parent order has to be sliced into hundreds of child orders over a horizon (say, the trading day). At each tick the agent sees the order book, decides where and how much to post, and is rewarded for buying cheap relative to a benchmark while not moving the price against itself. That is an MDP — but every one of the four pieces is awkward in a way that does not show up in a game.
| Piece | For an execution agent | The awkward part |
|---|---|---|
| State S | top-5 book levels per side (price+size → 20 features), inventory remaining, time left, realized volatility — a (K×21×2) causal tensor over the last K snapshots | naïve features leak the future; a persistent order-queue ID breaks the Markov property |
| Action A | which price level to post at (discrete, 5–10 ticks) × price offset and size fraction (continuous) | hybrid discrete-continuous, and it must clear in under the 500 µs exchange round-trip |
| Reward R | execution price vs. VWAP/arrival benchmark, minus slippage | your fill moves the price — reward = your impact + market noise, two confounded signals |
| Transition P | the matching engine + every other participant in the book | non-stationary regime; you can never A/B the live market → evaluation is off-policy only |
Same pattern as every domain: the MDP table writes itself and the rightmost column is the lesson. The four rows below become four mechanisms — causal encoding, structured masked heads, impact-separated reward, and off-policy evaluation with a regulatory guard.
2 · The state leaks the future — build a causal tensor
Intuition. The order book is a snapshot of prices and sizes at ten levels on each side. If you hand the network raw prices it learns scale, not structure; worse, if any feature is computed from data that only existed after the decision instant, the backtest looks brilliant and the live system loses money. The whole job here is to encode the book so the network sees only the past, in a form that is stationary across price regimes.
Engineering detail. Encode each level as an incremental log relative to the mid-price, which is scale-free and stable across a stock at $5 or $500:
The +1 guards against log(0) on an empty level. Maintain a deque of the last K=20 snapshots — appendleft on arrival, pop when full, O(1) per tick to meet the microsecond budget — and stack them into a (K, 21, 2) tensor: axis 0 is time (index 0 oldest, K−1 = now, strict causal order), axis 1 is the 20 book features plus a 21st channel of relative timestamps Δτ = ti−t0, and axis 2 is the price/volume channel so a 2-D CNN can convolve over (time, depth) with kernel=(3,3), dilation=(2,1) — a wider temporal receptive field at no extra parameters. A Transformer variant reshapes to (K, 42) and masks the upper triangle for causal self-attention.
Engineering detail — irregular arrivals. A tick-by-tick order feed arrives at non-uniform intervals, so absolute Δτ misleads. Replace it with an exponential time-decay encoding, decay coefficient α = 1/τhalf with τhalf ≈ 500 ms, which keeps causality while expressing continuous time. On an edge inference accelerator the (K=20, 21, 2) tensor runs FP16 in under ~400 µs — inside the sub-1 ms order round-trip. Push K past ~40 and latency grows linearly; if you need a longer memory, distill to a student network at K=10 fed differenced inputs to retain the information at a fraction of the cost.
3 · The action is hybrid and latency-bounded — structure it, mask it, schedule it
Intuition. "Post 30% of what's left, one tick inside the bid" is two decisions of different types: which price level (a choice from a menu of 5–10 ticks) and how far in and how much (a continuous offset and size fraction). Flattening into one giant discrete grid explodes; treating it as one continuous vector throws away the tick structure. Keep both — a discrete head for the level and conditioned continuous heads for offset and size — and never let the agent emit an action it can't legally place.
Engineering detail. A shared two-layer LayerNorm+ReLU MLP (256 units → 64-dim latent) feeds a discrete head (K logits, K=5 on the main board, up to 10 on a tick-finer board) and a continuous head (2-D: price offset, size fraction). Sample the discrete part with Gumbel-Softmax (τ=0.5, straight-through one-hot forward, gradient backward) and the continuous part with a Tanh-Normal; the final order is level-center price + round(Δp/tick)·tick and Δq × remaining quantity. Critically, the replay buffer stores the raw continuous outputs, not the rounded prices — rounding before storage severs the gradient. Anneal τ linearly every ~500 steps so Gumbel-Softmax converges toward a true one-hot. The loss is policy NLL × GAE advantage + discrete entropy + continuous entropy, value MSE, and a hard compliance penalty λ·ReLU(|price − limit|) with λ=1e3 so the gradient satisfies regulation before it optimizes return. Against a TWAP baseline this lifts the information ratio by ~0.8 and cuts the cancel rate ~35%.
4 · The reward confounds your impact with the market's noise — separate them
Intuition. When you buy, the price ticks up — partly because you pushed it (your market impact), partly because the market was moving anyway (noise). A raw price-improvement reward mixes the two, so the agent can't tell whether a bad fill was its fault or the market's. Worse, impact has two parts: a temporary component that relaxes after you stop trading, and a permanent component that stays — and the regulator wants the permanent part reported. So you separate the signal, normalize the learnable part, and keep the reportable part on the side.
Engineering detail — online impact separation. On each transition (s,a,r,s′), track the permanent-impact mean with an EWMA, take the residual as temporary impact, and standardize:
The shaped reward r̃ has mean 0 and variance 1, ready for any deep-RL algorithm. Feed μ̂perm into the critic as an extra state feature so the value head only learns "de-drifted" value, while the actor's gradient uses the standardized r̃ to cut variance. For the compliance report, add — outside the Bellman target, so it stays unbiased — a logged term λ·sign(μ̂perm)·log(1+|μ̂perm|), with λ calibrated by risk control, and persist μ̂perm and σ̂temp per child order to generate the regulator's impact-cost curve.
Engineering detail — taming VWAP-deviation outliers. When the reward is deviation from VWAP, a single fat-tailed print blows up the gradient. Defend in layers: (1) Winsorize — linearly penalize only the part of the deviation beyond a threshold δ to cap gradient magnitude. (2) Switch exploration noise from 𝒩(0,σ²) to a truncated normal and halve σ when the anomaly rate exceeds 15%. (3) Down-weight extreme samples with importance weight w = exp(−|r|/τ) so outlier rewards decay exponentially, then map the winsorized reward through tanh into [−1,1] so the gradient is bounded but never identically zero. On index-constituent backtests this cuts reward variance ~62% and max drawdown ~1.8pp while leaving the information coefficient essentially unchanged.
Engineering detail — slippage discount that adapts to inventory. The slippage penalty should matter less when little is left to trade. Make the potential-shaping discount a function of remaining quantity and fold it into the advantage so the policy-gradient theorem still holds:
Versus a fixed γslip=0.9 baseline, this adaptive form cut slippage cost ~18.7% and lifted fill rate ~4.2% while holding the cancel rate under 15%. For power-law-liquidity names, swap tanh for a stretched exponential γslip = γmin + (γmax−γmin)·exp(−α·(qrem/Q)β) and meta-learn α, β across stocks.
5 · You can never run the policy live to test it — off-policy evaluation under non-stationarity
Intuition. You cannot A/B a new execution policy against the real market without spending real money and moving real prices. So you must estimate a new policy's value purely from logs generated by the old one — off-policy evaluation (OPE) — in a market whose statistics drift across the day. The naïve estimator (importance-weighted return) is unbiased only if the market is stationary and the weights don't explode; both assumptions fail, so you build an estimator that survives drift and a test that tells you when "unbiased" stops being true.
Engineering detail — IPS, then double-robust. Estimate value with importance-weighted returns over a rolling 60-day window, V̂(π₁) = (1/n)Σ ρiRi, and tame the weights with truncation (ρmax=20) and Pareto smoothing. To detect when non-stationarity has broken unbiasedness, take the error series εi against a same-seed simulator, run CUSUM to cut it into locally-stationary blocks, t-test each block, and control the family-wise false-discovery rate with Benjamini–Hochberg at 5%. Reduce residual variance with the double-robust estimator, where a fitted Q model only has to be approximately right:
Push variance down further by estimating the Q correction with quasi-Monte-Carlo rollouts (Sobol/Halton sequences via torch.quasirandom.SobolEngine; Gray-code the discrete actions to preserve space-filling), giving Var(V̂DR-QMC) ≤ (1+‖ρ‖∞)²·O((log K)d/K²) + εQ. Use the DR-QMC value as a safety gate: if a candidate policy improves the estimated value <2% with variance reduction <10%, reject it before any capital is risked.
6 · Guard in production — the regulator is part of the environment
Intuition. An execution policy that is profitable but non-compliant is worthless: it gets the desk's trading license pulled. Self-trading, "flickering" orders that cancel within milliseconds to fake liquidity, and breaching a price limit are all illegal, and the rules can change overnight. So compliance is not a post-hoc filter — it is wired into the action space, the loss, and a runtime circuit breaker.
Engineering detail — three layers against flickering / fake fills. (1) Reward: penalize cancels of orders that rested <50 ms. (2) Environment hard wall: the Gym step() embeds the matching-engine rule engine, returns a "forbidden" signal for sub-50 ms cancels, and randomizes feed latency over 0–500 ms so the policy can't stably exploit a fixed delay. (3) Conservative offline training: train with BCQ/CQL on historical flickering negatives so the policy assigns low value to high-cancel regions, and every 10k steps run a replay evaluation — if replay fill-rate diverges from training fill-rate by >3%, auto-rollback and freeze that period's samples. Together this compresses the inflated fill rate from flickering to under 1%.
Engineering detail — a differentiable compliance mask. Build a binary mask M ∈ {0,1} over the action space marking actions that would cause self-trading (price/size in the self-cross band), set forbidden logits to zero, and renormalize the discrete distribution; for continuous actions, project the masked dimension to the nearest legal region and record the projection distance as a penalty. The total loss is L = LRL + λ·(‖araw − alegal‖² + Iself·Rpenalty), λ ∈ [1e2, 1e3] so compliance is satisfied before return is optimized. Register the mask in the graph (Torch FX / ONNX export) so the broker's gateway can do a simulation-vs-live consistency check with zero drift.
Engineering detail — Lagrangian multipliers that re-tune in minutes. When a rule changes, you cannot afford to retrain. Treat the constraint as a Lagrangian and update the multiplier online: each trajectory, compute mean constraint violation ĉ = (1/T)Σ g(st,at) − ḡnew, then λk+1 = [λk + κ·ĉ]+ through a softplus (κ=0.1 damps oscillation, softplus stops small violations from amplifying noise), λ capped by the regulator's worst-case violation cost. The actor maximizes Rt − λk+1·ct + γ·H(π) with the entropy coefficient rising as λ grows, so the policy doesn't collapse onto a degenerate "do nothing" action. Hold a sliding-window compliance rate ≥99.5% and a return drawdown <2% as dual gates; tripping either rolls back to the last stable multiplier — a "gray-scale rollback" that re-calibrates to a new rule in ~15 minutes with zero offline retraining.
Further considerations
- Message stream vs. snapshot input. Mixing raw event messages with periodic snapshots gives finer microstructure but costs you time-alignment (maintain dual event-time + processing-time axes, or off-policy IS weights diverge and gradient variance jumps an order of magnitude), storage blow-up (raw messages dwarf snapshots; use differential snapshots + protobuf deltas), and hyperparameter sensitivity (the snapshot mix ratio α is best in 0.2–0.4; above 0.5 policy entropy collapses into local optima). Monitor the KL between mixed and snapshot-only inputs and auto-rollback past ~0.02. Compliance often demands keeping the raw-message + snapshot + model-version triple for 5 years with tamper-evident hashes.
- Multi-instrument joint execution. Trading an ETF and an index future together stacks their book tensors along the channel axis into (K, 21, 4) over a shared CNN backbone, routed by a mixture-of-experts to per-instrument value heads — saving memory and curbing overfitting. For cross-market market-making, a Multi-Agent PPO with a central critic handles the coupled hybrid action.
- Belief states under partial observability. Iceberg orders and random cancels make the true book partially observable; an LSTM/attention policy can infer an approximate belief state and extend μ̂perm into a belief vector for sharper impact separation. But many risk desks forbid black-box memory in production and require a feature-importance report — so keep the state as Markov as you can and treat learned memory as the exception, not the default.
- When latency-arbitrage disappears. If exchanges compress feed latency under 100 µs, flickering edge vanishes and the policy's alpha must move to microstructure prediction — multi-level queue imbalance, cancel-rate gradients, broker-ID heat — computed with asynchronous PPO and GPU feed decoding at microsecond granularity. A regulatory sandbox (live operation under injected extreme conditions, license granted only after, say, 7 consecutive days of cancel-rate <30% and fill-rate error <2%) is a clean way to earn the right to trade live.
- Time-segmented circuit breakers. If a venue adds tiered breakers (e.g. a 30% halt), the piecewise cost function becomes a non-convex constraint — reach for Lyapunov-based RL to guarantee recursive feasibility. At microsecond frequency the Safety-Layer QP solve becomes the bottleneck; pre-train a neural KKT approximator offline (a "safety mapper") to bring online projection under ~5 µs.