all lessons / reinforcement learning / 60 · Financial trading (下) lesson 60 / 87

Financial trading (下) — portfolio optimization

Lesson 59 traded one stock with a discrete buy/hold/sell button. Real money managers do something harder and more continuous: split a dollar across many assets at once. The action is no longer a button press — it is a vector of weights that must sum to one. That single change drops us straight back into continuous control, and the fact that the dollars are real drops us into offline RL.

What this lesson reuses
This is an applications lesson — it invents no new algorithm. It is the meeting point of two earlier frontiers:

The action is a simplex, not a button

In lesson 59 the action set was tiny and discrete: {buy, hold, sell} one instrument. Portfolio optimization asks the manager's real question: given n assets, what fraction of the book belongs in each? The action is a weight vector

w = (w1, …, wn)  ∈  ℝn,     Σi wi = 1

where wi is the fraction of capital in asset i (and a cash/risk-free slot is just one more entry). This is the probability simplex Δn−1 — uncountably many points. There is no list of actions to sweep, so Q-learning's argmaxaQ(s,a) (lesson 05) is undefined here for exactly the reason it was for a robot arm in lesson 18.

Same wall, different domain (lesson 18)
A robot's action was a torque vector in d; a portfolio's action is a weight vector on the simplex. Both are continuous, so both need the same fix: a deterministic actor μφ(s) that outputs the allocation in one forward pass, trained by riding the critic's gradient aQ (the Deterministic Policy Gradient). The only twist is the output layer: a softmax over assets makes the weights automatically sum to one, so the actor lands on the simplex by construction — the same trick the lesson-03 policy used to land on a categorical distribution.

The MDP extends lesson 59's: the state is the market context (recent returns, volatilities, holdings, cash); the action w is the new target allocation; the transition applies next period's returns and charges transaction costs to rebalance from the old weights; the reward is the period's gain — adjusted for risk, the crux of the lesson.

Why "maximize return" is a trap

The obvious reward is the portfolio's log-return over the period:

rt = log( 1 + wt · ρt+1 ) − cost(wt, wt−1)

where ρt+1 is the vector of next-period asset returns. Plug this straight into any of our objectives — J(θ)=𝔼[Σγtrt] — and the optimizer does something technically correct and financially insane: it discovers that expected return scales with leverage, so it borrows as much as it can and concentrates everything into whatever asset had the highest mean return in the data.

Maximizing raw return takes ruinous leverage
Expected return is linear in position size: double the bet, double the expected gain. An agent rewarded only on return therefore has no reason to ever stop sizing up — it will lever to the hilt and pile into one name. That maximizes the average outcome and also maximizes the chance of a wipe-out: one bad period and a levered, concentrated book hits a margin call, is force-liquidated, and the compounding game is over. Average return is the wrong objective because investing compounds, and a single −100% erases every prior gain.

The fix: a risk-adjusted reward

The fix is a better reward, not a new algorithm: reward return per unit of risk. The canonical choice is the Sharpe ratio — mean excess return over its standard deviation:

Sharpe = ( 𝔼[r] − rf ) / σ(r)      or, as a per-step reward:    rt = (wt·ρt+1) − λ · σ2portfolio(wt)

The mean-variance form on the right is the one we will use in the widget: expected return minus a risk-aversion coefficient λ times portfolio variance. Now the objective is concave in position size — variance grows with the square of leverage, so past some point adding risk costs more than it earns, and the optimizer chooses a finite, diversified allocation on its own. Diversification falls out for free: spreading across imperfectly-correlated assets lowers σ2 without lowering the mean.

RewardShape in position sizeWhat the optimizer does
raw return 𝔼[r]linear — no maximuminfinite leverage, one asset, eventual ruin
mean − λ·varianceconcave — has a peakfinite, diversified allocation; λ sets where the peak is
This is the same move as SAC's entropy bonus (lesson 18)
SAC did not maximize reward alone — it maximized reward plus an entropy term, changing the objective so the agent behaved sensibly (kept exploring). Here we maximize reward minus a variance term, changing the objective so the agent behaves sensibly (stops levering). In both cases the headline metric is a poor objective on its own; the regularizer is what makes the learned policy usable.

Constraints: no shorting, position limits

Real mandates add hard constraints the action must respect: no shorting (wi ≥ 0), position limits (wi ≤ c, no single name above, say, 20%), fully invested (Σwi=1). A softmax actor handles the first and last for free (outputs are non-negative and sum to one); position caps are enforced by clipping-and-renormalizing the actor's output — the same "project the action onto the feasible set" instinct as the safety-clip in lesson 58. These constraints are also a free defense against leverage: a long-only, fully-invested book simply cannot reach the ruinous region.

Why offline & conservative methods are not optional here

Every continuous-control method from lesson 18 — DDPG, TD3, SAC — assumes a live environment to roll out in and fill a replay buffer. A trading book has no such thing. You cannot "explore" by placing speculative trades with the client's money to see what happens; one exploratory rollout is real, irreversible loss. So portfolio learning is structurally the offline setting of lesson 19: a fixed dataset of historical market states, allocations, and realized returns, and you must produce a good policy without ever calling env.step().

And we know exactly how that breaks. The Bellman backup queries Q(s, w) at allocations w the historical data never contained — out-of-distribution actions — where the critic's estimate is unchecked extrapolation. With no live trade to ever feel the consequence, those phantom values feed back through the bootstrap and inflate. In driving (lesson 58) that OOD blow-up was a phantom-safe maneuver; here it is a phantom-profitable allocation.

The OOD blow-up is real money (lesson 19)
An offline learner that over-estimates Q on an allocation the market history never showed will recommend that allocation. There is no correcting rollout — the next thing that happens is you put real capital into a position the model only ever imagined was good. Lesson 19's over-optimistic value estimate, in finance, is a strategy that looked brilliant in backtest and loses money the moment it touches the market. This is backtest overfitting (lesson 59) seen through the offline-RL lens.

The cures are the conservatism of lessons 2018, and they read naturally as risk management:

Fix familyIdeaFor a portfolio
BCQ — policy-constraint (L17)only consider actions the data supports."only hold allocations resembling ones that have actually been traded in markets like this."
CQL — value-pessimism (L18)push Q down on the policy's own (possibly OOD) actions."refuse to believe a strategy is profitable until the history proves it" — pessimism as prudence.

Conservatism, which was a mathematical necessity for correctness in offline RL, is here a fiduciary necessity for survival. An offline finance agent that is not conservative does not merely learn slowly — it confidently bets on fantasies.

Interactive · the risk-aversion knob

The widget below is a one-period mean-variance allocator over four assets plus cash. The actor chooses weights to maximize 𝔼[r] − λ·σ2; the risk-aversion knob λ slides from greedy (λ ≈ 0: chase return) to risk-averse (large λ: smooth the ride). For each setting it computes the optimal long-only allocation, then simulates an equity curve by drawing correlated returns, and plots the resulting risk/return point against a sampled efficient frontier.

Portfolio allocator: return vs volatility, with a risk-aversion knob
Left: the optimal long-only weights for the current λ (bars) and one simulated equity curve. Right: a sampled efficient frontier (grey cloud) with your portfolio marked. Slide λ from 0 (greedy → big drawdowns) toward risk-averse (smooth curve). "↻ New market" redraws the return path; the allocation is deterministic given λ.
Expected return
Volatility σ
Sharpe
Max drawdown
Show the core JS (≈28 lines)
// mu = expected returns, Sigma = covariance (incl. a cash slot, mu=rf, var=0).
// Actor maximizes mu·w - lambda * w'Sigma w  on the long-only simplex,
// found by projected gradient ascent (softmax-style → weights sum to 1, >= 0).
function allocate(lambda){
  let z = mu.map(()=> 0);                  // logits
  for (let it = 0; it < 400; it++){
    const w = softmax(z);                   // on the simplex by construction
    const Sw = matVec(Sigma, w);
    for (let i = 0; i < n; i++){
      // d/dw_i [ mu·w - lambda w'Sigma w ] = mu_i - 2*lambda*(Sigma w)_i
      const grad = mu[i] - 2*lambda*Sw[i];
      z[i] += 0.5 * grad;                    // ascent in logit space
    }
  }
  return softmax(z);
}

function stats(w){                           // risk-adjusted, not raw
  const ret = dot(mu, w);
  const variance = dot(w, matVec(Sigma, w));
  return { ret, vol: Math.sqrt(variance), sharpe: (ret - rf)/Math.sqrt(variance) };
}
// lambda = 0  → grad = mu_i  → all mass on the single highest-mu asset
//             → max expected return, max volatility, deepest drawdown. The bug.

Notice what the knob is really doing. At λ=0 the gradient is just μi, so softmax sends all the mass to the top-mean asset — concentration and the steepest, most fragile equity curve. As λ grows, the −2λ(Σw)i term penalizes whatever you already hold a lot of, pushing mass toward uncorrelated assets and cash. The Sharpe ratio peaks at a moderate λ, not at zero: chasing raw return actively destroys risk-adjusted performance.

Mapping back to the spine

Place each piece on the value / policy / model map from orientation:

Portfolio pieceCore methodPlace on the map
Allocation = weights on the simplexContinuous control, DPG/DDPG (L15)policy: a deterministic actor μφ(s) emits w via softmax — no argmax
Reward = mean − λ·variance (Sharpe)Risk-aware reward designshapes R so the objective is concave → bounded, diversified
No shorting / position limitsAction projection / clipconstrain the policy to the feasible simplex (L23's clip instinct)
Can't trade live to exploreOffline RL (L16)value learning on a fixed history, no env.step()
OOD allocation looks profitableBCQ / CQL conservatism (L17–18)pull value down on unsupported actions — pessimism = prudence

Read together: portfolio optimization is continuous control with a risk-adjusted reward, learned offline and conservatively. Lesson 59 gave us trading as an MDP with discrete actions; replace the button with a simplex and you are back in lesson 18. Reward raw return and the actor levers into ruin; reward Sharpe and it diversifies. And because the dollars are real, you cannot explore — so the offline conservatism of lessons 19–21, which was about correctness elsewhere, is here about not losing the client's money.

Takeaway
A portfolio's action is a continuous weight vector on the simplex (wi≥0, Σwi=1), so allocation is a continuous-control problem (lesson 18): a softmax actor emits the weights, no argmax. Reward raw return and the objective is linear in leverage — the agent borrows to the hilt, concentrates, and eventually blows up; reward risk-adjusted return (mean − λ·variance / Sharpe) and the objective turns concave, so the optimal book is finite and diversified, exactly as the widget shows when you lift λ off zero. And because you cannot trade live to explore, learning is offline (lesson 19) on a fixed history, where an over-optimistic value on an unseen allocation is not a bad plot but a real loss — which is why the BCQ/CQL conservatism of lessons 20–21 is mandatory here, not optional.