Logistic regression & GLMs
Linear regression handled continuous outcomes. For binary and multi-class labels, you change the link function and the noise model, and the same machinery — MLE, convex optimization, regularization — carries over. That's the entire GLM story, and logistic regression is its most-asked instance.
Why not linear regression on a 0/1 label?
Encode the label as y ∈ {0,1} and fit ŷ = βᵀx by OLS. Two things break:
- Predictions leave [0,1]. Nothing in OLS forces βᵀx ∈ [0,1]. You get negative "probabilities" and values above 1. Clipping is a band-aid that breaks calibration.
- Noise is heteroscedastic. A Bernoulli with mean p has variance p(1−p) — minimal near 0 and 1, maximal at 0.5. Gauss-Markov breaks; OLS standard errors are wrong.
Keep the linear score z = βᵀx, pass it through the sigmoid:
The sigmoid squashes ℝ → (0,1). The model is still linear in the logit; only the output is non-linear.
MLE: the loss isn't a choice
Juniors say "we use BCE because it works." Seniors know BCE is the negative log-likelihood of a Bernoulli — it falls out of MLE, not out of preference.
Assume y_i | x_i ∼ Bernoulli(p_i) with p_i = σ(βᵀx_i). Likelihood:
Negative log:
That is exactly binary cross-entropy. The loss in your DL framework is the MLE of a Bernoulli.
No closed form — but convex
Unlike OLS, there is no closed-form β̂ = (XᵀX)⁻¹Xᵀy. Setting the gradient to zero gives ∇_β NLL = Xᵀ(σ(Xβ) − y) = 0, non-linear in β. But the Hessian is:
The diagonal weights σ(1−σ) ∈ (0, 1/4] are positive, so H is PSD and the NLL is convex. Any local optimum is global. Optimizers:
- IRLS (Iteratively Reweighted Least Squares) — a Newton step in disguise. Each iteration is a weighted OLS with weights σ(βᵀx)(1 − σ(βᵀx)). What R's
glm()uses. Second-order; O(d³) per iteration, so it doesn't scale past a few thousand features. - L-BFGS — quasi-Newton, no Hessian materialization. scikit-learn's default.
- SGD / mini-batch — for very large n or streaming. Same algorithm as deep nets, applied to one linear layer.
The logit and calibration
Invert the sigmoid: log(p/(1−p)) = βᵀx. The logit (log-odds) is linear in x. A unit increase in x_j raises log-odds by β_j — equivalently, multiplies the odds by e^{β_j} (the "odds ratio"). "A 50-point FICO drop multiplies default odds by e^{0.4} ≈ 1.5×" is a sentence you can defend to a regulator. A GBDT SHAP value is not. The model is its coefficients.
Trained with BCE (the MLE), the model is asymptotically calibrated under correct specification (and well-calibrated in practice for well-specified models with enough data): among inputs where the model outputs 0.3, roughly 30% are positive. No post-hoc adjustment. Random forests (votes compress toward 0.5), GBDTs (over-confident leaves), deep nets (systematically over-confident; Guo et al. 2017) all need Platt, isotonic, or temperature scaling.
Regularization — same picture as linear regression
Add a penalty to the NLL exactly as in lesson 2: L2 (+λ‖β‖², Gaussian prior, shrinks all toward zero), L1 (+λ‖β‖₁, Laplace prior, drives some to exactly zero — feature selection for free), elastic net (mix; better with correlated features). All convex. The bias-variance trade from lesson 1 transfers without change; pick λ by CV on log-loss.
Multi-class: three options, one usually right
| Approach | Model | Trade-off |
|---|---|---|
| Softmax | P(y=k|x) = e^{β_kᵀx} / Σ_j e^{β_jᵀx} | Joint MLE; probabilities sum to 1; calibrated; one model. The default. |
| One-vs-rest | K binary classifiers, argmax | Trivially parallel; per-class calibration knob. Probabilities don't sum to 1; decisions inconsistent in overlap regions. |
| One-vs-one | K(K−1)/2 pairwise, vote | Quadratic in K. Useful for SVMs that scale poorly in n; rarely the right call for logistic regression. |
Default softmax. OvR when you want per-class calibration knobs or parallel-by-class training.
The GLM family — logistic regression is one row in a table
A Generalized Linear Model has three parts: a linear predictor η = βᵀx; a link function g with g(E[y|x]) = η; and an exponential-family noise distribution for y. Pick the link and distribution to match your outcome — the same machinery (MLE, IRLS) just works.
| Outcome | Distribution | Link | Name | Typical use |
|---|---|---|---|---|
| Continuous real | Gaussian | identity | Linear regression | Anything from lesson 2. |
| Binary {0,1} | Bernoulli | logit | Logistic regression | Click, conversion, default, churn. |
| Counts (0,1,2,…) | Poisson | log | Poisson regression | Visits/day, claims/policy, defects. |
| Positive continuous | Gamma | log | Gamma regression | Claim size, time-to-event, LTV. |
| Zero-inflated + positive | Tweedie | log | Tweedie GLM | Insurance pure premium. |
| K categorical | Multinomial | softmax | Multinomial | Multi-class classification. |
Interactive · decision boundary visualizer
Pick a dataset (linear, slight overlap, XOR, concentric). Choose linear features, polynomial features (adds x_1², x_2², x_1 x_2), or 3-class softmax. Slide L2. The widget fits logistic regression by gradient descent and draws the decision boundary, train accuracy, log-loss, and the learned weights.
When logistic regression beats more complex models
The reflex to reach for XGBoost is often wrong. Logistic regression is the right baseline — and often the production choice — when data is small-to-medium and high-dimensional sparse (text, hashed categoricals, click logs); when interpretability is mandated; when calibration matters (auctions, risk pricing); when drift monitoring is in scope (coefficient drift is trivial — tree-ensemble drift is opaque); and when the signal is linearly separable after feature engineering. Cross-features (lesson 10) close most of the gap with tree models. Where it loses: when interactions are dense and you can't enumerate them — trees and nets find them automatically; logistic regression needs them engineered.
Trade-offs against the usual suspects
| Dimension | Logistic | Random forest | GBDT | Neural net |
|---|---|---|---|---|
| Interpretability | Very high (odds ratios) | Medium (importance, not directional) | Medium (SHAP per-prediction) | Low |
| Calibration out of the box | Excellent | Poor (votes → 0.5) | Poor (over-confident leaves) | Poor (over-confident) |
| Training time | Fast (convex) | Medium | Medium-fast | Slow, brittle |
| Captures interactions | Only if engineered | Automatically | Automatically | Automatically |
| Small-data behaviour | Very strong | Decent | Decent | Often overfits |
| Feature-scaling sensitivity | Yes | None | None | Yes |
| Missing values | Impute first | Workable | Native | Impute first |
| Online / streaming | Yes (SGD) | No | Difficult | Yes (SGD) |
Interview prompts you should be ready for
- "Why BCE, not MSE, for binary classification?" (BCE is the NLL of a Bernoulli — principled. MSE assumes Gaussian noise on a 0/1 label. Bonus: BCE+sigmoid is convex; MSE+sigmoid isn't.)
- "Derive the gradient of the logistic NLL." (∇_β NLL = Xᵀ(σ(Xβ) − y). Same shape as the OLS gradient with sigmoid in place of the linear prediction — the canonical-link property.)
- "Train logistic regression on XOR — what happens, how do you fix it?" (Linear features → boundary is a line → ~50%. Add x_1·x_2; boundary becomes a conic; perfect separation. The widget shows this.)
- "When does logistic regression give calibrated probabilities by default?" (Trained with BCE and well-specified. The big production advantage over trees and deep nets, which usually need Platt or isotonic.)
- "Softmax vs one-vs-rest multi-class?" (Softmax: joint MLE, probabilities sum to 1, single model. OvR: K binary models, parallelizable, per-class calibration knobs. Default softmax.)
- "Logistic vs XGBoost for credit-default — defend a choice." (Calibration, interpretability, monitoring, regulator audit — not raw AUC.)
- "Modelling daily visits per user — what GLM?" (Counts → Poisson + log link. Variance ≫ mean: negative binomial. Don't run linear regression on log(visits+1).)