Logistic regression & GLMs

Linear regression handled continuous outcomes. For binary and multi-class labels, you change the link function and the noise model, and the same machinery — MLE, convex optimization, regularization — carries over. That's the entire GLM story, and logistic regression is its most-asked instance.

Why not linear regression on a 0/1 label?

Encode the label as y ∈ {0,1} and fit ŷ = βᵀx by OLS. Two things break:

Predictions leave [0,1]. Nothing in OLS forces βᵀx ∈ [0,1]. You get negative "probabilities" and values above 1. Clipping is a band-aid that breaks calibration.
Noise is heteroscedastic. A Bernoulli with mean p has variance p(1−p) — minimal near 0 and 1, maximal at 0.5. Gauss-Markov breaks; OLS standard errors are wrong.

Keep the linear score z = βᵀx, pass it through the sigmoid:

P(y=1 | x) = σ(βᵀx) = 1 / (1 + e^{−βᵀx})

The sigmoid squashes ℝ → (0,1). The model is still linear in the logit; only the output is non-linear.

MLE: the loss isn't a choice

Juniors say "we use BCE because it works." Seniors know BCE is the negative log-likelihood of a Bernoulli — it falls out of MLE, not out of preference.

Assume y_i | x_i ∼ Bernoulli(p_i) with p_i = σ(βᵀx_i). Likelihood:

L(β) = Π_i σ(βᵀx_i)^{y_i} (1 − σ(βᵀx_i))^{1 − y_i}

Negative log:

NLL(β) = − Σ_i [ y_i log σ(βᵀx_i) + (1 − y_i) log(1 − σ(βᵀx_i)) ]

That is exactly binary cross-entropy. The loss in your DL framework is the MLE of a Bernoulli.

Why this matters in an interview

"Why BCE and not MSE?" The answer isn't "MSE doesn't work." It's: BCE is the NLL under the Bernoulli model — the principled estimator. MSE assumes Gaussian noise around a 0/1 label, which is wrong. Bonus: MSE composed with sigmoid is non-convex; BCE composed with sigmoid is convex. Principled, then tractable.

No closed form — but convex

Unlike OLS, there is no closed-form β̂ = (XᵀX)⁻¹Xᵀy. Setting the gradient to zero gives ∇_β NLL = Xᵀ(σ(Xβ) − y) = 0, non-linear in β. But the Hessian is:

H = Xᵀ diag(σ(Xβ)(1 − σ(Xβ))) X ⪰ 0

The diagonal weights σ(1−σ) ∈ (0, 1/4] are positive, so H is PSD and the NLL is convex. Any local optimum is global. Optimizers:

Perfect separation — the gotcha

The flip side of strict convexity (under full-rank X) is that under PERFECT SEPARATION — when there's a hyperplane that classifies training perfectly — the MLE doesn't exist: the likelihood keeps increasing as ‖β‖ → ∞ along the separating direction. The standard fix is L2 regularization (or Firth's penalized likelihood in small-sample settings). This is the canonical "logistic regression with no convergence" interview gotcha.

IRLS (Iteratively Reweighted Least Squares) — a Newton step in disguise. Each iteration is a weighted OLS with weights σ(βᵀx)(1 − σ(βᵀx)). What R's glm() uses. Second-order; O(d³) per iteration, so it doesn't scale past a few thousand features.
L-BFGS — quasi-Newton, no Hessian materialization. scikit-learn's default.
SGD / mini-batch — for very large n or streaming. Same algorithm as deep nets, applied to one linear layer.

The logit and calibration

Invert the sigmoid: log(p/(1−p)) = βᵀx. The logit (log-odds) is linear in x. A unit increase in x_j raises log-odds by β_j — equivalently, multiplies the odds by e^{β_j} (the "odds ratio"). "A 50-point FICO drop multiplies default odds by e^{0.4} ≈ 1.5×" is a sentence you can defend to a regulator. A GBDT SHAP value is not. The model is its coefficients.

Trained with BCE (the MLE), the model is asymptotically calibrated under correct specification (and well-calibrated in practice for well-specified models with enough data): among inputs where the model outputs 0.3, roughly 30% are positive. No post-hoc adjustment. Random forests (votes compress toward 0.5), GBDTs (over-confident leaves), deep nets (systematically over-confident; Guo et al. 2017) all need Platt, isotonic, or temperature scaling.

Why CTR systems still love logistic regression

Many production CTR predictors are logistic regression on massive hashed feature crosses. Not for AUC — GBDTs beat it. For calibration: an ads auction multiplies predicted CTR by a bid, so 2× miscalibration is 2× revenue miss. Logistic regression's calibration is free; everywhere else you pay a Platt/isotonic stage and the tax of keeping it synced.

Regularization — same picture as linear regression

Add a penalty to the NLL exactly as in lesson 2: L2 (+λ‖β‖², Gaussian prior, shrinks all toward zero), L1 (+λ‖β‖₁, Laplace prior, drives some to exactly zero — feature selection for free), elastic net (mix; better with correlated features). All convex. The bias-variance trade from lesson 1 transfers without change; pick λ by CV on log-loss.

Multi-class: three options, one usually right

Approach	Model	Trade-off
Softmax	P(y=k\|x) = e^{β_kᵀx} / Σ_j e^{β_jᵀx}	Joint MLE; probabilities sum to 1; calibrated; one model. The default.
One-vs-rest	K binary classifiers, argmax	Trivially parallel; per-class calibration knob. Probabilities don't sum to 1; decisions inconsistent in overlap regions.
One-vs-one	K(K−1)/2 pairwise, vote	Quadratic in K. Useful for SVMs that scale poorly in n; rarely the right call for logistic regression.

Default softmax. OvR when you want per-class calibration knobs or parallel-by-class training.

The GLM family — logistic regression is one row in a table

A Generalized Linear Model has three parts: a linear predictor η = βᵀx; a link function g with g(E[y|x]) = η; and an exponential-family noise distribution for y. Pick the link and distribution to match your outcome — the same machinery (MLE, IRLS) just works.

Outcome	Distribution	Link	Name	Typical use
Continuous real	Gaussian	identity	Linear regression	Anything from lesson 2.
Binary {0,1}	Bernoulli	logit	Logistic regression	Click, conversion, default, churn.
Counts (0,1,2,…)	Poisson	log	Poisson regression	Visits/day, claims/policy, defects.
Positive continuous	Gamma	log	Gamma regression	Claim size, time-to-event, LTV.
Zero-inflated + positive	Tweedie	log	Tweedie GLM	Insurance pure premium.
K categorical	Multinomial	softmax	Multinomial	Multi-class classification.

Interview-grade depth

Most candidates don't know linear and logistic regression are siblings in the same family. Saying "this is a GLM problem; counts → Poisson + log link, not linear regression on log-counts" signals you've thought past the syllabus. Bonus: the canonical link (logit for Bernoulli, log for Poisson, identity for Gaussian) is the one that makes sufficient statistics line up with Xᵀy and the score equations clean.

Interactive · decision boundary visualizer

Pick a dataset (linear, slight overlap, XOR, concentric). Choose linear features, polynomial features (adds x_1², x_2², x_1 x_2), or 3-class softmax. Slide L2. The widget fits logistic regression by gradient descent and draws the decision boundary, train accuracy, log-loss, and the learned weights.

Logistic regression — decision boundary

Try XOR with linear features (impossible — boundary is a line, accuracy ≈ 50%). Switch to polynomial features (perfect separation — boundary is a hyperbola). This is the canonical "feature engineering closes the gap on a linear model" demo.

dataset: features: L2 strength λ: 0.05

train accuracy

—

train log-loss

—

n features

—

‖β‖²

—

Reading

—

When logistic regression beats more complex models

The reflex to reach for XGBoost is often wrong. Logistic regression is the right baseline — and often the production choice — when data is small-to-medium and high-dimensional sparse (text, hashed categoricals, click logs); when interpretability is mandated; when calibration matters (auctions, risk pricing); when drift monitoring is in scope (coefficient drift is trivial — tree-ensemble drift is opaque); and when the signal is linearly separable after feature engineering. Cross-features (lesson 10) close most of the gap with tree models. Where it loses: when interactions are dense and you can't enumerate them — trees and nets find them automatically; logistic regression needs them engineered.

Trade-offs against the usual suspects

Dimension	Logistic	Random forest	GBDT	Neural net
Interpretability	Very high (odds ratios)	Medium (importance, not directional)	Medium (SHAP per-prediction)	Low
Calibration out of the box	Excellent	Poor (votes → 0.5)	Poor (over-confident leaves)	Poor (over-confident)
Training time	Fast (convex)	Medium	Medium-fast	Slow, brittle
Captures interactions	Only if engineered	Automatically	Automatically	Automatically
Small-data behaviour	Very strong	Decent	Decent	Often overfits
Feature-scaling sensitivity	Yes	None	None	Yes
Missing values	Impute first	Workable	Native	Impute first
Online / streaming	Yes (SGD)	No	Difficult	Yes (SGD)

The trap senior interviewers set

"100k-row credit-default dataset, 40 features, regulator-facing. Logistic or XGBoost?" The right answer isn't a model — it's framing. Logistic: calibrated, interpretable, defensible coefficients, easy to monitor. XGBoost: ~1–3% more AUC, opaque to audit, needs SHAP narratives and post-hoc calibration. For regulator-facing, the AUC delta rarely justifies the operational cost. Defend the call; don't recite "XGBoost is better."

Interview prompts you should be ready for

"Why BCE, not MSE, for binary classification?" (BCE is the NLL of a Bernoulli — principled. MSE assumes Gaussian noise on a 0/1 label. Bonus: BCE+sigmoid is convex; MSE+sigmoid isn't.)
"Derive the gradient of the logistic NLL." (∇_β NLL = Xᵀ(σ(Xβ) − y). Same shape as the OLS gradient with sigmoid in place of the linear prediction — the canonical-link property.)
"Train logistic regression on XOR — what happens, how do you fix it?" (Linear features → boundary is a line → ~50%. Add x_1·x_2; boundary becomes a conic; perfect separation. The widget shows this.)
"When does logistic regression give calibrated probabilities by default?" (Trained with BCE and well-specified. The big production advantage over trees and deep nets, which usually need Platt or isotonic.)
"Softmax vs one-vs-rest multi-class?" (Softmax: joint MLE, probabilities sum to 1, single model. OvR: K binary models, parallelizable, per-class calibration knobs. Default softmax.)
"Logistic vs XGBoost for credit-default — defend a choice." (Calibration, interpretability, monitoring, regulator audit — not raw AUC.)
"Modelling daily visits per user — what GLM?" (Counts → Poisson + log link. Variance ≫ mean: negative binomial. Don't run linear regression on log(visits+1).)

Takeaway

Logistic regression isn't a stand-alone trick; it's the Bernoulli row in the GLM table. Pick the noise model that matches your outcome, pick the canonical link, and MLE + a convex optimizer gives you a calibrated, interpretable model. Reach for trees and nets only when interactions are dense and you can't engineer them — and price in the post-hoc calibration tax you'll pay.