Linear regression — OLS & regularization

The place where every bias-variance lever from lesson 1 has a closed-form analogue. If you can't derive OLS in four lines and explain why ridge has a unique solution but lasso does feature selection, the rest of these interviews will be uphill.

OLS from first principles

Design matrix X ∈ ℝ^{n×p}, targets y ∈ ℝ^n. Assume y = Xβ + ε with ε ~ (0, σ²I). The squared-error loss is L(β) = ‖y − Xβ‖² = (y − Xβ)ᵀ(y − Xβ). Take the gradient and set to zero — four lines:

∂L/∂β = −2 Xᵀ(y − Xβ) = 0 ⇒ Xᵀy = XᵀX β ⇒ β̂ = (XᵀX)⁻¹ Xᵀy.

Those are the normal equations. The Hessian 2XᵀX is PSD, so the stationary point is a minimum — unique exactly when XᵀX is invertible, i.e., the columns of X are linearly independent.

The geometric reading is sharper than the algebra. Xβ̂ lives in the column space of X; OLS picks the closest such combination to y, so the residual y − Xβ̂ is perpendicular to every column of X — exactly Xᵀ(y − Xβ̂) = 0. OLS is orthogonal projection of y onto the column space of X.

Why the projection picture is worth memorizing

Many "why does X work?" questions reduce to "because OLS is a projection". Why are residuals uncorrelated with the fit? Orthogonality. Why does adding a feature never increase training error? Larger subspace. Why is multicollinearity bad? Near-parallel columns make projection coefficients explode while the projection itself stays stable.

Gauss-Markov & the assumptions you must name

Under four conditions OLS is the BLUE — Best (minimum-variance) Linear Unbiased Estimator.

Assumption	Statement	Violation → fix
Linearity	E[y\|X] = Xβ	Polynomial / basis-function expansion, splines, GAMs, or a nonlinear model.
Exogeneity	E[ε\|X] = 0	Endogeneity → β̂ biased and inconsistent. Fix with IV, control functions, or randomized experiments.
Homoscedasticity	Var(ε\|X) = σ²I	β̂ still unbiased but textbook SEs wrong. Use weighted LS or sandwich (robust) SEs.
Uncorrelated errors	Cov(ε_i, ε_j) = 0	Time series / panel → GLS, Newey-West SEs, or model the AR structure.

Gauss-Markov is a conditional statement: among linear unbiased estimators, OLS has smallest variance. It does not say OLS is best overall — biased estimators (ridge, lasso) can have lower MSE. That gap is the entire point of regularization.

Normal equations vs gradient descent

Method	Cost	When
Normal equations	O(np² + p³)	p ≲ a few thousand, XᵀX well-conditioned. Solve, don't invert.
QR / SVD	O(np²)	Stable even when XᵀX is near-singular. Default in serious libraries.
(S)GD	O(np · iters)	p large (≳10⁴), n streaming, or X sparse.
Coordinate descent	O(np · iters)	Lasso / elastic-net via soft-thresholding.

Conditioning matters: κ(XᵀX) = κ(X)². Near-collinear columns make XᵀX nearly singular, and direct inversion amplifies floating-point noise. QR / SVD work on X directly and avoid the squaring. Ridge fixes it analytically by adding λI.

Multicollinearity — the failure mode that motivates regularization

Suppose features 3 and 4 are almost identical. The fits … + 5x_3 − 4x_4 … and … + 105x_3 − 104x_4 … make near-identical predictions because the difference of the coefficients is what acts on x_3 ≈ x_4. OLS can't choose; small data perturbations swing β̂ across this huge equivalent set. Predictions stay fine; coefficients become uninterpretable and high-variance.

Diagnostic: variance inflation factor VIF_j = 1/(1 − R²_j), where R²_j is the R² of regressing feature j on the others. VIF=1 means orthogonal; VIF≫10 is the conventional alarm. Fixes: drop one, combine them, or regularize.

Ridge regression — L2

L(β) = ‖y − Xβ‖² + λ‖β‖² ⇒ β̂_ridge = (XᵀX + λI)⁻¹ Xᵀy.

Three views: (1) constrained — minimise ‖y − Xβ‖² s.t. ‖β‖² ≤ t(λ); (2) Bayesian — MAP under Gaussian prior β ~ N(0, τ²I) with λ = σ²/τ²; (3) bias-variance — λ↑ shrinks β̂ toward 0, raising bias and lowering variance.

The miracle: XᵀX + λI is positive definite for any λ > 0, even when XᵀX is singular. Ridge always has a unique solution. The analytic fix for multicollinearity — correlated features are shrunk together rather than fighting.

Lasso regression — L1

L(β) = ‖y − Xβ‖² + λ‖β‖₁, ‖β‖₁ = Σ_j |β_j|.

No closed form: the L1 penalty is non-differentiable at zero. Standard solver is coordinate descent with soft-thresholding:

β_j ← S(ρ_j, λ) / ‖x_j‖², S(ρ, λ) = sign(ρ) · max(|ρ| − λ, 0),

where ρ_j = x_jᵀ(y − X_{−j}β_{−j}) is the partial residual correlation. The max(·, 0) is the crucial piece: when |ρ_j| ≤ λ, the coordinate is set exactly to zero. That's how lasso does feature selection.

Geometrically: the L1 ball is a diamond with corners on the axes; the L2 ball is a smooth sphere. Loss contours are ellipses centred at OLS, and the penalized optimum is where the smallest contour touches the constraint set. For the diamond, the touch point is very often a corner — coordinates are exactly zero. For the sphere, the touch point is generically interior — all coefficients nonzero, just shrunk.

Same OLS optimum, same loss contours, different penalty geometry. L2's smooth ball admits tangency anywhere on its surface — generically off-axis, so all coefficients shrink but stay nonzero. L1's diamond has corners on the axes, and the smallest reachable contour very often hits a corner — exact zero, feature selection by geometry.

Caveat: with several highly-correlated features, lasso arbitrarily picks one and zeros the others; the choice can flip across re-samples. Elastic net handles this case.

Elastic net

L(β) = ‖y − Xβ‖² + λ₁‖β‖₁ + λ₂‖β‖².

Zou & Hastie 2005. The L2 piece groups correlated features (shrinks them toward each other instead of arbitrary one-survivor selection); the L1 piece preserves feature selection. Two hyperparameters: total strength and L1/L2 mix.

Bayesian re-statement

Prior on β + MAP ⇒ penalty in log-space. Gaussian prior → L2 (ridge); Laplace prior → L1 (lasso). The Laplace has heavier tails and a sharper peak at zero — that's why it pushes coefficients all the way to zero rather than just shrinking. Regularization strength is inverse prior variance; "add a penalty" becomes "encode a prior belief that most coefficients are small".

Interactive · ridge vs lasso coefficient path

Synthetic dataset, 10 features: x0–x2 carry signal, x3 carries signal and is 95% correlated with the noise twin x4, x5–x9 are pure noise. Move λ along a log scale, flip between ridge / lasso / elastic-net.

Coefficient path · 10-feature toy

Start near λ=0 (OLS): coefficients on x3–x4 look noisy. Bump λ under lasso: x5–x9 (noise) collapse to zero first, then one of {x3, x4} zeros out. Same λ under ridge: nothing zero, all shrunk. Elastic-net: x3 and x4 shrink toward each other instead of one dying.

log₁₀(λ): -1.0 penalty: noise σ: 0.5

train MSE

—

val MSE

—

nonzero coefs

—

cond(XᵀX)

—

Reading

—

When linear is enough

Against the modern reflex to reach for XGBoost: linear models with thoughtful preprocessing are often competitive on small clean tabular data, and they are the only models with closed-form standard errors — CIs on predictions, coefficients, and contrasts for free. The recipe: (1) encode categoricals well — one-hot for low cardinality, target / mean encoding with CV for high (lesson 10); naïve label-encoding into a linear model is almost always wrong; (2) hand-craft cross-features for known interactions — linear models don't discover interactions, that's tree-shaped intelligence; (3) regularize — ridge by default, elastic net if p > n or features correlated. When not to: large p with unknown interaction structure, heavy non-linearities, or anywhere you can afford XGBoost and don't need CIs.

Trade-offs at a glance

	OLS	Ridge	Lasso	Elastic net
Handles multicollinearity	no — β̂ explodes	yes — shrinks together	partial — picks one	yes — groups
Feature selection	no	no	yes (exact zeros)	yes
Closed form	yes	yes	no	no
Unique solution	only if XᵀX invertible	always (λ>0)	convex; generically unique, non-unique under exact collinearity (the L1 ball can be tangent to a flat face of the loss)	always
Closed-form CIs	yes	biased; bootstrap	bootstrap	bootstrap
Cost	O(np² + p³)	O(np² + p³)	O(np · iters)	O(np · iters)
Hyperparameters	none	λ	λ	λ, mix

A common trap

"Standardize before regularizing" is right — but say why. The penalty ‖β‖² / ‖β‖₁ is not scale-invariant. A feature in millimetres has a 1000× smaller coefficient than the same feature in metres; the penalty treats them very differently. Center y, standardize columns of X, and don't penalize the intercept.

Interview prompts you should be ready for

"Derive the OLS estimator from scratch." (L = ‖y − Xβ‖², ∂L/∂β = −2Xᵀ(y − Xβ) = 0. Mention projection; uniqueness needs XᵀX invertible.)
"Why is ridge equivalent to a Gaussian prior on β?" (Log-posterior under y|β ~ N(Xβ, σ²I), β ~ N(0, τ²I); MAP is OLS + ‖β‖² with weight σ²/τ². Laplace prior gives lasso.)
"When pick lasso over ridge?" (True β sparse, or you need feature selection. Dense signal / correlated predictors → ridge or elastic net.)
"R²=0.95 in-sample, 0.40 out-of-sample. Diagnose." (High variance / overfit. Likely p ≫ n, leakage, or multicollinearity. Regularize, more data, check leakage, check coefficient stability across CV folds.)
"Two features perfectly correlated. OLS? Ridge? Lasso?" (OLS: XᵀX singular, no unique solution. Ridge: invertible, coefficients split equally. Lasso: arbitrarily picks one, zeros the other.)
"When use linear in 2026 instead of XGBoost?" (Small clean tabular with main-effects; need calibrated CIs; cheap inference; regulatory / interpretability; baseline; meta-learner over GBDT features.)

Takeaway

OLS is orthogonal projection of y onto the column space of X. Multicollinearity makes projection coefficients unstable without breaking the projection itself. Ridge adds λI to XᵀX — unique solution always, correlated features shrunk together. Lasso replaces L2 with L1; the diamond constraint forces exact zeros at corners — feature selection by geometry. Elastic net combines both. Every penalty has a Bayesian prior interpretation; regularization strength is inverse prior variance.