Naive Bayes & generative vs discriminative

The simplest classifier that still ships in 2026. A first-principles derivation, the "naive" assumption that should be wrong but isn't, and the Ng-Jordan crossover — the canonical interview question on generative vs discriminative learning.

Bayes rule, applied to classification

We want P(y | x) — the posterior over class labels given features. Bayes rule rewrites it:

P(y | x) = P(x | y) P(y) / P(x) ∝ P(x | y) P(y)

The denominator P(x) doesn't depend on y, so for prediction (argmax over classes) we can drop it. Two factors remain:

Class prior P(y). Trivial to estimate: count of class y divided by N.
Class-conditional likelihood P(x | y). This is the hard part. For d features, the joint distribution over x = (x_1, …, x_d) conditioned on y has — for discrete features with V values each — on the order of V^d free parameters per class. Exponential in d. You cannot estimate it from any realistic dataset.

Everything that follows is about how to get around this curse-of-dimensionality wall on the likelihood.

The naive assumption

Assume features are conditionally independent given the class:

P(x | y) = P(x_1, x_2, …, x_d | y) = Π_j P(x_j | y)

Each P(x_j | y) is now a 1D distribution: V parameters per class per feature, total O(d · V · K) for K classes. Linear in d. Tractable from tiny datasets.

The assumption is, in essentially every real dataset, wildly wrong. "Free" and "shipping" co-occur in spam emails far more than independence predicts. NB commits to a joint with zero off-diagonal mutual information given y.

So why does it work? Because classification only needs the argmax, not the calibrated probability. Domingos & Pazzani (1997, "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss") proved: NB returns the correct class whenever the most-likely class under the true correlated distribution coincides with the most-likely class under the independence approximation. This holds for many more cases than the assumption itself — including cases where NB's posterior probabilities are off by orders of magnitude. Domingos & Pazzani's contribution is to identify sufficient conditions for this to hold even under strong feature dependence — typically when the approximation's errors are roughly uniform across classes, so the argmax is preserved.

The right mental model

Naive Bayes is best read as a strong inductive bias, not as a faithful model of the world. The bias says: "ignore feature correlations; just look at how each feature lines up with each class on its own." It throws away information, but the information it keeps is robust to small n, and the decision rule is often still correct even when the underlying likelihood is badly wrong.

Three flavors

Variant	Per-feature model P(x_j \| y)	Use case
Gaussian NB	N(μ_{jy}, σ²_{jy}) — per-class mean and variance	Continuous features. Cheap baseline for tabular data.
Multinomial NB	Per-class word-count distribution: P(x_j \| y) = θ_{jy}	Bag-of-words text classification. The classical spam/topic baseline.
Bernoulli NB	Per-class probability that x_j = 1	Binary feature presence/absence. Short documents, click features.

Training is just counting

The MLE for each P(x_j | y) is a closed-form count. For a categorical feature with value v:

P̂(x_j = v | y = k) = count(x_j = v, y = k) / count(y = k)

For Gaussian NB: per-class sample mean and variance. No optimization, no gradient descent, no hyperparameter search. Training is O(n · d), one pass. Prediction is O(d · K) per example. This is why NB still ships: when you need a baseline on 100M rows in 5 minutes, nothing else is comparable.

Laplace smoothing — the one knob

If a feature value never appears in training for a class, MLE gives P̂ = 0. The product Π_j P(x_j | y = k) is then zero regardless of every other feature — one unseen token vetoes the entire class. Fix: add a pseudocount α to every cell:

P̂(x_j = v | y = k) = (count(x_j = v, y = k) + α) / (count(y = k) + α V)

where V is the number of distinct feature values. This is the MAP estimate under a symmetric Dirichlet prior with concentration α. α = 1 (add-one smoothing) is the default; tune on validation for text where most counts are zero.

A trap to avoid

Forgetting to smooth is the single most common production bug with NB. Symptom: in offline eval the model is fine; at serving time it predicts the majority class for any input containing a token unseen during training. Always smooth. Always test on a held-out set that contains out-of-vocabulary tokens.

Why it works despite the wrong assumption

Three reasons, each worth being able to cite:

The error is uniform across classes. If feature correlations are similar across the classes you're comparing, the bias affects every P(y | x) roughly equally and the argmax is unchanged. Probabilities can be off by orders of magnitude while the winning class stays correct.
Sparse high-dimensional data has limited redundancy. In a 50,000-token vocabulary most pairs almost never co-occur and independence is closer to true than intuition suggests. NB ships for text and rarely for image pixels.
Low variance from few parameters. NB has O(d · K) parameters and a strong inductive bias acts as a regularizer. When n is small, low variance dominates low bias and NB wins.

Calibration — NB's probabilities are wrong even when its decisions are right

Multiplying d small probabilities pushes the unnormalized posterior to extremes; after normalizing you get probabilities very close to 0 or 1. The argmax is unaffected but the magnitudes are. NB classifiers are systematically overconfident.

If downstream code consumes the probability (ranking, threshold tuning, ad auctions) you must calibrate. Standard fix: Platt-scale a sigmoid on top of log P(y=1|x) − log P(y=0|x) using a held-out set. Or isotonic regression — non-parametric and more flexible at the cost of more calibration data.

The deeper question: generative vs discriminative

NB is the canonical example of a generative classifier. To make this clean:

	Generative	Discriminative
What it models	Joint P(x, y) = P(x \| y) P(y)	Conditional P(y \| x) directly
Decision rule	Apply Bayes to derive P(y \| x)	Read off model output
Examples	Naive Bayes, LDA (linear discriminant analysis), GMM classifiers, HMMs	Logistic regression, SVMs, trees, GBDT, neural nets
Can sample new x?	Yes — sample y ~ P(y), then x ~ P(x \| y)	No — there's no model of P(x)
Anomaly detection?	Yes — low P(x \| y) under all classes flags anomaly	No — classifier still outputs a probability

The trade is sharper than it looks. Generative models commit to a story about how the data was produced; discriminative models commit only to how to separate classes. The generative commitment is more informative when correct (you can sample, detect anomalies, encode priors) and more costly when wrong (a bad P(x | y) degrades classification). Caveat: NB is a poor DENSITY estimator even when its argmax is good — the independence assumption can give absurdly low likelihoods to genuinely-typical points. So "low P(x|y) under all classes → anomaly" inherits the independence problem.

Ng & Jordan 2001 — the canonical result

Ng & Jordan ("On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes") gave the result that every senior MLE should be able to state:

The crossover

Naive Bayes (generative) converges to its asymptotic error faster than logistic regression (discriminative) — it needs less data to reach its best. But logistic regression has a lower asymptotic error. So as the training set grows, there is a crossover: NB wins at small n, LR wins at large n. The sample complexity to approach asymptotic error is roughly O(log d) for NB versus O(d) for LR.

NB's strong inductive bias reaches its best fit with very little data — but that "best fit" is capped by the independence assumption and is worse than LR's asymptotic best. LR makes no such assumption; it pays in samples needed but its ceiling is higher.

Interactive · the Ng-Jordan crossover

Two 2D Gaussian classes. Slide the training-set size n and the class overlap. The widget fits both a Gaussian NB and a logistic regression on the same data, draws the decision boundaries, and reports validation accuracy for both. At small n, NB should win; at large n, LR should pull ahead.

NB vs LR vs LDA — a senior cheat sheet

Aspect	Naive Bayes	Logistic regression	LDA
Family	Generative	Discriminative	Generative
Parameters	O(d · K)	O(d · K)	O(d² + d · K) (shared covariance)
Key assumption	Features cond. independent given y	Log-odds linear in x	Class-conditional Gaussians, shared covariance
Asymptotic error	Higher (assumption forces worse fit)	Lower	When the Gaussian-shared-covariance assumption holds exactly, LDA is asymptotically MORE efficient than LR (lower asymptotic error). The decision boundary is linear for both, but LDA's stronger generative assumption — when correct — gives better parameter efficiency. When the assumption is wrong, LR is more robust.
Small-n behavior	Best — strong bias, low variance	Worst — needs samples to fit	Middle — fewer params than LR when d small
Probability calibration	Bad (overconfident)	Good when assumption holds	Good when Gaussian assumption holds
Anomaly detection	Yes (via P(x \| y))	No	Yes (via Mahalanobis dist.)
Interpretability	Per-feature class weights are immediate	Coefficients are log-odds	Discriminant axes are interpretable

When to pick which, in practice

Pick generative (NB, LDA, GMM) when…	Pick discriminative (LR, GBDT, NN) when…
n is very small and you cannot collect more.	n is moderate or large.
You have a domain story about the data-generating process and want to encode it as a prior.	You don't trust any assumption you'd make about P(x).
You need to detect anomalies — points unlikely under all classes.	You only care about ranking or classification accuracy.
You want to generate synthetic data (resample P(x \| y)).	You have unlabeled covariates whose distribution you don't want to model.
You want a 5-minute baseline.	You're optimizing the last 1% of a deployed system.

Interview prompts you should be ready for

"Why is naive Bayes called 'naive'?" (The independence assumption: P(x | y) = Π_j P(x_j | y). Features in real data are essentially never independent. It's naive about the joint distribution.)
"Why does NB sometimes outperform logistic regression on the same dataset?" (Strong inductive bias → low variance → wins when n is small even though it has higher asymptotic error. Cite Ng-Jordan crossover.)
"Walk me through the Ng-Jordan result." (Generative models converge to asymptotic error faster — roughly O(log d) samples vs O(d) for LR — but the asymptotic error itself is higher for NB because of the independence assumption. Crossover at small-to-moderate n.)
"Your NB classifier outputs P(class) = 0 for some inputs. What's happening?" (A feature value unseen in training for that class is giving P̂ = 0, which zeroes out the entire product. Fix: Laplace/additive smoothing — add α to every count, equivalent to a Dirichlet prior MAP estimate.)
"Why are NB probabilities miscalibrated, and what do you do about it?" (Multiplying many small probabilities pushes the normalized posterior toward 0/1 → overconfident. Fix: Platt scaling or isotonic regression on a held-out set if downstream consumers need calibrated probabilities.)
"When would you pick NB over GBDT in 2026?" (Tiny n, very high d sparse features like raw text, need-something-in-an-hour baseline, streaming setting where online updates are trivial, or you genuinely care about P(x | y) for anomaly detection. Otherwise GBDT.)

Takeaway

NB survives because of three things: a closed-form count-based fit; an inductive bias strong enough to dominate when data is scarce; and a decision rule that's often right even when the probability model is wrong. The deeper lesson — generative vs discriminative — is that committing to a story about P(x | y) trades flexibility for sample efficiency, and the right side of that trade depends on how much data you have. Ng-Jordan is the sentence that captures it.