Naive Bayes & generative vs discriminative
The simplest classifier that still ships in 2026. A first-principles derivation, the "naive" assumption that should be wrong but isn't, and the Ng-Jordan crossover — the canonical interview question on generative vs discriminative learning.
Bayes rule, applied to classification
We want P(y | x) — the posterior over class labels given features. Bayes rule rewrites it:
The denominator P(x) doesn't depend on y, so for prediction (argmax over classes) we can drop it. Two factors remain:
- Class prior P(y). Trivial to estimate: count of class y divided by N.
- Class-conditional likelihood P(x | y). This is the hard part. For d features, the joint distribution over x = (x_1, …, x_d) conditioned on y has — for discrete features with V values each — on the order of V^d free parameters per class. Exponential in d. You cannot estimate it from any realistic dataset.
Everything that follows is about how to get around this curse-of-dimensionality wall on the likelihood.
The naive assumption
Assume features are conditionally independent given the class:
Each P(x_j | y) is now a 1D distribution: V parameters per class per feature, total O(d · V · K) for K classes. Linear in d. Tractable from tiny datasets.
The assumption is, in essentially every real dataset, wildly wrong. "Free" and "shipping" co-occur in spam emails far more than independence predicts. NB commits to a joint with zero off-diagonal mutual information given y.
So why does it work? Because classification only needs the argmax, not the calibrated probability. Domingos & Pazzani (1997, "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss") proved: NB returns the correct class whenever the most-likely class under the true correlated distribution coincides with the most-likely class under the independence approximation. This holds for many more cases than the assumption itself — including cases where NB's posterior probabilities are off by orders of magnitude. Domingos & Pazzani's contribution is to identify sufficient conditions for this to hold even under strong feature dependence — typically when the approximation's errors are roughly uniform across classes, so the argmax is preserved.
Three flavors
| Variant | Per-feature model P(x_j | y) | Use case |
|---|---|---|
| Gaussian NB | N(μ_{jy}, σ²_{jy}) — per-class mean and variance | Continuous features. Cheap baseline for tabular data. |
| Multinomial NB | Per-class word-count distribution: P(x_j | y) = θ_{jy} | Bag-of-words text classification. The classical spam/topic baseline. |
| Bernoulli NB | Per-class probability that x_j = 1 | Binary feature presence/absence. Short documents, click features. |
Training is just counting
The MLE for each P(x_j | y) is a closed-form count. For a categorical feature with value v:
For Gaussian NB: per-class sample mean and variance. No optimization, no gradient descent, no hyperparameter search. Training is O(n · d), one pass. Prediction is O(d · K) per example. This is why NB still ships: when you need a baseline on 100M rows in 5 minutes, nothing else is comparable.
Laplace smoothing — the one knob
If a feature value never appears in training for a class, MLE gives P̂ = 0. The product Π_j P(x_j | y = k) is then zero regardless of every other feature — one unseen token vetoes the entire class. Fix: add a pseudocount α to every cell:
where V is the number of distinct feature values. This is the MAP estimate under a symmetric Dirichlet prior with concentration α. α = 1 (add-one smoothing) is the default; tune on validation for text where most counts are zero.
Why it works despite the wrong assumption
Three reasons, each worth being able to cite:
- The error is uniform across classes. If feature correlations are similar across the classes you're comparing, the bias affects every P(y | x) roughly equally and the argmax is unchanged. Probabilities can be off by orders of magnitude while the winning class stays correct.
- Sparse high-dimensional data has limited redundancy. In a 50,000-token vocabulary most pairs almost never co-occur and independence is closer to true than intuition suggests. NB ships for text and rarely for image pixels.
- Low variance from few parameters. NB has O(d · K) parameters and a strong inductive bias acts as a regularizer. When n is small, low variance dominates low bias and NB wins.
Calibration — NB's probabilities are wrong even when its decisions are right
Multiplying d small probabilities pushes the unnormalized posterior to extremes; after normalizing you get probabilities very close to 0 or 1. The argmax is unaffected but the magnitudes are. NB classifiers are systematically overconfident.
If downstream code consumes the probability (ranking, threshold tuning, ad auctions) you must calibrate. Standard fix: Platt-scale a sigmoid on top of log P(y=1|x) − log P(y=0|x) using a held-out set. Or isotonic regression — non-parametric and more flexible at the cost of more calibration data.
The deeper question: generative vs discriminative
NB is the canonical example of a generative classifier. To make this clean:
| Generative | Discriminative | |
|---|---|---|
| What it models | Joint P(x, y) = P(x | y) P(y) | Conditional P(y | x) directly |
| Decision rule | Apply Bayes to derive P(y | x) | Read off model output |
| Examples | Naive Bayes, LDA (linear discriminant analysis), GMM classifiers, HMMs | Logistic regression, SVMs, trees, GBDT, neural nets |
| Can sample new x? | Yes — sample y ~ P(y), then x ~ P(x | y) | No — there's no model of P(x) |
| Anomaly detection? | Yes — low P(x | y) under all classes flags anomaly | No — classifier still outputs a probability |
The trade is sharper than it looks. Generative models commit to a story about how the data was produced; discriminative models commit only to how to separate classes. The generative commitment is more informative when correct (you can sample, detect anomalies, encode priors) and more costly when wrong (a bad P(x | y) degrades classification). Caveat: NB is a poor DENSITY estimator even when its argmax is good — the independence assumption can give absurdly low likelihoods to genuinely-typical points. So "low P(x|y) under all classes → anomaly" inherits the independence problem.
Ng & Jordan 2001 — the canonical result
Ng & Jordan ("On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes") gave the result that every senior MLE should be able to state:
NB's strong inductive bias reaches its best fit with very little data — but that "best fit" is capped by the independence assumption and is worse than LR's asymptotic best. LR makes no such assumption; it pays in samples needed but its ceiling is higher.
Interactive · the Ng-Jordan crossover
Two 2D Gaussian classes. Slide the training-set size n and the class overlap. The widget fits both a Gaussian NB and a logistic regression on the same data, draws the decision boundaries, and reports validation accuracy for both. At small n, NB should win; at large n, LR should pull ahead.
NB vs LR vs LDA — a senior cheat sheet
| Aspect | Naive Bayes | Logistic regression | LDA |
|---|---|---|---|
| Family | Generative | Discriminative | Generative |
| Parameters | O(d · K) | O(d · K) | O(d² + d · K) (shared covariance) |
| Key assumption | Features cond. independent given y | Log-odds linear in x | Class-conditional Gaussians, shared covariance |
| Asymptotic error | Higher (assumption forces worse fit) | Lower | When the Gaussian-shared-covariance assumption holds exactly, LDA is asymptotically MORE efficient than LR (lower asymptotic error). The decision boundary is linear for both, but LDA's stronger generative assumption — when correct — gives better parameter efficiency. When the assumption is wrong, LR is more robust. |
| Small-n behavior | Best — strong bias, low variance | Worst — needs samples to fit | Middle — fewer params than LR when d small |
| Probability calibration | Bad (overconfident) | Good when assumption holds | Good when Gaussian assumption holds |
| Anomaly detection | Yes (via P(x | y)) | No | Yes (via Mahalanobis dist.) |
| Interpretability | Per-feature class weights are immediate | Coefficients are log-odds | Discriminant axes are interpretable |
When to pick which, in practice
| Pick generative (NB, LDA, GMM) when… | Pick discriminative (LR, GBDT, NN) when… |
|---|---|
| n is very small and you cannot collect more. | n is moderate or large. |
| You have a domain story about the data-generating process and want to encode it as a prior. | You don't trust any assumption you'd make about P(x). |
| You need to detect anomalies — points unlikely under all classes. | You only care about ranking or classification accuracy. |
| You want to generate synthetic data (resample P(x | y)). | You have unlabeled covariates whose distribution you don't want to model. |
| You want a 5-minute baseline. | You're optimizing the last 1% of a deployed system. |
Interview prompts you should be ready for
- "Why is naive Bayes called 'naive'?" (The independence assumption: P(x | y) = Π_j P(x_j | y). Features in real data are essentially never independent. It's naive about the joint distribution.)
- "Why does NB sometimes outperform logistic regression on the same dataset?" (Strong inductive bias → low variance → wins when n is small even though it has higher asymptotic error. Cite Ng-Jordan crossover.)
- "Walk me through the Ng-Jordan result." (Generative models converge to asymptotic error faster — roughly O(log d) samples vs O(d) for LR — but the asymptotic error itself is higher for NB because of the independence assumption. Crossover at small-to-moderate n.)
- "Your NB classifier outputs P(class) = 0 for some inputs. What's happening?" (A feature value unseen in training for that class is giving P̂ = 0, which zeroes out the entire product. Fix: Laplace/additive smoothing — add α to every count, equivalent to a Dirichlet prior MAP estimate.)
- "Why are NB probabilities miscalibrated, and what do you do about it?" (Multiplying many small probabilities pushes the normalized posterior toward 0/1 → overconfident. Fix: Platt scaling or isotonic regression on a held-out set if downstream consumers need calibrated probabilities.)
- "When would you pick NB over GBDT in 2026?" (Tiny n, very high d sparse features like raw text, need-something-in-an-hour baseline, streaming setting where online updates are trivial, or you genuinely care about P(x | y) for anomaly detection. Otherwise GBDT.)