advanced information-theory 50 min read

Minimum Description Length

Model selection as data compression — the shortest total description identifies the best model

Overview & Motivation

Every statistician and every machine learning practitioner faces the same core problem: given data, which model should we use? Fit a polynomial of degree 3 or degree 15? Use a Gaussian mixture with 2 components or 20? The model that fits the training data best — the one with the highest likelihood — is almost never the right answer, because it overfits. We need a principled way to balance fit against complexity.

The Minimum Description Length (MDL) principle, introduced by Jorma Rissanen in 1978, provides exactly that balance by reframing model selection as a compression problem. The idea is beautifully concrete: imagine you want to send your dataset to a colleague using as few bits as possible. You can first describe a model (costing L(M)L(M) bits), then describe the data given that model (costing L(DM)L(D \mid M) bits). The best model is the one that minimizes the total description length:

L(M)+L(DM)L(M) + L(D \mid M)

A simple model costs few bits to describe but leaves large residuals; a complex model describes the data well but costs many bits itself. The minimum is the sweet spot — the model that captures the genuine structure in the data without encoding noise.

This is Occam’s Razor made precise: among models that explain the data equally well, prefer the simplest one — and “simplest” means “shortest to describe.”

Prerequisites: This topic builds on KL Divergence & f-Divergences, particularly the information-theoretic interpretation of divergence as coding redundancy. Familiarity with Shannon Entropy & Mutual Information is essential — especially source coding and the connection between entropy and optimal code lengths.

What We Cover

  1. Two-part codes and crude MDL — the description-length trade-off L(M)+L(DM)L(M) + L(D \mid M), Rissanen’s universal code for integers, and model selection as compression
  2. Kolmogorov complexity — the ideal but uncomputable measure of descriptive complexity, and why MDL provides a practical alternative
  3. Stochastic complexity and refined MDL — the Normalized Maximum Likelihood distribution, parametric complexity, and Shtarkov’s minimax optimality theorem
  4. Fisher information and asymptotic complexity — the geometric structure hidden inside model complexity, and why BIC is a crude MDL approximation
  5. MDL and Bayesian model selection — the surprising asymptotic equivalence of stochastic complexity and the marginal likelihood under Jeffreys prior
  6. Prequential codes and online MDL — sequential model selection that avoids the normalization problem entirely
  7. MDL for practical model selection — feature selection, distribution family selection, and overfitting detection

Two-Part Codes & Crude MDL

The simplest version of MDL works with two-part codes: first encode the model, then encode the data given the model. We need to make both parts precise.

Encoding Models

Suppose our model class is a family of polynomials of degree dd. To encode “degree dd,” we need a code for positive integers. Using a fixed-length code wastes bits on small values; a variable-length code does better.

Definition 1 (Two-Part Code / Crude MDL).

Given a countable collection of models {M1,M2,}\{M_1, M_2, \ldots\} and a coding scheme that assigns code length L(M)L(M) to each model and L(DM)L(D \mid M) to the data given model MM, the crude MDL principle selects the model minimizing:

M^=argminM[L(M)+L(DM)]\hat{M} = \arg\min_{M} \left[ L(M) + L(D \mid M) \right]

where L(DM)L(D \mid M) is typically the negative log-likelihood under the maximum likelihood estimator:

L(DM)=log2p(xnθ^(xn),M)L(D \mid M) = -\log_2 p(x^n \mid \hat{\theta}(x^n), M)

For the model code length, Rissanen proposed a universal code for positive integers:

Definition 2 (Universal Code for Integers).

Rissanen’s universal code assigns to positive integer kk the code length:

L(k)=log2k+log2log2k+log2log2log2k++log2c0L^*(k) = \log_2 k + \log_2 \log_2 k + \log_2 \log_2 \log_2 k + \cdots + \log_2 c_0

where the sum includes only positive terms. The constant c02.865c_0 \approx 2.865 normalizes the code so that the implied distribution 2L(k)2^{-L^*(k)} sums to 1 over positive integers. This is a universal code: it does not assume any prior distribution on model complexity.

Example: Polynomial Regression

Consider fitting a polynomial of degree dd to nn noisy observations. The model MdM_d has d+1d + 1 free parameters (coefficients). Using bb bits of precision per coefficient, the two-part code length is:

L(Md)+L(DMd)=L(d+1)+(d+1)bmodel complexity+n2log2(2πeσ^2)data fitL(M_d) + L(D \mid M_d) = \underbrace{L^*(d+1) + (d+1) \cdot b}_{\text{model complexity}} + \underbrace{\frac{n}{2} \log_2(2\pi e \hat{\sigma}^2)}_{\text{data fit}}

where σ^2=RSS/n\hat{\sigma}^2 = \text{RSS}/n is the residual variance under the degree-dd fit. The model code grows with dd; the data code shrinks as the fit improves. The minimum total code length selects a degree that balances these competing pressures.

The interactive visualization below demonstrates this trade-off. Try increasing the noise level — MDL selects simpler models when the signal-to-noise ratio is low.

Three-panel figure showing polynomial fits, two-part code decomposition, and the Occam's sweet spot

The left panel shows how the polynomial fit tightens as degree increases; the right panel reveals the two-part code decomposition. The total code length (purple line) reaches its minimum at a degree that neither underfits nor overfits — this is the MDL-optimal model.


Kolmogorov Complexity

Before we can refine MDL, we need to understand the ideal notion of descriptive complexity — one that Rissanen’s practical MDL approximates.

Definition 3 (Kolmogorov Complexity).

The Kolmogorov complexity of a binary string xx with respect to a universal Turing machine UU is:

KU(x)=min{p:U(p)=x}K_U(x) = \min \{ |p| : U(p) = x \}

where p|p| denotes the length of program pp in bits. In words: KU(x)K_U(x) is the length of the shortest program that outputs xx and halts.

K(x)K(x) captures the intuitive notion of the “true complexity” of a string. A string of all zeros has low Kolmogorov complexity (a short loop produces it), while a random string has complexity close to its length (no program shorter than the string itself can produce it).

Theorem 1 (Invariance Theorem).

For any two universal Turing machines UU and VV, there exists a constant cU,Vc_{U,V} (depending only on UU and VV, not on xx) such that:

KU(x)KV(x)cU,V|K_U(x) - K_V(x)| \leq c_{U,V}

for all strings xx.

Proof.

The key idea is simulation. Since VV is universal, there exists a program ss that makes VV simulate UU. For any string xx with U(p)=xU(p) = x, we can construct a program for VV: the simulator ss followed by pp. This gives KV(x)s+KU(x)K_V(x) \leq |s| + K_U(x). By symmetry, KU(x)s+KV(x)K_U(x) \leq |s'| + K_V(x) for some simulator ss'. Setting cU,V=max(s,s)c_{U,V} = \max(|s|, |s'|) completes the proof.

The invariance theorem means that Kolmogorov complexity is well-defined up to an additive constant — the choice of reference machine doesn’t matter asymptotically. We drop the subscript UU and write simply K(x)K(x).

The fundamental connection between Kolmogorov complexity and Shannon entropy:

Proposition 1 (Kolmogorov Complexity and Entropy).

If X1,X2,X_1, X_2, \ldots are i.i.d. draws from a distribution PP with entropy H(P)H(P), then:

K(X1X2Xn)nH(P)almost surely as n\frac{K(X_1 X_2 \cdots X_n)}{n} \to H(P) \quad \text{almost surely as } n \to \infty

Proof.

The upper bound K(xn)/nH(P)+ϵK(x^n)/n \leq H(P) + \epsilon follows from the asymptotic equipartition property (AEP): with high probability, xnx^n lies in the typical set, which has 2n(H(P)+ϵ)2^{n(H(P) + \epsilon)} elements. A program that encodes the index within the typical set uses n(H(P)+ϵ)n(H(P) + \epsilon) bits. The lower bound comes from the incompressibility lemma: random strings cannot be systematically compressed below their entropy rate.

This proposition reveals the deep connection: Kolmogorov complexity is the individual-sequence analogue of Shannon entropy. But there’s a fatal practical problem: K(x)K(x) is uncomputable. No algorithm can compute K(x)K(x) for all inputs — this follows from the halting problem. MDL sidesteps uncomputability by restricting to computable coding schemes, providing a practical upper bound on K(x)K(x).

Three-panel figure showing string compressibility, K(x^n)/n convergence to entropy, and MDL as a computable approximation


Stochastic Complexity & Refined MDL

Crude MDL has a troubling dependence on arbitrary choices: the precision bb used to encode coefficients, the integer code for model indices. Refined MDL eliminates these choices by using a single universal code — the Normalized Maximum Likelihood (NML) distribution.

Definition 4 (Stochastic Complexity).

The stochastic complexity of a data sequence xnx^n relative to a parametric model class Mk={pθ:θΘk}M_k = \{p_\theta : \theta \in \Theta_k\} is:

SC(xn;Mk)=log2pNML(xnMk)\mathrm{SC}(x^n; M_k) = -\log_2 p_{\mathrm{NML}}(x^n \mid M_k)

where pNMLp_{\mathrm{NML}} is the Normalized Maximum Likelihood distribution defined below.

Definition 5 (Normalized Maximum Likelihood (NML)).

The NML distribution for model class MkM_k is:

pNML(xnMk)=p(xnθ^(xn),Mk)COMP(Mk)p_{\mathrm{NML}}(x^n \mid M_k) = \frac{p(x^n \mid \hat{\theta}(x^n), M_k)}{\mathrm{COMP}(M_k)}

where θ^(xn)\hat{\theta}(x^n) is the maximum likelihood estimator for data xnx^n, and the parametric complexity is the normalizing constant:

COMP(Mk)=znXnp(znθ^(zn),Mk)\mathrm{COMP}(M_k) = \sum_{z^n \in \mathcal{X}^n} p(z^n \mid \hat{\theta}(z^n), M_k)

The sum ranges over all possible data sequences of length nn. (For continuous data, the sum becomes an integral.)

The NML distribution is remarkable: it assigns to each dataset the maximum likelihood for that dataset, but normalized so the result is a proper probability distribution. Why is this a good universal code?

Theorem 2 (Shtarkov's Minimax Optimality).

The NML distribution achieves constant regret across all data sequences:

log2p(xnθ^(xn))pNML(xn)=log2COMP(Mk)\log_2 \frac{p(x^n \mid \hat{\theta}(x^n))}{p_{\mathrm{NML}}(x^n)} = \log_2 \mathrm{COMP}(M_k)

for all xnx^n. Moreover, NML is the unique distribution achieving the minimax regret:

pNML=argminqmaxxnlog2p(xnθ^(xn))q(xn)p_{\mathrm{NML}} = \arg\min_{q} \max_{x^n} \log_2 \frac{p(x^n \mid \hat{\theta}(x^n))}{q(x^n)}

Proof.

The regret of NML is:

log2p(xnθ^(xn))pNML(xn)=log2p(xnθ^(xn))COMPp(xnθ^(xn))=log2COMP\log_2 \frac{p(x^n \mid \hat{\theta}(x^n))}{p_{\mathrm{NML}}(x^n)} = \log_2 \frac{p(x^n \mid \hat{\theta}(x^n)) \cdot \mathrm{COMP}}{p(x^n \mid \hat{\theta}(x^n))} = \log_2 \mathrm{COMP}

This is constant — it does not depend on xnx^n. For any other distribution qq, the maximum regret over all xnx^n is at least log2COMP\log_2 \mathrm{COMP}. To see this, suppose qq achieves maximum regret R<log2COMPR^* < \log_2 \mathrm{COMP}. Then for all xnx^n:

q(xn)p(xnθ^(xn))2Rq(x^n) \geq p(x^n \mid \hat{\theta}(x^n)) \cdot 2^{-R^*}

Summing over all xnx^n: 1=xnq(xn)2RCOMP>11 = \sum_{x^n} q(x^n) \geq 2^{-R^*} \cdot \mathrm{COMP} > 1, a contradiction. Therefore NML achieves the minimax regret, and this regret equals log2COMP\log_2 \mathrm{COMP}.

The interactive visualization below shows NML in action for the Bernoulli model. Notice that the NML regret (purple line) is perfectly flat — constant across all possible numbers of successes. The comparison codes (Bayes, plug-in) achieve lower regret for some outcomes but much higher regret for others.

Three-panel figure showing NML vs Bernoulli distributions, parametric complexity growth, and minimax regret


Fisher Information & Asymptotic Complexity

Computing COMP\mathrm{COMP} exactly requires summing over all possible datasets — exponential in nn. Rissanen’s asymptotic expansion reveals the geometric structure hiding inside this sum.

Theorem 3 (Asymptotic Parametric Complexity (Rissanen 1996)).

For a kk-dimensional parametric model class MkM_k with parameter space Θk\Theta_k and Fisher information matrix I(θ)I(\theta), as nn \to \infty:

logCOMP(Mk)=k2logn2π+logΘkdetI(θ)dθ+o(1)\log \mathrm{COMP}(M_k) = \frac{k}{2} \log \frac{n}{2\pi} + \log \int_{\Theta_k} \sqrt{\det I(\theta)} \, d\theta + o(1)

This expansion has a beautiful geometric interpretation:

  • The k2logn2π\frac{k}{2} \log \frac{n}{2\pi} term is the dominant contribution — it grows with both the number of parameters kk and the sample size nn. This is essentially the BIC penalty.
  • The integral detI(θ)dθ\int \sqrt{\det I(\theta)} \, d\theta is the volume of the model manifold in the Fisher information metric. Models with larger parameter spaces (in the Fisher metric sense) receive a larger complexity penalty. This is the term that distinguishes refined MDL from BIC.

Corollary 1 (Stochastic Complexity ≈ BIC + Geometry).

The stochastic complexity can be approximated as:

SC(xn;Mk)log2p(xnθ^)+k2log2n2π+log2ΘkdetI(θ)dθ\mathrm{SC}(x^n; M_k) \approx -\log_2 p(x^n \mid \hat{\theta}) + \frac{k}{2} \log_2 \frac{n}{2\pi} + \log_2 \int_{\Theta_k} \sqrt{\det I(\theta)} \, d\theta

The first two terms match the BIC (up to additive constants); the third term is the geometric correction that BIC misses.

Proposition 2 (BIC as Crude MDL Approximation).

The Bayesian Information Criterion:

BIC=2logL(θ^)+klogn\mathrm{BIC} = -2 \log L(\hat{\theta}) + k \log n

is a rough approximation to twice the stochastic complexity, dropping the Fisher information integral and using logn\log n in place of log(n/2π)\log(n / 2\pi). BIC is consistent (selects the true model as nn \to \infty) but generally less accurate than refined MDL for finite samples.

Remark (BIC as Practical MDL).

When to use BIC vs. refined MDL: BIC is appropriate for quick model comparison when exact Fisher information computation is infeasible, or when the models being compared have the same functional form (the Fisher integral cancels in the difference). Refined MDL (stochastic complexity) is preferable when comparing models of different types — e.g., Gaussian vs. Poisson — where the Fisher volumes differ.

The visualization below compares all three criteria. Note that MDL and BIC agree on the selected model degree most of the time, but AIC tends to overfit — it selects higher-degree models because its penalty (2k2k) doesn’t grow with sample size.

Three-panel figure showing BIC vs AIC vs MDL, parametric complexity growth, and BIC approximation quality


MDL and Bayesian Model Selection

MDL and Bayesian model selection appear to come from completely different philosophical traditions — coding theory vs. probabilistic inference. Yet they converge asymptotically in a surprising way.

Theorem 4 (Bayesian Marginal Likelihood and Stochastic Complexity).

Under Jeffreys prior πJ(θ)detI(θ)\pi_J(\theta) \propto \sqrt{\det I(\theta)}, the log marginal likelihood satisfies:

logp(xnθ)πJ(θ)dθ=SC(xn;Mk)+o(1)-\log \int p(x^n \mid \theta) \pi_J(\theta) \, d\theta = \mathrm{SC}(x^n; M_k) + o(1)

as nn \to \infty. That is, the stochastic complexity and the negative log marginal likelihood under Jeffreys prior are asymptotically equivalent.

Proof.

By Laplace approximation, the marginal likelihood concentrates around the MLE:

p(xnθ)πJ(θ)dθp(xnθ^)πJ(θ^)(2π)k/2In(θ^)1/2\int p(x^n \mid \theta) \pi_J(\theta) \, d\theta \approx p(x^n \mid \hat{\theta}) \cdot \pi_J(\hat{\theta}) \cdot (2\pi)^{k/2} \cdot |\mathcal{I}_n(\hat{\theta})|^{-1/2}

where In(θ^)=nI(θ^)\mathcal{I}_n(\hat{\theta}) = n \cdot I(\hat{\theta}) is the observed Fisher information. Taking the negative logarithm:

logp(xnθ)πJ(θ)dθlogp(xnθ^)+k2logn2π+logdetI(θ)dθdetI(θ^)+12logdetI(θ^)-\log \int p(x^n \mid \theta) \pi_J(\theta) \, d\theta \approx -\log p(x^n \mid \hat{\theta}) + \frac{k}{2} \log \frac{n}{2\pi} + \log \frac{\int \sqrt{\det I(\theta)} \, d\theta}{\sqrt{\det I(\hat{\theta})}} + \frac{1}{2} \log \det I(\hat{\theta})

Collecting terms and using πJ(θ^)=detI(θ^)/detI(θ)dθ\pi_J(\hat{\theta}) = \sqrt{\det I(\hat{\theta})} / \int \sqrt{\det I(\theta)} \, d\theta, we recover the stochastic complexity up to o(1)o(1) terms.

This equivalence explains a deep fact: Jeffreys prior is not just a convenient default — it is the unique prior that makes Bayesian model selection asymptotically equivalent to MDL. Despite their different foundations (MDL is frequentist and prior-free; Bayesian inference requires a prior), the two frameworks agree in the large-sample limit when the Bayesian uses the “right” prior.

The philosophical differences remain important:

  • MDL treats the code length as fundamental — models are evaluated by their compression efficiency. No prior is needed; the parametric complexity arises from the structure of the model class itself.
  • Bayesian model selection treats the posterior probability as fundamental — models are evaluated by how well they predicted the data, averaged over the prior. The prior encodes genuine beliefs about which parameter values are more plausible.
  • When the prior is misspecified (i.e., not Jeffreys), the two methods can disagree — and MDL is often more robust because it doesn’t rely on prior correctness.

Three-panel figure showing MDL vs Bayesian selection, marginal likelihood curves, and model selection agreement


Prequential Codes & Online MDL

There’s a practical problem with NML: computing COMP\mathrm{COMP} requires summing over all possible datasets of length nn. For many model classes, this sum is intractable. Prequential (predictive) MDL bypasses this entirely by processing data sequentially.

Definition 6 (Prequential Code).

The prequential code length for a data sequence xn=(x1,x2,,xn)x^n = (x_1, x_2, \ldots, x_n) under a sequential prediction strategy qq is:

Lpre(xn)=i=1nlog2q(xixi1)L_{\mathrm{pre}}(x^n) = -\sum_{i=1}^{n} \log_2 q(x_i \mid x^{i-1})

where q(xixi1)q(x_i \mid x^{i-1}) is the predictive probability assigned to xix_i given the first i1i-1 observations.

The prequential code is always a valid code (Kraft inequality is satisfied by the chain rule of probability), and it’s always computable — no normalization over all possible datasets is needed.

A natural prediction strategy is the prequential plug-in code: use the MLE fitted on past data to predict the next observation. For the Bernoulli model with Laplace smoothing:

q(xixi1)=ki1+1i1+21[xi=1]+i1ki1+1i1+21[xi=0]q(x_i \mid x^{i-1}) = \frac{k_{i-1} + 1}{i - 1 + 2} \cdot \mathbf{1}[x_i = 1] + \frac{i - 1 - k_{i-1} + 1}{i - 1 + 2} \cdot \mathbf{1}[x_i = 0]

where ki1k_{i-1} is the number of ones in the first i1i-1 observations.

Theorem 5 (Prequential Code and Stochastic Complexity).

The prequential plug-in code achieves:

Lpre(xn)=SC(xn;Mk)+o(logn)L_{\mathrm{pre}}(x^n) = \mathrm{SC}(x^n; M_k) + o(\log n)

as nn \to \infty. That is, the prequential code converges to the stochastic complexity, with the same leading-order complexity penalty.

Proposition 3 (Consistency of MDL).

Let M1M2M_1 \subset M_2 \subset \cdots be a nested sequence of model classes with Mk0M_{k_0} being the smallest class containing the true distribution. Both NML-based and prequential MDL select Mk0M_{k_0} with probability tending to 1 as nn \to \infty, provided the parametric complexity grows as k2logn+O(1)\frac{k}{2} \log n + O(1).

Proof.

For any model MkM_k with k>k0k > k_0, the difference in stochastic complexities satisfies:

SC(xn;Mk)SC(xn;Mk0)=kk02logn+O(1)[logLk(θ^k)logLk0(θ^k0)]\mathrm{SC}(x^n; M_k) - \mathrm{SC}(x^n; M_{k_0}) = \frac{k - k_0}{2} \log n + O(1) - [\log L_k(\hat{\theta}_k) - \log L_{k_0}(\hat{\theta}_{k_0})]

The log-likelihood difference is OP(1)O_P(1) (bounded in probability by Wilks’ theorem), while the complexity penalty grows as kk02logn\frac{k - k_0}{2} \log n \to \infty. Therefore overly complex models are eventually penalized out. For k<k0k < k_0, the log-likelihood penalty grows linearly in nn (model misspecification), outpacing any complexity savings. The true model minimizes stochastic complexity asymptotically.

Proposition 4 (Prequential Code via Bayes Mixture).

The prequential code using a Bayesian predictive distribution with prior π\pi:

qπ(xixi1)=p(xiθ)π(θ)dθp(xi1θ)π(θ)dθq_\pi(x_i \mid x^{i-1}) = \frac{\int p(x^i \mid \theta) \pi(\theta) \, d\theta}{\int p(x^{i-1} \mid \theta) \pi(\theta) \, d\theta}

yields a prequential code length equal to the negative log marginal likelihood:

Lpre(xn)=log2p(xnθ)π(θ)dθL_{\mathrm{pre}}(x^n) = -\log_2 \int p(x^n \mid \theta) \pi(\theta) \, d\theta

This is a consequence of the chain rule: p(xn)=i=1np(xixi1)p(x^n) = \prod_{i=1}^{n} p(x_i \mid x^{i-1}).

The visualization below shows prequential MDL in action for change-point detection. Two models compete: a single Bernoulli (one parameter for the whole sequence) and a two-piece Bernoulli (different parameters before and after a change point). Watch how the two-piece model initially incurs higher overhead but eventually wins when the distributional shift becomes evident.

i = 0 / 200

Three-panel figure showing prequential code comparison, online model selection, and sequential adaptation


MDL for Practical Model Selection

How does MDL translate to the everyday tasks of machine learning: choosing features, selecting the number of components, detecting overfitting?

Proposition 5 (MDL for Linear Regression).

For linear regression with kk features, nn observations, and Gaussian errors, the refined MDL criterion reduces to:

MDL(Mk)=n2log(2πeσ^k2)+k2logn2π+ck\mathrm{MDL}(M_k) = \frac{n}{2} \log(2\pi e \hat{\sigma}_k^2) + \frac{k}{2} \log \frac{n}{2\pi} + c_k

where σ^k2=RSSk/n\hat{\sigma}_k^2 = \text{RSS}_k / n and ckc_k depends on the Fisher information geometry. When ckc_k is ignored (equal across models), this reduces to BIC up to a constant.

Feature selection via MDL. In forward feature selection, we greedily add the feature that reduces the total MDL criterion the most. The MDL penalty automatically trades off the improvement in fit against the cost of an additional parameter. Unlike cross-validation, MDL requires no held-out data.

Theorem 6 (MDL Consistent Model Selection).

Let the true model have k0k_0 nonzero features. Forward selection using the MDL (or BIC) criterion selects k0k_0 features with probability tending to 1 as nn \to \infty, provided:

  1. The penalty grows as k2logn\frac{k}{2} \log n (satisfied by both BIC and MDL).
  2. The penalty grows slower than nn (ensuring we don’t underfit).

AIC is not consistent: its fixed penalty 2k2k allows overfitting to persist even as nn \to \infty.

Proof.

The proof follows the same logic as Proposition 3. For any model with k>k0k > k_0 features, the likelihood improvement from the kk0k - k_0 spurious features is OP(χkk02)O_P(\chi^2_{k-k_0}) (by Wilks’ theorem), while the penalty cost is kk02logn\frac{k-k_0}{2} \log n \to \infty. So spurious features are eventually rejected. For AIC, the penalty is kk0k - k_0 (constant), which does not dominate the OP(1)O_P(1) likelihood improvement, so AIC selects too many features with non-vanishing probability.

Three-panel figure showing MDL feature selection, distribution family selection, and MDL tracking test error


Computational Notes

MDL in Practice: BIC as a Proxy

For most practical applications, BIC is a reasonable proxy for MDL:

BIC=2logL(θ^)+klogn\mathrm{BIC} = -2 \log L(\hat{\theta}) + k \log n

BIC is fast to compute, consistent, and readily available in every statistics package. Use it when:

  • Comparing models of the same type (the Fisher volume term cancels)
  • The sample size nn is large enough that asymptotic approximations hold
  • You don’t need the geometric correction from the Fisher information

Remark (MDL and Regularization).

The MDL penalty on model complexity has a natural connection to L0L_0 regularization — penalizing the number of nonzero parameters. The refined MDL score logL(θ^)+k2logn-\log L(\hat{\theta}) + \frac{k}{2} \log n assigns a cost of roughly 12logn\frac{1}{2} \log n bits per parameter. In the regression setting, this is equivalent to an L0L_0 penalty with strength λ=12logn\lambda = \frac{1}{2} \log n. Unlike L1L_1 or L2L_2 regularization (which shrink parameters toward zero), the MDL/L0L_0 approach performs hard selection: each feature is either in the model or out.

Remark (MDL and Neural Network Compression).

MDL provides an information-theoretic framework for understanding neural network compression. The total description length of a network is L(architecture)+L(weightsarchitecture)+L(datanetwork)L(\text{architecture}) + L(\text{weights} \mid \text{architecture}) + L(\text{data} \mid \text{network}). Pruning (removing weights), quantization (reducing precision), and knowledge distillation (compressing into a smaller network) can all be understood as reducing the first two terms. The lottery ticket hypothesis — that sparse subnetworks can match the performance of the full network — is a statement about MDL: the compressed description (sparse architecture + remaining weights) has lower total code length than the full network. This connects modern deep learning compression to Rissanen’s 1978 insight that the best model is the one with the shortest total description.

Summary Comparison

CriterionFormulaPenaltyConsistent?Requires
AIC2logL+2k-2 \log L + 2kFixed (2k2k)NoOnly θ^\hat{\theta}, kk
BIC2logL+klogn-2 \log L + k \log nO(logn)O(\log n)YesOnly θ^\hat{\theta}, kk, nn
Refined MDL2logL+klogn2π+2logdetI(θ)dθ-2\log L + k\log\frac{n}{2\pi} + 2\log \int\sqrt{\det I(\theta)} \, d\thetaO(logn)+O(\log n) + geometryYesFisher information
NML (exact)logpNML-\log p_{\mathrm{NML}}Minimax optimalYesCOMP enumeration
Prequential MDLlogq(xixi1)-\sum \log q(x_i \mid x^{i-1})AdaptiveYesSequential predictions

Connections & Further Reading

Connection Map

TopicRelationship
Shannon Entropy & Mutual InformationMDL formalizes source coding for model selection: the shortest description length equals the entropy H(P)H(P) under the true model. Shannon’s source coding theorem underpins the coding interpretation of MDL.
KL Divergence & f-DivergencesThe regret of a universal code is bounded by DKL(pθpNML)D_{\mathrm{KL}}(p_{\theta^*} \| p_{\mathrm{NML}}). The stochastic complexity decomposes into the maximized log-likelihood plus the KL divergence from the NML distribution to the MLE model.
Rate-Distortion TheoryRate-distortion theory and MDL are dual perspectives on compression: rate-distortion minimizes bits for a fidelity constraint, while MDL uses total description length for model selection. The information bottleneck connects both viewpoints.
Convex AnalysisThe NML optimization is a minimax problem solved by convex duality. The parametric complexity integral involves the Fisher information geometry, connecting to the convex structure of exponential families.
Information Geometry & Fisher MetricThe asymptotic parametric complexity logCOMP\log \mathrm{COMP} involves detI(θ)dθ\int \sqrt{\det I(\theta)} \, d\theta — the volume of the model manifold in the Fisher metric. MDL model complexity has a geometric interpretation as the “size” of the model in information space.
PAC Learning & Computational Learning Theory (coming soon)MDL and PAC learning provide complementary perspectives on generalization: MDL penalizes model complexity via description length, while PAC bounds generalization error via VC dimension. Occam’s razor lemma connects both: short descriptions imply small generalization error.

Notation Reference

SymbolMeaning
L(M)L(M)Code length for model MM (complexity)
L(DM)L(D \mid M)Code length for data DD given model MM (goodness-of-fit)
L(k)L^*(k)Rissanen’s universal code for positive integer kk
K(x)K(x)Kolmogorov complexity of string xx
SC(xn;Mk)\mathrm{SC}(x^n; M_k)Stochastic complexity of xnx^n under model MkM_k
pNMLp_{\mathrm{NML}}Normalized Maximum Likelihood distribution
COMP(Mk)\mathrm{COMP}(M_k)Parametric complexity: znp(znθ^(zn))\sum_{z^n} p(z^n \mid \hat{\theta}(z^n))
I(θ)I(\theta)Fisher information matrix at parameter θ\theta
θ^(xn)\hat{\theta}(x^n)Maximum likelihood estimator for data xnx^n
Lpre(xn)L_{\mathrm{pre}}(x^n)Prequential (predictive) code length
BIC\mathrm{BIC}Bayesian Information Criterion: 2logL+klogn-2 \log L + k \log n

Connections

  • The regret of a universal code is bounded by D_KL(p_θ* || p_NML). The stochastic complexity decomposes into the maximized log-likelihood plus a term interpretable as the KL divergence from the NML distribution to the best model in the class. kl-divergence
  • MDL formalizes source coding for model selection: the shortest description length equals H(P) under the true model. Shannon's source coding theorem underpins the coding interpretation — entropy sets the fundamental limit that MDL approaches. shannon-entropy
  • Rate-distortion theory and MDL are dual perspectives on compression: rate-distortion minimizes bits for a fidelity constraint, while MDL uses total description length for model selection. Both derive from Shannon's operational approach to information. rate-distortion
  • The NML optimization is a minimax problem solved by convex duality. The parametric complexity integral involves the Fisher information geometry, connecting to the convex structure of exponential families. convex-analysis
  • The asymptotic parametric complexity involves the integral of the square root of the Fisher information determinant — the volume of the model manifold in the Fisher metric. MDL model complexity has a geometric interpretation as the size of the model in information space. information-geometry

References & Further Reading

  • book The Minimum Description Length Principle — Grünwald (2007) The definitive textbook on MDL — covers two-part codes, refined MDL, NML, prequential codes, and applications
  • book Information and Complexity in Statistical Modeling — Rissanen (2007) Rissanen's own monograph on stochastic complexity and refined MDL
  • paper Modeling by Shortest Data Description — Rissanen (1978) The foundational paper introducing the MDL principle
  • paper The Minimum Description Length Principle in Coding and Modeling — Barron, Rissanen & Yu (1998) Comprehensive treatment of MDL, NML, and connections to Bayesian inference
  • paper Minimum Description Length Revisited — Grünwald & Roos (2020) Modern survey of MDL with connections to deep learning and neural compression
  • book An Introduction to Kolmogorov Complexity and Its Applications — Li & Vitányi (2019) Standard reference on algorithmic information theory — Kolmogorov complexity and its connections to MDL