formalML

The mathematical machinery behind modern machine learning

Deep-dive explainers combining rigorous mathematics, interactive visualizations, and working code. Built for practitioners, graduate students, and researchers.

Latest Topics

advanced information-theory

Minimum Description Length

Model selection as data compression — the shortest total description identifies the best model

The Minimum Description Length principle formalizes Occam's Razor through coding theory: the best model is the one that provides the shortest total description of the data, balancing model complexity L(M) against goodness-of-fit L(D|M). We develop MDL from Rissanen's two-part codes through refined MDL, where the Normalized Maximum Likelihood (NML) distribution achieves minimax optimal regret, and the parametric complexity COMP(M_k) provides a principled, prior-free measure of model complexity. The asymptotic expansion of COMP reveals the geometric structure of model complexity: the (k/2) log n term (recovering BIC) captures dimensionality, while the integral of the Fisher information volume element measures the effective size of the model manifold. Prequential (predictive) MDL provides a computationally tractable alternative that processes data sequentially and asymptotically converges to the same complexity penalty. MDL connects to Bayesian model selection through the equivalence of stochastic complexity and the marginal likelihood under Jeffreys prior, and to algorithmic information theory through the interpretation of code lengths as computable upper bounds on Kolmogorov complexity. Applications to machine learning include BIC-based model selection, sparse feature selection via description-length penalization, and the information-theoretic interpretation of neural network compression.

2 prerequisites
intermediate information-theory

Rate-Distortion Theory

The fundamental limits of lossy compression — how many bits per symbol when we tolerate distortion?

Rate-distortion theory answers the fundamental question of lossy compression: how many bits per source symbol are necessary and sufficient when we tolerate an average distortion of at most D? The rate-distortion function R(D) — defined as the minimum mutual information I(X; X̂) over all test channels satisfying the distortion constraint — is convex, non-increasing, and equals H(X) at D = 0, recovering Shannon's lossless source coding limit. We derive closed-form solutions for the binary source with Hamming distortion (R(D) = H_b(p) - H_b(D)) and the Gaussian source with squared error (R(D) = (1/2) log(σ²/D)), prove Shannon's rate-distortion theorem establishing R(D) as the exact achievability boundary, and develop the Blahut–Arimoto algorithm for numerical computation via alternating minimization. The information bottleneck method extends rate-distortion to compression with relevance: minimizing I(X; T) while preserving I(T; Y), unifying lossy compression with representation learning. Applications to machine learning include the VAE loss as a rate-distortion Lagrangian, β-VAE as rate-distortion trade-off, and neural image compression as learned R(D)-optimal coding.

2 prerequisites
intermediate category-theory

Adjunctions

The free-forgetful paradigm, unit-counit pairs, and the universal optimization pattern that connects algebra, logic, and machine learning

An adjunction is a pair of functors F and G between categories C and D that are optimally related: a natural isomorphism Hom(F(A), B) ≅ Hom(A, G(B)) establishes a bijection between morphisms in D out of F(A) and morphisms in C into G(B). Equivalently, an adjunction consists of natural transformations η: Id → GF (the unit) and ε: FG → Id (the counit) satisfying the triangle identities — the zig-zag equations that ensure the two round-trip compositions collapse to identities. The free-forgetful adjunction between Set and Vec is the prototypical example: a linear map from the free vector space F(S) is determined by where the basis elements go, which is just a function from S. This free construction pattern — where a left adjoint builds structure freely and a right adjoint forgets it — pervades algebra (free groups, free modules), topology (discrete ⊣ forgetful ⊣ indiscrete), and logic (existential ⊣ substitution ⊣ universal). Galois connections — adjunctions between posets — give the floor-ceiling connection between the reals and integers, the closure-interior duality in topology, and the classical Galois correspondence between subfields and subgroups. The RAPL theorem (right adjoints preserve limits) is one of the most useful results in category theory, explaining why forgetful functors preserve products and why free functors preserve coproducts. Every adjunction generates a monad, connecting this topic to the capstone. For machine learning, Lagrangian duality is a Galois connection between primal and dual optimization problems, the encoder-decoder paradigm is an adjunction where the unit measures reconstruction error, and the tensor-hom adjunction underlies the currying that makes attention mechanisms work.

1 prerequisite