formalML

The mathematical machinery behind modern machine learning

Deep-dive explainers combining rigorous mathematics, interactive visualizations, and working code. Built for practitioners, graduate students, and researchers.

Latest Topics

advanced learning-theory

Generalized Method of Moments

Hansen's framework for over-identified estimation — efficient weighting, the J-statistic, and the bridge from ML to double machine learning

The Generalized Method of Moments (GMM) is the over-identified extension of Pearson's classical method of moments, formalized by Lars Peter Hansen in 1982 and now the workhorse estimator across modern econometrics, structural finance, and causal inference. When economic or causal theory supplies L > k moment conditions on a k-parameter model, no value of θ exactly zeros all L sample moments at finite n; Hansen's framework collapses the over-identified system to a weighted quadratic form J_n(θ, W) and produces both a point estimator (the minimizer) and a free over-identification test (the Hansen J-statistic at the efficient-weight optimum) as by-products of the same optimization. We develop the framework from the just-identified Pearson setup through the asymptotic sandwich variance, Hansen's efficiency bound, the two-step feasible procedure, the J-statistic χ²_{L-k} distribution, and modern extensions: linear instrumental variables and 2SLS, the continuous-updating estimator and Owen's empirical likelihood, GMM as just-identified ML with score moments, and the double machine learning of Chernozhukov et al. (2018) that lets cross-fitted machine-learned nuisance functions appear inside √n-consistent causal-inference estimators.

3 prerequisites

advanced learning-theory

Always-Valid Inference

Time-uniform confidence sequences, e-processes, and the betting reformulation — sequential A/B testing without peeking penalties

Always-valid inference is the time-uniform extension of the fixed-n concentration-inequality toolkit. Where a Hoeffding interval controls coverage at one pre-specified sample size, an always-valid confidence sequence satisfies the coverage guarantee simultaneously over every sample size — letting analysts peek, decide, and stop without invalidating their confidence statement. The whole topic rests on one tool: a nonnegative supermartingale with unit initial expectation, run through Ville's inequality. We trace it from Wald's 1945 sequential probability ratio test through Robbins' 1970 method of mixtures, the Howard–Ramdas–McAuliffe–Sekhon boundary atlas (2020), and Waudby-Smith and Ramdas's 2024 betting reformulation. The e-process language (Vovk 1993, Vovk and Wang 2021, Grünwald, de Heide, and Koolen 2024) and the growth-rate-optimality theorem (Kelly 1956, Grünwald and Koolen 2022) unify these constructions and connect anytime-valid testing to information-theoretic gambling. Modern A/B-testing platforms (Eppo, Statsig, Optimizely, Microsoft EXP, LinkedIn, Netflix, VWO) standardized around these methods between 2017 and 2022; off-policy confidence sequences extend the framework to contextual bandits and causal inference under adaptive sampling.

4 prerequisites

advanced learning-theory

The Information Bottleneck

Compressing X while preserving information about Y

A representation T of an input X is good for a downstream task Y if it forgets the parts of X that don't matter for Y and keeps the parts that do. The information bottleneck principle (Tishby, Pereira & Bialek, 1999) formalizes this as a single-parameter family of optimization problems with a Lagrangian that trades a compression term against a predictiveness term. This topic develops the principle through three arcs: the discrete iterative algorithm with full convergence proof via Csiszár–Tusnády, the closed-form Gaussian case with spectral phase transitions and the canonical-correlation reading, and the variational lift that scales to deep neural-network encoders. We then take the deep-learning fitting-and-compression controversy head-on, separating IB's prescriptive value (VIB regularization, Information Dropout) from the descriptive claim about SGD that did not survive Saxe et al. 2018.

3 prerequisites