Topics

Browse all published topics in advanced ML mathematics.

All bayesian-ml category-theory geometry graph-theory information-theory learning-theory linear-algebra nonparametric-ml optimization probability supervised-learning topology unsupervised

intermediate category-theory

Adjunctions

The free-forgetful paradigm, unit-counit pairs, and the universal optimization pattern that connects algebra, logic, and machine learning

An adjunction is a pair of functors F and G between categories C and D that are optimally related: a natural isomorphism Hom(F(A), B) ≅ Hom(A, G(B)) establishes a bijection between morphisms in D out of F(A) and morphisms in C into G(B). Equivalently, an adjunction consists of natural transformations η: Id → GF (the unit) and ε: FG → Id (the counit) satisfying the triangle identities — the zig-zag equations that ensure the two round-trip compositions collapse to identities. The free-forgetful adjunction between Set and Vec is the prototypical example: a linear map from the free vector space F(S) is determined by where the basis elements go, which is just a function from S. This free construction pattern — where a left adjoint builds structure freely and a right adjoint forgets it — pervades algebra (free groups, free modules), topology (discrete ⊣ forgetful ⊣ indiscrete), and logic (existential ⊣ substitution ⊣ universal). Galois connections — adjunctions between posets — give the floor-ceiling connection between the reals and integers, the closure-interior duality in topology, and the classical Galois correspondence between subfields and subgroups. The RAPL theorem (right adjoints preserve limits) is one of the most useful results in category theory, explaining why forgetful functors preserve products and why free functors preserve coproducts. Every adjunction generates a monad, connecting this topic to the capstone. For machine learning, Lagrangian duality is a Galois connection between primal and dual optimization problems, the encoder-decoder paradigm is an adjunction where the unit measures reconstruction error, and the tensor-hom adjunction underlies the currying that makes attention mechanisms work.

1 prerequisite

intermediate topology

Barcodes & Bottleneck Distance

Comparing persistence diagrams — the metrics that make TDA a rigorous statistical tool

Persistence barcodes decompose a persistence module into interval summands, giving a complete invariant of the multiscale topology captured by a filtration. The bottleneck and Wasserstein distances turn the space of persistence diagrams into a metric space, and the Stability Theorem guarantees that small perturbations of the input produce small changes in the diagram — the theoretical foundation that makes topological data analysis a rigorous tool for real-world data.

1 prerequisite

advanced bayesian-ml

Bayesian Neural Networks

Weight-space posteriors over neural networks: Laplace approximation, MC-dropout as approximate VI, deep ensembles, stochastic-gradient MCMC (SGLD and SGHMC), calibration diagnostics, and the function-space view via NNGP and NTK

A point-estimate neural network produces a single function and confidently extrapolates that function into regions where no training data lives. Bayesian neural networks replace the single weight vector w* with a distribution p(w | D) over weights and integrate predictions against it, so predictive variance grows where the data has nothing to say. The catch is that p(w | D) is intractable in four distinct ways for any non-trivial network — no closed-form marginal likelihood, deeply non-convex log-posterior, high dimension, and full-data gradients. As the T5 flagship, this topic develops Laplace approximation (asymptotic local Gaussian), MC-dropout (Bernoulli variational family), deep ensembles (function-space samples via independent training), and stochastic-gradient MCMC (SGLD and SGHMC, the asymptotically-exact methods that scale to deep-learning regimes), then evaluates all four head-to-head on calibration metrics, then closes with the function-space view from NNGP and NTK that ties weight-space methods to Gaussian processes.

5 prerequisites

advanced probability

Bayesian Nonparametrics

From the Dirichlet process to Gaussian processes and posterior consistency

Bayesian nonparametrics places priors directly on infinite-dimensional parameter spaces, letting the effective model complexity grow with the data rather than being fixed a priori. We develop the theory from foundations: the Dirichlet process as a distribution over probability distributions with three equivalent constructive representations (stick-breaking, Chinese Restaurant Process, and Pólya urn), Dirichlet process mixture models for adaptive density estimation and clustering, Gaussian processes as nonparametric priors on function spaces with closed-form posterior inference, and the Indian Buffet Process for latent feature models. The capstone section on posterior consistency and contraction rates connects Bayesian nonparametrics to the PAC learning framework, showing that both address the same fundamental question — when and how fast does learning from data succeed — through different mechanisms: worst-case bounds versus prior-dependent average-case rates.

2 prerequisites

foundational category-theory

Categories & Functors

Objects, morphisms, and functors — the abstract language that unifies disparate ML structures through composition

A category organizes mathematical objects and the structure-preserving maps between them into a framework where composition is the fundamental operation. We develop the theory from the axioms — associativity of composition and the existence of identity morphisms — through the taxonomy of morphism types (isomorphisms, monomorphisms, epimorphisms) to functors, the structure-preserving maps between categories themselves. The gallery of examples spans the categories that ML practitioners work in daily: Set (data transformations), Vec (linear layers), Meas (probabilistic models), and Top (topological data analysis). Universal properties — the categorical way of characterizing constructions by what they do rather than what they are — give us products, coproducts, and initial and terminal objects, unifying Cartesian products of sets, direct sums of vector spaces, and meets and joins of posets under a single definition. The Hom functor reframes linear algebra: Hom(R^m, R^n) is the space of n-by-m matrices, and the covariant Hom functor captures post-composition. Endofunctors — functors from a category to itself — are the gateway to monads, the categorical structure underlying probabilistic programming and computational effects. Throughout, we emphasize that category theory is not an alternative to concrete mathematics but a language for seeing the compositional patterns that connect seemingly disparate ML structures.

2 prerequisites

advanced learning-theory

Causal Inference Methods

Identification, doubly-robust estimation, and inference under confounding — from potential outcomes and IPW through AIPW, TMLE, and DML, with instrumental variables, front-door identification, heterogeneous treatment effects, and sensitivity analysis

How do we estimate causal effects from observational data? This topic develops the modern toolkit for the binary-treatment average treatment effect under ignorability — potential outcomes, identification, inverse-probability weighting (IPW), outcome regression, augmented IPW (AIPW) with its doubly-robust property, targeted maximum likelihood (TMLE), and double / debiased machine learning (DML) — then extends to instrumental variables and front-door identification when ignorability fails, to heterogeneous treatment effects with meta-learners and causal forests, and to sensitivity analysis via E-values and Cinelli–Hazlett robustness values. The signature theorem is the doubly-robust property of AIPW: consistency for the average treatment effect whenever either the propensity model or the outcome model is correctly specified — a remarkable two-shots-on-goal guarantee no single-model estimator can match. The framework is threaded through a single Robinson partially-linear DGP from §1 onward, so that every estimator can be compared empirically on the same canonical workbench.

2 prerequisites

intermediate topology

Čech Complexes & Nerve Theorem

The geometrically exact construction for topology from point clouds — and the deep theorem that makes it work

The Čech complex builds topology from point clouds by testing whether balls centered at data points share a common intersection — a geometrically exact construction that the Nerve Theorem guarantees faithfully captures the topology of the underlying space. We give the full construction, prove the Nerve Theorem via partition-of-unity maps, and demonstrate computationally where Čech and Vietoris-Rips diverge.

2 prerequisites

intermediate probability

Concentration Inequalities

Quantitative convergence rates — from Markov's inequality to sub-Gaussian theory and generalization bounds

Concentration inequalities provide quantitative bounds on how random variables deviate from their expectations, bridging the qualitative convergence theorems of measure-theoretic probability to the sample complexity results of statistical learning theory. We develop the theory progressively: Markov and Chebyshev inequalities with polynomial rates, the Chernoff method via moment generating functions for exponential bounds, Hoeffding's inequality for bounded random variables with full proof, sub-Gaussian and sub-exponential random variable theory with maximal inequalities, Bernstein's variance-adaptive inequality, McDiarmid's bounded differences inequality for general functions of independent variables, and applications to machine learning including generalization bounds, uniform convergence via the union bound, and sample complexity for finite hypothesis classes.

1 prerequisite

intermediate nonparametric-ml

Conformal Prediction

Distribution-free prediction sets with finite-sample coverage from exchangeability alone

Conformal prediction wraps any black-box predictor in a procedure that produces valid prediction intervals from a single nonparametric assumption: that the training and test data are exchangeable. The marginal coverage guarantee is exact in finite samples — no asymptotics, no distributional assumptions, no bounded-noise hypotheses. We develop the theory progressively: split conformal with a full proof of marginal coverage via rank symmetry; jackknife+ and CV+ for the leave-one-out regime with the BCRT 2021 coverage bound; conformalized quantile regression (CQR) for locally adaptive width; adaptive prediction sets (APS) for classification; and the Foygel-Barber 2021 impossibility theorem for distribution-free conditional coverage.

2 prerequisites

foundational optimization

Convex Analysis

Convex sets, convex functions, subdifferentials, and separation theorems — the geometric foundation of tractable optimization

Convex analysis provides the geometric and analytic infrastructure that makes modern optimization tractable. We develop the theory from its two primitives — convex sets (closed under line segments) and convex functions (epigraph is a convex set) — through three layers of increasing power: the first-order and second-order characterizations that connect convexity to gradients and Hessians, the operations that preserve convexity and underpin disciplined convex programming, and the duality machinery of conjugate functions and subdifferentials that extends calculus to non-smooth settings. The separating and supporting hyperplane theorems provide the geometric backbone, while the fundamental result — every local minimum of a convex function is a global minimum — explains why convexity transforms optimization from intractable search into efficient computation. The Spectral Theorem enters through the second-order condition: a twice-differentiable function is convex if and only if its Hessian is positive semidefinite everywhere.

1 prerequisite

advanced learning-theory

Double Descent

The interpolation threshold, the minimum-norm interpolator, and the modern overparameterized regime

Double descent is the empirical and analytic phenomenon that test error, plotted as a function of model complexity, has two descents rather than one — a classical descent in the underparameterized regime, a spike at the interpolation threshold where the number of parameters equals the number of training points, and a second descent into the overparameterized regime that classical learning theory said was hopeless. This topic develops the linear theory of double descent in full: §§1–3 establish the empirical phenomenon and pin down what makes the interpolation threshold special. §4 introduces the minimum-norm interpolator. §§5–6 develop the Marchenko–Pastur random-matrix machinery and the Hastie–Montanari–Rosset–Tibshirani 2022 closed form that gives the analytic double-descent curve. §7 separates model-wise from sample-wise double descent and the 'more data can hurt' pathology. §8 generalizes to random-features models and the kernel limit. §9 proves that gradient descent from zero initialization is implicitly equivalent to the minimum-norm interpolator. §10 is a sidebar on deep double descent. §§11–12 introduce effective dimension and revise the classical capacity-control framework. §§13–14 close with computational notes and an honest accounting of what the linear theory predicts versus what remains open in the feature-learning regime.

6 prerequisites

intermediate graph-theory

Expander Graphs

Sparse graphs with paradoxically strong connectivity — from the Expander Mixing Lemma to Ramanujan optimality and O(log n) mixing

Expander graphs are sparse graphs with paradoxically strong connectivity: every vertex subset has a large boundary relative to its size. Three equivalent perspectives — vertex expansion, edge expansion (the Cheeger constant), and spectral gap — capture this idea, linked by Cheeger's inequality. The Expander Mixing Lemma makes expansion quantitative: in an (n, d, λ)-expander, the number of edges between any two vertex subsets S and T deviates from the expected value d|S||T|/n by at most λ√(|S||T|), where λ is the second-largest eigenvalue in absolute value. The Alon–Boppana bound establishes 2√(d-1) as the theoretical floor for λ in d-regular graphs, and Ramanujan graphs — achieving this bound — represent optimal expanders. Explicit constructions from Cayley graphs and algebraic number theory (Lubotzky–Phillips–Sarnak) demonstrate that optimal expanders can be built deterministically. Random walks on expanders mix in O(log n) steps, enabling the expander walk sampling theorem: t steps on an expander yield nearly independent samples using only O(log n + t log d) random bits. These properties power applications from error-correcting codes and derandomization to network design and graph neural network architectures.

2 prerequisites

advanced nonparametric-ml

Extreme Value Theory

Asymptotic theory of sample maxima and tail-based inference — Fisher–Tippett–Gnedenko trichotomy, generalized Pareto distribution, peaks-over-threshold, tail-index estimation, and ML applications including tail-aware prediction intervals, OOD detection, and tail-risk quantification

The asymptotic theory of sample maxima and tail-based inference. Where the central limit theorem classifies the limit distribution of normalized sample means as Gaussian, extreme value theory classifies the limit distribution of normalized sample maxima as a three-parameter family — the generalized extreme value distribution — via the Fisher–Tippett–Gnedenko trichotomy. Every parent distribution whose normalized maxima admit a non-degenerate limit lands in one of three domains (Gumbel, Fréchet, reverse-Weibull), determined entirely by the rate at which its upper tail decays. The peaks-over-threshold framework, anchored by the Pickands–Balkema–de Haan theorem, extends the story to general tail modeling via the generalized Pareto distribution. The topic develops the asymptotic theory through full proofs of necessity-of-max-stability and the Fréchet domain-of-attraction characterization, then turns to inference: GEV maximum-likelihood and probability-weighted-moments estimators in §4, GPD MLE plus the Hill, Pickands, and Dekkers–Einmahl–de Haan tail-index estimators in §5, with closed-form delta-method standard errors and worked examples on standard-normal and Pareto data threading §§4–5. Modern ML applications thread §6: tail-aware prediction intervals as a refinement of conformal prediction for heavy-tailed residuals, out-of-distribution detection via calibrated tail probabilities of in-distribution scores, and tail-risk quantification — Value-at-Risk and Expected Shortfall — for deployed models with finite operational records.

1 prerequisite

intermediate bayesian-ml

Gaussian Processes

Function-space priors over regression functions: kernel-driven posteriors, marginal-likelihood hyperparameter learning, and three responses to the O(n^3) Cholesky bottleneck

A Gaussian process is a probability distribution over functions: every finite collection of function values is jointly multivariate Normal, with covariances determined by a kernel. The function-space view it makes possible — place the prior on the function, not on the parameters — is the canonical Bayesian regression model and the second pillar of the T5 Bayesian ML track. We motivate the move from parametric Bayesian regression (whose finite-rank covariance caps the function class at the basis dimension) to kernel-defined function-space priors with no such cap; verify that positive-definite kernels yield internally consistent finite-dimensional joints via the Kolmogorov extension theorem; derive the closed-form conjugate-Gaussian regression posterior in one application of multivariate-Normal conditioning; survey the kernel zoo (squared-exponential, Matérn family, periodic, ARD) and the inductive bias each kernel encodes; turn the kernel hyperparameters from quantities we set by hand into quantities we learn from the marginal likelihood (with analytic gradients via the Jacobi formula and multi-restart L-BFGS-B); and close with the O(n³) Cholesky bottleneck and three structural responses — Nyström, sparse variational GPs (Titsias 2009), and random Fourier features. GPs are the substrate of Bayesian optimization, the infinite-width limit of Bayesian neural networks, and the conceptual scaffold for thinking about uncertainty in regression.

3 prerequisites

intermediate learning-theory

Generalization Bounds

Rademacher complexity, uniform convergence, and the limits of classical theory

Generalization bounds are the formal answer to the question every practitioner asks: will my model do as well on new data as it did on training? We develop the theory progressively from the finite-class warmup (union bound + Hoeffding gives O(log|H|/ε²) sample complexity) to the full Rademacher-complexity machinery — symmetrization, Massart's lemma, Talagrand's contraction, McDiarmid concentration — that yields the canonical uniform-convergence bound. We extend to real-valued function classes through pseudo-dimension, fat-shattering, and Dudley's entropy integral; sharpen to fast rates via local Rademacher complexity under a Bernstein-type variance condition; specialize to margin bounds for SVMs and norm-based linear classifiers; and present algorithmic stability as the orthogonal route that bypasses hypothesis-class complexity altogether. A worked example computes empirical Rademacher complexity for thresholds and confronts the topic's signature open puzzle: classical bounds become vacuous for moderately overparameterized neural networks despite trivial empirical generalization.

2 prerequisites

intermediate geometry

Geodesics & Curvature

Geodesic equations, curvature tensors, and the Gauss–Bonnet theorem

Geodesics are the 'straightest possible' curves on a Riemannian manifold — the paths of zero acceleration with respect to the Levi-Civita connection. The geodesic equation is a second-order ODE whose solutions encode the manifold's intrinsic geometry: on the sphere, geodesics are great circles; on the hyperbolic plane, they are semicircles in the upper half-plane model. The exponential map packages geodesics into a local diffeomorphism from each tangent space to the manifold, and the resulting normal coordinates make the metric Euclidean to first order, with curvature appearing at second order. The Riemann curvature tensor measures the failure of parallel transport to be path-independent, and its contractions — sectional, Ricci, and scalar curvature — capture progressively coarser geometric information. The Gauss–Bonnet theorem connects curvature to topology: the total Gaussian curvature of a closed surface equals 2π times the Euler characteristic, making it a topological invariant independent of the metric. Jacobi fields describe how nearby geodesics spread or converge, with the sign of sectional curvature controlling the behavior: positive curvature focuses geodesics (leading to conjugate points), while negative curvature causes exponential divergence. The Bonnet–Myers and Cartan–Hadamard theorems draw global topological conclusions from curvature bounds, and the Rauch comparison theorem makes these estimates precise.

1 prerequisite

intermediate optimization

Gradient Descent & Convergence

The workhorse of continuous optimization: convergence rates under smoothness and strong convexity, acceleration, coordinate descent, mirror descent, and stochastic variants

Gradient descent is the computational engine behind modern optimization and machine learning. We develop the algorithm from its geometric core — the negative gradient as the direction of steepest descent — and build a complete convergence theory in two stages. For L-smooth convex functions, the Descent Lemma yields an O(1/k) sublinear rate; adding μ-strong convexity tightens this to a linear rate governed by the condition number κ = L/μ. We then extend the framework in three directions: Nesterov's accelerated gradient achieves the optimal O(1/k²) rate through a momentum mechanism, coordinate descent exploits problem structure by updating one variable at a time, and mirror descent replaces Euclidean geometry with Bregman divergence to handle constrained domains like the simplex. The stochastic gradient descent section bridges theory and practice, showing how noisy gradient estimates with decreasing step sizes still converge. Throughout, we connect convergence guarantees to the convex analysis foundations — smoothness, strong convexity, and subdifferentials — established in the prerequisite topic.

1 prerequisite

foundational graph-theory

Graph Laplacians & Spectrum

The eigenvalues and eigenvectors of graph Laplacians — from connectivity to clustering to graph neural networks

The graph Laplacian L = D - A encodes a graph's connectivity structure in a real symmetric matrix whose spectrum reveals global properties invisible to local inspection. The smallest eigenvalue is always zero, with multiplicity equal to the number of connected components. The second-smallest eigenvalue — the algebraic connectivity or Fiedler value — quantifies how well-connected the graph is, and its eigenvector provides a principled way to bipartition the graph. Cheeger's inequality makes this precise: the spectral gap is bounded above and below by the Cheeger constant, linking linear algebra to combinatorial optimization. The normalized Laplacian connects these ideas to random walks and entropy: its eigenvalues govern mixing times, and the stationary distribution's entropy reflects the graph's regularity. Spectral clustering exploits this theory by embedding vertices into the eigenspace of the Laplacian and clustering in the embedding — the mathematical foundation of modern graph-based learning methods including graph neural networks, where message passing is Laplacian smoothing in disguise.

2 prerequisites

advanced supervised-learning

High-Dimensional Regression

The lasso at $p \gg n$ — definition, geometry, and ISTA / FISTA / coordinate-descent solvers; the Bickel–Ritov–Tsybakov (2009) oracle inequality at the $\sigma^2 s \log(p)/n$ rate under the restricted-eigenvalue condition; the variable-selection story under the irrepresentable condition; ridge / elastic-net / adaptive-lasso variants; and the Zhang–Zhang / Javanmard–Montanari / van de Geer–Bühlmann–Ritov–Dezeure (2014) debiased lasso for valid $\sqrt n$-confidence inference on individual coefficients.

When the number of features $p$ is comparable to or much larger than the sample size $n$, ordinary least squares fails — the normal equations are rank-deficient and predictions explode. The lasso (Tibshirani 1996) replaces the squared-loss objective with an L1-penalized version, producing sparse solutions whose support adapts to the unknown active set of the true regression vector. We work the lasso pipeline end-to-end: (i) the estimator's geometry — convex objective, L1-ball corners, KKT subgradient conditions, soft-thresholding closed form on orthogonal designs — and the ISTA / FISTA / coordinate-descent solvers that handle the general case; (ii) the headline non-asymptotic prediction-risk theory — the Bickel-Ritov-Tsybakov (2009) oracle inequality at rate $\sigma^2 s \log(p)/n$ under the restricted-eigenvalue condition, plus the parallel variable-selection story under the irrepresentable condition (Zhao-Yu 2006); (iii) practical $\lambda$-selection by cross-validation and information criteria, plus the ridge / elastic-net / adaptive-lasso variants that fix the lasso's failure modes on correlated features and biased active coefficients; (iv) the inferential payoff — the debiased lasso (Zhang-Zhang 2014; Javanmard-Montanari 2014; van de Geer-Bühlmann-Ritov-Dezeure 2014), a one-step Newton correction that produces $\sqrt n$-consistent normal estimates of individual coefficients suitable for confidence intervals and hypothesis tests in the $p \gg n$ regime where naive post-selection inference undercovers by 20-50 percentage points. The lasso is the workhorse high-dimensional regression tool of the modern statistical toolkit and the canonical sparse nuisance estimator in semiparametric and double/debiased machine-learning inference.

2 prerequisites

advanced geometry

Information Geometry & Fisher Metric

The Fisher information metric, α-connections, and divergence functions on statistical manifolds

Information geometry equips the space of probability distributions with a Riemannian metric — the Fisher information — revealing that statistical inference is fundamentally geometric. The Fisher metric, defined as the covariance of score functions, is the unique Riemannian metric (up to scale) that is invariant under sufficient statistics (Čencov's theorem). For the Gaussian family, the Fisher metric is the Poincaré upper half-plane metric with constant negative curvature K = -1/2, and the Fisher-Rao distance recovers the Mahalanobis distance. Amari's α-connections extend the Levi-Civita connection to a one-parameter family, with the e-connection (α = 1) and m-connection (α = -1) forming a dual pair whose dually flat structure on exponential families underlies the Legendre duality between natural and expectation parameters. Divergence functions — KL divergence, α-divergences, and Bregman divergences — encode non-symmetric notions of distance whose infinitesimal form recovers the Fisher metric, and the generalized Pythagorean theorem governs optimal projections in variational inference. The Cramér–Rao bound is a direct geometric consequence of Cauchy-Schwarz in the Fisher inner product, connecting curvature to estimation precision. Natural gradient descent, which premultiplies gradients by the inverse Fisher matrix, achieves reparameterization-invariant optimization and is approximated by Adam through its diagonal second-moment estimates.

3 prerequisites

intermediate supervised-learning

Kernel Regression

Nonparametric estimation of conditional means via the Nadaraya–Watson estimator — bias-variance decomposition with the design-density correction, AMISE-optimal bandwidth selection, the curse of dimensionality at higher d, and the local-linear fix for boundary bias.

Kernel regression estimates the conditional mean function m(x) = E[Y | X = x] from iid pairs (X_i, Y_i) without assuming a parametric form for m. The Nadaraya–Watson estimator is the central object — a kernel-weighted average of Y values whose bandwidth controls the smoothing. We work through the full bias-variance machinery: the second-order Taylor expansion that produces an h² bias with a design-density correction term f_X'(x)/f_X(x), the variance scaling 1/(nh) controlled by the kernel-second-moment integral R(K), the AMISE-optimal bandwidth h* ≍ n^(-1/(4+d)), and the curse-of-dimensionality rate AMISE(h*) ≍ n^(-4/(4+d)) that limits naive kernel methods at d ≥ 10. We compare four bandwidth selectors on the §1 toy (Silverman, plug-in, leave-one-out CV, GCV), verify Marron–Nolan's canonical-kernel theorem that makes kernel choice a calibration on h, and derive the local-linear extension that fixes the O(h) boundary-bias degradation back to uniform O(h²). The topic closes by placing kernel regression in its broader family — kernel ridge regression as the RKHS counterpart, GP regression as the Bayesian counterpart, semi-parametric inference as the standard ML application, conformal prediction for distribution-free intervals — and forward-points to T2's local-regression and density-ratio-estimation topics that build on the same kernel substrate.

1 prerequisite

intermediate information-theory

KL Divergence & f-Divergences

Statistical divergences — measuring distributional mismatch through coding excess, convex generators, and variational representations

KL divergence measures the expected excess cost of encoding samples from one distribution using a code optimized for another — the gap between cross-entropy and entropy. We develop the theory from Gibbs' inequality through the asymmetry that distinguishes forward KL (mode-covering, used in maximum likelihood) from reverse KL (mode-seeking, used in variational inference). The f-divergence framework unifies KL, reverse KL, chi-squared, Hellinger, total variation, and Jensen–Shannon divergences through convex generator functions, inheriting non-negativity and the data processing inequality from Jensen's inequality alone. Variational representations via Fenchel conjugates turn divergence estimation into optimization problems — the foundation of f-GANs and modern density ratio estimation. Rényi divergences provide a one-parameter family that interpolates between these special cases, with monotonicity in the order parameter governing the privacy-utility trade-off in differential privacy. Applications to machine learning include cross-entropy loss as forward KL minimization, the ELBO as reverse KL minimization, and GAN training as Jensen–Shannon divergence minimization.

1 prerequisite

advanced optimization

Lagrangian Duality & KKT Conditions

The Lagrangian, weak and strong duality, Slater's condition, KKT optimality conditions, sensitivity analysis, and applications to SVMs, water-filling, and portfolio optimization

Lagrangian duality transforms constrained optimization into a two-player game between primal and dual variables. We build the theory from the Lagrangian function — which prices constraint violations via dual variables — through the dual problem, whose optimal value always lower-bounds the primal (weak duality). For convex problems satisfying Slater's constraint qualification, the bound is tight: strong duality gives p* = d*, and the Karush–Kuhn–Tucker conditions become both necessary and sufficient for optimality. The four KKT conditions — stationarity, primal feasibility, dual feasibility, and complementary slackness — encode the geometric insight that at an optimal point, the objective gradient lies in the cone spanned by the active constraint gradients. The dual variables double as sensitivity measures: each λ_i* is the shadow price of constraint i, which gives the rate at which the optimal value changes under perturbations. We ground the theory in three applications where the dual formulation is essential: the support vector machine (where the dual reveals the kernel trick and identifies support vectors via complementary slackness), water-filling for channel capacity allocation (where KKT yields the closed-form solution), and mean-variance portfolio optimization (where the dual traces the efficient frontier). Throughout, the convex analysis foundations — subdifferentials, conjugate functions, and the separation theorems — provide the analytical backbone.

1 prerequisite

intermediate supervised-learning

Local Polynomial Regression

Degree-$p$ generalization of local-linear regression — the Fan–Gijbels family, the equivalent-kernel formulation that makes every degree-$p$ estimator a kernel smoother with a degree-aware reweighting $K^*_p$, the Ruppert–Wand bias ladder, and the odd-vs-even degree story that distills the practical recommendation to $p \in \{1, 3\}$.

Local polynomial regression generalizes the Nadaraya–Watson and local-linear estimators of kernel regression to degree-p polynomial fits in kernel-weighted neighborhoods. We work through the Fan–Gijbels family from definition to deployment: the kernel-weighted least-squares normal equations and their (p+1)×(p+1) Hankel-matrix form, the equivalent-kernel formulation that recasts every degree-p estimator as a kernel smoother with a degree-aware reweighting K*_p, the Ruppert–Wand (1994) bias formula b_p(K) = ∫ u^(p+1) K*_p(u) du, the interior bias-rate ladder with the parity-bonus pattern at even p (h^(p+2) via the moment-matching identities), and the variance constants R*_p(K) that grow with degree. The §5 odd-vs-even degree story is the topic's sharpest result — at the boundary the parity-bonus collapses, but odd-p estimators retain design-adaptivity (no f_X-derivative contamination) that even-p does not, and the practical recommendation distills to p = 1 (local-linear, the production default) and p = 3 (local-cubic, when derivative estimates are needed). We extend bandwidth selection (LOO-CV, GCV, plug-in) to general p, generalize to multivariate covariates with the n^(-2(p+1)/(2(p+1)+d)) Stone-minimax rate, derive derivative estimates as a single-fit byproduct, prove the Silverman (1984) asymptotic equivalence between cubic smoothing splines and local-cubic regression, and bridge to GAMs via backfitting on the partial residuals. Local polynomial regression is the workhorse univariate nonparametric smoother of the modern toolkit and the canonical nuisance estimator in semiparametric and debiased machine-learning inference.

1 prerequisite

intermediate probability

Measure-Theoretic Probability

From sigma-algebras to martingales — the rigorous foundations that underpin modern statistics and machine learning

Measure-theoretic probability provides the rigorous foundations for modern statistics and machine learning. We develop the theory from first principles: sigma-algebras and measurable spaces, measures and Kolmogorov's axioms, the Lebesgue integral via simple functions, the Monotone and Dominated Convergence Theorems, the four modes of convergence and their hierarchy, the Laws of Large Numbers and Central Limit Theorem, product measures and Fubini's theorem, conditional expectation via the Radon-Nikodym theorem with its L2 projection interpretation, and a preview of martingale theory with connections to financial applications including GARCH volatility modeling and regime detection.

advanced graph-theory

Message Passing & GNNs

From spectral graph convolutions to neighborhood aggregation — the mathematical foundations of graph neural networks

Message passing neural networks learn graph representations by iteratively aggregating information along edges: at each layer, every node collects features from its neighbors, transforms the result, and produces an updated embedding. This framework unifies the Graph Theory track. The GCN update rule is a first-order polynomial of the normalized Laplacian — Laplacian smoothing with learnable weights. Repeated message passing drives node features toward the stationary distribution of the random walk on the graph, with the spectral gap controlling the convergence rate — the over-smoothing phenomenon. The Weisfeiler-Leman isomorphism test provides the expressiveness ceiling: no message passing GNN can distinguish graphs that 1-WL cannot, and GIN (sum aggregation + MLP) achieves this ceiling. Graph Attention Networks generalize GCN by learning data-dependent edge weights, effectively performing soft graph rewiring. Expander-based rewiring methods (FoSR, SDRF) explicitly increase the spectral gap by adding shortcut edges, trading the O(log n) mixing time of expanders for faster information propagation — but inheriting the O(log n) over-smoothing depth as a fundamental tradeoff.

3 prerequisites

advanced information-theory

Minimum Description Length

Model selection as data compression — the shortest total description identifies the best model

The Minimum Description Length principle formalizes Occam's Razor through coding theory: the best model is the one that provides the shortest total description of the data, balancing model complexity L(M) against goodness-of-fit L(D|M). We develop MDL from Rissanen's two-part codes through refined MDL, where the Normalized Maximum Likelihood (NML) distribution achieves minimax optimal regret, and the parametric complexity COMP(M_k) provides a principled, prior-free measure of model complexity. The asymptotic expansion of COMP reveals the geometric structure of model complexity: the (k/2) log n term (recovering BIC) captures dimensionality, while the integral of the Fisher information volume element measures the effective size of the model manifold. Prequential (predictive) MDL provides a computationally tractable alternative that processes data sequentially and asymptotically converges to the same complexity penalty. MDL connects to Bayesian model selection through the equivalence of stochastic complexity and the marginal likelihood under Jeffreys prior, and to algorithmic information theory through the interpretation of code lengths as computable upper bounds on Kolmogorov complexity. Applications to machine learning include BIC-based model selection, sparse feature selection via description-length penalization, and the information-theoretic interpretation of neural network compression.

2 prerequisites

intermediate bayesian-ml

Mixed-Effects Models

Partial-pooling theory in its own right: Henderson's mixed-model equations and the BLUP shrinkage formula, REML for variance components via flat-prior Bayesian marginalization, and the frequentist–Bayesian bridge that fits the same hierarchical model two ways and shows them agree numerically

Mixed-effects models develop the partial-pooling theory in its own right. We start with a six-classroom example showing the three classical responses to grouped data — complete pooling, no pooling, and partial pooling — and the geometric reason partial pooling sits between the other two with the location set by group size. We write the linear mixed model in matrix form and identify the marginal covariance V = ZGZᵀ + R that drives generalized least squares. We derive Henderson's mixed-model equations from the joint penalized log-density and recover the closed-form BLUP shrinkage formula λⱼ = τ²/(τ² + σ²/nⱼ) — the single number that characterizes how mixed-effects partial-pool. We derive restricted maximum likelihood (REML) for variance components by marginalizing the fixed effects out under a flat improper prior, showing the bias correction is a Bayesian-marginalization-of-β in disguise and structurally identical to the marginal-likelihood machinery Gaussian processes use for hyperparameter learning. We close by running the same six-classroom model through both statsmodels.MixedLM (REML) and PyMC (full Bayesian) and showing the two views agree numerically, with the centered-vs-non-centered story from probabilistic programming reappearing in its native habitat — small group counts and unknown variance components — and the GLMM, factor-model, GP, and variational alternatives for regimes the linear mixed model cannot reach.

3 prerequisites

advanced category-theory

Monads & Comonads

The categorical calculus of effects and contexts — from Bayesian inference to backpropagation, from probability monads to graph comonads

A monad on a category C is a triple (T, η, μ) consisting of an endofunctor T: C → C, a unit η: Id ⇒ T, and a multiplication μ: T² ⇒ T, satisfying associativity and unit laws — equivalently, a monoid in the monoidal category of endofunctors [C, C] under composition. Every adjunction F ⊣ G yields a monad T = GF, with the unit inherited from the adjunction and the multiplication μ = GεF constructed from the counit. The Kleisli category C_T captures 'effectful computations' — its morphisms A → TB model computations that produce a B with a side effect encoded by T. The Maybe monad models partiality, the List monad models nondeterminism, and the Giry monad on Meas models probabilistic computation, with Kleisli arrows as Markov kernels and Kleisli composition as the Chapman-Kolmogorov equation. The Eilenberg-Moore category C^T captures 'T-algebraic structures' — Maybe-algebras are pointed sets, List-algebras are monoids, and Giry-algebras are convex spaces. Beck's Monadicity Theorem characterizes when a functor is monadic: it must have a left adjoint, reflect isomorphisms, and preserve G-split coequalizers. This theorem draws the fundamental boundary between algebraic structures (groups, vector spaces, modules) and non-algebraic ones (topological spaces, partial orders). Dually, a comonad (W, ε, δ) encodes 'contextual computations' — the Stream comonad models signal processing, the Store comonad models indexed state, and the Neighborhood comonad on a graph models exactly the message-passing operation in graph neural networks: a coKleisli arrow WA → B extracts features from a neighborhood, and the coKleisli extension applies this extraction at every node. For machine learning, the Giry monad provides the categorical foundation of Bayesian inference, the continuation monad reveals backpropagation as Kleisli composition (the chain rule as functoriality of CPS), and the neighborhood comonad unifies attention mechanisms and GNN message passing under a single comonadic framework.

1 prerequisite

intermediate category-theory

Natural Transformations

Morphisms between functors — the naturality condition that distinguishes canonical constructions from arbitrary choices

A natural transformation is a morphism between functors — a family of arrows in the target category, indexed by objects of the source, that commutes with every morphism. The naturality condition captures what it means for a construction to be canonical: the double dual embedding V → V** is natural because it requires no choice of basis, while the isomorphism V ≅ V* is not. We develop vertical and horizontal composition, functor categories, and the interchange law that gives Cat the structure of a 2-category. The Yoneda lemma — the deepest result in basic category theory — says that an object is completely determined by its morphisms to all other objects, establishing a bijection Nat(Hom(A, -), F) ≅ F(A) natural in both A and F. The Yoneda embedding sends each object to its representable presheaf and is fully faithful, meaning that category theory's abstract objects carry as much information as their concrete incarnations. For machine learning, equivariance — the property that a function commutes with a group action — is precisely naturality: CNN translation equivariance, GNN permutation equivariance, and spherical CNN rotation equivariance are all instances of the naturality square. Shannon entropy defines a natural transformation from probability distributions to the reals, and the data processing inequality follows from naturality.

1 prerequisite

advanced unsupervised

Normalizing Flows

Invertible neural networks for explicit-density generative modeling — from the change-of-variables formula to coupling layers (NICE/RealNVP), autoregressive flows (MAF/IAF), and Glow's multi-scale architecture, with the 2-moons worked example end-to-end

Normalizing flows learn an invertible transformation that pushes a simple base distribution (almost always a standard Gaussian) into the data distribution. Unlike VAEs which give a bound on the density and GANs which give no density at all, flows give exact log p_phi(x) via the change-of-variables formula. We derive the framework from the multivariable substitution rule, build the dominant architectures (coupling layers / RealNVP, autoregressive flows / MAF / IAF, Glow's multi-scale primitives), train a flow on the 2-moons dataset end-to-end, and discuss applications (image generation, molecular sampling, RL policies, simulation-based inference), computational trade-offs, and limits. The 2-moons worked example in §11 trains a 6-layer affine-coupling RealNVP from scratch in under thirty seconds of CPU time and visualizes the learned diffeomorphism's forward and inverse maps simultaneously.

1 prerequisite

advanced probability

PAC Learning Framework

From finite hypothesis classes to VC dimension and the Fundamental Theorem of Statistical Learning

The Probably Approximately Correct (PAC) learning framework provides the foundational theory for understanding when and how efficiently machine learning algorithms can generalize from finite training data. We develop the theory progressively: realizable PAC learning for finite hypothesis classes with sample complexity bounds via the union bound, agnostic PAC learning through uniform convergence and Hoeffding's inequality, the Vapnik–Chervonenkis (VC) dimension as the combinatorial measure of hypothesis class complexity with the Sauer–Shelah lemma bounding growth functions, the Fundamental Theorem of Statistical Learning establishing the equivalence of PAC learnability and finite VC dimension, and Rademacher complexity as a data-dependent refinement using McDiarmid's inequality and symmetrization arguments. Applications to linear classifiers, neural networks, decision trees, and structural risk minimization connect the abstract theory to practical machine learning.

1 prerequisite

advanced learning-theory

PAC-Bayes Bounds

Posterior-averaged generalization theory — the change-of-measure inequality, McAllester / Seeger / Catoni bounds, the Gibbs distribution, and the first non-vacuous deep-network certificates

PAC-Bayes is the framework that closes the gap classical generalization theory cannot. Rademacher complexity goes vacuous when the hypothesis class is rich enough to memorize noise — exactly the regime modern over-parameterized networks live in. PAC-Bayes side-steps the worst-case-over-H charge by averaging the gap under a posterior Q over hypotheses and paying for Q's drift from a prior P fixed before the data. The price is the KL divergence between Q and P, finite for any Q ≪ P even when |H| is infinite. We develop the framework from first principles: the Donsker–Varadhan change-of-measure inequality as the master tool; McAllester's 1999 square-root bound, Maurer's 2004 tightening, Seeger's binary-KL form that beats McAllester near the realizable regime, and Catoni's linear bound whose RHS minimizer is the Gibbs distribution; the four-step refinement table (variance-adaptive Tolstikhin–Seldin, jointly-quasiconvex Thiemann, small-n Foong–Bruinsma–Burt–Turner); the Bayesian–PAC-Bayes correspondence at temperature λ = n with log-likelihood loss and the cold-posterior debate it clarifies; a Dziugaite–Roy reproduction certifying a stochastic MLP on binary MNIST below the Rademacher floor; and finally the four ML applications where the framework lands — meta-learning, compression, information-theoretic generalization, and tighter empirical neural-net bounds — with honest limits on what PAC-Bayes can and cannot say.

4 prerequisites

intermediate linear-algebra

PCA & Low-Rank Approximation

From variance maximization to modern dimensionality reduction — the mathematical theory behind the most widely used unsupervised learning method

Principal Component Analysis (PCA) finds the orthogonal directions of maximum variance in a dataset, providing the optimal linear dimensionality reduction in the least-squares sense. We derive PCA from first principles via the Rayleigh quotient (Spectral Theorem), connect it to the SVD through the Eckart–Young–Mirsky theorem (optimal low-rank approximation), develop the scree plot and parallel analysis for choosing dimensionality, and extend classical PCA to four modern variants: Kernel PCA for nonlinear manifolds, Probabilistic PCA for missing data and generative modeling, Robust PCA (Principal Component Pursuit) for separating low-rank structure from sparse corruption, and Sparse PCA for interpretable loadings. The mathematical foundation is the Spectral Theorem applied to the sample covariance matrix, computed via the SVD for numerical stability.

1 prerequisite

intermediate topology

Persistent Homology

Tracking topological features across scales — the workhorse of topological data analysis

Persistent homology solves the fundamental problem of topological data analysis: at what scale should we analyze shape? By tracking how topological features — connected components, loops, voids — are born and die across a filtration of simplicial complexes, it produces a multiscale signature that is both theoretically grounded (via the Stability Theorem) and practically useful as input to machine learning pipelines.

1 prerequisite

intermediate nonparametric-ml

Prediction Intervals

Constructing finite-sample-valid prediction intervals under exchangeability, iid, or location-shift symmetry — split conformal, pure QR, CQR, and Hodges-Lehmann test-inversion compared on coverage, width, conditional behavior, and cost

How to construct a finite-sample-valid prediction interval around a black-box predictor's output, and how the three featured constructions compare on coverage, width, conditional behavior, and cost. Three constructions are developed in order of weakest assumption first: split conformal under exchangeability (Theorem 1, finite-sample marginal); pure quantile-regression intervals under iid plus smoothness (Theorem 2, asymptotic conditional); Hodges-Lehmann test-inversion under iid plus symmetric residuals (Theorem 3, finite-sample conditional). CQR appears as the bridge between the first two — split conformal under the QR score — and earns its own analysis through three bridge theorems (Theorem 5.1: CQR's conditional-coverage gap is bounded by twice the QR base learner's pointwise error; Theorem 5.2: CQR's expected width is bounded above by split conformal's under heteroscedasticity; Theorem 5.3: HL and split conformal converge to the same band on location-shift symmetric problems). A four-scenario empirical comparison consolidates the practical recommendations: use CQR under heteroscedasticity, default to split conformal otherwise, avoid pure QR alone.

3 prerequisites

intermediate bayesian-ml

Probabilistic Programming

The PPL abstraction in PyMC, NumPyro, and Stan: trace-based joint log-densities, automatic constrained-to-unconstrained reparameterization, and inference dispatch from NUTS to ADVI to MAP through the eight-schools workflow

A probabilistic programming language is an answer to a structural problem in Bayesian inference: in practice, the modeling step and the inference step have come to be entangled. Running HMC requires the joint log-density and its gradient on unconstrained space; running mean-field VI requires deriving the variational updates; running Metropolis–Hastings requires a proposal kernel. The PPL move is to cleanly separate the two — the user writes only the generative model, and the language manufactures the joint log-density, the gradients, and the constrained-to-unconstrained reparameterizations the inference engines need. We motivate the PPL abstraction by computing a Beta–Binomial posterior algebraically and then declaring the same model in PyMC; name the three engine-derived objects (joint log-density, automatic gradient, execution trace) and contrast PyMC's PyTensor symbolic graph with NumPyro's effect-handler messengers and Stan's compiled C++; develop the change-of-variables theorem for the constrained-to-unconstrained transforms HMC needs and walk through the log, logit, and stick-breaking transforms PPLs use most often; dispatch the same generative model to NUTS, ADVI, and MAP from a single PyMC model spec on Iris; and walk the full Bayesian workflow on Rubin's eight-schools dataset, including the centered-vs-non-centered reparameterization that turns a divergence-littered fit into a clean one. The topic closes with the regimes where PP breaks down — large data, intractable likelihoods, high-dimensional latents, discrete latents, multimodal posteriors — and pointers to the planned T5 topics that pick up where PP runs out of road.

1 prerequisite

intermediate optimization

Proximal Methods

Proximal operators, the Moreau envelope, forward-backward splitting, ISTA/FISTA for composite objectives, Douglas–Rachford splitting, and ADMM — the algorithmic toolkit for non-smooth and constrained optimization

Proximal methods extend gradient descent to non-smooth and constrained optimization by replacing the gradient step with a proximal operator — the point that best balances minimizing the objective and staying close to the current iterate. We develop the theory from the proximal operator definition through three layers of increasing generality: forward-backward splitting (proximal gradient descent) handles composite objectives where one term is smooth and one is prox-friendly, yielding the ISTA algorithm for the Lasso; FISTA accelerates this from O(1/k) to the optimal O(1/k²) rate via Nesterov momentum; Douglas–Rachford splitting and ADMM extend the framework to problems where both terms are non-smooth or where variable splitting enables parallel computation. The Moreau envelope provides the smoothing bridge that connects proximal operators to gradient-based reasoning, while the soft-thresholding operator emerges as the workhorse proximal map underlying sparse recovery. Throughout, we connect each algorithm to the convex analysis foundations — subdifferentials, conjugate functions, and the fundamental optimality condition 0 ∈ ∂f(x*) — and to the convergence machinery developed in the Gradient Descent topic.

2 prerequisites

intermediate nonparametric-ml

Quantile Regression

The Koenker–Bassett 1978 estimator: pinball-loss minimization, LP reformulation, asymptotic normality, multi-quantile rearrangement, and the base learner of CQR

Quantile regression replaces OLS's squared-error loss with the asymmetric pinball loss, recovering the conditional tau-quantile of Y given X rather than the conditional mean. We develop the theory progressively: pinball-loss minimization recovers the population quantile; the Koenker-Bassett 1978 estimator is the empirical sample analog; the LP reformulation makes QR exact rather than iterative; conditional quantiles are equivariant under monotone transformations of Y, a clean robustness gain over OLS; Knight 1998's asymptotic normality has the Bernoulli factor tau(1-tau) replacing the noise variance and a density-weighted Gram matrix sandwich-replacing the standard one; Chernozhukov-Fernandez-Val-Galichon 2010 rearrangement fixes multi-quantile crossing; Belloni-Chernozhukov 2011 establishes oracle rates for L1-penalized QR. The topic closes with QR as the base learner inside Conformalized Quantile Regression, the bridge back into the conformal-prediction topic.

3 prerequisites

intermediate graph-theory

Random Walks & Mixing

Markov chains on graphs — from the transition matrix to mixing times, hitting times, and the spectral gap

A random walk on a graph is a Markov chain whose transition matrix P = D⁻¹A encodes local structure: at each step, the walker moves to a uniformly random neighbor. The stationary distribution π assigns probability proportional to degree, and the Perron–Frobenius theorem guarantees convergence for connected, non-bipartite graphs. The mixing time — how many steps until the walk's distribution is close to π in total variation — is governed by the spectral gap γ = 1 − λ₂(P), tying random walk convergence directly to the Laplacian eigenvalues from spectral graph theory. Hitting times and commute times provide complementary measures of graph distance with deep connections to electrical networks: the commute time between two vertices equals the effective resistance times the total edge weight. These ideas power modern graph algorithms from PageRank to DeepWalk, and the mixing time perspective motivates expander graphs — sparse graphs that mix as fast as dense ones.

1 prerequisite

intermediate nonparametric-ml

Rank Tests & Permutation Inference

Distribution-free hypothesis testing through ranks and permutations — Wilcoxon, Mann-Whitney, Kruskal-Wallis, Pitman ARE, Hodges-Lehmann, and randomization inference for A/B testing

Distribution-free hypothesis testing through ranks and permutations. We develop the theory progressively: the sign test and Wilcoxon signed-rank for paired data with full enumeration of the discrete null at small n; the permutation principle (Theorem 3) as the umbrella covering all rank tests as special cases; Mann-Whitney U for two-sample comparisons with closed-form null moments; Kruskal-Wallis for k-sample comparison via the rank analog of one-way ANOVA; Pitman asymptotic relative efficiency, with the celebrated 12σ²(∫f²)² formula yielding 3/π ≈ 0.955 under normality, 1.5 under Laplace, and unboundedly large under heavy tails (subject to the Hodges-Lehmann 0.864 floor); the Hodges-Lehmann estimator as the median of Walsh averages, recovered by inverting the Wilcoxon test, with an exact distribution-free CI; permutation tests with non-rank statistics for the trade-off between rank robustness and parametric efficiency; and randomization-inference framing of online A/B testing, with CUPED variance reduction. The topic closes with the covariate problem as the natural transition point to quantile regression and conformal prediction, the two T4 directions that pick up where rank-and-permutation leaves off.

2 prerequisites

intermediate information-theory

Rate-Distortion Theory

The fundamental limits of lossy compression — how many bits per symbol when we tolerate distortion?

Rate-distortion theory answers the fundamental question of lossy compression: how many bits per source symbol are necessary and sufficient when we tolerate an average distortion of at most D? The rate-distortion function R(D) — defined as the minimum mutual information I(X; X̂) over all test channels satisfying the distortion constraint — is convex, non-increasing, and equals H(X) at D = 0, recovering Shannon's lossless source coding limit. We derive closed-form solutions for the binary source with Hamming distortion (R(D) = H_b(p) - H_b(D)) and the Gaussian source with squared error (R(D) = (1/2) log(σ²/D)), prove Shannon's rate-distortion theorem establishing R(D) as the exact achievability boundary, and develop the Blahut–Arimoto algorithm for numerical computation via alternating minimization. The information bottleneck method extends rate-distortion to compression with relevance: minimizing I(X; T) while preserving I(T; Y), unifying lossy compression with representation learning. Applications to machine learning include the VAE loss as a rate-distortion Lagrangian, β-VAE as rate-distortion trade-off, and neural image compression as learned R(D)-optimal coding.

2 prerequisites

advanced geometry

Riemannian Geometry

Metric tensors, connections, and parallel transport on smooth manifolds

A Riemannian metric equips each tangent space of a smooth manifold with an inner product that varies smoothly from point to point — this single piece of structure unlocks lengths, angles, areas, curvature, and a canonical notion of differentiation. We construct the metric tensor in local coordinates and prove that every smooth manifold admits a Riemannian metric via partitions of unity. The musical isomorphisms (flat and sharp) provide the bridge between tangent and cotangent spaces, revealing that the gradient depends on the metric. The Fundamental Theorem of Riemannian Geometry establishes the existence and uniqueness of the Levi-Civita connection — the unique torsion-free, metric-compatible covariant derivative — whose Christoffel symbols are computed explicitly for the sphere and Poincaré disk. Parallel transport along curves, governed by a first-order ODE, is path-dependent on curved manifolds: the holonomy of a closed loop encodes curvature. The Riemannian volume form enables coordinate-invariant integration. Isometries and Killing vector fields capture the symmetries of a Riemannian manifold, with maximally symmetric spaces saturating the bound dim(Isom) = n(n+1)/2. The Fisher information metric on statistical parameter spaces connects this machinery to natural gradient descent and information geometry.

1 prerequisite

foundational information-theory

Shannon Entropy & Mutual Information

Information content, entropy, and mutual information — the quantities that measure uncertainty, surprise, and statistical dependence

Shannon entropy quantifies the irreducible uncertainty in a random variable — the minimum average number of bits needed to encode its outcomes. We develop the theory from the axioms that uniquely characterize entropy (Khinchin's theorem), through joint and conditional entropy with their chain rules, to mutual information as the symmetric measure of statistical dependence between random variables. The data processing inequality establishes that post-processing cannot create information, providing the theoretical foundation for representation learning. Differential entropy extends these ideas to continuous distributions, where the maximum-entropy distribution for fixed variance is Gaussian—a result that connects to the central role of Gaussians throughout statistics and machine learning. Shannon's source coding theorem makes entropy operationally meaningful: the entropy rate of a source is the fundamental limit of lossless compression, achieved by prefix codes satisfying Kraft's inequality. Applications to machine learning include cross-entropy loss as the KL divergence from the true distribution, mutual information for feature selection, and the InfoMax principle for representation learning.

1 prerequisite

advanced topology

Sheaf Theory

From local data to global consistency — the algebraic geometry of gluing

Sheaves formalize the passage from local data to global consistency. We develop cellular sheaves on graphs from scratch — assigning vector spaces to nodes and edges with linear restriction maps — then construct the sheaf Laplacian, whose spectrum measures how far a data assignment is from being globally consistent. Sheaf cohomology gives algebraic invariants (H⁰ counts consistent sections, H¹ measures obstructions), and sheaf diffusion provides a dynamical process that drives local observations toward global agreement. Applications to sensor networks, opinion dynamics, and multi-modal fusion demonstrate why sheaves are the natural language for data that lives on a network.

1 prerequisite

foundational topology

Simplicial Complexes

The combinatorial scaffolding that turns point clouds into topology

Simplicial complexes are the bridge between raw point cloud data and topological invariants. We build them from scratch — starting with the geometric intuition of simplices as generalized triangles, then constructing the Vietoris-Rips complex that underlies most of modern topological data analysis. Along the way, we develop the Euler characteristic and its relationship to Betti numbers, survey alternative complex constructions (alpha complexes, witness complexes), and connect the combinatorial theory to spectral graph theory, category theory, and sheaf theory.

intermediate linear-algebra

Singular Value Decomposition

From rectangular matrices to the geometry of linear maps — the most useful factorization in data science

The Singular Value Decomposition (SVD) factorizes any matrix A = UΣVᵀ into orthogonal rotations and diagonal stretching — generalizing the Spectral Theorem from symmetric square matrices to all matrices. The singular values σ₁ ≥ ⋯ ≥ σᵣ > 0 measure the importance of each dimension, and the Eckart–Young–Mirsky theorem guarantees that the truncated SVD is the optimal low-rank approximation in both Frobenius and spectral norms. We prove the SVD via the Spectral Theorem applied to AᵀA, give the geometric rotate–stretch–rotate interpretation, derive the four fundamental subspaces and the pseudoinverse, implement truncated SVD for image compression and Latent Semantic Analysis, introduce the randomized SVD algorithm, and establish the SVD–PCA connection that makes the SVD the computational backbone of principal component analysis.

1 prerequisite

intermediate geometry

Smooth Manifolds

Charts, tangent spaces, and the language of calculus on curved spaces

Smooth manifolds formalize the idea of a space that is locally Euclidean but globally curved — the setting where calculus, linear algebra, and topology converge. We build the theory from coordinate charts and atlases, which encode how to do calculus on curved spaces by transferring computations to flat Euclidean patches. Smooth compatibility of transition maps ensures that the notion of differentiability is chart-independent. Tangent vectors, defined as derivations on smooth functions, organize into tangent spaces that recover the linear algebra of the Spectral Theorem at every point. The differential (pushforward) of a smooth map generalizes the Jacobian to manifold-valued maps, and diffeomorphisms capture when two manifolds are 'the same' as smooth objects. Partitions of unity provide the technical bridge from local constructions to global results. The Whitney embedding theorem guarantees that every abstract manifold lives inside some Euclidean space, connecting the intrinsic and extrinsic viewpoints.

2 prerequisites

advanced bayesian-ml

Sparse Bayesian Priors

The global-local prior framework for sparse coefficients — horseshoe geometry and the Beta(1/2,1/2) shrinkage identity, spike-and-slab as theoretical anchor, the funnel pathology and the regularized horseshoe fix, R2-D2 as the variance-decomposition alternative, and VBMS-over-priors as the practitioner workflow.

Sparse regression in p >> n regimes asks for a prior on the coefficient vector that aggressively shrinks noise toward zero while leaving genuine signals untouched. The Bayesian sparse-prior menu — horseshoe (Carvalho-Polson-Scott 2010), regularized horseshoe (Piironen-Vehtari 2017), spike-and-slab (Mitchell-Beauchamp 1988), R2-D2 (Zhang-Reich-Bondell 2022) — implements this through the global-local shrinkage framework (Polson-Scott 2010), which characterizes sparsity-inducing priors via mass-at-zero and heavy-tail behavior on the implied marginal density. We develop the Polson-Scott criterion as the unifying language, prove the horseshoe's Beta(1/2, 1/2) shrinkage-factor identity that gives the prior its name, characterize the funnel pathology that breaks naive HMC and the non-centered + regularized-horseshoe fix that resolves it, work the four-prior comparison on a synthetic high-dimensional regression and on the diabetes benchmark with PyMC NUTS, and close with the VBMS escalation hierarchy applied to rank the four families on the same dataset using PSIS Pareto-k as the gating diagnostic.

6 prerequisites

intermediate bayesian-ml

Stacking & Predictive Ensembles

The M-open generalization of Bayesian Model Averaging: Yao 2018's predictive-density-weighted stacking, PSIS-LOO machinery for tractable leave-one-out from a single full-data fit per candidate, and the Wolpert-Breiman heritage that maps onto knowledge distillation as the deployment-time complement

Bayesian Model Averaging is the textbook answer to having multiple plausible models — average over them weighted by posterior model probability. In M-open settings, where the truth lies outside the candidate set, BMA's posterior collapses exponentially fast onto whichever candidate is closest in KL divergence to the truth, throwing away the predictive contributions of every other candidate. Yao, Vehtari, Simpson, and Gelman 2018 introduced Bayesian stacking as the predictive-density-weighted alternative that closes this oracle gap. We define the stacking weights as the simplex-constrained maximizers of the leave-one-out log predictive score, prove the underlying convex-optimization theorem with concavity, existence, and uniqueness, develop the PSIS-LOO machinery that makes the optimization tractable from a single full-data fit per candidate, fit four PyMC candidates (Bayesian linear regression, polynomial degree four, Gaussian process, BART) on a synthetic M-open ground truth, demonstrate that stacking dominates BMA on held-out predictive log-score while BMA's weights collapse onto the GP, historicize against Wolpert 1992 stacked generalization and Breiman 1996 NNLS-stacking and the Kaggle-era blending pattern, and close with knowledge distillation as the deployment-time complement that compresses the K-candidate ensemble into a single deployable student. The topic's running theme: when the truth is not in the candidate set, the best response is not to average over candidates by how confident we are that each is correct, but to average over candidates by how well each contributes to predicting held-out data.

4 prerequisites

advanced nonparametric-ml

Statistical Depth

Multivariate centrality scores — Tukey halfspace depth, the zoo of alternatives, asymptotic theory and NP-hardness, and ML applications via DD-classification, depth-based outlier detection, and functional depth

Statistical depth assigns each point in $\mathbb{R}^d$ a centrality score with respect to an empirical or population distribution, generalizing the median, quantile, and rank to dimensions where ordering is no longer well defined. The Tukey halfspace depth — the canonical construction — measures how surrounded a point is by the data: it is the smallest fraction of the sample on either side of any hyperplane through that point. Depth functions package multivariate location, spread, and outlyingness into a single object that obeys a clean axiomatic specification (Zuo and Serfling 2000), supports robust median estimation with breakdown $\ge 1/(d+1)$ (Donoho and Gasko 1992), and underwrites a small zoo of multivariate ML tools — the DD-classifier, depth-based outlier detection, and functional depth for curves and time series. The topic develops the asymptotic theory through full proofs of the Zuo–Serfling axioms for halfspace depth, contour convexity, and consistency, then sketches the Massé asymptotic distribution and the Aloupis NP-hardness result. Modern ML applications close the topic: distribution-free DD-classification on the harder Iris pair, masking-resistant outlier detection, and modified band depth on noisy functional data.

3 prerequisites

advanced topology

Statistical TDA

Doing inference with persistence diagrams — from stability guarantees to hypothesis testing

Persistence diagrams from finite samples are random objects — some bars represent genuine topological features, others are sampling noise. Statistical TDA provides the tools to tell them apart: stability theorems guarantee robustness, bootstrap confidence sets separate signal from noise, persistence landscapes and images vectorize diagrams for standard statistical methods, and permutation tests enable topological hypothesis testing. We develop the full pipeline from theory to application, including a financial market regime detection case study.

3 prerequisites

advanced bayesian-ml

Stochastic-Gradient MCMC

From the Ma–Chen–Fox complete recipe to SGLD, SGHMC, RSGLD, and the head-to-head with NUTS — a unified Itô-SDE treatment of scalable Bayesian posterior sampling.

bayesian-neural-networks introduced SGLD and SGHMC as recipes that scale Bayesian posterior sampling to deep learning regimes. This topic supplies the theory: SGLD and SGHMC are not two unrelated tricks but two instantiations of a single object — a stochastic differential equation whose stationary distribution is the posterior we want to sample. The Ma–Chen–Fox (2015) complete-recipe framework organizes every SG-MCMC variant inside one (D, Q) skew decomposition; specializing it gives overdamped Langevin → SGLD, underdamped Langevin → SGHMC, and Riemann-manifold variants in turn. The Vollmer–Zygalakis–Teh (2016) bound quantifies the price of mini-batching as O(η + 1/B), and a head-to-head with NUTS on Bayesian logistic regression pins down the regime in which SG-MCMC is the only practical option.

5 prerequisites

intermediate learning-theory

Structural Risk Minimization

The framework for choosing model complexity from data: penalize-and-pick across a nested hypothesis-class family, with Vapnik VC, Bartlett–Mendelson Rademacher, AIC/BIC/MDL, PAC-Bayes, and cross-validation as parallel instantiations of the same oracle inequality

Structural Risk Minimization (SRM) is the framework for choosing model complexity from data. Given a sequence of nested hypothesis classes of increasing capacity, SRM picks the index that minimizes training error plus a capacity penalty, then returns the ERM in the picked class. The penalty is a stand-in for the (unobservable) gap between training error and population risk; calibrating it correctly yields adaptive guarantees with rates matching the best-in-family choice — without knowing which class is best. We develop the framework at the intermediate level on a single anchor example (polynomial regression on a sin(πx) target with 50 noisy points) carried through every section: the bias-variance U-curve and the nested-family setup; the SRM oracle inequality and its proof; Vapnik's VC-dimension penalty and consistency; Bartlett–Mendelson's data-dependent Rademacher penalty; the continuous soft-SRM reformulation as penalized ERM, identifying Tikhonov ridge and lasso as soft SRM on the ℓ² and ℓ¹ balls; the cross-walk to AIC, BIC, and MDL as parametric SRM analogs; PAC-Bayes with KL divergence as a continuous capacity functional; cross-validation as a data-driven SRM surrogate; the agreement matrix across six rules; soft-margin SVMs as the canonical classical-SRM application and the regime where classical SRM goes loose on overparameterized neural networks; and finally the tightness, non-nested-family, and computational-cost considerations that bound what SRM can tell you.

4 prerequisites

advanced linear-algebra

Tensor Decompositions

From multi-way arrays to modern factorization — generalizing the SVD and PCA to higher-order data

Tensor decompositions generalize matrix factorizations to multi-way arrays, preserving multi-mode structure while achieving compression, denoising, and interpretable factor extraction. We develop five decomposition methods: CP (CANDECOMP/PARAFAC) with Kruskal's essential uniqueness theorem and Alternating Least Squares, the Tucker decomposition as multilinear PCA with its connection to mode-wise dimensionality reduction, the HOSVD (Higher-Order SVD) as the natural generalization of the matrix SVD with approximation bounds, the Tensor Train (TT) format that scales linearly in the number of modes breaking the exponential curse of dimensionality, and the t-SVD (tensor SVD via the Fourier domain) which provides the first exact Eckart–Young optimality theorem for tensors. The mathematical foundation draws on the Spectral Theorem (eigendecomposition of mode-n scatter matrices), the SVD (applied independently to each mode unfolding or Fourier-domain frontal slice), and PCA (Tucker as multilinear PCA along each tensor mode).

3 prerequisites

advanced topology

The Mapper Algorithm

A topological lens for high-dimensional data — building interpretable graphs from point clouds via the Nerve Theorem

The Mapper algorithm constructs a compressed graph summary of high-dimensional data by pulling back an interval cover through a filter function, clustering within each pullback, and taking the nerve of the resulting cover. The Nerve Theorem guarantees that this graph faithfully captures the data's topology. We develop the full construction from first principles, build Mapper from scratch, compare with the KeplerMapper library, analyze parameter sensitivity, and apply the algorithm to financial market regime detection.

3 prerequisites

foundational linear-algebra

The Spectral Theorem

From eigenvalues to geometry — why symmetric matrices are the best-behaved objects in linear algebra

The Spectral Theorem guarantees that every real symmetric matrix admits an orthogonal eigendecomposition A = QΛQᵀ with real eigenvalues and orthonormal eigenvectors. This foundational result underpins PCA, spectral clustering, kernel methods, and optimization via Hessians. We develop the theorem from first principles — proving the three key lemmas (real eigenvalues, orthogonal eigenvectors, invariant complements), giving the full inductive proof, establishing the Rayleigh quotient and Courant–Fischer variational characterization, classifying quadratic forms via definiteness, implementing power iteration and the QR algorithm, and applying the theorem to spectral clustering and PCA.

advanced bayesian-ml

Variational Bayes for Model Selection

The variational evidence approximation as a model-selection criterion — Laplace bridge to the BIC, KL projection bias, importance-weighted and annealed bias correction, and the bits-back coding interpretation

Variational inference replaces the intractable log-evidence with a free lower bound, the ELBO. Variational Bayes for model selection asks: can we use that lower bound to compare models? The answer is yes, but with a caveat — the ELBO is a biased estimator of log p(x | M), the bias has structure (the reverse-KL gap between the variational family and the true posterior), and the bias-direction is informative when the variational family treats competing models symmetrically and misleading when it doesn't. We develop the variational evidence approximation as a model-selection criterion, derive its asymptotic equivalence to the BIC via Laplace expansion, work through the Bayesian-GMM automatic-relevance-determination case (Bishop §10.2) and the polynomial-regression Bayes-factor case end-to-end, characterize when the KL projection bias preserves rankings versus when it flips them, develop importance-weighted ELBO and annealed importance sampling as bias-correction tools with monotone tightness guarantees, and close with the bits-back coding interpretation that places VBMS within the MDL framework and a comparison to BIC, WAIC, LOO-CV, BMA, and stacking as alternative model-selection criteria.

5 prerequisites

intermediate bayesian-ml

Variational Inference

Posterior approximation by reverse-KL projection onto a tractable family — the ELBO, mean-field CAVI, stochastic and reparameterization gradients, and normalizing-flow variational families

Variational inference is the projection-based response to the intractable evidence in Bayesian inference: pick a tractable family of distributions over the latent variables, find the member closest to the posterior under reverse KL, and report it. The recipe trades the exact-but-slow of MCMC for the approximate-but-fast of optimization, with structural error controlled by the variational family. We develop the evidence lower bound (ELBO) from a single application of Jensen's inequality, prove the evidence decomposition $\log p(x) = \mathrm{ELBO}(q) + \mathrm{KL}(q \,\|\, p(\cdot \mid x))$ that turns reverse-KL minimization into a tractable optimization, and develop the mean-field family with coordinate-ascent variational inference (CAVI) on a Bayesian Gaussian mixture. Modern VI moves to stochastic and black-box gradients: stochastic variational inference for big-data conjugate models, the score-function gradient for non-conjugate likelihoods, and the reparameterization gradient that powers automatic differentiation variational inference (ADVI) and every modern variational autoencoder. The topic closes with the structural error of mean-field — Gaussian targets are projected onto axis-aligned Gaussians whose marginal variances are systematically too small — and lifts the limitation through full-rank Gaussian VI and normalizing-flow variational families on a banana-shaped target where mean-field, full-rank, and flows form a strict hierarchy of approximation quality. Modern Bayesian computation is increasingly hybrid: VI for development and large-scale models, MCMC as the inferential reference whenever the bias matters.

3 prerequisites

intermediate learning-theory

VC Dimension

Combinatorial capacity, Sauer–Shelah, and uniform convergence for binary hypothesis classes

The Vapnik–Chervonenkis dimension is the combinatorial summary of a binary hypothesis class's capacity — the largest sample size on which the class produces every conceivable labeling. Finiteness of this single integer is the necessary and sufficient condition for distribution-free uniform convergence of empirical risk to population risk, and the Sauer–Shelah lemma converts it into a polynomial growth function Π(n) ≤ (en/d)^d that powers the Fundamental Theorem of Statistical Learning. This topic develops VC dimension at the intermediate level on two anchor classes (half-planes and axis-aligned rectangles in the plane) carried through every section. §§1–4 introduce the motivation, shattering, growth function, and VC dimension itself. §5 proves the Sauer–Shelah lemma in full via Pajor's shifting trick. §6 chains Sauer–Shelah with Hoeffding through symmetrization to produce the FTSL (both upper and lower bound directions). §7 separates realizable (1/ε) from agnostic (1/ε²) rates. §8 computes the VC dimension for canonical classes — half-spaces via Radon's theorem, axis-rectangles via pigeonhole, and neural networks via the Bartlett bound. §9 bridges to Rademacher complexity. §10 is the integrative empirical-shatter-check experiment that verifies every bound on the same two anchor classes. §§11–13 cover ML applications (SVM margins, decision-tree pruning, the deep-learning puzzle), computational notes, and forward-pointers.

2 prerequisites