intermediate learning-theory 55 min read

Structural Risk Minimization

The framework for choosing model complexity from data: penalize-and-pick across a nested hypothesis-class family, with Vapnik VC, Bartlett–Mendelson Rademacher, AIC/BIC/MDL, PAC-Bayes, and cross-validation as parallel instantiations of the same oracle inequality

Motivation: why capacity control matters

The fixed-class ERM trap

Empirical risk minimization (ERM) — picking the hypothesis h^H\hat h \in \mathcal{H} that minimizes training error — is the foundational learning algorithm, and it behaves as advertised as long as the class H\mathcal{H} is small enough that uniform convergence kicks in (the Fundamental Theorem of Statistical Learning from PAC Learning is the formal statement). But small enough is doing all the work. If H\mathcal{H} is too small, we can’t represent the regression function well and bias dominates. If H\mathcal{H} is too rich, we fit the noise and variance dominates. The whole game in learning theory is navigating this trade-off, and the irritating fact is that we can’t navigate it by choosing H\mathcal{H} upfront: the best class to run ERM on depends on the data we haven’t seen yet — the sample size, the noise level, the smoothness of the target.

Make this concrete. Fit polynomials to n=50n = 50 noisy points drawn from m(x)=sin(πx)m(x) = \sin(\pi x). Degree-1 polynomials underfit; they can’t bend. Degree-15 polynomials interpolate the noise and oscillate wildly between data points. Somewhere in between is a degree that captures the smooth structure without chasing the wiggles. ERM on H1\mathcal{H}_1 gives us a bad linear fit; ERM on H15\mathcal{H}_{15} gives us a wild oscillator. ERM is the wrong question. The right question is which class Hk\mathcal{H}_k should we run ERM on?

Bias and variance as functions of capacity

Fix a regression target m:XRm: \mathcal{X} \to \mathbb{R} and a noise process Y=m(X)+εY = m(X) + \varepsilon with E[ε]=0\mathbb{E}[\varepsilon] = 0 and Var(ε)=σ2\mathrm{Var}(\varepsilon) = \sigma^2. For a nested sequence of classes H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots, let h^k\hat h_k be the ERM in Hk\mathcal{H}_k on a sample of size nn. The pointwise mean-squared test error against a fresh Y=m(x)+εY' = m(x) + \varepsilon' decomposes as the usual bias-variance identity (see formalStatistics: linear-regression for the parametric derivation):

E[(h^k(x)Y)2]=(E[h^k(x)]m(x))2bias2(k,x)+Var(h^k(x))variance(k,x)+σ2,\mathbb{E}\left[(\hat h_k(x) - Y')^2\right] = \underbrace{\bigl(\mathbb{E}[\hat h_k(x)] - m(x)\bigr)^2}_{\text{bias}^2(k, x)} + \underbrace{\mathrm{Var}(\hat h_k(x))}_{\text{variance}(k, x)} + \sigma^2,

where the expectation is over the training sample SDnS \sim D^n and the fresh noise ε\varepsilon', and σ2\sigma^2 is the irreducible noise floor. The error against the regression function alone — E[(h^k(x)m(x))2]\mathbb{E}[(\hat h_k(x) - m(x))^2] — drops the σ2\sigma^2 term and is just bias2(k,x)+^2(k, x) + variance(k,x)(k, x). Two observations carry the section. Bias is monotone non-increasing in kk: a larger nested class can only approximate mm at least as well as a smaller one. Variance is monotone non-decreasing in kk: a richer class has more freedom to chase noise, so h^k\hat h_k moves around more across resamplings of SS. Add them and integrate over xx to get the expected MSE; the result is a U-curve in kk.

The minimum of that U-curve — the true optimal k(n)k^*(n) — is what we’d pick if we could see the population. The problem is that we can’t; the training error L^n(h^k)\hat L_n(\hat h_k) is monotone non-increasing in kk (more flexibility never hurts in-sample) and bottoms out at zero by interpolation once kk is large enough. Empirical risk alone tells us a strictly wrong story about how to pick kk.

Why the “true” optimal capacity depends on nn

The U-curve isn’t fixed; it shifts with nn. For ordinary least squares with k+1k+1 parameters, the variance term at degree kk scales like σ2(k+1)/n\sigma^2 (k+1) / n — the textbook Gauss–Markov consequence ( formalStatistics: linear-regression derives it). Doubling nn halves the variance contribution without touching the bias, so the U-curve flattens on the right and the optimum k(n)k^*(n) shifts up: with more data, we can afford more complexity.

This is the central reason no fixed choice of H\mathcal{H} works across problem sizes. A degree-3 ERM that’s near-optimal at n=50n = 50 leaves bias on the table at n=500n = 500; a degree-9 ERM that’s near-optimal at n=500n = 500 overfits catastrophically at n=50n = 50. The optimal capacity is a function of nn, so the model class itself has to be chosen as a function of the data.

SRM in one paragraph: penalize-and-pick

Structural Risk Minimization (SRM) is the recipe for choosing Hk\mathcal{H}_k from data. (1) Fix in advance a nested family H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots, ordered by increasing capacity. (2) For each Hk\mathcal{H}_k, define a capacity-dependent penalty pen(Hk,n,δ)\mathrm{pen}(\mathcal{H}_k, n, \delta) that grows with the capacity of Hk\mathcal{H}_k and shrinks with nn. (3) Pick the class k^\hat k that minimizes training error plus penalty, and return h^k^\hat h_{\hat k}, the ERM on Hk^\mathcal{H}_{\hat k}:

k^argmink1{L^n(h^k)+pen(Hk,n,δ)},h^k^=argminhHk^L^n(h).\hat k \in \arg\min_{k \ge 1} \left\{\hat L_n(\hat h_k) + \mathrm{pen}(\mathcal{H}_k, n, \delta)\right\}, \qquad \hat h_{\hat k} = \arg\min_{h \in \mathcal{H}_{\hat k}} \hat L_n(h).

The penalty plays the role of a stand-in for the (unobservable) variance. When it’s calibrated correctly — meaning the penalty upper-bounds the gap L(h^k)L^n(h^k)L(\hat h_k) - \hat L_n(\hat h_k) with probability 1δ\ge 1 - \delta uniformly across kk — the picked class k^\hat k provably tracks k(n)k^*(n) as nn grows. The rest of this topic is about how to calibrate the penalty (VC dimension in §4, Rademacher complexity in §5, PAC-Bayes in §9) and what the trade-offs are.

At (n = 50, σ = 0.20, B = 100), the bias-variance optimum is k* = 5. Right panel: shifting n shifts k* rightward, the central claim of §1.3.
Bias-squared, variance, and MSE as functions of polynomial degree k at n = 50, σ = 0.2, with the U-shaped MSE bottoming out at k* ≈ 5; an overlay panel shows MSE curves at n ∈ {25, 50, 100, 500} drifting rightward as n grows.
The bias-variance U-curve as a function of polynomial degree k. Left: bias², variance, and MSE at n = 50; the MSE bottoms out at k* ≈ 5. Right: MSE at n ∈ {25, 50, 100, 500}; the optimum shifts rightward as n grows. The picture below is static; the live panel above lets you scrub through (n, σ, B).

The nested-family setup

Definition: nested hypothesis-class family

A nested hypothesis-class family is a sequence H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots of hypothesis classes indexed by kNk \in \mathbb{N}, satisfying two conditions: nesting (HjHk\mathcal{H}_j \subseteq \mathcal{H}_k whenever jkj \le k), and capacity monotonicity (a capacity functional C:{Hk}R0C: \{\mathcal{H}_k\} \to \mathbb{R}_{\ge 0} — VC dimension, Rademacher complexity, or effective degrees of freedom — non-decreasing in kk with C(Hk)C(\mathcal{H}_k) \to \infty).

Definition 1 (Nested hypothesis-class family).

A nested hypothesis-class family is a sequence (Hk)k1(\mathcal{H}_k)_{k \ge 1} of hypothesis classes with HjHk\mathcal{H}_j \subseteq \mathcal{H}_k whenever jkj \le k, equipped with a capacity functional C:{Hk}k1R0C: \{\mathcal{H}_k\}_{k \ge 1} \to \mathbb{R}_{\ge 0} that is non-decreasing in kk with C(Hk)C(\mathcal{H}_k) \to \infty as kk \to \infty. The ambient class is H=k1Hk\mathcal{H}_\infty = \bigcup_{k \ge 1} \mathcal{H}_k.

SRM never runs ERM on H\mathcal{H}_\infty; it always runs it on a finite class Hk^\mathcal{H}_{\hat k} chosen from the family. Older Russian-school literature (Vapnik 1995) calls a nested family a filtration of the ambient class; the word is borrowed from probability theory and refers to the same object.

The indexing doesn’t have to be discrete — continuous-parameter families like RKHS norm balls Hr={fHK:fKr}\mathcal{H}_r = \{f \in \mathcal{H}_K : \|f\|_K \le r\} are nested in rr and admit the same SRM analysis; this is the soft-SRM view we’ll formalize in §6. The choice of which nested family to use is a structural modeling decision that happens before seeing the data — the analog of choosing a parametric model in classical statistics. SRM picks k^\hat k from the family, not the family itself. Picking the family poorly is a different (and unaddressed) failure mode.

Canonical examples

Four families recur throughout the topic and across the broader ML literature.

Polynomials by degree. Hk={p:[1,1]Rdegpk}\mathcal{H}_k = \{p: [-1, 1] \to \mathbb{R} \mid \deg p \le k\}, the toy from §1. The capacity is C(Hk)=k+1C(\mathcal{H}_k) = k + 1 — the dimension of the parameter space; for the threshold-classification variant of the same family, the VC dimension is also k+1k + 1. This is the example we’ll use for all numerics through §11.

SVMs by inverse margin. Hγ={hh(x)=sign(w,x+b), w1/γ}\mathcal{H}_\gamma = \{h \mid h(x) = \mathrm{sign}(\langle w, x \rangle + b),\ \|w\| \le 1/\gamma\}, the large-margin family from Vapnik (1995). Larger γ\gamma means a smaller class — the linear classifiers are restricted to those that separate with margin at least γ\gamma. Capacity scales as 1/γ21/\gamma^2 for the worst-case Rademacher bound, independent of input dimension. This is the celebrated dimension-free property that motivates kernel methods, and it anchors §12.1.

Neural networks by width. Hw={depth-L feedforward nets withw neurons per layer}\mathcal{H}_w = \{\text{depth-}L \text{ feedforward nets with} \le w \text{ neurons per layer}\}. The nesting is honest — a width-ww net is a special case of a width-(w+1)(w+1) net by zeroing out the last neuron — but the classical capacity measures are loose enough that the SRM bound is uninformative for any realistic deep net. This is the family where classical theory breaks (§12.2–12.4) and where implicit regularization takes over.

RKHS norm balls. For a reproducing-kernel Hilbert space HK\mathcal{H}_K with kernel KK, the family Hr={fHK:fKr}\mathcal{H}_r = \{f \in \mathcal{H}_K : \|f\|_K \le r\} is nested in rr, with empirical Rademacher complexity R^n(Hr)riK(xi,xi)/n\hat{\mathfrak{R}}_n(\mathcal{H}_r) \le r \sqrt{\sum_i K(x_i, x_i)}\,/n (Bartlett and Mendelson 2002). The squared-norm penalty λfK2\lambda \|f\|_K^2 implements SRM on {Hr}r0\{\mathcal{H}_r\}_{r \ge 0} in soft form — the §7.1 connection.

Others fit the same template — decision trees by depth (with pruning), boosted ensembles by round count, sparse linear models by 0\ell_0-norm. The unifying picture is an indexed family of classes, monotone in some capacity functional, with the index kk the free parameter the algorithm gets to set.

Capacity measures: VC dimension, Rademacher complexity, effective DoF

Different SRM bounds need different capacity functionals. Three are standard, and the choice between them in a specific bound has both theoretical and practical consequences.

VC dimension (dimVC(H)\dim_{\mathrm{VC}}(\mathcal{H})). For binary classification, defined as the size of the largest set H\mathcal{H} can shatter — fully developed in VC Dimension and introduced in PAC Learning. VC dimension is a distribution-free, worst-case measure: it captures the largest possible generalization gap over all data-generating distributions. The Vapnik-style SRM bound (§4) uses VC dimension — universal but typically loose.

Rademacher complexity (R^n(H)\hat{\mathfrak{R}}_n(\mathcal{H}), Rn(H)\mathfrak{R}_n(\mathcal{H})). Defined as R^n(H)=Eσ[suphH1ni=1nσih(xi)]\hat{\mathfrak{R}}_n(\mathcal{H}) = \mathbb{E}_{\sigma}[\sup_{h \in \mathcal{H}} \tfrac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)] where σi\sigma_i are i.i.d. Rademacher (uniform on {±1}\{\pm 1\}); the population version takes a further expectation over XX (Generalization Bounds). Rademacher complexity is distribution-dependent — it sees the actual training sample — and is typically sharper than VC dimension when H\mathcal{H} doesn’t shatter the data the distribution puts mass on. The Bartlett–Mendelson SRM bound (§5) uses it.

Effective degrees of freedom. For linear smoothers Y^=SY\hat Y = S Y — least squares, ridge, kernel smoothers — the effective DoF is tr(S)\mathrm{tr}(S). For ordinary least squares with k+1k+1 parameters this equals k+1k+1 exactly; for ridge regression with penalty λ\lambda it’s tr(X(XX+λI)1X)\mathrm{tr}(\mathbf{X}(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top), which is strictly less than k+1k+1 for λ>0\lambda > 0 and decreases as λ\lambda grows. Effective DoF is the capacity functional underlying AIC, BIC, and the SRM-as-regularization story in §7; it’s specific to linear or near-linear estimators but is the practitioner’s everyday measure.

The three are related but not interchangeable. For a class of VC dimension dd, the worst-case Rademacher complexity satisfies Rn(H)=O(dlogn/n)\mathfrak{R}_n(\mathcal{H}) = O(\sqrt{d \log n / n}) — a Sauer–Shelah consequence. For our polynomial-regression toy, all three measures grow linearly in kk, which is why the same SRM picture works regardless of which one we use. For richer model classes — especially deep nets — they diverge sharply, which is part of §12’s story.

What goes wrong with non-nested families

Some natural model families aren’t nested. The clean example is kk-nearest neighbors: the class HkkNN\mathcal{H}_k^{kNN} of kk-NN rules indexed by neighbor count is not nested under inclusion — a 3-NN rule is not a special case of a 5-NN rule. Decision trees indexed by depth are nested under pruning; trees indexed by leaf count typically are not. SVMs with different kernels (linear, RBF, polynomial) form a discrete unordered set, not a chain.

The fix is to drop the literal-inclusion requirement and keep only the capacity-indexed part. Given any countable indexed family {Hk}k1\{\mathcal{H}_k\}_{k \ge 1} with capacity CkC_k monotone non-decreasing, the union-bound argument that produces the SRM penalty (§3) goes through unchanged — we never actually used HjHk\mathcal{H}_j \subset \mathcal{H}_k in the derivation, only that we’d allocated a per-class confidence budget δk\delta_k with kδkδ\sum_k \delta_k \le \delta. The same SRM estimator k^=argmink{L^n(h^k)+pen(Hk)}\hat k = \arg\min_k \{\hat L_n(\hat h_k) + \mathrm{pen}(\mathcal{H}_k)\} is well-defined, and the same oracle inequality (§3.3) holds with the same proof.

In Vapnik’s original formulation the families are nested for cleanliness, but everything generalizes to a capacity-indexed sequence (Shawe-Taylor et al. 1998 treat the data-dependent and non-nested cases explicitly). The cost of relaxing nesting is interpretive, not formal: the bias term doesn’t decompose monotonically when classes overlap unpredictably, so the “more data lets us afford more capacity” story becomes less crisp. For most practical applications — degree-indexed polynomials, depth-indexed trees, width-indexed nets, CC-indexed SVMs — the families are nested and the cleaner version applies.

Visualizing the polynomial ladder

The figure below makes the nested family concrete. On a single training sample of n=50n = 50 from the §1 toy, we fit polynomials of degrees k{0,1,3,5,10,15}k \in \{0, 1, 3, 5, 10, 15\} and plot the six fits as a 2×3 panel grid, with the training data and the true function m(x)=sin(πx)m(x) = \sin(\pi x) overlaid on each panel. The reader should see the ladder of capacity unfold: k=0k = 0 is a horizontal line (mean of YY), k=1k = 1 is a tilted line that still misses the curvature, k=3k = 3 is the first fit that captures the sinusoidal shape (matching the leading-order Taylor expansion sin(πx)πx(πx)3/6\sin(\pi x) \approx \pi x - (\pi x)^3 / 6), k=5k = 5 is essentially the optimum, k=10k = 10 starts oscillating between training points, and k=15k = 15 exhibits catastrophic Runge-style oscillation near the endpoints. The training error (printed in each panel title) decreases monotonically with kk, confirming the “ERM alone tells the wrong story” thesis from §1.

Training MSE at k = 5: 0.0284. Training error is monotone non-increasing in k — ERM alone tells the wrong story about which k to pick.
A 2×3 grid of polynomial fits at degrees 0, 1, 3, 5, 10, 15 to n = 50 noisy samples from sin(πx), with training error printed per panel; training error decreases monotonically while the high-degree fits visibly chase noise.
The polynomial ladder. The fit captures the sinusoidal shape from k = 3 onward; k = 10 and k = 15 chase noise visibly. ERM picks the lowest training error and is silently wrong.

The SRM principle

The per-class confidence allocation

The SRM principle starts from a per-class uniform convergence bound. For each class Hk\mathcal{H}_k in our nested family, suppose we have a high-probability bound on the worst-case gap between training error and population risk: there is a function εk(δk,n)\varepsilon_k(\delta_k, n) such that, on a sample of size nn,

PrSDn[suphHkL(h)L^n(h)εk(δk,n)]1δk.()\Pr_{S \sim D^n}\left[\sup_{h \in \mathcal{H}_k} \bigl|L(h) - \hat L_n(h)\bigr| \le \varepsilon_k(\delta_k, n)\right] \ge 1 - \delta_k. \qquad (\dagger)

The function εk\varepsilon_k is what the §4 (Vapnik VC) and §5 (Bartlett–Mendelson Rademacher) developments specify; for now it’s a placeholder. The reader can think of εk(δk,n)=c(dk+log(1/δk))/n\varepsilon_k(\delta_k, n) = c \sqrt{(d_k + \log(1/\delta_k))/n} for some constant cc and the VC dimension dkd_k of Hk\mathcal{H}_k, but the SRM derivation that follows doesn’t depend on the specific form — only on the existence of some per-class uniform convergence bound.

The catch: ()(\dagger) holds with probability 1δk\ge 1 - \delta_k for the fixed class Hk\mathcal{H}_k in isolation. We’re going to pick from among all the classes H1,H2,\mathcal{H}_1, \mathcal{H}_2, \ldots based on the same sample SS, so we need a bound that holds simultaneously across all classes. By the union bound,

PrS[k1: suphHkL(h)L^n(h)>εk(δk,n)]k=1δk.\Pr_{S}\left[\exists k \ge 1:\ \sup_{h \in \mathcal{H}_k} \bigl|L(h) - \hat L_n(h)\bigr| > \varepsilon_k(\delta_k, n)\right] \le \sum_{k=1}^\infty \delta_k.

For the right-hand side to equal our target failure probability δ\delta, we need an allocation {δk}k1\{\delta_k\}_{k \ge 1} of confidence-mass across classes with kδkδ\sum_k \delta_k \le \delta. The canonical choice — Vapnik 1995 — is

δk=6δπ2k2,k=1,2,3,,\delta_k = \frac{6 \delta}{\pi^2 k^2}, \qquad k = 1, 2, 3, \ldots,

which sums to δ\delta exactly via the Basel identity k11/k2=π2/6\sum_{k \ge 1} 1/k^2 = \pi^2/6.

The choice isn’t unique. The telescoping allocation δk=δ/(k(k+1))\delta_k = \delta/(k(k+1)) also sums to δ\delta exactly; the geometric allocation δk=δ2k\delta_k = \delta \cdot 2^{-k} does too. What distinguishes the 1/k21/k^2 choice is its growth rate inside the log(1/δk)\log(1/\delta_k) term that ends up in the penalty:

log(1/δk)=logπ2k26δ=2logk+logπ26δ=2logk+O(log(1/δ)).\log(1/\delta_k) = \log\frac{\pi^2 k^2}{6 \delta} = 2 \log k + \log\frac{\pi^2}{6 \delta} = 2 \log k + O(\log(1/\delta)).

So when we substitute back into εk\varepsilon_k, we pick up an additive 2logk2 \log k term in the bound — logarithmic growth in kk. The geometric allocation δk=δ2k\delta_k = \delta \cdot 2^{-k} would give klog2k \log 2 — linear growth in kk, a much harsher penalty for large classes. The polynomial-decay allocation is canonical precisely because it gives the slowest-growing penalty consistent with the union bound.

This 2logk2 \log k term is the “price” we pay for considering an infinite nested family rather than a single fixed class. In practical applications kk ranges over a small finite set (k{0,,15}k \in \{0, \ldots, 15\} in our polynomial toy), and the 2logk2 \log k contribution is dominated by the capacity term dkd_k.

Definition: the SRM estimator

Equipped with a confidence allocation, the SRM penalty is the per-class bound at the allocated level.

Definition 2 (Capacity penalty and SRM estimator).

Given a nested family H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots with per-class uniform convergence functions εk(δ,n)\varepsilon_k(\delta, n) satisfying ()(\dagger), a confidence parameter δ(0,1)\delta \in (0, 1), and the canonical allocation δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2), the capacity penalty of class Hk\mathcal{H}_k at sample size nn is

pen(Hk,n,δ)  =  εk ⁣(6δπ2k2, n).\mathrm{pen}(\mathcal{H}_k, n, \delta) \;=\; \varepsilon_k\!\left(\frac{6 \delta}{\pi^2 k^2},\ n\right).

The SRM estimator is the pair (k^,h^k^)(\hat k, \hat h_{\hat k}) defined by

k^argmink1{L^n(h^k)+pen(Hk,n,δ)},h^k^=argminhHk^L^n(h),\hat k \in \arg\min_{k \ge 1} \bigl\{\hat L_n(\hat h_k) \,+\, \mathrm{pen}(\mathcal{H}_k, n, \delta)\bigr\}, \qquad \hat h_{\hat k} = \arg\min_{h \in \mathcal{H}_{\hat k}} \hat L_n(h),

where h^k\hat h_k is the ERM in Hk\mathcal{H}_k (any minimizer; ties broken arbitrarily).

A practical note: the outer infimum over kk is taken over the (countably infinite) family, but pen(Hk,n,δ)\mathrm{pen}(\mathcal{H}_k, n, \delta) grows in kk — the capacity term goes to infinity, and even if the capacity stays bounded the 2logk2 \log k term grows without bound. So the minimum is achieved at some finite k^\hat k, and the search is effectively over a finite prefix of the family.

When the inf inside h^k\hat h_k isn’t achieved (e.g., for some non-parametric classes), the analysis goes through with h^k\hat h_k defined as an η\eta-approximate ERM for any η>0\eta > 0; the additional η\eta term in the oracle inequality is taken to zero. From here on we assume the inf is achieved in each Hk\mathcal{H}_k — true for all the parametric examples in §2.

The SRM oracle inequality

The oracle inequality is the load-bearing theorem of SRM. It says the SRM estimator’s risk is comparable to the best risk achievable in any class in the family, plus twice that class’s penalty — and we don’t have to know which class is best to enjoy this guarantee.

Theorem 1 (SRM oracle inequality).

Let H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots be a nested family with per-class uniform convergence bounds ()(\dagger) valid at every level δk>0\delta_k > 0, and let pen(Hk,n,δ)=εk(6δ/(π2k2),n)\mathrm{pen}(\mathcal{H}_k, n, \delta) = \varepsilon_k(6\delta/(\pi^2 k^2), n) as in Definition 2. Write L(Hk)=infhHkL(h)L^*(\mathcal{H}_k) = \inf_{h \in \mathcal{H}_k} L(h) for the best risk achievable in class Hk\mathcal{H}_k. For any δ(0,1)\delta \in (0, 1), with probability at least 1δ1 - \delta over the sample SDnS \sim D^n, the SRM estimator h^k^\hat h_{\hat k} satisfies

L(h^k^)    infk1{L(Hk)+2pen(Hk,n,δ)}.L(\hat h_{\hat k}) \;\le\; \inf_{k \ge 1} \bigl\{L^*(\mathcal{H}_k) \,+\, 2 \cdot \mathrm{pen}(\mathcal{H}_k, n, \delta)\bigr\}.

The right-hand side is what we’d get from an oracle who knows the population risk and picks the optimal kk — modulo the factor of 2 on the penalty. The factor of 2 is the cost of learning k^\hat k from data: we need uniform convergence twice over, once to validate our choice of k^\hat k and once to compare it to the oracle’s choice.

Two consequences worth flagging. Adaptive guarantee: the bound holds for the best kk in the family, which can depend on nn, the noise level, and the target — and we don’t have to specify it in advance. Approximation–estimation decomposition: L(Hk)L^*(\mathcal{H}_k) is the approximation error of class Hk\mathcal{H}_k (how well the class can approximate the Bayes-optimal predictor), and pen(Hk,n,δ)\mathrm{pen}(\mathcal{H}_k, n, \delta) is the estimation error (how well we can identify the best element of Hk\mathcal{H}_k from a finite sample). SRM trades off approximation vs estimation automatically — taking kk larger improves approximation (since L(Hk)L^*(\mathcal{H}_k) is monotone non-increasing in kk by nesting) but worsens estimation (since pen grows in kk).

Proof of the oracle inequality

The proof is the union-bound argument written out carefully.

Proof.

Define the event

Ω  =  {suphHkL(h)L^n(h)pen(Hk,n,δ)k1}.\Omega \;=\; \left\{\,\sup_{h \in \mathcal{H}_k} \bigl|L(h) - \hat L_n(h)\bigr| \le \mathrm{pen}(\mathcal{H}_k, n, \delta) \quad \forall\, k \ge 1\,\right\}.

By ()(\dagger) applied to each Hk\mathcal{H}_k at confidence level δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2), and a union bound over the family,

Pr[Ωc]    k=1δk  =  δ6π2k=11k2  =  δ6π2π26  =  δ.(1)\Pr[\Omega^c] \;\le\; \sum_{k=1}^\infty \delta_k \;=\; \delta \cdot \frac{6}{\pi^2} \sum_{k=1}^\infty \frac{1}{k^2} \;=\; \delta \cdot \frac{6}{\pi^2} \cdot \frac{\pi^2}{6} \;=\; \delta. \tag{1}

So Pr[Ω]1δ\Pr[\Omega] \ge 1 - \delta. We work on the event Ω\Omega for the rest of the proof; all subsequent claims hold simultaneously with probability 1δ\ge 1 - \delta.

Fix any reference class index k1k^* \ge 1. We will show

L(h^k^)    L(Hk)+2pen(Hk,n,δ),L(\hat h_{\hat k}) \;\le\; L^*(\mathcal{H}_{k^*}) \,+\, 2 \cdot \mathrm{pen}(\mathcal{H}_{k^*}, n, \delta),

and since kk^* is arbitrary, taking the infimum over kk^* on the right will yield the theorem.

Step 1 — Bound the population risk of h^k^\hat h_{\hat k} by its empirical risk plus the SRM-class penalty. On Ω\Omega, applied to class Hk^\mathcal{H}_{\hat k},

L(h^k^)    L^n(h^k^)+pen(Hk^,n,δ).(2)L(\hat h_{\hat k}) \;\le\; \hat L_n(\hat h_{\hat k}) \,+\, \mathrm{pen}(\mathcal{H}_{\hat k}, n, \delta). \tag{2}

Step 2 — Invoke the SRM-estimator’s minimality. By the definition of k^\hat k as the minimizer of the penalized empirical risk,

L^n(h^k^)+pen(Hk^,n,δ)    L^n(h^k)+pen(Hk,n,δ).(3)\hat L_n(\hat h_{\hat k}) \,+\, \mathrm{pen}(\mathcal{H}_{\hat k}, n, \delta) \;\le\; \hat L_n(\hat h_{k^*}) \,+\, \mathrm{pen}(\mathcal{H}_{k^*}, n, \delta). \tag{3}

Combining (2) and (3),

L(h^k^)    L^n(h^k)+pen(Hk,n,δ).(4)L(\hat h_{\hat k}) \;\le\; \hat L_n(\hat h_{k^*}) \,+\, \mathrm{pen}(\mathcal{H}_{k^*}, n, \delta). \tag{4}

The k^\hat k-dependent terms have cancelled, and the right-hand side now depends only on the reference class kk^*.

Step 3 — Bound L^n(h^k)\hat L_n(\hat h_{k^*}) by the best population risk in Hk\mathcal{H}_{k^*}. Let hkargminhHkL(h)h^*_{k^*} \in \arg\min_{h \in \mathcal{H}_{k^*}} L(h) (existence by the inf-is-achieved assumption from Definition 2). Since h^k\hat h_{k^*} is the ERM in Hk\mathcal{H}_{k^*}, its training error is no larger than that of any other element of Hk\mathcal{H}_{k^*}, in particular not larger than that of hkh^*_{k^*}:

L^n(h^k)    L^n(hk).\hat L_n(\hat h_{k^*}) \;\le\; \hat L_n(h^*_{k^*}).

On Ω\Omega, applied to Hk\mathcal{H}_{k^*},

L^n(hk)    L(hk)+pen(Hk,n,δ)  =  L(Hk)+pen(Hk,n,δ).\hat L_n(h^*_{k^*}) \;\le\; L(h^*_{k^*}) \,+\, \mathrm{pen}(\mathcal{H}_{k^*}, n, \delta) \;=\; L^*(\mathcal{H}_{k^*}) \,+\, \mathrm{pen}(\mathcal{H}_{k^*}, n, \delta).

Chaining the two,

L^n(h^k)    L(Hk)+pen(Hk,n,δ).(5)\hat L_n(\hat h_{k^*}) \;\le\; L^*(\mathcal{H}_{k^*}) \,+\, \mathrm{pen}(\mathcal{H}_{k^*}, n, \delta). \tag{5}

Step 4 — Combine. Plugging (5) into (4),

L(h^k^)    L(Hk)+2pen(Hk,n,δ).L(\hat h_{\hat k}) \;\le\; L^*(\mathcal{H}_{k^*}) \,+\, 2 \cdot \mathrm{pen}(\mathcal{H}_{k^*}, n, \delta).

Since kk^* was arbitrary, taking the infimum over k1k^* \ge 1 on the right gives the theorem.

Three things worth noting about the proof’s structure. The factor of 2 on pen(Hk)\mathrm{pen}(\mathcal{H}_{k^*}) comes from invoking uniform convergence twice — once for the SRM-picked class in (2), once for the reference class in (5). The penalty for the SRM-picked class Hk^\mathcal{H}_{\hat k} cancels between (2) and (3), so the final bound only sees the reference-class penalty — this algebraic cancellation is what makes the bound adaptive: it gives us the best penalty in the family for free. And the 2logk2 \log k contribution from the confidence allocation is hidden inside pen(Hk)\mathrm{pen}(\mathcal{H}_{k^*}), since we evaluated εk\varepsilon_{k^*} at level δk\delta_{k^*} rather than δ\delta.

Penalty decomposition

The figure below decomposes the SRM penalty into its three additive contributions on the polynomial-regression toy with dk=k+1d_k = k+1. At δ=0.05\delta = 0.05, n=50n = 50, k{1,,15}k \in \{1, \ldots, 15\}, the capacity term dkd_k is linear and reaches 16 at k=15k = 15; the log-class term 2logk2 \log k is logarithmic and reaches 5.4\approx 5.4 at k=15k = 15 (about a third of the capacity term); the log-confidence term log(1/δ)3\log(1/\delta) \approx 3 is constant. Capacity dominates the penalty; the logk\log k price for considering an infinite family is real but small in practice.

The second panel compares three confidence-allocation choices for δk\delta_k. The canonical polynomial decay δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2) gives log(1/δk)2logk\log(1/\delta_k) \sim 2 \log k. The telescoping choice δk=δ/(k(k+1))\delta_k = \delta/(k(k+1)) gives essentially the same asymptotic shape. The geometric choice δk=δ2k\delta_k = \delta \cdot 2^{-k} gives log(1/δk)klog2\log(1/\delta_k) \sim k \log 2linear in kk, dramatically harsher. This is the visualization of why the polynomial decay is canonical: it’s the slowest-growing log(1/δk)\log(1/\delta_k) subject to kδkδ\sum_k \delta_k \le \delta.

Two-panel figure: left shows the three additive contributions to the Vapnik penalty under the canonical confidence allocation (capacity term linear in d_k, 2 log k logarithmic, log(1/δ) constant); right compares three confidence allocations (polynomial decay, telescoping, geometric) showing the geometric one grows linearly while the other two grow logarithmically.
The SRM penalty decomposition. Left: the three additive contributions inside the Vapnik square-root. Right: three confidence allocations — only the polynomial decay (canonical) and telescoping versions keep the union-bound cost logarithmic in k.

Classical VC-bound SRM (Vapnik 1995/1998)

The Vapnik penalty

The SRM oracle inequality (Theorem 1) is loss-agnostic and capacity-agnostic — it works for any per-class uniform convergence bound ()(\dagger) we can produce. The Fundamental Theorem of Statistical Learning (PAC Learning, agnostic-PAC two-sided form) gives one for any class of finite VC dimension. For a class H\mathcal{H} with VC dimension dd, any distribution DD, and any δ(0,1)\delta \in (0, 1), with probability at least 1δ1 - \delta over SDnS \sim D^n,

suphHLD(h)L^S(h)    Cdlog(2en/d)+log(4/δ)n,(4.1)\sup_{h \in \mathcal{H}} \bigl|L_D(h) - \hat L_S(h)\bigr| \;\le\; C \sqrt{\frac{d \log(2en/d) + \log(4/\delta)}{n}}, \tag{4.1}

where CC is a universal constant (the value depends on the loss function and the route taken through the FTSL proof — Sauer–Shelah + Hoeffding gives one constant, chaining bounds give a smaller one; the qualitative shape is what matters here). Plug in the canonical confidence allocation δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2) from §3.

Definition 3 (Vapnik SRM penalty).

Given a nested family {Hk}k1\{\mathcal{H}_k\}_{k \ge 1} with VC dimensions dk<d_k < \infty, sample size nn, and confidence parameter δ(0,1)\delta \in (0, 1), the Vapnik SRM penalty is

penV(Hk,n,δ)  =  Cdklog(2n/dk)+2logk+log ⁣(π2/(6δ))n,\mathrm{pen}_V(\mathcal{H}_k, n, \delta) \;=\; C \sqrt{\frac{d_k \log(2n/d_k) \,+\, 2\log k \,+\, \log\!\bigl(\pi^2 / (6\delta)\bigr)}{n}},

where CC is the universal constant from (4.1).

Each term in the numerator has a meaning. dklog(2n/dk)d_k \log(2n / d_k) is the capacity contribution — it comes from the Sauer–Shelah count of dichotomies a VC-dkd_k class can produce on nn points and is the dominant term for moderate kk and nn. 2logk2 \log k is the union-bound cost from the confidence allocation δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2), the “price for considering infinitely many classes.” log(π2/(6δ))\log(\pi^2/(6\delta)) is the confidence-level cost. All inside a 1/n\sqrt{1/n} envelope — the classical “slow rate” of statistical learning.

For regression with squared loss the same shape holds provided predictions and labels are bounded (truncate to [B,B][-B, B] and the penalty picks up a multiplicative factor of B2B^2). The polynomial-regression toy from §1 satisfies m(x)1|m(x)| \le 1 and ε1|\varepsilon| \le 1 with overwhelming probability, so the truncation is invisible. For the polynomial family, the VC dimension of the threshold-classification variant equals the parameter dimension dk=k+1d_k = k + 1 — same shape for the regression effective DoF.

Derivation from FTSL

The derivation is the union-bound argument from §3 with εk\varepsilon_k instantiated by (4.1). For each Hk\mathcal{H}_k at confidence level δk\delta_k,

Pr ⁣[suphHkL(h)L^n(h)>Cdklog(2en/dk)+log(4/δk)n]    δk.\Pr\!\left[\sup_{h \in \mathcal{H}_k} \bigl|L(h) - \hat L_n(h)\bigr| > C \sqrt{\frac{d_k \log(2en/d_k) + \log(4/\delta_k)}{n}}\right] \;\le\; \delta_k.

Substituting δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2),

log(4/δk)  =  log4+log ⁣π2k26δ  =  2logk+log ⁣4π26δ  =  2logk+log ⁣π26δ+log4.\log(4/\delta_k) \;=\; \log 4 + \log\!\frac{\pi^2 k^2}{6\delta} \;=\; 2\log k + \log\!\frac{4 \pi^2}{6\delta} \;=\; 2\log k + \log\!\frac{\pi^2}{6\delta} + \log 4.

The additive log4\log 4 rolls into the universal constant CC when we resolve the log\logs into the form of Definition 3. The result is ()(\dagger) with εk(δk,n)=penV(Hk,n,δ)\varepsilon_k(\delta_k, n) = \mathrm{pen}_V(\mathcal{H}_k, n, \delta). Applying Theorem 1 then gives the Vapnik SRM oracle inequality: with probability 1δ\ge 1 - \delta,

L(h^k^)    infk1{L(Hk)+2penV(Hk,n,δ)}.L(\hat h_{\hat k}) \;\le\; \inf_{k \ge 1} \bigl\{L^*(\mathcal{H}_k) \,+\, 2 \, \mathrm{pen}_V(\mathcal{H}_k, n, \delta)\bigr\}.

This is the original SRM bound from Vapnik (1995, 1998) — the form that motivated the entire framework.

SRM consistency

The oracle inequality is a finite-sample guarantee: for the given nn and δ\delta, the SRM estimator’s risk is no worse than the best in-family choice plus twice that class’s penalty. We’d also like an asymptotic guarantee — that as nn \to \infty, the SRM estimator’s risk approaches the Bayes-optimal risk L=infhL(h)L^* = \inf_h L(h), taken over all measurable hh.

Theorem 2 (SRM consistency (Vapnik 1995)).

Let {Hk}k1\{\mathcal{H}_k\}_{k \ge 1} be a nested family with VC dimensions dk<d_k < \infty for every kk. Suppose

(a) Universal approximation: infk1L(Hk)=L\inf_{k \ge 1} L^*(\mathcal{H}_k) = L^*.

(b) Confidence schedule: δ=δn\delta = \delta_n depends on nn with δn0\delta_n \to 0 and log(1/δn)=o(n)\log(1/\delta_n) = o(n) (e.g., δn=1/n\delta_n = 1/n).

Let h^k^n\hat h_{\hat k_n} be the SRM estimator at sample size nn with the Vapnik penalty (Definition 3). Then

L(h^k^n)  P  Las n.L(\hat h_{\hat k_n}) \;\xrightarrow{P}\; L^* \quad \text{as } n \to \infty.

Two consequences. Convergence rate: if condition (a) is strengthened to a quantitative approximation rate — L(Hk)L=O(kα)L^*(\mathcal{H}_k) - L^* = O(k^{-\alpha}) for some α>0\alpha > 0 — then the convergence in Theorem 2 happens at a quantifiable rate, determined by trading kk off against the penalty (Lugosi and Zeger 1995). No rate without smoothness: without quantitative approximation conditions, the convergence in Theorem 2 can be arbitrarily slow (Devroye, Györfi, and Lugosi 1996, §7).

Proof of consistency

The argument is: pick a reference class KK that’s good enough in approximation, observe that its penalty vanishes as nn \to \infty, plug both into the oracle inequality, take the limit.

Proof.

Fix ε>0\varepsilon > 0. We will show Pr[L(h^k^n)>L+ε]0\Pr[L(\hat h_{\hat k_n}) > L^* + \varepsilon] \to 0.

Step 1 — Pick a reference class. By (a), there exists K=K(ε)NK = K(\varepsilon) \in \mathbb{N} with

L(HK)    L+ε/2.(1)L^*(\mathcal{H}_K) \;\le\; L^* + \varepsilon/2. \tag{1}

This KK is fixed once and for all — it does not depend on nn.

Step 2 — The penalty at KK vanishes. The Vapnik penalty at sample size nn and the fixed class KK is

penV(HK,n,δn)  =  CdKlog(2n/dK)+2logK+log ⁣(π2/(6δn))n.\mathrm{pen}_V(\mathcal{H}_K, n, \delta_n) \;=\; C \sqrt{\frac{d_K \log(2n/d_K) + 2\log K + \log\!\bigl(\pi^2 / (6\delta_n)\bigr)}{n}}.

dKd_K is fixed (Step 1), so dKlog(2n/dK)=O(logn)d_K \log(2n/d_K) = O(\log n). KK is fixed, so 2logK2 \log K is a constant. By (b), log(1/δn)=o(n)\log(1/\delta_n) = o(n). The numerator inside the square root is O(logn)+O(1)+o(n)=o(n)O(\log n) + O(1) + o(n) = o(n), so

penV(HK,n,δn)    0as n.(2)\mathrm{pen}_V(\mathcal{H}_K, n, \delta_n) \;\to\; 0 \quad \text{as } n \to \infty. \tag{2}

Choose N=N(ε)N = N(\varepsilon) such that for all nNn \ge N, 2penV(HK,n,δn)ε/22 \, \mathrm{pen}_V(\mathcal{H}_K, n, \delta_n) \le \varepsilon/2.

Step 3 — Apply the oracle inequality. By Theorem 1 (applied to the Vapnik penalty), with probability at least 1δn1 - \delta_n over the sample SnDnS_n \sim D^n,

L(h^k^n)    infk1{L(Hk)+2penV(Hk,n,δn)}    L(HK)+2penV(HK,n,δn),L(\hat h_{\hat k_n}) \;\le\; \inf_{k \ge 1} \bigl\{L^*(\mathcal{H}_k) + 2 \, \mathrm{pen}_V(\mathcal{H}_k, n, \delta_n)\bigr\} \;\le\; L^*(\mathcal{H}_K) + 2 \, \mathrm{pen}_V(\mathcal{H}_K, n, \delta_n),

where the second inequality is the trivial upper bound by the value at k=Kk = K. For nNn \ge N, applying (1) and (2),

L(h^k^n)    (L+ε/2)+ε/2  =  L+ε.(3)L(\hat h_{\hat k_n}) \;\le\; (L^* + \varepsilon/2) + \varepsilon/2 \;=\; L^* + \varepsilon. \tag{3}

This holds with probability at least 1δn1 - \delta_n.

Step 4 — In-probability convergence. From (3),

Pr[L(h^k^n)>L+ε]    δn.(4)\Pr\bigl[L(\hat h_{\hat k_n}) > L^* + \varepsilon\bigr] \;\le\; \delta_n. \tag{4}

By (b), δn0\delta_n \to 0, so the right-hand side vanishes. Since ε>0\varepsilon > 0 was arbitrary, L(h^k^n)PLL(\hat h_{\hat k_n}) \xrightarrow{P} L^*.

Two structural notes about the proof. The reference KK is data-independent. This is essential: we pick KK based on the approximation property (a), apply the oracle inequality with KK on the right, and absorb the gap into the penalty. Trying to take K=k^nK = \hat k_n would defeat the argument, since the right-hand side of the oracle inequality is fixed before the data is seen. The proof says nothing about k^n\hat k_n itself. We never claim k^n\hat k_n \to \infty or k^nK\hat k_n \to K^* for some optimal KK^* — what’s controlled is the risk, not the index. k^n\hat k_n can oscillate arbitrarily as long as the corresponding risk converges.

Vapnik SRM on the polynomial toy

The figure below plots training MSE, the Vapnik penalty, and their sum as functions of kk on the polynomial-regression toy, at n{50,100,500}n \in \{50, 100, 500\}. The picked k^\hat k is the argmin of the total; we also compute the bias-variance optimum kk^* from a small Monte Carlo for comparison.

The reader should see two things. The U-shape exists. Training MSE decreases monotonically, penalty grows monotonically, the sum has an interior minimum. The SRM rule is well-defined and computable from the sample alone. Vapnik SRM is conservative. The picked k^\hat k is consistently smaller than the bias-variance optimum kk^* computed from §1’s MC. The Vapnik bound is the worst-case uniform-convergence rate over all distributions; on a benign distribution like ours, the actual generalization gap is much smaller than the bound, and the rule pays for the safety margin by under-fitting.

This sets up §5 directly. The Bartlett–Mendelson Rademacher SRM uses a data-dependent capacity measure (empirical Rademacher complexity on the actual sample) instead of the distribution-free VC dimension. On distributions where the worst case is not the worst case (which is most distributions), the Rademacher bound is tighter, and the picked k^\hat k moves closer to kk^* — though at very small nn a McDiarmid confidence term can dominate and flip the relative tightness (§5.5).

The pedagogical C=1C = 1 above keeps the §4.5 demo’s U-shape interior so the picked k^\hat k is interpretable. The literal FTSL constant from chaining bounds is roughly 2–8, depending on the proof route; practitioners typically don’t use the literal Vapnik bound — they use Rademacher (§5), AIC/BIC (§8), or CV (§10) instead.

Vapnik SRM picks k̂_V = 3 at (n = 50, δ = 0.05, C = 1.00). On this benign distribution k̂_V is consistently smaller than the oracle k* ≈ 5 — the bound pays for distribution-freeness.
Three panels showing training MSE, Vapnik penalty, and total = MSE + pen as functions of polynomial degree k at n = 50, 100, 500. Each panel marks the argmin k̂_V (Vapnik pick), and labels the bias-variance optimum k* from MC for comparison. The Vapnik pick is consistently smaller than k* on this benign distribution.
Vapnik SRM curve on the polynomial toy. Training MSE drops monotonically, penalty grows, the total has an interior minimum at $\\hat{k}_V$. The pick is consistently smaller than the bias-variance optimum k* on this benign distribution — Vapnik is the worst-case rate paying for distribution-freeness.

Rademacher-complexity SRM (Bartlett–Mendelson 2002)

From worst-case to data-dependent

The Vapnik penalty is the worst case over all distributions on X\mathcal{X} — it bounds the generalization gap of Hk\mathcal{H}_k uniformly over every conceivable input distribution. For any distribution where Hk\mathcal{H}_k exhibits less complexity than its worst case — because the data is on a low-dimensional manifold, or because the realized inputs lie in a benign region, or just because the noise structure happens to be helpful — the Vapnik bound overestimates the generalization gap, and the SRM rule built on it under-fits (§4.5).

The Rademacher complexity is empirical — it’s defined on the realized sample X1,,XnX_1, \ldots, X_n and measures how well functions in Hk\mathcal{H}_k can align with random sign noise on that specific sample. If the class can’t fit Rademacher noise well on this particular XX — and for benign distributions, it usually can’t, by a log(n/d)\sqrt{\log(n/d)} factor — the resulting penalty is tighter and the SRM rule picks a less conservative k^\hat k.

The replacement is mechanical: keep everything from §3 and §4, just swap the per-class uniform convergence bound. The Bartlett–Mendelson (2002) bound (Generalization Bounds, which establishes the bound via McDiarmid’s inequality and a symmetrization argument) replaces (4.1).

The Bartlett–Mendelson SRM penalty

For a function class F\mathcal{F} defined on a sample S=(Z1,,Zn)S = (Z_1, \ldots, Z_n), the empirical Rademacher complexity is

R^S(F)  =  Eσ ⁣[supfF1ni=1nσif(Zi)],\hat{\mathfrak{R}}_S(\mathcal{F}) \;=\; \mathbb{E}_{\sigma}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(Z_i)\right],

where σ=(σ1,,σn)\sigma = (\sigma_1, \ldots, \sigma_n) are i.i.d. Rademacher variables uniform on {±1}\{\pm 1\} (also developed in PAC Learning and Generalization Bounds). The Bartlett–Mendelson generalization bound: for the loss class Fk={(x,y)(h(x),y):hHk}\mathcal{F}_k = \{(x, y) \mapsto \ell(h(x), y) : h \in \mathcal{H}_k\}, with probability at least 1δ1 - \delta over SDnS \sim D^n,

suphHk(L(h)L^n(h))    2R^S(Fk)+3log(2/δ)2n.(5.1)\sup_{h \in \mathcal{H}_k} \bigl(L(h) - \hat L_n(h)\bigr) \;\le\; 2\hat{\mathfrak{R}}_S(\mathcal{F}_k) + 3 \sqrt{\frac{\log(2/\delta)}{2n}}. \tag{5.1}

The factor of 2 in front of R^S(Fk)\hat{\mathfrak{R}}_S(\mathcal{F}_k) comes from the symmetrization step (passing from the data-sample SS to an i.i.d. ghost sample SS' and bounding by 2Rn2\mathfrak{R}_n — see Generalization Bounds). The 3log(2/δ)/(2n)3\sqrt{\log(2/\delta)/(2n)} comes from two applications of McDiarmid’s bounded-differences inequality.

Plug δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2) into (5.1).

Definition 4 (Bartlett–Mendelson SRM penalty).

Given a nested family {Hk}k1\{\mathcal{H}_k\}_{k \ge 1}, sample S=(Z1,,Zn)S = (Z_1, \ldots, Z_n), and confidence δ(0,1)\delta \in (0, 1), the empirical Rademacher SRM penalty is

penR(Hk,S,δ)  =  2R^S(Fk)+3log(2π2k2/(6δ))2n,\mathrm{pen}_R(\mathcal{H}_k, S, \delta) \;=\; 2 \hat{\mathfrak{R}}_S(\mathcal{F}_k) \,+\, 3 \sqrt{\frac{\log(2 \pi^2 k^2 / (6\delta))}{2n}},

where Fk={(x,y)(h(x),y):hHk}\mathcal{F}_k = \{(x, y) \mapsto \ell(h(x), y) : h \in \mathcal{H}_k\} is the loss class associated with Hk\mathcal{H}_k.

Three observations. The penalty is sample-dependent. The capacity term R^S(Fk)\hat{\mathfrak{R}}_S(\mathcal{F}_k) has no log(n/d)\log(n/d) factor: for nice classes its scaling is d/n\sqrt{d/n}, not dlog(n/d)/n\sqrt{d \log(n/d) / n} as in Vapnik. That log(n/d)\sqrt{\log(n/d)} factor is the savings on benign distributions. For Lipschitz losses, Talagrand’s contraction lemma (Generalization Bounds) gives R^S(Fk)LR^S(Hk)\hat{\mathfrak{R}}_S(\mathcal{F}_k) \le L \cdot \hat{\mathfrak{R}}_S(\mathcal{H}_k), so we can compute Rademacher complexity on the hypothesis class and absorb LL into a constant.

The Rademacher SRM oracle inequality

Theorem 3 (Bartlett–Mendelson SRM oracle inequality).

Under the setup of Definition 4, with probability at least 1δ1 - \delta over the sample SDnS \sim D^n, the SRM estimator h^k^\hat h_{\hat k} with penalty penR\mathrm{pen}_R satisfies

L(h^k^)    infk1{L(Hk)+2penR(Hk,S,δ)}.L(\hat h_{\hat k}) \;\le\; \inf_{k \ge 1} \bigl\{L^*(\mathcal{H}_k) \,+\, 2 \cdot \mathrm{pen}_R(\mathcal{H}_k, S, \delta)\bigr\}.

Same shape as the Vapnik oracle inequality (§4), with penV\mathrm{pen}_V replaced by penR\mathrm{pen}_R. The only thing that has changed is which uniform-convergence bound feeds into Theorem 1.

The bound is data-dependent in a sharper sense than Vapnik’s. The Vapnik bound depends on the sample only through nn; the Rademacher bound depends on the sample through R^S\hat{\mathfrak{R}}_S. Two practical consequences. Adaptivity: on benign samples, R^S(Fk)\hat{\mathfrak{R}}_S(\mathcal{F}_k) is small even when dkd_k is large. Sample-by-sample variability: the picked k^\hat k can differ across samples even when the underlying distribution is the same — a feature, not a bug.

Proof

Proof.

Define the event

ΩR={suphHk(L(h)L^n(h))penR(Hk,S,δ)k1}.\Omega_R = \left\{\,\sup_{h \in \mathcal{H}_k}\bigl(L(h) - \hat L_n(h)\bigr) \le \mathrm{pen}_R(\mathcal{H}_k, S, \delta) \quad \forall\, k \ge 1\,\right\}.

By (5.1) applied to each Hk\mathcal{H}_k at confidence level δk=6δ/(π2k2)\delta_k = 6\delta/(\pi^2 k^2), log(2/δk)=log(2π2k2/(6δ))\log(2/\delta_k) = \log(2\pi^2 k^2 / (6\delta)), so the per-class bound is exactly penR(Hk,S,δ)\mathrm{pen}_R(\mathcal{H}_k, S, \delta) as in Definition 4. A union bound over kk gives Pr[ΩRc]kδk=δ\Pr[\Omega_R^c] \le \sum_k \delta_k = \delta. The remainder of the argument is identical to the proof of Theorem 1 (§3) with pen\mathrm{pen} replaced by penR\mathrm{pen}_R.

The proof is short because all the work is delegated. The McDiarmid + symmetrization machinery that establishes (5.1) lives in Generalization Bounds; the union-bound algebra lives in §3.4. What §5 contributes is the instantiation.

Empirical Rademacher vs the VC upper bound

The polynomial-regression toy admits a clean closed form for the empirical Rademacher complexity of the L2(Pn)L^2(P_n)-unit-ball polynomial class Hk={hHk:hn1}\mathcal{H}_k^\circ = \{h \in \mathcal{H}_k : \|h\|_n \le 1\}. A direct computation (Cauchy–Schwarz in the VVV^\top V inner product, where VV is the Vandermonde of the realized sample) gives

suphHk1ni=1nσih(Xi)  =  1nPVσ,\sup_{h \in \mathcal{H}_k^\circ} \frac{1}{n}\sum_{i=1}^n \sigma_i h(X_i) \;=\; \frac{1}{\sqrt{n}} \|P_V \sigma\|,

where PVP_V is the orthogonal projection onto the column space of VV. So R^n(Hk)=Eσ[PVσ/n]\hat{\mathfrak{R}}_n(\mathcal{H}_k^\circ) = \mathbb{E}_\sigma[\|P_V \sigma\| / \sqrt{n}], estimated by Monte Carlo over B=500B = 500 Rademacher draws. The closed form makes the computation deterministic — no actual fitting required, just orthogonal projection.

For comparison, the VC-implied upper bound (Massart’s lemma + Sauer–Shelah) is 2(k+1)log(en/(k+1))/n\sqrt{2(k+1)\log(en/(k+1))/n}. The factor that distinguishes them is the log(en/(k+1))\sqrt{\log(en/(k+1))} — the log savings from data-dependent vs distribution-free analysis.

The expected behavior — Rademacher tighter than Vapnik on benign distributions — is asymptotic. At small nn the Bartlett–Mendelson confidence term 3log(2π2k2/(6δ))/(2n)3 \sqrt{\log(2\pi^2 k^2/(6\delta))/(2n)} can dominate the data-dependent capacity savings, and the Rademacher-picked k^R\hat k_R can come out smaller than k^V\hat k_V rather than larger. On our polynomial toy, k^R=1\hat k_R = 1 at n=50n = 50 and n=100n = 100, while k^V=3\hat k_V = 3 across all three sample sizes; the cross-over happens around n200n \ge 200. The honest reading: Rademacher is tighter than Vapnik asymptotically, on benign distributions; at small nn the McDiarmid confidence term can dominate and produce a conservative pick. The picture sharpens visibly as nn grows.

Empirical Rademacher sits below the VC upper bound; the gap is the √log(n/d_k) data-dependent savings. The savings shrink at small n because the McDiarmid confidence term in Definition 4 dominates (Rademacher SRM picks $\hat k_R = 1$ at n = 50, 100; cross-over near n ≥ 200).
Empirical Rademacher complexity and Bartlett–Mendelson SRM penalty for the polynomial unit-ball class at n = 50, 100, 500. Left panels show $\\hat{\\mathfrak{R}}_n$ vs k with the VC-implied upper bound overlaid; right panels show training MSE + 2 pen_R as a function of k with the Rademacher pick $\\hat{k}_R$ marked. At small n the McDiarmid confidence term dominates and $\\hat{k}_R$ is small; at n = 500 the bound has sharpened and $\\hat{k}_R$ is closer to k*.
Empirical Rademacher vs Vapnik upper bound. The data-dependent capacity is uniformly below the VC bound; the gap is the log savings on this benign distribution. At small n the confidence term dominates and $\\hat{k}_R \\le \\hat{k}_V$; the bound asymptotically sharpens and $\\hat{k}_R$ rises toward k* as n grows.

Penalized ERM and soft SRM

From hard nesting to soft penalty

Sections 3–5 derived SRM as a discrete rule. The discreteness is a notational artifact. Every practical model class can be smoothly interpolated by a continuous capacity functional C:HR0C: \mathcal{H} \to \mathbb{R}_{\ge 0}, and the discrete family is recovered as the level sets Hr={hH:C(h)r}\mathcal{H}_r = \{h \in \mathcal{H}_\infty : C(h) \le r\} of the capacity functional.

With a continuous capacity functional in hand, two natural problems present themselves. The constrained form:

minhHL^n(h)subject toC(h)r.(6.1)\min_{h \in \mathcal{H}_\infty} \hat L_n(h) \quad \text{subject to} \quad C(h) \le r. \tag{6.1}

The penalized form replaces the constraint with a Lagrange multiplier:

minhH{L^n(h)+λC(h)},λ0.(6.2)\min_{h \in \mathcal{H}_\infty} \bigl\{\hat L_n(h) + \lambda C(h)\bigr\}, \qquad \lambda \ge 0. \tag{6.2}

Under mild convexity assumptions on L^n\hat L_n and CC — both satisfied for squared loss plus a convex norm penalty — strong Lagrangian duality holds: every solution of (6.1) at level rr is a solution of (6.2) at some λ=λ(r)\lambda = \lambda(r), and vice versa (Boyd and Vandenberghe 2004, §5.5).

For SRM the practical consequence is that we don’t need to enumerate the discrete family. We can solve (6.2) for a continuous grid of λ\lambda values, trace out the entire regularization path of solutions h^λ\hat h_\lambda, and pick λ^\hat\lambda to minimize a soft SRM objective — training error plus a penalty that depends on the effective capacity at λ\lambda.

The penalty parameter as a continuous SRM level

Let h^λ\hat h_\lambda denote the solution of (6.2) at penalty parameter λ\lambda. As λ\lambda varies from 00 to \infty, C(h^λ)C(\hat h_\lambda) varies monotonically: λ1<λ2\lambda_1 < \lambda_2 implies C(h^λ1)C(h^λ2)C(\hat h_{\lambda_1}) \ge C(\hat h_{\lambda_2}) (Tikhonov 1963).

So λ\lambda acts as a continuous capacity dial. Small λ\lambda = high capacity, large λ\lambda = low capacity. This monotonicity is the soft analog of the nested-family inclusion: as λ\lambda shrinks, the effective class Hr(λ)={h:C(h)r(λ)}\mathcal{H}_{r(\lambda)} = \{h : C(h) \le r(\lambda)\} grows.

A useful capacity measure for the penalized-ERM regime is the effective degrees of freedom: tr(Sλ)\mathrm{tr}(S_\lambda) for linear smoothers. For ridge regression with Vandermonde features VV and penalty λα22\lambda \|\alpha\|_2^2,

Sλ=V(VV+λI)1V,tr(Sλ)=j=1k+1sj2sj2+λ,S_\lambda = V (V^\top V + \lambda I)^{-1} V^\top, \qquad \mathrm{tr}(S_\lambda) = \sum_{j=1}^{k+1} \frac{s_j^2}{s_j^2 + \lambda},

where sjs_j are the singular values of VV. This trace decreases smoothly from rank(V)=k+1\mathrm{rank}(V) = k+1 at λ=0\lambda = 0 to 00 at λ=\lambda = \infty, providing exactly the continuous interpolation we want.

The soft SRM estimator and its relationship to k^\hat k

Definition 5 (Soft SRM estimator).

Given a capacity functional C:HR0C: \mathcal{H}_\infty \to \mathbb{R}_{\ge 0}, a continuous penalty penλ(λ,n,δ)\mathrm{pen}_\lambda(\lambda, n, \delta), and a confidence parameter δ(0,1)\delta \in (0, 1), the soft SRM estimator is

λ^argminλ0{L^n(h^λ)+penλ(λ,n,δ)},h^λ=argminhH{L^n(h)+λC(h)}.\hat\lambda \in \arg\min_{\lambda \ge 0} \bigl\{\hat L_n(\hat h_\lambda) + \mathrm{pen}_\lambda(\lambda, n, \delta)\bigr\}, \qquad \hat h_\lambda = \arg\min_{h \in \mathcal{H}_\infty}\bigl\{\hat L_n(h) + \lambda C(h)\bigr\}.

The natural form replaces the discrete capacity term dkd_k with effective DoF. The hard/soft correspondence: for each λ\lambda, Lagrangian duality identifies a unique r(λ)r(\lambda) such that h^λ\hat h_\lambda solves the constrained problem at level r(λ)r(\lambda). For benign penalty constants, the picked effective DoF at λ^\hat\lambda should be close to the discrete SRM’s k^+1\hat k + 1.

Implicit nested families

Given any penalty functional CC and any λ>0\lambda > 0, the level set Hr(λ)={h:C(h)r(λ)}\mathcal{H}_{r(\lambda)} = \{h : C(h) \le r(\lambda)\} is a well-defined hypothesis class, and as λ\lambda varies these classes are nested. This is the implicit nested family generated by the penalty.

The consequence: any penalty-based method automatically inherits the SRM framework. The penalty type — 2\ell_2 for ridge, 1\ell_1 for lasso, Sobolev norm for spline smoothing, 0\ell_0-pseudo-norm for best-subset selection — determines the geometry of the implicit family, but the soft-SRM rule and its oracle inequality apply regardless.

Soft-SRM path on the polynomial toy

The figure below uses the degree-15 Vandermonde feature matrix as the ambient parameter space and fits ridge regression minαYVα22+λα22\min_\alpha \|Y - V\alpha\|_2^2 + \lambda \|\alpha\|_2^2 for a logarithmic grid of λ[108,106]\lambda \in [10^{-8}, 10^6]. For each λ\lambda we compute training MSE, effective DoF tr(Sλ)\mathrm{tr}(S_\lambda), and a simplified soft-SRM penalty (tr(Sλ)+log(1/δ))/n\sqrt{(\mathrm{tr}(S_\lambda) + \log(1/\delta))/n}.

Soft-SRM ridge: λ acts as a continuous capacity dial. Small λ = high capacity (overfit visible on the left); large λ = low capacity (underfit). The soft-SRM total picks λ̂ at the interior minimum, with picked effective DoF = 5.59.
Soft-SRM path on the polynomial-regression toy. Top row shows ridge fits at three λ values overlaid on training data. Bottom row shows the soft-SRM components — training MSE, simplified capacity penalty, and total — as functions of λ on a log scale, with the picked $\\hat\\lambda$ marked and the corresponding effective DoF labeled.
Soft-SRM path on degree-15 Vandermonde features. The picked $\\hat\\lambda$ corresponds to effective DoF close to the hard-SRM $\\hat k + 1$. The continuous regularization path replaces the discrete family enumeration with a single sweep.

SRM as regularization

Tikhonov regularization as soft SRM

Tikhonov (1963) introduced the regularization functional minθ{YXθ22+λLθ22}\min_\theta \{\|Y - X\theta\|_2^2 + \lambda \|L\theta\|_2^2\} for some operator LL. With L=IL = I this is standard ridge regression. With LL a discrete-derivative operator it becomes spline smoothing. With L=DL = D a finite-difference matrix it becomes total-variation denoising.

For SRM, the relevant fact is that Tikhonov with L=IL = I is precisely soft SRM (Definition 5) with capacity functional C(θ)=θ22C(\theta) = \|\theta\|_2^2. The implicit nested family is the parameter-space ball Hr={θ:θ22r}\mathcal{H}_r = \{\theta : \|\theta\|_2^2 \le r\}. Geometrically the 2\ell_2 ball is rotationally symmetric — solutions shrink isotropically toward zero. No coefficient is ever set to exactly zero.

Ridge effective degrees of freedom

The effective DoF tr(Sλ)\mathrm{tr}(S_\lambda) from §6 plays a dual role for ridge regression. First, it’s the capacity of the implicit family at λ\lambda. Second, it’s the Stein’s unbiased risk estimator (SURE) bias-correction term: the expected training MSE is biased downward from the population risk by 2σ2tr(Sλ)/n2 \sigma^2 \mathrm{tr}(S_\lambda)/n.

For ridge on polynomial-Vandermonde features, the SVD gives tr(Sλ)=jsj2/(sj2+λ)\mathrm{tr}(S_\lambda) = \sum_j s_j^2/(s_j^2 + \lambda) — a smooth monotone function from k+1k+1 (OLS) to 00. The SURE interpretation also gives a sample-only way to estimate λ^\hat\lambda: minimize L^n(h^λ)+2σ^2tr(Sλ)/n\hat L_n(\hat h_\lambda) + 2 \hat\sigma^2 \mathrm{tr}(S_\lambda)/n. This is Mallows’s CpC_p in disguise (§8).

Lasso and the 1\ell_1 ball

Lasso (Tibshirani 1996) replaces Tikhonov’s 2\ell_2 penalty with 1\ell_1:

θ^λLasso    argminθ{YXθ22+λθ1}.\hat\theta_\lambda^{\mathrm{Lasso}} \;\in\; \arg\min_\theta \bigl\{\|Y - X\theta\|_2^2 + \lambda \|\theta\|_1\bigr\}.

The implicit nested family is the 1\ell_1 ball Hr={θ:θ1r}\mathcal{H}_r = \{\theta : \|\theta\|_1 \le r\} — a cross-polytope. The geometry forces solutions to lie on low-dimensional faces with most coordinates exactly zero. The natural capacity measure for lasso is sparsity — the number of nonzero coefficients. Zou, Hastie, and Tibshirani (2007) prove that this count equals the effective DoF in the SURE sense:

eff_dof(θ^λLasso)  =  θ^λLasso0.\mathrm{eff\_dof}(\hat\theta_\lambda^{\mathrm{Lasso}}) \;=\; \|\hat\theta_\lambda^{\mathrm{Lasso}}\|_0.

The Bickel-Ritov-Tsybakov (2009) lasso oracle inequality has exactly the SRM shape: X(θ^λLassoθ)n2cslogp/n\|X(\hat\theta_\lambda^{\mathrm{Lasso}} - \theta^*)\|_n^2 \le c \cdot s^* \log p / n, with s=θ0s^* = \|\theta^*\|_0 the true sparsity — the capacity term slogps \log p plays the role of dklog(n/dk)d_k \log(n/d_k) from §4.

In linear models, weight decay — adding λθ22\lambda \|\theta\|_2^2 to the loss — is literally Tikhonov regularization with L=IL = I, hence soft SRM on the 2\ell_2 ball. The terminology comes from the gradient-descent update rule: θt+1=(1ηλ)θtηL^n(θt)\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta\nabla\hat L_n(\theta_t). In deep networks the connection to classical SRM is murkier (§12), but the soft-SRM intuition still motivates the practice. For our polynomial-regression toy, weight decay is just ridge regression and the ridge effective-DoF analysis applies verbatim.

Ridge vs lasso on the polynomial toy

Fit ridge and lasso on the polynomial Vandermonde with the same training sample. Compute training MSE, effective DoF (smooth tr(Sλ)\mathrm{tr}(S_\lambda) for ridge, integer-valued θ^0\|\hat\theta\|_0 for lasso), and the soft-SRM total. The figure compares ridge λ^\hat\lambda + effective DoF, lasso λ^\hat\lambda + number of nonzeros, §4 Vapnik hard-SRM k^+1\hat k + 1, and the §1 bias-variance oracle k+1k^* + 1. All four should fall in a narrow band on this benign toy.

Ridge and lasso are both soft SRM on different implicit families. Ridge ℓ² ball: rotationally symmetric, smooth effective DoF. Lasso ℓ¹ ball: cross-polytope, integer-valued capacity. The user picks the penalty geometry; SRM picks the level.
Ridge vs lasso paths on the polynomial-regression toy. Two columns: left for ridge, right for lasso. Top row shows fits at the picked $\\hat\\lambda$ overlaid on training data. Bottom row shows the soft-SRM total as a function of λ (log scale) with $\\hat\\lambda$ marked and the picked effective DoF (continuous for ridge, integer for lasso) labeled.
Ridge and lasso are both soft SRM, on different implicit families. Ridge picks a smooth effective DoF; lasso picks an integer number of nonzero coefficients. The two pick comparable capacity, but the geometry of the regularization is different — the user chooses the penalty geometry; SRM picks the level.

SRM and information criteria

AIC

The Akaike Information Criterion (Akaike 1973) is AICk=2logp(Yθ^k)+2dk\mathrm{AIC}_k = -2 \log p(Y \mid \hat\theta_k) + 2 d_k. For Gaussian-noise regression with plug-in σ2\sigma^2, AIC is equivalent to picking k^\hat k minimizing L^n(h^k)+2σ2dk/n\hat L_n(\hat h_k) + 2 \sigma^2 d_k / n. So AIC is exactly the SRM rule with capacity penalty penAIC(dk,n)=2σ2dk/n\mathrm{pen}_{\mathrm{AIC}}(d_k, n) = 2\sigma^2 d_k / n — linear in dkd_k, no log\log factor, no confidence parameter.

Motivation: Akaike derived AIC by asking, asymptotically, which model gives the smallest Kullback–Leibler divergence from the true distribution. The rule is calibrated for prediction risk, not selection consistency: AIC is asymptotically efficient for prediction.

BIC

The Bayesian Information Criterion (Schwarz 1978) is BICk=2logp(Yθ^k)+dklogn\mathrm{BIC}_k = -2 \log p(Y \mid \hat\theta_k) + d_k \log n — penalty dklognd_k \log n rather than 2dk2 d_k. For Gaussian-noise regression, BIC penalty is σ2dklogn/n\sigma^2 d_k \log n / n. For ne27.4n \ge e^2 \approx 7.4, logn>2\log n > 2, so BIC is strictly more conservative than AIC.

Motivation: Schwarz derived BIC as a Laplace approximation to the negative log marginal likelihood of a Bayesian model with a flat prior. BIC is consistent for selection: if the true model lies in the family, BIC picks it with probability 1\to 1 as nn \to \infty. The AIC vs BIC distinction: AIC minimizes prediction risk asymptotically, BIC identifies the true model asymptotically.

MDL

The Minimum Description Length principle (Rissanen 1978) views model selection as data compression. The “best” coding scheme for parameters (Rissanen’s universal Bayesian code) yields MDLklogp(Yθ^k)+(dk/2)logn+O(1)\mathrm{MDL}_k \approx -\log p(Y \mid \hat\theta_k) + (d_k/2) \log n + O(1). Doubling and comparing to BIC: 2MDLkBICk2 \mathrm{MDL}_k \approx \mathrm{BIC}_k asymptotically. MDL and BIC are the same rule from different starting points. See Minimum Description Length for the coding-theoretic substrate.

Unification

All five SRM rules fit the same template — pick k^\hat k minimizing L^n(h^k)+pen(dk,n,δ)\hat L_n(\hat h_k) + \mathrm{pen}(d_k, n, \delta). They differ in the penalty function:

RulePenaltyShape
AIC2σ2dk/n2\sigma^2 d_k / nlinear in dkd_k; no logn\log n; no δ\delta
BIC, MDLσ2dklogn/n\sigma^2 d_k \log n / nlinear in dkd_k; logn\log n; no δ\delta
VapnikC(dklog(n/dk)+log(1/δ))/nC \sqrt{(d_k \log(n/d_k) + \log(1/\delta)) / n}square-root in dkd_k; logn\log n; finite-sample δ\delta
Rademacher2R^n+3log(1/δk)/(2n)2 \hat{\mathfrak{R}}_n + 3 \sqrt{\log(1/\delta_k)/(2n)}empirical; data-dependent; finite-sample δ\delta

Three structural differences. Capacity shape: AIC/BIC are linear (parametric, fast rate); Vapnik/Rademacher are square-root (non-parametric, slow rate). logn\log n factor: BIC and Vapnik have it; AIC and Rademacher don’t. Confidence parameter: Vapnik and Rademacher are finite-sample bounds; AIC and BIC are asymptotic.

Each targets a different goal: AIC for asymptotic prediction efficiency; BIC/MDL for selection consistency; Vapnik for worst-case finite-sample generalization; Rademacher for data-dependent finite-sample generalization.

Penalty shapes and pick-agreement

For n{50,100,500}n \in \{50, 100, 500\} on the polynomial toy, compute training MSE and each of the four penalties as functions of kk. The figure plots the four penalty shapes on a log y-scale (the bound-based and asymptotic penalties span ~50× in magnitude) and tabulates the picked k^\hat k for each rule.

AIC pick: 5
BIC pick: 3
Vapnik pick: 3
Rademacher pick: 1
Same SRM template, different calibrations, different picks. AIC/BIC are parametric/linear; Vapnik/Rademacher are non-parametric/square-root and dominate AIC/BIC by 1-2 orders of magnitude.
Penalty shapes for AIC, BIC, Vapnik (C = 1), and Rademacher as functions of polynomial degree k at n = 50 and n = 500, on a log y-scale. The Vapnik and Rademacher penalties dominate AIC/BIC by 1–2 orders of magnitude; AIC and BIC are linear in d_k while the bound-based penalties are square-root.
Four penalty shapes on a log y-scale. The structural difference between linear (AIC, BIC) and square-root (Vapnik, Rademacher) capacity penalties is visible across two orders of magnitude. Same SRM template, different calibrations, different picks.

SRM and PAC-Bayes

From class-indexed nesting to posterior-averaged capacity

PAC-Bayes generalizes SRM substantially: the rule produces a probability distribution QQ over hypotheses, with the predictor being either the posterior-mean function hˉQ(x)=EhQ[h(x)]\bar h_Q(x) = \mathbb{E}_{h \sim Q}[h(x)] or a stochastic sample from QQ.

The setup. Fix a prior PP over H\mathcal{H} chosen before seeing the data. After observing SS, choose a posterior QQ — any data-dependent distribution, subject to QPQ \ll P. The PAC-Bayes framework bounds the posterior-averaged generalization error EhQ[L(h)]\mathbb{E}_{h \sim Q}[L(h)] in terms of posterior-averaged training error and KL divergence KL(QP)\mathrm{KL}(Q \| P). The implicit nested family is indexed by KL: Qr={Q:KL(QP)r}\mathcal{Q}_r = \{Q : \mathrm{KL}(Q \| P) \le r\}. PAC-Bayes is smooth SRM with KL as the capacity functional.

The PAC-Bayes bound (Catoni–McAllester family)

Theorem 4 (PAC-Bayes generalization bound (McAllester)).

Fix a prior PP over H\mathcal{H} chosen independently of the sample. Suppose the loss takes values in [0,1][0, 1]. For any δ(0,1)\delta \in (0, 1), with probability at least 1δ1 - \delta over SDnS \sim D^n, every posterior QQ satisfies

EhQ[L(h)]    EhQ[L^n(h)]+KL(QP)+log(2n/δ)2n.(9.1)\mathbb{E}_{h \sim Q}[L(h)] \;\le\; \mathbb{E}_{h \sim Q}[\hat L_n(h)] \,+\, \sqrt{\frac{\mathrm{KL}(Q \| P) + \log(2 \sqrt{n} / \delta)}{2 n}}. \tag{9.1}

The bound is uniform over all QQ by a continuous-QQ MGF argument (PAC-Bayes Bounds contains the full derivation via Donsker–Varadhan). Catoni’s tighter 2007 form replaces the square-root capacity term with an exponentially-tilted bound sharper for small EQ[L^n]\mathbb{E}_Q[\hat L_n]; for SRM purposes the qualitative shape is identical.

KL divergence as a continuous capacity measure

For Gaussian posteriors and priors, KL has a closed form. With P=N(0,σP2Id)P = \mathcal{N}(0, \sigma_P^2 I_d) and Q=N(μQ,σQ2Id)Q = \mathcal{N}(\mu_Q, \sigma_Q^2 I_d),

KL(QP)  =  12(dσQ2+μQ2σP2d+dlogσP2σQ2).(9.2)\mathrm{KL}(Q \| P) \;=\; \frac{1}{2}\left(\frac{d \sigma_Q^2 + \|\mu_Q\|^2}{\sigma_P^2} \,-\, d \,+\, d \log \frac{\sigma_P^2}{\sigma_Q^2}\right). \tag{9.2}

KL is monotone in μQ2\|\mu_Q\|^2 — the connection to Tikhonov regularization from §7 becomes explicit at the PAC-Bayes level. Minimizing EQ[L^n]+KL/(2n)\mathbb{E}_Q[\hat L_n] + \sqrt{\mathrm{KL}/(2n)} over a Gaussian QQ centered at the ridge estimator produces a penalty α^λ2/σP2\propto \|\hat\alpha_\lambda\|^2 / \sigma_P^2.

Soft-to-hard SRM limit

Posterior concentrates on a single hypothesis: KL(QP)=\mathrm{KL}(Q \| P) = \infty for any continuous prior PP. To recover hard SRM, use a discrete prior: PP uniform over {h^1,,h^K}\{\hat h_1, \ldots, \hat h_K\}, QQ point-mass on one h^k\hat h_k, KL(QP)=logK\mathrm{KL}(Q \| P) = \log K — the union-bound rule from §3 as a special case.

Posterior equals prior: KL(QP)=0\mathrm{KL}(Q \| P) = 0 gives a tight bound on the prior-mean predictor, which has no data-adaptation.

The PAC-Bayes-optimal posterior is the Gibbs posterior Q(h)P(h)exp(βL^n(h))Q^*(h) \propto P(h) \exp(-\beta \cdot \hat L_n(h)) — the analog of the Bayesian posterior, with β\beta controlling concentration. The PAC-Bayes-optimal β\beta trades off the two extremes: exactly the SRM trade-off. See PAC-Bayes Bounds for the full Gibbs construction.

PAC-Bayes on the polynomial toy

The figure uses the polynomial Vandermonde features of degree 15 as the parameter space (d=16d = 16). Prior P=N(0,σP2I16)P = \mathcal{N}(0, \sigma_P^2 I_{16}) with σP=1\sigma_P = 1. For a logarithmic grid of λ\lambda, the posterior mean is set to the ridge estimate and the posterior covariance to τ2I16\tau^2 I_{16} with τ=0.1\tau = 0.1. We compute KL via (9.2), posterior-averaged training error, and PAC-Bayes total.

The τ=0.1\tau = 0.1 choice is a pedagogical simplification — the proper Bayesian-ridge posterior covariance is σ2(VV+λI)1\sigma^2 (V^\top V + \lambda I)^{-1}, which depends on λ\lambda. Decoupling τ\tau from λ\lambda keeps the §9 demo focused on the KL-as-capacity story without re-engineering for the full Bayesian-ridge posterior.

Picked effective DoF at λ̂: 5.67. KL is the continuous capacity measure; the prior σ_P chooses *which* implicit family, the posterior τ chooses *where* on the family to land.
Three panels showing the PAC-Bayes Catoni-McAllester soft-SRM path on degree-15 ridge fits. Left: posterior-averaged training MSE and KL divergence as functions of effective DoF. Middle: PAC-Bayes total = MSE + sqrt((KL + log(2√n/δ))/(2n)) with the picked $\\hat\\lambda$ marked. Right: comparison of picked degree across PAC-Bayes, Vapnik, and oracle k*.
PAC-Bayes soft SRM on the polynomial toy. KL grows with effective DoF; the PAC-Bayes total has an interior minimum that tracks the bias-variance oracle as σ_P widens. The prior is the choice of *which* implicit family; the posterior is the choice of *where* on the family to land.

Cross-validation as data-driven SRM

CV as a complexity-penalty surrogate

Cross-validation takes a different route from the bound-based rules: estimate the population risk by holding out part of the sample. The KK-fold cross-validation procedure for Hk\mathcal{H}_k partitions SS into KK folds, fits h^k(j)\hat h_k^{(-j)} on the other K1K-1 folds, and evaluates on the held-out fold:

L^kCV  =  1Kj=1K1S(j)iS(j)(h^k(j)(Xi),Yi).\hat L^{\mathrm{CV}}_k \;=\; \frac{1}{K} \sum_{j=1}^K \frac{1}{|S^{(j)}|} \sum_{i \in S^{(j)}} \ell\bigl(\hat h_k^{(-j)}(X_i), Y_i\bigr).

CV picks k^CVargminkL^kCV\hat k_{\mathrm{CV}} \in \arg\min_k \hat L^{\mathrm{CV}}_k and returns h^k^CV\hat h_{\hat k_{\mathrm{CV}}} refit on the full sample. The key structural fact: no explicit penalty. L^kCV\hat L^{\mathrm{CV}}_k is an unbiased estimator of the population risk of the (K1)/K(K-1)/K-subsample fit. The trade-off is information loss: CV uses 1/K1/K of the data for testing per fold, so it has higher statistical variance than bound-based rules.

A CV oracle inequality

Theorem 5 (CV oracle inequality (finite-class)).

Let {Hk}k=1M\{\mathcal{H}_k\}_{k=1}^M be a finite family with bounded loss [0,B]\ell \in [0, B]. For KK-fold CV with fold size ntest=n/Kn_{\mathrm{test}} = n/K, with probability at least 1δ1 - \delta over the random fold partition,

L(h^k^CV)    minkL(h^k)+B2log(2M/δ)ntest.L\bigl(\hat h_{\hat k_{\mathrm{CV}}}\bigr) \;\le\; \min_k L(\hat h_k) \,+\, B \sqrt{\frac{2 \log(2M/\delta)}{n_{\mathrm{test}}}}.

The logM/ntest\sqrt{\log M / n_{\mathrm{test}}} rate matches SRM up to constants. For infinite families, the analog (Bartlett, Lugosi, Mendelson 2002; Yang 2007) replaces logM\log M with a stability-based or VC-based measure. Proof sketch: Hoeffding on each fold plus union bound over MM candidates.

The CV-variance vs SRM-stability trade-off

CV has higher statistical variance — the score depends on the random fold partition, with width 1–3 degrees on the polynomial toy across 100 partitions. Bound-based SRM depends on the algorithm’s stability — Vapnik is loose for stable algorithms, tight for unstable. Practical rule: CV for complex algorithms (deep nets, ensembles); bound-based SRM for simple families with clean capacity analysis.

When CV picks differently from SRM

Three regimes: loose bound constants (Vapnik’s CC slack); asymptotic vs finite-nn (AIC/BIC are asymptotic, CV is finite-sample); fold-partition noise (CV varies by 1–2 across rerolls). Agreement is reassuring; disagreement is diagnostic.

CV vs Vapnik vs Rademacher

Run 5-fold CV at n=50n = 50 with 100 random fold partitions. The figure aggregates to the mean CV curve, ±1\pm 1 std band across partitions, and the distribution of picked k^\hat k. The histogram of CV picks shows fold-to-fold variability concretely.

CV pick mode: 3 across 50 fold rerolls at K=5, n=50. The histogram shows fold-partition variance directly — 1–3 degrees of spread is typical at small n.
Cross-validation as data-driven SRM. Left: 5-fold CV curve (mean ± 1 std across 100 fold partitions) vs polynomial degree k at n = 50, with comparison curves for Vapnik and Rademacher SRM. Right: histogram of CV picks across 100 fold partitions showing fold-to-fold variability of 1–2 degrees.
CV is the practitioner's empirical SRM. The pick varies across fold partitions; the histogram captures the statistical variance directly. On the polynomial toy CV agrees with the Rademacher pick to within ±1 degree at n = 50.

Worked example: polynomial regression with SRM

Setup recap

The toy from §1, used throughout: XUniform(1,1)X \sim \mathrm{Uniform}(-1, 1), m(x)=sin(πx)m(x) = \sin(\pi x), εN(0,0.22)\varepsilon \sim \mathcal{N}(0, 0.2^2), n{50,100,500}n \in \{50, 100, 500\}, Hk=\mathcal{H}_k = polynomials of degree k\le k.

Bias-variance Monte Carlo

The bias-variance decomposition from §1, with B=200B = 200 replicates per sample size. The argmin of MSE over kk defines the bias-variance oracle k(n)k^*(n).

The agreement matrix

For each rule and each nn, the picked k^\hat k on the executed notebook (NumPy PCG64 seed 20260512):

Rule (this seed)n=50n = 50n=100n = 100n=500n = 500
AIC555
BIC555
Vapnik (C=1C = 1)333
Rademacher113
5-fold CV (mode, 100 rerolls)455
PAC-Bayes (degree =DoF1= \lfloor \mathrm{DoF}\rfloor - 1)456
Oracle kk^*555

Three patterns: the picks cluster within a 4-degree band; AIC and BIC pick largest, Vapnik smallest at moderate nn; all picks shift upward (or hold steady) with nn (consistency in action). The one surprise is Rademacher at small nn — the McDiarmid confidence term in penR\mathrm{pen}_R dominates the data-dependent capacity savings on this benign distribution, and the rule picks k^R=1\hat k_R = 1 rather than the k^Rk^V\hat k_R \ge \hat k_V asymptotic prediction.

Sensitivity to noise, sample size, and confidence

Noise variance: larger σ\sigma shifts oracle kk^* down. AIC and BIC have explicit σ2\sigma^2 factors. Vapnik/Rademacher are σ\sigma-free but indirectly affected via the training-error curve. Sample size: all rules shift up with nn. Confidence parameter: only Vapnik, Rademacher, PAC-Bayes depend on δ\delta; AIC, BIC, CV are δ\delta-free.

Agreement matrix, money shot, sensitivity sweep

Agreement matrix
Money shot + sensitivity bar
Every rule, every pick, at this (n, σ, K, δ). Set the headline-rule selector to swap whose fit drives the money shot. The agreement matrix reveals when rules disagree (small n) and when they cluster (large n).
The flagship figure of the topic: Rademacher SRM-picked polynomial fit at n = 50 overlaid on training data, true function m(x) = sin(πx), and a ±2-std envelope from a B = 200 bias-variance Monte Carlo. The picked fit captures the sinusoidal shape; the envelope is tight near the center and widens toward the endpoints.
The money shot. The Rademacher-SRM-picked polynomial fit at n = 50, with the bias-variance envelope from a Monte Carlo of B = 200 replicates. The fit captures the sin(πx) shape; the envelope is the visualisation of the variance term in §1.2.
Sensitivity sweep: each rule's picked $\\hat k$ as a function of noise standard deviation σ in [0.05, 0.5] at n = 50. All rules shift downward as σ grows, consistent with the bias-variance optimum shifting down with noise.
Sensitivity sweep. Each rule's $\\hat k$ as a function of noise σ at n = 50. All rules are weakly decreasing in σ; the spread between rules is roughly 2–3 degrees across this σ range.

SRM in practice: SVMs and beyond

SVMs and the CC-parameter as SRM

Soft-margin SVMs (Cortes and Vapnik 1995) are the canonical classical-SRM application:

minw,b,ξ    12w22+Ci=1nξisubject toyi(w,ϕ(xi)+b)1ξi,  ξi0.\min_{w, b, \xi} \;\;\frac{1}{2}\|w\|_2^2 + C \sum_{i=1}^n \xi_i \quad \text{subject to} \quad y_i (\langle w, \phi(x_i) \rangle + b) \ge 1 - \xi_i, \;\xi_i \ge 0.

This is soft SRM (Definition 5) with hinge loss and capacity functional C(w)=12w22C(w) = \tfrac{1}{2}\|w\|_2^2. The implicit nested family is the margin-balls Hγ={(w,b):w21/γ}\mathcal{H}_\gamma = \{(w, b) : \|w\|_2 \le 1/\gamma\}. The capacity calculation is celebrated: for an RKHS with bounded kernel K(x,x)R2K(x, x) \le R^2,

R^n(Hγ)    Rγn.\hat{\mathfrak{R}}_n(\mathcal{H}_\gamma) \;\le\; \frac{R}{\gamma \sqrt{n}}.

Dimension-free — this is why kernel methods work in infinite-dimensional spaces without curse-of-dimensionality penalty.

Neural networks: where classical SRM goes loose

For a depth-LL neural network with WW total parameters, dimVC=Θ(WLlogW)\dim_{\mathrm{VC}} = \Theta(W L \log W) (Bartlett, Maiorov, Meir 1998). For W=107W = 10^7, L=50L = 50, that’s 5×109\approx 5 \times 10^9 — vacuous bound. Why classical SRM fails: the bound is uniform over Hw\mathcal{H}_w but SGD explores only a small subset; implicit regularization biases SGD toward low-norm solutions. Active research: spectral norms (Bartlett, Foster, Telgarsky 2017), PAC-Bayes with data-dependent priors (Dziugaite and Roy 2017), sharpness (Foret et al. 2021).

The overparameterized regime

Modern ML operates with WnW \gg n. Classical SRM says generalization should be terrible; empirically it’s fine. Three theory threads address the puzzle: benign overfitting (Bartlett, Long, Lugosi, Tsigler 2020), implicit regularization, norm-based margin bounds. None is yet predictive. Practical model selection still uses CV.

Double descent as the modern complication

Belkin, Hsu, Ma, Mandal (2019); Nakkiran et al. (2020): test risk as a function of capacity has two regimes. Classical regime (WnW \ll n): U-curve. Interpolation threshold (WnW \approx n): catastrophic peak. Modern regime (WnW \gg n): second descent, often below the classical minimum. Detailed treatment: Double Descent (coming soon).

The double-descent picture: classical U-curve (bias → variance) up to the interpolation threshold W ≈ n, catastrophic peak at the threshold, then a second descent in the modern overparameterised regime (W ≫ n). Sketch only — full treatment in Double Descent (coming soon).

SVM-CC sweep on a 2D classification toy

A 2D binary classification toy (n=100n = 100, two Gaussian clusters). RBF-kernel SVM across C[102,102]C \in [10^{-2}, 10^2]. 5-fold CV picks C^\hat C. The four-panel figure shows decision boundaries at three CC values plus the CV error vs CC path.

Four-panel figure: three RBF-SVM decision boundaries at C ∈ {0.01, $\\hat C$, 100} on a 2D Gaussian-cluster toy showing the regularization-to-overfit transition; fourth panel shows 5-fold CV error vs C on a log scale with $\\hat C$ marked.
SVM C-parameter sweep on a 2D binary classification toy. Small C: underfit; large C: overfit; $\\hat C$ at the CV minimum: the SRM-picked margin-capacity sweet spot. The SVM is the cleanest classical-SRM application: hinge loss plus norm penalty plus margin-ball implicit family.

Connections and limits

Tightness of the bounds

The Vapnik bound is loose by orders of magnitude on benign distributions — the price of being distribution-free. Rademacher tightens by going data-dependent (factor of log(n/dk)\sqrt{\log(n/d_k)} for nice distributions). PAC-Bayes tightens further by leveraging the prior. Lower bounds: the Vapnik rate is tight up to constants for some problem instances — the worst case is real, just not the average case. Practitioner hierarchy: Vapnik for deterministic guarantee with looseness tolerance; Rademacher for tighter guarantee on the specific dataset; PAC-Bayes for prior knowledge; CV for empirical right-answer.

Non-nested families

Many families aren’t nested: kk-NN, different SVM kernels, decision trees by leaf count, bagging. The §2.4 workaround: drop the strict inclusion requirement and keep only the capacity-indexed part. The union-bound argument (§3.1) never used HjHk\mathcal{H}_j \subset \mathcal{H}_k — only kδkδ\sum_k \delta_k \le \delta. The same SRM estimator and oracle inequality work for any countable capacity-indexed family. What’s lost is interpretive: the bias term doesn’t decompose monotonically. For unions of finite collections of nested chains (e.g., SVM with multiple kernels), hierarchical SRM with δm=δ/M\delta_m = \delta/M across chains works cleanly.

Computational cost

Bound-based SRM: per-class fit cost × number of classes. Cheap for polynomial regression; manageable for SVMs; dominated by training cost for deep nets. CV: K×K \times per-class fit cost. Expensive for large neural nets. Implicit regularization (early stopping, dropout): essentially free but theoretically opaque. The trade-off: clean theory and low cost (bounds) vs universal applicability (CV) vs cheap-but-opaque (implicit).

What SRM doesn’t tell you

Distribution shift: SRM assumes i.i.d. training and test data. Real-world shift breaks this; domain adaptation is the active research area (Ben-David et al. 2010). Agnostic-PAC gap: SRM bounds L(h^k^)L(Hk^)L(\hat h_{\hat k}) - L^*(\mathcal{H}_{\hat k}), not L(h^k^)LL(\hat h_{\hat k}) - L^* (Bayes-optimal gap); if the family doesn’t approximate Bayes-optimal well, SRM is vacuous on absolute risk. Model misspecification: classical SRM assumes well-specified models; under misspecification, “best-in-class” L(Hk)L^*(\mathcal{H}_k) may be far from any meaningful target. Other: family choice (vs index within family), loss-function choice, adversarial robustness, active learning. SRM is precise for given fixed family + i.i.d. sampling, how to pick adaptively; silent on everything else.

Where to go next

  • Double Descent (coming soon) — the modern overparameterized regime; benign overfitting; implicit regularization.
  • PAC-Bayes Bounds — full PAC-Bayes theory; Donsker-Varadhan, Maurer’s lemma, Catoni.
  • Generalization Bounds — McDiarmid plus symmetrization; Bartlett-Mendelson Rademacher bounds in detail.
  • PAC Learning — foundational VC plus Sauer-Shelah plus FTSL plus Rademacher.
  • Stacking and Predictive Ensembles — ensembling and Bayesian model averaging as soft model averaging across the SRM family.

Beyond formalML: Massart’s Concentration Inequalities and Model Selection (Springer 2007); Mohri, Rostamizadeh, Talwalkar’s Foundations of Machine Learning (MIT 2018); Vapnik’s Statistical Learning Theory (Wiley 1998).

Computational notes

Every numerical experiment in this topic runs in under a second per slider commit on a 2020-era CPU. The computational bottleneck across §§4–11 is the polynomial Vandermonde-based linear algebra; below is the Python skeleton used to produce the figures.

import numpy as np

def make_data(n, sigma, rng):
    """Polynomial-regression toy: X ~ U(-1,1), Y = sin(πX) + N(0, sigma²)."""
    X = rng.uniform(-1.0, 1.0, size=n)
    Y = np.sin(np.pi * X) + sigma * rng.standard_normal(n)
    return X, Y


def vapnik_penalty(d_k, n, k, delta, C=1.0):
    """Vapnik penalty (Definition 3) with universal constant C exposed."""
    capacity = d_k * np.log(2 * n / np.maximum(d_k, 1))
    klog = np.where(np.asarray(k) <= 1, 0.0, 2 * np.log(np.maximum(k, 1)))
    conf = np.log((np.pi ** 2) / (6 * delta))
    return C * np.sqrt((capacity + klog + conf) / n)


def empirical_rademacher_polynomial(X, k, B, rng):
    """Closed-form empirical Rademacher complexity for the polynomial unit ball."""
    V = np.vander(X, k + 1, increasing=True)
    Q, _ = np.linalg.qr(V)  # thin QR — orthonormal basis for col(V)
    n = len(X)
    norms = np.empty(B)
    for b in range(B):
        sigma = rng.choice([-1, 1], size=n)
        norms[b] = np.linalg.norm(Q.T @ sigma) / np.sqrt(n)
    return norms.mean(), norms.std() / np.sqrt(B)


def estimate_sigma2(X, Y, k_max):
    """Unbiased plug-in noise variance from SVD-pseudoinverse OLS at k_max."""
    V = np.vander(X, k_max + 1, increasing=True)
    coef, *_ = np.linalg.lstsq(V, Y, rcond=None)
    rss = float(np.sum((Y - V @ coef) ** 2))
    return rss / max(len(Y) - (k_max + 1), 1)


def aic_penalty(d_k, n, sigma2):
    return 2.0 * sigma2 * d_k / n


def bic_penalty(d_k, n, sigma2):
    return sigma2 * d_k * np.log(n) / n


def ridge_fit_predict(X, Y, k_max, lam):
    V = np.vander(X, k_max + 1, increasing=True)
    n, d = V.shape
    A = V.T @ V + lam * np.eye(d)
    alpha = np.linalg.solve(A, V.T @ Y)
    return V @ alpha, alpha


def ridge_effective_dof(X, k_max, lam):
    V = np.vander(X, k_max + 1, increasing=True)
    s = np.linalg.svd(V, compute_uv=False)
    return float(np.sum(s ** 2 / (s ** 2 + lam)))

Numerical pitfalls worth flagging. Vandermonde conditioning at high degree: the monomial Vandermonde on n=50n = 50 points at degree 15 has condition number 1020\approx 10^{20}; use SVD-pseudoinverse (numpy.linalg.lstsq with rcond=None) or QR rather than the normal equations directly. Numerical lasso convergence: at very small λ\lambda, the lasso fit on a degree-15 Vandermonde converges slowly under coordinate descent; set max_iter generously and suppress benign convergence warnings, or use the Chebyshev basis as the working basis. Plug-in σ^2\hat\sigma^2: at small nn relative to dmaxd_{\max} the unbiased estimator RSS/(ndmax)\mathrm{RSS}/(n - d_{\max}) has non-trivial variance, and BIC picks can shift by 1–2 degrees across reseeds. The agreement matrix above gives the central pick at the notebook seed; expect drift across alternative seeds.

Connections

  • PAC-learning's Fundamental Theorem of Statistical Learning supplies the per-class uniform convergence bound (4.1) that SRM converts into a capacity penalty via §3.1's confidence allocation. Without uniform convergence at every level of the nested family, the union-bound argument that produces the SRM oracle inequality has nothing to work with. pac-learning
  • The McDiarmid-plus-symmetrization machinery (`generalization-bounds` §6.3) that establishes (5.1) is the foundation of §5's Bartlett–Mendelson SRM penalty. The Rademacher complexity definition (`generalization-bounds` §3, §5) and Talagrand's contraction lemma (`generalization-bounds` §5.2) are used as given. SRM is the model-selection layer built on top of generalization-bounds' uniform-convergence layer. generalization-bounds
  • PAC-Bayes is the continuous-capacity generalization of discrete SRM, with KL divergence between posterior and prior playing the role of the capacity penalty. §9 of this topic reproduces the McAllester bound in SRM form; the Catoni-Gibbs construction that pac-bayes-bounds develops in detail is the soft-to-hard limit of §9.4. pac-bayes-bounds
  • Hoeffding's inequality is the workhorse of every uniform-convergence-based SRM penalty. McDiarmid's bounded-differences inequality is required for the Rademacher penalty via symmetrization. §3.4's proof uses the union bound over countably many classes; the per-class Pr[gap > pen] ≤ δ_k comes from Hoeffding-type concentration inside each class. concentration-inequalities
  • KL divergence is the load-bearing capacity measure in §9's PAC-Bayes SRM. The closed-form KL between two isotropic Gaussians (eq 9.2) is the workhorse of the §9.5 polynomial-regression PAC-Bayes demo; minimizing KL(Q∥P) under a posterior-averaged training-error constraint is exactly what soft SRM does on the implicit nested family generated by the prior. kl-divergence
  • Ensembling and Bayesian model averaging are soft model averaging across the SRM family: instead of picking one $\hat k$, the practitioner combines the predictions of all $\hat h_k$ with weights that reflect each class's posterior probability or stacking score. The SRM oracle inequality bounds the loss of the single picked class; the stacking analogue bounds the loss of the convex combination. stacking-and-predictive-ensembles

References & Further Reading