intermediate learning-theory 50 min read

Generalization Bounds

Rademacher complexity, uniform convergence, and the limits of classical theory

Motivation: why bound the generalization gap?

There is a question every practitioner has asked, and most have asked it under their breath: the model fits the training set, but will it work on data I haven’t seen yet? The training error is a number we can compute; the test error is the number we actually care about. This chapter is about the gap between them — how big it can be, what makes it bigger or smaller, and what kinds of guarantees we can write down in advance.

The gap matters even when training accuracy is excellent. A neural network with a billion parameters can interpolate ImageNet — zero training error, ten million images memorized — and still fail miserably on a held-out test image. The same network can sometimes generalize beautifully. Why? When we ask “why,” what we want is a bound: a number ϵ\epsilon such that the test error is at most the training error plus ϵ\epsilon, with high probability over the random draw of the training set. The size of ϵ\epsilon should depend on something — the size of the training set, the complexity of the model, the noise in the data, the choice of algorithm. Producing such a bound is what generalization theory does, and the rest of this chapter is the story of how we produce them.

A first principle. The chapter does not deliver bounds in the practitioner’s sense of “this number will be the test error to within ten percent.” Almost no classical bound, applied to a real neural network on a real dataset, gives a number you would stake money on. The classical bounds are too loose; the practical estimates (held-out validation, cross-validation) come without theoretical guarantees on the bound’s tightness. The classical theory’s value is structural, not numerical: it tells us what features of a learning problem tighten or loosen the gap, and which proof techniques bite hardest on which kinds of hypothesis class. When we eventually look at a 2-layer MLP on a binary MNIST subset (§12) and see that the classical Rademacher bound exceeds one — i.e., is uninformative — that’s not a failure of the chapter; that’s the chapter delivering an honest reckoning of what the classical machinery does and does not yet say about deep nets. We end (§14) with pointers to the post-classical bounds (PAC-Bayes, norm-based, compression-based) that improve on this picture.

A second principle. Bounds come in flavors, and a meaningful bound has to be both valid (it really is an upper bound on the gap, with high probability) and non-vacuous (its numerical value is less than the trivial bound of one). Both conditions matter. A valid bound that says “the gap is at most 4.74.7” is useless when losses live in [0,1][0, 1]. A non-vacuous bound that holds only on average, not with high probability, is a different (weaker) statement than what we want. The bar to clear is non-vacuous high-probability validity, and the rest of the chapter is engineered to clear that bar in as many regimes as the classical theory can manage.

The test-error question — what we actually want to estimate

Formally, the data are iid draws (Xi,Yi)P(X_i, Y_i) \sim P for i=1,,ni = 1, \ldots, n from some unknown joint distribution PP on X×Y\mathcal{X} \times \mathcal{Y}. A hypothesis h:XYh: \mathcal{X} \to \mathcal{Y} assigns predictions to inputs; the loss (h(x),y)\ell(h(x), y) measures how badly hypothesis hh does at the example (x,y)(x, y). The risk (also: true risk, population risk, test error in expectation) is

R(h)=E(X,Y)P[(h(X),Y)].R(h) = \mathbb{E}_{(X, Y) \sim P}[\ell(h(X), Y)].

This is the quantity we want to control. We never see it directly: PP is unknown, and the expectation under PP is the thing the training set is supposed to approximate.

Training error as an optimistic estimator

What we do see is the empirical risk on the training sample S={(Xi,Yi)}i=1nS = \{(X_i, Y_i)\}_{i=1}^n:

R^S(h)=1ni=1n(h(Xi),Yi).\hat{R}_S(h) = \frac{1}{n} \sum_{i=1}^n \ell(h(X_i), Y_i).

For any fixed hypothesis hh, the law of large numbers says R^S(h)R(h)\hat{R}_S(h) \to R(h) as nn \to \infty, and Hoeffding’s inequality says the convergence is exponential in nn. So if the learner picked hh before seeing the training set, we’d be done. But the learner uses the training set to pick hh — that’s what learning means. The hypothesis h^S\hat h_S output by empirical risk minimization,

h^SargminhHR^S(h),\hat h_S \in \arg\min_{h \in \mathcal{H}} \hat{R}_S(h),

is a function of the data, and the law of large numbers does not directly apply to R^S(h^S)\hat{R}_S(\hat h_S) as an estimator of R(h^S)R(\hat h_S). The empirical risk of the ERM is systematically optimistic: the learner has shopped over H\mathcal{H} for the hypothesis that fits this particular SS best, and that hypothesis is by construction not a typical member of H\mathcal{H} on SS. We have to control the worst-case gap

suphHR(h)R^S(h)\sup_{h \in \mathcal{H}} \big| R(h) - \hat{R}_S(h) \big|

across the entire class, not the gap for any single hypothesis. That switch — from pointwise to uniform — is the whole game.

Vacuous vs. non-vacuous bounds: the bar to clear

A generalization bound is a statement of the form: with probability at least 1δ1 - \delta over the draw of SS,

R(h^S)R^S(h^S)+ϵ(H,n,δ),R(\hat h_S) \le \hat{R}_S(\hat h_S) + \epsilon(\mathcal{H}, n, \delta),

where ϵ\epsilon depends on the complexity of the hypothesis class H\mathcal{H}, the sample size nn, and the failure probability δ\delta. The bound is valid if the inequality holds with the stated probability; it is non-vacuous if the numerical value of ϵ\epsilon leaves room below the trivial range of the loss (e.g., ϵ<1\epsilon < 1 for 0-1 loss). All the classical machinery we build in §§3–11 produces valid bounds; whether those bounds are non-vacuous depends on H\mathcal{H}, nn, and the problem. The vacuousness puzzle for deep nets (§12) is that valid classical bounds for a typical overparameterized network on a standard image dataset come out larger than one — valid, but uninformative.

What this chapter delivers

The chapter develops seven generalization-bound recipes, each tightening the previous in some regime:

  1. Union bound + Hoeffding for finite H\mathcal{H} (§3) — the warmup.
  2. Uniform convergence via Glivenko–Cantelli (§4) — the right notion, before the right tooling.
  3. Rademacher complexity (§§5–7) — the tooling; the canonical bound.
  4. Metric-entropy / Dudley (§8) — beyond {0,1}\{0, 1\} losses.
  5. Local Rademacher complexity (§9) — fast O(1/n)O(1/n) rates under variance restrictions.
  6. Margin bounds (§10) — what changes when classifiers come with a confidence score.
  7. Algorithmic stability (§11) — bypassing H\mathcal{H}-complexity entirely.

By the end of §11 we have seven different lenses on the same gap. §12 then runs them all on two examples — a small threshold class where they all bite tightly, and a small MLP where the classical ones go vacuous — and §14 honestly reports the open questions about what tightens the bounds for deep networks.

The generalization gap and its decomposition

§1 introduced the gap informally as the difference between training and test error. §2 makes that gap a formal mathematical object, decomposes it into three structurally distinct sources of error, and presents the first picture of the gap as a function of sample size — the empirical anchor every subsequent bound will refine.

Risk and empirical risk, formally

Throughout this chapter we fix a data distribution PP on X×Y\mathcal{X} \times \mathcal{Y} and a loss function :Y×Y[0,M]\ell: \mathcal{Y} \times \mathcal{Y} \to [0, M], bounded above by some constant MM (almost always M=1M = 1 for the 0-1 loss (y,y)=1[yy]\ell(y', y) = \mathbb{1}[y' \ne y]). A hypothesis class H\mathcal{H} is a set of functions h:XYh: \mathcal{X} \to \mathcal{Y}. The reader can keep H\mathcal{H} concrete: it might be all decision stumps, all linear classifiers in Rd\mathbb{R}^d, or all 2-layer ReLU networks of width ww.

Definition 1 (Risk).

The risk, or generalization error, or true error of a hypothesis hh is

R(h)=E(X,Y)P[(h(X),Y)].R(h) = \mathbb{E}_{(X, Y) \sim P}\big[\ell(h(X), Y)\big].

When \ell is the 0-1 loss, R(h)=Pr(X,Y)P[h(X)Y]R(h) = \Pr_{(X,Y) \sim P}[h(X) \ne Y] is the misclassification probability.

Definition 2 (Empirical risk).

Given an iid sample S={(Xi,Yi)}i=1nS = \{(X_i, Y_i)\}_{i=1}^n from PP, the empirical risk of hh on SS is

R^S(h)=1ni=1n(h(Xi),Yi).\hat{R}_S(h) = \frac{1}{n} \sum_{i=1}^n \ell(h(X_i), Y_i).

For fixed hh, R^S(h)\hat{R}_S(h) is an unbiased estimator of R(h)R(h): ES[R^S(h)]=R(h)\mathbb{E}_S[\hat{R}_S(h)] = R(h). By Hoeffding’s inequality (cited from concentration-inequalities), the empirical risk concentrates exponentially: for any hh and any t>0t > 0,

PrS[R^S(h)R(h)t]2exp(2nt2/M2).\Pr_S\big[|\hat{R}_S(h) - R(h)| \ge t\big] \le 2 \exp(-2nt^2 / M^2).

This is the pointwise bound — it controls one hh at a time, with hh fixed in advance.

The generalization gap as the central object

Definition 3 (Generalization gap).

The generalization gap of a hypothesis hh on sample SS is

gapS(h)=R(h)R^S(h).\mathrm{gap}_S(h) = R(h) - \hat{R}_S(h).

A positive gap means hh does worse on test data than on training; a negative gap means hh got lucky on SS. For a fixed hh, the gap has mean zero (ES[gapS(h)]=0\mathbb{E}_S[\mathrm{gap}_S(h)] = 0) and is exponentially concentrated by Hoeffding. The problem is that we never use a fixed hh — we pick h^S\hat{h}_S as a function of SS, and the gap of a data-dependent hypothesis is not mean-zero. In fact, ES[gapS(h^S)]0\mathbb{E}_S[\mathrm{gap}_S(\hat{h}_S)] \ge 0 in general: the ERM systematically overfits.

To get a handle on gapS(h^S)\mathrm{gap}_S(\hat{h}_S) we control the uniform deviation

ΦS(H)=suphHR(h)R^S(h),\Phi_S(\mathcal{H}) = \sup_{h \in \mathcal{H}} \big| R(h) - \hat{R}_S(h) \big|,

which sandwiches the gap of any data-dependent h^S\hat{h}_S:

gapS(h^S)ΦS(H).|\mathrm{gap}_S(\hat{h}_S)| \le \Phi_S(\mathcal{H}).

A bound on ΦS(H)\Phi_S(\mathcal{H}) — a single number that controls the gap simultaneously across all hypotheses in H\mathcal{H} — is what we need. The rest of the chapter is the story of how to bound ΦS(H)\Phi_S(\mathcal{H}) tightly for various classes H\mathcal{H}.

Approximation–estimation–optimization

The gap is one piece of a larger decomposition. Following Bottou & Bousquet (2008), let

  • hh^* be the Bayes-optimal hypothesis (the minimizer of RR over all measurable functions);
  • hH=argminhHR(h)h^*_{\mathcal{H}} = \arg\min_{h \in \mathcal{H}} R(h) be the best in class (the minimizer restricted to H\mathcal{H});
  • h^S=argminhHR^S(h)\hat{h}_S = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h) be the ERM;
  • halgh_{\mathrm{alg}} be the algorithm’s output — typically not exact ERM (SGD doesn’t run to convergence; even when it does, gradient descent on non-convex losses finds only local minima).

Proposition 1 (Excess-risk decomposition).

The excess risk of the algorithm’s output decomposes as

R(halg)R(h)excess risk=R(hH)R(h)approximation+R(h^S)R(hH)estimation+R(halg)R(h^S)optimization.\underbrace{R(h_{\mathrm{alg}}) - R(h^*)}_{\text{excess risk}} = \underbrace{R(h^*_{\mathcal{H}}) - R(h^*)}_{\text{approximation}} + \underbrace{R(\hat{h}_S) - R(h^*_{\mathcal{H}})}_{\text{estimation}} + \underbrace{R(h_{\mathrm{alg}}) - R(\hat{h}_S)}_{\text{optimization}}.
Proof.

All three terms on the right telescope, leaving R(halg)R(h)R(h_{\mathrm{alg}}) - R(h^*) on the left.

The three terms have orthogonal sources:

  • Approximation error depends only on H\mathcal{H} and PP. A class too small to express the truth (linear classifiers for a quadratic boundary) has irreducible approximation error.
  • Estimation error depends on H\mathcal{H}, nn, and the proof technique. This is the gap-controlled term — the rest of this chapter is dedicated to bounding it.
  • Optimization error depends on the algorithm. SGD-on-a-deep-net is the most studied modern case; convex programming (linear regression, SVM) typically achieves 00 optimization error.

Generalization theory’s job is the middle term. The approximation term is the model-selection literature’s problem; the optimization term is the optimization literature’s problem. We will (almost always) assume halg=h^Sh_{\mathrm{alg}} = \hat{h}_S — i.e., optimization is exact — and bound the estimation term.

Connection to bias–variance

For square loss in regression — (y,y)=(yy)2\ell(y', y) = (y' - y)^2 — and a class indexed by parameters, the excess risk decomposes Bayes-additively into squared bias plus variance:

ES[R(h^S)]R(h)=ES[(hˉSh)2]bias2+ES[(h^ShˉS)2]variance,\mathbb{E}_S\big[R(\hat{h}_S)\big] - R(h^*) = \underbrace{\mathbb{E}_S\big[(\bar h_S - h^*)^2\big]}_{\text{bias}^2} + \underbrace{\mathbb{E}_S\big[(\hat{h}_S - \bar h_S)^2\big]}_{\text{variance}},

where hˉS=ES[h^S]\bar h_S = \mathbb{E}_S[\hat{h}_S] is the average estimator across data draws. The bias term parallels the approximation error (how well does the class express the truth?); the variance term parallels the estimation error (how variable is the ERM around its mean?). The parallel is exact for square loss; for 0-1 loss the decomposition does not factor as cleanly (the loss is non-additive), but the intuition transfers: a low-complexity class has small variance but possibly large bias; a high-complexity class has small bias but large variance; the trade-off chooses H\mathcal{H}. Classical generalization theory makes the variance/estimation side of this trade-off rigorous.

Finite-class warmup

We can write down our first generalization bound, today. The recipe is two ingredients we already have: Hoeffding’s inequality for a fixed hypothesis, and the union bound to make it uniform across a finite class. The result — sometimes called the Occam bound — controls the gap by logH/n\sqrt{\log|\mathcal{H}|/n}, an O(1/n)O(1/\sqrt{n}) rate with logarithmic dependence on class size. It is the right starting point for two reasons: it is genuinely useful in its own right (it bites for small finite classes), and the proof technique generalizes — every later bound is some refinement of “control fixed-hh deviations, then make the bound uniform across H\mathcal{H}.”

Hoeffding for a single hypothesis

Fix one hHh \in \mathcal{H}. The empirical risk R^S(h)=1ni(h(Xi),Yi)\hat{R}_S(h) = \frac{1}{n}\sum_i \ell(h(X_i), Y_i) is the mean of nn iid random variables, each bounded in [0,M][0, M]. Hoeffding’s inequality (proved in concentration-inequalities) gives a two-sided bound: for any t>0t > 0,

PrS[R(h)R^S(h)t]2exp ⁣(2nt2M2).\Pr_S\big[|R(h) - \hat{R}_S(h)| \ge t\big] \le 2 \exp\!\Big({-}\frac{2 n t^2}{M^2}\Big).

Choosing t=Mlog(2/δ)/(2n)t = M\sqrt{\log(2/\delta)/(2n)} makes the right-hand side equal δ\delta, so with probability at least 1δ1 - \delta,

R(h)R^S(h)Mlog(2/δ)2n.|R(h) - \hat{R}_S(h)| \le M\sqrt{\frac{\log(2/\delta)}{2n}}.

This is a pointwise bound — it holds for any specific hh chosen before SS was drawn. It does not hold uniformly across H\mathcal{H}.

Union bound across H|\mathcal{H}|

If H\mathcal{H} is finite with H=N|\mathcal{H}| = N hypotheses, label them h1,,hNh_1, \ldots, h_N. For each hih_i, Hoeffding bounds the failure probability PrS[R(hi)R^S(hi)t]2exp(2nt2/M2)\Pr_S[|R(h_i) - \hat{R}_S(h_i)| \ge t] \le 2\exp(-2nt^2/M^2). The probability that any of the NN hypotheses fails is at most the sum:

PrS[i:R(hi)R^S(hi)t]i=1NPrS[]2Nexp ⁣(2nt2M2).\Pr_S\big[\exists i : |R(h_i) - \hat{R}_S(h_i)| \ge t\big] \le \sum_{i=1}^N \Pr_S[\cdot] \le 2N \exp\!\Big({-}\frac{2nt^2}{M^2}\Big).

Set the right-hand side equal to δ\delta and solve for tt:

2Nexp(2nt2/M2)=δt=Mlog(2N/δ)2n.2N \exp(-2nt^2 / M^2) = \delta \quad \Longrightarrow \quad t = M\sqrt{\frac{\log(2N/\delta)}{2n}}.

Theorem 1 (Finite-class generalization bound).

Let H\mathcal{H} be a finite hypothesis class with H=N|\mathcal{H}| = N, let SS be an iid sample of size nn from PP, and let \ell be a loss bounded in [0,M][0, M]. For any δ(0,1)\delta \in (0, 1), with probability at least 1δ1 - \delta over SS,

suphHR(h)R^S(h)Mlog(2N/δ)2n.\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_S(h)| \le M \sqrt{\frac{\log(2N/\delta)}{2n}}.
Proof.

Apply Hoeffding pointwise to each hHh \in \mathcal{H} with t=Mlog(2N/δ)/(2n)t = M\sqrt{\log(2N/\delta)/(2n)}; the per-hypothesis failure probability is δ/N\delta/N. The union bound across the NN hypotheses gives total failure probability at most Nδ/N=δN \cdot \delta/N = \delta. The contrapositive — that no hypothesis fails — gives the stated bound.

Sample complexity n=O(logH/ϵ2)n = O(\log|\mathcal{H}|/\epsilon^2)

Theorem 1 inverts to a sample-complexity statement: how many samples does H\mathcal{H} need so the gap is at most ϵ\epsilon with confidence 1δ1 - \delta?

Corollary 1 (Sample complexity for finite classes).

To guarantee suphR(h)R^S(h)ϵ\sup_h |R(h) - \hat{R}_S(h)| \le \epsilon with probability 1δ\ge 1 - \delta, it suffices that

nM2log(2H/δ)2ϵ2.n \ge \frac{M^2 \log(2|\mathcal{H}|/\delta)}{2\epsilon^2}.
Proof.

Solve Theorem 1’s bound for nn given the target ϵ\epsilon.

Three observations sharpen the corollary into intuition:

  1. Dependence on class size is logarithmic. Doubling H|\mathcal{H}| increases the required sample size by an additive constant log2/(2ϵ2)\log 2 / (2\epsilon^2), not a multiplicative factor. This is the Occam in Occam bound: complexity is cheap, in the sense that very large but finite hypothesis classes are still PAC-learnable with modest samples.
  2. Dependence on ϵ\epsilon is quadratic. Halving ϵ\epsilon quadruples the required nn. The ϵ2\epsilon^{-2} rate is the canonical slow rate of distribution-free generalization theory; §9 shows when fast rates (ϵ1\epsilon^{-1}) are possible.
  3. Dependence on δ\delta is logarithmic. Halving δ\delta adds log2/(2ϵ2)\log 2 / (2\epsilon^2) to the required nn. High-confidence bounds come cheap.

Why logH\log|\mathcal{H}| misleads for infinite classes

Theorem 1 needs H\mathcal{H} finite. Every interesting class in machine learning — linear classifiers in Rd\mathbb{R}^d, decision trees, neural networks — is uncountably infinite. A naive application of the bound to an infinite class gives a useless log\log \infty.

A natural workaround discretizes. Take the threshold class Hthresh={x1[xτ]:τ[0,1]}\mathcal{H}_{\mathrm{thresh}} = \{x \mapsto \mathbb{1}[x \ge \tau] : \tau \in [0, 1]\}, replace it with the NN-point grid HN={x1[xk/N]:k=0,,N}\mathcal{H}_N = \{x \mapsto \mathbb{1}[x \ge k/N] : k = 0, \ldots, N\}, apply Theorem 1, and take NN \to \infty. The bound rate becomes logN/n\sqrt{\log N / n}, which diverges as NN \to \infty. So discretization alone doesn’t work — the bound goes vacuous as the discretization refines.

But the true worst-case gap of Hthresh\mathcal{H}_{\mathrm{thresh}} on an nn-sample is bounded; it grows like logn/n\sqrt{\log n / n} (we’ll prove this in §5 via Rademacher complexity). The discretization argument has lost information about the combinatorial structure of the class. Two thresholds τ,τ\tau, \tau' are different elements of HN\mathcal{H}_N, but if no sample point falls between them, they produce identical predictions on SS — they are effectively the same hypothesis from the point of view of the data. The right complexity is not “how many hypotheses are there?” but “how many distinct labelings of an nn-sample can the class produce?” That switch — from counting hypotheses to counting behaviors — is the conceptual move §4 and §5 develop, via Glivenko–Cantelli classes and Rademacher complexity respectively. By the end of §5 we will have replaced logH\log|\mathcal{H}| with Rn(H)\mathfrak{R}_n(\mathcal{H}), a quantity that stays finite for the genuinely-infinite classes ML actually uses.

Uniform convergence and Glivenko–Cantelli

§3 showed that for finite hypothesis classes, the gap is controlled uniformly. §4 elevates “uniform convergence” from a proof technique to a property of a hypothesis class. We will see that some infinite classes have the property — they are Glivenko–Cantelli classes — and others do not. The classical Glivenko–Cantelli theorem (1933) is the prototype: the class of half-line indicators on R\mathbb{R} has uniform convergence, even though the class is uncountably infinite. The right complexity is structural, not cardinal — and §5 (Rademacher) will give the structural complexity measure we need.

Uniform convergence: the right notion

For a class of functions F\mathcal{F} on a space Z\mathcal{Z}, the empirical measure of fFf \in \mathcal{F} on sample S={Z1,,Zn}S = \{Z_1, \ldots, Z_n\} is Pnf=1nif(Zi)P_n f = \frac{1}{n}\sum_i f(Z_i), and its population mean is Pf=E[f(Z)]Pf = \mathbb{E}[f(Z)].

Definition 4 (Uniform convergence; Glivenko–Cantelli class).

The class F\mathcal{F} has the uniform convergence property with respect to a distribution PP on Z\mathcal{Z}, written F\mathcal{F} is a Glivenko–Cantelli class for PP, if

supfFPnfPfn0in probability (equivalently, a.s. for our purposes).\sup_{f \in \mathcal{F}} |P_n f - P f| \xrightarrow[n \to \infty]{} 0 \quad \text{in probability (equivalently, a.s. for our purposes).}

The connection to risk and empirical risk is direct: take F={(x,y)(h(x),y):hH}\mathcal{F} = \{(x, y) \mapsto \ell(h(x), y) : h \in \mathcal{H}\}, the loss class induced by H\mathcal{H}. Then Pf=R(h)P f = R(h), Pnf=R^S(h)P_n f = \hat{R}_S(h), and F\mathcal{F} being GC means suphR(h)R^S(h)0\sup_h |R(h) - \hat{R}_S(h)| \to 0 — uniform convergence of empirical risks to true risks across the entire hypothesis class. Distribution-free uniform convergence (GC for every PP) is what makes a hypothesis class PAC-learnable in the agnostic setting.

Classical Glivenko–Cantelli (empirical CDF)

The original GC theorem is the special case of half-line indicators on R\mathbb{R}.

Theorem 2 (Glivenko–Cantelli, 1933).

Let X1,,XnX_1, \ldots, X_n be iid from a distribution on R\mathbb{R} with CDF F(t)=Pr[Xt]F(t) = \Pr[X \le t], and let F^n(t)=1ni1[Xit]\hat F_n(t) = \frac{1}{n}\sum_i \mathbb{1}[X_i \le t] be the empirical CDF. Then

suptRF^n(t)F(t)na.s.0.\sup_{t \in \mathbb{R}} |\hat F_n(t) - F(t)| \xrightarrow[n \to \infty]{\mathrm{a.s.}} 0.

The qualitative version (a.s. convergence with no rate) was proved by Glivenko and Cantelli independently in 1933. A quantitative version with an exponential rate is the DKW inequality (Dvoretzky–Kiefer–Wolfowitz 1956, sharpened by Massart 1990): for all nn and t>0t > 0,

Pr ⁣[suptRF^n(t)F(t)t]2exp(2nt2).\Pr\!\left[\sup_{t \in \mathbb{R}} |\hat F_n(t) - F(t)| \ge t \right] \le 2 \exp(-2nt^2).

Two observations make this quantitatively striking:

  1. The rate is the same as Hoeffding for a single function. The supremum across the uncountable family {1[t]:tR}\{\mathbb{1}[\cdot \le t] : t \in \mathbb{R}\} costs no extra log factor compared to a single hh. The combinatorial structure of half-line indicators on R\mathbb{R} is rich enough to be uncountable but lean enough to behave like a single function in the bound.
  2. The constant is dimension-free. The bound has no dependence on FF, the underlying distribution — uniform convergence holds for any continuous, discrete, or mixed distribution on R\mathbb{R}.

The connection of Theorem 2 to learning theory is the half-line classifier: H={x1[xt]:tR}\mathcal{H} = \{x \mapsto \mathbb{1}[x \le t] : t \in \mathbb{R}\} is the simplest non-trivial classifier class. The DKW inequality says this class has O(1/n)O(1/\sqrt{n}) uniform-convergence rate independent of the data distribution — a property the discretization argument from §3 failed to recover. The reason the half-line class succeeds where naive discretization fails is the combinatorial richness of half-line indicators: on a sample of size nn, the class produces exactly n+1n + 1 distinct labelings (one for each “break” between consecutive ordered sample points). This count grows polynomially in nn — the seed of the Sauer–Shelah polynomial growth lemma.

GC classes and Donsker classes

The classical Glivenko–Cantelli theorem generalizes well beyond half-lines. The empirical-processes literature studies broad classes F\mathcal{F} for which uniform convergence holds:

  • VC-subgraph classes (loosely: classes with finite VC dimension applied to indicator functions) are GC for every PP. Half-lines are the simplest example; half-planes in Rd\mathbb{R}^d, intervals, and convex sets in Rd\mathbb{R}^d are also GC.
  • Donsker classes are a strictly stronger notion: the centered and scaled empirical process n(PnP)\sqrt{n}(P_n - P) converges, as a process indexed by fFf \in \mathcal{F}, to a Gaussian process on F\mathcal{F}. Donsker implies GC (and gives the right rate); the converse is false.
  • Bracketing entropy and uniform covering numbers are the two technical workhorses that certify a class as GC or Donsker. We use them in §8 for real-valued function classes.

For now the takeaway is structural: GC-ness is a property of the combinatorial / metric structure of F\mathcal{F}, not of its raw cardinality. The next two sections develop a quantitative notion that subsumes GC: a class is GC if and only if its Rademacher complexity vanishes as nn \to \infty.

A non-GC counterexample

Some classes are not Glivenko–Cantelli — and naming one cleanly is the test of whether the definition has bite. The cleanest counterexample is the class of indicators of finite sets:

Ffin={1A:AX,A<}.\mathcal{F}_{\mathrm{fin}} = \{ \mathbb{1}_A : A \subset \mathcal{X}, |A| < \infty \}.

Example 1 (A non-GC class).

For any continuous distribution PP on X\mathcal{X} — say uniform on [0,1][0, 1] — the class Ffin\mathcal{F}_{\mathrm{fin}} fails the uniform convergence property catastrophically. The argument: for any sample S={X1,,Xn}S = \{X_1, \ldots, X_n\}, the indicator of AS={X1,,Xn}A_S = \{X_1, \ldots, X_n\} (a finite set) is in Ffin\mathcal{F}_{\mathrm{fin}}, and

Pn1AS=1ni=1n1[XiAS]=1,P1AS=P(AS)=0,P_n \mathbb{1}_{A_S} = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[X_i \in A_S] = 1, \qquad P \mathbb{1}_{A_S} = P(A_S) = 0,

where P(AS)=0P(A_S) = 0 because ASA_S is a finite set under a continuous distribution. Therefore supfFfinPnfPf1\sup_{f \in \mathcal{F}_{\mathrm{fin}}} |P_n f - P f| \ge 1 for every nn — uniform convergence fails maximally.

What went wrong? The class Ffin\mathcal{F}_{\mathrm{fin}} is too rich — it shatters every sample, in the VC sense. Whatever the labels on SS, some AFfinA \in \mathcal{F}_{\mathrm{fin}} produces them. A class that can produce any labeling of any sample is unconstrained, and unconstrained means no uniform convergence. The VC Dimension topic makes this precise via the Sauer–Shelah lemma: a class with finite VC dimension produces only polynomially many distinct labelings on an nn-sample, while a shattering class produces 2n2^n — and only the polynomial regime is GC. A GC class is one whose behavior on nn samples is constrained, not one whose cardinality is small.

Rademacher complexity

We have now established two things. First, that uniform convergence is the right notion for a hypothesis class — the property that simultaneously controls the gap across all hHh \in \mathcal{H}. Second, that the cardinality H|\mathcal{H}| is the wrong complexity measure for infinite classes — what matters is the combinatorial behavior of H\mathcal{H} on samples. §5 introduces the quantity that puts a finger on that combinatorial behavior: the Rademacher complexity of H\mathcal{H}. It is the canonical complexity measure for distribution-free generalization theory, the data-dependent quantity that we will see in every subsequent bound, and the one that allows us to compute — not just upper-bound — the relevant complexity of finite-and-infinite classes alike.

The random-labels intuition

Here is the picture. Fix a sample S={x1,,xn}S = \{x_1, \ldots, x_n\} in X\mathcal{X}. Forget the true labels; draw fresh ones uniformly at random: σi{1,+1}\sigma_i \in \{-1, +1\} iid with probability 1/21/2 each. Each Rademacher label vector σ=(σ1,,σn){1,+1}n\sigma = (\sigma_1, \ldots, \sigma_n) \in \{-1, +1\}^n is a different “random labeling” of the sample. There are 2n2^n such labelings.

For a binary classifier h:X{1,+1}h: \mathcal{X} \to \{-1, +1\}, the quantity 1niσih(xi)\frac{1}{n}\sum_i \sigma_i h(x_i) is the correlation between the classifier’s outputs and the random labels σ\sigma. It is +1 if hh‘s outputs match σ\sigma exactly, 1-1 if they are everywhere opposite, 00 in expectation if hh is independent of σ\sigma. The supremum across hHh \in \mathcal{H},

suphH1ni=1nσih(xi),\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i),

measures the best fit any hHh \in \mathcal{H} can produce to the random label vector σ\sigma. A class that can fit any labeling perfectly achieves a supremum of 11 for every σ\sigma. A class with no labeling flexibility achieves a value that concentrates around 00 at O(1/n)O(1/\sqrt n). The expected supremum over the random σ\sigma interpolates between these extremes — and that expectation is the empirical Rademacher complexity.

Empirical and population Rademacher complexity

Definition 5 (Empirical Rademacher complexity).

Given a class F\mathcal{F} of real-valued functions f:ZRf: \mathcal{Z} \to \mathbb{R} and a sample S={Z1,,Zn}S = \{Z_1, \ldots, Z_n\}, the empirical Rademacher complexity of F\mathcal{F} on SS is

R^S(F)=Eσ ⁣[supfF1ni=1nσif(Zi)S],\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma\!\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(Z_i) \,\Big|\, S \right],

where σ=(σ1,,σn)\sigma = (\sigma_1, \ldots, \sigma_n) is an iid Rademacher vector (σi{1,+1}\sigma_i \in \{-1, +1\} with Pr[σi=+1]=1/2\Pr[\sigma_i = +1] = 1/2), independent of SS.

Definition 6 (Population Rademacher complexity).

The population (or expected) Rademacher complexity is the expectation of the empirical Rademacher over the draw of SS:

Rn(F)=ES[R^S(F)]=ES,σ ⁣[supfF1ni=1nσif(Zi)].\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})] = \mathbb{E}_{S, \sigma}\!\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(Z_i) \right].

Three basic observations:

  1. Translation invariance / centering. Adding a constant cc to every fFf \in \mathcal{F} shifts the sum by c1nσic \cdot \frac{1}{n}\sum \sigma_i, which has mean zero — so Rademacher complexity is unchanged. It measures variability across the class, not absolute level.
  2. Convex hull invariance. R^S(conv(F))=R^S(F)\hat{\mathfrak{R}}_S(\mathrm{conv}(\mathcal{F})) = \hat{\mathfrak{R}}_S(\mathcal{F}): taking convex combinations of functions cannot help fit random labels better than the extreme points already do.
  3. Monotone in the class. F1F2    R^S(F1)R^S(F2)\mathcal{F}_1 \subset \mathcal{F}_2 \implies \hat{\mathfrak{R}}_S(\mathcal{F}_1) \le \hat{\mathfrak{R}}_S(\mathcal{F}_2): a bigger class fits random labels better.

The relationship to the uniform deviation ΦS(H)\Phi_S(\mathcal{H}) is made precise in §6 (symmetrization) and §7 (the canonical bound). Punch line: ES[ΦS(H)]2Rn(LH)\mathbb{E}_S[\Phi_S(\mathcal{H})] \le 2 \mathfrak{R}_n(\mathcal{L} \circ \mathcal{H}) — Rademacher of the loss class is what shows up.

Massart’s finite-class lemma

Lemma 1 (Massart's finite-class lemma).

Let ARnA \subset \mathbb{R}^n be a finite set with A=N|A| = N, and let B=supaAa2B = \sup_{a \in A} \|a\|_2. Then

Eσ ⁣[maxaA1ni=1nσiai]B2logNn.\mathbb{E}_\sigma\!\left[ \max_{a \in A} \frac{1}{n} \sum_{i=1}^n \sigma_i a_i \right] \le \frac{B \sqrt{2 \log N}}{n}.
Proof.

Let Ya=iσiaiY_a = \sum_i \sigma_i a_i. By Hoeffding’s lemma for Rademacher random variables, E[exp(λYa)]exp(λ2a22/2)exp(λ2B2/2)\mathbb{E}[\exp(\lambda Y_a)] \le \exp(\lambda^2 \|a\|_2^2 / 2) \le \exp(\lambda^2 B^2 / 2). For any λ>0\lambda > 0,

exp ⁣(λE[maxaYa])  (Jensen)  E ⁣[exp(λmaxaYa)]=E ⁣[maxaexp(λYa)]aAE[exp(λYa)]Nexp(λ2B2/2).\exp\!\big( \lambda \mathbb{E}[\max_a Y_a] \big) \;\overset{(\text{Jensen})}{\le}\; \mathbb{E}\!\big[ \exp(\lambda \max_a Y_a) \big] = \mathbb{E}\!\big[ \max_a \exp(\lambda Y_a) \big] \le \sum_{a \in A} \mathbb{E}[\exp(\lambda Y_a)] \le N \exp(\lambda^2 B^2 / 2).

Taking logarithms and dividing by λ\lambda,

E[maxaYa]logNλ+λB22.\mathbb{E}[\max_a Y_a] \le \frac{\log N}{\lambda} + \frac{\lambda B^2}{2}.

Setting the derivative to zero gives λ=2logN/B2\lambda^* = \sqrt{2 \log N / B^2} and substituting back,

E[maxaYa]B2logN.\mathbb{E}[\max_a Y_a] \le B \sqrt{2 \log N}.

Dividing by nn gives Massart’s lemma.

Corollary 2 (Rademacher complexity of a finite hypothesis class).

For a binary classifier class H\mathcal{H} with h:X{1,+1}h: \mathcal{X} \to \{-1, +1\} and H=N|\mathcal{H}| = N,

R^S(H)2logNn.\hat{\mathfrak{R}}_S(\mathcal{H}) \le \sqrt{\frac{2 \log N}{n}}.
Proof.

Apply Lemma 1 with A={(h(X1),,h(Xn)):hH}A = \{(h(X_1), \ldots, h(X_n)) : h \in \mathcal{H}\}. Each aAa \in A has a22=n\|a\|_2^2 = n, so B=nB = \sqrt n.

Massart’s bound is the basis of the Rademacher refinement of §3. Even if H=2100|\mathcal{H}| = 2^{100}, on a sample of n=1000n = 1000 the Rademacher complexity is at most 200log2/10000.37\sqrt{200 \log 2 / 1000} \approx 0.37 — informative, where the union-bound treatment is identical in rate but routes through enumeration of H\mathcal{H} rather than a single computable quantity.

Talagrand’s contraction lemma

Lemma 2 (Talagrand's contraction lemma).

Let ϕ:RR\phi: \mathbb{R} \to \mathbb{R} be LL-Lipschitz with ϕ(0)=0\phi(0) = 0, and let F\mathcal{F} be a class of real-valued functions on Z\mathcal{Z}. Then for any sample SS,

R^S(ϕF)LR^S(F),\hat{\mathfrak{R}}_S(\phi \circ \mathcal{F}) \le L \cdot \hat{\mathfrak{R}}_S(\mathcal{F}),

where ϕF={zϕ(f(z)):fF}\phi \circ \mathcal{F} = \{ z \mapsto \phi(f(z)) : f \in \mathcal{F} \}.

Proof.

The proof goes by induction on nn. The base case n=1n = 1 is direct from Lipschitzness. For the inductive step, one conditions on σ1,,σn1\sigma_1, \ldots, \sigma_{n-1} and uses a comparison inequality that reduces to the Lipschitz condition. The full argument is in Ledoux & Talagrand (1991), Ch. 4.

The contraction lemma is how we move from the predictor class H\mathcal{H} to the loss class LH\mathcal{L} \circ \mathcal{H}. For the 0-1 loss on binary classifiers, rewrite (h(x),y)=12(1yh(x))\ell(h(x), y) = \frac{1}{2}(1 - y h(x)) — which is 12\frac{1}{2}-Lipschitz in h(x)h(x) — and conclude R^S(LH)12R^S(H)\hat{\mathfrak{R}}_S(\mathcal{L} \circ \mathcal{H}) \le \frac{1}{2} \hat{\mathfrak{R}}_S(\mathcal{H}). For margin losses (§10), ϕ(t)=max(0,1t/γ)\phi(t) = \max(0, 1 - t/\gamma) is 1/γ1/\gamma-Lipschitz, and the contraction lemma gives the γ\gamma-in-the-denominator dependence of the margin bound.

Symmetrization

§5 introduced Rademacher complexity as a data-dependent complexity measure. §6 is the proof technique that connects Rademacher complexity to the uniform deviation: symmetrization, also called the ghost-sample trick. The lemma proved here is the structural backbone of every Rademacher generalization bound — including the canonical one in §7. The technique is clever in a way that rewards a careful, slow read. We will lay every step out.

The ghost-sample trick

Recall the central object: the uniform deviation ΦS(F)=supfF(PfPnf)\Phi_S(\mathcal{F}) = \sup_{f \in \mathcal{F}} (Pf - P_n f) (we work one-sided here; the two-sided version follows by applying the argument to F(F)\mathcal{F} \cup (-\mathcal{F})). We want to bound ES[ΦS(F)]\mathbb{E}_S[\Phi_S(\mathcal{F})] by something computable. The challenge is that Pf=EZ[f(Z)]Pf = \mathbb{E}_{Z}[f(Z)] is a population expectation — invisible to us, and a single deterministic number per ff. The empirical PnfP_n f is a sample average. The supremum couples both, and the asymmetry between the two is what makes the bound hard.

The trick is to introduce a ghost sample S=(Z1,,Zn)S' = (Z'_1, \ldots, Z'_n), iid from PP, independent of SS. The ghost sample never gets drawn — it is a mathematical fiction we use to symmetrize the difference. The key observation: Pf=ES[P^Sf]Pf = \mathbb{E}_{S'}[\hat P_{S'} f], where P^Sf=1nif(Zi)\hat P_{S'} f = \frac{1}{n} \sum_i f(Z'_i) is the ghost empirical measure. The pair (Zi,Zi)(Z_i, Z'_i) is iid from P×PP \times P, exchangeable under coordinate swap — and that exchangeability is what lets us introduce iid Rademacher signs without changing the joint distribution.

Generalization-gap moments → Rademacher averages

For any fixed function g(z,z)=f(z)f(z)g(z, z') = f(z') - f(z), the pair (Zi,Zi)(Z_i, Z'_i) and its swap (Zi,Zi)(Z'_i, Z_i) have the same joint distribution; the function gg is antisymmetric under the swap, so g(Zi,Zi)=g(Zi,Zi)g(Z'_i, Z_i) = -g(Z_i, Z'_i). Equivalently, f(Zi)f(Zi)f(Z'_i) - f(Z_i) has the same distribution as its negation, conditional on the pair being drawn iid. Inserting an iid Rademacher sign σi\sigma_i leaves the joint distribution unchanged for each fixed ff — and the supremum can be moved through by Jensen-like arguments. The full chain follows.

A full proof, written out

Lemma 3 (Symmetrization lemma).

For any class F\mathcal{F} of bounded functions on Z\mathcal{Z} and any n1n \ge 1,

ES ⁣[supfF(PfPnf)]2Rn(F).\mathbb{E}_S\!\Big[\sup_{f \in \mathcal{F}} (Pf - P_n f)\Big] \le 2 \, \mathfrak{R}_n(\mathcal{F}).
Proof.

Let S=(Z1,,Zn)S = (Z_1, \ldots, Z_n) be the original sample (iid from PP), and let S=(Z1,,Zn)S' = (Z'_1, \ldots, Z'_n) be an independent ghost sample (also iid from PP). For any fFf \in \mathcal{F},

Pf=EZP[f(Z)]=ES ⁣[1ni=1nf(Zi)]=ES[P^Sf].(1)Pf = \mathbb{E}_{Z' \sim P}[f(Z')] = \mathbb{E}_{S'}\!\Big[\frac{1}{n}\sum_{i=1}^n f(Z'_i)\Big] = \mathbb{E}_{S'}[\hat P_{S'} f]. \qquad (1)

Substituting (1) into the uniform deviation:

supf(PfPnf)=supfES ⁣[P^SfPnf].(2)\sup_f (Pf - P_n f) = \sup_f \mathbb{E}_{S'}\!\big[\hat P_{S'} f - P_n f\big]. \qquad (2)

The function fES[P^SfPnf]f \mapsto \mathbb{E}_{S'}[\hat P_{S'} f - P_n f] depends on the random variable SS' only through its expectation. By Jensen’s inequality applied to the convex function supf\sup_f (a pointwise supremum of linear functions),

supfES[P^SfPnf]ES ⁣[supf(P^SfPnf)].(3)\sup_f \mathbb{E}_{S'}[\hat P_{S'} f - P_n f] \le \mathbb{E}_{S'}\!\Big[\sup_f (\hat P_{S'} f - P_n f)\Big]. \qquad (3)

Taking ES\mathbb{E}_S on both sides of (3) and combining with (2),

ES ⁣[supf(PfPnf)]ES,S ⁣[supf1ni=1n(f(Zi)f(Zi))].(4)\mathbb{E}_S\!\Big[\sup_f (Pf - P_n f)\Big] \le \mathbb{E}_{S, S'}\!\Big[\sup_f \frac{1}{n} \sum_{i=1}^n (f(Z'_i) - f(Z_i))\Big]. \qquad (4)

Now introduce iid Rademacher random variables σ1,,σn{1,+1}\sigma_1, \ldots, \sigma_n \in \{-1, +1\} with Pr[σi=+1]=1/2\Pr[\sigma_i = +1] = 1/2, independent of (S,S)(S, S'). Because the pair (Zi,Zi)(Z_i, Z'_i) is exchangeable under coordinate swap (iid from P×PP \times P), the swapped pair (Zi(σ),Zi(σ))(Z^{(\sigma)}_i, Z'^{(\sigma)}_i) has the same joint distribution as (Zi,Zi)(Z_i, Z'_i) for any fixed σ\sigma. Therefore

ES,S ⁣[supf1ni(f(Zi)f(Zi))]=ES,S,σ ⁣[supf1niσi(f(Zi)f(Zi))].(5)\mathbb{E}_{S, S'}\!\Big[\sup_f \frac{1}{n} \sum_i (f(Z'_i) - f(Z_i))\Big] = \mathbb{E}_{S, S', \sigma}\!\Big[\sup_f \frac{1}{n} \sum_i \sigma_i (f(Z'_i) - f(Z_i))\Big]. \qquad (5)

Apply the triangle inequality on the supremum:

supf1niσi(f(Zi)f(Zi))supf1niσif(Zi)+supf1niσi(f(Zi)).\sup_f \frac{1}{n} \sum_i \sigma_i (f(Z'_i) - f(Z_i)) \le \sup_f \frac{1}{n} \sum_i \sigma_i f(Z'_i) + \sup_f \frac{1}{n} \sum_i \sigma_i (-f(Z_i)).

Taking expectations,

ES,S,σ ⁣[supf1niσi(f(Zi)f(Zi))]ES,σ ⁣[supf1niσif(Zi)]+ES,σ ⁣[supf1niσi(f(Zi))].(6)\mathbb{E}_{S, S', \sigma}\!\Big[\sup_f \frac{1}{n} \sum_i \sigma_i (f(Z'_i) - f(Z_i))\Big] \le \mathbb{E}_{S', \sigma}\!\Big[\sup_f \frac{1}{n} \sum_i \sigma_i f(Z'_i)\Big] + \mathbb{E}_{S, \sigma}\!\Big[\sup_f \frac{1}{n} \sum_i \sigma_i (-f(Z_i))\Big]. \qquad (6)

The first term in (6) is the definition of Rn(F)\mathfrak{R}_n(\mathcal{F}). For the second term: σi-\sigma_i has the same distribution as σi\sigma_i (the Rademacher distribution is symmetric under negation), so the second term also equals Rn(F)\mathfrak{R}_n(\mathcal{F}).

Combining (4), (5), and (6):

ES ⁣[supf(PfPnf)]Rn(F)+Rn(F)=2Rn(F).\mathbb{E}_S\!\big[\sup_f (Pf - P_n f)\big] \le \mathfrak{R}_n(\mathcal{F}) + \mathfrak{R}_n(\mathcal{F}) = 2 \, \mathfrak{R}_n(\mathcal{F}).

Remark.

The one-sided bound in Lemma 3 is sometimes stated as a two-sided bound ES[supfPfPnf]2Rn(F)\mathbb{E}_S[\sup_f |Pf - P_n f|] \le 2 \mathfrak{R}_n(\mathcal{F}). The two-sided version follows by applying the lemma to the class F(F)\mathcal{F} \cup (-\mathcal{F}). A more refined symmetrization yields the empirical Rademacher bound ES[supf(PfPnf)]2ES[R^S(F)]\mathbb{E}_S[\sup_f (Pf - P_n f)] \le 2 \, \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})], which is what shows up in the canonical bound in §7.

McDiarmid + Rademacher → uniform convergence

We have two ingredients: symmetrization (Lemma 3) controls ES[ΦS(F)]\mathbb{E}_S[\Phi_S(\mathcal{F})] by 2Rn(F)2 \mathfrak{R}_n(\mathcal{F}); concentration (McDiarmid, from concentration-inequalities) promotes “expectation \le …” into “with probability 1δ\ge 1 - \delta, … \le …”. Combine them and we get the canonical generalization bound — the theorem the whole chapter has been building toward.

Bounded differences for suph(RR^S)\sup_h (R - \hat R_S)

The function SΦS(F)S \mapsto \Phi_S(\mathcal{F}) has the bounded-differences property for losses bounded in [0,M][0, M]. Replacing one sample ZiZ_i with a different value ZiZ'_i changes R^S(h)\hat{R}_S(h) by at most M/nM/n for any fixed hh (the loss term for the ii-th sample contributes at most M/nM/n to the average). Therefore suph(R(h)R^S(h))\sup_h (R(h) - \hat R_S(h)) changes by at most M/nM/n — the supremum operation preserves the bounded-differences constant.

McDiarmid concentration on ΦS\Phi_S

By McDiarmid’s bounded-differences inequality, for any t>0t > 0,

PrS[ΦS(F)E[ΦS(F)]t]exp ⁣(2nt2M2).\Pr_S\big[\Phi_S(\mathcal{F}) - \mathbb{E}[\Phi_S(\mathcal{F})] \ge t\big] \le \exp\!\Big({-}\frac{2 n t^2}{M^2}\Big).

This is a concentration statement: ΦS\Phi_S is tightly concentrated around its mean. To deduce a high-probability upper bound on ΦS\Phi_S itself, we use the (one-sided) failure inversion: with probability at least 1δ1 - \delta',

ΦS(F)E[ΦS(F)]+Mlog(1/δ)2n.\Phi_S(\mathcal{F}) \le \mathbb{E}[\Phi_S(\mathcal{F})] + M \sqrt{\frac{\log(1/\delta')}{2n}}.

The bound on E[ΦS]\mathbb{E}[\Phi_S] comes from Lemma 3 (symmetrization). The remaining task is to upgrade the population Rademacher in Lemma 3 to the empirical Rademacher — the latter is what we can compute from a single sample.

The two-step recipe and the canonical bound

Apply McDiarmid a second time, this time to R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) as a function of SS. The empirical Rademacher also has bounded differences: replacing ZiZ_i with ZiZ'_i changes the inner 1nσjf(Zj)\frac{1}{n}\sum \sigma_j f(Z_j) by at most M/nM/n for any fixed ff and σ\sigma, and supremum + expectation over σ\sigma both preserve the bound. So with probability 1δ\ge 1 - \delta'',

ES[R^S(F)]R^S(F)+Mlog(1/δ)2n.\mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})] \le \hat{\mathfrak{R}}_S(\mathcal{F}) + M \sqrt{\frac{\log(1/\delta'')}{2n}}.

Theorem 3 (Canonical Rademacher generalization bound).

Let F\mathcal{F} be a class of functions f:Z[0,M]f: \mathcal{Z} \to [0, M], let S={Z1,,Zn}S = \{Z_1, \ldots, Z_n\} be an iid sample from PP, and let δ(0,1)\delta \in (0, 1). With probability at least 1δ1 - \delta over SS,

supfF(PfPnf)    2R^S(F)  +  3Mlog(2/δ)2n.\sup_{f \in \mathcal{F}} (Pf - P_n f) \;\le\; 2\, \hat{\mathfrak{R}}_S(\mathcal{F}) \;+\; 3 M \sqrt{\frac{\log(2/\delta)}{2n}}.
Proof.

Step (i): bounded-differences verification of ΦS(F)\Phi_S(\mathcal{F}), done above; the constant is ci=M/nc_i = M/n for each coordinate.

Step (ii): McDiarmid on ΦS\Phi_S. By the bounded-differences inequality with δ=δ/2\delta' = \delta/2, with probability 1δ/2\ge 1 - \delta/2,

ΦS(F)ES[ΦS(F)]+Mlog(2/δ)2n.(I)\Phi_S(\mathcal{F}) \le \mathbb{E}_S[\Phi_S(\mathcal{F})] + M \sqrt{\frac{\log(2/\delta)}{2n}}. \qquad \text{(I)}

Step (iii): symmetrization (Lemma 3, data-dependent form),

ES[ΦS(F)]2ES[R^S(F)].(II)\mathbb{E}_S[\Phi_S(\mathcal{F})] \le 2 \, \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})]. \qquad \text{(II)}

Step (iv): McDiarmid on R^S\hat{\mathfrak{R}}_S. By the bounded-differences inequality with δ=δ/2\delta'' = \delta/2, with probability 1δ/2\ge 1 - \delta/2,

ES[R^S(F)]R^S(F)+Mlog(2/δ)2n.(III)\mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})] \le \hat{\mathfrak{R}}_S(\mathcal{F}) + M \sqrt{\frac{\log(2/\delta)}{2n}}. \qquad \text{(III)}

Step (v): combine. By the union bound over the two failure events (each δ/2\delta/2), with probability 1δ\ge 1 - \delta both (I) and (III) hold simultaneously. Chaining (I) \to (II) \to (III) and summing the three Mlog(2/δ)/(2n)M\sqrt{\log(2/\delta)/(2n)} terms gives 3Mlog(2/δ)/(2n)3M\sqrt{\log(2/\delta)/(2n)}.

Corollary 3 (For binary classification with 0-1 loss).

Let H\mathcal{H} be a class of binary classifiers h:X{1,+1}h: \mathcal{X} \to \{-1, +1\} and \ell the 0-1 loss (M=1M = 1). For any δ(0,1)\delta \in (0, 1), with probability 1δ\ge 1 - \delta, for every hHh \in \mathcal{H},

R(h)R^S(h)+R^S(H)+3log(2/δ)2n.R(h) \le \hat{R}_S(h) + \hat{\mathfrak{R}}_S(\mathcal{H}) + 3 \sqrt{\frac{\log(2/\delta)}{2n}}.
Proof.

Apply Theorem 3 to the loss class LH\mathcal{L} \circ \mathcal{H}. By Talagrand’s contraction lemma (Lemma 2) and the 12\frac{1}{2}-Lipschitz form (h(x),y)=12(1yh(x))\ell(h(x), y) = \frac{1}{2}(1 - y h(x)),

R^S(LH)12R^S(H).\hat{\mathfrak{R}}_S(\mathcal{L} \circ \mathcal{H}) \le \tfrac{1}{2} \hat{\mathfrak{R}}_S(\mathcal{H}).

Theorem 3 then gives R(h)R^S(h)212R^S(H)+3log(2/δ)/(2n)R(h) - \hat R_S(h) \le 2 \cdot \frac{1}{2} \hat{\mathfrak{R}}_S(\mathcal{H}) + 3\sqrt{\log(2/\delta)/(2n)}.

When the bound bites — and when it doesn’t

Three sanity checks:

  1. The bound vanishes as nn \to \infty. Both R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}) and the log(2/δ)/n\sqrt{\log(2/\delta)/n} term decay as O(1/n)O(1/\sqrt n), so the right-hand side of Corollary 3 approaches R^S(h)R(h)\hat R_S(h) \approx R(h). Good: the bound is consistent.
  2. The bound is informative when R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}) is small. For the threshold class on nn samples, R^S(Hthresh)logn/n\hat{\mathfrak{R}}_S(\mathcal{H}_{\mathrm{thresh}}) \approx \sqrt{\log n / n}. At n=1000n = 1000, this is 0.083\approx 0.083 — comfortably below the trivial bound of 11 (the bound is non-vacuous). At n=10n = 10, this is 0.48\approx 0.48 — already large but still informative.
  3. The bound is uninformative (“vacuous”) when R^S(H)1/2\hat{\mathfrak{R}}_S(\mathcal{H}) \ge 1/2 or so. For a moderately overparameterized neural network, R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}) approaches 11 on standard image datasets (Zhang et al. 2017’s signature observation: deep nets can memorize random labels). The classical bound therefore gives R(h)R^S(h)1+3log(2/δ)/(2n)1R(h) - \hat R_S(h) \le 1 + 3\sqrt{\log(2/\delta)/(2n)} \approx 1, which is the trivial bound. §12 makes this vacuousness explicit, and §14 points to PAC-Bayes (coming soon) as the post-classical remedy.

The empirical-Rademacher term R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}) is the whole generalization-bound story for classical theory. Sample size nn, confidence δ\delta, and the loss range MM enter cleanly and obviously; the hypothesis class H\mathcal{H} enters only through R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}).

Real-valued function classes

§§5–7 developed Rademacher complexity for {1,+1}\{-1, +1\}-valued classifiers. §8 extends to real-valued function classes — regression problems where the predictor f:XRf: \mathcal{X} \to \mathbb{R} and the loss (f(x),y)=(f(x)y)2\ell(f(x), y) = (f(x) - y)^2 are both continuously valued; margin classifiers where ff is a real score and the decision is sign(f)\mathrm{sign}(f); neural networks viewed as real-valued functions before the final activation. The combinatorial machinery (shattering, VC dimension) doesn’t apply directly. The right substitutes are pseudo-dimension / fat-shattering (the real-valued analogs of VC dim) and covering numbers (the metric-entropy substitute), combined through Dudley’s entropy integral.

Beyond {0,1}\{0, 1\} losses: margin and regression

Two motivating settings:

Regression. F={f:RdR}\mathcal{F} = \{f: \mathbb{R}^d \to \mathbb{R}\} with square loss (f(x),y)=(f(x)y)2\ell(f(x), y) = (f(x) - y)^2. The loss is bounded only if f(x)|f(x)| and y|y| are bounded; the predictor class is real-valued.

Margin classification. F={f:RdR}\mathcal{F} = \{f: \mathbb{R}^d \to \mathbb{R}\} with margin loss γ(f(x),y)=max(0,1yf(x)/γ)\ell_\gamma(f(x), y) = \max(0, 1 - y f(x) / \gamma) for some margin parameter γ>0\gamma > 0. The classifier is sign(f(x))\mathrm{sign}(f(x)), but the score yf(x)y f(x) — the “confidence” — drives the bound. This is the setting for §10.

Both settings require complexity measures finer than VC dimension.

Pseudo-dimension and fat-shattering

Definition 7 (Pseudo-dimension).

A class F\mathcal{F} of real-valued functions on X\mathcal{X} has pseudo-dimension Pdim(F)d\mathrm{Pdim}(\mathcal{F}) \ge d if there exist x1,,xdXx_1, \ldots, x_d \in \mathcal{X} and witness values r1,,rdRr_1, \ldots, r_d \in \mathbb{R} such that for every ϵ{1,+1}d\epsilon \in \{-1, +1\}^d there exists fFf \in \mathcal{F} with sign(f(xi)ri)=ϵi\mathrm{sign}(f(x_i) - r_i) = \epsilon_i for every ii. The pseudo-dimension is the largest such dd.

Pseudo-dimension is the VC dimension of the subgraph class {{(x,t):tf(x)}:fF}\{ \{(x, t) : t \le f(x)\} : f \in \mathcal{F} \} in X×R\mathcal{X} \times \mathbb{R}. For linear regression in Rd\mathbb{R}^d, Pdim=d+1\mathrm{Pdim} = d + 1. For neural networks, Pdim\mathrm{Pdim} depends on the architecture and is polynomial in the parameter count.

Definition 8 (Fat-shattering dimension).

For γ>0\gamma > 0, F\mathcal{F} γ\gamma-fat-shatters a set {x1,,xd}\{x_1, \ldots, x_d\} with witness values rir_i if for every ϵ{1,+1}d\epsilon \in \{-1, +1\}^d there exists fFf \in \mathcal{F} such that f(xi)ri+γf(x_i) \ge r_i + \gamma when ϵi=+1\epsilon_i = +1 and f(xi)riγf(x_i) \le r_i - \gamma when ϵi=1\epsilon_i = -1. The fat-shattering dimension at scale γ\gamma is fatγ(F)=\mathrm{fat}_\gamma(\mathcal{F}) = the largest dd that is γ\gamma-fat-shattered.

Fat-shattering is the right notion for margin classifiers: at margin γ\gamma, only the separability above margin matters, not infinitesimal differences. The dimension typically decreases as γ\gamma increases (a larger margin is harder to achieve, so fewer points can be fat-shattered).

Covering numbers and Dudley’s entropy integral

For a class F\mathcal{F} of functions on Z\mathcal{Z}, a pseudo-metric dd on F\mathcal{F}, and ϵ>0\epsilon > 0, the covering number N(ϵ,F,d)N(\epsilon, \mathcal{F}, d) is the smallest cardinality of an ϵ\epsilon-net — a set GF\mathcal{G} \subset \mathcal{F} such that every fFf \in \mathcal{F} has some gGg \in \mathcal{G} with d(f,g)ϵd(f, g) \le \epsilon. The metric entropy is logN(ϵ,F,d)\log N(\epsilon, \mathcal{F}, d). For a sample SS, the canonical metric is L2(Pn)L_2(P_n): d2(f,g)=(1ni(f(zi)g(zi))2)1/2d_2(f, g) = \big(\frac{1}{n}\sum_i (f(z_i) - g(z_i))^2\big)^{1/2}.

Theorem 4 (Dudley's entropy integral).

For any class F\mathcal{F} of functions f:ZRf: \mathcal{Z} \to \mathbb{R} and any sample SS,

R^S(F)    12n0DnlogN(ϵ,F,L2(Pn))dϵ,\hat{\mathfrak{R}}_S(\mathcal{F}) \;\le\; \frac{12}{\sqrt n} \int_0^{D_n} \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P_n))} \, d\epsilon,

where Dn=supfFPnf2D_n = \sup_{f \in \mathcal{F}} \sqrt{P_n f^2} is the L2L_2-diameter of F\mathcal{F} on SS.

Proof.

The full proof goes by chaining: construct ϵk\epsilon_k-nets Gk\mathcal{G}_k at dyadic scales ϵk=2kDn\epsilon_k = 2^{-k} D_n, and telescope each fFf \in \mathcal{F} as f=g0(f)+k(gk+1(f)gk(f))f = g_0(f) + \sum_k (g_{k+1}(f) - g_k(f)). Each telescoping increment is bounded in L2(Pn)L_2(P_n) by 3ϵk+13 \epsilon_{k+1}, so by Massart’s lemma applied to the finite set of differences, the Rademacher complexity of the increments is geometric in kk and the sum converges to Dudley’s integral. The constant 1212 is sharp up to small factors. Full argument: Wainwright (2019) Theorem 5.22.

For VC-subgraph classes (finite pseudo-dimension dd), the covering number satisfies logN(ϵ,F,L2(Pn))cdlog(1/ϵ)\log N(\epsilon, \mathcal{F}, L_2(P_n)) \le c \cdot d \cdot \log(1/\epsilon) for some constant cc, and Dudley’s integral evaluates to R^S(F)=O(d/n)\hat{\mathfrak{R}}_S(\mathcal{F}) = O(\sqrt{d/n}) — the parametric rate. This recovers and generalizes the VC bound to real-valued classes.

Bracketing vs uniform covering

Two notions of covering co-exist in the empirical-processes literature:

  • Uniform covering N(ϵ,F,L2(Pn))N(\epsilon, \mathcal{F}, L_2(P_n)): ϵ\epsilon-net under the sample L2L_2 metric. What shows up in Rademacher / Dudley bounds.
  • Bracketing covering N[](ϵ,F,L2(P))N_{[\,]}(\epsilon, \mathcal{F}, L_2(P)): ϵ\epsilon-net of “brackets” [,u][\ell, u] with uL2(P)ϵ\|u - \ell\|_{L_2(P)} \le \epsilon, such that every fFf \in \mathcal{F} lies in some bracket. Bracketing controls supfPfPnf\sup_{f} |Pf - P_n f| directly through the supremum over bracket-pairs.

Bracketing is strictly stronger than uniform covering (every ϵ\epsilon-bracket-net is a 2ϵ2\epsilon-uniform-covering net, but not conversely). Bracketing is more natural for non-Donsker classes where uniform covering doesn’t control the empirical process. For our chapter we mostly use uniform covering, which suffices for VC-subgraph and parametric classes; bracketing matters in more delicate settings.

Data-dependent bounds

The canonical bound of §7 has rate O(R^S+1/n)O(\hat{\mathfrak{R}}_S + 1/\sqrt n) — uniformly across the entire class H\mathcal{H}. But the ERM h^S\hat h_S is not just any hh: it has small empirical risk, hence small variance (in the sense that (h^S(X),Y)\ell(\hat h_S(X), Y) has small variance under PP). For such low-variance hypotheses, the bound can be sharpened from the slow O(1/n)O(1/\sqrt n) to the fast rate O(1/n)O(1/n). This is the local Rademacher complexity technology of Bartlett, Bousquet, and Mendelson (2005). The price is a Bernstein-type variance condition that ties the loss variance to its mean; under that condition, the local Rademacher complexity — restricted to the variance-small subclass — replaces the global one in the bound, and a fixed-point characterization gives the fast-rate scaling.

Why classical bounds are slow

Theorem 3 gives R(h^S)R^S(h^S)R^S(LH)+3log(2/δ)/(2n)R(\hat h_S) - \hat R_S(\hat h_S) \le \hat{\mathfrak{R}}_S(\mathcal{L} \circ \mathcal{H}) + 3\sqrt{\log(2/\delta)/(2n)}. Both terms decay as O(1/n)O(1/\sqrt n). This is the slow rate of distribution-free generalization: no matter how small the empirical risk, the bound’s deviation term is Ω(1/n)\Omega(1/\sqrt n).

For realizable problems (the Bayes classifier hh^* is in H\mathcal{H}, label noise η=0\eta = 0), R^S(h^S)=0\hat R_S(\hat h_S) = 0 and R(h^S)0R(\hat h_S) \to 0 as nn \to \infty. The classical bound says R(h^S)0+O(1/n)R(\hat h_S) \le 0 + O(1/\sqrt n) — informative but loose; the actual rate for realizable PAC learning is O(1/n)O(1/n) (covered in pac-learning’s realizable section). The slow-rate gap is the price the agnostic / worst-case analysis pays for not exploiting low-noise structure.

Local Rademacher complexity

The fix: replace the global Rademacher complexity R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}) with a local version computed over the low-variance subclass.

Definition 9 (Local Rademacher complexity).

For a class F\mathcal{F} and a level r>0r > 0,

R^S(Fr)=Eσ ⁣[supfF,Pnf2r1niσif(Zi)S],\hat{\mathfrak{R}}_S(\mathcal{F}_r) = \mathbb{E}_\sigma\!\left[ \sup_{f \in \mathcal{F},\, P_n f^2 \le r} \frac{1}{n} \sum_i \sigma_i f(Z_i) \,\Big|\, S \right],

where the supremum is over functions with bounded empirical second moment Pnf2rP_n f^2 \le r.

The local Rademacher restricts attention to a small ball in L2(Pn)L_2(P_n) around 00; functions far away (large variance) are excluded. The empirical risk of the ERM is small, so the ERM lies in this small ball — the local complexity is what controls its gap.

A Bernstein-type variance condition

For the local Rademacher to give a meaningful bound, we need a relation between the variance of ff (the radius rr) and its mean (the risk). The cleanest such condition:

Bernstein condition. A class F\mathcal{F} satisfies the Bernstein condition with parameter β[0,1]\beta \in [0, 1] if there exists c>0c > 0 such that for every fFf \in \mathcal{F},

Pf2c(Pf)β.P f^2 \le c \, (P f)^\beta.

The condition couples the second moment to the first. The two extremes:

  • β=0\beta = 0: Pf2cPf^2 \le c — only the boundedness of ff is used, no link to PfPf. This gives the classical slow rate.
  • β=1\beta = 1: Pf2cPfPf^2 \le c \, Pf — Massart’s case. Holds for 0-1 losses with Tsybakov’s low-noise condition (the margin separates classes cleanly): η0\eta \to 0 rapidly near the Bayes boundary.

For β(0,1)\beta \in (0, 1), the Bernstein condition interpolates: the fast rate becomes O(n1/(2β))O(n^{-1/(2 - \beta)}), which is O(1/n)O(1/n) for β=1\beta = 1 and O(1/n)O(1/\sqrt n) for β=0\beta = 0. Tsybakov noise conditions give specific values of β\beta.

Fast rates via the fixed-point theorem

Theorem 5 (Local Rademacher fast-rate bound (Bartlett–Bousquet–Mendelson 2005)).

Suppose F\mathcal{F} satisfies the Bernstein condition with parameter β=1\beta = 1 (the canonical Massart case), and let rr^* be the largest solution of the fixed-point equation

r=c1R^S(Fr)+c2log(1/δ)nr = c_1 \, \hat{\mathfrak{R}}_S(\mathcal{F}_r) + c_2 \frac{\log(1/\delta)}{n}

for universal constants c1,c2c_1, c_2. Then with probability 1δ\ge 1 - \delta,

R(h^S)R(hH)c3r=O(1/n).R(\hat h_S) - R(h^*_{\mathcal{H}}) \le c_3 \, r^* = O(1/n).
Proof.

The proof is technical and runs many pages; the full argument is BBM 2005 §3. The picture to keep is the fixed point: the local Rademacher complexity R^S(Fr)\hat{\mathfrak{R}}_S(\mathcal{F}_r) shrinks as rr shrinks (smaller class, smaller Rademacher), so the equation has a unique fixed point rr^* that scales as O(1/n)O(1/n) rather than O(1/n)O(1/\sqrt n) — and that fixed point is the excess-risk bound.

Margin bounds

A linear classifier h(x)=sign(wx)h(x) = \mathrm{sign}(w^\top x) gives a discrete output, but the score wxw^\top x has continuous information: its magnitude is the confidence, and points far from the boundary (large wx|w^\top x| with the correct sign) are “easy.” A margin bound exploits this confidence: classifiers that not only get the right answer but get it with margin γ\gamma generalize better than ones that just barely separate the data. The result is the foundation of kernel methods (SVM), boosting margin theory, and the modern norm-based generalization bounds for deep nets.

From 0-1 loss to margin / ramp loss

Define the margin of ff on (x,y)(x, y) as ρ(f;x,y)=yf(x)\rho(f; x, y) = y f(x) (positive when ff classifies correctly; magnitude is the confidence). For a margin parameter γ>0\gamma > 0, the ramp loss (also: margin loss, hinge loss) is

γ(t)={1t01t/γ0<t<γ0tγ.\ell_\gamma(t) = \begin{cases} 1 & t \le 0 \\ 1 - t/\gamma & 0 < t < \gamma \\ 0 & t \ge \gamma \end{cases}.

Equivalently γ(t)=min(1,max(0,1t/γ))\ell_\gamma(t) = \min(1, \max(0, 1 - t/\gamma)). Key properties: γ\ell_\gamma is 1/γ1/\gamma-Lipschitz, γ(0)=1\ell_\gamma(0) = 1, γ(t)1[t0]\ell_\gamma(t) \ge \mathbb{1}[t \le 0] (ramp loss upper-bounds 0-1 loss), and γ(t)1[tγ]\ell_\gamma(t) \le \mathbb{1}[t \le \gamma] (ramp loss lower-bounds the “margin-violation” indicator).

Definition 10 (Margin / margin loss).

The margin of ff on (x,y)(x, y) is ρ(f;x,y)=yf(x)\rho(f; x, y) = y f(x). The empirical margin loss at margin γ\gamma is

R^Sγ(f)=1ni=1nγ(yif(xi))=1ni=1nmin ⁣(1,max ⁣(0,1yif(xi)γ)).\hat R_S^\gamma(f) = \frac{1}{n} \sum_{i=1}^n \ell_\gamma(y_i f(x_i)) = \frac{1}{n} \sum_{i=1}^n \min\!\Big(1, \max\!\Big(0, 1 - \frac{y_i f(x_i)}{\gamma}\Big)\Big).

Two boundary cases: R^S0+(f)=R01emp(f)\hat R_S^{0^+}(f) = R_{0-1}^{\mathrm{emp}}(f) (the empirical 0-1 risk), and R^Sγ\hat R_S^\gamma is increasing in γ\gamma — a larger margin demand more readily counts borderline classifications as misclassified.

SVM margin bound (kernel methods)

Consider the kernel-method class FB={xw,ϕ(x):w2B}\mathcal{F}_B = \{ x \mapsto \langle w, \phi(x) \rangle : \|w\|_2 \le B \} for a feature map ϕ:XH\phi: \mathcal{X} \to \mathcal{H} into a reproducing kernel Hilbert space with kernel K(x,x)=ϕ(x),ϕ(x)K(x, x') = \langle \phi(x), \phi(x') \rangle. Two key Rademacher facts (proofs in Mohri et al. 2018 Ch. 5):

  1. R^S(FB)Btr(K)nBsupxK(x,x)n\hat{\mathfrak{R}}_S(\mathcal{F}_B) \le \frac{B \sqrt{\mathrm{tr}(K)}}{n} \le \frac{B \sqrt{\sup_x K(x, x)}}{\sqrt n}, where KK is the kernel Gram matrix on SS.
  2. By Talagrand contraction (Lemma 2) applied to the 1/γ1/\gamma-Lipschitz γ\ell_\gamma: R^S(γFB)1γR^S(FB)\hat{\mathfrak{R}}_S(\ell_\gamma \circ \mathcal{F}_B) \le \frac{1}{\gamma} \hat{\mathfrak{R}}_S(\mathcal{F}_B).

Theorem 6 (Margin-based generalization bound for kernel classifiers).

Let FB={xw,ϕ(x):w2B}\mathcal{F}_B = \{x \mapsto \langle w, \phi(x) \rangle : \|w\|_2 \le B\} with kernel KK, and let R=supxK(x,x)R = \sqrt{\sup_x K(x, x)}. For any γ>0\gamma > 0 and δ(0,1)\delta \in (0, 1), with probability 1δ\ge 1 - \delta over SS, for every fFBf \in \mathcal{F}_B,

Pr(X,Y)P[Yf(X)0]R^Sγ(f)+2BRγn+3log(2/δ)2n.\Pr_{(X, Y) \sim P}[Y f(X) \le 0] \le \hat R_S^\gamma(f) + \frac{2 B R}{\gamma \sqrt n} + 3\sqrt{\frac{\log(2/\delta)}{2n}}.
Proof.

Apply Theorem 3 to the loss class γFB\ell_\gamma \circ \mathcal{F}_B (range [0,1][0, 1]). The bound from Theorem 3 has 2R^S(γFB)2γR^S(FB)2BRγn2 \hat{\mathfrak{R}}_S(\ell_\gamma \circ \mathcal{F}_B) \le \frac{2}{\gamma} \hat{\mathfrak{R}}_S(\mathcal{F}_B) \le \frac{2BR}{\gamma\sqrt n} via the two facts above. The 0-1 risk is Pr[Yf(X)0]E[γ(Yf(X))]=Rγ(f)\Pr[Y f(X) \le 0] \le \mathbb{E}[\ell_\gamma(Y f(X))] = R_\gamma(f) because γ\ell_\gamma upper-bounds the 0-1 loss; Theorem 3 bounds Rγ(f)R^Sγ(f)R_\gamma(f) - \hat R_S^\gamma(f) by the displayed expression.

The γ\gamma in the denominator makes the bound tight for high-margin classifiers: doubling the margin halves the complexity term. For SVMs, BB corresponds to the norm of the weight vector and γ\gamma to the margin width — both standard quantities in the SVM training objective.

Linear classifiers and the norm-based bound

The kernel-method bound specializes to plain linear classifiers when ϕ(x)=x\phi(x) = x (identity feature map) and XRd\mathcal{X} \subset \mathbb{R}^d. Then K(x,x)=xxK(x, x') = x^\top x' and R=supxx2R = \sup_x \|x\|_2. The bound becomes

Pr(X,Y)[Yf(X)0]R^Sγ(f)+2w2supxx2γn+3log(2/δ)2n.\Pr_{(X, Y)}[Y f(X) \le 0] \le \hat R_S^\gamma(f) + \frac{2 \|w\|_2 \sup_x \|x\|_2}{\gamma \sqrt n} + 3\sqrt{\frac{\log(2/\delta)}{2n}}.

Two observations:

  1. The complexity term scales as w2/γ\|w\|_2 / \gammanot as dd or any explicit dimension. This is dimension-free; it depends only on the norm of the weight vector and the margin. Linear classifiers in million-dimensional feature spaces still get a meaningful bound, as long as the norm of the optimal weight vector is moderate.
  2. SVM training minimizes w22\|w\|_2^2 subject to a margin constraint yi(wxi+b)1y_i (w^\top x_i + b) \ge 1 — this is exactly the quantity that controls the generalization bound. SVMs aren’t just “good classifiers”; their training objective directly optimizes the bound.
loading margin-bound payload…

Margin bounds for neural networks: a preview

Margin bounds for neural networks have an analogous form but with w2\|w\|_2 replaced by a product of layer norms:

R^S(FNN)RWn(small factors)\hat{\mathfrak{R}}_S(\mathcal{F}_{\mathrm{NN}}) \le \frac{R \prod_\ell \|W_\ell\|}{\sqrt n} \cdot (\text{small factors})

(Bartlett 1998, Neyshabur–Bhojanapalli–McAllester–Srebro 2017, Bartlett–Foster–Telgarsky 2017). The picture: a deep network is a composition of Lipschitz maps; Talagrand contraction propagates margin through the layers, multiplying Lipschitz constants. For a network with LL layers and per-layer norm W1\|W_\ell\| \approx 1, the product is 1\sim 1 and the bound is informative; for networks with large per-layer norms (typical of unregularized training), the product blows up. §12 shows this empirically — the margin bound for a 2-layer MLP trained without explicit norm control comes out larger than 11, vacuous. Modern post-classical bounds (PAC-Bayes, PAC-Bayes Bounds (coming soon)) use flatness of the loss landscape instead of layer norms, recovering non-vacuous bounds for some deep nets.

Algorithmic stability bounds

The whole §§5–10 program routes generalization through hypothesis-class complexity. But generalization doesn’t have to go that way. A different route — algorithmic stability — controls generalization through a property of the learning algorithm: how much does the algorithm’s output change if we perturb one training example? If the answer is “not much,” the empirical risk is close to the population risk, and we get a generalization bound that doesn’t mention hypothesis-class complexity at all. This is the Bousquet–Elisseeff (2002) framework: regularization implies stability, and stability implies generalization.

Uniform β\beta-stability defined

A learning algorithm is a map A:ZnHA: \mathcal{Z}^n \to \mathcal{H} from samples to hypotheses. For i{1,,n}i \in \{1, \ldots, n\}, let S(i)S^{(i)} denote SS with the ii-th sample replaced by an iid copy ZiZ'_i.

Definition 11 (Uniform β-stability).

AA is uniformly β\beta-stable with respect to loss \ell if for every SZnS \in \mathcal{Z}^n, every i{1,,n}i \in \{1, \ldots, n\}, and every ZiZZ'_i \in \mathcal{Z},

sup(x,y)Z(A(S)(x),y)(A(S(i))(x),y)β.\sup_{(x, y) \in \mathcal{Z}} \big| \ell(A(S)(x), y) - \ell(A(S^{(i)})(x), y) \big| \le \beta.

The supremum is over all test points, not just SS — uniform stability is a strong condition. β=β(n)\beta = \beta(n) typically decreases with nn (more data \Rightarrow less sensitive to any one example).

Regularization implies stability (ridge regression)

Example 2 (Ridge regression is β-stable).

Consider ridge regression

A(S)=argminwRd1ni=1n(yiwxi)2+λw2.A(S) = \arg\min_{w \in \mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i - w^\top x_i)^2 + \lambda \|w\|^2.

The closed-form solution is wS=(XX+nλI)1Xyw_S = (X^\top X + n\lambda I)^{-1} X^\top y. Suppose x2RX\|x\|_2 \le R_X for all xx and yRY|y| \le R_Y. The stability constant is β=O(RX2RY2/(λn))\beta = O(R_X^2 R_Y^2 / (\lambda n)).

Three observations:

  1. β\beta decreases linearly in nn — more data, more stability.
  2. β\beta decreases linearly in λ\lambda — stronger regularization, more stability.
  3. β\beta does not depend on dd — dimension-free, even in the under-determined case dnd \gg n.

The proof goes through the closed-form (XX+nλI)1(X^\top X + n\lambda I)^{-1} and a perturbation argument: replacing one xix_i changes the inverse by an outer-product rank-one update, and the spectral norm of that update is bounded by xi22/(nλ)\|x_i\|_2^2 / (n \lambda) (full derivation: Bousquet–Elisseeff 2002 §3).

Stability implies generalization (Bousquet–Elisseeff)

Theorem 7 (Bousquet–Elisseeff stability bound).

If AA is uniformly β\beta-stable with respect to a loss bounded in [0,M][0, M], then for any δ(0,1)\delta \in (0, 1), with probability 1δ\ge 1 - \delta over the iid sample SS,

R(A(S))R^S(A(S))+β+(2nβ+M)log(1/δ)2n.R(A(S)) \le \hat R_S(A(S)) + \beta + (2 n \beta + M) \sqrt{\frac{\log(1/\delta)}{2n}}.
Proof.

Let g(S)=R(A(S))R^S(A(S))g(S) = R(A(S)) - \hat R_S(A(S)). The proof verifies three properties:

  1. ES[g(S)]β\mathbb{E}_S[g(S)] \le \beta: a stability-based version of symmetrization (the ghost-sample trick) gives the expected gap directly.
  2. g(S)g(S) has bounded differences in the McDiarmid sense: replacing ZiZ_i with ZiZ'_i changes g(S)g(S) by at most 2β+M/n2\beta + M/n (the 2β2\beta comes from stability, the M/nM/n from the single-coordinate change of R^S\hat R_S).
  3. McDiarmid concentration: Pr[g(S)E[g(S)]t]exp(2t2/(n(2β+M/n)2))\Pr[g(S) - \mathbb{E}[g(S)] \ge t] \le \exp(-2t^2 / (n \cdot (2\beta + M/n)^2)), giving with probability 1δ\ge 1 - \delta,
g(S)E[g(S)]+(2nβ+M)log(1/δ)2n.g(S) \le \mathbb{E}[g(S)] + (2n\beta + M)\sqrt{\frac{\log(1/\delta)}{2n}}.

Combining with E[g(S)]β\mathbb{E}[g(S)] \le \beta gives the theorem. Full proof: Bousquet–Elisseeff 2002 §4.

For ridge regression with β=O(L2/(λn))\beta = O(L^2/(\lambda n)), Theorem 7 yields

R(wS)R^S(wS)+L2λn+(2L2λ+M)log(1/δ)2n.R(w_S) \le \hat R_S(w_S) + \frac{L^2}{\lambda n} + \left(\frac{2 L^2}{\lambda} + M\right) \sqrt{\frac{\log(1/\delta)}{2n}}.

The L2/λL^2/\lambda in the deviation term is the regularization-stability cost; choosing λ\lambda to balance this with the empirical risk’s smaller λ\lambda dependence gives an optimized excess-risk bound.

Why this route bypasses class complexity

Theorem 7 contains no R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}), no VCdim(H)\mathrm{VCdim}(\mathcal{H}), no covering number. The hypothesis class H\mathcal{H} doesn’t appear at all — only β\beta does. This has three consequences:

  1. The bound is algorithm-specific. Two different algorithms applied to the same class produce different bounds. Ridge regression with λ=1\lambda = 1 has β=O(1/n)\beta = O(1/n) and gets the fast O(1/n)O(1/n) rate; the ERM over the same class (unregularized least squares) is not uniformly stable and gets only the slow O(1/n)O(1/\sqrt n) Rademacher bound. The choice of algorithm matters.
  2. The bound is dimension-free for ridge. For under-determined least squares (d>nd > n), the ERM over H=Rd\mathcal{H} = \mathbb{R}^d has R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}) \to \infty — the Rademacher bound is useless. But ridge with λ>0\lambda > 0 has β=O(1/(λn))\beta = O(1/(\lambda n)) — finite, useful, dimension-free. Regularization buys generalization through stability, not through restricting H\mathcal{H}.
  3. The bound applies to randomized / SGD algorithms. Define stability over the randomness of the algorithm too; SGD has approximate stability properties (Hardt–Recht–Singer 2016) that give generalization bounds for deep nets — non-trivial in regimes where Rademacher bounds are vacuous.

The stability framework is one of the post-classical paths to non-vacuous deep-net bounds.

Worked example: empirical Rademacher and vacuousness

We have built seven bound recipes. §12 puts them on two examples — one where they bite tightly, one where they go vacuous — and lets the reader see the classical theory’s reach and its honest limits in the same picture.

Threshold classifier — Rademacher complexity in closed form

Recap from §5: on a sample of nn uniform [0,1][0,1] points, the threshold class produces exactly n+1n+1 distinct labelings. Massart’s lemma gives R^S(Hthresh)2log(n+1)/n\hat{\mathfrak{R}}_S(\mathcal{H}_{\mathrm{thresh}}) \le \sqrt{2 \log(n+1)/n}, matching the true rate O(logn/n)O(\sqrt{\log n / n}) to a constant factor. Corollary 3 plugs into the canonical bound for any nn, δ\delta: the bound is non-vacuous (under 11) for n10n \gtrsim 10, and tightens rapidly with nn.

Gap-vs-nn scaling vs the theoretical envelope

For the threshold class with η=0.10\eta = 0.10 noise, the empirical worst-case gap matches the Corollary 3 envelope across n{30,100,300,1000,3000}n \in \{30, 100, 300, 1000, 3000\} — §7 already showed this. The classical theory works as advertised in the tabular small-class regime: bounds are informative, the rate is correct, and the constant slack is a small multiple.

A 2-layer MLP on binary MNIST: classical bounds become vacuous

We now apply the same machinery to a modest neural network on a real dataset and observe that the classical bound exceeds 11 — vacuous — despite the model having trivial generalization gap. The setup: subsample n=1000n = 1000 examples from the MNIST digits 0 and 1 (a binary classification task; the easiest non-trivial vision problem). Train a 2-layer ReLU MLP with hidden width w=64w = 64. Evaluate generalization gap by held-out test accuracy. Compute the classical norm-based Rademacher upper bound and the Corollary 3 envelope.

For a 2-layer MLP f(x)=w2ReLU(W1x)f(x) = w_2^\top \mathrm{ReLU}(W_1 x) with w2Rww_2 \in \mathbb{R}^w and W1Rw×dW_1 \in \mathbb{R}^{w \times d}, the layer-norm-based Rademacher upper bound (Bartlett 1998 / Neyshabur et al. 2017) is

R^S(FNN)RXW1opw22nO(logw),\hat{\mathfrak{R}}_S(\mathcal{F}_{\mathrm{NN}}) \le \frac{R_X \cdot \|W_1\|_{\mathrm{op}} \cdot \|w_2\|_2}{\sqrt n} \cdot O(\sqrt{\log w}),

where RX=supxx2R_X = \sup_x \|x\|_2 is the input radius and op\|\cdot\|_{\mathrm{op}} is the spectral / operator norm. For MNIST with pixels normalized to [0,1][0, 1], RX78428R_X \le \sqrt{784} \approx 28. After training, the layer norms produce a classical Rademacher upper bound that comfortably exceeds the trivial range of the loss — vacuous. Meanwhile the actual generalization gap (test error minus train error) is 0.005\sim 0.005. Classical theory says nothing useful about why this network generalizes.

loading vacuousness payload…

What the failure tells us

Three honest takeaways:

  1. The bound isn’t wrong. Corollary 3 holds with high probability for any H\mathcal{H}, including 2-layer MLPs. The MLP could, in principle, fit random labels — Zhang et al. 2017 demonstrated this empirically — and a classical bound has to cover that worst case. So Rademacher complexity for typical neural network classes really is close to 11, and the bound really is vacuous.
  2. The classical bound charges H\mathcal{H} uniformly. Rademacher complexity is a property of the entire class. It does not care that the specific trained network h^S\hat h_S is in a “good neighborhood” of H\mathcal{H} where networks all generalize. Data-dependent and algorithm-dependent refinements — PAC-Bayes (Dziugaite & Roy 2017), the implicit bias of SGD (Neyshabur et al. 2017), and compression-based bounds (Arora et al. 2018) — locate the trained network in a tighter implicit hypothesis class and recover non-vacuous bounds for some deep nets.
  3. Stability is an alternative route. Stability bounds (§11) for SGD on smooth losses (Hardt–Recht–Singer 2016) give non-vacuous bounds without going through R^S\hat{\mathfrak{R}}_S at all. The trade-off: stability bounds work only for certain optimization schedules and don’t yet match the empirical accuracy of practical deep nets.

The vacuousness puzzle is the chapter’s central honest reckoning. Classical generalization theory delivers crisp bounds for tabular and kernel methods; it falls silent for modern deep nets. The post-classical bounds are an active research area, surveyed in §14.

Computational notes

Rademacher Monte Carlo: cost and structure

The naive Rademacher estimator R^S(H)1Bb=1Bsuph1niσi(b)h(xi)\hat{\mathfrak{R}}_S(\mathcal{H}) \approx \frac{1}{B}\sum_{b=1}^B \sup_h \frac{1}{n}\sum_i \sigma_i^{(b)} h(x_i) costs O(BHn)O(B \cdot |\mathcal{H}| \cdot n) time when H\mathcal{H} is finite, O(BOPT(H,n))O(B \cdot \mathrm{OPT}(\mathcal{H}, n)) when suph\sup_h requires optimization. For “nice” classes (linear, kernel) the supremum is a quadratic program in the Rademacher labels; for thresholds / intervals it has a closed-form O(nlogn)O(n \log n) solution per Rademacher draw. Variance of the MC estimator decreases as 1/B1/\sqrt{B}, so B1000B \sim 1000 gives 2-decimal-place accuracy. For high-dimensional or NN classes, the supremum can’t be optimized exactly; upper bounds via Massart, Dudley, or layer-norm propagation are the practical alternatives.

Class-evaluation matrices and structured shortcuts

For finite or finitely-enumerable classes, build H{0,1}H×nH \in \{0, 1\}^{|\mathcal{H}| \times n} where H[h,i]=h(Xi)H[h, i] = h(X_i). Empirical Rademacher reduces to a single matrix multiplication (Hσ)/n(H \cdot \sigma^\top) / n followed by row-wise max and average. NumPy broadcasting handles this in vectorized form. For real-valued classes use HRH×nH \in \mathbb{R}^{|\mathcal{H}| \times n}. For continuous classes (linear, kernel) replace the enumeration by the minimal class that realizes all distinct labelings on SS — for thresholds, that’s n+1n + 1 rows; for half-planes in Rd\mathbb{R}^d, (nd)\binom{n}{d} rows by the Sauer–Shelah lemma; for sufficiently rich classes, 2n2^n.

Pitfalls

Three common errors: (i) confusing empirical and population Rademacher — the Corollary 3 bound uses empirical; (ii) forgetting to standardize h(x){0,1}h(x) \in \{0, 1\} to {1,+1}\{-1, +1\} for the Rademacher inner product (the unstandardized case shifts the sum by a σ\sigma-independent mean, but the supremum-of-sigma cancels it out — the result is correct but the variance is slightly off); (iii) computing covering numbers on the population metric when the bound uses the empirical metric (Dudley with L2(Pn)L_2(P_n), not L2(P)L_2(P)). For high-dimensional Monte Carlo, prefer Rademacher antithetic pairs σ,σ\sigma, -\sigma for variance reduction.

Connections and limits

§14 is the chapter’s closing reckoning.

PAC-Bayes bounds (coming soon)

The PAC-Bayesian framework (McAllester 1999, Catoni 2007, Maurer 2004) does not bound the gap of a single hypothesis; it bounds the expected gap under a posterior distribution QQ over hypotheses. The bound has the form

EhQ[R(h)]EhQ[R^S(h)]+KL(QP)+log(1/δ)2n,\mathbb{E}_{h \sim Q}[R(h)] \le \mathbb{E}_{h \sim Q}[\hat R_S(h)] + \sqrt{\frac{\mathrm{KL}(Q \| P) + \log(1/\delta)}{2n}},

where PP is a prior fixed before seeing the data. The bound is tighter than Rademacher precisely when QQ concentrates on hypotheses with small empirical risk and small KL to the prior. For deep networks, QQ as a Gaussian posterior around the SGD output produces non-vacuous bounds (Dziugaite & Roy 2017), giving the first classical theory bounds that match the empirical generalization of trained networks.

Remark.

PAC-Bayes is the standard route to non-vacuous deep-net bounds and is the natural extension of generalization theory for modern overparameterized models. The forthcoming PAC-Bayes Bounds (coming soon) topic develops it in detail.

VC dimension

The Vapnik–Chervonenkis dimension is the combinatorial complexity measure for {0,1}\{0, 1\}-valued classes — the largest dd such that some dd-point set is shattered by H\mathcal{H} (every 2d2^d labeling is realizable). VC dimension underlies the Sauer–Shelah polynomial growth lemma, the Fundamental Theorem of Statistical Learning, and the realizable PAC bounds. For Rademacher-flavored bounds, the standard chain is: finite VC dimension \Rightarrow polynomial growth function \Rightarrow polynomial covering numbers \Rightarrow Dudley integral evaluates to O(d/n)O(\sqrt{d/n}). The VC Dimension topic develops all this in detail and complements generalization-bounds from the combinatorial direction (this topic develops it from the Rademacher direction).

The deep-network puzzle (Zhang et al. 2017)

Remark.

Zhang et al. (2017) “Understanding deep learning requires rethinking generalization” demonstrated that standard convolutional networks can perfectly fit random labels on CIFAR-10 — meaning R^S(FNN)1\hat{\mathfrak{R}}_S(\mathcal{F}_{\mathrm{NN}}) \approx 1 and the classical Rademacher bound is identically vacuous. Yet the same networks trained on real labels generalize well. This is the central empirical observation that classical theory cannot explain. Active responses include: (a) implicit-bias arguments — SGD prefers low-norm or “flat” solutions out of the infinitely many ERMs (Neyshabur et al. 2017, Soudry et al. 2018); (b) compression-based bounds (Arora et al. 2018); (c) PAC-Bayes — the SGD posterior has small KL to a reasonable prior; (d) the neural tangent kernel regime — overparameterized nets approximate kernel methods with margin bounds that are meaningful (Jacot et al. 2018). None has fully solved the puzzle. Tractable, computable non-vacuous deep-net bounds remain an active research target.

Implicit bias, double descent, and beyond

The classical regime d<nd < n has clean theory; the modern regime dnd \gg n has surprising empirical phenomena like double descent (Belkin et al. 2019): generalization error first increases (as model size approaches interpolation threshold) then decreases again as the model grows further. Classical Rademacher and VC bounds predict monotone risk in model size, contradicting the empirical curve. Resolution requires data-dependent bounds (§9) that depend on which solution the algorithm finds — the implicit bias of SGD toward minimum-norm interpolators (Bartlett et al. 2020; benign overfitting literature). The post-classical landscape is still being mapped; this chapter ends with the classical theory’s honest limit and the post-classical roadmap as the open frontier.

Connections

  • Direct prerequisite — ERM, the Fundamental Theorem of Statistical Learning, and the boundary marker for Rademacher complexity. pac-learning introduces Rademacher in one shallow section; this topic goes deep into the symmetrization argument, the canonical bound, fast rates, margin bounds, and stability. pac-learning
  • Direct prerequisite — Hoeffding's inequality drives the §3 finite-class warmup; McDiarmid's bounded-differences inequality is the concentration engine that powers the §7 canonical uniform-convergence bound and the §11 Bousquet–Elisseeff stability bound. concentration-inequalities

References & Further Reading