Generalization Bounds
Rademacher complexity, uniform convergence, and the limits of classical theory
Motivation: why bound the generalization gap?
There is a question every practitioner has asked, and most have asked it under their breath: the model fits the training set, but will it work on data I haven’t seen yet? The training error is a number we can compute; the test error is the number we actually care about. This chapter is about the gap between them — how big it can be, what makes it bigger or smaller, and what kinds of guarantees we can write down in advance.
The gap matters even when training accuracy is excellent. A neural network with a billion parameters can interpolate ImageNet — zero training error, ten million images memorized — and still fail miserably on a held-out test image. The same network can sometimes generalize beautifully. Why? When we ask “why,” what we want is a bound: a number such that the test error is at most the training error plus , with high probability over the random draw of the training set. The size of should depend on something — the size of the training set, the complexity of the model, the noise in the data, the choice of algorithm. Producing such a bound is what generalization theory does, and the rest of this chapter is the story of how we produce them.
A first principle. The chapter does not deliver bounds in the practitioner’s sense of “this number will be the test error to within ten percent.” Almost no classical bound, applied to a real neural network on a real dataset, gives a number you would stake money on. The classical bounds are too loose; the practical estimates (held-out validation, cross-validation) come without theoretical guarantees on the bound’s tightness. The classical theory’s value is structural, not numerical: it tells us what features of a learning problem tighten or loosen the gap, and which proof techniques bite hardest on which kinds of hypothesis class. When we eventually look at a 2-layer MLP on a binary MNIST subset (§12) and see that the classical Rademacher bound exceeds one — i.e., is uninformative — that’s not a failure of the chapter; that’s the chapter delivering an honest reckoning of what the classical machinery does and does not yet say about deep nets. We end (§14) with pointers to the post-classical bounds (PAC-Bayes, norm-based, compression-based) that improve on this picture.
A second principle. Bounds come in flavors, and a meaningful bound has to be both valid (it really is an upper bound on the gap, with high probability) and non-vacuous (its numerical value is less than the trivial bound of one). Both conditions matter. A valid bound that says “the gap is at most ” is useless when losses live in . A non-vacuous bound that holds only on average, not with high probability, is a different (weaker) statement than what we want. The bar to clear is non-vacuous high-probability validity, and the rest of the chapter is engineered to clear that bar in as many regimes as the classical theory can manage.
The test-error question — what we actually want to estimate
Formally, the data are iid draws for from some unknown joint distribution on . A hypothesis assigns predictions to inputs; the loss measures how badly hypothesis does at the example . The risk (also: true risk, population risk, test error in expectation) is
This is the quantity we want to control. We never see it directly: is unknown, and the expectation under is the thing the training set is supposed to approximate.
Training error as an optimistic estimator
What we do see is the empirical risk on the training sample :
For any fixed hypothesis , the law of large numbers says as , and Hoeffding’s inequality says the convergence is exponential in . So if the learner picked before seeing the training set, we’d be done. But the learner uses the training set to pick — that’s what learning means. The hypothesis output by empirical risk minimization,
is a function of the data, and the law of large numbers does not directly apply to as an estimator of . The empirical risk of the ERM is systematically optimistic: the learner has shopped over for the hypothesis that fits this particular best, and that hypothesis is by construction not a typical member of on . We have to control the worst-case gap
across the entire class, not the gap for any single hypothesis. That switch — from pointwise to uniform — is the whole game.
Vacuous vs. non-vacuous bounds: the bar to clear
A generalization bound is a statement of the form: with probability at least over the draw of ,
where depends on the complexity of the hypothesis class , the sample size , and the failure probability . The bound is valid if the inequality holds with the stated probability; it is non-vacuous if the numerical value of leaves room below the trivial range of the loss (e.g., for 0-1 loss). All the classical machinery we build in §§3–11 produces valid bounds; whether those bounds are non-vacuous depends on , , and the problem. The vacuousness puzzle for deep nets (§12) is that valid classical bounds for a typical overparameterized network on a standard image dataset come out larger than one — valid, but uninformative.
What this chapter delivers
The chapter develops seven generalization-bound recipes, each tightening the previous in some regime:
- Union bound + Hoeffding for finite (§3) — the warmup.
- Uniform convergence via Glivenko–Cantelli (§4) — the right notion, before the right tooling.
- Rademacher complexity (§§5–7) — the tooling; the canonical bound.
- Metric-entropy / Dudley (§8) — beyond losses.
- Local Rademacher complexity (§9) — fast rates under variance restrictions.
- Margin bounds (§10) — what changes when classifiers come with a confidence score.
- Algorithmic stability (§11) — bypassing -complexity entirely.
By the end of §11 we have seven different lenses on the same gap. §12 then runs them all on two examples — a small threshold class where they all bite tightly, and a small MLP where the classical ones go vacuous — and §14 honestly reports the open questions about what tightens the bounds for deep networks.
The generalization gap and its decomposition
§1 introduced the gap informally as the difference between training and test error. §2 makes that gap a formal mathematical object, decomposes it into three structurally distinct sources of error, and presents the first picture of the gap as a function of sample size — the empirical anchor every subsequent bound will refine.
Risk and empirical risk, formally
Throughout this chapter we fix a data distribution on and a loss function , bounded above by some constant (almost always for the 0-1 loss ). A hypothesis class is a set of functions . The reader can keep concrete: it might be all decision stumps, all linear classifiers in , or all 2-layer ReLU networks of width .
Definition 1 (Risk).
The risk, or generalization error, or true error of a hypothesis is
When is the 0-1 loss, is the misclassification probability.
Definition 2 (Empirical risk).
Given an iid sample from , the empirical risk of on is
For fixed , is an unbiased estimator of : . By Hoeffding’s inequality (cited from concentration-inequalities), the empirical risk concentrates exponentially: for any and any ,
This is the pointwise bound — it controls one at a time, with fixed in advance.
The generalization gap as the central object
Definition 3 (Generalization gap).
The generalization gap of a hypothesis on sample is
A positive gap means does worse on test data than on training; a negative gap means got lucky on . For a fixed , the gap has mean zero () and is exponentially concentrated by Hoeffding. The problem is that we never use a fixed — we pick as a function of , and the gap of a data-dependent hypothesis is not mean-zero. In fact, in general: the ERM systematically overfits.
To get a handle on we control the uniform deviation
which sandwiches the gap of any data-dependent :
A bound on — a single number that controls the gap simultaneously across all hypotheses in — is what we need. The rest of the chapter is the story of how to bound tightly for various classes .
Approximation–estimation–optimization
The gap is one piece of a larger decomposition. Following Bottou & Bousquet (2008), let
- be the Bayes-optimal hypothesis (the minimizer of over all measurable functions);
- be the best in class (the minimizer restricted to );
- be the ERM;
- be the algorithm’s output — typically not exact ERM (SGD doesn’t run to convergence; even when it does, gradient descent on non-convex losses finds only local minima).
Proposition 1 (Excess-risk decomposition).
The excess risk of the algorithm’s output decomposes as
Proof.
All three terms on the right telescope, leaving on the left.
∎The three terms have orthogonal sources:
- Approximation error depends only on and . A class too small to express the truth (linear classifiers for a quadratic boundary) has irreducible approximation error.
- Estimation error depends on , , and the proof technique. This is the gap-controlled term — the rest of this chapter is dedicated to bounding it.
- Optimization error depends on the algorithm. SGD-on-a-deep-net is the most studied modern case; convex programming (linear regression, SVM) typically achieves optimization error.
Generalization theory’s job is the middle term. The approximation term is the model-selection literature’s problem; the optimization term is the optimization literature’s problem. We will (almost always) assume — i.e., optimization is exact — and bound the estimation term.
Connection to bias–variance
For square loss in regression — — and a class indexed by parameters, the excess risk decomposes Bayes-additively into squared bias plus variance:
where is the average estimator across data draws. The bias term parallels the approximation error (how well does the class express the truth?); the variance term parallels the estimation error (how variable is the ERM around its mean?). The parallel is exact for square loss; for 0-1 loss the decomposition does not factor as cleanly (the loss is non-additive), but the intuition transfers: a low-complexity class has small variance but possibly large bias; a high-complexity class has small bias but large variance; the trade-off chooses . Classical generalization theory makes the variance/estimation side of this trade-off rigorous.
Finite-class warmup
We can write down our first generalization bound, today. The recipe is two ingredients we already have: Hoeffding’s inequality for a fixed hypothesis, and the union bound to make it uniform across a finite class. The result — sometimes called the Occam bound — controls the gap by , an rate with logarithmic dependence on class size. It is the right starting point for two reasons: it is genuinely useful in its own right (it bites for small finite classes), and the proof technique generalizes — every later bound is some refinement of “control fixed- deviations, then make the bound uniform across .”
Hoeffding for a single hypothesis
Fix one . The empirical risk is the mean of iid random variables, each bounded in . Hoeffding’s inequality (proved in concentration-inequalities) gives a two-sided bound: for any ,
Choosing makes the right-hand side equal , so with probability at least ,
This is a pointwise bound — it holds for any specific chosen before was drawn. It does not hold uniformly across .
Union bound across
If is finite with hypotheses, label them . For each , Hoeffding bounds the failure probability . The probability that any of the hypotheses fails is at most the sum:
Set the right-hand side equal to and solve for :
Theorem 1 (Finite-class generalization bound).
Let be a finite hypothesis class with , let be an iid sample of size from , and let be a loss bounded in . For any , with probability at least over ,
Proof.
Apply Hoeffding pointwise to each with ; the per-hypothesis failure probability is . The union bound across the hypotheses gives total failure probability at most . The contrapositive — that no hypothesis fails — gives the stated bound.
∎Sample complexity
Theorem 1 inverts to a sample-complexity statement: how many samples does need so the gap is at most with confidence ?
Corollary 1 (Sample complexity for finite classes).
To guarantee with probability , it suffices that
Proof.
Solve Theorem 1’s bound for given the target .
∎Three observations sharpen the corollary into intuition:
- Dependence on class size is logarithmic. Doubling increases the required sample size by an additive constant , not a multiplicative factor. This is the Occam in Occam bound: complexity is cheap, in the sense that very large but finite hypothesis classes are still PAC-learnable with modest samples.
- Dependence on is quadratic. Halving quadruples the required . The rate is the canonical slow rate of distribution-free generalization theory; §9 shows when fast rates () are possible.
- Dependence on is logarithmic. Halving adds to the required . High-confidence bounds come cheap.
Why misleads for infinite classes
Theorem 1 needs finite. Every interesting class in machine learning — linear classifiers in , decision trees, neural networks — is uncountably infinite. A naive application of the bound to an infinite class gives a useless .
A natural workaround discretizes. Take the threshold class , replace it with the -point grid , apply Theorem 1, and take . The bound rate becomes , which diverges as . So discretization alone doesn’t work — the bound goes vacuous as the discretization refines.
But the true worst-case gap of on an -sample is bounded; it grows like (we’ll prove this in §5 via Rademacher complexity). The discretization argument has lost information about the combinatorial structure of the class. Two thresholds are different elements of , but if no sample point falls between them, they produce identical predictions on — they are effectively the same hypothesis from the point of view of the data. The right complexity is not “how many hypotheses are there?” but “how many distinct labelings of an -sample can the class produce?” That switch — from counting hypotheses to counting behaviors — is the conceptual move §4 and §5 develop, via Glivenko–Cantelli classes and Rademacher complexity respectively. By the end of §5 we will have replaced with , a quantity that stays finite for the genuinely-infinite classes ML actually uses.
Uniform convergence and Glivenko–Cantelli
§3 showed that for finite hypothesis classes, the gap is controlled uniformly. §4 elevates “uniform convergence” from a proof technique to a property of a hypothesis class. We will see that some infinite classes have the property — they are Glivenko–Cantelli classes — and others do not. The classical Glivenko–Cantelli theorem (1933) is the prototype: the class of half-line indicators on has uniform convergence, even though the class is uncountably infinite. The right complexity is structural, not cardinal — and §5 (Rademacher) will give the structural complexity measure we need.
Uniform convergence: the right notion
For a class of functions on a space , the empirical measure of on sample is , and its population mean is .
Definition 4 (Uniform convergence; Glivenko–Cantelli class).
The class has the uniform convergence property with respect to a distribution on , written is a Glivenko–Cantelli class for , if
The connection to risk and empirical risk is direct: take , the loss class induced by . Then , , and being GC means — uniform convergence of empirical risks to true risks across the entire hypothesis class. Distribution-free uniform convergence (GC for every ) is what makes a hypothesis class PAC-learnable in the agnostic setting.
Classical Glivenko–Cantelli (empirical CDF)
The original GC theorem is the special case of half-line indicators on .
Theorem 2 (Glivenko–Cantelli, 1933).
Let be iid from a distribution on with CDF , and let be the empirical CDF. Then
The qualitative version (a.s. convergence with no rate) was proved by Glivenko and Cantelli independently in 1933. A quantitative version with an exponential rate is the DKW inequality (Dvoretzky–Kiefer–Wolfowitz 1956, sharpened by Massart 1990): for all and ,
Two observations make this quantitatively striking:
- The rate is the same as Hoeffding for a single function. The supremum across the uncountable family costs no extra log factor compared to a single . The combinatorial structure of half-line indicators on is rich enough to be uncountable but lean enough to behave like a single function in the bound.
- The constant is dimension-free. The bound has no dependence on , the underlying distribution — uniform convergence holds for any continuous, discrete, or mixed distribution on .
The connection of Theorem 2 to learning theory is the half-line classifier: is the simplest non-trivial classifier class. The DKW inequality says this class has uniform-convergence rate independent of the data distribution — a property the discretization argument from §3 failed to recover. The reason the half-line class succeeds where naive discretization fails is the combinatorial richness of half-line indicators: on a sample of size , the class produces exactly distinct labelings (one for each “break” between consecutive ordered sample points). This count grows polynomially in — the seed of the Sauer–Shelah polynomial growth lemma.
GC classes and Donsker classes
The classical Glivenko–Cantelli theorem generalizes well beyond half-lines. The empirical-processes literature studies broad classes for which uniform convergence holds:
- VC-subgraph classes (loosely: classes with finite VC dimension applied to indicator functions) are GC for every . Half-lines are the simplest example; half-planes in , intervals, and convex sets in are also GC.
- Donsker classes are a strictly stronger notion: the centered and scaled empirical process converges, as a process indexed by , to a Gaussian process on . Donsker implies GC (and gives the right rate); the converse is false.
- Bracketing entropy and uniform covering numbers are the two technical workhorses that certify a class as GC or Donsker. We use them in §8 for real-valued function classes.
For now the takeaway is structural: GC-ness is a property of the combinatorial / metric structure of , not of its raw cardinality. The next two sections develop a quantitative notion that subsumes GC: a class is GC if and only if its Rademacher complexity vanishes as .
A non-GC counterexample
Some classes are not Glivenko–Cantelli — and naming one cleanly is the test of whether the definition has bite. The cleanest counterexample is the class of indicators of finite sets:
Example 1 (A non-GC class).
For any continuous distribution on — say uniform on — the class fails the uniform convergence property catastrophically. The argument: for any sample , the indicator of (a finite set) is in , and
where because is a finite set under a continuous distribution. Therefore for every — uniform convergence fails maximally.
What went wrong? The class is too rich — it shatters every sample, in the VC sense. Whatever the labels on , some produces them. A class that can produce any labeling of any sample is unconstrained, and unconstrained means no uniform convergence. The VC Dimension topic makes this precise via the Sauer–Shelah lemma: a class with finite VC dimension produces only polynomially many distinct labelings on an -sample, while a shattering class produces — and only the polynomial regime is GC. A GC class is one whose behavior on samples is constrained, not one whose cardinality is small.
Rademacher complexity
We have now established two things. First, that uniform convergence is the right notion for a hypothesis class — the property that simultaneously controls the gap across all . Second, that the cardinality is the wrong complexity measure for infinite classes — what matters is the combinatorial behavior of on samples. §5 introduces the quantity that puts a finger on that combinatorial behavior: the Rademacher complexity of . It is the canonical complexity measure for distribution-free generalization theory, the data-dependent quantity that we will see in every subsequent bound, and the one that allows us to compute — not just upper-bound — the relevant complexity of finite-and-infinite classes alike.
The random-labels intuition
Here is the picture. Fix a sample in . Forget the true labels; draw fresh ones uniformly at random: iid with probability each. Each Rademacher label vector is a different “random labeling” of the sample. There are such labelings.
For a binary classifier , the quantity is the correlation between the classifier’s outputs and the random labels . It is +1 if ‘s outputs match exactly, if they are everywhere opposite, in expectation if is independent of . The supremum across ,
measures the best fit any can produce to the random label vector . A class that can fit any labeling perfectly achieves a supremum of for every . A class with no labeling flexibility achieves a value that concentrates around at . The expected supremum over the random interpolates between these extremes — and that expectation is the empirical Rademacher complexity.
Empirical and population Rademacher complexity
Definition 5 (Empirical Rademacher complexity).
Given a class of real-valued functions and a sample , the empirical Rademacher complexity of on is
where is an iid Rademacher vector ( with ), independent of .
Definition 6 (Population Rademacher complexity).
The population (or expected) Rademacher complexity is the expectation of the empirical Rademacher over the draw of :
Three basic observations:
- Translation invariance / centering. Adding a constant to every shifts the sum by , which has mean zero — so Rademacher complexity is unchanged. It measures variability across the class, not absolute level.
- Convex hull invariance. : taking convex combinations of functions cannot help fit random labels better than the extreme points already do.
- Monotone in the class. : a bigger class fits random labels better.
The relationship to the uniform deviation is made precise in §6 (symmetrization) and §7 (the canonical bound). Punch line: — Rademacher of the loss class is what shows up.
Massart’s finite-class lemma
Lemma 1 (Massart's finite-class lemma).
Let be a finite set with , and let . Then
Proof.
Let . By Hoeffding’s lemma for Rademacher random variables, . For any ,
Taking logarithms and dividing by ,
Setting the derivative to zero gives and substituting back,
Dividing by gives Massart’s lemma.
∎Corollary 2 (Rademacher complexity of a finite hypothesis class).
For a binary classifier class with and ,
Proof.
Apply Lemma 1 with . Each has , so .
∎Massart’s bound is the basis of the Rademacher refinement of §3. Even if , on a sample of the Rademacher complexity is at most — informative, where the union-bound treatment is identical in rate but routes through enumeration of rather than a single computable quantity.
Talagrand’s contraction lemma
Lemma 2 (Talagrand's contraction lemma).
Let be -Lipschitz with , and let be a class of real-valued functions on . Then for any sample ,
where .
Proof.
The proof goes by induction on . The base case is direct from Lipschitzness. For the inductive step, one conditions on and uses a comparison inequality that reduces to the Lipschitz condition. The full argument is in Ledoux & Talagrand (1991), Ch. 4.
∎The contraction lemma is how we move from the predictor class to the loss class . For the 0-1 loss on binary classifiers, rewrite — which is -Lipschitz in — and conclude . For margin losses (§10), is -Lipschitz, and the contraction lemma gives the -in-the-denominator dependence of the margin bound.
Symmetrization
§5 introduced Rademacher complexity as a data-dependent complexity measure. §6 is the proof technique that connects Rademacher complexity to the uniform deviation: symmetrization, also called the ghost-sample trick. The lemma proved here is the structural backbone of every Rademacher generalization bound — including the canonical one in §7. The technique is clever in a way that rewards a careful, slow read. We will lay every step out.
The ghost-sample trick
Recall the central object: the uniform deviation (we work one-sided here; the two-sided version follows by applying the argument to ). We want to bound by something computable. The challenge is that is a population expectation — invisible to us, and a single deterministic number per . The empirical is a sample average. The supremum couples both, and the asymmetry between the two is what makes the bound hard.
The trick is to introduce a ghost sample , iid from , independent of . The ghost sample never gets drawn — it is a mathematical fiction we use to symmetrize the difference. The key observation: , where is the ghost empirical measure. The pair is iid from , exchangeable under coordinate swap — and that exchangeability is what lets us introduce iid Rademacher signs without changing the joint distribution.
Generalization-gap moments → Rademacher averages
For any fixed function , the pair and its swap have the same joint distribution; the function is antisymmetric under the swap, so . Equivalently, has the same distribution as its negation, conditional on the pair being drawn iid. Inserting an iid Rademacher sign leaves the joint distribution unchanged for each fixed — and the supremum can be moved through by Jensen-like arguments. The full chain follows.
A full proof, written out
Lemma 3 (Symmetrization lemma).
For any class of bounded functions on and any ,
Proof.
Let be the original sample (iid from ), and let be an independent ghost sample (also iid from ). For any ,
Substituting (1) into the uniform deviation:
The function depends on the random variable only through its expectation. By Jensen’s inequality applied to the convex function (a pointwise supremum of linear functions),
Taking on both sides of (3) and combining with (2),
Now introduce iid Rademacher random variables with , independent of . Because the pair is exchangeable under coordinate swap (iid from ), the swapped pair has the same joint distribution as for any fixed . Therefore
Apply the triangle inequality on the supremum:
Taking expectations,
The first term in (6) is the definition of . For the second term: has the same distribution as (the Rademacher distribution is symmetric under negation), so the second term also equals .
Combining (4), (5), and (6):
∎Remark.
The one-sided bound in Lemma 3 is sometimes stated as a two-sided bound . The two-sided version follows by applying the lemma to the class . A more refined symmetrization yields the empirical Rademacher bound , which is what shows up in the canonical bound in §7.
McDiarmid + Rademacher → uniform convergence
We have two ingredients: symmetrization (Lemma 3) controls by ; concentration (McDiarmid, from concentration-inequalities) promotes “expectation …” into “with probability , … …”. Combine them and we get the canonical generalization bound — the theorem the whole chapter has been building toward.
Bounded differences for
The function has the bounded-differences property for losses bounded in . Replacing one sample with a different value changes by at most for any fixed (the loss term for the -th sample contributes at most to the average). Therefore changes by at most — the supremum operation preserves the bounded-differences constant.
McDiarmid concentration on
By McDiarmid’s bounded-differences inequality, for any ,
This is a concentration statement: is tightly concentrated around its mean. To deduce a high-probability upper bound on itself, we use the (one-sided) failure inversion: with probability at least ,
The bound on comes from Lemma 3 (symmetrization). The remaining task is to upgrade the population Rademacher in Lemma 3 to the empirical Rademacher — the latter is what we can compute from a single sample.
The two-step recipe and the canonical bound
Apply McDiarmid a second time, this time to as a function of . The empirical Rademacher also has bounded differences: replacing with changes the inner by at most for any fixed and , and supremum + expectation over both preserve the bound. So with probability ,
Theorem 3 (Canonical Rademacher generalization bound).
Let be a class of functions , let be an iid sample from , and let . With probability at least over ,
Proof.
Step (i): bounded-differences verification of , done above; the constant is for each coordinate.
Step (ii): McDiarmid on . By the bounded-differences inequality with , with probability ,
Step (iii): symmetrization (Lemma 3, data-dependent form),
Step (iv): McDiarmid on . By the bounded-differences inequality with , with probability ,
Step (v): combine. By the union bound over the two failure events (each ), with probability both (I) and (III) hold simultaneously. Chaining (I) (II) (III) and summing the three terms gives .
∎Corollary 3 (For binary classification with 0-1 loss).
Let be a class of binary classifiers and the 0-1 loss (). For any , with probability , for every ,
Proof.
Apply Theorem 3 to the loss class . By Talagrand’s contraction lemma (Lemma 2) and the -Lipschitz form ,
Theorem 3 then gives .
∎When the bound bites — and when it doesn’t
Three sanity checks:
- The bound vanishes as . Both and the term decay as , so the right-hand side of Corollary 3 approaches . Good: the bound is consistent.
- The bound is informative when is small. For the threshold class on samples, . At , this is — comfortably below the trivial bound of (the bound is non-vacuous). At , this is — already large but still informative.
- The bound is uninformative (“vacuous”) when or so. For a moderately overparameterized neural network, approaches on standard image datasets (Zhang et al. 2017’s signature observation: deep nets can memorize random labels). The classical bound therefore gives , which is the trivial bound. §12 makes this vacuousness explicit, and §14 points to PAC-Bayes (coming soon) as the post-classical remedy.
The empirical-Rademacher term is the whole generalization-bound story for classical theory. Sample size , confidence , and the loss range enter cleanly and obviously; the hypothesis class enters only through .
Real-valued function classes
§§5–7 developed Rademacher complexity for -valued classifiers. §8 extends to real-valued function classes — regression problems where the predictor and the loss are both continuously valued; margin classifiers where is a real score and the decision is ; neural networks viewed as real-valued functions before the final activation. The combinatorial machinery (shattering, VC dimension) doesn’t apply directly. The right substitutes are pseudo-dimension / fat-shattering (the real-valued analogs of VC dim) and covering numbers (the metric-entropy substitute), combined through Dudley’s entropy integral.
Beyond losses: margin and regression
Two motivating settings:
Regression. with square loss . The loss is bounded only if and are bounded; the predictor class is real-valued.
Margin classification. with margin loss for some margin parameter . The classifier is , but the score — the “confidence” — drives the bound. This is the setting for §10.
Both settings require complexity measures finer than VC dimension.
Pseudo-dimension and fat-shattering
Definition 7 (Pseudo-dimension).
A class of real-valued functions on has pseudo-dimension if there exist and witness values such that for every there exists with for every . The pseudo-dimension is the largest such .
Pseudo-dimension is the VC dimension of the subgraph class in . For linear regression in , . For neural networks, depends on the architecture and is polynomial in the parameter count.
Definition 8 (Fat-shattering dimension).
For , -fat-shatters a set with witness values if for every there exists such that when and when . The fat-shattering dimension at scale is the largest that is -fat-shattered.
Fat-shattering is the right notion for margin classifiers: at margin , only the separability above margin matters, not infinitesimal differences. The dimension typically decreases as increases (a larger margin is harder to achieve, so fewer points can be fat-shattered).
Covering numbers and Dudley’s entropy integral
For a class of functions on , a pseudo-metric on , and , the covering number is the smallest cardinality of an -net — a set such that every has some with . The metric entropy is . For a sample , the canonical metric is : .
Theorem 4 (Dudley's entropy integral).
For any class of functions and any sample ,
where is the -diameter of on .
Proof.
The full proof goes by chaining: construct -nets at dyadic scales , and telescope each as . Each telescoping increment is bounded in by , so by Massart’s lemma applied to the finite set of differences, the Rademacher complexity of the increments is geometric in and the sum converges to Dudley’s integral. The constant is sharp up to small factors. Full argument: Wainwright (2019) Theorem 5.22.
∎For VC-subgraph classes (finite pseudo-dimension ), the covering number satisfies for some constant , and Dudley’s integral evaluates to — the parametric rate. This recovers and generalizes the VC bound to real-valued classes.
Bracketing vs uniform covering
Two notions of covering co-exist in the empirical-processes literature:
- Uniform covering : -net under the sample metric. What shows up in Rademacher / Dudley bounds.
- Bracketing covering : -net of “brackets” with , such that every lies in some bracket. Bracketing controls directly through the supremum over bracket-pairs.
Bracketing is strictly stronger than uniform covering (every -bracket-net is a -uniform-covering net, but not conversely). Bracketing is more natural for non-Donsker classes where uniform covering doesn’t control the empirical process. For our chapter we mostly use uniform covering, which suffices for VC-subgraph and parametric classes; bracketing matters in more delicate settings.
Data-dependent bounds
The canonical bound of §7 has rate — uniformly across the entire class . But the ERM is not just any : it has small empirical risk, hence small variance (in the sense that has small variance under ). For such low-variance hypotheses, the bound can be sharpened from the slow to the fast rate . This is the local Rademacher complexity technology of Bartlett, Bousquet, and Mendelson (2005). The price is a Bernstein-type variance condition that ties the loss variance to its mean; under that condition, the local Rademacher complexity — restricted to the variance-small subclass — replaces the global one in the bound, and a fixed-point characterization gives the fast-rate scaling.
Why classical bounds are slow
Theorem 3 gives . Both terms decay as . This is the slow rate of distribution-free generalization: no matter how small the empirical risk, the bound’s deviation term is .
For realizable problems (the Bayes classifier is in , label noise ), and as . The classical bound says — informative but loose; the actual rate for realizable PAC learning is (covered in pac-learning’s realizable section). The slow-rate gap is the price the agnostic / worst-case analysis pays for not exploiting low-noise structure.
Local Rademacher complexity
The fix: replace the global Rademacher complexity with a local version computed over the low-variance subclass.
Definition 9 (Local Rademacher complexity).
For a class and a level ,
where the supremum is over functions with bounded empirical second moment .
The local Rademacher restricts attention to a small ball in around ; functions far away (large variance) are excluded. The empirical risk of the ERM is small, so the ERM lies in this small ball — the local complexity is what controls its gap.
A Bernstein-type variance condition
For the local Rademacher to give a meaningful bound, we need a relation between the variance of (the radius ) and its mean (the risk). The cleanest such condition:
Bernstein condition. A class satisfies the Bernstein condition with parameter if there exists such that for every ,
The condition couples the second moment to the first. The two extremes:
- : — only the boundedness of is used, no link to . This gives the classical slow rate.
- : — Massart’s case. Holds for 0-1 losses with Tsybakov’s low-noise condition (the margin separates classes cleanly): rapidly near the Bayes boundary.
For , the Bernstein condition interpolates: the fast rate becomes , which is for and for . Tsybakov noise conditions give specific values of .
Fast rates via the fixed-point theorem
Theorem 5 (Local Rademacher fast-rate bound (Bartlett–Bousquet–Mendelson 2005)).
Suppose satisfies the Bernstein condition with parameter (the canonical Massart case), and let be the largest solution of the fixed-point equation
for universal constants . Then with probability ,
Proof.
The proof is technical and runs many pages; the full argument is BBM 2005 §3. The picture to keep is the fixed point: the local Rademacher complexity shrinks as shrinks (smaller class, smaller Rademacher), so the equation has a unique fixed point that scales as rather than — and that fixed point is the excess-risk bound.
∎Margin bounds
A linear classifier gives a discrete output, but the score has continuous information: its magnitude is the confidence, and points far from the boundary (large with the correct sign) are “easy.” A margin bound exploits this confidence: classifiers that not only get the right answer but get it with margin generalize better than ones that just barely separate the data. The result is the foundation of kernel methods (SVM), boosting margin theory, and the modern norm-based generalization bounds for deep nets.
From 0-1 loss to margin / ramp loss
Define the margin of on as (positive when classifies correctly; magnitude is the confidence). For a margin parameter , the ramp loss (also: margin loss, hinge loss) is
Equivalently . Key properties: is -Lipschitz, , (ramp loss upper-bounds 0-1 loss), and (ramp loss lower-bounds the “margin-violation” indicator).
Definition 10 (Margin / margin loss).
The margin of on is . The empirical margin loss at margin is
Two boundary cases: (the empirical 0-1 risk), and is increasing in — a larger margin demand more readily counts borderline classifications as misclassified.
SVM margin bound (kernel methods)
Consider the kernel-method class for a feature map into a reproducing kernel Hilbert space with kernel . Two key Rademacher facts (proofs in Mohri et al. 2018 Ch. 5):
- , where is the kernel Gram matrix on .
- By Talagrand contraction (Lemma 2) applied to the -Lipschitz : .
Theorem 6 (Margin-based generalization bound for kernel classifiers).
Let with kernel , and let . For any and , with probability over , for every ,
Proof.
Apply Theorem 3 to the loss class (range ). The bound from Theorem 3 has via the two facts above. The 0-1 risk is because upper-bounds the 0-1 loss; Theorem 3 bounds by the displayed expression.
∎The in the denominator makes the bound tight for high-margin classifiers: doubling the margin halves the complexity term. For SVMs, corresponds to the norm of the weight vector and to the margin width — both standard quantities in the SVM training objective.
Linear classifiers and the norm-based bound
The kernel-method bound specializes to plain linear classifiers when (identity feature map) and . Then and . The bound becomes
Two observations:
- The complexity term scales as — not as or any explicit dimension. This is dimension-free; it depends only on the norm of the weight vector and the margin. Linear classifiers in million-dimensional feature spaces still get a meaningful bound, as long as the norm of the optimal weight vector is moderate.
- SVM training minimizes subject to a margin constraint — this is exactly the quantity that controls the generalization bound. SVMs aren’t just “good classifiers”; their training objective directly optimizes the bound.
Margin bounds for neural networks: a preview
Margin bounds for neural networks have an analogous form but with replaced by a product of layer norms:
(Bartlett 1998, Neyshabur–Bhojanapalli–McAllester–Srebro 2017, Bartlett–Foster–Telgarsky 2017). The picture: a deep network is a composition of Lipschitz maps; Talagrand contraction propagates margin through the layers, multiplying Lipschitz constants. For a network with layers and per-layer norm , the product is and the bound is informative; for networks with large per-layer norms (typical of unregularized training), the product blows up. §12 shows this empirically — the margin bound for a 2-layer MLP trained without explicit norm control comes out larger than , vacuous. Modern post-classical bounds (PAC-Bayes, PAC-Bayes Bounds (coming soon)) use flatness of the loss landscape instead of layer norms, recovering non-vacuous bounds for some deep nets.
Algorithmic stability bounds
The whole §§5–10 program routes generalization through hypothesis-class complexity. But generalization doesn’t have to go that way. A different route — algorithmic stability — controls generalization through a property of the learning algorithm: how much does the algorithm’s output change if we perturb one training example? If the answer is “not much,” the empirical risk is close to the population risk, and we get a generalization bound that doesn’t mention hypothesis-class complexity at all. This is the Bousquet–Elisseeff (2002) framework: regularization implies stability, and stability implies generalization.
Uniform -stability defined
A learning algorithm is a map from samples to hypotheses. For , let denote with the -th sample replaced by an iid copy .
Definition 11 (Uniform β-stability).
is uniformly -stable with respect to loss if for every , every , and every ,
The supremum is over all test points, not just — uniform stability is a strong condition. typically decreases with (more data less sensitive to any one example).
Regularization implies stability (ridge regression)
Example 2 (Ridge regression is β-stable).
Consider ridge regression
The closed-form solution is . Suppose for all and . The stability constant is .
Three observations:
- decreases linearly in — more data, more stability.
- decreases linearly in — stronger regularization, more stability.
- does not depend on — dimension-free, even in the under-determined case .
The proof goes through the closed-form and a perturbation argument: replacing one changes the inverse by an outer-product rank-one update, and the spectral norm of that update is bounded by (full derivation: Bousquet–Elisseeff 2002 §3).
Stability implies generalization (Bousquet–Elisseeff)
Theorem 7 (Bousquet–Elisseeff stability bound).
If is uniformly -stable with respect to a loss bounded in , then for any , with probability over the iid sample ,
Proof.
Let . The proof verifies three properties:
- : a stability-based version of symmetrization (the ghost-sample trick) gives the expected gap directly.
- has bounded differences in the McDiarmid sense: replacing with changes by at most (the comes from stability, the from the single-coordinate change of ).
- McDiarmid concentration: , giving with probability ,
Combining with gives the theorem. Full proof: Bousquet–Elisseeff 2002 §4.
∎For ridge regression with , Theorem 7 yields
The in the deviation term is the regularization-stability cost; choosing to balance this with the empirical risk’s smaller dependence gives an optimized excess-risk bound.
Why this route bypasses class complexity
Theorem 7 contains no , no , no covering number. The hypothesis class doesn’t appear at all — only does. This has three consequences:
- The bound is algorithm-specific. Two different algorithms applied to the same class produce different bounds. Ridge regression with has and gets the fast rate; the ERM over the same class (unregularized least squares) is not uniformly stable and gets only the slow Rademacher bound. The choice of algorithm matters.
- The bound is dimension-free for ridge. For under-determined least squares (), the ERM over has — the Rademacher bound is useless. But ridge with has — finite, useful, dimension-free. Regularization buys generalization through stability, not through restricting .
- The bound applies to randomized / SGD algorithms. Define stability over the randomness of the algorithm too; SGD has approximate stability properties (Hardt–Recht–Singer 2016) that give generalization bounds for deep nets — non-trivial in regimes where Rademacher bounds are vacuous.
The stability framework is one of the post-classical paths to non-vacuous deep-net bounds.
Worked example: empirical Rademacher and vacuousness
We have built seven bound recipes. §12 puts them on two examples — one where they bite tightly, one where they go vacuous — and lets the reader see the classical theory’s reach and its honest limits in the same picture.
Threshold classifier — Rademacher complexity in closed form
Recap from §5: on a sample of uniform points, the threshold class produces exactly distinct labelings. Massart’s lemma gives , matching the true rate to a constant factor. Corollary 3 plugs into the canonical bound for any , : the bound is non-vacuous (under ) for , and tightens rapidly with .
Gap-vs- scaling vs the theoretical envelope
For the threshold class with noise, the empirical worst-case gap matches the Corollary 3 envelope across — §7 already showed this. The classical theory works as advertised in the tabular small-class regime: bounds are informative, the rate is correct, and the constant slack is a small multiple.
A 2-layer MLP on binary MNIST: classical bounds become vacuous
We now apply the same machinery to a modest neural network on a real dataset and observe that the classical bound exceeds — vacuous — despite the model having trivial generalization gap. The setup: subsample examples from the MNIST digits 0 and 1 (a binary classification task; the easiest non-trivial vision problem). Train a 2-layer ReLU MLP with hidden width . Evaluate generalization gap by held-out test accuracy. Compute the classical norm-based Rademacher upper bound and the Corollary 3 envelope.
For a 2-layer MLP with and , the layer-norm-based Rademacher upper bound (Bartlett 1998 / Neyshabur et al. 2017) is
where is the input radius and is the spectral / operator norm. For MNIST with pixels normalized to , . After training, the layer norms produce a classical Rademacher upper bound that comfortably exceeds the trivial range of the loss — vacuous. Meanwhile the actual generalization gap (test error minus train error) is . Classical theory says nothing useful about why this network generalizes.
What the failure tells us
Three honest takeaways:
- The bound isn’t wrong. Corollary 3 holds with high probability for any , including 2-layer MLPs. The MLP could, in principle, fit random labels — Zhang et al. 2017 demonstrated this empirically — and a classical bound has to cover that worst case. So Rademacher complexity for typical neural network classes really is close to , and the bound really is vacuous.
- The classical bound charges uniformly. Rademacher complexity is a property of the entire class. It does not care that the specific trained network is in a “good neighborhood” of where networks all generalize. Data-dependent and algorithm-dependent refinements — PAC-Bayes (Dziugaite & Roy 2017), the implicit bias of SGD (Neyshabur et al. 2017), and compression-based bounds (Arora et al. 2018) — locate the trained network in a tighter implicit hypothesis class and recover non-vacuous bounds for some deep nets.
- Stability is an alternative route. Stability bounds (§11) for SGD on smooth losses (Hardt–Recht–Singer 2016) give non-vacuous bounds without going through at all. The trade-off: stability bounds work only for certain optimization schedules and don’t yet match the empirical accuracy of practical deep nets.
The vacuousness puzzle is the chapter’s central honest reckoning. Classical generalization theory delivers crisp bounds for tabular and kernel methods; it falls silent for modern deep nets. The post-classical bounds are an active research area, surveyed in §14.
Computational notes
Rademacher Monte Carlo: cost and structure
The naive Rademacher estimator costs time when is finite, when requires optimization. For “nice” classes (linear, kernel) the supremum is a quadratic program in the Rademacher labels; for thresholds / intervals it has a closed-form solution per Rademacher draw. Variance of the MC estimator decreases as , so gives 2-decimal-place accuracy. For high-dimensional or NN classes, the supremum can’t be optimized exactly; upper bounds via Massart, Dudley, or layer-norm propagation are the practical alternatives.
Class-evaluation matrices and structured shortcuts
For finite or finitely-enumerable classes, build where . Empirical Rademacher reduces to a single matrix multiplication followed by row-wise max and average. NumPy broadcasting handles this in vectorized form. For real-valued classes use . For continuous classes (linear, kernel) replace the enumeration by the minimal class that realizes all distinct labelings on — for thresholds, that’s rows; for half-planes in , rows by the Sauer–Shelah lemma; for sufficiently rich classes, .
Pitfalls
Three common errors: (i) confusing empirical and population Rademacher — the Corollary 3 bound uses empirical; (ii) forgetting to standardize to for the Rademacher inner product (the unstandardized case shifts the sum by a -independent mean, but the supremum-of-sigma cancels it out — the result is correct but the variance is slightly off); (iii) computing covering numbers on the population metric when the bound uses the empirical metric (Dudley with , not ). For high-dimensional Monte Carlo, prefer Rademacher antithetic pairs for variance reduction.
Connections and limits
§14 is the chapter’s closing reckoning.
PAC-Bayes bounds (coming soon)
The PAC-Bayesian framework (McAllester 1999, Catoni 2007, Maurer 2004) does not bound the gap of a single hypothesis; it bounds the expected gap under a posterior distribution over hypotheses. The bound has the form
where is a prior fixed before seeing the data. The bound is tighter than Rademacher precisely when concentrates on hypotheses with small empirical risk and small KL to the prior. For deep networks, as a Gaussian posterior around the SGD output produces non-vacuous bounds (Dziugaite & Roy 2017), giving the first classical theory bounds that match the empirical generalization of trained networks.
Remark.
PAC-Bayes is the standard route to non-vacuous deep-net bounds and is the natural extension of generalization theory for modern overparameterized models. The forthcoming PAC-Bayes Bounds (coming soon) topic develops it in detail.
VC dimension
The Vapnik–Chervonenkis dimension is the combinatorial complexity measure for -valued classes — the largest such that some -point set is shattered by (every labeling is realizable). VC dimension underlies the Sauer–Shelah polynomial growth lemma, the Fundamental Theorem of Statistical Learning, and the realizable PAC bounds. For Rademacher-flavored bounds, the standard chain is: finite VC dimension polynomial growth function polynomial covering numbers Dudley integral evaluates to . The VC Dimension topic develops all this in detail and complements generalization-bounds from the combinatorial direction (this topic develops it from the Rademacher direction).
The deep-network puzzle (Zhang et al. 2017)
Remark.
Zhang et al. (2017) “Understanding deep learning requires rethinking generalization” demonstrated that standard convolutional networks can perfectly fit random labels on CIFAR-10 — meaning and the classical Rademacher bound is identically vacuous. Yet the same networks trained on real labels generalize well. This is the central empirical observation that classical theory cannot explain. Active responses include: (a) implicit-bias arguments — SGD prefers low-norm or “flat” solutions out of the infinitely many ERMs (Neyshabur et al. 2017, Soudry et al. 2018); (b) compression-based bounds (Arora et al. 2018); (c) PAC-Bayes — the SGD posterior has small KL to a reasonable prior; (d) the neural tangent kernel regime — overparameterized nets approximate kernel methods with margin bounds that are meaningful (Jacot et al. 2018). None has fully solved the puzzle. Tractable, computable non-vacuous deep-net bounds remain an active research target.
Implicit bias, double descent, and beyond
The classical regime has clean theory; the modern regime has surprising empirical phenomena like double descent (Belkin et al. 2019): generalization error first increases (as model size approaches interpolation threshold) then decreases again as the model grows further. Classical Rademacher and VC bounds predict monotone risk in model size, contradicting the empirical curve. Resolution requires data-dependent bounds (§9) that depend on which solution the algorithm finds — the implicit bias of SGD toward minimum-norm interpolators (Bartlett et al. 2020; benign overfitting literature). The post-classical landscape is still being mapped; this chapter ends with the classical theory’s honest limit and the post-classical roadmap as the open frontier.
Connections
- Direct prerequisite — ERM, the Fundamental Theorem of Statistical Learning, and the boundary marker for Rademacher complexity. pac-learning introduces Rademacher in one shallow section; this topic goes deep into the symmetrization argument, the canonical bound, fast rates, margin bounds, and stability. pac-learning
- Direct prerequisite — Hoeffding's inequality drives the §3 finite-class warmup; McDiarmid's bounded-differences inequality is the concentration engine that powers the §7 canonical uniform-convergence bound and the §11 Bousquet–Elisseeff stability bound. concentration-inequalities
References & Further Reading
- paper On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities — Vapnik & Chervonenkis (1971) The foundational paper of statistical learning theory; the uniform-convergence framework that underwrites §§4–7 originates here (Theory of Probability & Its Applications).
- paper Rademacher and Gaussian Complexities: Risk Bounds and Structural Results — Bartlett & Mendelson (2002) The reference modernization of Rademacher complexity as the standard distribution-free complexity measure; §§5–7 follow this paper's structure (JMLR).
- paper Rademacher Penalties and Structural Risk Minimization — Koltchinskii (2001) Independent contemporaneous development of the Rademacher-penalty framework (IEEE TIT).
- paper Local Rademacher Complexities — Bartlett, Bousquet & Mendelson (2005) The fast-rate framework via local Rademacher; §9 follows this paper's fixed-point argument (Annals of Statistics).
- paper Stability and Generalization — Bousquet & Elisseeff (2002) The algorithmic-stability framework developed in §11; Theorem 7 is from this paper (JMLR).
- paper Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator — Dvoretzky, Kiefer & Wolfowitz (1956) The DKW inequality used in §4 (Annals of Mathematical Statistics).
- paper The Tight Constant in the Dvoretzky–Kiefer–Wolfowitz Inequality — Massart (1990) The sharp constant for DKW (Annals of Probability); pairs with Massart's finite-class lemma (§5).
- book Foundations of Machine Learning — Mohri, Rostamizadeh & Talwalkar (2018) Ch. 3–5 give a textbook treatment of Rademacher and margin bounds (MIT Press, 2nd ed.).
- book Understanding Machine Learning: From Theory to Algorithms — Shalev-Shwartz & Ben-David (2014) The standard pedagogical companion to PAC and Rademacher theory (Cambridge).
- book High-Dimensional Statistics: A Non-Asymptotic Viewpoint — Wainwright (2019) Ch. 5 develops the chaining argument behind Dudley's entropy integral (§8) (Cambridge).
- book Empirical Processes in M-Estimation — van de Geer (2000) The empirical-process companion that the cross-site formalstatistics:empirical-processes topic develops in detail (Cambridge).
- book Probability in Banach Spaces: Isoperimetry and Processes — Ledoux & Talagrand (1991) Source for Talagrand's contraction lemma used in §5 (Springer).
- paper The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights Is More Important than the Size of the Network — Bartlett (1998) Original norm-based generalization bound for neural networks — the §10.4 preview and §12.3 vacuousness diagnosis use this style (IEEE TIT).
- paper Exploring Generalization in Deep Learning — Neyshabur, Bhojanapalli, McAllester & Srebro (2017) The modern norm-based / margin-based deep-net bound; §14.3 references it (NeurIPS).
- paper Understanding Deep Learning Requires Rethinking Generalization — Zhang, Bengio, Hardt, Recht & Vinyals (2017) The empirical observation driving §12's vacuousness puzzle and §14.3's open-questions discussion (ICLR).
- paper Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data — Dziugaite & Roy (2017) PAC-Bayes route to non-vacuous deep-net bounds; the §14.1 outlook (UAI).
- paper Train Faster, Generalize Better: Stability of Stochastic Gradient Descent — Hardt, Recht & Singer (2016) SGD-based algorithmic stability for non-convex losses; §14 references it (ICML).
- paper Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off — Belkin, Hsu, Ma & Mandal (2019) Empirical double-descent phenomena that classical bounds do not yet predict (PNAS); §14.4.
- paper Deep Learning: A Statistical Viewpoint — Bartlett, Montanari & Rakhlin (2021) Modern survey of the deep-learning generalization landscape (Acta Numerica).
- paper The Tradeoffs of Large Scale Learning — Bottou & Bousquet (2008) The approximation–estimation–optimization decomposition Proposition 1 follows (NeurIPS).
- paper Spectrally-Normalized Margin Bounds for Neural Networks — Bartlett, Foster & Telgarsky (2017) The product-of-layer-norms Rademacher bound previewed in §10.4 (NeurIPS).
- paper Stronger Generalization Bounds for Deep Nets via a Compression Approach — Arora, Ge, Neyshabur & Zhang (2018) Compression-based deep-net bounds; §14.3 reference (ICML).
- paper Neural Tangent Kernel: Convergence and Generalization in Neural Networks — Jacot, Gabriel & Hongler (2018) The NTK regime where overparameterized networks behave like kernel methods; §14.3 reference (NeurIPS).
- paper The Implicit Bias of Gradient Descent on Separable Data — Soudry, Hoffer, Nacson, Gunasekar & Srebro (2018) The implicit-bias-of-SGD line referenced in §14.4 (JMLR).
- paper Benign Overfitting in Linear Regression — Bartlett, Long, Lugosi & Tsigler (2020) The benign-overfitting story behind §14.4 (PNAS).