intermediate learning-theory 75 min read

VC Dimension

Combinatorial capacity, Sauer–Shelah, and uniform convergence for binary hypothesis classes

Part of the Learning Theory & Methodology track · View full curriculum →

Prerequisites: PAC Learning Framework Concentration Inequalities

Motivation: why finite-class arguments break for infinite hypothesis classes

The PAC Learning topic developed the sharpest tool we have for finite hypothesis classes: take Hoeffding’s inequality on a single hypothesis, apply the union bound over all $|\mathcal{H}|$ of them, and a sample size of $n = O((\log |\mathcal{H}| + \log(1/\delta))/\epsilon^2)$ suffices for uniform convergence at level $\epsilon$ with confidence $1-\delta$ . The bound is tight in $|\mathcal{H}|$ , transparent in its derivation, and useless the moment we try to use it on anything we’d want to learn — every real-parameter class has $|\mathcal{H}| = \infty$ , and $\log \infty$ shuts the argument down. The question for this whole topic: what’s the right combinatorial object to put in place of $|\mathcal{H}|$ when the cardinality counts the wrong thing?

The finite-class union-bound recap

Fix the standard setup. We have a labeled sample $S = ((X_1, Y_1), \dots, (X_n, Y_n))$ drawn i.i.d. from a distribution $D$ on $\mathcal{X} \times \{0, 1\}$ , a binary hypothesis class $\mathcal{H} \subseteq \{0, 1\}^{\mathcal{X}}$ , the population risk $R(h) = \Pr_D[h(X) \ne Y]$ , and the empirical risk $\hat R(h) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[h(X_i) \ne Y_i]$ . Hoeffding’s inequality applied to the $n$ Bernoulli indicators gives $\Pr(|R(h) - \hat R(h)| > \epsilon) \le 2 \exp(-2 n \epsilon^2)$ . The PAC framework wants this uniformly over the whole class, so we union-bound across $\mathcal{H}$ :

\Pr\!\bigl(\exists h \in \mathcal{H} : |R(h) - \hat R(h)| > \epsilon\bigr) \le 2 |\mathcal{H}| \exp(-2 n \epsilon^2).

Setting the right-hand side to $\delta$ and solving for $n$ gives the canonical finite-class sample-complexity bound

n \ge \frac{\log|\mathcal{H}| + \log(2/\delta)}{2 \epsilon^2}, \quad\quad (1.1)

the headline result of PAC Learning §3. The logarithm is the union bound’s signature: every extra hypothesis costs us $\log$ in the sample size. The bound stays usable for surprisingly large finite classes — billions of hypotheses cost twenty-something samples — but it requires $|\mathcal{H}| < \infty$ .

What |H| should “really” mean

Take the simplest infinite class: half-lines on the real line, $\mathcal{H}_{\text{half-line}} = \{h_a : a \in \mathbb{R}\}$ with $h_a(x) = \mathbb{1}[x \ge a]$ . There are uncountably many hypotheses and bound (1.1) instantly degenerates: $\log|\mathcal{H}_{\text{half-line}}| = \infty$ forces $n = \infty$ . Yet half-lines are intuitively a small class — one real degree of freedom, a handful of points should suffice. The bound is counting something it shouldn’t be counting.

The issue is that the union bound treats every $h_a$ as a fresh draw from concentration’s perspective, but on any fixed sample $S$ most of those thresholds are behaviorally indistinguishable. Two thresholds $a, a'$ in the same gap between consecutive sample points produce the same labels on $S$ . The hypotheses are different as functions on $\mathbb{R}$ , but identical as labelings of $S$ . If $\mathcal{H}_{\text{half-line}}$ has uncountably many functions but only $n + 1$ distinct labelings of any $n$ -point sample, the right effective cardinality for the union bound is $n + 1$ , not $\infty$ .

The right object: labelings on a fixed sample, not functions on the input space

For a sample $S = (x_1, \dots, x_n)$ and a class $\mathcal{H}$ , the restriction of $\mathcal{H}$ to $S$ is

\mathcal{H}_{|S} = \bigl\{(h(x_1), \dots, h(x_n)) : h \in \mathcal{H}\bigr\} \subseteq \{0, 1\}^n. \quad\quad (1.2)

Three immediate consequences: $|\mathcal{H}_{|S}| \le 2^n$ ; $|\mathcal{H}_{|S}| \le |\mathcal{H}|$ ; and the latter inequality can be enormously loose. The hope is to replace $|\mathcal{H}|$ in the union bound with some worst-case-over- $S$ version of $|\mathcal{H}_{|S}|$ — the growth function $\Pi_\mathcal{H}(n)$ in §3. If $\Pi_\mathcal{H}(n)$ is polynomial in $n$ even when $|\mathcal{H}|$ is infinite, the union bound buys a finite sample complexity.

Roadmap: shattering, growth, Sauer–Shelah, uniform convergence

The four-step ladder this topic climbs: (1) shattering in §2 — given $S$ of size $n$ , $\mathcal{H}$ shatters $S$ when $|\mathcal{H}_{|S}| = 2^n$ ; (2) the growth function in §3 — $\Pi_\mathcal{H}(n) = \max_{|S|=n} |\mathcal{H}_{|S}|$ ; (3) the VC dimension in §4 — $d_{\mathrm{VC}}(\mathcal{H})$ , the largest $n$ for which $\Pi_\mathcal{H}(n) = 2^n$ ; (4) the Sauer–Shelah lemma and the FTSL in §§5–6 — finite VC dimension forces polynomial growth, which combined with Hoeffding via a doubled-sample union bound gives the agnostic uniform-convergence bound $|R(\hat h) - \hat R(\hat h)| = O(\sqrt{d_{\mathrm{VC}} \log(n/d_{\mathrm{VC}})/n})$ . The shape of the ladder is the entire story in one substitution: $\log|\mathcal{H}|$ becomes $d_{\mathrm{VC}} \log(n/d_{\mathrm{VC}})$ .

Demo — log cardinality blows up; restricted cardinality plateaus

Take an $n$ -point sample $S$ on $[0, 1]$ and discretized half-line classes $\mathcal{H}_k = \{h_a : a \in \{1/k, 2/k, \dots, k/k\}\}$ with $k$ thresholds. As $k \to \infty$ , $|\mathcal{H}_k| \to \infty$ (red, diverging on the log-log plot) and the finite-class bound (1.1) degenerates. But $|\mathcal{H}_{k|S}|$ saturates at exactly $n+1$ (blue plateau) as soon as $k$ is large enough to put a threshold in every gap. The red curve is the wrong quantity for learning theory; the blue curve is the right one.

Log-log plot of the discretized half-line class cardinality versus the number of thresholds, with the red full-class cardinality diverging and the blue restricted cardinality plateauing at n+1. — Discretized half-line cardinality (red) diverges; restricted cardinality on a 20-point sample (blue) plateaus at n+1 = 21.

Adding more thresholds inflates |H_k| (red) without adding any new dichotomies once every gap is covered — the restricted cardinality |H_k|_S| (blue) saturates at n + 1.

The takeaway is exactly the diagnosis from §1.2: refining $\mathcal{H}$ adds hypotheses, but doesn’t add information that distinguishes them on $S$ . The behaviorally-distinct hypotheses on $S$ — the dichotomies of §2 — are what matters.

Shattering: the combinatorial substitute for |H|

Restriction to S: dichotomies

The set of labelings $\mathcal{H}_{|S} \subseteq \{0,1\}^n$ produced by $\mathcal{H}$ on $S$ is also called the set of dichotomies (Vapnik–Chervonenkis 1971). Three properties: $\mathcal{H}_{|S} \subseteq \{0,1\}^n$ so $|\mathcal{H}_{|S}| \le 2^n$ ; $|\mathcal{H}_{|S}| \le |\mathcal{H}|$ ; the second bound can be loose by an unbounded factor.

Definition: H shatters S

Definition 1 (Shattering).

A class $\mathcal{H}$ shatters a sample $S = (x_1, \dots, x_n)$ when $\mathcal{H}_{|S} = \{0, 1\}^n$ — every $b \in \{0,1\}^n$ is realized by some $h \in \mathcal{H}$ .

Two pieces of commentary: shattering is a property of the pair $(\mathcal{H}, S)$ , not of $\mathcal{H}$ or $S$ alone; and the definition is existential in $h$ — for each target labeling, we need some witnessing hypothesis.

The three-point picture: half-lines vs half-planes

Three configurations across two classes paint the picture:

A. Half-lines on three collinear 1D points $(\{1, 2, 3\})$ : only the 4 threshold patterns $(0,0,0), (0,0,1), (0,1,1), (1,1,1)$ are realized. The labelings $(1,0,0), (0,1,0), (1,0,1), (1,1,0)$ are unrealizable. Half-lines do not shatter.
B. Half-planes on three collinear 2D points $((1,0), (2,0), (3,0))$ : same 4 patterns; half-planes on collinear points behave like half-lines on the projection.
C. Half-planes on three triangle vertices $((0,0), (1,0), (0.5,1))$ : all 8 labelings realized — separation via a half-plane through edges or vertices. Half-planes shatter triangles.
D. Half-planes on any 4 points in the plane: previewed here; the full proof goes through §8.1 via Radon’s theorem.

Shattering depends on the configuration, not just the cardinality

Cases A–C carry the §2 punchline: shattering is not a function of $|S|$ alone. To prove $d_{\mathrm{VC}}(\mathcal{H}) = d$ , we’ll need two pieces: a constructive direction (exhibit some $d$ -point set that $\mathcal{H}$ shatters — existence, usually easy) and a destructive direction (prove that every $(d+1)$ -point set has some unrealizable labeling — universal, usually hard).

Demo — the shatter atlas

A 3 × 8 grid: rows for the three (config, class) cases, columns for the 8 elements of $\{0,1\}^3$ , marked with realizing classifier when one exists. Rows A and B each tally 4 of 8; row C tallies 8 of 8 — shatters.

A three-by-eight grid: each row a (configuration, class) pair, each column one of the eight binary labelings of three points. Half-lines on three collinear points realize four labelings; half-planes on three collinear points realize four; half-planes on a triangle realize all eight. — Shatter atlas. Row A: half-lines on three collinear 1D points realize 4 of 8 labelings. Row B: half-planes on three collinear 2D points realize 4 of 8. Row C: half-planes on a triangle realize all 8 — shatters.

Top row: half-lines on three collinear 1D points realize 4 of 8 labelings (the monotone prefix patterns). Middle row: half-planes on three collinear 2D points realize the same 4. Bottom row: half-planes on a triangle realize all 8 — shatters.

The atlas makes the §2.4 asymmetry concrete. Rows A and B each top out at 4 of the 8 labelings; row C realizes all 8. Drag a point in row B onto the line $y = x$ and the row jumps from 4 to 8 — shattering switched on by a one-pixel move. The class alone doesn’t determine shattering; the configuration matters every bit as much.

The growth function Π(n)

Definition: maximum number of labelings on an n-point sample

Definition 2 (Growth function).

The growth function of $\mathcal{H}$ is

\Pi_\mathcal{H}(n) = \max_{S \subseteq \mathcal{X}, \, |S|=n} \, |\mathcal{H}_{|S}|. \quad\quad (3.1)

A function of $n$ alone; monotone non-decreasing in $n$ (extending a sample at most preserves the labeling count).

The trivial bound, and the shattering equivalence

Proposition 1 (Trivial bound and shattering).

$\Pi_\mathcal{H}(n) \le 2^n$ , with equality iff some $n$ -point sample is shattered.

The Proposition converts shattering into an algebraic question and sets up VC dimension as “the largest $n$ for which the bound is tight.”

Closed-form growth functions for the canonical classes

Four canonical classes have closed-form growth functions:

Half-lines on $\mathbb{R}$ : $\Pi(n) = n + 1$ . The threshold falls in one of $n+1$ gaps between consecutive sample points.
Intervals on $\mathbb{R}$ : $\Pi(n) = \binom{n+1}{2} + 1$ . Each non-empty interval picks a contiguous run of labels.
Half-planes in $\mathbb{R}^2$ : $\Pi(n) = n^2 - n + 2$ (Cover 1965), derived from the recursion $C(n+1) = C(n) + 2n$ on the number of regions cut by $n$ lines in general position, summed with the two trivial all-0 and all-1 dichotomies.
Axis-rectangles in $\mathbb{R}^2$ : no closed form, but $\Pi(n) = O(n^4)$ via the bounding-box argument — every rectangle dichotomy is determined by its four extreme points along the two axes.

The phase transition: exponential below the VC dimension, polynomial above

Each canonical class tracks $2^n$ until $n = d_{\mathrm{VC}}$ , then peels off to polynomial. Half-lines peel at $n = 2$ , intervals at $n = 3$ , half-planes at $n = 4$ , rectangles at $n = 5$ . The polynomial degree is exactly $d_{\mathrm{VC}}$ . The phase transition is what Sauer–Shelah (§5) upgrades to a theorem.

Demo — the growth-function gallery

Log-scale plot of the growth function for half-lines, intervals, half-planes, and axis-rectangles from n equals 1 to 16, with the 2^n trivial bound and Sauer–Shelah binomial-sum bounds overlaid. Phase-transition arrows mark n equals d_VC plus 1 for each class. — Growth-function gallery on log scale. Each class tracks 2^n until n = d_VC, then peels off to polynomial — half-lines at n=2, intervals at n=3, half-planes at n=4, axis-rectangles at n=5.

Max n: 16

The picture matches the §3.4 prediction exactly: each curve tracks $2^n$ on the log scale until $n = d_{\mathrm{VC}}$ , then peels off to polynomial. The Sauer–Shelah binomial-sum bounds (dotted) sit above each class’s true growth function — loose by a constant factor, but the right asymptotic order.

The VC dimension

Definition: largest shattered set

Definition 3 (VC dimension).

The VC dimension of $\mathcal{H}$ is

d_{\mathrm{VC}}(\mathcal{H}) = \max\bigl\{n \in \mathbb{N} : \Pi_\mathcal{H}(n) = 2^n\bigr\},

or $+\infty$ if the maximum does not exist. Conventions: $d_{\mathrm{VC}}(\emptyset) = -\infty$ and $d_{\mathrm{VC}}(\{h\}) = 0$ .

Integer-valued (or $+\infty$ ); reduces to no other dimension notion for binary classes.

The two-sided proof template

To prove $d_{\mathrm{VC}}(\mathcal{H}) = d$ , we need two ingredients: (1) constructively exhibit some $d$ -point set that $\mathcal{H}$ shatters, and (2) destructively prove that every $(d+1)$ -point set has some unrealizable labeling. The destructive direction is universal — quantified over all $(d+1)$ -point samples — and is where Radon’s theorem (§8.1) and pigeonhole arguments (§8.2) earn their keep.

Worked examples: half-lines, intervals, half-planes

$d_{\mathrm{VC}}(\mathcal{H}_{\text{half-line}}) = 1$ . Constructive: a single point shatters (one labeling each way). Destructive: any two points $x_1 < x_2$ have the unrealizable labeling $(1, 0)$ — no threshold realizes “1 then 0.”
$d_{\mathrm{VC}}(\mathcal{H}_{\text{interval}}) = 2$ . Constructive: any two points shatter via four intervals. Destructive: any three points $x_1 < x_2 < x_3$ have the unrealizable labeling $(1, 0, 1)$ — an interval is a contiguous set, so it cannot label the endpoints “in” and the middle “out.”
$d_{\mathrm{VC}}(\mathcal{H}_{\text{half-plane}}) = 3$ . Constructive: a triangle shatters (§2.5 row C). Destructive: every 4-point configuration has an unrealizable labeling, following from Radon’s theorem in §8.1.

Pathological cases: H_sin has 1 parameter but infinite VC dimension

The sine-wave family $\mathcal{H}_{\sin} = \{h_\omega(x) = \mathbb{1}[\sin(\omega x) \ge 0] : \omega \in \mathbb{R}\}$ has one real parameter but $d_{\mathrm{VC}}(\mathcal{H}_{\sin}) = \infty$ . The construction (Vapnik 1998 §3.4) uses geometrically-spaced points $x_i = 2^{-i}$ ; for any labeling $b \in \{0, 1\}^n$ , the choice $\omega = \pi \cdot (1 + \sum_{i : b_i = 0} 2^i)$ realizes $b$ . Parameter count is not a reliable proxy for VC dimension.

Demo — destructive direction for each canonical class

Four panels showing the smallest unrealizable labeling for each class: half-lines fail (1,0) on two points (monotonicity); intervals fail (1,0,1) on three points (connectivity); half-planes fail the XOR labeling on the unit square (convex hull); axis-rectangles fail four-cardinals-versus-origin on five points (bounding box). — Destructive direction for each canonical class. Each panel shows the smallest unrealizable labeling, named after the geometric obstruction: monotonicity, connectivity, convex-hull, bounding-box.

Blue circles labeled 1; red circles labeled 0. Each panel shows the smallest unrealizable labeling for its class — the geometric obstruction that caps the VC dimension at d_VC.

The four obstructions are exemplars of the four most common destructive-direction arguments in VC theory: monotonicity (no realization of decreasing patterns), connectivity (no disconnected positive set), convex-hull (no separation across the hull’s interior), and bounding-box (no positive set excluding an interior point).

The Sauer–Shelah lemma

The numerical phase transition in §3.4 told us that finite VC dimension appears to force polynomial growth of $\Pi_\mathcal{H}(n)$ . The Sauer–Shelah lemma is that observation upgraded to a theorem. Two independent discoveries — Sauer (1972) and Shelah (1972), with antecedents in Vapnik–Chervonenkis (1971) — established the bound; the modern proof we give here is due to Frankl–Pach (1983) and Pajor (1985), using a clever operation on $\{0, 1\}^n$ called the shifting operator.

Statement: finite VC dimension forces polynomial growth

Theorem 1 (Sauer–Shelah).

Let $\mathcal{H}$ have $d_{\mathrm{VC}}(\mathcal{H}) = d < \infty$ . Then for every $n \ge 1$ ,

\Pi_\mathcal{H}(n) \le \sum_{i=0}^{d} \binom{n}{i}. \quad\quad (5.1)

Consequently $\Pi_\mathcal{H}(n) \le (en/d)^d$ for $n \ge d$ .

The binomial-sum form (5.1) is tight in general; the $(en/d)^d$ closed form (Corollary 1) is the convenient version for downstream applications. Loose by a constant factor for any specific class — half-planes have $\Pi(n) = n^2 - n + 2$ vs the Sauer–Shelah ceiling $\binom{n}{\le 3}$ — but universal across all classes with $d_{\mathrm{VC}} = d$ .

The shifting operator

Identify $\mathcal{H}$ with its restriction $\mathcal{H}_{|S} \subseteq \{0,1\}^n$ on a fixed sample $S$ of size $n$ .

Definition 4 (Shifting operator).

For $i \in \{1, \dots, n\}$ , the shifting operator $S_i(\mathcal{H}) \subseteq \{0, 1\}^n$ is constructed as follows. For each $v \in \mathcal{H}$ with $v_i = 1$ , let $v'$ denote the down-shifted vector obtained by flipping $v_i$ to $0$ . Replace $v$ with $v'$ in $S_i(\mathcal{H})$ if $v' \notin \mathcal{H}$ ; otherwise leave both $v$ and $v'$ in place.

Lemma 1 (Size preservation).

$|S_i(\mathcal{H})| = |\mathcal{H}|$ .

Proof.

The map $v \mapsto v'$ on “lonely” labels (those whose down-shift $v'$ is not in $\mathcal{H}$ ) is a bijection between vectors leaving $\mathcal{H}$ and vectors entering $S_i(\mathcal{H})$ .

∎

Lemma 2 (VC dimension preservation).

If $S_i(\mathcal{H})$ shatters a set $T \subseteq \{1, \dots, n\}$ , then $\mathcal{H}$ shatters $T$ .

Proof.

Suppose $S_i(\mathcal{H})$ shatters $T$ . For each $u \in \{0, 1\}^T$ , we must produce a vector $v \in \mathcal{H}$ with $v|_T = u$ . By shattering, there is some $w \in S_i(\mathcal{H})$ with $w|_T = u$ . Two cases.

Case A: $w \in \mathcal{H}$ . Take $v = w$ . Done.

Case B: $w \in S_i(\mathcal{H}) \setminus \mathcal{H}$ . Then $w$ is a down-shifted vector — its $i$ -th coordinate is $0$ , and there exists $v^\star \in \mathcal{H}$ with $v^\star_i = 1$ and $v^\star_j = w_j$ for $j \ne i$ .

Sub-case B1: $i \notin T$ . Take $v = v^\star \in \mathcal{H}$ ; since the disagreement is at coordinate $i \notin T$ , we have $v^\star|_T = w|_T = u$ .
Sub-case B2: $i \in T$ and $u_i = 0$ . We need $v \in \mathcal{H}$ with $v_i = 0$ and $v|_T = u$ . Apply the shattering of $T$ by $S_i(\mathcal{H})$ to the flipped labeling $u'$ defined by $u'_i = 1$ and $u'_j = u_j$ for $j \ne i$ . The witness is $w^\dagger \in S_i(\mathcal{H})$ with $w^\dagger|_T = u'$ , so $w^\dagger_i = 1$ . Now $w^\dagger$ cannot have been produced by a down-shift — a down-shift would set the $i$ -th coordinate to $0$ , not $1$ — so $w^\dagger \in \mathcal{H}$ directly. Consider the down-shift $w^{\dagger\prime}$ of $w^\dagger$ : it has $w^{\dagger\prime}_i = 0 = u_i$ and $w^{\dagger\prime}|_{T \setminus \{i\}} = w^\dagger|_{T \setminus \{i\}} = u|_{T \setminus \{i\}}$ . We claim $w^{\dagger\prime} \in \mathcal{H}$ . Suppose not: then $w^\dagger$ would be “lonely” in $\mathcal{H}$ (its down-shift would be missing), so the shifting operator $S_i$ would replace $w^\dagger$ with $w^{\dagger\prime}$ in $S_i(\mathcal{H})$ — putting $w^{\dagger\prime} \in S_i(\mathcal{H})$ and removing $w^\dagger$ from $S_i(\mathcal{H})$ . But $w^\dagger \in S_i(\mathcal{H})$ by hypothesis, a contradiction. Therefore $w^{\dagger\prime} \in \mathcal{H}$ and $v = w^{\dagger\prime}$ realizes $u$ on $T$ .
Sub-case B2 with $u_i = 1$ would require $v_i = 1$ , and $v = v^\star$ from Case B works directly.

∎

The Sub-case B2 contradiction is the only delicate moment in the entire proof: we need $w^{\dagger\prime}$ to live in $\mathcal{H}$ , and the only way to extract that fact is to suppose otherwise and watch the shifting operator’s definition force a contradiction with $w^\dagger \in S_i(\mathcal{H})$ .

Full proof of Sauer–Shelah

Proof.

Five steps.

Step 1. Apply the shifting operators $S_1, S_2, \dots, S_n$ iteratively, cycling until no further changes occur. The total weight $\sum_v \sum_i v_i$ decreases by at least $1$ whenever a shift moves a lonely vector down, and is bounded below by $0$ , so the process terminates in finitely many cycles. Call the resulting class $\mathcal{H}^\star$ (the “stable class”). By construction, $S_i(\mathcal{H}^\star) = \mathcal{H}^\star$ for every $i$ .

Step 2. Lemma 1 applied iteratively: $|\mathcal{H}^\star| = |\mathcal{H}_{|S}|$ . Lemma 2 applied iteratively: $d_{\mathrm{VC}}(\mathcal{H}^\star) \le d_{\mathrm{VC}}(\mathcal{H}) = d$ (if $\mathcal{H}^\star$ shattered some set, so would $\mathcal{H}$ ).

Step 3 — the stable class is down-closed. Suppose $v \in \mathcal{H}^\star$ has $v_i = 1$ . By stability $S_i(\mathcal{H}^\star) = \mathcal{H}^\star$ , so the down-shift $v'$ must be in $\mathcal{H}^\star$ — otherwise $S_i$ would replace $v$ with $v'$ , contradicting stability. Iterating one coordinate at a time, every $u \le v$ (coordinatewise) lies in $\mathcal{H}^\star$ .

Step 4 — count by support sizes. For any $v \in \mathcal{H}^\star$ , every subset $U \subseteq \mathrm{supp}(v) = \{i : v_i = 1\}$ has indicator $\mathbb{1}_U \in \mathcal{H}^\star$ by down-closedness. So the restriction $\mathcal{H}^\star_{|\mathrm{supp}(v)}$ contains all $2^{|\mathrm{supp}(v)|}$ labelings — $\mathcal{H}^\star$ shatters $\mathrm{supp}(v)$ . By Step 2, this forces $|\mathrm{supp}(v)| \le d$ .

Step 5. The number of $v \in \{0, 1\}^n$ with $|\mathrm{supp}(v)| \le d$ is exactly $\sum_{i=0}^d \binom{n}{i}$ . So $|\mathcal{H}_{|S}| = |\mathcal{H}^\star| \le \binom{n}{\le d}$ . Taking the maximum over $S$ gives $\Pi_\mathcal{H}(n) \le \binom{n}{\le d}$ .

∎

The shifting trick does two pieces of work simultaneously: Lemma 1 conserves the count we want to bound, and Lemma 2 conserves the VC dimension. The stable class is structurally much simpler — a down-closed family of subsets — and easy to count by support size.

The (en/d)^d closed form

Corollary 1 (The (en/d)^d bound).

For $n \ge d \ge 1$ ,

\sum_{i=0}^d \binom{n}{i} \le (en/d)^d.

Proof.

Multiply by $(n/d)^d / (n/d)^d$ and weaken the leading factor to $1$ :

\sum_{i \le d} \binom{n}{i} \le (n/d)^d \sum_{i \le d} \binom{n}{i} (d/n)^i \le (n/d)^d \sum_{i=0}^n \binom{n}{i} (d/n)^i = (n/d)^d \cdot (1 + d/n)^n.

The standard inequality $(1 + x/n)^n \le e^x$ at $x = d$ gives $(1 + d/n)^n \le e^d$ . Combining: $\sum_{i \le d} \binom{n}{i} \le (n/d)^d \cdot e^d = (en/d)^d$ .

∎

The slack between $\binom{n}{\le d}$ and $(en/d)^d$ is $e^d/\sqrt{2\pi d}$ by Stirling — exponential in $d$ but constant in $n$ . The closed form gives the right asymptotic order in $n$ and the right exponent in $d$ .

Demo — the three bounds compared

Two panels. Left: log-scale plot of the binomial sum and the (en/d)^d closed form against 2^n for d in {1, 2, 3, 4, 5, 10}. Right: the ratio of the closed form to the binomial sum, converging to e^d over root-2-pi-d as predicted by Stirling. — Left: Sauer–Shelah bounds versus the trivial 2^n bound for d in {1, 2, 3, 4, 5, 10}. Right: ratio of the (en/d)^d closed form to the binomial sum — converges to the Stirling factor e^d over root-2-pi-d (dashed).

VC dimension d: 3

The left panel shows the bounds themselves on log scale: each $\binom{n}{\le d}$ sits below $2^n$ for every $n > d$ , and the $(en/d)^d$ closed form sits above it. The right panel verifies the Stirling-derived gap: the ratio of closed form to binomial sum settles onto $e^d/\sqrt{2\pi d}$ , the multiplicative slack we accept in exchange for a clean polynomial.

The Fundamental Theorem of Statistical Learning

Hoeffding for one $h$ + symmetrization + union bound over labelings + Sauer–Shelah substitution → the agnostic uniform-convergence bound. The result is one direction of the equivalence theorem that Blumer–Ehrenfeucht–Haussler–Warmuth (1989) named the Fundamental Theorem of Statistical Learning.

Setup: agnostic PAC, ERM, and the worst-case gap

For $S \sim D^n$ , hypothesis class $\mathcal{H}$ , and 0-1 loss: $R(h) = \Pr_D[h(X) \ne Y]$ , $\hat R(h) = (1/n)\sum_i \mathbb{1}[h(X_i) \ne Y_i]$ . The empirical risk minimizer is $\hat h \in \arg\min_h \hat R(h)$ . The worst-case generalization gap is

\Phi(S) = \sup_{h \in \mathcal{H}} |R(h) - \hat R(h)|.

A high-probability bound on $\Phi$ gives ERM’s agnostic guarantee $R(\hat h) \le R(h^\star) + 2\epsilon$ where $h^\star \in \arg\min_h R(h)$ is the population-best hypothesis.

Symmetrization: from population risk to a doubled empirical risk

Lemma 3 (Symmetrization (Vapnik–Chervonenkis 1971)).

For $\epsilon > 0$ and $n \ge 8/\epsilon^2$ ,

\Pr_S\bigl[\Phi(S) > \epsilon\bigr] \le 2 \, \Pr_{S, S'}\!\Bigl[\sup_{h \in \mathcal{H}} |\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2\Bigr]. \quad\quad (6.1)

Proof.

Fix $h$ with $|R(h) - \hat R_S(h)| > \epsilon$ . Chebyshev’s inequality applied to the empirical risk on the ghost sample $S'$ gives

\Pr_{S'}\bigl[|\hat R_{S'}(h) - R(h)| > \epsilon/2\bigr] \le \frac{4}{n \epsilon^2} \le \frac{1}{2}

for $n \ge 8/\epsilon^2$ . So with probability $\ge 1/2$ over $S'$ , the ghost-sample risk is within $\epsilon/2$ of $R(h)$ . Combined with $|R(h) - \hat R_S(h)| > \epsilon$ , the triangle inequality yields $|\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2$ . The factor 2 on the right absorbs the “with probability $\ge 1/2$ ” step. The full derivation is in Vapnik (1998) Theorem 4.1.

∎

The payoff: the right side of (6.1) is a statement about empirical averages on $S \cup S'$ of size $2n$ , where only the $\Pi_\mathcal{H}(2n)$ distinct labelings matter — finitely many, by Sauer–Shelah.

The agnostic uniform-convergence bound

Theorem 2 (FTSL upper bound).

Let $d_{\mathrm{VC}}(\mathcal{H}) = d < \infty$ . For $\delta \in (0, 1)$ and $n \ge \max(8/\epsilon^2, d)$ , with probability at least $1 - \delta$ over $S \sim D^n$ ,

\sup_{h \in \mathcal{H}} |R(h) - \hat R(h)| \le \sqrt{\frac{8 \bigl(d \log(2en/d) + \log(4/\delta)\bigr)}{n}}. \quad\quad (6.2)

Proof.

From symmetrization (6.1),

\Pr_S[\Phi > \epsilon] \le 2 \Pr_{S,S'}\!\Bigl[\sup_h |\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2\Bigr].

The supremum on the right involves at most $\Pi_\mathcal{H}(2n)$ distinct random variables — one for each labeling of $S \cup S'$ . Union-bound over labelings, then apply Hoeffding to the difference of two $n$ -sample averages of $[-1, +1]$ -bounded variables:

\Pr_{S,S'}\bigl[|\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2\bigr] \le 2 \exp(-n\epsilon^2/8).

Combining:

\Pr_S[\Phi > \epsilon] \le 4 \, \Pi_\mathcal{H}(2n) \, \exp(-n\epsilon^2/8).

Apply Corollary 1: $\Pi_\mathcal{H}(2n) \le (2en/d)^d$ . Setting the right side to $\delta$ and solving for $\epsilon$ gives $\epsilon = \sqrt{8(d \log(2en/d) + \log(4/\delta))/n}$ .

∎

The rate is $O(\sqrt{d \log n / n})$ — $\sqrt{1/n}$ from Hoeffding, $\sqrt{d}$ from Sauer–Shelah, $\sqrt{\log n}$ from polynomial growth. The constants (the 8 and the 4) are loose by factors of 2–4 in modern refinements (Massart, McDiarmid, Talagrand). Inverting for sample complexity with a self-bounding argument gives $n = O((d \log(d/\epsilon) + \log(1/\delta))/\epsilon^2)$ — the agnostic-PAC bound that matches (1.1) with $d_{\mathrm{VC}}$ replacing $\log|\mathcal{H}|$ .

The lower-bound direction and the FTSL equivalence

Theorem 3 (No-free-lunch).

If $d_{\mathrm{VC}}(\mathcal{H}) = \infty$ , then for every learning algorithm $A$ and every $n$ , there exists a distribution $D$ such that $\mathbb{E}_{S \sim D^n}[R(A(S))] \ge 1/4 + \min_h R(h)$ .

Proof.

Pick any $2n$ -point set $C$ shattered by $\mathcal{H}$ (exists because $d_{\mathrm{VC}} = \infty$ ). Let $D$ be uniform on $C$ with an adversarially-chosen labeling drawn after $A$ commits to its strategy. By shattering, some $h \in \mathcal{H}$ realizes the labeling exactly, so $\min_h R(h) = 0$ . Algorithm $A$ sees $n$ of the $2n$ points; on the unseen half, $A$ has no information about the labeling and must guess, incurring error at least $1/2$ on the unseen half — equivalently, at least $1/4$ overall. Full derivation in Shalev-Shwartz–Ben-David Theorem 5.1.

∎

Infinite VC is fatal for distribution-free learning. The sine-wave family $\mathcal{H}_{\sin}$ from §4.4 is a concrete instance — one real parameter is enough to defeat any learning algorithm in the agnostic setting.

Theorem 4 (FTSL equivalence (Blumer–Ehrenfeucht–Haussler–Warmuth 1989)).

For a binary class $\mathcal{H}$ , the following are equivalent: (1) $\mathcal{H}$ has finite VC dimension; (2) $\mathcal{H}$ has the uniform convergence property; (3) $\mathcal{H}$ is agnostically PAC-learnable; (4) ERM is a successful agnostic PAC learner for $\mathcal{H}$ ; (5) $\mathcal{H}$ is PAC-learnable in the realizable case.

The proof chain is (1) ⇒ (2) ⇒ (3) ⇒ (4) via Theorem 2 and standard ERM analysis; (3) ⇒ (5) a fortiori; and (5) ⇒ (1) by the contrapositive of Theorem 3. The full development is in Shalev-Shwartz–Ben-David §6.4. The equivalence is what gives this theorem its “fundamental” name: every interesting learnability property reduces to the single integer $d_{\mathrm{VC}}$ .

Demo — the FTSL bound versus the empirical generalization gap

Two panels. Left: the FTSL bound versus n on log-log scale for d in {1, 3, 5, 10}, all with slope minus one half. Right: the empirical worst-case generalization gap on half-planes from 15 Monte Carlo replicates at six sample sizes, overlaid against the FTSL bound at d=3 and delta=0.05. — Left: the FTSL bound versus sample size for VC dimensions 1, 3, 5, 10 — all slope minus one half on log-log. Right: empirical worst-case gap on half-planes from 15 Monte Carlo replicates × 6 sample sizes, against the FTSL bound at d=3, delta=0.05. Empirical gap sits roughly an order of magnitude below the bound, with matching slope.

n: 120d_VC: 3δ: 0.050

At n = 120, d = 3, δ = 0.05: FTSL bound ε = 1.170. The bound has slope −1/2 in n on log-log, just like the empirical gap (right panel).

The bound has the right asymptotic shape — both the empirical worst-case gap and the FTSL bound scale as $O(\sqrt{\log n / n})$ — but is loose by roughly a factor of 10. The Hoeffding constant is the largest source of slack; the polynomial-growth substitution is comparatively tight. Modern refinements via Talagrand and McDiarmid close the constant gap by another factor of 2–4 but the order-of-magnitude looseness persists.

Realizable versus agnostic rates

The FTSL sample complexity is $O((d \log(d/\epsilon) + \log(1/\delta))/\epsilon^2)$ — quadratic in $1/\epsilon$ . The realizable PAC case has linear dependence: $O((d \log(1/\epsilon) + \log(1/\delta))/\epsilon)$ . The factor- $1/\epsilon$ gap is the headline practical consequence of the realizability assumption.

Realizable PAC: the 1/ε rate

Theorem 5 (Realizable PAC bound).

Under realizability ( $\exists h^\star \in \mathcal{H} : R(h^\star) = 0$ ), with probability at least $1 - \delta$ , any ERM $\hat h$ satisfies $R(\hat h) \le \epsilon$ provided

n \ge \frac{8 \bigl(d \log(2e/\epsilon) + \log(2/\delta)\bigr)}{\epsilon}. \quad\quad (7.1)

Proof.

Define the bad subclass $\mathcal{H}_\epsilon = \{h \in \mathcal{H} : R(h) > \epsilon\}$ — every hypothesis ERM might erroneously pick. The union bound on consistency gives

\Pr\bigl[\exists h \in \mathcal{H}_\epsilon : \hat R(h) = 0\bigr] \le \Pi_\mathcal{H}(n) (1 - \epsilon)^n \le (en/d)^d e^{-n\epsilon},

using $(1 - \epsilon)^n \le e^{-n\epsilon}$ . Set this $\le \delta$ and solve for $n$ to get (7.1). The exponential decay is in $n\epsilon$ , not $n\epsilon^2$ — that’s where the linear versus quadratic split comes from.

∎

The crucial step is $\Pr[\hat R(h) = 0] = (1 - R(h))^n$ — the probability that a Bernoulli( $R(h)$ ) random variable is zero $n$ times in a row. This is not Hoeffding; the “rare-event” bound has $e^{-n\epsilon}$ tails, not $e^{-n\epsilon^2}$ .

Agnostic PAC: the 1/ε² rate

Theorem 2’s bound inverts to $n = O((d \log(d/\epsilon) + \log(1/\delta))/\epsilon^2)$ . The practical contrast at $\epsilon = 0.01$ , $\delta = 0.05$ , $d = 10$ is dramatic: realizable needs $n \approx 5{,}000$ , agnostic needs $n \approx 500{,}000$ — two orders of magnitude.

Bernstein versus Hoeffding: where the gap comes from

Hoeffding’s bound is variance-blind: $\Pr(|\bar Z - p| > t) \le 2 \exp(-2nt^2)$ regardless of $p$ . Bernstein’s bound is variance-aware:

\Pr(|\bar Z - p| > t) \le 2 \exp\!\Bigl(-\frac{nt^2}{2p(1-p) + (2/3)t}\Bigr).

When $p \to 0$ — which is what realizability says about the variance of the error — the denominator vanishes and the bound tightens. The realizable analysis effectively uses Bernstein in disguise: at $t = \epsilon, p = \epsilon$ , the exponent is $\approx -n\epsilon/2$ , matching the $(1 - \epsilon)^n \approx e^{-n\epsilon}$ rate. Hoeffding ignores the variance information and pays the $1/\epsilon^2$ price.

When realizable rates degrade to agnostic

Realizability is fragile. Three concrete failure modes:

Label noise. Even $\eta = 0.01$ flip probability — one in a hundred labels corrupted — destroys the realizable assumption. Every $h \in \mathcal{H}$ has $R(h) \ge \eta > 0$ , so the rare-event bound $e^{-n\epsilon}$ does not apply for $\epsilon < \eta$ . Natarajan et al (2013) develop the noisy-label theory in detail.
Misspecification. If $\min_{h \in \mathcal{H}} R(h) > 0$ — the class $\mathcal{H}$ doesn’t contain a perfect classifier — Theorem 5 doesn’t apply and the agnostic FTSL bound (6.2) takes over.
Intermediate regime: Tsybakov margin. Between realizable ( $p = 0$ ) and agnostic ( $p \approx 1/2$ ), the Tsybakov margin condition (Mammen–Tsybakov 1999) interpolates with parameters $(c, \alpha)$ : $\Pr(|\eta(X) - 1/2| \le t) \le c t^\alpha$ for $t > 0$ . The resulting rate is $n = O(d^{1/(2-\alpha)} \epsilon^{-(2-\alpha)})$ for $\alpha \in [0, 1]$ — linear at $\alpha = 1$ (full margin), quadratic at $\alpha = 0$ (no margin). Full development in Generalization Bounds §9.

Demo — sample complexity versus accuracy, two regimes

Two panels at d=5, delta=0.05. Left: linear y-axis showing the realizable sample-complexity curve well below the agnostic curve, with a dramatic gap at small epsilon. Right: log-log scale showing slope minus one for the realizable curve and slope minus two for the agnostic curve. — Realizable versus agnostic sample-complexity curves at VC dimension 5, delta = 0.05. Left: linear scale shows the dramatic gap — at epsilon = 0.01, realizable asks for around 5,000 samples while agnostic asks for around 500,000. Right: log-log slopes confirm the linear-versus-quadratic rate split.

VC dimension d: 5δ: 0.050Axis scale: log-log

The left panel makes the dramatic practical gap visible: at $\epsilon = 0.01$ , the realizable bound asks for around 5,000 samples; the agnostic bound asks for around 500,000. The right panel decomposes the gap into its asymptotic shape — slope $-1$ versus slope $-2$ on log-log. The variance-aware Bernstein-style analysis that the realizable case implicitly uses is what buys the extra factor.

VC dimension of canonical classes

Four canonical classes with full VC-dimension proofs — half-spaces in $\mathbb{R}^d$ via Radon’s theorem, axis-aligned rectangles in $\mathbb{R}^d$ via pigeonhole, decision stumps and Boolean classes (the easy cases), and neural networks via the Bartlett bound.

Half-spaces in R^d via Radon’s theorem

Theorem 6 (VC dimension of half-spaces).

$d_{\mathrm{VC}}(\text{half-spaces in } \mathbb{R}^d) = d + 1$ .

The destructive direction depends on the following geometric fact.

Theorem 7 (Radon's theorem (1921)).

Any $d + 2$ points in $\mathbb{R}^d$ admit a partition $(I_+, I_-)$ such that $\mathrm{conv}(I_+) \cap \mathrm{conv}(I_-) \ne \emptyset$ .

Proof.

Consider the linear system in $\lambda \in \mathbb{R}^{d+2}$ :

\sum_i \lambda_i x_i = 0, \qquad \sum_i \lambda_i = 0.

This is $d + 1$ scalar equations in $d + 2$ unknowns, so a nontrivial solution $\lambda^\star \ne 0$ exists. Let $I_+ = \{i : \lambda^\star_i > 0\}$ and $I_- = \{i : \lambda^\star_i < 0\}$ ; both are non-empty because $\sum_i \lambda^\star_i = 0$ . Let $\Lambda = \sum_{i \in I_+} \lambda^\star_i = -\sum_{i \in I_-} \lambda^\star_i > 0$ . The point

p = \frac{1}{\Lambda} \sum_{i \in I_+} \lambda^\star_i x_i = \frac{1}{\Lambda} \sum_{i \in I_-} (-\lambda^\star_i) x_i

is a convex combination from both sides: the coefficients $\lambda^\star_i / \Lambda$ for $i \in I_+$ are non-negative and sum to $1$ , and the coefficients $-\lambda^\star_i / \Lambda$ for $i \in I_-$ are likewise non-negative and sum to $1$ . So $p \in \mathrm{conv}(I_+) \cap \mathrm{conv}(I_-)$ .

∎

Proof.

Proof of Theorem 6, destructive direction. For any $d + 2$ points, Radon’s theorem gives a partition $(I_+, I_-)$ with shared point $p$ . Consider the labeling that assigns $1$ to $I_+$ and $0$ to $I_-$ . Any separating half-space $H = \{x : \langle w, x \rangle \ge c\}$ would contain all of $I_+$ (and hence $\mathrm{conv}(I_+) \subseteq H$ by convexity of $H$ ) and exclude all of $I_-$ (so $\mathrm{conv}(I_-) \subseteq H^c$ ). But then $p \in \mathrm{conv}(I_+) \cap \mathrm{conv}(I_-) \subseteq H \cap H^c = \emptyset$ — contradiction. So this labeling is unrealizable, and $d_{\mathrm{VC}}(\text{half-spaces}) \le d + 1$ .

∎

Proof.

Proof of Theorem 6, constructive direction. Take any $d + 1$ affinely-independent points, e.g., the origin together with the standard basis vectors $e_1, \dots, e_d$ . For any labeling $b \in \{0, 1\}^{d+1}$ , let $C_b$ and $C_{\bar b}$ be the centroids of the $1$ - and $0$ -labeled vertices respectively. The half-space whose normal is $(C_b - C_{\bar b})$ and which passes through the midpoint $(C_b + C_{\bar b})/2$ contains the $1$ -labeled vertices and excludes the $0$ -labeled vertices, by affine independence. Full details in Mohri–Rostamizadeh–Talwalkar Theorem 3.20.

∎

Axis-aligned rectangles in R^d

Theorem 8 (VC dimension of axis-aligned rectangles).

$d_{\mathrm{VC}}(\text{axis-aligned rectangles in } \mathbb{R}^d) = 2d$ .

Proof.

Constructive direction. Take the $2d$ unit-axis-direction points $\{\pm e_j : j = 1, \dots, d\}$ . For any labeling $b \in \{0, 1\}^{2d}$ , build the rectangle one axis at a time: the $j$ -th range is $[-1.5, 1.5]$ if both $+e_j$ and $-e_j$ are labeled $1$ , $[-1.5, 0.5]$ if only $-e_j$ is labeled $1$ , $[-0.5, 1.5]$ if only $+e_j$ , and $[-0.5, 0.5]$ if neither (which excludes both). Four cases per axis, each producing the required labels on that axis’s two extreme points.

∎

Proof.

Destructive direction. Among any $2d + 1$ points in $\mathbb{R}^d$ , by pigeonhole at least one point $x^\star$ is not extreme on any axis — every axis has at most two extreme positions (the maximum and minimum coordinate), giving only $2d$ extreme positions total. Now consider the labeling “1 everywhere except $x^\star$ .” Any rectangle that includes the other $2d$ points must include their bounding box; but the bounding box contains $x^\star$ (because $x^\star$ is not extreme on any axis), so the rectangle includes $x^\star$ as well — contradicting the requested labeling. This labeling is unrealizable, so $d_{\mathrm{VC}} \le 2d$ .

∎

Decision stumps and small Boolean concept classes

Decision stumps in $\mathbb{R}^d$ are hypotheses of the form $h(x) = \mathbb{1}[x_j \ge \theta]$ (or its complement) for some axis $j$ and threshold $\theta$ . The growth function is $\Pi(n) \le 2d(n + 1)$ — polynomial in $n$ of degree $1$ . The exact VC dimension is $d_{\mathrm{VC}} = \lceil \log_2(2d + 1) \rceil$ (Anthony–Bartlett 1999 §3.6). This logarithmic bound is why boosting (Friedman 2001) works: weak learners (stumps) have logarithmic VC dimension, so $T$ -stump ensembles have VC dimension at most $T \log d$ — much smaller than a $T$ -depth tree’s $O(T)$ .

For $k$ -of- $n$ majority thresholds $h(x) = \mathbb{1}[\sum_j x_j \ge k]$ on $x \in \{0, 1\}^n$ , there are only $n + 1$ such functions, so $d_{\mathrm{VC}} \le \log_2(n + 1)$ by the finite-class bound.

Neural networks: where classical VC analysis breaks

Theorem 9 (Neural-network VC dimension (Bartlett–Maiorov–Meir 1998 / Bartlett–Harvey–Liaw–Mehrabian 2019)).

For a feedforward neural network with $W$ parameters and $L$ layers using piecewise-linear activations,

c \, W L \log(W/L) \le d_{\mathrm{VC}}(\mathcal{H}_{\text{net}}) \le C \, W L \log W

for absolute constants $c, C > 0$ .

For ResNet-50 ( $W \approx 25 \times 10^6$ , $L = 50$ ), this gives $d_{\mathrm{VC}} \approx 10^{10}$ . The FTSL bound (6.2) then asks for around $10^{10}$ training samples to achieve a non-vacuous generalization guarantee. ImageNet has $10^6$ . The classical VC bound is vacuous on every realistic deep network.

This is the deep-learning puzzle: networks generalize well in practice despite enormous classical VC dimension. Three modern responses pursue different routes around the impasse:

Margin/norm-based bounds (Bartlett–Mendelson 2002, Bartlett–Foster–Telgarsky 2017, Neyshabur–Tomioka–Srebro 2015): replace $d_{\mathrm{VC}}$ with capacity measures scaling with weight norms or margins, often yielding non-vacuous bounds.
Implicit-bias of SGD (Soudry et al 2018, Gunasekar et al 2017, Ji–Telgarsky 2019): show that gradient descent on overparameterized models is biased toward low-complexity solutions, so the effective capacity is much smaller than the parameter count.
Beyond uniform convergence (Nagarajan–Kolter 2019): argue that uniform-convergence-style FTSL is fundamentally the wrong tool for deep networks; algorithm-specific bounds (PAC-Bayes, compression) are needed instead.

PAC-Bayes Bounds is the canonical alternative. Full treatments of margin/norm bounds appear in Generalization Bounds §12; the implicit-bias and beyond-UC stories will be developed in the forthcoming Double Descent (coming soon) topic.

Demo — Radon’s theorem in the plane

Three panels showing Radon partitions for four-point configurations in the plane. Panel A: convex-position quadrilateral with diagonals intersecting at the shared point. Panel B: triangle plus an interior point, the interior point and the three vertices partition into two convex hulls sharing the interior point. Panel C: near-collinear configuration where one off-axis point pairs against the three near-collinear points. — Three Radon-partition geometries for four-point configurations in the plane. Panel A: convex-position quadrilateral — diagonals intersect at the shared point. Panel B: one point inside a triangle — interior versus vertices. Panel C: near-collinear — one off-axis point versus the three near-collinear points.

Drag any point in any panel; the Radon partition reassigns automatically. The shared point p (teal) always lies inside both convex hulls — that's what makes the corresponding labeling unrealizable by a half-plane.

The three panels show three distinct Radon-partition geometries, and the §8.5 demo lets you drag the four points to watch the partition reassign as the configuration deforms. The shared point $p$ is always inside both convex hulls — that’s what makes the labeling ” $I_+ = 1$ , $I_- = 0$ ” unrealizable by any half-plane.

From VC dimension to Rademacher complexity

The bounds developed so far are distribution-free — strong universality but worst-case tightness. Rademacher complexity is the canonical data-dependent successor: it refines the union-bound step of the FTSL by replacing $\log \Pi_\mathcal{H}(2n)$ with $\log |\mathcal{H}_{|S}|$ on the actual sample $S$ .

Empirical Rademacher complexity and Massart’s lemma

Definition 5 (Empirical Rademacher complexity).

The empirical Rademacher complexity of $\mathcal{H}$ on sample $S = (x_1, \dots, x_n)$ is

\hat{\mathfrak R}_S(\mathcal{H}) = \mathbb{E}_\sigma\!\Bigl[\sup_{h \in \mathcal{H}} \tfrac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)\Bigr], \quad\quad (9.1)

where $\sigma_1, \dots, \sigma_n$ are i.i.d. Rademacher variables (uniform on $\{\pm 1\}$ ). The population Rademacher complexity is $\mathfrak R_n(\mathcal{H}) = \mathbb{E}_S[\hat{\mathfrak R}_S(\mathcal{H})]$ .

The quantity measures how well $\mathcal{H}$ can fit random $\pm 1$ labelings of $S$ : small means the class can’t fit random labels (limited capacity), large means it can (large capacity).

Lemma 4 (Massart's finite-class lemma).

For a finite set $A \subseteq \mathbb{R}^n$ with $\sup_{a \in A} \|a\|_2 \le r$ ,

\mathbb{E}_\sigma\!\Bigl[\sup_{a \in A} \tfrac{1}{n}\sum_i \sigma_i a_i\Bigr] \le \frac{r \sqrt{2 \log |A|}}{n}.

Specialized to $A = \mathcal{H}_{|S} \subseteq \{0, 1\}^n$ (so $r = \sqrt{n}$ ):

\hat{\mathfrak R}_S(\mathcal{H}) \le \sqrt{\frac{2 \log |\mathcal{H}_{|S}|}{n}}. \quad\quad (9.2)

The proof is the standard sub-Gaussian max-bound argument; the full derivation is in Generalization Bounds §5 or Mohri–Rostamizadeh–Talwalkar Theorem 3.7.

The Sauer–Shelah substitution

Combining (9.2) with Sauer–Shelah’s $|\mathcal{H}_{|S}| \le (en/d)^d$ :

\hat{\mathfrak R}_S(\mathcal{H}) \le \sqrt{\frac{2 d \log(en/d)}{n}} = O\!\left(\sqrt{\frac{d \log n}{n}}\right). \quad\quad (9.3)

The same shape as the FTSL bound (6.2). The data-dependent version uses the actual $|\mathcal{H}_{|S}|$ before the Sauer–Shelah substitution, which can be much tighter when the sample is benign.

Via the canonical Rademacher-based generalization bound (developed in Generalization Bounds §5.4):

\sup_{h \in \mathcal{H}} |R(h) - \hat R(h)| \le 2 \mathfrak R_n(\mathcal{H}) + O\!\left(\sqrt{\log(1/\delta)/n}\right).

The §9 framework feeds directly into the Structural Risk Minimization §5 Bartlett–Mendelson SRM penalty, where empirical Rademacher complexity replaces VC dimension as the capacity functional.

Why Rademacher can be tighter

Three sources of tightening over the worst-case VC bound:

Duplicate samples. $|\mathcal{H}_{|S}|$ drops when the sample contains duplicates — fewer distinct points means fewer distinct labelings. Rademacher follows the drop; the VC bound is oblivious.
Margin. On data with a separation margin $\gamma > 0$ , only the “boundary” labelings near the decision surface are realized, so $|\mathcal{H}_{|S}|$ is much smaller than the worst case. Rademacher captures the margin directly; the VC bound does not.
Low-rank structure. Data lying on a $k$ -dimensional submanifold has empirical Rademacher complexity scaling with $\sqrt{k \log n / n}$ instead of $\sqrt{d \log n / n}$ . The gap can be exponential in $d$ for benign distributions.

Forward-pointer: pseudo-dimension and fat-shattering for real-valued classes

For real-valued function classes $\mathcal{F} \subseteq \mathbb{R}^\mathcal{X}$ , three complexity measures generalize VC dimension:

Pseudo-dimension $\mathrm{Pdim}(\mathcal{F})$ (Pollard 1984) — the VC dimension of the subgraph class $\{\{(x, t) : t \le f(x)\} : f \in \mathcal{F}\}$ . Reduces to VC dimension when $\mathcal{F}$ is binary.
Fat-shattering at scale $\gamma$ , $\mathrm{fat}_\gamma(\mathcal{F})$ (Bartlett–Long–Williamson 1996) — a margin-aware refinement; $\mathrm{fat}_\gamma \to \mathrm{Pdim}$ as $\gamma \to 0$ .
Covering numbers $N(\mathcal{F}, \gamma, n)$ (Dudley 1967, van der Vaart–Wellner 1996) — the minimum number of $\gamma$ -balls covering $\mathcal{F}_{|S}$ ; tied to fat-shattering by $\log N = O(\mathrm{fat}_{\gamma/2} \log(n/\gamma))$ .

Full theory in Generalization Bounds §8 — Dudley’s entropy integral, Talagrand’s contraction lemma, McDiarmid plus symmetrization for real-valued losses.

Demo — empirical Rademacher for axis-rectangles

Three curves versus sample size from n=5 to n=20. Empirical Rademacher complexity (blue) sits below both upper bounds; Massart's bound on the actual restricted class size (orange dashed) tracks the empirical closely; the Sauer–Shelah Rademacher bound at d=4 (red dashed) is the worst-case upper envelope. Annotations show the actual restricted class size at each n. — Empirical Rademacher (blue) versus Massart bound on the actual restricted class size (orange dashed) versus Sauer–Shelah bound at VC dimension 4 (red dashed), for axis-rectangles on 2D random samples. Empirical Rademacher always sits below both upper bounds; Massart tracks the empirical closely because actual restricted class sizes are small on benign random samples.

Loading precomputed Rademacher Monte Carlo...

The empirical Rademacher (blue) sits below both upper bounds at every $n$ . The Massart bound (orange) tracks the empirical closely because the actual restricted class sizes on benign random samples are far below the Sauer–Shelah ceiling. The Sauer–Shelah bound (red) is the worst-case envelope — universal but loose by a constant factor on every benign distribution.

Integrative worked example: half-planes and rectangles, end to end

A single integrative Monte Carlo on the two anchor classes verifies every box in the bound-vs-empirical chain — growth function (§3), Sauer–Shelah (§5), FTSL envelope (§6), realizable PAC (§7), and empirical Rademacher (§9) — all on the same two test problems.

The two test classes and experimental setup

Half-plane test: $D$ uniform on $[0, 1]^2$ with $y = \mathbb{1}[x_2 > x_1 + 0.2]$ — a separating line offset from the diagonal, realizable in $\mathcal{H}_{\text{half-plane}}$ with $R(h^\star) = 0$ . Rectangle test: $D$ uniform on $[0, 1]^2$ with $y = \mathbb{1}[x_1 \in [0.3, 0.7] \text{ and } x_2 \in [0.3, 0.7]]$ — a centered $0.4 \times 0.4$ square, realizable in $\mathcal{H}_{\text{axis-rectangle}}$ .

Four protocols on each class:

Growth-function check at $n \in \{5, 10, 15, 20\}$ — exhaustive enumeration of $|\mathcal{H}_{|S}|$ for half-planes via the Cover-1965 closed form and for rectangles by brute force on $2^n$ candidate labelings.
FTSL envelope check at $n \in \{30, 60, 120, 250, 500, 1000\}$ with $B = 30$ replicates — empirical worst-case gap from LinearSVC at $C = 10^6$ (half-plane ERM) or smallest-enclosing axis-rectangle (rectangle ERM), measured on a 30k-point held-out test set.
Realizable sample-complexity verification at $\epsilon \in \{0.10, 0.05, 0.02\}$ , $\delta = 0.05$ — find the smallest $n^\star$ at which 95% of replicates achieve $R(\hat h) \le \epsilon$ , compared to the theoretical (7.1) prediction.
Empirical Rademacher at $n = 20$ — brute-force $|\mathcal{H}_{|S}|$ followed by $B = 2000$ Rademacher-draw averages of the sup over realized labelings.

Growth-function verification

For each $n$ , enumerate $|\mathcal{H}_{|S}|$ for half-planes (closed form $n^2 - n + 2$ , exact) and rectangles (brute-force over $2^n$ labelings, exact). Both classes respect the Sauer–Shelah ceiling. The empirical $\Pi_{\text{HP}}$ matches the closed form exactly; the rectangle empirical sits below $\binom{n}{\le 4}$ with slack that varies by sample configuration.

FTSL envelope check

Run ERM Monte Carlo. The empirical gap is 1/10 to 1/30 of the FTSL envelope across all sample sizes; the same $n^{-1/2}$ shape on log-log; the rectangle gap is larger than the half-plane gap by roughly the $\sqrt{4/3}$ ratio predicted by the $\sqrt{d_{\mathrm{VC}}}$ scaling.

Realizable sample-complexity verification

Both empirical curves scale as $1/\epsilon$ , matching the theoretical (7.1) shape; the theoretical bound is loose by roughly an order of magnitude; the rectangle-to-half-plane ratio holds across all $\epsilon$ , consistent with the $\sqrt{d_{\mathrm{VC}}}$ scaling.

Demo — the integrative four-panel “money shot”

Four panels integrating every protocol from sections 3, 5, 6, 7, and 9. Panel A: empirical Pi(n) for half-planes (linear) and rectangles (n^4 shape) versus their Sauer–Shelah ceilings. Panel B: FTSL envelope versus empirical worst-case gap on log-log, n=30 to n=1000. Panel C: realizable sample-complexity n*(epsilon) empirical versus theoretical at three epsilon values. Panel D: bar chart comparing empirical Rademacher against Sauer–Shelah Rademacher at n=20 for both classes. — Integrative four-panel Monte Carlo on half-planes and axis-rectangles. Panel A: empirical growth functions below Sauer–Shelah ceilings. Panel B: FTSL envelope versus empirical generalization gap, log-log. Panel C: realizable sample complexity, empirical versus theoretical. Panel D: empirical Rademacher complexity versus the Sauer–Shelah Rademacher bound at n=20. Every bound is verified; loose by ~10x but rate-correct on every protocol.

Loading integrative Monte Carlo (4 protocols)...

Panel A confirms empirical growth functions sit below Sauer–Shelah ceilings ( $n^3$ and $n^4$ slopes). Panel B is the FTSL envelope versus empirical worst-case gap — loose by an order of magnitude but rate-correct ( $n^{-1/2}$ slope on log-log for both classes). Panel C is the realizable sample-complexity story — linear in $1/\epsilon$ as predicted. Panel D is the empirical Rademacher comparison — empirical sits below the Sauer–Shelah Rademacher bound at every $n$ , with the gap growing as the actual $|\mathcal{H}_{|S}|$ stays well below the worst-case ceiling. Every bound is verified; the looseness is universal but the rate is right.

ML applications

SVMs and the margin: tightening VC dimension with geometry

Half-spaces in $\mathbb{R}^d$ have $d_{\mathrm{VC}} = d + 1$ — large for high-dimensional feature spaces, FTSL bound vacuous. SVMs sidestep the dimension dependence via margin-based VC analysis: for the margin-restricted class $\mathcal{H}_\gamma$ of half-spaces achieving at least margin $\gamma$ on data with radius $R$ ,

d_{\mathrm{VC}}(\mathcal{H}_\gamma) \le \lceil R^2 / \gamma^2 \rceil,

independent of the ambient dimension $d$ . Linear SVMs in $\mathbb{R}^d$ and kernel SVMs in infinite-dimensional RKHS are both tractable only because the margin bound replaces dimension-dependent VC dimension with geometry-dependent $R^2/\gamma^2$ . Hinge-loss training is direct margin maximization; the soft-margin slack handles the non-separable case. The forward link is Generalization Bounds §10.

Decision trees, pruning, and capacity control

A decision tree with $N$ leaves on $d$ -dimensional input has $d_{\mathrm{VC}} = O(N \log(Nd))$ (Mansour–McAllester 2000). Pruning controls VC dimension directly: cost-complexity pruning (Breiman et al 1984) walks a nested family of decision trees with increasing $N$ , applying the SRM machinery from Structural Risk Minimization. Random forests (Breiman 2001) and gradient-boosted trees (Friedman 2001) extend the picture with bagging and boosting respectively. The §8.3 fact that decision stumps have $d_{\mathrm{VC}} = O(\log d)$ is precisely why boosting works: a $T$ -stump ensemble has $d_{\mathrm{VC}} \le T \log d$ , exponentially smaller than a $T$ -depth tree.

ERM and the bias-complexity decomposition

The agnostic FTSL bound gives the decomposition

R(\hat h) - R^\star = \underbrace{R(h^\star) - R^\star}_{\text{approximation error}} + \underbrace{R(\hat h) - R(h^\star)}_{\text{estimation error}}.

Approximation error decreases in $d_{\mathrm{VC}}(\mathcal{H})$ (richer classes can approximate the Bayes-optimal classifier more closely); estimation error increases in $d_{\mathrm{VC}}$ (richer classes have larger generalization gap). SRM minimizes the sum: scan nested classes $\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots$ ordered by VC dimension and pick the index $\hat k$ minimizing training error plus a VC-based capacity penalty. Full development in Structural Risk Minimization §4.

The deep-learning puzzle

ResNet-50 has $d_{\mathrm{VC}} \approx 10^{10}$ ; FTSL needs $10^{10}$ training samples; ImageNet provides $10^6$ ; ResNets generalize well in practice. Zhang et al (2017) confirmed three findings on this picture: memorization capacity is enormous (deep networks can fit random labels to perfect accuracy); the same networks generalize on real labels; classical VC bounds are vacuous. Three modern responses pursue different routes:

Margin/norm-based bounds (Bartlett–Mendelson 2002, Bartlett–Foster–Telgarsky 2017, Neyshabur–Tomioka–Srebro 2015): replace $d_{\mathrm{VC}}$ with capacity measures scaling with weight norms or margins. The bound $\sup_h |R(h) - \hat R(h)| \le O(\prod_\ell \|W_\ell\| \cdot \sqrt{\log n / n})$ is informative when per-layer norms are controlled and vacuous otherwise.
Implicit-bias of SGD (Soudry et al 2018, Gunasekar et al 2017, Ji–Telgarsky 2019): the optimizer biases the trained network toward low-complexity solutions, so effective VC dimension is much smaller than parameter count.
Beyond uniform convergence (Nagarajan–Kolter 2019): classical FTSL is uniform-convergence-based, but the right tool may be non-uniform algorithm-specific bounds. PAC-Bayes (PAC-Bayes Bounds) and compression bounds (Moran–Yehudayoff 2016) are canonical alternatives.

The forward-pointer is Double Descent (coming soon): test error decreases past the interpolation threshold, a phenomenon classical VC analysis fundamentally cannot explain.

Demo — margin tightens the VC bound

Three curves versus the separation margin gamma from 0.05 to 0.7. The vanilla FTSL bound for half-planes is flat (sees only d=3 and n=200). The margin-based bound R^2/gamma^2 decreases. Empirical SVM Monte Carlo gap (red) matches the margin-based shape, descending across the gamma range. — Vanilla FTSL bound (blue dashed) is flat in gamma — sees only VC dimension 3 and sample size 200. Margin-based bound R^2/gamma^2 (orange) decreases in gamma. Empirical SVM Monte Carlo (red) tracks the margin-based shape. The data-dependent margin bound is the right capacity measure for separable data.

Margin γ: 0.20

At γ = 0.20: vanilla FTSL = 0.939 (flat), margin bound = 1.987, empirical = 0.0020. The margin bound tracks the empirical gap's shape; the vanilla bound doesn't see γ at all.

The FTSL bound (blue dashed) is flat — sees only $d_{\mathrm{VC}} = 3$ and $n = 200$ , neither dependent on $\gamma$ . The margin-based bound $R^2/\gamma^2$ (orange) decreases with $\gamma$ because the effective capacity shrinks as the data become more separable. Empirical SVM Monte Carlo (red) matches the margin-based shape — the flat VC bound is genuinely the wrong capacity measure when the data have margin.

Computational notes

The 2^n enumeration ceiling

Brute-force enumeration of $\mathcal{H}_{|S}$ requires $2^n$ labeling checks at $O(n)$ cost per check. Laptop ceiling: $n = 16$ takes milliseconds; $n = 20$ takes 1–5 seconds; $n = 24$ takes 30–90 seconds; $n = 28$ takes many minutes. For in-browser visualizations, the cap is $n \le 16$ . Above that, use closed-form $\Pi(n)$ (half-lines, intervals, half-planes via Cover 1965), Sauer–Shelah upper bounds, or symmetry-aware enumeration.

Symmetry exploitation for half-spaces

Cover (1965): $\Pi_{\text{half-space}}(n) = 2 \sum_{k=0}^{d-1} \binom{n-1}{k} = O(n^d)$ , not $2^n$ . The constructive $O(n^d)$ enumeration: every distinct dichotomy is determined by a $(d+1)$ -tuple of support points. For $d = 2$ , this is $O(n^2)$ — a sample of $n = 100$ has $\Pi = 9{,}902$ vs $2^{100} \approx 10^{30}$ . The ceiling extends to $n \approx 200$ in $\mathbb{R}^2$ .

Cell decomposition for axis-rectangles

$O(n^{2d})$ enumeration: every rectangle dichotomy is determined by $2d$ extreme points (axis min/max of the labeled-1 subset). For $d = 2$ : $O(n^4) \approx 2.6 \times 10^5$ candidates at $n = 50$ , versus $2^{50}$ — impossible by brute force. Production visualizations should use this; the pedagogical demos in §§3.5 and 10.5 use brute force for conceptual clarity at small $n$ .

Reproducibility and visualization caps

Seed convention: RNG = np.random.default_rng(20260513) at the notebook top. Practical caps for in-browser visualizations: the shattering playground at $n \le 6$ ; the growth-function plotter empirical curves at $n \le 12$ ; the FTSL bound explorer uses closed-form bounds plus cached JSON; the empirical-shatter-check four-panel is fully precomputed and served as JSON. NumPy convention throughout: always default_rng, never the deprecated np.random.seed global state.

Connections and limits

Pseudo-dimension and fat-shattering for real-valued classes

For real-valued $\mathcal{F} \subseteq \mathbb{R}^\mathcal{X}$ , the three complexity measures generalizing VC dimension are:

Pseudo-dimension $\mathrm{Pdim}(\mathcal{F})$ (Pollard 1984) — largest $n$ such that some sample admits a threshold vector for which the thresholded class shatters every binary pattern. Reduces to VC for binary classes. For linear regression in $\mathbb{R}^d$ , $\mathrm{Pdim} = d + 1$ . For neural networks, $\mathrm{Pdim}$ depends on the architecture and is polynomial in the parameter count.
Fat-shattering at scale $\gamma$ , $\mathrm{fat}_\gamma(\mathcal{F})$ (Bartlett–Long–Williamson 1996) — a margin constraint on the shattered set; $\mathrm{fat}_\gamma \to \mathrm{Pdim}$ as $\gamma \to 0$ .
Covering numbers $N(\mathcal{F}, \gamma, n)$ (Dudley 1967, van der Vaart–Wellner 1996) — minimum number of $\gamma$ -balls covering $\mathcal{F}_{|S}$ ; tied to fat-shattering by $\log N = O(\mathrm{fat}_{\gamma/2} \log(n/\gamma))$ .

Full theory in Generalization Bounds §8 — Dudley’s entropy integral, Talagrand’s contraction lemma, McDiarmid plus symmetrization for real-valued losses.

The deep-learning generalization puzzle

Zhang et al (2017) crystallized the three findings discussed in §11.4: memorization is enormous, generalization on real labels is real, and classical VC bounds are vacuous. The conclusion is that uniform convergence is not the operative framework for deep learning — the worst-case capacity argument that powers FTSL fails on modern models. Three modern responses sketched in §11.4 follow different routes; the forward-pointer is Double Descent (coming soon), which addresses the interpolation-threshold phenomenon directly.

Distribution-free versus distribution-dependent guarantees

The VC framework is distribution-free — universal but worst-case. Three distribution-dependent alternatives:

Bayes-style smoothness assumptions — Tsybakov margin (§7.4 mode 3) interpolates $1/\epsilon$ to $1/\epsilon^2$ ; smoothness classes (Hölder, Sobolev) yield minimax rates. The formalStatistics: nonparametric-regression topic is the entry point.
Algorithmic stability (Bousquet–Elisseeff 2002) — $\beta$ -stability gives $O(\beta + \sqrt{1/n})$ rate, no VC required. Ridge is $\beta = O(1/(\lambda n))$ -stable. Forward link: Generalization Bounds §11.
Compression bounds (Littlestone–Warmuth 1986, Floyd–Warmuth 1995, Moran–Yehudayoff 2016) — reconstruction from a $k$ -subsample gives a $\sqrt{k \log n / n}$ rate. Every PAC-learnable class admits a compression scheme of size $O(d_{\mathrm{VC}})$ (Moran–Yehudayoff converse).

Where to go next

Connected formalML topics:

Structural Risk Minimization — uses $d_{\mathrm{VC}}$ as the canonical capacity penalty in the Vapnik SRM framework.
PAC-Bayes Bounds — replaces the worst-case supremum over $\mathcal{H}$ with a posterior average, using KL divergence as the capacity measure.
Generalization Bounds — the empirical-process machinery (Rademacher complexity, symmetrization, McDiarmid, Talagrand) used here without full proof.
PAC Learning — the parent framework for agnostic PAC, realizable PAC, and the union-bound machinery.
Double Descent (coming soon) — the modern overparameterized regime where VC analysis breaks down.

The VC framework’s value is not tight bounds for every ML setting; it is the structural insight that capacity is the right quantity to bound, that shattering is the right way to measure capacity, and that $d_{\mathrm{VC}} \log(n/d_{\mathrm{VC}})$ is the right substitute for $\log|\mathcal{H}|$ in the union-bound argument. Every modern alternative — margin/norm bounds, PAC-Bayes, implicit-bias analysis, compression schemes — is built on this foundation.

We started asking what’s the right combinatorial substitute for $|\mathcal{H}|$ when the cardinality counts the wrong thing. The answer is the VC dimension and its growth-function machinery. Sauer–Shelah converts the integer $d_{\mathrm{VC}}$ into polynomial growth, the FTSL converts polynomial growth into a $O(\sqrt{d \log n / n})$ uniform-convergence bound, and the equivalence theorem ties finite VC dimension to PAC-learnability in both directions. The topic delivers the foundation; tighter bounds for modern overparameterized models require the margin/norm refinements, PAC-Bayes, or implicit-bias analysis from connected topics.

Connections

PAC-learning sets up the (ε, δ)-framework and the finite-class union-bound; this topic supplies the combinatorial machinery (shattering, growth function, VC dimension) that extends finite-class arguments to infinite classes. The Fundamental Theorem of Statistical Learning equivalence we prove in §6 ties finite VC dimension to agnostic PAC-learnability in both directions. pac-learning
The §9 bridge to Rademacher complexity feeds directly into the Massart, McDiarmid, and Talagrand machinery developed in generalization-bounds; finite VC dimension is the discrete capacity precursor of the continuous-time empirical-process bounds shown there. The §14.2 VC-dimension stub in generalization-bounds is discharged by this topic. generalization-bounds
VC dimension is the canonical capacity functional in the Vapnik SRM penalty (SRM §4); the §3 phase-transition picture and §5 Sauer–Shelah bound are the substrate that makes the SRM oracle inequality nontrivial for infinite classes. The §11.3 bias-complexity decomposition is the SRM motivation in compressed form. structural-risk-minimization
Hoeffding's inequality from concentration-inequalities is one of the two ingredients in the FTSL proof — the other being polynomial growth via Sauer–Shelah. The §7.3 Bernstein-vs-Hoeffding rate-gap discussion is the variance-aware concentration story that explains the realizable vs agnostic split. concentration-inequalities

References & Further Reading

paper On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities — Vapnik & Chervonenkis (1971) The founding paper. Defines shattering, the growth function, and proves the uniform-convergence theorem in its original form.
paper On the Density of Families of Sets — Sauer (1972) Independent discovery of the polynomial growth bound — the Sauer half of the Sauer–Shelah lemma.
paper A Combinatorial Problem: Stability and Order for Models and Theories in Infinitary Languages — Shelah (1972) Shelah's parallel discovery from model theory in the same year.
paper Mengen konvexer Körper, die einen gemeinsamen Punkt enthalten — Radon (1921) Radon's theorem on partitions of d+2 points in R^d — the load-bearing geometric fact in §8.1's half-space proof.
paper Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition — Cover (1965) Closed-form growth function n²−n+2 for half-planes in §3.3; symmetry-aware O(n^d) enumeration for half-spaces in §12.2.
paper On the Number of Sets in a Null t-Design — Frankl & Pach (1983) The shifting-operator proof of Sauer–Shelah used in §5.
book Sous-espaces ℓ_1^n des espaces de Banach — Pajor (1985) The cleaner presentation of the shifting argument that §5.2–§5.3 follow.
paper Learnability and the Vapnik-Chervonenkis Dimension — Blumer, Ehrenfeucht, Haussler & Warmuth (1989) The Fundamental Theorem of Statistical Learning equivalence (§6.4 Theorem 4): finite VC ⇔ uniform convergence ⇔ agnostic PAC-learnability.
book The Nature of Statistical Learning Theory — Vapnik (1995) Pedagogical book-length presentation of VC theory and SRM; the §4.4 sine-wave pathology in our topic is from Chapter 3.4.
book Statistical Learning Theory — Vapnik (1998) Comprehensive treatment; Theorem 4.1 here is the symmetrization step used in §6.2 Lemma 3.
book Neural Network Learning: Theoretical Foundations — Anthony & Bartlett (1999) Standard reference for VC dimension of canonical classes; §3.6 develops decision stumps and small Boolean classes used in §8.3 of our topic.
book Understanding Machine Learning: From Theory to Algorithms — Shalev-Shwartz & Ben-David (2014) The clearest modern textbook treatment. §6.4 contains the FTSL equivalence chain we follow in §6.4, and §5.1 has the no-free-lunch lower bound in §6.4 of our topic.
book Foundations of Machine Learning, 2nd ed. — Mohri, Rostamizadeh & Talwalkar (2018) Theorem 3.20 has the constructive direction of the half-space VC dimension; Theorem 3.7 is the Massart lemma proof used in §9.1.
book Convergence of Stochastic Processes — Pollard (1984) Pseudo-dimension for real-valued classes (§13.1 forward-pointer) was introduced here.
paper Some Applications of Concentration Inequalities to Statistics — Massart (2000) Massart's finite-class lemma (§9.1 Lemma 4) and the constant-improvement program for FTSL bounds.
paper The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights Is More Important Than the Size of the Network — Bartlett (1998) Margin-based capacity control as the first systematic alternative to the dimension-dependent VC bound for neural networks (§11.4 modern responses).
paper Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks — Bartlett, Maiorov & Meir (1998) Half of the Theorem 9 (§8.4) bound; the lower bound c WL log(W/L) on VC dimension for piecewise-linear nets.
paper Fat-Shattering and the Learnability of Real-Valued Functions — Bartlett, Long & Williamson (1996) Fat-shattering at scale γ (§13.1 forward-pointer) — the margin-aware refinement of pseudo-dimension.
paper Rademacher and Gaussian Complexities: Risk Bounds and Structural Results — Bartlett & Mendelson (2002) Canonical Rademacher-complexity development feeding §9 and the margin-norm bounds in §11.4.
paper Nearly-Tight VC-Dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks — Bartlett, Harvey, Liaw & Mehrabian (2019) The matching upper bound C WL log W in Theorem 9 (§8.4). Together with Bartlett–Maiorov–Meir 1998 this sandwiches the VC dimension of piecewise-linear neural networks.
paper What Size Net Gives Valid Generalization? — Baum & Haussler (1989) Earliest VC bound for neural networks; precursor to the Bartlett family of bounds in §8.4.
book Weak Convergence and Empirical Processes: With Applications to Statistics — van der Vaart & Wellner (1996) Definitive treatment of covering numbers, bracketing, and the Dudley entropy integral that real-valued generalizations (§13.1) point to.
paper Understanding Deep Learning Requires Rethinking Generalization — Zhang, Bengio, Hardt, Recht & Vinyals (2017) The empirical observation that powers §11.4: deep networks can memorize random labels, so classical VC bounds are vacuous.
paper Spectrally-Normalized Margin Bounds for Neural Networks — Bartlett, Foster & Telgarsky (2017) Modern margin-norm bound for deep networks; one of the three responses to the deep-learning puzzle in §11.4.
paper Norm-Based Capacity Control in Neural Networks — Neyshabur, Tomioka & Srebro (2015) Early norm-based capacity control for neural networks (§11.4 response 1).
paper The Implicit Bias of Gradient Descent on Separable Data — Soudry, Hoffer, Nacson, Gunasekar & Srebro (2018) Implicit-bias program (§11.4 response 2): SGD on separable data converges to the maximum-margin solution, so effective capacity is much smaller than parameter count.
paper Uniform Convergence May Be Unable to Explain Generalization in Deep Learning — Nagarajan & Kolter (2019) The beyond-uniform-convergence response (§11.4 response 3): classical FTSL is uniform-convergence, but the right tool may be non-uniform algorithm-specific bounds.
paper Nonparametric Regression Using Deep Neural Networks with ReLU Activation Function — Schmidt-Hieber (2020) Approximation-theoretic deep-net rates as a contrast to capacity-based VC bounds; §13.2 connection point.
paper Stability and Generalization — Bousquet & Elisseeff (2002) Algorithmic stability as a distribution-dependent alternative to VC (§13.3); ridge is β=O(1/(λn))-stable.
paper Sample Compression Schemes for VC Classes — Moran & Yehudayoff (2016) The Moran–Yehudayoff converse: every PAC-learnable class admits a sample-compression scheme of size O(d_VC). §13.3 distribution-dependent alternative.
paper Generalization Bounds for Decision Trees — Mansour & McAllester (2000) Decision-tree VC dimension O(N log(Nd)) used in §11.2.
book Classification and Regression Trees — Breiman, Friedman, Olshen & Stone (1984) Cost-complexity pruning as nested-SRM (§11.2).
paper Random Forests — Breiman (2001) Bagging extension to decision trees (§11.2).
paper Greedy Function Approximation: A Gradient Boosting Machine — Friedman (2001) Gradient boosting on decision stumps; the §8.3 logarithmic VC of stumps is what makes the boosted ensemble's effective capacity grow linearly in the number of weak learners (§11.2).
paper Smooth Discrimination Analysis — Mammen & Tsybakov (1999) Tsybakov margin condition: the (c, α) interpolation between realizable (1/ε) and agnostic (1/ε²) rates used in §7.4 failure mode 3.
paper Learning with Noisy Labels — Natarajan, Dhillon, Ravikumar & Tewari (2013) Label-noise as the canonical degradation from realizable to agnostic (§7.4 failure mode 1).