intermediate learning-theory 75 min read

VC Dimension

Combinatorial capacity, Sauer–Shelah, and uniform convergence for binary hypothesis classes

Motivation: why finite-class arguments break for infinite hypothesis classes

The PAC Learning topic developed the sharpest tool we have for finite hypothesis classes: take Hoeffding’s inequality on a single hypothesis, apply the union bound over all H|\mathcal{H}| of them, and a sample size of n=O((logH+log(1/δ))/ϵ2)n = O((\log |\mathcal{H}| + \log(1/\delta))/\epsilon^2) suffices for uniform convergence at level ϵ\epsilon with confidence 1δ1-\delta. The bound is tight in H|\mathcal{H}|, transparent in its derivation, and useless the moment we try to use it on anything we’d want to learn — every real-parameter class has H=|\mathcal{H}| = \infty, and log\log \infty shuts the argument down. The question for this whole topic: what’s the right combinatorial object to put in place of H|\mathcal{H}| when the cardinality counts the wrong thing?

The finite-class union-bound recap

Fix the standard setup. We have a labeled sample S=((X1,Y1),,(Xn,Yn))S = ((X_1, Y_1), \dots, (X_n, Y_n)) drawn i.i.d. from a distribution DD on X×{0,1}\mathcal{X} \times \{0, 1\}, a binary hypothesis class H{0,1}X\mathcal{H} \subseteq \{0, 1\}^{\mathcal{X}}, the population risk R(h)=PrD[h(X)Y]R(h) = \Pr_D[h(X) \ne Y], and the empirical risk R^(h)=1ni=1n1[h(Xi)Yi]\hat R(h) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[h(X_i) \ne Y_i]. Hoeffding’s inequality applied to the nn Bernoulli indicators gives Pr(R(h)R^(h)>ϵ)2exp(2nϵ2)\Pr(|R(h) - \hat R(h)| > \epsilon) \le 2 \exp(-2 n \epsilon^2). The PAC framework wants this uniformly over the whole class, so we union-bound across H\mathcal{H}:

Pr ⁣(hH:R(h)R^(h)>ϵ)2Hexp(2nϵ2).\Pr\!\bigl(\exists h \in \mathcal{H} : |R(h) - \hat R(h)| > \epsilon\bigr) \le 2 |\mathcal{H}| \exp(-2 n \epsilon^2).

Setting the right-hand side to δ\delta and solving for nn gives the canonical finite-class sample-complexity bound

nlogH+log(2/δ)2ϵ2,(1.1)n \ge \frac{\log|\mathcal{H}| + \log(2/\delta)}{2 \epsilon^2}, \quad\quad (1.1)

the headline result of PAC Learning §3. The logarithm is the union bound’s signature: every extra hypothesis costs us log\log in the sample size. The bound stays usable for surprisingly large finite classes — billions of hypotheses cost twenty-something samples — but it requires H<|\mathcal{H}| < \infty.

What |H| should “really” mean

Take the simplest infinite class: half-lines on the real line, Hhalf-line={ha:aR}\mathcal{H}_{\text{half-line}} = \{h_a : a \in \mathbb{R}\} with ha(x)=1[xa]h_a(x) = \mathbb{1}[x \ge a]. There are uncountably many hypotheses and bound (1.1) instantly degenerates: logHhalf-line=\log|\mathcal{H}_{\text{half-line}}| = \infty forces n=n = \infty. Yet half-lines are intuitively a small class — one real degree of freedom, a handful of points should suffice. The bound is counting something it shouldn’t be counting.

The issue is that the union bound treats every hah_a as a fresh draw from concentration’s perspective, but on any fixed sample SS most of those thresholds are behaviorally indistinguishable. Two thresholds a,aa, a' in the same gap between consecutive sample points produce the same labels on SS. The hypotheses are different as functions on R\mathbb{R}, but identical as labelings of SS. If Hhalf-line\mathcal{H}_{\text{half-line}} has uncountably many functions but only n+1n + 1 distinct labelings of any nn-point sample, the right effective cardinality for the union bound is n+1n + 1, not \infty.

The right object: labelings on a fixed sample, not functions on the input space

For a sample S=(x1,,xn)S = (x_1, \dots, x_n) and a class H\mathcal{H}, the restriction of H\mathcal{H} to SS is

HS={(h(x1),,h(xn)):hH}{0,1}n.(1.2)\mathcal{H}_{|S} = \bigl\{(h(x_1), \dots, h(x_n)) : h \in \mathcal{H}\bigr\} \subseteq \{0, 1\}^n. \quad\quad (1.2)

Three immediate consequences: HS2n|\mathcal{H}_{|S}| \le 2^n; HSH|\mathcal{H}_{|S}| \le |\mathcal{H}|; and the latter inequality can be enormously loose. The hope is to replace H|\mathcal{H}| in the union bound with some worst-case-over-SS version of HS|\mathcal{H}_{|S}| — the growth function ΠH(n)\Pi_\mathcal{H}(n) in §3. If ΠH(n)\Pi_\mathcal{H}(n) is polynomial in nn even when H|\mathcal{H}| is infinite, the union bound buys a finite sample complexity.

Roadmap: shattering, growth, Sauer–Shelah, uniform convergence

The four-step ladder this topic climbs: (1) shattering in §2 — given SS of size nn, H\mathcal{H} shatters SS when HS=2n|\mathcal{H}_{|S}| = 2^n; (2) the growth function in §3 — ΠH(n)=maxS=nHS\Pi_\mathcal{H}(n) = \max_{|S|=n} |\mathcal{H}_{|S}|; (3) the VC dimension in §4 — dVC(H)d_{\mathrm{VC}}(\mathcal{H}), the largest nn for which ΠH(n)=2n\Pi_\mathcal{H}(n) = 2^n; (4) the Sauer–Shelah lemma and the FTSL in §§5–6 — finite VC dimension forces polynomial growth, which combined with Hoeffding via a doubled-sample union bound gives the agnostic uniform-convergence bound R(h^)R^(h^)=O(dVClog(n/dVC)/n)|R(\hat h) - \hat R(\hat h)| = O(\sqrt{d_{\mathrm{VC}} \log(n/d_{\mathrm{VC}})/n}). The shape of the ladder is the entire story in one substitution: logH\log|\mathcal{H}| becomes dVClog(n/dVC)d_{\mathrm{VC}} \log(n/d_{\mathrm{VC}}).

Demo — log cardinality blows up; restricted cardinality plateaus

Take an nn-point sample SS on [0,1][0, 1] and discretized half-line classes Hk={ha:a{1/k,2/k,,k/k}}\mathcal{H}_k = \{h_a : a \in \{1/k, 2/k, \dots, k/k\}\} with kk thresholds. As kk \to \infty, Hk|\mathcal{H}_k| \to \infty (red, diverging on the log-log plot) and the finite-class bound (1.1) degenerates. But HkS|\mathcal{H}_{k|S}| saturates at exactly n+1n+1 (blue plateau) as soon as kk is large enough to put a threshold in every gap. The red curve is the wrong quantity for learning theory; the blue curve is the right one.

Log-log plot of the discretized half-line class cardinality versus the number of thresholds, with the red full-class cardinality diverging and the blue restricted cardinality plateauing at n+1.
Discretized half-line cardinality (red) diverges; restricted cardinality on a 20-point sample (blue) plateaus at n+1 = 21.

Adding more thresholds inflates |H_k| (red) without adding any new dichotomies once every gap is covered — the restricted cardinality |H_k|_S| (blue) saturates at n + 1.

The takeaway is exactly the diagnosis from §1.2: refining H\mathcal{H} adds hypotheses, but doesn’t add information that distinguishes them on SS. The behaviorally-distinct hypotheses on SS — the dichotomies of §2 — are what matters.

Shattering: the combinatorial substitute for |H|

Restriction to S: dichotomies

The set of labelings HS{0,1}n\mathcal{H}_{|S} \subseteq \{0,1\}^n produced by H\mathcal{H} on SS is also called the set of dichotomies (Vapnik–Chervonenkis 1971). Three properties: HS{0,1}n\mathcal{H}_{|S} \subseteq \{0,1\}^n so HS2n|\mathcal{H}_{|S}| \le 2^n; HSH|\mathcal{H}_{|S}| \le |\mathcal{H}|; the second bound can be loose by an unbounded factor.

Definition: H shatters S

Definition 1 (Shattering).

A class H\mathcal{H} shatters a sample S=(x1,,xn)S = (x_1, \dots, x_n) when HS={0,1}n\mathcal{H}_{|S} = \{0, 1\}^n — every b{0,1}nb \in \{0,1\}^n is realized by some hHh \in \mathcal{H}.

Two pieces of commentary: shattering is a property of the pair (H,S)(\mathcal{H}, S), not of H\mathcal{H} or SS alone; and the definition is existential in hh — for each target labeling, we need some witnessing hypothesis.

The three-point picture: half-lines vs half-planes

Three configurations across two classes paint the picture:

  • A. Half-lines on three collinear 1D points ({1,2,3})(\{1, 2, 3\}): only the 4 threshold patterns (0,0,0),(0,0,1),(0,1,1),(1,1,1)(0,0,0), (0,0,1), (0,1,1), (1,1,1) are realized. The labelings (1,0,0),(0,1,0),(1,0,1),(1,1,0)(1,0,0), (0,1,0), (1,0,1), (1,1,0) are unrealizable. Half-lines do not shatter.
  • B. Half-planes on three collinear 2D points ((1,0),(2,0),(3,0))((1,0), (2,0), (3,0)): same 4 patterns; half-planes on collinear points behave like half-lines on the projection.
  • C. Half-planes on three triangle vertices ((0,0),(1,0),(0.5,1))((0,0), (1,0), (0.5,1)): all 8 labelings realized — separation via a half-plane through edges or vertices. Half-planes shatter triangles.
  • D. Half-planes on any 4 points in the plane: previewed here; the full proof goes through §8.1 via Radon’s theorem.

Shattering depends on the configuration, not just the cardinality

Cases A–C carry the §2 punchline: shattering is not a function of S|S| alone. To prove dVC(H)=dd_{\mathrm{VC}}(\mathcal{H}) = d, we’ll need two pieces: a constructive direction (exhibit some dd-point set that H\mathcal{H} shatters — existence, usually easy) and a destructive direction (prove that every (d+1)(d+1)-point set has some unrealizable labeling — universal, usually hard).

Demo — the shatter atlas

A 3 × 8 grid: rows for the three (config, class) cases, columns for the 8 elements of {0,1}3\{0,1\}^3, marked with realizing classifier when one exists. Rows A and B each tally 4 of 8; row C tallies 8 of 8 — shatters.

A three-by-eight grid: each row a (configuration, class) pair, each column one of the eight binary labelings of three points. Half-lines on three collinear points realize four labelings; half-planes on three collinear points realize four; half-planes on a triangle realize all eight.
Shatter atlas. Row A: half-lines on three collinear 1D points realize 4 of 8 labelings. Row B: half-planes on three collinear 2D points realize 4 of 8. Row C: half-planes on a triangle realize all 8 — shatters.

Top row: half-lines on three collinear 1D points realize 4 of 8 labelings (the monotone prefix patterns). Middle row: half-planes on three collinear 2D points realize the same 4. Bottom row: half-planes on a triangle realize all 8 — shatters.

The atlas makes the §2.4 asymmetry concrete. Rows A and B each top out at 4 of the 8 labelings; row C realizes all 8. Drag a point in row B onto the line y=xy = x and the row jumps from 4 to 8 — shattering switched on by a one-pixel move. The class alone doesn’t determine shattering; the configuration matters every bit as much.

The growth function Π(n)

Definition: maximum number of labelings on an n-point sample

Definition 2 (Growth function).

The growth function of H\mathcal{H} is

ΠH(n)=maxSX,S=nHS.(3.1)\Pi_\mathcal{H}(n) = \max_{S \subseteq \mathcal{X}, \, |S|=n} \, |\mathcal{H}_{|S}|. \quad\quad (3.1)

A function of nn alone; monotone non-decreasing in nn (extending a sample at most preserves the labeling count).

The trivial bound, and the shattering equivalence

Proposition 1 (Trivial bound and shattering).

ΠH(n)2n\Pi_\mathcal{H}(n) \le 2^n, with equality iff some nn-point sample is shattered.

The Proposition converts shattering into an algebraic question and sets up VC dimension as “the largest nn for which the bound is tight.”

Closed-form growth functions for the canonical classes

Four canonical classes have closed-form growth functions:

  • Half-lines on R\mathbb{R}: Π(n)=n+1\Pi(n) = n + 1. The threshold falls in one of n+1n+1 gaps between consecutive sample points.
  • Intervals on R\mathbb{R}: Π(n)=(n+12)+1\Pi(n) = \binom{n+1}{2} + 1. Each non-empty interval picks a contiguous run of labels.
  • Half-planes in R2\mathbb{R}^2: Π(n)=n2n+2\Pi(n) = n^2 - n + 2 (Cover 1965), derived from the recursion C(n+1)=C(n)+2nC(n+1) = C(n) + 2n on the number of regions cut by nn lines in general position, summed with the two trivial all-0 and all-1 dichotomies.
  • Axis-rectangles in R2\mathbb{R}^2: no closed form, but Π(n)=O(n4)\Pi(n) = O(n^4) via the bounding-box argument — every rectangle dichotomy is determined by its four extreme points along the two axes.

The phase transition: exponential below the VC dimension, polynomial above

Each canonical class tracks 2n2^n until n=dVCn = d_{\mathrm{VC}}, then peels off to polynomial. Half-lines peel at n=2n = 2, intervals at n=3n = 3, half-planes at n=4n = 4, rectangles at n=5n = 5. The polynomial degree is exactly dVCd_{\mathrm{VC}}. The phase transition is what Sauer–Shelah (§5) upgrades to a theorem.

Log-scale plot of the growth function for half-lines, intervals, half-planes, and axis-rectangles from n equals 1 to 16, with the 2^n trivial bound and Sauer–Shelah binomial-sum bounds overlaid. Phase-transition arrows mark n equals d_VC plus 1 for each class.
Growth-function gallery on log scale. Each class tracks 2^n until n = d_VC, then peels off to polynomial — half-lines at n=2, intervals at n=3, half-planes at n=4, axis-rectangles at n=5.

The picture matches the §3.4 prediction exactly: each curve tracks 2n2^n on the log scale until n=dVCn = d_{\mathrm{VC}}, then peels off to polynomial. The Sauer–Shelah binomial-sum bounds (dotted) sit above each class’s true growth function — loose by a constant factor, but the right asymptotic order.

The VC dimension

Definition: largest shattered set

Definition 3 (VC dimension).

The VC dimension of H\mathcal{H} is

dVC(H)=max{nN:ΠH(n)=2n},d_{\mathrm{VC}}(\mathcal{H}) = \max\bigl\{n \in \mathbb{N} : \Pi_\mathcal{H}(n) = 2^n\bigr\},

or ++\infty if the maximum does not exist. Conventions: dVC()=d_{\mathrm{VC}}(\emptyset) = -\infty and dVC({h})=0d_{\mathrm{VC}}(\{h\}) = 0.

Integer-valued (or ++\infty); reduces to no other dimension notion for binary classes.

The two-sided proof template

To prove dVC(H)=dd_{\mathrm{VC}}(\mathcal{H}) = d, we need two ingredients: (1) constructively exhibit some dd-point set that H\mathcal{H} shatters, and (2) destructively prove that every (d+1)(d+1)-point set has some unrealizable labeling. The destructive direction is universal — quantified over all (d+1)(d+1)-point samples — and is where Radon’s theorem (§8.1) and pigeonhole arguments (§8.2) earn their keep.

Worked examples: half-lines, intervals, half-planes

  • dVC(Hhalf-line)=1d_{\mathrm{VC}}(\mathcal{H}_{\text{half-line}}) = 1. Constructive: a single point shatters (one labeling each way). Destructive: any two points x1<x2x_1 < x_2 have the unrealizable labeling (1,0)(1, 0) — no threshold realizes “1 then 0.”
  • dVC(Hinterval)=2d_{\mathrm{VC}}(\mathcal{H}_{\text{interval}}) = 2. Constructive: any two points shatter via four intervals. Destructive: any three points x1<x2<x3x_1 < x_2 < x_3 have the unrealizable labeling (1,0,1)(1, 0, 1) — an interval is a contiguous set, so it cannot label the endpoints “in” and the middle “out.”
  • dVC(Hhalf-plane)=3d_{\mathrm{VC}}(\mathcal{H}_{\text{half-plane}}) = 3. Constructive: a triangle shatters (§2.5 row C). Destructive: every 4-point configuration has an unrealizable labeling, following from Radon’s theorem in §8.1.

Pathological cases: H_sin has 1 parameter but infinite VC dimension

The sine-wave family Hsin={hω(x)=1[sin(ωx)0]:ωR}\mathcal{H}_{\sin} = \{h_\omega(x) = \mathbb{1}[\sin(\omega x) \ge 0] : \omega \in \mathbb{R}\} has one real parameter but dVC(Hsin)=d_{\mathrm{VC}}(\mathcal{H}_{\sin}) = \infty. The construction (Vapnik 1998 §3.4) uses geometrically-spaced points xi=2ix_i = 2^{-i}; for any labeling b{0,1}nb \in \{0, 1\}^n, the choice ω=π(1+i:bi=02i)\omega = \pi \cdot (1 + \sum_{i : b_i = 0} 2^i) realizes bb. Parameter count is not a reliable proxy for VC dimension.

Demo — destructive direction for each canonical class

Four panels showing the smallest unrealizable labeling for each class: half-lines fail (1,0) on two points (monotonicity); intervals fail (1,0,1) on three points (connectivity); half-planes fail the XOR labeling on the unit square (convex hull); axis-rectangles fail four-cardinals-versus-origin on five points (bounding box).
Destructive direction for each canonical class. Each panel shows the smallest unrealizable labeling, named after the geometric obstruction: monotonicity, connectivity, convex-hull, bounding-box.

Blue circles labeled 1; red circles labeled 0. Each panel shows the smallest unrealizable labeling for its class — the geometric obstruction that caps the VC dimension at d_VC.

The four obstructions are exemplars of the four most common destructive-direction arguments in VC theory: monotonicity (no realization of decreasing patterns), connectivity (no disconnected positive set), convex-hull (no separation across the hull’s interior), and bounding-box (no positive set excluding an interior point).

The Sauer–Shelah lemma

The numerical phase transition in §3.4 told us that finite VC dimension appears to force polynomial growth of ΠH(n)\Pi_\mathcal{H}(n). The Sauer–Shelah lemma is that observation upgraded to a theorem. Two independent discoveries — Sauer (1972) and Shelah (1972), with antecedents in Vapnik–Chervonenkis (1971) — established the bound; the modern proof we give here is due to Frankl–Pach (1983) and Pajor (1985), using a clever operation on {0,1}n\{0, 1\}^n called the shifting operator.

Statement: finite VC dimension forces polynomial growth

Theorem 1 (Sauer–Shelah).

Let H\mathcal{H} have dVC(H)=d<d_{\mathrm{VC}}(\mathcal{H}) = d < \infty. Then for every n1n \ge 1,

ΠH(n)i=0d(ni).(5.1)\Pi_\mathcal{H}(n) \le \sum_{i=0}^{d} \binom{n}{i}. \quad\quad (5.1)

Consequently ΠH(n)(en/d)d\Pi_\mathcal{H}(n) \le (en/d)^d for ndn \ge d.

The binomial-sum form (5.1) is tight in general; the (en/d)d(en/d)^d closed form (Corollary 1) is the convenient version for downstream applications. Loose by a constant factor for any specific class — half-planes have Π(n)=n2n+2\Pi(n) = n^2 - n + 2 vs the Sauer–Shelah ceiling (n3)\binom{n}{\le 3} — but universal across all classes with dVC=dd_{\mathrm{VC}} = d.

The shifting operator

Identify H\mathcal{H} with its restriction HS{0,1}n\mathcal{H}_{|S} \subseteq \{0,1\}^n on a fixed sample SS of size nn.

Definition 4 (Shifting operator).

For i{1,,n}i \in \{1, \dots, n\}, the shifting operator Si(H){0,1}nS_i(\mathcal{H}) \subseteq \{0, 1\}^n is constructed as follows. For each vHv \in \mathcal{H} with vi=1v_i = 1, let vv' denote the down-shifted vector obtained by flipping viv_i to 00. Replace vv with vv' in Si(H)S_i(\mathcal{H}) if vHv' \notin \mathcal{H}; otherwise leave both vv and vv' in place.

Lemma 1 (Size preservation).

Si(H)=H|S_i(\mathcal{H})| = |\mathcal{H}|.

Proof.

The map vvv \mapsto v' on “lonely” labels (those whose down-shift vv' is not in H\mathcal{H}) is a bijection between vectors leaving H\mathcal{H} and vectors entering Si(H)S_i(\mathcal{H}).

Lemma 2 (VC dimension preservation).

If Si(H)S_i(\mathcal{H}) shatters a set T{1,,n}T \subseteq \{1, \dots, n\}, then H\mathcal{H} shatters TT.

Proof.

Suppose Si(H)S_i(\mathcal{H}) shatters TT. For each u{0,1}Tu \in \{0, 1\}^T, we must produce a vector vHv \in \mathcal{H} with vT=uv|_T = u. By shattering, there is some wSi(H)w \in S_i(\mathcal{H}) with wT=uw|_T = u. Two cases.

Case A: wHw \in \mathcal{H}. Take v=wv = w. Done.

Case B: wSi(H)Hw \in S_i(\mathcal{H}) \setminus \mathcal{H}. Then ww is a down-shifted vector — its ii-th coordinate is 00, and there exists vHv^\star \in \mathcal{H} with vi=1v^\star_i = 1 and vj=wjv^\star_j = w_j for jij \ne i.

  • Sub-case B1: iTi \notin T. Take v=vHv = v^\star \in \mathcal{H}; since the disagreement is at coordinate iTi \notin T, we have vT=wT=uv^\star|_T = w|_T = u.

  • Sub-case B2: iTi \in T and ui=0u_i = 0. We need vHv \in \mathcal{H} with vi=0v_i = 0 and vT=uv|_T = u. Apply the shattering of TT by Si(H)S_i(\mathcal{H}) to the flipped labeling uu' defined by ui=1u'_i = 1 and uj=uju'_j = u_j for jij \ne i. The witness is wSi(H)w^\dagger \in S_i(\mathcal{H}) with wT=uw^\dagger|_T = u', so wi=1w^\dagger_i = 1. Now ww^\dagger cannot have been produced by a down-shift — a down-shift would set the ii-th coordinate to 00, not 11 — so wHw^\dagger \in \mathcal{H} directly. Consider the down-shift ww^{\dagger\prime} of ww^\dagger: it has wi=0=uiw^{\dagger\prime}_i = 0 = u_i and wT{i}=wT{i}=uT{i}w^{\dagger\prime}|_{T \setminus \{i\}} = w^\dagger|_{T \setminus \{i\}} = u|_{T \setminus \{i\}}. We claim wHw^{\dagger\prime} \in \mathcal{H}. Suppose not: then ww^\dagger would be “lonely” in H\mathcal{H} (its down-shift would be missing), so the shifting operator SiS_i would replace ww^\dagger with ww^{\dagger\prime} in Si(H)S_i(\mathcal{H}) — putting wSi(H)w^{\dagger\prime} \in S_i(\mathcal{H}) and removing ww^\dagger from Si(H)S_i(\mathcal{H}). But wSi(H)w^\dagger \in S_i(\mathcal{H}) by hypothesis, a contradiction. Therefore wHw^{\dagger\prime} \in \mathcal{H} and v=wv = w^{\dagger\prime} realizes uu on TT.

  • Sub-case B2 with ui=1u_i = 1 would require vi=1v_i = 1, and v=vv = v^\star from Case B works directly.

The Sub-case B2 contradiction is the only delicate moment in the entire proof: we need ww^{\dagger\prime} to live in H\mathcal{H}, and the only way to extract that fact is to suppose otherwise and watch the shifting operator’s definition force a contradiction with wSi(H)w^\dagger \in S_i(\mathcal{H}).

Full proof of Sauer–Shelah

Proof.

Five steps.

Step 1. Apply the shifting operators S1,S2,,SnS_1, S_2, \dots, S_n iteratively, cycling until no further changes occur. The total weight vivi\sum_v \sum_i v_i decreases by at least 11 whenever a shift moves a lonely vector down, and is bounded below by 00, so the process terminates in finitely many cycles. Call the resulting class H\mathcal{H}^\star (the “stable class”). By construction, Si(H)=HS_i(\mathcal{H}^\star) = \mathcal{H}^\star for every ii.

Step 2. Lemma 1 applied iteratively: H=HS|\mathcal{H}^\star| = |\mathcal{H}_{|S}|. Lemma 2 applied iteratively: dVC(H)dVC(H)=dd_{\mathrm{VC}}(\mathcal{H}^\star) \le d_{\mathrm{VC}}(\mathcal{H}) = d (if H\mathcal{H}^\star shattered some set, so would H\mathcal{H}).

Step 3 — the stable class is down-closed. Suppose vHv \in \mathcal{H}^\star has vi=1v_i = 1. By stability Si(H)=HS_i(\mathcal{H}^\star) = \mathcal{H}^\star, so the down-shift vv' must be in H\mathcal{H}^\star — otherwise SiS_i would replace vv with vv', contradicting stability. Iterating one coordinate at a time, every uvu \le v (coordinatewise) lies in H\mathcal{H}^\star.

Step 4 — count by support sizes. For any vHv \in \mathcal{H}^\star, every subset Usupp(v)={i:vi=1}U \subseteq \mathrm{supp}(v) = \{i : v_i = 1\} has indicator 1UH\mathbb{1}_U \in \mathcal{H}^\star by down-closedness. So the restriction Hsupp(v)\mathcal{H}^\star_{|\mathrm{supp}(v)} contains all 2supp(v)2^{|\mathrm{supp}(v)|} labelings — H\mathcal{H}^\star shatters supp(v)\mathrm{supp}(v). By Step 2, this forces supp(v)d|\mathrm{supp}(v)| \le d.

Step 5. The number of v{0,1}nv \in \{0, 1\}^n with supp(v)d|\mathrm{supp}(v)| \le d is exactly i=0d(ni)\sum_{i=0}^d \binom{n}{i}. So HS=H(nd)|\mathcal{H}_{|S}| = |\mathcal{H}^\star| \le \binom{n}{\le d}. Taking the maximum over SS gives ΠH(n)(nd)\Pi_\mathcal{H}(n) \le \binom{n}{\le d}.

The shifting trick does two pieces of work simultaneously: Lemma 1 conserves the count we want to bound, and Lemma 2 conserves the VC dimension. The stable class is structurally much simpler — a down-closed family of subsets — and easy to count by support size.

The (en/d)^d closed form

Corollary 1 (The (en/d)^d bound).

For nd1n \ge d \ge 1,

i=0d(ni)(en/d)d.\sum_{i=0}^d \binom{n}{i} \le (en/d)^d.
Proof.

Multiply by (n/d)d/(n/d)d(n/d)^d / (n/d)^d and weaken the leading factor to 11:

id(ni)(n/d)did(ni)(d/n)i(n/d)di=0n(ni)(d/n)i=(n/d)d(1+d/n)n.\sum_{i \le d} \binom{n}{i} \le (n/d)^d \sum_{i \le d} \binom{n}{i} (d/n)^i \le (n/d)^d \sum_{i=0}^n \binom{n}{i} (d/n)^i = (n/d)^d \cdot (1 + d/n)^n.

The standard inequality (1+x/n)nex(1 + x/n)^n \le e^x at x=dx = d gives (1+d/n)ned(1 + d/n)^n \le e^d. Combining: id(ni)(n/d)ded=(en/d)d\sum_{i \le d} \binom{n}{i} \le (n/d)^d \cdot e^d = (en/d)^d.

The slack between (nd)\binom{n}{\le d} and (en/d)d(en/d)^d is ed/2πde^d/\sqrt{2\pi d} by Stirling — exponential in dd but constant in nn. The closed form gives the right asymptotic order in nn and the right exponent in dd.

Demo — the three bounds compared

Two panels. Left: log-scale plot of the binomial sum and the (en/d)^d closed form against 2^n for d in {1, 2, 3, 4, 5, 10}. Right: the ratio of the closed form to the binomial sum, converging to e^d over root-2-pi-d as predicted by Stirling.
Left: Sauer–Shelah bounds versus the trivial 2^n bound for d in {1, 2, 3, 4, 5, 10}. Right: ratio of the (en/d)^d closed form to the binomial sum — converges to the Stirling factor e^d over root-2-pi-d (dashed).

The left panel shows the bounds themselves on log scale: each (nd)\binom{n}{\le d} sits below 2n2^n for every n>dn > d, and the (en/d)d(en/d)^d closed form sits above it. The right panel verifies the Stirling-derived gap: the ratio of closed form to binomial sum settles onto ed/2πde^d/\sqrt{2\pi d}, the multiplicative slack we accept in exchange for a clean polynomial.

The Fundamental Theorem of Statistical Learning

Hoeffding for one hh + symmetrization + union bound over labelings + Sauer–Shelah substitution → the agnostic uniform-convergence bound. The result is one direction of the equivalence theorem that Blumer–Ehrenfeucht–Haussler–Warmuth (1989) named the Fundamental Theorem of Statistical Learning.

Setup: agnostic PAC, ERM, and the worst-case gap

For SDnS \sim D^n, hypothesis class H\mathcal{H}, and 0-1 loss: R(h)=PrD[h(X)Y]R(h) = \Pr_D[h(X) \ne Y], R^(h)=(1/n)i1[h(Xi)Yi]\hat R(h) = (1/n)\sum_i \mathbb{1}[h(X_i) \ne Y_i]. The empirical risk minimizer is h^argminhR^(h)\hat h \in \arg\min_h \hat R(h). The worst-case generalization gap is

Φ(S)=suphHR(h)R^(h).\Phi(S) = \sup_{h \in \mathcal{H}} |R(h) - \hat R(h)|.

A high-probability bound on Φ\Phi gives ERM’s agnostic guarantee R(h^)R(h)+2ϵR(\hat h) \le R(h^\star) + 2\epsilon where hargminhR(h)h^\star \in \arg\min_h R(h) is the population-best hypothesis.

Symmetrization: from population risk to a doubled empirical risk

Lemma 3 (Symmetrization (Vapnik–Chervonenkis 1971)).

For ϵ>0\epsilon > 0 and n8/ϵ2n \ge 8/\epsilon^2,

PrS[Φ(S)>ϵ]2PrS,S ⁣[suphHR^S(h)R^S(h)>ϵ/2].(6.1)\Pr_S\bigl[\Phi(S) > \epsilon\bigr] \le 2 \, \Pr_{S, S'}\!\Bigl[\sup_{h \in \mathcal{H}} |\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2\Bigr]. \quad\quad (6.1)
Proof.

Fix hh with R(h)R^S(h)>ϵ|R(h) - \hat R_S(h)| > \epsilon. Chebyshev’s inequality applied to the empirical risk on the ghost sample SS' gives

PrS[R^S(h)R(h)>ϵ/2]4nϵ212\Pr_{S'}\bigl[|\hat R_{S'}(h) - R(h)| > \epsilon/2\bigr] \le \frac{4}{n \epsilon^2} \le \frac{1}{2}

for n8/ϵ2n \ge 8/\epsilon^2. So with probability 1/2\ge 1/2 over SS', the ghost-sample risk is within ϵ/2\epsilon/2 of R(h)R(h). Combined with R(h)R^S(h)>ϵ|R(h) - \hat R_S(h)| > \epsilon, the triangle inequality yields R^S(h)R^S(h)>ϵ/2|\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2. The factor 2 on the right absorbs the “with probability 1/2\ge 1/2” step. The full derivation is in Vapnik (1998) Theorem 4.1.

The payoff: the right side of (6.1) is a statement about empirical averages on SSS \cup S' of size 2n2n, where only the ΠH(2n)\Pi_\mathcal{H}(2n) distinct labelings matter — finitely many, by Sauer–Shelah.

The agnostic uniform-convergence bound

Theorem 2 (FTSL upper bound).

Let dVC(H)=d<d_{\mathrm{VC}}(\mathcal{H}) = d < \infty. For δ(0,1)\delta \in (0, 1) and nmax(8/ϵ2,d)n \ge \max(8/\epsilon^2, d), with probability at least 1δ1 - \delta over SDnS \sim D^n,

suphHR(h)R^(h)8(dlog(2en/d)+log(4/δ))n.(6.2)\sup_{h \in \mathcal{H}} |R(h) - \hat R(h)| \le \sqrt{\frac{8 \bigl(d \log(2en/d) + \log(4/\delta)\bigr)}{n}}. \quad\quad (6.2)
Proof.

From symmetrization (6.1),

PrS[Φ>ϵ]2PrS,S ⁣[suphR^S(h)R^S(h)>ϵ/2].\Pr_S[\Phi > \epsilon] \le 2 \Pr_{S,S'}\!\Bigl[\sup_h |\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2\Bigr].

The supremum on the right involves at most ΠH(2n)\Pi_\mathcal{H}(2n) distinct random variables — one for each labeling of SSS \cup S'. Union-bound over labelings, then apply Hoeffding to the difference of two nn-sample averages of [1,+1][-1, +1]-bounded variables:

PrS,S[R^S(h)R^S(h)>ϵ/2]2exp(nϵ2/8).\Pr_{S,S'}\bigl[|\hat R_S(h) - \hat R_{S'}(h)| > \epsilon/2\bigr] \le 2 \exp(-n\epsilon^2/8).

Combining:

PrS[Φ>ϵ]4ΠH(2n)exp(nϵ2/8).\Pr_S[\Phi > \epsilon] \le 4 \, \Pi_\mathcal{H}(2n) \, \exp(-n\epsilon^2/8).

Apply Corollary 1: ΠH(2n)(2en/d)d\Pi_\mathcal{H}(2n) \le (2en/d)^d. Setting the right side to δ\delta and solving for ϵ\epsilon gives ϵ=8(dlog(2en/d)+log(4/δ))/n\epsilon = \sqrt{8(d \log(2en/d) + \log(4/\delta))/n}.

The rate is O(dlogn/n)O(\sqrt{d \log n / n})1/n\sqrt{1/n} from Hoeffding, d\sqrt{d} from Sauer–Shelah, logn\sqrt{\log n} from polynomial growth. The constants (the 8 and the 4) are loose by factors of 2–4 in modern refinements (Massart, McDiarmid, Talagrand). Inverting for sample complexity with a self-bounding argument gives n=O((dlog(d/ϵ)+log(1/δ))/ϵ2)n = O((d \log(d/\epsilon) + \log(1/\delta))/\epsilon^2) — the agnostic-PAC bound that matches (1.1) with dVCd_{\mathrm{VC}} replacing logH\log|\mathcal{H}|.

The lower-bound direction and the FTSL equivalence

Theorem 3 (No-free-lunch).

If dVC(H)=d_{\mathrm{VC}}(\mathcal{H}) = \infty, then for every learning algorithm AA and every nn, there exists a distribution DD such that ESDn[R(A(S))]1/4+minhR(h)\mathbb{E}_{S \sim D^n}[R(A(S))] \ge 1/4 + \min_h R(h).

Proof.

Pick any 2n2n-point set CC shattered by H\mathcal{H} (exists because dVC=d_{\mathrm{VC}} = \infty). Let DD be uniform on CC with an adversarially-chosen labeling drawn after AA commits to its strategy. By shattering, some hHh \in \mathcal{H} realizes the labeling exactly, so minhR(h)=0\min_h R(h) = 0. Algorithm AA sees nn of the 2n2n points; on the unseen half, AA has no information about the labeling and must guess, incurring error at least 1/21/2 on the unseen half — equivalently, at least 1/41/4 overall. Full derivation in Shalev-Shwartz–Ben-David Theorem 5.1.

Infinite VC is fatal for distribution-free learning. The sine-wave family Hsin\mathcal{H}_{\sin} from §4.4 is a concrete instance — one real parameter is enough to defeat any learning algorithm in the agnostic setting.

Theorem 4 (FTSL equivalence (Blumer–Ehrenfeucht–Haussler–Warmuth 1989)).

For a binary class H\mathcal{H}, the following are equivalent: (1) H\mathcal{H} has finite VC dimension; (2) H\mathcal{H} has the uniform convergence property; (3) H\mathcal{H} is agnostically PAC-learnable; (4) ERM is a successful agnostic PAC learner for H\mathcal{H}; (5) H\mathcal{H} is PAC-learnable in the realizable case.

The proof chain is (1) ⇒ (2) ⇒ (3) ⇒ (4) via Theorem 2 and standard ERM analysis; (3) ⇒ (5) a fortiori; and (5) ⇒ (1) by the contrapositive of Theorem 3. The full development is in Shalev-Shwartz–Ben-David §6.4. The equivalence is what gives this theorem its “fundamental” name: every interesting learnability property reduces to the single integer dVCd_{\mathrm{VC}}.

Demo — the FTSL bound versus the empirical generalization gap

Two panels. Left: the FTSL bound versus n on log-log scale for d in {1, 3, 5, 10}, all with slope minus one half. Right: the empirical worst-case generalization gap on half-planes from 15 Monte Carlo replicates at six sample sizes, overlaid against the FTSL bound at d=3 and delta=0.05.
Left: the FTSL bound versus sample size for VC dimensions 1, 3, 5, 10 — all slope minus one half on log-log. Right: empirical worst-case gap on half-planes from 15 Monte Carlo replicates × 6 sample sizes, against the FTSL bound at d=3, delta=0.05. Empirical gap sits roughly an order of magnitude below the bound, with matching slope.

At n = 120, d = 3, δ = 0.05: FTSL bound ε = 1.170. The bound has slope −1/2 in n on log-log, just like the empirical gap (right panel).

The bound has the right asymptotic shape — both the empirical worst-case gap and the FTSL bound scale as O(logn/n)O(\sqrt{\log n / n}) — but is loose by roughly a factor of 10. The Hoeffding constant is the largest source of slack; the polynomial-growth substitution is comparatively tight. Modern refinements via Talagrand and McDiarmid close the constant gap by another factor of 2–4 but the order-of-magnitude looseness persists.

Realizable versus agnostic rates

The FTSL sample complexity is O((dlog(d/ϵ)+log(1/δ))/ϵ2)O((d \log(d/\epsilon) + \log(1/\delta))/\epsilon^2)quadratic in 1/ϵ1/\epsilon. The realizable PAC case has linear dependence: O((dlog(1/ϵ)+log(1/δ))/ϵ)O((d \log(1/\epsilon) + \log(1/\delta))/\epsilon). The factor-1/ϵ1/\epsilon gap is the headline practical consequence of the realizability assumption.

Realizable PAC: the 1/ε rate

Theorem 5 (Realizable PAC bound).

Under realizability (hH:R(h)=0\exists h^\star \in \mathcal{H} : R(h^\star) = 0), with probability at least 1δ1 - \delta, any ERM h^\hat h satisfies R(h^)ϵR(\hat h) \le \epsilon provided

n8(dlog(2e/ϵ)+log(2/δ))ϵ.(7.1)n \ge \frac{8 \bigl(d \log(2e/\epsilon) + \log(2/\delta)\bigr)}{\epsilon}. \quad\quad (7.1)
Proof.

Define the bad subclass Hϵ={hH:R(h)>ϵ}\mathcal{H}_\epsilon = \{h \in \mathcal{H} : R(h) > \epsilon\} — every hypothesis ERM might erroneously pick. The union bound on consistency gives

Pr[hHϵ:R^(h)=0]ΠH(n)(1ϵ)n(en/d)denϵ,\Pr\bigl[\exists h \in \mathcal{H}_\epsilon : \hat R(h) = 0\bigr] \le \Pi_\mathcal{H}(n) (1 - \epsilon)^n \le (en/d)^d e^{-n\epsilon},

using (1ϵ)nenϵ(1 - \epsilon)^n \le e^{-n\epsilon}. Set this δ\le \delta and solve for nn to get (7.1). The exponential decay is in nϵn\epsilon, not nϵ2n\epsilon^2 — that’s where the linear versus quadratic split comes from.

The crucial step is Pr[R^(h)=0]=(1R(h))n\Pr[\hat R(h) = 0] = (1 - R(h))^n — the probability that a Bernoulli(R(h)R(h)) random variable is zero nn times in a row. This is not Hoeffding; the “rare-event” bound has enϵe^{-n\epsilon} tails, not enϵ2e^{-n\epsilon^2}.

Agnostic PAC: the 1/ε² rate

Theorem 2’s bound inverts to n=O((dlog(d/ϵ)+log(1/δ))/ϵ2)n = O((d \log(d/\epsilon) + \log(1/\delta))/\epsilon^2). The practical contrast at ϵ=0.01\epsilon = 0.01, δ=0.05\delta = 0.05, d=10d = 10 is dramatic: realizable needs n5,000n \approx 5{,}000, agnostic needs n500,000n \approx 500{,}000 — two orders of magnitude.

Bernstein versus Hoeffding: where the gap comes from

Hoeffding’s bound is variance-blind: Pr(Zˉp>t)2exp(2nt2)\Pr(|\bar Z - p| > t) \le 2 \exp(-2nt^2) regardless of pp. Bernstein’s bound is variance-aware:

Pr(Zˉp>t)2exp ⁣(nt22p(1p)+(2/3)t).\Pr(|\bar Z - p| > t) \le 2 \exp\!\Bigl(-\frac{nt^2}{2p(1-p) + (2/3)t}\Bigr).

When p0p \to 0 — which is what realizability says about the variance of the error — the denominator vanishes and the bound tightens. The realizable analysis effectively uses Bernstein in disguise: at t=ϵ,p=ϵt = \epsilon, p = \epsilon, the exponent is nϵ/2\approx -n\epsilon/2, matching the (1ϵ)nenϵ(1 - \epsilon)^n \approx e^{-n\epsilon} rate. Hoeffding ignores the variance information and pays the 1/ϵ21/\epsilon^2 price.

When realizable rates degrade to agnostic

Realizability is fragile. Three concrete failure modes:

  1. Label noise. Even η=0.01\eta = 0.01 flip probability — one in a hundred labels corrupted — destroys the realizable assumption. Every hHh \in \mathcal{H} has R(h)η>0R(h) \ge \eta > 0, so the rare-event bound enϵe^{-n\epsilon} does not apply for ϵ<η\epsilon < \eta. Natarajan et al (2013) develop the noisy-label theory in detail.
  2. Misspecification. If minhHR(h)>0\min_{h \in \mathcal{H}} R(h) > 0 — the class H\mathcal{H} doesn’t contain a perfect classifier — Theorem 5 doesn’t apply and the agnostic FTSL bound (6.2) takes over.
  3. Intermediate regime: Tsybakov margin. Between realizable (p=0p = 0) and agnostic (p1/2p \approx 1/2), the Tsybakov margin condition (Mammen–Tsybakov 1999) interpolates with parameters (c,α)(c, \alpha): Pr(η(X)1/2t)ctα\Pr(|\eta(X) - 1/2| \le t) \le c t^\alpha for t>0t > 0. The resulting rate is n=O(d1/(2α)ϵ(2α))n = O(d^{1/(2-\alpha)} \epsilon^{-(2-\alpha)}) for α[0,1]\alpha \in [0, 1] — linear at α=1\alpha = 1 (full margin), quadratic at α=0\alpha = 0 (no margin). Full development in Generalization Bounds §9.

Demo — sample complexity versus accuracy, two regimes

Two panels at d=5, delta=0.05. Left: linear y-axis showing the realizable sample-complexity curve well below the agnostic curve, with a dramatic gap at small epsilon. Right: log-log scale showing slope minus one for the realizable curve and slope minus two for the agnostic curve.
Realizable versus agnostic sample-complexity curves at VC dimension 5, delta = 0.05. Left: linear scale shows the dramatic gap — at epsilon = 0.01, realizable asks for around 5,000 samples while agnostic asks for around 500,000. Right: log-log slopes confirm the linear-versus-quadratic rate split.
log-log

The left panel makes the dramatic practical gap visible: at ϵ=0.01\epsilon = 0.01, the realizable bound asks for around 5,000 samples; the agnostic bound asks for around 500,000. The right panel decomposes the gap into its asymptotic shape — slope 1-1 versus slope 2-2 on log-log. The variance-aware Bernstein-style analysis that the realizable case implicitly uses is what buys the extra factor.

VC dimension of canonical classes

Four canonical classes with full VC-dimension proofs — half-spaces in Rd\mathbb{R}^d via Radon’s theorem, axis-aligned rectangles in Rd\mathbb{R}^d via pigeonhole, decision stumps and Boolean classes (the easy cases), and neural networks via the Bartlett bound.

Half-spaces in R^d via Radon’s theorem

Theorem 6 (VC dimension of half-spaces).

dVC(half-spaces in Rd)=d+1d_{\mathrm{VC}}(\text{half-spaces in } \mathbb{R}^d) = d + 1.

The destructive direction depends on the following geometric fact.

Theorem 7 (Radon's theorem (1921)).

Any d+2d + 2 points in Rd\mathbb{R}^d admit a partition (I+,I)(I_+, I_-) such that conv(I+)conv(I)\mathrm{conv}(I_+) \cap \mathrm{conv}(I_-) \ne \emptyset.

Proof.

Consider the linear system in λRd+2\lambda \in \mathbb{R}^{d+2}:

iλixi=0,iλi=0.\sum_i \lambda_i x_i = 0, \qquad \sum_i \lambda_i = 0.

This is d+1d + 1 scalar equations in d+2d + 2 unknowns, so a nontrivial solution λ0\lambda^\star \ne 0 exists. Let I+={i:λi>0}I_+ = \{i : \lambda^\star_i > 0\} and I={i:λi<0}I_- = \{i : \lambda^\star_i < 0\}; both are non-empty because iλi=0\sum_i \lambda^\star_i = 0. Let Λ=iI+λi=iIλi>0\Lambda = \sum_{i \in I_+} \lambda^\star_i = -\sum_{i \in I_-} \lambda^\star_i > 0. The point

p=1ΛiI+λixi=1ΛiI(λi)xip = \frac{1}{\Lambda} \sum_{i \in I_+} \lambda^\star_i x_i = \frac{1}{\Lambda} \sum_{i \in I_-} (-\lambda^\star_i) x_i

is a convex combination from both sides: the coefficients λi/Λ\lambda^\star_i / \Lambda for iI+i \in I_+ are non-negative and sum to 11, and the coefficients λi/Λ-\lambda^\star_i / \Lambda for iIi \in I_- are likewise non-negative and sum to 11. So pconv(I+)conv(I)p \in \mathrm{conv}(I_+) \cap \mathrm{conv}(I_-).

Proof.

Proof of Theorem 6, destructive direction. For any d+2d + 2 points, Radon’s theorem gives a partition (I+,I)(I_+, I_-) with shared point pp. Consider the labeling that assigns 11 to I+I_+ and 00 to II_-. Any separating half-space H={x:w,xc}H = \{x : \langle w, x \rangle \ge c\} would contain all of I+I_+ (and hence conv(I+)H\mathrm{conv}(I_+) \subseteq H by convexity of HH) and exclude all of II_- (so conv(I)Hc\mathrm{conv}(I_-) \subseteq H^c). But then pconv(I+)conv(I)HHc=p \in \mathrm{conv}(I_+) \cap \mathrm{conv}(I_-) \subseteq H \cap H^c = \emptyset — contradiction. So this labeling is unrealizable, and dVC(half-spaces)d+1d_{\mathrm{VC}}(\text{half-spaces}) \le d + 1.

Proof.

Proof of Theorem 6, constructive direction. Take any d+1d + 1 affinely-independent points, e.g., the origin together with the standard basis vectors e1,,ede_1, \dots, e_d. For any labeling b{0,1}d+1b \in \{0, 1\}^{d+1}, let CbC_b and CbˉC_{\bar b} be the centroids of the 11- and 00-labeled vertices respectively. The half-space whose normal is (CbCbˉ)(C_b - C_{\bar b}) and which passes through the midpoint (Cb+Cbˉ)/2(C_b + C_{\bar b})/2 contains the 11-labeled vertices and excludes the 00-labeled vertices, by affine independence. Full details in Mohri–Rostamizadeh–Talwalkar Theorem 3.20.

Axis-aligned rectangles in R^d

Theorem 8 (VC dimension of axis-aligned rectangles).

dVC(axis-aligned rectangles in Rd)=2dd_{\mathrm{VC}}(\text{axis-aligned rectangles in } \mathbb{R}^d) = 2d.

Proof.

Constructive direction. Take the 2d2d unit-axis-direction points {±ej:j=1,,d}\{\pm e_j : j = 1, \dots, d\}. For any labeling b{0,1}2db \in \{0, 1\}^{2d}, build the rectangle one axis at a time: the jj-th range is [1.5,1.5][-1.5, 1.5] if both +ej+e_j and ej-e_j are labeled 11, [1.5,0.5][-1.5, 0.5] if only ej-e_j is labeled 11, [0.5,1.5][-0.5, 1.5] if only +ej+e_j, and [0.5,0.5][-0.5, 0.5] if neither (which excludes both). Four cases per axis, each producing the required labels on that axis’s two extreme points.

Proof.

Destructive direction. Among any 2d+12d + 1 points in Rd\mathbb{R}^d, by pigeonhole at least one point xx^\star is not extreme on any axis — every axis has at most two extreme positions (the maximum and minimum coordinate), giving only 2d2d extreme positions total. Now consider the labeling “1 everywhere except xx^\star.” Any rectangle that includes the other 2d2d points must include their bounding box; but the bounding box contains xx^\star (because xx^\star is not extreme on any axis), so the rectangle includes xx^\star as well — contradicting the requested labeling. This labeling is unrealizable, so dVC2dd_{\mathrm{VC}} \le 2d.

Decision stumps and small Boolean concept classes

Decision stumps in Rd\mathbb{R}^d are hypotheses of the form h(x)=1[xjθ]h(x) = \mathbb{1}[x_j \ge \theta] (or its complement) for some axis jj and threshold θ\theta. The growth function is Π(n)2d(n+1)\Pi(n) \le 2d(n + 1) — polynomial in nn of degree 11. The exact VC dimension is dVC=log2(2d+1)d_{\mathrm{VC}} = \lceil \log_2(2d + 1) \rceil (Anthony–Bartlett 1999 §3.6). This logarithmic bound is why boosting (Friedman 2001) works: weak learners (stumps) have logarithmic VC dimension, so TT-stump ensembles have VC dimension at most TlogdT \log d — much smaller than a TT-depth tree’s O(T)O(T).

For kk-of-nn majority thresholds h(x)=1[jxjk]h(x) = \mathbb{1}[\sum_j x_j \ge k] on x{0,1}nx \in \{0, 1\}^n, there are only n+1n + 1 such functions, so dVClog2(n+1)d_{\mathrm{VC}} \le \log_2(n + 1) by the finite-class bound.

Neural networks: where classical VC analysis breaks

Theorem 9 (Neural-network VC dimension (Bartlett–Maiorov–Meir 1998 / Bartlett–Harvey–Liaw–Mehrabian 2019)).

For a feedforward neural network with WW parameters and LL layers using piecewise-linear activations,

cWLlog(W/L)dVC(Hnet)CWLlogWc \, W L \log(W/L) \le d_{\mathrm{VC}}(\mathcal{H}_{\text{net}}) \le C \, W L \log W

for absolute constants c,C>0c, C > 0.

For ResNet-50 (W25×106W \approx 25 \times 10^6, L=50L = 50), this gives dVC1010d_{\mathrm{VC}} \approx 10^{10}. The FTSL bound (6.2) then asks for around 101010^{10} training samples to achieve a non-vacuous generalization guarantee. ImageNet has 10610^6. The classical VC bound is vacuous on every realistic deep network.

This is the deep-learning puzzle: networks generalize well in practice despite enormous classical VC dimension. Three modern responses pursue different routes around the impasse:

  1. Margin/norm-based bounds (Bartlett–Mendelson 2002, Bartlett–Foster–Telgarsky 2017, Neyshabur–Tomioka–Srebro 2015): replace dVCd_{\mathrm{VC}} with capacity measures scaling with weight norms or margins, often yielding non-vacuous bounds.
  2. Implicit-bias of SGD (Soudry et al 2018, Gunasekar et al 2017, Ji–Telgarsky 2019): show that gradient descent on overparameterized models is biased toward low-complexity solutions, so the effective capacity is much smaller than the parameter count.
  3. Beyond uniform convergence (Nagarajan–Kolter 2019): argue that uniform-convergence-style FTSL is fundamentally the wrong tool for deep networks; algorithm-specific bounds (PAC-Bayes, compression) are needed instead.

PAC-Bayes Bounds is the canonical alternative. Full treatments of margin/norm bounds appear in Generalization Bounds §12; the implicit-bias and beyond-UC stories will be developed in the forthcoming Double Descent (coming soon) topic.

Demo — Radon’s theorem in the plane

Three panels showing Radon partitions for four-point configurations in the plane. Panel A: convex-position quadrilateral with diagonals intersecting at the shared point. Panel B: triangle plus an interior point, the interior point and the three vertices partition into two convex hulls sharing the interior point. Panel C: near-collinear configuration where one off-axis point pairs against the three near-collinear points.
Three Radon-partition geometries for four-point configurations in the plane. Panel A: convex-position quadrilateral — diagonals intersect at the shared point. Panel B: one point inside a triangle — interior versus vertices. Panel C: near-collinear — one off-axis point versus the three near-collinear points.

Drag any point in any panel; the Radon partition reassigns automatically. The shared point p (teal) always lies inside both convex hulls — that's what makes the corresponding labeling unrealizable by a half-plane.

The three panels show three distinct Radon-partition geometries, and the §8.5 demo lets you drag the four points to watch the partition reassign as the configuration deforms. The shared point pp is always inside both convex hulls — that’s what makes the labeling ”I+=1I_+ = 1, I=0I_- = 0” unrealizable by any half-plane.

From VC dimension to Rademacher complexity

The bounds developed so far are distribution-free — strong universality but worst-case tightness. Rademacher complexity is the canonical data-dependent successor: it refines the union-bound step of the FTSL by replacing logΠH(2n)\log \Pi_\mathcal{H}(2n) with logHS\log |\mathcal{H}_{|S}| on the actual sample SS.

Empirical Rademacher complexity and Massart’s lemma

Definition 5 (Empirical Rademacher complexity).

The empirical Rademacher complexity of H\mathcal{H} on sample S=(x1,,xn)S = (x_1, \dots, x_n) is

R^S(H)=Eσ ⁣[suphH1ni=1nσih(xi)],(9.1)\hat{\mathfrak R}_S(\mathcal{H}) = \mathbb{E}_\sigma\!\Bigl[\sup_{h \in \mathcal{H}} \tfrac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)\Bigr], \quad\quad (9.1)

where σ1,,σn\sigma_1, \dots, \sigma_n are i.i.d. Rademacher variables (uniform on {±1}\{\pm 1\}). The population Rademacher complexity is Rn(H)=ES[R^S(H)]\mathfrak R_n(\mathcal{H}) = \mathbb{E}_S[\hat{\mathfrak R}_S(\mathcal{H})].

The quantity measures how well H\mathcal{H} can fit random ±1\pm 1 labelings of SS: small means the class can’t fit random labels (limited capacity), large means it can (large capacity).

Lemma 4 (Massart's finite-class lemma).

For a finite set ARnA \subseteq \mathbb{R}^n with supaAa2r\sup_{a \in A} \|a\|_2 \le r,

Eσ ⁣[supaA1niσiai]r2logAn.\mathbb{E}_\sigma\!\Bigl[\sup_{a \in A} \tfrac{1}{n}\sum_i \sigma_i a_i\Bigr] \le \frac{r \sqrt{2 \log |A|}}{n}.

Specialized to A=HS{0,1}nA = \mathcal{H}_{|S} \subseteq \{0, 1\}^n (so r=nr = \sqrt{n}):

R^S(H)2logHSn.(9.2)\hat{\mathfrak R}_S(\mathcal{H}) \le \sqrt{\frac{2 \log |\mathcal{H}_{|S}|}{n}}. \quad\quad (9.2)

The proof is the standard sub-Gaussian max-bound argument; the full derivation is in Generalization Bounds §5 or Mohri–Rostamizadeh–Talwalkar Theorem 3.7.

The Sauer–Shelah substitution

Combining (9.2) with Sauer–Shelah’s HS(en/d)d|\mathcal{H}_{|S}| \le (en/d)^d:

R^S(H)2dlog(en/d)n=O ⁣(dlognn).(9.3)\hat{\mathfrak R}_S(\mathcal{H}) \le \sqrt{\frac{2 d \log(en/d)}{n}} = O\!\left(\sqrt{\frac{d \log n}{n}}\right). \quad\quad (9.3)

The same shape as the FTSL bound (6.2). The data-dependent version uses the actual HS|\mathcal{H}_{|S}| before the Sauer–Shelah substitution, which can be much tighter when the sample is benign.

Via the canonical Rademacher-based generalization bound (developed in Generalization Bounds §5.4):

suphHR(h)R^(h)2Rn(H)+O ⁣(log(1/δ)/n).\sup_{h \in \mathcal{H}} |R(h) - \hat R(h)| \le 2 \mathfrak R_n(\mathcal{H}) + O\!\left(\sqrt{\log(1/\delta)/n}\right).

The §9 framework feeds directly into the Structural Risk Minimization §5 Bartlett–Mendelson SRM penalty, where empirical Rademacher complexity replaces VC dimension as the capacity functional.

Why Rademacher can be tighter

Three sources of tightening over the worst-case VC bound:

  1. Duplicate samples. HS|\mathcal{H}_{|S}| drops when the sample contains duplicates — fewer distinct points means fewer distinct labelings. Rademacher follows the drop; the VC bound is oblivious.
  2. Margin. On data with a separation margin γ>0\gamma > 0, only the “boundary” labelings near the decision surface are realized, so HS|\mathcal{H}_{|S}| is much smaller than the worst case. Rademacher captures the margin directly; the VC bound does not.
  3. Low-rank structure. Data lying on a kk-dimensional submanifold has empirical Rademacher complexity scaling with klogn/n\sqrt{k \log n / n} instead of dlogn/n\sqrt{d \log n / n}. The gap can be exponential in dd for benign distributions.

Forward-pointer: pseudo-dimension and fat-shattering for real-valued classes

For real-valued function classes FRX\mathcal{F} \subseteq \mathbb{R}^\mathcal{X}, three complexity measures generalize VC dimension:

  • Pseudo-dimension Pdim(F)\mathrm{Pdim}(\mathcal{F}) (Pollard 1984) — the VC dimension of the subgraph class {{(x,t):tf(x)}:fF}\{\{(x, t) : t \le f(x)\} : f \in \mathcal{F}\}. Reduces to VC dimension when F\mathcal{F} is binary.
  • Fat-shattering at scale γ\gamma, fatγ(F)\mathrm{fat}_\gamma(\mathcal{F}) (Bartlett–Long–Williamson 1996) — a margin-aware refinement; fatγPdim\mathrm{fat}_\gamma \to \mathrm{Pdim} as γ0\gamma \to 0.
  • Covering numbers N(F,γ,n)N(\mathcal{F}, \gamma, n) (Dudley 1967, van der Vaart–Wellner 1996) — the minimum number of γ\gamma-balls covering FS\mathcal{F}_{|S}; tied to fat-shattering by logN=O(fatγ/2log(n/γ))\log N = O(\mathrm{fat}_{\gamma/2} \log(n/\gamma)).

Full theory in Generalization Bounds §8 — Dudley’s entropy integral, Talagrand’s contraction lemma, McDiarmid plus symmetrization for real-valued losses.

Demo — empirical Rademacher for axis-rectangles

Three curves versus sample size from n=5 to n=20. Empirical Rademacher complexity (blue) sits below both upper bounds; Massart's bound on the actual restricted class size (orange dashed) tracks the empirical closely; the Sauer–Shelah Rademacher bound at d=4 (red dashed) is the worst-case upper envelope. Annotations show the actual restricted class size at each n.
Empirical Rademacher (blue) versus Massart bound on the actual restricted class size (orange dashed) versus Sauer–Shelah bound at VC dimension 4 (red dashed), for axis-rectangles on 2D random samples. Empirical Rademacher always sits below both upper bounds; Massart tracks the empirical closely because actual restricted class sizes are small on benign random samples.

Loading precomputed Rademacher Monte Carlo...

The empirical Rademacher (blue) sits below both upper bounds at every nn. The Massart bound (orange) tracks the empirical closely because the actual restricted class sizes on benign random samples are far below the Sauer–Shelah ceiling. The Sauer–Shelah bound (red) is the worst-case envelope — universal but loose by a constant factor on every benign distribution.

Integrative worked example: half-planes and rectangles, end to end

A single integrative Monte Carlo on the two anchor classes verifies every box in the bound-vs-empirical chain — growth function (§3), Sauer–Shelah (§5), FTSL envelope (§6), realizable PAC (§7), and empirical Rademacher (§9) — all on the same two test problems.

The two test classes and experimental setup

Half-plane test: DD uniform on [0,1]2[0, 1]^2 with y=1[x2>x1+0.2]y = \mathbb{1}[x_2 > x_1 + 0.2] — a separating line offset from the diagonal, realizable in Hhalf-plane\mathcal{H}_{\text{half-plane}} with R(h)=0R(h^\star) = 0. Rectangle test: DD uniform on [0,1]2[0, 1]^2 with y=1[x1[0.3,0.7] and x2[0.3,0.7]]y = \mathbb{1}[x_1 \in [0.3, 0.7] \text{ and } x_2 \in [0.3, 0.7]] — a centered 0.4×0.40.4 \times 0.4 square, realizable in Haxis-rectangle\mathcal{H}_{\text{axis-rectangle}}.

Four protocols on each class:

  1. Growth-function check at n{5,10,15,20}n \in \{5, 10, 15, 20\} — exhaustive enumeration of HS|\mathcal{H}_{|S}| for half-planes via the Cover-1965 closed form and for rectangles by brute force on 2n2^n candidate labelings.
  2. FTSL envelope check at n{30,60,120,250,500,1000}n \in \{30, 60, 120, 250, 500, 1000\} with B=30B = 30 replicates — empirical worst-case gap from LinearSVC at C=106C = 10^6 (half-plane ERM) or smallest-enclosing axis-rectangle (rectangle ERM), measured on a 30k-point held-out test set.
  3. Realizable sample-complexity verification at ϵ{0.10,0.05,0.02}\epsilon \in \{0.10, 0.05, 0.02\}, δ=0.05\delta = 0.05 — find the smallest nn^\star at which 95% of replicates achieve R(h^)ϵR(\hat h) \le \epsilon, compared to the theoretical (7.1) prediction.
  4. Empirical Rademacher at n=20n = 20 — brute-force HS|\mathcal{H}_{|S}| followed by B=2000B = 2000 Rademacher-draw averages of the sup over realized labelings.

Growth-function verification

For each nn, enumerate HS|\mathcal{H}_{|S}| for half-planes (closed form n2n+2n^2 - n + 2, exact) and rectangles (brute-force over 2n2^n labelings, exact). Both classes respect the Sauer–Shelah ceiling. The empirical ΠHP\Pi_{\text{HP}} matches the closed form exactly; the rectangle empirical sits below (n4)\binom{n}{\le 4} with slack that varies by sample configuration.

FTSL envelope check

Run ERM Monte Carlo. The empirical gap is 1/10 to 1/30 of the FTSL envelope across all sample sizes; the same n1/2n^{-1/2} shape on log-log; the rectangle gap is larger than the half-plane gap by roughly the 4/3\sqrt{4/3} ratio predicted by the dVC\sqrt{d_{\mathrm{VC}}} scaling.

Realizable sample-complexity verification

Both empirical curves scale as 1/ϵ1/\epsilon, matching the theoretical (7.1) shape; the theoretical bound is loose by roughly an order of magnitude; the rectangle-to-half-plane ratio holds across all ϵ\epsilon, consistent with the dVC\sqrt{d_{\mathrm{VC}}} scaling.

Demo — the integrative four-panel “money shot”

Four panels integrating every protocol from sections 3, 5, 6, 7, and 9. Panel A: empirical Pi(n) for half-planes (linear) and rectangles (n^4 shape) versus their Sauer–Shelah ceilings. Panel B: FTSL envelope versus empirical worst-case gap on log-log, n=30 to n=1000. Panel C: realizable sample-complexity n*(epsilon) empirical versus theoretical at three epsilon values. Panel D: bar chart comparing empirical Rademacher against Sauer–Shelah Rademacher at n=20 for both classes.
Integrative four-panel Monte Carlo on half-planes and axis-rectangles. Panel A: empirical growth functions below Sauer–Shelah ceilings. Panel B: FTSL envelope versus empirical generalization gap, log-log. Panel C: realizable sample complexity, empirical versus theoretical. Panel D: empirical Rademacher complexity versus the Sauer–Shelah Rademacher bound at n=20. Every bound is verified; loose by ~10x but rate-correct on every protocol.

Loading integrative Monte Carlo (4 protocols)...

Panel A confirms empirical growth functions sit below Sauer–Shelah ceilings (n3n^3 and n4n^4 slopes). Panel B is the FTSL envelope versus empirical worst-case gap — loose by an order of magnitude but rate-correct (n1/2n^{-1/2} slope on log-log for both classes). Panel C is the realizable sample-complexity story — linear in 1/ϵ1/\epsilon as predicted. Panel D is the empirical Rademacher comparison — empirical sits below the Sauer–Shelah Rademacher bound at every nn, with the gap growing as the actual HS|\mathcal{H}_{|S}| stays well below the worst-case ceiling. Every bound is verified; the looseness is universal but the rate is right.

ML applications

SVMs and the margin: tightening VC dimension with geometry

Half-spaces in Rd\mathbb{R}^d have dVC=d+1d_{\mathrm{VC}} = d + 1 — large for high-dimensional feature spaces, FTSL bound vacuous. SVMs sidestep the dimension dependence via margin-based VC analysis: for the margin-restricted class Hγ\mathcal{H}_\gamma of half-spaces achieving at least margin γ\gamma on data with radius RR,

dVC(Hγ)R2/γ2,d_{\mathrm{VC}}(\mathcal{H}_\gamma) \le \lceil R^2 / \gamma^2 \rceil,

independent of the ambient dimension dd. Linear SVMs in Rd\mathbb{R}^d and kernel SVMs in infinite-dimensional RKHS are both tractable only because the margin bound replaces dimension-dependent VC dimension with geometry-dependent R2/γ2R^2/\gamma^2. Hinge-loss training is direct margin maximization; the soft-margin slack handles the non-separable case. The forward link is Generalization Bounds §10.

Decision trees, pruning, and capacity control

A decision tree with NN leaves on dd-dimensional input has dVC=O(Nlog(Nd))d_{\mathrm{VC}} = O(N \log(Nd)) (Mansour–McAllester 2000). Pruning controls VC dimension directly: cost-complexity pruning (Breiman et al 1984) walks a nested family of decision trees with increasing NN, applying the SRM machinery from Structural Risk Minimization. Random forests (Breiman 2001) and gradient-boosted trees (Friedman 2001) extend the picture with bagging and boosting respectively. The §8.3 fact that decision stumps have dVC=O(logd)d_{\mathrm{VC}} = O(\log d) is precisely why boosting works: a TT-stump ensemble has dVCTlogdd_{\mathrm{VC}} \le T \log d, exponentially smaller than a TT-depth tree.

ERM and the bias-complexity decomposition

The agnostic FTSL bound gives the decomposition

R(h^)R=R(h)Rapproximation error+R(h^)R(h)estimation error.R(\hat h) - R^\star = \underbrace{R(h^\star) - R^\star}_{\text{approximation error}} + \underbrace{R(\hat h) - R(h^\star)}_{\text{estimation error}}.

Approximation error decreases in dVC(H)d_{\mathrm{VC}}(\mathcal{H}) (richer classes can approximate the Bayes-optimal classifier more closely); estimation error increases in dVCd_{\mathrm{VC}} (richer classes have larger generalization gap). SRM minimizes the sum: scan nested classes H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots ordered by VC dimension and pick the index k^\hat k minimizing training error plus a VC-based capacity penalty. Full development in Structural Risk Minimization §4.

The deep-learning puzzle

ResNet-50 has dVC1010d_{\mathrm{VC}} \approx 10^{10}; FTSL needs 101010^{10} training samples; ImageNet provides 10610^6; ResNets generalize well in practice. Zhang et al (2017) confirmed three findings on this picture: memorization capacity is enormous (deep networks can fit random labels to perfect accuracy); the same networks generalize on real labels; classical VC bounds are vacuous. Three modern responses pursue different routes:

  1. Margin/norm-based bounds (Bartlett–Mendelson 2002, Bartlett–Foster–Telgarsky 2017, Neyshabur–Tomioka–Srebro 2015): replace dVCd_{\mathrm{VC}} with capacity measures scaling with weight norms or margins. The bound suphR(h)R^(h)O(Wlogn/n)\sup_h |R(h) - \hat R(h)| \le O(\prod_\ell \|W_\ell\| \cdot \sqrt{\log n / n}) is informative when per-layer norms are controlled and vacuous otherwise.
  2. Implicit-bias of SGD (Soudry et al 2018, Gunasekar et al 2017, Ji–Telgarsky 2019): the optimizer biases the trained network toward low-complexity solutions, so effective VC dimension is much smaller than parameter count.
  3. Beyond uniform convergence (Nagarajan–Kolter 2019): classical FTSL is uniform-convergence-based, but the right tool may be non-uniform algorithm-specific bounds. PAC-Bayes (PAC-Bayes Bounds) and compression bounds (Moran–Yehudayoff 2016) are canonical alternatives.

The forward-pointer is Double Descent (coming soon): test error decreases past the interpolation threshold, a phenomenon classical VC analysis fundamentally cannot explain.

Demo — margin tightens the VC bound

Three curves versus the separation margin gamma from 0.05 to 0.7. The vanilla FTSL bound for half-planes is flat (sees only d=3 and n=200). The margin-based bound R^2/gamma^2 decreases. Empirical SVM Monte Carlo gap (red) matches the margin-based shape, descending across the gamma range.
Vanilla FTSL bound (blue dashed) is flat in gamma — sees only VC dimension 3 and sample size 200. Margin-based bound R^2/gamma^2 (orange) decreases in gamma. Empirical SVM Monte Carlo (red) tracks the margin-based shape. The data-dependent margin bound is the right capacity measure for separable data.

At γ = 0.20: vanilla FTSL = 0.939 (flat), margin bound = 1.987, empirical = 0.0020. The margin bound tracks the empirical gap's shape; the vanilla bound doesn't see γ at all.

The FTSL bound (blue dashed) is flat — sees only dVC=3d_{\mathrm{VC}} = 3 and n=200n = 200, neither dependent on γ\gamma. The margin-based bound R2/γ2R^2/\gamma^2 (orange) decreases with γ\gamma because the effective capacity shrinks as the data become more separable. Empirical SVM Monte Carlo (red) matches the margin-based shape — the flat VC bound is genuinely the wrong capacity measure when the data have margin.

Computational notes

The 2^n enumeration ceiling

Brute-force enumeration of HS\mathcal{H}_{|S} requires 2n2^n labeling checks at O(n)O(n) cost per check. Laptop ceiling: n=16n = 16 takes milliseconds; n=20n = 20 takes 1–5 seconds; n=24n = 24 takes 30–90 seconds; n=28n = 28 takes many minutes. For in-browser visualizations, the cap is n16n \le 16. Above that, use closed-form Π(n)\Pi(n) (half-lines, intervals, half-planes via Cover 1965), Sauer–Shelah upper bounds, or symmetry-aware enumeration.

Symmetry exploitation for half-spaces

Cover (1965): Πhalf-space(n)=2k=0d1(n1k)=O(nd)\Pi_{\text{half-space}}(n) = 2 \sum_{k=0}^{d-1} \binom{n-1}{k} = O(n^d), not 2n2^n. The constructive O(nd)O(n^d) enumeration: every distinct dichotomy is determined by a (d+1)(d+1)-tuple of support points. For d=2d = 2, this is O(n2)O(n^2) — a sample of n=100n = 100 has Π=9,902\Pi = 9{,}902 vs 210010302^{100} \approx 10^{30}. The ceiling extends to n200n \approx 200 in R2\mathbb{R}^2.

Cell decomposition for axis-rectangles

O(n2d)O(n^{2d}) enumeration: every rectangle dichotomy is determined by 2d2d extreme points (axis min/max of the labeled-1 subset). For d=2d = 2: O(n4)2.6×105O(n^4) \approx 2.6 \times 10^5 candidates at n=50n = 50, versus 2502^{50} — impossible by brute force. Production visualizations should use this; the pedagogical demos in §§3.5 and 10.5 use brute force for conceptual clarity at small nn.

Reproducibility and visualization caps

Seed convention: RNG = np.random.default_rng(20260513) at the notebook top. Practical caps for in-browser visualizations: the shattering playground at n6n \le 6; the growth-function plotter empirical curves at n12n \le 12; the FTSL bound explorer uses closed-form bounds plus cached JSON; the empirical-shatter-check four-panel is fully precomputed and served as JSON. NumPy convention throughout: always default_rng, never the deprecated np.random.seed global state.

Connections and limits

Pseudo-dimension and fat-shattering for real-valued classes

For real-valued FRX\mathcal{F} \subseteq \mathbb{R}^\mathcal{X}, the three complexity measures generalizing VC dimension are:

  • Pseudo-dimension Pdim(F)\mathrm{Pdim}(\mathcal{F}) (Pollard 1984) — largest nn such that some sample admits a threshold vector for which the thresholded class shatters every binary pattern. Reduces to VC for binary classes. For linear regression in Rd\mathbb{R}^d, Pdim=d+1\mathrm{Pdim} = d + 1. For neural networks, Pdim\mathrm{Pdim} depends on the architecture and is polynomial in the parameter count.
  • Fat-shattering at scale γ\gamma, fatγ(F)\mathrm{fat}_\gamma(\mathcal{F}) (Bartlett–Long–Williamson 1996) — a margin constraint on the shattered set; fatγPdim\mathrm{fat}_\gamma \to \mathrm{Pdim} as γ0\gamma \to 0.
  • Covering numbers N(F,γ,n)N(\mathcal{F}, \gamma, n) (Dudley 1967, van der Vaart–Wellner 1996) — minimum number of γ\gamma-balls covering FS\mathcal{F}_{|S}; tied to fat-shattering by logN=O(fatγ/2log(n/γ))\log N = O(\mathrm{fat}_{\gamma/2} \log(n/\gamma)).

Full theory in Generalization Bounds §8 — Dudley’s entropy integral, Talagrand’s contraction lemma, McDiarmid plus symmetrization for real-valued losses.

The deep-learning generalization puzzle

Zhang et al (2017) crystallized the three findings discussed in §11.4: memorization is enormous, generalization on real labels is real, and classical VC bounds are vacuous. The conclusion is that uniform convergence is not the operative framework for deep learning — the worst-case capacity argument that powers FTSL fails on modern models. Three modern responses sketched in §11.4 follow different routes; the forward-pointer is Double Descent (coming soon), which addresses the interpolation-threshold phenomenon directly.

Distribution-free versus distribution-dependent guarantees

The VC framework is distribution-free — universal but worst-case. Three distribution-dependent alternatives:

  • Bayes-style smoothness assumptions — Tsybakov margin (§7.4 mode 3) interpolates 1/ϵ1/\epsilon to 1/ϵ21/\epsilon^2; smoothness classes (Hölder, Sobolev) yield minimax rates. The formalStatistics: nonparametric-regression topic is the entry point.
  • Algorithmic stability (Bousquet–Elisseeff 2002) — β\beta-stability gives O(β+1/n)O(\beta + \sqrt{1/n}) rate, no VC required. Ridge is β=O(1/(λn))\beta = O(1/(\lambda n))-stable. Forward link: Generalization Bounds §11.
  • Compression bounds (Littlestone–Warmuth 1986, Floyd–Warmuth 1995, Moran–Yehudayoff 2016) — reconstruction from a kk-subsample gives a klogn/n\sqrt{k \log n / n} rate. Every PAC-learnable class admits a compression scheme of size O(dVC)O(d_{\mathrm{VC}}) (Moran–Yehudayoff converse).

Where to go next

Connected formalML topics:

  • Structural Risk Minimization — uses dVCd_{\mathrm{VC}} as the canonical capacity penalty in the Vapnik SRM framework.
  • PAC-Bayes Bounds — replaces the worst-case supremum over H\mathcal{H} with a posterior average, using KL divergence as the capacity measure.
  • Generalization Bounds — the empirical-process machinery (Rademacher complexity, symmetrization, McDiarmid, Talagrand) used here without full proof.
  • PAC Learning — the parent framework for agnostic PAC, realizable PAC, and the union-bound machinery.
  • Double Descent (coming soon) — the modern overparameterized regime where VC analysis breaks down.

The VC framework’s value is not tight bounds for every ML setting; it is the structural insight that capacity is the right quantity to bound, that shattering is the right way to measure capacity, and that dVClog(n/dVC)d_{\mathrm{VC}} \log(n/d_{\mathrm{VC}}) is the right substitute for logH\log|\mathcal{H}| in the union-bound argument. Every modern alternative — margin/norm bounds, PAC-Bayes, implicit-bias analysis, compression schemes — is built on this foundation.

We started asking what’s the right combinatorial substitute for H|\mathcal{H}| when the cardinality counts the wrong thing. The answer is the VC dimension and its growth-function machinery. Sauer–Shelah converts the integer dVCd_{\mathrm{VC}} into polynomial growth, the FTSL converts polynomial growth into a O(dlogn/n)O(\sqrt{d \log n / n}) uniform-convergence bound, and the equivalence theorem ties finite VC dimension to PAC-learnability in both directions. The topic delivers the foundation; tighter bounds for modern overparameterized models require the margin/norm refinements, PAC-Bayes, or implicit-bias analysis from connected topics.

Connections

  • PAC-learning sets up the (ε, δ)-framework and the finite-class union-bound; this topic supplies the combinatorial machinery (shattering, growth function, VC dimension) that extends finite-class arguments to infinite classes. The Fundamental Theorem of Statistical Learning equivalence we prove in §6 ties finite VC dimension to agnostic PAC-learnability in both directions. pac-learning
  • The §9 bridge to Rademacher complexity feeds directly into the Massart, McDiarmid, and Talagrand machinery developed in generalization-bounds; finite VC dimension is the discrete capacity precursor of the continuous-time empirical-process bounds shown there. The §14.2 VC-dimension stub in generalization-bounds is discharged by this topic. generalization-bounds
  • VC dimension is the canonical capacity functional in the Vapnik SRM penalty (SRM §4); the §3 phase-transition picture and §5 Sauer–Shelah bound are the substrate that makes the SRM oracle inequality nontrivial for infinite classes. The §11.3 bias-complexity decomposition is the SRM motivation in compressed form. structural-risk-minimization
  • Hoeffding's inequality from concentration-inequalities is one of the two ingredients in the FTSL proof — the other being polynomial growth via Sauer–Shelah. The §7.3 Bernstein-vs-Hoeffding rate-gap discussion is the variance-aware concentration story that explains the realizable vs agnostic split. concentration-inequalities

References & Further Reading