advanced probability 45 min read

PAC Learning Framework

From finite hypothesis classes to VC dimension and the Fundamental Theorem of Statistical Learning

Overview & Motivation

The Concentration Inequalities topic gave us quantitative tools for controlling how random quantities deviate from their expectations. Hoeffding’s inequality bounds the gap between a sample average and its mean. McDiarmid’s inequality extends this to general functions of independent variables. But those tools answer a specific question — how fast does an average concentrate? — without addressing the deeper question that drives machine learning: when can a learning algorithm generalize at all?

Consider the standard machine learning setup. We have a hypothesis class H\mathcal{H} — a collection of candidate classifiers. We observe a finite training sample and pick the hypothesis that performs best on that sample (empirical risk minimization). The training error is an observable quantity. The true error — performance on unseen data — is not. The gap between them is the generalization gap, and controlling it is the central problem of statistical learning theory.

PAC learning theory answers three fundamental questions:

  1. Feasibility. For which hypothesis classes is generalization possible — that is, for which H\mathcal{H} can we guarantee that training error approximates true error given enough data?
  2. Sample complexity. How many training examples suffice to achieve a desired accuracy ε\varepsilon with confidence 1δ1 - \delta?
  3. Complexity measures. What properties of H\mathcal{H} determine the answers to (1) and (2)?

We will build the theory in three stages. First, we handle finite hypothesis classes using the concentration tools from the previous topic — the union bound and Hoeffding’s inequality give clean sample complexity bounds. Second, we introduce the VC dimension as the combinatorial measure that governs learnability for infinite classes, and prove the Sauer–Shelah lemma that makes infinite classes tractable. Third, we develop Rademacher complexity as a data-dependent alternative that often gives tighter bounds. The culmination is the Fundamental Theorem of Statistical Learning, which shows these perspectives are equivalent: a hypothesis class is learnable if and only if its VC dimension is finite.


The Learning Problem

We begin by formalizing the setup that machine learning algorithms operate in. The formalization may seem pedantic at first, but each definition isolates an assumption that will matter when we prove our main results.

The setup. We have an instance space X\mathcal{X} (the set of all possible inputs — think feature vectors in Rd\mathbb{R}^d), a label space Y={0,1}\mathcal{Y} = \{0, 1\} (binary classification), and an unknown probability distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y} that governs the data-generating process. A hypothesis is a function h:XYh: \mathcal{X} \to \mathcal{Y} — a candidate classifier. A hypothesis class H\mathcal{H} is a collection of hypotheses that our learning algorithm considers.

We observe a training sample S={(x1,y1),,(xn,yn)}S = \{(x_1, y_1), \ldots, (x_n, y_n)\} drawn i.i.d. from D\mathcal{D}. The fundamental tension: we can evaluate how well a hypothesis performs on SS, but we care about how well it performs on fresh data from D\mathcal{D}.

Definition 1 (True Risk).

The true risk (generalization error) of hypothesis h:XYh: \mathcal{X} \to \mathcal{Y} with respect to distribution D\mathcal{D} is:

R(h)=Pr(x,y)D[h(x)y]=E(x,y)D[1[h(x)y]]R(h) = \Pr_{(x,y) \sim \mathcal{D}}[h(x) \neq y] = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\mathbf{1}[h(x) \neq y]]

where 1[]\mathbf{1}[\cdot] is the indicator function that equals 1 when its argument is true and 0 otherwise.

The true risk is the probability of misclassification on a fresh example drawn from D\mathcal{D}. It is the quantity we want to minimize but cannot compute, since D\mathcal{D} is unknown.

Definition 2 (Empirical Risk).

The empirical risk of hh on sample S={(xi,yi)}i=1nS = \{(x_i, y_i)\}_{i=1}^n is:

R^S(h)=1ni=1n1[h(xi)yi]\hat{R}_S(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x_i) \neq y_i]

the fraction of training examples that hh misclassifies.

The empirical risk is computable — it is the training error. For any fixed hypothesis hh, the empirical risk is the average of i.i.d. Bernoulli random variables with mean R(h)R(h), so by the law of large numbers, R^S(h)R(h)\hat{R}_S(h) \to R(h) as nn \to \infty. The question is: does this convergence happen uniformly over all hHh \in \mathcal{H} simultaneously, and how fast?

Definition 3 (Empirical Risk Minimization).

The Empirical Risk Minimization (ERM) learning rule selects:

hSERM=argminhHR^S(h)h_S^{\mathrm{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)

the hypothesis with the lowest training error. When the minimum is achieved by multiple hypotheses, we break ties arbitrarily.

ERM is the most natural learning strategy: pick the hypothesis that fits the training data best. Whether this strategy actually works — whether low training error implies low true error — depends on the hypothesis class H\mathcal{H}, and characterizing when it works is the central goal of PAC theory.

Definition 4 (Realizable Case).

Hypothesis class H\mathcal{H} is realizable with respect to distribution D\mathcal{D} if there exists hHh^* \in \mathcal{H} with R(h)=0R(h^*) = 0. That is, some hypothesis in the class achieves zero true risk — the “ground truth” labeling function belongs to H\mathcal{H}.

Realizability is a strong assumption: it says our hypothesis class is rich enough to contain a perfect classifier. Most real-world problems are not realizable — the best classifier in H\mathcal{H} still makes some errors. We will handle both cases: the realizable setting first (where the analysis is cleaner), then the agnostic setting (where no assumptions are made about D\mathcal{D}).

The learning problem setup — instance space, hypothesis class, and the true risk minimizer


Realizable PAC Learning

We are now ready to state the central definition of this topic. The name “PAC” — Probably Approximately Correct — captures both sources of uncertainty: the sample is random (so guarantees are probabilistic, “probably”), and we cannot expect perfect accuracy from finite data (so we settle for approximation, “approximately”).

Definition 5 (PAC Learnability (Realizable)).

A hypothesis class H\mathcal{H} is PAC learnable if there exists a learning algorithm AA and a function nH:(0,1)2Nn_{\mathcal{H}}: (0,1)^2 \to \mathbb{N} such that for every ε,δ(0,1)\varepsilon, \delta \in (0,1) and every distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y} for which the realizability assumption holds, if nnH(ε,δ)n \geq n_{\mathcal{H}}(\varepsilon, \delta), then with probability at least 1δ1 - \delta over the random draw of SDnS \sim \mathcal{D}^n:

R(A(S))εR(A(S)) \leq \varepsilon

The function nH(ε,δ)n_{\mathcal{H}}(\varepsilon, \delta) is the sample complexity of learning H\mathcal{H}.

Let’s unpack this carefully. The definition quantifies over all distributions D\mathcal{D} (where H\mathcal{H} is realizable) and all accuracy/confidence parameters ε,δ\varepsilon, \delta. It asks: is there a single algorithm AA and a sample size nn (depending on ε,δ\varepsilon, \delta, and possibly H\mathcal{H}, but not on D\mathcal{D}) such that AA produces a hypothesis with true risk at most ε\varepsilon with high probability? The algorithm does not know D\mathcal{D} — it only sees the sample SS.

Our first main result shows that every finite hypothesis class is PAC learnable, and ERM is the learning algorithm.

Theorem 1 (PAC Learnability of Finite Classes (Realizable)).

Let H\mathcal{H} be a finite hypothesis class. Then H\mathcal{H} is PAC learnable by ERM with sample complexity:

nH(ε,δ)log(H/δ)εn_{\mathcal{H}}(\varepsilon, \delta) \leq \left\lceil \frac{\log(|\mathcal{H}|/\delta)}{\varepsilon} \right\rceil

Proof.

Fix ε,δ(0,1)\varepsilon, \delta \in (0,1). Let hHh^* \in \mathcal{H} satisfy R(h)=0R(h^*) = 0 (this exists by realizability). Define the set of “bad” hypotheses:

Hbad={hH:R(h)>ε}\mathcal{H}_{\text{bad}} = \{h \in \mathcal{H} : R(h) > \varepsilon\}

Since hh^* achieves R^S(h)=0\hat{R}_S(h^*) = 0 on any sample (because R(h)=0R(h^*) = 0 and the sample is drawn from D\mathcal{D}), ERM selects some hSh_S with R^S(hS)=0\hat{R}_S(h_S) = 0. The event R(hS)>εR(h_S) > \varepsilon can only occur if some bad hypothesis has zero empirical risk:

Pr[R(hS)>ε]Pr[hHbad:R^S(h)=0]\Pr[R(h_S) > \varepsilon] \leq \Pr\left[\exists h \in \mathcal{H}_{\text{bad}} : \hat{R}_S(h) = 0\right]

We apply the union bound (from Concentration Inequalities):

Pr[hHbad:R^S(h)=0]hHbadPr[R^S(h)=0]\Pr\left[\exists h \in \mathcal{H}_{\text{bad}} : \hat{R}_S(h) = 0\right] \leq \sum_{h \in \mathcal{H}_{\text{bad}}} \Pr\left[\hat{R}_S(h) = 0\right]

For each bad hypothesis hh with R(h)>εR(h) > \varepsilon, the event R^S(h)=0\hat{R}_S(h) = 0 means that hh correctly classifies all nn training examples despite having true error greater than ε\varepsilon. Each training example is correctly classified with probability 1R(h)<1ε1 - R(h) < 1 - \varepsilon, and the examples are independent, so:

Pr[R^S(h)=0]=(1R(h))n(1ε)nenε\Pr\left[\hat{R}_S(h) = 0\right] = (1 - R(h))^n \leq (1 - \varepsilon)^n \leq e^{-n\varepsilon}

where the last step uses the standard inequality 1xex1 - x \leq e^{-x}. Combining:

Pr[R(hS)>ε]HbadenεHenε\Pr[R(h_S) > \varepsilon] \leq |\mathcal{H}_{\text{bad}}| \cdot e^{-n\varepsilon} \leq |\mathcal{H}| \cdot e^{-n\varepsilon}

Setting this to at most δ\delta and solving for nn:

Henεδ    nlog(H/δ)ε|\mathcal{H}| \cdot e^{-n\varepsilon} \leq \delta \iff n \geq \frac{\log(|\mathcal{H}|/\delta)}{\varepsilon}

\square

Remark (Logarithmic Sample Complexity).

The sample complexity n=O(logH/ε)n = O(\log|\mathcal{H}|/\varepsilon) has a clean interpretation. The dependence on H|\mathcal{H}| is logarithmic — doubling the number of hypotheses adds only one more sample to the requirement. This is the price of the union bound: we pay a log factor to control all hypotheses simultaneously. The dependence on accuracy is 1/ε1/\varepsilon — to halve the maximum error, we double the sample size.

The following code verifies the theoretical bound via Monte Carlo simulation. We create a finite hypothesis class of 50 linear classifiers in R5\mathbb{R}^5, run ERM on random samples of varying sizes, and compare the empirical success probability with the theoretical guarantee.

n_trials = 2000
H_size_sim = 50
d = 5
epsilon_target = 0.1

H_weights = rng.standard_normal((H_size_sim, d))
H_biases = rng.standard_normal(H_size_sim)
h_star_idx = 0  # First hypothesis is the ground truth

sample_sizes_sim = [10, 20, 50, 100, 200, 500]
empirical_success = []

for n_s in sample_sizes_sim:
    successes = 0
    for _ in range(n_trials):
        X_sim = rng.standard_normal((n_s, d))
        y_sim = (X_sim @ H_weights[h_star_idx] + H_biases[h_star_idx] > 0).astype(int)

        # ERM: find hypothesis with lowest training error
        best_h, best_err = None, float('inf')
        for h_idx in range(H_size_sim):
            preds = (X_sim @ H_weights[h_idx] + H_biases[h_idx] > 0).astype(int)
            err = np.mean(preds != y_sim)
            if err < best_err:
                best_err = err
                best_h = h_idx

        # Evaluate true risk on fresh test data
        X_test = rng.standard_normal((5000, d))
        y_test = (X_test @ H_weights[h_star_idx] + H_biases[h_star_idx] > 0).astype(int)
        true_risk = np.mean(
            (X_test @ H_weights[best_h] + H_biases[best_h] > 0).astype(int) != y_test
        )

        if true_risk <= epsilon_target:
            successes += 1

    empirical_success.append(successes / n_trials)

# Compare with theoretical bound: P(success) >= 1 - |H| * exp(-n * epsilon)
theoretical_bound = [
    1 - min(H_size_sim * np.exp(-n_s * epsilon_target), 1.0)
    for n_s in sample_sizes_sim
]

Realizable PAC learning — ERM success probability vs sample size, compared with theoretical bound


Agnostic PAC Learning

The realizable setting is instructive but unrealistic. In practice, we rarely know whether our hypothesis class contains a perfect classifier. The agnostic setting drops the realizability assumption entirely: the distribution D\mathcal{D} can be arbitrary, and the best hypothesis in H\mathcal{H} may still have nonzero risk. The goal shifts from finding a hypothesis with small absolute risk to finding one whose risk is close to the best achievable within H\mathcal{H}.

Definition 6 (Agnostic PAC Learnability).

A hypothesis class H\mathcal{H} is agnostic PAC learnable if there exists a learning algorithm AA and a function nH:(0,1)2Nn_{\mathcal{H}}: (0,1)^2 \to \mathbb{N} such that for every ε,δ(0,1)\varepsilon, \delta \in (0,1) and every distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y} (with no realizability assumption), if nnH(ε,δ)n \geq n_{\mathcal{H}}(\varepsilon, \delta), then with probability at least 1δ1 - \delta:

R(A(S))minhHR(h)+εR(A(S)) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon

The term minhHR(h)\min_{h \in \mathcal{H}} R(h) is the approximation error — the best possible risk within H\mathcal{H} — and ε\varepsilon is the estimation error due to learning from a finite sample.

Remark (Realizable vs Agnostic Rates).

When H\mathcal{H} is realizable, minhHR(h)=0\min_{h \in \mathcal{H}} R(h) = 0, and agnostic PAC reduces to the realizable definition. But the sample complexity will be different — the agnostic setting requires more samples because we cannot exploit the fact that a perfect hypothesis exists.

The key technique for proving agnostic bounds is uniform convergence: ensuring that empirical risk approximates true risk simultaneously for all hypotheses in H\mathcal{H}.

Proposition 1 (Uniform Convergence Implies Agnostic PAC).

If H\mathcal{H} has the uniform convergence property — that is, for every ε,δ>0\varepsilon, \delta > 0, there exists nUC(ε,δ)n_{\mathrm{UC}}(\varepsilon, \delta) such that for nnUCn \geq n_{\mathrm{UC}}:

Pr[suphHR^S(h)R(h)ε]1δ\Pr\left[\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \leq \varepsilon\right] \geq 1 - \delta

— then H\mathcal{H} is agnostic PAC learnable by ERM with sample complexity nH(ε,δ)=nUC(ε/2,δ)n_{\mathcal{H}}(\varepsilon, \delta) = n_{\mathrm{UC}}(\varepsilon/2, \delta).

Proof.

Suppose uniform convergence holds with parameter ε/2\varepsilon/2, so that R^S(h)R(h)ε/2|\hat{R}_S(h) - R(h)| \leq \varepsilon/2 for all hHh \in \mathcal{H} simultaneously. Let hSh_S be the ERM hypothesis and h=argminhHR(h)h^* = \arg\min_{h \in \mathcal{H}} R(h). Then:

R(hS)R^S(hS)+ε2R^S(h)+ε2R(h)+ε2+ε2=R(h)+εR(h_S) \leq \hat{R}_S(h_S) + \frac{\varepsilon}{2} \leq \hat{R}_S(h^*) + \frac{\varepsilon}{2} \leq R(h^*) + \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = R(h^*) + \varepsilon

The first and third inequalities use uniform convergence; the second uses the fact that ERM minimizes empirical risk, so R^S(hS)R^S(h)\hat{R}_S(h_S) \leq \hat{R}_S(h^*). \square

This proposition reduces the learning problem to a concentration problem: we need to show that R^S(h)R(h)\hat{R}_S(h) \approx R(h) uniformly over H\mathcal{H}. For finite classes, this follows directly from Hoeffding’s inequality and the union bound.

Theorem 2 (Agnostic PAC Learnability of Finite Classes).

Let H\mathcal{H} be a finite hypothesis class. Then H\mathcal{H} is agnostic PAC learnable by ERM with sample complexity:

nH(ε,δ)2log(2H/δ)ε2n_{\mathcal{H}}(\varepsilon, \delta) \leq \left\lceil \frac{2\log(2|\mathcal{H}|/\delta)}{\varepsilon^2} \right\rceil

Proof.

We need uniform convergence: suphHR^S(h)R(h)ε/2\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \leq \varepsilon/2 with probability at least 1δ1 - \delta.

For a fixed hypothesis hh, the random variables 1[h(xi)yi]\mathbf{1}[h(x_i) \neq y_i] are i.i.d. Bernoulli with mean R(h)[0,1]R(h) \in [0,1]. By Hoeffding’s inequality (from Concentration Inequalities):

Pr[R^S(h)R(h)ε2]2exp(nε22)\Pr\left[|\hat{R}_S(h) - R(h)| \geq \frac{\varepsilon}{2}\right] \leq 2\exp\left(-\frac{n\varepsilon^2}{2}\right)

Applying the union bound over all hHh \in \mathcal{H}:

Pr[hH:R^S(h)R(h)ε2]2Hexp(nε22)\Pr\left[\exists h \in \mathcal{H} : |\hat{R}_S(h) - R(h)| \geq \frac{\varepsilon}{2}\right] \leq 2|\mathcal{H}| \exp\left(-\frac{n\varepsilon^2}{2}\right)

Setting this to at most δ\delta:

2Hexp(nε22)δ    n2log(2H/δ)ε22|\mathcal{H}| \exp\left(-\frac{n\varepsilon^2}{2}\right) \leq \delta \iff n \geq \frac{2\log(2|\mathcal{H}|/\delta)}{\varepsilon^2}

By Proposition 1, this gives agnostic PAC learnability with the stated sample complexity. \square

Remark (The 1/ε vs 1/ε² Gap).

Compare the sample complexities: realizable gives O(logH/ε)O(\log|\mathcal{H}|/\varepsilon) while agnostic gives O(logH/ε2)O(\log|\mathcal{H}|/\varepsilon^2). The agnostic bound is quadratically worse in ε\varepsilon. This is not an artifact of the proof technique — it reflects a fundamental difference. In the realizable case, we exploited the fact that R^S(h)=0\hat{R}_S(h^*) = 0, which gave us a one-sided tail bound. In the agnostic case, we need a two-sided bound (empirical risk can be both above and below true risk), and controlling this requires the stronger 1/ε21/\varepsilon^2 rate from Hoeffding’s inequality.

The following simulation illustrates uniform convergence empirically: we draw 50 random hypotheses, compute their empirical risks over 500 trials, and compare the distribution of the supremum gap suphR^S(h)R(h)\sup_h |\hat{R}_S(h) - R(h)| against the Hoeffding + union bound prediction.

n_sample = 100
n_hyp = 50
rng2 = np.random.default_rng(123)

true_risks = rng2.beta(2, 5, n_hyp)
emp_risks = np.zeros((n_hyp, 500))

for trial in range(500):
    for h_idx in range(n_hyp):
        outcomes = rng2.binomial(1, true_risks[h_idx], n_sample)
        emp_risks[h_idx, trial] = outcomes.mean()

# Supremum gap over all hypotheses
sup_gap = np.max(np.abs(emp_risks - true_risks[:, None]), axis=0)

# Theoretical bound: Hoeffding + union bound
eps_theory = np.linspace(0.001, 0.4, 200)
p_exceed_hoeffding_ub = np.minimum(
    2 * n_hyp * np.exp(-2 * n_sample * eps_theory**2), 1.0
)

# Empirical exceedance probability for comparison
empirical_survival = [np.mean(sup_gap > e) for e in eps_theory]

Agnostic PAC learning — supremum gap distribution vs Hoeffding + union bound

PAC Bound Comparator
|H| (hypotheses): 100
VC dim d: 10
δ (failure prob): 0.0501

The VC Dimension

For finite hypothesis classes, the story is complete: both realizable and agnostic PAC learning are possible, with sample complexity logarithmic in H|\mathcal{H}|. But most interesting hypothesis classes are infinite — the class of all linear classifiers in Rd\mathbb{R}^d, for instance, has uncountably many elements. The union bound over H|\mathcal{H}| is useless when H=|\mathcal{H}| = \infty.

The key insight is that an infinite hypothesis class, when restricted to a finite sample, can only produce finitely many distinct labeling patterns. The effective complexity of H\mathcal{H} on nn points is not H|\mathcal{H}| but the number of distinct behaviors on those points. This leads us to the VC dimension.

Definition 7 (Restriction).

The restriction of hypothesis class H\mathcal{H} to a finite set C={x1,,xm}XC = \{x_1, \ldots, x_m\} \subset \mathcal{X} is:

HC={(h(x1),,h(xm)):hH}{0,1}m\mathcal{H}_C = \{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\} \subseteq \{0,1\}^m

This is the set of all labeling patterns that hypotheses in H\mathcal{H} can produce on the points in CC.

Even if H\mathcal{H} is infinite, HC\mathcal{H}_C is always finite — it has at most 2m2^m elements (the total number of possible binary labelings of mm points). The question is: for how large an mm can we achieve the maximum HC=2m|\mathcal{H}_C| = 2^m?

Definition 8 (Shattering).

A hypothesis class H\mathcal{H} shatters a set CXC \subset \mathcal{X} if HC={0,1}C\mathcal{H}_C = \{0,1\}^{|C|}, i.e., every possible binary labeling of CC is realized by some hHh \in \mathcal{H}.

Shattering means H\mathcal{H} is maximally expressive on CC — no matter how the points are labeled, some hypothesis in H\mathcal{H} fits perfectly. This is the combinatorial analog of overfitting: if H\mathcal{H} can shatter large sets, it can memorize arbitrary patterns, which makes generalization harder.

Definition 9 (VC Dimension).

The Vapnik–Chervonenkis dimension of H\mathcal{H}, denoted VCdim(H)\mathrm{VCdim}(\mathcal{H}), is the largest integer dd such that there exists a set CXC \subset \mathcal{X} with C=d|C| = d that is shattered by H\mathcal{H}. If H\mathcal{H} can shatter arbitrarily large sets, then VCdim(H)=\mathrm{VCdim}(\mathcal{H}) = \infty.

The VC dimension measures the largest set of points that H\mathcal{H} can label in all possible ways. It is a worst-case combinatorial measure: we only need one set of size dd that is shattered, but no set of size d+1d+1 can be shattered. Let’s see some examples.

Example 1 (Thresholds on ℝ).

Consider H={ha:aR}\mathcal{H} = \{h_a : a \in \mathbb{R}\} where ha(x)=1[xa]h_a(x) = \mathbf{1}[x \leq a] (predict 1 if xx is at most aa). For any single point {x1}\{x_1\}, we can realize both labelings: choose a>x1a > x_1 for label 1, or a<x1a < x_1 for label 0. So H\mathcal{H} shatters sets of size 1.

But for any pair {x1,x2}\{x_1, x_2\} with x1<x2x_1 < x_2, the labeling (0,1)(0, 1) — “the smaller point is negative and the larger point is positive” — cannot be achieved by any threshold. Therefore VCdim(Hthreshold)=1\mathrm{VCdim}(\mathcal{H}_{\text{threshold}}) = 1.

Example 2 (Intervals on ℝ).

Consider H={ha,b:ab}\mathcal{H} = \{h_{a,b} : a \leq b\} where ha,b(x)=1[axb]h_{a,b}(x) = \mathbf{1}[a \leq x \leq b] (predict 1 if xx lies in the interval [a,b][a,b]). Any two points can be shattered (check all four labelings). But for three points x1<x2<x3x_1 < x_2 < x_3, the labeling (1,0,1)(1, 0, 1) — positive at the endpoints, negative in the middle — cannot be produced by a single interval. Therefore VCdim(Hinterval)=2\mathrm{VCdim}(\mathcal{H}_{\text{interval}}) = 2.

Shattering Explorer
Total
Realized
|H_C|
Shattered?
VCdim = 1

The most important example for machine learning is linear classifiers, whose VC dimension has a clean characterization.

Theorem 3 (VC Dimension of Linear Classifiers).

The class of halfspaces in Rd\mathbb{R}^d:

Hlin={x1[wx+b0]:wRd,bR}\mathcal{H}_{\text{lin}} = \{x \mapsto \mathbf{1}[\mathbf{w} \cdot x + b \geq 0] : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}

has VCdim(Hlin)=d+1\mathrm{VCdim}(\mathcal{H}_{\text{lin}}) = d + 1.

Proof.

Lower bound (VCdimd+1\mathrm{VCdim} \geq d + 1). We exhibit a set of d+1d+1 points that Hlin\mathcal{H}_{\text{lin}} shatters. Take the origin 0\mathbf{0} and the dd standard basis vectors e1,,ed\mathbf{e}_1, \ldots, \mathbf{e}_d. For any binary labeling of these d+1d+1 points, we can construct a weight vector w\mathbf{w} and bias bb that realizes it. The key is that these points are in “general position” — no d+1d+1 of them lie on a common hyperplane — so the d+1d+1 linear constraints wxi+b0\mathbf{w} \cdot x_i + b \geq 0 (or <0< 0) are always simultaneously satisfiable.

Upper bound (VCdimd+1\mathrm{VCdim} \leq d + 1). We use Radon’s theorem from convex geometry: any set of d+2d+2 points in Rd\mathbb{R}^d can be partitioned into two disjoint sets P,QP, Q whose convex hulls intersect: conv(P)conv(Q)\mathrm{conv}(P) \cap \mathrm{conv}(Q) \neq \emptyset. If the convex hulls intersect, no hyperplane can separate PP from QQ, so the labeling that assigns 1 to PP and 0 to QQ is not realizable. Therefore no set of d+2d+2 points can be shattered. \square

This result connects the VC dimension to the geometric dimension of the feature space: linear classifiers in Rd\mathbb{R}^d have VC dimension d+1d+1. This is why the number of features matters for generalization — it directly controls the complexity of the hypothesis class.

VC dimension and shattering examples — thresholds, intervals, and linear classifiers


Sauer–Shelah Lemma and Growth Functions

The VC dimension tells us the boundary at which shattering becomes impossible: H\mathcal{H} cannot shatter any set of size d+1d+1 or larger. But we need a more quantitative statement: exactly how many labelings can H\mathcal{H} produce on a set of mm points when m>dm > d? The growth function captures this.

Definition 10 (Growth Function).

The growth function (also called the shattering coefficient) of H\mathcal{H} is:

ΠH(m)=maxCX,C=mHC\Pi_{\mathcal{H}}(m) = \max_{C \subset \mathcal{X}, |C|=m} |\mathcal{H}_C|

the maximum number of distinct labelings that H\mathcal{H} can produce on any set of mm points. By definition, ΠH(m)2m\Pi_{\mathcal{H}}(m) \leq 2^m always, and VCdim(H)=d\mathrm{VCdim}(\mathcal{H}) = d means ΠH(d)=2d\Pi_{\mathcal{H}}(d) = 2^d but ΠH(m)<2m\Pi_{\mathcal{H}}(m) < 2^m for all m>dm > d.

The Sauer–Shelah lemma is the remarkable fact that once ΠH\Pi_{\mathcal{H}} drops below 2m2^m, it doesn’t just decrease slightly — it collapses from exponential to polynomial growth. This is the combinatorial engine that makes infinite hypothesis classes learnable.

Lemma 1 (Sauer–Shelah Lemma).

If VCdim(H)=d<\mathrm{VCdim}(\mathcal{H}) = d < \infty, then for all m1m \geq 1:

ΠH(m)i=0d(mi)\Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i}

Proof.

We prove this by strong induction on m+dm + d.

Base cases. If d=0d = 0, then H\mathcal{H} cannot shatter any single point, so ΠH(m)1=(m0)\Pi_{\mathcal{H}}(m) \leq 1 = \binom{m}{0} for all mm. If mdm \leq d, then i=0d(mi)=2mΠH(m)\sum_{i=0}^{d} \binom{m}{i} = 2^m \geq \Pi_{\mathcal{H}}(m) trivially.

Inductive step. Assume the lemma holds for all pairs (m,d)(m', d') with m+d<m+dm' + d' < m + d. Let C={x1,,xm}C = \{x_1, \ldots, x_m\} be any set of mm points achieving the maximum ΠH(m)=HC\Pi_{\mathcal{H}}(m) = |\mathcal{H}_C|. Set C=C{xm}={x1,,xm1}C' = C \setminus \{x_m\} = \{x_1, \ldots, x_{m-1}\}.

We partition HC\mathcal{H}_C based on the behavior at xmx_m. For each labeling pattern b=(b1,,bm1)HC\mathbf{b} = (b_1, \ldots, b_{m-1}) \in \mathcal{H}_{C'}, either:

  • b\mathbf{b} extends to exactly one labeling on CC (either (b1,,bm1,0)(b_1, \ldots, b_{m-1}, 0) or (b1,,bm1,1)(b_1, \ldots, b_{m-1}, 1) is in HC\mathcal{H}_C, but not both), or
  • b\mathbf{b} extends to both labelings on CC (both extensions are in HC\mathcal{H}_C).

Let H0=HC\mathcal{H}_0 = \mathcal{H}_{C'} be the set of all patterns on CC', and let H1HC\mathcal{H}_1 \subseteq \mathcal{H}_{C'} be the set of patterns that extend both ways. Then:

HC=H0+H1|\mathcal{H}_C| = |\mathcal{H}_0| + |\mathcal{H}_1|

because each pattern in H0H1\mathcal{H}_0 \setminus \mathcal{H}_1 contributes one labeling on CC, and each pattern in H1\mathcal{H}_1 contributes two.

Now we bound each term:

  • H0\mathcal{H}_0 is the restriction of H\mathcal{H} to m1m-1 points, so VCdim(H0)d\mathrm{VCdim}(\mathcal{H}_0) \leq d. By the inductive hypothesis: H0i=0d(m1i)|\mathcal{H}_0| \leq \sum_{i=0}^{d} \binom{m-1}{i}.

  • For H1\mathcal{H}_1, we claim VCdim(H1)d1\mathrm{VCdim}(\mathcal{H}_1) \leq d - 1. If H1\mathcal{H}_1 shattered a set DCD \subseteq C' of size dd, then since every pattern in H1\mathcal{H}_1 extends both ways on xmx_m, the set D{xm}D \cup \{x_m\} of size d+1d+1 would be shattered by H\mathcal{H} — contradicting VCdim(H)=d\mathrm{VCdim}(\mathcal{H}) = d. By the inductive hypothesis: H1i=0d1(m1i)|\mathcal{H}_1| \leq \sum_{i=0}^{d-1} \binom{m-1}{i}.

Combining and using Pascal’s identity (m1i)+(m1i1)=(mi)\binom{m-1}{i} + \binom{m-1}{i-1} = \binom{m}{i}:

HCi=0d(m1i)+i=0d1(m1i)=(m10)+i=1d[(m1i)+(m1i1)]=i=0d(mi)|\mathcal{H}_C| \leq \sum_{i=0}^{d} \binom{m-1}{i} + \sum_{i=0}^{d-1} \binom{m-1}{i} = \binom{m-1}{0} + \sum_{i=1}^{d}\left[\binom{m-1}{i} + \binom{m-1}{i-1}\right] = \sum_{i=0}^{d} \binom{m}{i}

\square

Corollary 1 (Polynomial Growth Bound).

For md1m \geq d \geq 1:

ΠH(m)i=0d(mi)(emd)d\Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d

The second inequality follows from the standard bound i=0d(mi)(em/d)d\sum_{i=0}^{d} \binom{m}{i} \leq (em/d)^d, which can be proved using the entropy method or direct algebraic manipulation.

The significance of Corollary 1 cannot be overstated. The growth function transitions from ΠH(m)=2m\Pi_{\mathcal{H}}(m) = 2^m (exponential) for mdm \leq d to ΠH(m)(em/d)d\Pi_{\mathcal{H}}(m) \leq (em/d)^d (polynomial in mm with degree dd) for m>dm > d. This phase transition at m=dm = d is what allows the VC theory to work: we can replace H|\mathcal{H}| in the union bound by ΠH(n)(en/d)d\Pi_{\mathcal{H}}(n) \leq (en/d)^d, which is polynomial rather than infinite.

The following code computes the growth function exactly and demonstrates the phase transition:

from scipy.special import comb

m_range = np.arange(1, 31)

# Growth function for several VC dimensions
for d in [1, 2, 3, 5, 10]:
    growth = np.array([
        sum(comb(m, i, exact=True) for i in range(d + 1))
        for m in m_range
    ])

# Sauer-Shelah bound tightness
d = 3
m_range2 = np.arange(d, 51)
exact = np.array([
    sum(comb(m, i, exact=True) for i in range(d + 1))
    for m in m_range2
])
upper = (np.e * m_range2 / d) ** d  # Simplified bound

# Phase transition: ratio Pi(m) / 2^m drops at m = d
d_val = 5
m_range3 = np.arange(1, 25)
growth_exact = np.array([
    sum(comb(m, i, exact=True) for i in range(d_val + 1))
    for m in m_range3
])
ratio = growth_exact / 2.0**m_range3

Growth functions and the Sauer–Shelah bound — phase transition at m = d

Growth Function & Sauer–Shelah Bound
VC dimension d: 5

The Fundamental Theorem of Statistical Learning

We now have all the ingredients to state the deepest result in PAC learning theory. The Fundamental Theorem shows that four seemingly different properties of a hypothesis class are equivalent — they are four views of the same underlying phenomenon.

Theorem 4 (VC Bound (Finite-Sample)).

Let VCdim(H)=d<\mathrm{VCdim}(\mathcal{H}) = d < \infty. For any distribution D\mathcal{D} and any δ(0,1)\delta \in (0,1), with probability at least 1δ1 - \delta over SDnS \sim \mathcal{D}^n:

suphHR(h)R^S(h)8dlog(2en/d)+8log(4/δ)n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_S(h)| \leq \sqrt{\frac{8d\log(2en/d) + 8\log(4/\delta)}{n}}

Proof.

The proof combines three techniques.

Step 1: Symmetrization. Replace the true risk R(h)R(h) with the empirical risk on an independent “ghost sample” SDnS' \sim \mathcal{D}^n. By a doubling argument, Pr[suphR(h)R^S(h)ε]2Pr[suphR^S(h)R^S(h)ε/2]\Pr[\sup_h |R(h) - \hat{R}_S(h)| \geq \varepsilon] \leq 2\Pr[\sup_h |\hat{R}_S(h) - \hat{R}_{S'}(h)| \geq \varepsilon/2]. This step eliminates the unknown distribution D\mathcal{D} from the bound.

Step 2: Rademacher randomization. Since SS and SS' are exchangeable, we can insert random sign flips: replacing some pairs (zi,zi)(z_i, z_i') by (zi,zi)(z_i', z_i) doesn’t change the distribution. This yields a bound in terms of Rademacher averages.

Step 3: Growth function bound. On the combined sample SSS \cup S' of 2n2n points, H\mathcal{H} produces at most ΠH(2n)(2en/d)d\Pi_{\mathcal{H}}(2n) \leq (2en/d)^d distinct labeling patterns (by Sauer–Shelah). Apply the union bound and Hoeffding’s inequality over these finitely many patterns. The resulting bound is the VC bound. \square

The VC bound establishes uniform convergence for any class with finite VC dimension. Combined with Proposition 1 (uniform convergence implies agnostic PAC), this gives us one direction of the Fundamental Theorem. The full result establishes equivalence.

Theorem 5 (The Fundamental Theorem of Statistical Learning).

For binary classification with 0-1 loss, the following are equivalent:

  1. H\mathcal{H} has the uniform convergence property.
  2. H\mathcal{H} is agnostic PAC learnable (by ERM).
  3. H\mathcal{H} is PAC learnable in the realizable setting (by ERM).
  4. VCdim(H)<\mathrm{VCdim}(\mathcal{H}) < \infty.
Proof.

We outline the implications:

(4) \Rightarrow (1): This is Theorem 4 (the VC bound). Finite VC dimension gives us the growth function bound via Sauer–Shelah, which yields uniform convergence through symmetrization and the union bound over distinct labeling patterns.

(1) \Rightarrow (2): This is Proposition 1. Uniform convergence guarantees that the ERM hypothesis has risk close to the best in H\mathcal{H}.

(2) \Rightarrow (3): Immediate — realizable PAC learnability is a special case of agnostic PAC learnability (set minhHR(h)=0\min_{h \in \mathcal{H}} R(h) = 0).

(3) \Rightarrow (4): This is the hardest direction, proved by contrapositive. If VCdim(H)=\mathrm{VCdim}(\mathcal{H}) = \infty, then for every sample size nn, there exists a set of 2n2n points that H\mathcal{H} shatters. An adversary can construct a distribution D\mathcal{D} on this set such that any learning algorithm fails with probability at least 1/41/4 — this is a No Free Lunch argument. The construction places uniform distribution on the 2n2n points and assigns labels that the adversary chooses after seeing the algorithm’s output, exploiting the shattering to always have a “hard” labeling available. \square

Corollary 2 (VC Dimension Sample Complexity).

If VCdim(H)=d<\mathrm{VCdim}(\mathcal{H}) = d < \infty, then H\mathcal{H} is agnostic PAC learnable with sample complexity:

nH(ε,δ)=O(dlog(1/ε)+log(1/δ)ε2)n_{\mathcal{H}}(\varepsilon, \delta) = O\left(\frac{d\log(1/\varepsilon) + \log(1/\delta)}{\varepsilon^2}\right)

In particular, the sample complexity depends on H\mathcal{H} only through its VC dimension dd — not on the cardinality or any other structural property of H\mathcal{H}.

Remark (Significance of the Fundamental Theorem).

The Fundamental Theorem is the crown jewel of computational learning theory. It tells us that one number — the VC dimension — completely characterizes whether a binary hypothesis class is learnable. This is a qualitative characterization (finite VC     \iff learnable) that also gives quantitative sample complexity bounds (through the VC bound). The elegance is in the equivalence of four seemingly different conditions: a convergence property, two learnability definitions, and a combinatorial measure.

The Fundamental Theorem — equivalences between uniform convergence, PAC learnability, and finite VC dimension


Rademacher Complexity

The VC dimension is a combinatorial complexity measure — it depends on H\mathcal{H} but not on the data distribution D\mathcal{D} or the specific sample SS. This universality is a strength (the VC bound holds for all distributions) but also a weakness (the bound may be loose for specific distributions). Rademacher complexity provides a data-dependent alternative that can yield tighter bounds.

The intuition is simple: a complex hypothesis class can “fit random noise” — it can correlate with random labels. Rademacher complexity measures this ability.

Definition 11 (Empirical Rademacher Complexity).

For a fixed sample S={z1,,zn}S = \{z_1, \ldots, z_n\} and a function class F:ZR\mathcal{F}: \mathcal{Z} \to \mathbb{R}, the empirical Rademacher complexity is:

R^S(F)=Eσ[supfF1ni=1nσif(zi)]\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_{\boldsymbol{\sigma}}\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(z_i)\right]

where σ1,,σn\sigma_1, \ldots, \sigma_n are i.i.d. Rademacher variables: Pr[σi=+1]=Pr[σi=1]=1/2\Pr[\sigma_i = +1] = \Pr[\sigma_i = -1] = 1/2.

The expression inside the expectation asks: given random signs σi\sigma_i, how well can the best function in F\mathcal{F} correlate with these signs? If F\mathcal{F} contains a function that correlates highly with random noise, the class is complex. If no function can do better than random chance, the class is simple.

Definition 12 (Rademacher Complexity).

The (population) Rademacher complexity is the expectation of the empirical version over the random draw of the sample:

Rn(F)=ESDn[R^S(F)]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_{S \sim \mathcal{D}^n}\left[\hat{\mathfrak{R}}_S(\mathcal{F})\right]

This measures how well F\mathcal{F} can fit random noise on average over samples from D\mathcal{D}.

The main result connects Rademacher complexity to generalization through two applications of concentration inequalities from the previous topic: McDiarmid’s inequality for the concentration step and symmetrization for the expectation bound.

Theorem 6 (Rademacher Generalization Bound).

Let F\mathcal{F} be a class of functions mapping to [0,1][0,1]. For any δ>0\delta > 0, with probability at least 1δ1 - \delta over SDnS \sim \mathcal{D}^n:

supfF(E[f]1ni=1nf(zi))2R^S(F)+3log(2/δ)2n\sup_{f \in \mathcal{F}} \left(\mathbb{E}[f] - \frac{1}{n}\sum_{i=1}^n f(z_i)\right) \leq 2\hat{\mathfrak{R}}_S(\mathcal{F}) + 3\sqrt{\frac{\log(2/\delta)}{2n}}

Proof.

We break the proof into three steps.

Step 1: Bounded differences. Define Φ(S)=supfF(E[f]E^S[f])\Phi(S) = \sup_{f \in \mathcal{F}} (\mathbb{E}[f] - \hat{\mathbb{E}}_S[f]) where E^S[f]=1nif(zi)\hat{\mathbb{E}}_S[f] = \frac{1}{n}\sum_i f(z_i). Changing a single sample point ziz_i changes Φ(S)\Phi(S) by at most 1/n1/n (since each ff maps to [0,1][0,1]). By McDiarmid’s inequality:

Pr[Φ(S)E[Φ(S)]t]exp(2nt2)\Pr[\Phi(S) - \mathbb{E}[\Phi(S)] \geq t] \leq \exp(-2nt^2)

Setting t=log(2/δ)/(2n)t = \sqrt{\log(2/\delta)/(2n)} gives Φ(S)E[Φ(S)]+log(2/δ)/(2n)\Phi(S) \leq \mathbb{E}[\Phi(S)] + \sqrt{\log(2/\delta)/(2n)} with probability 1δ/2\geq 1 - \delta/2.

Step 2: Symmetrization. We bound E[Φ(S)]\mathbb{E}[\Phi(S)] using a ghost sample S={z1,,zn}S' = \{z_1', \ldots, z_n'\}:

E[Φ(S)]=ES[supf(ES[E^S[f]]E^S[f])]ES,S[supf(E^S[f]E^S[f])]\mathbb{E}[\Phi(S)] = \mathbb{E}_S\left[\sup_f\left(\mathbb{E}_{S'}[\hat{\mathbb{E}}_{S'}[f]] - \hat{\mathbb{E}}_S[f]\right)\right] \leq \mathbb{E}_{S,S'}\left[\sup_f\left(\hat{\mathbb{E}}_{S'}[f] - \hat{\mathbb{E}}_S[f]\right)\right]

Since ziz_i and ziz_i' are identically distributed, we can replace ziziz_i' - z_i by σi(zizi)\sigma_i(z_i' - z_i) where σi\sigma_i are Rademacher variables. This gives:

E[Φ(S)]ES,S,σ[supf1ni=1nσi(f(zi)f(zi))]2ES,σ[supf1ni=1nσif(zi)]=2Rn(F)\mathbb{E}[\Phi(S)] \leq \mathbb{E}_{S,S',\boldsymbol{\sigma}}\left[\sup_f \frac{1}{n}\sum_{i=1}^n \sigma_i(f(z_i') - f(z_i))\right] \leq 2\mathbb{E}_{S,\boldsymbol{\sigma}}\left[\sup_f \frac{1}{n}\sum_{i=1}^n \sigma_i f(z_i)\right] = 2\mathfrak{R}_n(\mathcal{F})

Step 3: Empirical to population. We convert Rn(F)\mathfrak{R}_n(\mathcal{F}) to the empirical R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) using McDiarmid again. The empirical Rademacher complexity SR^S(F)S \mapsto \hat{\mathfrak{R}}_S(\mathcal{F}) has bounded differences 1/n\leq 1/n (changing one sample point changes the supremum by at most 1/n1/n). So with probability 1δ/2\geq 1 - \delta/2:

Rn(F)R^S(F)+log(2/δ)2n\mathfrak{R}_n(\mathcal{F}) \leq \hat{\mathfrak{R}}_S(\mathcal{F}) + \sqrt{\frac{\log(2/\delta)}{2n}}

Combining Steps 1–3 by a union bound over the two δ/2\delta/2 events:

Φ(S)2R^S(F)+2log(2/δ)2n+log(2/δ)2n=2R^S(F)+3log(2/δ)2n\Phi(S) \leq 2\hat{\mathfrak{R}}_S(\mathcal{F}) + 2\sqrt{\frac{\log(2/\delta)}{2n}} + \sqrt{\frac{\log(2/\delta)}{2n}} = 2\hat{\mathfrak{R}}_S(\mathcal{F}) + 3\sqrt{\frac{\log(2/\delta)}{2n}}

\square

Two structural results connect Rademacher complexity to our earlier measures.

Proposition 2 (Massart's Lemma).

For a finite function class F\mathcal{F}:

R^S(F)maxffS22logFn\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \frac{\max_f \|f_S\|_2 \cdot \sqrt{2\log|\mathcal{F}|}}{n}

where fS2=if(zi)2\|f_S\|_2 = \sqrt{\sum_i f(z_i)^2}. This recovers the logH/n\sqrt{\log|\mathcal{H}|/n} rate for finite classes.

Proposition 3 (VC Bounds Rademacher Complexity).

If VCdim(H)=d\mathrm{VCdim}(\mathcal{H}) = d, then:

Rn(H)2dlog(en/d)n\mathfrak{R}_n(\mathcal{H}) \leq \sqrt{\frac{2d\log(en/d)}{n}}

This follows from combining Massart’s Lemma with the Sauer–Shelah bound on the number of effective hypotheses: HSΠH(n)(en/d)d|\mathcal{H}_S| \leq \Pi_{\mathcal{H}}(n) \leq (en/d)^d.

Proposition 3 shows that the Rademacher bound is never worse than the VC bound (up to constants). But because R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) depends on the specific sample SS, it can be much tighter when the data has favorable structure — for instance, when the data is low-dimensional or when most hypotheses behave similarly on typical inputs.

The following code estimates empirical Rademacher complexity via Monte Carlo simulation:

def estimate_rademacher(H_preds, n_rad=500):
    """
    Estimate empirical Rademacher complexity via Monte Carlo.
    H_preds: (num_hypotheses, sample_size) predictions matrix
    n_rad: number of random sign trials
    Returns: mean of max correlation with random signs
    """
    n_s = H_preds.shape[1]
    maxcorr = []
    for _ in range(n_rad):
        sigma = rng.choice([-1, 1], size=n_s)
        maxcorr.append(np.max(H_preds @ sigma / n_s))
    return np.mean(maxcorr)

# How Rademacher complexity scales with |H|
n_data = 200
X_data = rng.standard_normal((n_data, 2))
H_sizes_test = [5, 10, 25, 50, 100, 200, 500]
rad_c = []

for H_s in H_sizes_test:
    W = rng.standard_normal((H_s, 2))
    b = rng.standard_normal(H_s)
    preds = np.sign(X_data @ W.T + b)
    rad_c.append(estimate_rademacher(preds.T))

# Compare with theoretical: O(sqrt(log|H| / n))
theory_rc = np.sqrt(2 * np.log(np.array(H_sizes_test)) / n_data)

Rademacher complexity — empirical estimation and comparison with theoretical bounds


Applications and Worked Examples

The abstract theory developed above has direct implications for understanding the generalization behavior of concrete machine learning methods. We discuss four applications.

Linear Classifiers

For halfspaces in Rd\mathbb{R}^d, we know VCdim=d+1\mathrm{VCdim} = d+1 (Theorem 3). The VC bound gives sample complexity O(d/ε2)O(d/\varepsilon^2) for agnostic PAC learning. This provides a formal justification for the common wisdom that the number of training examples should grow with the number of features. In practice, if we have d=100d = 100 features and want ε=0.05\varepsilon = 0.05 accuracy, the VC bound suggests we need on the order of 100/0.0025=40,000100/0.0025 = 40{,}000 samples — conservative, but in the right ballpark for linear methods on moderately difficult problems.

Neural Networks

The VC dimension of a neural network with WW real-valued weights is O(WlogW)O(W \log W) (due to Bartlett, Harvey, Liaw & Mehrabian, 2019). For a network with millions of parameters, this gives a sample complexity bound in the millions — but modern networks generalize well with far fewer samples than the VC theory predicts. This theory-practice gap is one of the most active research areas in learning theory. Several explanations have been proposed: the effective capacity of trained networks is much smaller than the architectural capacity, optimization with SGD implicitly regularizes, and the data distribution has favorable structure that tighter measures like Rademacher complexity can exploit.

Decision Trees

A decision tree with kk internal nodes on dd-dimensional binary features has VC dimension at most O(klog(kd))O(k \log(kd)). This shows that the sample complexity scales with the tree complexity (number of splits), not with the ambient dimension dd. Pruning a decision tree reduces its VC dimension, which the theory predicts should improve generalization — consistent with the well-known observation that unpruned trees overfit.

Structural Risk Minimization

The bias-complexity tradeoff (also called the bias-variance tradeoff in some contexts) is a direct consequence of the PAC framework. Given a nested sequence of hypothesis classes H1H2\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots with increasing VC dimensions d1<d2<d_1 < d_2 < \cdots:

  • The approximation error minhHkR(h)\min_{h \in \mathcal{H}_k} R(h) decreases with kk (larger classes contain better approximations).
  • The estimation error (the generalization gap, bounded by the VC bound) increases with kk (larger classes are harder to learn from finite data).

Structural Risk Minimization (SRM) selects the hypothesis class Hk\mathcal{H}_{k^*} that minimizes the sum of empirical risk and the VC complexity penalty. This is the learning-theoretic justification for regularization: by controlling model complexity, we balance the two sources of error.

Structural Risk Minimization Dashboard
n (samples): 500
Approx. decay b: 0.08
δ (failure prob): 0.05
Optimal model complexity d* = 1 — balances approximation error (decreasing with complexity) against estimation error from the VC bound (increasing with complexity).

Applications of PAC learning — linear classifiers, neural networks, and the bias-complexity tradeoff


Connections & Further Reading

Connections Map

TopicDomainRelationship
Concentration InequalitiesProbability & StatisticsDirect prerequisite — Hoeffding + union bound give sample complexity for finite classes; McDiarmid’s inequality proves the Rademacher generalization bound
Measure-Theoretic ProbabilityProbability & StatisticsFoundational — the convergence hierarchy and expectation operator underpin the definitions of true risk and uniform convergence
PCA & Low-Rank ApproximationLinear AlgebraThe VC dimension of linear subspace classifiers relates to effective dimension captured by PCA; sample covariance concentration connects to agnostic learning bounds
Simplicial ComplexesTopologyThe combinatorial proof of Sauer–Shelah (induction on point removal) has structural analogues in extremal and topological combinatorics
Bayesian NonparametricsProbability & StatisticsBayesian model selection provides an alternative to SRM for balancing model complexity; posterior contraction rates parallel PAC sample complexity

Key Notation Summary

SymbolMeaning
R(h)=Pr(x,y)D[h(x)y]R(h) = \Pr_{(x,y) \sim \mathcal{D}}[h(x) \neq y]True risk (generalization error)
R^S(h)=1ni1[h(xi)yi]\hat{R}_S(h) = \frac{1}{n}\sum_i \mathbf{1}[h(x_i) \neq y_i]Empirical risk (training error)
hSERM=argminhHR^S(h)h_S^{\mathrm{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)Empirical risk minimizer
nH(ε,δ)n_{\mathcal{H}}(\varepsilon, \delta)Sample complexity
HC\mathcal{H}_CRestriction of H\mathcal{H} to set CC
VCdim(H)\mathrm{VCdim}(\mathcal{H})Vapnik–Chervonenkis dimension
ΠH(m)\Pi_{\mathcal{H}}(m)Growth function (shattering coefficient)
R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F})Empirical Rademacher complexity
Rn(F)\mathfrak{R}_n(\mathcal{F})Population Rademacher complexity

Connections

  • Direct prerequisite — Hoeffding's inequality and the union bound give sample complexity for finite classes; McDiarmid's inequality proves Rademacher generalization bounds. concentration-inequalities
  • Foundational — the convergence hierarchy and expectation operator underpin the definition of true risk and uniform convergence. measure-theoretic-probability
  • The VC dimension of linear subspace classifiers relates to effective dimension captured by PCA; sample covariance concentration connects to agnostic learning bounds. pca-low-rank
  • The combinatorial proof of Sauer–Shelah has structural analogues in extremal and topological combinatorics. simplicial-complexes

References & Further Reading

  • paper A Theory of the Learnable — Valiant (1984) The foundational PAC learning paper
  • book Understanding Machine Learning: From Theory to Algorithms — Shalev-Shwartz & Ben-David (2014) Primary reference — Chapters 2-6 cover all topics in this page
  • book Statistical Learning Theory — Vapnik (1998) Comprehensive treatment of VC theory and structural risk minimization
  • book Foundations of Machine Learning — Mohri, Rostamizadeh & Talwalkar (2018) Excellent treatment of Rademacher complexity and its applications
  • paper Learnability and the Vapnik-Chervonenkis Dimension — Blumer, Ehrenfeucht, Haussler & Warmuth (1989) Early paper connecting VC dimension to PAC learning
  • paper On the Density of Families of Sets — Sauer (1972) The original Sauer lemma
  • paper Rademacher and Gaussian Complexities: Risk Bounds and Structural Results — Bartlett & Mendelson (2002) Foundational paper on Rademacher complexity bounds