PAC-Bayes Bounds
Posterior-averaged generalization theory — the change-of-measure inequality, McAllester / Seeger / Catoni bounds, the Gibbs distribution, and the first non-vacuous deep-network certificates
Motivation: beyond uniform convergence
What generalization-bounds §14 promised
The companion topic on generalization bounds closes by stating, but not developing, a remedy for its own central limitation. Rademacher complexity gives the cleanest non-vacuous bounds we have for small hypothesis classes — threshold classifiers, polynomials of bounded degree, linear models in low dimension. The bound it produces is a uniform statement: with probability at least over the sample ,
where is the population risk, its empirical counterpart on , and the Rademacher complexity — a single scalar that summarizes the richness of on points. The bound holds for every hypothesis simultaneously, which is exactly why it goes vacuous when is rich enough to memorize the training set.
§14.1 of that topic stated, as a forward-pointer, McAllester’s 1999 PAC-Bayes bound — a strictly different way to charge for complexity, one that doesn’t pay uniformly over and survives the deep-network regime that breaks Rademacher. This is the topic that develops it.
The deep-network puzzle revisited
Zhang et al. (2017) showed something the classical theory cannot accommodate: a modern convolutional network with millions of parameters, trained by SGD on ImageNet, can fit randomly relabeled training data to zero error. Its Rademacher complexity is therefore at least , so the bound above is — vacuous, since the risk lives in . Yet the same network, trained on real labels, generalizes. The empirical generalization gap is on the order of a few percent. The classical bound and the observed gap differ by orders of magnitude.
The architectural fact that breaks uniform convergence is that has the capacity to be terrible — to memorize noise — even when the specific that SGD lands on is not. A bound that charges the complexity of uniformly cannot distinguish those two facts. It pays for the worst-case hypothesis regardless of which one we actually use.
PAC-Bayes in one sentence
PAC-Bayes changes the charging scheme. Instead of bounding , we bound — the expected gap under a posterior distribution over hypotheses. The price we pay is no longer a complexity measure of ; it is the KL divergence between and a prior distribution fixed before seeing the data. Roughly,
with the precise form developed across §§3–6. The two ideas are dual: uniform convergence charges ; PAC-Bayes charges ‘s drift from . When concentrates on hypotheses that look like the one SGD found, stays small even when is vast. Dziugaite & Roy (2017) used exactly this observation to compute the first non-vacuous bound for a deep network with more parameters than training points — a result we reproduce in miniature in §11.
The hybrid framing matters. The object is Bayesian in flavor — a distribution over hypotheses, often constructed from a Bayesian posterior or a stochastic version of the SGD output — but the guarantee is fully frequentist: it holds with probability at least over the draw of the sample , uniformly across all posteriors . PAC-Bayes is neither Bayesian inference nor classical learning theory; it is the bridge between them.
Roadmap
Three canonical bounds anchor this topic. The McAllester (1999, Maurer-tightened 2004) square-root bound of §4 is the most legible — it produces the inequality above with explicit constants. The Seeger (2002) bound of §5 replaces with the binary-KL function and is tighter when is near zero, which is the regime of interest for over-parameterized models. The Catoni (2007) linear bound of §6 is parametrized by a temperature , and its right-hand-side minimizer is the Gibbs distribution — the same object that variational inference produces in the limit of perfect approximation, and the bridge to the Bayesian correspondence developed in §9.
All three rest on a single technical tool: the Donsker–Varadhan change-of-measure inequality, which we develop from scratch in §3. The chain “DV → McAllester → Seeger → Catoni” is the spine of the topic.
The running example
Throughout §§2–10 we work with the same toy: a one-dimensional binary-classification problem with a finite class of threshold classifiers. The data are , with , labels flipped independently with probability . The hypothesis class is where enumerates thresholds, and . The sample size is .
This class is small enough to enumerate, finite enough that the Gibbs posterior is a discrete distribution we can plot, and rich enough that the choice of — uniform-on- versus concentrated-near-the-ERM — produces qualitatively different bounds. The numerical demo confirms that the empirical risk curve over is roughly U-shaped with a minimum near , and the ERM threshold lands close to (with the noise-flip and finite-sample error from ).
The PAC-Bayes setup
Data, loss, hypothesis class
Throughout the topic, denotes an i.i.d. sample from an unknown distribution on . A hypothesis is evaluated by a bounded loss , and we write
for the population risk and its empirical counterpart on . The hypothesis class is the set of we are willing to consider — finite (as in the running example), parametric (Gaussian-process classifiers, linear models with a parameter prior), or non-parametric (the function space of an over-parameterized neural network).
The boundedness assumption matters: it is what makes the Hoeffding moments of §4 finite. The classical 0/1 classification loss satisfies it directly; squared regression loss does not, and PAC-Bayes for unbounded losses requires extra moment assumptions we discuss in §14.2.
The prior
A prior is a probability distribution over the hypothesis class, chosen before seeing the sample . Two facts about deserve emphasis on first reading.
First, is not a Bayesian prior in the inference-theoretic sense — it is not a statement of belief about which generated the data, and the analysis does not require that any be the “true” data-generating mechanism. is a measuring stick: a distribution against which we will charge our posterior’s drift.
Second, must be chosen without using . If depends on , the bounds of §§4–6 are false. (Data-dependent priors are a real technique, but they are constructed carefully — typically by splitting , fitting on one half, and computing the bound on the other half; we develop this in §8.3.) For the running example will be uniform on the 101 thresholds; for the §11 deep-network bound will be a Gaussian in parameter space whose variance is selected from a -union over a finite grid.
The posterior
A posterior is a probability distribution that may depend arbitrarily on . The only requirement is — is absolutely continuous with respect to : every event -null is also -null. On finite or countable this just means . The condition is what makes the Radon–Nikodym derivative well-defined, and through it the KL divergence
KL divergence is non-negative, equal to zero iff , and asymmetric — properties we’ll need in §3 and document fully in the KL divergence topic. For now the operational reading is: measures how surprising would be to someone who held , in units of nats.
The PAC-Bayes posterior is the user’s choice. It is not the output of any inference rule. Some natural constructions — the Bayesian posterior , the Gibbs distribution , a Gaussian centered on the SGD output — are good choices because they tend to balance the two competing terms of the bound (low empirical risk under vs low KL from ). But the guarantee holds for every , including pathological ones.
The hybrid framing
The PAC-Bayes statement we will prove in §4 has the form
where is some bound functional (square-root for McAllester, binary-KL for Seeger, linear for Catoni). Three structural features distinguish PAC-Bayes from neighboring frameworks.
The bound is uniform over but not over . The supremum over the hypothesis class that uniform-convergence bounds pay for is gone; in its place we have a supremum over posteriors. This is the trade that lets PAC-Bayes survive overparameterization — there are infinitely many posteriors , but most are far from in KL, and the bound only bites when we actually use one of them.
The randomness is over , not over . The probability is taken with respect to the draw of the training sample. Once is drawn, the bound either holds or doesn’t for every simultaneously, and we can pick after seeing — indeed, that’s the whole point.
The framework is agnostic about . No realizability assumption, no parametric form, no assumption that any is the truth. The guarantee is frequentist in the same sense as Hoeffding’s inequality is frequentist.
This hybrid — a Bayesian object () inside a frequentist guarantee (probability over ) — is the conceptual move that makes the framework work. PAC-Bayes is not “Bayesian PAC learning” in the sense of treating as a parameter space and updating via Bayes’ rule. It is PAC learning where the learner outputs a distribution rather than a point, and the analysis charges the learner for how far that distribution drifted from a reference distribution fixed in advance.
Running example — prior and two candidate posteriors
We carry forward the §1.5 toy. The prior is uniform on with : every threshold gets mass . We will compare two candidate posteriors: , a Gaussian-shape on the grid centered at the ERM threshold with bandwidth (narrow enough that nearly all mass sits within ten grid steps of the ERM); and , the same shape with (broad enough to keep substantial mass across most of ).
For each we read off two numbers: the KL divergence and the expected empirical risk . The narrow posterior pays a higher KL (more concentrated than uniform) but a lower expected empirical risk (its mass sits on near-ERM hypotheses); the broad posterior makes the opposite trade. On the verified notebook run the numerics are , and , . The PAC-Bayes bound formalizes this trade-off, and §10 plots both posteriors against the McAllester, Seeger, and Catoni bounds explicitly.
KL(Q_b ∥ P) = 0.67 | E_Q_b[R̂] = 0.179
The change-of-measure inequality
KL divergence as a log-density-ratio expectation
Before we develop the master tool, it pays to look at KL divergence from one more angle. We introduced it in §2.3 as the integral ; equivalently, when ,
where is the Radon–Nikodym derivative — the unique-up-to--null-sets density of with respect to . (See formalCalculus: lebesgue-integral and formalCalculus: radon-nikodym for the measure-theoretic foundations.) The derivative satisfies the change-of-measure formula
for every -integrable . This is the simple algebraic fact that PAC-Bayes is built on: any expectation under can be rewritten as an expectation under , paid for by re-weighting with the density ratio. Equivalently, , which is the direction we’ll need in a moment.
The KL term in the PAC-Bayes bound is precisely the price the change-of-measure formula extracts when the re-weighting is taken inside a and pushed through Jensen’s inequality. The next subsection makes this concrete numerically before we state the result.
Numerical hook — the variational form as an envelope
Pick any bounded function . For each compute the variational functional
On the running example with uniform on the 101 thresholds and (the negative empirical risk scaled by a temperature ), trace across a restricted family: Gaussian-shaped distributions on the grid with fixed bandwidth , parametrized by their center . The demo below plots as a function of . Two observations.
First, has a peak: among Gaussian-on-grid posteriors of this bandwidth, some center maximizes the functional. Second, the peak sits strictly below a horizontal line that we’ll mark as — a quantity that depends only on and , not on . The gap between the family’s peak and that horizontal line is the price the restricted family pays for not containing the global maximizer.
When we widen the search to all , the global maximizer turns out to be the Gibbs distribution
and the maximum is exactly the horizontal envelope . The Donsker–Varadhan inequality is the statement that this envelope is universal — for every with finite, the gap is non-negative and tight at exactly one .
gap = 0.138 nats | KL(Q* ∥ P) = 2.20
The variational form
We now state it formally.
Theorem 1 (Donsker–Varadhan variational form of KL).
Let be a probability measure on a measurable space and a measurable function with . Then
with the supremum achieved at the tilted distribution
i.e., where .
Rearranging gives the form we’ll actually use in §§4–6:
Corollary 1 (Change-of-measure inequality).
Under the hypotheses of Theorem 1, for every ,
This is the master tool. Plug in for a suitable constant and you’ll get the McAllester bound; plug in and you’ll get the Seeger bound; plug in and the maximizer on the right is the Gibbs posterior of Catoni’s linear bound. The proof in §3.4 is a single Jensen step in one direction and an explicit exhibit in the other.
Proof of Theorem 1
The proof has two halves: an upper bound on the variational functional valid for every , and an exhibit showing it is achieved at .
Proof.
Upper bound ( direction of Theorem 1, equivalent to Corollary 1). Fix . By Radon–Nikodym, the derivative exists. On the support of — i.e., on the set — the reciprocal is also well-defined. (Outside the support of the value of any integrand does not matter for an integral with respect to , so we ignore that set throughout.) The change-of-measure formula gives
The integrand on the right is nonnegative and -integrable (since its -expectation equals , finite by hypothesis). Take of both sides and apply Jensen’s inequality — the concave commutes with expectation in the direction for any nonnegative with :
Split the logarithm:
The second term is by definition, since . Combining,
which is the change-of-measure inequality after rearranging.
Saturation at ( direction). Define by with . Then is a probability measure (it integrates to ) and (by construction). Its KL divergence from is
so
The variational functional equals the right-hand side of at , so the supremum on the right of Theorem 1 is attained.
Together the two halves give the variational form, and Corollary 1 follows by rearrangement.
∎The Jensen step is where every PAC-Bayes bound is born. Strict inequality holds whenever is not -a.s. constant — which fails precisely when . So the bound is tight iff is the Gibbs distribution, and otherwise we pay a non-zero gap.
Demo — closed-form sanity check on Gaussians
The Gaussian case admits a closed-form check of every quantity in Theorem 1, which is the cleanest way to verify the result independent of the running example. Take on and . Three one-line calculations: the envelope is ; the Gibbs maximizer is ; and the variational functional viewed as a function of for reaches its closed-form maximum at . The notebook’s §3.5 cell confirms all three numerically and overlays a Monte-Carlo estimate of to ground the closed-form result in independent computation; saturation is verified to better than .
The Gaussian- mean parameter here is named to distinguish it from the §1.5 running-example threshold . The two objects are unrelated — one is a sanity-check parameter on , the other a fixed truth in the threshold-classifier problem.
The McAllester bound
Statement
We can now state the load-bearing result of the topic. It is the bound McAllester (1999) introduced and Maurer (2004) tightened to its modern constants. The argument is a four-step chain combining the change-of-measure inequality of §3, a Bernoulli-MGF lemma due to Maurer, Markov’s inequality, and a Jensen step in the opposite direction from §3.
Theorem 2 (McAllester PAC-Bayes bound (Maurer-tightened form)).
Let be any distribution on , let be a bounded loss, let be a prior on chosen before seeing the data, fix , and let . Then
Two features worth flagging before the proof. The probability is over the sample only; the quantifier over sits inside the probability, so any posterior we construct from — including a we choose specifically to minimize the right-hand side — inherits the same . And the inside the logarithm comes from the Maurer moment bound of §4.2; the original McAllester (1999) statement carried a worse constant, replaced by McAllester (2003) and tightened to its modern form by Maurer (2004).
The proof is the chain we previewed: a Hoeffding-type moment bound for each fixed hypothesis, integration against the prior, Markov’s inequality to convert from expectation to a high-probability statement over , the change-of-measure inequality to flip from a -expectation to a -expectation, and a Jensen step to pull the square root outside.
Step 1 — Maurer’s moment bound
The technical heart of the bound lives in a lemma about a single hypothesis. For a fixed , the losses are i.i.d. random variables in with mean , and the empirical risk is their sample mean. We need a bound on the moment generating function of .
Lemma 1 (Maurer's moment bound (Maurer 2004, Theorem 1)).
For every fixed with and every ,
where is the binary-KL function. Consequently, by Pinsker’s inequality ,
Two notes on this lemma before we use it. First, the moment bound is uniform in — it depends only on , not on the distribution of the data. That uniformity is what makes the bound work distribution-free. Second, the rather than constant comes from a Stirling-type analysis of the worst-case loss distribution. The argument runs as follows. By a convexity / exchangeability argument (Maurer 2004 Lemma 3), the worst-case bounded loss with given mean is the Bernoulli, so it suffices to bound the moment when for . The Bernoulli MGF evaluates explicitly:
an expression independent of — a remarkable cancellation. Stirling’s approximation gives this sum is bounded by for . Pinsker’s inequality then converts the binary-KL form to the squared-difference form. Pinsker is the lossy step in McAllester’s chain — Seeger’s bound in §5 skips it.
Step 2 — integrating against the prior
The moment bound holds for each fixed with the data-randomness on the outside. To combine it with the change-of-measure inequality, we need to put a -expectation on the outside too. Because the prior does not depend on , we may integrate without restriction:
The first inequality uses Lemma 1 pointwise in , then takes the -expectation; the inner -expectation is bounded by at each , and the bound passes through the -expectation. Then by Fubini’s theorem — applicable because both expectations of the nonnegative integrand are finite, and because is independent of — we may swap the order of integration:
The second form is the one we’ll plug into Markov’s inequality: it is a statement about the random variable — a function of the sample — whose expectation over is bounded by .
This “data-independence of ” assumption is the workhorse of the proof. Drop it and the proof collapses: a data-dependent would make and non-commuting, and the moment bound — which presumes is fixed independently of — would not pass through. §8.3 develops the careful construction that allows data-dependent priors via a sample split.
Step 3 — Markov + change-of-measure
Step 2 gave a bound on . Markov’s inequality converts that expectation bound into a high-probability bound on itself: for any ,
Taking complements, with probability at least over ,
This is the event on which all subsequent statements hold. We now condition on it for the remainder of the proof — every “with probability ” in the statement of Theorem 2 traces back to this single Markov step.
On the event , the change-of-measure inequality from §3 (Corollary 1) with gives, for every simultaneously,
The uniformity in matters: the change-of-measure inequality holds for every on a single fixed value of , so once is in hand we get the bound for every at once — including any we construct adaptively from .
Step 4 — Jensen and the final square root
The previous step bounded from above. We want a bound on — the expected gap, not the expected squared gap. Jensen’s inequality, applied in the direction for the convex map , gives
Combining with §4.3,
Dividing by and taking square roots,
This is the statement of Theorem 2.
The Jensen step in §4.4 is the second source of looseness in the McAllester chain — the first being Pinsker in §4.1. Both can be removed by a more careful argument, and the result is the Seeger bound of §5: it skips Pinsker (keeping the binary-KL form) and skips this Jensen step (working directly with the kl-form of the gap). The price is that the Seeger bound has no closed-form expression for the population risk in terms of the empirical one — it must be inverted numerically.
On the verified running example , , , the McAllester slack evaluates to , giving a certificate .
The Seeger / kl-form bound
The binary-KL function and why Pinsker is lossy
The McAllester bound’s chain in §4 hit two lossy steps: Pinsker’s inequality in the moment lemma and Jensen-for-squares at the end. Both can be avoided if we work directly with the binary-KL function
i.e., the KL divergence between and . This is the same function that appeared in Maurer’s moment lemma; it satisfies with equality iff , is jointly convex in , and obeys Pinsker’s inequality .
The Pinsker bound is the version McAllester’s chain plugs in. But Pinsker is very slack near the boundary. To see why, fix and vary :
so for small , . The Pinsker lower bound is for small . At and , vs. Pinsker — a factor of five looser. At and , vs Pinsker — a factor of fifty looser. The discrepancy grows as , and the regime of interest for over-parameterized modern ML is exactly .
At , Pinsker is locally tight: . The kl-form and the squared-difference form coincide to leading order at the symmetric midpoint, which is why McAllester’s bound is competitive in the “agnostic” regime where and badly off in the realizable regime where . The Seeger bound skips Pinsker and keeps the kl-form; the price is that the resulting statement has no closed-form solution for in terms of and must be inverted numerically.
Statement
Theorem 3 (Seeger / Maurer–Langford–Seeger kl-form bound).
Under the same hypotheses as Theorem 2 (, , data-independent, , ),
The statement is implicit in : it says that the Bernoulli-KL between the expected empirical risk (a known quantity once is in hand) and the expected population risk (the object we want to bound) does not exceed a quantity computable from , , and the KL between and . To extract a number for , we solve numerically (§5.4).
Two structural notes. First, the kl-form is strictly tighter than McAllester’s whenever is bounded away from , and dramatically so near the boundary. Second, the bound is on the binary-KL between the averages and , not on the average of pointwise binary-KLs. The proof in §5.3 uses the joint convexity of to move the expectation inside.
Proof
The proof reuses the first three steps of the McAllester chain — Lemma 1, Fubini, Markov — and replaces only the last two (Pinsker + Jensen-for-squares) with a single Jensen step that exploits the joint convexity of .
Proof.
Apply Lemma 1 with :
for each fixed . By the §4.2 Fubini argument (with independent of so the expectations commute),
By Markov’s inequality, with probability at least over ,
On the event , the change-of-measure inequality (Corollary 1 of §3) with this choice of gives, for every ,
The key step: the binary-KL function is jointly convex in — a standard fact about KL divergences on finite alphabets (Cover & Thomas 2006, Theorem 2.7.2). Jensen’s inequality applied to a jointly convex function gives, for any pair of random variables on with ,
Setting and ,
Combining with the change-of-measure inequality and dividing by ,
which is the statement of Theorem 3.
∎The chain “Lemma 1 → Fubini → Markov → DV → Jensen-for-convex-kl” replaces McAllester’s “Lemma 1 → Pinsker → Fubini → Markov → DV → Jensen-for-squares.” The single Jensen step here is the same direction as the one in §4.4 — both use for a convex — but with in place of squaring, the constants are tight enough that no other Pinsker-style relaxation is needed.
Inverting the bound numerically
Theorem 3 is an implicit statement: given and the RHS , we want the upper-confidence value defined by
The function is convex with minimum at and strictly increasing on (heading to as ). So is the unique root of in . Bracketing the root in and bisecting (or calling scipy.optimize.brentq) gives in a few dozen iterations.
Three numerical traps to flag for the implementation. The endpoints have closed forms — at , so ; at , the upper-confidence bound is vacuous. The RHS can exceed , in which case the bound is vacuous; return rather than letting brentq fail on a missing bracket. And use scipy.special.kl_div(p, q) + scipy.special.kl_div(1-p, 1-q) (which handles correctly), not raw which nans at .
Demo — McAllester vs Seeger across
The demo at the close of the section makes the §5.1 claim concrete: for fixed , we plot the upper confidence bound on produced by each of the two bounds as a function of . At the Seeger bound is dramatically tighter — its slope is rather than — and the two converge as approaches where Pinsker is locally tight. The McAllester bound is symmetric in around (because is); Seeger’s bound is asymmetric, with a longer arm extending into low- territory.
At the canonical demo point , the certificates at are McAllester and Seeger — a factor of about five tighter; by they converge to and .
The Catoni / linear bound and the Gibbs posterior
Statement
The third canonical bound goes back to Catoni’s 2007 monograph. It is linear in — no implicit equation to invert — parametrized by a temperature chosen before seeing the data, and structurally crucial because its right-hand-side minimizer over is a recognizable object: the Gibbs distribution.
Theorem 4 (Catoni linear-form PAC-Bayes bound).
Let be any distribution on , bounded, a prior chosen before seeing the data, , and a fixed temperature. Then
Three features distinguish this from Theorems 2 and 3. The RHS is linear in and in — a single closed-form expression, no numerical inversion. The temperature must be fixed before seeing the data for the bound to hold with the stated confidence; data-dependent (a grid search over a finite set) is allowed via a union bound that adds to the numerator for candidate values (§13.3). The bound’s RHS — viewed as a function of at fixed — has a closed-form minimizer that is precisely the Gibbs distribution, which §6.2 derives.
The Gibbs distribution as the RHS minimizer
Fix and the sample . The RHS of Theorem 4 as a function of is
and the -dependent piece is . Multiplying by , we want to minimize
The expression in parentheses is exactly the variational functional from Theorem 1 of §3 with . Its maximum over is , attained at
where the partition function is computable from the sample. The minimum of over is therefore , and plugging back into ,
This is the tightest Catoni bound at temperature : every other gives a strictly larger RHS, and the difference is precisely (cf. the saturation discussion in §3.4).
The distribution has a name in every statistical-physics-adjacent literature: it is the Gibbs distribution at temperature — also called the tilted distribution, the exponentially weighted posterior, or, in the PAC-Bayes literature specifically, the Catoni-optimal posterior. The temperature controls its concentration: low means stays close to (light tilt); high means concentrates on hypotheses with low empirical risk (sharp tilt). §6.4 makes this precise.
The Gibbs distribution is also exactly what variational inference produces in the limit of perfect approximation (the family of all probability measures ), and what the Bayesian posterior is when the loss is interpreted as a negative log-likelihood and . §9 develops both correspondences carefully.
Proof of Theorem 4
The proof is a four-line application of the §3 machinery.
Proof.
For each fixed , the random variables are i.i.d. with mean , so is an average of bounded variables and Hoeffding’s MGF bound gives, for every ,
Set — a function of that depends on through . Then for each fixed . By Fubini (with data-independent),
Markov’s inequality: with probability over ,
On that event, the change-of-measure inequality (Corollary 1 of §3) with this gives, for every ,
Expanding and dividing by ,
which rearranges to the statement of Theorem 4.
∎Two observations on the proof. The Hoeffding MGF bound is the slack term we pay for working with arbitrary bounded losses (rather than Bernoullis specifically — which would give a tighter exponential form due to Catoni 2007). Replacing it with the exact Bernoulli cumulant produces Catoni’s exponential-form bound; the linearized version above is what most modern papers use. And the proof uses none of the lossy steps of McAllester’s chain — no Pinsker, no Jensen-for-squares. The cost is paid entirely in the temperature parameter: the bound depends on in two competing terms ( rising linearly, falling), and choosing is the substance of §6.4.
Choosing and the bias–variance trade-off
The RHS of Theorem 4 viewed as a function of at fixed has a clean U-shape. Differentiating,
zero at . Plugging back, the optimized bound is
This is structurally identical to McAllester (Theorem 2) but with in place of — a saving of nats, sometimes meaningful. The catch is that depends on , which is data-dependent. To choose honestly, we must -union over a finite grid of values, which adds to the numerator for candidates — typically , recovering McAllester’s constant up to lower-order terms.
The Catoni form’s practical edge is therefore not in being asymptotically tighter than McAllester. It is in two other places. At fixed , the linearity in and makes joint optimization over tractable — which is what the Dziugaite–Roy reproduction of §11 exploits, gradient-descenting on for a Gaussian to minimize the linear RHS. And the Gibbs minimizer is a recognizable object that bridges to variational inference and Bayesian inference; minimizing McAllester’s bound over produces a similar object but without the clean exponential form.
The temperature parameter has an interpretive role too. Small corresponds to “trust the prior more” — stays close to , the KL term is small, but stays near the prior average (likely a bad value). Large corresponds to “trust the data more” — concentrates near the ERM, drops, but the KL term grows. The optimal is the bias-variance sweet spot.
Demo — temperature sweep and Gibbs sharpening
We sweep across roughly on the running example. The first visualization shows on the threshold grid at five representative temperatures: . At , uniform; at , concentrates on the ERM cell. The intermediate values show smooth sharpening.
The second decomposes the optimized Catoni bound into its three components as a function of . The empirical-risk component decreases monotonically from at to at . The linearization penalty rises linearly. The KL term rises from zero, peaks, and falls back. The total has a U-shape with a clear minimum, and the verified empirical gives bound — the Gibbs-at-its-own-optimum certificate.
Refinements
The three canonical bounds of §§4–6 are the load-bearing results, but the PAC-Bayes literature has produced a substantial body of refinements that bite in specific regimes: tighter constants for problems where the slack matters operationally, variance-adaptive bounds for low-variance losses, jointly-quasiconvex forms suitable for gradient-based optimization, and small- analyses that show where the canonical bounds saturate. This section is a survey rather than a full development; we sketch the four most consequential refinements at the level needed to use them, defer full proofs to their sources, and close with a numerical demonstration of the variance-adaptive trade-off.
Maurer (2004) — tighter constants and the realizable rate
Maurer’s 2004 note is the source we have already drawn on twice — first for the moment lemma in §4.1, and second for the Pinsker step that converts the kl-form to the squared-difference form. Two further contributions of the same paper sharpen practical computations.
The realizable-case improvement. When (zero empirical risk for some in the support of ), the binary-KL is the Taylor expansion rather than . The Seeger bound in this regime certifies where , which for small is approximately itself — a fast rate of rather than the slow rate of McAllester. The Dziugaite–Roy reproduction in §11 lands precisely in this realizable regime, and the fast rate is what makes the bound non-vacuous at finite .
The best constant in the moment lemma: Maurer proves this is sharp for the Bernoulli, with no further refinement available without distributional assumptions. Tighter versions exist for sub-Gaussian losses (Catoni 2007) or losses with known variance bound (Tolstikhin–Seldin 2013; §7.2 below), but they require extra structure.
Tolstikhin–Seldin (2013) — empirical-Bernstein PAC-Bayes
The McAllester and Catoni bounds use Hoeffding-style moment bounds that pay for the worst-case variance of a -bounded loss, namely . When the actual variance is much smaller — as in any realizable or near-realizable regime — that worst case is grossly conservative. Bernstein’s inequality replaces the worst-case variance with the empirical variance, and Tolstikhin & Seldin (2013) lift the substitution into a PAC-Bayes bound.
The empirical variance of the loss family under is
The Tolstikhin–Seldin bound — with probability over and over a finite -union grid of candidate variance bounds — gives, for every ,
The leading term is variance-adaptive: when , the bound is much tighter than McAllester’s. The correction term is the cost of working with the empirical (vs. true) variance. In the realizable regime , and only the second-order remains — the fast rate, without needing kl-form inversion. Mhammedi, Grünwald & Guedj (2019) refine the bound with an “Un-Expected Bernstein” inequality that vanishes whenever the learning algorithm is sufficiently stable on .
Thiemann et al. (2017) — a strongly quasiconvex form
The Catoni bound of §6 is jointly convex in and at fixed , and convex in at fixed . But it is not jointly convex in simultaneously — the term has a saddle structure. For applied PAC-Bayes, where one wants to gradient-descend on a parametric posterior with the temperature as a learnable parameter, this is a real obstacle.
Thiemann, Igel, Wintenberger, and Seldin (2017) sidestep it. Their bound, for any chosen alongside ,
is strongly quasiconvex in when is parametrized by a vector in . Strong quasiconvexity is weaker than convexity but strong enough that gradient descent finds the global minimum from any initialization. Pérez-Ortiz, Rivasplata, Shawe-Taylor & Szepesvári (2021) used this bound to train neural networks where the network parameters and the bound coefficient are optimized jointly via SGD, producing some of the tightest published non-vacuous bounds for deep nets (theirs land at 0.16 on MNIST-binary, vs Dziugaite & Roy’s 0.21 on the same architecture).
The trade-off: a worse leading constant at the optimal , but joint-optimization tractability matters more in practice.
Foong et al. (2021) — how tight can PAC-Bayes be at small ?
Foong, Bruinsma, Burt, and Turner (2021) carry out the analysis. Their main result: the kl-form (Seeger) bound is essentially tight in the realizable regime at any finite — for the specific worst-case data distribution among those compatible with the observed , the Seeger bound matches the true expected risk up to lower-order terms. McAllester is loose by a factor that depends on , which is the gap we visualized in §5.5.
A second contribution: in regimes where the loss has small variance but is not exactly realizable, the empirical-Bernstein form (§7.2) can beat the kl-form by a constant factor — but only when is large enough that the second-order term in Tolstikhin–Seldin is dominated by the leading variance-adaptive term. Foong et al. note that at small with a non-trivial -union over candidate variance bounds, Bernstein’s overhead can flip the comparison in Hoeffding’s favor. The takeaway for applied work — there is no uniformly tightest bound. Pick the bound for the regime you’re in:
| Regime | Recommended bound |
|---|---|
| Realizable, | Seeger (§5) |
| Small-variance, large-, non-realizable | Tolstikhin–Seldin (§7.2) |
| Agnostic, | McAllester (§4) or Catoni (§6) |
| Jointly-optimized neural-net bounds | Thiemann (§7.3) |
Demo — variance-adaptive slack and the small- regime flip
The Tolstikhin–Seldin advantage is most visible when the empirical variance is much less than the Hoeffding worst-case . We plot the certificate slack — i.e., the term above — for both Catoni-Hoeffding and Catoni-Bernstein as a function of at fixed , treating the loss as Bernoulli (variance ). The two slacks coincide at where Hoeffding is locally tight; they diverge as or , but the direction of divergence depends on and .
At the running example’s regime and (the empirical risk), the verified notebook outputs are Catoni-Hoeffding slack and Tolstikhin–Seldin slack — a Bernstein/Hoeffding ratio of , meaning Hoeffding is actually tighter at this small- config. The variance-adaptive promise of §7.2 is real, but it kicks in for either larger (so the second-order term shrinks) or smaller (so the overhead shrinks). Foong et al. (2021) flag this regime flip explicitly. Drag the slider in the demo below — the Bernstein curve drops below Hoeffding when exceeds a few thousand.
Hoeffding slack = 0.1140, Bernstein slack = 0.1549
ratio Bernstein/Hoeffding = 1.36 (Hoeffding tighter — small-n regime)
Choosing prior and posterior in continuous classes
From finite enumeration to parameter-space priors
The running example used a finite hypothesis class with a uniform prior, which made the KL term a discrete sum capped at nats. Every practical PAC-Bayes application — including the §11 Dziugaite–Roy reproduction — works in continuous parameter space, typically with Gaussian and Gaussian .
A neural network with parameters defines , and a Gaussian distribution on induces a distribution on via the pushforward. The prior — isotropic Gaussian, mean zero, single scalar variance — is the canonical choice and the one Dziugaite & Roy (2017) used to compute the first non-vacuous deep-network PAC-Bayes bound. The posterior — diagonal-covariance Gaussian with learnable mean and learnable per-coordinate variances — is the standard parametric posterior family for the same reason: it makes the KL term computable in closed form.
The lift from finite to continuous changes nothing about the bounds of §§4–6 themselves — McAllester, Seeger, and Catoni all hold for any on any measurable hypothesis space, with KL on the right computed in the same way. What changes is the technique for working out the KL: it’s now an integral over instead of a sum over cells, and we need that integral in closed form to make the bound computable. The diagonal-covariance restriction on is a parametric approximation; a fully general could have arbitrary correlation structure, but the resulting KL involves a determinant computation that scales as , prohibitive for . The diagonal restriction is the practical compromise.
The Gaussian–Gaussian KL closed form
Proposition 1 (Multivariate Gaussian KL).
Let and be two non-degenerate Gaussians on . Then
In the diagonal case with and isotropic prior , , this specializes to
The two terms have distinct interpretations and matter at very different scales in the deep-net regime.
The mean-shift term is quadratic in the posterior mean and inversely quadratic in the prior bandwidth. It charges for ‘s mean drifting away from the prior’s. In dimension , this term scales with the squared norm of — not with explicitly — but a SGD-trained network’s parameter vector tends to have norm growing as , so is typically .
The variance term where charges for each posterior variance differing from the prior’s. The function is convex with minimum at and growing toward at both endpoints. For and all equal to a single , the variance term scales with , so even a small per-coordinate deviation accumulates into a substantial KL bill. Matching to across coordinates is therefore not optional in high dimension — it is the dominant constraint on the bound.
The full KL is the sum, and the two terms are independently controllable. In the §11 Dziugaite–Roy reproduction the optimization separates into roughly these two pieces — the mean trades off against empirical risk via SGD-like dynamics, while the log-variances trade off against the variance KL via a per-coordinate scaling.
Data-independent prior — -union over discretized
The PAC-Bayes bounds of §§4–6 require the prior to be chosen before seeing the data. For a Gaussian prior this means must be fixed in advance — but the right is data-dependent. The standard recipe, due to Langford & Caruana (2001) for SVMs and adapted by Dziugaite & Roy (2017) for deep networks, is to discretize over a finite grid and union-bound over the choice.
Concretely: fix a finite grid of candidate prior variances, chosen before seeing any data. For each , compute the McAllester (or Catoni, Seeger, etc.) bound with confidence in place of , treating as the prior. By a union bound over , with probability at least over all bounds hold simultaneously, and we are then free to report the tightest one with no penalty beyond the added to the numerator.
The grid is typically logarithmic: for to , costing – nats in the numerator. This is a one-time cost regardless of and is dwarfed by the mean-shift and variance terms in any realistic high-dimensional setting. The grid must be chosen and fixed before any inspection of .
A natural temptation, when implementing this, is to optimize continuously over based on the data. This is invalid. A continuous data-dependent prior makes a function of , and the moment lemmas of §§4–6 fail because the Fubini swap requires independence. The Pérez-Ortiz et al. (2021) work pushes the data-dependence further with sample splitting — use one half of to choose , compute the bound on the other half — but the discrete-grid approach is the simpler and more widely used recipe and is what §11 follows.
The weight-decay correspondence
When the loss is approximately constant over ‘s support around — equivalently, when is concentrated enough that resampling doesn’t change predictions much — the expected empirical risk is approximately . In this regime the Catoni bound at fixed becomes
If we further fix the per-coordinate variances at their prior values (, so the variance term is zero) and minimize the right-hand side over at fixed , we are solving
This is exactly -regularized empirical risk minimization — what every introductory machine-learning course calls weight decay. The regularization strength is
with the PAC-Bayes temperature and the prior variance. Larger temperatures and broader priors both correspond to weaker regularization. The result is structural: every weight-decay-regularized model can be read as the posterior mean of a Gaussian–Gaussian PAC-Bayes setup, and the PAC-Bayes bound at that gives a generalization guarantee. Conversely, every Gaussian PAC-Bayes bound at fixed variance reduces to weight decay when minimized over the posterior mean.
Two qualifications. The approximation is good only when ‘s support is small enough that the loss is locally well-modeled by its value at . For wide , the gap between and is exactly what makes the bound non-trivial compared to a straight regularized-ERM analysis — this is where the §11 reproduction extracts its non-vacuous guarantee. And jointly optimizing rather than holding fixed is the better strategy in practice, since the variance term can be reduced per-coordinate.
Demo — the KL surface and its two terms
The demo visualizes the §8.2 scaling argument: in high dimension, the variance term ( unless ) dominates the mean-shift term () by orders of magnitude unless is carefully matched to . Setting , , — the regime the §11 reproduction lives in — the mean-shift contribution is nats while the variance term blows up if is even modestly off ; for the variance term contributes about nats. Both dwarf the overhead of the -union over prior variances.
The Bayesian–PAC-Bayes correspondence
Gibbs at with log-likelihood loss
The Gibbs distribution of §6.2,
is parametrized by a temperature chosen by the user. At one specific temperature with one specific loss, it coincides with a familiar object from Bayesian inference. The choice is likelihood loss: take the per-sample loss to be the negative log-likelihood, , assumed bounded in . The empirical risk is then
where is the joint sample and is its likelihood under hypothesis . Setting in the Gibbs distribution,
The right-hand side is the Bayesian posterior under prior and likelihood , by Bayes’ theorem.
Proposition 2 (Bayesian–PAC-Bayes correspondence).
Under likelihood loss with bounded likelihood ratios, the PAC-Bayes Gibbs distribution at temperature equals the Bayesian posterior:
The result is elementary algebra given the right loss, but it has a structural significance worth pulling apart: it shows that Bayesian inference is the PAC-Bayes posterior optimizer at one specific temperature, for one specific loss.
The bound holds at
A consequence of the correspondence: the PAC-Bayes bounds of §§4–6 apply to the Bayesian posterior. Plugging into Theorem 2 (McAllester),
with probability over . This is a frequentist generalization guarantee for Bayesian inference — the kind of statement Bayesian methodology by itself does not produce, since Bayesian probability calculations condition on the data and do not deliver tail bounds over data realizations.
Two notes. The bound at is generally not the tightest PAC-Bayes bound available — the Catoni-optimal at produces a smaller RHS in most regimes. And the KL term — sometimes called the information gain in the Bayesian literature — is a computable quantity in closed form for conjugate families.
The correspondence runs in both directions. Bayesians get a frequentist tail bound for free. PAC-Bayesians get a recognizable “default” choice of posterior, with familiar geometric properties, that always satisfies the bound.
The cold-posterior effect
If the Bayesian posterior were optimal for some objective beyond bound-minimization — say, test-set accuracy on the held-out distribution — we would expect to be the empirically preferred temperature for predictive performance. It is not. Wenzel et al. (2020), “How Good is the Bayes Posterior in Deep Neural Networks Really?”, showed that for SGD-trained deep networks, the cold posterior — for , i.e., a Gibbs distribution more concentrated than the Bayesian posterior — consistently outperforms the Bayes posterior on test-set predictive accuracy.
The PAC-Bayes view of the cold-posterior effect is clarifying. There is no theoretical reason to expect to be the optimal temperature for any objective. The bound-optimal temperature from §6.4 is , an implicit equation whose solution depends on the actual KL geometry of the problem. For a deep network with nats and , the bound-optimal — colder than the prior () and warmer than Bayes (). For different problem regimes, can fall on either side of .
The empirical cold-posterior effect — wins on test accuracy — does not contradict any PAC-Bayes statement, because PAC-Bayes does not claim is optimal. Several explanations have been proposed for why cold posteriors win in practice. Aitchison (2021) argues the effect is a downstream consequence of dataset curation — standard image benchmarks like CIFAR-10 have been hand-curated to be unambiguous, which makes the implicit likelihood overconfident and the Bayes posterior insufficiently sharp. Izmailov et al. (2021), using full-batch HMC to sample the true posterior, find that without data augmentation the cold-posterior effect largely disappears, and attribute the empirical phenomenon to misspecification introduced by augmentation. The PAC-Bayes framework is agnostic about which is the right diagnosis; it just provides the right framing for the question.
What the correspondence does not say
Three non-implications worth stating explicitly.
Bayesian inference is not “the right way” to do PAC-Bayes. is one valid choice of posterior; it is rarely the bound-optimal choice. Dziugaite & Roy’s (2017) non-vacuous deep-network bound uses a Gaussian centered on SGD output — neither the Bayesian posterior nor any Gibbs distribution.
PAC-Bayes does not validate Bayesian inference in misspecified settings. The frequentist guarantee requires the loss to be a proper bounded negative log-likelihood AND the prior to be data-independent. Real Bayesian workflows routinely involve weakly informative priors chosen after data inspection, data-dependent hyperparameter tuning, and likelihoods that are model approximations rather than data-generating truths.
The “agreement” is at and likelihood loss only. Other losses (0/1, hinge, squared) have no corresponding Bayesian posterior. PAC-Bayes is strictly broader than Bayesian inference; the correspondence is a coincidence in a specific corner, not a general unification.
Demo — temperature sweep on a Bernoulli toy
The cleanest demonstration of the correspondence is in a toy where the correspondence applies exactly. We work with a Bernoulli mean-estimation problem: with , samples, hypothesis class (100 discrete candidates), uniform prior , per-sample loss . The verified correspondence: at , across the entire grid — exact to machine precision.
Three bounds on the running example
We have now developed four certificate-producing tools: the classical union bound for finite hypothesis classes from generalization bounds §3, McAllester (§4), Seeger (§5), and Catoni (§6). This section puts all four on the running example, reads off the numerics, and identifies the regimes where each dominates.
Setup and the union-bound baseline
The §1.5 setup: threshold classifiers, , , label noise, , uniform prior . The ERM threshold has empirical risk near zero (the noise lands inside the margin between and the cell boundaries).
The union-bound baseline from generalization-bounds §3: with probability ,
For our setup: , slack . This is a per-hypothesis statement; PAC-Bayes is per-posterior. Both are valid; §10.2 just confirms when each dominates.
The three PAC-Bayes bounds at
Evaluating at from §2.5 (, ) at , :
| Certificate | Value | Slack above |
|---|---|---|
| Union bound (on ERM point-mass) | ||
| McAllester (§4) at | ||
| Catoni-optimized (§6.4) closed form at | ||
| Seeger (§5) at |
Catoni-optimized beats McAllester on the slack side because it replaces with — a -nat saving when the data-dependent can be chosen freely. With an honest -union over candidate values (typically ), the saving shrinks back to McAllester-equivalent. Seeger dominates everything because the realizable-regime advantage from §5.5 bears out: at , the kl-form is much tighter than the Pinsker-relaxed square-root form. For a topic where the typical empirical risk is small — as in any modern over-parameterized setting — Seeger should be the default.
For comparison, at (KL , ): McAllester certificate , slack . The broad posterior pays less in KL but more in expected empirical risk; the bound trade-off is on a different point of the KL-vs-risk curve.
When PAC-Bayes wins
The §10.2 numbers show PAC-Bayes winning by 15–40% — real but modest, because is small. The advantage grows in three regimes.
Large or infinite . Union bound scales as , going to infinity as . PAC-Bayes with continuous Gaussian has KL that stays finite. The §11 Dziugaite–Roy reproduction is the marquee case: parameters means (union bound vacuous), but PAC-Bayes lands at .
Concentrated with . When the posterior concentrates on a small effective subset, KL is dramatically less than . For higher-dimensional classes the saving can be orders of magnitude.
Realizable or near-realizable regimes. At , Seeger certifies with — fast rate, dominating union bound’s at large .
When the union bound wins
Two restricted settings.
Tiny with point-estimate output. If committing to the ERM as a point estimate and is small (), the union bound’s slack is competitive with — and often slightly tighter than — McAllester for a point-mass , which pays the extra in the numerator. In our running example, point-mass McAllester slack vs union .
Adversarial classes where a uniform prior is correct. If no prior preference exists and the analyst is committed to reporting bounds for an arbitrary to be chosen later (rather than a specific posterior ), the union bound is the right tool.
In practice, neither restriction holds in modern ML. Hypothesis classes are continuous and high-dimensional, the analyst is willing to commit to a posterior, and the loss is in a realizable or near-realizable regime.
Demo — bound comparison curves
The first panel sweeps at fixed ; all four bounds scale as or faster, Seeger dominates throughout the realizable regime, Catoni-optimized sits below McAllester by a constant gap, and Union and McAllester overlap. The second panel groups the four bounds at fixed across three posteriors. The union-bound bars are constant across clusters (the bound doesn’t depend on ); the PAC-Bayes bars track the empirical-risk floor of each posterior plus its certificate slack.
Non-vacuous deep-network bounds (Dziugaite–Roy 2017)
The vacuousness reckoning, revisited
The companion topic on generalization bounds closes its §12 demo with an unsettling observation. A multi-layer perceptron with ~12,000 parameters trained on 1,000 binary-MNIST samples has Rademacher complexity at least 1 (the network can fit randomized labels to zero training error), so the classical uniform-convergence bound is — vacuous. The observed generalization gap on the same network is around 1–3% in test error. Classical bound and observed reality differ by two orders of magnitude.
Dziugaite & Roy (NeurIPS 2017), “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data,” produced the first PAC-Bayes bound that closed this gap. Their result: a non-vacuous bound of approximately 0.21 on the 0/1 expected risk of a stochastic version of a SGD-trained MLP on binary MNIST. The same network’s Rademacher bound on the same data is vacuous; the same network’s test 0/1 error is around 0.02. Their 0.21 bound is loose by an order of magnitude relative to test error, but it is the first bound that lives in the right neighborhood at all.
The technical move that closes the gap is exactly the PAC-Bayes posterior-vs-class duality of §1.3. Uniform convergence charges , which has unbounded Rademacher complexity for any . PAC-Bayes charges ‘s drift from , and for a Gaussian centered on the SGD output, the KL divergence is finite and can be optimized to balance against empirical risk.
The recipe
The Dziugaite–Roy construction has four steps.
Step 1: train a deterministic network with SGD. Use standard machine-learning tools — sklearn’s MLPClassifier — to fit a network with parameters to a training set of size . Call the output . This is the starting point.
Step 2: build a Gaussian PAC-Bayes posterior centered on . Define with diagonal covariance and learnable mean and learnable per-coordinate log-variances . Initialize and for some small initial value.
Step 3: choose the prior variance from a discretized grid. Per §8.3, the prior must be data-independent, so we fix a grid of candidate prior variances before seeing any data and union-bound the analysis with confidence at each grid point.
Step 4: optimize the bound jointly over . At each fixed , minimize the Catoni-optimized PAC-Bayes bound
over via gradient descent. Closed-form KL gradient (§8.2) + reparametrization-trick empirical-risk gradient (§11.3). At convergence, evaluate the final certificate by computing via Monte Carlo on the 0/1 loss and plugging into the bound.
Implementation in plain NumPy
The bound optimization in step 4 is what would, in a fuller-featured implementation, use PyTorch or JAX for automatic differentiation. The topic’s code-language policy is NumPy / SciPy / sklearn, so we implement the gradients by hand.
The KL gradient is closed-form. From Proposition 1 of §8.2,
Per-coordinate, operations.
The empirical-risk gradient uses the reparametrization trick. Reparametrize for . Then
Estimated by Monte Carlo with sample per gradient step. The inner gradient is the standard backpropagation gradient through the MLP, implemented manually in NumPy.
The empirical-risk loss used for the gradient is a smooth surrogate — binary cross-entropy with sigmoid output — even though the bound certificate is reported on the 0/1 loss. Standard practice: the 0/1 loss has zero gradient almost everywhere, so a smooth surrogate is needed for the optimizer. Final certification computes on the 0/1 loss with a separate Monte Carlo step ( samples).
-union over discretized
A logarithmic grid for – candidate prior variances, fixed before any inspection of . For each :
- Compute the optimized bound by running the §11.3 gradient descent at fixed .
- Evaluate the final certificate with in place of in the Catoni-optimized formula.
By a union bound over the grid, with probability over , all certificates hold simultaneously, and the reported certificate is . The cost is – nats added to the bound’s numerator — a one-time fixed cost regardless of .
For the miniature implementation in §11.5, we use a 5-point grid to keep total runtime under budget.
The reproduction in miniature
The §11.5 demo runs the full pipeline on a binary-MNIST subset: digits 0 vs 1, , , MLP with one hidden layer of 16 ReLU units (so parameters), Adam optimizer for the bound minimization with steps per prior, Monte Carlo sample per Adam step, samples for the 0/1 certificate evaluation.
The verified notebook output: final certificate at — non-vacuous (well below ), an order of magnitude looser than the SGD test 0/1 error of , and in the same neighborhood as Dziugaite & Roy’s 2017 result on a similar architecture. The whole pipeline runs in about 8 seconds on a 2020-era laptop.
Caveat. The miniature reproduction does not match the original Dziugaite–Roy (2017) bound of 0.21 on MNIST-binary or the Pérez-Ortiz et al. (2021) bound of 0.16 on the same data. The state-of-the-art numbers use larger training sets, more careful priors (data-dependent via sample splitting), and substantially more compute. Our miniature demonstrates the phenomenon — that PAC-Bayes can produce non-vacuous bounds for over-parameterized networks where uniform convergence cannot — at a scale that fits the topic’s 60-second CPU budget.
ML applications
PAC-Bayes is not just a tool for producing generalization certificates on individual models. The change-of-measure inequality and its consequences turn up across modern machine-learning theory in places that don’t obviously involve a posterior over hypotheses. This section sketches four such connections.
Meta-learning
In meta-learning we observe a sequence of related tasks drawn from a meta-distribution, each task providing a training set . The natural PAC-Bayes structure is hierarchical: an outer PAC-Bayes bound at the task level (with a meta-prior over task-specific priors) plus an inner PAC-Bayes bound within each task.
Pentina & Lampert (ICML 2014), “A PAC-Bayesian Bound for Lifelong Learning,” developed the framework in the sequential-task setting. Amit & Meir (ICML 2018), “Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory,” extended this to the standard meta-learning setting with i.i.d. tasks, giving rigorous PAC-Bayes guarantees for what model-agnostic meta-learning (MAML; Finn, Abbeel, Levine 2017) does in practice. The framework has since been extended to continual learning, transfer learning, and few-shot classification.
Compression-based bounds
A network compressible to bits has effective parameter count , not original . Arora, Ge, Neyshabur, and Zhang (ICML 2018), “Stronger Generalization Bounds for Deep Nets via a Compression Approach,” made this precise via PAC-Bayes. The construction: given a -bit compression scheme preserving test error within , build a posterior supported on the compressed networks. With uniform prior over the same set, nats, and PAC-Bayes gives
For networks compressible to bits — common for over-parameterized models in their “winning ticket” subspace — the bound is non-vacuous at moderate even when .
Zhou, Veitch, Austern, Adams, and Orbanz (ICLR 2019), “Non-Vacuous PAC-Bayes Bounds at Scale via Re-parameterization,” extended the compression framework to large-scale ImageNet-class models.
Information-theoretic generalization
Xu & Raginsky (NeurIPS 2017), “Information-Theoretic Analysis of Generalization Capability of Learning Algorithms,” proved a bound on the generalization gap in terms of the mutual information between the learning algorithm’s output and the training sample :
where is a sub-Gaussian bound on the per-sample loss. The mutual information is exactly the expected PAC-Bayes KL term, averaged over the sample randomness.
Steinke & Zakynthinou (COLT 2020), “Reasoning About Generalization via Conditional Mutual Information,” refined the Xu–Raginsky bound to use conditional mutual information given an auxiliary “supersample” of points, often dramatically tighter. Negrea, Haghifam, Dziugaite, Khisti & Roy (NeurIPS 2019) and Haghifam et al. (NeurIPS 2020) applied the conditional-MI framework to Langevin dynamics, yielding state-of-the-art guarantees for SGD-based algorithms with rates under stability assumptions.
The cmi and PAC-Bayes views are complementary. PAC-Bayes is the natural tool for certifying a specific stochastic predictor; cmi is the natural tool for analyzing the generalization properties of a learning algorithm.
Tighter empirical bounds — Pérez-Ortiz et al. (2021)
The current state-of-the-art applied PAC-Bayes bound on a vision benchmark is due to Pérez-Ortiz, Rivasplata, Shawe-Taylor, and Szepesvári, “Tighter Risk Certificates for Neural Networks” (JMLR 2021). On MNIST-binary they achieve 0.16 — versus Dziugaite & Roy (2017)‘s 0.21 and the §11 miniature’s 0.24.
Their construction combines several techniques developed earlier in this topic: data-dependent priors via sample splitting; the Thiemann et al. (2017) quasiconvex form (§7.3); SVD-reduced features (PCA on the input) to reduce effective dimensionality; probabilistic networks trained from scratch with the PAC-Bayes bound as the loss.
The 0.16 result is the upper edge of what current techniques can achieve. The gap to test error reflects two sources of looseness: the McAllester / Catoni constants themselves, and the choice of stochastic posterior which has lower expected accuracy than the deterministic . The latter source is fundamental — PAC-Bayes certifies , not — and §14.2 discusses de-randomization. Biggs & Guedj (AISTATS 2022) develop margin-based derandomisation techniques that recover deterministic guarantees from PAC-Bayes posteriors. Viallard, Germain, Habrard & Morvant (Mach. Learn. 2024) push further into disintegrated PAC-Bayes bounds that give single-hypothesis guarantees rather than averaged ones — a recent thread expected to keep tightening published bounds.
Computational notes
This section collects practical numerical considerations for implementing the bounds developed in §§4–11. The math is correct in finite precision only if certain standard tricks are observed.
Inverting the kl-form
The Seeger bound (§5) is implicit in : given and , we need
scipy.optimize.brentq handles this with four small additions:
- Bracket carefully. Use with .
- Handle endpoints in closed form. At : . At : bound is vacuous.
- Use
scipy.special.kl_div. Don’t compute raw — itnans at . Usekl_div(p, q) + kl_div(1-p, 1-q). - Detect vacuous regime. If , no root in — return .
Stable log-sum-exp for partition functions
The Gibbs has log-partition (for uniform ). The naive overflows when is large; the standard trick:
scipy.special.logsumexp does this. For PAC-Bayes where can be hundreds and , underflows for in double precision without the subtraction. The Gibbs itself in log-space:
log_Q = np.log(P) - lam * risks
log_Q -= log_Q.max() # stabilize
Q = np.exp(log_Q)
Q /= Q.sum() # renormalize off float noise
The -union over discretized hyperparameters
The trick recurs in §§6.4 (temperature ), §8.3 (prior ), §11.4 (Dziugaite–Roy). Recipe:
- Fix a finite grid before seeing the data.
- At each , compute the bound with confidence in place of .
- Report the minimum certificate over the grid.
Cost: added to the numerator — 2–3 nats for –. The grid must be data-independent. Setting the prior-variance grid by inspecting the trained network’s weight norm invalidates the bound.
KL gradients for Gaussian–Gaussian — closed form vs autograd
For the §11 implementation in NumPy, every PAC-Bayes step needs the gradient of with respect to . The closed-form expressions from §8.2 are short:
For a implementation in pure NumPy (as in §11), the closed form is the right choice: speed (single vectorized computation), numerical stability (exact in finite precision), and code simplicity (5 lines vs the apparatus of a graph-building library).
The empirical-risk gradient does require backpropagation through the network — no closed form for in general. For the §11 single-hidden-layer MLP, manual backprop is roughly 15 lines of NumPy and substantially faster than spinning up a PyTorch tape.
Per-coordinate versus single scalar. The closed-form KL gradient is per-coordinate, so optimizing per-coordinate adds no asymptotic cost. The Dziugaite–Roy paper uses per-coordinate; it tends to land at tighter bounds because the optimal varies dramatically across coordinates. §11.5 uses per-coordinate.
Connections and limits
What the topic delivered
The fourteen sections above developed PAC-Bayes from first principles. The framework’s distinctive features are worth restating concisely.
PAC-Bayes is distribution-free: bounds hold for every satisfying only the boundedness assumption on . Finite-sample: non-asymptotic, explicit constants. Posterior-averaged: bounds , not . This is the structural feature that makes PAC-Bayes survive overparameterization. Uniform over : once the over event occurs, the bound holds for every simultaneously — including any we construct from after seeing it.
The three canonical bounds form a hierarchy. McAllester (§4) is the most legible. Seeger (§5) is the tightest in the realizable regime; the price is numerical inversion. Catoni (§6) is linear in the posterior’s empirical risk and KL, parametrized by a temperature , and its RHS minimizer is the Gibbs distribution that bridges to Bayesian inference (§9). For applied work, the §7.4 table summarizes when each dominates.
What PAC-Bayes can’t do
De-randomization. PAC-Bayes certifies , not for a deterministic predictor. Standard workaround: report a guarantee on the randomized predictor. Biggs & Guedj (AISTATS 2022) and Viallard et al. (Mach. Learn. 2024) develop PAC-Bayes-derandomization techniques that extract a deterministic predictor from with similar generalization properties, but the conversion is loose and the de-randomization is an active research area.
Unbounded losses. The boundedness assumption is load-bearing for moment lemmas. Extensions exist — Alquier, Ridgway & Chopin (JMLR 2016) develop PAC-Bayes for sub-Gaussian losses via variational approximations of Gibbs posteriors; Mhammedi, Grünwald & Guedj (NeurIPS 2019) refine the bounded case with the “Un-Expected Bernstein” inequality; Holland (NeurIPS 2019) handles heavy-tailed losses via robust risk estimators — but each requires extra moment assumptions, and the bound’s distribution-freeness is partially lost.
Non-iid data. The bounds assume i.i.d. samples. Extensions exist (Seldin, Laviolette, Cesa-Bianchi, Shawe-Taylor & Auer 2012 for martingale-style PAC-Bayes; Audibert & Bousquet 2007 for chaining-based bounds), but they require explicit dependence structure.
Regression vs. classification. PAC-Bayes is most cleanly developed for classification with bounded loss. Regression with unbounded loss requires the extensions above; even with bounded regression loss, the kl-form bound’s Bernoulli-MGF basis doesn’t translate directly.
The gap to test error. Even the tightest current PAC-Bayes bound on MNIST-binary (Pérez-Ortiz et al. 2021: 0.16) is an order of magnitude looser than test error (). Closing this gap is the central open problem. Two sources of looseness drive it. The bounds are worst-case over data distributions compatible with ; for the actual benign , they are loose by a factor that depends on but is not computable from alone. And the framework certifies stochastic predictors, whose risk is worse than the underlying deterministic predictor’s; even a perfect bound on would not match .
Alternative routes to generalization theory
PAC-Bayes is one of several frameworks. Each has its scope.
Norm-based bounds (Bartlett 1998; Neyshabur, Tomioka, and Srebro 2015; Bartlett, Foster, and Telgarsky 2017) bound the gap in terms of weight-matrix norms. Capture the qualitative “small weights good generalization” intuition cleanly but typically loose in practice.
Compression-based bounds (§12.2). Tight for networks with low effective rank or structured sparsity; less effective for dense networks.
Information-theoretic bounds (§12.3). PAC-Bayes-adjacent but framed in terms of algorithm-level rather than predictor-level analysis.
Algorithmic stability (Bousquet and Elisseeff 2002; Hardt, Recht, and Singer 2016 for SGD). Bounds the gap in terms of the sensitivity of the algorithm’s output to a single training example. Tighter than PAC-Bayes for specific algorithms; less general.
VC-style uniform convergence (Vapnik 1998). The classical framework. Vacuous for over-parameterized neural networks for the reasons §11.1 reviewed.
No single framework dominates across all regimes. The right choice depends on what is known about the problem: VC for simple parametric classes, norm-based for explicitly weight-regularized models, compression-based for sparse or low-rank networks, PAC-Bayes for stochastic predictors with explicit posterior structure, information-theoretic for algorithm-level analysis.
Forward-pointers in the curriculum
Three formalML topics use PAC-Bayes machinery as a substantial component when shipped.
Variational inference shares the change-of-measure machinery developed in §3 and the Gaussian–Gaussian KL apparatus of §8.2. The variational lower bound (ELBO) is structurally the Donsker–Varadhan dual of a free energy, and the Gibbs distribution of §6.2 is what VI produces in the limit of perfect approximation. The §9 Bayesian–PAC-Bayes correspondence is the formal bridge.
Bayesian neural networks (coming soon) uses Gaussian on parameter space identically to §8 and §11. The PAC-Bayes bound on a Bayesian neural network is one of the two primary ways to give a generalization guarantee for such a model.
Meta-learning (coming soon) uses the hierarchical PAC-Bayes structure of §12.1 — outer bound at the task level, inner bound within each task — as the canonical framework for theoretical analysis.
Two further connections. Density ratio estimation (coming soon): the Radon–Nikodym derivative that appears in the change-of-measure inequality is the same object that density-ratio estimation methods (KLIEP, LSIF, neural ratio estimation) compute. And stochastic gradient MCMC (coming soon): Welling and Teh’s (2011) SGLD samples from the Gibbs distribution at temperature , recovering the Bayesian posterior.
The topic concludes here, but the techniques carry forward. PAC-Bayes is the right framework for thinking about generalization of stochastic predictors, the right bridge from frequentist learning theory to Bayesian inference, and — with the ongoing tightening of constants and prior-engineering techniques — currently the only framework that produces non-vacuous bounds on the over-parameterized neural networks of modern practice.
Connections
- Direct predecessor. §14.1 of that topic states McAllester's bound as a forward-pointer and §12.5 closes with the vacuousness reckoning — this topic develops the full PAC-Bayes machinery that closes that gap, reproducing the Dziugaite–Roy non-vacuous deep-network bound in §11 and the bound comparison on the shared threshold-class running example in §10. generalization-bounds
- KL divergence is the load-bearing penalty in every PAC-Bayes bound: the change-of-measure inequality of §3 routes E_Q expectations through E_P expectations at a cost of KL(Q ∥ P), and the Donsker–Varadhan variational form (§3.3) is exactly the convex-conjugate relation that kl-divergence develops in its variational characterization section. kl-divergence
- Hoeffding's MGF bound and the Bernoulli-MGF lemma (Maurer 2004) are the moment-control tools that turn each fixed-hypothesis statement into a high-probability-over-S bound via Markov's inequality. The §4 McAllester proof, §6 Catoni proof, and §7.2 Tolstikhin–Seldin empirical-Bernstein refinement all rest on the concentration toolkit. concentration-inequalities
- PAC-Bayes is PAC learning where the learner outputs a distribution Q over hypotheses rather than a point hypothesis, and the analysis charges KL(Q ∥ P) instead of class complexity. The frequentist (1 − δ)-over-S guarantee is the same PAC guarantee — only the complexity-charging mechanism changes. pac-learning
- The Gibbs distribution Q*_λ ∝ P · exp(−λ R̂_S) of §6.2 is exactly what variational inference produces in the limit of perfect approximation (the family of all Q ≪ P). The ELBO is structurally the Donsker–Varadhan dual of a free energy, and the §9 Bayesian–PAC-Bayes correspondence at λ = n is the formal bridge between VI's optimization target and a frequentist generalization guarantee. variational-inference
References & Further Reading
- paper Some PAC-Bayesian Theorems — McAllester (1999) The original PAC-Bayes paper introducing the square-root bound (Machine Learning).
- paper PAC-Bayesian Stochastic Model Selection — McAllester (2003) Refined constants and stochastic-model-selection framing (Machine Learning).
- paper PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification — Seeger (2002) Original kl-form (binary-KL) bound; tighter than McAllester near the realizable regime (JMLR).
- paper A Note on the PAC Bayesian Theorem — Maurer (2004) Tightened moment lemma giving the modern 2√n constant; source of §4.2 Lemma 1 (arXiv).
- book PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning — Catoni (2007) The linear-form bound and the Gibbs-distribution-as-RHS-minimizer view (IMS Lecture Notes 56).
- paper (Not) Bounding the True Error — Langford & Caruana (2001) The δ-union-over-discretized-prior recipe used in §8.3 (NeurIPS).
- paper Asymptotic Evaluation of Certain Markov Process Expectations for Large Time, I — Donsker & Varadhan (1975) The variational form of KL that drives every PAC-Bayes proof (Comm. Pure Appl. Math.).
- book Elements of Information Theory — Cover & Thomas (2006) Theorem 2.7.2 (joint convexity of KL) used in the §5 Seeger proof (Wiley, 2nd ed.).
- paper PAC-Bayesian Learning of Linear Classifiers — Germain, Lacasse, Laviolette & Marchand (2009) Modern unified PAC-Bayes proof template used as basis for §6's Catoni derivation (ICML).
- paper PAC-Bayes-Empirical-Bernstein Inequality — Tolstikhin & Seldin (2013) Variance-adaptive PAC-Bayes via empirical Bernstein; §7.2 (NeurIPS).
- paper A Strongly Quasiconvex PAC-Bayesian Bound — Thiemann, Igel, Wintenberger & Seldin (2017) Quasiconvex form enabling joint (Q, β) gradient optimization; §7.3 (ALT).
- paper How Tight Can PAC-Bayes Be in the Small Data Regime? — Foong, Bruinsma, Burt & Turner (2021) Small-n analysis showing Seeger is essentially tight in the realizable regime; §7.4 (NeurIPS).
- paper On the Properties of Variational Approximations of Gibbs Posteriors — Alquier, Ridgway & Chopin (2016) PAC-Bayes for sub-Gaussian losses via variational approximations of Gibbs posteriors (JMLR).
- paper Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data — Dziugaite & Roy (2017) The first non-vacuous deep-network PAC-Bayes bound; reproduction in §11 (UAI).
- paper Tighter Risk Certificates for Neural Networks — Pérez-Ortiz, Rivasplata, Shawe-Taylor & Szepesvári (2021) State-of-the-art applied PAC-Bayes on MNIST-binary (0.16 cert); §7.3 & §12.4 (JMLR).
- paper A PAC-Bayesian Bound for Lifelong Learning — Pentina & Lampert (2014) Hierarchical PAC-Bayes for sequential meta-learning; §12.1 (ICML).
- paper Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory — Amit & Meir (2018) PAC-Bayes guarantees for MAML-style meta-learning; §12.1 (ICML).
- paper Stronger Generalization Bounds for Deep Nets via a Compression Approach — Arora, Ge, Neyshabur & Zhang (2018) Compression-based PAC-Bayes for deep networks; §12.2 (ICML).
- paper Non-Vacuous PAC-Bayes Bounds at Scale via Re-parameterization — Zhou, Veitch, Austern, Adams & Orbanz (2019) Compression PAC-Bayes scaled to ImageNet-class models; §12.2 (ICLR).
- paper Information-Theoretic Analysis of Generalization Capability of Learning Algorithms — Xu & Raginsky (2017) Mutual-information generalization bound; PAC-Bayes-adjacent algorithm-level framework (NeurIPS).
- paper Reasoning About Generalization via Conditional Mutual Information — Steinke & Zakynthinou (2020) Conditional MI refinement of Xu–Raginsky; §12.3 (COLT).
- paper Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates — Negrea, Haghifam, Dziugaite, Khisti & Roy (2019) Improved MI bounds for SGLD via data-dependent priors (NeurIPS).
- paper Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms — Haghifam, Negrea, Khisti, Roy & Dziugaite (2020) Conditional MI refinement applied to Langevin dynamics (NeurIPS).
- paper How Good is the Bayes Posterior in Deep Neural Networks Really? — Wenzel, Roth, Veeling, Świątkowski, Tran, Mandt, Snoek, Salimans, Jenatton & Nowozin (2020) Empirical demonstration of the cold-posterior effect; §9.3 (ICML).
- paper A statistical theory of cold posteriors in deep neural networks — Aitchison (2021) Data-curation explanation for the cold-posterior effect (ICLR).
- paper What Are Bayesian Neural Network Posteriors Really Like? — Izmailov, Vikram, Hoffman & Wilson (2021) Full-batch HMC study attributing cold-posterior effect to data augmentation; §9.3 (ICML).
- paper On Margins and Derandomisation in PAC-Bayes — Biggs & Guedj (2022) Margin-based derandomisation of PAC-Bayes posteriors; §12.4 (AISTATS).
- paper A General Framework for the Practical Disintegration of PAC-Bayesian Bounds — Viallard, Germain, Habrard & Morvant (2024) Disintegrated bounds — single-hypothesis guarantees from PAC-Bayes; §12.4 / §14.2 (Mach. Learn.).
- paper PAC-Bayes Un-Expected Bernstein Inequality — Mhammedi, Grünwald & Guedj (2019) Bernstein-like inequality with X² outside the expectation; bounded-loss extension (NeurIPS).
- paper PAC-Bayes under potentially heavy tails — Holland (2019) PAC-Bayes for heavy-tailed losses via robust risk estimators; §14.2 (NeurIPS).
- paper PAC-Bayesian Inequalities for Martingales — Seldin, Laviolette, Cesa-Bianchi, Shawe-Taylor & Auer (2012) Martingale-style PAC-Bayes for non-iid data; §14.2 (IEEE TIT).
- paper Combining PAC-Bayesian and Generic Chaining Bounds — Audibert & Bousquet (2007) Chaining-based PAC-Bayes extension (JMLR).
- paper Understanding Deep Learning Requires Rethinking Generalization — Zhang, Bengio, Hardt, Recht & Vinyals (2017) The empirical observation driving §1.2's vacuousness puzzle (ICLR).
- paper Bayesian Learning via Stochastic Gradient Langevin Dynamics — Welling & Teh (2011) SGLD samples Q*_n at log-likelihood loss; §14.4 forward-pointer (ICML).
- paper Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks — Finn, Abbeel & Levine (2017) MAML — the meta-learning algorithm Amit–Meir's PAC-Bayes framework analyzes (ICML).
- book Statistical Learning Theory — Vapnik (1998) VC framework reference; §14.3 alternative-routes comparison (Wiley).
- paper Stability and Generalization — Bousquet & Elisseeff (2002) Algorithmic stability framework; §14.3 alternative-routes comparison (JMLR).
- paper Train Faster, Generalize Better: Stability of Stochastic Gradient Descent — Hardt, Recht & Singer (2016) SGD-based stability for non-convex losses; §14.3 (ICML).
- paper The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights Is More Important than the Size of the Network — Bartlett (1998) Original norm-based deep-net generalization bound; §14.3 (IEEE TIT).
- paper Norm-Based Capacity Control in Neural Networks — Neyshabur, Tomioka & Srebro (2015) Modern norm-based capacity control; §14.3 (COLT).
- paper Spectrally-Normalized Margin Bounds for Neural Networks — Bartlett, Foster & Telgarsky (2017) Spectral-norm-based margin bound; §14.3 (NeurIPS).
- paper A Primer on PAC-Bayesian Learning — Guedj (2019) Modern survey of the PAC-Bayes literature (arXiv preprint).
- paper User-Friendly Introduction to PAC-Bayes Bounds — Alquier (2021) Pedagogical survey covering the bound zoo and proof templates (arXiv preprint).