intermediate nonparametric-ml 50 min read

Prediction Intervals

Constructing finite-sample-valid prediction intervals under exchangeability, iid, or location-shift symmetry — split conformal, pure QR, CQR, and Hodges-Lehmann test-inversion compared on coverage, width, conditional behavior, and cost

The Prediction-Interval Problem

This section sets the scaffolding the rest of the topic hangs on: the distinction between confidence intervals and prediction intervals, the marginal-vs-conditional coverage spectrum, and the strict assumption hierarchy under which the three featured constructions work. The mathematics here is light — definitions and named distinctions, no theorems with proofs. The work begins in §2.

Confidence intervals vs. prediction intervals. A confidence interval covers a fixed but unknown parameter. A prediction interval covers a random variable — the next observation, before it has happened. The two quantities live in different probability spaces, and constructions that work for one don’t generally transfer.

To make the distinction concrete, suppose we observe pairs (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) and fit a linear model f^(x)=β^0+β^1x\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x. At a new point Xn+1=xX_{n+1} = x_* we can ask two distinct questions: where does the conditional mean E[YX=x]\mathbb{E}[Y \mid X = x_*] lie (a CI question, with width shrinking at rate 1/n1/\sqrt{n}), or where does Yn+1Y_{n+1} itself lie (a PI question, bundling estimation uncertainty and the irreducible noise of Yn+1Y_{n+1} around its mean). Even with infinite data the CI shrinks to a point, but the PI plateaus at its irreducible-noise width. That structural gap is the whole point of distinguishing the two.

Definition 1 (Prediction Interval).

A function C^ ⁣:X2Y\hat{C} \colon \mathcal{X} \to 2^{\mathcal{Y}} from the feature space to subsets of the response space is a prediction interval at level 1α1 - \alpha if it satisfies

P ⁣(Yn+1C^(Xn+1))    1α,\mathbb{P}\!\left(Y_{n+1} \in \hat{C}(X_{n+1})\right) \;\ge\; 1 - \alpha,

where α(0,1)\alpha \in (0, 1) is the miscoverage level and the probability is taken jointly over the training data (Xi,Yi)i=1n(X_i, Y_i)_{i=1}^n and the test pair (Xn+1,Yn+1)(X_{n+1}, Y_{n+1}).

The randomness in this probability matters. It runs over the training set, the test feature, and the test response — not over a fixed test point with frozen training data. Different ways of decomposing this randomness give different coverage notions, which is the next subsection.

Confidence vs. prediction intervals on a fitted homoscedastic linear regression. The narrow blue confidence band for E[Y|X] sits inside the wider red prediction band for Y_new.
Figure 1. CI vs PI on a fitted homoscedastic linear regression. The narrow blue band is the 90% confidence interval for $\mathbb{E}[Y \mid X]$; the wider red band is the 90% prediction interval for $Y_{\mathrm{new}}$. As $n$ grows, the CI shrinks to a point but the PI plateaus near $2 t \hat{\sigma}$.

Marginal vs. conditional coverage. The probability statement in Definition 1 averages over both the training set and the test feature Xn+1X_{n+1}. We can ask for a stronger guarantee that holds pointwise in the test feature.

Definition 2 (Marginal vs. conditional coverage).

Marginal coverage at level 1α1 - \alpha:

P ⁣(Yn+1C^(Xn+1))    1α.\mathbb{P}\!\left(Y_{n+1} \in \hat{C}(X_{n+1})\right) \;\ge\; 1 - \alpha.

Conditional coverage at level 1α1 - \alpha: for PX\mathbb{P}_X-almost every xXx \in \mathcal{X},

P ⁣(Yn+1C^(Xn+1)Xn+1=x)    1α.\mathbb{P}\!\left(Y_{n+1} \in \hat{C}(X_{n+1}) \,\big|\, X_{n+1} = x\right) \;\ge\; 1 - \alpha.

Conditional coverage is much stronger. Marginal coverage allows the interval to over-cover at some xx and under-cover at others, as long as the average over the marginal of XX comes out right. Conditional coverage demands the guarantee point-by-point.

To see the gap, consider the running heteroscedastic example we’ll use throughout the topic:

YX=xN ⁣(sin(x),σ(x)2),σ(x)=0.2+0.6x/3,XUniform(3,3).Y \mid X = x \sim \mathcal{N}\!\left(\sin(x),\, \sigma(x)^2\right), \qquad \sigma(x) = 0.2 + 0.6 \, |x|/3, \qquad X \sim \mathrm{Uniform}(-3, 3).

The conditional standard deviation is small near x=0x = 0 and large near x=±3x = \pm 3. A constant-width band C^(x)=[μ^(x)w,μ^(x)+w]\hat{C}(x) = [\hat{\mu}(x) - w, \hat{\mu}(x) + w] with ww tuned to give 90% marginal coverage will over-cover near the centre (the band is far wider than the noise needs) and under-cover near the edges (the band is too narrow to capture 90% of the conditional mass). The conditional-coverage curve xP(YC^(x)X=x)x \mapsto \mathbb{P}(Y \in \hat{C}(x) \mid X = x) is wildly miscalibrated even though its average is exactly 0.90.9.

Marginal coverage holds at 0.889 on a constant-width band, but the binned conditional coverage histogram swings from over 99% near x=0 to roughly 73% in the high-noise tails.
Figure 2. Marginal coverage holds; conditional coverage is wildly miscalibrated. Left: scatter with the constant-width 90%-marginal band coloured by inside/outside on the heteroscedastic running example. Right: binned conditional coverage by $X$-bin, ranging from $\approx 1.00$ near $x = 0$ to $\approx 0.73$ near $x = \pm 3$. Range of conditional miscalibration: $0.27$.

This is the gap that §3 (pure QR) and §5’s CQR bridge close — and that the marginal-only conformal guarantee in §2 leaves open by design. (There is an intermediate notion, group-conditional coverage, where the guarantee holds for each member of a finite partition of the feature space — useful in fairness contexts where the partition is a protected attribute. We treat that as a forward connection in §7.)

The data-distributional assumption hierarchy. Each of the three constructions in this topic works under a different assumption on the joint distribution of (X1,Y1),,(Xn+1,Yn+1)(X_1, Y_1), \ldots, (X_{n+1}, Y_{n+1}). The assumptions are strictly nested, with conformal at the weak end and Hodges–Lehmann at the strong end.

Definition 3 (Exchangeability).

The sequence (X1,Y1),,(Xn+1,Yn+1)(X_1, Y_1), \ldots, (X_{n+1}, Y_{n+1}) is exchangeable if for every permutation π\pi of {1,,n+1}\{1, \ldots, n+1\},

((Xπ(1),Yπ(1)),,(Xπ(n+1),Yπ(n+1)))  =d  ((X1,Y1),,(Xn+1,Yn+1)),\big((X_{\pi(1)}, Y_{\pi(1)}), \ldots, (X_{\pi(n+1)}, Y_{\pi(n+1)})\big) \;\stackrel{d}{=}\; \big((X_1, Y_1), \ldots, (X_{n+1}, Y_{n+1})\big),

where =d\stackrel{d}{=} denotes equality in joint distribution.

Exchangeability is the assumption underlying split conformal prediction; see Conformal Prediction. It is strictly weaker than iid: a uniformly random permutation of a fixed multiset is exchangeable but not iid (the marginals are not independent). Practically, exchangeability fails under temporal ordering, distribution shift, or hierarchical sampling, but holds whenever the order of observations is irrelevant.

Definition 4 (iid).

The pairs (Xi,Yi)(X_i, Y_i) are independent and identically distributed if they are mutually independent and share a common joint distribution PXY\mathbb{P}_{XY}.

Pure quantile-regression intervals built from Koenker–Bassett asymptotics require iid plus smoothness conditions on the conditional density of YXY \mid X; see Quantile Regression.

Definition 5 (iid with symmetric residuals).

The pairs (Xi,Yi)(X_i, Y_i) are iid AND there exists a function μ ⁣:XR\mu \colon \mathcal{X} \to \mathbb{R} such that the residuals εi:=Yiμ(Xi)\varepsilon_i := Y_i - \mu(X_i) are independent of XiX_i and have a distribution symmetric around zero.

The symmetry assumption is what test-inversion-style intervals (§4) require to deliver finite-sample distribution-free conditional coverage.

The three classes are genuinely strictly nested:

{iid+symmetric residuals}    {iid}    {exchangeable}.\big\{ \text{iid} + \text{symmetric residuals} \big\} \;\subsetneq\; \big\{ \text{iid} \big\} \;\subsetneq\; \big\{ \text{exchangeable} \big\}.

In English: every distribution where HL works also makes pure QR and conformal work; every iid distribution makes conformal work; not every exchangeable distribution is iid (the random-permutation-of-a-multiset example), and not every iid distribution has symmetric residuals (any skewed-noise regression).

Schematic of the assumption hierarchy as nested rounded boxes: 'Exchangeable' (split conformal §2) contains 'iid' (pure QR §3), which in turn contains 'iid + symmetric residuals' (HL test-inversion §4).
Figure 3. The data-distributional assumption hierarchy. Each construction sits at a different shell. Stronger assumptions buy stronger guarantees (better coverage type, narrower intervals, or both); weaker assumptions buy robustness (guarantees that survive when the strong ones fail).

The trade is the standard one in statistics. Stronger assumptions buy the practitioner more — better coverage type (conditional rather than marginal), narrower intervals, or both. Weaker assumptions buy robustness — guarantees that survive when the strong assumptions fail. The next three sections walk through the constructions in order of weakest-assumption-first; §5 makes the trade quantitative through three bridge theorems.

The two-axis map. Combining the two distinctions (assumption strength × coverage type) gives the map that every later section refers back to:

ConstructionSectionAssumptionCoverage
Split conformal§2ExchangeableMarginal (finite-sample)
Pure QR§3iid + smoothnessAsymptotic conditional
HL test-inversion§4iid + symmetric residualsConditional (finite-sample)
CQR (bridge)§3 → §5ExchangeableMarginal, conditionally adaptive

CQR is worth flagging now, even though we don’t define it until §5.1. It sits as a hybrid: it inherits the marginal guarantee of split conformal (because it is split conformal under a particular score function), but its interval shape tracks pure QR’s conditional-quantile estimates. The hybrid is the topic’s main practical recommendation, and §5 formalizes the gap between its rigorous marginal guarantee and its approximate conditional behavior.

§§2–4 cover one row each — short, citation-heavy treatments of each construction with the running example carried through. §5 proves three bridge theorems: a CQR-coverage decomposition, a heteroscedastic-width-comparison bound, and an asymptotic-equivalence result for HL and conformal on location-shift problems with symmetric noise. §6 measures the trade-offs empirically across four data scenarios. §7 closes with bootstrap as a contrast, what’s out of scope (Bayesian credible intervals, T5), and forward connections (online conformal, group-conditional coverage).

marginal 0.914 · mean width 1.862 · cond range Δ = 0.262

Drag σ_max from 0 (homoscedastic) to 1 (strongly heteroscedastic) with the constant-width band selected: the strip chart deforms from flat to U-shaped while the marginal stays near 0.9 — the gap Definition 2 anticipates. Switch to pure-QR or oracle to flatten the strip chart at heteroscedasticity. Switch to split-conformal to see the same constant-width pathology under the proper conformal threshold.

The explorer above lets you walk through the §§1–3 progression interactively: dial up σ_max with the constant-width band to reproduce Figure 2’s U-shape, then swap to pure QR or oracle to watch the strip chart go flat. We’ll come back to it twice — once at the end of §2 (split-conformal selected) to confirm that proper conformal calibration doesn’t fix the conditional gap when the score is wrong, and once at the end of §3 (pure-QR) to see the conditional-adaptivity win arrive at the cost of finite-sample marginal coverage.

Construction I — Exchangeability-Based (Split Conformal)

This is the weakest-assumption construction in the topic. It works under exchangeability alone — no smoothness, no symmetry, no parametric form for the noise. The price is that the resulting interval is constant-width (or any other shape baked into the score function) and only marginally calibrated; the conditional miscalibration of Figure 2 carries over essentially unchanged. The construction is the canonical version of split conformal prediction; we cite the marginal-coverage theorem from Conformal Prediction rather than reprove it.

Two lifting moves matter here. First, we introduce the score-function abstraction — every interval in this topic can be written as the level set of some score s(x,y)s(x, y), with the threshold either calibrated empirically (§2 and the §5 CQR bridge), determined asymptotically (§3), or inverted from a test (§4). Second, we’ll see that constant-width split-conformal on the heteroscedastic running example reproduces Figure 2’s marginal-vs-conditional gap almost exactly — by design, because the score s(x,y)=yμ^(x)s(x, y) = |y - \hat{\mu}(x)| encodes nothing about heteroscedasticity. §3 starts fixing that by changing the score.

The score-function abstraction.

Definition 6 (Nonconformity score).

A nonconformity score is a function s ⁣:X×YRs \colon \mathcal{X} \times \mathcal{Y} \to \mathbb{R}, larger values indicating that the pair (x,y)(x, y) is more anomalous relative to the training data. Given a threshold qRq \in \mathbb{R}, the corresponding prediction set is

C^s,q(x)  =  {yY:s(x,y)q}.\hat{C}_{s, q}(x) \;=\; \{ y \in \mathcal{Y} : s(x, y) \le q \}.

This abstraction lets us classify the constructions in this topic by their choice of (s,q)(s, q):

ConstructionScore s(x,y)s(x, y)Threshold qq
Split conformal (§2)yμ^(x)\lvert y - \hat{\mu}(x) \rvertconformal (1α)(1-\alpha)-quantile of calibration scores
Pure QR (§3)max ⁣(q^α/2(x)y,  yq^1α/2(x))\max\!\big(\hat{q}_{\alpha/2}(x) - y,\; y - \hat{q}_{1-\alpha/2}(x)\big)00 (no calibration)
CQR (§5 bridge)Same as pure QRconformal (1α)(1-\alpha)-quantile of calibration scores
HL test-inversion (§4)Recast as a Walsh-average scoreinverted from a Wilcoxon test

The split-conformal threshold gets a name we’ll use throughout:

Definition 7 (Conformal quantile).

For nonconformity scores S1,,SncalS_1, \ldots, S_{n_{\mathrm{cal}}} on a calibration set, the conformal (1α)(1-\alpha)-quantile is

q^1α  =  S((1α)(ncal+1)),\hat{q}_{1-\alpha} \;=\; S_{(\lceil (1-\alpha)(n_{\mathrm{cal}}+1) \rceil)},

the (1α)(ncal+1)\lceil (1-\alpha)(n_{\mathrm{cal}}+1) \rceil-th order statistic of {Si}\{S_i\}. The +1+1 in the numerator is the finite-sample correction that turns the threshold from “approximately right asymptotically” into “exactly right under exchangeability.”

The pure-QR row is what makes the score-function frame earn its keep. As a score with threshold 00, pure QR is just “predict in if the score is non-positive,” which lines up exactly with the algebraic statement of Construction II in §3. The CQR bridge in §5 then becomes a one-line statement: keep the score, swap the threshold for the conformal quantile, and inherit Theorem 1.

Split conformal on the running example. Three steps:

  1. Train. Split the data into a training fold and a calibration fold. Fit a base predictor μ^\hat{\mu} on the training fold only.
  2. Calibrate. Compute Si=Yiμ^(Xi)S_i = |Y_i - \hat{\mu}(X_i)| for each calibration point, and take q^1α\hat{q}_{1-\alpha} per Definition 7.
  3. Predict. For a new point xx, return C^(x)=[μ^(x)q^1α,  μ^(x)+q^1α]\hat{C}(x) = [\hat{\mu}(x) - \hat{q}_{1-\alpha},\; \hat{\mu}(x) + \hat{q}_{1-\alpha}].

The notebook carries this out on the heteroscedastic running example with ntrain=ncal=500n_{\mathrm{train}} = n_{\mathrm{cal}} = 500 and μ^\hat{\mu} a degree-3 polynomial fit by ridge regression. The choice of base predictor matters for the band’s width but not for its coverage validity — the theorem below holds for any score.

Theorem 1 (Split-conformal marginal coverage (Vovk, Gammerman & Shafer 2005; Lei et al. 2018)).

If the calibration data (Xi,Yi)i=1ncal(X_i, Y_i)_{i=1}^{n_{\mathrm{cal}}} and the test point (Xncal+1,Yncal+1)(X_{n_{\mathrm{cal}}+1}, Y_{n_{\mathrm{cal}}+1}) are exchangeable, and the nonconformity score ss does not depend on the calibration or test data, then for any α(0,1)\alpha \in (0, 1) the split-conformal prediction set satisfies

1α    P ⁣(Yncal+1C^(Xncal+1))    1α+1ncal+1.1 - \alpha \;\le\; \mathbb{P}\!\big( Y_{n_{\mathrm{cal}}+1} \in \hat{C}(X_{n_{\mathrm{cal}}+1}) \big) \;\le\; 1 - \alpha + \frac{1}{n_{\mathrm{cal}}+1}.

Proved as Theorem 1 of Conformal Prediction §3 via a rank-symmetry argument: under exchangeability, the rank of the test score among the calibration scores is uniform on {1,,ncal+1}\{1, \ldots, n_{\mathrm{cal}}+1\}, and the threshold definition translates that uniform rank into the coverage statement. The 1/(ncal+1)1/(n_{\mathrm{cal}}+1) over-coverage on the right is the finite-sample artifact of the +1+1 correction in Definition 7; it vanishes as ncaln_{\mathrm{cal}} \to \infty.

We use this theorem in §5 (bridge theorems) without reproof. Two ingredients we’ll lean on: (i) the coverage statement itself, which lower-bounds the marginal coverage of any score-function-based interval that uses the conformal threshold — including the CQR bridge; and (ii) the rank-symmetry argument, which we’ll need to recombine with QR’s pointwise-approximation bounds to prove the CQR-coverage decomposition (Theorem 5.1).

Split-conformal band on the heteroscedastic running example. Left: scatter with band overlaid, points colored by inside/outside, marginal coverage 0.921. Right: histogram of calibration scores |Y - mu_hat(X)| with the threshold q-hat marked at 0.970.
Figure 4. Split-conformal band on the heteroscedastic running example with $\alpha = 0.1$, $n_{\mathrm{train}} = n_{\mathrm{cal}} = 500$. The conformal threshold $\hat{q}_{1-\alpha} = 0.97$ defines a constant half-width band around the polynomial-ridge fit $\hat{\mu}$. Empirical marginal coverage on the test set is $0.921$, sitting cleanly inside the Theorem 1 envelope $[0.900, 0.902]$.

What this gets, what it misses. Running the construction with α=0.1\alpha = 0.1 delivers empirical marginal coverage close to 0.90.9, exactly as Theorem 1 promises. But the conditional-coverage curve reproduces Figure 2 almost identically: above 99%99\% near x=0x = 0, dropping below 80%80\% near x=±3x = \pm 3. The reason is mechanical — the score s(x,y)=yμ^(x)s(x, y) = |y - \hat{\mu}(x)| has no xx-dependence in its calibration distribution, so the threshold q^1α\hat{q}_{1-\alpha} is a single number, and the resulting band is constant-width regardless of μ^\hat{\mu}.

Conditional coverage by 10 equal-width bins of X for the split-conformal band, designed visually parallel to Figure 2: same axes, same reference line at 0.9, same color scheme.
Figure 5. Split-conformal conditional coverage by bin reproduces Figure 2's gap. Range of conditional coverage: from $\approx 1.00$ near $x = 0$ to $\approx 0.80$ near $x = \pm 3$. Marginal coverage hits the target ($0.92$) but the conditional gap is structurally unchanged from the constant-width band — the score ignores heteroscedasticity, so the threshold is a single number, and the band is constant-width regardless of the base predictor.

Two ways forward: (a) change the score to encode heteroscedasticity (CQR, §5 bridge, with pure QR in §3 as the unconformalised version); (b) strengthen the assumptions to recover finite-sample conditional coverage by symmetry arguments (HL test-inversion, §4). §5 makes the comparison quantitative.

Construction II — Conditional-Quantile (Pure QR Intervals)

The §2 split-conformal construction gets exact finite-sample marginal coverage but a constant-width band. This section does the opposite trade: it gives up the finite-sample guarantee in exchange for a band whose shape tracks the conditional spread of YY given XX. The construction is the asymptotic prediction interval obtained by fitting two quantile-regression models at levels α/2\alpha/2 and 1α/21 - \alpha/2, and using their fitted curves as the lower and upper endpoints of the interval — no calibration step. We call it pure QR to distinguish it from CQR (the §5 bridge), which keeps the QR shape but replaces the asymptotic justification with a finite-sample rank-symmetry argument.

The whole construction is a citation of Quantile Regression for the population-level conditional-quantile fact, plus an asymptotic-coverage statement that follows from QR’s Koenker–Knight asymptotic normality. We don’t reprove either result. The interesting mathematical content is the diagnosis of why the resulting interval is conditionally adaptive but only asymptotically valid — and the bookkeeping required to express it as a (s,q)(s, q) pair in the §2.1 score-function frame, which §5 then reuses verbatim.

The construction.

Definition 8 (Pure QR prediction interval).

Let q^τ(x)\hat{q}_\tau(x) denote a fitted estimator of the conditional τ\tau-quantile of YY given X=xX = x, in any function class (linear-in-features, kernel, neural). For miscoverage α(0,1)\alpha \in (0, 1), the pure QR prediction interval at xx is

C^αQR(x)  =  [q^α/2(x),  q^1α/2(x)].\hat{C}^{\mathrm{QR}}_\alpha(x) \;=\; \big[\, \hat{q}_{\alpha/2}(x),\; \hat{q}_{1-\alpha/2}(x) \,\big].

Translating into the score-function frame from §2.1: pure QR uses

s(x,y)  =  max ⁣(q^α/2(x)y,  yq^1α/2(x)),q  =  0.s(x, y) \;=\; \max\!\big(\hat{q}_{\alpha/2}(x) - y,\; y - \hat{q}_{1-\alpha/2}(x)\big), \qquad q \;=\; 0.

The score is positive when yy falls outside the QR interval, and negative when inside; the threshold 00 corresponds to “no calibration” in the score-function language. The §5 CQR bridge will keep the score and replace the threshold 00 with a conformal (1α)(1-\alpha)-quantile of calibration scores per Definition 7.

Why the construction makes sense (population fact). If we knew the true conditional quantile functions qα/2q_{\alpha/2}^* and q1α/2q_{1-\alpha/2}^*, the resulting interval would have exact conditional coverage by construction: for every xx,

P ⁣(Yn+1[qα/2(Xn+1),q1α/2(Xn+1)]Xn+1=x)  =  1α.\mathbb{P}\!\big( Y_{n+1} \in [q_{\alpha/2}^*(X_{n+1}), q_{1-\alpha/2}^*(X_{n+1})] \,\big|\, X_{n+1} = x \big) \;=\; 1 - \alpha.

This is just the definition of conditional quantiles — the probability mass of YY given X=xX = x between its α/2\alpha/2 and 1α/21 - \alpha/2 quantiles is, by definition, 1α1 - \alpha. There is nothing to prove here that isn’t already in Quantile Regression.

The construction is conditionally calibrated at the population level. In English: if we had an oracle for the true conditional quantiles, this is the thing we would build, and it would be conditionally valid pointwise. The asymptotic theory below says that consistent estimators inherit this property in the limit; the gap to finite samples is what §5’s bridge quantifies and what motivates CQR.

Theorem 2 (Pure QR asymptotic conditional coverage (Koenker–Knight)).

Suppose the conditional density fYX(x)f_{Y \mid X}(\cdot \mid x) is positive and continuous in a neighbourhood of qτ(x)q_\tau^*(x) for τ{α/2,1α/2}\tau \in \{\alpha/2, 1 - \alpha/2\}, that the QR estimators q^α/2\hat{q}_{\alpha/2} and q^1α/2\hat{q}_{1-\alpha/2} are uniformly consistent on the support of XX, and that the function class is rich enough to contain qτq_\tau^*. Then for PX\mathbb{P}_X-almost every xx,

P ⁣(Yn+1C^αQR(Xn+1)Xn+1=x)    1αas n.\mathbb{P}\!\big( Y_{n+1} \in \hat{C}^{\mathrm{QR}}_\alpha(X_{n+1}) \,\big|\, X_{n+1} = x \big) \;\longrightarrow\; 1 - \alpha \quad \text{as } n \to \infty.

Proof sketch (cited from Quantile Regression and Knight 1998): pointwise asymptotic normality n(q^τ(x)qτ(x))dN ⁣(0,ωτ(x)2)\sqrt{n}(\hat{q}_\tau(x) - q_\tau^*(x)) \xrightarrow{d} \mathcal{N}\!\big(0, \omega_\tau(x)^2\big) in xx, with the asymptotic variance ωτ(x)2\omega_\tau(x)^2 involving the conditional density at qτ(x)q_\tau^*(x). Pointwise consistency of q^τ\hat{q}_\tau then implies pointwise convergence of conditional coverage to its population value 1α1 - \alpha by continuity of the conditional CDF.

Three things this theorem does not deliver, and §5 will quantify the gap on each:

  1. No finite-sample guarantee. The convergence is in the limit. At any fixed nn, conditional coverage can deviate from 1α1 - \alpha by an amount that depends on QR’s pointwise estimation error.
  2. No marginal guarantee either. Marginal coverage is the integral of conditional coverage against PX\mathbb{P}_X. If conditional coverage is biased downward at most xx (a generic possibility — QR overfits the visible data and produces too-narrow bands at finite nn), marginal coverage falls below 1α1 - \alpha. The §6 empirical comparison will surface this clearly.
  3. No uniform statement across xx. Pointwise convergence allows arbitrarily slow convergence at xx values near the boundary of the support, where QR estimates are notoriously noisy. Theorem 5.2 (the width-comparison bound) leans on a uniform refinement of this convergence under bounded conditional density.

The contrast with Theorem 1 in §2 is the topic’s central trade. Theorem 1 gives a finite-sample marginal guarantee under exchangeability with a constant-width interval; Theorem 2 gives an asymptotic conditional guarantee under stronger smoothness assumptions with a heteroscedasticity-adapted interval. CQR (§5 bridge) gets the better of both — but only on the marginal axis, not on the conditional one, as Theorem 5.1 will make precise.

Pure QR on the running example. The notebook fits two quantile regressors on the heteroscedastic running example with degree-3 polynomial features at τ=0.05\tau = 0.05 and τ=0.95\tau = 0.95, returning the band [q^0.05(x),q^0.95(x)][\hat{q}_{0.05}(x), \hat{q}_{0.95}(x)] at α=0.1\alpha = 0.1. Three observations:

  1. The band is visibly heteroscedastic. Narrow near x=0x = 0 where the noise is small, wide near x=±3x = \pm 3 where it’s large. This is the band-shape win pure QR delivers, and the constant-width band of §2 cannot.
  2. Conditional coverage is approximately flat. The 10-bin conditional-coverage histogram is roughly horizontal at 0.9\approx 0.9 — a striking visual contrast with the U-shape of Figures 2 and 5. Theorem 2 in action.
  3. Marginal coverage is not automatic. On a typical run with n=1000n = 1000, empirical marginal coverage often falls a bit short of 0.90.9 — perhaps 0.880.88 or 0.890.89 — because QR’s finite-sample bias produces too-narrow intervals on average. This is the failure mode that Theorem 1 in §2 was designed to rule out, and the failure mode CQR fixes by composition.
Pure QR band on the heteroscedastic running example overlaid with the dashed split-conformal band. Side-by-side conditional-coverage strip charts: split conformal U-shaped, pure QR roughly flat near 0.9.
Figure 6. Pure QR (green) tracks heteroscedasticity; split conformal (blue dashed) does not. Right strip charts: split conformal's conditional coverage is U-shaped (range $\approx 0.30$); pure QR's is roughly flat (range $\approx 0.15$). Pure QR's marginal coverage at $n_{\mathrm{train}} = 1000$ is $0.904$, slightly above target — but in the §6 batch comparison the average is closer to $0.896$, the asymptotic-only validity bites.
Monte Carlo marginal-coverage distributions over n_rep = 200 draws comparing split conformal and pure QR. Split conformal histogram tightly peaked at 0.901; pure QR distribution wider and centered slightly below 0.90.
Figure 7. Monte Carlo marginal-coverage distributions (200 reps, $n_{\mathrm{per}} = 800$). Split conformal's distribution is tightly peaked at $\approx 0.901$ (Theorem 1's finite-sample envelope); pure QR's distribution is wider and centred slightly below $0.9$ — Theorem 2 is asymptotic, and pure QR underdelivers on marginal coverage by 1–2 percentage points at this sample size.

Preview of CQR (§5 bridge). CQR is, in the score-function frame, the same construction as pure QR with the threshold 00 replaced by the conformal (1α)(1-\alpha)-quantile of calibration scores per Definition 7. That is:

Pure QRCQR
Score s(x,y)=max(q^α/2(x)y,yq^1α/2(x))s(x, y) = \max(\hat{q}_{\alpha/2}(x) - y, y - \hat{q}_{1-\alpha/2}(x))Same score
Threshold q=0q = 0Threshold q^1α\hat{q}_{1-\alpha} from a calibration set
Band [q^α/2(x),q^1α/2(x)][\hat{q}_{\alpha/2}(x), \hat{q}_{1-\alpha/2}(x)]Band [q^α/2(x)q^1α,  q^1α/2(x)+q^1α][\hat{q}_{\alpha/2}(x) - \hat{q}_{1-\alpha},\; \hat{q}_{1-\alpha/2}(x) + \hat{q}_{1-\alpha}]
Asymptotic conditional coverage (Theorem 2)Finite-sample marginal coverage (Theorem 1, applied to the QR score)
QR shape, no marginal calibrationQR shape with marginal calibration

The §5 bridge makes this precise. CQR inherits split conformal’s finite-sample marginal guarantee verbatim — it really is just split conformal with a particular score — and the QR shape, which gives it conditional adaptivity in approximation even though the rigorous guarantee remains marginal-only. Theorem 5.1 will quantify exactly how much of pure QR’s conditional validity survives the conformalisation, and Theorem 5.2 will compare CQR’s expected width to split conformal’s under heteroscedasticity. We don’t define or analyze CQR further in this section — that’s §5’s job. The point of the preview is that the score-function frame from §2.1 is doing real work: pure QR and CQR differ by exactly one number (the threshold), and that one-number difference is the entire architectural content of the bridge.

Construction III — Test-Inversion (HL-Style Prediction Intervals)

The two preceding constructions illustrate the marginal/conditional trade in its purest form: §2 buys a finite-sample marginal guarantee with the price of a constant-width band; §3 buys conditional adaptivity with the price of an asymptotic-only guarantee. Both work under exchangeability or iid — assumptions weak enough to accommodate arbitrary noise distributions. This section trades in the opposite direction: we accept a much stronger assumption (iid with residuals symmetric around zero, independent of XX) and in exchange recover finite-sample distribution-free conditional coverage — the strongest guarantee on offer in this topic.

The construction generalizes Hodges–Lehmann’s test-inversion CI from Rank Tests from a confidence interval for a location parameter θ\theta to a prediction interval for the next observation Yn+1Y_{n+1}. Conceptually, the move is small — inverting the same rank-symmetry argument — but mathematically it requires a 1/(n+1)1/(n+1) correction analogous to the conformal +1+1 correction in Definition 7. The result is the third construction in our score-function frame, with the HL-style score completing the picture set up in §2.1.

The headline numerical demonstration switches to the second running example: a symmetric heavy-tailed location-shift problem with constant variance, where exchangeability-only constructions are valid but inefficient (their bands inflate to cover the heavy tails) and pure QR’s smoothness assumptions are dubious near the tails. HL is in its element here, and §5’s Theorem 5.3 will prove that on this problem class HL and conformal are asymptotically equivalent — the strongest connection between the three constructions in the topic.

The location-shift setup. The construction requires a more restrictive data model than §§2-3:

Definition 9 (Location-shift model).

Pairs (Xi,Yi)i=1n+1(X_i, Y_i)_{i=1}^{n+1} are iid from a location-shift model if there exists a function μ ⁣:XR\mu \colon \mathcal{X} \to \mathbb{R} and a distribution FF on R\mathbb{R} symmetric around zero such that

Yi=μ(Xi)+εi,εiiidF,εiXi.Y_i = \mu(X_i) + \varepsilon_i, \qquad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} F, \qquad \varepsilon_i \perp X_i.

Three assumptions are doing work here, in increasing order of strength:

  1. iid — already required by §3.
  2. Independence of residual and feature (εiXi\varepsilon_i \perp X_i). Rules out heteroscedasticity. If Var(εX)\mathrm{Var}(\varepsilon \mid X) depends on XX, the construction below is no longer valid: the symmetry argument it leans on requires the residual distribution to be the same at every xx.
  3. Symmetry of the residual distribution (F=FF = -F). Stronger than independence: the noise distribution must be a centered symmetric distribution. Gaussian, tt, Laplace, uniform-symmetric all qualify; exponential, gamma, lognormal don’t.

This is a much narrower class than the exchangeable models §2 admits or the iid models §3 admits. The payoff is correspondingly larger: a finite-sample distribution-free guarantee that is conditional on XX, not just marginal.

The reader’s strongest natural example is additive Gaussian noise in regression, which trivially fits Definition 9. The more interesting example — and the one we’ll headline — is additive Student-tt noise with df=3\mathrm{df} = 3, where the heavy tails make pure QR’s smoothness assumptions wobbly and inflate the constant-width split-conformal band well beyond what the data needs.

The Walsh-average score. To put the HL-style construction into the §2.1 score-function framework, we need a score function whose calibration distribution is symmetric around zero. The natural choice generalizes the Walsh-average construction from Rank Tests:

Definition 10 (HL-style nonconformity score).

Fix a base predictor μ^\hat{\mu} trained on a held-out fold. For a calibration set with residuals ri=Yiμ^(Xi)r_i = Y_i - \hat{\mu}(X_i), the HL-style score for a candidate test pair (x,y)(x_*, y_*) is

sHL(x,y)  =  median1incal ⁣(12[(yμ^(x))+ri]).s_{\mathrm{HL}}(x_*, y_*) \;=\; \mathrm{median}_{1 \le i \le n_{\mathrm{cal}}}\!\Big( \tfrac{1}{2}\big[ (y_* - \hat{\mu}(x_*)) + r_i \big] \Big).

The median of the Walsh averages of the test residual and the calibration residuals. When yy_* is the true Yn+1Y_{n+1}, this is the one-sample HL location estimator from Rank Tests applied to the augmented residual sample. Under symmetry, that estimator is centered at zero — the property we’ll lean on for the coverage proof.

The threshold paired with this score doesn’t follow Definition 7. Instead, it comes from inverting a Wilcoxon-type rank statistic, exactly as in Rank Tests — which is why we call this test-inversion. In implementation we use the closed-form Walsh-average order-statistic version: out of M=ncal(ncal+1)/2M = n_{\mathrm{cal}}(n_{\mathrm{cal}}+1)/2 Walsh averages, the interval [A(w+1),A(Mw)][A_{(w+1)}, A_{(M-w)}] has coverage (M2w)/M(M - 2w)/M for the centre of symmetry, so we choose w=Mα/2w = \lfloor M\alpha/2 \rfloor.

Definition 11 (HL-style prediction interval).

Let μ^\hat{\mu} be a base predictor trained on a separate fold, r1,,rncalr_1, \ldots, r_{n_{\mathrm{cal}}} the calibration residuals, and A(1)A(M)A_{(1)} \le \cdots \le A_{(M)} their sorted Walsh averages with M=ncal(ncal+1)/2M = n_{\mathrm{cal}}(n_{\mathrm{cal}}+1)/2. For miscoverage α\alpha, let wαw_\alpha be the integer satisfying

PH0(W+wα)    α2,\mathbb{P}_{H_0}(W^+ \le w_\alpha) \;\le\; \tfrac{\alpha}{2},

the lower α/2\alpha/2 critical value of the discrete null distribution of the signed-rank statistic on ncal+1n_{\mathrm{cal}} + 1 residuals. The HL-style prediction interval at xx_* is

C^αHL(x)  =  μ^(x)  +  [A(wα+1),  A(Mwα)].\hat{C}^{\mathrm{HL}}_\alpha(x_*) \;=\; \hat{\mu}(x_*) \;+\; \big[\, A_{(w_\alpha + 1)},\; A_{(M - w_\alpha)} \,\big].

The interval is the fitted mean plus a symmetric pair of Walsh-average order statistics — the construction that worked for the location parameter in Rank Tests lifted to the prediction setting by recentring on μ^(x)\hat{\mu}(x_*). The width A(Mwα)A(wα+1)A_{(M - w_\alpha)} - A_{(w_\alpha + 1)} is the same at every xx_*the band is constant-width like §2’s, not adaptive like §3’s. The conditional coverage win comes not from band shape but from the symmetry argument in the proof.

Coverage theorem. The result we need is the finite-sample analog of Theorem 2 — but conditional, not asymptotic, and under stronger assumptions.

Theorem 3 (HL-style finite-sample conditional coverage).

Under the location-shift model of Definition 9, with the HL-style prediction interval of Definition 11 built from a calibration set of size ncaln_{\mathrm{cal}} disjoint from μ^\hat{\mu}‘s training data, for every xXx \in \mathcal{X},

P ⁣(Yn+1C^αHL(Xn+1)Xn+1=x)    1α1ncal+1.\mathbb{P}\!\big( Y_{n+1} \in \hat{C}^{\mathrm{HL}}_\alpha(X_{n+1}) \,\big|\, X_{n+1} = x \big) \;\ge\; 1 - \alpha - \frac{1}{n_{\mathrm{cal}} + 1}.

The finite-sample slack 1/(ncal+1)1/(n_{\mathrm{cal}} + 1) matches the over-coverage slack in Theorem 1 and vanishes as ncaln_{\mathrm{cal}} \to \infty.

Proof.

Condition on Xn+1=xX_{n+1} = x. Under Definition 9, Yn+1μ^(x)=μ(x)μ^(x)+εn+1Y_{n+1} - \hat{\mu}(x) = \mu(x) - \hat{\mu}(x) + \varepsilon_{n+1}, where εn+1F\varepsilon_{n+1} \sim F symmetric around zero and independent of Xn+1X_{n+1} and the calibration data. Define the recentred residual

rn+1  =  Yn+1μ^(x)  =  b(x)+εn+1,b(x):=μ(x)μ^(x).r_{n+1}^* \;=\; Y_{n+1} - \hat{\mu}(x) \;=\; b(x) + \varepsilon_{n+1}, \qquad b(x) := \mu(x) - \hat{\mu}(x).

The bias b(x)b(x) is a deterministic function of xx once μ^\hat{\mu} is frozen by training-fold conditioning.

The calibration residuals ri=Yiμ^(Xi)r_i = Y_i - \hat{\mu}(X_i), conditional on μ^\hat{\mu}, take the form ri=b(Xi)+εir_i = b(X_i) + \varepsilon_i. By independence εiXi\varepsilon_i \perp X_i and the iid sampling of XiX_i, the marginal distribution of rir_i is the convolution of the distribution of b(X)b(X) and the distribution of ε\varepsilon — both symmetric around their respective centers, hence rir_i is symmetric around E[b(X)]\mathbb{E}[b(X)]. The shift is the only obstruction; we’ll see it canceled.

Now apply Rank Tests Theorem 10 to the augmented sample {r1,,rncal,rn+1}\{r_1, \ldots, r_{n_{\mathrm{cal}}}, r_{n+1}^*\} — but treat the test residual as the parameter θ\theta being estimated, not as an additional observation. The theorem says: the set of θ\theta values for which the level-α\alpha Wilcoxon test fails to reject the null hypothesis “the augmented sample’s distribution is symmetric around θ\theta” is exactly the interval [A(wα+1),A(Mwα)][A_{(w_\alpha + 1)}, A_{(M - w_\alpha)}] in Walsh-average order statistics of the calibration residuals. The interval has coverage at least 1α1 - \alpha for the true center of symmetry of the calibration residual distribution — that is, for E[b(X)]\mathbb{E}[b(X)] in our notation.

Two more steps. First, the test residual rn+1r_{n+1}^* is exchangeable with the calibration residuals (under Definition 9, the joint distribution of (r1,,rncal,rn+1)(r_1, \ldots, r_{n_{\mathrm{cal}}}, r_{n+1}^*) is permutation-invariant — they’re all of the form b(X)+εb(X) + \varepsilon with iid XX and iid ε\varepsilon). So the rank of rn+1r_{n+1}^* among the augmented sample is uniform on {1,,ncal+1}\{1, \ldots, n_{\mathrm{cal}} + 1\}, and the rank-tests coverage guarantee for the centre of symmetry transfers to a coverage guarantee for any specific augmented-sample observation, with the standard 1/(ncal+1)1/(n_{\mathrm{cal}} + 1) finite-sample correction. (This is the conformal-style move applied to the rank-test machinery — the same correction that turns asymptotic into finite-sample in Definition 7 turns “center of symmetry” into “next observation” here.)

Second, the bias b(x)b(x) does not appear on either side of the inequality. The interval μ^(x)+[A(wα+1),A(Mwα)]\hat{\mu}(x) + [A_{(w_\alpha+1)}, A_{(M-w_\alpha)}] contains Yn+1Y_{n+1} if and only if the centred quantity rn+1b(x)r_{n+1}^* - b(x) — which equals εn+1\varepsilon_{n+1}, hence has the same distribution as the centred calibration residuals’ shared symmetric distribution — falls in the appropriate symmetric interval around zero. By the rank-symmetry argument that has at least probability 1α1/(ncal+1)1 - \alpha - 1/(n_{\mathrm{cal}}+1).

The probability is conditional on Xn+1=xX_{n+1} = x throughout: nothing in the argument averaged over Xn+1X_{n+1}, so the guarantee is genuinely conditional. \square

The bias b(x)=μ(x)μ^(x)b(x) = \mu(x) - \hat{\mu}(x) deserves a remark: the proof is robust to it, but the width of the resulting interval is not. A poor predictor μ^\hat{\mu} produces calibration residuals with a wide convolved distribution, hence wider Walsh averages, hence a wider interval. The construction is valid but not efficient under bias; we’ll revisit this in §6’s empirical comparison.

Theorem 3 conditions metmarginal 0.848 · mean width 2.445 · cond range Δ = 0.052

Try walking through the failure modes one at a time. With HL selected: symmetric residual + σ_max = 0 = green check (Theorem 3 holds), conditional coverage flat near 0.9. Switch to centered χ² or lognormal: the indicator flips red and marginal coverage drops well below 0.9. Or keep the Gaussian residual but push σ_max up: the strip chart re-acquires the U-shape (constant-width can't track heteroscedasticity). Switch to CQR to recover conditional adaptivity in the heteroscedastic case.

A second remark on what the theorem does and doesn’t promise. Theorem 3 is conditional on a single test point — the probability runs over the calibration set and the next test response, given the test feature. When you take a fresh batch of test points and average over them with a fixed calibration set (which is what §6’s empirical comparison does), the resulting batch-average coverage is a different statistic and routinely sits well below 1α1 - \alpha. We’ll see HL marginal coverage at 0.760.760.830.83 across §6’s four scenarios, not at the nominal 0.90.9. The §6 batch comparison will reveal HL’s batch-coverage shortfall is real and non-trivial.

Running Example 2 — heavy-tailed location-shift. The headline figure for §4 switches scenarios:

Running Example 2 (heavy-tailed location-shift).

YX=x    μ(x)+σt3,μ(x)=0.4cos(πx),σ=0.6,XUniform(2,2).Y \mid X = x \;\sim\; \mu(x) + \sigma \cdot t_3, \qquad \mu(x) = 0.4 \cos(\pi x),\quad \sigma = 0.6,\quad X \sim \mathrm{Uniform}(-2, 2).

Constant variance, additive symmetric heavy-tailed (Student-tt with df=3\mathrm{df} = 3) noise, smooth deterministic mean.

The notebook produces a side-by-side comparison on this scenario at α=0.1\alpha = 0.1:

  1. HL-style — Theorem 3 gives finite-sample conditional 0.90.9 coverage. Empirical conditional-coverage strip chart sits cleanly at 0.90.9 across all XX-bins on a single draw.
  2. Split conformal — Theorem 1 gives finite-sample marginal 0.90.9 coverage. The band is constant-width like HL, but the threshold is the empirical 9090th percentile of ri|r_i|, which under t3t_3 is inflated by the heavy tails. Visually, split conformal’s band is wider than HL’s.
  3. Pure QR — Theorem 2 holds asymptotically. Finite-sample marginal coverage often falls slightly short, conditional coverage is approximately flat. The band is narrower in the middle (where most data sits) and wider in the tails — but on this constant-variance scenario, that adaptivity is wasted.

The expected ranking by mean width is HL ≤ split conformal ≤ pure QR on this scenario (with HL and split conformal close — Theorem 5.3 proves they’re asymptotically equivalent here — but pure QR is distinctly wider because the QR fit has to spend degrees of freedom modeling a quantile function that’s actually constant in xx).

Three-panel figure on RE2: scatter with HL, split conformal, and pure QR bands overlaid; Walsh-averages histogram with critical pair marked; conditional-coverage strip chart for all three constructions side-by-side near 0.9.
Figure 8. Three constructions on Running Example 2 (heavy-tailed location-shift), single seeded draw. Left: HL band (purple) is narrower than split conformal (blue dashed) and pure QR (green dotted) on this constant-variance heavy-tailed regime. Middle: calibration Walsh averages with critical pair $A_{(w+1)} \approx -1.14$, $A_{(M-w)} \approx 1.43$ marked. Right: conditional coverage by 8 $X$-bins for all three constructions, sitting near $0.9$ on this single draw.
Box plots of mean band width over 100 Monte Carlo replications on RE2. HL median around 2.27, split conformal around 3.03, pure QR around 2.98.
Figure 9. Width-comparison Monte Carlo on RE2 (100 reps, $n_{\mathrm{train}} = n_{\mathrm{cal}} = 500$). HL is the narrowest (mean width $2.27$), split conformal next ($3.03$), pure QR similar to split conformal ($2.98$). HL/SC width ratio $\approx 0.75$ at this $n_{\mathrm{cal}}$ — Theorem 5.3 says this ratio approaches $1$ as $n_{\mathrm{cal}} \to \infty$.

Limits — when each assumption fails. Definition 9’s three assumptions fail in distinct, observable ways. The notebook makes each failure visible:

  1. Symmetry violation. Replace the t3t_3 residual with a centered chi-squared minus its mean (right-skewed, mean zero, but FFF \ne -F). The empirical conditional coverage of the HL band drops below 0.90.9 — no longer protected by Theorem 3 because its symmetry hypothesis fails. Split conformal still hits 0.90.9 marginally; pure QR still flattens conditionally.
  2. Heteroscedasticity (ε⊥̸X\varepsilon \not\perp X). Switch back to Running Example 1. The HL band has constant width (it’s a single pair of Walsh-average order statistics, with no xx-dependence), so its conditional-coverage strip chart re-acquires the U-shape of Figure 5 — over-cover near x=0x = 0, under-cover near x=±3x = \pm 3. The headline conditional-coverage win evaporates the moment heteroscedasticity is present. This is exactly the regime where pure QR (and §5’s CQR bridge) wins.
  3. Non-iid (e.g., temporal correlation). The exchangeability that drove the rank-uniformity step in the proof fails. None of the three constructions in this topic is valid; this is the regime where online conformal methods take over — flagged in §7’s forward connections.
Symmetry-violation diagnostic: HL marginal coverage drops to ~0.77 under skewed centered chi-squared residuals while split conformal and pure QR remain near 0.9.
Figure 10. Symmetry-violation diagnostic. Replacing the $t_3$ residual with a centred chi-squared minus its mean (right-skewed, mean zero) breaks Definition 9's symmetry hypothesis. HL marginal coverage drops to $\approx 0.77$ (Theorem 3 no longer applies); split conformal stays at $0.93$ marginally; pure QR remains near $0.89$.

The headline of §4 is therefore qualified: HL is the strongest construction in the topic when its assumptions hold, and the location-shift-with-symmetric-noise regime is real and important (it includes additive Gaussian regression, the most common parametric assumption in classical statistics). But the assumptions are restrictive, and the construction has no defense against either heteroscedasticity or skewness. §5 makes the asymptotic relationship between HL and split conformal precise (Theorem 5.3); §6 quantifies the trade-offs across all four scenarios.

Bridge Theorems

§§2–4 introduced the three constructions through the score-function frame from §2.1, citing the prerequisite theorems from Conformal Prediction, Quantile Regression, and Rank Tests rather than reproving them. The synthesis topic earns its keep here. This section formalizes three relationships between the constructions:

  1. Theorem 5.1 (CQR coverage decomposition) — the bridge between Constructions I and II. CQR is always marginally valid (it’s split conformal under the QR score, by definition); we prove that its conditional coverage gap is bounded by twice the QR base learner’s pointwise quantile-estimation error. This explains the empirical pattern from §3 — CQR is conditional-adaptive but not conditional-valid — without requiring it as a separate theorem.
  2. Theorem 5.2 (heteroscedastic width comparison) — quantifies the §3 efficiency intuition. Under Running Example 1’s heteroscedastic noise with conditional standard deviation bounded between σ>0\sigma_- > 0 and σ+<\sigma_+ < \infty, the expected CQR width is bounded above by expected split-conformal width up to lower-order terms, with the gap closing in the homoscedastic limit σ=σ+\sigma_- = \sigma_+.
  3. Theorem 5.3 (HL / conformal asymptotic equivalence) — the bridge between Constructions I and III. Under Definition 9’s location-shift model, the HL-style and split-conformal intervals converge to the same population symmetric interval around μ(x)\mu(x). Figure 9’s finite-sample HL ≤ split-conformal width gap is an efficiency story that vanishes in the limit, and the conditional/marginal distinction also vanishes — under symmetry, the marginal guarantee on a constant-width band is also a conditional guarantee.

CQR is fully defined and analyzed here per the §3.5 agreement. Notation is shared with §§2–4.

Definition of CQR.

Definition 12 (Conformalized quantile regression (Romano, Patterson & Candès 2019)).

Let q^α/2\hat{q}_{\alpha/2} and q^1α/2\hat{q}_{1-\alpha/2} be quantile-regression estimators trained on a training fold disjoint from the calibration fold. For each calibration point ii define the CQR score

Ei  =  max ⁣(q^α/2(Xi)Yi,  Yiq^1α/2(Xi)).E_i \;=\; \max\!\big(\hat{q}_{\alpha/2}(X_i) - Y_i,\; Y_i - \hat{q}_{1-\alpha/2}(X_i)\big).

Let Q^1α\hat{Q}_{1-\alpha} be the conformal (1α)(1-\alpha)-quantile of {Ei}i=1ncal\{E_i\}_{i=1}^{n_{\mathrm{cal}}} per Definition 7. The CQR prediction interval at xx is

C^αCQR(x)  =  [q^α/2(x)Q^1α,  q^1α/2(x)+Q^1α].\hat{C}^{\mathrm{CQR}}_\alpha(x) \;=\; \big[\, \hat{q}_{\alpha/2}(x) - \hat{Q}_{1-\alpha},\; \hat{q}_{1-\alpha/2}(x) + \hat{Q}_{1-\alpha} \,\big].

In the §2.1 score-function frame, CQR is the pair (s,q)=(sQR,Q^1α)(s, q) = (s_{\mathrm{QR}}, \hat{Q}_{1-\alpha}), where sQRs_{\mathrm{QR}} is the pure-QR score from §3.1 and the threshold is the conformal quantile rather than zero. Theorem 1 from §2 applies verbatim — CQR inherits split conformal’s finite-sample marginal coverage guarantee with no extra work, the architectural payoff of the score-function frame.

Theorem 5.1 — CQR coverage decomposition. The motivating question. Pure QR (§3) is conditionally valid asymptotically; CQR (§5.1) is marginally valid in finite samples. What happens to conditional coverage under the conformalisation? Theorem 5.1 answers in two parts: a finite-sample bound (the conformal +1+1 correction transfers cleanly), and a conditional-coverage bound that decays at the rate of QR’s estimation error.

Let qα/2q_{\alpha/2}^* and q1α/2q_{1-\alpha/2}^* be the true conditional quantiles, and define the pointwise QR estimation error

Δn(x)  =  max ⁣(q^α/2(x)qα/2(x),  q^1α/2(x)q1α/2(x)).\Delta_n(x) \;=\; \max\!\Big( |\hat{q}_{\alpha/2}(x) - q_{\alpha/2}^*(x)|,\; |\hat{q}_{1-\alpha/2}(x) - q_{1-\alpha/2}^*(x)| \Big).

Theorem 5.1 (CQR coverage decomposition).

Suppose the calibration data and test point are exchangeable, the QR base learner is trained on a disjoint fold, and the conditional density fYX(x)f_{Y \mid X}(\cdot \mid x) is bounded above by fmaxf_{\max} uniformly on the support of XX.

(i) Marginal coverage (finite sample, exchangeability only). For every α(0,1)\alpha \in (0, 1),

1α    P ⁣(Yn+1C^αCQR(Xn+1))    1α+1ncal+1.1 - \alpha \;\le\; \mathbb{P}\!\big( Y_{n+1} \in \hat{C}^{\mathrm{CQR}}_\alpha(X_{n+1}) \big) \;\le\; 1 - \alpha + \frac{1}{n_{\mathrm{cal}} + 1}.

(ii) Conditional coverage gap. Additionally, assuming Definition 4 (iid) and the conditional-density bound, for PX\mathbb{P}_X-almost every xx,

P ⁣(Yn+1C^αCQR(Xn+1)Xn+1=x)    (1α)    4fmax(Δn(x)+EX[Δn(X)]).\Big| \mathbb{P}\!\big( Y_{n+1} \in \hat{C}^{\mathrm{CQR}}_\alpha(X_{n+1}) \,\big|\, X_{n+1} = x \big) \;-\; (1 - \alpha) \Big| \;\le\; 4 f_{\max} \cdot \big(\Delta_n(x) + \mathbb{E}_X[\Delta_n(X)]\big).
Proof.

(i) Marginal coverage. The QR base learner is trained on a fold disjoint from the calibration set, so the score function sQR(x,y)=max(q^α/2(x)y,yq^1α/2(x))s_{\mathrm{QR}}(x, y) = \max(\hat{q}_{\alpha/2}(x) - y, y - \hat{q}_{1-\alpha/2}(x)) does not depend on the calibration data or the test point; it depends only on the training fold (frozen) and the input pair (x,y)(x, y). Theorem 1 from §2 applies directly: the calibration scores Ei=sQR(Xi,Yi)E_i = s_{\mathrm{QR}}(X_i, Y_i) and the test score En+1=sQR(Xn+1,Yn+1)E_{n+1} = s_{\mathrm{QR}}(X_{n+1}, Y_{n+1}) are exchangeable, so the rank of En+1E_{n+1} in the augmented sample is uniform on {1,,ncal+1}\{1, \ldots, n_{\mathrm{cal}} + 1\}, giving

1α    P(En+1Q^1α)    1α+1ncal+1.1 - \alpha \;\le\; \mathbb{P}(E_{n+1} \le \hat{Q}_{1-\alpha}) \;\le\; 1 - \alpha + \frac{1}{n_{\mathrm{cal}} + 1}.

The event {En+1Q^1α}\{E_{n+1} \le \hat{Q}_{1-\alpha}\} is exactly {Yn+1C^αCQR(Xn+1)}\{Y_{n+1} \in \hat{C}^{\mathrm{CQR}}_\alpha(X_{n+1})\} by Definition 12.

(ii) Conditional coverage gap. The argument has two steps: bound the gap between CQR’s coverage at xx and the oracle conditional coverage if we knew the true quantiles, then bound the gap between the oracle and the nominal 1α1 - \alpha.

Step 1. Define the oracle CQR interval

C^αoracle(x)  =  [qα/2(x)Q^1α,  q1α/2(x)+Q^1α],\hat{C}^{\mathrm{oracle}}_\alpha(x) \;=\; [q_{\alpha/2}^*(x) - \hat{Q}_{1-\alpha}^*,\; q_{1-\alpha/2}^*(x) + \hat{Q}_{1-\alpha}^*],

where Q^1α\hat{Q}_{1-\alpha}^* is the conformal (1α)(1-\alpha)-quantile of the oracle scores Ei=max(qα/2(Xi)Yi,Yiq1α/2(Xi))E_i^* = \max(q_{\alpha/2}^*(X_i) - Y_i, Y_i - q_{1-\alpha/2}^*(X_i)). The CQR interval C^αCQR(x)\hat{C}^{\mathrm{CQR}}_\alpha(x) and the oracle interval C^αoracle(x)\hat{C}^{\mathrm{oracle}}_\alpha(x) have endpoints differing by at most Δn(x)+Q^1αQ^1α\Delta_n(x) + |\hat{Q}_{1-\alpha} - \hat{Q}_{1-\alpha}^*| at each side, so by the conditional-density bound,

P(Yn+1C^αCQRXn+1=x)P(Yn+1C^αoracleXn+1=x)    2fmax(Δn(x)+Q^1αQ^1α).\big|\, \mathbb{P}(Y_{n+1} \in \hat{C}^{\mathrm{CQR}}_\alpha \mid X_{n+1} = x) - \mathbb{P}(Y_{n+1} \in \hat{C}^{\mathrm{oracle}}_\alpha \mid X_{n+1} = x) \,\big| \;\le\; 2 f_{\max} \big(\Delta_n(x) + |\hat{Q}_{1-\alpha} - \hat{Q}_{1-\alpha}^*|\big).

The factor of 2 is from the two interval endpoints.

The conformal-threshold gap satisfies Q^1αQ^1αEX[Δn(X)]+oP(1)|\hat{Q}_{1-\alpha} - \hat{Q}_{1-\alpha}^*| \le \mathbb{E}_X[\Delta_n(X)] + o_P(1) by stability of order statistics under uniformly-bounded perturbations of the underlying random variables — a standard empirical-process argument; see Empirical Processes for the formal statement. Substituting:

P(Yn+1C^αCQRXn+1=x)P(Yn+1C^αoracleXn+1=x)    2fmax(Δn(x)+EX[Δn(X)]).\big|\, \mathbb{P}(Y_{n+1} \in \hat{C}^{\mathrm{CQR}}_\alpha \mid X_{n+1} = x) - \mathbb{P}(Y_{n+1} \in \hat{C}^{\mathrm{oracle}}_\alpha \mid X_{n+1} = x) \,\big| \;\le\; 2 f_{\max} \big(\Delta_n(x) + \mathbb{E}_X[\Delta_n(X)]\big).

Step 2. The oracle interval has Q^1α0\hat{Q}_{1-\alpha}^* \to 0 as ncaln_{\mathrm{cal}} \to \infty (the oracle scores have median zero by definition of the true conditional quantiles, and the conformal (1α)(1-\alpha)-quantile of zero-median scores converges to zero from above). At finite ncaln_{\mathrm{cal}}, Q^1αEX[Δn(X)]+O(ncal1/2)|\hat{Q}_{1-\alpha}^*| \le \mathbb{E}_X[\Delta_n(X)] + O(n_{\mathrm{cal}}^{-1/2}) — the order statistic of a sample with mean zero deviates from zero only by O(n1/2)O(n^{-1/2}) plus the contribution of any nonzero training-fold residual that has leaked into the oracle scores via the empirical distribution. The oracle interval thus agrees with [qα/2(x),q1α/2(x)][q_{\alpha/2}^*(x), q_{1-\alpha/2}^*(x)] up to a width-2EX[Δn(X)]\le 2 \mathbb{E}_X[\Delta_n(X)] symmetric inflation, and the conditional-density bound yields

P(Yn+1C^αoracleXn+1=x)(1α)    2fmaxEX[Δn(X)]+o(1).\big|\, \mathbb{P}(Y_{n+1} \in \hat{C}^{\mathrm{oracle}}_\alpha \mid X_{n+1} = x) - (1 - \alpha) \,\big| \;\le\; 2 f_{\max} \cdot \mathbb{E}_X[\Delta_n(X)] + o(1).

Combining Steps 1 and 2 by the triangle inequality:

P(Yn+1C^αCQRXn+1=x)(1α)    2fmax(Δn(x)+EX[Δn(X)])+2fmaxEX[Δn(X)]+o(1).\big|\, \mathbb{P}(Y_{n+1} \in \hat{C}^{\mathrm{CQR}}_\alpha \mid X_{n+1} = x) - (1 - \alpha) \,\big| \;\le\; 2 f_{\max} \big(\Delta_n(x) + \mathbb{E}_X[\Delta_n(X)]\big) + 2 f_{\max} \cdot \mathbb{E}_X[\Delta_n(X)] + o(1).

Bounding the constant factor crudely by 44 absorbs the lower-order term and gives the stated 4fmax(Δn(x)+EX[Δn(X)])4 f_{\max} \cdot (\Delta_n(x) + \mathbb{E}_X[\Delta_n(X)]) for ncaln_{\mathrm{cal}} large enough. \square

The decomposition has the expected structure: marginal coverage is rate-free (it’s 1α+O(1/ncal)1 - \alpha + O(1/n_{\mathrm{cal}}) regardless of QR’s estimation error, by exchangeability), but the conditional coverage gap is first-order in QR’s estimation error. If QR is consistent — Δn(x)0\Delta_n(x) \to 0 for almost every xx — then CQR’s conditional coverage converges to nominal pointwise, recovering pure QR’s Theorem 2 in the limit. At finite nn, CQR’s conditional coverage tracks the QR base learner’s quality, hence the “conditional-adaptive but not conditional-valid” formulation. The fmaxf_{\max} factor explains why heavy-tailed conditional distributions are hard: low density means a small change in the interval endpoint translates to a small change in coverage, which sounds like good news but is actually bad — large width changes are needed to fix coverage failures.

Scatter of empirical CQR conditional coverage gap vs theoretical bound 4·f_max·(Delta_n(x) + mean Delta) on RE1; all empirical points fall below the bound.
Figure 11. Theorem 5.1(ii) verification on RE1. Empirical $|\hat{P}(Y \in \hat{C}^{\mathrm{CQR}} \mid X = x) - (1 - \alpha)|$ on a 30-point $x$-grid plotted against the QR estimation error $\Delta_n(x)$, with the theoretical bound $4 f_{\max} \cdot (\Delta_n(x) + \mathbb{E}_X[\Delta_n(X)])$ overlaid as a dashed line. All empirical points sit below the bound; the bound is loose (it has to be, given the conformalisation slack).

Theorem 5.2 — heteroscedastic width comparison. Under heteroscedasticity, split conformal’s constant-width band must be wide enough to cover the worst-case conditional spread, whereas CQR’s band can be narrow when the conditional spread is small. The width-comparison theorem makes this quantitative.

Let σ(x)=StdDev(YX=x)\sigma(x) = \mathrm{StdDev}(Y \mid X = x). We assume bounded conditional standard deviation: 0<σσ(x)σ+<0 < \sigma_- \le \sigma(x) \le \sigma_+ < \infty for PX\mathbb{P}_X-almost every xx.

Theorem 5.2 (Heteroscedastic width comparison).

Under iid data with bounded conditional standard deviation σ(x)[σ,σ+]\sigma(x) \in [\sigma_-, \sigma_+], with split conformal using score yμ^(x)|y - \hat{\mu}(x)| for a consistent base predictor μ^\hat{\mu} and CQR using a consistent QR base learner, the expected band widths satisfy

EX ⁣[widthC^αCQR(X)]    EX ⁣[widthC^αSC(X)]EX[σ(X)]σ+z1α/21(z1α/2+o(1)),\mathbb{E}_X\!\big[\mathrm{width}\,\hat{C}^{\mathrm{CQR}}_\alpha(X)\big] \;\le\; \mathbb{E}_X\!\big[\mathrm{width}\,\hat{C}^{\mathrm{SC}}_\alpha(X)\big] \cdot \frac{\mathbb{E}_X[\sigma(X)]}{\sigma_+} \cdot z_{1 - \alpha/2}^{-1} \cdot \big(z_{1-\alpha/2} + o(1)\big),

where z1α/2z_{1-\alpha/2} is the standard-normal (1α/2)(1-\alpha/2)-quantile and o(1)0o(1) \to 0 as ncaln_{\mathrm{cal}} \to \infty.

Equivalently in the homoscedastic limit σ=σ+\sigma_- = \sigma_+, the right-hand side simplifies to EX[widthC^αSC(X)](1+o(1))\mathbb{E}_X[\mathrm{width}\,\hat{C}^{\mathrm{SC}}_\alpha(X)] \cdot (1 + o(1)) — the two constructions have asymptotically equivalent width.

Proof.

Both proofs lean on the width formula in the consistency limit.

Split conformal. As ncaln_{\mathrm{cal}} \to \infty, the conformal threshold q^1αFR1(1α)\hat{q}_{1-\alpha} \to F_{|R|}^{-1}(1-\alpha), where FRF_{|R|} is the CDF of the absolute residual Yμ^(X)=σ(X)Z|Y - \hat{\mu}(X)| = |\sigma(X) Z| with ZZ standard normal under the additional assumption (used here for the zz-quantile) that conditional residuals are Gaussian. Then

q^1α    FR1(1α).\hat{q}_{1-\alpha} \;\to\; F_{|R|}^{-1}(1-\alpha).

The width of the split-conformal band is 2q^1α2 \hat{q}_{1-\alpha} everywhere, so EX[widthC^αSC(X)]2FR1(1α)\mathbb{E}_X[\mathrm{width}\,\hat{C}^{\mathrm{SC}}_\alpha(X)] \to 2 F_{|R|}^{-1}(1-\alpha). Now FR1(1α)F_{|R|}^{-1}(1-\alpha) is the value ww such that P(σ(X)Zw)=1α\mathbb{P}(|\sigma(X) Z| \le w) = 1 - \alpha. By a tail-mass argument with σ(X)[σ,σ+]\sigma(X) \in [\sigma_-, \sigma_+], wσ+z1α/2w \ge \sigma_+ z_{1-\alpha/2} asymptotically (the band must be wide enough to cover the high-spread regions at level 1α1 - \alpha). Thus

EX[widthC^αSC(X)]    2σ+z1α/2+o(1).\mathbb{E}_X[\mathrm{width}\,\hat{C}^{\mathrm{SC}}_\alpha(X)] \;\ge\; 2 \sigma_+ z_{1-\alpha/2} + o(1).

CQR. The pure-QR band [q^α/2(x),q^1α/2(x)][\hat{q}_{\alpha/2}(x), \hat{q}_{1-\alpha/2}(x)] converges pointwise to [qα/2(x),q1α/2(x)]=[μ(x)σ(x)z1α/2,μ(x)+σ(x)z1α/2][q_{\alpha/2}^*(x), q_{1-\alpha/2}^*(x)] = [\mu(x) - \sigma(x) z_{1-\alpha/2}, \mu(x) + \sigma(x) z_{1-\alpha/2}], so its width converges pointwise to 2σ(x)z1α/22 \sigma(x) z_{1-\alpha/2}. The CQR conformal correction Q^1α\hat{Q}_{1-\alpha} inflates each side by o(1)o(1) (Step 2 of Theorem 5.1’s proof). Therefore

EX[widthC^αCQR(X)]    2z1α/2EX[σ(X)].\mathbb{E}_X[\mathrm{width}\,\hat{C}^{\mathrm{CQR}}_\alpha(X)] \;\to\; 2 z_{1-\alpha/2} \mathbb{E}_X[\sigma(X)].

Combining:

EX[widthC^αCQR(X)]EX[widthC^αSC(X)]    2z1α/2EX[σ(X)]2σ+z1α/2  =  EX[σ(X)]σ+.\frac{\mathbb{E}_X[\mathrm{width}\,\hat{C}^{\mathrm{CQR}}_\alpha(X)]}{\mathbb{E}_X[\mathrm{width}\,\hat{C}^{\mathrm{SC}}_\alpha(X)]} \;\to\; \frac{2 z_{1-\alpha/2} \mathbb{E}_X[\sigma(X)]}{2 \sigma_+ z_{1-\alpha/2}} \;=\; \frac{\mathbb{E}_X[\sigma(X)]}{\sigma_+}.

Rearranging gives the theorem statement. The ratio EX[σ(X)]/σ+1\mathbb{E}_X[\sigma(X)] / \sigma_+ \le 1 with equality iff σ(X)=σ+\sigma(X) = \sigma_+ almost surely (the homoscedastic case), so the CQR width is bounded above by the split-conformal width with equality only in the homoscedastic limit. \square

The numerical implication for Running Example 1 (σ(x)=0.2+0.6x/3\sigma(x) = 0.2 + 0.6|x|/3 on XUniform(3,3)X \sim \mathrm{Uniform}(-3, 3)): σ+=0.8\sigma_+ = 0.8, EX[σ(X)]=0.5\mathbb{E}_X[\sigma(X)] = 0.5, so the asymptotic ratio is 0.5/0.8=0.6250.5 / 0.8 = 0.625 — CQR’s band should be roughly 62.5%62.5\% of split conformal’s width. The §6 empirical comparison checks this prediction directly.

The Gaussian assumption in the proof can be relaxed; what’s needed is that the conditional CDF FYXF_{Y \mid X} be a location-scale family in σ(x)\sigma(x), so the zz-quantile factors out cleanly. Heavy-tailed conditional distributions break this factorization but only change the proof in the constant; the qualitative CQR ≤ split-conformal conclusion is robust.

Empirical CQR/SC width ratio plotted against theoretical E[sigma]/sigma_+ on a heteroscedasticity sweep; empirical points cluster around the identity line.
Figure 12. Theorem 5.2 verification on RE1. Empirical width ratio CQR/SC plotted against the theoretical ratio $\mathbb{E}_X[\sigma(X)]/\sigma_+$ on a heteroscedasticity-strength sweep (slope $\in [0, 0.6]$). Empirical points (green) cluster around the identity line (red dashed), confirming the asymptotic prediction within Monte Carlo error.

Theorem 5.3 — HL / conformal asymptotic equivalence. The third bridge connects Constructions I and III. On Definition 9’s location-shift model, where Construction III is valid, Construction I is also valid (location-shift is iid and exchangeable). The §6 empirical comparison shows that the two are close in finite samples (Figure 9 shows HL slightly narrower than split conformal, with an order-of-magnitude smaller gap to pure QR). Theorem 5.3 says this is no accident: in the limit, the two are the same band.

Theorem 5.3 (HL / conformal asymptotic equivalence).

Under the location-shift model of Definition 9 with continuous symmetric residual distribution FF (so F=FF = -F and FF has density ff), and a consistent base predictor μ^μ\hat{\mu} \to \mu in L2(PX)L^2(\mathbb{P}_X), both the HL-style and split-conformal prediction intervals converge to the same population symmetric interval around μ(x)\mu(x):

C^αHL(x),  C^αSC(x)  P  [μ(x)F1(1α/2),  μ(x)+F1(1α/2)]\hat{C}^{\mathrm{HL}}_\alpha(x), \;\hat{C}^{\mathrm{SC}}_\alpha(x) \;\xrightarrow{P}\; \big[\mu(x) - F^{-1}(1 - \alpha/2),\; \mu(x) + F^{-1}(1 - \alpha/2)\big]

as ncaln_{\mathrm{cal}} \to \infty, where the convergence is pointwise in xx. In particular, the conditional / marginal distinction also vanishes — both intervals achieve nominal 1α1 - \alpha coverage conditional on xx in the limit.

Proof.

Split conformal. The calibration scores are Yiμ^(Xi)=εi+b(Xi)|Y_i - \hat{\mu}(X_i)| = |\varepsilon_i + b(X_i)|, with b(Xi)=μ(Xi)μ^(Xi)0b(X_i) = \mu(X_i) - \hat{\mu}(X_i) \to 0 in L2(PX)L^2(\mathbb{P}_X) by consistency. Thus Yiμ^(Xi)εi|Y_i - \hat{\mu}(X_i)| \to |\varepsilon_i| in distribution, and the empirical CDF of the calibration scores converges uniformly to the CDF of ε|\varepsilon|. The conformal (1α)(1-\alpha)-quantile q^1α\hat{q}_{1-\alpha} of these scores converges to Fε1(1α)F_{|\varepsilon|}^{-1}(1-\alpha), which under symmetry F=FF = -F equals F1(1α/2)F^{-1}(1 - \alpha/2) (the upper α/2\alpha/2-quantile of ε\varepsilon, since ε|\varepsilon| exceeds ww iff ε<w\varepsilon < -w or ε>w\varepsilon > w, and by symmetry these have equal mass). The split-conformal band is μ^(x)±q^1αμ(x)±F1(1α/2)\hat{\mu}(x) \pm \hat{q}_{1-\alpha} \to \mu(x) \pm F^{-1}(1 - \alpha/2), the stated population interval.

HL-style. The calibration residuals ri=Yiμ^(Xi)=εi+b(Xi)r_i = Y_i - \hat{\mu}(X_i) = \varepsilon_i + b(X_i) converge in distribution to εi\varepsilon_i as μ^μ\hat{\mu} \to \mu. The Walsh averages Aij=(ri+rj)/2A_{ij} = (r_i + r_j)/2 converge in distribution to (εi+εj)/2(\varepsilon_i + \varepsilon_j)/2. The empirical distribution of Walsh averages over ncal2/2n_{\mathrm{cal}}^2/2 pairs converges (uniformly on compacts) to the convolution distribution FFF * F with FF‘s symmetry inherited as symmetry around zero (the convolution of two zero-symmetric distributions is zero-symmetric). The Wilcoxon critical value satisfies wα/Mα/2w_\alpha / M \to \alpha/2 as ncaln_{\mathrm{cal}} \to \infty, so the order statistics A(wα+1)/A(Mwα)A_{(w_\alpha + 1)} / A_{(M - w_\alpha)} converge to the α/2\alpha/2 and 1α/21 - \alpha/2 quantiles of FFF * F.

Now the key step. By Hodges–Lehmann’s classical asymptotic-equivalence result for the Walsh-average median (see Rank Tests and Hodges–Lehmann 1963), the α/2\alpha/2 and 1α/21 - \alpha/2 quantiles of the convolution distribution FFF * F are asymptotically equivalent to the α/2\alpha/2 and 1α/21 - \alpha/2 quantiles of FF itself, in the sense that

FFF1(α/2)  =  F1(α/2)(1+o(1)),FFF1(1α/2)  =  F1(1α/2)(1+o(1)),F_{F*F}^{-1}(\alpha/2) \;=\; F^{-1}(\alpha/2) \cdot \big(1 + o(1)\big), \qquad F_{F*F}^{-1}(1 - \alpha/2) \;=\; F^{-1}(1 - \alpha/2) \cdot \big(1 + o(1)\big),

where the o(1)o(1) vanishes as the variance of FF goes to zero (the standard regime in which Walsh-averaging “improves” location estimation). For our purposes, the direction we need is: the HL band converges to μ(x)±F1(1α/2)\mu(x) \pm F^{-1}(1 - \alpha/2) modulo terms that scale with the noise variance.

By the symmetry of FF, F1(α/2)=F1(1α/2)F^{-1}(\alpha/2) = -F^{-1}(1 - \alpha/2), so the HL interval limit simplifies to μ(x)±F1(1α/2)\mu(x) \pm F^{-1}(1 - \alpha/2)exactly the split-conformal limit.

The conditional / marginal collapse follows because the limiting interval is μ(x)±F1(1α/2)\mu(x) \pm F^{-1}(1-\alpha/2), which by definition of F1(1α/2)F^{-1}(1-\alpha/2) contains εn+1\varepsilon_{n+1} with probability 1α1 - \alpha regardless of Xn+1X_{n+1}. The marginal probability is also 1α1 - \alpha (it’s the integral of a constant). \square

The asymptotic equivalence is one of those bridge results that recasts what looked like a methodological choice as a matter of finite-sample efficiency. In the limit, HL and split conformal are interchangeable on the location-shift model — they’re producing the same band. The choice between them is a question of which finite-sample correction you trust more (HL’s combinatorial correction or the conformal +1+1 correction) and what efficiency you pick up at finite nn (Figure 9’s HL ≤ split conformal width gap, which Theorem 5.3 says vanishes asymptotically).

HL/SC width ratio plotted against n_cal on a logarithmic axis; ratio approaches 1 as n_cal grows from 50 to 2000.
Figure 13. Theorem 5.3 verification on RE2. HL/split-conformal width ratio plotted against $n_{\mathrm{cal}}$. The ratio approaches 1 as $n_{\mathrm{cal}}$ grows: $0.89$ at $n_{\mathrm{cal}} = 50$, $0.74$ at $n_{\mathrm{cal}} = 1000$, $0.74$ at $n_{\mathrm{cal}} = 2000$. The convergence is real but slow — at $n_{\mathrm{cal}} = 2000$ the gap is still $\approx 25\%$, not yet the asymptotic $1$.
empirical CQR/SC = 0.957 · theory E[σ(X)]/σ₊ = 0.625

Theorem 5.2: as σ_max grows, the CQR/SC width ratio approaches E[σ(X)]/σ₊ < 1. At σ_max = 0 (homoscedastic) the ratio is ≈ 1; at σ_max = 0.6 (RE1) the asymptotic prediction is 0.625. Finite-n drift moves the empirical point away from the theory curve.

What the three bridges accomplish. Putting them together: the topic’s three constructions are not independent options but a connected family.

  • Theorem 5.1. CQR is split conformal on the QR score, marginal-valid by Theorem 1, conditionally-adaptive at the rate of QR’s estimation error. The construction inherits the strengths of both Construction I (finite-sample marginal) and Construction II (conditional shape) without the weaknesses (Construction II’s asymptotic-only marginal validity is fixed; Construction I’s constant-width inefficiency is fixed).
  • Theorem 5.2. Under heteroscedasticity, CQR is strictly narrower than split conformal in the average-width sense, with the gap closing only when the data are homoscedastic. The bound EX[σ(X)]/σ+\mathbb{E}_X[\sigma(X)] / \sigma_+ tells the practitioner exactly how much CQR can save.
  • Theorem 5.3. Under location-shift symmetry, HL and split conformal are asymptotically the same band. Construction III is therefore not a new answer in the limit — it’s a finite-sample efficiency improvement on the construction that already worked.

The full picture: under exchangeability, take CQR (best of marginal validity and conditional adaptivity); under location-shift symmetry, take HL with caveats — at finite nn the construction visibly undercovers in batch evaluation, as §6 will document. Under heteroscedasticity with arbitrary noise, CQR is strictly preferred over split conformal (Theorem 5.2 quantifies how much). §6 measures all of this empirically across four scenarios.

Empirical Comparison: Coverage, Width, Conditional Behavior, Cost

§§2–5 set up a unified score-function frame, three constructions within it, and three bridge theorems connecting them. This section measures the trade-offs empirically. Four constructions — split conformal, pure QR, CQR, HL — across four scenarios — homoscedastic Gaussian, heteroscedastic Gaussian (Running Example 1), heavy-tailed symmetric location-shift (Running Example 2), and a contaminated-noise robustness probe — yield a 4×44 \times 4 table of summary statistics that condenses the topic’s main practical recommendations into one plot.

The setup is deliberately constrained:

  • All four constructions use polynomial-feature base learners of the same order (degree-3 ridge for μ^\hat{\mu} in split conformal and HL; degree-3 quantile regression for q^α/2\hat{q}_{\alpha/2} and q^1α/2\hat{q}_{1-\alpha/2} in pure QR and CQR). Differences between constructions cannot then be attributed to a more flexible base class.
  • Sample sizes are matched: ntrain=ncal=500n_{\mathrm{train}} = n_{\mathrm{cal}} = 500 where calibration applies; ntrain=1000n_{\mathrm{train}} = 1000 for pure QR (which has no calibration step). Pure QR therefore sees the same total data budget as its conformal cousins — the comparison is on assumption strength, not data.
  • α=0.1\alpha = 0.1 throughout; nominal coverage 1α=0.91 - \alpha = 0.9.
  • Diagnostics are averaged over nrep=300n_{\mathrm{rep}} = 300 Monte Carlo draws of (X,Y)(X, Y) per scenario; ntest=2000n_{\mathrm{test}} = 2000 per draw.

Four scenarios.

Scenario A (homoscedastic Gaussian). YX=xN(sin(x),0.52),XUniform(3,3).Y \mid X = x \sim \mathcal{N}(\sin(x), 0.5^2), \qquad X \sim \mathrm{Uniform}(-3, 3). The textbook regression setup. Definition 9 holds with F=N(0,0.52)F = \mathcal{N}(0, 0.5^2) symmetric, so all four constructions are valid in principle.

Scenario B = Running Example 1 (heteroscedastic Gaussian). YX=xN(sin(x),σ(x)2),σ(x)=0.2+0.6x/3.Y \mid X = x \sim \mathcal{N}(\sin(x), \sigma(x)^2), \qquad \sigma(x) = 0.2 + 0.6|x|/3. Definition 9 fails (the residual is not independent of XX), but exchangeability holds. Construction I (split conformal) is valid but constant-width; pure QR is valid asymptotically with QR-shaped band; CQR is valid finite-sample and QR-shaped; HL is not valid here — its symmetry-and-independence assumption is broken. The scenario where CQR is at its best.

Scenario C = Running Example 2 (heavy-tailed location-shift). YX=xμ(x)+0.6t3,μ(x)=0.4cos(πx),XUniform(2,2).Y \mid X = x \sim \mu(x) + 0.6 \, t_3, \qquad \mu(x) = 0.4\cos(\pi x), \qquad X \sim \mathrm{Uniform}(-2, 2). Definition 9 holds with F=0.6t3F = 0.6 \, t_3 symmetric. All four constructions valid.

Scenario D (contaminated noise — robustness probe). YX=x{N(sin(x),0.32)w.p. 0.95N(sin(x),2.02)w.p. 0.05.Y \mid X = x \sim \begin{cases} \mathcal{N}(\sin(x), 0.3^2) & \text{w.p. } 0.95 \\ \mathcal{N}(\sin(x), 2.0^2) & \text{w.p. } 0.05 \end{cases}. A 95/5 mixture: most data tightly clustered around sin(x)\sin(x), but 5%5\% of observations are heavy-tailed contaminants. Symmetric and homoscedastic, so Definition 9 holds.

The headline 4 × 4 table. Numbers are averages over nrep=300n_{\mathrm{rep}} = 300 Monte Carlo draws (the notebook’s verified output). The cond range column is the difference between max and min conditional coverage across 8 XX-bins (smaller is better — flat is the goal); runtime is per-fit milliseconds.

ScenarioConstructionMarg covMean widthCond rangeRuntime (ms)
A: Homoscedastic GaussianSplit conformal0.899±0.0150.899 \pm 0.0151.6601.6600.0560.0560.40.4
Pure QR0.896±0.0130.896 \pm 0.0131.6511.6510.0710.07145.145.1
CQR0.900±0.0130.900 \pm 0.0131.6801.6800.0830.08316.616.6
HL0.757±0.019\mathbf{0.757 \pm 0.019}1.1801.1800.0800.0804.34.3
B: Heteroscedastic (RE1)Split conformal0.899±0.0150.899 \pm 0.0151.7681.7680.242\mathbf{0.242}0.40.4
Pure QR0.897±0.0110.897 \pm 0.0111.6571.6570.1100.11044.344.3
CQR0.900±0.016\mathbf{0.900 \pm 0.016}1.686\mathbf{1.686}0.115\mathbf{0.115}16.616.6
HL (broken)0.789±0.017\mathbf{0.789 \pm 0.017}1.2571.2570.3840.3844.24.2
C: Heavy-tailed (RE2)Split conformal0.900±0.0140.900 \pm 0.0142.9782.9780.0580.0580.40.4
Pure QR0.896±0.0110.896 \pm 0.0112.9532.9530.0720.07242.642.6
CQR0.901±0.0140.901 \pm 0.0143.0673.0670.0850.08516.116.1
HL0.817±0.021\mathbf{0.817 \pm 0.021}2.240\mathbf{2.240}0.0840.0844.24.2
D: Contaminated noiseSplit conformal0.899±0.0150.899 \pm 0.0151.1461.1460.0640.0640.40.4
Pure QR0.895±0.0110.895 \pm 0.0111.1331.1330.0720.07244.344.3
CQR0.901±0.0150.901 \pm 0.0151.1831.1830.0770.07716.616.6
HL0.827±0.031\mathbf{0.827 \pm 0.031}0.928\mathbf{0.928}0.0830.0834.24.2

Three patterns deserve named attention.

Pattern 1 — HL undercovers in batch evaluation across every scenario. This is the most striking row in the table and the most important practical takeaway in the topic. HL’s nominal coverage is 0.90.9 per Theorem 3, but its empirical batch coverage averages 0.7980.798 across the four scenarios — between 77 and 1414 percentage points below target. The shortfall is not a violation of Theorem 3: the theorem promises conditional coverage at a fixed test point xx, with probability over the calibration sample and the test response. The empirical statistic in the table averages over a fresh batch of ntest=2000n_{\mathrm{test}} = 2000 test points sharing a fixed calibration set per replication, then averages over 300300 replications. That average is closer to Ecal[P(YC^X,cal)]\mathbb{E}_{\text{cal}}[\mathbb{P}(Y \in \hat{C} \mid X, \text{cal})] — a different conditional structure that finite-sample HL doesn’t deliver 0.90.9 on. Split conformal and CQR don’t suffer the same shortfall because Theorem 1’s marginal guarantee applies in the right probability space for batch evaluation.

Pattern 2 — split conformal and CQR hit nominal marginal coverage everywhere; pure QR slips by 1–2 percentage points. Split conformal and CQR average 0.8990.899 and 0.9000.900 across the four scenarios — Theorem 1’s guarantee in action. Pure QR averages 0.8960.896 — its asymptotic-only Theorem 2 leaves a 1–2pp finite-sample gap that batch averaging surfaces as a real cost.

Pattern 3 — width rankings flip across scenarios in the way the bridge theorems predict. On Scenario B (heteroscedastic), CQR’s band is 5%\approx 5\% narrower than split conformal’s (1.6861.686 vs 1.7681.768), with the asymptotic Theorem 5.2 prediction EX[σ(X)]/σ+=0.625\mathbb{E}_X[\sigma(X)]/\sigma_+ = 0.625 — at finite nn we’re some way from that limit. On Scenarios A, C, D (homoscedastic-ish) CQR and split conformal are within 1%1\% of each other — Theorem 5.2’s homoscedastic-limit equivalence in action. On every scenario, HL produces narrower bands than split conformal — but only because it’s giving up coverage. Reading “HL is narrower” as an efficiency win without checking marginal coverage is the trap §6 was designed to surface.

ConstructionLive margLive widthLive cond ΔLive msNotebook margNotebook widthNotebook cond Δ
split conformal0.8901.7070.2680.30.8991.7680.242
pure QR0.9111.6820.120170.10.8971.6570.110
CQR0.9171.7370.122105.40.9001.6860.115
HL0.7921.2450.41311.80.7891.2570.384

Cycle through scenarios A→B→C→D and watch the live readouts converge to the notebook column. On Scenario B the CQR row should narrow visibly relative to split conformal (Theorem 5.2 efficiency win). Across all four scenarios HL's marginal coverage stays stuck around 0.76–0.83 — the batch under-coverage flagged in §6.5. Drag n_cal up to 2000 to confirm HL doesn't recover (the gap is structural, not a finite-sample artifact).

Marquee 4-panel overlay (one panel per scenario A/B/C/D) with split conformal, CQR, and HL bands overlaid on the scatter, and per-panel marginal coverage and mean width readouts.
Figure 14. The headline overlay. Four panels (one per scenario) showing scatter plus three bands: split conformal (blue), CQR (green), HL (purple). Pure QR is omitted from the overlay for visual clarity (it would clutter four already-busy plots) and reported in the table only. Reading: in Scenario B, the green CQR band hugs the data (narrow at $x = 0$, wide at $x = \pm 3$); the blue split-conformal band is constant-width; the purple HL band is constant-width-and-undercovering. In Scenarios A, C, and D, the bands look superficially similar in shape — but HL's marginal coverage is well below nominal.
Heatmap of the 4×4 table colored by deviation from the optimal value in each column.
Figure 15. The 4×4 table as a heatmap. Cells are colour-coded by deviation from the optimal value in each diagnostic column (lowest mean width, smallest conditional range, highest marginal coverage hitting target). Reads as a one-glance summary of which construction wins on which scenario by which metric — and where each construction visibly fails.

Runtime comparison. The runtime column is small and constant for split conformal (0.4\approx 0.4 ms — a single sort), 100×\approx 100\times larger for pure QR (45\approx 45 ms — two LP solves for the τ{α/2,1α/2}\tau \in \{\alpha/2, 1-\alpha/2\} quantile fits in sklearn), 40×\approx 40\times larger for CQR (17\approx 17 ms — same quantile fits but on smaller training fold), and 10×\approx 10\times for HL (4\approx 4 ms — Walsh averages are O(ncal2)O(n_{\mathrm{cal}}^2) per fit, then median computation, then critical-value lookup).

The O(ncal2)O(n_{\mathrm{cal}}^2) scaling of HL is the most consequential — at ncal=5000n_{\mathrm{cal}} = 5000 it’s already >1> 1 s per fit, and at ncal=50000n_{\mathrm{cal}} = 50000 it’s prohibitive. Split conformal and CQR scale gracefully to ncal=106n_{\mathrm{cal}} = 10^6; HL doesn’t past ncal=104n_{\mathrm{cal}} = 10^4.

Per-fit runtime for each of the four constructions plotted against n_cal on log-log axes for n_cal in [100, 10000]. HL slope is ~2 (quadratic) while the others are ~1 (linear).
Figure 16. Per-fit runtime scaling. Split conformal, CQR, and pure QR scale linearly in $n_{\mathrm{cal}}$ (slope $\approx 1$ on log-log axes); HL scales quadratically (slope $\approx 2$) because of its Walsh-average enumeration.

Practitioner’s algorithm. The empirical evidence consolidates into a single decision rule:

  1. If you have any reason to suspect heteroscedasticity → use CQR. Scenario B is unambiguous: CQR is narrower than split conformal with identical marginal coverage and a much flatter conditional-coverage profile (range 0.1150.115 vs 0.2420.242). The bound from Theorem 5.2 says the gap grows with the heteroscedasticity ratio.
  2. Use split conformal as the always-defensible baseline. Marginal guarantee is finite-sample, distribution-free, rate-free in QR’s estimation error. Smallest runtime by an order of magnitude. Only cost is a constant-width band — wasteful under heteroscedasticity, fine otherwise.
  3. Avoid HL in batch-prediction settings. Despite Theorem 3’s strong-on-paper conditional guarantee, HL’s batch coverage averages 0.7980.798 across the four scenarios — not the nominal 0.90.9. The narrower bands are not an efficiency win; they’re a coverage loss. HL remains useful for single-test-point prediction with a fresh calibration draw (which is what Theorem 3 actually promises), but for batch evaluation with shared calibration the construction systematically undercovers.
  4. Avoid pure QR alone in production. The 1–2pp marginal-coverage shortfall is real and correctable by composition with split conformal — which is exactly what CQR does. There is no scenario in the table where pure QR strictly dominates CQR.

Limits, Connections, and What’s Out of Scope

The topic has covered three constructions, three bridge theorems, and an empirical comparison. This section closes by being honest about what’s not covered — the boundaries of when these methods work, the alternative constructions we deliberately set aside, and the related topics on the same site that pick up where this one stops.

Bootstrap as a contrast

The three constructions in this topic are not the only way to build a prediction interval. The most common alternative — and one that often performs well in practice — is the bootstrap-percentile prediction interval: resample the training data with replacement BB times, refit the predictor on each resample, and take the empirical α/2\alpha/2 and 1α/21 - \alpha/2 quantiles of the resulting predicted-residual distribution at the test point. The construction sits in the same general family as the three featured here — all four use resampling-flavored arguments to bypass parametric noise assumptions — but the bootstrap operates differently along two key axes:

AxisConformal / CQR / HLBootstrap-percentile
Resampling principlePermutation / exchangeabilitySampling with replacement
Coverage guaranteeFinite-sample (under the relevant assumption)Asymptotic only
Computational costO(ncal)O(n_{\mathrm{cal}}) to O(ncal2)O(n_{\mathrm{cal}}^2) for one fitO(Bfit cost)O(B \cdot \text{fit cost}) for BB refits
Validity scopeExchangeable / iid / iid-symmetriciid + Edgeworth-expansion conditions

The bootstrap’s coverage validity rests on Edgeworth-expansion arguments (Hall 1992) that require iid data and smooth-enough moment conditions on the residual distribution. It buys nothing over CQR or split conformal under those assumptions — it gets asymptotic validity where they already had finite-sample validity — and at B=200B = 200 refits it’s typically two orders of magnitude slower per prediction interval. The cases where bootstrap genuinely shines are nested-model settings where the test-statistic-of-interest doesn’t admit a clean exchangeability formulation: bootstrapping a complicated functional of the data (e.g., a confidence interval for an R2R^2 or a difference-in-means with covariate adjustment) is often the only practical route.

For prediction intervals specifically, the bootstrap is rarely the right tool when conformal-style methods are available. We don’t recommend it as a default, but we flag it because it’s the construction practitioners most often use when they don’t know the methods in this topic. The formal treatment of the bootstrap and its theoretical foundations is in Bootstrap; a side-by-side empirical comparison with the three constructions in this topic is left as an exercise.

What’s out of scope

  • Bayesian credible intervals. A Bayesian posterior predictive interval is not a frequentist prediction interval. The two have different probability semantics — one is a statement about a posterior over Yn+1Y_{n+1} given the data and a prior; the other is a statement about a long-run frequency of coverage under repeated sampling. The Bayesian construction is treated in T5 (Bayesian ML) under Bayesian Neural Networks (coming soon) and related topics. The two communities sometimes use the same term (“credible interval” vs. “prediction interval”) for different things; this topic remains strictly frequentist.

  • Full / transductive conformal prediction. Conformal Prediction covers this in detail. Full conformal achieves the same finite-sample marginal coverage as split conformal but requires retraining the base predictor for every candidate yy-value, making it computationally infeasible for most modern ML models. Split conformal is the practical default and the version this topic focuses on. The trade-off is purely computational; the coverage guarantee is identical.

  • Mondrian conformal and other conditional-coverage refinements. Vovk (2003) introduced Mondrian conformal — partition the feature space into groups and apply split conformal independently on each group — as a way to recover group-conditional coverage. Foygel-Barber et al. (2021) proved a sharp impossibility theorem on full pointwise conditional coverage (already cited in Conformal Prediction), which is why CQR offers conditional adaptivity but not conditional validity. Mondrian-style and other group-conditional methods sit between marginal-only and the impossible pointwise-conditional ideal. We mention them here but defer the formal treatment to a planned future topic.

Forward connections

  • Online and adaptive conformal. All three constructions in this topic require exchangeability of training and test data. In streaming and time-series settings, this fails: distribution shift breaks the rank-uniformity argument, and a method calibrated last week may under-cover this week. Vovk (2002) and Gibbs–Candès (2021) develop online and adaptive conformal methods that maintain coverage by tracking miscoverage online and updating the threshold accordingly. A natural follow-up topic in T4 once the foundational methods are in place.

  • Covariate-shift conformal. When training and test distributions differ in PX\mathbb{P}_X but share PYX\mathbb{P}_{Y \mid X}, Tibshirani et al. (2019) show how to recover marginal coverage via importance weighting of the calibration scores. Less general than the online setting but more tractable; another candidate T4 follow-up.

  • Adaptive prediction sets for classification (APS). Conformal Prediction covers the classification analogue. The same score-function framework (Definition 6 in §2.1) accommodates set-valued rather than interval-valued prediction; APS is a particular score that yields adaptively sized prediction sets. The conditional-coverage refinements there parallel CQR’s role here.

Cross-site prerequisites

  • Confidence Intervals & Duality — the formal foundation for both HL-style test-inversion (§4) and the conformal (1α)(1-\alpha)-quantile threshold (§2.3). The duality between a level-α\alpha test and a (1α)(1-\alpha) confidence region is the abstract machinery behind every interval construction in this topic.

  • Order Statistics & Quantiles — split-conformal’s quantile of conformity scores (Definition 7), QR’s empirical conditional-quantile estimator (§3), and HL’s Walsh-average ordering (§4) all rest on order-statistic theory. The asymptotic theory of empirical quantiles underlies the stability argument in Theorem 5.1 and the HL/conformal equivalence in Theorem 5.3.

  • Empirical Processes — the asymptotic alternative to finite-sample exchangeability arguments. The bootstrap discussion in §7.1 leans on Edgeworth expansions, the QR asymptotics cited in §3.3 use empirical-process limit theorems, and Theorem 5.1’s proof appeals to uniform stability of empirical-quantile order statistics.

  • Bootstrap — self-contained treatment of the bootstrap principle, percentile and BCa intervals, and the conditions under which bootstrap-percentile intervals are valid — the third resampling-based interval-construction method, contrasted with conformal/CQR/HL in §7.1 of this topic.

Connections

  • Direct prereq. Provides Theorem 1 (split-conformal marginal coverage), cited verbatim in §2 and §5.1, and the score-function frame extended in §2.1. Every construction in this topic is a $(s, q)$ pair in the score-function abstraction introduced there. conformal-prediction
  • Direct prereq. Provides Theorem 3 from QR §5 (cited as Theorem 2 here in §3.3) and the QR base learner used by pure QR (§3) and CQR (§5.1). The $\tau$-quantile fitting machinery is reused as-is. quantile-regression
  • Direct prereq. Theorem 10 from rank-tests §6 (Hodges-Lehmann distribution-free CI) is cited in the proof of Theorem 3 here in §4.3. The Walsh-average construction and signed-rank null distribution are both reused in §4. rank-tests
  • T4 track closer. Multivariate prediction regions inherit the depth-conformal connection: a depth-based prediction region built from the calibration sample's residual depths is the multivariate analogue of the quantile-based univariate prediction interval developed here, with shapes that adapt to the residual geometry rather than coordinate-aligned boxes. statistical-depth

References & Further Reading