intermediate nonparametric-ml 45 min read

Quantile Regression

The Koenker–Bassett 1978 estimator: pinball-loss minimization, LP reformulation, asymptotic normality, multi-quantile rearrangement, and the base learner of CQR

The Pinball Loss and the Population Quantile

Quantile regression rests on a single asymmetric loss function, the check or pinball loss. Where squared loss penalises positive and negative residuals symmetrically and recovers the conditional mean at its minimum, the pinball loss penalises them asymmetrically and recovers the conditional quantile. The next two results make this statement precise.

Definition 1 (Pinball loss).

For τ(0,1)\tau \in (0, 1) and any real residual uu, the pinball loss at level τ\tau is

ρτ(u)  =  u(τ1{u<0})  =  max{τu, (τ1)u}.\rho_\tau(u) \;=\; u\,\bigl(\tau - \mathbb{1}\{u < 0\}\bigr) \;=\; \max\bigl\{\tau u,\ (\tau - 1) u\bigr\}.

Equivalently, ρτ(u)\rho_\tau(u) is a piecewise-linear V with slope τ\tau on the right half-line and slope τ1\tau - 1 on the left. At τ=0.5\tau = 0.5, ρ0.5(u)=12u\rho_{0.5}(u) = \tfrac{1}{2}|u| is half the absolute loss; for τ0.5\tau \neq 0.5 the V is asymmetric, with the steeper side on whichever direction we pay more for getting wrong.

Theorem 1 (Pinball minimization recovers the population quantile).

Let YY be a real random variable with cumulative distribution function FF and finite mean. For each τ(0,1)\tau \in (0, 1), the (any) τ\tau-quantile

ξτ  =  inf{qR:F(q)τ}\xi_\tau \;=\; \inf\bigl\{q \in \mathbb{R} : F(q) \geq \tau\bigr\}

minimises the expected pinball loss:

ξτ    argminqRE[ρτ(Yq)].\xi_\tau \;\in\; \arg\min_{q \in \mathbb{R}} \, \mathbb{E}\bigl[\rho_\tau(Y - q)\bigr].
Proof.

Write the expected loss as a function of qq:

L(q)  =  E[ρτ(Yq)]  =  τq(yq)dF(y)  +  (1τ)q(qy)dF(y).L(q) \;=\; \mathbb{E}\bigl[\rho_\tau(Y - q)\bigr] \;=\; \tau \int_q^\infty (y - q)\, dF(y) \;+\; (1 - \tau) \int_{-\infty}^q (q - y)\, dF(y).

The first term penalises Y>qY > q at rate τ\tau (the “right side”); the second penalises Y<qY < q at rate 1τ1 - \tau (the “left side”). Differentiate L(q)L(q) under the integral. The derivative of the first term with respect to qq is τ(1F(q))-\tau (1 - F(q)) — the Leibniz contribution from the lower limit, plus the constant τ-\tau under the integrand. The derivative of the second term is (1τ)F(q)(1 - \tau) F(q) — matching contribution from the upper limit and integrand. Adding:

L(q)  =  τ(1F(q))  +  (1τ)F(q)  =  F(q)τ.L'(q) \;=\; -\tau\,(1 - F(q)) \;+\; (1 - \tau)\, F(q) \;=\; F(q) - \tau.

Setting L(q)=0L'(q) = 0 gives F(q)=τF(q) = \tau, i.e., q=ξτq = \xi_\tau. The second derivative, wherever FF is differentiable, is L(q)=f(q)0L''(q) = f(q) \geq 0, so LL is convex; whenever the conditional density ff at ξτ\xi_\tau is strictly positive, the minimiser is unique. (When FF is flat at ξτ\xi_\tau, the argmin set is an interval; we take the smallest minimiser by convention.) \square

Numerical verification: sample YN(0,1)Y \sim \mathcal{N}(0, 1) and minimize the empirical pinball loss over a fine grid of qq for several τ\tau — the empirical argmin should track Φ1(τ)\Phi^{-1}(\tau) as nn grows.

import numpy as np
from scipy.stats import norm

np.random.seed(0)
y_sample = norm.rvs(size=20_000, random_state=1)
q_grid = np.linspace(-3, 3, 601)

print('Empirical vs theoretical quantile at n = 20,000:')
print(f'{"tau":>6} | {"Phi^-1(tau)":>12} | {"argmin emp loss":>16} | {"diff":>8}')
for tau in [0.10, 0.25, 0.50, 0.75, 0.90]:
    L_grid = np.array([
        np.mean(np.where(y_sample >= q,
                         tau * (y_sample - q),
                         (tau - 1) * (y_sample - q)))
        for q in q_grid
    ])
    q_hat = q_grid[L_grid.argmin()]
    q_true = norm.ppf(tau)
    print(f'{tau:>6.2f} | {q_true:>12.4f} | {q_hat:>16.4f} | {q_hat - q_true:>+8.4f}')

Remark (Pinball-loss derivative requires no smoothness on Y).

The proof opens no black boxes — it does not require YY to have a density, finite variance, or bounded support. The piecewise-linear structure of ρτ\rho_\tau is doing all of the work: differentiating it produces the indicator 1{u<0}\mathbb{1}\{u < 0\}, and the indicator’s expectation is exactly FF.

Remark (Conditional version: argmin over functions = conditional τ-quantile).

The same calculation, conditioned on X=xX = x, gives the conditional quantile ξτ(YX=x)\xi_\tau(Y \mid X = x). This is what quantile regression estimates: the argmin over functions q(x)q(x) of E[ρτ(Yq(X))]\mathbb{E}[\rho_\tau(Y - q(X))] is the conditional τ\tau-quantile function of YY given XX, almost surely. The QR estimator replaces the population expectation with the empirical mean and the function class with a parametric one (the linear span of the features).

Linear Quantile Regression

Theorem 1 says the population τ\tau-quantile of YY is recovered by minimising the expected pinball loss. The Koenker–Bassett 1978 quantile regression estimator is the empirical-version twin of that statement: minimise the sample pinball loss over a parametric family of conditional-quantile candidates.

Definition 2 (Linear quantile regression estimator (Koenker–Bassett 1978)).

Given features XiRpX_i \in \mathbb{R}^p and responses YiRY_i \in \mathbb{R} for i=1,,ni = 1, \ldots, n, the linear quantile regression estimator at level τ(0,1)\tau \in (0, 1) is

β^(τ)  =  argminβRpi=1nρτ(YiXiβ).\hat\beta(\tau) \;=\; \arg\min_{\beta \in \mathbb{R}^p} \, \sum_{i=1}^n \rho_\tau\bigl(Y_i - X_i^\top \beta\bigr).

The fitted conditional quantile at a new point xx is q^τ(x)=xβ^(τ)\hat q_\tau(x) = x^\top \hat\beta(\tau). When the feature vector includes a constant (intercept), the model is

Yi  =  Xiβ(τ)  +  ei,ξτ(eiXi)=0.Y_i \;=\; X_i^\top \beta(\tau) \;+\; e_i, \qquad \xi_\tau(e_i \mid X_i) = 0.

The workhorse fit-and-predict helper used throughout this topic — a thin wrapper around scikit-learn’s QuantileRegressor (which calls HiGHS) on degree-3 polynomial features:

from sklearn.linear_model import QuantileRegressor
from sklearn.preprocessing import PolynomialFeatures

def fit_predict_quantile(x_train, y_train, x_eval, tau, alpha_l2=0.01, degree=3):
    """Quantile regression at level τ on degree-`degree` polynomial features.

    alpha_l2: small L2 penalty for numerical stability (kept tiny so KB78
              asymptotics are not distorted).
    """
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    Phi_train = poly.fit_transform(np.asarray(x_train).reshape(-1, 1))
    Phi_eval = poly.transform(np.asarray(x_eval).reshape(-1, 1))
    qr = QuantileRegressor(quantile=tau, alpha=alpha_l2, solver='highs')
    qr.fit(Phi_train, y_train)
    return qr.predict(Phi_eval)

The KKT condition L(q)=F(q)τ=0L'(q) = F(q) - \tau = 0 has a clean empirical analog: at the QR optimum, fraction τ\tau of training residuals are negative.

# At each τ, fraction of training residuals < 0 should ≈ τ.
for tau in [0.10, 0.50, 0.90]:
    fit_at_train = fit_predict_quantile(x_demo, y_demo, x_demo, tau)
    print(f'τ = {tau}:  P(Y < q̂(X)) = {(y_demo < fit_at_train).mean():.3f}')

Remark (When does linear QR target the right thing?).

If Y=Xβ0+εY = X^\top \beta_0 + \varepsilon with ε\varepsilon independent of XX and ξτ(ε)=cτ\xi_\tau(\varepsilon) = c_\tau, then ξτ(YX=x)=xβ0+cτ\xi_\tau(Y \mid X = x) = x^\top \beta_0 + c_\tau. Linear QR with an intercept recovers (β0,cτ)(\beta_0, c_\tau) exactly in the population. Under heteroscedastic noise — say Y=Xβ0+σ(X)εY = X^\top \beta_0 + \sigma(X)\,\varepsilon — the conditional quantile

ξτ(YX=x)  =  xβ0  +  σ(x)cτ\xi_\tau(Y \mid X = x) \;=\; x^\top \beta_0 \;+\; \sigma(x)\, c_\tau

is not linear in xx unless σ\sigma is. The fix is to enrich the feature map (polynomial, spline, RBF) so the linear span includes the conditional quantile function. That is the path we take throughout this topic: degree-3 polynomial features applied to a univariate covariate.

Remark (Why an intercept matters).

Without an intercept, the line is forced through the origin, and the QR estimator can pick up bias to compensate; with an intercept, the estimator decomposes cleanly into a slope (the location) and an intercept that absorbs the noise quantile cτc_\tau. Standard practice — and what every implementation we use here does — is to include an intercept either as a separate feature column or as the bias term of an LP solver.

The LP Reformulation

The pinball loss is piecewise-linear, so the QR optimisation is a linear program. This is more than a curiosity: it explains why sklearn’s QuantileRegressor uses HiGHS (a state-of-the-art LP solver), why QR is exact rather than iterative, and why the “interpolation” structure of the fit is combinatorial — exactly pp data points sit on the fitted line in the non-degenerate case.

The trick is to split each residual ri=YiXiβr_i = Y_i - X_i^\top \beta into its positive and negative parts:

ri  =  ui+ui,ui+=max(ri,0),ui=max(ri,0).r_i \;=\; u_i^+ - u_i^-, \qquad u_i^+ = \max(r_i, 0), \quad u_i^- = \max(-r_i, 0).

Then ρτ(ri)=τui++(1τ)ui\rho_\tau(r_i) = \tau\, u_i^+ + (1 - \tau)\, u_i^-, which is linear in the slack variables. Stacking the nn constraints:

Definition 3 (LP reformulation of QR).

Quantile regression at level τ\tau is equivalent to the linear program

minβ,u+,u  τ1u+  +  (1τ)1usubject to  yXβ  =  u+u,u+0,u0,βRp  (free).\begin{aligned} \min_{\beta,\, u^+,\, u^-}\; & \tau\, \mathbf{1}^\top u^+ \;+\; (1 - \tau)\, \mathbf{1}^\top u^- \\ \text{subject to}\; & y - X\beta \;=\; u^+ - u^-, \\ & u^+ \geq 0,\quad u^- \geq 0, \\ & \beta \in \mathbb{R}^p \;\text{(free)}. \end{aligned}

Stacking all variables into z=(β,u+,u)Rp+2nz = (\beta, u^+, u^-) \in \mathbb{R}^{p + 2n}, this is a standard-form LP with cost c=(0p, τ1n, (1τ)1n)c = (0_p,\ \tau \mathbf{1}_n,\ (1 - \tau)\mathbf{1}_n), equality constraint Aeqz=beqA_{\text{eq}}\, z = b_{\text{eq}} with Aeq=[X, I, I]A_{\text{eq}} = [X,\ -I,\ I] and beq=yb_{\text{eq}} = y, and nonnegativity bounds on the slacks (β\beta is unbounded).

Remark (Combinatorial structure of the LP optimum).

A basic feasible solution of an LP has at most as many nonzero variables as constraints — nn here — so the optimal (β,u+,u)(\beta, u^+, u^-) has at most nn nonzeros among the 2n2n slacks. But each row ii has exactly one of ui+u_i^+, uiu_i^- positive (both being positive would violate optimality: a quick swap reduces cost). So each observation contributes one slack. The remaining slack-degrees go to β\beta (which is pp-dimensional). Hence: at exactly pp observations, ui+=ui=0u_i^+ = u_i^- = 0, i.e., the QR fit interpolates pp of the nn data points. This is unique to QR — OLS interpolates none in general, ridge interpolates all only in the limit.

Solving this LP directly with scipy.optimize.linprog makes Definition 3 concrete; the result agrees with QuantileRegressor to machine precision since both call HiGHS.

from scipy.optimize import linprog

def solve_qr_linprog(X, y, tau):
    """Solve QR via the LP reformulation using scipy.optimize.linprog (HiGHS).

    X: (n, p) design matrix (include constant column for intercept).
    y: (n,) responses.
    tau: quantile level in (0, 1).
    Returns: (beta, residuals, slack_pos, slack_neg).
    """
    n, p = X.shape
    # z = [β (p), u⁺ (n), u⁻ (n)],  total p + 2n variables.
    c = np.concatenate([np.zeros(p), tau * np.ones(n), (1 - tau) * np.ones(n)])
    # Equality constraint: X β + u⁺ - u⁻ = y    (slacks split residual sign)
    A_eq = np.hstack([X, np.eye(n), -np.eye(n)])
    b_eq = y
    # β unbounded; slacks ≥ 0
    bounds = [(None, None)] * p + [(0, None)] * (2 * n)
    res = linprog(c, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method='highs')
    z = res.x
    beta = z[:p]
    return beta, y - X @ beta, z[p:p + n], z[p + n:]
QR residuals at tau = 0.50 colored by sign with interpolated points marked; coefficient comparison between scipy linprog and sklearn QuantileRegressor.
The LP reformulation made concrete. Left: residuals at the τ = 0.5 optimum (n = 40, p = 4) — exactly p = 4 points sit on the fitted line (Remark 5), with the remaining n − p split between u⁺ > 0 (above) and u⁻ > 0 (below). Right: solving the same LP via scipy.linprog and via sklearn.QuantileRegressor agrees to machine precision (max |Δβ| ~ 1e-14).

Equivariance Under Monotone Transformations

Conditional quantiles have a property the conditional mean lacks: they commute with monotone transformations of YY. If YY is income and we want the median of log(Y)\log(Y), the answer is log\log of the median of YY — no Jensen correction required. This makes quantile regression unusually robust to the scale on which the response is modelled.

Theorem 2 (Equivariance of conditional quantiles).

Let YY be a random variable with conditional distribution FYXF_{Y \mid X}, and let h:RRh : \mathbb{R} \to \mathbb{R} be a non-decreasing function. Then for every τ(0,1)\tau \in (0, 1) and every xsupp(X)x \in \mathrm{supp}(X),

ξτ(h(Y)X=x)  =  h(ξτ(YX=x)).\xi_\tau\bigl(h(Y) \mid X = x\bigr) \;=\; h\bigl(\xi_\tau(Y \mid X = x)\bigr).
Proof.

Recall ξτ(YX=x)=inf{q:FYX(qx)τ}\xi_\tau(Y \mid X = x) = \inf\{q : F_{Y \mid X}(q \mid x) \geq \tau\}. We need to express Fh(Y)XF_{h(Y) \mid X} in terms of FYXF_{Y \mid X}. Because hh is non-decreasing, for any vRv \in \mathbb{R},

{h(Y)v}  =  {Yh+1(v)},\{h(Y) \leq v\} \;=\; \{Y \leq h^{-1}_{+}(v)\},

where h+1(v):=sup{y:h(y)v}h^{-1}_{+}(v) := \sup\{y : h(y) \leq v\} is the right-continuous inverse (when hh is invertible the two coincide; the right-continuous inverse handles hh with flat regions, and the proof goes through unchanged). Taking conditional probabilities,

Fh(Y)X(vx)  =  FYX(h+1(v)x).F_{h(Y) \mid X}(v \mid x) \;=\; F_{Y \mid X}\bigl(h^{-1}_{+}(v) \mid x\bigr).

Now compute the conditional τ\tau-quantile of h(Y)h(Y):

ξτ(h(Y)X=x)=inf{v:Fh(Y)X(vx)τ}=inf{v:FYX(h+1(v)x)τ}=inf{v:h+1(v)ξτ(YX=x)}(definition of ξτ)=h(ξτ(YX=x)).\begin{aligned} \xi_\tau(h(Y) \mid X = x) &= \inf\bigl\{v : F_{h(Y) \mid X}(v \mid x) \geq \tau\bigr\} \\ &= \inf\bigl\{v : F_{Y \mid X}(h^{-1}_{+}(v) \mid x) \geq \tau\bigr\} \\ &= \inf\bigl\{v : h^{-1}_{+}(v) \geq \xi_\tau(Y \mid X = x)\bigr\} \quad \text{(definition of } \xi_\tau\text{)} \\ &= h\bigl(\xi_\tau(Y \mid X = x)\bigr). \end{aligned}

The last step uses non-decreasingness of hh: {v:h+1(v)q}\{v : h^{-1}_{+}(v) \geq q^*\} is a right half-line in vv with infimum h(q)h(q^*). Strict monotonicity is not needed. \square

Remark (Equivariance is about the conditional quantile, not the linear estimator).

Theorem 2 is a statement about conditional quantiles, not about the linear-QR estimator. If we fit linear QR to YY on features Φ(X)\Phi(X) and obtain β^(τ)\hat\beta(\tau), there is no general guarantee that fitting linear QR to h(Y)h(Y) on Φ(X)\Phi(X) returns hh applied to the original fit at every test point. Two reasons the equality fails in finite samples and finite-dimensional feature classes: (i) the linear span of Φ(X)\Phi(X) is closed under linear combinations but not under composition with hh, so the function class capable of representing ξτ(h(Y)X)\xi_\tau(h(Y) \mid X) may differ from the one representing ξτ(YX)\xi_\tau(Y \mid X); (ii) even if the function class is the same, the empirical pinball-loss minimiser need not commute with hh beyond a constant shift. What does hold: the underlying conditional quantile function is equivariant, and well-specified quantile regression captures that equivariance to the extent the linear class can represent both functions.

Remark (Contrast with the conditional mean (Jensen)).

E[h(Y)X]\mathbb{E}[h(Y) \mid X] equals h(E[YX])h(\mathbb{E}[Y \mid X]) only when hh is affine, by Jensen’s inequality. So OLS on logY\log Y has no clean relationship to OLS on YY; the log-transformation changes both the mean function and the implied error structure. QR’s equivariance is therefore a strict robustness gain for tasks where the response scale is a modelling choice (income, durations, prices, counts).

Four panels demonstrating equivariance: original-scale QR, transformed-scale QR, comparison of direct h-scale fit to h applied to original-scale fit, and the residual gap at three tau levels.
Equivariance in four panels. Top: QR fits on Y (left) vs. on h(Y) = log(Y − Yₘᵢₙ + 1) (right). Bottom-left: direct h-scale fit (solid) vs. h applied to the Y-scale fit (dashed); they agree closely but not exactly. Bottom-right: the linear-estimator gap (Remark 6) — small but nonzero, a finite-sample / linear-class artefact, not a violation of Theorem 2.

Asymptotic Theory: Koenker–Bassett 1978, Knight 1998

We just saw that QR returns the conditional τ\tau-quantile in expectation. The next question is: at what rate does the estimator converge, and what is the limit distribution? The answer, established by Koenker–Bassett 1978 with the clean modern proof due to Knight 1998, is the same root-nn / Gaussian shape we know from OLS, with two structural differences. The “noise variance” is replaced by τ(1τ)\tau(1 - \tau) — the Bernoulli variance of the indicator 1{Y<q^(X)}\mathbb{1}\{Y < \hat q(X)\} — and the design Gram matrix is sandwich-replaced by a density-weighted Gram matrix evaluated at the conditional quantile.

Theorem 3 (Asymptotic normality of QR (Koenker–Bassett 1978, Knight 1998)).

Suppose (Xi,Yi)(X_i, Y_i) are i.i.d. with finite second moments E[X2]<\mathbb{E}[\lVert X \rVert^2] < \infty, the conditional density fYX(yx)f_{Y \mid X}(y \mid x) is continuous and strictly positive at y=ξτ(YX=x)y = \xi_\tau(Y \mid X = x) for almost every xx, and the matrices

Ω  =  E[XX],D(τ)  =  E[fYX(ξτ(YX)X)XX]\Omega \;=\; \mathbb{E}[X X^\top], \qquad D(\tau) \;=\; \mathbb{E}\bigl[f_{Y \mid X}(\xi_\tau(Y \mid X) \mid X)\, X X^\top\bigr]

are positive definite. Then

n(β^(τ)β(τ))    N(0, τ(1τ)D(τ)1ΩD(τ)1).\sqrt{n}\,\bigl(\hat\beta(\tau) - \beta(\tau)\bigr) \;\Rightarrow\; \mathcal{N}\Bigl(0,\ \tau(1 - \tau)\, D(\tau)^{-1}\, \Omega\, D(\tau)^{-1}\Bigr).
Proof.

Sketch. Set ui=YiXiβ(τ)u_i = Y_i - X_i^\top \beta(\tau) and write the perturbed objective Zn(δ)=i[ρτ(uiXiδ/n)ρτ(ui)]Z_n(\delta) = \sum_i [\rho_\tau(u_i - X_i^\top \delta / \sqrt{n}) - \rho_\tau(u_i)]. Knight’s identity gives the algebraic decomposition

ρτ(uv)ρτ(u)  =  v(τ1{u<0})  +  0v(1{us}1{u0})ds.\rho_\tau(u - v) - \rho_\tau(u) \;=\; -v\,(\tau - \mathbb{1}\{u < 0\}) \;+\; \int_0^v \bigl(\mathbb{1}\{u \leq s\} - \mathbb{1}\{u \leq 0\}\bigr)\, ds.

Apply this with v=Xiδ/nv = X_i^\top \delta / \sqrt{n}. The first piece is a sum of mean-zero random variables (because E[τ1{ui<0}Xi]=0\mathbb{E}[\tau - \mathbb{1}\{u_i < 0\} \mid X_i] = 0 at the population τ\tau-quantile); by the standard CLT it converges to a Gaussian linear in δ\delta with covariance τ(1τ)Ω\tau(1 - \tau)\, \Omega. The second piece is an empirical-process integral that, after rescaling, converges to a deterministic quadratic 12δD(τ)δ\tfrac{1}{2}\delta^\top D(\tau)\, \delta — the density-weighted matrix arising from a first-order expansion of FYXF_{Y \mid X} at the conditional quantile. Adding the two limits, the rescaled objective ZnZ_n converges to a quadratic + linear function of δ\delta whose argmin is δ=D(τ)1W\delta_* = D(\tau)^{-1}\, W with WN(0,τ(1τ)Ω)W \sim \mathcal{N}(0,\, \tau(1 - \tau)\, \Omega). Continuous mapping (the argmin lemma for stochastic processes) transfers convergence to δ^=n(β^(τ)β(τ))\hat\delta = \sqrt{n}\,(\hat\beta(\tau) - \beta(\tau)), giving δ^N(0,τ(1τ)D(τ)1ΩD(τ)1)\hat\delta \Rightarrow \mathcal{N}(0,\, \tau(1 - \tau)\, D(\tau)^{-1}\, \Omega\, D(\tau)^{-1}). For full regularity conditions and the empirical-process step, see formalStatistics: Empirical Processes , Topic 32 §32.5. \square

The bootstrap is the most direct way to see Theorem 3 — resample (X,Y)(X, Y) pairs BB times, refit QR, and read off the empirical distribution of any coefficient.

def fit_qr_coefs(x_train, y_train, tau, degree=3):
    """Return all polynomial coefficients (intercept + slopes) of QR at level τ."""
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    Phi = poly.fit_transform(np.asarray(x_train).reshape(-1, 1))
    qr = QuantileRegressor(quantile=tau, alpha=0.0, solver='highs')
    qr.fit(Phi, y_train)
    return np.concatenate([[qr.intercept_], qr.coef_])  # [a₀, a₁, a₂, a₃]


def bootstrap_qr_coefs(x, y, tau, B, rng, degree=3):
    """B (X, Y)-bootstrap resamples; returns (B, p+1) coefficient matrix."""
    n = len(x)
    out = np.empty((B, degree + 1))
    for b in range(B):
        idx = rng.integers(0, n, n)
        out[b] = fit_qr_coefs(x[idx], y[idx], tau, degree=degree)
    return out

At fixed τ\tau, the empirical std of the slope coefficient β^1(τ)\hat\beta_1(\tau) across bootstrap replicates should scale as 1/n1/\sqrt{n} (so σ^n\hat\sigma \cdot \sqrt{n} stabilises). Across τ\tau, the constant inflates as τ0\tau \to 0 or τ1\tau \to 1 — Remark 8.

(sliders auto-run B = 50 for fast feedback)

Remark (The Bernoulli variance factor τ(1−τ) and tail inflation).

The asymptotic variance scales with τ(1τ)\tau(1 - \tau), peaking at 0.50.5 (median) and vanishing at the extremes 00 and 11. Geometrically: at the median, the indicator 1{Y<q^(X)}\mathbb{1}\{Y < \hat q(X)\} is a fair coin, contributing maximal variance to the estimating equation. At extreme τ\tau, the indicator is nearly constant, but D(τ)D(\tau) — the density at the quantile — also goes to zero in the tail, which generally pushes the variance up. The two effects together are why extreme-quantile QR estimates have notoriously wide CIs even at moderate nn, and motivate Extreme Value Theory (coming soon) as a separate framework for the tail regime.

Remark (Bahadur representation at covariate values).

A finer statement, due to Bahadur 1966 in the no-covariate case and extended to QR in Koenker–Bassett 1978, gives the linearised expansion at any fixed x0x_0:

n(q^τ(x0)qτ(x0))  =  x0D(τ)1n1/2iXi(τ1{YiXiβ(τ)})  +  oP(1).\sqrt{n}\,\bigl(\hat q_\tau(x_0) - q_\tau(x_0)\bigr) \;=\; x_0^\top D(\tau)^{-1} \cdot n^{-1/2} \sum_i X_i\, \bigl(\tau - \mathbb{1}\{Y_i \leq X_i^\top \beta(\tau)\}\bigr) \;+\; o_P(1).

This identifies the asymptotic variance of the conditional-quantile estimator at any specific test point x0x_0 as x0Vx0x_0^\top V x_0 where VV is the variance from Theorem 3. The bootstrap distributions in the figure above are exactly the empirical version: for a fixed x0x_0 they should approach the asymptotic Gaussian as nn \to \infty.

Multi-Quantile Estimation, Crossing, and Rearrangement

A natural use of QR is to fit multiple quantile levels simultaneously and read the resulting bands as a non-parametric estimate of the conditional distribution of YY given XX. Three or five evenly-spaced quantiles already give a usable picture; for high-resolution density estimation, fit at a fine grid in (0,1)(0, 1).

But there is a problem. The population conditional quantiles are weakly increasing in τ\tau by definition: ξτ1(YX=x)ξτ2(YX=x)\xi_{\tau_1}(Y \mid X = x) \leq \xi_{\tau_2}(Y \mid X = x) whenever τ1τ2\tau_1 \leq \tau_2. The marginal QR estimates do not enforce this. Each τ\tau-fit is a separate optimisation, and at finite nn in finite-dimensional function classes, the resulting curves can — and frequently do — cross: the τ=0.7\tau = 0.7 fit can dip below the τ=0.6\tau = 0.6 fit at some xx.

Remark (Quantile crossing as a coherence violation; constrained-LP fix).

Crossing is not just an aesthetic flaw. It produces probabilistic nonsense: P(Y<q0.6(x))>0.6\mathbb{P}(Y < q_{0.6}(x)) > 0.6 and P(Y<q0.7(x))<0.7\mathbb{P}(Y < q_{0.7}(x)) < 0.7 cannot both hold if q0.7(x)<q0.6(x)q_{0.7}(x) < q_{0.6}(x). Any downstream procedure that treats the bands as CDF estimates (CQR uses them as prediction-interval endpoints; quantile forests use them for distribution prediction) inherits the incoherence. Two structural fixes exist:

  1. Joint estimation with monotonicity constraints (Bondell–Reich–Wang 2010). Augment the LP with linear inequalities qτk(xl)qτk+1(xl)q_{\tau_k}(x_l) \leq q_{\tau_{k+1}}(x_l) at a grid of evaluation points; solve a single bigger LP. Pros: exact monotonicity by construction. Cons: scales poorly in n×K×gridn \times K \times |\text{grid}|; loses the per-τ\tau decoupling that makes QR fast and parallelisable.
  2. Marginal estimation followed by REARRANGEMENT (Chernozhukov–Fernández-Val–Galichon 2010). Fit each τ\tau independently as before, then at every evaluation point sort the KK predicted values along the τ\tau-axis. The rearranged curves are monotone by construction, equal to the originals when no crossing is present, and never worse in LpL^p for any p1p \geq 1 (CFV-G 2010 Theorem 1). This is the path we take.

Definition 4 (Rearrangement of conditional-quantile estimates).

Given a vector of marginal quantile estimates q^τ1(x),,q^τK(x)\hat q_{\tau_1}(x), \ldots, \hat q_{\tau_K}(x) at a fixed test point xx and increasing τ\tau-grid τ1<<τK\tau_1 < \cdots < \tau_K, the rearranged estimates at xx are the sorted values

q~τ1(x)    q~τ2(x)        q~τK(x),\tilde q_{\tau_1}(x) \;\leq\; \tilde q_{\tau_2}(x) \;\leq\; \cdots \;\leq\; \tilde q_{\tau_K}(x),

i.e., the rearranged estimate at level τk\tau_k equals the kk-th order statistic of {q^τj(x)}\{\hat q_{\tau_j}(x)\}. The function τq~τ(x)\tau \mapsto \tilde q_\tau(x) is monotone non-decreasing by construction.

Remark (Rearrangement weakly improves Lᵖ approximation).

If the true conditional quantile function τqτ(x)\tau \mapsto q_\tau(x) is monotone (which it is, by definition), then for any p1p \geq 1 the LpL^p distance from the rearranged estimates to the truth is no larger than the LpL^p distance from the original marginal estimates. The proof is rearrangement-inequality combinatorics: matching a monotone target with a monotone candidate is always at least as good as matching a monotone target with a non-monotone one. Strict improvement happens whenever crossing is present in the original estimates. So rearrangement is a free lunch as a post-processing step, never harmful.

Two helpers — fit a KK-quantile bundle independently and then sort along the τ\tau-axis at every evaluation point:

def fit_multiple_quantiles(x_train, y_train, x_eval, taus, alpha_l2=0.01, degree=3):
    """Fit QR at each τ in `taus` on degree-`degree` polynomial features.

    Returns a (K, |x_eval|) array Q where Q[k, j] = q̂_{taus[k]}(x_eval[j]).
    """
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    Phi_train = poly.fit_transform(np.asarray(x_train).reshape(-1, 1))
    Phi_eval = poly.transform(np.asarray(x_eval).reshape(-1, 1))
    out = np.empty((len(taus), len(x_eval)))
    for k, tau in enumerate(taus):
        qr = QuantileRegressor(quantile=tau, alpha=alpha_l2, solver='highs')
        qr.fit(Phi_train, y_train)
        out[k] = qr.predict(Phi_eval)
    return out


def rearrange_quantile_estimates(Q):
    """CFV-G 2010 rearrangement: sort along the τ-axis at each evaluation point."""
    return np.sort(Q, axis=0)

Penalized Quantile Regression

Two settings push QR away from its plain Koenker–Bassett 1978 form. First, when p>np > n or pnp \sim n the unregularised QR LP is under-determined and badly behaved. Second, even at moderate pp with nn large enough for asymptotics, an L1L^1 penalty drives variable selection that simple shrinkage cannot match. Both settings use the same trick: add a penalty term to the pinball-loss objective.

The L2L^2 penalty (already familiar from sklearn’s alpha parameter):

minβ  iρτ(YiXiβ)  +  λβ22.\min_\beta \; \sum_i \rho_\tau(Y_i - X_i^\top \beta) \;+\; \lambda\, \lVert \beta \rVert_2^2.

This has a closed-form Hessian (modulo the non-smoothness of ρτ\rho_\tau) and can be solved either by augmenting the LP with linear-quadratic structure or via the smoothed-check-loss accelerated-gradient solver we use in the in-browser viz components.

The L1L^1 penalty (the focus of this section, “QR-lasso” or “L1-QR”):

minβ  iρτ(YiXiβ)  +  λβ1.\min_\beta \; \sum_i \rho_\tau(Y_i - X_i^\top \beta) \;+\; \lambda\, \lVert \beta \rVert_1.

This stays a linear program, since β1=1β++1β\lVert \beta \rVert_1 = \mathbf{1}^\top \beta^+ + \mathbf{1}^\top \beta^- splits cleanly with β=β+β\beta = \beta^+ - \beta^-. Augment the QR LP with these auxiliary variables and solve for the full path of λ\lambda values.

Remark (Why L1-QR is the natural high-dim QR).

At the population level, suppose the true conditional τ\tau-quantile is linear with sparse slope vector β0(τ)\beta_0(\tau) — only ss out of pp coordinates are nonzero. The LP-lasso-QR estimator targets that sparse β0(τ)\beta_0(\tau) directly. By contrast, ridge-QR (L2L^2-penalised) shrinks all coordinates uniformly and never sets any to exactly zero, so it cannot recover the sparsity pattern; its high-dim rate is correspondingly slower.

Theorem 4 (Belloni–Chernozhukov 2011 oracle rate for L1-QR).

Suppose (Xi,Yi)(X_i, Y_i) are i.i.d. with XiX_i bounded, the true conditional τ\tau-quantile is qτ(x)=xβ0(τ)q_\tau(x) = x^\top \beta_0(\tau) with β0(τ)0=s\lVert \beta_0(\tau) \rVert_0 = s, and the design satisfies suitable restricted eigenvalue conditions on the active subspace. Choose

λ    cτ(1τ)logp/n\lambda \;\asymp\; c\, \sqrt{\tau(1 - \tau)}\, \sqrt{\log p / n}

for a sufficiently large constant cc. Then with high probability,

β^(τ)β0(τ)2  =  OP ⁣(slogp/n),\lVert \hat\beta(\tau) - \beta_0(\tau) \rVert_2 \;=\; O_P\!\left(\sqrt{s \log p / n}\right),

matching the oracle rate one would obtain with knowledge of the support of β0(τ)\beta_0(\tau).

Proof.

Sketch. The argument has three pieces, each of which uses tools developed in formalStatistics: Empirical Processes (Topic 32).

  1. Gradient-domination condition. The estimating function for QR has bounded influence (the pinball-loss subgradient is bounded by 1 in absolute value), so a self-normalised concentration inequality gives (1/n)iXi(τ1{YiXiβ0})λ/2\lVert (1/n) \sum_i X_i\, (\tau - \mathbb{1}\{Y_i \leq X_i^\top \beta_0\}) \rVert_\infty \leq \lambda / 2 with high probability when λ\lambda is chosen as above. This is the score condition that ensures the lasso doesn’t over-shrink.
  2. Restricted-eigenvalue excursion control. The pinball loss is convex but not strongly convex; restricting to the active cone restores a strong-convexity-like inequality on the design Gram matrix. Combine with the score condition to bound the L1L^1 deviation in the active subspace.
  3. Oracle inequality. The combined argument yields a high-probability bound of the form β^β022Cslog(p)/n\lVert \hat\beta - \beta_0 \rVert_2^2 \leq C\, s \log(p) / n where CC absorbs the restricted-eigenvalue constant and the τ(1τ)\tau(1 - \tau) factor. Square-rooting gives the stated rate.

Full details in BC2011 §4; the general empirical-process scaffold is Topic 32 §32.5. \square

Remark (Choice of λ (CV vs BC plug-in)).

The asymptotic rate fixes λ\lambda only up to a constant. In practice, two routes pick the constant: (a) the BC2011 self-normalised plug-in, which uses a pivotal quantity to set λ\lambda from data without cross-validation; (b) cross-validation on the pinball-loss objective itself. The figure below uses (b). CV-based λ\lambda is more familiar and tends to be more aggressive than the BC plug-in, which is intentionally conservative for inference.

The L1-QR LP splits β=β+β\beta = \beta^+ - \beta^- with β+,β0\beta^+, \beta^- \geq 0, then adds the slack pair u+,uu^+, u^- from §3. Four nonnegative variable groups give a standard-form LP that scipy / HiGHS solves directly:

def solve_lasso_qr_linprog(X, y, tau, lam, no_penalty_mask=None):
    """L1-penalised QR via the standard LP reformulation.

    Variables z = [β⁺ (p), β⁻ (p), u⁺ (n), u⁻ (n)].
    Objective:   λ·(1ᵀβ⁺ + 1ᵀβ⁻) + τ·1ᵀu⁺ + (1−τ)·1ᵀu⁻,
    with the λ coefficients zeroed out at indices in no_penalty_mask
    (typically the intercept column).
    Constraints: X(β⁺ − β⁻) + u⁺ − u⁻ = y, all variables ≥ 0.
    """
    n, p = X.shape
    if no_penalty_mask is None:
        no_penalty_mask = np.zeros(p, dtype=bool)
    pen_costs = np.where(no_penalty_mask, 0.0, lam)
    c = np.concatenate([
        pen_costs,                  # cost on β⁺
        pen_costs,                  # cost on β⁻
        tau * np.ones(n),           # cost on u⁺
        (1 - tau) * np.ones(n),     # cost on u⁻
    ])
    A_eq = np.hstack([X, -X, np.eye(n), -np.eye(n)])
    bounds = [(0, None)] * (2 * p + 2 * n)
    res = linprog(c, A_eq=A_eq, b_eq=y, bounds=bounds, method='highs')
    z = res.x
    return z[:p] - z[p:2 * p]      # β = β⁺ − β⁻
L1-penalized QR coefficient path over thirty lambda values plus the 5-fold cross-validation pinball-loss curve identifying the CV-selected lambda.
L1-QR (lasso-QR) on a 30-feature design with 4 truly active features (n = 200, τ = 0.5). Left: coefficient path along log₁₀(λ); colored lines are the truly active coefficients, grey lines are the 25 noise features, vanishing as λ grows. Right: 5-fold cross-validation pinball loss; the dashed line marks the CV-min λ, which selects all 4 true active features (with a few false positives — Theorem 4 governs the rate, not exact recovery).

Quantile Regression as the Base Learner of Conformalized QR

We close the loop with the use that motivated this topic in the T4 track: quantile regression as the base learner inside Conformalized Quantile Regression (Topic 1, §6). CQR uses two QR fits — at levels α/2\alpha/2 and 1α/21 - \alpha/2 — to produce a heteroscedastic prediction band, and then applies conformal calibration on top to guarantee finite-sample marginal coverage at level 1α1 - \alpha. Conformal’s coverage theorem holds for any base learner (Conformal Prediction, Theorem 1); the role of QR specifically is to give the band the right shape.

This is the division of labour:

  • QR base fits provide the band SHAPE — wide where conditional variance is large, narrow where it is small. A symmetric residual-conformal interval does not have this property; it produces constant-width bands.
  • Conformal calibration provides the band WIDTH — adjust the QR fits by an additive constant so that the empirical miscoverage rate on a held-out calibration set matches the target α\alpha.

Conformal Prediction’s Theorem 1 (split-conformal validity) gives marginal coverage 1α1 - \alpha regardless of how badly QR is misspecified — even constant fits. What QR contributes is conditional approximate validity: when the linear class spans the true conditional quantile function, the conformal correction is a small constant and the resulting band tracks the true conditional coverage rate uniformly. When QR is misspecified, the conformal correction absorbs the misspecification globally; the resulting band still has marginal coverage 1α1 - \alpha but conditional coverage will be uneven. The full treatment of prediction-interval procedures — comparing CQR with locally adaptive variants (CQR-r, CQR-m), conditional-coverage methods, and base learners beyond linear QR — lives in Prediction Intervals (coming soon).

Recall (callback to Conformal Prediction §6, Definition 4) the CQR prediction set at level 1α1 - \alpha based on QR fits q^α/2\hat q_{\alpha/2} and q^1α/2\hat q_{1 - \alpha/2}:

C^α(x)  =  [q^α/2(x)Q^α,q^1α/2(x)+Q^α],\hat C_\alpha(x) \;=\; \bigl[\,\hat q_{\alpha/2}(x) - \hat Q_\alpha,\quad \hat q_{1 - \alpha/2}(x) + \hat Q_\alpha\,\bigr],

where Q^α\hat Q_\alpha is the (1α)(1 - \alpha) empirical quantile (with the standard 1/ncal1/n_{\text{cal}} correction) of the calibration nonconformity scores

Ei  =  max{q^α/2(Xi)Yi,  Yiq^1α/2(Xi)}.E_i \;=\; \max\bigl\{\,\hat q_{\alpha/2}(X_i) - Y_i,\ \ Y_i - \hat q_{1 - \alpha/2}(X_i)\,\bigr\}.

The two QR fits define the band shape; Q^α\hat Q_\alpha is a single scalar that inflates or deflates the band uniformly to hit the target coverage.

Two-panel comparison: uncalibrated QR band (test coverage below target) vs. CQR-corrected band (test coverage at target), with underlying QR fit and true conditional quantile band overlaid.
QR + conformal decomposition. Left: uncalibrated QR band at τ ∈ {0.05, 0.95} (n = 1000 train); the band shape tracks the heteroscedasticity but its empirical coverage on a held-out test set undershoots the 0.90 target (~0.87) because of misspecification + finite-sample noise. Right: CQR-corrected band — same shape, calibrated width, with test coverage hitting the target.

Connections and Further Reading

Within formalML, this topic feeds into:

  • Conformal Prediction (T4 #1). The two-sided QR fits at α/2\alpha/2 and 1α/21 - \alpha/2 are the base learner inside Conformalized Quantile Regression. Theorem 1 of conformal-prediction guarantees marginal coverage of the CQR band regardless of QR’s specification; the role of QR is to give the band the right shape — heteroscedastic where the data are heteroscedastic. See §8 above.

  • Prediction Intervals (coming soon). A unifying treatment of frequentist prediction intervals — fixed-width residual-based, conformal, and quantile-regression-based — and their coverage guarantees under various assumptions (i.i.d., exchangeable, group-conditional). Quantile regression is one of three “spokes” feeding into that umbrella topic.

  • Extreme Value Theory (coming soon). Theorem 3’s variance formula tells us QR’s asymptotic variance inflates as τ0\tau \to 0 or τ1\tau \to 1, because the density-weighted matrix D(τ)D(\tau) shrinks in the tail. Beyond moderate quantile levels — typically τ\tau outside roughly [0.05,0.95][0.05, 0.95] for a Gaussian-tailed YY — direct QR is not the right framework. EVT replaces it with the generalised Pareto / generalised extreme-value families and a peaks-over-threshold estimator that targets the tail directly.

Cross-site prerequisites:

  • formalStatistics: Linear Regression — the OLS analog. The KB78 estimator is the pinball-loss replacement of squared loss; the LP reformulation is the QR replacement of the normal equations; Theorem 3’s asymptotic Gaussian is the QR replacement of OLS’s β^OLSN(β,σ2(XX)1)\hat\beta_{OLS} \sim \mathcal{N}(\beta, \sigma^2 (X^\top X)^{-1}).

  • formalStatistics: Order Statistics and Quantiles — the no-covariate special case. With a single intercept feature, KB78 reduces to the empirical τ\tau-quantile of {Yi}\{Y_i\}; Theorem 3 reduces to the classical Bahadur–Ghosh asymptotics.

  • formalStatistics: Empirical Processes — the toolkit behind Theorem 3’s proof sketch (Knight’s identity, the empirical-process limit of the rescaled objective, the argmin lemma) and behind Theorem 4’s restricted-eigenvalue / oracle-inequality argument.

Internal prerequisites:

  • Convex Analysis — the pinball loss is convex but non-smooth; the LP reformulation exploits its piecewise-linear structure.

  • Gradient Descent — the smoothed-check-loss accelerated-gradient solver is what powers the in-browser visualisation widgets, where running an LP solver in the user’s browser is impractical.

Connections

  • T4's first topic. The CQR callback in §8 builds directly on conformal-prediction §6 Definition 4; reading conformal-prediction first frames QR as the base learner that supplies the band SHAPE while conformal calibration supplies the band WIDTH. conformal-prediction
  • The pinball loss is a piecewise-linear convex function. The LP reformulation in §3 is exactly the standard reduction of piecewise-linear convex optimization to a linear program; the slack-variable trick (u = u⁺ − u⁻ with u⁺, u⁻ ≥ 0) is the canonical convex-analysis device. convex-analysis
  • The smoothed-check-loss accelerated-gradient solver underlying the in-browser visualization components is a direct adaptation of the smoothed-objective + Nesterov-acceleration construction — applied to a non-smooth piecewise-linear loss via Moreau-envelope smoothing. gradient-descent
  • T4's track closer cites the Koenker–Knight asymptotic theorem from §5 here as Theorem 2 (pure QR's asymptotic conditional coverage), and the QR base learner is reused in pure QR (§3) and CQR (§5.1) of that topic. The architectural punchline there is that CQR equals pure QR with the threshold $0$ replaced by a conformal $(1-\alpha)$-quantile — one number of difference, one bridge theorem of consequence. prediction-intervals
  • T4 sibling and track closer. Statistical depth handles the unconditional multivariate-quantile problem by collapsing dimension into a center-outward scalar; quantile regression handles the conditional problem one quantile-level at a time. The two converge to multivariate-quantile regression with depth-based prediction regions. statistical-depth

References & Further Reading