Quantile Regression
The Koenker–Bassett 1978 estimator: pinball-loss minimization, LP reformulation, asymptotic normality, multi-quantile rearrangement, and the base learner of CQR
The Pinball Loss and the Population Quantile
Quantile regression rests on a single asymmetric loss function, the check or pinball loss. Where squared loss penalises positive and negative residuals symmetrically and recovers the conditional mean at its minimum, the pinball loss penalises them asymmetrically and recovers the conditional quantile. The next two results make this statement precise.
Definition 1 (Pinball loss).
For and any real residual , the pinball loss at level is
Equivalently, is a piecewise-linear V with slope on the right half-line and slope on the left. At , is half the absolute loss; for the V is asymmetric, with the steeper side on whichever direction we pay more for getting wrong.
Theorem 1 (Pinball minimization recovers the population quantile).
Let be a real random variable with cumulative distribution function and finite mean. For each , the (any) -quantile
minimises the expected pinball loss:
Proof.
Write the expected loss as a function of :
The first term penalises at rate (the “right side”); the second penalises at rate (the “left side”). Differentiate under the integral. The derivative of the first term with respect to is — the Leibniz contribution from the lower limit, plus the constant under the integrand. The derivative of the second term is — matching contribution from the upper limit and integrand. Adding:
Setting gives , i.e., . The second derivative, wherever is differentiable, is , so is convex; whenever the conditional density at is strictly positive, the minimiser is unique. (When is flat at , the argmin set is an interval; we take the smallest minimiser by convention.)
∎Numerical verification: sample and minimize the empirical pinball loss over a fine grid of for several — the empirical argmin should track as grows.
import numpy as np
from scipy.stats import norm
np.random.seed(0)
y_sample = norm.rvs(size=20_000, random_state=1)
q_grid = np.linspace(-3, 3, 601)
print('Empirical vs theoretical quantile at n = 20,000:')
print(f'{"tau":>6} | {"Phi^-1(tau)":>12} | {"argmin emp loss":>16} | {"diff":>8}')
for tau in [0.10, 0.25, 0.50, 0.75, 0.90]:
L_grid = np.array([
np.mean(np.where(y_sample >= q,
tau * (y_sample - q),
(tau - 1) * (y_sample - q)))
for q in q_grid
])
q_hat = q_grid[L_grid.argmin()]
q_true = norm.ppf(tau)
print(f'{tau:>6.2f} | {q_true:>12.4f} | {q_hat:>16.4f} | {q_hat - q_true:>+8.4f}')
Remark (Pinball-loss derivative requires no smoothness on Y).
The proof opens no black boxes — it does not require to have a density, finite variance, or bounded support. The piecewise-linear structure of is doing all of the work: differentiating it produces the indicator , and the indicator’s expectation is exactly .
Remark (Conditional version: argmin over functions = conditional τ-quantile).
The same calculation, conditioned on , gives the conditional quantile . This is what quantile regression estimates: the argmin over functions of is the conditional -quantile function of given , almost surely. The QR estimator replaces the population expectation with the empirical mean and the function class with a parametric one (the linear span of the features).
Linear Quantile Regression
Theorem 1 says the population -quantile of is recovered by minimising the expected pinball loss. The Koenker–Bassett 1978 quantile regression estimator is the empirical-version twin of that statement: minimise the sample pinball loss over a parametric family of conditional-quantile candidates.
Definition 2 (Linear quantile regression estimator (Koenker–Bassett 1978)).
Given features and responses for , the linear quantile regression estimator at level is
The fitted conditional quantile at a new point is . When the feature vector includes a constant (intercept), the model is
The workhorse fit-and-predict helper used throughout this topic — a thin wrapper around scikit-learn’s QuantileRegressor (which calls HiGHS) on degree-3 polynomial features:
from sklearn.linear_model import QuantileRegressor
from sklearn.preprocessing import PolynomialFeatures
def fit_predict_quantile(x_train, y_train, x_eval, tau, alpha_l2=0.01, degree=3):
"""Quantile regression at level τ on degree-`degree` polynomial features.
alpha_l2: small L2 penalty for numerical stability (kept tiny so KB78
asymptotics are not distorted).
"""
poly = PolynomialFeatures(degree=degree, include_bias=False)
Phi_train = poly.fit_transform(np.asarray(x_train).reshape(-1, 1))
Phi_eval = poly.transform(np.asarray(x_eval).reshape(-1, 1))
qr = QuantileRegressor(quantile=tau, alpha=alpha_l2, solver='highs')
qr.fit(Phi_train, y_train)
return qr.predict(Phi_eval)
The KKT condition has a clean empirical analog: at the QR optimum, fraction of training residuals are negative.
# At each τ, fraction of training residuals < 0 should ≈ τ.
for tau in [0.10, 0.50, 0.90]:
fit_at_train = fit_predict_quantile(x_demo, y_demo, x_demo, tau)
print(f'τ = {tau}: P(Y < q̂(X)) = {(y_demo < fit_at_train).mean():.3f}')
Remark (When does linear QR target the right thing?).
If with independent of and , then . Linear QR with an intercept recovers exactly in the population. Under heteroscedastic noise — say — the conditional quantile
is not linear in unless is. The fix is to enrich the feature map (polynomial, spline, RBF) so the linear span includes the conditional quantile function. That is the path we take throughout this topic: degree-3 polynomial features applied to a univariate covariate.
Remark (Why an intercept matters).
Without an intercept, the line is forced through the origin, and the QR estimator can pick up bias to compensate; with an intercept, the estimator decomposes cleanly into a slope (the location) and an intercept that absorbs the noise quantile . Standard practice — and what every implementation we use here does — is to include an intercept either as a separate feature column or as the bias term of an LP solver.
The LP Reformulation
The pinball loss is piecewise-linear, so the QR optimisation is a linear program. This is more than a curiosity: it explains why sklearn’s QuantileRegressor uses HiGHS (a state-of-the-art LP solver), why QR is exact rather than iterative, and why the “interpolation” structure of the fit is combinatorial — exactly data points sit on the fitted line in the non-degenerate case.
The trick is to split each residual into its positive and negative parts:
Then , which is linear in the slack variables. Stacking the constraints:
Definition 3 (LP reformulation of QR).
Quantile regression at level is equivalent to the linear program
Stacking all variables into , this is a standard-form LP with cost , equality constraint with and , and nonnegativity bounds on the slacks ( is unbounded).
Remark (Combinatorial structure of the LP optimum).
A basic feasible solution of an LP has at most as many nonzero variables as constraints — here — so the optimal has at most nonzeros among the slacks. But each row has exactly one of , positive (both being positive would violate optimality: a quick swap reduces cost). So each observation contributes one slack. The remaining slack-degrees go to (which is -dimensional). Hence: at exactly observations, , i.e., the QR fit interpolates of the data points. This is unique to QR — OLS interpolates none in general, ridge interpolates all only in the limit.
Solving this LP directly with scipy.optimize.linprog makes Definition 3 concrete; the result agrees with QuantileRegressor to machine precision since both call HiGHS.
from scipy.optimize import linprog
def solve_qr_linprog(X, y, tau):
"""Solve QR via the LP reformulation using scipy.optimize.linprog (HiGHS).
X: (n, p) design matrix (include constant column for intercept).
y: (n,) responses.
tau: quantile level in (0, 1).
Returns: (beta, residuals, slack_pos, slack_neg).
"""
n, p = X.shape
# z = [β (p), u⁺ (n), u⁻ (n)], total p + 2n variables.
c = np.concatenate([np.zeros(p), tau * np.ones(n), (1 - tau) * np.ones(n)])
# Equality constraint: X β + u⁺ - u⁻ = y (slacks split residual sign)
A_eq = np.hstack([X, np.eye(n), -np.eye(n)])
b_eq = y
# β unbounded; slacks ≥ 0
bounds = [(None, None)] * p + [(0, None)] * (2 * n)
res = linprog(c, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method='highs')
z = res.x
beta = z[:p]
return beta, y - X @ beta, z[p:p + n], z[p + n:]
Equivariance Under Monotone Transformations
Conditional quantiles have a property the conditional mean lacks: they commute with monotone transformations of . If is income and we want the median of , the answer is of the median of — no Jensen correction required. This makes quantile regression unusually robust to the scale on which the response is modelled.
Theorem 2 (Equivariance of conditional quantiles).
Let be a random variable with conditional distribution , and let be a non-decreasing function. Then for every and every ,
Proof.
Recall . We need to express in terms of . Because is non-decreasing, for any ,
where is the right-continuous inverse (when is invertible the two coincide; the right-continuous inverse handles with flat regions, and the proof goes through unchanged). Taking conditional probabilities,
Now compute the conditional -quantile of :
The last step uses non-decreasingness of : is a right half-line in with infimum . Strict monotonicity is not needed.
∎Remark (Equivariance is about the conditional quantile, not the linear estimator).
Theorem 2 is a statement about conditional quantiles, not about the linear-QR estimator. If we fit linear QR to on features and obtain , there is no general guarantee that fitting linear QR to on returns applied to the original fit at every test point. Two reasons the equality fails in finite samples and finite-dimensional feature classes: (i) the linear span of is closed under linear combinations but not under composition with , so the function class capable of representing may differ from the one representing ; (ii) even if the function class is the same, the empirical pinball-loss minimiser need not commute with beyond a constant shift. What does hold: the underlying conditional quantile function is equivariant, and well-specified quantile regression captures that equivariance to the extent the linear class can represent both functions.
Remark (Contrast with the conditional mean (Jensen)).
equals only when is affine, by Jensen’s inequality. So OLS on has no clean relationship to OLS on ; the log-transformation changes both the mean function and the implied error structure. QR’s equivariance is therefore a strict robustness gain for tasks where the response scale is a modelling choice (income, durations, prices, counts).
Asymptotic Theory: Koenker–Bassett 1978, Knight 1998
We just saw that QR returns the conditional -quantile in expectation. The next question is: at what rate does the estimator converge, and what is the limit distribution? The answer, established by Koenker–Bassett 1978 with the clean modern proof due to Knight 1998, is the same root- / Gaussian shape we know from OLS, with two structural differences. The “noise variance” is replaced by — the Bernoulli variance of the indicator — and the design Gram matrix is sandwich-replaced by a density-weighted Gram matrix evaluated at the conditional quantile.
Theorem 3 (Asymptotic normality of QR (Koenker–Bassett 1978, Knight 1998)).
Suppose are i.i.d. with finite second moments , the conditional density is continuous and strictly positive at for almost every , and the matrices
are positive definite. Then
Proof.
Sketch. Set and write the perturbed objective . Knight’s identity gives the algebraic decomposition
Apply this with . The first piece is a sum of mean-zero random variables (because at the population -quantile); by the standard CLT it converges to a Gaussian linear in with covariance . The second piece is an empirical-process integral that, after rescaling, converges to a deterministic quadratic — the density-weighted matrix arising from a first-order expansion of at the conditional quantile. Adding the two limits, the rescaled objective converges to a quadratic + linear function of whose argmin is with . Continuous mapping (the argmin lemma for stochastic processes) transfers convergence to , giving . For full regularity conditions and the empirical-process step, see formalStatistics: Empirical Processes , Topic 32 §32.5.
∎The bootstrap is the most direct way to see Theorem 3 — resample pairs times, refit QR, and read off the empirical distribution of any coefficient.
def fit_qr_coefs(x_train, y_train, tau, degree=3):
"""Return all polynomial coefficients (intercept + slopes) of QR at level τ."""
poly = PolynomialFeatures(degree=degree, include_bias=False)
Phi = poly.fit_transform(np.asarray(x_train).reshape(-1, 1))
qr = QuantileRegressor(quantile=tau, alpha=0.0, solver='highs')
qr.fit(Phi, y_train)
return np.concatenate([[qr.intercept_], qr.coef_]) # [a₀, a₁, a₂, a₃]
def bootstrap_qr_coefs(x, y, tau, B, rng, degree=3):
"""B (X, Y)-bootstrap resamples; returns (B, p+1) coefficient matrix."""
n = len(x)
out = np.empty((B, degree + 1))
for b in range(B):
idx = rng.integers(0, n, n)
out[b] = fit_qr_coefs(x[idx], y[idx], tau, degree=degree)
return out
At fixed , the empirical std of the slope coefficient across bootstrap replicates should scale as (so stabilises). Across , the constant inflates as or — Remark 8.
Remark (The Bernoulli variance factor τ(1−τ) and tail inflation).
The asymptotic variance scales with , peaking at (median) and vanishing at the extremes and . Geometrically: at the median, the indicator is a fair coin, contributing maximal variance to the estimating equation. At extreme , the indicator is nearly constant, but — the density at the quantile — also goes to zero in the tail, which generally pushes the variance up. The two effects together are why extreme-quantile QR estimates have notoriously wide CIs even at moderate , and motivate Extreme Value Theory (coming soon) as a separate framework for the tail regime.
Remark (Bahadur representation at covariate values).
A finer statement, due to Bahadur 1966 in the no-covariate case and extended to QR in Koenker–Bassett 1978, gives the linearised expansion at any fixed :
This identifies the asymptotic variance of the conditional-quantile estimator at any specific test point as where is the variance from Theorem 3. The bootstrap distributions in the figure above are exactly the empirical version: for a fixed they should approach the asymptotic Gaussian as .
Multi-Quantile Estimation, Crossing, and Rearrangement
A natural use of QR is to fit multiple quantile levels simultaneously and read the resulting bands as a non-parametric estimate of the conditional distribution of given . Three or five evenly-spaced quantiles already give a usable picture; for high-resolution density estimation, fit at a fine grid in .
But there is a problem. The population conditional quantiles are weakly increasing in by definition: whenever . The marginal QR estimates do not enforce this. Each -fit is a separate optimisation, and at finite in finite-dimensional function classes, the resulting curves can — and frequently do — cross: the fit can dip below the fit at some .
Remark (Quantile crossing as a coherence violation; constrained-LP fix).
Crossing is not just an aesthetic flaw. It produces probabilistic nonsense: and cannot both hold if . Any downstream procedure that treats the bands as CDF estimates (CQR uses them as prediction-interval endpoints; quantile forests use them for distribution prediction) inherits the incoherence. Two structural fixes exist:
- Joint estimation with monotonicity constraints (Bondell–Reich–Wang 2010). Augment the LP with linear inequalities at a grid of evaluation points; solve a single bigger LP. Pros: exact monotonicity by construction. Cons: scales poorly in ; loses the per- decoupling that makes QR fast and parallelisable.
- Marginal estimation followed by REARRANGEMENT (Chernozhukov–Fernández-Val–Galichon 2010). Fit each independently as before, then at every evaluation point sort the predicted values along the -axis. The rearranged curves are monotone by construction, equal to the originals when no crossing is present, and never worse in for any (CFV-G 2010 Theorem 1). This is the path we take.
Definition 4 (Rearrangement of conditional-quantile estimates).
Given a vector of marginal quantile estimates at a fixed test point and increasing -grid , the rearranged estimates at are the sorted values
i.e., the rearranged estimate at level equals the -th order statistic of . The function is monotone non-decreasing by construction.
Remark (Rearrangement weakly improves Lᵖ approximation).
If the true conditional quantile function is monotone (which it is, by definition), then for any the distance from the rearranged estimates to the truth is no larger than the distance from the original marginal estimates. The proof is rearrangement-inequality combinatorics: matching a monotone target with a monotone candidate is always at least as good as matching a monotone target with a non-monotone one. Strict improvement happens whenever crossing is present in the original estimates. So rearrangement is a free lunch as a post-processing step, never harmful.
Two helpers — fit a -quantile bundle independently and then sort along the -axis at every evaluation point:
def fit_multiple_quantiles(x_train, y_train, x_eval, taus, alpha_l2=0.01, degree=3):
"""Fit QR at each τ in `taus` on degree-`degree` polynomial features.
Returns a (K, |x_eval|) array Q where Q[k, j] = q̂_{taus[k]}(x_eval[j]).
"""
poly = PolynomialFeatures(degree=degree, include_bias=False)
Phi_train = poly.fit_transform(np.asarray(x_train).reshape(-1, 1))
Phi_eval = poly.transform(np.asarray(x_eval).reshape(-1, 1))
out = np.empty((len(taus), len(x_eval)))
for k, tau in enumerate(taus):
qr = QuantileRegressor(quantile=tau, alpha=alpha_l2, solver='highs')
qr.fit(Phi_train, y_train)
out[k] = qr.predict(Phi_eval)
return out
def rearrange_quantile_estimates(Q):
"""CFV-G 2010 rearrangement: sort along the τ-axis at each evaluation point."""
return np.sort(Q, axis=0)
Penalized Quantile Regression
Two settings push QR away from its plain Koenker–Bassett 1978 form. First, when or the unregularised QR LP is under-determined and badly behaved. Second, even at moderate with large enough for asymptotics, an penalty drives variable selection that simple shrinkage cannot match. Both settings use the same trick: add a penalty term to the pinball-loss objective.
The penalty (already familiar from sklearn’s alpha parameter):
This has a closed-form Hessian (modulo the non-smoothness of ) and can be solved either by augmenting the LP with linear-quadratic structure or via the smoothed-check-loss accelerated-gradient solver we use in the in-browser viz components.
The penalty (the focus of this section, “QR-lasso” or “L1-QR”):
This stays a linear program, since splits cleanly with . Augment the QR LP with these auxiliary variables and solve for the full path of values.
Remark (Why L1-QR is the natural high-dim QR).
At the population level, suppose the true conditional -quantile is linear with sparse slope vector — only out of coordinates are nonzero. The LP-lasso-QR estimator targets that sparse directly. By contrast, ridge-QR (-penalised) shrinks all coordinates uniformly and never sets any to exactly zero, so it cannot recover the sparsity pattern; its high-dim rate is correspondingly slower.
Theorem 4 (Belloni–Chernozhukov 2011 oracle rate for L1-QR).
Suppose are i.i.d. with bounded, the true conditional -quantile is with , and the design satisfies suitable restricted eigenvalue conditions on the active subspace. Choose
for a sufficiently large constant . Then with high probability,
matching the oracle rate one would obtain with knowledge of the support of .
Proof.
Sketch. The argument has three pieces, each of which uses tools developed in formalStatistics: Empirical Processes (Topic 32).
- Gradient-domination condition. The estimating function for QR has bounded influence (the pinball-loss subgradient is bounded by 1 in absolute value), so a self-normalised concentration inequality gives with high probability when is chosen as above. This is the score condition that ensures the lasso doesn’t over-shrink.
- Restricted-eigenvalue excursion control. The pinball loss is convex but not strongly convex; restricting to the active cone restores a strong-convexity-like inequality on the design Gram matrix. Combine with the score condition to bound the deviation in the active subspace.
- Oracle inequality. The combined argument yields a high-probability bound of the form where absorbs the restricted-eigenvalue constant and the factor. Square-rooting gives the stated rate.
Full details in BC2011 §4; the general empirical-process scaffold is Topic 32 §32.5.
∎Remark (Choice of λ (CV vs BC plug-in)).
The asymptotic rate fixes only up to a constant. In practice, two routes pick the constant: (a) the BC2011 self-normalised plug-in, which uses a pivotal quantity to set from data without cross-validation; (b) cross-validation on the pinball-loss objective itself. The figure below uses (b). CV-based is more familiar and tends to be more aggressive than the BC plug-in, which is intentionally conservative for inference.
The L1-QR LP splits with , then adds the slack pair from §3. Four nonnegative variable groups give a standard-form LP that scipy / HiGHS solves directly:
def solve_lasso_qr_linprog(X, y, tau, lam, no_penalty_mask=None):
"""L1-penalised QR via the standard LP reformulation.
Variables z = [β⁺ (p), β⁻ (p), u⁺ (n), u⁻ (n)].
Objective: λ·(1ᵀβ⁺ + 1ᵀβ⁻) + τ·1ᵀu⁺ + (1−τ)·1ᵀu⁻,
with the λ coefficients zeroed out at indices in no_penalty_mask
(typically the intercept column).
Constraints: X(β⁺ − β⁻) + u⁺ − u⁻ = y, all variables ≥ 0.
"""
n, p = X.shape
if no_penalty_mask is None:
no_penalty_mask = np.zeros(p, dtype=bool)
pen_costs = np.where(no_penalty_mask, 0.0, lam)
c = np.concatenate([
pen_costs, # cost on β⁺
pen_costs, # cost on β⁻
tau * np.ones(n), # cost on u⁺
(1 - tau) * np.ones(n), # cost on u⁻
])
A_eq = np.hstack([X, -X, np.eye(n), -np.eye(n)])
bounds = [(0, None)] * (2 * p + 2 * n)
res = linprog(c, A_eq=A_eq, b_eq=y, bounds=bounds, method='highs')
z = res.x
return z[:p] - z[p:2 * p] # β = β⁺ − β⁻
Quantile Regression as the Base Learner of Conformalized QR
We close the loop with the use that motivated this topic in the T4 track: quantile regression as the base learner inside Conformalized Quantile Regression (Topic 1, §6). CQR uses two QR fits — at levels and — to produce a heteroscedastic prediction band, and then applies conformal calibration on top to guarantee finite-sample marginal coverage at level . Conformal’s coverage theorem holds for any base learner (Conformal Prediction, Theorem 1); the role of QR specifically is to give the band the right shape.
This is the division of labour:
- QR base fits provide the band SHAPE — wide where conditional variance is large, narrow where it is small. A symmetric residual-conformal interval does not have this property; it produces constant-width bands.
- Conformal calibration provides the band WIDTH — adjust the QR fits by an additive constant so that the empirical miscoverage rate on a held-out calibration set matches the target .
Conformal Prediction’s Theorem 1 (split-conformal validity) gives marginal coverage regardless of how badly QR is misspecified — even constant fits. What QR contributes is conditional approximate validity: when the linear class spans the true conditional quantile function, the conformal correction is a small constant and the resulting band tracks the true conditional coverage rate uniformly. When QR is misspecified, the conformal correction absorbs the misspecification globally; the resulting band still has marginal coverage but conditional coverage will be uneven. The full treatment of prediction-interval procedures — comparing CQR with locally adaptive variants (CQR-r, CQR-m), conditional-coverage methods, and base learners beyond linear QR — lives in Prediction Intervals (coming soon).
Recall (callback to Conformal Prediction §6, Definition 4) the CQR prediction set at level based on QR fits and :
where is the empirical quantile (with the standard correction) of the calibration nonconformity scores
The two QR fits define the band shape; is a single scalar that inflates or deflates the band uniformly to hit the target coverage.
Connections and Further Reading
Within formalML, this topic feeds into:
-
Conformal Prediction (T4 #1). The two-sided QR fits at and are the base learner inside Conformalized Quantile Regression. Theorem 1 of conformal-prediction guarantees marginal coverage of the CQR band regardless of QR’s specification; the role of QR is to give the band the right shape — heteroscedastic where the data are heteroscedastic. See §8 above.
-
Prediction Intervals (coming soon). A unifying treatment of frequentist prediction intervals — fixed-width residual-based, conformal, and quantile-regression-based — and their coverage guarantees under various assumptions (i.i.d., exchangeable, group-conditional). Quantile regression is one of three “spokes” feeding into that umbrella topic.
-
Extreme Value Theory (coming soon). Theorem 3’s variance formula tells us QR’s asymptotic variance inflates as or , because the density-weighted matrix shrinks in the tail. Beyond moderate quantile levels — typically outside roughly for a Gaussian-tailed — direct QR is not the right framework. EVT replaces it with the generalised Pareto / generalised extreme-value families and a peaks-over-threshold estimator that targets the tail directly.
Cross-site prerequisites:
-
formalStatistics: Linear Regression — the OLS analog. The KB78 estimator is the pinball-loss replacement of squared loss; the LP reformulation is the QR replacement of the normal equations; Theorem 3’s asymptotic Gaussian is the QR replacement of OLS’s .
-
formalStatistics: Order Statistics and Quantiles — the no-covariate special case. With a single intercept feature, KB78 reduces to the empirical -quantile of ; Theorem 3 reduces to the classical Bahadur–Ghosh asymptotics.
-
formalStatistics: Empirical Processes — the toolkit behind Theorem 3’s proof sketch (Knight’s identity, the empirical-process limit of the rescaled objective, the argmin lemma) and behind Theorem 4’s restricted-eigenvalue / oracle-inequality argument.
Internal prerequisites:
-
Convex Analysis — the pinball loss is convex but non-smooth; the LP reformulation exploits its piecewise-linear structure.
-
Gradient Descent — the smoothed-check-loss accelerated-gradient solver is what powers the in-browser visualisation widgets, where running an LP solver in the user’s browser is impractical.
Connections
- T4's first topic. The CQR callback in §8 builds directly on conformal-prediction §6 Definition 4; reading conformal-prediction first frames QR as the base learner that supplies the band SHAPE while conformal calibration supplies the band WIDTH. conformal-prediction
- The pinball loss is a piecewise-linear convex function. The LP reformulation in §3 is exactly the standard reduction of piecewise-linear convex optimization to a linear program; the slack-variable trick (u = u⁺ − u⁻ with u⁺, u⁻ ≥ 0) is the canonical convex-analysis device. convex-analysis
- The smoothed-check-loss accelerated-gradient solver underlying the in-browser visualization components is a direct adaptation of the smoothed-objective + Nesterov-acceleration construction — applied to a non-smooth piecewise-linear loss via Moreau-envelope smoothing. gradient-descent
- T4's track closer cites the Koenker–Knight asymptotic theorem from §5 here as Theorem 2 (pure QR's asymptotic conditional coverage), and the QR base learner is reused in pure QR (§3) and CQR (§5.1) of that topic. The architectural punchline there is that CQR equals pure QR with the threshold $0$ replaced by a conformal $(1-\alpha)$-quantile — one number of difference, one bridge theorem of consequence. prediction-intervals
- T4 sibling and track closer. Statistical depth handles the unconditional multivariate-quantile problem by collapsing dimension into a center-outward scalar; quantile regression handles the conditional problem one quantile-level at a time. The two converge to multivariate-quantile regression with depth-based prediction regions. statistical-depth
References & Further Reading
- paper Regression quantiles — Koenker & Bassett (1978) The foundational paper. Definition 2 (linear QR estimator) and the structural form of Theorem 3 originate here.
- paper Limiting distributions for L₁ regression estimators under general conditions — Knight (1998) The clean modern proof of Theorem 3 via the Knight-identity decomposition used in §5's proof sketch.
- book Quantile Regression — Koenker (2005) The canonical book-length reference. Comprehensive coverage of LP reformulation, asymptotic theory, and the rearrangement / crossing literature.
- paper Quantile and probability curves without crossing — Chernozhukov, Fernández-Val & Galichon (2010) The rearrangement procedure (Definition 4 and Remark 11). Establishes that rearrangement weakly improves Lᵖ approximation.
- paper ℓ₁-penalized quantile regression in high-dimensional sparse models — Belloni & Chernozhukov (2011) Theorem 4's oracle-rate result and the BC plug-in choice of λ (Remark 13). Restricted-eigenvalue conditions for QR-lasso.
- paper Noncrossing quantile regression curve estimation — Bondell, Reich & Wang (2010) The constrained-LP alternative to rearrangement (Remark 10). Joint estimation of multiple quantile levels with monotonicity constraints.
- paper A note on quantiles in large samples — Bahadur (1966) The original Bahadur representation in the no-covariate case. Remark 9 references the QR generalization.
- paper Conformalized quantile regression — Romano, Patterson & Candès (2019) The CQR procedure used in §8 (callback to T4 Topic 1). Same reference appears in conformal-prediction.mdx; intentional duplication for self-containment.