advanced supervised-learning 65 min read

High-Dimensional Regression

The lasso at $p \gg n$ — definition, geometry, and ISTA / FISTA / coordinate-descent solvers; the Bickel–Ritov–Tsybakov (2009) oracle inequality at the $\sigma^2 s \log(p)/n$ rate under the restricted-eigenvalue condition; the variable-selection story under the irrepresentable condition; ridge / elastic-net / adaptive-lasso variants; and the Zhang–Zhang / Javanmard–Montanari / van de Geer–Bühlmann–Ritov–Dezeure (2014) debiased lasso for valid $\sqrt n$-confidence inference on individual coefficients.

§1. From OLS to penalized regression

When the predictor matrix has more columns than rows, ordinary least squares stops being a single estimator and becomes a degenerate family — infinitely many vectors fit the training data perfectly, none of them generalize. This section establishes the high-dimensional regime, shows OLS failing on the canonical sparse-Gaussian-design problem we’ll reuse for the next eight sections, and introduces the two classical regularization remedies: ridge (L2-penalized) and lasso (L1-penalized). Ridge gives a unique solution and shrinks every coefficient smoothly toward zero; lasso gives a sparse solution and shrinks small coefficients all the way to zero. The geometric difference between those two penalties — corners on the L1 ball, smooth curvature on the L2 ball — is what makes the lasso the central object of this topic.

§1.1 The high-dimensional regime pnp \gtrsim n and where it appears

Standard regression theory assumes more observations than features (n>pn > p), often by orders of magnitude. The high-dimensional regime flips that: pp is comparable to or much larger than nn. Three places it shows up routinely:

  • Genome-wide association studies (GWAS). Regress a phenotype on hundreds of thousands to millions of single-nucleotide polymorphisms. Typical scale is p106p \approx 10^6 with n104n \approx 10^4 patients; the ratio p/n100p / n \approx 100 is normal, not extreme.
  • Functional MRI. A whole-brain scan resolves 105\sim 10^5 voxels per timepoint; predicting a behavioral or clinical outcome from voxel-level activations gives p105p \approx 10^5 with nn in the low hundreds.
  • Text and high-cardinality categorical features. A bag-of-words encoding of even a modest corpus pushes pp into the millions while nn stays in the thousands.

These problems share more than pnp \gg n — they also tend to be sparse: only a small subset of features carries the actual signal. A handful of SNPs drive most of the heritable variation in a quantitative trait; a focal brain region carries most of the predictive signal in an fMRI study; a few keywords carry most of the topic information in a document. The lasso’s design exploits this sparsity directly.

We’ll formalize sparsity as β0=s\|\boldsymbol\beta^*\|_0 = s where sps \ll p — the true coefficient vector βRp\boldsymbol\beta^* \in \mathbb{R}^p has only ss nonzero entries. The set S={j:βj0}S = \{j : \beta^*_j \neq 0\} is the support of β\boldsymbol\beta^*, and S=s|S| = s is the sparsity level. We don’t know SS in advance; recovering it (or, more modestly, predicting well without recovering it) is the estimator’s job.

The canonical sparse high-dimensional dataset (DGP-1). We’ll reuse the same data-generating process across §§2–9 to make the comparisons concrete. Fix:

  • n=200n = 200, p=500p = 500, s=10s = 10.
  • Rows xiRp\mathbf{x}_i \in \mathbb{R}^p iid N(0,Σ)\mathcal{N}(\mathbf{0}, \boldsymbol\Sigma) with Σjk=0.5jk\boldsymbol\Sigma_{jk} = 0.5^{|j-k|} (AR(1) Toeplitz, weakly decaying off-diagonal correlation).
  • βj=1\beta^*_j = 1 for jS={0,1,,9}j \in S = \{0, 1, \dots, 9\} (contiguous active set), βj=0\beta^*_j = 0 otherwise.
  • Noise εN(0,σ2I)\boldsymbol\varepsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) with σ=0.5\sigma = 0.5.
  • Population R20.99R^2 \approx 0.99 (computed: βΣβ26.0\boldsymbol\beta^{*\top} \boldsymbol\Sigma \boldsymbol\beta^* \approx 26.0 via the AR(1) sum over a 10×10 active block).
  • Seed: np.random.default_rng(42).

The §10 debiased-lasso coverage demonstration switches to (n,p,s)=(200,100,5)(n, p, s) = (200, 100, 5) to make OLS feasible as a baseline. All other sections reuse DGP-1.

§1.2 Why OLS fails: the rank-deficient normal equations

OLS minimizes the squared training loss:

β^OLS=argminβRp1nyXβ22,\hat{\boldsymbol\beta}^{\text{OLS}} = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2,

with the closed-form solution β^OLS=(XX)1Xy\hat{\boldsymbol\beta}^{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} when XX\mathbf{X}^\top \mathbf{X} is invertible. Invertibility requires rank(X)=p\text{rank}(\mathbf{X}) = p, which in turn requires npn \ge p.

When p>np > n, the matrix XXRp×p\mathbf{X}^\top \mathbf{X} \in \mathbb{R}^{p \times p} has rank at most n<pn < p and is singular. The normal equations XXβ=Xy\mathbf{X}^\top \mathbf{X} \boldsymbol\beta = \mathbf{X}^\top \mathbf{y} have infinitely many solutions: for any vector v\mathbf{v} in the (pn)(p - n)-dimensional null space of X\mathbf{X}, both β^\hat{\boldsymbol\beta} and β^+v\hat{\boldsymbol\beta} + \mathbf{v} achieve the same residual. The OLS objective has a flat plateau of global minima, all of them interpolating the training data exactly.

The standard pseudoinverse fix picks the minimum-norm solution from that plateau:

β^OLS, min-norm=X(XX)1y,\hat{\boldsymbol\beta}^{\text{OLS, min-norm}} = \mathbf{X}^\top (\mathbf{X} \mathbf{X}^\top)^{-1} \mathbf{y},

well-defined whenever X\mathbf{X} has full row rank. But “minimum-norm OLS” is still OLS — it interpolates the training data (Xβ^=y\mathbf{X} \hat{\boldsymbol\beta} = \mathbf{y}), the training MSE is exactly zero, and the test MSE is unbounded. Even before pp reaches nn, predictive performance degrades: as pnp \to n from below, the smallest singular value of X\mathbf{X} approaches zero, (XX)1(\mathbf{X}^\top \mathbf{X})^{-1} blows up, and the OLS coefficient estimates inflate even though the in-sample fit looks great.

Sweep the active feature count from pused=10p_{\text{used}} = 10 to pused=199p_{\text{used}} = 199 on a fresh DGP-1 sample, fit OLS to each subproblem (using only the first pusedp_{\text{used}} columns of X\mathbf{X}), and plot train MSE and test MSE on log-y. Train MSE drops monotonically toward zero as pusedp_{\text{used}} approaches nn. Test MSE bottoms out near puseds=10p_{\text{used}} \approx s = 10 (the truth) and explodes by orders of magnitude as pusednp_{\text{used}} \to n. The reader sees, in one picture, that OLS is incapable of using the sparsity structure — it doesn’t know that only 10 features matter, so it overfits aggressively as soon as it has the degrees of freedom to do so.

OLS on DGP-1 (n = 200, σ = 0.5, AR(1) ρ = 0.5, s = 10). Train MSE drops to zero as p_used → n; test MSE bottoms near p_used = s and explodes near the rank-deficiency boundary p_used = n. Hover any point for exact values. Computed live in-browser with Cholesky-based ridge OLS (jitter ε = 1e-8 for numerical stability at large p_used).

OLS train MSE drops to zero while test MSE explodes as the active feature count p_used approaches the sample size n on DGP-1.
OLS on DGP-1 (n = 200, s = 10) as the active feature count p_used varies from 10 to 199. Train MSE drops monotonically toward zero as p_used → n; test MSE bottoms out near p_used ≈ s (the truth) and explodes by orders of magnitude as p_used → n. OLS cannot exploit the sparsity structure — given enough degrees of freedom it overfits aggressively. (Static fallback figure; the interactive viz above computes the same curve in-browser via Cholesky-based ridge OLS.)

§1.3 Ridge regression as the L2 fix

Ridge regression resolves the rank-deficiency by adding a quadratic penalty:

β^ridge(λ)=argminβRp12nyXβ22+λ2β22,\hat{\boldsymbol\beta}^{\text{ridge}}(\lambda) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \frac{\lambda}{2} \|\boldsymbol\beta\|_2^2,

where β22=j=1pβj2\|\boldsymbol\beta\|_2^2 = \sum_{j=1}^p \beta_j^2 is the squared L2 norm and λ0\lambda \ge 0 is a tuning parameter (λ=0\lambda = 0 recovers OLS, λ\lambda \to \infty shrinks every coefficient to zero). The closed form is:

β^ridge(λ)=(XX+nλI)1Xy.\hat{\boldsymbol\beta}^{\text{ridge}}(\lambda) = (\mathbf{X}^\top \mathbf{X} + n \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}.

The matrix XX+nλI\mathbf{X}^\top \mathbf{X} + n\lambda \mathbf{I} is positive definite for any λ>0\lambda > 0 — its eigenvalues are bounded below by nλn\lambda — so the inversion is well-defined regardless of whether pnp \le n or p>np > n. Ridge restores uniqueness.

Two structural properties matter for what follows:

  1. Continuous shrinkage, dense solutions. Each coefficient is shrunk toward zero by a factor that depends on the corresponding singular value of X\mathbf{X}, but no coefficient is set to exactly zero (with probability one over a continuous design). On DGP-1, ridge produces 500 nonzero coefficients even though only 10 features matter.
  2. Smoothness in λ\lambda. β^ridge(λ)\hat{\boldsymbol\beta}^{\text{ridge}}(\lambda) is a continuous, differentiable function of λ\lambda everywhere on [0,)[0, \infty). There’s no “selection event” — coefficients shrink, they don’t snap.

Ridge is the right tool when all features carry some signal and the goal is to stabilize coefficient estimates against multicollinearity. In the high-dimensional sparse regime where the truth has 10 active features out of 500, ridge’s refusal to zero out the 490 inactive features is a liability — every irrelevant coefficient contributes variance to the prediction. The standard formalstatistics treatment of ridge covers the n>pn > p case and the Bayesian Gaussian-prior interpretation; we’re using ridge here as the dense-shrinkage baseline that lasso will improve on.

§1.4 The lasso as the L1 alternative

The lasso (Tibshirani 1996) replaces the squared L2 penalty with an L1 penalty:

β^lasso(λ)=argminβRp12nyXβ22+λβ1,\hat{\boldsymbol\beta}^{\text{lasso}}(\lambda) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \lambda \|\boldsymbol\beta\|_1,

where β1=j=1pβj\|\boldsymbol\beta\|_1 = \sum_{j=1}^p |\beta_j| is the L1 norm. The objective is convex (squared loss is convex, L1 norm is convex, sum of convex is convex), so any local minimum is a global minimum. But the L1 norm is not differentiable at zero, and that non-smoothness has two consequences that turn out to be central:

  1. Sparsity. The optimal solution has many coefficients exactly equal to zero, and the number of nonzero coefficients is controlled by λ\lambda. Geometrically, the L1 ball {β:β1t}\{\boldsymbol\beta : \|\boldsymbol\beta\|_1 \le t\} has corners at the coordinate axes; the constrained-form lasso solution is the point where the squared-loss contour first touches the L1 ball as the ball expands, and that contact point is generically at a corner — i.e., on a coordinate hyperplane, with one or more coefficients zero. The smooth L2 ball has no corners, which is why ridge solutions are dense. We’ll come back to this picture in §2.2 with an interactive viz.
  2. No closed form in general. Unlike ridge, the lasso has no general closed-form solution. There is one when the design is orthogonal (XX=nI\mathbf{X}^\top \mathbf{X} = n \mathbf{I}, derived in §3.1 via the soft-thresholding operator), but for general X\mathbf{X} the solution requires an iterative solver. §3 covers coordinate descent, ISTA, and FISTA in detail.

The L1 penalty is the smallest convex penalty that produces sparsity. The non-convex L0 penalty β0={j:βj0}\|\boldsymbol\beta\|_0 = |\{j : \beta_j \neq 0\}| would also produce sparse solutions — and is in some ways the “right” objective for variable selection — but the resulting optimization problem is best-subset selection, which is NP-hard. The L1 penalty is the convex relaxation of L0: among convex penalties that produce sparsity, it’s the simplest to state, the easiest to optimize, and the one with the best-developed statistical theory.

Side-by-side: ridge at three penalty levels (α{0.01,1,100}\alpha \in \{0.01, 1, 100\}) and lasso at three penalty levels (λ{0.001,λCV0.056,1}\lambda \in \{0.001, \lambda_{\text{CV}} \approx 0.056, 1\}), all on the same DGP-1 sample. Ridge produces 500 nonzero coefficients at every α\alpha — heavier penalization means smaller coefficients across the board, never exactly zero. The lasso at the CV-selected λ\lambda has only ~12 nonzero coefficients, mostly concentrated at the true active coordinates S={0,,9}S = \{0, \dots, 9\}. The contrast is the visual punchline of the section.

Ridge (top row, three α levels) vs lasso (bottom row, three λ levels) on DGP-1 (n = 200, p = 500, s = 10, σ = 0.5). True active coordinates (j < 10) in black; 490 inactive coordinates in gray. Ridge is dense at every α; lasso at the CV-selected λ ≈ 0.056 produces a sparse fit concentrated at the true active set. Computed live in-browser via Cholesky-based ridge with pre-computed XᵀX (shared across α levels) and 200-iteration ISTA for lasso. Compute ~1 second; viz is hidden until scrolled into view.

Side-by-side coefficient bar charts of ridge (three penalty levels, all dense) versus lasso (three penalty levels, sparse) on DGP-1.
Ridge (top row, three α levels) versus lasso (bottom row, three λ levels) coefficient estimates on DGP-1. The ten true active coordinates are highlighted in black. Ridge produces 500 nonzero coefficients at every α — continuous shrinkage but no selection. The lasso at CV-selected λ ≈ 0.05 produces ~12 nonzero coefficients, mostly concentrated at the true active coordinates S = {0, …, 9}. The contrast is the visual punchline: same data, two penalties, two completely different solution structures. (Static fallback figure; the interactive viz above computes the same six panels in-browser.)

§1.5 Roadmap

The rest of the topic answers four questions about the lasso. What does the estimator look like? §2 fills in the geometric picture and basic structural results — existence, uniqueness, KKT subgradient conditions. How do we compute it? §3 derives the soft-thresholding closed form for orthogonal designs and develops the iterative solvers (coordinate descent, ISTA, FISTA) used in general. Does it predict well, and does it recover the true support? §§4–6 work out the bias-variance trade-off, prove the headline non-asymptotic prediction-risk bound (the lasso oracle inequality, O(σ2slogp/n)O(\sigma^2 s \log p / n) under the restricted-eigenvalue condition), and treat variable-selection consistency as a separate theorem with its own sufficient condition (irrepresentable). §7 covers practical λ\lambda selection by cross-validation and information criteria. §8 covers the ridge / elastic-net / adaptive-lasso variants and when each wins. §9 deepens the geometry of the high-dimensional regime — RIP, sub-Gaussian designs, the implication chain between conditions. Can we do inference with it? §10 is the inferential payoff: naive lasso confidence intervals undercover (PoSI; Berk et al. 2013), and the debiased-lasso construction (Zhang-Zhang 2014; Javanmard-Montanari 2014; van de Geer et al. 2014) restores valid coverage. §11 extends the lasso to non-Gaussian responses (logistic, Poisson). §12 closes with connections to double/debiased ML, causal inference, and the Bayesian counterpart in Sparse Bayesian Priors.

§2. The lasso estimator

The lasso is convex L1-penalized least squares — a well-defined, well-studied estimator with a clean geometric story and a precise first-order characterization. This section establishes the formal definition (in both penalized and constrained forms), develops the geometric intuition for why L1 penalization produces sparse solutions (corners on the L1 ball, smooth curvature on the L2 ball), addresses existence and uniqueness (Tibshirani 2013), and works out the KKT subgradient conditions that characterize every lasso solution. The KKT conditions are the load-bearing technical machinery for the rest of the topic — the soft-thresholding closed form in §3.1, the basic inequality in the oracle inequality proof of §5.2, and the debiased-lasso construction of §10.2 all derive from them.

§2.1 The L1-penalized least-squares definition

The lasso estimator is the minimizer of an L1-penalized squared loss:

Definition 1 (Lasso estimator).

For a design matrix XRn×p\mathbf{X} \in \mathbb{R}^{n \times p}, response yRn\mathbf{y} \in \mathbb{R}^n, and tuning parameter λ0\lambda \ge 0, the lasso estimator is

β^lasso(λ)=argminβRp{12nyXβ22+λβ1},\hat{\boldsymbol\beta}^{\text{lasso}}(\lambda) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \left\{ \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \lambda \|\boldsymbol\beta\|_1 \right\},

where β1=j=1pβj\|\boldsymbol\beta\|_1 = \sum_{j=1}^p |\beta_j| is the L1 norm.

Three things to note about the definition.

The 1/(2n)1 / (2n) scaling. The factor of 1/21/2 in front of the squared loss is a convention that makes the gradient equal X(yXβ)/n-\mathbf{X}^\top (\mathbf{y} - \mathbf{X} \boldsymbol\beta) / n (no leading 2), and the 1/n1/n normalization makes the objective an empirical average. With this scaling, λ\lambda has units of “covariate-response correlation” — comparable across sample sizes — and the optimal λ\lambda scales as σlog(p)/n\sigma \sqrt{\log(p) / n} (we’ll see this in §5). The scikit-learn convention (and the one we’ll use throughout the notebook) matches: Lasso(alpha=λ) minimizes 12nyXβ22+αβ1\frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \alpha \|\boldsymbol\beta\|_1, with the same 1/2n1/2n out front.

Convexity. The squared loss is strongly convex in Xβ\mathbf{X} \boldsymbol\beta but only weakly convex in β\boldsymbol\beta when X\mathbf{X} has rank <p< p. The L1 norm is convex (every norm is), so the sum is convex. There are no local minima that aren’t global minima; the solution set is always a convex set in Rp\mathbb{R}^p.

The L1 norm is not differentiable at zero. βj|\beta_j| is differentiable everywhere except βj=0\beta_j = 0, where the subgradient is the closed interval [1,1][-1, 1]. This non-smoothness is exactly what produces sparsity — and is also why the lasso has no closed-form solution in general (subgradients require a discrete case analysis at each coordinate).

Constrained-form duality. The penalized lasso is equivalent to the constrained-form lasso

β^lasso(t)=argminβ1t12nyXβ22,\hat{\boldsymbol\beta}^{\text{lasso}}(t) = \arg\min_{\|\boldsymbol\beta\|_1 \le t} \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2,

via Lagrangian duality. For each λ0\lambda \ge 0 there’s a corresponding budget t(λ)=β^lasso(λ)1t(\lambda) = \|\hat{\boldsymbol\beta}^{\text{lasso}}(\lambda)\|_1, and conversely each t[0,β^OLS1]t \in [0, \|\hat{\boldsymbol\beta}^{\text{OLS}}\|_1] corresponds to a unique λ(t)0\lambda(t) \ge 0. The mapping tλt \leftrightarrow \lambda is monotone but not generally available in closed form. Both forms appear in the literature; the penalized form is more convenient for proofs (one term in the gradient, no constraint qualification needed), and the constrained form is more convenient for the geometric pictures we’ll draw next.

§2.2 Geometric picture: L1L^1 corners produce sparsity

Why does L1 penalization produce solutions with exact zeros? The cleanest answer is geometric: the L1 ball has corners on the coordinate axes, and the loss contour generically touches the ball at one of those corners.

Consider the constrained lasso in 2D with a fixed budget tt. The feasible region {β:β1t}\{\boldsymbol\beta : \|\boldsymbol\beta\|_1 \le t\} is a diamond — a square rotated 45 degrees — with vertices at (±t,0)(\pm t, 0) and (0,±t)(0, \pm t). The objective contours {β:yXβ22=c}\{\boldsymbol\beta : \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 = c\} are concentric ellipses centered at β^OLS\hat{\boldsymbol\beta}^{\text{OLS}} with axes determined by the eigenvectors of XX\mathbf{X}^\top \mathbf{X}. The constrained solution is the point where the smallest ellipse just touches the diamond.

If β^OLS\hat{\boldsymbol\beta}^{\text{OLS}} lies outside the diamond — which is the interesting case; otherwise the constraint is inactive — the contact point is on the diamond’s boundary. The boundary consists of four edges and four vertices. Generically, for almost every XX\mathbf{X}^\top \mathbf{X} and β^OLS\hat{\boldsymbol\beta}^{\text{OLS}}, the contact is at a vertex, not a smooth point on an edge. At a vertex, one coordinate is exactly zero. Sparsity.

The L2 ball is the disk {β:β2t}\{\boldsymbol\beta : \|\boldsymbol\beta\|_2 \le t\} — smooth, no corners. The contact point with an ellipse is generically a smooth point on the disk’s boundary, with both coordinates nonzero. Density.

This picture extends to Rp\mathbb{R}^p with p>2p > 2. The L1 ball has 2p2p vertices (one per coordinate-axis intersection), p(p1)p(p-1) edges, and a hierarchy of lower-dimensional faces; contact at a kk-dimensional face means the solution has exactly pkp - k nonzero coordinates. The L2 ball has no faces of any positive codimension; contact is always smooth, the solution is always dense.

(release to recompute contacts)

β̂_OLS = (0.4, 1.6); Hessian H = [[1, 0.4], [0.4, 1]]; loss Q(β) = (β − β̂_OLS)ᵀ H (β − β̂_OLS). The amber dashed ellipse is the just-tangent loss contour at the L1 contact. As you drag t, watch the L1 contact stay pinned to the diamond's nearest vertex (one coordinate exactly zero) while the L2 contact slides smoothly around the disk. The L1 vertex generically achieves a lower loss than any edge interior — that's the geometric source of sparsity.

2D pedagogical picture of the L1 diamond and L2 disk with elliptical loss contours, illustrating sparse vs dense solutions.
2D pedagogical picture: β̂_OLS = (0.4, 1.6), Hessian H = XᵀX/n with off-diagonal 0.4. At budget t = 1: the L1 contact lands at the vertex (0, 1) (sparse, one coordinate exactly zero); the L2 contact lands at approximately (0.39, 0.92) (dense, both coordinates nonzero). The L1 ball's corners are the geometric source of sparsity. (Static fallback figure; the interactive viz above lets the reader vary t.)

The figure is the geometric content of the lasso. Everything else — the soft-thresholding closed form (§3.1), the lasso path (§4.4), the active-set / equicorrelation-set characterization (§2.4 below) — is algebraic machinery that operationalizes it.

§2.3 Existence and uniqueness

Existence. The lasso objective is convex, continuous, and coercive — as β\|\boldsymbol\beta\| \to \infty, the squared loss is bounded below by zero and the L1 penalty grows linearly, so the objective grows without bound. Convex-and-coercive implies attainment of the minimum, so the solution set B^(λ)=argminβ{(1/2n)yXβ22+λβ1}\hat{B}(\lambda) = \arg\min_{\boldsymbol\beta} \{(1/2n) \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \lambda \|\boldsymbol\beta\|_1\} is non-empty for every λ0\lambda \ge 0. The solution set is also convex (intersection of all minima of a convex function) and closed (by continuity).

Uniqueness. When does B^(λ)\hat{B}(\lambda) contain a single point? Two sufficient conditions:

  1. pnp \le n and X\mathbf{X} has full column rank. Then XX\mathbf{X}^\top \mathbf{X} is positive definite, the squared loss is strictly convex in β\boldsymbol\beta, the lasso objective is strictly convex, and the minimum is unique.
  2. X\mathbf{X} has columns in “general position”. Tibshirani (2013) showed that this condition — no k+1k+1 columns of ±X\pm \mathbf{X} lie in an affine subspace of dimension k1k - 1 — is sufficient for lasso uniqueness regardless of whether pnp \le n or p>np > n. With probability one over any continuous random design (Gaussian, bounded continuous, etc.), the columns are in general position. So for every continuous random X\mathbf{X}, the lasso solution is unique with probability one, even at pnp \gg n.

The conditions can fail with discrete or rank-deficient designs. Two pathologies:

  • Duplicate columns. If Xj=Xk\mathbf{X}_j = \mathbf{X}_k for some jkj \neq k, then β^j\hat\beta_j and β^k\hat\beta_k can be redistributed freely subject to β^j+β^k\hat\beta_j + \hat\beta_k being constant. The solution set is a 1-parameter family.
  • One-hot encodings of categorical features. If Xj+Xk+=1\mathbf{X}_j + \mathbf{X}_k + \cdots = \mathbf{1} for some subset of columns, the same pathology arises after centering.

But — and this is what saves the lasso in practice — the fitted values Xβ^\mathbf{X} \hat{\boldsymbol\beta} are always unique, even when β^\hat{\boldsymbol\beta} is not. The squared loss is strictly convex in Xβ\mathbf{X} \boldsymbol\beta, so any two solutions β^(1),β^(2)B^(λ)\hat{\boldsymbol\beta}^{(1)}, \hat{\boldsymbol\beta}^{(2)} \in \hat{B}(\lambda) satisfy Xβ^(1)=Xβ^(2)\mathbf{X} \hat{\boldsymbol\beta}^{(1)} = \mathbf{X} \hat{\boldsymbol\beta}^{(2)}. The prediction is uniquely determined; only the coefficient decomposition can be ambiguous.

For DGP-1 the design is continuous Gaussian, so the lasso solution is unique with probability one. We’ll assume uniqueness throughout the rest of the topic.

§2.4 KKT subgradient conditions

The lasso objective is convex but non-differentiable. The first-order optimality condition uses the subgradient of the L1 norm in place of the gradient. Recall: for a convex function f:RpRf : \mathbb{R}^p \to \mathbb{R}, the subdifferential at β\boldsymbol\beta is the set

f(β)={gRp:f(β)f(β)+g(ββ)    β}.\partial f(\boldsymbol\beta) = \{\mathbf{g} \in \mathbb{R}^p : f(\boldsymbol\beta') \ge f(\boldsymbol\beta) + \mathbf{g}^\top (\boldsymbol\beta' - \boldsymbol\beta) \;\; \forall \boldsymbol\beta'\}.

For differentiable ff, f(β)={f(β)}\partial f(\boldsymbol\beta) = \{\nabla f(\boldsymbol\beta)\} — a singleton. For non-differentiable points, it’s a non-trivial set. The subdifferential of the absolute value x|x| is

x={{sign(x)}x0,[1,1]x=0.\partial |x| = \begin{cases} \{\mathrm{sign}(x)\} & x \neq 0, \\ [-1, 1] & x = 0. \end{cases}

The L1 norm β1=jβj\|\boldsymbol\beta\|_1 = \sum_j |\beta_j| has subdifferential β1={g:gjβj for all j}\partial \|\boldsymbol\beta\|_1 = \{\mathbf{g} : g_j \in \partial |\beta_j| \text{ for all } j\}.

The first-order optimality condition for the lasso — 0\boldsymbol{0} is in the subdifferential of the objective at β^\hat{\boldsymbol\beta} — gives the KKT subgradient conditions:

Proposition 1 (Lasso KKT conditions).

β^\hat{\boldsymbol\beta} is a lasso solution at λ>0\lambda > 0 if and only if there exists g^β^1\hat{\mathbf{g}} \in \partial \|\hat{\boldsymbol\beta}\|_1 such that

1nX(yXβ^)=λg^.\frac{1}{n} \mathbf{X}^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) = \lambda \hat{\mathbf{g}}.

Coordinate-by-coordinate, this splits into two cases:

  • Active coordinates (β^j0\hat\beta_j \neq 0): g^j=sign(β^j)\hat g_j = \mathrm{sign}(\hat\beta_j), so
1nXj(yXβ^)=λsign(β^j).\frac{1}{n} \mathbf{X}_j^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) = \lambda \, \mathrm{sign}(\hat\beta_j).

The residual correlation with each active feature is exactly ±λ\pm \lambda — the active features all sit on the boundary of the same “correlation level set.”

  • Inactive coordinates (β^j=0\hat\beta_j = 0): g^j[1,1]\hat g_j \in [-1, 1], so
1nXj(yXβ^)λ.\left| \frac{1}{n} \mathbf{X}_j^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) \right| \le \lambda.

The residual correlation with each inactive feature is bounded by λ\lambda — strictly less, generically.

We’ll use the names Xj\mathbf{X}_j for the jj-th column of X\mathbf{X}, r^=yXβ^\hat{\mathbf{r}} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta} for the residual vector, and c^j=(1/n)Xjr^\hat{c}_j = (1/n) \mathbf{X}_j^\top \hat{\mathbf{r}} for the residual-feature correlation at coordinate jj. In this notation, KKT says: c^j=λ|\hat c_j| = \lambda for active jj, c^jλ|\hat c_j| \le \lambda for inactive jj.

Active set, equicorrelation set. Define the active set as A^λ={j:β^j0}\hat A_\lambda = \{j : \hat\beta_j \neq 0\} and the equicorrelation set as E^λ={j:c^j=λ}\hat E_\lambda = \{j : |\hat c_j| = \lambda\}. The KKT conditions imply A^λE^λ\hat A_\lambda \subseteq \hat E_\lambda — every active coordinate is at the equicorrelation boundary. Generically, A^λ=E^λ\hat A_\lambda = \hat E_\lambda (every coordinate at the boundary is also active). When A^λE^λ\hat A_\lambda \subsetneq \hat E_\lambda — i.e., when an inactive coordinate happens to sit exactly at c^j=λ|\hat c_j| = \lambda — the solution is non-unique on the equicorrelation set.

Two corollaries we’ll use later. First, the active set has size at most min(n,p)\min(n, p): if A^λ>n|\hat A_\lambda| > n, the system 1nXA^(yXA^β^A^)=λsign(β^A^)\frac{1}{n} \mathbf{X}_{\hat A}^\top (\mathbf{y} - \mathbf{X}_{\hat A} \hat{\boldsymbol\beta}_{\hat A}) = \lambda \, \mathrm{sign}(\hat{\boldsymbol\beta}_{\hat A}) has no full-rank solution unless the columns of XA^\mathbf{X}_{\hat A} are linearly dependent. So the lasso never selects more than nn features, regardless of pp. (This is one of the lasso’s most useful structural properties: it returns a model of size at most nn, which is interpretable.)

Second, the KKT conditions give the dual certificate for sparsity recovery (used in §6 for variable-selection consistency): if there exists a vector g^\hat{\mathbf{g}} supported on SS with g^Sc<1\|\hat{\mathbf{g}}_{S^c}\|_\infty < 1 that satisfies KKT, then the lasso correctly identifies SS as the active set. The construction of this dual certificate, and the conditions on X\mathbf{X} that make it possible, are the irrepresentable condition (§6.2).

We verify the KKT conditions numerically on the §1 lasso fit at λCV\lambda_{\text{CV}}: residual correlations at active coordinates equal ±λ\pm \lambda to within 10310^{-3}; residual correlations at inactive coordinates are strictly bounded by λ\lambda, with the bulk well inside the dead zone.

Histogram of residual-feature correlations c_j on the §1 lasso fit at lambda_CV. Active coordinates concentrate at |c_j| = lambda; inactive coordinates spread within (-lambda, lambda).
KKT verification on the §1 DGP at λ_CV: residual-feature correlations ĉⱼ = (1/n) Xⱼᵀ r̂. Active coordinates (red) sit at |ĉⱼ| = λ to within 10⁻³; inactive coordinates (gray) are strictly bounded by λ, with the bulk well inside the dead zone. The KKT subgradient conditions are satisfied numerically as expected.

§3. Solving the lasso

The lasso has no closed-form solution for general X\mathbf{X}, so we need iterative algorithms. This section develops the four solvers that matter in practice, in increasing order of sophistication: the soft-thresholding closed form for orthogonal designs (§3.1, the only case that admits a closed form, but conceptually load-bearing because every general-purpose solver reduces to it inside the inner loop); coordinate descent (§3.2, the glmnet workhorse, fastest in practice for moderate-sized problems and the default in scikit-learn); ISTA, the proximal-gradient method (§3.3, simple and slow, O(1/k)O(1/k) convergence rate, the natural first step toward FISTA); and FISTA with Nesterov momentum (§3.4, the same proximal-gradient framework with a momentum trick that improves the rate to O(1/k2)O(1/k^2), with full convergence proof). §3.5 gives practical solver-choice notes.

A common thread: every solver in this section is some application of the soft-thresholding operator S(z,λ)=sign(z)max(zλ,0)S(z, \lambda) = \mathrm{sign}(z) \cdot \max(|z| - \lambda, 0), the proximal operator of the L1 norm. ISTA / FISTA / coordinate descent differ only in what they soft-threshold and how often. So §3.1’s three-line derivation is the algorithmic kernel of everything that follows.

§3.1 Soft-thresholding closed form for orthogonal designs

When XX=nI\mathbf{X}^\top \mathbf{X} = n \mathbf{I} — an orthogonal design, achievable in practice via QR decomposition of any full-rank X\mathbf{X} — the lasso decouples across coordinates and admits a closed-form solution.

Substitute XX=nI\mathbf{X}^\top \mathbf{X} = n \mathbf{I} into the lasso objective:

F(β)=12nyXβ22+λβ1=12ny221nyXβ+12β22+λβ1.F(\boldsymbol\beta) = \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \lambda \|\boldsymbol\beta\|_1 = \frac{1}{2n} \|\mathbf{y}\|_2^2 - \frac{1}{n} \mathbf{y}^\top \mathbf{X} \boldsymbol\beta + \frac{1}{2} \|\boldsymbol\beta\|_2^2 + \lambda \|\boldsymbol\beta\|_1.

Drop the constant 12ny22\frac{1}{2n} \|\mathbf{y}\|_2^2 and let z=Xy/n\mathbf{z} = \mathbf{X}^\top \mathbf{y} / n. The objective separates by coordinate:

F(β)=const+j=1p[12(βjzj)2+λβj]12z22.F(\boldsymbol\beta) = \mathrm{const} + \sum_{j=1}^p \left[ \frac{1}{2} (\beta_j - z_j)^2 + \lambda |\beta_j| \right] - \frac{1}{2} \|\mathbf{z}\|_2^2.

So the lasso reduces to pp independent univariate problems: minimize 12(βz)2+λβ\frac{1}{2}(\beta - z)^2 + \lambda |\beta| over βR\beta \in \mathbb{R}, one for each coordinate.

Theorem 1 (Soft-thresholding closed form).

For zRz \in \mathbb{R} and λ0\lambda \ge 0, the unique minimizer of 12(βz)2+λβ\frac{1}{2}(\beta - z)^2 + \lambda |\beta| is

S(z,λ):=sign(z)max(zλ,0)={zλz>λ,0zλ,z+λz<λ.S(z, \lambda) := \mathrm{sign}(z) \cdot \max(|z| - \lambda, 0) = \begin{cases} z - \lambda & z > \lambda, \\ 0 & |z| \le \lambda, \\ z + \lambda & z < -\lambda. \end{cases}

Equivalently, the lasso solution on an orthogonal design is β^j=S(zj,λ)\hat\beta_j = S(z_j, \lambda) with zj=(Xy/n)jz_j = (\mathbf{X}^\top \mathbf{y} / n)_j.

Proof.

The objective f(β)=12(βz)2+λβf(\beta) = \frac{1}{2}(\beta - z)^2 + \lambda |\beta| is convex (sum of convex), continuous, and coercive, so a minimum exists and is unique (the quadratic part is strictly convex). The KKT condition: 0f(β^)=β^z+λβ^0 \in \partial f(\hat\beta) = \hat\beta - z + \lambda \, \partial |\hat\beta|, equivalently zβ^λβ^z - \hat\beta \in \lambda \, \partial |\hat\beta|.

Case 1: β^>0\hat\beta > 0. Then β^={1}\partial |\hat\beta| = \{1\}, so zβ^=λz - \hat\beta = \lambda, giving β^=zλ\hat\beta = z - \lambda. This is consistent with β^>0\hat\beta > 0 if and only if z>λz > \lambda.

Case 2: β^<0\hat\beta < 0. Then β^={1}\partial |\hat\beta| = \{-1\}, so zβ^=λz - \hat\beta = -\lambda, giving β^=z+λ\hat\beta = z + \lambda. Consistent with β^<0\hat\beta < 0 if and only if z<λz < -\lambda.

Case 3: β^=0\hat\beta = 0. Then β^=[1,1]\partial |\hat\beta| = [-1, 1], so we need zλ[1,1]=[λ,λ]z \in \lambda \cdot [-1, 1] = [-\lambda, \lambda], i.e., zλ|z| \le \lambda.

The three cases partition R\mathbb{R} and give a unique β^\hat\beta for each zz. Combining: β^=S(z,λ)\hat\beta = S(z, \lambda).

The geometric content: for each coordinate, if the (rescaled) least-squares estimate zjz_j is small in magnitude — bounded by λ\lambda — the lasso sets β^j\hat\beta_j to exactly zero. Otherwise, the lasso shrinks zjz_j toward zero by exactly λ\lambda in absolute value. The “dead zone” zjλ|z_j| \le \lambda is the source of sparsity; the constant shrinkage β^j=zjλ|\hat\beta_j| = |z_j| - \lambda outside the dead zone is what biases active coefficients toward zero (the bias problem the debiased lasso fixes in §10).

For non-orthogonal designs (the generic case), no closed form exists. But S(z,λ)S(z, \lambda) remains the algorithmic atom: it’s the proximal operator of the L1 norm,

proxηλ1(z)=argminβ{12ηβz22+λβ1}=S(z,ηλ),\mathrm{prox}_{\eta \lambda \|\cdot\|_1}(\mathbf{z}) = \arg\min_{\boldsymbol\beta} \left\{ \frac{1}{2 \eta} \|\boldsymbol\beta - \mathbf{z}\|_2^2 + \lambda \|\boldsymbol\beta\|_1 \right\} = S(\mathbf{z}, \eta \lambda),

applied componentwise. Coordinate descent (§3.2) and proximal gradient methods (§§3.3–3.4) repeatedly apply S(,)S(\cdot, \cdot) inside their iterations.

§3.2 Coordinate descent

Coordinate descent solves the lasso by cycling through coordinates and minimizing the objective over one coordinate at a time, keeping the others fixed. Each subproblem is univariate and admits a closed form via soft-thresholding.

Derivation. Fix all coordinates except βj\beta_j. The lasso objective restricted to βj\beta_j is

Fj(βj)=12nrjXjβj22+λβj+(terms not depending on βj),F_j(\beta_j) = \frac{1}{2n} \|\mathbf{r}_{-j} - \mathbf{X}_j \beta_j\|_2^2 + \lambda |\beta_j| + (\text{terms not depending on } \beta_j),

where rj=ykjXkβk\mathbf{r}_{-j} = \mathbf{y} - \sum_{k \neq j} \mathbf{X}_k \beta_k is the partial residual with the jj-th feature’s contribution removed. Expand the squared loss:

Fj(βj)=cj2βj2zjβj+λβj+const,cj:=Xj22/n,zj:=Xjrj/n.F_j(\beta_j) = \frac{c_j}{2} \beta_j^2 - z_j \beta_j + \lambda |\beta_j| + \mathrm{const}, \quad c_j := \|\mathbf{X}_j\|_2^2 / n, \quad z_j := \mathbf{X}_j^\top \mathbf{r}_{-j} / n.

This is cj2(βjzj/cj)2+λβj+const\frac{c_j}{2}(\beta_j - z_j / c_j)^2 + \lambda |\beta_j| + \mathrm{const}' — the same univariate problem from §3.1 up to a rescaling. The KKT condition gives

β^jnew=S(zj,λ)cj=1cjsign(zj)max(zjλ,0).\hat\beta_j^{\text{new}} = \frac{S(z_j, \lambda)}{c_j} = \frac{1}{c_j} \cdot \mathrm{sign}(z_j) \cdot \max(|z_j| - \lambda, 0).

Coordinate descent algorithm.

  1. Initialize β=0\boldsymbol\beta = \mathbf{0}, r=y\mathbf{r} = \mathbf{y}.
  2. For each j=1,2,,pj = 1, 2, \dots, p (cyclically):
    • Form the partial residual: r~=r+Xjβj\tilde{\mathbf{r}} = \mathbf{r} + \mathbf{X}_j \beta_j.
    • Compute zj=Xjr~/nz_j = \mathbf{X}_j^\top \tilde{\mathbf{r}} / n and cj=Xj22/nc_j = \|\mathbf{X}_j\|_2^2 / n.
    • Update βjnew=S(zj,λ)/cj\beta_j^{\text{new}} = S(z_j, \lambda) / c_j.
    • Update the residual: rrXj(βjnewβj)\mathbf{r} \leftarrow \mathbf{r} - \mathbf{X}_j (\beta_j^{\text{new}} - \beta_j), then βjβjnew\beta_j \leftarrow \beta_j^{\text{new}}.
  3. Repeat step 2 until βoldβnew\|\boldsymbol\beta^{\text{old}} - \boldsymbol\beta^{\text{new}}\| falls below tolerance.

Why the residual update. Storing r\mathbf{r} and updating it incrementally avoids re-computing Xβ\mathbf{X} \boldsymbol\beta from scratch at each coordinate update. Each update costs O(n)O(n) instead of O(np)O(np), so a full cycle costs O(np)O(np) — the same as one ISTA / FISTA iteration.

Convergence. The lasso objective is the sum of a smooth strictly-convex (in Xβ\mathbf{X} \boldsymbol\beta) quadratic and a separable convex term λjβj\lambda \sum_j |\beta_j|. Tseng (2001) showed that block coordinate descent converges to a global minimum for any objective of this form (smooth + separable convex), with no rate guarantee in general but linear convergence under additional assumptions. In practice on lasso problems with continuous designs, coordinate descent is one of the fastest methods — Friedman, Hastie & Tibshirani (2010) report 10–100× speedups over LARS and proximal-gradient methods at typical (n,p)(n, p) scales. The glmnet package and scikit-learn’s Lasso both use it as the default.

Warm starts along a λ\lambda path. Practical solvers compute the lasso path β^(λ)\hat{\boldsymbol\beta}(\lambda) for a decreasing grid of λ\lambda values, using the previous solution as a warm start at the next λ\lambda. The path is piecewise linear in λ\lambda (Efron-Hastie-Johnstone-Tibshirani 2004), so a small step in λ\lambda requires few coordinate-descent passes to converge — typically 5–20.

§3.3 ISTA: the proximal gradient method

Coordinate descent works for the lasso because the L1 penalty is separable. For more general non-smooth penalties (group lasso, fused lasso, nuclear norm), we need a different framework. Proximal gradient methods generalize gradient descent to objectives of the form F(β)=f(β)+g(β)F(\boldsymbol\beta) = f(\boldsymbol\beta) + g(\boldsymbol\beta), where ff is smooth and gg is non-smooth but admits a tractable proximal operator. For the lasso, f(β)=(1/2n)yXβ22f(\boldsymbol\beta) = (1/2n) \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 and g(β)=λβ1g(\boldsymbol\beta) = \lambda \|\boldsymbol\beta\|_1, with proxηg(z)=S(z,ηλ)\mathrm{prox}_{\eta g}(\mathbf{z}) = S(\mathbf{z}, \eta \lambda).

The iterative soft-thresholding algorithm (ISTA) is the proximal-gradient iteration:

βk+1=proxηg(βkηf(βk))=S ⁣(βkηnX(Xβky),  ηλ),\boldsymbol\beta^{k+1} = \mathrm{prox}_{\eta g}(\boldsymbol\beta^k - \eta \nabla f(\boldsymbol\beta^k)) = S\!\left( \boldsymbol\beta^k - \frac{\eta}{n} \mathbf{X}^\top (\mathbf{X} \boldsymbol\beta^k - \mathbf{y}), \; \eta \lambda \right),

with a step size η=1/L\eta = 1/L where L=X22/nL = \|\mathbf{X}\|_2^2 / n is the Lipschitz constant of f\nabla f (largest eigenvalue of XX/n\mathbf{X}^\top \mathbf{X} / n). Each iteration costs one matrix-vector multiply Xβk\mathbf{X} \boldsymbol\beta^k and one Xr\mathbf{X}^\top \mathbf{r} — the same O(np)O(np) as a coordinate-descent cycle.

Theorem 2 (ISTA convergence rate).

Let LL be the Lipschitz constant of f\nabla f and let β\boldsymbol\beta^* be a minimizer of F=f+gF = f + g. The ISTA iterates {βk}\{\boldsymbol\beta^k\} with step size η=1/L\eta = 1/L satisfy

F(βk)F(β)L2kβ0β22for all k1.F(\boldsymbol\beta^k) - F(\boldsymbol\beta^*) \le \frac{L}{2k} \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|_2^2 \quad \text{for all } k \ge 1.
Proof.

The proof has two ingredients: a per-step descent lemma and a telescoping argument.

Descent lemma (Beck-Teboulle 2009, Lemma 2.3). For any β,yRp\boldsymbol\beta, \mathbf{y} \in \mathbb{R}^p, the proximal-gradient step T(β)=prox(1/L)g(β(1/L)f(β))T(\boldsymbol\beta) = \mathrm{prox}_{(1/L) g}(\boldsymbol\beta - (1/L) \nabla f(\boldsymbol\beta)) satisfies

F(T(β))F(y)L2βy22L2T(β)y22.F(T(\boldsymbol\beta)) - F(\mathbf{y}) \le \frac{L}{2} \|\boldsymbol\beta - \mathbf{y}\|_2^2 - \frac{L}{2} \|T(\boldsymbol\beta) - \mathbf{y}\|_2^2.

The proof uses LL-smoothness of ff (f(T(β))f(β)+f(β)(T(β)β)+(L/2)T(β)β2f(T(\boldsymbol\beta)) \le f(\boldsymbol\beta) + \nabla f(\boldsymbol\beta)^\top (T(\boldsymbol\beta) - \boldsymbol\beta) + (L/2) \|T(\boldsymbol\beta) - \boldsymbol\beta\|^2, the standard descent lemma) plus the variational characterization of the prox.

Telescope. Apply the descent lemma at iterate kk with y=β\mathbf{y} = \boldsymbol\beta^*:

F(βk+1)FL2βkβ22L2βk+1β22.F(\boldsymbol\beta^{k+1}) - F^* \le \frac{L}{2} \|\boldsymbol\beta^k - \boldsymbol\beta^*\|_2^2 - \frac{L}{2} \|\boldsymbol\beta^{k+1} - \boldsymbol\beta^*\|_2^2.

Sum from k=0k = 0 to K1K - 1. The right side telescopes to (L/2)β0β22(L/2)βKβ22(L/2)β0β22(L/2) \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|_2^2 - (L/2) \|\boldsymbol\beta^K - \boldsymbol\beta^*\|_2^2 \le (L/2) \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|_2^2. The left side, using monotonicity of F(βk)F(\boldsymbol\beta^k) along the iteration (also from the descent lemma applied with y=βk\mathbf{y} = \boldsymbol\beta^k), is bounded below by K(F(βK)F)K \cdot (F(\boldsymbol\beta^K) - F^*). Dividing by KK gives the rate.

The O(1/k)O(1/k) rate is “sublinear” — to halve the suboptimality F(βk)FF(\boldsymbol\beta^k) - F^* requires doubling kk. ISTA is simple and stable but slow.

§3.4 FISTA: Nesterov momentum and the O(1/k2)O(1/k^2) rate

Beck-Teboulle (2009) showed that adding Nesterov momentum to ISTA accelerates the convergence rate from O(1/k)O(1/k) to O(1/k2)O(1/k^2) — a quadratic improvement in iteration count for the same accuracy. The algorithm:

FISTA. Set β0=β1=0\boldsymbol\beta^0 = \boldsymbol\beta^{-1} = \mathbf{0}, t1=1t_1 = 1. For k=1,2,k = 1, 2, \dots:

  1. zk=βk+tk1tk+1(βkβk1)\mathbf{z}^k = \boldsymbol\beta^k + \frac{t_k - 1}{t_{k+1}} (\boldsymbol\beta^k - \boldsymbol\beta^{k-1}) — momentum extrapolation.
  2. βk+1=S ⁣(zk1nLX(Xzky),  λ/L)\boldsymbol\beta^{k+1} = S\!\left( \mathbf{z}^k - \frac{1}{nL} \mathbf{X}^\top (\mathbf{X} \mathbf{z}^k - \mathbf{y}), \; \lambda / L \right) — proximal gradient step from zk\mathbf{z}^k, not βk\boldsymbol\beta^k.
  3. tk+1=(1+1+4tk2)/2t_{k+1} = (1 + \sqrt{1 + 4 t_k^2}) / 2.

The momentum coefficient (tk1)/tk+1(t_k - 1)/t_{k+1} approaches 11 as kk grows, giving the iteration a “running start” along the previous direction of motion.

Theorem 3 (FISTA convergence rate (Beck-Teboulle 2009, Theorem 4.4)).

With step size 1/L1/L, the FISTA iterates satisfy

F(βk)F(β)2Lβ0β22(k+1)2for all k1.F(\boldsymbol\beta^k) - F(\boldsymbol\beta^*) \le \frac{2 L \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|_2^2}{(k+1)^2} \quad \text{for all } k \ge 1.
Proof.

Define uk=tkβk(tk1)βk1\mathbf{u}^k = t_k \boldsymbol\beta^k - (t_k - 1) \boldsymbol\beta^{k-1} and the Lyapunov function

vk=2Ltk2(F(βk)F)+ukβ22.v_k = \frac{2}{L} t_k^2 \big( F(\boldsymbol\beta^k) - F^* \big) + \|\mathbf{u}^k - \boldsymbol\beta^*\|_2^2.

The proof has three steps.

Step 1 (Lyapunov lemma). We show vk+1vkv_{k+1} \le v_k for all k1k \ge 1, i.e., the Lyapunov function is non-increasing along FISTA iterations. Apply the proximal-gradient inequality (Beck-Teboulle Lemma 2.3, the same lemma used for ISTA but evaluated at the momentum point zk\mathbf{z}^k) at zk\mathbf{z}^k with two choices of y\mathbf{y}:

2L(F(βk+1)F)zkβ2βk+1β2(with y=β),\frac{2}{L} (F(\boldsymbol\beta^{k+1}) - F^*) \le \|\mathbf{z}^k - \boldsymbol\beta^*\|^2 - \|\boldsymbol\beta^{k+1} - \boldsymbol\beta^*\|^2 \quad (\text{with } \mathbf{y} = \boldsymbol\beta^*),2L(F(βk+1)F(βk))zkβk2βk+1βk2(with y=βk).\frac{2}{L} (F(\boldsymbol\beta^{k+1}) - F(\boldsymbol\beta^k)) \le \|\mathbf{z}^k - \boldsymbol\beta^k\|^2 - \|\boldsymbol\beta^{k+1} - \boldsymbol\beta^k\|^2 \quad (\text{with } \mathbf{y} = \boldsymbol\beta^k).

Multiply the first inequality by tk+1t_{k+1} and the second by (tk+11)(t_{k+1} - 1) — using the FISTA recursion tk+12tk+1=tk2t_{k+1}^2 - t_{k+1} = t_k^2 — and add. After algebraic manipulation that uses the definition of zk\mathbf{z}^k in terms of βk\boldsymbol\beta^k and βk1\boldsymbol\beta^{k-1}, the right side telescopes into ukβ2uk+1β2\|\mathbf{u}^k - \boldsymbol\beta^*\|^2 - \|\mathbf{u}^{k+1} - \boldsymbol\beta^*\|^2, and the left side is (2/L)[tk+12(F(βk+1)F)tk2(F(βk)F)](2/L)[t_{k+1}^2 (F(\boldsymbol\beta^{k+1}) - F^*) - t_k^2 (F(\boldsymbol\beta^k) - F^*)]. Rearranging gives vk+1vkv_{k+1} \le v_k.

Step 2 (tkt_k lower bound). By induction, tk(k+1)/2t_k \ge (k+1)/2 for all k1k \ge 1. Base case t1=11t_1 = 1 \ge 1. Inductive step: tk+1=(1+1+4tk2)/2(1+2tk)/2(1+(k+1))/2=(k+2)/2t_{k+1} = (1 + \sqrt{1 + 4 t_k^2})/2 \ge (1 + 2 t_k)/2 \ge (1 + (k+1))/2 = (k+2)/2.

Step 3 (conclude). Iterating Step 1 from k=1k = 1 gives vkv1v_k \le v_1. Since β0=β1=0\boldsymbol\beta^0 = \boldsymbol\beta^{-1} = \mathbf{0} implies u1=t1β1=β1\mathbf{u}^1 = t_1 \boldsymbol\beta^1 = \boldsymbol\beta^1, and β1\boldsymbol\beta^1 is one ISTA step from β0\boldsymbol\beta^0, the bound v1β0β2v_1 \le \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|^2 follows from one application of the descent lemma (the same as in the ISTA proof). So vkβ0β2v_k \le \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|^2 for all k1k \ge 1. Drop the non-negative ukβ2\|\mathbf{u}^k - \boldsymbol\beta^*\|^2 term:

2Ltk2(F(βk)F)vkβ0β2.\frac{2}{L} t_k^2 (F(\boldsymbol\beta^k) - F^*) \le v_k \le \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|^2.

Combine with tk(k+1)/2t_k \ge (k+1)/2 from Step 2:

F(βk)FLβ0β22tk22Lβ0β2(k+1)2.F(\boldsymbol\beta^k) - F^* \le \frac{L \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|^2}{2 t_k^2} \le \frac{2 L \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|^2}{(k+1)^2}.

A factor-of-kk improvement over ISTA: to reduce F(βk)FF(\boldsymbol\beta^k) - F^* by a factor of 100, ISTA needs 100× more iterations while FISTA needs only 10×. The constant 2Lβ0β22L \|\boldsymbol\beta^0 - \boldsymbol\beta^*\|^2 is the same as the ISTA bound (modulo the factor of 4), so the asymptotic rate is the only source of difference — but it’s a substantial one.

FISTA is not a descent method. Unlike ISTA, F(βk)F(\boldsymbol\beta^k) is not monotonic along FISTA iterations — small “ripples” in F(βk)F(\boldsymbol\beta^k) are normal. A monotone variant (M-FISTA, Beck-Teboulle 2009 §5) accepts βk+1\boldsymbol\beta^{k+1} only if F(βk+1)F(βk)F(\boldsymbol\beta^{k+1}) \le F(\boldsymbol\beta^k), otherwise reuses βk\boldsymbol\beta^k. This trade-off — slightly worse worst-case constant for monotonicity — is rarely worth it in practice.

Log-log convergence trace on a smaller-scale DGP-1 (n = 150, p = 200, s = 10, σ = 0.5, AR(1) ρ = 0.5) at λ = 0.05. F* is computed by a 5,000-iteration FISTA reference. Reading off the slopes: ISTA tracks k⁻¹ (Theorem 3.2), FISTA tracks k⁻² (Theorem 3.3), and coordinate descent matches FISTA early then asymptotically beats both once the active set stabilizes — the lasso restricted to a fixed active set is a strictly-convex quadratic, where CD converges linearly. Iterations to reach F − F* < 10⁻³: ISTA = 63, FISTA = 22, CD = 6.

§3.5 Practical solver-choice notes

When does each solver win? A field guide.

Coordinate descent (sklearn.linear_model.Lasso, glmnet). Default for everything in the lasso family (lasso, elastic net, group lasso). Fastest in practice for n,p105n, p \le 10^5 at moderate sparsity. Warm starts along a λ\lambda path are nearly free, which is why LassoCV is fast even with 100-fold path × 10-fold CV.

FISTA. The right default for lasso variants where the L1 prox is easy but coordinate-by-coordinate updates are not — group lasso with overlapping groups, fused lasso, generalized-lasso-with-non-axis-aligned penalty, total-variation penalties. Also the right default when the design matrix is structured (e.g., a fast Fourier or wavelet transform) and matrix-vector products Xv\mathbf{X} \mathbf{v} can be computed in O(nlogn)O(n \log n) rather than O(np)O(np) — coordinate descent breaks the structured-multiplication advantage by accessing one column at a time.

ISTA. Pedagogically valuable, rarely the right algorithmic choice — FISTA dominates it at no extra implementation cost. Use ISTA only when the proof of correctness or the descent property is needed (some monotonicity-sensitive applications, e.g., in statistical guarantees that rely on objective decrease).

Specialized solvers we don’t cover. ADMM (Boyd et al. 2011) is the right tool for lasso variants with linear-equality-coupled penalties (e.g., the Dantzig selector). LARS (Efron-Hastie-Johnstone-Tibshirani 2004) computes the entire lasso path exactly in O(npmin(n,p))O(np \min(n, p)), which can beat coordinate descent at very small pp but loses badly at the high-dimensional scales we care about. Interior-point methods (CVXPY, cvxopt) work but are typically 100×+ slower than coordinate descent on lasso problems of any meaningful size.

For everything in the rest of this topic — the §1 lasso fits, the LassoCV in §7, the elastic-net comparison in §8, the debiased-lasso pipeline in §10 — we use scikit-learn’s coordinate descent. We hand-rolled FISTA above to demonstrate the O(1/k2)O(1/k^2) rate and to keep the algorithmic content visible.

§4. Bias-variance for the lasso

The lasso’s central trade-off is between bias (from L1 shrinkage of active coefficients) and variance (controlled by the size of the data-adapted active set). As λ\lambda ranges from 00 to λmax\lambda_{\max} — the smallest penalty at which the solution is identically zero — the prediction risk traces the canonical U-curve familiar from any bias-variance analysis. This section formalizes both halves of the trade-off, computes λmax\lambda_{\max} in closed form from the KKT conditions, and develops the lasso solution path β^(λ)\hat{\boldsymbol\beta}(\lambda) as a piecewise-linear function of λ\lambda.

The U-curve is the practical bridge between §3 (we can compute the lasso) and §5 (the oracle inequality bounds the bottom of the U). The solution path is what LassoCV (§7) and LassoLarsIC (§7) operate on when they pick a λ\lambda.

§4.1 The bias contribution from L1 shrinkage

The lasso’s shrinkage isn’t soft and asymptotically vanishing the way a Bayesian Gaussian-prior posterior is — it’s a constant absolute shrinkage that biases every active coefficient toward zero by approximately λ\lambda.

The orthogonal case makes this explicit. From §3.1, with XX=nI\mathbf{X}^\top \mathbf{X} = n \mathbf{I} and zj=(Xy/n)jz_j = (\mathbf{X}^\top \mathbf{y} / n)_j, the lasso solution is β^j=S(zj,λ)\hat\beta_j = S(z_j, \lambda). Under the model zjN(βj,σ2/n)z_j \sim \mathcal{N}(\beta^*_j, \sigma^2 / n):

  • For “large signal” coordinates with βjλ+σ/n|\beta^*_j| \gg \lambda + \sigma/\sqrt{n}: with high probability zj>λ|z_j| > \lambda and sign(zj)=sign(βj)\mathrm{sign}(z_j) = \mathrm{sign}(\beta^*_j), so β^jzjλsign(βj)\hat\beta_j \approx z_j - \lambda \, \mathrm{sign}(\beta^*_j) and E[β^j]βjλsign(βj)\mathbb{E}[\hat\beta_j] \approx \beta^*_j - \lambda \, \mathrm{sign}(\beta^*_j). The bias is λsign(βj)-\lambda \, \mathrm{sign}(\beta^*_j) — constant magnitude, opposite sign to the true value, independent of how large βj\beta^*_j is.
  • For “noise” coordinates with βj=0\beta^*_j = 0: by the symmetry zjN(0,σ2/n)z_j \sim \mathcal{N}(0, \sigma^2/n), E[S(zj,λ)]=0\mathbb{E}[S(z_j, \lambda)] = 0. No bias, but a small variance contribution from the false positives where zj>λ|z_j| > \lambda by chance.

For general (non-orthogonal) designs, the calculation is more involved but the qualitative picture survives. Conditioning on the lasso correctly identifying the active set SS, the active coefficients satisfy

β^S=β^SOLS-on-Sλ(XSXS/n)1sign(β^S),\hat{\boldsymbol\beta}_S = \hat{\boldsymbol\beta}_S^{\text{OLS-on-}S} - \lambda \big(\mathbf{X}_S^\top \mathbf{X}_S / n\big)^{-1} \mathrm{sign}(\hat{\boldsymbol\beta}_S),

where β^SOLS-on-S=(XSXS)1XSy\hat{\boldsymbol\beta}_S^{\text{OLS-on-}S} = (\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \mathbf{X}_S^\top \mathbf{y} is the OLS estimator restricted to SS. The shrinkage correction λ(XSXS/n)1sign()-\lambda (\mathbf{X}_S^\top \mathbf{X}_S / n)^{-1} \mathrm{sign}(\cdot) scales linearly in λ\lambda; its magnitude depends on the conditioning of XSXS/n\mathbf{X}_S^\top \mathbf{X}_S / n but is generally of order λ\lambda.

This bias is the price of sparsity. Two later sections fix it for different reasons:

  • The adaptive lasso (Zou 2006, §8.3) replaces the constant shrinkage λ\lambda with feature-specific weights λwj\lambda \cdot w_j where wj=1/β^jinitw_j = 1 / |\hat\beta_j^{\text{init}}| for some initial estimator. Coordinates with large β^jinit|\hat\beta_j^{\text{init}}| get small wjw_j and small shrinkage, so the active-coefficient bias decays to zero asymptotically.
  • The debiased lasso (§10.2) explicitly subtracts off the shrinkage bias via a one-step Newton correction β^db=β^+(1/n)M^X(yXβ^)\hat{\boldsymbol\beta}^{\text{db}} = \hat{\boldsymbol\beta} + (1/n) \hat{\mathbf{M}} \mathbf{X}^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}), producing n\sqrt{n}-consistent normal estimates of individual coefficients suitable for hypothesis testing and confidence intervals.

For prediction the bias isn’t catastrophic — the U-curve in §4.3 shows that the bias-variance trade-off is favorable at well-chosen λ\lambda. For inference it’s the central problem of the topic, and §10 is where it gets resolved.

§4.2 Variance from sparsity adaptation

In contrast to the bias, the lasso’s variance is small — much smaller than the variance of OLS would be at the same pp, when OLS is even defined.

The cleanest way to see this: OLS variance scales with the number of features, Var(β^OLS)=σ2(XX)1\mathbb{V}\mathrm{ar}(\hat{\boldsymbol\beta}^{\text{OLS}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} has trace σ2tr((XX)1)\sigma^2 \, \mathrm{tr}((\mathbf{X}^\top \mathbf{X})^{-1}), which scales as σ2p/n\sigma^2 p / n for a well-conditioned design. As pnp \to n from below, the variance blows up; at p>np > n, OLS is undefined and the min-norm interpolant has its own pathologies.

The lasso, by zeroing out coordinates with small zj|z_j|, effectively reduces the model dimension from pp to the active-set size A^λ|\hat A_\lambda|. Heuristically — and this is made precise by the degrees of freedom of the lasso result (Zou-Hastie-Tibshirani 2007): df(β^lasso(λ))=E[A^λ]\mathrm{df}(\hat{\boldsymbol\beta}^{\text{lasso}}(\lambda)) = \mathbb{E}[|\hat A_\lambda|], the expected size of the active set. So the lasso’s prediction variance scales as σ2E[A^λ]/n\sigma^2 \, \mathbb{E}[|\hat A_\lambda|] / n, which for a well-chosen λ\lambda is on the order of σ2s/n\sigma^2 s / n — proportional to the true sparsity, not to pp.

This is the sparsity-adaptation property: the lasso pays a variance cost proportional to the model size it actually uses, regardless of how many candidate features were available to start with. It’s the central reason the lasso works in the pnp \gg n regime where OLS doesn’t.

For DGP-1 with s=10s = 10, σ=0.5\sigma = 0.5, n=200n = 200: the lasso variance is roughly σ2s/n=0.0125\sigma^2 s / n = 0.0125, while OLS at p=199p = 199 has variance roughly σ2199/2000.249\sigma^2 \cdot 199 / 200 \approx 0.249 — a 20× advantage, before counting the bias.

§4.3 The U-curve as λ\lambda varies

The bias-variance pieces combine into the canonical U-shaped prediction-risk curve as a function of λ\lambda. Define the prediction risk at test point x\mathbf{x}^*:

R(λ;x):=E[(xβ^(λ)xβ)2]=(E[xβ^(λ)]xβ)2bias2(x;λ)+Var(xβ^(λ))variance(x;λ),R(\lambda; \mathbf{x}^*) := \mathbb{E}\big[ (\mathbf{x}^{*\top} \hat{\boldsymbol\beta}(\lambda) - \mathbf{x}^{*\top} \boldsymbol\beta^*)^2 \big] = \underbrace{\big(\mathbb{E}[\mathbf{x}^{*\top} \hat{\boldsymbol\beta}(\lambda)] - \mathbf{x}^{*\top} \boldsymbol\beta^*\big)^2}_{\text{bias}^2(\mathbf{x}^*; \lambda)} + \underbrace{\mathbb{V}\mathrm{ar}(\mathbf{x}^{*\top} \hat{\boldsymbol\beta}(\lambda))}_{\text{variance}(\mathbf{x}^*; \lambda)},

with the expectation over training-set draws (test point x\mathbf{x}^* fixed). Average over x\mathbf{x}^* in a test set to get the integrated prediction risk IPE(λ)=Ex[R(λ;x)]\mathrm{IPE}(\lambda) = \mathbb{E}_{\mathbf{x}^*}[R(\lambda; \mathbf{x}^*)].

The U-curve has two boundaries:

  • At λ=0\lambda = 0 (OLS / min-norm OLS at p>np > n): bias is zero (or near-zero for min-norm) but variance dominates and is large or undefined.
  • At λ=λmax\lambda = \lambda_{\max} (defined below): variance is zero — β^(λ)=0\hat{\boldsymbol\beta}(\lambda) = \mathbf{0} deterministically — but bias equals the full prediction signal xβ\mathbf{x}^{*\top} \boldsymbol\beta^*, so bias² is large.

Between these endpoints the curve is U-shaped, with an optimal λ\lambda^* that minimizes IPE. The §5 oracle inequality bounds the value of IPE at this optimum from above by Cσ2slog(p)/nC \sigma^2 s \log(p) / n.

λmax\lambda_{\max} in closed form. From the KKT conditions of §2.4: β^=0\hat{\boldsymbol\beta} = \mathbf{0} is a lasso solution if and only if (Xy/n)jλ|(\mathbf{X}^\top \mathbf{y} / n)_j| \le \lambda for all jj — i.e., the inactive condition holds at every coordinate. So

λmax=Xyn=maxjXjyn.\lambda_{\max} = \left\| \frac{\mathbf{X}^\top \mathbf{y}}{n} \right\|_\infty = \max_j \left| \frac{\mathbf{X}_j^\top \mathbf{y}}{n} \right|.

For λλmax\lambda \ge \lambda_{\max}, the lasso solution is identically zero. For λ\lambda just below λmax\lambda_{\max}, exactly one coordinate becomes active (the one achieving the maximum) — this is the start of the lasso path described next. On DGP-1 with seed 42, λmax1.04\lambda_{\max} \approx 1.04, and λCV0.06\lambda_{\text{CV}} \approx 0.06 sits two orders of magnitude smaller — well into the path’s interesting region.

The interactive viz below shows empirical bias², variance, and total MSE on a held-out test set as a function of λ\lambda, computed by Monte Carlo over BB replicate draws of DGP-1. Bias² grows monotonically with λ\lambda (constant shrinkage hurts more when applied harder); variance decays monotonically with λ\lambda (heavier penalization shrinks the active set); their sum traces a U-curve with minimum near λCV\lambda_{\text{CV}}. The minimum of the empirical U-curve coincides — within MC noise — with the value of λ\lambda that LassoCV selects automatically (§7).

Empirical bias-variance decomposition on a smaller-scale DGP-1 (n = 200, p = 200, s = 10, σ = 0.5, AR(1) ρ = 0.5, B = 20 replicates). MSE = bias² + variance is the canonical U; bias² (teal, dashed) grows with λ as constant shrinkage hits active coords harder; variance (purple, dotted) decays with λ as the active-set size shrinks. The red marker sits at the empirical λ minimizer of MSE — close to the LassoCV-selected operating point covered in §7. Computed live in-browser via warm-started ISTA across the 25-point log-spaced λ-grid (~3-5 s precompute).

Empirical bias-squared, variance, and total MSE on a held-out test set as a function of lambda on DGP-1, traced over 50 Monte Carlo replicates.
Empirical bias², variance, and total test-set MSE as a function of λ on DGP-1, computed by Monte Carlo over B = 50 replicate draws. Bias² grows monotonically with λ (constant shrinkage hurts more when applied harder); variance decays monotonically with λ (heavier penalization shrinks the active set); their sum traces the canonical U-curve. The minimum of the empirical U-curve coincides — within MC noise — with the value of λ that LassoCV selects automatically (§7). (Static fallback at p = 500; the interactive viz above runs at p = 200 for in-browser tractability.)

§4.4 The lasso solution path is piecewise linear

Define the lasso solution path as the function λβ^(λ)\lambda \mapsto \hat{\boldsymbol\beta}(\lambda) for λ[0,λmax]\lambda \in [0, \lambda_{\max}]. The path has two structural properties that make it both computationally tractable and visually informative.

Theorem 1 (Piecewise linearity of the lasso path (Efron-Hastie-Johnstone-Tibshirani 2004)).

The lasso solution path β^(λ)\hat{\boldsymbol\beta}(\lambda) is a continuous piecewise-linear function of λ\lambda. There is a finite sequence of knots λmax=λ(0)>λ(1)>>λ(K)=0\lambda_{\max} = \lambda_{(0)} > \lambda_{(1)} > \cdots > \lambda_{(K)} = 0 such that on each interval [λ(k+1),λ(k)][\lambda_{(k+1)}, \lambda_{(k)}], the active set A^λ\hat A_\lambda is constant and β^(λ)\hat{\boldsymbol\beta}(\lambda) is linear in λ\lambda. The knots are exactly the values of λ\lambda at which the active set changes — a coordinate enters or leaves A^\hat A.

The proof is a direct calculation from the KKT conditions: between knots, the active set is fixed, the active KKT condition 1nXj(yXβ^)=λsign(β^j)\frac{1}{n} \mathbf{X}_j^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) = \lambda \, \mathrm{sign}(\hat\beta_j) for jA^j \in \hat A is a linear system in β^A^\hat{\boldsymbol\beta}_{\hat A} with λ\lambda on the right-hand side, so the solution is linear in λ\lambda. The LARS algorithm (Efron et al. 2004) traces this piecewise-linear path knot-by-knot in O(npmin(n,p))O(np \min(n, p)) total time — though for moderate-to-large pp, coordinate descent on a λ\lambda-grid (§3.2) is faster in practice.

Reading the path. Plotting β^j(λ)\hat\beta_j(\lambda) vs logλ\log \lambda for all jj shows which features the lasso selects in what order and at what penalty level. As λ\lambda decreases from λmax\lambda_{\max}:

  • The first coordinate to enter is argmaxj(Xy/n)j\arg\max_j |(\mathbf{X}^\top \mathbf{y} / n)_j| — the feature most correlated with the response.
  • Subsequent coordinates enter at successively smaller λ\lambda values, in roughly the order of their importance.
  • At λ=0\lambda = 0 the path reaches OLS (in the pnp \le n case) or the min-norm OLS interpolant (in p>np > n).
  • A coordinate can also leave the active set as λ\lambda decreases (coefficient passes through zero) — uncommon in continuous designs but possible.

The viz below shows the lasso solution path on DGP-1: β^j(λ)\hat\beta_j(\lambda) vs logλ\log \lambda for all 500 coordinates, with the 10 true active coordinates plotted in black and the 490 inactive coordinates in light gray. The vertical line at λCV\lambda_{\text{CV}} marks the cross-validation-selected operating point. The reader sees that the true active features are consistently the first to enter the path as λ\lambda decreases, and that at λCV\lambda_{\text{CV}} the active set is a tight superset of the true SS — most of the gray noise coefficients are still at zero.

Lasso solution path on a smaller-scale DGP-1 (n = 200, p = 200, s = 10, σ = 0.5, AR(1) ρ = 0.5). The 10 true active coordinates (j < 10) plot in black; the 190 inactive coordinates in gray (only those with |β̂_j| > 0.005 anywhere on the path are drawn — most stay flat-zero and are omitted to keep the SVG light). Vertical marker at λ_CV ≈ 0.056 (the LassoUCurve minimizer above). The active features enter the path first as λ decreases from λ_max ≈ 1; at λ_CV the active set is a tight superset of the true S. Computed live via warm-started ISTA across 30 log-spaced λ values (~200 ms precompute).

Lasso solution path on DGP-1: coefficient paths beta_j(lambda) versus log lambda, with active and inactive coordinates color-coded.
The lasso solution path on DGP-1: β̂ⱼ(λ) vs log λ for all 500 coordinates. The 10 true active coordinates are plotted in black; the 490 inactive coordinates in light gray. Vertical line marks λ_CV. The true active features are consistently the first to enter the path as λ decreases; at λ_CV the active set is a tight superset of the true S. (Static fallback at p = 500; the interactive viz above runs at p = 200 for in-browser tractability.)

§5. The lasso oracle inequality

This is the topic’s headline theoretical result: under a restricted-eigenvalue condition on the design and a deviation bound on the noise, the lasso achieves prediction risk of order σ2slog(p)/n\sigma^2 s \log(p) / n — comparable to what an oracle that knew the true active set SS in advance could achieve, up to a logarithmic factor in pp. The bound is non-asymptotic (holds with high probability for any finite nn, pp), dimension-free (the dependence on pp is only logarithmic), and rate-optimal for the sparse high-dimensional regression problem.

The proof has four steps and follows Bickel-Ritov-Tsybakov (2009) closely. Step 1 (the basic inequality) uses the lasso’s defining optimality to bound the prediction error in terms of the L1 estimation error and a noise term. Step 2 (the cone condition) shows that the error vector β^β\hat{\boldsymbol\beta} - \boldsymbol\beta^* has most of its L1 mass concentrated on the true support SS. Step 3 (the restricted-eigenvalue condition) lets us convert L1 estimation error on SS into a lower bound on the prediction error. Step 4 (the deviation step) controls the noise term using a maximal sub-Gaussian inequality. Combining gives the rate.

The proof’s main work is in steps 1 and 2 — the basic inequality and the cone condition derive directly from the KKT conditions of §2.4 with no further ingredients. Step 3 is the geometric assumption on the design that we’re imposing; step 4 is the standard probabilistic deviation inequality. The whole argument is technical but elementary — no measure theory beyond the sub-Gaussian moment bound.

§5.1 Setup: prediction risk in the high-dim regime

We work in the standard high-dimensional linear regression model:

y=Xβ+ε,XRn×p,βRp,εRn.\mathbf{y} = \mathbf{X} \boldsymbol\beta^* + \boldsymbol\varepsilon, \quad \mathbf{X} \in \mathbb{R}^{n \times p}, \quad \boldsymbol\beta^* \in \mathbb{R}^p, \quad \boldsymbol\varepsilon \in \mathbb{R}^n.

We assume:

  • Sparsity. β\boldsymbol\beta^* has support S={j:βj0}S = \{j : \beta^*_j \neq 0\} with S=s|S| = s, sps \ll p.
  • Sub-Gaussian noise. ε=(ε1,,εn)\boldsymbol\varepsilon = (\varepsilon_1, \dots, \varepsilon_n) has independent entries with E[εi]=0\mathbb{E}[\varepsilon_i] = 0 and sub-Gaussian parameter σ\sigma: E[exp(tεi)]exp(t2σ2/2)\mathbb{E}[\exp(t \varepsilon_i)] \le \exp(t^2 \sigma^2 / 2) for all tRt \in \mathbb{R}. Gaussian noise with variance σ2\sigma^2 is the canonical case; bounded ε[σ,σ]\boldsymbol\varepsilon \in [-\sigma, \sigma] also qualifies.
  • Column-normalized design. Each column Xj\mathbf{X}_j satisfies Xj22n\|\mathbf{X}_j\|_2^2 \le n — a normalization convention that makes the bound clean. Equivalently, the empirical second moment (1/n)Xj221(1/n) \|\mathbf{X}_j\|_2^2 \le 1. For DGP-1 the columns satisfy this in expectation; in practice rescaling the columns to exactly Xj22=n\|\mathbf{X}_j\|_2^2 = n is standard before fitting.

The prediction risk at the lasso estimator is

PE(β^lasso):=1nX(β^lassoβ)22=1ni=1n(xiβ^lassoxiβ)2,\mathrm{PE}(\hat{\boldsymbol\beta}^{\text{lasso}}) := \frac{1}{n} \|\mathbf{X} (\hat{\boldsymbol\beta}^{\text{lasso}} - \boldsymbol\beta^*)\|_2^2 = \frac{1}{n} \sum_{i=1}^n (\mathbf{x}_i^\top \hat{\boldsymbol\beta}^{\text{lasso}} - \mathbf{x}_i^\top \boldsymbol\beta^*)^2,

the average squared in-sample prediction error against the true regression function. (This is the “fixed-design prediction error.” It differs from the integrated Ex[]\mathbb{E}_{\mathbf{x}^*}[\cdot] test-set error of §4.3 — they coincide when the test design has the same row distribution as the training design and nn is large.)

The benchmark is the oracle estimator that knew SS in advance:

β^Soracle=(XSXS)1XSy,β^Scoracle=0.\hat{\boldsymbol\beta}^{\text{oracle}}_S = (\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \mathbf{X}_S^\top \mathbf{y}, \qquad \hat{\boldsymbol\beta}^{\text{oracle}}_{S^c} = \mathbf{0}.

The oracle is OLS restricted to SS. Its prediction risk is PE(β^oracle)=σ2s/n\mathrm{PE}(\hat{\boldsymbol\beta}^{\text{oracle}}) = \sigma^2 s / n in expectation (standard OLS variance with ss degrees of freedom). The oracle inequality says the lasso achieves the same rate up to a log(p)\log(p) factor, without knowing SS.

§5.2 The basic inequality

The starting point is the lasso’s defining optimality: β^lasso\hat{\boldsymbol\beta}^{\text{lasso}} achieves the minimum of the lasso objective, so in particular it does no worse than β\boldsymbol\beta^*:

12nyXβ^22+λβ^112nyXβ22+λβ1.\frac{1}{2n} \|\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}\|_2^2 + \lambda \|\hat{\boldsymbol\beta}\|_1 \le \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta^*\|_2^2 + \lambda \|\boldsymbol\beta^*\|_1.

(We drop the “lasso” superscript for brevity; β^\hat{\boldsymbol\beta} is always the lasso solution in this section.)

Substitute y=Xβ+ε\mathbf{y} = \mathbf{X} \boldsymbol\beta^* + \boldsymbol\varepsilon and let δ=β^β\boldsymbol\delta = \hat{\boldsymbol\beta} - \boldsymbol\beta^*:

12nXβ+εXβ^22=12nεXδ22=12nε221nεXδ+12nXδ22,\frac{1}{2n} \|\mathbf{X} \boldsymbol\beta^* + \boldsymbol\varepsilon - \mathbf{X} \hat{\boldsymbol\beta}\|_2^2 = \frac{1}{2n} \|\boldsymbol\varepsilon - \mathbf{X} \boldsymbol\delta\|_2^2 = \frac{1}{2n} \|\boldsymbol\varepsilon\|_2^2 - \frac{1}{n} \boldsymbol\varepsilon^\top \mathbf{X} \boldsymbol\delta + \frac{1}{2n} \|\mathbf{X} \boldsymbol\delta\|_2^2,

and 12nyXβ22=12nε22\frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta^*\|_2^2 = \frac{1}{2n} \|\boldsymbol\varepsilon\|_2^2. The 12nε22\frac{1}{2n}\|\boldsymbol\varepsilon\|_2^2 terms cancel, giving

12nXδ221nεXδ+λβ^1λβ1.\frac{1}{2n} \|\mathbf{X} \boldsymbol\delta\|_2^2 - \frac{1}{n} \boldsymbol\varepsilon^\top \mathbf{X} \boldsymbol\delta + \lambda \|\hat{\boldsymbol\beta}\|_1 \le \lambda \|\boldsymbol\beta^*\|_1.

Rearrange and double both sides:

1nXδ222nεXδ+2λ(β1β^1).()\frac{1}{n} \|\mathbf{X} \boldsymbol\delta\|_2^2 \le \frac{2}{n} \boldsymbol\varepsilon^\top \mathbf{X} \boldsymbol\delta + 2\lambda \big( \|\boldsymbol\beta^*\|_1 - \|\hat{\boldsymbol\beta}\|_1 \big). \quad\quad (\star)

This is the basic inequality. Two things to control: the noise inner product εXδ/n\boldsymbol\varepsilon^\top \mathbf{X} \boldsymbol\delta / n, and the L1-norm difference β1β^1\|\boldsymbol\beta^*\|_1 - \|\hat{\boldsymbol\beta}\|_1.

The noise term, via Hölder’s inequality.

2nεXδ=2εXnδ2Xεnδ1.\frac{2}{n} |\boldsymbol\varepsilon^\top \mathbf{X} \boldsymbol\delta| = 2 \left| \frac{\boldsymbol\varepsilon^\top \mathbf{X}}{n} \boldsymbol\delta \right| \le 2 \left\| \frac{\mathbf{X}^\top \boldsymbol\varepsilon}{n} \right\|_\infty \cdot \|\boldsymbol\delta\|_1.

Define the noise event

E:={Xεnλ2}.\mathcal{E} := \left\{ \left\| \frac{\mathbf{X}^\top \boldsymbol\varepsilon}{n} \right\|_\infty \le \frac{\lambda}{2} \right\}.

On E\mathcal{E}, (2/n)εXδλδ1(2/n) \boldsymbol\varepsilon^\top \mathbf{X} \boldsymbol\delta \le \lambda \|\boldsymbol\delta\|_1. Step 4 below shows that with λ\lambda chosen as λ=2σ2log(2p/δ)/n\lambda = 2\sigma \sqrt{2 \log(2p/\delta) / n}, the event E\mathcal{E} holds with probability at least 1δ1 - \delta. For now assume we’re on E\mathcal{E}.

The L1-norm difference, via the support split. Decompose β^=β^S+β^Sc\hat{\boldsymbol\beta} = \hat{\boldsymbol\beta}_S + \hat{\boldsymbol\beta}_{S^c} where the subscripts indicate the indices restricted to SS and ScS^c respectively. Since βSc=0\boldsymbol\beta^*_{S^c} = \mathbf{0}:

β1=βS1,β^1=β^S1+β^Sc1,δ1=β^SβS1+β^Sc1.\|\boldsymbol\beta^*\|_1 = \|\boldsymbol\beta^*_S\|_1, \qquad \|\hat{\boldsymbol\beta}\|_1 = \|\hat{\boldsymbol\beta}_S\|_1 + \|\hat{\boldsymbol\beta}_{S^c}\|_1, \qquad \|\boldsymbol\delta\|_1 = \|\hat{\boldsymbol\beta}_S - \boldsymbol\beta^*_S\|_1 + \|\hat{\boldsymbol\beta}_{S^c}\|_1.

Apply the reverse triangle inequality β^S1βS1β^SβS1\|\hat{\boldsymbol\beta}_S\|_1 \ge \|\boldsymbol\beta^*_S\|_1 - \|\hat{\boldsymbol\beta}_S - \boldsymbol\beta^*_S\|_1:

β1β^1=βS1β^S1β^Sc1β^SβS1β^Sc1=δS1δSc1.\|\boldsymbol\beta^*\|_1 - \|\hat{\boldsymbol\beta}\|_1 = \|\boldsymbol\beta^*_S\|_1 - \|\hat{\boldsymbol\beta}_S\|_1 - \|\hat{\boldsymbol\beta}_{S^c}\|_1 \le \|\hat{\boldsymbol\beta}_S - \boldsymbol\beta^*_S\|_1 - \|\hat{\boldsymbol\beta}_{S^c}\|_1 = \|\boldsymbol\delta_S\|_1 - \|\boldsymbol\delta_{S^c}\|_1.

Substitute into ()(\star) on the event E\mathcal{E}:

1nXδ22λδ1+2λ(δS1δSc1)=3λδS1λδSc1.\frac{1}{n} \|\mathbf{X} \boldsymbol\delta\|_2^2 \le \lambda \|\boldsymbol\delta\|_1 + 2\lambda (\|\boldsymbol\delta_S\|_1 - \|\boldsymbol\delta_{S^c}\|_1) = 3\lambda \|\boldsymbol\delta_S\|_1 - \lambda \|\boldsymbol\delta_{S^c}\|_1.

Combining the δS\boldsymbol\delta_S and δSc\boldsymbol\delta_{S^c} terms:

  1nXδ22+λδSc13λδS1  on E.()\boxed{\; \frac{1}{n} \|\mathbf{X} \boldsymbol\delta\|_2^2 + \lambda \|\boldsymbol\delta_{S^c}\|_1 \le 3\lambda \|\boldsymbol\delta_S\|_1 \;} \quad \text{on } \mathcal{E}. \quad\quad (\star\star)

This is the basic inequality in its useful form. The L1 mass of δ\boldsymbol\delta outside the true support, plus the prediction error, is controlled by the L1 mass on the true support.

§5.3 The cone condition

The basic inequality ()(\star\star) has an immediate geometric consequence. The prediction-error term 1nXδ22\frac{1}{n}\|\mathbf{X}\boldsymbol\delta\|_2^2 on the LHS is non-negative, so we can drop it:

λδSc13λδS1  δSc13δS1.  \lambda \|\boldsymbol\delta_{S^c}\|_1 \le 3\lambda \|\boldsymbol\delta_S\|_1 \quad \Rightarrow \quad \boxed{\;\|\boldsymbol\delta_{S^c}\|_1 \le 3 \|\boldsymbol\delta_S\|_1.\;}

This is the cone condition. The error vector δ=β^β\boldsymbol\delta = \hat{\boldsymbol\beta} - \boldsymbol\beta^* lies in the convex cone

C(S,3):={δRp:δSc13δS1}.\mathcal{C}(S, 3) := \{ \boldsymbol\delta \in \mathbb{R}^p : \|\boldsymbol\delta_{S^c}\|_1 \le 3 \|\boldsymbol\delta_S\|_1 \}.

Interpretation. Most of the error vector’s L1 mass is on the true support SS. The factor of 3 is conventional; it depends on the constant 2 in the basic inequality, which in turn comes from doubling both sides of the lasso optimality. Different works in the literature use C(S,c0)\mathcal{C}(S, c_0) for various c01c_0 \ge 1; the bound just needs the cone constant to be finite.

The cone condition is the structural content of the basic inequality. Without any further assumption on X\mathbf{X} or ε\boldsymbol\varepsilon, we know the lasso error is concentrated on SS — but we don’t yet have a rate bound on the prediction error. The next step requires an additional assumption on the design.

§5.4 The restricted-eigenvalue condition

The pure basic inequality ()(\star\star) gives 1nXδ223λδS1\frac{1}{n}\|\mathbf{X}\boldsymbol\delta\|_2^2 \le 3\lambda \|\boldsymbol\delta_S\|_1. The RHS bounds the prediction error in terms of the L1 norm of the active part of δ\boldsymbol\delta, which has only ss entries — so by Cauchy-Schwarz, δS1sδS2\|\boldsymbol\delta_S\|_1 \le \sqrt{s} \|\boldsymbol\delta_S\|_2. We get

1nXδ223λsδS2.\frac{1}{n} \|\mathbf{X} \boldsymbol\delta\|_2^2 \le 3\lambda \sqrt{s} \|\boldsymbol\delta_S\|_2.

To convert this into a rate bound, we need to bound δS2\|\boldsymbol\delta_S\|_2 from above by something involving 1nXδ22\frac{1}{n}\|\mathbf{X}\boldsymbol\delta\|_2^2 — i.e., a lower bound on 1nXδ22\frac{1}{n}\|\mathbf{X}\boldsymbol\delta\|_2^2 in terms of δS22\|\boldsymbol\delta_S\|_2^2. That’s exactly the restricted-eigenvalue condition.

Definition 1 (Restricted-eigenvalue condition (Bickel-Ritov-Tsybakov 2009)).

The design X\mathbf{X} satisfies the restricted-eigenvalue condition RE(s,c0)\mathrm{RE}(s, c_0) with constant κ>0\kappa > 0 if for every δC(S,c0)\boldsymbol\delta \in \mathcal{C}(S, c_0) with Ss|S| \le s,

1nXδ22κ2δS22.\frac{1}{n} \|\mathbf{X} \boldsymbol\delta\|_2^2 \ge \kappa^2 \|\boldsymbol\delta_S\|_2^2.

In words: on the cone C(S,c0)\mathcal{C}(S, c_0), the empirical Gram matrix XX/n\mathbf{X}^\top \mathbf{X} / n acts like a positive-definite matrix on the active block — its smallest “eigenvalue” restricted to C(S,c0)\mathcal{C}(S, c_0) is at least κ2\kappa^2. On the full space Rp\mathbb{R}^p this would be the smallest eigenvalue of XX/n\mathbf{X}^\top \mathbf{X}/n, which is zero whenever p>np > n. The restriction to the cone C(S,c0)\mathcal{C}(S, c_0) is what makes the condition feasible in the high-dim regime.

When does RE hold? A few sufficient conditions:

  • Random Gaussian designs. If X\mathbf{X} has iid N(0,Σ)\mathcal{N}(\mathbf{0}, \boldsymbol\Sigma) rows with λmin(Σ)κ02>0\lambda_{\min}(\boldsymbol\Sigma) \ge \kappa_0^2 > 0, then RE holds with high probability with κ2κ02\kappa^2 \asymp \kappa_0^2 provided nslog(p)n \gtrsim s \log(p) (Raskutti-Wainwright-Yu 2010, Theorem 1).
  • Sub-Gaussian designs. Same conclusion under sub-Gaussian rows (Rudelson-Zhou 2013).
  • Restricted isometry property (RIP). RIP \Rightarrow RE (Candès-Tao 2005; we cover this in §9).

The condition is essentially the weakest design assumption under which the lasso works — it’s equivalent to ”XX/n\mathbf{X}^\top \mathbf{X}/n acts well on sparse vectors and small perturbations of them.” For DGP-1 with AR(1) Toeplitz Σ\boldsymbol\Sigma, λmin(Σ)=(1ρ)/(1+ρ)=1/3\lambda_{\min}(\boldsymbol\Sigma) = (1 - \rho)/(1 + \rho) = 1/3 in the limit pp \to \infty for ρ=0.5\rho = 0.5, so RE holds with κ21/3\kappa^2 \approx 1/3.

§5.5 The deviation step and the O(σ2slogp/n)O(\sigma^2 s \log p / n) rate

We now combine the three ingredients — basic inequality, cone condition, RE — and add the probabilistic step that controls E\mathcal{E}.

Combining basic inequality and RE. Start from ()(\star\star):

1nXδ223λδS13λsδS23λs1κ1nXδ22,\frac{1}{n} \|\mathbf{X} \boldsymbol\delta\|_2^2 \le 3\lambda \|\boldsymbol\delta_S\|_1 \le 3\lambda \sqrt{s} \|\boldsymbol\delta_S\|_2 \le 3\lambda \sqrt{s} \cdot \frac{1}{\kappa} \sqrt{\frac{1}{n} \|\mathbf{X} \boldsymbol\delta\|_2^2},

where the second inequality is Cauchy-Schwarz on δSRs\boldsymbol\delta_S \in \mathbb{R}^s and the third is RE. Let u=1nXδ22u = \sqrt{\frac{1}{n}\|\mathbf{X}\boldsymbol\delta\|_2^2}. Then u23λsu/κu^2 \le 3\lambda \sqrt{s} u / \kappa, so u3λs/κu \le 3\lambda \sqrt{s}/\kappa, and squaring:

1nX(β^β)229λ2sκ2on E{RE(s,3)}.( ⁣ ⁣)\frac{1}{n} \|\mathbf{X} (\hat{\boldsymbol\beta} - \boldsymbol\beta^*)\|_2^2 \le \frac{9 \lambda^2 s}{\kappa^2} \quad \text{on } \mathcal{E} \cap \{\mathrm{RE}(s, 3)\}. \quad\quad (\star\!\star\!\star)

The deviation step. It remains to choose λ\lambda so that E\mathcal{E} holds with high probability.

Lemma 1 (Sub-Gaussian maximal inequality).

Let ε\boldsymbol\varepsilon have independent entries, sub-Gaussian with parameter σ\sigma, and X\mathbf{X} have columns with Xj22n\|\mathbf{X}_j\|_2^2 \le n. For any δ(0,1)\delta \in (0, 1),

P ⁣(Xεn>σ2log(2p/δ)n)δ.\mathbb{P}\!\left( \left\| \frac{\mathbf{X}^\top \boldsymbol\varepsilon}{n} \right\|_\infty > \sigma \sqrt{\frac{2 \log(2p/\delta)}{n}} \right) \le \delta.
Proof.

For a single coordinate jj, the inner product (Xjε)/n=i(Xij/n)εi(\mathbf{X}_j^\top \boldsymbol\varepsilon) / n = \sum_i (X_{ij}/n) \varepsilon_i is a linear combination of independent sub-Gaussian random variables, hence itself sub-Gaussian with parameter

σj2:=σ2i=1n(Xij/n)2=σ2Xj22n2σ2n.\sigma_j^2 := \sigma^2 \sum_{i=1}^n (X_{ij}/n)^2 = \frac{\sigma^2 \|\mathbf{X}_j\|_2^2}{n^2} \le \frac{\sigma^2}{n}.

By the standard sub-Gaussian tail bound, P((Xjε)/n>t)exp(t2/(2σj2))exp(nt2/(2σ2))\mathbb{P}((\mathbf{X}_j^\top \boldsymbol\varepsilon)/n > t) \le \exp(-t^2 / (2 \sigma_j^2)) \le \exp(-n t^2 / (2 \sigma^2)), and the same bound for the negative tail. Union bound over the pp coordinates and the two tails:

P ⁣(Xεn>t)2pexp ⁣(nt22σ2).\mathbb{P}\!\left( \left\| \frac{\mathbf{X}^\top \boldsymbol\varepsilon}{n} \right\|_\infty > t \right) \le 2p \exp\!\left( -\frac{n t^2}{2 \sigma^2} \right).

Set the RHS equal to δ\delta and solve for tt: t=σ2log(2p/δ)/nt = \sigma \sqrt{2 \log(2p/\delta) / n}.

Choose λ\lambda. Set λ=2σ2log(2p/δ)/n\lambda = 2 \sigma \sqrt{2 \log(2p/\delta) / n} — twice the deviation threshold from Lemma 1. Then λ/2Xε/n\lambda/2 \ge \|\mathbf{X}^\top \boldsymbol\varepsilon / n\|_\infty on the event E\mathcal{E}, which holds with probability at least 1δ1 - \delta. Substituting into ( ⁣ ⁣)(\star\!\star\!\star):

1nX(β^β)2294σ22log(2p/δ)snκ2=72σ2slog(2p/δ)nκ2.\frac{1}{n} \|\mathbf{X} (\hat{\boldsymbol\beta} - \boldsymbol\beta^*)\|_2^2 \le \frac{9 \cdot 4 \sigma^2 \cdot 2 \log(2p/\delta) \cdot s}{n \cdot \kappa^2} = \frac{72 \sigma^2 s \log(2p/\delta)}{n \kappa^2}.

Cleaning up the constants:

Theorem 1 (Lasso oracle inequality (Bickel-Ritov-Tsybakov 2009)).

Assume β\boldsymbol\beta^* is ss-sparse, the noise ε\boldsymbol\varepsilon has independent sub-Gaussian entries with parameter σ\sigma, the columns of X\mathbf{X} satisfy Xj22n\|\mathbf{X}_j\|_2^2 \le n, and the design satisfies RE(s,3)\mathrm{RE}(s, 3) with constant κ>0\kappa > 0. Choose λ=2σ2log(2p/δ)/n\lambda = 2\sigma\sqrt{2\log(2p/\delta) / n} for some δ(0,1)\delta \in (0, 1). Then with probability at least 1δ1 - \delta,

1nX(β^lassoβ)2272σ2slog(2p/δ)nκ2.\frac{1}{n} \|\mathbf{X} (\hat{\boldsymbol\beta}^{\text{lasso}} - \boldsymbol\beta^*)\|_2^2 \le \frac{72 \sigma^2 s \log(2p/\delta)}{n \kappa^2}.

The order of magnitude. Setting δ=1/p\delta = 1/p (high-confidence statement, 11/p1 - 1/p probability), log(2p/δ)=log(2p2)2log(2p)\log(2p/\delta) = \log(2p^2) \le 2 \log(2p), so the bound is

1nX(β^lassoβ)22σ2slog(p)nκ2.\frac{1}{n} \|\mathbf{X} (\hat{\boldsymbol\beta}^{\text{lasso}} - \boldsymbol\beta^*)\|_2^2 \lesssim \frac{\sigma^2 s \log(p)}{n \kappa^2}.

This is the fundamental rate for sparse high-dimensional regression. Three things to note:

  1. The log(p)\log(p) factor is the price of not knowing SS. The oracle estimator β^oracle\hat{\boldsymbol\beta}^{\text{oracle}} — OLS restricted to SS — achieves σ2s/n\sigma^2 s / n in expectation. The lasso achieves σ2slog(p)/n\sigma^2 s \log(p) / n — the same rate up to a logarithmic factor. The log(p)\log(p) is the price of doing model selection from pp candidates without prior knowledge.
  2. The rate is minimax-optimal. Donoho-Johnstone (1994) and Raskutti-Wainwright-Yu (2011) showed that no estimator can beat σ2slog(p/s)/n\sigma^2 s \log(p/s) / n in the worst case over the class of ss-sparse signals. So the lasso achieves the optimal rate up to a log(s)1/κ2\log(s) \cdot 1/\kappa^2 constant — the lasso pays for not knowing SS, but it doesn’t pay extra for being computationally tractable.
  3. The κ2\kappa^{-2} dependence is real. When the design is poorly conditioned (highly correlated features), κ2\kappa^{-2} blows up and the bound degrades. This is the formal counterpart of the practical observation that the lasso works less well with strongly collinear features — §8 covers the elastic net as the standard remedy.

Corollary: L1 estimation rate. A similar argument starting from ()(\star\star) and using RE gives

β^lassoβ112λsκ2σslog(p)/nκ2.\|\hat{\boldsymbol\beta}^{\text{lasso}} - \boldsymbol\beta^*\|_1 \le \frac{12 \lambda s}{\kappa^2} \asymp \frac{\sigma s \sqrt{\log(p)/n}}{\kappa^2}.

Sketch: from ()(\star\star) + Cauchy-Schwarz + RE, δS13λs/κ2\|\boldsymbol\delta_S\|_1 \le 3 \lambda s / \kappa^2, then δ1=δS1+δSc14δS1\|\boldsymbol\delta\|_1 = \|\boldsymbol\delta_S\|_1 + \|\boldsymbol\delta_{S^c}\|_1 \le 4 \|\boldsymbol\delta_S\|_1 from the cone condition. This L1L^1 estimation rate appears as a lemma in the §10.4 debiased-lasso asymptotic-normality argument.

The viz below shows the empirical prediction risk on a held-out test set vs sample size nn on DGP-1, alongside the theoretical BRT bound (constant 72) and a calibrated bound (constant fit empirically). The empirical curve sits one to two orders of magnitude below the BRT bound — the constant 72 is mathematically clean (each proof step contributes a factor of 2 or 3) but practically loose. The slope match on log-log axes (both lines parallel) is the substantive confirmation of the oracle-inequality rate.

Lasso prediction risk on smaller-scale DGP-1 (p = 200, s = 10, σ = 0.5, AR(1) ρ = 0.5, λ = 2σ√(2 log(p)/n) per Theorem 1) as n varies from 50 to 800. Empirical (teal dots) sits one to two orders of magnitude below the BRT bound (amber, c = 72); the calibrated bound (purple, constant fit to the n = 800 point) sits right on top of the empirical curve. All three lines have the same -1 slope on log-log axes — the substantive confirmation of the σ²s log(p)/n rate. The constant 72 is mathematically clean (each proof step contributes a factor of 2 or 3) but practically loose by 10-100×. Computed live in-browser via single-rep ISTA at each n (~500 ms total).

Empirical prediction risk versus theoretical bound on log-log axes, illustrating the lasso oracle inequality rate on DGP-1.
Empirical lasso prediction risk ‖X_test (β̂_lasso − β*)‖² / n_test vs sample size n on DGP-1 (p = 500, s = 10, σ = 0.5 fixed; n from 50 to 800), at the theory-guided λ = 2σ√(2 log(p)/n). Both empirical risk and the theoretical BRT bound (constant 72) plus a calibrated bound (constant fit empirically) on log-log axes. The empirical curve sits one to two orders of magnitude below the BRT bound (the constant 72 is mathematically clean but practically loose); the slope match — both lines parallel on log-log — is the substantive confirmation of the oracle-inequality rate. (Static fallback at p = 500; the interactive viz above runs at p = 200.)

§6. Variable-selection consistency

The §5 oracle inequality bounds the lasso’s prediction risk: 1nX(β^β)22σ2slog(p)/n\frac{1}{n}\|\mathbf{X}(\hat{\boldsymbol\beta} - \boldsymbol\beta^*)\|_2^2 \lesssim \sigma^2 s \log(p) / n. That bound says the lasso predicts well — comparable to the oracle that knew SS in advance. It does not say the lasso correctly recovers the support SS itself.

These are different statements with different sufficient conditions, and confusing them is the most common conceptual error in lasso applications. Two highly correlated features can both be predictive of the response; the lasso might select either one, or alternate between them across resampled training sets, while keeping the prediction error small. The prediction-risk bound is robust to this kind of selection instability. The support-recovery question — does A^λ=S\hat A_\lambda = S? — is sensitive to it.

This section formalizes sign-consistency (the strongest form of support recovery), introduces the irrepresentable condition (Zhao-Yu 2006) that’s both sufficient and essentially necessary for it, states the sample-size scaling for support recovery, and contrasts the prediction-risk bound (RE-based) against the support-recovery bound (IC-based) so the difference is visible.

§6.1 Sign-consistency: what it means and why prediction consistency doesn’t imply it

For a vector βRp\boldsymbol\beta \in \mathbb{R}^p, define sign(β){1,0,1}p\mathrm{sign}(\boldsymbol\beta) \in \{-1, 0, 1\}^p entry-wise: sign(βj)=1\mathrm{sign}(\beta_j) = 1 if βj>0\beta_j > 0, 1-1 if βj<0\beta_j < 0, 00 if βj=0\beta_j = 0. Three increasingly strong support-recovery notions:

  • Support recovery (set-equality): A^λ:={j:β^j0}=S\hat A_\lambda := \{j : \hat\beta_j \neq 0\} = S.
  • Sign consistency: sign(β^)=sign(β)\mathrm{sign}(\hat{\boldsymbol\beta}) = \mathrm{sign}(\boldsymbol\beta^*). This implies A^λ=S\hat A_\lambda = S and additionally that the signs of the active coefficients are correct.
  • Sign-consistent estimation: P(sign(β^lasso)=sign(β))1\mathbb{P}(\mathrm{sign}(\hat{\boldsymbol\beta}^{\text{lasso}}) = \mathrm{sign}(\boldsymbol\beta^*)) \to 1 as nn \to \infty.

The standard target in the lasso literature is sign consistency — slightly stronger than support recovery, but only by the negligible probability that an active coordinate is estimated with the wrong sign (which has probability 0\to 0 rapidly under any reasonable signal-strength assumption).

Why prediction consistency doesn’t imply sign consistency. Consider the simplest counterexample. Take p=2p = 2 with X1\mathbf{X}_1 and X2\mathbf{X}_2 identical: X1=X2\mathbf{X}_1 = \mathbf{X}_2. The true coefficient is β=(1,0)\boldsymbol\beta^* = (1, 0) — the first feature is active, the second is not. The lasso objective is

12nyX1(β1+β2)22+λ(β1+β2),\frac{1}{2n} \|\mathbf{y} - \mathbf{X}_1 (\beta_1 + \beta_2)\|_2^2 + \lambda(|\beta_1| + |\beta_2|),

which depends on (β1,β2)(\beta_1, \beta_2) only through their sum and through β1+β2|\beta_1| + |\beta_2|. The minimizer is non-unique: any (β1,β2)(\beta_1, \beta_2) with β1+β2\beta_1 + \beta_2 equal to the optimal sum and β1+β2|\beta_1| + |\beta_2| minimal — i.e., any (β1,β2)(\beta_1, \beta_2) with β1,β20\beta_1, \beta_2 \ge 0 and the right sum — solves it. Among these, (β^1,β^2)=(1,0)(\hat\beta_1, \hat\beta_2) = (1, 0), (0,1)(0, 1), and (0.5,0.5)(0.5, 0.5) are all valid solutions. Prediction is identical across them; support is dramatically different.

This is an extreme case (perfectly collinear features), but the same phenomenon shows up in milder form whenever two features are highly correlated and both predictive — the lasso has no preference between them, and may flip which one it selects under tiny perturbations of the data.

§6.2 The irrepresentable condition (Zhao-Yu 2006)

The right structural condition for sign consistency is the irrepresentable condition (IC), introduced independently by Zhao-Yu (2006) and Meinshausen-Bühlmann (2006). Given the active set SS and the sign vector sS:=sign(βS){1,1}s\mathbf{s}^*_S := \mathrm{sign}(\boldsymbol\beta^*_S) \in \{-1, 1\}^s:

Definition 1 (Irrepresentable condition).

The design X\mathbf{X} satisfies the (weak) irrepresentable condition for (S,sS)(S, \mathbf{s}^*_S) if

XScXS(XSXS)1sS    1.\big\| \mathbf{X}_{S^c}^\top \mathbf{X}_S \big(\mathbf{X}_S^\top \mathbf{X}_S\big)^{-1} \mathbf{s}^*_S \big\|_\infty \;\le\; 1.

The strong irrepresentable condition with parameter η>0\eta > 0 strengthens this to 1η\le 1 - \eta.

Geometric interpretation. The vector (XSXS)1sS(\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \mathbf{s}^*_S is the OLS coefficient vector that would arise from regressing sS\mathbf{s}^*_S on XS\mathbf{X}_S — call it the dual representation of the sign pattern. Multiplying by XS\mathbf{X}_S gives the projection XS(XSXS)1sSRn\mathbf{X}_S (\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \mathbf{s}^*_S \in \mathbb{R}^n. Then XScXS(XSXS)1sSRps\mathbf{X}_{S^c}^\top \mathbf{X}_S (\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \mathbf{s}^*_S \in \mathbb{R}^{p-s} is the inner product of each inactive feature Xj\mathbf{X}_j (jScj \in S^c) with that projection.

The condition asks: how strongly is each inactive feature correlated, after accounting for the active features, with the sign pattern of the active coefficients? IC says the correlation is bounded by 1; strong IC says strictly less than 1. The intuition: if some inactive feature Xj\mathbf{X}_j (jScj \in S^c) can be “represented” by the active features — written as a linear combination XSw\mathbf{X}_S \mathbf{w} — with wsS>1|\mathbf{w}^\top \mathbf{s}^*_S| > 1, the lasso will select Xj\mathbf{X}_j in preference to (or in addition to) the true active features. Strong IC rules this out.

An equivalent formulation in terms of regression coefficients. Let wj:=(XSXS)1XSXj\mathbf{w}_j := (\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \mathbf{X}_S^\top \mathbf{X}_j be the OLS coefficient vector for regressing the inactive feature Xj\mathbf{X}_j (jScj \in S^c) on the active features XS\mathbf{X}_S. Then IC reads

maxjScwjsS1,strong IC: 1η.\max_{j \in S^c} |\mathbf{w}_j^\top \mathbf{s}^*_S| \le 1, \qquad \text{strong IC: } \le 1 - \eta.

Each wj\mathbf{w}_j describes how the inactive feature jj is predicted by the active features. The IC says the dot product of this prediction recipe with the sign pattern of active coefficients is bounded — i.e., no inactive feature is “too aligned” with the active features in a sign-coherent way.

Population versus empirical. For a random design X\mathbf{X} with iid rows from N(0,Σ)\mathcal{N}(\mathbf{0}, \boldsymbol\Sigma), the population IC is

ΣSc,SΣS,S1sS1,\| \boldsymbol\Sigma_{S^c, S} \boldsymbol\Sigma_{S, S}^{-1} \mathbf{s}^*_S \|_\infty \le 1,

and the empirical IC concentrates around it as nn grows. The viz below plots the population IC as a function of correlation strength on DGP-1-style AR(1) Toeplitz designs.

Population IC quantity for AR(1) Toeplitz designs Σⱼₖ = ρ^|j−k| with contiguous active set S = {0, …, 9} and sign(β*_S) = (1, …, 1). Below the IC = 1 threshold, the lasso is sign-consistent (Wainwright 2009 Theorem 1); above the threshold, the lasso provably fails sign-consistency (Wainwright 2009 Theorem 3) regardless of how λ is chosen — elastic net (§8.2) or adaptive lasso (§8.3) become necessary. At ρ = 0.5 (DGP-1 default) the IC sits comfortably below 1 and the §1 viz showed clean recovery. No crossover in this ρ range. Computed live in-browser via Cholesky on the s × s = 10 × 10 active-set Gram block.

Population irrepresentable condition quantity versus correlation strength rho for AR(1) Toeplitz designs, with the IC = 1 threshold marked.
Population irrepresentable quantity (the ℓ_∞ norm of the off-support × on-support normal-equation coupling, ‖Σ_{Sᶜ S} Σ_{S S}⁻¹ sign(β*_S)‖_∞) for DGP-1-style designs (AR(1) Toeplitz Σⱼₖ = ρ^|j−k|, contiguous active set S = {0, …, 9}) as ρ varies from 0 to 0.95. Horizontal line at 1 is the IC threshold: below the line the lasso is sign-consistent; above the line it provably fails. At ρ = 0.5 (DGP-1 default) the IC quantity is comfortably below 1 and the lasso reliably recovers most of the support; at ρ > 0.7 the IC starts to bind and support recovery degrades dramatically.

§6.3 The sample-size scaling for support recovery

Theorem 1 (Lasso sign-consistency (Zhao-Yu 2006; Wainwright 2009)).

Suppose:

(i) The design X\mathbf{X} satisfies the strong irrepresentable condition with parameter η>0\eta > 0.

(ii) The active-set Gram matrix is well-conditioned: λmin(XSXS/n)Cmin>0\lambda_{\min}(\mathbf{X}_S^\top \mathbf{X}_S / n) \ge C_{\min} > 0.

(iii) The columns are normalized: Xj22n\|\mathbf{X}_j\|_2^2 \le n.

(iv) The minimum signal is large enough: minjSβjc0λ(XSXS/n)1\min_{j \in S} |\beta^*_j| \ge c_0 \lambda \cdot \| (\mathbf{X}_S^\top \mathbf{X}_S / n)^{-1}\|_\infty for some constant c0c_0.

(v) The noise is sub-Gaussian with parameter σ\sigma.

Choose λ2ση2logpn\lambda \ge \frac{2 \sigma}{\eta} \sqrt{\frac{2 \log p}{n}}. Then with probability at least 14exp(cnλ2/σ2)1 - 4 \exp(-c \cdot n \lambda^2 / \sigma^2),

sign(β^lasso)=sign(β).\mathrm{sign}(\hat{\boldsymbol\beta}^{\text{lasso}}) = \mathrm{sign}(\boldsymbol\beta^*).

In particular, taking λ=c1σlog(p)/n\lambda = c_1 \sigma \sqrt{\log(p)/n} for some constant c1c_1, the conclusion holds with probability 1\to 1 as long as n(s/Cmin)log(p)n \gtrsim (s/C_{\min}) \log(p).

Proof sketch (primal-dual witness). Wainwright’s (2009) proof uses a five-step primal-dual witness construction:

  1. Restricted lasso. Solve the lasso only on the active features: β~S=argminβS12nyXSβS2+λβS1\tilde{\boldsymbol\beta}_S = \arg\min_{\boldsymbol\beta_S} \frac{1}{2n}\|\mathbf{y} - \mathbf{X}_S \boldsymbol\beta_S\|^2 + \lambda \|\boldsymbol\beta_S\|_1. Set β~Sc=0\tilde{\boldsymbol\beta}_{S^c} = \mathbf{0} and define β~=(β~S,0)\tilde{\boldsymbol\beta} = (\tilde{\boldsymbol\beta}_S, \mathbf{0}).

  2. Sign verification. Verify that sign(β~S)=sign(βS)\mathrm{sign}(\tilde{\boldsymbol\beta}_S) = \mathrm{sign}(\boldsymbol\beta^*_S) — this is where the minimum-signal condition (iv) is used. With high probability, the restricted lasso has the right signs because the noise is small compared to βmin\beta^*_{\min}.

  3. Construct the dual. Set g~S=sign(β~S)\tilde{\mathbf{g}}_S = \mathrm{sign}(\tilde{\boldsymbol\beta}_S). The active KKT condition then determines what g~Sc\tilde{\mathbf{g}}_{S^c} would have to be for β~\tilde{\boldsymbol\beta} to be the full lasso solution:

g~Sc=XScXS(XSXS)1g~S+(noise term).\tilde{\mathbf{g}}_{S^c} = \mathbf{X}_{S^c}^\top \mathbf{X}_S (\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \tilde{\mathbf{g}}_S + (\text{noise term}).
  1. Verify the inactive KKT condition. For β~\tilde{\boldsymbol\beta} to be the actual lasso solution, we need g~Sc<1\|\tilde{\mathbf{g}}_{S^c}\|_\infty < 1 (strict). The leading term is exactly the irrepresentable quantity from Definition 1; strong IC bounds it by 1η1 - \eta. The noise term is O(σlog(p)/n)O(\sigma \sqrt{\log(p)/n}) via sub-Gaussian deviation, which is η\ll \eta for large enough nn. So with high probability, g~Sc<1η/2<1\|\tilde{\mathbf{g}}_{S^c}\|_\infty < 1 - \eta/2 < 1.

  2. Conclude by KKT uniqueness. Steps 1–4 produce a (β~,g~)(\tilde{\boldsymbol\beta}, \tilde{\mathbf{g}}) pair satisfying the lasso KKT conditions with strictly bounded g~Sc\tilde{\mathbf{g}}_{S^c}. Under the design assumptions, the lasso solution is unique (recall §2.3), so β~=β^lasso\tilde{\boldsymbol\beta} = \hat{\boldsymbol\beta}^{\text{lasso}}. Since β~Sc=0\tilde{\boldsymbol\beta}_{S^c} = \mathbf{0} and sign(β~S)=sign(βS)\mathrm{sign}(\tilde{\boldsymbol\beta}_S) = \mathrm{sign}(\boldsymbol\beta^*_S), sign consistency holds. \blacksquare

The full proof with explicit constants is in Wainwright (2009, §III). The key probabilistic ingredient is the same sub-Gaussian deviation we used in §5.5 — extended to control the noise term in step 4.

Necessity of IC. Wainwright (2009, Theorem 3) also proved the converse: if the irrepresentable condition fails (say XScXS(XSXS)1sS>1+δ\| \mathbf{X}_{S^c}^\top \mathbf{X}_S (\mathbf{X}_S^\top \mathbf{X}_S)^{-1} \mathbf{s}^*_S \|_\infty > 1 + \delta for some δ>0\delta > 0), then P(sign(β^lasso)=sign(β))0\mathbb{P}(\mathrm{sign}(\hat{\boldsymbol\beta}^{\text{lasso}}) = \mathrm{sign}(\boldsymbol\beta^*)) \to 0 as nn \to \infty, regardless of how λ\lambda is chosen. The lasso provably fails to recover the support when IC is violated. So IC isn’t just a proof artifact — it’s the correct characterization of when the lasso can do support recovery.

§6.4 Contrasting prediction-risk and support-recovery: same estimator, different theorems

The two main theorems of §§5–6 differ on every axis. A comparison:

Prediction-risk bound (§5)Support-recovery bound (§6)
What’s bounded1nX(β^β)22\frac{1}{n}\|\mathbf{X}(\hat{\boldsymbol\beta} - \boldsymbol\beta^*)\|_2^2P(sign(β^)=sign(β))\mathbb{P}(\mathrm{sign}(\hat{\boldsymbol\beta}) = \mathrm{sign}(\boldsymbol\beta^*))
Sufficient condition on X\mathbf{X}Restricted-eigenvalue (§5.4)Irrepresentable (§6.2)
Necessary?RE essentially necessary for any sparse-regression estimator at the optimal rateStrong IC necessary for lasso sign-consistency (Wainwright 2009)
Sample-size scalingnslog(p)n \gtrsim s \log(p)nslog(p)n \gtrsim s \log(p)
Minimum-signal needed?No — works for any β\boldsymbol\beta^* with β0s\|\boldsymbol\beta^*\|_0 \le sYes — βminλ\beta^*_{\min} \gtrsim \lambda required
What fixes failureLarger nn gives better REIC violated \Rightarrow lasso fundamentally can’t recover support; need adaptive lasso (§8.3) or post-selection refit

The conditions are not nested. RE doesn’t imply IC, and IC doesn’t imply RE. They measure different geometric properties of the design:

  • RE is a lower bound on XX/n\mathbf{X}^\top \mathbf{X} / n restricted to a cone. It’s about the design being “well-conditioned on sparse and near-sparse vectors” — a global property that doesn’t depend on which SS is the active set.
  • IC is a constraint relating the inactive-to-active block of XX/n\mathbf{X}^\top \mathbf{X} / n to the sign pattern sS\mathbf{s}^*_S. It depends on SS and sS\mathbf{s}^*_S specifically.

A design can satisfy RE but violate IC (random Gaussian designs with strong correlation between active and inactive features), in which case the lasso predicts well but selects the wrong support. The reverse can also happen, though it’s less common in practice.

Practical implications. The lasso is a much better prediction tool than a model-selection tool. Two rules of thumb:

  • For prediction: trust the lasso. CV-selected λ\lambda, refit at λCV\lambda_{\text{CV}} or λ1SE\lambda_{1\text{SE}}, and use the lasso predictions. The §5 oracle inequality gives near-optimal prediction risk under mild conditions.

  • For variable selection: be skeptical of the lasso’s chosen support. Two specific patterns to watch for: (i) two highly correlated features where only one shows up in the lasso fit (the lasso may have arbitrarily picked one), and (ii) the lasso fit changing dramatically across resampled training sets (instability \Rightarrow IC likely violated). Use stability selection (Meinshausen-Bühlmann 2010) to assess; consider the adaptive lasso (§8.3) for a sign-consistent variant under weaker conditions.

The deeper bridge to §§7–10: practical λ\lambda-selection (§7) trades off these two objectives differently — LassoCV optimizes prediction (smaller λ\lambda, more features); LassoLarsIC with BIC penalizes model size more heavily (larger λ\lambda, fewer features, closer to support recovery). The debiased lasso (§10) sidesteps the support-recovery question entirely by producing valid CIs for individual coefficients without requiring sign consistency.

§7. Cross-validation for λ\lambda

The §5 oracle inequality recommends λσlog(p)/n\lambda \asymp \sigma \sqrt{\log(p)/n} for prediction-optimal performance — a rate, not a constant. The constant matters in practice (a factor of 2 in λ\lambda can change the active-set size by a factor of 2 or more), and the noise scale σ\sigma is rarely known. Practical λ\lambda-selection uses data-driven criteria: cross-validation (the default in scikit-learn and glmnet), the one-standard-error rule (a parsimony-favoring variant), and information criteria like AIC/BIC computed along the lasso path (LassoLarsIC). This section covers all three.

The CV / 1-SE / BIC distinction maps directly onto the §6 discussion: CV optimizes prediction error and tends to select more features than necessary; BIC penalizes model size more aggressively and is sometimes selection-consistent; the 1-SE rule is a Hastie-Tibshirani-Friedman compromise that gives a smaller model than CV-min at minimal prediction-performance cost. None is “right” — the right choice depends on whether you care about prediction or model interpretability.

This is a named-section of the topic per the formalML “no separate cross-validation topic” convention. The same structural pattern is used in Kernel Regression §5 for LOO-CV / GCV bandwidth selection.

§7.1 K-fold cross-validation with LassoCV

KK-fold CV estimates the prediction risk at each λ\lambda in a candidate grid by holdout. The procedure:

  1. Partition the training data into KK folds of approximately equal size.
  2. For each fold k=1,,Kk = 1, \dots, K and each candidate λ\lambda:
    • Fit the lasso on the data minus fold kk, obtaining β^(k)(λ)\hat{\boldsymbol\beta}^{(-k)}(\lambda).
    • Compute the held-out MSE: MSEk(λ)=1fold kifold k(yixiβ^(k)(λ))2\mathrm{MSE}_k(\lambda) = \frac{1}{|\text{fold } k|} \sum_{i \in \text{fold } k} (y_i - \mathbf{x}_i^\top \hat{\boldsymbol\beta}^{(-k)}(\lambda))^2.
  3. Average over folds: CV(λ)=1KkMSEk(λ)\mathrm{CV}(\lambda) = \frac{1}{K} \sum_k \mathrm{MSE}_k(\lambda).
  4. Select λmin=argminλCV(λ)\lambda_{\min} = \arg\min_\lambda \mathrm{CV}(\lambda).

Standard choices: K=10K = 10 for general use, K=5K = 5 if computation is constrained, leave-one-out (K=nK = n) only when nn is small (otherwise computationally wasteful and statistically unstable). The candidate λ\lambda-grid is typically log-spaced from λmax\lambda_{\max} (largest, all coefficients zero) down to λmax103\lambda_{\max} \cdot 10^{-3} or so, with 100 grid points — LassoCV(n_alphas=100)’s default.

Why CV works. CV(λ)\mathrm{CV}(\lambda) is an estimator of the test prediction error PE(λ)=E[(yxβ^)2]\mathrm{PE}(\lambda) = \mathbb{E}[(y - \mathbf{x}^\top \hat{\boldsymbol\beta})^2], with bias of order 1/n1/n (because each fold-fit uses (K1)/Kn(K-1)/K \cdot n samples instead of nn). At the fold sizes used in practice (K5K \ge 5), the bias is negligible compared to the variance of the CV estimate.

Computational efficiency. Naively, KK-fold CV requires K×ΛgridK \times |\Lambda_{\text{grid}}| lasso fits. The glmnet and scikit-learn implementations use warm starts along the λ\lambda path (recall §3.2) — fitting the lasso at the entire λ\lambda-grid for one fold is barely more expensive than fitting at a single λ\lambda, so the practical cost is more like KK path computations. For DGP-1 with K=10K = 10, Λ=100|\Lambda| = 100, p=500p = 500: about 1–3 seconds total, dominated by the matrix-vector multiplies inside coordinate descent.

§7.2 The one-standard-error rule: λ1SE\lambda_{1\mathrm{SE}} vs λmin\lambda_{\min}

The CV estimate CV(λ)\mathrm{CV}(\lambda) is itself a random quantity — it has variance over the choice of fold partition and over the training data. A useful uncertainty quantification is the standard error across folds:

SE(λ):=sdk(MSEk(λ))K.\mathrm{SE}(\lambda) := \frac{\mathrm{sd}_k(\mathrm{MSE}_k(\lambda))}{\sqrt{K}}.

The one-standard-error rule (Breiman et al. 1984; popularized by Hastie-Tibshirani-Friedman, Elements of Statistical Learning, §7.10) selects the most parsimonious model whose CV error is within one standard error of the minimum:

λ1SE:=max{λ:CV(λ)CV(λmin)+SE(λmin)}.\lambda_{1\mathrm{SE}} := \max\{\lambda : \mathrm{CV}(\lambda) \le \mathrm{CV}(\lambda_{\min}) + \mathrm{SE}(\lambda_{\min})\}.

Since CV(λ)\mathrm{CV}(\lambda) is roughly U-shaped in λ\lambda, λ1SE>λmin\lambda_{1\mathrm{SE}} > \lambda_{\min} — the 1-SE-selected model is more regularized, hence sparser.

The motivation. The CV minimizer λmin\lambda_{\min} is the unbiased “MSE-optimal” choice but tends to be unstable across resampled training data — a small perturbation in the training set can shift λmin\lambda_{\min} by a factor of 2 in either direction. The 1-SE rule trades this instability for a small, controlled increase in prediction error: the resulting model has CV-MSE within one standard error of optimal (i.e., not statistically distinguishable from λmin\lambda_{\min}‘s prediction performance) but is more parsimonious and reproducible.

In the lasso context, λ1SE\lambda_{1\mathrm{SE}} typically gives an active set 10–30% smaller than λmin\lambda_{\min}, with test prediction error 5–15% larger. For interpretability-driven applications (variable selection, communication of results, downstream modeling), the 1-SE rule is the standard recommendation.

A practical caveat. The 1-SE rule is a heuristic, not a theorem. Its bias-variance trade-off is empirically reasonable but doesn’t have a sharp theoretical justification — it doesn’t, for instance, give support consistency under weaker conditions than CV-min. If you need provable support recovery, use BIC (§7.3) or stability selection (mentioned in §6.4). If you need provable prediction risk, the §5 oracle inequality is the right reference and CV-min is the right selector.

10-fold LassoCV on smaller-scale DGP-1 (n = 200, p = 200, s = 10, σ = 0.5, AR(1) ρ = 0.5). The teal curve is mean CV-MSE across folds; shaded band is ±1 SE. Five selector markers (vertical dashed lines) ordered from smallest to largest λ: CV-min (largest active set, smallest test MSE), CV-1SE (parsimony-favoring within 1 SE of CV-min), AIC and BIC (information-criterion picks on the lasso path; BIC selects sparser models than AIC), and theory-guided RIC = 2σ√(2 log p / n) from Theorem 1 (largest, conservative). Computed live in-browser via 10-fold warm-started ISTA across 25 log-spaced λ values (~1-2 s).

LassoCV curve on DGP-1 with one-standard-error band, vertical markers at lambda_min and lambda_1SE.
The 10-fold LassoCV curve CV(λ) on DGP-1 with shaded ±1 SE band. Vertical markers at λ_min (CV-MSE minimizer) and λ_1SE (largest λ within one SE of the minimum). The canonical U-shape; the 1-SE rule's choice sits in the relatively flat region near the minimum and gives a substantially sparser model at minimal prediction-performance cost. (Static fallback at p = 500, two-panel; the interactive viz above runs at p = 200 with all five selector markers in one panel.)

§7.3 BIC selection with LassoLarsIC

Information criteria offer a different selection philosophy: rather than estimating prediction error directly via holdout, they balance model fit against model complexity through an explicit complexity penalty.

The criteria. For a candidate λ\lambda with active-set size kλ:=A^λk_\lambda := |\hat A_\lambda| and residual sum of squares RSSλ:=yXβ^(λ)22\mathrm{RSS}_\lambda := \|\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}(\lambda)\|_2^2:

AIC(λ)=nlog(RSSλ/n)+2kλ,\mathrm{AIC}(\lambda) = n \log(\mathrm{RSS}_\lambda / n) + 2 k_\lambda, BIC(λ)=nlog(RSSλ/n)+kλlogn.\mathrm{BIC}(\lambda) = n \log(\mathrm{RSS}_\lambda / n) + k_\lambda \log n.

Both penalize larger models; BIC penalizes more aggressively as soon as n>e27.4n > e^2 \approx 7.4, so BIC selects smaller models than AIC.

The use of kλ=A^λk_\lambda = |\hat A_\lambda| as the lasso’s effective degrees of freedom rests on Zou-Hastie-Tibshirani (2007), who showed that E[kλ]=df(β^lasso(λ))\mathbb{E}[k_\lambda] = \mathrm{df}(\hat{\boldsymbol\beta}^{\text{lasso}}(\lambda)) exactly — the size of the active set is an unbiased estimator of the lasso’s degrees of freedom. This is a non-trivial result; for a generic non-linear estimator, dof is not the count of nonzero parameters. The lasso’s piecewise-linear path makes the result exact.

LassoLarsIC. The scikit-learn implementation computes the entire lasso path via LARS (Least Angle Regression — Efron-Hastie-Johnstone-Tibshirani 2004), which exploits piecewise linearity to enumerate every knot λ(k)\lambda_{(k)} in O(npmin(n,p))O(np \min(n, p)) total time. The IC value is computed at each knot, and the λ\lambda minimizing the chosen criterion is returned. The path-based approach is exact (no λ\lambda-grid discretization error) but only practical for moderate pp — at p>104p > 10^4, coordinate descent on a λ\lambda-grid is much faster.

Selection consistency. BIC for the lasso is selection-consistent under additional conditions: P(A^λBIC=S)1\mathbb{P}(\hat A_{\lambda_{\mathrm{BIC}}} = S) \to 1 as nn \to \infty if the design satisfies a slightly stronger condition than IC and the minimum signal is bounded below (Wang-Li-Tsai 2007). AIC and CV are not selection-consistent in general — they tend to over-select features (include all of SS plus some noise features) even asymptotically. For variable-selection applications, this is BIC’s main virtue.

Caveats. BIC’s selection consistency is asymptotic; at finite samples, BIC can over- or under-select depending on the signal strength. AIC is roughly equivalent to leave-one-out CV in expectation (Stone 1977) and tends to choose larger models than KK-fold CV with KnK \ll n.

§7.4 Comparison on the §1 toy DGP

The four selectors — LassoCV λmin\lambda_{\min}, LassoCV λ1SE\lambda_{1\mathrm{SE}}, LassoLarsIC BIC, LassoLarsIC AIC — make different trade-offs and select different λ\lambda values on the same data. On DGP-1 (n=200,p=500,s=10,σ=0.5n=200, p=500, s=10, \sigma=0.5), the typical pattern (see Figure 7.2):

| Selector | Selected λ\lambda | Active set size A^λ|\hat A_\lambda| | Test MSE | |---|---|---|---| | LassoCV (λmin\lambda_{\min}) | smallest | largest (~12–18, includes some false positives) | smallest | | LassoCV (λ1SE\lambda_{1\mathrm{SE}}) | medium | medium (~10–13, close to true s=10s = 10) | small (5–15% above λmin\lambda_{\min}) | | LassoLarsIC (AIC) | smallish | medium-to-large | small (close to λmin\lambda_{\min}) | | LassoLarsIC (BIC) | largest | smallest (~8–11, may miss weak signal coords) | medium (10–25% above λmin\lambda_{\min}) |

Recommendation.

  • If prediction is the goal: use LassoCV with λmin\lambda_{\min}. The §5 oracle inequality says this achieves the optimal rate; in practice it gives the smallest test MSE on most problems.
  • If prediction is the goal but you want a smaller, more interpretable model: use the 1-SE rule. Trades a small amount of prediction performance for a substantially smaller active set and more reproducible variable selection.
  • If selection consistency is the goal: use BIC via LassoLarsIC(criterion='bic'). The selected model is asymptotically the true support under stronger conditions; finite-sample behavior depends on signal strength.
  • For everything else: start with LassoCV λmin\lambda_{\min}. It’s the default in almost every lasso application; the alternatives are refinements for specific use cases.

The §10 debiased lasso uses LassoCV λmin\lambda_{\min} as its initial estimator, then corrects the resulting bias to produce valid CIs. The choice of λ\lambda for the initial fit isn’t critical for the debiased lasso’s coverage — the one-step correction substantially compensates for the lasso’s selection idiosyncrasies.

Bar chart of CV-selected lambda values across five selectors (LassoCV-min, LassoCV-1SE, LassoLarsIC-AIC, LassoLarsIC-BIC, theory-guided RIC) on DGP-1.
Selected λ across five selectors on DGP-1: LassoCV (λ_min, smallest), LassoCV (λ_1SE), LassoLarsIC AIC, LassoLarsIC BIC, and the theory-guided RIC (λ = 2σ√(2 log(p)/n), largest). λ_min produces the largest active set with smallest test MSE; BIC produces the smallest active set; AIC sits in between. The selector choice trades off prediction error against model parsimony.

§8. Ridge, elastic net, and adaptive lasso

The lasso has three practical limitations: it can be unstable when features are highly correlated (the §6.1 collinearity counterexample — flipping between equivalent supports); it biases active coefficients toward zero by a constant λ\lambda (the §4.1 shrinkage bias); and it requires the irrepresentable condition for support recovery (the §6.2 IC, often violated in real data). Each of these motivates a variant.

Ridge (already introduced in §1.3) keeps the L2 penalty, gives a unique dense solution under any design, and is robust to correlated features — but doesn’t select. Elastic net (Zou-Hastie 2005) combines L1 + L2 penalties, getting the lasso’s sparsity with ridge’s stability under correlated features. Adaptive lasso (Zou 2006) uses data-driven feature-specific weights to remove the constant shrinkage bias and achieve the oracle property under weaker conditions than IC.

This section explains when each variant wins on the side. The decision tree:

  • Truth is dense (all coefficients moderate, no sparsity): ridge.
  • Truth is sparse, features are well-separated: lasso.
  • Truth is sparse but features come in correlated groups: elastic net.
  • Truth is sparse, you want unbiased active coefficients and support consistency: adaptive lasso.

§8.1 Ridge: continuous shrinkage, no selection

Recall the ridge objective from §1.3:

β^ridge(λ)=argminβRp12nyXβ22+λ2β22,\hat{\boldsymbol\beta}^{\text{ridge}}(\lambda) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \frac{\lambda}{2} \|\boldsymbol\beta\|_2^2,

with closed form β^ridge=(XX+nλI)1Xy\hat{\boldsymbol\beta}^{\text{ridge}} = (\mathbf{X}^\top \mathbf{X} + n\lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}. Three relevant properties at pnp \gg n:

Always defined and unique. The matrix XX+nλI\mathbf{X}^\top \mathbf{X} + n\lambda \mathbf{I} is positive definite for any λ>0\lambda > 0 regardless of pp vs nn. Ridge has no failure mode the way OLS does.

Continuous shrinkage, dense solutions. In the SVD basis X=UDV\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^\top, ridge shrinks each coefficient by a factor dj2/(dj2+nλ)d_j^2 / (d_j^2 + n\lambda) where djd_j is the jj-th singular value. Small singular values (the noisy directions) get heavy shrinkage; large singular values (the signal directions) get light shrinkage. But no coefficient is zeroed out — the solution is generically dense.

Bayesian interpretation. Ridge is the posterior mean of β\boldsymbol\beta under a Gaussian prior βN(0,λ1I)\boldsymbol\beta \sim \mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{I}) and Gaussian likelihood yX,βN(Xβ,σ2I)\mathbf{y} | \mathbf{X}, \boldsymbol\beta \sim \mathcal{N}(\mathbf{X} \boldsymbol\beta, \sigma^2 \mathbf{I}). The penalty strength λ\lambda is inversely related to the prior variance.

When ridge wins. Two scenarios:

  1. Truly dense β\boldsymbol\beta^*. When every feature carries some signal — no underlying sparsity — the lasso’s sparsity assumption is wrong, and the lasso under-fits (zeros out features that should be active). Ridge has no such bias.
  2. Heavy multicollinearity with no sparsity prior. When features are nearly linearly dependent and there’s no reason to prefer one over another, ridge distributes the signal smoothly across them. Lasso would arbitrarily select one and zero the others — a worse use of information.

When ridge loses. When β\boldsymbol\beta^* is genuinely sparse, ridge’s refusal to zero out inactive coordinates leaves residual noise in the fitted values. Each inactive coordinate contributes σ2dj2/(dj2+nλ)2\sigma^2 d_j^2 / (d_j^2 + n\lambda)^2 to the prediction variance — small but non-zero. Lasso eliminates this contribution by zeroing the inactive coordinates entirely. On DGP-1 (s=10s = 10, p=500p = 500, so 490 inactive coordinates), the cumulative variance contribution is substantial, and lasso prediction MSE is typically 2–5× better than ridge.

For most high-dimensional ML problems the truth is more sparse than dense, so the lasso wins more often than ridge. The standard practice is to compare both on cross-validated test MSE and pick the winner.

§8.2 The elastic net for groups of correlated features

The elastic net (Zou-Hastie 2005) combines L1 and L2 penalties:

β^EN(λ,α)=argminβRp12nyXβ22+λ[αβ1+1α2β22],\hat{\boldsymbol\beta}^{\text{EN}}(\lambda, \alpha) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \lambda \left[ \alpha \|\boldsymbol\beta\|_1 + \frac{1 - \alpha}{2} \|\boldsymbol\beta\|_2^2 \right],

with mixing parameter α[0,1]\alpha \in [0, 1] controlling the L1/L2 balance. α=1\alpha = 1 recovers pure lasso, α=0\alpha = 0 recovers pure ridge (up to scaling). Standard practice: α=0.5\alpha = 0.5 or tuned via cross-validation.

Two structural advantages over pure lasso:

  1. Strict convexity for α<1\alpha < 1. The L2 term makes the objective strictly convex in β\boldsymbol\beta, so the solution is unique even with duplicate or perfectly collinear columns. The §6.1 collinear-features pathology disappears.
  2. The grouping effect. For two highly correlated features Xj\mathbf{X}_j and Xk\mathbf{X}_k (correlation ρjk\rho_{jk} close to 1), the elastic-net coefficient difference is bounded:

Proposition 1 (Grouping inequality (Zou-Hastie 2005, Theorem 1)).

For any two features j,k{1,,p}j, k \in \{1, \dots, p\} with empirical correlation ρjk\rho_{jk},

β^jENβ^kENy2nλ(1α)2(1ρjk).\big| \hat\beta_j^{\text{EN}} - \hat\beta_k^{\text{EN}} \big| \le \frac{\|\mathbf{y}\|_2}{n\lambda(1 - \alpha)} \sqrt{2(1 - \rho_{jk})}.

As ρjk1\rho_{jk} \to 1, the RHS 0\to 0 — perfectly correlated features get equal coefficients. The lasso has no such property; it would pick one feature and zero the other. So in applications where the truth has correlated groups of active features (gene-expression clusters, related text features), elastic net selects all members of the group; lasso picks one member.

Reduction to a standard lasso. Zou-Hastie showed that the elastic net can be solved by augmenting the design matrix and running a standard lasso solver. Define

X~=(Xnλ(1α)Ip)R(n+p)×p,y~=(y0p)Rn+p.\tilde{\mathbf{X}} = \begin{pmatrix} \mathbf{X} \\ \sqrt{n \lambda (1 - \alpha)} \cdot \mathbf{I}_p \end{pmatrix} \in \mathbb{R}^{(n+p) \times p}, \quad \tilde{\mathbf{y}} = \begin{pmatrix} \mathbf{y} \\ \mathbf{0}_p \end{pmatrix} \in \mathbb{R}^{n+p}.

Then β^EN(λ,α)\hat{\boldsymbol\beta}^{\text{EN}}(\lambda, \alpha) is the lasso solution on (X~,y~)(\tilde{\mathbf{X}}, \tilde{\mathbf{y}}) at penalty λα\lambda \alpha. So all the lasso algorithms (coordinate descent, FISTA) work for the elastic net by augmentation. scikit-learn’s ElasticNet and ElasticNetCV use a direct coordinate-descent implementation that exploits the structure without explicit augmentation, but the principle is the same.

When elastic net wins. Highly correlated features, especially correlated groups of features that should be selected together. Genomics is the canonical application: SNPs in linkage disequilibrium come in correlated blocks, and a gene-level signal manifests as coordinated effects across an entire block. Lasso would arbitrarily pick one SNP from each block; elastic net selects all the SNPs from active blocks.

When elastic net loses. When features are well-separated (low correlation), elastic net offers no advantage over lasso. The L2 term adds a small bias relative to lasso without reducing variance much. On the DGP-1 setting (AR(1) Toeplitz ρ=0.5\rho = 0.5, moderate correlation), elastic net and lasso typically perform similarly; on a stronger-correlation DGP (ρ=0.9\rho = 0.9), elastic net wins clearly.

§8.3 The adaptive lasso and the oracle property

The lasso biases every active coefficient by λsign(βj)-\lambda \, \mathrm{sign}(\beta^*_j) (§4.1). This constant shrinkage is independent of βj|\beta^*_j|, so even strong-signal coordinates get shrunk by a fixed amount — a real problem for parameter recovery and inference.

The adaptive lasso (Zou 2006) replaces the uniform L1 penalty with feature-specific weights:

β^aLasso(λ)=argminβRp12nyXβ22+λj=1pwjβj,\hat{\boldsymbol\beta}^{\text{aLasso}}(\lambda) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{2n} \|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|_2^2 + \lambda \sum_{j=1}^p w_j |\beta_j|,

with weights wj=1/β^jinitγw_j = 1 / |\hat\beta_j^{\text{init}}|^\gamma for some initial estimator β^init\hat{\boldsymbol\beta}^{\text{init}} and a tuning parameter γ>0\gamma > 0 (typically γ=1\gamma = 1). Coordinates with large initial estimates get small weights (light shrinkage); coordinates with small or zero initial estimates get large weights (heavy shrinkage, effectively forced to zero).

Reduction to standard lasso. Rescale the features: X~j=Xj/wj\tilde{\mathbf{X}}_j = \mathbf{X}_j / w_j and β~j=wjβj\tilde\beta_j = w_j \beta_j. Then yXβ2=yX~β~2\|\mathbf{y} - \mathbf{X} \boldsymbol\beta\|^2 = \|\mathbf{y} - \tilde{\mathbf{X}} \tilde{\boldsymbol\beta}\|^2 and jwjβj=jβ~j\sum_j w_j |\beta_j| = \sum_j |\tilde\beta_j|, so the weighted lasso reduces to a standard lasso on rescaled features. Solve with any lasso solver, then unscale: β^j=β~j^/wj\hat\beta_j = \hat{\tilde\beta_j} / w_j.

Choice of β^init\hat{\boldsymbol\beta}^{\text{init}}. Two standard choices:

  • OLS (when n>pn > p): β^init=(XX)1Xy\hat{\boldsymbol\beta}^{\text{init}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}.
  • Ridge or LassoCV (when pnp \ge n): OLS isn’t defined, so use a regularized initial. LassoCV is the standard high-dimensional choice (cf. Bühlmann-van de Geer 2011, §2.8).

Theorem 1 (Oracle property of the adaptive lasso (Zou 2006)).

Assume the initial estimator β^init\hat{\boldsymbol\beta}^{\text{init}} is n\sqrt{n}-consistent for β\boldsymbol\beta^*, the minimum signal βmin\beta^*_{\min} is bounded below by an appropriate constant, the noise is sub-Gaussian, and λn\lambda_n satisfies nλn0\sqrt{n} \lambda_n \to 0 and n(nλn)γ/21\sqrt{n} (n \lambda_n)^{\gamma/2 - 1} \to \infty. Then the adaptive lasso satisfies the oracle property:

(i) Selection consistency: P(A^λnaLasso=S)1\mathbb{P}(\hat A_{\lambda_n}^{\text{aLasso}} = S) \to 1.

(ii) Asymptotic normality: n(β^SaLassoβS)dN(0,σ2(ΣS,S)1)\sqrt{n} (\hat{\boldsymbol\beta}_S^{\text{aLasso}} - \boldsymbol\beta^*_S) \overset{d}{\to} \mathcal{N}\big(\mathbf{0}, \sigma^2 (\boldsymbol\Sigma_{S, S})^{-1}\big), the same asymptotic distribution as the oracle OLS estimator restricted to SS.

The oracle property is stronger than sign-consistency: in addition to recovering the support, the adaptive-lasso estimates on SS have the correct asymptotic standard errors — same as if you had known SS in advance and run OLS on it. The constant shrinkage bias disappears asymptotically because the weights wj=1/β^jinitw_j = 1/|\hat\beta_j^{\text{init}}| are bounded for active coordinates (where the initial estimator is near βj|\beta^*_j|) but diverge for inactive coordinates.

The conditions are weaker than IC. The adaptive lasso achieves selection consistency under any design that has a n\sqrt{n}-consistent initial estimator — much weaker than the strong irrepresentable condition required by the lasso. In particular, designs that fail IC (correlated active and inactive features) can still admit adaptive-lasso selection consistency.

Trade-off. The adaptive lasso requires a good initial estimator. In the high-dimensional regime where LassoCV is the standard initial, the adaptive lasso’s behavior depends on the quality of the lasso’s variable selection — a circular dependence that limits the asymptotic argument’s practical reach. Empirically, adaptive lasso usually beats vanilla lasso on support recovery and produces less-biased active coefficients, but the improvement isn’t dramatic on well-behaved problems.

§8.4 Side-by-side comparison on the §1 DGP

The four methods — ridge, lasso, elastic net, adaptive lasso — applied to DGP-1 at CV-selected penalties (Figure 8.1). The key visual contrasts:

  • Ridge’s path is dense everywhere. Every coefficient is non-zero at every λ>0\lambda > 0, shrinking smoothly. The 10 true active coordinates emerge as the largest-magnitude coefficients, but distinguishing them from the 490 inactive coordinates requires a thresholding step.
  • Lasso’s path is sparse. Most coefficients are exactly zero at moderate λ\lambda. The true active coordinates appear as the first to leave zero as λ\lambda decreases. Active coefficients are visibly shrunk relative to the truth (β=1\beta^* = 1 on SS).
  • Elastic net’s path looks like the lasso with smoother corners. Sparsity is preserved (most coefficients are zero at moderate λ\lambda), but the active coefficients are slightly less shrunk than the lasso’s at the same effective regularization. The grouping effect would be more visible on a strongly-correlated DGP.

Figure 8.2 shows the adaptive lasso vs vanilla lasso on the active-coordinate estimates: the vanilla lasso underestimates βj|\beta^*_j| by approximately λ\lambda (the constant shrinkage bias from §4.1); the adaptive lasso gets much closer to the true value βj=1\beta^*_j = 1 for active jj.

Recommendation. Default to LassoCV. If features come in correlated groups, try ElasticNetCV. If you need unbiased coefficient estimates or weaker selection conditions, run the adaptive lasso as a post-processing step on top of LassoCV. The §10 debiased lasso provides a different fix for the bias problem — it doesn’t reduce the lasso’s bias but corrects for it explicitly when forming CIs.

Regularization paths β̂_j(t) on DGP-1 (n = 150, p = 200, s = 10, σ = 0.5, AR(1) ρ = 0.5). True active coordinates (j < 10) drawn in distinct colors; 190 inactive coordinates in light gray. Dashed reference at β* = 1. At the median penalty t ≈ 2.47e-1, the lasso fit has 10/10 active and 0/190 noise coordinates nonzero. Ridge keeps every coefficient nonzero at every α; lasso and elastic net both reach exact zeros at moderate penalty — the active coords are the first to leave zero as the penalty shrinks. Switch tabs to compare path geometries directly.

Adaptive lasso vs vanilla lasso active-coefficient estimates on DGP-1; adaptive substantially closer to the true beta = 1.
Adaptive lasso vs vanilla lasso on the 10 active coordinates of DGP-1. Vanilla lasso underestimates |β*ⱼ| by approximately λ (the constant shrinkage bias from §4.1); adaptive lasso (with LassoCV initial and weights wⱼ = 1/(|β̂ⱼ_init| + ε)^γ, γ = 1, ε = 10⁻⁴) gets much closer to the true value β*ⱼ = 1. The ε-floor is a numerical-stability choice that doesn't materially affect the demo.

§9. Geometry of the high-dimensional regime

§§5–6 introduced two structural conditions on the design matrix X\mathbf{X} — the restricted-eigenvalue condition (RE) for prediction-risk bounds and the irrepresentable condition (IC) for support recovery — and stated when they hold (random Gaussian, sub-Gaussian, RIP-bounded designs) without unpacking the geometry. This section deepens the picture: it introduces the restricted isometry property (RIP) from compressed sensing, traces the implication chain RIPRE\mathrm{RIP} \Rightarrow \mathrm{RE}, gives the standard concentration arguments for sub-Gaussian random designs, and sketches when each condition fails in practice.

The section is text-heavy and relatively short — there are no new theorems beyond §§5–6, just structural unpacking. The payoff is a cleaner picture of what makes a “good” design for the lasso and what doesn’t, useful for diagnosing real-data failures.

§9.1 The restricted isometry property (RIP)

The restricted isometry property was introduced by Candès-Tao (2005) in the compressed-sensing literature, predating Bickel-Ritov-Tsybakov’s RE by a few years. It’s a uniform version of RE: instead of asking that the design preserve geometry on a particular cone C(S,c0)\mathcal{C}(S, c_0), it asks that the design act approximately like an isometry on all ss-sparse vectors.

Definition 1 (Restricted isometry property (Candès-Tao 2005)).

The matrix X\mathbf{X} satisfies the restricted isometry property of order ss with constant δs(0,1)\delta_s \in (0, 1) if for every vector vRp\mathbf{v} \in \mathbb{R}^p with v0s\|\mathbf{v}\|_0 \le s,

(1δs)v221nXv22(1+δs)v22.(1 - \delta_s) \|\mathbf{v}\|_2^2 \le \frac{1}{n} \|\mathbf{X} \mathbf{v}\|_2^2 \le (1 + \delta_s) \|\mathbf{v}\|_2^2.

The smallest such δs\delta_s is the restricted isometry constant.

In words: X\mathbf{X} (with the 1/n1/\sqrt n scaling) acts almost like an isometry on ss-sparse vectors — the squared norm Xv2/n\|\mathbf{X}\mathbf{v}\|^2 / n approximates the squared norm v2\|\mathbf{v}\|^2 to within a multiplicative factor of 1±δs1 \pm \delta_s.

Why uniform sparsity matters. The RE condition (§5.4) is keyed to a specific support SS and sign pattern sS\mathbf{s}^*_S — it gives a lower bound on Xδ2/n\|\mathbf{X} \boldsymbol\delta\|^2 / n for δ\boldsymbol\delta in the cone C(S,c0)\mathcal{C}(S, c_0). Different SS give different cones. RIP, by contrast, is a single condition that holds simultaneously for every choice of SS with Ss|S| \le s — uniformly across the combinatorial (ps)\binom{p}{s} possibilities. This makes RIP much stronger than RE for any single SS, but also much harder to verify.

Compressed-sensing origin. The compressed-sensing problem asks: given a known measurement matrix XRn×p\mathbf{X} \in \mathbb{R}^{n \times p} (with npn \ll p) and observed measurements y=Xβ\mathbf{y} = \mathbf{X} \boldsymbol\beta^* of an unknown sparse signal β\boldsymbol\beta^*, recover β\boldsymbol\beta^* exactly. The solution is L1 minimization:

β^=argminββ1subject to Xβ=y.\hat{\boldsymbol\beta} = \arg\min_{\boldsymbol\beta} \|\boldsymbol\beta\|_1 \quad \text{subject to } \mathbf{X} \boldsymbol\beta = \mathbf{y}.

This is the lasso in the limit λ0\lambda \to 0 with the constraint replaced by exact equality. Candès-Tao (2005) showed that if δ2s<210.414\delta_{2s} < \sqrt{2} - 1 \approx 0.414, the L1 minimizer recovers β\boldsymbol\beta^* exactly. The lasso’s noisy-version recovery results (§§5–6) inherit the geometric content of RIP, weakened to the more flexible RE.

When does RIP hold? For random Gaussian designs with iid N(0,1/n)\mathcal{N}(0, 1/n) entries (so E[Xj2]=1\mathbb{E}[\|\mathbf{X}_j\|^2] = 1): RIP of order ss with constant δ\delta holds with high probability provided

nslog(p/s)δ2.n \gtrsim \frac{s \log(p/s)}{\delta^2}.

Same scaling for sub-Gaussian random designs (Baraniuk-Davenport-DeVore-Wakin 2008). For deterministic designs, RIP is generally hard to verify — checking δs<c\delta_s < c requires examining all (ps)\binom{p}{s} sparse subsets, which is computationally intractable.

§9.2 RIP \Rightarrow RE: the implication chain

If a design satisfies RIP with a small constant, it automatically satisfies RE. The implication is straightforward but worth tracing.

Proposition 1 (RIP implies RE).

If X\mathbf{X} satisfies RIP of order 2s2s with constant δ2s<1/(1+2c0)\delta_{2s} < 1/(1 + 2c_0), then RE(s,c0)\mathrm{RE}(s, c_0) holds with constant κ21δ2s(1+2c0)\kappa^2 \ge 1 - \delta_{2s}(1 + 2c_0).

Proof sketch. Pick any δC(S,c0)\boldsymbol\delta \in \mathcal{C}(S, c_0) with S=s|S| = s. We need Xδ2/nκ2δS2\|\mathbf{X}\boldsymbol\delta\|^2 / n \ge \kappa^2 \|\boldsymbol\delta_S\|^2.

Decompose δSc\boldsymbol\delta_{S^c} by sorting its entries by magnitude into blocks of size ss: let T1T_1 be the top-ss coordinates of δSc\boldsymbol\delta_{S^c} in absolute value, T2T_2 the next ss, and so on. Let T0=ST_0 = S. Then δ=kδTk\boldsymbol\delta = \sum_k \boldsymbol\delta_{T_k} with Tks|T_k| \le s for k0k \ge 0, so each δTk\boldsymbol\delta_{T_k} is ss-sparse and we can apply RIP to it.

The cone condition δSc1c0δS1\|\boldsymbol\delta_{S^c}\|_1 \le c_0 \|\boldsymbol\delta_S\|_1 controls the L1-mass of δ\boldsymbol\delta outside SS, which (by sorting and the standard “top-ss approximation” argument) controls the L2-mass of the tail blocks k2δTk2c0δS2\sum_{k \ge 2} \|\boldsymbol\delta_{T_k}\|_2 \le c_0 \|\boldsymbol\delta_S\|_2.

Then by the triangle inequality applied to Xδ=XδT0+XδT1+k2XδTk\mathbf{X}\boldsymbol\delta = \mathbf{X}\boldsymbol\delta_{T_0} + \mathbf{X}\boldsymbol\delta_{T_1} + \sum_{k \ge 2} \mathbf{X}\boldsymbol\delta_{T_k}, and using RIP on each δTk\boldsymbol\delta_{T_k},

XδnX(δT0+δT1)nk2XδTkn1δ2sδT0T11+δsc0δS.\frac{\|\mathbf{X}\boldsymbol\delta\|}{\sqrt{n}} \ge \frac{\|\mathbf{X}(\boldsymbol\delta_{T_0} + \boldsymbol\delta_{T_1})\|}{\sqrt n} - \sum_{k \ge 2} \frac{\|\mathbf{X}\boldsymbol\delta_{T_k}\|}{\sqrt n} \ge \sqrt{1 - \delta_{2s}} \|\boldsymbol\delta_{T_0 \cup T_1}\| - \sqrt{1 + \delta_s} \cdot c_0 \|\boldsymbol\delta_S\|.

Squaring and simplifying gives RE with κ2=1δ2s(1+2c0)\kappa^2 = 1 - \delta_{2s}(1 + 2c_0). The full proof has more careful constants; van de Geer-Bühlmann (2009) and Wainwright (2019, §11.2) give the polished version. \blacksquare

The reverse implication does not hold. RE is strictly weaker than RIP. Many designs satisfy RE without satisfying RIP — most importantly, random Gaussian designs with non-identity covariance, where the columns are correlated and the uniform isometry property fails. RE for these designs comes from Raskutti-Wainwright-Yu (2010), who avoid going through RIP entirely.

So in practice: if your design is “compressed-sensing-style” (random Gaussian with iid entries, designed for sparse recovery), RIP is the right framework. If your design is “statistics-style” (random with population covariance, real-world features), RE is the right framework. The lasso oracle inequality holds under either, but the proofs and constants differ.

§9.3 Sub-Gaussian designs and concentration

The §5.5 deviation step controlled Xε/n\|\mathbf{X}^\top \boldsymbol\varepsilon / n\|_\infty via sub-Gaussian concentration on the noise ε\boldsymbol\varepsilon. For random designs — when X\mathbf{X} itself has random entries — a parallel concentration result controls how the empirical Gram matrix XX/n\mathbf{X}^\top \mathbf{X} / n approximates the population covariance Σ\boldsymbol\Sigma.

Lemma 1 (Sub-Gaussian design concentration (Vershynin 2018, Theorem 4.6.1)).

Let X\mathbf{X} have iid sub-Gaussian rows with mean zero and population covariance Σ\boldsymbol\Sigma. With probability at least 12exp(cn)1 - 2 \exp(-c n),

XXnΣopC(pn+pn)Σop,\left\| \frac{\mathbf{X}^\top \mathbf{X}}{n} - \boldsymbol\Sigma \right\|_{\text{op}} \le C \left( \sqrt{\frac{p}{n}} + \frac{p}{n} \right) \cdot \|\boldsymbol\Sigma\|_{\text{op}},

where the constants depend on the sub-Gaussian parameter.

Implications. For npn \gg p, the empirical Gram matrix concentrates tightly around Σ\boldsymbol\Sigma, and any spectral property of Σ\boldsymbol\Sigma (eigenvalues, condition number, RE constant) transfers to XX/n\mathbf{X}^\top \mathbf{X}/n with high probability. For npn \asymp p, the concentration is loose (p/n=O(1)\sqrt{p/n} = O(1)), and individual eigenvalues of XX/n\mathbf{X}^\top \mathbf{X}/n can drift substantially from those of Σ\boldsymbol\Sigma — the empirical Gram is rank-deficient when n<pn < p, so the smallest eigenvalue is exactly zero regardless of Σ\boldsymbol\Sigma.

For the lasso, the relevant concentration is the restricted version: rather than controlling the full operator norm, we need infδC(S,c0)Xδ2/(nδS2)κ2\inf_{\boldsymbol\delta \in \mathcal{C}(S, c_0)} \|\mathbf{X}\boldsymbol\delta\|^2 / (n \|\boldsymbol\delta_S\|^2) \ge \kappa^2 to inherit from infδC(S,c0)δΣδ/δS2κ02\inf_{\boldsymbol\delta \in \mathcal{C}(S, c_0)} \boldsymbol\delta^\top \boldsymbol\Sigma \boldsymbol\delta / \|\boldsymbol\delta_S\|^2 \ge \kappa_0^2.

Theorem 1 (RE for sub-Gaussian designs (Raskutti-Wainwright-Yu 2010, simplified)).

Let X\mathbf{X} have iid sub-Gaussian rows with population covariance Σ\boldsymbol\Sigma satisfying λmin(Σ)κ02>0\lambda_{\min}(\boldsymbol\Sigma) \ge \kappa_0^2 > 0. If nCslog(p)n \ge C s \log(p) for a sufficiently large constant CC depending on κ0\kappa_0 and the sub-Gaussian parameter, then with probability 1c1exp(c2n)\ge 1 - c_1 \exp(-c_2 n), the design satisfies RE(s,3)\mathrm{RE}(s, 3) with constant κ2κ02/8\kappa^2 \ge \kappa_0^2 / 8.

So the population condition λmin(Σ)>0\lambda_{\min}(\boldsymbol\Sigma) > 0 — the population covariance is positive definite — combined with the sample-size scaling nCslog(p)n \ge C s \log(p) gives the sample-level RE condition the lasso needs. This is the standard “high-dimensional probability” path from a population-level assumption to a sample-level condition. The proof uses a covering argument on the cone C(S,c0)\mathcal{C}(S, c_0), plus a uniform deviation bound on δXXδ/n\boldsymbol\delta^\top \mathbf{X}^\top \mathbf{X} \boldsymbol\delta / n for δ\boldsymbol\delta in the cone.

For DGP-1 with Σjk=0.5jk\boldsymbol\Sigma_{jk} = 0.5^{|j - k|}, λmin(Σ)(10.5)/(1+0.5)=1/3\lambda_{\min}(\boldsymbol\Sigma) \to (1 - 0.5)/(1 + 0.5) = 1/3 as pp \to \infty, so RE holds with κ21/3/80.04\kappa^2 \approx 1/3 / 8 \approx 0.04 (the constant is loose; the empirical RE is much better). The sample-size requirement nCslog(p)=C10log(500)60Cn \ge C s \log(p) = C \cdot 10 \cdot \log(500) \approx 60 C is satisfied at n=200n = 200 for any reasonable CC.

§9.4 When the conditions fail in practice

The lasso’s theoretical guarantees require RE (for prediction) or IC (for support recovery). When these fail, the lasso behaves badly in predictable ways. Three common failure modes:

Highly correlated features. When two features are nearly identical (correlation ρ1\rho \to 1), the irrepresentable condition fails first (the IC quantity grows linearly in ρ/(1ρ)\rho/(1-\rho) for AR(1) Toeplitz; see §6.2). RE degrades more slowly but eventually fails too. The lasso’s behavior under high correlation:

  • Support recovery: lasso may flip arbitrarily between which correlated feature it selects across resamples. This is the §6.1 stability issue.
  • Prediction: surprisingly robust — the lasso’s prediction at x\mathbf{x}^* is approximately the same whether it selected feature jj or feature kk (the two are correlated, so xjxk\mathbf{x}^*_j \approx \mathbf{x}^*_k). The §5 oracle inequality’s prediction-risk bound holds as long as RE holds, even when IC fails.

The remedies: elastic net (§8.2) for coordinated selection of correlated groups; adaptive lasso (§8.3) for sign-consistency under weaker conditions; or post-selection refit (OLS on the lasso-selected support) for unbiased coefficient estimates.

Deterministic / structured designs. Designs with deterministic structure — Vandermonde matrices for polynomial features, Fourier bases, wavelet dictionaries — typically satisfy RE only under specific column-sampling protocols. RIP is even harder to verify deterministically; the gap between provably RIP-bounded designs (a small set of explicit constructions) and the much larger set of designs that work empirically is one of the open problems in compressed sensing.

For practical applications, the lasso is usually applied to designs that are “random-Gaussian-like” enough that RE holds with high probability, even if a formal proof is unavailable. Empirical diagnostics (cross-validated stability of the active set; condition number of XSXS\mathbf{X}_S^\top \mathbf{X}_S on the lasso-selected support) substitute for rigorous condition checking.

Adversarial or pathological designs. Constructions exist where the lasso provably fails: for example, designs where some inactive feature is exactly representable as a sign-coherent combination of active features (IC = 1 exactly), or where two columns are exactly equal. These don’t appear in random data but can arise from data preprocessing — e.g., one-hot encoding of a categorical variable produces columns summing to a constant, which violates the lasso’s general-position uniqueness assumption (§2.3).

The standard preprocessing recipes — drop one level from each one-hot encoding to avoid the dummy-variable trap; check for and remove duplicate columns before fitting — handle these cases. Beyond preprocessing, the practical heuristic is: if two ostensibly different features have correlation >0.95> 0.95, suspect a data-pipeline issue and investigate before fitting.

Diagnostic toolkit. When the lasso behaves unexpectedly (large stability across resamples, support that doesn’t include obvious signal features, prediction MSE much worse than ridge), check the following:

  1. Maximum pairwise feature correlation: if >0.9> 0.9, consider elastic net.
  2. Condition number of the selected XA^XA^\mathbf{X}_{\hat A}^\top \mathbf{X}_{\hat A}: if >100> 100, the active-set Gram is ill-conditioned and the IC is likely violated; consider adaptive lasso or stability selection.
  3. Cross-validated stability of A^λ\hat A_\lambda across resamples: if the active set varies substantially, IC is likely violated; report stability selection probabilities (Meinshausen-Bühlmann 2010) instead of a single point estimate.
  4. Population vs empirical Gram drift: if XX/nΣ^op\|\mathbf{X}^\top \mathbf{X} / n - \hat{\boldsymbol\Sigma}\|_{\text{op}} is large (where Σ^\hat{\boldsymbol\Sigma} is computed on a held-out fold), the sample size may be insufficient for stable RE.

These checks — none of them theoretically rigorous, all of them practically useful — are the day-to-day handle on whether the lasso’s theoretical guarantees apply to your problem.

§10. Post-selection inference and the debiased lasso

This is the inferential payoff section. Up to this point, the lasso has been a prediction tool — point estimates of β\boldsymbol\beta^* that minimize a penalized squared loss, with rate-optimal prediction risk under RE (§5). When practitioners want to do inference — confidence intervals for individual βj\beta^*_j, hypothesis tests of βj=0\beta^*_j = 0 — the natural impulse is to treat the lasso fit like an OLS fit and apply the standard normal-theory machinery: confidence interval β^j±1.96se^(β^j)\hat\beta_j \pm 1.96 \cdot \widehat{\mathrm{se}}(\hat\beta_j).

This naive approach fails dramatically. The lasso’s bias (§4.1) shifts β^j\hat\beta_j away from βj\beta^*_j by λsign(βj)-\lambda \, \mathrm{sign}(\beta^*_j); the naive standard error doesn’t account for the selection step that determined which features are in the model; and the resulting CIs undercover by 20–50 percentage points at typical signal levels. Empirical coverage of nominally 95% naive lasso CIs lands at 50–70%.

The fix is the debiased lasso (Zhang-Zhang 2014; Javanmard-Montanari 2014; van de Geer-Bühlmann-Ritov-Dezeure 2014, three nearly-simultaneous papers): a one-step Newton correction β^db=β^+(1/n)M^X(yXβ^)\hat{\boldsymbol\beta}^{\text{db}} = \hat{\boldsymbol\beta} + (1/n) \hat{\mathbf{M}} \mathbf{X}^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) that explicitly removes the lasso’s bias and produces n\sqrt n-consistent normal estimates of individual coefficients. The construction works in the pnp \gg n regime where standard OLS doesn’t exist; in the p<np < n regime it reduces to OLS asymptotically. The resulting confidence intervals achieve the nominal coverage.

§10.1 Why naive post-selection CIs undercover: PoSI (Berk et al. 2013)

The standard OLS confidence interval for a regression coefficient assumes the model — which features are in, which are out — is specified before the data are seen. The CI’s coverage guarantee is conditional on the model, and the CI’s standard error formula doesn’t account for any model-selection step.

The lasso violates this assumption in the most fundamental way: the model (the active set A^λ\hat A_\lambda) is selected from the data. Treating the post-lasso refit as if the model had been pre-specified double-counts the data — once for selection, once for inference — and the resulting CIs are too narrow.

Berk-Brown-Buja-Zhang-Zhao (2013) — “Valid Post-Selection Inference” formalized this problem and proposed a (very conservative) solution. Their PoSI confidence intervals provide simultaneous coverage guarantees over all submodels the procedure could have selected: P(βjCIj(M) for all jM, for all submodels M)1α\mathrm{P}(\beta^*_j \in \mathrm{CI}_j(M) \text{ for all } j \in M, \text{ for all submodels } M) \ge 1 - \alpha. The PoSI intervals use a multiplier KPoSIK_{\mathrm{PoSI}} much larger than the standard normal quantile z0.025=1.96z_{0.025} = 1.96 — typically KPoSI[3,5]K_{\mathrm{PoSI}} \in [3, 5] at p=100p = 100, growing roughly as 2logp\sqrt{2 \log p}.

PoSI intervals are valid but extremely wide. They achieve coverage by being so conservative that they’re rarely actionable. The debiased lasso takes a fundamentally different approach: rather than widening the CI to absorb selection uncertainty, it constructs an estimator whose distribution is not selection-dependent in the first place.

Three sources of failure for the naive CI. Concretely, consider the standard “post-selection” pipeline: fit lasso, identify A^λ\hat A_\lambda, refit OLS on XA^\mathbf{X}_{\hat A}, form the CI for βj\beta^*_j (jA^j \in \hat A) using the refit OLS coefficient and standard error. Three things go wrong:

  1. Selection bias on the refit estimator. OLS on a selected support is biased because the support was chosen to make the regression “look good.” On expectation, β^jrefit\hat\beta_j^{\text{refit}} is shifted away from βj\beta^*_j.
  2. Underestimated standard error. The OLS variance formula σ2(XA^XA^)jj1\sigma^2 (\mathbf{X}_{\hat A}^\top \mathbf{X}_{\hat A})^{-1}_{jj} doesn’t account for the variability in A^\hat A. The true variance is larger.
  3. Coverage degenerates for noise coords. For jA^j \notin \hat A, the refit doesn’t include βj\beta_j — the implicit estimate is 0 with no CI. Coverage of βj\beta^*_j for noise coordinates becomes vacuous (always covers 0 trivially) or undefined.

The empirical effect: at standard signal-to-noise ratios, naive 95% CIs cover at 50–70% of replications. The §10.5 numerical experiment quantifies this on DGP-1.

§10.2 The one-step debiased correction (Zhang-Zhang 2014)

The debiased lasso starts from a different question: rather than asking how to make a CI from β^lasso\hat{\boldsymbol\beta}^{\text{lasso}} that accounts for selection, it asks for a new estimator whose asymptotic distribution doesn’t depend on selection.

Recall the lasso KKT condition (§2.4): at the optimum, 1nX(yXβ^)=λg^\frac{1}{n} \mathbf{X}^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) = \lambda \hat{\mathbf{g}} where g^β^1\hat{\mathbf{g}} \in \partial \|\hat{\boldsymbol\beta}\|_1 is a subgradient. Let Σ^:=XX/n\hat{\boldsymbol\Sigma} := \mathbf{X}^\top \mathbf{X} / n (the empirical Gram matrix). Then

Σ^β^=1nXyλg^.\hat{\boldsymbol\Sigma} \hat{\boldsymbol\beta} = \frac{1}{n} \mathbf{X}^\top \mathbf{y} - \lambda \hat{\mathbf{g}}.

If Σ^\hat{\boldsymbol\Sigma} were invertible, multiplying by Σ^1\hat{\boldsymbol\Sigma}^{-1} would give β^=Σ^1(1nXy)λΣ^1g^\hat{\boldsymbol\beta} = \hat{\boldsymbol\Sigma}^{-1} (\frac{1}{n} \mathbf{X}^\top \mathbf{y}) - \lambda \hat{\boldsymbol\Sigma}^{-1} \hat{\mathbf{g}}. The first term is OLS; the second is the lasso’s bias (the λΣ^1sign\lambda \hat{\boldsymbol\Sigma}^{-1} \mathrm{sign} correction from §4.1). To remove the bias, add it back:

β^+λΣ^1g^=Σ^11nXy.\hat{\boldsymbol\beta} + \lambda \hat{\boldsymbol\Sigma}^{-1} \hat{\mathbf{g}} = \hat{\boldsymbol\Sigma}^{-1} \cdot \frac{1}{n} \mathbf{X}^\top \mathbf{y}.

Substituting the KKT condition λg^=1nX(yXβ^)\lambda \hat{\mathbf{g}} = \frac{1}{n} \mathbf{X}^\top (\mathbf{y} - \mathbf{X}\hat{\boldsymbol\beta}):

β^+Σ^11nX(yXβ^)=Σ^11nXy.\hat{\boldsymbol\beta} + \hat{\boldsymbol\Sigma}^{-1} \cdot \frac{1}{n} \mathbf{X}^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}) = \hat{\boldsymbol\Sigma}^{-1} \cdot \frac{1}{n} \mathbf{X}^\top \mathbf{y}.

This is the one-step Newton correction: starting from the lasso initial β^\hat{\boldsymbol\beta}, take one step in the direction of the gradient of the OLS loss, scaled by the OLS Hessian inverse Σ^1\hat{\boldsymbol\Sigma}^{-1}. The result is exactly OLS — when Σ^1\hat{\boldsymbol\Sigma}^{-1} exists (i.e., p<np < n).

Definition 1 (Debiased lasso (Zhang-Zhang 2014)).

Given the lasso initial β^\hat{\boldsymbol\beta} and a matrix M^Rp×p\hat{\mathbf{M}} \in \mathbb{R}^{p \times p} approximating Σ^1\hat{\boldsymbol\Sigma}^{-1}, the debiased lasso is

β^db:=β^+1nM^X(yXβ^).\hat{\boldsymbol\beta}^{\text{db}} := \hat{\boldsymbol\beta} + \frac{1}{n} \hat{\mathbf{M}} \mathbf{X}^\top (\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}).

Three observations:

  1. At p<np < n with M^=Σ^1\hat{\mathbf{M}} = \hat{\boldsymbol\Sigma}^{-1}: β^db=Σ^11nXy=β^OLS\hat{\boldsymbol\beta}^{\text{db}} = \hat{\boldsymbol\Sigma}^{-1} \cdot \frac{1}{n} \mathbf{X}^\top \mathbf{y} = \hat{\boldsymbol\beta}^{\text{OLS}}. The debiased lasso is exactly OLS.
  2. At p>np > n: Σ^\hat{\boldsymbol\Sigma} is singular and Σ^1\hat{\boldsymbol\Sigma}^{-1} doesn’t exist. We need to construct M^\hat{\mathbf{M}} that approximates the population Σ1\boldsymbol\Sigma^{-1} (or some “approximate inverse” of Σ^\hat{\boldsymbol\Sigma}). §10.3 covers the nodewise-lasso construction.
  3. The bias-variance decomposition. Substituting y=Xβ+ε\mathbf{y} = \mathbf{X} \boldsymbol\beta^* + \boldsymbol\varepsilon:
β^dbβ=1nM^Xεnormal-asymptotic term+(M^Σ^I)(β^β)remainder.\hat{\boldsymbol\beta}^{\text{db}} - \boldsymbol\beta^* = \underbrace{\frac{1}{n} \hat{\mathbf{M}} \mathbf{X}^\top \boldsymbol\varepsilon}_{\text{normal-asymptotic term}} + \underbrace{(\hat{\mathbf{M}} \hat{\boldsymbol\Sigma} - \mathbf{I})(\hat{\boldsymbol\beta} - \boldsymbol\beta^*)}_{\text{remainder}}.

The first term is a linear functional of the noise, asymptotically normal by CLT. The remainder term is small (op(1/n)o_p(1/\sqrt n)) when M^Σ^I\hat{\mathbf{M}} \hat{\boldsymbol\Sigma} \approx \mathbf{I} and β^β1slog(p)/n\|\hat{\boldsymbol\beta} - \boldsymbol\beta^*\|_1 \lesssim s \sqrt{\log(p)/n} (the L1 estimation rate from §5.5). Under suitable conditions, the remainder is op(1/n)o_p(1/\sqrt n), and the asymptotic distribution is governed by the linear functional alone.

This is the one-step Newton (or “one-step correction” in semiparametric statistics) construction: take a biased initial estimator and a single Newton step toward the unbiased solution. The general theory is due to Le Cam (1956) and Bickel (1982); the lasso application is Zhang-Zhang’s contribution.

§10.3 The nodewise lasso for M^\hat{\mathbf{M}} (van de Geer et al. 2014)

In the p>np > n regime, Σ^\hat{\boldsymbol\Sigma} is singular and we need a different construction for M^\hat{\mathbf{M}}. The target is a matrix such that M^Σ^I\hat{\mathbf{M}} \hat{\boldsymbol\Sigma} \approx \mathbf{I} in some appropriate sense — specifically, with row jj of M^\hat{\mathbf{M}} chosen to make M^j,Σ^ej\hat{\mathbf{M}}_{j, \cdot} \hat{\boldsymbol\Sigma} \approx \mathbf{e}_j (the jj-th standard basis vector).

Nodewise lasso (van de Geer et al. 2014, Algorithm 1). For each j=1,,pj = 1, \dots, p:

  1. Regress feature jj on the others. Fit the lasso with response Xj\mathbf{X}_j and features Xj\mathbf{X}_{-j} (all columns except jj):
γ^j:=argminγRp112nXjXjγ22+λjγ1.\hat{\boldsymbol\gamma}_j := \arg\min_{\boldsymbol\gamma \in \mathbb{R}^{p-1}} \frac{1}{2n} \|\mathbf{X}_j - \mathbf{X}_{-j} \boldsymbol\gamma\|_2^2 + \lambda_j \|\boldsymbol\gamma\|_1.
  1. Compute the residual variance:
τ^j2:=1nXjXjγ^j22+λjγ^j1.\hat{\tau}_j^2 := \frac{1}{n} \|\mathbf{X}_j - \mathbf{X}_{-j} \hat{\boldsymbol\gamma}_j\|_2^2 + \lambda_j \|\hat{\boldsymbol\gamma}_j\|_1.
  1. Form the jj-th row of M^\hat{\mathbf{M}}: insert 11 in column jj and γ^j,k-\hat\gamma_{j, k} in column kk for kjk \neq j, then divide by τ^j2\hat\tau_j^2:
M^j,k={1/τ^j2k=j,γ^j,k/τ^j2kj.\hat{\mathbf{M}}_{j, k} = \begin{cases} 1 / \hat\tau_j^2 & k = j, \\ -\hat\gamma_{j, k} / \hat\tau_j^2 & k \neq j. \end{cases}

The construction is motivated by the partition formula for the inverse covariance: under a population Σ\boldsymbol\Sigma, the jj-th row of Σ1\boldsymbol\Sigma^{-1} is (1,γj)/τj2(1, -\boldsymbol\gamma_j^*) / \tau_j^{*2} where γj=Σj,j1Σj,j\boldsymbol\gamma_j^* = \boldsymbol\Sigma_{-j, -j}^{-1} \boldsymbol\Sigma_{-j, j} is the population regression of XjX_j on XjX_{-j} and τj2\tau_j^{*2} is the population residual variance. The nodewise lasso estimates (γj,τj2)(\boldsymbol\gamma_j^*, \tau_j^{*2}) in the high-dim regime by lasso regression rather than OLS.

Computational cost. pp lasso fits, each on a problem of size (n,p1)(n, p-1), with λj\lambda_j tuned via CV. At p=500,n=200p = 500, n = 200 with cv=10 lasso fits: about 30 s per nodewise-lasso construction. Caching M^\hat{\mathbf{M}} across uses on the same X\mathbf{X} amortizes this — the cost is paid once per dataset, not once per coefficient inference.

Sparsity assumption. The nodewise lasso works because the rows of Σ1\boldsymbol\Sigma^{-1} are assumed sparse — i.e., each feature is well-predicted by a small subset of the other features. This is a real assumption with a real cost when violated; if Σ1\boldsymbol\Sigma^{-1} is dense (every feature depends on many others), the nodewise lasso is misspecified and the resulting M^\hat{\mathbf{M}} doesn’t satisfy M^Σ^I\hat{\mathbf{M}} \hat{\boldsymbol\Sigma} \approx \mathbf{I} closely enough for the one-step correction to work.

Alternative constructions. Javanmard-Montanari (2014) propose a different M^\hat{\mathbf{M}} constructed by solving an optimization problem directly (rather than nodewise lasso), with similar asymptotic guarantees. Belloni-Chernozhukov-Wang (2014) use ridge regression with shrinkage tuned to achieve a particular bias-variance trade-off. The three constructions are asymptotically equivalent at first order; in finite samples they can differ by 10–20% in CI length.

§10.4 The n\sqrt n normal asymptotics

The debiased lasso’s central asymptotic property — the foundation of all CI and hypothesis-testing applications — is:

Theorem 1 (Asymptotic normality of debiased lasso (van de Geer-Bühlmann-Ritov-Dezeure 2014, Theorem 2.2; Javanmard-Montanari 2014, Theorem 2.1)).

Assume:

(i) The design X\mathbf{X} has iid sub-Gaussian rows with population covariance Σ\boldsymbol\Sigma.

(ii) The population precision matrix Σ1\boldsymbol\Sigma^{-1} has rows of bounded sparsity: maxjΣj,10sM\max_j \|\boldsymbol\Sigma^{-1}_{j, \cdot}\|_0 \le s_M.

(iii) The noise ε\boldsymbol\varepsilon is iid sub-Gaussian with variance σ2\sigma^2.

(iv) β\boldsymbol\beta^* is ss-sparse.

(v) Sample-size scaling: nmax(s,sM)(logp)2n \gg \max(s, s_M) \cdot (\log p)^2.

(vi) Lasso initial β^\hat{\boldsymbol\beta} satisfies the §5.5 oracle inequality bound.

(vii) Nodewise-lasso M^\hat{\mathbf{M}} satisfies an analogous oracle bound.

Then for each j{1,,p}j \in \{1, \dots, p\},

n(β^jdbβj)dN(0,σ2M^j,j).\sqrt n \big(\hat\beta_j^{\text{db}} - \beta^*_j\big) \overset{d}{\to} \mathcal{N}\big(0, \, \sigma^2 \hat{\mathbf{M}}_{j, j} \big).

Equivalently, n(β^jdbβj)/σ^db,jdN(0,1)\sqrt n (\hat\beta_j^{\text{db}} - \beta^*_j) / \hat\sigma_{db, j} \overset{d}{\to} \mathcal{N}(0, 1), where the asymptotic standard error simplifies under the nodewise construction to σ^db,j2=σ2M^j,j/n\hat\sigma_{db, j}^2 = \sigma^2 \hat{\mathbf{M}}_{j, j} / n.

Proof sketch. Use the bias-variance decomposition from §10.2:

n(β^dbβ)=1nM^Xε+n(M^Σ^I)(β^β).\sqrt n (\hat{\boldsymbol\beta}^{\text{db}} - \boldsymbol\beta^*) = \frac{1}{\sqrt n} \hat{\mathbf{M}} \mathbf{X}^\top \boldsymbol\varepsilon + \sqrt n (\hat{\mathbf{M}} \hat{\boldsymbol\Sigma} - \mathbf{I})(\hat{\boldsymbol\beta} - \boldsymbol\beta^*).

For the normal-asymptotic term: 1nM^Xε\frac{1}{\sqrt n} \hat{\mathbf{M}} \mathbf{X}^\top \boldsymbol\varepsilon is a linear combination of iid sub-Gaussian noise entries with deterministic (conditional on X\mathbf{X}) coefficients. By the multivariate CLT applied coordinate-by-coordinate, the jj-th entry dN(0,σ2M^j,j)\overset{d}{\to} \mathcal{N}(0, \sigma^2 \hat{\mathbf{M}}_{j, j}) in the limit.

For the remainder term: bound (M^Σ^I)j,λj\|(\hat{\mathbf{M}} \hat{\boldsymbol\Sigma} - \mathbf{I})_{j, \cdot}\|_\infty \le \lambda_j from the nodewise-lasso KKT conditions. Then the remainder is Op(λjslog(p)/n)=Op(slog(p)/n)O_p(\lambda_j \cdot s \sqrt{\log(p)/n}) = O_p(s \log(p) / n) with λjlog(p)/n\lambda_j \asymp \sqrt{\log(p)/n}, which is op(1/n)o_p(1/\sqrt n) provided slog(p)/n0s \log(p) / \sqrt n \to 0 — equivalent to ns2(logp)2n \gg s^2 (\log p)^2. The slightly weaker nsM(logp)2n \gg s_M (\log p)^2 requirement in the theorem comes from a tighter analysis (Javanmard-Montanari 2014). \blacksquare

The (logp)2(\log p)^2 scaling is striking: the debiased lasso requires substantially more samples than the lasso’s prediction-risk bound, which only needed nslogpn \gg s \log p (one logp\log p factor). The extra logp\log p comes from the nodewise-lasso M^\hat{\mathbf{M}} requiring its own oracle-inequality bound. In practice the debiased lasso works at smaller nn than the theory requires; the conditions are sufficient but not always necessary.

Confidence interval. Theorem 1 immediately gives an asymptotically valid (1α)(1 - \alpha) CI for βj\beta^*_j:

CIdb,j1α=[β^jdbzα/2σ^M^j,j/n,  β^jdb+zα/2σ^M^j,j/n],\mathrm{CI}_{db, j}^{1 - \alpha} = \left[ \hat\beta_j^{\text{db}} - z_{\alpha/2} \hat\sigma \sqrt{\hat{\mathbf{M}}_{j, j} / n}, \; \hat\beta_j^{\text{db}} + z_{\alpha/2} \hat\sigma \sqrt{\hat{\mathbf{M}}_{j, j} / n} \right],

where zα/2z_{\alpha/2} is the standard normal quantile and σ^2\hat\sigma^2 is an estimate of the noise variance (the standard estimate from Reid-Tibshirani-Friedman 2016: σ^2=yXβ^2/(nβ^0)\hat\sigma^2 = \|\mathbf{y} - \mathbf{X} \hat{\boldsymbol\beta}\|^2 / (n - \|\hat{\boldsymbol\beta}\|_0)).

Hypothesis test. For H0:βj=0H_0: \beta^*_j = 0, reject at level α\alpha if β^jdb>zα/2σ^M^j,j/n|\hat\beta_j^{\text{db}}| > z_{\alpha/2} \hat\sigma \sqrt{\hat{\mathbf{M}}_{j, j}/n}. This test has correct asymptotic level under the conditions of Theorem 1, regardless of whether the lasso selected coordinate jj.

§10.5 Empirical coverage demonstration

We compare three confidence-interval procedures on DGP-1-style data at (n,p,s)=(200,100,5)(n, p, s) = (200, 100, 5) — the user-specified setting where OLS is feasible as a baseline:

  1. OLS CI (gold standard): standard normal-theory CI from OLS coefficients and OLS standard errors.
  2. Naive post-selection CI: refit OLS on the lasso-selected support, use the refit standard error to form the CI.
  3. Debiased lasso CI: Zhang-Zhang one-step correction with M^=Σ^1\hat{\mathbf{M}} = \hat{\boldsymbol\Sigma}^{-1} (the OLS Hessian inverse, available since p<np < n).

For each method and each coordinate jj, we form a nominally 95% CI and check whether it covers the true βj\beta^*_j across B=200B = 200 Monte Carlo replicates. Coverage is reported separately for signal coords (jS={0,,4}j \in S = \{0, \dots, 4\}, βj=1\beta^*_j = 1) and noise coords (j{50,51,,99}j \in \{50, 51, \dots, 99\}, βj=0\beta^*_j = 0).

The expected pattern (Figure 10.1):

  • OLS CI covers near 95% at both signal and noise coords — the gold standard, valid because the model is pre-specified.
  • Naive post-selection CI undercovers at signal coords (~50–70%) due to selection-induced bias and underestimated standard errors. At noise coords the metric is degenerate (the lasso usually doesn’t select them, so the CI is implicitly {0}\{0\} which trivially covers βj=0\beta^*_j = 0).
  • Debiased lasso CI covers near 95% at both signal and noise coords — reduces to OLS at p<np < n with the OLS-Hessian M^\hat{\mathbf{M}}, recovers the gold-standard coverage despite starting from the biased lasso initial.

The headline takeaway: the debiased lasso recovers OLS-quality inference from a biased lasso initial. At p>np > n where OLS is unavailable, the debiased lasso (with nodewise M^\hat{\mathbf{M}}) is the only valid CI procedure of the three.

Caveats on the demo. Two simplifications relative to the general debiased-lasso theory:

  1. M^=Σ^1\hat{\mathbf{M}} = \hat{\boldsymbol\Sigma}^{-1} at p<np < n. This makes the debiased lasso reduce to OLS exactly, by the §10.2 calculation. The demo is honest in that all three procedures target the same βj\beta^*_j, but it doesn’t exercise the nodewise-lasso construction (§10.3) that’s needed at p>np > n. The §10.3 algorithm has been demonstrated separately in the notebook on a single sample, showing that nodewise M^\hat{\mathbf{M}} also satisfies M^Σ^I\hat{\mathbf{M}} \hat{\boldsymbol\Sigma} \approx \mathbf{I}.
  2. Single λ\lambda choice. We use the theory-guided λ=2σ2log(p)/n\lambda = 2\sigma\sqrt{2\log(p)/n} rather than CV-tuned. CV-tuned would give qualitatively similar coverage with marginally better point estimates.

The core lesson — that naive post-selection inference fails and the one-step correction fixes it — is robust to both simplifications. In production code, the hdi R package (Dezeure-Bühlmann-Meier-Meinshausen 2015) and desparsified-lasso Python package implement the full nodewise-lasso pipeline.

Empirical 95% CI coverage rates for OLS, naive post-selection, and debiased lasso at three signal strengths (strong, borderline, noise).
Empirical 95% CI coverage rates (B = 200 MC replicates) on DGP-1-style data at n = 200, p = 100, s = 5. Three coord types: strong (β* = 1.0), borderline (β* = 0.15), noise (β* = 0).
MethodStrong (β* = 1.0)Borderline (β* = 0.15)Noise (β* = 0)
OLS (gold standard)95.0%94.3%95.1%
Naive post-selection94.5%24.0%100.0%
Debiased lasso98.2%98.8%98.4%

The headline is the 24.0% entry: at borderline-strength signals (β* = 0.15), naive post-selection CIs undercover by ~70 percentage points. Strong signals (β* = 1.0) are always selected and stably estimated, so the naive procedure happens to cover near nominal; noise coords are typically unselected, making the naive CI degenerate at {0} which trivially covers β* = 0. The OLS baseline (available here because p < n) and the debiased lasso both recover ~95% coverage uniformly across coord types. At p > n where OLS is infeasible, the debiased lasso with nodewise-M̂ is the only valid CI procedure of the three.

§11. Generalized lasso for non-Gaussian responses

Everything in §§1–10 was developed for the Gaussian linear model: y=Xβ+ε\mathbf{y} = \mathbf{X} \boldsymbol\beta^* + \boldsymbol\varepsilon with ε\boldsymbol\varepsilon iid N(0,σ2I)\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}), squared-error loss, soft-thresholding KKT solution. The lasso framework extends naturally to generalized linear models (GLMs) — replacing the squared-error loss with the negative log-likelihood of any GLM family — and most of the topic’s results carry over. This section covers two specific extensions (logistic, Poisson), the general IRLS-with-soft-thresholding solver, and the GLM-debiased-lasso construction for inference.

The math is mostly mechanical: substitute the GLM log-likelihood for the Gaussian one, re-derive the KKT conditions with the new gradient/Hessian, run a similar oracle-inequality argument with sub-Gaussian-replaced-by-sub-exponential noise tail bounds. The conclusions are qualitatively the same — sparsity-adaptive prediction risk under RE, sign-consistency under IC, debiased-lasso CIs at the same n\sqrt n rate. We sketch the changes without re-deriving the proofs.

The implementation story is similarly straightforward: scikit-learn’s LogisticRegression(penalty='l1') for binary responses, PoissonRegressor (in sklearn.linear_model) and glmnet for count responses; the underlying solvers are coordinate-descent-with-IRLS or proximal-gradient variants of the Gaussian solvers from §3.

§11.1 Logistic lasso for binary classification

For binary response yi{0,1}y_i \in \{0, 1\} with P(yi=1xi)=σ(xiβ)\mathbb{P}(y_i = 1 \mid \mathbf{x}_i) = \sigma(\mathbf{x}_i^\top \boldsymbol\beta) where σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}) is the logistic function, the negative log-likelihood per observation is

(yi,xiβ)=yixiβ+log(1+exp(xiβ)).\ell(y_i, \mathbf{x}_i^\top \boldsymbol\beta) = -y_i \, \mathbf{x}_i^\top \boldsymbol\beta + \log(1 + \exp(\mathbf{x}_i^\top \boldsymbol\beta)).

The logistic lasso is the L1-penalized negative log-likelihood:

β^logistic-lasso(λ)=argminβRp1ni=1n(yi,xiβ)+λβ1.\hat{\boldsymbol\beta}^{\text{logistic-lasso}}(\lambda) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{n} \sum_{i=1}^n \ell(y_i, \mathbf{x}_i^\top \boldsymbol\beta) + \lambda \|\boldsymbol\beta\|_1.

The objective is convex (the negative log-likelihood is convex; sum of convex; L1 is convex). Two structural differences from the Gaussian case:

No closed-form even on orthogonal designs. The logistic loss is non-quadratic, so the orthogonal-design trick from §3.1 — which reduced the Gaussian lasso to pp univariate problems with closed-form solutions — fails. Each coordinate update involves a non-trivial root-finding problem.

The Hessian is data-dependent. For the Gaussian case, 2=XX/n\nabla^2 \ell = \mathbf{X}^\top \mathbf{X} / n — independent of β\boldsymbol\beta and of y\mathbf{y}. For logistic, 2=XW(β)X/n\nabla^2 \ell = \mathbf{X}^\top \mathbf{W}(\boldsymbol\beta) \mathbf{X} / n where W(β)=diag(σ(xiβ)(1σ(xiβ)))\mathbf{W}(\boldsymbol\beta) = \text{diag}(\sigma(\mathbf{x}_i^\top \boldsymbol\beta)(1 - \sigma(\mathbf{x}_i^\top \boldsymbol\beta))) — a β\boldsymbol\beta-dependent diagonal weighting. This makes the conditioning of the Hessian state-dependent: when many predicted probabilities are near 0 or 1 (extreme classes), W\mathbf{W} has small entries and the Hessian is poorly conditioned.

Solver: cyclic coordinate descent with IRLS quadratic approximations. At each outer iteration, form a quadratic approximation of the logistic loss around the current β(t)\boldsymbol\beta^{(t)}:

(y,Xβ)(y,Xβ(t))+(ββ(t))+12(ββ(t))2(ββ(t)),\ell(\mathbf{y}, \mathbf{X} \boldsymbol\beta) \approx \ell(\mathbf{y}, \mathbf{X} \boldsymbol\beta^{(t)}) + \nabla \ell^\top (\boldsymbol\beta - \boldsymbol\beta^{(t)}) + \frac{1}{2} (\boldsymbol\beta - \boldsymbol\beta^{(t)})^\top \nabla^2 \ell \, (\boldsymbol\beta - \boldsymbol\beta^{(t)}),

which is a weighted Gaussian lasso in β\boldsymbol\beta. Apply Gaussian-lasso coordinate descent (§3.2) to this subproblem; the solution becomes β(t+1)\boldsymbol\beta^{(t+1)}. Iterate until convergence. This is the iteratively reweighted least squares (IRLS) framework, the same iteration used for unpenalized GLM fitting (cf. McCullagh-Nelder 1989) with a soft-thresholding modification at each coordinate update.

scikit-learn’s LogisticRegression(penalty='l1', solver='saga') uses a stochastic-average-gradient variant of this scheme; solver='liblinear' uses the dual coordinate descent of Yuan-Ho-Lin (2012), faster on small problems. glmnet’s logistic variant uses outer-IRLS-plus-inner-Gaussian-coordinate-descent and is the practical reference implementation.

Theory carries over. The §5 oracle inequality has a logistic analog: under restricted-eigenvalue (with the design matrix replaced by W1/2(β)X\mathbf{W}^{1/2}(\boldsymbol\beta) \mathbf{X} to account for the GLM weighting) and the lasso initial β^logistic-lasso\hat{\boldsymbol\beta}^{\text{logistic-lasso}}, the prediction risk in deviance terms is bounded by Cσ2slog(p)/nC \sigma^2 s \log(p) / n for an appropriately defined σ\sigma that depends on the GLM family (van de Geer 2008, Bühlmann-van de Geer 2011 §6). The constants are messier than the Gaussian case but the rate is identical.

§11.2 Poisson lasso for count data

For count response yi{0,1,2,}y_i \in \{0, 1, 2, \dots\} with yixiPoisson(exp(xiβ))y_i \mid \mathbf{x}_i \sim \mathrm{Poisson}(\exp(\mathbf{x}_i^\top \boldsymbol\beta)), the negative log-likelihood is

(yi,xiβ)=yixiβ+exp(xiβ)+log(yi!),\ell(y_i, \mathbf{x}_i^\top \boldsymbol\beta) = -y_i \, \mathbf{x}_i^\top \boldsymbol\beta + \exp(\mathbf{x}_i^\top \boldsymbol\beta) + \log(y_i!),

with the constant log(yi!)\log(y_i!) irrelevant for optimization. The Poisson lasso has the same form as the logistic case:

β^Poisson-lasso(λ)=argminβRp1ni=1n(yi,xiβ)+λβ1.\hat{\boldsymbol\beta}^{\text{Poisson-lasso}}(\lambda) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^p} \frac{1}{n} \sum_{i=1}^n \ell(y_i, \mathbf{x}_i^\top \boldsymbol\beta) + \lambda \|\boldsymbol\beta\|_1.

Convex objective, same IRLS-with-coordinate-descent solver pattern, no closed form. The Hessian weighting is W(β)=diag(exp(xiβ))\mathbf{W}(\boldsymbol\beta) = \text{diag}(\exp(\mathbf{x}_i^\top \boldsymbol\beta)) — the predicted Poisson means.

Useful applications. Genome-wide association studies with rare-variant counts, web traffic prediction, text n-gram occurrence counts, insurance claim counts. The Poisson lasso is the natural sparse-regression tool for any rate-modeling problem.

Caveat: overdispersion. Real count data often has variance exceeding the Poisson assumption (Var(yi)>E[yi]\mathbb{V}\mathrm{ar}(y_i) > \mathbb{E}[y_i]). The negative-binomial lasso handles this by adding an overdispersion parameter; statsmodels.NegativeBinomial(penalty='elastic_net') is the standard implementation.

scikit-learn provides PoissonRegressor for unpenalized Poisson regression; for the penalized version, glmnet (via glmnetpy Python bindings) is the reference. A pure-NumPy implementation is straightforward via the IRLS framework but rarely needed in practice.

§11.3 Inference extensions for GLMs

The §10 debiased-lasso construction generalizes from squared-error loss to any GLM negative log-likelihood. The key change: the matrix M^\hat{\mathbf{M}} now approximates the inverse of the GLM Hessian 2(β^)=XW(β^)X/n\nabla^2 \ell(\hat{\boldsymbol\beta}) = \mathbf{X}^\top \mathbf{W}(\hat{\boldsymbol\beta}) \mathbf{X} / n, not the unweighted Gram XX/n\mathbf{X}^\top \mathbf{X} / n.

The GLM-debiased lasso. Given a lasso initial β^GLM-lasso\hat{\boldsymbol\beta}^{\text{GLM-lasso}} and a matrix M^\hat{\mathbf{M}} approximating the inverse GLM Hessian:

β^GLM-db:=β^GLM-lasso+1nM^X(yg1(Xβ^GLM-lasso)),\hat{\boldsymbol\beta}^{\text{GLM-db}} := \hat{\boldsymbol\beta}^{\text{GLM-lasso}} + \frac{1}{n} \hat{\mathbf{M}} \mathbf{X}^\top \big(\mathbf{y} - g^{-1}(\mathbf{X} \hat{\boldsymbol\beta}^{\text{GLM-lasso}})\big),

where g1g^{-1} is the GLM inverse-link function (σ\sigma for logistic, exp\exp for Poisson). The construction has the same one-step Newton interpretation as the Gaussian case: starting from the biased lasso, take one Newton step toward the unbiased solution.

Nodewise lasso for M^\hat{\mathbf{M}} in GLMs. The GLM-Hessian-inverse M^\hat{\mathbf{M}} is constructed by a weighted nodewise lasso: regress each feature Xj\mathbf{X}_j on Xj\mathbf{X}_{-j} with sample weights wi=(W(β^))iiw_i = (\mathbf{W}(\hat{\boldsymbol\beta}))_{ii} — the same weights that appear in the GLM Hessian. The resulting M^\hat{\mathbf{M}} approximates (XW(β^)X/n)1(\mathbf{X}^\top \mathbf{W}(\hat{\boldsymbol\beta}) \mathbf{X} / n)^{-1} in the same row-by-row sense as in §10.3.

Asymptotic normality (van de Geer-Bühlmann-Ritov-Dezeure 2014, Theorem 3.1). Under analogous conditions to Theorem 1 of §10.4 — sub-Gaussian design, sparse population GLM-Hessian inverse, ss-sparse β\boldsymbol\beta^*, sample-size scaling nmax(s,sM)(logp)2n \gg \max(s, s_M) (\log p)^2 — the GLM-debiased lasso satisfies

n(β^jGLM-dbβj)dN(0,varj),\sqrt n (\hat\beta_j^{\text{GLM-db}} - \beta^*_j) \overset{d}{\to} \mathcal{N}\big(0, \, \mathrm{var}_j\big),

with varj\mathrm{var}_j given explicitly in terms of the GLM Hessian and the noise structure. Confidence intervals and hypothesis tests follow the same recipe as §10.4.

Practical note. The GLM-debiased lasso is implemented in the hdi R package (Dezeure-Bühlmann-Meier-Meinshausen 2015) for logistic and Poisson families. Python implementations are less mature; desparsified-lasso covers the Gaussian case but not the GLM extension as of this writing. For applications, fitting the GLM lasso via LogisticRegression(penalty='l1') and computing CIs via R’s hdi package is the standard pipeline.

The bigger picture: lasso as a unified penalized-likelihood framework. The lasso’s L1 penalty is a regularizer, not a likelihood-specific construction. It composes with any convex log-likelihood — Gaussian (squared error), Bernoulli (logistic), Poisson (count), Cox (survival), multinomial (multiclass) — to give a family of sparse penalized estimators. Each instance has the same structural flavor (sparsity, bias, IC for selection, RE for prediction, debiased inference) with family-specific computational and statistical details. Friedman-Hastie-Tibshirani (2010, “Regularization Paths for Generalized Linear Models via Coordinate Descent”) is the canonical computational reference; Bühlmann-van de Geer (2011, Statistics for High-Dimensional Data) is the canonical statistical reference covering all families uniformly.

The forward pointer from §11 to the rest of formalML: the GLM-lasso framework is the foundation for causal-inference-methods (T6, planned) — propensity-score and outcome-regression nuisance models with binary or count outcomes use the logistic and Poisson lassos as the natural sparse estimators. The §10 debiased-lasso machinery extends to those nuisance models, giving valid CIs for treatment effects under the double/debiased ML framework (Chernozhukov et al. 2018, covered in §12.1 below).

§12. Connections, applications, and limits

This is the closing section. The lasso is one of the most widely-used algorithms in modern statistics — it shows up as a standalone estimator, as a nuisance estimator inside larger inferential pipelines, and as the conceptual ancestor of an entire family of penalized methods (group lasso, fused lasso, generalized lasso, structured-sparsity penalties). The §§5–10 results we’ve developed for the standard L1-penalized lasso transfer with appropriate modifications to most of these descendants.

This section traces five sets of connections. §12.1 covers the lasso’s role inside double/debiased ML — the dominant modern framework for inference on causal and structural parameters with high-dim nuisance. §12.2 zooms in on causal inference with high-dim confounders, the most consequential application of DML in practice. §12.3 points at the Bayesian counterpart, where horseshoe and spike-and-slab priors achieve approximate sparsity through a different mechanism. §12.4 catalogs where the lasso breaks down in practice — the regimes where the §5–10 theory either fails or requires substantial adaptation. §12.5 lists the forward pointers in formalML — the topics that build on this one.

§12.1 Double / debiased ML (Chernozhukov et al. 2018)

The §10 debiased lasso targets inference on individual coefficients of a high-dim linear regression. Double / debiased machine learning (DML; Chernozhukov-Chetverikov-Demirer-Duflo-Hansen-Newey-Robins 2018) generalizes this to inference on a low-dim target parameter θ\theta in a model where the nuisance is high-dimensional and estimable by any sufficiently regular ML method — the lasso, random forests, neural networks, gradient boosting. The framework is the modern foundation for treatment-effect inference in observational studies, instrumental-variable regression with many instruments, structural-econometrics estimation, and any other setting where a target parameter is identified by a moment condition involving a high-dim nuisance.

The setup. Suppose the target θ0R\theta_0 \in \mathbb{R} is identified by a moment condition

E[ψ(W;θ0,η0)]=0,\mathbb{E}[\psi(W; \theta_0, \eta_0)] = 0,

where WW is the observed data, ψ\psi is a known score function, and η0\eta_0 is a high-dimensional nuisance function. The naive plug-in estimator θ^\hat\theta solves 1niψ(Wi;θ^,η^)=0\frac{1}{n} \sum_i \psi(W_i; \hat\theta, \hat\eta) = 0 where η^\hat\eta is some ML estimate of η0\eta_0. This is biased: the regularization in η^\hat\eta propagates into θ^\hat\theta, and the resulting estimate has bias of order 1/na1/n^a for some a<1/2a < 1/2 — slower than n\sqrt n — so confidence intervals based on the naive plug-in undercover.

The two ingredients. DML fixes the naive plug-in via two ingredients used together:

  1. Neyman orthogonality. The score ψ\psi must satisfy ηE[ψ(W;θ0,η0)]=0\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta_0)] = 0 at the true (θ0,η0)(\theta_0, \eta_0) — the moment condition is insensitive to first-order errors in the nuisance. Constructing such an orthogonal score from an arbitrary moment condition is a calculation: typically subtract off the projection onto the nuisance-tangent space (the influence-function correction).

  2. Cross-fitting. Split the data into KK folds. For each ii in fold kk, use a nuisance estimate η^(k)\hat\eta^{(-k)} trained on the data minus fold kk. This decouples η^\hat\eta from the residual structure used in the moment condition, eliminating an O(n)O(\sqrt n) overfitting bias that the naive in-sample plug-in would have. Cross-fitting is essentially K-fold CV applied to estimation rather than to model selection.

Theorem 1 (DML asymptotic normality (Chernozhukov et al. 2018, simplified)).

Under Neyman orthogonality, cross-fitting, and standard regularity conditions on η^\hat\eta (specifically η^(k)η0L2=op(n1/4)\|\hat\eta^{(-k)} - \eta_0\|_{L^2} = o_p(n^{-1/4})), the DML estimator θ^DML\hat\theta^{\mathrm{DML}} satisfies n(θ^DMLθ0)dN(0,var)\sqrt n (\hat\theta^{\mathrm{DML}} - \theta_0) \overset{d}{\to} \mathcal{N}(0, \mathrm{var}), with var\mathrm{var} given by the influence function and consistently estimable.

The lasso’s role. The DML framework is silent on which ML method estimates η^\hat\eta — any method achieving the op(n1/4)o_p(n^{-1/4}) rate works. The lasso is the standard choice when η0\eta_0 is a sparse high-dim regression, both because it has the right rate (the §5 oracle inequality gives β^lassoβ2=Op(slog(p)/n)\|\hat{\boldsymbol\beta}^{\text{lasso}} - \boldsymbol\beta^*\|_2 = O_p(\sqrt{s\log(p)/n}), beating n1/4n^{-1/4} when slogp/n0s \log p / n \to 0) and because it’s computationally cheap. Random forests, boosting, and neural networks are alternatives at different parts of the bias-variance spectrum.

Connection to debiased lasso (§10). The §10 debiased lasso is, in a sense, “DML for the special case of inference on an individual regression coefficient.” The one-step Newton correction can be re-derived as the application of the DML influence-function machinery to the moment condition E[Xj(YXβ0)]=0\mathbb{E}[X_j(Y - X^\top \beta_0)] = 0. Both rely on Neyman-orthogonal scores; both achieve n\sqrt n asymptotic normality despite a regularized nuisance estimator.

§12.2 High-dim confounder adjustment in causal inference

The most consequential DML application is causal inference with many confounders. Setup: observational data {(Di,Xi,Yi)}i=1n\{(D_i, X_i, Y_i)\}_{i=1}^n where Di{0,1}D_i \in \{0, 1\} is a binary treatment, XiRpX_i \in \mathbb{R}^p is a vector of confounders, and YiRY_i \in \mathbb{R} is the outcome. Goal: estimate the average treatment effect τ=E[Yi(1)Yi(0)]\tau = \mathbb{E}[Y_i(1) - Y_i(0)], where Yi(d)Y_i(d) is the potential outcome under treatment dd (unobserved counterfactual under the unrealized treatment).

Identification under standard assumptions (unconfoundedness Y(d)DXY(d) \perp D \mid X, overlap 0<P(D=1X)<10 < \mathbb{P}(D = 1 \mid X) < 1) gives

τ=E[μ(X,1)μ(X,0)+Dπ(X)π(X)(1π(X))(Yμ(X,D))],\tau = \mathbb{E}\left[\mu(X, 1) - \mu(X, 0) + \frac{D - \pi(X)}{\pi(X)(1 - \pi(X))} (Y - \mu(X, D))\right],

the augmented inverse-propensity-weighting (AIPW) representation, where π(x):=P(D=1X=x)\pi(x) := \mathbb{P}(D = 1 \mid X = x) is the propensity score and μ(x,d):=E[YX=x,D=d]\mu(x, d) := \mathbb{E}[Y \mid X = x, D = d] is the outcome regression. Both nuisance functions are high-dim regression problems; both can be estimated by lasso (logistic for π\pi, Gaussian for μ\mu).

The AIPW representation is doubly robust (Robins-Rotnitzky-Zhao 1994): the moment condition is satisfied if either π^\hat\pi or μ^\hat\mu is consistent (not necessarily both). DML strengthens this: under op(n1/4)o_p(n^{-1/4}) rates for both π^\hat\pi and μ^\hat\mu (achievable by lasso under sparsity), the cross-fit DML estimator τ^DML\hat\tau^{\mathrm{DML}} is n\sqrt n-consistent and asymptotically normal with the semiparametrically efficient variance, regardless of how slow the nuisance rates are.

The practical pipeline. (i) Fit π^\hat\pi via LogisticRegression(penalty='l1') in cross-fitted form. (ii) Fit μ^(,0)\hat\mu(\cdot, 0) and μ^(,1)\hat\mu(\cdot, 1) via Lasso in cross-fitted form (separate models for treated and control). (iii) Plug into the AIPW formula. (iv) Compute the standard error from the empirical influence function. (v) Form CI / hypothesis test.

This pipeline is implemented in Python in econml (Microsoft Research) and doubleml (Bach et al.), and in R in the DoubleML package. It’s the standard observational-causal-inference workflow for moderate-to-high-dim confounder vectors. Forward pointer: T6 causal-inference-methods (coming soon) will cover this in detail, including extensions to dynamic treatments, instrumental variables, and partial identification.

§12.3 Sparse-Bayesian alternatives

The lasso’s L1 penalty has a Bayesian interpretation: it’s the negative log-prior of an iid Laplace prior βjLaplace(0,1/λ)\beta_j \sim \mathrm{Laplace}(0, 1/\lambda), so the lasso estimator is the posterior mode (MAP) under that prior. The posterior mean under the same prior is dense — Bayesian inference based on the Laplace prior doesn’t produce sparse point estimates, only sparse modes.

The Bayesian counterpart of the lasso is a class of priors that produce approximate sparsity in the posterior — most posterior mass concentrated near zero, with heavy tails for the active coefficients. Two main families, both detailed in Sparse Bayesian Priors:

  • Horseshoe (Carvalho-Polson-Scott 2010). βjN(0,λj2τ2)\beta_j \sim \mathcal{N}(0, \lambda_j^2 \tau^2) with λjC+(0,1)\lambda_j \sim C^+(0, 1) (half-Cauchy local scales) and τC+(0,1)\tau \sim C^+(0, 1) (global scale). The half-Cauchy local scales have a pole at zero (heavy shrinkage of small signals) and polynomial tails (vanishing shrinkage of large signals) — qualitatively the inverse of the lasso’s constant-shrinkage behavior, so the active-coefficient bias from §4.1 is avoided.
  • Spike-and-slab (Mitchell-Beauchamp 1988; George-McCulloch 1993). βj(1w)δ0+wN(0,σ2)\beta_j \sim (1 - w) \cdot \delta_0 + w \cdot \mathcal{N}(0, \sigma^2) — a discrete mixture of a point mass at zero (spike) and a wide Gaussian (slab). Posterior puts probability w^j\hat w_j on βj0\beta_j \neq 0, giving a natural variable-selection probability.

Trade-offs. Bayesian methods give native uncertainty quantification — credible intervals come for free from the posterior, no debiased-lasso construction needed. The cost: posterior sampling (HMC, NUTS) is computationally expensive at scale, typically 100×–1000× slower than the lasso for the same problem. The lasso wins on speed and the Bayesian methods win on inferential clarity; the practical choice depends on whether n,pn, p scales make MCMC tractable. Forward pointer: Sparse Bayesian Priors covers the full Bayesian sparsity framework — horseshoe, spike-and-slab, R2-D2, regularized horseshoe, and their computational implementations in PyMC and NumPyro.

§12.4 Where the lasso breaks down

The lasso has known failure modes beyond the §6.2 / §9.4 IC and RE failures. Five worth flagging:

Highly correlated features. Already covered (§6.1, §9.4). Recap: lasso flips arbitrarily between correlated features in the active set; elastic net is the standard remedy.

Ultra-high-dimensional regimes (p>106p > 10^6). Coordinate descent’s per-iteration cost is O(np)O(np) — proportional to the data size. At p>106p > 10^6 with nn in the thousands, even one coordinate-descent pass takes seconds; full convergence with CV-tuned λ\lambda becomes computationally prohibitive. Sure independence screening (Fan-Lv 2008): a preprocessing step that filters pp down to O(n)O(n) candidates by ranking Xjy|\mathbf{X}_j^\top \mathbf{y}|, then runs the lasso on the reduced feature set. Loses some precision but keeps the workflow tractable. For genuine pn10p \gg n^{10} regimes (some genomics applications), the screening-then-lasso pipeline is standard.

Heavy-tailed (non-sub-Gaussian) noise. The §5.5 deviation step required ε\boldsymbol\varepsilon to be sub-Gaussian. With heavier tails (say, ε\boldsymbol\varepsilon Cauchy or Student-tt with low degrees of freedom), the maximal inequality Xε/nλ/2\|\mathbf{X}^\top \boldsymbol\varepsilon / n\|_\infty \le \lambda/2 fails to hold with the lasso’s standard λσlogp/n\lambda \asymp \sigma\sqrt{\log p / n} — the noise has too-large rare excursions. Quantile / median regression lasso (Belloni-Chernozhukov 2011): replace the squared-error loss with the check function ρτ(u)=u(τ1{u<0})\rho_\tau(u) = u(\tau - \mathbb{1}\{u < 0\}), giving an L1-penalized quantile regression. Robust to heavy tails because the check loss penalizes residuals linearly, not quadratically. The estimator achieves the same slogp/n\sqrt{s \log p / n} rate under weaker noise assumptions but requires a different solver (linear programming or smoothed coordinate descent).

Time-series and spatially correlated observations. The lasso assumes iid observations. With correlated observations (autoregressive errors in time series, spatial random fields), the standard λ\lambda-tuning criteria (CV, BIC) become biased — they underestimate the prediction error because nearby training and test points share information. Block-CV (block-wise leave-one-out where each block is a contiguous time window or spatial region) is the standard remedy. Theoretical guarantees under dependence are looser; the most general result is Wong-Li-Tewari (2020) for stationary time series.

Causal lasso interpretation pitfalls. A frequent misuse: interpreting the lasso’s selected coefficients as “causal effects.” The lasso minimizes prediction error on the observed data; the resulting coefficients are predictively useful but not necessarily causally interpretable unless the design satisfies specific structural assumptions (e.g., randomized treatment, valid instruments). The §12.1–§12.2 DML / AIPW pipeline is the right framework when causal interpretation is the goal — the lasso’s role is as a nuisance estimator, not as the producer of causal coefficients.

§12.5 Forward pointers in formalML

The lasso machinery from this topic feeds into several planned formalML topics:

  • Semiparametric Inference (coming soon). The DML framework introduced in §12.1 generalizes substantially: any target parameter θ0\theta_0 identified by a moment condition with high-dim nuisance can be estimated n\sqrt n-consistently using cross-fitted ML estimates of the nuisance. The lasso is one of several admissible nuisance estimators (alongside random forests, gradient boosting, neural networks). The semiparametric-inference topic will treat the unified framework — Neyman orthogonal scores, the op(n1/4)o_p(n^{-1/4}) nuisance rate condition, the practical pipeline — at the level needed for applied econometrics and causal inference.

  • Causal Inference Methods (coming soon). Treatment-effect inference with high-dim confounders (§12.2) is the headline application of the DML framework. The causal-inference-methods topic will cover the full observational-causal-inference workflow: identification assumptions (unconfoundedness, overlap), the AIPW representation, double-robust efficient scores, the DML estimator and its CI / hypothesis tests, sensitivity analysis to unmeasured confounding, dynamic treatment regimes, and the connection to instrumental variables. Throughout, the lasso (logistic for propensity, Gaussian for outcome) is the default nuisance estimator.

  • PAC-Bayes Bounds (coming soon). Catoni’s (2007) PAC-Bayesian framework gives an alternative theoretical perspective on sparse regression: instead of the §5 oracle inequality (frequentist, RE-based), use a PAC-Bayes bound with a sparsity-favoring prior to derive a closely-related risk bound. The PAC-Bayes prior is structurally similar to the Bayesian sparsity priors of §12.3; the resulting estimator is closer to the posterior mean than to the lasso. The pac-bayes-bounds topic will treat this connection.

  • Bayesian Neural Networks. Sparse priors on neural network weights — particularly horseshoe priors on the input-layer weights — give automatic feature selection in deep models. The §12.3 Bayesian sparsity framework, scaled up to neural-net parameter counts, is the foundation for this. The bayesian-neural-networks topic covers practical implementations (HMC, variational inference) and the recovery guarantees.

  • Density Ratio Estimation (coming soon). Estimating r(x)=p2(x)/p1(x)r(x) = p_2(x) / p_1(x) from samples of p1p_1 and p2p_2 is a related problem to regression — KLIEP (Kullback-Leibler importance estimation procedure) and its sparse-regularized variants reduce to convex optimization problems that look very much like the lasso. The connection isn’t immediate, but the algorithmic and theoretical machinery transfers.

The closing thesis. The lasso is one of the most successful algorithms in modern statistics because it occupies a sweet spot in the bias-variance-computability triangle. Sparsity bias is small under reasonable conditions; sparsity-adapted variance is the lowest of any L1-style estimator; the convex-optimization formulation makes computation tractable at scales where every pre-2000 alternative was infeasible. The §5 oracle inequality and the §10 debiased-lasso construction together make the lasso the standard tool for both prediction and inference in the sparse high-dim regime, with adaptations (elastic net, adaptive lasso, GLM-lasso, DML-lasso) covering the cases where the basic version doesn’t fit. We’ve worked through the full pipeline; the rest of the formalML topic graph picks up the threads from here.

Connections

  • T2 #1 predecessor. Established the bias-variance / cross-validation discipline carried over here; §7's named-section convention for cross-validation parallels kernel-regression §5's LOO-CV / GCV section. The toy-DGP convention (controlled signal-to-noise, fixed seed, modular helper functions) is shared. kernel-regression
  • T2 #2 predecessor and freshest May-2026 exemplar. Established the §-prefix heading style, the verification-suite-against-notebook discipline, and the figure-styling baseline. This topic ports the structural template; the algorithmic substrate (sparsity / convex optimization / concentration) is independent. local-regression
  • Bayesian counterpart with horseshoe / spike-and-slab / R2-D2 priors. The lasso's L1 penalty is the negative log-prior of an iid Laplace prior; sparse-Bayesian methods replace the Laplace with heavier-tailed continuous shrinkage priors (horseshoe) or discrete-mixture spike-and-slab. §12.3 below is the explicit cross-link; sparse-bayesian-priors §5 contrasts horseshoe vs Bayesian-LASSO geometry as the prior-side dual of this topic's L1 penalty. sparse-bayesian-priors
  • Foundations-layer prerequisite. The §5.5 deviation step for the oracle inequality uses the sub-Gaussian maximal inequality on $\|\mathbf{X}^\top \boldsymbol\varepsilon / n\|_\infty$; the §6.3 sign-consistency proof uses an analogous deviation bound on the noise term in the primal-dual witness construction. concentration-inequalities
  • Foundations-layer prerequisite. The §2.4 KKT subgradient conditions, §3.1 soft-thresholding derivation, and §3 ISTA / FISTA convergence proofs all rest on the convex-analysis substrate (subdifferentials, the descent lemma, proximal operators, Lagrangian duality). convex-analysis
  • Foundations-layer prerequisite. §3.3 ISTA is the proximal-gradient generalization of gradient descent; §3.4 FISTA layers Nesterov momentum on top of it for the $O(1/k^2)$ rate. The descent-lemma machinery and the $L$-smoothness convergence analysis port directly. gradient-descent
  • Foundations-layer prerequisite. The §3 lasso solvers (ISTA, FISTA, coordinate descent) all reduce to repeated application of the soft-thresholding operator $S(\cdot, \lambda)$ — the proximal operator of the L1 norm developed in proximal-methods. The shared `proximalUtils.softThreshold` helper used in this topic's viz components is exported from the proximal-methods topic's substrate. proximal-methods

References & Further Reading