Local Polynomial Regression
Degree-$p$ generalization of local-linear regression — the Fan–Gijbels family, the equivalent-kernel formulation that makes every degree-$p$ estimator a kernel smoother with a degree-aware reweighting $K^*_p$, the Ruppert–Wand bias ladder, and the odd-vs-even degree story that distills the practical recommendation to $p \in \{1, 3\}$.
1. From local-linear to degree-
Kernel Regression closed with a boundary problem and a fix. The Nadaraya–Watson estimator does fine in the interior of the support but biases toward the data side at the boundary — the kernel mass that would sit outside the support gets reflected back in by the normalization, dragging toward whatever’s nearby. The §8.2 fix was local-linear regression: fit a line, not a constant, in each kernel-weighted neighborhood. The line absorbs the slope information that NW was forced to merge into a level estimate, and the boundary bias drops from to — uniformly, across the whole support.
That fix is the simplest member of a family. The §8.3 forward-pointer named the family — degree- local polynomial regression, due to Fan, Gijbels, and collaborators in a string of 1990s papers — and left the rest for this topic. We pick it up here.
1.1 The local-cubic teaser
Before the algebra, the picture. Below are NW (degree 0), local-linear (degree 1), local-quadratic (degree 2), and local-cubic (degree 3) fits to the same 200-point sample from
at the same bandwidth — the same toy that ran through Kernel Regression and the same seed (42).
Three things to notice. The NW fit clips the peaks and troughs of the sine — degree-zero polynomials are constants, so the fit at is a kernel-weighted average of nearby values, and averaging across a neighborhood with a steep slope flattens the peak. Local-linear recovers the slopes — the line through the kernel-weighted neighborhood at the peak isn’t asked to also fit the curvature, so the level estimate at the peak is no longer dragged down by the off-center neighbors. Local-cubic does both — slope and curvature — and tracks the sine more tightly still, particularly at the inflection points where local-linear has to compromise.
You’ll notice we skipped degree 2. That’s not an oversight; that’s the punchline of §5.
1.2 The Fan–Gijbels family
The general estimator, for a degree- polynomial, solves the kernel-weighted least-squares problem
at every evaluation point . The function-value estimate is . The first derivative estimates fall out of the same fit as — a notably useful side-product we cash in during §9.
Three results structure the topic and earn the family its place in the modern nonparametric toolkit:
- The equivalent-kernel formulation (§3). Every degree- local polynomial estimator can be rewritten as a kernel smoother , where is not the original kernel — it’s a degree-aware, location-aware reweighting of . The bandwidth-selection theory and the implementation tricks port from Kernel Regression because of this; the higher-degree estimators are still kernel smoothers, just smarter ones.
- The bias ladder (§4). For -times-differentiable , degree- local polynomial achieves bias when is odd. When is even, the leading term vanishes by parity at the interior with symmetric , and the next-order analysis gives — a parity-bonus interior rate that does not survive at the boundary. The variance is throughout.
- The odd-vs-even degree pattern (§5). Combining the first two: an odd degree provides the same interior rate as the preceding even degree (the §3 pairing at interior, with matched variance constant ) but with strictly better boundary behavior — the parity-zero pattern that gives even its interior bonus collapses at the boundary, while odd retains its rate uniformly across the support and keeps design-adaptivity (no -derivative contamination). So odd Pareto-dominates the next-higher-degree even neighbor at no variance cost. The practical choices are and , and the choice between them comes down to whether you need derivative estimates.
That’s the spine. Boundary behavior (§6) follows from the equivalent-kernel adapting to the boundary at odd . Bandwidth selection (§7) extends Kernel Regression §5 essentially unchanged. The multivariate generalization (§8), the smoothing-spline bridge (§10), and the additive-model bridge (§11) all rest on the same scaffolding.
1.3 Roadmap
§2 is the formal definition and the matrix form. §3 is the equivalent-kernel rewrite. §4 derives the bias-variance ladder, and §5 — the topic’s sharpest result — is the odd-vs-even degree theorem. §6 takes the theorem to the boundary. §7 extends the bandwidth-selection toolkit. §8 generalizes to . §9 is derivative estimation. §10 connects to smoothing splines. §11 connects to additive models. §12 closes with the connections to semiparametric inference and the practical limits.
2. The degree- local polynomial estimator
2.1 The Taylor-expansion idea
The motivating observation is the same one that drives any local approximation in calculus. If is -times differentiable at , then for near , Taylor’s theorem gives
with remainder . So in a small enough neighborhood of , looks like a polynomial of degree in , with coefficients .
The local polynomial estimator turns this around. We don’t know , but we have data. Fit a polynomial of degree in to the observed , weighted by how close is to . The kernel weights enforce the “near ” part — observations far from get exponentially down-weighted under a Gaussian kernel, or zeroed out entirely under a compact-support kernel like the Epanechnikov. The function-value estimate is the intercept of the fitted polynomial; the slope, curvature, and higher derivatives come along as the higher-order coefficients.
NW corresponds to — a local constant fit, where the only coefficient is the kernel-weighted mean of the . Local-linear is , the §8 Kernel Regression fix. Local-cubic is .
2.2 The kernel-weighted least-squares objective
Write the kernel weight as where is the standard scaling that makes integrate to one. The kernel-weighted sum of squared errors at evaluation point is
Definition 1 (Local polynomial estimator).
The degree- local polynomial estimator of at a fixed evaluation point is
with derivative estimates for .
Two things to notice. First, the problem decouples across : each evaluation point gets its own . The bandwidth and the polynomial degree are global, but the fitted coefficients are local. Second, this is ordinary weighted least squares — the kernel only enters through the weights.
2.3 Normal equations
Setting for each gives the first-order conditions
Rearranging,
Define the empirical kernel-moment quantities
The normal equations are then for . Stacked as a matrix system,
The matrix has constant entries along anti-diagonals — a Hankel matrix. The structure isn’t load-bearing for solving (we use a generic linear solver) but it’s worth registering: is fully determined by the scalars .
Existence and uniqueness. is invertible iff the ‘s with positive kernel weight at contain at least distinct values. For continuous kernels like the Gaussian, every observation has positive weight, so the matrix is invertible whenever the data has at least distinct — almost surely true under any continuous distribution of . For compactly supported kernels (Epanechnikov, biweight, triweight), we need at least distinct within distance of . This is a local-design constraint that bites only at small bandwidths or sparse regions.
2.4 Matrix form and batched-grid vectorization
Stack the design rows for each observation into the matrix
and let be the diagonal weight matrix. Then and , and the closed-form WLS solution is
The function-value estimate is
where picks off the intercept. We return to this structure in §3 — it’s the seed of the equivalent-kernel formulation.
Vectorization. For an evaluation grid , equation (2.2) is separate WLS solves with different design and weight matrices. The straightforward implementation loops over ; the batched implementation stacks all design matrices into a array, all weight diagonals into a array, and uses NumPy’s np.linalg.solve broadcasting to do all systems at once. The cost is the same — at the §1-toy scale () the speedup over a Python loop is modest (about in the notebook timing), but at where Python-loop overhead dominates the picture it grows substantially.
Conditioning. The columns of have entries that span many orders of magnitude when is small. The fix is to work in scaled coordinates . The rescaled design has entries , with for the bulk of the kernel mass — well-conditioned. The function-value estimate is invariant under this rescaling because the first column is unchanged (). Higher coefficients pick up an factor: . The notebook’s vectorized implementation uses scaled coordinates throughout.
def local_polynomial(X, Y, x_eval, h, p, K_fn=gaussian_kernel):
"""Vectorized degree-p local polynomial. Returns hat m_h(x) at x_eval."""
X = np.asarray(X, dtype=float); Y = np.asarray(Y, dtype=float)
x_eval = np.atleast_1d(np.asarray(x_eval, dtype=float))
diff = X[None, :] - x_eval[:, None] # (G, n)
u = diff / h
w = K_fn(u) / h # (G, n)
# Scaled design Z[i, j, k] = u[i, j]**k.
Z = u[..., None] ** np.arange(p + 1) # (G, n, p+1)
Wz = w[..., None] * Z # (G, n, p+1)
# Normal equations: A_g beta_g = b_g for each evaluation point g.
A = np.einsum("gnj,gnk->gjk", Wz, Z) # (G, p+1, p+1)
b = np.einsum("gnj,gn->gj", Wz, Y) # (G, p+1)
beta = np.linalg.solve(A, b) # (G, p+1) — beta_0 is hat m
return beta[..., 0]
2.5 Reading off derivatives
The Taylor identification from §2.1 said that, locally, . Plugging in the WLS estimate gives the local-polynomial derivative estimator,
Quick demonstration on the §1 toy. Fitting a local-cubic at with gives all four coefficients in one solve. The truth at is , , , and . The local-cubic recovers the function value () and first derivative () tightly; the second and third derivatives are recovered with substantially more variance (, ), foreshadowing §9’s formula — the bandwidth that’s right for is too small for at .
3. The equivalent-kernel formulation
3.1 Local polynomial estimators are linear smoothers
Equation (2.2) gives the function-value estimate as
Define
Then , and crucially the weights depend only on — through and — and on , never on . So local polynomial estimators of every degree are linear smoothers: the fit at is a linear combination of the responses with design-determined weights.
The -th column of is . So
This is the exact, finite-sample equivalent-kernel weight. It depends on , on , on the kernel , on the degree , and on the entire design via the Gram matrix. It’s not yet a clean kernel function — it doesn’t factor as for any fixed .
3.2 The asymptotic equivalent kernel
The clean factorization emerges in the population limit. Replace with its expectation; the entry is
Substitute , so , , , and :
To leading order in , , so
where are the kernel moments familiar from Kernel Regression. The Gram-matrix factorization is
where and is the kernel-moment matrix in the -coordinate.
Inverting and substituting into (3.1) and simplifying with and :
where the asymptotic equivalent kernel is
The equivalent-kernel form of the local-polynomial estimator is therefore
This is the headline result of the section. Local polynomial regression, at every degree , is asymptotically a kernel smoother — but with in place of . The original kernel enters only through (the moment matrix) and through the factor in (3.3); the polynomial degree enters through , which is a polynomial of degree in multiplying the original kernel.
The prefactor is worth flagging. NW had to manufacture this density correction explicitly, via the denominator . Local polynomial of any degree produces it automatically through the normal equations. That’s the algebraic source of the boundary-bias improvement we previewed in §1.
3.3 Properties of
The equivalent kernel inherits three structural properties from the construction.
Proposition 1 (Moment-matching identities).
For ,
The original kernel has only guaranteed (and when is symmetric). The equivalent kernel has all moments up to order matched to those of a degree- polynomial that reproduces constants — far stronger. This is what kills the design-density and lower-order Taylor terms in the bias expansion of §4.
Proof.
Substitute (3.3): . For , the vector is exactly the -th column of . So , and .
∎Property 2 — the odd-with-even pairing at interior with symmetric . When is symmetric () and is an interior point of the support, the odd kernel moments all vanish. The moment matrix then has a checkerboard zero pattern: nonzero entries only at positions with even. The same checkerboard pattern persists in , so has nonzero entries only at even indices.
In particular, with the Gaussian kernel (, , ):
- (trivially).
- (symmetric at interior gives ).
- , which is negative for .
- , and so on.
The pairing at interior is a structural consequence of the checkerboard zeros: adding a column to the design matrix extends by an odd-indexed row and column whose only nonzero entries are odd moments — disconnected from the even-indexed rows that determine the first row of . This algebraic fact is the seed of the odd-vs-even degree story (§5): at interior, jumping from to , or from to , doesn’t change the equivalent kernel — and therefore doesn’t change the leading-order bias rate.
Property 3 — sometimes negative-valued. For , the polynomial factor has zero crossings, so takes negative values. This isn’t a flaw — it’s the price of the moment-matching identities. The original is non-negative and only matches the zeroth moment; matches moments, and the trade-off is negativity in the tails.
3.4 Boundary behavior and the visualization
At an interior point , the kernel integrates to one over its full support. At a boundary point, say for some near the support boundary at , only the portion of the kernel mass falls within the support. The kernel-moment matrix becomes
and the equivalent kernel becomes -dependent:
Two things change from the interior case. First, the symmetry of no longer implies , so the checkerboard zero pattern is gone and the odd-with-even pairing breaks. Second, the equivalent kernel adapts its shape to compensate for the missing kernel mass on the truncated side. This is the automatic boundary correction — the equivalent kernel knows where the boundary is and reweights accordingly, without us having to do anything beyond solving the original normal equations.
The interior panel makes the §3.3 algebra visible. The boundary panel makes the §6 boundary-bias result visible: at the boundary, the equivalent kernel for local-linear and local-cubic absorbs the truncation, while the NW equivalent kernel (, just the truncated Gaussian) is asymmetric and biased. We return to this in §6 with the formal bias rates.
4. Bias and variance at general
4.1 The conditional bias and the residual decomposition
The cleanest path from §3 to a sharp bias formula is the conditional approach of Ruppert & Wand (1994). Conditioning on the design , the estimator’s mean is
where collects the true regression function evaluated at the design points. Taylor-expanding about ,
The first Taylor terms are exactly degree- polynomials in , so they live in the column space of . The projection fixes any vector in that column space, so it returns those coefficients exactly: projects out . The conditional bias collapses to the residual contribution alone:
This is a purely local statement — the bias depends on only through its -th derivative at , not through any global properties of .
4.2 The leading-order bias formula
To extract the leading-order asymptotic, we substitute the leading-order approximation into (4.1). The vector has -th component
The empirical sum converges (by LLN + change of variables, as in §3.2) to its expectation to leading order. Stacking and using the factorization of the Gram matrix from (3.2),
where .
Theorem 1 (Ruppert–Wand (1994) interior bias formula).
Let be an interior point of the support of with and . Conditional on the design,
where the asymptotic bias constant is the first surviving moment of the equivalent kernel .
Two remarks.
The asymptotic bias constant equals the -th moment of . Reading alongside (3.3) gives . The bias is determined by the first surviving moment of the equivalent kernel — the first moment that the moment-matching identities of §3.3 don’t kill. The §3 algebra and the §4 bias derivation are the same algebra read in two directions.
No design-density derivatives appear. Unlike the NW bias formula (which has a term), Theorem 1 depends on only through for any . Local polynomials are design-adaptive: the WLS solve absorbs the design-density correction automatically through the Gram-matrix inversion. This is the structural advantage that won local-linear regression its place as the default at the end of Kernel Regression §8, and the formula shows the advantage extends to any .
4.3 The interior rate ladder
For symmetric at an interior point , the moment matrix has the checkerboard zero pattern from §3.3, and so does . The first row of has nonzero entries only at even indices. The vector has nonzero entries only where the index has the parity of .
- Odd : is even, so has nonzero entries at even indices. These pair with the even-indexed nonzero entries of , giving generically. The bias rate is .
- Even : is odd, so has nonzero entries at odd indices. These pair with the zero odd-indexed entries of — every term vanishes, so identically. The leading term in Theorem 1 collapses, and the next-order analysis (one more term in the Taylor expansion of ) gives a bias of order .
Stitching this together, the interior bias-rate ladder at symmetric on the Gaussian (computed exactly via numerical integration; the zeros are exact zeros from parity, not numerical noise) is:
| parity | leading interior rate | |||
|---|---|---|---|---|
| 0 | even | (parity, with NW correction at next order) | ||
| 1 | odd | |||
| 2 | even | (parity) | ||
| 3 | odd | |||
| 4 | even | (parity) | ||
| 5 | odd |
The pairing is the §3 pairing rendered as a bias-rate identity. Within each pair, odd is design-adaptive while even has design-density-dependent constants — a constant-level advantage that grows in importance at the boundary (§5), where the parity bonus disappears entirely.
The row (NW) is the design-density-corrected formula already derived in Kernel Regression §3 — included here for completeness, not re-derived. The formula picks up its term not from Theorem 1 (which would give at symmetric by parity) but from the next-order correction to the asymptotic equivalent-kernel form. NW is asymptotically almost a kernel smoother, but not exactly one — local-polynomial of any is.
Even-p bias constants are exactly zero at interior with symmetric K (parity). The leading-order bias picks up the next-order term, giving the parity-bonus rate h^(p+2) for even p versus h^(p+1) for odd p. Variance constants pair (R*_0 = R*_1, R*_2 = R*_3, R*_4 = R*_5) — the §3 odd-with-even pairing rendered as a constant.
4.4 Variance
The conditional variance of the linear smoother is , where is the equivalent-kernel weight from (3.1) and is the conditional noise variance.
For within the kernel’s effective support of , to leading order, so . Substituting the asymptotic form and applying LLN + change of variables,
Theorem 2 (Pointwise variance of local polynomial regression).
Under the conditions of Theorem 1 plus ,
The variance rate is at every — same scaling as NW. The constant varies with , and crucially grows with because becomes more oscillatory to match more moments. For the Gaussian, , while — local-quadratic and local-cubic carry roughly the variance of NW and local-linear at the same (and , a multiplier). This is the variance side of the bias-variance ladder: higher buys faster bias decay at the cost of a bigger variance constant.
4.5 The AMISE-optimal bandwidth
Combining Theorems 1 and 2, the asymptotic mean squared error at an interior point at odd is
Differentiating with respect to and setting to zero gives the local AMSE-optimal bandwidth
For the §1 toy at , integrated over to avoid the boundary regions, the AMISE-optimal bandwidths come out at:
| scaling | at (§1 toy) | |
|---|---|---|
| 1 | ||
| 3 |
Higher-degree estimators want wider bandwidths because their variance constant has grown but their bias constant has shrunk faster — wider exploits the better bias rate. The AMISE at scales as :
- : .
- : .
- : .
This is the Stone (1980, 1982) minimax rate for -times-differentiable regression functions, with at odd . Local polynomial regression at degree is minimax-optimal over the class of -times-differentiable univariate functions — the canonical optimality result that makes local polynomial regression the modern frequentist nonparametric estimator of choice.
5. The boundary and the odd-vs-even degree story
5.1 The question this section answers
At interior points of the support, §4 gave us a clean rate ladder: odd delivers bias , even collects a parity bonus to , and within each pair the bias constant is identical. The boundary is a different story. Kernel Regression §8 gave us part of it: NW at has bias — one order worse than its interior rate — and local-linear regression restores uniformly. What about higher ? Does the §4 rate ladder hold uniformly across the support, or does the boundary degrade some degrees more than others?
This is the question. The answer turns on the boundary-truncated kernel moments and their interaction with the moment-matching identities of §3.
5.2 The boundary-truncated equivalent kernel
§3.4 already introduced the boundary-truncated equivalent kernel , where measures the distance from the support edge in bandwidth units. The truncated kernel moments are , the truncated moment matrix is , and the equivalent kernel satisfies the moment-matching identities
by construction (§3.3, Property 1) — the proof only uses that the -th column of is , which is unchanged by truncation. What doesn’t survive the truncation is the checkerboard zero pattern that gave us the parity bonus at interior. With finite, the odd moments are no longer zero, has full nonzero structure, and the first row of has nonzero entries at every index — both odd and even.
The §4 bias derivation goes through unchanged with the truncated moments.
Theorem 3 (Boundary bias formula).
At a boundary point with ,
where is the boundary analog of the §4 bias constant.
5.3 The boundary-rate ladder
Computing for the Gaussian (the notebook does this via numerical integration on the truncated moments) gives:
| | | | leading boundary rate | |----:|:-----------------------------:|:-------------:|:---------------------:| | 0 | | | | | 1 | | | | | 2 | | | | | 3 | | | | | 4 | | | |
All five constants are nonzero — the parity-zero pattern that simplified §4.3 is gone at boundary. The boundary bias rate, read off from Theorem 3, increases monotonically with : for . There is no “even gets stuck at ” effect at the leading-order asymptotic rate on the Gaussian-kernel-on-uniform-design setting that §1’s toy provides.
The thing odd retains, even doesn’t. What separates odd from even at boundary is design-adaptivity. From the §4.2 derivation, the leading-bias formula at odd involves only . At even , the next-order term enters with a coefficient that involves — the design-density correction that NW also carries.
In the boundary regime where Theorem 3 is the dominant term, this is a constant-level distinction, not a rate-level one. The simplest sharp statement is therefore: odd gives uniform bias rate with design-adaptive constants across the support; even gives the same uniform rate at leading order but with non-adaptive boundary constants.
5.4 The practical recommendation:
Combining the §4 interior story and the §5.3 boundary story, the bias-rate ladder uniformly across the support is:
| interior rate | boundary rate | uniform rate | design-adaptive? | |
|---|---|---|---|---|
| 0 | no | |||
| 1 | yes | |||
| 2 | partially | |||
| 3 | yes | |||
| 4 | partially | |||
| 5 | yes |
Odd gives a matched uniform rate (interior and boundary both ); even gives a mismatched uniform rate (boundary one order worse than the parity-bonus interior). Combined with the variance penalty growing with , the Pareto-optimal degrees in the (uniform rate, complexity) trade-off are the odd ones.
In practice:
- — local-linear: the workhorse. Uniform bias, (no variance penalty over NW), design-adaptive. R’s
loessandstatsmodels’s nonparametric regression both default here. - — local-cubic: when you need second derivatives (§9) or your application earns the rate. Variance constant , but the bias improvement is usually worth it for moderate .
- skipped: only marginal interior-rate improvement over ( vs at interior, but vs at boundary), variance penalty similar to , no second-derivative estimate.
- : rarely worth it — variance constant grows fast, bandwidth sensitivity grows fast, and the -smoothness assumption is hard to defend.
6. Boundary behavior across the full support
§5 measured the bias rate at one boundary point . This section zooms out: the bias as a function of , plotted across the full support at fixed bandwidth. The §3 equivalent kernel adapts smoothly to location through , and that adaptation is what gives local polynomial regression its uniform rate.
6.1 The location-aware equivalent kernel
The boundary parameter is a smooth function of evaluation location. At the strict boundary , — only of the kernel mass is available. For (deep interior), is large and the truncation is invisible — , the interior form from §3. Between these extremes, interpolates continuously, reweighting itself to compensate for whichever portion of the kernel mass falls outside the support at the current .
The location-awareness is automatic — we never explicitly construct a “boundary kernel.” We just solve the same WLS problem (2.2) at every evaluation point, and the kernel-moment matrix absorbs the boundary truncation through its empirical moments.
6.2 Why removes the boundary asymmetry
The moment-matching identities are the entire mechanism. From §3.3 and §5.2, for every ,
The Taylor-expansion bias derivation (§4.1) substitutes inside the equivalent-kernel integral, and the moment-matching identities kill the first powers of in the expansion. The first surviving term is at order , contributing the leading bias.
For NW (), the moment-matching identity covers only — not . At interior with symmetric , holds for free by symmetry. At the boundary (truncated on with ), the symmetry argument fails and . The first-order term in the Taylor expansion survives, contributing to the bias — the famous NW boundary bias.
For local-linear (), the moment-matching identity now covers both and regardless of . The first-order bias term vanishes uniformly across the support. The remaining leading order is the same as interior — boundary bias and interior bias are at the same rate.
For local-cubic (), all four identities for hold uniformly. The first three Taylor terms vanish, leaving leading bias rate uniformly across the support.
6.3 The full-domain bias figure
The empirical bias profile across at fixed on the §1 toy makes the uniformity concrete.
Shaded regions show one bandwidth h on each side of the support boundary (where the kernel mass starts to truncate). B = 30 MC replicates per p (live); the 06_bias_profile.png static figure uses B = 500 for cleaner curves.
The takeaway, visible in one figure: NW pays a boundary tax that local-linear removes and local-cubic removes more thoroughly. The §1 close (“local-cubic does both — slope and curvature — and tracks the sine more tightly still”) reflects this rate uniformity, not just constant improvements.
7. Bandwidth selection at general
Kernel Regression §5 covered four bandwidth selectors for NW: Silverman’s rule of thumb, plug-in, leave-one-out cross-validation, and generalized cross-validation. Three of the four — LOO-CV, GCV, and the plug-in family — extend essentially unchanged to general via the smoother-matrix machinery already implicit in §3. Silverman’s rule is NW-specific (it borrows from kernel density estimation) and doesn’t transfer; the others do.
7.1 The smoother matrix at degree
Local polynomial regression at every is a linear smoother (§3.1). When we evaluate the estimator at the design points themselves, the values stack into
where is the smoother matrix at degree . Its entry is the equivalent-kernel weight on observation when evaluating at . Each row sums to one (the moment-matching identity at the empirical level — partition-of-unity for the ).
The diagonal collapses cleanly. At , the design vector , and . So
This is the “self-influence” of observation on its own fit. The trace is the effective degrees of freedom of the local-polynomial fit at bandwidth and degree — small gives large (close to , near-interpolation), large gives small (close to , near-global-polynomial fit). Higher at fixed produces a larger — each evaluation point gets a more flexible local fit, making the smoother “hungrier” for degrees of freedom.
7.2 Leave-one-out cross-validation
The LOO-CV objective is
where is the local polynomial fit excluding observation . Computed naively, this requires separate fits per bandwidth — prohibitive in any sweep over . The leave-one-out identity (Hastie & Tibshirani 1990, applicable to any linear smoother) reduces it to one fit:
Theorem 4 (LOO-CV closed-form identity).
For any linear smoother with smoother matrix ,
Proof.
Construct a perturbed response vector with for and (the LOO prediction at ). The local-polynomial fit using must agree with at , because replacing by the prediction we’d have made anyway doesn’t change the LOO fit there. So
Solving for ,
and the LOO residual is
∎The identity gives LOO-CV in closed form at the cost of one full fit and the diagonal of .
7.3 Generalized cross-validation
GCV (Craven & Wahba 1979) replaces the per-point in the LOO-CV formula with the average :
Two practical advantages over LOO-CV: only the trace is needed (one scalar per bandwidth), and GCV is invariant to orthogonal rotations of the response (more robust to outlier ). The asymptotic minimizers of LOO-CV and GCV coincide; GCV’s choice tends to be slightly larger in finite samples and is the more common default in production smoothers.
7.4 Plug-in selectors
The Ruppert–Sheather–Wand (1995) plug-in family estimates the AMISE-optimal bandwidth (4.4) directly from data, by replacing the unknown functionals with pilot estimates. For at interior with symmetric , (4.4) reduces to
Plug-in estimates the two unknown functionals: via Rice’s estimator (consecutive-observation differences) and via a pilot -fit at degree . The pilot bandwidth has its own selection problem; RSW solves the recursion via a normal-reference rule of thumb.
For , (4.4) calls for — a fourth-derivative estimate. The pilot recursion deepens, and the variance of overwhelms the plug-in’s nominal advantage. For local-cubic, plug-in selectors are theoretically clean but practically fragile — most software defaults to LOO-CV or GCV at .
7.5 How shifts with
The §4.5 scaling predicts that the AMISE-optimal bandwidth grows with . The notebook’s LOO-CV sweep on the §1 toy at over a 25-point log-spaced bandwidth grid produces:
| scaling | LOO-CV minimizer at | |
|---|---|---|
| 0 | ||
| 1 | ||
| 3 |
In practice, the shift is a useful diagnostic: if you’re running LOO-CV at multiple degrees on the same data and the minimizers don’t shift right with , something’s wrong with your pilot or your CV implementation.
8. Multivariate local polynomials
Kernel Regression §6 already extended NW to via product kernels and bandwidth matrices. Local-polynomial generalizes that extension cleanly: the design matrix grows from columns to columns, the bandwidth becomes a matrix or a -vector, and everything else from §2–§7 carries over.
8.1 The polynomial-feature design matrix
In , a polynomial of total degree around evaluation point is parameterized by multi-indices with . The corresponding monomial in centered coordinates is . Stacking gives the multivariate design matrix with column count
For canonical choices: , , , , , , . The column count grows polynomially in at fixed — manageable at moderate but a real concern by and .
The local-polynomial WLS problem stays in the same form as (2.1):
with closed-form .
8.2 Kernel weights and bandwidth matrices
Three standard ways to put kernel weights on :
Product kernel with scalar bandwidth. The simplest: . One bandwidth, no scale awareness.
Product kernel with diagonal bandwidth. When coordinates have different scales: , . The most common choice in practice.
Full positive-definite bandwidth matrix. When coordinates are correlated: . Bandwidth-matrix selection is a real engineering problem at ; diagonal usually suffices.
8.3 The 2-dim toy
Generalize the §1 toy to two dimensions:
with — twice the §1 sample size to fill the 2-dim cube.
25×25 evaluation grid (live recompute on slider release). The static figure 08_multivariate_local_quadratic.png shows the same fit on a 50×50 grid.
The headline: local-polynomial regression in 2-dim recovers the surface faithfully at this , with residuals concentrated near the boundary of — the multivariate analog of §6’s univariate boundary-bias profile.
8.4 The curse of dimensionality
Stone (1980, 1982) showed that for -times-continuously-differentiable regression functions on , no estimator can achieve a uniform AMSE rate faster than . Local polynomial of degree achieves this rate (at odd ) — minimax-optimal — but the rate degrades sharply with :
| rate | rate | to match | |
|---|---|---|---|
| 1 | 200 | ||
| 2 | |||
| 3 | |||
| 5 | |||
| 10 | |||
| 20 |
Higher polynomial degree softens the dimension penalty but only by replacing the curse with a different smoothness assumption. In practice, rarely exceeds in multivariate settings. The escape routes — additive structure (§11), low-rank tensor methods, neural surrogates — are the subject matter of §11 and the broader high-dimensional-statistics literature.
9. Derivative estimation
The Taylor identification from §2.1 — — gives local polynomial regression a useful side product: derivative estimates fall out of the WLS solve at no additional cost. A single local-cubic fit returns at the evaluation point, in one solve.
9.1 Reading derivatives off the WLS coefficients
For any local polynomial fit of degree ,
The §4 parity story extends: at interior with symmetric , the leading bias of at degree is
with a kernel-moment-matrix combination, whenever is odd. When is even, the leading term involves a parity-zero moment, and the bias rate is one order better — but the next-order constant picks up design-density derivatives, the same design-adaptivity penalty of §4.3.
The “good” parity rule for derivative estimation is therefore:
- (): want odd. Use or .
- (): want even for the cleanest rate. is technically optimal at interior, but is the practical choice because it covers and at uniform rates and the rate-suboptimality for is mild.
- (): want odd. Use .
A single local-cubic () fit therefore covers the practically interesting derivative range — — with and at clean uniform rates.
9.2 Bias and variance of
The leading-order bias is when is odd. The variance scales much worse than the function-value estimate. From the §3 equivalent-kernel argument with the rescaling, , and the scaled variance is . So
| bias rate ( odd, odd) | variance rate | |
|---|---|---|
| 0 | ||
| 1 | ||
| 2 | ||
| 3 |
Variance grows by a factor of per derivative order. At , that’s roughly a variance multiplier per derivative.
9.3 The bandwidth shift for derivative estimation
The AMISE-optimal bandwidth for at degree scales as — same in as for — but with a -dependent constant that grows with . Don’t use a single bandwidth for everything. Run LOO-CV separately for each derivative target and use the -specific bandwidth.
9.4 Numerical demonstration on the §1 toy
We test derivative recovery at — a point where all four of are nonzero (unlike where the even derivatives vanish). The truth, computed analytically:
A single local-cubic fit at on a 200-point sample reads off all four. The function value and second derivative recover cleanly (per §9.2’s parity rule); the first derivative recovers with moderate noise; the third derivative is dominated by noise at this .
| j | m̂^(j)(x₀) | m^(j)(x₀) | err |
|---|---|---|---|
| 0 | 0.779 | 0.770 | 0.010 |
| 1 | 4.583 | 4.943 | 0.360 |
| 2 | -35.862 | -27.915 | 7.946 |
| 3 | 36.511 | -175.398 | 211.909 |
Variance grows by ≈1/h² per derivative order — m̂'' is much noisier than m̂, m̂''' worse still.
10. Connections to smoothing splines
Smoothing splines are the other major family of frequentist nonparametric smoothers. They look very different from local polynomial regression — a penalty-based variational problem instead of a kernel-weighted local fit — but Silverman (1984) showed they’re asymptotically equivalent to local-cubic regression with a specific location-adaptive bandwidth.
10.1 The cubic smoothing-spline objective
The cubic smoothing spline minimizes
over . As : interpolation. As : OLS line.
The minimizer is a natural cubic spline. Schoenberg (1964) and Reinsch (1967) showed that the minimizer of (10.1) is piecewise cubic between the design points, continuous in value and first two derivatives across knots, and linear (zero curvature) outside . The “natural” boundary condition is automatic.
In matrix form, the fitted values at the design points satisfy
where is a banded penalty matrix encoding at the design points. The smoother matrix makes the smoothing spline a linear smoother. The Reinsch algorithm computes in time. The §7 LOO-CV / GCV machinery extends without change.
10.2 Silverman’s equivalent variable-kernel theorem
Theorem 5 (Silverman (1984) equivalent variable kernel).
Suppose is bounded away from zero on and satisfies . Then the smoothing-spline weight on observation when evaluating at satisfies
where is the Silverman kernel
Two things to register about .
It’s a fourth-order kernel. and for , but . These are the same moment-matching identities the equivalent kernel satisfies at interior with a symmetric base kernel (§3.3, Property 1, with ). Both produce leading bias and variance.
The bandwidth is location-adaptive. Where is large, is small (tighter local neighborhood). Where is small, is large. This adaptation is automatic — the global becomes a local through — and it’s something local-polynomial regression doesn’t do at fixed .
10.3 Practical comparison
When is roughly uniform, smoothing splines and local-cubic produce nearly identical fits at matched smoothing levels. On the §1 toy at local-cubic (effective df ), the matching smoothing spline at has , with mean pointwise — visually indistinguishable.
Smoothing splines win when: heterogeneous design density, boundary behavior matters, single hyperparameter is preferable, Sobolev-norm formulation is natural.
Local-polynomial regression wins when: multivariate (thin-plate splines are harder), derivative estimation is needed, local control over smoothing is required, embarrassingly-parallel evaluation across a grid matters.
The Silverman kernel K_S is asymptotically equivalent to the cubic smoothing spline (Silverman 1984). We use it as a stand-in here — a true Reinsch-algorithm spline implementation is outside this PR's scope; the static figure 10_local_cubic_vs_smoothing_spline.png shows the comparison against scipy's `make_smoothing_spline`.
11. Connections to additive models
§8.4 left us with the curse: at , the local-polynomial rate becomes too slow for realistic sample sizes. The additive-model trick is to assume the regression function has structure that escapes the dimension penalty.
11.1 Additive structure as a curse dodge
The additive model assumes
Each depends on a single coordinate, so each can be estimated at the univariate Stone rate — independent of . At , , :
| approach | AMSE rate | example AMSE |
|---|---|---|
| Multivariate local-polynomial | ||
| Additive (GAM) with local-poly components |
A 5× improvement at , growing with . The cost is the structural assumption: no interactions like or . Misspecification bias is — it doesn’t go to zero with .
The additive form (11.1) goes by GAM (Hastie & Tibshirani 1990), with “generalized” referring to the extension to non-Gaussian responses via link functions. For the regression case, the GAM reduces to (11.1) with local-polynomial component estimators inside the backfitting algorithm.
11.2 The backfitting algorithm
Algorithm 1 (Backfitting (Friedman & Stuetzle 1981; Buja, Hastie & Tibshirani 1989)).
Fit the additive model (11.1) by iteratively smoothing partial residuals against each covariate.
Initialization. , for .
Iteration. Cycle . Compute the partial residual
then update by smoothing the partial residuals against via a univariate local polynomial,
with the centering subtraction enforcing .
Convergence. Iterate until (typically ). Under mild conditions (Buja–Hastie–Tibshirani 1989, Theorem 9), backfitting converges geometrically to the unique additive decomposition.
On the §11 3-dim additive toy at , the notebook’s local-linear backfitting GAM converges in 6 iterations to tolerance with geometric convergence ratio per iteration.
11.3 Why the components have univariate rates
On convergence, equals the local-polynomial fit of the converged partial residuals against the univariate . The bias-variance analysis of §4 applies directly, with noise variance enlarged by the contribution of the other components — bounded if each component is estimable. The bias rate is the univariate , independent of .
The cost is the misspecification bias. If the true has interaction structure that (11.1) can’t capture, converges to the best additive approximation of , not to itself. The standard diagnostic is residual analysis on pairwise covariate plots; if systematic 2-dim structure shows up, the additive model is missing interactions.
12. Connections, applications, and limits
12.1 Local polynomials as nuisance estimators
Local polynomial regression’s most consequential modern role isn’t as a final estimator — it’s as a nuisance estimator in semiparametric and debiased-inference pipelines.
The canonical setting is the partially-linear model:
with a low-dimensional parameter of interest. Robinson (1988) showed that can be estimated at the parametric rate — despite the nonparametric nuisance — by partialling out a local-polynomial estimate of .
The modern extension is double/debiased machine learning (Chernozhukov et al. 2018), which generalizes Robinson to high-dimensional via cross-fitting and Neyman orthogonality. Local-polynomial fits the framework: at the bias rate is fast enough for the cross-fitted score to be Neyman-orthogonal at chosen by LOO-CV (§7).
Forward pointer: semiparametric-inference (coming soon) will build out the Robinson machinery and the modern Chernozhukov-style extensions.
12.2 Varying-coefficient models
Hastie & Tibshirani (1993) introduced the varying-coefficient model:
where each is an unknown smooth function. The estimator is local-linear-in- regression weighted by a univariate kernel in . Higher-degree variants extend to local polynomial in as well.
In time-series, varying-coefficient models go by time-varying coefficient regression. The §3 equivalent-kernel theory extends with the obvious modifications.
12.3 Where local polynomials break down
- High dimension at without additive structure (§8.4 curse).
- Sparse design at boundary — local Gram matrix becomes ill-conditioned.
- Heavy-tailed noise — robust alternatives (LOESS with bisquare weights) handle this.
- Heteroscedastic noise — variable-bandwidth local polynomial (Fan & Gijbels 1995) handles this.
- Discontinuities — change-point detection + segment-wise fitting.
12.4 Forward pointers within formalML
| topic | track | relationship |
|---|---|---|
kernel-regression | T2 (shipped) | Predecessor; shared module, shared §1 toy DGP, shared bandwidth-selection machinery |
density-ratio-estimation (coming soon) | T2 | Uses the same kernel-methods shared module for estimation |
quantile-regression | T4 (shipped) | Pinball-loss analog of local polynomial regression |
semiparametric-inference (coming soon) | T6 | Local-cubic as nuisance estimator (§12.1) |
gaussian-processes | T5 | Smoothing-spline / GP / local-cubic asymptotic-equivalence triangle (§10) |
12.5 Closing synthesis
Local polynomial regression is the workhorse univariate nonparametric smoother of the modern toolkit. It generalizes Kernel Regression §8’s local-linear fix to arbitrary degree , achieves Stone-minimax-optimal rates, gives derivative estimates as a single-fit byproduct, extends cleanly to multivariate via the polynomial-feature design matrix, and survives boundary effects via the equivalent-kernel adaptation that emerges automatically from the WLS structure.
The practical recommendation distills to as the production default and when derivative estimates are needed or the smoothness regime warrants the faster rate. doesn’t earn its variance penalty under uniform-rate analysis; is theoretically clean but practically fragile.
The deeper structural fact — that every local polynomial estimator is asymptotically a kernel smoother with a degree-aware, location-aware equivalent kernel — is the conceptual centerpiece, and it’s what ties the topic to its neighbors. Smoothing splines (§10) are local-cubic in disguise; GAMs (§11) are coordinate-wise local-polynomials chained by backfitting; semiparametric inference (§12.1) treats local-polynomial as the nuisance-estimator default. The equivalent kernel is the lens that makes all of these the same algebraic object, viewed from different angles.
Connections and Further Reading
Within formalML:
-
Kernel Regression (T2). Direct predecessor. The §1 toy, the kernel substrate, and the bandwidth-selection machinery all carry over from kernel-regression unchanged. This topic adds the equivalent-kernel formulation, the rate ladder at general , the boundary story, multivariate / derivative / smoothing-spline / GAM connections.
-
Concentration Inequalities. Transitive prerequisite for the convergence-rate machinery, inherited via kernel-regression. The §4 bias and variance derivations are stated to leading order; the rigorous version (uniform consistency, asymptotic normality) requires concentration-inequality bounds for the kernel-weighted partial sums.
-
Gaussian Processes (T5). §10 connects local-cubic to smoothing splines via Silverman (1984), and smoothing splines to GP regression via the spline-as-GP-posterior view. The fourth-order Silverman kernel satisfies the same moment-matching identities as — both produce leading bias and variance.
-
Conformal Prediction (T4). §12.1 forward-points to local-cubic as the natural base estimator for split conformal prediction with sharper convergence than NW. The faster bias rate at means tighter undersmoothing thresholds and tighter conformal intervals.
-
Density-Ratio Estimation (coming soon). Uses the same kernel-methods shared module from this topic and the bandwidth-selection methods from §7, applied to instead of .
-
Semiparametric Inference (coming soon). Local-cubic regression as the canonical nuisance estimator in the Robinson (1988) double-residual approach and the Chernozhukov et al. (2018) double/debiased ML framework — see §12.1.
Cross-site prerequisites:
-
formalStatistics: Linear Regression — the parametric counterpart. Locally-weighted least squares with a degree- polynomial design matrix is this topic’s central machinery; the normal equations are the linear-regression formulas applied at every evaluation point with location-adaptive design and weight matrices. The §3 equivalent-kernel formulation is the classical projection-onto-column-space picture from linear regression read in kernel-weighted units.
-
formalStatistics: Kernel Density Estimation — the density-estimation sibling. Same kernel families and the same bandwidth-moment conditions port directly. The bandwidth-selection methods (Silverman, plug-in, LOO-CV, GCV) transfer to local polynomial regression via the same linear-smoother machinery; the §3.3 moment-matching identities are the local-polynomial analog of KDE’s higher-order kernels.
-
formalCalculus: Multiple Integrals — the multivariate-integration substrate. The change-of-variables substitution that converts the kernel-weighted moments into kernel-moment integrals throughout §3.2, the boundary-truncated form , and the §8 multivariate full-degree expansion all rest on standard multivariate-integration moves.
-
formalCalculus: Mean Value Theorem and Taylor Expansion — the Taylor-expansion machinery. Taylor expansion to order is the load-bearing technical move of the §4 bias derivation at general degree . The §4.1 residual-decomposition argument uses the column-space projection identity to absorb the first Taylor terms, leaving only the -st derivative remainder to drive the bias.
Connections
- Direct predecessor. §1 picks up the §8.3 forward-pointer of kernel-regression, generalizing local-linear (the §8.2 boundary-bias fix) to arbitrary degree p. The kernel substrate, the §1 toy DGP, the bandwidth-selection machinery, and the kernel-weighted-least-squares reframing of NW carry over unchanged. This topic adds the equivalent-kernel formulation, the rate ladder at general p, the boundary story, derivative estimation, and the multivariate / smoothing-spline / GAM connections. kernel-regression
- Transitive prerequisite for the convergence-rate machinery, inherited via kernel-regression. The §4 bias and variance derivations are stated to leading order; the rigorous version (uniform consistency, asymptotic normality) requires concentration-inequality bounds for the kernel-weighted partial sums. concentration-inequalities
- §10 connects local-cubic to smoothing splines via Silverman (1984), and smoothing splines to GP regression via the spline-as-GP-posterior view. The fourth-order Silverman kernel K_S satisfies the same moment-matching identities as K*_3 with a Gaussian base — both produce h^4 leading bias and 1/(nh) variance, and both are asymptotically equivalent to a degree-3 local polynomial smoother. gaussian-processes
- §12.1 forward-points to local-cubic as the natural base estimator for split conformal prediction with sharper convergence than NW (kernel-regression's §9.4 example). The faster bias rate at p = 3 means tighter undersmoothing thresholds and tighter conformal intervals — an unrealized but visible improvement over the NW baseline. conformal-prediction
References & Further Reading
- paper Spline functions and the problem of graduation — Schoenberg, I. J. (1964)
- paper Consistent nonparametric regression — Stone, C. J. (1977)
- paper Robust locally weighted regression and smoothing scatterplots — Cleveland, W. S. (1979)
- paper Optimal rates of convergence for nonparametric estimators — Stone, C. J. (1980)
- paper Projection pursuit regression — Friedman, J. H. & Stuetzle, W. (1981)
- paper Optimal global rates of convergence for nonparametric regression — Stone, C. J. (1982)
- paper Spline smoothing: The equivalent variable kernel method — Silverman, B. W. (1984)
- paper Locally weighted regression: An approach to regression analysis by local fitting — Cleveland, W. S. & Devlin, S. J. (1988)
- paper Root-N-consistent semiparametric regression — Robinson, P. M. (1988)
- paper Linear smoothers and additive models — Buja, A., Hastie, T. & Tibshirani, R. (1989)
- book Generalized Additive Models — Hastie, T. & Tibshirani, R. (1990)
- paper Design-adaptive nonparametric regression — Fan, J. (1992)
- paper Local linear regression smoothers and their minimax efficiencies — Fan, J. (1993)
- paper Varying-coefficient models — Hastie, T. & Tibshirani, R. (1993)
- paper Multivariate locally weighted least squares regression — Ruppert, D. & Wand, M. P. (1994)
- paper Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation — Fan, J. & Gijbels, I. (1995)
- paper An effective bandwidth selector for local least squares regression — Ruppert, D., Sheather, S. J. & Wand, M. P. (1995)
- book Kernel Smoothing — Wand, M. P. & Jones, M. C. (1995)
- book Local Polynomial Modelling and Its Applications — Fan, J. & Gijbels, I. (1996)
- book Generalized Additive Models: An Introduction with R — Wood, S. N. (2006)
- book Introduction to Nonparametric Estimation — Tsybakov, A. B. (2009)
- paper Double/debiased machine learning for treatment and structural parameters — Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. & Robins, J. (2018)