intermediate supervised-learning 55 min read

Local Polynomial Regression

Degree-$p$ generalization of local-linear regression — the Fan–Gijbels family, the equivalent-kernel formulation that makes every degree-$p$ estimator a kernel smoother with a degree-aware reweighting $K^*_p$, the Ruppert–Wand bias ladder, and the odd-vs-even degree story that distills the practical recommendation to $p \in \{1, 3\}$.

Prerequisites: Kernel Regression

1. From local-linear to degree-pp

Kernel Regression closed with a boundary problem and a fix. The Nadaraya–Watson estimator does fine in the interior of the support but biases toward the data side at the boundary — the kernel mass that would sit outside the support gets reflected back in by the normalization, dragging m^h(0)\hat m_h(0) toward whatever’s nearby. The §8.2 fix was local-linear regression: fit a line, not a constant, in each kernel-weighted neighborhood. The line absorbs the slope information that NW was forced to merge into a level estimate, and the boundary bias drops from O(h)O(h) to O(h2)O(h^2) — uniformly, across the whole support.

That fix is the simplest member of a family. The §8.3 forward-pointer named the family — degree-pp local polynomial regression, due to Fan, Gijbels, and collaborators in a string of 1990s papers — and left the rest for this topic. We pick it up here.

1.1 The local-cubic teaser

Before the algebra, the picture. Below are NW (degree 0), local-linear (degree 1), local-quadratic (degree 2), and local-cubic (degree 3) fits to the same 200-point sample from

XUniform(0,1),Y=sin(2πX)+X2+ε,εN(0,0.22),X \sim \mathrm{Uniform}(0,1), \qquad Y = \sin(2\pi X) + \tfrac{X}{2} + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, 0.2^2),

at the same bandwidth h=0.08h = 0.08 — the same toy that ran through Kernel Regression and the same seed (42).

Four-panel comparison of NW, local-linear, local-quadratic, and local-cubic fits at h = 0.08 on the §1 toy.
Four-panel teaser: NW (p = 0), local-linear (p = 1), local-quadratic (p = 2), and local-cubic (p = 3) at h = 0.08 on the §1 toy. NW clips the peaks and troughs of the sinusoid — degree-zero polynomials are constants, so the fit at x₀ is a kernel-weighted average of nearby Y values, and averaging across a steeply-curved neighborhood flattens the peak. Local-linear recovers the slopes; local-cubic does both slope and curvature. The skipped degree 2 is the punchline of §5.

Three things to notice. The NW fit clips the peaks and troughs of the sine — degree-zero polynomials are constants, so the fit at x0x_0 is a kernel-weighted average of nearby YiY_i values, and averaging across a neighborhood with a steep slope flattens the peak. Local-linear recovers the slopes — the line through the kernel-weighted neighborhood at the peak isn’t asked to also fit the curvature, so the level estimate at the peak is no longer dragged down by the off-center neighbors. Local-cubic does both — slope and curvature — and tracks the sine more tightly still, particularly at the inflection points where local-linear has to compromise.

You’ll notice we skipped degree 2. That’s not an oversight; that’s the punchline of §5.

1.2 The Fan–Gijbels family

The general estimator, for a degree-pp polynomial, solves the kernel-weighted least-squares problem

β^(x)=argminβRp+1i=1n(Yiβ0β1(Xix)βp(Xix)p)2K ⁣(Xixh)\hat{\boldsymbol\beta}(x) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^{p+1}} \sum_{i=1}^n \Big(Y_i - \beta_0 - \beta_1(X_i - x) - \cdots - \beta_p(X_i - x)^p\Big)^2 \, K\!\left(\frac{X_i - x}{h}\right)

at every evaluation point xx. The function-value estimate is m^h(x)=β^0(x)\hat m_h(x) = \hat\beta_0(x). The first jj derivative estimates fall out of the same fit as m^h(j)(x)=j!β^j(x)\hat m^{(j)}_h(x) = j!\,\hat\beta_j(x) — a notably useful side-product we cash in during §9.

Three results structure the topic and earn the family its place in the modern nonparametric toolkit:

  1. The equivalent-kernel formulation (§3). Every degree-pp local polynomial estimator can be rewritten as a kernel smoother m^h(x)=iKh,p,x(Xi)Yi\hat m_h(x) = \sum_i K^*_{h,p,x}(X_i)\,Y_i, where KK^* is not the original kernel KK — it’s a degree-aware, location-aware reweighting of KK. The bandwidth-selection theory and the implementation tricks port from Kernel Regression because of this; the higher-degree estimators are still kernel smoothers, just smarter ones.
  2. The bias ladder (§4). For (p+1)(p+1)-times-differentiable mm, degree-pp local polynomial achieves O(hp+1)O(h^{p+1}) bias when pp is odd. When pp is even, the leading hp+1h^{p+1} term vanishes by parity at the interior with symmetric KK, and the next-order analysis gives O(hp+2)O(h^{p+2}) — a parity-bonus interior rate that does not survive at the boundary. The variance is O(1/(nh))O(1/(nh)) throughout.
  3. The odd-vs-even degree pattern (§5). Combining the first two: an odd degree pp provides the same interior rate as the preceding even degree p1p - 1 (the §3 pairing K2m=K2m+1K^*_{2m} = K^*_{2m+1} at interior, with matched variance constant R2m=R2m+1R^*_{2m} = R^*_{2m+1}) but with strictly better boundary behavior — the parity-zero pattern that gives even pp its hp+2h^{p+2} interior bonus collapses at the boundary, while odd pp retains its hp+1h^{p+1} rate uniformly across the support and keeps design-adaptivity (no fXf_X-derivative contamination). So odd pp Pareto-dominates the next-higher-degree even neighbor at no variance cost. The practical choices are p=1p = 1 and p=3p = 3, and the choice between them comes down to whether you need derivative estimates.

That’s the spine. Boundary behavior (§6) follows from the equivalent-kernel adapting to the boundary at odd pp. Bandwidth selection (§7) extends Kernel Regression §5 essentially unchanged. The multivariate generalization (§8), the smoothing-spline bridge (§10), and the additive-model bridge (§11) all rest on the same scaffolding.

1.3 Roadmap

§2 is the formal definition and the matrix form. §3 is the equivalent-kernel rewrite. §4 derives the bias-variance ladder, and §5 — the topic’s sharpest result — is the odd-vs-even degree theorem. §6 takes the theorem to the boundary. §7 extends the bandwidth-selection toolkit. §8 generalizes to Rd\mathbb{R}^d. §9 is derivative estimation. §10 connects to smoothing splines. §11 connects to additive models. §12 closes with the connections to semiparametric inference and the practical limits.

2. The degree-pp local polynomial estimator

2.1 The Taylor-expansion idea

The motivating observation is the same one that drives any local approximation in calculus. If mm is (p+1)(p+1)-times differentiable at xx, then for zz near xx, Taylor’s theorem gives

m(z)=m(x)+m(x)(zx)+m(x)2!(zx)2++m(p)(x)p!(zx)p+Rp(z,x),m(z) = m(x) + m'(x)(z - x) + \frac{m''(x)}{2!}(z - x)^2 + \cdots + \frac{m^{(p)}(x)}{p!}(z - x)^p + R_p(z, x),

with remainder Rp(z,x)=O(zxp+1)R_p(z, x) = O(|z - x|^{p+1}). So in a small enough neighborhood of xx, mm looks like a polynomial of degree pp in (zx)(z - x), with coefficients βj=m(j)(x)/j!\beta_j^\star = m^{(j)}(x) / j!.

The local polynomial estimator turns this around. We don’t know mm, but we have data. Fit a polynomial of degree pp in (Xix)(X_i - x) to the observed YiY_i, weighted by how close XiX_i is to xx. The kernel weights enforce the “near xx” part — observations far from xx get exponentially down-weighted under a Gaussian kernel, or zeroed out entirely under a compact-support kernel like the Epanechnikov. The function-value estimate m^h(x)\hat m_h(x) is the intercept of the fitted polynomial; the slope, curvature, and higher derivatives come along as the higher-order coefficients.

NW corresponds to p=0p = 0 — a local constant fit, where the only coefficient is the kernel-weighted mean of the YiY_i. Local-linear is p=1p = 1, the §8 Kernel Regression fix. Local-cubic is p=3p = 3.

2.2 The kernel-weighted least-squares objective

Write the kernel weight as wi=Kh(Xix)w_i = K_h(X_i - x) where Kh(z)=h1K(z/h)K_h(z) = h^{-1} K(z/h) is the standard scaling that makes KhK_h integrate to one. The kernel-weighted sum of squared errors at evaluation point xx is

Ln(β;x)=i=1nwi(Yik=0pβk(Xix)k) ⁣2.\mathcal{L}_n(\boldsymbol\beta; x) = \sum_{i=1}^n w_i \, \bigg( Y_i - \sum_{k=0}^{p} \beta_k (X_i - x)^k \bigg)^{\!2}.

Definition 1 (Local polynomial estimator).

The degree-pp local polynomial estimator of m(x)=E[YX=x]m(x) = \mathbb{E}[Y \mid X = x] at a fixed evaluation point xx is

β^(x)=argminβRp+1Ln(β;x),m^h(x)=β^0(x),\hat{\boldsymbol\beta}(x) = \arg\min_{\boldsymbol\beta \in \mathbb{R}^{p+1}} \mathcal{L}_n(\boldsymbol\beta; x), \qquad \hat m_h(x) = \hat\beta_0(x),

with derivative estimates m^h(j)(x)=j!β^j(x)\hat m_h^{(j)}(x) = j!\, \hat\beta_j(x) for j=1,,pj = 1, \dots, p.

Two things to notice. First, the problem decouples across xx: each evaluation point gets its own β^\hat{\boldsymbol\beta}. The bandwidth hh and the polynomial degree pp are global, but the fitted coefficients are local. Second, this is ordinary weighted least squares — the kernel only enters through the weights.

2.3 Normal equations

Setting Ln/βj=0\partial \mathcal{L}_n / \partial \beta_j = 0 for each j{0,1,,p}j \in \{0, 1, \dots, p\} gives the first-order conditions

Lnβj=2i=1nwi(Xix)j(Yik=0pβk(Xix)k)=0.\frac{\partial \mathcal{L}_n}{\partial \beta_j} = -2 \sum_{i=1}^n w_i (X_i - x)^j \bigg( Y_i - \sum_{k=0}^p \beta_k (X_i - x)^k \bigg) = 0.

Rearranging,

k=0p(i=1nwi(Xix)j+k)βk  =  i=1nwi(Xix)jYi,j=0,1,,p.\sum_{k=0}^p \bigg( \sum_{i=1}^n w_i (X_i - x)^{j+k} \bigg) \beta_k \;=\; \sum_{i=1}^n w_i (X_i - x)^j \, Y_i, \qquad j = 0, 1, \dots, p.

Define the empirical kernel-moment quantities

S~n,j(x)=i=1nwi(Xix)j,T~n,j(x)=i=1nwi(Xix)jYi.\tilde S_{n,j}(x) = \sum_{i=1}^n w_i (X_i - x)^j, \qquad \tilde T_{n,j}(x) = \sum_{i=1}^n w_i (X_i - x)^j \, Y_i.

The normal equations are then k=0pS~n,j+k(x)βk=T~n,j(x)\sum_{k=0}^p \tilde S_{n,j+k}(x) \, \beta_k = \tilde T_{n,j}(x) for j=0,,pj = 0, \dots, p. Stacked as a matrix system,

(S~n,0S~n,1S~n,pS~n,1S~n,2S~n,p+1S~n,pS~n,p+1S~n,2p)S~n(x)(β^0β^1β^p)  =  (T~n,0(x)T~n,1(x)T~n,p(x)).(2.1)\underbrace{ \begin{pmatrix} \tilde S_{n,0} & \tilde S_{n,1} & \cdots & \tilde S_{n,p} \\ \tilde S_{n,1} & \tilde S_{n,2} & \cdots & \tilde S_{n,p+1} \\ \vdots & \vdots & \ddots & \vdots \\ \tilde S_{n,p} & \tilde S_{n,p+1} & \cdots & \tilde S_{n,2p} \end{pmatrix} }_{\tilde{\mathbf S}_n(x)} \begin{pmatrix} \hat\beta_0 \\ \hat\beta_1 \\ \vdots \\ \hat\beta_p \end{pmatrix} \;=\; \begin{pmatrix} \tilde T_{n,0}(x) \\ \tilde T_{n,1}(x) \\ \vdots \\ \tilde T_{n,p}(x) \end{pmatrix}. \quad\quad (2.1)

The matrix S~n(x)\tilde{\mathbf S}_n(x) has constant entries along anti-diagonals — a Hankel matrix. The structure isn’t load-bearing for solving (we use a generic linear solver) but it’s worth registering: S~n(x)\tilde{\mathbf S}_n(x) is fully determined by the 2p+12p + 1 scalars S~n,0,,S~n,2p\tilde S_{n,0}, \dots, \tilde S_{n,2p}.

Existence and uniqueness. S~n(x)\tilde{\mathbf S}_n(x) is invertible iff the XiX_i‘s with positive kernel weight at xx contain at least p+1p + 1 distinct values. For continuous kernels like the Gaussian, every observation has positive weight, so the matrix is invertible whenever the data has at least p+1p + 1 distinct XiX_i — almost surely true under any continuous distribution of XX. For compactly supported kernels (Epanechnikov, biweight, triweight), we need at least p+1p + 1 distinct XiX_i within distance hh of xx. This is a local-design constraint that bites only at small bandwidths or sparse regions.

2.4 Matrix form and batched-grid vectorization

Stack the design rows for each observation into the n×(p+1)n \times (p+1) matrix

Xx=(1(X1x)(X1x)2(X1x)p1(X2x)(X2x)2(X2x)p1(Xnx)(Xnx)2(Xnx)p),\mathbf{X}_x = \begin{pmatrix} 1 & (X_1 - x) & (X_1 - x)^2 & \cdots & (X_1 - x)^p \\ 1 & (X_2 - x) & (X_2 - x)^2 & \cdots & (X_2 - x)^p \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & (X_n - x) & (X_n - x)^2 & \cdots & (X_n - x)^p \end{pmatrix},

and let Wx=diag(w1,,wn)\mathbf{W}_x = \operatorname{diag}(w_1, \dots, w_n) be the diagonal weight matrix. Then S~n(x)=XxWxXx\tilde{\mathbf{S}}_n(x) = \mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x and T~n(x)=XxWxy\tilde{\mathbf{T}}_n(x) = \mathbf{X}_x^\top \mathbf{W}_x \mathbf{y}, and the closed-form WLS solution is

β^(x)=(XxWxXx)1XxWxy.(2.2)\hat{\boldsymbol\beta}(x) = \big(\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x\big)^{-1} \mathbf{X}_x^\top \mathbf{W}_x \mathbf{y}. \quad\quad (2.2)

The function-value estimate is

m^h(x)=e1(XxWxXx)1XxWxy,\hat m_h(x) = \mathbf{e}_1^\top \big(\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x\big)^{-1} \mathbf{X}_x^\top \mathbf{W}_x \mathbf{y},

where e1=(1,0,,0)\mathbf{e}_1 = (1, 0, \dots, 0)^\top picks off the intercept. We return to this e1S~1\mathbf{e}_1^\top \tilde{\mathbf{S}}^{-1} structure in §3 — it’s the seed of the equivalent-kernel formulation.

Vectorization. For an evaluation grid x1,,xGx_1, \dots, x_G, equation (2.2) is GG separate WLS solves with GG different design and weight matrices. The straightforward implementation loops over GG; the batched implementation stacks all GG design matrices into a G×n×(p+1)G \times n \times (p+1) array, all GG weight diagonals into a G×nG \times n array, and uses NumPy’s np.linalg.solve broadcasting to do all GG systems at once. The cost is the same O(Gn(p+1)+G(p+1)3)O\big(G \cdot n \cdot (p+1) + G \cdot (p+1)^3\big) — at the §1-toy scale (G=400,n=200,p=3G = 400, n = 200, p = 3) the speedup over a Python loop is modest (about 1.1×1.1\times in the notebook timing), but at G104G \ge 10^4 where Python-loop overhead dominates the picture it grows substantially.

Conditioning. The columns of Xx\mathbf{X}_x have entries (Xix)j(X_i - x)^j that span many orders of magnitude when hh is small. The fix is to work in scaled coordinates ui=(Xix)/hu_i = (X_i - x)/h. The rescaled design Z~x\tilde{\mathbf{Z}}_x has entries uiju_i^j, with ui3|u_i| \lesssim 3 for the bulk of the kernel mass — well-conditioned. The function-value estimate β^0\hat\beta_0 is invariant under this rescaling because the first column is unchanged (ui0=1=(Xix)0u_i^0 = 1 = (X_i - x)^0). Higher coefficients pick up an hjh^j factor: β^junscaled=β^jscaled/hj\hat\beta_j^{\text{unscaled}} = \hat\beta_j^{\text{scaled}} / h^j. The notebook’s vectorized implementation uses scaled coordinates throughout.

degree p:
def local_polynomial(X, Y, x_eval, h, p, K_fn=gaussian_kernel):
    """Vectorized degree-p local polynomial. Returns hat m_h(x) at x_eval."""
    X = np.asarray(X, dtype=float); Y = np.asarray(Y, dtype=float)
    x_eval = np.atleast_1d(np.asarray(x_eval, dtype=float))
    diff = X[None, :] - x_eval[:, None]                # (G, n)
    u = diff / h
    w = K_fn(u) / h                                    # (G, n)
    # Scaled design Z[i, j, k] = u[i, j]**k.
    Z = u[..., None] ** np.arange(p + 1)               # (G, n, p+1)
    Wz = w[..., None] * Z                              # (G, n, p+1)
    # Normal equations: A_g beta_g = b_g for each evaluation point g.
    A = np.einsum("gnj,gnk->gjk", Wz, Z)               # (G, p+1, p+1)
    b = np.einsum("gnj,gn->gj", Wz, Y)                 # (G, p+1)
    beta = np.linalg.solve(A, b)                       # (G, p+1) — beta_0 is hat m
    return beta[..., 0]

2.5 Reading off derivatives

The Taylor identification from §2.1 said that, locally, βj=m(j)(x)/j!\beta_j^\star = m^{(j)}(x) / j!. Plugging in the WLS estimate gives the local-polynomial derivative estimator,

m^h(j)(x)=j!β^j(x),j=0,1,,p.\hat m_h^{(j)}(x) = j! \, \hat\beta_j(x), \qquad j = 0, 1, \dots, p.

Quick demonstration on the §1 toy. Fitting a local-cubic at x0=0.5x_0 = 0.5 with h=0.08h = 0.08 gives all four coefficients in one solve. The truth at x0=0.5x_0 = 0.5 is m(0.5)=0.25m(0.5) = 0.25, m(0.5)=2π+0.55.78m'(0.5) = -2\pi + 0.5 \approx -5.78, m(0.5)=0m''(0.5) = 0, and m(0.5)=8π3248m'''(0.5) = 8\pi^3 \approx 248. The local-cubic recovers the function value (m^h(0.5)0.265\hat m_h(0.5) \approx 0.265) and first derivative (m^h(0.5)5.26\hat m_h'(0.5) \approx -5.26) tightly; the second and third derivatives are recovered with substantially more variance (m^h(0.5)0.64\hat m_h''(0.5) \approx -0.64, m^h(0.5)178\hat m_h'''(0.5) \approx 178), foreshadowing §9’s formula Var(m^h(j))=O(1/(nh2j+1))\operatorname{Var}(\hat m_h^{(j)}) = O(1/(n h^{2j+1})) — the bandwidth that’s right for m^\hat m is too small for m^(j)\hat m^{(j)} at j1j \ge 1.

3. The equivalent-kernel formulation

3.1 Local polynomial estimators are linear smoothers

Equation (2.2) gives the function-value estimate as

m^h(x)=e1(XxWxXx)1XxWxy.\hat m_h(x) = \mathbf{e}_1^\top \big(\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x\big)^{-1} \mathbf{X}_x^\top \mathbf{W}_x \, \mathbf{y}.

Define

a(x)  =  e1(XxWxXx)1XxWx    R1×n.\mathbf{a}(x)^\top \;=\; \mathbf{e}_1^\top \big(\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x\big)^{-1} \mathbf{X}_x^\top \mathbf{W}_x \;\in\; \mathbb{R}^{1 \times n}.

Then m^h(x)=a(x)y=i=1nai(x)Yi\hat m_h(x) = \mathbf{a}(x)^\top \mathbf{y} = \sum_{i=1}^n a_i(x)\, Y_i, and crucially the weights ai(x)a_i(x) depend only on {Xj}\{X_j\} — through Xx\mathbf{X}_x and Wx\mathbf{W}_x — and on (K,h,p)(K, h, p), never on {Yj}\{Y_j\}. So local polynomial estimators of every degree pp are linear smoothers: the fit at xx is a linear combination of the responses with design-determined weights.

The ii-th column of XxWx\mathbf{X}_x^\top \mathbf{W}_x is wi(1,Xix,,(Xix)p)w_i (1, X_i - x, \ldots, (X_i - x)^p)^\top. So

ai(x)  =  wie1(XxWxXx)1(1,Xix,,(Xix)p).(3.1)a_i(x) \;=\; w_i \cdot \mathbf{e}_1^\top \big(\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x\big)^{-1} \big(1, \, X_i - x, \, \ldots, \, (X_i - x)^p\big)^\top. \quad\quad (3.1)

This is the exact, finite-sample equivalent-kernel weight. It depends on xx, on hh, on the kernel KK, on the degree pp, and on the entire design {Xj}\{X_j\} via the Gram matrix. It’s not yet a clean kernel function — it doesn’t factor as K((Xix)/h)K^\star\big((X_i-x)/h\big) for any fixed KK^\star.

3.2 The asymptotic equivalent kernel

The clean factorization emerges in the population limit. Replace XxWxXx/n\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x / n with its expectation; the (j,k)(j, k) entry is

E[Kh(Xx)(Xx)j+k]  =  Kh(zx)(zx)j+kfX(z)dz.\mathbb{E}\big[K_h(X - x)\, (X - x)^{j+k}\big] \;=\; \int K_h(z - x)\, (z - x)^{j+k}\, f_X(z)\, dz.

Substitute u=(zx)/hu = (z - x)/h, so z=x+huz = x + hu, dz=hdudz = h\, du, Kh(zx)=K(u)/hK_h(z - x) = K(u)/h, and (zx)j+k=hj+kuj+k(z - x)^{j+k} = h^{j+k} u^{j+k}:

=hj+kuj+kK(u)fX(x+hu)du.= h^{j+k} \int u^{j+k}\, K(u)\, f_X(x + hu)\, du.

To leading order in hh, fX(x+hu)fX(x)f_X(x + hu) \approx f_X(x), so

E[Kh(Xx)(Xx)j+k]    hj+kfX(x)μj+k(K),\mathbb{E}\big[K_h(X - x)\, (X - x)^{j+k}\big] \;\approx\; h^{j+k} \, f_X(x) \, \mu_{j+k}(K),

where μj(K)=ujK(u)du\mu_j(K) = \int u^j K(u)\, du are the kernel moments familiar from Kernel Regression. The Gram-matrix factorization is

1nXxWxXx    fX(x)DhSpDh,(3.2)\frac{1}{n}\, \mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x \;\approx\; f_X(x) \, \mathbf{D}_h \, \mathbf{S}_p \, \mathbf{D}_h, \quad\quad (3.2)

where Dh=diag(1,h,h2,,hp)\mathbf{D}_h = \operatorname{diag}(1, h, h^2, \ldots, h^p) and Sp=[μj+k(K)]j,k=0p\mathbf{S}_p = [\mu_{j+k}(K)]_{j, k = 0}^{p} is the (p+1)×(p+1)(p+1) \times (p+1) kernel-moment matrix in the uu-coordinate.

Inverting and substituting into (3.1) and simplifying with e1Dh1=e1\mathbf{e}_1^\top \mathbf{D}_h^{-1} = \mathbf{e}_1^\top and Dh1(1,Xix,,(Xix)p)=(1,ui,ui2,,uip)\mathbf{D}_h^{-1}(1, X_i - x, \ldots, (X_i - x)^p)^\top = (1, u_i, u_i^2, \ldots, u_i^p)^\top:

ai(x)    1nhfX(x)Kp(ui),a_i(x) \;\approx\; \frac{1}{n\, h\, f_X(x)} \cdot K^*_p(u_i),

where the asymptotic equivalent kernel is

Kp(u)  =  e1Sp1(1,u,u2,,up)K(u).(3.3)\boxed{\,K^*_p(u) \;=\; \mathbf{e}_1^\top \mathbf{S}_p^{-1}\, \big(1, \, u, \, u^2, \, \ldots, \, u^p\big)^\top \, K(u). \,} \quad\quad (3.3)

The equivalent-kernel form of the local-polynomial estimator is therefore

m^h(x)    1nhfX(x)i=1nKp ⁣(Xixh)Yi.(3.4)\hat m_h(x) \;\approx\; \frac{1}{n\, h\, f_X(x)} \sum_{i=1}^n K^*_p\!\left(\frac{X_i - x}{h}\right) Y_i. \quad\quad (3.4)

This is the headline result of the section. Local polynomial regression, at every degree pp, is asymptotically a kernel smoother — but with KpK^*_p in place of KK. The original kernel KK enters only through Sp\mathbf{S}_p (the moment matrix) and through the K(u)K(u) factor in (3.3); the polynomial degree pp enters through e1Sp1(1,u,,up)\mathbf{e}_1^\top \mathbf{S}_p^{-1} (1, u, \ldots, u^p)^\top, which is a polynomial of degree pp in uu multiplying the original kernel.

The 1/(nhfX(x))1/(nh f_X(x)) prefactor is worth flagging. NW had to manufacture this density correction explicitly, via the denominator iKh(Xix)nhfX(x)\sum_i K_h(X_i - x) \approx n h f_X(x). Local polynomial of any degree p1p \ge 1 produces it automatically through the normal equations. That’s the algebraic source of the boundary-bias improvement we previewed in §1.

3.3 Properties of KpK^*_p

The equivalent kernel KpK^*_p inherits three structural properties from the Sp1\mathbf{S}_p^{-1} construction.

Proposition 1 (Moment-matching identities).

For j=0,1,,pj = 0, 1, \ldots, p,

ujKp(u)du  =  δj,0.\int u^j\, K^*_p(u)\, du \;=\; \delta_{j, 0}.

The original kernel KK has only K=1\int K = 1 guaranteed (and uK=0\int u K = 0 when KK is symmetric). The equivalent kernel KpK^*_p has all moments up to order pp matched to those of a degree-pp polynomial that reproduces constants — far stronger. This is what kills the design-density and lower-order Taylor terms in the bias expansion of §4.

Proof.

Substitute (3.3): ujKp(u)du=e1Sp1(1,u,,up)ujK(u)du=e1Sp1(μj,μj+1,,μj+p)\int u^j K^*_p(u) du = \mathbf{e}_1^\top \mathbf{S}_p^{-1} \int (1, u, \ldots, u^p)^\top u^j K(u) du = \mathbf{e}_1^\top \mathbf{S}_p^{-1} (\mu_j, \mu_{j+1}, \ldots, \mu_{j+p})^\top. For j{0,1,,p}j \in \{0, 1, \ldots, p\}, the vector (μj,,μj+p)(\mu_j, \ldots, \mu_{j+p})^\top is exactly the (j+1)(j{+}1)-th column of Sp\mathbf{S}_p. So Sp1(μj,,μj+p)=ej+1\mathbf{S}_p^{-1} (\mu_j, \ldots, \mu_{j+p})^\top = \mathbf{e}_{j+1}, and e1ej+1=δ1,j+1=δj,0\mathbf{e}_1^\top \mathbf{e}_{j+1} = \delta_{1, j+1} = \delta_{j, 0}.

Property 2 — the odd-with-even pairing at interior with symmetric KK. When KK is symmetric (K(u)=K(u)K(u) = K(-u)) and xx is an interior point of the support, the odd kernel moments μ1,μ3,μ5,\mu_1, \mu_3, \mu_5, \ldots all vanish. The moment matrix Sp\mathbf{S}_p then has a checkerboard zero pattern: nonzero entries only at positions (j,k)(j, k) with j+kj + k even. The same checkerboard pattern persists in Sp1\mathbf{S}_p^{-1}, so e1Sp1\mathbf{e}_1^\top \mathbf{S}_p^{-1} has nonzero entries only at even indices.

In particular, with the Gaussian kernel (μ2=1\mu_2 = 1, μ4=3\mu_4 = 3, μ6=15\mu_6 = 15):

  • K0(u)=K(u)K^*_0(u) = K(u) (trivially).
  • K1(u)=K(u)K^*_1(u) = K(u) (symmetric KK at interior gives S1=diag(1,1)\mathbf{S}_1 = \operatorname{diag}(1, 1)).
  • K2(u)=K3(u)=12(3u2)K(u)K^*_2(u) = K^*_3(u) = \dfrac{1}{2}(3 - u^2)\, K(u), which is negative for u>31.73|u| > \sqrt{3} \approx 1.73.
  • K4(u)=K5(u)K^*_4(u) = K^*_5(u), and so on.

The pairing K2m=K2m+1K^*_{2m} = K^*_{2m+1} at interior is a structural consequence of the checkerboard zeros: adding a u2m+1u^{2m+1} column to the design matrix extends Sp\mathbf{S}_p by an odd-indexed row and column whose only nonzero entries are odd moments — disconnected from the even-indexed rows that determine the first row of Sp1\mathbf{S}_p^{-1}. This algebraic fact is the seed of the odd-vs-even degree story (§5): at interior, jumping from p=0p = 0 to p=1p = 1, or from p=2p = 2 to p=3p = 3, doesn’t change the equivalent kernel — and therefore doesn’t change the leading-order bias rate.

Property 3 — sometimes negative-valued. For p2p \ge 2, the polynomial factor e1Sp1(1,u,,up)\mathbf{e}_1^\top \mathbf{S}_p^{-1} (1, u, \ldots, u^p)^\top has zero crossings, so KpK^*_p takes negative values. This isn’t a flaw — it’s the price of the moment-matching identities. The original KK is non-negative and only matches the zeroth moment; KpK^*_p matches pp moments, and the trade-off is negativity in the tails.

3.4 Boundary behavior and the visualization

At an interior point xx, the kernel KK integrates to one over its full support. At a boundary point, say x=chx = ch for some c0c \ge 0 near the support boundary at 00, only the portion ucu \ge -c of the kernel mass falls within the support. The kernel-moment matrix becomes

Sp(c)  =  [μj+k(c)]j,k=0p,μj(c)  =  cujK(u)du,\mathbf{S}_p^{(c)} \;=\; \big[\mu_{j+k}^{(c)}\big]_{j,k=0}^{p}, \qquad \mu_j^{(c)} \;=\; \int_{-c}^{\infty} u^j\, K(u)\, du,

and the equivalent kernel becomes cc-dependent:

Kp(c)(u)  =  e1(Sp(c))1(1,u,,up)K(u),uc.K^{*\,(c)}_p(u) \;=\; \mathbf{e}_1^\top \big(\mathbf{S}_p^{(c)}\big)^{-1} \big(1, u, \ldots, u^p\big)^\top K(u), \quad u \ge -c.

Two things change from the interior case. First, the symmetry of KK no longer implies μ1(c)=0\mu_1^{(c)} = 0, so the checkerboard zero pattern is gone and the odd-with-even pairing breaks. Second, the equivalent kernel adapts its shape to compensate for the missing kernel mass on the truncated side. This is the automatic boundary correction — the equivalent kernel knows where the boundary is and reweights accordingly, without us having to do anything beyond solving the original normal equations.

boundary parameter:
show p:
Equivalent kernels K*_p for p in {0, 1, 2, 3} at interior (c = infinity) and boundary (c = 0) for the Gaussian base kernel.
Equivalent kernels K*_p for p ∈ {0, 1, 2, 3} on the Gaussian. Left panel: interior (c = ∞). Right panel: boundary (c = 0). At interior, K*_0 = K*_1 = K (overlaid) and K*_2 = K*_3 are overlaid with negative tails for |u| > √3. At boundary, all four diverge — the equivalent kernel adapts to the missing left-tail mass automatically through the WLS solve.

The interior panel makes the §3.3 algebra visible. The boundary panel makes the §6 boundary-bias result visible: at the boundary, the equivalent kernel for local-linear and local-cubic absorbs the truncation, while the NW equivalent kernel (K0=KK^*_0 = K, just the truncated Gaussian) is asymmetric and biased. We return to this in §6 with the formal bias rates.

4. Bias and variance at general pp

4.1 The conditional bias and the residual decomposition

The cleanest path from §3 to a sharp bias formula is the conditional approach of Ruppert & Wand (1994). Conditioning on the design {Xi}\{X_i\}, the estimator’s mean is

E[m^h(x)X1,,Xn]  =  e1(XxWxXx)1XxWxm,\mathbb{E}\big[\hat m_h(x) \mid X_1, \dots, X_n\big] \;=\; \mathbf{e}_1^\top (\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x)^{-1} \mathbf{X}_x^\top \mathbf{W}_x\, \mathbf{m},

where m=(m(X1),,m(Xn))\mathbf{m} = (m(X_1), \ldots, m(X_n))^\top collects the true regression function evaluated at the design points. Taylor-expanding mm about xx,

m(Xi)  =  j=0pm(j)(x)j!(Xix)jin the column space of Xx  +  m(p+1)(x)(p+1)!(Xix)p+1+o ⁣((Xix)p+1)ri.m(X_i) \;=\; \underbrace{\sum_{j=0}^{p} \frac{m^{(j)}(x)}{j!}\, (X_i - x)^j}_{\text{in the column space of } \mathbf{X}_x} \;+\; \underbrace{\frac{m^{(p+1)}(x)}{(p+1)!}\, (X_i - x)^{p+1} + o\!\big((X_i - x)^{p+1}\big)}_{r_i}.

The first p+1p + 1 Taylor terms are exactly degree-pp polynomials in (Xix)(X_i - x), so they live in the column space of Xx\mathbf{X}_x. The projection (XxWxXx)1XxWx(\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x)^{-1} \mathbf{X}_x^\top \mathbf{W}_x fixes any vector in that column space, so it returns those coefficients exactly: e1\mathbf{e}_1^\top projects out m(x)m(x). The conditional bias collapses to the residual contribution alone:

Bias(m^h(x){Xi})  =  e1(XxWxXx)1XxWxr,rim(p+1)(x)(p+1)!(Xix)p+1.(4.1)\text{Bias}\big(\hat m_h(x) \mid \{X_i\}\big) \;=\; \mathbf{e}_1^\top (\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x)^{-1} \mathbf{X}_x^\top \mathbf{W}_x\, \mathbf{r}, \quad r_i \approx \frac{m^{(p+1)}(x)}{(p+1)!}\, (X_i - x)^{p+1}. \quad\quad (4.1)

This is a purely local statement — the bias depends on mm only through its (p+1)(p+1)-th derivative at xx, not through any global properties of mm.

4.2 The leading-order bias formula

To extract the leading-order asymptotic, we substitute the leading-order approximation rim(p+1)(x)/(p+1)!(Xix)p+1r_i \approx m^{(p+1)}(x)/(p+1)! \cdot (X_i - x)^{p+1} into (4.1). The vector XxWxr\mathbf{X}_x^\top \mathbf{W}_x \mathbf{r} has jj-th component

m(p+1)(x)(p+1)!i=1nwi(Xix)j+p+1.\frac{m^{(p+1)}(x)}{(p+1)!} \sum_{i=1}^n w_i\, (X_i - x)^{j + p + 1}.

The empirical sum 1niwi(Xix)j+p+1\frac{1}{n}\sum_i w_i (X_i - x)^{j+p+1} converges (by LLN + change of variables, as in §3.2) to its expectation hj+p+1fX(x)μj+p+1(K)h^{j+p+1} f_X(x) \mu_{j+p+1}(K) to leading order. Stacking and using the DhSpDh\mathbf{D}_h \mathbf{S}_p \mathbf{D}_h factorization of the Gram matrix from (3.2),

(XxWxXx)1XxWxr    m(p+1)(x)hp+1(p+1)!Dh1Sp1cp,(\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x)^{-1}\, \mathbf{X}_x^\top \mathbf{W}_x\, \mathbf{r} \;\approx\; \frac{m^{(p+1)}(x)\, h^{p+1}}{(p+1)!}\, \mathbf{D}_h^{-1} \mathbf{S}_p^{-1}\, \mathbf{c}_p,

where cp=(μp+1(K),μp+2(K),,μ2p+1(K))Rp+1\mathbf{c}_p = (\mu_{p+1}(K),\, \mu_{p+2}(K),\, \ldots,\, \mu_{2p+1}(K))^\top \in \mathbb{R}^{p+1}.

Theorem 1 (Ruppert–Wand (1994) interior bias formula).

Let xx be an interior point of the support of XX with fX(x)>0f_X(x) > 0 and mCp+1m \in C^{p+1}. Conditional on the design,

Bias(m^h(x))  =  hp+1(p+1)!m(p+1)(x)bp(K)  +  o(hp+1),\text{Bias}(\hat m_h(x)) \;=\; \frac{h^{p+1}}{(p+1)!}\, m^{(p+1)}(x)\, b_p(K) \;+\; o(h^{p+1}),

where the asymptotic bias constant bp(K)=e1Sp1cp=up+1Kp(u)dub_p(K) = \mathbf{e}_1^\top \mathbf{S}_p^{-1}\, \mathbf{c}_p = \int u^{p+1}\, K^*_p(u)\, du is the first surviving moment of the equivalent kernel KpK^*_p.

Two remarks.

The asymptotic bias constant equals the (p+1)(p+1)-th moment of KpK^*_p. Reading bp(K)=e1Sp1cpb_p(K) = \mathbf{e}_1^\top \mathbf{S}_p^{-1} \mathbf{c}_p alongside (3.3) gives bp(K)=up+1Kp(u)dub_p(K) = \int u^{p+1}\, K^*_p(u)\, du. The bias is determined by the first surviving moment of the equivalent kernel — the first moment that the moment-matching identities of §3.3 don’t kill. The §3 algebra and the §4 bias derivation are the same algebra read in two directions.

No design-density derivatives appear. Unlike the NW bias formula (which has a 2m(x)fX(x)/fX(x)2 m'(x) f_X'(x)/f_X(x) term), Theorem 1 depends on mm only through m(p+1)(x)m^{(p+1)}(x) for any p1p \ge 1. Local polynomials are design-adaptive: the WLS solve absorbs the design-density correction automatically through the Gram-matrix inversion. This is the structural advantage that won local-linear regression its place as the default at the end of Kernel Regression §8, and the formula shows the advantage extends to any p1p \ge 1.

4.3 The interior rate ladder

For symmetric KK at an interior point xx, the moment matrix Sp\mathbf{S}_p has the checkerboard zero pattern from §3.3, and so does Sp1\mathbf{S}_p^{-1}. The first row of Sp1\mathbf{S}_p^{-1} has nonzero entries only at even indices. The vector cp=(μp+1,μp+2,,μ2p+1)\mathbf{c}_p = (\mu_{p+1}, \mu_{p+2}, \ldots, \mu_{2p+1})^\top has nonzero entries only where the index has the parity of p+1p+1.

  • Odd pp: p+1p + 1 is even, so cp\mathbf{c}_p has nonzero entries at even indices. These pair with the even-indexed nonzero entries of e1Sp1\mathbf{e}_1^\top \mathbf{S}_p^{-1}, giving bp(K)0b_p(K) \neq 0 generically. The bias rate is O(hp+1)O(h^{p+1}).
  • Even pp: p+1p + 1 is odd, so cp\mathbf{c}_p has nonzero entries at odd indices. These pair with the zero odd-indexed entries of e1Sp1\mathbf{e}_1^\top \mathbf{S}_p^{-1} — every term vanishes, so bp(K)=0b_p(K) = 0 identically. The leading hp+1h^{p+1} term in Theorem 1 collapses, and the next-order analysis (one more term in the Taylor expansion of mm) gives a bias of order hp+2h^{p+2}.

Stitching this together, the interior bias-rate ladder at symmetric KK on the Gaussian (computed exactly via numerical integration; the zeros are exact zeros from parity, not numerical noise) is:

ppparitybp(K)b_p(K)Rp(K)R^*_p(K)leading interior rate
0even00 (parity, with NW fXf_X correction at next order)0.2820950.282095h2h^2
1odd110.2820950.282095h2h^2
2even00 (parity)0.4760350.476035h4h^4
3odd3-30.4760350.476035h4h^4
4even00 (parity)0.6239690.623969h6h^6
5odd15150.6239690.623969h6h^6

The pairing (0,1),(2,3),(4,5),(0,1), (2,3), (4,5), \ldots is the §3 pairing K2m=K2m+1K^*_{2m} = K^*_{2m+1} rendered as a bias-rate identity. Within each pair, odd pp is design-adaptive while even pp has design-density-dependent constants — a constant-level advantage that grows in importance at the boundary (§5), where the parity bonus disappears entirely.

The p=0p = 0 row (NW) is the design-density-corrected formula already derived in Kernel Regression §3 — included here for completeness, not re-derived. The p=0p = 0 formula picks up its fX/fXf_X'/f_X term not from Theorem 1 (which would give b0(K)=0b_0(K) = 0 at symmetric KK by parity) but from the next-order correction to the asymptotic equivalent-kernel form. NW is asymptotically almost a kernel smoother, but not exactly one — local-polynomial of any p1p \ge 1 is.

Even-p bias constants are exactly zero at interior with symmetric K (parity). The leading-order bias picks up the next-order term, giving the parity-bonus rate h^(p+2) for even p versus h^(p+1) for odd p. Variance constants pair (R*_0 = R*_1, R*_2 = R*_3, R*_4 = R*_5) — the §3 odd-with-even pairing rendered as a constant.

Empirical bias-rate verification at x_0 = 0.5 on the §1 toy: |Bias| vs h on log-log axes for p in {0, 1, 2, 3}, with theoretical h^(p+1) slopes overlaid.
Empirical bias-rate verification at x₀ = 0.5, B = 500 MC replicates. Theoretical slopes h^(p+1) overlaid as dashed lines. The empirical slopes deviate substantially from theory because the §1 mean function m(x) = sin(2πx) + x/2 has m''(0.5) = 0 and m^(4)(0.5) = 0 — both leading-order bias terms vanish at this point, so the empirical |Bias| values are MC-noise-dominated rather than rate-revealing. The analytical verification of the rate ladder lives in the bias-constants table above (computed exactly via numerical integration); this empirical replicate experiment serves as a finite-sample diagnostic of MC-noise scaling, not a rate confirmation.

4.4 Variance

The conditional variance of the linear smoother is Var(m^h(x){Xi})=i=1nai(x)2σ2(Xi)\text{Var}(\hat m_h(x) \mid \{X_i\}) = \sum_{i=1}^n a_i(x)^2\, \sigma^2(X_i), where ai(x)a_i(x) is the equivalent-kernel weight from (3.1) and σ2(z)=Var(YX=z)\sigma^2(z) = \text{Var}(Y \mid X = z) is the conditional noise variance.

For XiX_i within the kernel’s effective support of xx, σ2(Xi)σ2(x)\sigma^2(X_i) \approx \sigma^2(x) to leading order, so Varσ2(x)iai(x)2\text{Var} \approx \sigma^2(x) \sum_i a_i(x)^2. Substituting the asymptotic form ai(x)Kp(ui)/(nhfX(x))a_i(x) \approx K^*_p(u_i) / (n h f_X(x)) and applying LLN + change of variables,

Theorem 2 (Pointwise variance of local polynomial regression).

Under the conditions of Theorem 1 plus σ2(x):=Var[YX=x]<\sigma^2(x) := \mathrm{Var}[Y \mid X = x] < \infty,

Var(m^h(x))  =  σ2(x)Rp(K)nhfX(x)+o ⁣(1nh),Rp(K)=Kp(u)2du.\text{Var}(\hat m_h(x)) \;=\; \frac{\sigma^2(x)\, R^*_p(K)}{n\, h\, f_X(x)} + o\!\big(\tfrac{1}{nh}\big), \qquad R^*_p(K) = \int K^*_p(u)^2\, du.

The variance rate is O(1/(nh))O(1/(nh)) at every pp — same scaling as NW. The constant Rp(K)R^*_p(K) varies with pp, and crucially grows with pp because KpK^*_p becomes more oscillatory to match more moments. For the Gaussian, R0=R1=1/(2π)0.282R^*_0 = R^*_1 = 1/(2\sqrt{\pi}) \approx 0.282, while R2=R30.476R^*_2 = R^*_3 \approx 0.476 — local-quadratic and local-cubic carry roughly 1.69×1.69 \times the variance of NW and local-linear at the same hh (and R4=R50.624R^*_4 = R^*_5 \approx 0.624, a 2.21×2.21\times multiplier). This is the variance side of the bias-variance ladder: higher pp buys faster bias decay at the cost of a bigger variance constant.

4.5 The AMISE-optimal bandwidth

Combining Theorems 1 and 2, the asymptotic mean squared error at an interior point xx at odd pp is

AMSE(m^h(x))  =  h2(p+1)[(p+1)!]2[m(p+1)(x)]2bp(K)2bias2  +  σ2(x)Rp(K)nhfX(x)variance.\text{AMSE}(\hat m_h(x)) \;=\; \underbrace{\frac{h^{2(p+1)}}{[(p+1)!]^2}\, [m^{(p+1)}(x)]^2\, b_p(K)^2}_{\text{bias}^2} \;+\; \underbrace{\frac{\sigma^2(x)\, R^*_p(K)}{n\, h\, f_X(x)}}_{\text{variance}}.

Differentiating with respect to hh and setting to zero gives the local AMSE-optimal bandwidth

hp(x)  =  [[(p+1)!]2σ2(x)Rp(K)2(p+1)nfX(x)[m(p+1)(x)]2bp(K)2]1/(2p+3)    n1/(2p+3).(4.4)h^*_p(x) \;=\; \left[\frac{[(p+1)!]^2\, \sigma^2(x)\, R^*_p(K)} {2(p+1)\, n\, f_X(x)\, [m^{(p+1)}(x)]^2\, b_p(K)^2}\right]^{1 / (2p + 3)} \;\asymp\; n^{-1 / (2p + 3)}. \quad\quad (4.4)

For the §1 toy at n=200n = 200, integrated over [0.1,0.9][0.1, 0.9] to avoid the boundary regions, the AMISE-optimal bandwidths come out at:

pphph^*_p scalinghph^*_p at n=200n = 200 (§1 toy)
1n1/5n^{-1/5}0.0360.036
3n1/9n^{-1/9}0.0930.093

Higher-degree estimators want wider bandwidths because their variance constant Rp(K)R^*_p(K) has grown but their bias constant has shrunk faster — wider hh exploits the better bias rate. The AMISE at hph^*_p scales as n2(p+1)/(2p+3)n^{-2(p+1)/(2p+3)}:

  • p=1p = 1: n4/5n^{-4/5}.
  • p=3p = 3: n8/9n^{-8/9}.
  • p=5p = 5: n12/13n^{-12/13}.

This is the Stone (1980, 1982) minimax rate n2s/(2s+1)n^{-2s/(2s+1)} for ss-times-differentiable regression functions, with s=p+1s = p + 1 at odd pp. Local polynomial regression at degree pp is minimax-optimal over the class Fp+11\mathcal{F}_{p+1}^1 of (p+1)(p+1)-times-differentiable univariate functions — the canonical optimality result that makes local polynomial regression the modern frequentist nonparametric estimator of choice.

5. The boundary and the odd-vs-even degree story

5.1 The question this section answers

At interior points of the support, §4 gave us a clean rate ladder: odd pp delivers bias O(hp+1)O(h^{p+1}), even pp collects a parity bonus to O(hp+2)O(h^{p+2}), and within each pair the bias constant is identical. The boundary is a different story. Kernel Regression §8 gave us part of it: NW at x=0x = 0 has bias O(h)O(h) — one order worse than its O(h2)O(h^2) interior rate — and local-linear regression restores O(h2)O(h^2) uniformly. What about higher pp? Does the §4 rate ladder hold uniformly across the support, or does the boundary degrade some degrees more than others?

This is the question. The answer turns on the boundary-truncated kernel moments and their interaction with the moment-matching identities of §3.

5.2 The boundary-truncated equivalent kernel

§3.4 already introduced the boundary-truncated equivalent kernel Kp,(c)K^{*,(c)}_p, where c=x/h0c = x/h \ge 0 measures the distance from the support edge in bandwidth units. The truncated kernel moments are μj(c)(K)=cujK(u)du\mu_j^{(c)}(K) = \int_{-c}^{\infty} u^j K(u) du, the truncated moment matrix is Sp(c)=[μj+k(c)]\mathbf{S}_p^{(c)} = [\mu_{j+k}^{(c)}], and the equivalent kernel satisfies the moment-matching identities

cujKp,(c)(u)du  =  δj,0,j=0,1,,p,\int_{-c}^{\infty} u^j\, K^{*,(c)}_p(u)\, du \;=\; \delta_{j,0}, \qquad j = 0, 1, \ldots, p,

by construction (§3.3, Property 1) — the proof only uses that the (j+1)(j+1)-th column of Sp(c)\mathbf{S}_p^{(c)} is (μj(c),,μj+p(c))(\mu_j^{(c)}, \ldots, \mu_{j+p}^{(c)})^\top, which is unchanged by truncation. What doesn’t survive the truncation is the checkerboard zero pattern that gave us the parity bonus at interior. With cc finite, the odd moments μ1(c),μ3(c),\mu_1^{(c)}, \mu_3^{(c)}, \ldots are no longer zero, Sp(c)\mathbf{S}_p^{(c)} has full nonzero structure, and the first row of (Sp(c))1(\mathbf{S}_p^{(c)})^{-1} has nonzero entries at every index — both odd and even.

The §4 bias derivation goes through unchanged with the truncated moments.

Theorem 3 (Boundary bias formula).

At a boundary point x=chx = ch with c[0,)c \in [0, \infty),

Bias(m^h(x))  =  hp+1(p+1)!m(p+1)(x)bp(c)(K)  +  o(hp+1),\text{Bias}(\hat m_h(x)) \;=\; \frac{h^{p+1}}{(p+1)!}\, m^{(p+1)}(x)\, b_p^{(c)}(K) \;+\; o(h^{p+1}),

where bp(c)(K)=e1(Sp(c))1cp(c)b_p^{(c)}(K) = \mathbf{e}_1^\top \big(\mathbf{S}_p^{(c)}\big)^{-1}\, \mathbf{c}_p^{(c)} is the boundary analog of the §4 bias constant.

5.3 The boundary-rate ladder

Computing bp(0)(K)b_p^{(0)}(K) for the Gaussian (the notebook does this via numerical integration on the truncated moments) gives:

| pp | bp(0)(KGauss)b_p^{(0)}(K_{\text{Gauss}}) | bp(0)|b_p^{(0)}| | leading boundary rate | |----:|:-----------------------------:|:-------------:|:---------------------:| | 0 | +0.797885+0.797885 | 0.7980.798 | h1h^1 | | 1 | 0.751938-0.751938 | 0.7520.752 | h2h^2 | | 2 | +0.822824+0.822824 | 0.8230.823 | h3h^3 | | 3 | 1.014838-1.014838 | 1.0151.015 | h4h^4 | | 4 | +1.380150+1.380150 | 1.3801.380 | h5h^5 |

All five constants are nonzero — the parity-zero pattern that simplified §4.3 is gone at boundary. The boundary bias rate, read off from Theorem 3, increases monotonically with pp: h,h2,h3,h4,h5h, h^2, h^3, h^4, h^5 for p=0,1,2,3,4p = 0, 1, 2, 3, 4. There is no “even pp gets stuck at hph^p” effect at the leading-order asymptotic rate on the Gaussian-kernel-on-uniform-design setting that §1’s toy provides.

(c = 0 strict boundary · c = 3 ≈ interior)
Empirical bias rate at boundary x_0 = 0.005 on the §1 toy: |Bias| vs h on log-log axes for p in {0, 1, 2, 3}, with theoretical h^(p+1) slopes overlaid.
Empirical bias rate at boundary x₀ = 0.005 on the §1 toy, B = 500 replicates per (p, h), p ∈ {0, 1, 2, 3}, h on a 10-point log-spaced grid. Theoretical slopes h^(p+1) overlaid. The analytical boundary bias constants b_p^(0)(K) are nonzero for every p ∈ {0, …, 4} (table above), so the boundary rate is uniformly h^(p+1) — the parity-zero pattern of the interior is gone. The MC-noisy empirical slopes at this small n and B are consistent with the theoretical prediction; cleaner empirical recovery would require either a larger B or a slightly less extreme x₀.

The thing odd pp retains, even pp doesn’t. What separates odd from even pp at boundary is design-adaptivity. From the §4.2 derivation, the leading-bias formula at odd pp involves only m(p+1)(x)m^{(p+1)}(x). At even pp, the next-order hp+2h^{p+2} term enters with a coefficient that involves fX(x)/fX(x)f_X'(x)/f_X(x) — the design-density correction that NW also carries.

In the boundary regime where Theorem 3 is the dominant term, this is a constant-level distinction, not a rate-level one. The simplest sharp statement is therefore: odd pp gives uniform O(hp+1)O(h^{p+1}) bias rate with design-adaptive constants across the support; even pp gives the same uniform rate at leading order but with non-adaptive boundary constants.

5.4 The practical recommendation: p{1,3}p \in \{1, 3\}

Combining the §4 interior story and the §5.3 boundary story, the bias-rate ladder uniformly across the support is:

ppinterior rateboundary rateuniform ratedesign-adaptive?
0h2h^2h1h^1h1h^1no
1h2h^2h2h^2h2h^2yes
2h4h^4h3h^3h3h^3partially
3h4h^4h4h^4h4h^4yes
4h6h^6h5h^5h5h^5partially
5h6h^6h6h^6h6h^6yes

Odd pp gives a matched uniform rate (interior and boundary both hp+1h^{p+1}); even pp gives a mismatched uniform rate (boundary one order worse than the parity-bonus interior). Combined with the variance penalty Rp(K)R^*_p(K) growing with pp, the Pareto-optimal degrees in the (uniform rate, complexity) trade-off are the odd ones.

In practice:

  • p=1p = 1 — local-linear: the workhorse. Uniform h2h^2 bias, R1(K)=R0(K)R^*_1(K) = R^*_0(K) (no variance penalty over NW), design-adaptive. R’s loess and statsmodels’s nonparametric regression both default here.
  • p=3p = 3 — local-cubic: when you need second derivatives (§9) or your application earns the h4h^4 rate. Variance constant R31.69R1R^*_3 \approx 1.69 R^*_1, but the bias improvement is usually worth it for moderate nn.
  • p=2p = 2 skipped: only marginal interior-rate improvement over p=1p = 1 (h4h^4 vs h2h^2 at interior, but h3h^3 vs h2h^2 at boundary), variance penalty similar to p=3p = 3, no second-derivative estimate.
  • p5p \ge 5: rarely worth it — variance constant grows fast, bandwidth sensitivity grows fast, and the Cp+1C^{p+1}-smoothness assumption is hard to defend.

6. Boundary behavior across the full support

§5 measured the bias rate at one boundary point x0=0.005x_0 = 0.005. This section zooms out: the bias as a function of xx, plotted across the full support [0,1][0, 1] at fixed bandwidth. The §3 equivalent kernel Kp,(c)K^{*,(c)}_p adapts smoothly to location through c=x/hc = x/h, and that adaptation is what gives local polynomial regression its uniform rate.

6.1 The location-aware equivalent kernel

The boundary parameter c=x/hc = x/h is a smooth function of evaluation location. At the strict boundary x=0x = 0, c=0c = 0 — only u[0,)u \in [0, \infty) of the kernel mass is available. For xhx \gg h (deep interior), cc is large and the truncation is invisible — Kp,(c)KpK^{*,(c)}_p \to K^*_p, the interior form from §3. Between these extremes, Kp,(c)K^{*,(c)}_p interpolates continuously, reweighting itself to compensate for whichever portion of the kernel mass falls outside the support at the current xx.

Equivalent kernels K*^(c)_1 and K*^(c)_3 at c in {0, 0.5, 1, 2, infinity} for the Gaussian base kernel.
Equivalent kernels K*^(c)_1(u) and K*^(c)_3(u) at c ∈ {0, 0.5, 1, 2, ∞} on the Gaussian base. Two panels (one per degree). At c = 0 the kernel is one-sided and asymmetric; at c = ∞ it's the symmetric interior form; at intermediate c it transitions smoothly. The location-awareness is automatic — we never explicitly construct a 'boundary kernel.'

The location-awareness is automatic — we never explicitly construct a “boundary kernel.” We just solve the same WLS problem (2.2) at every evaluation point, and the kernel-moment matrix XxWxXx\mathbf{X}_x^\top \mathbf{W}_x \mathbf{X}_x absorbs the boundary truncation through its empirical moments.

6.2 Why p{1,3}p \in \{1, 3\} removes the boundary asymmetry

The moment-matching identities are the entire mechanism. From §3.3 and §5.2, for every c[0,]c \in [0, \infty],

cujKp,(c)(u)du  =  δj,0,j=0,1,,p.\int_{-c}^{\infty} u^j\, K^{*,(c)}_p(u)\, du \;=\; \delta_{j,0}, \qquad j = 0, 1, \ldots, p.

The Taylor-expansion bias derivation (§4.1) substitutes m(x+hu)m(x + hu) inside the equivalent-kernel integral, and the moment-matching identities kill the first pp powers of uu in the expansion. The first surviving term is at order up+1u^{p+1}, contributing the hp+1h^{p+1} leading bias.

For NW (p=0p = 0), the moment-matching identity covers only K=1\int K = 1 — not uK=0\int u K = 0. At interior with symmetric KK, uK=0\int u K = 0 holds for free by symmetry. At the boundary (truncated KK on [c,)[-c, \infty) with c<c < \infty), the symmetry argument fails and cuK(u)du0\int_{-c}^{\infty} u K(u) du \ne 0. The first-order term in the Taylor expansion survives, contributing hμ1(c)m(x)h \cdot \mu_1^{(c)} \cdot m'(x) to the bias — the famous NW boundary bias.

For local-linear (p=1p = 1), the moment-matching identity now covers both K1=1\int K^*_1 = 1 and uK1=0\int u K^*_1 = 0 regardless of cc. The first-order bias term vanishes uniformly across the support. The remaining h2h^2 leading order is the same as interior — boundary bias and interior bias are at the same rate.

For local-cubic (p=3p = 3), all four identities ujK3du=δj,0\int u^j K^*_3 du = \delta_{j,0} for j=0,1,2,3j = 0, 1, 2, 3 hold uniformly. The first three Taylor terms vanish, leaving h4h^4 leading bias rate uniformly across the support.

6.3 The full-domain bias figure

The empirical bias profile E[m^h(x)]m(x)|\mathbb{E}[\hat m_h(x)] - m(x)| across x[0,1]x \in [0, 1] at fixed h=0.08h = 0.08 on the §1 toy makes the uniformity concrete.

Shaded regions show one bandwidth h on each side of the support boundary (where the kernel mass starts to truncate). B = 30 MC replicates per p (live); the 06_bias_profile.png static figure uses B = 500 for cleaner curves.

Empirical bias profile across x in [0, 1] at h = 0.08 on the §1 toy, B = 500 MC replicates, p in {0, 1, 2, 3}.
Empirical bias vs x at h = 0.08, B = 500 MC replicates, p ∈ {0, 1, 2, 3}. NW (p = 0) shows pronounced spikes at the boundaries x = 0 and x = 1 (the h-rate boundary bias) flanking a moderate-bias interior. Local-linear (p = 1) shows uniformly small bias across the support — boundaries no worse than interior. Local-cubic (p = 3) is uniformly the smallest of the three. Local-quadratic (p = 2) sits between.

The takeaway, visible in one figure: NW pays a boundary tax that local-linear removes and local-cubic removes more thoroughly. The §1 close (“local-cubic does both — slope and curvature — and tracks the sine more tightly still”) reflects this rate uniformity, not just constant improvements.

7. Bandwidth selection at general pp

Kernel Regression §5 covered four bandwidth selectors for NW: Silverman’s rule of thumb, plug-in, leave-one-out cross-validation, and generalized cross-validation. Three of the four — LOO-CV, GCV, and the plug-in family — extend essentially unchanged to general pp via the smoother-matrix machinery already implicit in §3. Silverman’s rule is NW-specific (it borrows from kernel density estimation) and doesn’t transfer; the others do.

7.1 The smoother matrix at degree pp

Local polynomial regression at every pp is a linear smoother (§3.1). When we evaluate the estimator at the design points themselves, the values stack into

m^h  =  Hpy,\hat{\mathbf{m}}_h \;=\; \mathbf{H}_p\, \mathbf{y},

where HpRn×n\mathbf{H}_p \in \mathbb{R}^{n \times n} is the smoother matrix at degree pp. Its (i,j)(i, j) entry is the equivalent-kernel weight on observation jj when evaluating at XiX_i. Each row sums to one (the moment-matching identity Kp=1\int K^*_p = 1 at the empirical level — partition-of-unity for the HijH_{ij}).

The diagonal collapses cleanly. At j=ij = i, the design vector (1,XiXi,,(XiXi)p)=e1(1, X_i - X_i, \ldots, (X_i - X_i)^p)^\top = \mathbf{e}_1, and wii=Kh(0)=K(0)/hw_{ii} = K_h(0) = K(0)/h. So

Hii  =  K(0)h[(XXiWXiXXi)1]0,0.(7.1)\boxed{\, H_{ii} \;=\; \frac{K(0)}{h}\, \big[(\mathbf{X}_{X_i}^\top \mathbf{W}_{X_i} \mathbf{X}_{X_i})^{-1}\big]_{0, 0}. \,} \quad\quad (7.1)

This is the “self-influence” of observation ii on its own fit. The trace tr(Hp)=iHii\operatorname{tr}(\mathbf{H}_p) = \sum_i H_{ii} is the effective degrees of freedom of the local-polynomial fit at bandwidth hh and degree pp — small hh gives large tr(Hp)\operatorname{tr}(\mathbf{H}_p) (close to nn, near-interpolation), large hh gives small tr(Hp)\operatorname{tr}(\mathbf{H}_p) (close to p+1p + 1, near-global-polynomial fit). Higher pp at fixed hh produces a larger tr(Hp)\operatorname{tr}(\mathbf{H}_p) — each evaluation point gets a more flexible local fit, making the smoother “hungrier” for degrees of freedom.

7.2 Leave-one-out cross-validation

The LOO-CV objective is

LOOCV(h)  =  1ni=1n(Yim^h(i)(Xi))2,\text{LOOCV}(h) \;=\; \frac{1}{n} \sum_{i=1}^n \big(Y_i - \hat m_h^{(-i)}(X_i)\big)^2,

where m^h(i)\hat m_h^{(-i)} is the local polynomial fit excluding observation ii. Computed naively, this requires nn separate fits per bandwidth — prohibitive in any sweep over hh. The leave-one-out identity (Hastie & Tibshirani 1990, applicable to any linear smoother) reduces it to one fit:

Theorem 4 (LOO-CV closed-form identity).

For any linear smoother m^h(Xi)=(Hpy)i\hat m_h(X_i) = (\mathbf{H}_p \mathbf{y})_i with smoother matrix Hp\mathbf{H}_p,

Yim^h(i)(Xi)  =  Yim^h(Xi)1Hii,LOOCV(h)  =  1ni=1n(Yim^h(Xi)1Hii) ⁣2.Y_i - \hat m_h^{(-i)}(X_i) \;=\; \frac{Y_i - \hat m_h(X_i)}{1 - H_{ii}}, \qquad \text{LOOCV}(h) \;=\; \frac{1}{n} \sum_{i=1}^n \left(\frac{Y_i - \hat m_h(X_i)}{1 - H_{ii}}\right)^{\!2}.
Proof.

Construct a perturbed response vector y~\tilde{\mathbf y} with y~j=Yj\tilde y_j = Y_j for jij \ne i and y~i=m^h(i)(Xi)\tilde y_i = \hat m_h^{(-i)}(X_i) (the LOO prediction at XiX_i). The local-polynomial fit using y~\tilde{\mathbf y} must agree with m^h(i)\hat m_h^{(-i)} at XiX_i, because replacing YiY_i by the prediction we’d have made anyway doesn’t change the LOO fit there. So

m^h(i)(Xi)  =  (Hpy~)i  =  jiHijYj  +  Hiiy~i  =  jiHijYj  +  Hiim^h(i)(Xi).\hat m_h^{(-i)}(X_i) \;=\; \big(\mathbf{H}_p\, \tilde{\mathbf{y}}\big)_i \;=\; \sum_{j \ne i} H_{ij}\, Y_j \;+\; H_{ii}\, \tilde y_i \;=\; \sum_{j \ne i} H_{ij}\, Y_j \;+\; H_{ii}\, \hat m_h^{(-i)}(X_i).

Solving for m^h(i)(Xi)\hat m_h^{(-i)}(X_i),

m^h(i)(Xi)(1Hii)  =  jiHijYj  =  m^h(Xi)HiiYi,\hat m_h^{(-i)}(X_i)\, (1 - H_{ii}) \;=\; \sum_{j \ne i} H_{ij}\, Y_j \;=\; \hat m_h(X_i) - H_{ii}\, Y_i,

and the LOO residual is

Yim^h(i)(Xi)  =  Yim^h(Xi)HiiYi1Hii  =  Yim^h(Xi)1Hii.Y_i - \hat m_h^{(-i)}(X_i) \;=\; Y_i - \frac{\hat m_h(X_i) - H_{ii}\, Y_i}{1 - H_{ii}} \;=\; \frac{Y_i - \hat m_h(X_i)}{1 - H_{ii}}.

The identity gives LOO-CV in closed form at the cost of one full fit and the diagonal of Hp\mathbf{H}_p.

7.3 Generalized cross-validation

GCV (Craven & Wahba 1979) replaces the per-point HiiH_{ii} in the LOO-CV formula with the average tr(Hp)/n\operatorname{tr}(\mathbf{H}_p)/n:

GCV(h)  =  (1/n)i=1n(Yim^h(Xi))2(1tr(Hp)/n)2.(7.4)\text{GCV}(h) \;=\; \frac{(1/n) \sum_{i=1}^n (Y_i - \hat m_h(X_i))^2} {\big(1 - \operatorname{tr}(\mathbf{H}_p)/n\big)^2}. \quad\quad (7.4)

Two practical advantages over LOO-CV: only the trace is needed (one scalar per bandwidth), and GCV is invariant to orthogonal rotations of the response (more robust to outlier HiiH_{ii}). The asymptotic minimizers of LOO-CV and GCV coincide; GCV’s choice tends to be slightly larger in finite samples and is the more common default in production smoothers.

7.4 Plug-in selectors

The Ruppert–Sheather–Wand (1995) plug-in family estimates the AMISE-optimal bandwidth (4.4) directly from data, by replacing the unknown functionals with pilot estimates. For p=1p = 1 at interior with symmetric KK, (4.4) reduces to

h1  =  [σ2R(K)(ba)nμ2(K)2θ2]1/5,θ2  =  ab[m(x)]2fX(x)dx.h^*_1 \;=\; \left[\frac{\sigma^2\, R(K)\, (b - a)} {n\, \mu_2(K)^2\, \theta_2}\right]^{1/5}, \qquad \theta_2 \;=\; \int_a^b [m''(x)]^2\, f_X(x)\, dx.

Plug-in estimates the two unknown functionals: σ^2\hat\sigma^2 via Rice’s estimator (consecutive-observation differences) and θ^2\hat\theta_2 via a pilot mm''-fit at degree p3p \ge 3. The pilot bandwidth has its own selection problem; RSW solves the recursion via a normal-reference rule of thumb.

For p=3p = 3, (4.4) calls for θ4=[m(4)(x)]2fX(x)dx\theta_4 = \int [m^{(4)}(x)]^2 f_X(x) dx — a fourth-derivative estimate. The pilot recursion deepens, and the variance of θ^4\hat\theta_4 overwhelms the plug-in’s nominal advantage. For local-cubic, plug-in selectors are theoretically clean but practically fragile — most software defaults to LOO-CV or GCV at p=3p = 3.

7.5 How hph^*_p shifts with pp

The §4.5 scaling hpn1/(2p+3)h^*_p \asymp n^{-1/(2p+3)} predicts that the AMISE-optimal bandwidth grows with pp. The notebook’s LOO-CV sweep on the §1 toy at n=200n = 200 over a 25-point log-spaced bandwidth grid h[0.04,0.40]h \in [0.04, 0.40] produces:

pphph^*_p scalingLOO-CV minimizer at n=200n = 200
0n1/3n^{-1/3}0.040\approx 0.040
1n1/5n^{-1/5}0.040\approx 0.040
3n1/9n^{-1/9}0.139\approx 0.139
show:
LOO-CV and GCV objective curves for p in {0, 1, 3} on the §1 toy at n = 200, with their minimizers marked.
LOO-CV and GCV objective curves for p ∈ {0, 1, 3} on the §1 toy at n = 200, h-grid log-spaced over [0.04, 0.40]. The minimizers shown — h_loo ≈ 0.040 for p ∈ {0, 1} and h_loo ≈ 0.139 for p = 3 — agree with the §4.5 AMISE prediction shape (h*_p grows with p). The minimizers for p ∈ {0, 1} both clip the bandwidth-grid lower bound (0.04); the true p = 1 minimizer is likely closer to the §4.5 AMISE prediction h*_1 ≈ 0.036. Widen the grid downward to 0.02 if a precise estimate is needed; the p = 3 minimizer at 0.139 is well-resolved.

In practice, the hph^*_p shift is a useful diagnostic: if you’re running LOO-CV at multiple degrees on the same data and the minimizers don’t shift right with pp, something’s wrong with your pilot or your CV implementation.

8. Multivariate local polynomials

Kernel Regression §6 already extended NW to Rd\mathbb{R}^d via product kernels and bandwidth matrices. Local-polynomial generalizes that extension cleanly: the design matrix grows from p+1p + 1 columns to (d+pp)\binom{d + p}{p} columns, the bandwidth becomes a d×dd \times d matrix or a dd-vector, and everything else from §2–§7 carries over.

8.1 The polynomial-feature design matrix

In Rd\mathbb{R}^d, a polynomial of total degree pp around evaluation point x\mathbf{x} is parameterized by multi-indices α=(α1,,αd)Z0d\boldsymbol\alpha = (\alpha_1, \ldots, \alpha_d) \in \mathbb{Z}_{\ge 0}^d with α:=α1++αdp|\boldsymbol\alpha| := \alpha_1 + \cdots + \alpha_d \le p. The corresponding monomial in centered coordinates is (Xix)α=j=1d(Xijxj)αj(\mathbf{X}_i - \mathbf{x})^{\boldsymbol\alpha} = \prod_{j=1}^d (X_{ij} - x_j)^{\alpha_j}. Stacking gives the multivariate design matrix ΦxRn×Md,p\boldsymbol\Phi_{\mathbf{x}} \in \mathbb{R}^{n \times M_{d, p}} with column count

Md,p  =  (d+pp)  =  (d+p)!d!p!.M_{d, p} \;=\; \binom{d + p}{p} \;=\; \frac{(d+p)!}{d!\, p!}.

For canonical choices: M1,1=2M_{1, 1} = 2, M1,3=4M_{1, 3} = 4, M2,1=3M_{2, 1} = 3, M2,2=6M_{2, 2} = 6, M2,3=10M_{2, 3} = 10, M5,2=21M_{5, 2} = 21, M10,2=66M_{10, 2} = 66. The column count grows polynomially in dd at fixed pp — manageable at moderate (d,p)(d, p) but a real concern by d5d \ge 5 and p3p \ge 3.

The local-polynomial WLS problem stays in the same form as (2.1):

β^(x)  =  argminβRMd,pi=1nwi(YiΦx,iβ)2,m^H(x)  =  e1β^(x),\hat{\boldsymbol\beta}(\mathbf{x}) \;=\; \arg\min_{\boldsymbol\beta \in \mathbb{R}^{M_{d, p}}} \sum_{i=1}^n w_i \big(Y_i - \boldsymbol\Phi_{\mathbf{x}, i}^\top \boldsymbol\beta\big)^2, \qquad \hat m_{\mathbf{H}}(\mathbf{x}) \;=\; \mathbf{e}_1^\top \hat{\boldsymbol\beta}(\mathbf{x}),

with closed-form β^(x)=(ΦxWΦx)1ΦxWy\hat{\boldsymbol\beta}(\mathbf{x}) = (\boldsymbol\Phi_{\mathbf{x}}^\top \mathbf{W} \boldsymbol\Phi_{\mathbf{x}})^{-1} \boldsymbol\Phi_{\mathbf{x}}^\top \mathbf{W} \mathbf{y}.

8.2 Kernel weights and bandwidth matrices

Three standard ways to put kernel weights on XixRd\mathbf{X}_i - \mathbf{x} \in \mathbb{R}^d:

Product kernel with scalar bandwidth. The simplest: Kh(u)=hdjK(uj/h)K_h(\mathbf{u}) = h^{-d} \prod_j K(u_j / h). One bandwidth, no scale awareness.

Product kernel with diagonal bandwidth. When coordinates have different scales: KH(u)=(jhj1)jK(uj/hj)K_{\mathbf{H}}(\mathbf{u}) = (\prod_j h_j^{-1}) \prod_j K(u_j / h_j), H=diag(h1,,hd)\mathbf{H} = \operatorname{diag}(h_1, \ldots, h_d). The most common choice in practice.

Full positive-definite bandwidth matrix. When coordinates are correlated: KH(u)=H1/2K(H1/2u)K_{\mathbf{H}}(\mathbf{u}) = |\mathbf{H}|^{-1/2} K(\mathbf{H}^{-1/2} \mathbf{u}). Bandwidth-matrix selection is a real engineering problem at d3d \ge 3; diagonal H\mathbf{H} usually suffices.

8.3 The 2-dim toy

Generalize the §1 toy to two dimensions:

XUniform([0,1]2),m(x)=sin(2πx1)+sin(2πx2),εN(0,0.22),\mathbf{X} \sim \mathrm{Uniform}([0, 1]^2), \qquad m(\mathbf{x}) = \sin(2\pi x_1) + \sin(2\pi x_2), \qquad \varepsilon \sim \mathcal{N}(0, 0.2^2),

with n=400n = 400 — twice the §1 sample size to fill the 2-dim cube.

25×25 evaluation grid (live recompute on slider release). The static figure 08_multivariate_local_quadratic.png shows the same fit on a 50×50 grid.

Local-quadratic (p = 2) fit on the 2-dim sin-sum toy at n = 400, isotropic bandwidth H = diag(0.15, 0.15).
Local-quadratic (p = 2) fit on the 2-dim toy at n = 400, isotropic bandwidth H = diag(0.15, 0.15). Three panels: true m(x) as heatmap, fitted m̂_H(x), residual m̂ − m with diverging colormap. Sample points overlaid on the truth panel. Grid MSE on a 50×50 evaluation lattice ≈ 0.004 — the multivariate analog of §6's univariate boundary-bias profile, with residuals concentrated near the boundary of [0, 1]².

The headline: local-polynomial regression in 2-dim recovers the surface faithfully at this nn, with residuals concentrated near the boundary of [0,1]2[0, 1]^2 — the multivariate analog of §6’s univariate boundary-bias profile.

8.4 The curse of dimensionality

Stone (1980, 1982) showed that for (p+1)(p+1)-times-continuously-differentiable regression functions on Rd\mathbb{R}^d, no estimator can achieve a uniform AMSE rate faster than n2(p+1)/(2(p+1)+d)n^{-2(p+1)/(2(p+1) + d)}. Local polynomial of degree pp achieves this rate (at odd pp) — minimax-optimal — but the rate degrades sharply with dd:

ddp=1p = 1 ratep=3p = 3 ratenn to match (d=1,p=1,n=200)(d=1, p=1, n=200)
1n4/5n^{-4/5}n8/9n^{-8/9}200
2n2/3n^{-2/3}n4/5n^{-4/5}600\approx 600
3n4/7n^{-4/7}n8/11n^{-8/11}1,600\approx 1{,}600
5n4/9n^{-4/9}n8/13n^{-8/13}9,200\approx 9{,}200
10n2/7n^{-2/7}n4/9n^{-4/9}700,000\approx 700{,}000
20n1/6n^{-1/6}n2/7n^{-2/7}5×109\approx 5 \times 10^9

Higher polynomial degree softens the dimension penalty but only by replacing the curse with a different smoothness assumption. In practice, pp rarely exceeds 33 in multivariate settings. The escape routes — additive structure (§11), low-rank tensor methods, neural surrogates — are the subject matter of §11 and the broader high-dimensional-statistics literature.

9. Derivative estimation

The Taylor identification from §2.1 — βj=m(j)(x)/j!\beta_j^\star = m^{(j)}(x)/j! — gives local polynomial regression a useful side product: derivative estimates fall out of the WLS solve at no additional cost. A single local-cubic fit returns m^,m^,m^,m^\hat m, \hat m', \hat m'', \hat m''' at the evaluation point, in one solve.

9.1 Reading derivatives off the WLS coefficients

For any local polynomial fit of degree pp,

m^h(j)(x)  =  j!β^j(x),j=0,1,,p.\hat m_h^{(j)}(x) \;=\; j! \, \hat\beta_j(x), \qquad j = 0, 1, \ldots, p.

The §4 parity story extends: at interior with symmetric KK, the leading bias of m^h(j)\hat m_h^{(j)} at degree pp is

Bias(m^h(j)(x))  =  hp+1jj!cj,p(K)(p+1)!m(p+1)(x)  +  o(hp+1j),\text{Bias}(\hat m_h^{(j)}(x)) \;=\; h^{p + 1 - j} \cdot \frac{j! \, c_{j, p}(K)}{(p+1)!} \, m^{(p+1)}(x) \;+\; o(h^{p+1-j}),

with cj,p(K)c_{j, p}(K) a kernel-moment-matrix combination, whenever pjp - j is odd. When pjp - j is even, the leading term involves a parity-zero moment, and the bias rate is one order better — but the next-order constant picks up design-density derivatives, the same design-adaptivity penalty of §4.3.

The “good” parity rule for derivative estimation is therefore:

  • m^\hat m (j=0j = 0): want pp odd. Use p=1p = 1 or p=3p = 3.
  • m^\hat m' (j=1j = 1): want pp even for the cleanest rate. p=2p = 2 is technically optimal at interior, but p=3p = 3 is the practical choice because it covers m^\hat m and m^\hat m'' at uniform rates and the rate-suboptimality for m^\hat m' is mild.
  • m^\hat m'' (j=2j = 2): want pp odd. Use p=3p = 3.

A single local-cubic (p=3p = 3) fit therefore covers the practically interesting derivative range — j=0,1,2j = 0, 1, 2 — with m^\hat m and m^\hat m'' at clean uniform rates.

9.2 Bias and variance of m^h(j)\hat m_h^{(j)}

The leading-order bias is O(hp+1j)O(h^{p+1-j}) when pjp - j is odd. The variance scales much worse than the function-value estimate. From the §3 equivalent-kernel argument with the Dh\mathbf{D}_h rescaling, β^j=hjβ^jscaled\hat\beta_j = h^{-j} \hat\beta_j^{\text{scaled}}, and the scaled variance is O(1/(nh))O(1/(nh)). So

Var(m^h(j))  =  (j!)2Var(β^j)  =  O ⁣(1nh2j+1).\text{Var}(\hat m_h^{(j)}) \;=\; (j!)^2 \, \text{Var}(\hat\beta_j) \;=\; O\!\left(\frac{1}{n\, h^{2j+1}}\right).
jjbias rate (pp odd, pjp - j odd)variance rate
0hp+1h^{p+1}1/(nh)1/(nh)
1hph^{p}1/(nh3)1/(nh^3)
2hp1h^{p-1}1/(nh5)1/(nh^5)
3hp2h^{p-2}1/(nh7)1/(nh^7)

Variance grows by a factor of 1/h21/h^2 per derivative order. At h=0.15h = 0.15, that’s roughly a 44×44\times variance multiplier per derivative.

9.3 The bandwidth shift for derivative estimation

The AMISE-optimal bandwidth for m^h(j)\hat m_h^{(j)} at degree pp scales as n1/(2p+3)n^{-1/(2p+3)} — same in nn as for m^\hat m — but with a jj-dependent constant that grows with jj. Don’t use a single bandwidth for everything. Run LOO-CV separately for each derivative target and use the jj-specific bandwidth.

9.4 Numerical demonstration on the §1 toy

We test derivative recovery at x0=0.125x_0 = 0.125 — a point where all four of m,m,m,mm, m', m'', m''' are nonzero (unlike x0=0.5x_0 = 0.5 where the even derivatives vanish). The truth, computed analytically:

m(0.125)=sin(π/4)+1/160.770,m(0.125)=π2+1/24.94,m(0.125) = \sin(\pi/4) + 1/16 \approx 0.770, \qquad m'(0.125) = \pi\sqrt{2} + 1/2 \approx 4.94, m(0.125)=2π2227.92,m(0.125)=4π32175.4.m''(0.125) = -2\pi^2 \sqrt{2} \approx -27.92, \qquad m'''(0.125) = -4\pi^3 \sqrt{2} \approx -175.4.

A single local-cubic fit at h=0.15h = 0.15 on a 200-point sample reads off all four. The function value and second derivative recover cleanly (per §9.2’s parity rule); the first derivative recovers with moderate noise; the third derivative is dominated by noise at this nn.

jm̂^(j)(x₀)m^(j)(x₀)err
00.7790.7700.010
14.5834.9430.360
2-35.862-27.9157.946
336.511-175.398211.909

Variance grows by ≈1/h² per derivative order — m̂'' is much noisier than m̂, m̂''' worse still.

Empirical AMSE of m̂^(j)_h for j in {0, 1, 2} at p = 3, x_0 = 0.125 on the §1 toy, B = 300 replicates.
Empirical AMSE of m̂_h^(j)(x_0) at degree p = 3, x_0 = 0.125 on the §1 toy, B = 300 replicates per (j, h). Curves for j ∈ {0, 1, 2}, h on a 16-point log-spaced grid in [0.06, 0.40]. The minimizers shift rightward as j grows — derivative estimation wants wider bandwidths than function-value estimation, even at the same p. The variance multiplier 1/h^2 per derivative order is what drives this.

10. Connections to smoothing splines

Smoothing splines are the other major family of frequentist nonparametric smoothers. They look very different from local polynomial regression — a penalty-based variational problem instead of a kernel-weighted local fit — but Silverman (1984) showed they’re asymptotically equivalent to local-cubic regression with a specific location-adaptive bandwidth.

10.1 The cubic smoothing-spline objective

The cubic smoothing spline minimizes

Jλ(g)  =  i=1n(Yig(Xi))2  +  λab[g(x)]2dx(10.1)\mathcal{J}_\lambda(g) \;=\; \sum_{i=1}^n \big(Y_i - g(X_i)\big)^2 \;+\; \lambda \int_a^b \big[g''(x)\big]^2\, dx \quad\quad (10.1)

over gW22g \in W^2_2. As λ0\lambda \to 0: interpolation. As λ\lambda \to \infty: OLS line.

The minimizer is a natural cubic spline. Schoenberg (1964) and Reinsch (1967) showed that the minimizer of (10.1) is piecewise cubic between the design points, continuous in value and first two derivatives across knots, and linear (zero curvature) outside [X(1),X(n)][X_{(1)}, X_{(n)}]. The “natural” boundary condition is automatic.

In matrix form, the fitted values at the design points satisfy

g^λ  =  (In+λK)1y,(10.2)\hat{\mathbf g}_\lambda \;=\; \big(\mathbf{I}_n + \lambda\, \mathbf{K}\big)^{-1} \mathbf{y}, \quad\quad (10.2)

where KRn×n\mathbf{K} \in \mathbb{R}^{n \times n} is a banded penalty matrix encoding [g]2\int [g'']^2 at the design points. The smoother matrix Sλ=(In+λK)1\mathbf{S}_\lambda = (\mathbf{I}_n + \lambda \mathbf{K})^{-1} makes the smoothing spline a linear smoother. The Reinsch algorithm computes g^λ\hat{\mathbf g}_\lambda in O(n)O(n) time. The §7 LOO-CV / GCV machinery extends without change.

10.2 Silverman’s equivalent variable-kernel theorem

Theorem 5 (Silverman (1984) equivalent variable kernel).

Suppose fXf_X is bounded away from zero on [a,b][a, b] and λ=λn\lambda = \lambda_n satisfies n4/5λn1n^{-4/5} \ll \lambda_n \ll 1. Then the smoothing-spline weight on observation ii when evaluating at xx satisfies

[Sλ]ix    1nh(x)fX(x)KS ⁣(Xixh(x)),h(x)  =  [λnfX(x)]1/4,[\mathbf{S}_\lambda]_{ix} \;\approx\; \frac{1}{n\, h(x)\, f_X(x)}\, K_S\!\left(\frac{X_i - x}{h(x)}\right), \qquad h(x) \;=\; \left[\frac{\lambda}{n\, f_X(x)}\right]^{1/4},

where KSK_S is the Silverman kernel

KS(u)  =  12exp ⁣(u2)sin ⁣(u2+π4).(10.3)K_S(u) \;=\; \tfrac{1}{2}\, \exp\!\left(-\tfrac{|u|}{\sqrt{2}}\right) \, \sin\!\left(\tfrac{|u|}{\sqrt{2}} + \tfrac{\pi}{4}\right). \quad\quad (10.3)

Two things to register about KSK_S.

It’s a fourth-order kernel. KS=1\int K_S = 1 and ujKS=0\int u^j K_S = 0 for j=1,2,3j = 1, 2, 3, but u4KS0\int u^4 K_S \ne 0. These are the same moment-matching identities the equivalent kernel K3K^*_3 satisfies at interior with a symmetric base kernel (§3.3, Property 1, with p=3p = 3). Both produce h4h^4 leading bias and 1/(nh)1/(nh) variance.

The bandwidth is location-adaptive. Where fXf_X is large, h(x)h(x) is small (tighter local neighborhood). Where fXf_X is small, h(x)h(x) is large. This adaptation is automatic — the global λ\lambda becomes a local h(x)h(x) through fX(x)1/4f_X(x)^{-1/4} — and it’s something local-polynomial regression doesn’t do at fixed hh.

10.3 Practical comparison

When fXf_X is roughly uniform, smoothing splines and local-cubic produce nearly identical fits at matched smoothing levels. On the §1 toy at local-cubic h=0.10h = 0.10 (effective df tr(H3)8.47\operatorname{tr}(\mathbf{H}_3) \approx 8.47), the matching smoothing spline at λ=1.23×103\lambda = 1.23 \times 10^{-3} has tr(Sλ)7.96\operatorname{tr}(\mathbf{S}_\lambda) \approx 7.96, with mean pointwise LCspline0.014|LC - \text{spline}| \approx 0.014 — visually indistinguishable.

Smoothing splines win when: heterogeneous design density, boundary behavior matters, single hyperparameter is preferable, Sobolev-norm formulation is natural.

Local-polynomial regression wins when: multivariate (thin-plate splines are harder), derivative estimation is needed, local control over smoothing is required, embarrassingly-parallel evaluation across a grid matters.

The Silverman kernel K_S is asymptotically equivalent to the cubic smoothing spline (Silverman 1984). We use it as a stand-in here — a true Reinsch-algorithm spline implementation is outside this PR's scope; the static figure 10_local_cubic_vs_smoothing_spline.png shows the comparison against scipy's `make_smoothing_spline`.

Silverman kernel K_S overlaid with K*_3 for the Gaussian base kernel.
Silverman kernel K_S (red) overlaid with K*_3 for the Gaussian base kernel (blue). Different shapes (exponential vs Gaussian decay) but the same fourth-order moment structure (∫K_S = 1, ∫u^j K_S = 0 for j = 1, 2, 3) — that's what produces the asymptotic equivalence in Theorem 5.
§1 toy with local-cubic (h = 0.10) and smoothing-spline (lambda = 1.23e-3) fits at matched effective df.
§1 toy with local-cubic at h = 0.10 (effective df ≈ 8.47) and cubic smoothing spline at λ = 1.23 × 10⁻³ (effective df ≈ 7.96). The two are visually indistinguishable across most of the support — Silverman's (1984) asymptotic equivalence reads off as a finite-sample near-coincidence at moderate n.

11. Connections to additive models

§8.4 left us with the curse: at d5d \ge 5, the local-polynomial rate becomes too slow for realistic sample sizes. The additive-model trick is to assume the regression function has structure that escapes the dimension penalty.

11.1 Additive structure as a curse dodge

The additive model assumes

m(x)  =  α  +  j=1dmj(xj),E[mj(Xj)]  =  0 for each j.(11.1)m(\mathbf{x}) \;=\; \alpha \;+\; \sum_{j=1}^{d} m_j(x_j), \qquad \mathbb{E}[m_j(X_j)] \;=\; 0 \text{ for each } j. \quad\quad (11.1)

Each mjm_j depends on a single coordinate, so each can be estimated at the univariate Stone rate n2(p+1)/(2p+3)n^{-2(p+1)/(2p+3)} — independent of dd. At d=5d = 5, p=1p = 1, n=200n = 200:

approachAMSE rateexample AMSE
Multivariate local-polynomialn4/9n^{-4/9}0.073\approx 0.073
Additive (GAM) with local-poly componentsn4/5n^{-4/5}0.014\approx 0.014

A 5× improvement at d=5d = 5, growing with dd. The cost is the structural assumption: no interactions like x1x2x_1 x_2 or cos(x1+x2)\cos(x_1 + x_2). Misspecification bias is O(1)O(1) — it doesn’t go to zero with nn.

The additive form (11.1) goes by GAM (Hastie & Tibshirani 1990), with “generalized” referring to the extension to non-Gaussian responses via link functions. For the regression case, the GAM reduces to (11.1) with local-polynomial component estimators inside the backfitting algorithm.

11.2 The backfitting algorithm

Algorithm 1 (Backfitting (Friedman & Stuetzle 1981; Buja, Hastie & Tibshirani 1989)).

Fit the additive model (11.1) by iteratively smoothing partial residuals against each covariate.

Initialization. α^=Yˉ\hat\alpha = \bar Y, m^j0\hat m_j \equiv 0 for j=1,,dj = 1, \dots, d.

Iteration. Cycle j=1,,dj = 1, \ldots, d. Compute the partial residual

ri(j)  =  Yiα^kjm^k(Xik),r_i^{(j)} \;=\; Y_i - \hat\alpha - \sum_{k \ne j} \hat m_k(X_{ik}),

then update m^j\hat m_j by smoothing the partial residuals against XijX_{ij} via a univariate local polynomial,

m^j(x)  =  [local-polynomial fit of r(j) on X,j](x)    1ni=1n[](Xij),(11.2)\hat m_j(x) \;=\; \big[\text{local-polynomial fit of } r^{(j)} \text{ on } X_{\cdot,j}\big](x) \;-\; \frac{1}{n} \sum_{i=1}^{n} [\cdots](X_{ij}), \quad\quad (11.2)

with the centering subtraction enforcing E[m^j(Xj)]0\mathbb{E}[\hat m_j(X_j)] \approx 0.

Convergence. Iterate until maxjm^j(new)m^j(old)<τ\max_j \|\hat m_j^{(\text{new})} - \hat m_j^{(\text{old})}\|_\infty < \tau (typically 10510^{-5}). Under mild conditions (Buja–Hastie–Tibshirani 1989, Theorem 9), backfitting converges geometrically to the unique additive decomposition.

On the §11 3-dim additive toy at n=400n = 400, the notebook’s local-linear backfitting GAM converges in 6 iterations to tolerance 10510^{-5} with geometric convergence ratio 0.066\approx 0.066 per iteration.

11.3 Why the components have univariate rates

On convergence, m^j\hat m_j equals the local-polynomial fit of the converged partial residuals against the univariate X,jX_{\cdot, j}. The bias-variance analysis of §4 applies directly, with noise variance enlarged by the contribution of the other components — bounded if each component is estimable. The bias rate is the univariate n2(p+1)/(2p+3)n^{-2(p+1)/(2p+3)}, independent of dd.

The cost is the misspecification bias. If the true mm has interaction structure that (11.1) can’t capture, m^GAM(x)\hat m^{\text{GAM}}(\mathbf{x}) converges to the best additive approximation of mm, not to mm itself. The standard diagnostic is residual analysis on pairwise covariate plots; if systematic 2-dim structure shows up, the additive model is missing interactions.

GAM fit on a 3-dim additive toy with three panels showing recovered components m̂_j(x_j) overlaid with truth, partial-residual scatter for each j.
GAM fit on the 3-dim additive toy m(x) = sin(2πx_1) + 0.6(x_2 − 0.5) − 2(x_3 − 0.5)² (centered). Three panels: recovered components m̂_j(x_j) (orange) overlaid with truth (gray dashed), partial-residual scatter for each j. In-sample MSE: GAM with local-linear components ≈ 0.016, vs multivariate local-quadratic ≈ 0.017 — at d = 3 the curse is mild and the two approaches are nearly equivalent. The gap widens dramatically at d ≥ 5.
Backfitting convergence trajectory on a log scale, showing geometric decay of max_j norm difference between iterations.
Backfitting convergence on the §11 3-dim toy — geometric decay of max_j ‖m̂_j^(k+1) − m̂_j^(k)‖_∞ on a log scale. 6 iterations to tolerance 10⁻⁵, geometric convergence ratio ≈ 0.066 per iteration. Buja–Hastie–Tibshirani (1989, Theorem 9) gives the abstract guarantee; the empirical rate is well within their bound for the local-linear smoother class.

12. Connections, applications, and limits

12.1 Local polynomials as nuisance estimators

Local polynomial regression’s most consequential modern role isn’t as a final estimator — it’s as a nuisance estimator in semiparametric and debiased-inference pipelines.

The canonical setting is the partially-linear model:

Y  =  βZ  +  g(W)  +  ε,E[εZ,W]=0,Y \;=\; \boldsymbol\beta^\top \mathbf{Z} \;+\; g(W) \;+\; \varepsilon, \qquad \mathbb{E}[\varepsilon \mid \mathbf{Z}, W] = 0,

with βRq\boldsymbol\beta \in \mathbb{R}^q a low-dimensional parameter of interest. Robinson (1988) showed that β\boldsymbol\beta can be estimated at the parametric n\sqrt{n} rate — despite the nonparametric nuisance — by partialling out a local-polynomial estimate of gg.

The modern extension is double/debiased machine learning (Chernozhukov et al. 2018), which generalizes Robinson to high-dimensional Z\mathbf{Z} via cross-fitting and Neyman orthogonality. Local-polynomial fits the framework: at p1p \ge 1 the bias rate hp+1h^{p+1} is fast enough for the cross-fitted score to be Neyman-orthogonal at h=hh = h^* chosen by LOO-CV (§7).

Forward pointer: semiparametric-inference (coming soon) will build out the Robinson machinery and the modern Chernozhukov-style extensions.

12.2 Varying-coefficient models

Hastie & Tibshirani (1993) introduced the varying-coefficient model:

Y  =  β0(Z)  +  β1(Z)X1  +    +  βq(Z)Xq  +  ε,Y \;=\; \beta_0(Z) \;+\; \beta_1(Z) X_1 \;+\; \cdots \;+\; \beta_q(Z) X_q \;+\; \varepsilon,

where each βj(Z)\beta_j(Z) is an unknown smooth function. The estimator is local-linear-in-X\mathbf{X} regression weighted by a univariate kernel in ZZ. Higher-degree variants extend to local polynomial in ZZ as well.

In time-series, varying-coefficient models go by time-varying coefficient regression. The §3 equivalent-kernel theory extends with the obvious modifications.

12.3 Where local polynomials break down

  • High dimension at d5d \ge 5 without additive structure (§8.4 curse).
  • Sparse design at boundary — local Gram matrix becomes ill-conditioned.
  • Heavy-tailed noise — robust alternatives (LOESS with bisquare weights) handle this.
  • Heteroscedastic noise — variable-bandwidth local polynomial (Fan & Gijbels 1995) handles this.
  • Discontinuities — change-point detection + segment-wise fitting.

12.4 Forward pointers within formalML

topictrackrelationship
kernel-regressionT2 (shipped)Predecessor; shared module, shared §1 toy DGP, shared bandwidth-selection machinery
density-ratio-estimation (coming soon)T2Uses the same kernel-methods shared module for r(x)=p1(x)/p0(x)r(x) = p_1(x)/p_0(x) estimation
quantile-regressionT4 (shipped)Pinball-loss analog of local polynomial regression
semiparametric-inference (coming soon)T6Local-cubic as nuisance estimator (§12.1)
gaussian-processesT5Smoothing-spline / GP / local-cubic asymptotic-equivalence triangle (§10)

12.5 Closing synthesis

Local polynomial regression is the workhorse univariate nonparametric smoother of the modern toolkit. It generalizes Kernel Regression §8’s local-linear fix to arbitrary degree pp, achieves Stone-minimax-optimal rates, gives derivative estimates as a single-fit byproduct, extends cleanly to multivariate via the polynomial-feature design matrix, and survives boundary effects via the equivalent-kernel adaptation that emerges automatically from the WLS structure.

The practical recommendation distills to p=1p = 1 as the production default and p=3p = 3 when derivative estimates are needed or the smoothness regime warrants the faster h4h^4 rate. p=2p = 2 doesn’t earn its variance penalty under uniform-rate analysis; p5p \ge 5 is theoretically clean but practically fragile.

The deeper structural fact — that every local polynomial estimator is asymptotically a kernel smoother with a degree-aware, location-aware equivalent kernel KpK^*_p — is the conceptual centerpiece, and it’s what ties the topic to its neighbors. Smoothing splines (§10) are local-cubic in disguise; GAMs (§11) are coordinate-wise local-polynomials chained by backfitting; semiparametric inference (§12.1) treats local-polynomial as the nuisance-estimator default. The equivalent kernel is the lens that makes all of these the same algebraic object, viewed from different angles.

Connections and Further Reading

Within formalML:

  • Kernel Regression (T2). Direct predecessor. The §1 toy, the kernel substrate, and the bandwidth-selection machinery all carry over from kernel-regression unchanged. This topic adds the equivalent-kernel formulation, the rate ladder at general pp, the boundary story, multivariate / derivative / smoothing-spline / GAM connections.

  • Concentration Inequalities. Transitive prerequisite for the convergence-rate machinery, inherited via kernel-regression. The §4 bias and variance derivations are stated to leading order; the rigorous version (uniform consistency, asymptotic normality) requires concentration-inequality bounds for the kernel-weighted partial sums.

  • Gaussian Processes (T5). §10 connects local-cubic to smoothing splines via Silverman (1984), and smoothing splines to GP regression via the spline-as-GP-posterior view. The fourth-order Silverman kernel KSK_S satisfies the same moment-matching identities as K3K^*_3 — both produce h4h^4 leading bias and 1/(nh)1/(nh) variance.

  • Conformal Prediction (T4). §12.1 forward-points to local-cubic as the natural base estimator for split conformal prediction with sharper convergence than NW. The faster bias rate at p=3p = 3 means tighter undersmoothing thresholds and tighter conformal intervals.

  • Density-Ratio Estimation (coming soon). Uses the same kernel-methods shared module from this topic and the bandwidth-selection methods from §7, applied to r(x)=p1(x)/p0(x)r(x) = p_1(x)/p_0(x) instead of E[YX]\mathbb{E}[Y \mid X].

  • Semiparametric Inference (coming soon). Local-cubic regression as the canonical nuisance estimator in the Robinson (1988) double-residual approach and the Chernozhukov et al. (2018) double/debiased ML framework — see §12.1.

Cross-site prerequisites:

  • formalStatistics: Linear Regression — the parametric counterpart. Locally-weighted least squares with a degree-pp polynomial design matrix is this topic’s central machinery; the (XWX)1XWy(\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W} \mathbf{y} normal equations are the linear-regression formulas applied at every evaluation point with location-adaptive design and weight matrices. The §3 equivalent-kernel formulation is the classical projection-onto-column-space picture from linear regression read in kernel-weighted units.

  • formalStatistics: Kernel Density Estimation — the density-estimation sibling. Same kernel families and the same bandwidth-moment conditions μ2(K),R(K)\mu_2(K), R(K) port directly. The bandwidth-selection methods (Silverman, plug-in, LOO-CV, GCV) transfer to local polynomial regression via the same linear-smoother machinery; the §3.3 moment-matching identities are the local-polynomial analog of KDE’s higher-order kernels.

  • formalCalculus: Multiple Integrals — the multivariate-integration substrate. The change-of-variables substitution u=(zx)/hu = (z-x)/h that converts the kernel-weighted moments into kernel-moment integrals throughout §3.2, the boundary-truncated form μj(c)=c\mu_j^{(c)} = \int_{-c}^{\infty}, and the §8 multivariate full-degree expansion Md,p=(d+pp)M_{d,p} = \binom{d+p}{p} all rest on standard multivariate-integration moves.

  • formalCalculus: Mean Value Theorem and Taylor Expansion — the Taylor-expansion machinery. Taylor expansion to order p+1p+1 is the load-bearing technical move of the §4 bias derivation at general degree pp. The §4.1 residual-decomposition argument uses the column-space projection identity to absorb the first p+1p+1 Taylor terms, leaving only the (p+1)(p+1)-st derivative remainder to drive the bias.

Connections

  • Direct predecessor. §1 picks up the §8.3 forward-pointer of kernel-regression, generalizing local-linear (the §8.2 boundary-bias fix) to arbitrary degree p. The kernel substrate, the §1 toy DGP, the bandwidth-selection machinery, and the kernel-weighted-least-squares reframing of NW carry over unchanged. This topic adds the equivalent-kernel formulation, the rate ladder at general p, the boundary story, derivative estimation, and the multivariate / smoothing-spline / GAM connections. kernel-regression
  • Transitive prerequisite for the convergence-rate machinery, inherited via kernel-regression. The §4 bias and variance derivations are stated to leading order; the rigorous version (uniform consistency, asymptotic normality) requires concentration-inequality bounds for the kernel-weighted partial sums. concentration-inequalities
  • §10 connects local-cubic to smoothing splines via Silverman (1984), and smoothing splines to GP regression via the spline-as-GP-posterior view. The fourth-order Silverman kernel K_S satisfies the same moment-matching identities as K*_3 with a Gaussian base — both produce h^4 leading bias and 1/(nh) variance, and both are asymptotically equivalent to a degree-3 local polynomial smoother. gaussian-processes
  • §12.1 forward-points to local-cubic as the natural base estimator for split conformal prediction with sharper convergence than NW (kernel-regression's §9.4 example). The faster bias rate at p = 3 means tighter undersmoothing thresholds and tighter conformal intervals — an unrealized but visible improvement over the NW baseline. conformal-prediction

References & Further Reading