intermediate optimization 50 min read

Proximal Methods

Proximal operators, the Moreau envelope, forward-backward splitting, ISTA/FISTA for composite objectives, Douglas–Rachford splitting, and ADMM — the algorithmic toolkit for non-smooth and constrained optimization

Overview & Motivation

In Gradient Descent & Convergence we built a complete convergence theory for smooth optimization: given an LL-smooth convex function ff, gradient descent with step size η=1/L\eta = 1/L achieves an O(1/k)O(1/k) rate, and Nesterov acceleration pushes this to the optimal O(1/k2)O(1/k^2). But smoothness is a luxury that many real problems don’t enjoy.

Consider the Lasso — the workhorse of sparse regression:

minx12Axb2+λx1\min_{x} \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1

The first term is smooth (its gradient is A(Axb)A^\top(Ax - b)), but the 1\ell_1 penalty λx1\lambda\|x\|_1 is not differentiable at zero, which is precisely where sparsity lives. We can’t just compute f(xk)\nabla f(x_k) and step in the negative gradient direction, because f\nabla f doesn’t exist everywhere.

Proximal methods resolve this tension by splitting the problem into pieces we know how to handle. The key insight: even though x1\|x\|_1 isn’t smooth, we can evaluate its proximal operator — a generalization of the gradient step that replaces “move in the direction that decreases ff fastest” with “find the point that best balances minimizing ff and staying close to where you are.” For the 1\ell_1 norm, this proximal operator turns out to be soft-thresholding: shrink each coordinate toward zero by λ\lambda, and set it exactly to zero if it’s smaller than λ\lambda.

What We Cover

  1. The Proximal Operator — definition, geometric interpretation, and the fixed-point characterization of minimizers.
  2. Proximal Operator Catalog — closed-form solutions for the 1\ell_1 norm (soft-thresholding), indicator functions (projection), the squared norm (shrinkage), and group norms.
  3. The Moreau Envelope — the smooth approximation that connects proximal operators to gradient-based reasoning, with the gradient identity Mλf=1λ(vproxλf(v))\nabla M_\lambda f = \frac{1}{\lambda}(v - \mathrm{prox}_{\lambda f}(v)).
  4. Proximal Gradient Descent — forward-backward splitting for composite objectives F=g+hF = g + h (smooth + non-smooth), with O(1/k)O(1/k) convergence.
  5. ISTA — proximal gradient applied to the Lasso, with soft-thresholding as the key subroutine.
  6. FISTA — Nesterov-style acceleration from O(1/k)O(1/k) to the optimal O(1/k2)O(1/k^2).
  7. Douglas–Rachford Splitting — handling problems where both terms are non-smooth.
  8. ADMM — the consensus formulation with three-variable updates and primal-dual residuals.
  9. Convergence Analysis — the rate landscape across algorithms and the role of strong convexity.
  10. Computational Notes — backtracking step size selection, convergence diagnostics, and stopping criteria.

The Proximal Operator

The proximal operator is the fundamental building block. Where gradient descent asks “which direction decreases ff fastest?”, the proximal operator asks a different question: “which point best balances minimizing ff and staying close to where I am?”

Definition 1 (The Proximal Operator).

Let f:RnR{+}f : \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\} be a proper, closed, convex function, and let λ>0\lambda > 0. The proximal operator of ff with parameter λ\lambda is the map proxλf:RnRn\mathrm{prox}_{\lambda f} : \mathbb{R}^n \to \mathbb{R}^n defined by

proxλf(v)=argminxRn{f(x)+12λxv2}\mathrm{prox}_{\lambda f}(v) = \arg\min_{x \in \mathbb{R}^n} \left\{ f(x) + \frac{1}{2\lambda}\|x - v\|^2 \right\}

The minimizer exists and is unique because the objective is 1λ\frac{1}{\lambda}-strongly convex (the quadratic penalty dominates at large x\|x\|, and the sum of a convex and a strongly convex function is strongly convex).

The parameter λ\lambda controls the tradeoff between the two competing objectives. When λ\lambda is small, the quadratic penalty dominates — the proximal operator stays close to vv, barely moving. When λ\lambda is large, the function ff dominates — the proximal operator moves toward the minimizer of ff itself. In the limit λ0\lambda \to 0, proxλf(v)v\mathrm{prox}_{\lambda f}(v) \to v (don’t move); in the limit λ\lambda \to \infty, proxλf(v)argminf\mathrm{prox}_{\lambda f}(v) \to \arg\min f (jump to the global minimizer).

The proximal operator has an elegant characterization in terms of fixed points that connects it directly to the optimality conditions from Convex Analysis.

Proposition 1 (Fixed-Point Characterization).

A point xx^* minimizes ff if and only if xx^* is a fixed point of the proximal operator:

x=proxλf(x)0f(x)x^* = \mathrm{prox}_{\lambda f}(x^*) \qquad \Longleftrightarrow \qquad 0 \in \partial f(x^*)

This holds for any λ>0\lambda > 0. That is, the fixed points of proxλf\mathrm{prox}_{\lambda f} are exactly the minimizers of ff, regardless of the choice of λ\lambda.

Proof.

(\Rightarrow) If x=proxλf(x)x^* = \mathrm{prox}_{\lambda f}(x^*), then xx^* minimizes f(x)+12λxx2f(x) + \frac{1}{2\lambda}\|x - x^*\|^2. The optimality condition for this minimization is 0f(x)+1λ(xx)0 \in \partial f(x^*) + \frac{1}{\lambda}(x^* - x^*), which gives 0f(x)0 \in \partial f(x^*).

(\Leftarrow) If 0f(x)0 \in \partial f(x^*), then for any xx, convexity gives f(x)f(x)+g,xxf(x) \geq f(x^*) + \langle g, x - x^*\rangle for some gf(x)g \in \partial f(x^*). Taking g=0g = 0, we get f(x)f(x)f(x) \geq f(x^*) for all xx, so f(x)+12λxx2f(x)+0f(x) + \frac{1}{2\lambda}\|x - x^*\|^2 \geq f(x^*) + 0, and xx^* is the minimizer.

proxλf(v) = prox1.0·|x|(1.5) = 0.500

The proximal operator: definition, input-output map, and gallery of proximal operators


Proximal Operator Catalog

The utility of proximal methods depends on having cheap, closed-form proximal operators for the functions we care about. Here are the most important ones.

Definition 2 (Soft-Thresholding Operator).

The proximal operator of λ\lambda|\cdot| (the scaled absolute value in 1D, or the scaled 1\ell_1 norm componentwise in Rn\mathbb{R}^n) is the soft-thresholding operator:

Sλ(v)=sgn(v)(vλ)+={vλif v>λ0if vλv+λif v<λ\mathcal{S}_\lambda(v) = \mathrm{sgn}(v) \cdot (|v| - \lambda)_+ = \begin{cases} v - \lambda & \text{if } v > \lambda \\ 0 & \text{if } |v| \leq \lambda \\ v + \lambda & \text{if } v < -\lambda \end{cases}

In Rn\mathbb{R}^n, soft-thresholding acts componentwise: [Sλ(v)]i=sgn(vi)(viλ)+[\mathcal{S}_\lambda(v)]_i = \mathrm{sgn}(v_i)(|v_i| - \lambda)_+.

Soft-thresholding is the engine behind every 1\ell_1-regularized method in this topic. It does two things simultaneously: it shrinks each coordinate toward zero by λ\lambda (the “shrinkage” part), and it sets coordinates smaller than λ\lambda exactly to zero (the “thresholding” part). This is why 1\ell_1 regularization produces sparse solutions — the proximal operator itself enforces sparsity.

def soft_threshold(x, lam):
    """Proximal operator of lambda * |.|: soft-thresholding."""
    return np.sign(x) * np.maximum(np.abs(x) - lam, 0.0)

Proposition 2 (Proximal Operator of an Indicator Function).

Let CRnC \subseteq \mathbb{R}^n be a nonempty closed convex set, and let ιC\iota_C be its indicator function (ιC(x)=0\iota_C(x) = 0 if xCx \in C, ++\infty otherwise). Then

proxλιC(v)=ΠC(v)=argminxCxv2\mathrm{prox}_{\lambda \iota_C}(v) = \Pi_C(v) = \arg\min_{x \in C} \|x - v\|^2

is the Euclidean projection onto CC, for any λ>0\lambda > 0.

This is immediate from the definition: the term ιC(x)\iota_C(x) forces the minimizer to lie in CC, and the quadratic penalty selects the closest point in CC to vv. The indicator function effectively encodes a hard constraint, and its proximal operator is the classical projection. This means constrained optimization can be cast as unconstrained proximal operations.

Proposition 3 (Proximal Operator of the Squared Norm).

For f(x)=12x2f(x) = \frac{1}{2}\|x\|^2 (the squared Euclidean norm), the proximal operator is a shrinkage toward the origin:

proxλf(v)=v1+λ\mathrm{prox}_{\lambda f}(v) = \frac{v}{1 + \lambda}

This follows by setting the gradient of 12x2+12λxv2\frac{1}{2}\|x\|^2 + \frac{1}{2\lambda}\|x - v\|^2 to zero: (1+1/λ)x=v/λ(1 + 1/\lambda)x = v/\lambda, giving x=v/(1+λ)x = v/(1 + \lambda).

Two more proximal operators appear frequently in applications:

Group soft-thresholding. For the group Lasso with f(x)=xG2f(x) = \|x_G\|_2 (the 2\ell_2 norm of a group of coordinates), the proximal operator is:

proxλf(vG)={(1λvG2)vGif vG2>λ0otherwise\mathrm{prox}_{\lambda f}(v_G) = \begin{cases} \left(1 - \frac{\lambda}{\|v_G\|_2}\right) v_G & \text{if } \|v_G\|_2 > \lambda \\ 0 & \text{otherwise} \end{cases}

This zeroes out entire groups, rather than individual coordinates.

Simplex projection. The proximal operator of the indicator of the probability simplex Δ={x0:ixi=1}\Delta = \{x \geq 0 : \sum_i x_i = 1\} is the Euclidean projection onto Δ\Delta. This can be computed in O(nlogn)O(n \log n) time by sorting and thresholding (Duchi et al., 2008):

def proj_simplex(x):
    """Projection onto the probability simplex."""
    n = len(x)
    u = np.sort(x)[::-1]
    cssv = np.cumsum(u) - 1.0
    rho = np.nonzero(u * np.arange(1, n + 1) > cssv)[0][-1]
    theta = cssv[rho] / (rho + 1.0)
    return np.maximum(x - theta, 0.0)

Proximal operator catalog: soft-thresholding, L2 projection, group soft-thresholding, simplex projection


The Moreau Envelope

The proximal operator finds the minimizer of f(x)+12λxv2f(x) + \frac{1}{2\lambda}\|x - v\|^2. But what is the value of that minimum? This is the Moreau envelope — a smooth version of ff that serves as a theoretical bridge between non-smooth analysis and gradient-based reasoning.

Definition 3 (The Moreau Envelope).

Let f:RnR{+}f : \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\} be proper, closed, and convex. The Moreau envelope of ff with parameter λ>0\lambda > 0 is

Mλf(v)=minxRn{f(x)+12λxv2}=f ⁣(proxλf(v))+12λproxλf(v)v2M_\lambda f(v) = \min_{x \in \mathbb{R}^n} \left\{ f(x) + \frac{1}{2\lambda}\|x - v\|^2 \right\} = f\!\left(\mathrm{prox}_{\lambda f}(v)\right) + \frac{1}{2\lambda}\left\|\mathrm{prox}_{\lambda f}(v) - v\right\|^2

The Moreau envelope is always finite, continuous, and — crucially — differentiable, even when ff is not.

The parameter λ\lambda controls the smoothing strength. When λ\lambda is small, MλfM_\lambda f closely approximates ff (but with corners rounded off). When λ\lambda is large, MλfM_\lambda f is a very smooth, gentle version of ff. The infimal values are preserved: minMλf=minf\min M_\lambda f = \min f, and the minimizers are the same.

A beautiful special case: when f(x)=xf(x) = |x|, the Moreau envelope MλM_\lambda |{\cdot}| is exactly the Huber function with parameter δ=λ\delta = \lambda:

Mλ(v)={v22λif vλvλ2if v>λM_\lambda |\cdot|(v) = \begin{cases} \frac{v^2}{2\lambda} & \text{if } |v| \leq \lambda \\ |v| - \frac{\lambda}{2} & \text{if } |v| > \lambda \end{cases}

The Huber function is the standard “smooth 1\ell_1” used in robust regression — it’s quadratic near zero and linear in the tails, and we now see that this arises naturally as the Moreau envelope.

Theorem 1 (Gradient of the Moreau Envelope).

For any proper, closed, convex function ff and λ>0\lambda > 0, the Moreau envelope MλfM_\lambda f is differentiable everywhere, with gradient

Mλf(v)=1λ(vproxλf(v))\nabla M_\lambda f(v) = \frac{1}{\lambda}\left(v - \mathrm{prox}_{\lambda f}(v)\right)

Moreover, Mλf\nabla M_\lambda f is Lipschitz continuous with constant 1/λ1/\lambda, so MλfM_\lambda f is (1/λ)(1/\lambda)-smooth.

Proof.

Let p(v)=proxλf(v)p(v) = \mathrm{prox}_{\lambda f}(v). By definition, p(v)p(v) minimizes h(x,v)=f(x)+12λxv2h(x, v) = f(x) + \frac{1}{2\lambda}\|x - v\|^2 over xx. The envelope is Mλf(v)=h(p(v),v)M_\lambda f(v) = h(p(v), v).

By the envelope theorem (or direct computation), we differentiate with respect to vv, using the fact that p(v)p(v) is the minimizer (so the h/x\partial h / \partial x term vanishes at x=p(v)x = p(v)):

vMλf(v)=hvx=p(v)=1λ(vp(v))\nabla_v M_\lambda f(v) = \frac{\partial h}{\partial v}\bigg|_{x = p(v)} = \frac{1}{\lambda}(v - p(v))

For Lipschitz continuity: the proximal operator pp is firmly nonexpansive (a classical result), meaning p(v1)p(v2)v1v2\|p(v_1) - p(v_2)\| \leq \|v_1 - v_2\|. Therefore

Mλf(v1)Mλf(v2)=1λ(v1p(v1))(v2p(v2))2λv1v2\|\nabla M_\lambda f(v_1) - \nabla M_\lambda f(v_2)\| = \frac{1}{\lambda}\|(v_1 - p(v_1)) - (v_2 - p(v_2))\| \leq \frac{2}{\lambda}\|v_1 - v_2\|

A tighter analysis using firm nonexpansiveness gives the 1/λ1/\lambda constant.

This identity is remarkable: the gradient of the smooth envelope equals the proximal residual 1λ(vproxλf(v))\frac{1}{\lambda}(v - \mathrm{prox}_{\lambda f}(v)), scaled by 1/λ1/\lambda. When v=xv = x^* is a minimizer, the proximal operator is the identity (x=proxλf(x)x^* = \mathrm{prox}_{\lambda f}(x^*)), and the gradient is zero — exactly what we expect.

The Moreau envelope: smoothing, Huber connection, and gradient comparison


Proximal Gradient Descent

Now we assemble the proximal operator into an optimization algorithm. The setting is composite minimization — the most important problem structure in modern convex optimization.

Definition 4 (Composite Minimization).

The composite minimization problem is

minxRnF(x)=g(x)+h(x)\min_{x \in \mathbb{R}^n} F(x) = g(x) + h(x)

where:

  • gg is convex and LL-smooth (i.e., g\nabla g is Lipschitz with constant LL: g(x)g(y)Lxy\|\nabla g(x) - \nabla g(y)\| \leq L\|x - y\| for all x,yx, y),
  • hh is convex but possibly non-smooth, with a cheaply computable proximal operator.

The canonical example is the Lasso: g(x)=12Axb2g(x) = \frac{1}{2}\|Ax - b\|^2 is smooth with L=λmax(AA)L = \lambda_{\max}(A^\top A), and h(x)=λx1h(x) = \lambda\|x\|_1 has the soft-thresholding proximal operator. But the framework covers much more — any problem where we can split the objective into a smooth part we can take gradients of and a structured part we can “prox.”

The forward-backward splitting algorithm alternates between a gradient step on gg (the “forward” step) and a proximal step on hh (the “backward” step):

xk+1=proxηh ⁣(xkηg(xk))x_{k+1} = \mathrm{prox}_{\eta h}\!\left(x_k - \eta \nabla g(x_k)\right)

where η(0,1/L]\eta \in (0, 1/L] is the step size. Intuitively: first move in the negative gradient direction of gg (the smooth part), then apply the proximal operator of hh to “clean up” — projecting, thresholding, or whatever hh demands.

When h=0h = 0, the proximal step is the identity and we recover vanilla gradient descent. When g=0g = 0, the gradient step is trivial and we get the proximal point algorithm. Forward-backward splitting is the natural hybrid.

Theorem 2 (Convergence of Proximal Gradient Descent).

Let F=g+hF = g + h where gg is convex and LL-smooth, and hh is convex with a computable proximal operator. Let {xk}\{x_k\} be the iterates of proximal gradient descent with step size η=1/L\eta = 1/L. Then:

  1. Convex case: F(xk)F(x)Lx0x22kF(x_k) - F(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2k}, giving a rate of O(1/k)O(1/k).

  2. Strongly convex case: If gg is additionally μ\mu-strongly convex (with μ>0\mu > 0), then

F(xk)F(x)(1μL)k(F(x0)F(x))F(x_k) - F(x^*) \leq \left(1 - \frac{\mu}{L}\right)^k \left(F(x_0) - F(x^*)\right)

giving linear convergence with rate governed by the condition number κ=L/μ\kappa = L/\mu.

Proof.

The proof extends the Descent Lemma from Gradient Descent & Convergence. By LL-smoothness of gg:

g(y)g(x)+g(x),yx+L2yx2g(y) \leq g(x) + \langle \nabla g(x), y - x \rangle + \frac{L}{2}\|y - x\|^2

Setting x=xkx = x_k and y=xk+1=proxηh(xkηg(xk))y = x_{k+1} = \mathrm{prox}_{\eta h}(x_k - \eta \nabla g(x_k)) with η=1/L\eta = 1/L, and using the optimality condition for the proximal step (which says 1η(xkηg(xk)xk+1)h(xk+1)\frac{1}{\eta}(x_k - \eta \nabla g(x_k) - x_{k+1}) \in \partial h(x_{k+1})), we obtain the sufficient decrease condition:

F(xk+1)F(x)+12ηxkx212ηxk+1x2F(x_{k+1}) \leq F(x) + \frac{1}{2\eta}\|x_k - x\|^2 - \frac{1}{2\eta}\|x_{k+1} - x\|^2

for any xx. Setting x=xx = x^* and summing from k=0k = 0 to K1K-1:

k=0K1(F(xk+1)F(x))12ηx0x2\sum_{k=0}^{K-1} \left(F(x_{k+1}) - F(x^*)\right) \leq \frac{1}{2\eta}\|x_0 - x^*\|^2

Since F(xk)F(x_k) is non-increasing, K(F(xK)F(x))L2x0x2K \cdot (F(x_K) - F(x^*)) \leq \frac{L}{2}\|x_0 - x^*\|^2, giving the O(1/k)O(1/k) rate.

For the strongly convex case, the additional quadratic lower bound g(y)g(x)+g(x),yx+μ2yx2g(y) \geq g(x) + \langle \nabla g(x), y - x\rangle + \frac{\mu}{2}\|y - x\|^2 gives a contraction factor of (1μ/L)(1 - \mu/L) per step, yielding linear convergence.

Proximal gradient descent: forward-backward splitting trajectory, convergence, and sparse recovery


ISTA & Soft-Thresholding

The Iterative Shrinkage-Thresholding Algorithm (ISTA) is proximal gradient descent applied to the Lasso. The name comes from the fact that the proximal step reduces to componentwise soft-thresholding — a “shrink and threshold” operation.

Proposition 4 (ISTA for the Lasso).

For the Lasso problem minx12Axb2+λx1\min_x \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1, proximal gradient descent with step size η=1/L\eta = 1/L (where L=λmax(AA)L = \lambda_{\max}(A^\top A)) reduces to:

xk+1=Sηλ ⁣(xkηA(Axkb))x_{k+1} = \mathcal{S}_{\eta\lambda}\!\left(x_k - \eta A^\top(Ax_k - b)\right)

That is: take a gradient step on the smooth quadratic loss, then apply componentwise soft-thresholding with threshold ηλ\eta\lambda. The algorithm converges at rate O(1/k)O(1/k) for the objective gap.

The two-step structure is transparent. The gradient step xkηA(Axkb)x_k - \eta A^\top(Ax_k - b) moves toward fitting the data (minimizing Axb2\|Ax - b\|^2), and the soft-thresholding step Sηλ\mathcal{S}_{\eta\lambda} promotes sparsity by shrinking small coefficients to zero. The threshold ηλ\eta\lambda balances the two goals: larger λ\lambda means more thresholding and sparser solutions.

# ISTA for Lasso
L = np.linalg.eigvalsh(A.T @ A)[-1]  # Lipschitz constant
eta = 1.0 / L
x = np.zeros(p)

for k in range(max_iter):
    x_half = x - eta * A.T @ (A @ x - b)          # gradient step on g
    x = soft_threshold(x_half, eta * lam_lasso)     # prox step on h

The regularization path. As λ\lambda varies from large to small, the ISTA solution traces a path from zero (complete sparsity) to the least-squares solution (no regularization). This path reveals which features enter the model first — the most important predictors appear at larger λ\lambda values. Computing the full regularization path is a standard diagnostic for choosing λ\lambda in practice.

ISTA: soft-thresholding, regularization path, and sparsity vs regularization strength


FISTA: Accelerated Proximal Gradient

ISTA’s O(1/k)O(1/k) rate is the same as vanilla gradient descent on smooth problems. Can we accelerate it, the way Nesterov acceleration improves gradient descent from O(1/k)O(1/k) to O(1/k2)O(1/k^2)? The answer is yes, and the result is FISTA — the Fast Iterative Shrinkage-Thresholding Algorithm of Beck & Teboulle (2009).

Theorem 3 (FISTA Convergence Rate).

The FISTA algorithm, defined by the updates

yk=xk+tk11tk(xkxk1)y_k = x_k + \frac{t_{k-1} - 1}{t_k}(x_k - x_{k-1}) xk+1=proxηh ⁣(ykηg(yk))x_{k+1} = \mathrm{prox}_{\eta h}\!\left(y_k - \eta \nabla g(y_k)\right) tk+1=1+1+4tk22t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}

with t0=1t_0 = 1, converges at rate

F(xk)F(x)2Lx0x2(k+1)2F(x_k) - F(x^*) \leq \frac{2L\|x_0 - x^*\|^2}{(k+1)^2}

This is O(1/k2)O(1/k^2) — optimal for first-order methods on the composite class.

Proof.

The proof follows Nesterov’s estimation sequence technique, adapted to the composite setting. The key is the momentum schedule tkt_k satisfying tk2tktk12t_k^2 - t_k \leq t_{k-1}^2, which ensures the sequence tkt_k grows as tkk/2t_k \sim k/2 for large kk.

Define the Lyapunov function Ek=tk2(F(xk)F(x))+L2ukx2E_k = t_k^2(F(x_k) - F(x^*)) + \frac{L}{2}\|u_k - x^*\|^2, where uk=xk1+tk(xkxk1)u_k = x_{k-1} + t_k(x_k - x_{k-1}). Using the sufficient decrease property from the proximal gradient step and the momentum relation, one shows Ek+1EkE_{k+1} \leq E_k. Since E0=F(x0)F(x)+L2x0x2E_0 = F(x_0) - F(x^*) + \frac{L}{2}\|x_0 - x^*\|^2, we get

tk2(F(xk)F(x))E0L2x0x2+(F(x0)F(x))t_k^2(F(x_k) - F(x^*)) \leq E_0 \leq \frac{L}{2}\|x_0 - x^*\|^2 + (F(x_0) - F(x^*))

and since tk(k+1)/2t_k \geq (k+1)/2, the O(1/k2)O(1/k^2) bound follows.

The momentum schedule tk+1=1+1+4tk22t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2} is the signature feature. As kk \to \infty, the momentum coefficient (tk11)/tk1(t_{k-1} - 1)/t_k \to 1, meaning the extrapolation step becomes increasingly aggressive. This is why FISTA trajectories can overshoot — the momentum pushes past the current iterate, trading local stability for faster global convergence.

# FISTA for Lasso
x = np.zeros(p)
x_prev = x.copy()
t_k = 1.0

for k in range(max_iter):
    t_next = (1 + np.sqrt(1 + 4 * t_k**2)) / 2
    y = x + ((t_k - 1) / t_next) * (x - x_prev)   # momentum extrapolation
    x_prev = x.copy()
    x_half = y - eta * A.T @ (A @ y - b)            # gradient step on g (at y)
    x = soft_threshold(x_half, eta * lam_lasso)      # prox step on h
    t_k = t_next

When does acceleration matter? The gap between O(1/k)O(1/k) and O(1/k2)O(1/k^2) is most dramatic when high accuracy is needed — to reach ϵ=106\epsilon = 10^{-6}, ISTA needs O(1/ϵ)O(1/\epsilon) iterations while FISTA needs O(1/ϵ)O(1/\sqrt{\epsilon}). For well-conditioned problems or moderate accuracy, ISTA may suffice. For ill-conditioned sparse recovery or high-precision scientific computing, FISTA’s speedup is substantial.

FISTA acceleration: ISTA vs FISTA convergence, momentum schedule, and trajectory comparison


Douglas–Rachford Splitting

Forward-backward splitting requires one term to be smooth (so we can compute its gradient). What if both terms are non-smooth? This is the setting for Douglas–Rachford splitting.

Definition 5 (Douglas–Rachford Splitting).

To minimize f(x)+g(x)f(x) + g(x) where both ff and gg are proper, closed, convex (but possibly non-smooth) with computable proximal operators, the Douglas–Rachford (DR) update is:

yk+1=proxγf(zk)y_{k+1} = \mathrm{prox}_{\gamma f}(z_k) zk+1=zk+proxγg(2yk+1zk)yk+1z_{k+1} = z_k + \mathrm{prox}_{\gamma g}(2y_{k+1} - z_k) - y_{k+1}

where γ>0\gamma > 0 is a step-size parameter. The iterates yky_k converge to a minimizer of f+gf + g.

The variable zkz_k is an auxiliary “shadow” variable — it doesn’t converge to the solution itself, but the yky_k iterates do. The second update can be rewritten as zk+1=zk+RgRf(zk)z_{k+1} = z_k + R_g \circ R_f(z_k) where Rf(z)=2proxf(z)zR_f(z) = 2\mathrm{prox}_f(z) - z is the reflected proximal operator (or “reflection”). Douglas–Rachford is thus an averaged iteration of composed reflections — a deeply geometric operation.

The canonical application is basis pursuit: minimize x1\|x\|_1 subject to Ax=bAx = b (exact sparse recovery). Setting f(x)=x1f(x) = \|x\|_1 and g(x)=ι{Ax=b}(x)g(x) = \iota_{\{Ax = b\}}(x), both are non-smooth, but both have cheap proximal operators — soft-thresholding for ff and projection onto the affine subspace for gg.

Theorem 4 (Convergence of Douglas–Rachford Splitting).

Let f,gf, g be proper, closed, convex functions with domfdomg\mathrm{dom}\, f \cap \mathrm{dom}\, g \neq \emptyset. Then for any z0Rnz_0 \in \mathbb{R}^n and γ>0\gamma > 0, the Douglas–Rachford iterates satisfy:

  1. yk=proxγf(zk)y_k = \mathrm{prox}_{\gamma f}(z_k) converges to a minimizer xx^* of f+gf + g.
  2. The convergence rate for the objective gap is O(1/k)O(1/k).
# Douglas-Rachford for basis pursuit: min ||x||_1 s.t. Ax = b
AAt_inv = np.linalg.inv(A @ A.T)
def proj_affine(x):
    return x + A.T @ AAt_inv @ (b - A @ x)

gamma = 1.0
z = np.zeros(p)
for k in range(max_iter):
    y = soft_threshold(z, gamma)                    # prox_f
    r = proj_affine(2 * y - z)                      # prox_g(reflected)
    z = z + r - y                                   # shadow update
# y is the solution

Douglas–Rachford splitting: 2D geometry, convergence, and basis pursuit recovery


ADMM

The Alternating Direction Method of Multipliers (ADMM) is the most widely used splitting method in practice. It handles the general constrained formulation minf(x)+g(z) s.t. Ax+Bz=c\min f(x) + g(z) \text{ s.t. } Ax + Bz = c by augmenting the Lagrangian with a quadratic penalty.

Definition 6 (ADMM Update).

For the consensus problem minf(x)+g(z) subject to x=z\min f(x) + g(z) \text{ subject to } x = z, the ADMM updates (in scaled form) are:

xk+1=proxf/ρ(zkuk)[x-update]x_{k+1} = \mathrm{prox}_{f/\rho}(z_k - u_k) \qquad \text{[x-update]} zk+1=proxg/ρ(xk+1+uk)[z-update]z_{k+1} = \mathrm{prox}_{g/\rho}(x_{k+1} + u_k) \qquad \text{[z-update]} uk+1=uk+xk+1zk+1[dual update]u_{k+1} = u_k + x_{k+1} - z_{k+1} \qquad \text{[dual update]}

where ρ>0\rho > 0 is the penalty parameter and uu is the scaled dual variable (the Lagrange multiplier divided by ρ\rho).

For the Lasso via ADMM, we set f(x)=12Axb2f(x) = \frac{1}{2}\|Ax - b\|^2 and g(z)=λz1g(z) = \lambda\|z\|_1 with the constraint x=zx = z. The xx-update requires solving a linear system (AA+ρI)x=Ab+ρ(zu)(A^\top A + \rho I)x = A^\top b + \rho(z - u), the zz-update is soft-thresholding, and the uu-update is a simple addition.

# ADMM for Lasso
rho = 1.0
AtA_rho_inv = np.linalg.inv(A.T @ A + rho * np.eye(p))
Atb = A.T @ b

x, z, u = np.zeros(p), np.zeros(p), np.zeros(p)
for k in range(max_iter):
    x = AtA_rho_inv @ (Atb + rho * (z - u))              # x-update
    z_old = z.copy()
    z = soft_threshold(x + u, lam_lasso / rho)            # z-update
    u = u + x - z                                          # dual update

    primal_res = np.linalg.norm(x - z)                    # primal residual
    dual_res = rho * np.linalg.norm(z - z_old)            # dual residual

Convergence monitoring. ADMM convergence is tracked via two residuals:

  • Primal residual rk=xkzkr_k = \|x_k - z_k\| — measures constraint violation (how far xx and zz are from consensus).
  • Dual residual sk=ρzkzk1s_k = \rho\|z_k - z_{k-1}\| — measures dual variable change (stationarity of the Lagrangian).

The algorithm converges when both residuals are small. The standard stopping criterion is rk<ϵprir_k < \epsilon^{\mathrm{pri}} and sk<ϵduals_k < \epsilon^{\mathrm{dual}}.

Sensitivity to ρ\rho. The penalty parameter ρ\rho has a significant effect on convergence speed but does not affect the final solution. Small ρ\rho favors fast primal convergence (the xx and zz updates are more independent), while large ρ\rho favors fast dual convergence (the constraint is enforced more aggressively). In practice, adaptive ρ\rho schemes that increase ρ\rho when the primal residual is large and decrease it when the dual residual is large can dramatically improve convergence.

Remark.

ADMM as Douglas–Rachford on the dual. ADMM applied to the consensus problem minf(x)+g(z)\min f(x) + g(z) s.t. x=zx = z is equivalent to applying Douglas–Rachford splitting to the dual problem. This explains why both methods achieve an O(1/k)O(1/k) convergence rate and why the penalty parameter ρ\rho in ADMM plays the same role as the step-size parameter γ\gamma in DR. The primal-dual structure of ADMM provides richer convergence diagnostics (two residuals instead of one), which is a practical advantage.

ADMM: algorithm comparison, primal-dual residuals, and sensitivity to penalty parameter rho


Convergence Analysis

We now have four splitting algorithms — ISTA, FISTA, Douglas–Rachford, and ADMM — each designed for a different problem structure. The following table summarizes the convergence landscape:

AlgorithmSettingRate (convex)Rate (strongly convex, μ>0\mu > 0)
ISTA (Proximal Gradient)gg smooth + hh prox-friendlyO(LR2/k)O(LR^2/k)(1μ/L)k(1 - \mu/L)^k (linear)
FISTA (Accelerated)gg smooth + hh prox-friendlyO(LR2/k2)O(LR^2/k^2) (optimal)(1μ/L)k(1 - \sqrt{\mu/L})^k (linear, faster)
Douglas–Rachfordboth non-smooth, both prox-friendlyO(1/k)O(1/k)
ADMMconsensus/splitting formO(1/k)O(1/k)

Here LL is the smoothness constant of gg, R=x0xR = \|x_0 - x^*\| is the initial distance, μ\mu is the strong convexity parameter, and κ=L/μ\kappa = L/\mu is the condition number.

Key observations:

  1. FISTA is always at least as fast as ISTA, and strictly faster when the target accuracy ϵ\epsilon is small. The gap widens as κ\kappa grows: ISTA needs O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) iterations while FISTA needs O(κlog(1/ϵ))O(\sqrt{\kappa}\log(1/\epsilon)).

  2. Strong convexity accelerates both ISTA and FISTA from sublinear to linear convergence. The practical message: if you know your problem is strongly convex, the convergence is exponentially faster.

  3. ADMM and DR have the same theoretical rate (O(1/k)O(1/k)), but ADMM is often faster in practice because it exploits the structure of the consensus formulation and allows per-variable parallelism.

  4. Condition number dependence is the bottleneck. For all these methods, ill-conditioned problems (large κ\kappa) converge slowly. Preconditioning — transforming the problem to reduce κ\kappa — is often more effective than switching algorithms.

Convergence analysis: rate comparison, iterations-to-accuracy, and actual convergence on Lasso


Computational Notes

We close with a full implementation of proximal gradient descent with Armijo backtracking — the version you’d actually use in practice, with convergence diagnostics.

Backtracking step size selection. Instead of computing L=λmax(AA)L = \lambda_{\max}(A^\top A) exactly (which requires an eigendecomposition), backtracking finds a suitable step size adaptively. Starting from an initial estimate LestL_{\text{est}}, it doubles LestL_{\text{est}} until the sufficient decrease condition is satisfied:

g(xk+1)g(y)+g(y),xk+1y+Lest2xk+1y2g(x_{k+1}) \leq g(y) + \langle \nabla g(y), x_{k+1} - y \rangle + \frac{L_{\text{est}}}{2}\|x_{k+1} - y\|^2

This ensures the step size is never too large while allowing it to be larger than 1/L1/L when the local curvature is smaller.

def proximal_gradient_bt(A, b, lam, x0=None, max_iter=500, tol=1e-8,
                          accelerated=False, verbose=False):
    """
    Proximal gradient for Lasso: min (1/2)||Ax-b||^2 + lam*||x||_1
    with Armijo backtracking for step size selection.

    Parameters
    ----------
    A : (n, p) array
    b : (n,) array
    lam : float, regularization parameter
    x0 : initial point (default: zeros)
    max_iter : max iterations
    tol : stopping tolerance on relative change
    accelerated : if True, use FISTA momentum

    Returns
    -------
    x : solution
    info : dict with convergence diagnostics
    """
    n_obs, p_dim = A.shape
    if x0 is None:
        x0 = np.zeros(p_dim)

    x = x0.copy()
    x_prev = x.copy()
    t_k = 1.0
    L_est = 1.0

    AtA = A.T @ A
    Atb_vec = A.T @ b

    info = {'f_vals': [], 'grad_norms': [], 'step_sizes': [], 'nnz': []}

    def f_smooth(x):
        r = A @ x - b
        return 0.5 * np.dot(r, r)

    def grad_smooth(x):
        return AtA @ x - Atb_vec

    def F_obj(x):
        return f_smooth(x) + lam * np.linalg.norm(x, 1)

    for k in range(max_iter):
        if accelerated and k > 0:
            t_next = (1 + np.sqrt(1 + 4 * t_k**2)) / 2
            y = x + ((t_k - 1) / t_next) * (x - x_prev)
            t_k = t_next
        else:
            y = x

        g = grad_smooth(y)
        f_y = f_smooth(y)

        # Backtracking: find step size that satisfies descent condition
        while True:
            eta_k = 1.0 / L_est
            x_new = soft_threshold(y - eta_k * g, eta_k * lam)
            f_new = f_smooth(x_new)
            d = x_new - y
            if f_new <= f_y + np.dot(g, d) + (L_est / 2) * np.dot(d, d):
                break
            L_est *= 2

        # Adaptive decrease of L_est for efficiency
        if k % 10 == 0 and L_est > 1.0:
            L_est *= 0.8

        x_prev = x.copy()
        x = x_new

        info['f_vals'].append(F_obj(x))
        info['grad_norms'].append(np.linalg.norm(g))
        info['step_sizes'].append(eta_k)
        info['nnz'].append(np.sum(np.abs(x) > 1e-8))

        # Stopping criterion: relative change
        if k > 0 and abs(info['f_vals'][-1] - info['f_vals'][-2]) < tol * abs(info['f_vals'][-2]):
            break

    return x, info

Convergence diagnostics. A well-implemented proximal gradient solver tracks four quantities:

  1. Objective value F(xk)F(x_k) — should decrease monotonically (for ISTA; FISTA may oscillate).
  2. Gradient norm g(xk)\|\nabla g(x_k)\| — measures how close the smooth part is to stationarity.
  3. Step size ηk\eta_k — should stabilize; rapid changes indicate numerical issues.
  4. Sparsity xk0\|x_k\|_0 (number of nonzeros) — should converge to a stable support.

Computational notes: backtracking, convergence diagnostics, and stopping criteria


Connections & Further Reading

Proximal methods sit at the intersection of convex analysis, gradient-based optimization, and linear algebra — drawing on the foundations of the first two topics in the Optimization track and connecting forward to constrained optimization.

TopicConnection
Convex AnalysisEvery proximal operator solves a convex minimization problem. The subdifferential optimality condition 0f(x)0 \in \partial f(x^*) becomes the fixed-point condition x=proxf(x)x^* = \mathrm{prox}_f(x^*). Conjugate functions and the Moreau envelope are dual objects connected by the Fenchel–Moreau theorem.
Gradient Descent & ConvergenceProximal gradient descent generalizes gradient descent to composite objectives F=g+hF = g + h. When h=0h = 0, the proximal step reduces to the identity and we recover vanilla GD. The convergence analysis parallels the Descent Lemma, and FISTA inherits Nesterov’s momentum machinery. Mirror descent from GD is a special case of the proximal point algorithm with Bregman divergence.
The Spectral TheoremThe condition number κ=L/μ\kappa = L/\mu that governs proximal gradient convergence involves eigenvalues of the Hessian. For quadratic g(x)=12xAxg(x) = \frac{1}{2}x^\top A x, the Spectral Theorem gives the eigendecomposition that determines LL and μ\mu.
Singular Value DecompositionThe proximal operator of the nuclear norm X=σi\|X\|_* = \sum \sigma_i requires computing the SVD and applying soft-thresholding to the singular values — the singular value thresholding operator, which is the key subroutine in low-rank matrix recovery.
Lagrangian Duality & KKTADMM is a dual decomposition method: it applies Douglas–Rachford to the dual of the consensus problem. The KKT conditions generalize the fixed-point characterization 0f(x)0 \in \partial f(x^*) to constrained settings.

The Optimization Track

Convex Analysis
    ├── Gradient Descent & Convergence
    │       └── Proximal Methods (this topic)
    └── Lagrangian Duality & KKT

Connections

  • Every proximal operator solves a convex minimization problem. The subdifferential optimality condition 0 ∈ ∂f(x*) from Convex Analysis becomes the fixed-point condition x* = prox_f(x*) that characterizes solutions. Conjugate functions and the Moreau envelope are conjugate-dual objects connected by the Fenchel–Moreau theorem. convex-analysis
  • Proximal gradient descent generalizes gradient descent to composite objectives F = g + h. When h = 0 (no non-smooth term), the proximal step reduces to the identity and we recover vanilla gradient descent. The convergence analysis parallels the Descent Lemma framework, and FISTA inherits the Nesterov acceleration machinery. gradient-descent
  • The condition number κ = L/μ that governs proximal gradient convergence for strongly convex problems involves eigenvalues of the Hessian. For quadratic smooth terms g(x) = (1/2)x^TAx, the Spectral Theorem provides the eigendecomposition that determines L and μ. spectral-theorem
  • The proximal operator of the nuclear norm (the matrix analogue of the ℓ₁ norm) requires computing the SVD and applying soft-thresholding to the singular values. This singular value thresholding operator is the key subroutine in low-rank matrix recovery via proximal methods. svd

References & Further Reading