foundational optimization 45 min read

Convex Analysis

Convex sets, convex functions, subdifferentials, and separation theorems — the geometric foundation of tractable optimization

Prerequisites: The Spectral Theorem

Overview & Motivation

Here is the single most important fact in all of optimization:

Every local minimum of a convex function is a global minimum.

This one sentence is why convex optimization is tractable and general optimization is not. In a non-convex landscape — say a deep neural network loss surface — gradient descent can get trapped in local minima, saddle points, or flat regions, and we have no guarantee that the point we converge to is anywhere near optimal. But if the function we’re minimizing is convex, then any point where the gradient vanishes (or, more generally, where 00 lies in the subdifferential) is a global minimizer. No restarts, no escaping saddle points, no hoping for the best.

This topic develops the mathematical infrastructure behind that fact. We start with the two geometric primitives — convex sets (closed under line segments) and convex functions (whose epigraph is a convex set) — and build up through three layers:

  1. Characterization: First-order and second-order conditions that connect convexity to gradients and Hessians.
  2. Closure: Operations that preserve convexity — the toolkit that lets us certify complex functions as convex without checking from scratch.
  3. Duality: Conjugate functions and subdifferentials that extend calculus to non-smooth settings, enabling the analysis of functions like x1\|x\|_1 and max(0,x)\max(0, x) that appear throughout modern ML.

The separating hyperplane theorem provides the geometric backbone: it says that disjoint convex sets can always be separated by a hyperplane. This seemingly simple geometric fact has far-reaching consequences — it is the foundation of support vector machines, LP duality, and the entire theory of Lagrangian optimization.

What We Cover

  1. Convex Sets — the line-segment definition, convex combinations, and a gallery of examples.
  2. Convex Hull & Extreme Points — how to construct the smallest convex set containing a given set.
  3. Convex Functions — the chord inequality, epigraphs, and Jensen’s inequality.
  4. First-Order & Second-Order Conditions — connecting convexity to tangent lines and Hessian eigenvalues (via the Spectral Theorem).
  5. Operations Preserving Convexity — nonneg. weighted sums, pointwise max, composition rules (DCP).
  6. Conjugate Functions — the Legendre–Fenchel transform and biconjugation.
  7. Subdifferentials & Subgradients — generalized gradients for non-smooth functions.
  8. Separation Theorems — the geometric foundation of duality.
  9. Computational Notes — DCP rules, CVXPY, and numerical convexity verification.

Why convexity matters: in a non-convex landscape, local minima are traps; in a convex landscape, every local minimum is global


Convex Sets

Everything begins with a definition that is geometric at its core: a set is convex if you can draw a straight line between any two of its points and the line stays inside the set.

Definition 1 (Convex Set).

A set CRnC \subseteq \mathbb{R}^n is convex if for every x,yCx, y \in C and every θ[0,1]\theta \in [0, 1]:

θx+(1θ)yC\theta x + (1 - \theta) y \in C

Geometrically: the line segment from xx to yy lies entirely in CC.

This is the line-segment test, and it is the only thing you need to verify. An ellipse passes — every chord stays inside. An L-shape fails — there exist pairs of points whose connecting segment exits the set.

Definition 2 (Convex Combination).

A point zz is a convex combination of x1,,xkx_1, \ldots, x_k if z=i=1kθixiz = \sum_{i=1}^k \theta_i x_i where θi0\theta_i \geq 0 and i=1kθi=1\sum_{i=1}^k \theta_i = 1. A set CC is convex if and only if it contains all convex combinations of its points.

Convex sets are everywhere. Here are the workhorses:

  • Halfspaces: {x:aTxb}\{x : a^T x \leq b\} for a fixed a0a \neq 0 and bb. Every linear inequality defines a halfspace, and every halfspace is convex.
  • Balls: {x:xcr}\{x : \|x - c\| \leq r\} in any norm. Euclidean balls, 1\ell_1 balls (cross-polytopes), \ell_\infty balls (cubes).
  • Polyhedra: {x:Axb}\{x : Ax \leq b\} — intersections of finitely many halfspaces.
  • Cones: {x:x2t}\{x : \|x\|_2 \leq t\} (the second-order cone), and the positive semidefinite cone S+n={XRn×n:X=XT,X0}\mathbb{S}^n_+ = \{X \in \mathbb{R}^{n \times n} : X = X^T, X \succeq 0\}.
  • The PSD cone S+n\mathbb{S}^n_+: The set of all n×nn \times n symmetric positive semidefinite matrices. Its structure is governed by the Spectral Theorem — a symmetric matrix is PSD if and only if all its eigenvalues are nonneg.

Convexity is preserved by intersection (any intersection of convex sets is convex), by affine maps (if CC is convex, then f(C)={Ax+b:xC}f(C) = \{Ax + b : x \in C\} is convex), and by inverse affine maps (if CC is convex, then f1(C)={x:Ax+bC}f^{-1}(C) = \{x : Ax + b \in C\} is convex).

Try the interactive explorer below — drag the two points around and watch whether the line segment between them stays inside the set.

Convex vs non-convex sets: line segment test and a gallery of canonical convex sets


Convex Hull & Extreme Points

Given any set SS — convex or not — we can always construct the smallest convex set that contains it.

Definition 3 (Convex Hull).

The convex hull of a set SRnS \subseteq \mathbb{R}^n, denoted conv(S)\mathrm{conv}(S), is the set of all convex combinations of points in SS:

conv(S)={i=1kθixi:xiS,  θi0,  θi=1,  kN}\mathrm{conv}(S) = \left\{ \sum_{i=1}^k \theta_i x_i : x_i \in S,\; \theta_i \geq 0,\; \sum \theta_i = 1,\; k \in \mathbb{N} \right\}

Equivalently, conv(S)\mathrm{conv}(S) is the intersection of all convex sets containing SS.

The vertices of the convex hull are special — they are the points that cannot be written as convex combinations of other points.

Definition 4 (Extreme Point).

A point xCx \in C is an extreme point of a convex set CC if there do not exist y,zCy, z \in C with yzy \neq z and θ(0,1)\theta \in (0, 1) such that x=θy+(1θ)zx = \theta y + (1 - \theta) z. In other words, xx is not a strict convex combination of two other points in CC.

The Krein–Milman theorem says that every compact convex set is the convex hull of its extreme points. This is a foundational result in functional analysis, and it has practical consequences: to optimize a linear function over a compact convex set, we need only check the extreme points — a fact that underpins the simplex method for linear programming.

Convex hull construction: raw point set, convex hull boundary, and extreme points highlighted

# Compute and plot convex hull of a point set
import numpy as np
from scipy.spatial import ConvexHull

points = np.random.randn(30, 2) * 1.5
hull = ConvexHull(points)

# hull.vertices gives the indices of extreme points
extreme_points = points[hull.vertices]

Convex Functions

We now turn from sets to functions. The definition is again geometric: a function is convex if the chord between any two points on its graph lies above the graph.

Definition 5 (Convex Function).

A function f:RnR{+}f : \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\} is convex if for all x,yx, y in its domain and all θ[0,1]\theta \in [0, 1]:

f(θx+(1θ)y)θf(x)+(1θ)f(y)f(\theta x + (1 - \theta)y) \leq \theta f(x) + (1 - \theta)f(y)

The left side evaluates ff at the weighted average of xx and yy; the right side is the weighted average of f(x)f(x) and f(y)f(y). The inequality says: the function at the blend is at most the blend of the function values.

The connection between convex functions and convex sets runs through the epigraph.

Definition 6 (Epigraph).

The epigraph of a function ff is the set of points lying on or above its graph:

epi(f)={(x,t)Rn+1:f(x)t}\mathrm{epi}(f) = \{(x, t) \in \mathbb{R}^{n+1} : f(x) \leq t\}

Proposition 1 (Epigraph Characterization of Convexity).

A function ff is convex if and only if epi(f)\mathrm{epi}(f) is a convex set.

Proof.

(\Rightarrow) Suppose ff is convex. Take any two points (x,s),(y,t)epi(f)(x, s), (y, t) \in \mathrm{epi}(f), so f(x)sf(x) \leq s and f(y)tf(y) \leq t. For θ[0,1]\theta \in [0, 1]:

f(θx+(1θ)y)θf(x)+(1θ)f(y)θs+(1θ)tf(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) \leq \theta s + (1-\theta)t

So (θx+(1θ)y,θs+(1θ)t)epi(f)(\theta x + (1-\theta)y, \,\theta s + (1-\theta)t) \in \mathrm{epi}(f), meaning epi(f)\mathrm{epi}(f) is convex.

(\Leftarrow) Suppose epi(f)\mathrm{epi}(f) is convex. For x,yx, y in the domain of ff, the points (x,f(x))(x, f(x)) and (y,f(y))(y, f(y)) lie in epi(f)\mathrm{epi}(f). Convexity of the epigraph gives:

(θx+(1θ)y,θf(x)+(1θ)f(y))epi(f)(\theta x + (1-\theta)y, \,\theta f(x) + (1-\theta)f(y)) \in \mathrm{epi}(f)

which means f(θx+(1θ)y)θf(x)+(1θ)f(y)f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y), so ff is convex.

This equivalence is powerful: it lets us transfer results about convex sets directly to convex functions.

Jensen’s Inequality

The chord inequality generalizes from two points to any weighted average — this is Jensen’s inequality, one of the most widely used results in probability and information theory.

Theorem 1 (Jensen's Inequality).

If ff is convex and XX is a random variable with E[X]\mathbb{E}[X] in the domain of ff, then:

f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

Applying a convex function then averaging gives at least as much as averaging then applying the function.

Proof.

We prove the finite case. Let x1,,xkx_1, \ldots, x_k be values with weights θ1,,θk0\theta_1, \ldots, \theta_k \geq 0 summing to 11. We proceed by induction on kk.

Base case (k=2k = 2): This is exactly the definition of convexity.

Inductive step: Assume the result holds for k1k - 1 points. Write:

i=1kθixi=θkxk+(1θk)i=1k1θi1θkxi\sum_{i=1}^k \theta_i x_i = \theta_k x_k + (1 - \theta_k) \sum_{i=1}^{k-1} \frac{\theta_i}{1 - \theta_k} x_i

The inner sum is a convex combination of k1k - 1 points (the weights θi/(1θk)\theta_i / (1 - \theta_k) sum to 11). Call it xˉ\bar{x}. By convexity of ff:

f ⁣(i=1kθixi)=f(θkxk+(1θk)xˉ)θkf(xk)+(1θk)f(xˉ)f\!\left(\sum_{i=1}^k \theta_i x_i\right) = f(\theta_k x_k + (1 - \theta_k) \bar{x}) \leq \theta_k f(x_k) + (1 - \theta_k) f(\bar{x})

By the inductive hypothesis, f(xˉ)i=1k1θi1θkf(xi)f(\bar{x}) \leq \sum_{i=1}^{k-1} \frac{\theta_i}{1 - \theta_k} f(x_i). Substituting:

f ⁣(i=1kθixi)θkf(xk)+i=1k1θif(xi)=i=1kθif(xi)f\!\left(\sum_{i=1}^k \theta_i x_i\right) \leq \theta_k f(x_k) + \sum_{i=1}^{k-1} \theta_i f(x_i) = \sum_{i=1}^k \theta_i f(x_i)

Convex functions: chord inequality, non-convex epigraph, and Jensen's inequality


First-Order & Second-Order Conditions

The chord inequality is the definition, but there are more practical characterizations for differentiable functions.

The Tangent Line Condition

Theorem 2 (First-Order Condition for Convexity).

Suppose ff is differentiable on an open convex set. Then ff is convex if and only if for all x,yx, y:

f(y)f(x)+f(x)T(yx)f(y) \geq f(x) + \nabla f(x)^T (y - x)

The tangent hyperplane at any point is a global underestimator — the function never dips below its linearization.

Proof.

(\Rightarrow) Suppose ff is convex. For any θ(0,1]\theta \in (0, 1]:

f(x+θ(yx))(1θ)f(x)+θf(y)f(x + \theta(y - x)) \leq (1 - \theta) f(x) + \theta f(y)

Rearranging: f(y)f(x)+f(x+θ(yx))f(x)θf(y) \geq f(x) + \frac{f(x + \theta(y - x)) - f(x)}{\theta}

Taking θ0+\theta \to 0^+, the right side becomes f(x)+f(x)T(yx)f(x) + \nabla f(x)^T(y - x).

(\Leftarrow) Suppose the tangent inequality holds. For any x,yx, y and θ[0,1]\theta \in [0, 1], let z=θx+(1θ)yz = \theta x + (1 - \theta)y. Apply the tangent inequality at zz:

f(x)f(z)+f(z)T(xz),f(y)f(z)+f(z)T(yz)f(x) \geq f(z) + \nabla f(z)^T(x - z), \qquad f(y) \geq f(z) + \nabla f(z)^T(y - z)

Multiply the first by θ\theta, the second by (1θ)(1 - \theta), and add:

θf(x)+(1θ)f(y)f(z)+f(z)T(θx+(1θ)yz)=f(z)\theta f(x) + (1-\theta)f(y) \geq f(z) + \nabla f(z)^T(\theta x + (1-\theta)y - z) = f(z)

since θx+(1θ)yz=0\theta x + (1-\theta)y - z = 0.

The Hessian Condition

For twice-differentiable functions, convexity is equivalent to the Hessian being positive semidefinite — and this is where the Spectral Theorem enters.

Theorem 3 (Second-Order Condition for Convexity).

Suppose ff is twice differentiable on an open convex set. Then ff is convex if and only if the Hessian is positive semidefinite everywhere:

2f(x)0for all x\nabla^2 f(x) \succeq 0 \quad \text{for all } x

By the Spectral Theorem, the Hessian (a symmetric matrix) admits an eigendecomposition 2f(x)=QΛQT\nabla^2 f(x) = Q \Lambda Q^T, and 2f(x)0\nabla^2 f(x) \succeq 0 if and only if all eigenvalues λi0\lambda_i \geq 0.

This is the bridge between convexity and spectral theory. When we check whether a neural network loss function is locally convex around a critical point, we compute the Hessian and check its eigenvalues. The Spectral Theorem guarantees these eigenvalues are real and the eigenvectors are orthogonal — the entire machinery of eigendecomposition is built for exactly this purpose.

Example. Consider f(x1,x2)=2x12+x22+x1x2f(x_1, x_2) = 2x_1^2 + x_2^2 + x_1 x_2. The Hessian is:

2f=(4112)\nabla^2 f = \begin{pmatrix} 4 & 1 \\ 1 & 2 \end{pmatrix}

Its eigenvalues are approximately 4.624.62 and 1.381.38 — both positive. So ff is strictly convex.

First-order condition (tangent below graph), second-order condition (Hessian eigenvalues), and level set comparison


Operations Preserving Convexity

Checking convexity from the definition or from the Hessian can be tedious. A more practical approach is to build convex functions from known convex building blocks using operations that preserve convexity. This is the idea behind Disciplined Convex Programming (DCP).

Nonnegative Weighted Sums

If f1,,fkf_1, \ldots, f_k are convex and α1,,αk0\alpha_1, \ldots, \alpha_k \geq 0, then αifi\sum \alpha_i f_i is convex. This follows directly from the definition — the chord inequality distributes over sums.

Pointwise Maximum and Supremum

Proposition 2 (Pointwise Supremum Preserves Convexity).

If {fα}αA\{f_\alpha\}_{\alpha \in \mathcal{A}} is a family of convex functions, then g(x)=supαAfα(x)g(x) = \sup_{\alpha \in \mathcal{A}} f_\alpha(x) is convex.

Proof.

For any x,yx, y and θ[0,1]\theta \in [0, 1]:

g(θx+(1θ)y)=supαfα(θx+(1θ)y)supα[θfα(x)+(1θ)fα(y)]g(\theta x + (1-\theta)y) = \sup_\alpha f_\alpha(\theta x + (1-\theta)y) \leq \sup_\alpha \left[\theta f_\alpha(x) + (1-\theta) f_\alpha(y)\right]

Since θfα(x)θsupβfβ(x)=θg(x)\theta f_\alpha(x) \leq \theta \sup_\beta f_\beta(x) = \theta \, g(x) and similarly for the second term:

g(θx+(1θ)y)θg(x)+(1θ)g(y)g(\theta x + (1-\theta)y) \leq \theta \, g(x) + (1-\theta) \, g(y)

This result is remarkably useful. The maximum of finitely many convex functions is convex: max(f1(x),f2(x),f3(x))\max(f_1(x), f_2(x), f_3(x)) is convex if each fif_i is. The \ell_\infty norm x=maxixi\|x\|_\infty = \max_i |x_i| is convex because it’s the max of convex functions xi|x_i|.

Composition Rules (DCP)

The composition hgh \circ g is convex under specific monotonicity conditions:

Outer hhInner ggResult hgh \circ gCondition
Convex, nondecreasingConvexConvex
Convex, nonincreasingConcaveConvex
Concave, nondecreasingConcaveConcave
ConvexAffineConvexAlways

These rules are the backbone of solvers like CVXPY. They verify convexity by parsing the composition structure of the objective function rather than computing second derivatives.

The Perspective Function

If ff is convex, then its perspective g(x,t)=tf(x/t)g(x, t) = t \, f(x/t) for t>0t > 0 is also convex. This construction appears in information theory (the KL divergence is the perspective of log-\log) and in conic optimization.

Operations preserving convexity: weighted sums, pointwise max, and perspective function


Conjugate Functions

The Legendre–Fenchel conjugate provides a dual representation of convex functions — a powerful tool that transforms optimization problems into their dual forms.

Definition 7 (Legendre–Fenchel Conjugate).

The conjugate (or Fenchel conjugate) of a function f:RnR{+}f : \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\} is:

f(s)=supx{sTxf(x)}f^*(s) = \sup_{x} \left\{ s^T x - f(x) \right\}

For each slope ss, the conjugate f(s)f^*(s) measures the maximum gap between the linear function sTxs^T x and f(x)f(x).

The conjugate ff^* is always convex (it’s a pointwise supremum of affine functions in ss), even if ff is not. Here are the classic conjugate pairs:

Function f(x)f(x)Conjugate f(s)f^*(s)
12x2\frac{1}{2}x^212s2\frac{1}{2}s^2
x\|x\| (any norm)δ{s:s1}\delta_{\{s : \|s\|_* \leq 1\}} (indicator of dual norm ball)
exe^xslnsss \ln s - s for s>0s > 0
δC(x)\delta_C(x) (indicator of CC)supxCsTx\sup_{x \in C} s^T x (support function)

The biconjugation theorem is the central duality result: applying the conjugate twice recovers the original function — but only if it was convex to begin with.

Theorem 4 (Fenchel–Moreau Biconjugation).

If ff is a closed convex function (i.e., lower semicontinuous, convex, and proper), then f=ff^{**} = f.

If ff is not convex, then ff^{**} is the convex envelope of ff — the largest convex function that lies below ff.

Proof.

Proof sketch. We always have fff^{**} \leq f (apply the supremum definition twice). For the reverse inequality when ff is closed and convex: at any point x0x_0, by the supporting hyperplane theorem, there exists s0s_0 such that f(x0)=s0Tx0f(s0)f(x_0) = s_0^T x_0 - f^*(s_0). Then f(x0)s0Tx0f(s0)=f(x0)f^{**}(x_0) \geq s_0^T x_0 - f^*(s_0) = f(x_0).

Biconjugation is the foundation of Lagrangian duality in optimization: the dual problem is the conjugate of the primal, and strong duality says that solving the dual recovers the primal optimum — under convexity.

Conjugate functions: geometric construction, conjugate pairs, and biconjugation (convex envelope)


Subdifferentials & Subgradients

Not all convex functions are differentiable. The 1\ell_1 norm x1\|x\|_1, the hinge loss max(0,1yy^)\max(0, 1 - y \hat{y}), and the ReLU max(0,x)\max(0, x) all have kinks where the gradient does not exist. Subgradients generalize the gradient to handle these cases.

Definition 8 (Subgradient).

A vector gRng \in \mathbb{R}^n is a subgradient of a convex function ff at xx if:

f(y)f(x)+gT(yx)for all yf(y) \geq f(x) + g^T(y - x) \quad \text{for all } y

That is, the affine function f(x)+gT(yx)f(x) + g^T(y - x) is a global underestimator of ff.

Definition 9 (Subdifferential).

The subdifferential of ff at xx is the set of all subgradients:

f(x)={gRn:f(y)f(x)+gT(yx) for all y}\partial f(x) = \{g \in \mathbb{R}^n : f(y) \geq f(x) + g^T(y - x) \text{ for all } y\}

If ff is differentiable at xx, then f(x)={f(x)}\partial f(x) = \{\nabla f(x)\} — the subdifferential is a singleton containing the ordinary gradient. At a non-differentiable point, f(x)\partial f(x) is a closed convex set of possible “slopes.”

Example: For f(x)=xf(x) = |x|:

  • At x>0x > 0: f(x)={1}\partial f(x) = \{1\} (the ordinary derivative)
  • At x<0x < 0: f(x)={1}\partial f(x) = \{-1\}
  • At x=0x = 0: f(0)=[1,1]\partial f(0) = [-1, 1] (any slope between 1-1 and 11 gives a valid underestimator)

The subdifferential gives us a complete optimality condition for convex optimization:

Theorem 5 (Subgradient Optimality Condition).

For a convex function ff, xx^* is a global minimizer if and only if:

0f(x)0 \in \partial f(x^*)

Proof.

(\Leftarrow) If 0f(x)0 \in \partial f(x^*), then by the subgradient inequality with g=0g = 0:

f(y)f(x)+0T(yx)=f(x)for all yf(y) \geq f(x^*) + 0^T(y - x^*) = f(x^*) \quad \text{for all } y

So xx^* is a global minimizer.

(\Rightarrow) If xx^* is a global minimizer, then f(y)f(x)f(y) \geq f(x^*) for all yy. This means f(y)f(x)+0T(yx)f(y) \geq f(x^*) + 0^T(y - x^*), so g=0g = 0 satisfies the subgradient inequality, i.e., 0f(x)0 \in \partial f(x^*).

Subdifferential Calculus

The subdifferential obeys sum and chain rules (with some caveats):

  • Sum rule: (f1+f2)(x)f1(x)+f2(x)\partial(f_1 + f_2)(x) \supseteq \partial f_1(x) + \partial f_2(x) (Minkowski sum). Equality holds when a constraint qualification is satisfied (e.g., one of the functions is continuous at xx).
  • Scalar multiplication: (αf)(x)=αf(x)\partial(\alpha f)(x) = \alpha \, \partial f(x) for α>0\alpha > 0.
  • Chain rule: For h(x)=g(Ax+b)h(x) = g(Ax + b) with gg convex, h(x)=ATg(Ax+b)\partial h(x) = A^T \partial g(Ax + b).
  • Connection to conjugates: sf(x)s \in \partial f(x) if and only if xf(s)x \in \partial f^*(s) if and only if f(x)+f(s)=sTxf(x) + f^*(s) = s^T x (the Fenchel–Young equality).

Subdifferentials: subgradient fan at kink point, set-valued subdifferential map, and optimality condition


Separation Theorems

The separation theorems are the geometric bedrock of convex optimization. They say, loosely, that convex sets that don’t overlap can be separated by a hyperplane.

Theorem 6 (Separating Hyperplane Theorem).

Let CC and DD be nonempty, disjoint, convex sets in Rn\mathbb{R}^n. Then there exists a0a \neq 0 and bb such that:

aTxb for all xC,aTxb for all xDa^T x \leq b \text{ for all } x \in C, \qquad a^T x \geq b \text{ for all } x \in D

The hyperplane {x:aTx=b}\{x : a^T x = b\} separates CC and DD.

Proof.

Proof sketch. When CC and DD are closed and at least one is compact, we can find closest points cCc^* \in C and dDd^* \in D minimizing cd\|c - d\|. The separating hyperplane passes through the midpoint (c+d)/2(c^* + d^*)/2 with normal a=dca = d^* - c^*. The key step is showing that for any cCc \in C, the inner product aTcaTca^T c \leq a^T c^* (otherwise we could find a point in CC closer to dd^*, contradicting the minimality of cd\|c^* - d^*\|). The argument for DD is symmetric.

Theorem 7 (Supporting Hyperplane Theorem).

Let CC be a convex set and x0bd(C)x_0 \in \mathrm{bd}(C) a boundary point. Then there exists a supporting hyperplane at x0x_0: a hyperplane {x:aTx=b}\{x : a^T x = b\} with aTx0=ba^T x_0 = b and aTxba^T x \leq b for all xCx \in C.

Geometrically: we can always find a hyperplane that is tangent to CC at any boundary point, with the entire set on one side.

These theorems have far-reaching consequences:

  • SVM margin: The separating hyperplane between two classes of labeled data is the maximum-margin classifier.
  • Farkas’ lemma: A direct consequence of separation, and the foundation of LP duality and the KKT conditions.
  • Duality in convex optimization: The entire theory of Lagrangian duality rests on separating the epigraph of the objective from a set defined by the constraints.

Separating hyperplane theorem, supporting hyperplane theorem, and strict separation


Computational Notes

Disciplined Convex Programming (DCP)

Rather than checking convexity analytically, modern solvers like CVXPY use the DCP composition rules to verify convexity syntactically. A problem is DCP-compliant if the objective and constraints can be parsed into a composition tree where every node preserves convexity according to the rules in the table above.

import cvxpy as cp
import numpy as np

# Non-negative Lasso: min ||Ax - b||_2^2 + lambda * ||x||_1  s.t. x >= 0
A = np.random.randn(50, 20)
b = np.random.randn(50)
lam = 0.1

x = cp.Variable(20)
objective = cp.Minimize(cp.sum_squares(A @ x - b) + lam * cp.norm1(x))
constraints = [x >= 0]
problem = cp.Problem(objective, constraints)
problem.solve()

print(f"Optimal value: {problem.value:.4f}")
print(f"Solution sparsity: {np.sum(np.abs(x.value) < 1e-6)} / {x.size} zeros")

CVXPY verifies that sum_squares is convex nondecreasing, the affine map A @ x - b composes correctly, norm1 is convex, and x >= 0 is a valid convex constraint. If any rule is violated, CVXPY rejects the problem before solving.

Numerical Convexity Verification

For a function given as code rather than a formula, we can check convexity numerically by sampling the Hessian at random points and verifying λmin(2f(x))0\lambda_{\min}(\nabla^2 f(x)) \geq 0:

# Check convexity by sampling Hessian eigenvalues
def check_convexity(f, grad2_f, n_samples=100, dim=5):
    """Returns True if all sampled Hessians are PSD."""
    for _ in range(n_samples):
        x = np.random.randn(dim)
        H = grad2_f(x)
        min_eig = np.linalg.eigvalsh(H)[0]
        if min_eig < -1e-10:
            return False
    return True

Computational convexity: DCP rules table, eigenvalue verification, and CVXPY example


Connections & Further Reading

Convex analysis is the foundation of the Optimization track, connecting backward to spectral theory and forward to gradient methods, proximal operators, and duality theory.

TopicConnection
The Spectral TheoremThe second-order condition: 2f0\nabla^2 f \succeq 0 (PSD Hessian) is verified through the eigendecomposition guaranteed by the Spectral Theorem. The PSD cone S+n\mathbb{S}^n_+ is a fundamental convex set.
Singular Value DecompositionThe nuclear norm X=σi\|X\|_* = \sum \sigma_i is convex, and its subdifferential involves the SVD. Convex relaxations of rank constraints use the nuclear norm ball.
PCA & Low-Rank ApproximationPCA solves a non-convex rank-constrained problem. Its convex relaxation — nuclear norm minimization — is a cornerstone of compressed sensing and matrix completion.
Gradient Descent & ConvergenceConvexity guarantees that gradient descent converges to the global minimum. The convergence rate depends on the strong convexity constant.
Proximal MethodsSubgradients motivate proximal operators — the proximal map of ff at xx minimizes f(y)+12yx2f(y) + \frac{1}{2}\|y - x\|^2, a regularized subgradient step.
Lagrangian Duality & KKTConjugate functions are the engine of Lagrangian duality. The separation theorems lead to Farkas’ lemma, which leads to the KKT conditions — the first-order optimality conditions for constrained convex optimization.

The Optimization Track

Convex Analysis is the root of the track: everything that follows either applies its results (gradient descent, proximal methods) or extends its duality machinery (Lagrangian duality, KKT conditions).

Convex Analysis (this topic)
    ├── Gradient Descent & Convergence
    │       └── Proximal Methods
    └── Lagrangian Duality & KKT

Connections

  • The second-order convexity condition requires the Hessian to be positive semidefinite — the Spectral Theorem provides the eigendecomposition that lets us check this. PSD cones are fundamental convex sets whose structure is governed by the spectral theory of symmetric matrices. spectral-theorem
  • The nuclear norm (sum of singular values) is a convex function, and its subdifferential involves the SVD. Low-rank matrix optimization problems rely on convex relaxations via the nuclear norm ball. svd
  • PCA solves a non-convex rank-constrained problem whose convex relaxation (nuclear norm minimization) is a cornerstone of compressed sensing and matrix completion. pca-low-rank

References & Further Reading

  • book Convex Optimization — Boyd & Vandenberghe (2004) Chapters 2-3: Convex sets and convex functions — the primary reference for this topic
  • book Convex Analysis — Rockafellar (1970) The foundational monograph — conjugate functions, subdifferentials, separation theorems
  • book Introductory Lectures on Convex Optimization — Nesterov (2004) Chapter 2: Smooth convex optimization and first/second-order conditions
  • book Convex Analysis and Minimization Algorithms I — Hiriart-Urruty & Lemaréchal (1993) Subdifferential calculus, conjugate duality, and Moreau envelopes
  • paper Disciplined Convex Programming — Grant, Boyd & Ye (2006) DCP composition rules that underpin CVXPY and CVX
  • paper Proximal Algorithms — Parikh & Boyd (2014) Section 2: Convex analysis background — subdifferentials, conjugates, Moreau envelopes