advanced optimization 55 min read

Lagrangian Duality & KKT Conditions

The Lagrangian, weak and strong duality, Slater's condition, KKT optimality conditions, sensitivity analysis, and applications to SVMs, water-filling, and portfolio optimization

Prerequisites: Convex Analysis

Overview & Motivation

In Convex Analysis we developed the foundations — convex sets, convex functions, subdifferentials, and the separation theorems. In Gradient Descent & Convergence and Proximal Methods we built algorithms for unconstrained optimization (or at least for objectives where constraints could be absorbed into the proximal operator). But many of the most important problems in machine learning and engineering are constrained: minimize an objective subject to explicit limits on the variables.

Consider training a support vector machine. We want the widest-margin separating hyperplane, but we require that every training point lies on the correct side of the boundary. Or consider allocating power across communication channels: we want to maximize total capacity, but we have a fixed total power budget. Or consider building a portfolio: we want maximum expected return, but we can’t exceed a given risk tolerance.

The direct approach — “optimize over the feasible set” — is often intractable. The feasible set might have a complicated geometry, or the constraints might couple variables in ways that make projection expensive. Lagrangian duality offers an alternative: attach a price (a dual variable, or multiplier) to each constraint, fold the constraints into the objective, and solve an unconstrained problem instead. The dual variables play a double role: they enforce the constraints at optimality, and they measure the sensitivity of the optimal value to constraint perturbations.

The punchline is the KKT conditions — four equations that are necessary and sufficient for optimality in convex programs. When you see a machine learning paper derive a closed-form solution to a constrained problem, the KKT conditions are almost always the tool that gets them there.

What We Cover

  1. The Lagrangian & Dual Problem — the Lagrangian function, dual variables as prices on constraint violations, the dual function as a pointwise infimum, and concavity of the dual.
  2. Weak Duality — the universal inequality dpd^* \leq p^* and the duality gap.
  3. Strong Duality & Slater’s Condition — when the gap closes to zero, and the geometric picture via the perturbation function.
  4. KKT Conditions — stationarity, primal feasibility, dual feasibility, and complementary slackness; necessity under strong duality and sufficiency for convex programs.
  5. Complementary Slackness — the economic insight: you only pay for constraints that bind.
  6. Saddle Point Interpretation — the minimax theorem and the Lagrangian as a saddle surface.
  7. Sensitivity Analysis — shadow prices, the perturbation function, and the identity p/ui=λi\partial p^*/\partial u_i = -\lambda_i^*.
  8. Application: The SVM Dual — the hard-margin SVM, the kernel trick as a consequence of duality, and support vectors via complementary slackness.
  9. Application: Water-Filling & Portfolio Optimization — KKT in closed form for channel capacity and the efficient frontier.
  10. Computational Notes — CVXPY with dual extraction, scipy SLSQP, and the barrier method.

The Lagrangian & Dual Problem

We start with the standard form of a constrained convex optimization problem.

Definition 1 (Standard Form Convex Program).

A standard form convex program is

minimizef0(x)subject tofi(x)0,  i=1,,m,hj(x)=0,  j=1,,p\text{minimize} \quad f_0(x) \qquad \text{subject to} \quad f_i(x) \leq 0,\; i = 1, \ldots, m, \qquad h_j(x) = 0,\; j = 1, \ldots, p

where f0,f1,,fmf_0, f_1, \ldots, f_m are convex functions and h1,,hph_1, \ldots, h_p are affine functions (hj(x)=ajxbjh_j(x) = a_j^\top x - b_j). The primal optimal value is p=inf{f0(x):x feasible}p^* = \inf\{f_0(x) : x \text{ feasible}\}.

The affineness requirement on the equality constraints is not a minor technicality — it’s what keeps the feasible set convex. A nonlinear equality constraint h(x)=0h(x) = 0 generally defines a non-convex surface, which would destroy the convexity of the problem.

Now we introduce the central object: the Lagrangian function. The idea is to relax the hard constraints into a penalty term weighted by dual variables.

Definition 2 (The Lagrangian Function).

The Lagrangian associated with the standard form problem is the function L:Rn×Rm×RpR\mathcal{L} : \mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} defined by

L(x,λ,ν)=f0(x)+i=1mλifi(x)+j=1pνjhj(x)\mathcal{L}(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^{m} \lambda_i f_i(x) + \sum_{j=1}^{p} \nu_j h_j(x)

where λRm\lambda \in \mathbb{R}^m with λi0\lambda_i \geq 0 are the dual variables (or Lagrange multipliers) for the inequality constraints, and νRp\nu \in \mathbb{R}^p are the dual variables for the equality constraints.

Think of λi\lambda_i as a price on violating the ii-th constraint. When fi(x)>0f_i(x) > 0 (the constraint is violated), the term λifi(x)\lambda_i f_i(x) adds a positive penalty to the objective. When fi(x)<0f_i(x) < 0 (the constraint has slack), the term is negative — it rewards having extra room. The dual variable λi\lambda_i controls how aggressively we penalize violations.

Now we minimize the Lagrangian over xx to obtain a function of the dual variables alone.

Definition 3 (The Lagrange Dual Function).

The Lagrange dual function g:Rm×RpR{}g : \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} \cup \{-\infty\} is

g(λ,ν)=infxL(x,λ,ν)=infx(f0(x)+i=1mλifi(x)+j=1pνjhj(x))g(\lambda, \nu) = \inf_{x} \mathcal{L}(x, \lambda, \nu) = \inf_{x} \left( f_0(x) + \sum_{i=1}^{m} \lambda_i f_i(x) + \sum_{j=1}^{p} \nu_j h_j(x) \right)

The dual problem is

maximizeg(λ,ν)subject toλ0\text{maximize} \quad g(\lambda, \nu) \qquad \text{subject to} \quad \lambda \geq 0

with dual optimal value d=sup{g(λ,ν):λ0}d^* = \sup\{g(\lambda, \nu) : \lambda \geq 0\}.

The dual function has a remarkable structural property that does not depend on convexity of the primal problem.

Proposition 1 (Concavity of the Dual Function).

The dual function g(λ,ν)g(\lambda, \nu) is concave, even if the primal problem is not convex. Consequently, the dual problem is always a concave maximization problem.

Proof.

For any fixed xx, the function (λ,ν)L(x,λ,ν)=f0(x)+iλifi(x)+jνjhj(x)(\lambda, \nu) \mapsto \mathcal{L}(x, \lambda, \nu) = f_0(x) + \sum_i \lambda_i f_i(x) + \sum_j \nu_j h_j(x) is affine (and hence concave) in (λ,ν)(\lambda, \nu). The dual function g(λ,ν)=infxL(x,λ,ν)g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu) is the pointwise infimum of a family of affine functions. Since the pointwise infimum of any collection of concave functions is concave, gg is concave.

This is a powerful result. No matter how ugly the primal problem is — non-convex objective, non-convex constraints, disconnected feasible set — the dual is always a concave maximization. The dual is a “convexification” of the primal, and it always produces useful lower bounds.

A concrete example. Consider minimizing f0(x)=(x3)2f_0(x) = (x - 3)^2 subject to f1(x)=x1.50f_1(x) = x - 1.5 \leq 0 (i.e., x1.5x \leq 1.5). The Lagrangian is L(x,λ)=(x3)2+λ(x1.5)\mathcal{L}(x, \lambda) = (x-3)^2 + \lambda(x - 1.5). Minimizing over xx (take the derivative, set to zero): 2(x3)+λ=02(x-3) + \lambda = 0, so x(λ)=3λ/2x^*(\lambda) = 3 - \lambda/2. Substituting back:

g(λ)=(3λ/23)2+λ(3λ/21.5)=λ2/4+1.5λg(\lambda) = (3 - \lambda/2 - 3)^2 + \lambda(3 - \lambda/2 - 1.5) = -\lambda^2/4 + 1.5\lambda

This is a concave quadratic in λ\lambda, maximized at λ=3\lambda^* = 3 with g(3)=2.25g(3) = 2.25. The primal optimum is f0(1.5)=(1.53)2=2.25f_0(1.5) = (1.5 - 3)^2 = 2.25, so p=d=2.25p^* = d^* = 2.25 — strong duality holds.

p* = 2.2500 | g(λ=2.0) = 2.0000 | gap = 0.2500

The Lagrangian, dual function, and duality gap for parameterized constrained problems


Weak Duality

The most fundamental inequality in optimization theory requires almost no assumptions.

Theorem 1 (Weak Duality).

For any optimization problem (not necessarily convex), the dual optimal value lower-bounds the primal optimal value:

dpd^* \leq p^*

More precisely, for any primal-feasible x~\tilde{x} and any dual-feasible (λ~,ν~)(\tilde{\lambda}, \tilde{\nu}) with λ~0\tilde{\lambda} \geq 0:

g(λ~,ν~)f0(x~)g(\tilde{\lambda}, \tilde{\nu}) \leq f_0(\tilde{x})

The difference pd0p^* - d^* \geq 0 is called the duality gap.

Proof.

Let x~\tilde{x} be primal-feasible (so fi(x~)0f_i(\tilde{x}) \leq 0 for all ii and hj(x~)=0h_j(\tilde{x}) = 0 for all jj) and let (λ~,ν~)(\tilde{\lambda}, \tilde{\nu}) be dual-feasible (so λ~0\tilde{\lambda} \geq 0). Then:

g(λ~,ν~)=infxL(x,λ~,ν~)L(x~,λ~,ν~)=f0(x~)+iλ~ifi(x~)0+jν~jhj(x~)=0f0(x~)g(\tilde{\lambda}, \tilde{\nu}) = \inf_{x} \mathcal{L}(x, \tilde{\lambda}, \tilde{\nu}) \leq \mathcal{L}(\tilde{x}, \tilde{\lambda}, \tilde{\nu}) = f_0(\tilde{x}) + \underbrace{\sum_{i} \tilde{\lambda}_i f_i(\tilde{x})}_{\leq 0} + \underbrace{\sum_{j} \tilde{\nu}_j h_j(\tilde{x})}_{= 0} \leq f_0(\tilde{x})

The first inequality holds because the infimum over all xx is at most the value at a particular xx. The second inequality uses λ~i0\tilde{\lambda}_i \geq 0 and fi(x~)0f_i(\tilde{x}) \leq 0 (so each product is non-positive) and hj(x~)=0h_j(\tilde{x}) = 0. Since this holds for all primal-feasible x~\tilde{x} and dual-feasible (λ~,ν~)(\tilde{\lambda}, \tilde{\nu}), taking the supremum over dual variables and infimum over primal variables gives dpd^* \leq p^*.

The proof is elegantly simple — just two inequalities and a sign argument. Note that we never used convexity. Weak duality holds for arbitrary optimization problems: non-convex objectives, non-convex constraints, discrete variables, anything. This universality is what makes the dual function so useful as a bounding tool even when we can’t solve the primal exactly.

The duality gap pdp^* - d^* measures how much information we lose by passing to the dual. For non-convex problems, the gap can be strictly positive — the dual relaxation is too loose. The central question is: when does the gap close?

Weak duality: the dual provides a family of lower bounds on the primal optimal value


Strong Duality & Slater’s Condition

Strong duality — the statement that d=pd^* = p^* — is the bridge that makes the dual problem as powerful as the primal. It holds for convex problems under a mild regularity condition.

Definition 4 (Slater's Condition).

A convex program satisfies Slater’s condition (or is Slater-qualified) if there exists a point xˉ\bar{x} in the relative interior of the feasible set such that

fi(xˉ)<0for all i=1,,mandhj(xˉ)=0for all j=1,,pf_i(\bar{x}) < 0 \quad \text{for all } i = 1, \ldots, m \qquad \text{and} \qquad h_j(\bar{x}) = 0 \quad \text{for all } j = 1, \ldots, p

That is, there exists a strictly feasible point — one that satisfies all inequality constraints with strict inequality.

Slater’s condition is saying that the feasible set has a non-empty interior (it’s not “infinitely thin”). The inequality constraints must have room to breathe — you can’t have the feasible set consist entirely of points where some constraint is active. The equality constraints, being affine, must still be satisfied exactly.

Theorem 2 (Strong Duality (Slater)).

If a convex program satisfies Slater’s condition and the primal optimal value pp^* is finite, then strong duality holds:

d=pd^* = p^*

Moreover, the dual optimum is attained: there exist λ0\lambda^* \geq 0 and ν\nu^* such that g(λ,ν)=d=pg(\lambda^*, \nu^*) = d^* = p^*.

Proof.

We prove this via the separating hyperplane theorem applied to the perturbation set. Define the set

G={(u,v,t)Rm×Rp×R:x with fi(x)ui,  hj(x)=vj,  f0(x)t}\mathcal{G} = \{(u, v, t) \in \mathbb{R}^m \times \mathbb{R}^p \times \mathbb{R} : \exists\, x \text{ with } f_i(x) \leq u_i,\; h_j(x) = v_j,\; f_0(x) \leq t\}

This is a convex set (because f0,fif_0, f_i are convex and hjh_j are affine). The primal optimal value corresponds to the point (0,0,p)(0, 0, p^*): the best objective value achievable when no constraint is perturbed.

Consider the point (0,0,pϵ)(0, 0, p^* - \epsilon) for any ϵ>0\epsilon > 0. This point lies outside G\mathcal{G} (because pp^* is optimal — we can’t do better). By the supporting hyperplane theorem (from Convex Analysis), there exists a hyperplane separating (0,0,pϵ)(0, 0, p^* - \epsilon) from G\mathcal{G}. Taking ϵ0\epsilon \to 0, we obtain a supporting hyperplane at (0,0,p)(0, 0, p^*).

The normal to this hyperplane gives us the dual variables: (λ,ν,α)(\lambda^*, \nu^*, \alpha) with λ0\lambda^* \geq 0 and α0\alpha \geq 0. The separating hyperplane condition gives, for all (u,v,t)G(u, v, t) \in \mathcal{G}:

λu+νv+αtλ0+ν0+αp=αp\lambda^{*\top} u + \nu^{*\top} v + \alpha t \geq \lambda^{*\top} \cdot 0 + \nu^{*\top} \cdot 0 + \alpha \cdot p^* = \alpha p^*

Slater’s condition ensures α>0\alpha > 0 (the hyperplane is not vertical — it has a non-trivial component in the tt-direction). Dividing by α\alpha and rearranging, this gives g(λ/α,ν/α)pg(\lambda^*/\alpha, \nu^*/\alpha) \geq p^*. Combined with weak duality gpg \leq p^*, we conclude d=pd^* = p^*.

When Slater fails. If the feasible set has no interior — for example, if the inequality constraints all hold with equality — then the separating hyperplane might be vertical (α=0\alpha = 0), and the dual may not recover the primal value. A concrete example: minimize exe^{-x} subject to x20x^2 \leq 0. The only feasible point is x=0x = 0, so p=e0=1p^* = e^0 = 1. The dual function is g(λ)=infx(ex+λx2)g(\lambda) = \inf_x (e^{-x} + \lambda x^2). For any finite λ0\lambda \geq 0, the infimum is driven toward zero as x+x \to +\infty (the exponential decays faster than the quadratic grows for any fixed λ\lambda), so g(λ)0g(\lambda) \leq 0 for all λ\lambda. Hence d=0<1=pd^* = 0 < 1 = p^* — a positive duality gap, because the feasible set is a single point with no interior.

LP strong duality. For linear programs, strong duality holds without Slater’s condition — the LP duality theorem guarantees d=pd^* = p^* whenever the primal is feasible and bounded. This is because the Lagrangian is linear in xx, so the infimum is either -\infty or attained at a vertex.

Strong duality and Slater's condition: strict interior guarantees zero duality gap


KKT Conditions

The Karush–Kuhn–Tucker conditions are the first-order necessary and sufficient conditions for optimality in convex programs with strong duality. They are the constrained analogue of the unconstrained condition f(x)=0\nabla f(x^*) = 0.

Definition 5 (Karush–Kuhn–Tucker (KKT) Conditions).

A primal-dual triple (x,λ,ν)(x^*, \lambda^*, \nu^*) satisfies the KKT conditions for the standard form convex program if all four of the following hold:

  1. Stationarity: f0(x)+i=1mλifi(x)+j=1pνjhj(x)=0\nabla f_0(x^*) + \sum_{i=1}^{m} \lambda_i^* \nabla f_i(x^*) + \sum_{j=1}^{p} \nu_j^* \nabla h_j(x^*) = 0
  2. Primal feasibility: fi(x)0f_i(x^*) \leq 0 for all ii and hj(x)=0h_j(x^*) = 0 for all jj
  3. Dual feasibility: λi0\lambda_i^* \geq 0 for all ii
  4. Complementary slackness: λifi(x)=0\lambda_i^* f_i(x^*) = 0 for all ii

Stationarity says that the gradient of the Lagrangian with respect to xx vanishes at the optimum — the objective gradient is cancelled by a non-negative combination of the constraint gradients. Geometrically, f0(x)-\nabla f_0(x^*) lies in the cone spanned by the active constraint gradients {fi(x):fi(x)=0}\{\nabla f_i(x^*) : f_i(x^*) = 0\}.

Theorem 3 (KKT Necessary Conditions).

If strong duality holds (i.e., p=dp^* = d^* and the dual optimum is attained), and xx^* is primal-optimal and (λ,ν)(\lambda^*, \nu^*) is dual-optimal, then (x,λ,ν)(x^*, \lambda^*, \nu^*) satisfies the KKT conditions.

Proof.

Since xx^* is primal-optimal and (λ,ν)(\lambda^*, \nu^*) is dual-optimal, and strong duality holds:

f0(x)=p=d=g(λ,ν)=infxL(x,λ,ν)f_0(x^*) = p^* = d^* = g(\lambda^*, \nu^*) = \inf_x \mathcal{L}(x, \lambda^*, \nu^*)

We also know (from the weak duality proof) that

g(λ,ν)L(x,λ,ν)=f0(x)+iλifi(x)+jνjhj(x)f0(x)g(\lambda^*, \nu^*) \leq \mathcal{L}(x^*, \lambda^*, \nu^*) = f_0(x^*) + \sum_i \lambda_i^* f_i(x^*) + \sum_j \nu_j^* h_j(x^*) \leq f_0(x^*)

The last inequality uses λi0\lambda_i^* \geq 0, fi(x)0f_i(x^*) \leq 0, and hj(x)=0h_j(x^*) = 0. Since the left and right sides are equal (both equal p=dp^* = d^*), every inequality in the chain must hold with equality.

Complementary slackness: The equality iλifi(x)=0\sum_i \lambda_i^* f_i(x^*) = 0 with each term λifi(x)0\lambda_i^* f_i(x^*) \leq 0 forces λifi(x)=0\lambda_i^* f_i(x^*) = 0 for each ii.

Stationarity: The equality infxL(x,λ,ν)=L(x,λ,ν)\inf_x \mathcal{L}(x, \lambda^*, \nu^*) = \mathcal{L}(x^*, \lambda^*, \nu^*) means xx^* minimizes L(,λ,ν)\mathcal{L}(\cdot, \lambda^*, \nu^*). Since the Lagrangian is convex in xx (as a sum of convex functions weighted by non-negative coefficients plus affine functions), the first-order condition gives xL(x,λ,ν)=0\nabla_x \mathcal{L}(x^*, \lambda^*, \nu^*) = 0, which is exactly the stationarity condition.

Primal and dual feasibility hold by assumption (xx^* is primal-feasible, λ0\lambda^* \geq 0).

The converse is even more powerful for convex problems.

Theorem 4 (KKT Sufficient for Convex Programs).

For a convex program, if (x,λ,ν)(x^*, \lambda^*, \nu^*) satisfies the KKT conditions, then xx^* is primal-optimal, (λ,ν)(\lambda^*, \nu^*) is dual-optimal, and strong duality holds with zero duality gap.

Proof.

Suppose the KKT conditions hold. Stationarity means xx^* minimizes L(x,λ,ν)\mathcal{L}(x, \lambda^*, \nu^*) over xx (since for a convex function, f=0\nabla f = 0 implies global minimum). Therefore:

g(λ,ν)=infxL(x,λ,ν)=L(x,λ,ν)=f0(x)+iλifi(x)=0 (comp. slack.)+jνjhj(x)=0 (primal feas.)=f0(x)g(\lambda^*, \nu^*) = \inf_x \mathcal{L}(x, \lambda^*, \nu^*) = \mathcal{L}(x^*, \lambda^*, \nu^*) = f_0(x^*) + \underbrace{\sum_i \lambda_i^* f_i(x^*)}_{= 0 \text{ (comp. slack.)}} + \underbrace{\sum_j \nu_j^* h_j(x^*)}_{= 0 \text{ (primal feas.)}} = f_0(x^*)

So g(λ,ν)=f0(x)g(\lambda^*, \nu^*) = f_0(x^*). By weak duality, dpf0(x)d^* \leq p^* \leq f_0(x^*). But dg(λ,ν)=f0(x)d^* \geq g(\lambda^*, \nu^*) = f_0(x^*). Therefore d=p=f0(x)d^* = p^* = f_0(x^*): strong duality holds, xx^* is primal-optimal, and (λ,ν)(\lambda^*, \nu^*) is dual-optimal.

Together, Theorems 3 and 4 give the complete picture for convex programs with Slater’s condition: the KKT conditions are necessary and sufficient for optimality. This is the constrained analogue of “set the gradient to zero and solve.”

Constraints:

KKT Conditions

Stationarity
‖∇L‖ = 1.414
Primal feasibility
f1 = 0.000, f2 = -2.000, f3 = -2.000
Dual feasibility
λ1 = 1.000, λ2 = 0.000, λ3 = 0.000
Complementary slackness
λ1f1 = 0.0000, λ2f2 = 0.0000, λ3f3 = 0.0000
KKT conditions not all satisfied
x = (2.00, 2.00) | f₀(x) = 1.0000

Drag the point to explore KKT conditions at different locations.

KKT conditions: stationarity, complementary slackness, and sufficiency for convex programs


Complementary Slackness

Complementary slackness deserves special attention because it is the condition that does the most work in applications.

Theorem 5 (Complementary Slackness).

If strong duality holds and (x,λ,ν)(x^*, \lambda^*, \nu^*) is a primal-dual optimal pair, then for each inequality constraint ii:

λifi(x)=0\lambda_i^* f_i(x^*) = 0

That is, either λi=0\lambda_i^* = 0 (the multiplier is zero — the constraint is “free”) or fi(x)=0f_i(x^*) = 0 (the constraint is active — it holds with equality). Both can be zero simultaneously, but they cannot both be strictly positive/negative.

Proof.

This was established in the proof of Theorem 3. The chain of equalities g(λ,ν)=L(x,λ,ν)=f0(x)g(\lambda^*, \nu^*) = \mathcal{L}(x^*, \lambda^*, \nu^*) = f_0(x^*) forces iλifi(x)=0\sum_i \lambda_i^* f_i(x^*) = 0. Since each term satisfies λi0\lambda_i^* \geq 0 and fi(x)0f_i(x^*) \leq 0, each product λifi(x)0\lambda_i^* f_i(x^*) \leq 0. A sum of non-positive terms equals zero only if each term is zero.

The economic interpretation is clean: λi\lambda_i^* is the price you’re paying for constraint ii. If the constraint has slack (fi(x)<0f_i(x^*) < 0 — you have unused capacity), the price is zero — you wouldn’t pay anything to relax a constraint that isn’t binding. If the constraint is active (fi(x)=0f_i(x^*) = 0 — you’re at the limit), then the multiplier λi>0\lambda_i^* > 0 tells you how much the optimal value would improve per unit of relaxation.

This insight is what identifies the support vectors in an SVM: the data points where the margin constraint is active (αi>0\alpha_i^* > 0) are exactly the ones that determine the decision boundary. All other points have αi=0\alpha_i^* = 0 and can be removed without changing the solution.


Saddle Point Interpretation

The KKT conditions have an elegant reformulation in terms of saddle points of the Lagrangian.

Definition 6 (Saddle Point of the Lagrangian).

A point (x,λ,ν)(x^*, \lambda^*, \nu^*) with λ0\lambda^* \geq 0 is a saddle point of the Lagrangian if

L(x,λ,ν)L(x,λ,ν)L(x,λ,ν)\mathcal{L}(x^*, \lambda, \nu) \leq \mathcal{L}(x^*, \lambda^*, \nu^*) \leq \mathcal{L}(x, \lambda^*, \nu^*)

for all xx, all λ0\lambda \geq 0, and all ν\nu. That is, xx^* minimizes L\mathcal{L} and (λ,ν)(\lambda^*, \nu^*) maximizes it.

The Lagrangian is a “game” between two players. The primal player chooses xx to minimize L\mathcal{L}; the dual player chooses (λ,ν)(\lambda, \nu) to maximize it. At a saddle point, neither player can improve their position unilaterally. This is the minimax interpretation:

maxλ0,νminxL(x,λ,ν)=minxmaxλ0,νL(x,λ,ν)\max_{\lambda \geq 0, \nu} \min_{x} \mathcal{L}(x, \lambda, \nu) = \min_{x} \max_{\lambda \geq 0, \nu} \mathcal{L}(x, \lambda, \nu)

When strong duality holds, the max and min commute — the order of play doesn’t matter. This is the minimax theorem applied to the Lagrangian.

Theorem 6 (Saddle Point Equivalence).

For a convex program, (x,λ,ν)(x^*, \lambda^*, \nu^*) is a saddle point of L\mathcal{L} if and only if xx^* is primal-optimal, (λ,ν)(\lambda^*, \nu^*) is dual-optimal, and strong duality holds.

Proof.

(\Rightarrow) Suppose (x,λ,ν)(x^*, \lambda^*, \nu^*) is a saddle point. The right inequality L(x,λ,ν)L(x,λ,ν)\mathcal{L}(x^*, \lambda^*, \nu^*) \leq \mathcal{L}(x, \lambda^*, \nu^*) for all xx means xx^* minimizes L(,λ,ν)\mathcal{L}(\cdot, \lambda^*, \nu^*), so g(λ,ν)=L(x,λ,ν)g(\lambda^*, \nu^*) = \mathcal{L}(x^*, \lambda^*, \nu^*).

The left inequality L(x,λ,ν)L(x,λ,ν)\mathcal{L}(x^*, \lambda, \nu) \leq \mathcal{L}(x^*, \lambda^*, \nu^*) for all λ0\lambda \geq 0 and ν\nu. Taking λ=0,ν=0\lambda = 0, \nu = 0: f0(x)L(x,λ,ν)f_0(x^*) \leq \mathcal{L}(x^*, \lambda^*, \nu^*). The maximization over (λ,ν)(\lambda, \nu) of the left side, evaluated at xx^* feasible, forces xx^* to be primal-feasible (otherwise we could make L(x,λ,ν)+\mathcal{L}(x^*, \lambda, \nu) \to +\infty). With xx^* feasible: f0(x)=g(λ,ν)f_0(x^*) = g(\lambda^*, \nu^*), giving zero duality gap.

(\Leftarrow) If xx^*, (λ,ν)(\lambda^*, \nu^*) are primal and dual optimal with zero gap, then KKT holds (Theorem 3). Stationarity gives the right inequality; the left inequality follows because for feasible xx^*, L(x,λ,ν)=f0(x)+iλifi(x)+jνjhj(x)\mathcal{L}(x^*, \lambda, \nu) = f_0(x^*) + \sum_i \lambda_i f_i(x^*) + \sum_j \nu_j h_j(x^*) is maximized at (λ,ν)(\lambda^*, \nu^*) by complementary slackness.

L(1.50, 3.00) = 2.2500 | Saddle point: L(1.5, 3) = 2.2500 ← you are at the saddle point!

The Lagrangian as a saddle surface: minimize over x, maximize over λ


Sensitivity Analysis

Dual variables are not just abstract quantities needed to state the KKT conditions — they have a concrete interpretation as shadow prices that measure how sensitive the optimal value is to constraint perturbations.

Theorem 7 (Sensitivity via Dual Variables).

Consider the perturbed problem: minimize f0(x)f_0(x) subject to fi(x)uif_i(x) \leq u_i and hj(x)=vjh_j(x) = v_j, with optimal value p(u,v)p^*(u, v). If strong duality holds and the dual optimum (λ,ν)(\lambda^*, \nu^*) is attained for the unperturbed problem (u=0,v=0u = 0, v = 0), then

puiu=0=λiandpvjv=0=νj\frac{\partial p^*}{\partial u_i}\bigg|_{u=0} = -\lambda_i^* \qquad \text{and} \qquad \frac{\partial p^*}{\partial v_j}\bigg|_{v=0} = -\nu_j^*

That is, the optimal dual variable λi\lambda_i^* is the shadow price of constraint ii: it measures the rate at which the optimal value decreases when the constraint is relaxed.

Proof.

The dual function for the perturbed problem is

gu(λ,ν)=infx[f0(x)+iλi(fi(x)ui)+jνj(hj(x)vj)]=g(λ,ν)λuνvg_u(\lambda, \nu) = \inf_x \left[ f_0(x) + \sum_i \lambda_i(f_i(x) - u_i) + \sum_j \nu_j(h_j(x) - v_j) \right] = g(\lambda, \nu) - \lambda^\top u - \nu^\top v

By strong duality for the perturbed problem (assuming Slater holds for small perturbations):

p(u,v)=maxλ0,νgu(λ,ν)=maxλ0,ν[g(λ,ν)λuνv]p^*(u, v) = \max_{\lambda \geq 0, \nu} g_u(\lambda, \nu) = \max_{\lambda \geq 0, \nu} \left[ g(\lambda, \nu) - \lambda^\top u - \nu^\top v \right]

At u=0,v=0u = 0, v = 0: p(0,0)=g(λ,ν)p^*(0, 0) = g(\lambda^*, \nu^*), and the maximizer is (λ,ν)(\lambda^*, \nu^*). By the envelope theorem (the derivative of the optimum with respect to a parameter equals the partial derivative of the objective at the optimal point):

puiu=0=ui[g(λ,ν)(λ)u(ν)v]u=0=λi\frac{\partial p^*}{\partial u_i}\bigg|_{u=0} = \frac{\partial}{\partial u_i}\left[g(\lambda^*, \nu^*) - (\lambda^*)^\top u - (\nu^*)^\top v\right]_{u=0} = -\lambda_i^*

The same argument gives p/vj=νj\partial p^*/\partial v_j = -\nu_j^*.

The sign makes economic sense. A positive λi\lambda_i^* means the constraint fi(x)0f_i(x) \leq 0 is active and costly. Increasing uiu_i (relaxing the constraint to fi(x)uif_i(x) \leq u_i) gives more room, which decreases the optimal objective by approximately λi\lambda_i^* per unit of relaxation. A large shadow price means the constraint is bottleneck — relaxing it yields large improvements.

Resource allocation example. Suppose a factory maximizes profit subject to material constraints. If the shadow price of steel is $7.50/kg and the shadow price of copper is $0/kg, then buying one more kilogram of steel improves profit by $7.50, while copper has slack — the current supply isn’t a binding limit. This tells the manager exactly where to invest.

p*(0) = -3.0000 | p*(0.00) = -3.0000 | Δp* ≈ −λ*·u = −0.75×0.00 = 0.0000

Sensitivity analysis: shadow prices and the perturbation function


Application: The SVM Dual

The hard-margin support vector machine is perhaps the most celebrated application of Lagrangian duality in machine learning. The dual formulation reveals the kernel trick, identifies the support vectors via complementary slackness, and reduces the problem to a tractable QP.

The primal. Given linearly separable training data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n with xiRdx_i \in \mathbb{R}^d and yi{1,+1}y_i \in \{-1, +1\}, the hard-margin SVM solves:

minimize12w2subject toyi(wxi+b)1,  i=1,,n\text{minimize} \quad \frac{1}{2}\|w\|^2 \qquad \text{subject to} \quad y_i(w^\top x_i + b) \geq 1,\; i = 1, \ldots, n

The objective minimizes w2\|w\|^2, which maximizes the margin 2/w2/\|w\|. The constraints require each point to be on the correct side of the margin.

The Lagrangian. Introducing multipliers αi0\alpha_i \geq 0 for each constraint:

L(w,b,α)=12w2i=1nαi[yi(wxi+b)1]\mathcal{L}(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i=1}^{n} \alpha_i \left[ y_i(w^\top x_i + b) - 1 \right]

KKT stationarity. Setting the gradient of L\mathcal{L} with respect to ww and bb to zero:

Lw=0    w=i=1nαiyixi\frac{\partial \mathcal{L}}{\partial w} = 0 \implies w^* = \sum_{i=1}^{n} \alpha_i^* y_i x_i

Lb=0    i=1nαiyi=0\frac{\partial \mathcal{L}}{\partial b} = 0 \implies \sum_{i=1}^{n} \alpha_i^* y_i = 0

The first equation is remarkable: the optimal weight vector is a linear combination of the training points, weighted by the dual variables. Only points with αi>0\alpha_i^* > 0 contribute — these are the support vectors.

The dual QP. Substituting the stationarity conditions back into the Lagrangian:

maximizei=1nαi12i,jαiαjyiyj(xixj)subject toαi0,iαiyi=0\text{maximize} \quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i^\top x_j) \qquad \text{subject to} \quad \alpha_i \geq 0, \quad \sum_i \alpha_i y_i = 0

Remark (Why the SVM Dual Reveals the Kernel Trick).

The dual objective depends on the data only through the inner products xixjx_i^\top x_j. This means we can replace xixjx_i^\top x_j with a kernel function k(xi,xj)=ϕ(xi)ϕ(xj)k(x_i, x_j) = \phi(x_i)^\top \phi(x_j) — computing inner products in a high-dimensional (even infinite-dimensional) feature space without ever explicitly mapping the data. This is the kernel trick, and it is a direct consequence of the dual formulation. The primal objective 12w2\frac{1}{2}\|w\|^2 depends on ww explicitly, which lives in the feature space — the kernel trick would not be visible from the primal alone.

Complementary slackness identifies support vectors. By complementary slackness, αi(yi(wxi+b)1)=0\alpha_i^*(y_i(w^{*\top} x_i + b^*) - 1) = 0 for each ii. Either αi=0\alpha_i^* = 0 (the point doesn’t contribute to ww^*) or yi(wxi+b)=1y_i(w^{*\top} x_i + b^*) = 1 (the point lies exactly on the margin). Points with αi>0\alpha_i^* > 0 are the support vectors — they sit on the margin boundary and fully determine the classifier. All other points could be removed without changing the solution.

import numpy as np
from scipy.optimize import minimize as sp_minimize

def svm_dual(X, y):
    """Solve the hard-margin SVM dual via scipy."""
    n = X.shape[0]
    Q = np.outer(y, y) * (X @ X.T)  # Gram matrix

    def neg_dual(alpha):
        return -np.sum(alpha) + 0.5 * alpha @ Q @ alpha

    def neg_dual_grad(alpha):
        return -np.ones(n) + Q @ alpha

    constraints = [{'type': 'eq', 'fun': lambda a: a @ y}]
    bounds = [(0, None)] * n
    result = sp_minimize(neg_dual, np.zeros(n), jac=neg_dual_grad,
                         bounds=bounds, constraints=constraints, method='SLSQP')
    alpha = result.x

    # Recover w* = sum alpha_i y_i x_i
    w = (alpha * y) @ X
    # Recover b* from a support vector
    sv = np.where(alpha > 1e-6)[0]
    b = np.mean(y[sv] - X[sv] @ w)
    return w, b, alpha

Support vectors: 12 / 20 | margin = 3.0335 | w = [0.394, 0.528], b = -0.491

The SVM dual: decision boundary, support vectors, and complementary slackness


Application: Water-Filling & Portfolio Optimization

Two more applications where the KKT conditions yield closed-form or near-closed-form solutions.

Water-Filling for Channel Capacity

In information theory, the water-filling problem allocates power across nn parallel channels to maximize total capacity:

maximizei=1nlog(1+xi/αi)subject toi=1nxi=P,  xi0\text{maximize} \quad \sum_{i=1}^{n} \log(1 + x_i / \alpha_i) \qquad \text{subject to} \quad \sum_{i=1}^{n} x_i = P,\; x_i \geq 0

where αi>0\alpha_i > 0 is the noise level of channel ii, xix_i is the power allocated to channel ii, and PP is the total power budget.

Remark (Water-Filling as KKT in Closed Form).

Writing the KKT stationarity condition for the Lagrangian L=ilog(1+xi/αi)ν(ixiP)+iμixi\mathcal{L} = \sum_i \log(1 + x_i/\alpha_i) - \nu(\sum_i x_i - P) + \sum_i \mu_i x_i and applying complementary slackness (μixi=0\mu_i x_i = 0), we get the water-filling solution:

xi=max ⁣(0,  1ναi)x_i^* = \max\!\left(0,\; \frac{1}{\nu^*} - \alpha_i\right)

The dual variable ν\nu^* (the “water level”) is determined by the budget constraint ixi=P\sum_i x_i^* = P. The name comes from the visualization: imagine pouring water (power) over an irregular terrain (the noise levels αi\alpha_i); the water settles to a uniform level 1/ν1/\nu^*, filling the noisiest channels last.

import numpy as np
from scipy.optimize import brentq

def water_filling(alphas, P_total):
    """Water-filling solution via bisection for the water level."""
    def budget_residual(nu_inv):
        alloc = np.maximum(0, nu_inv - alphas)
        return np.sum(alloc) - P_total

    nu_inv_star = brentq(budget_residual, max(alphas), max(alphas) + P_total + 10)
    x_star = np.maximum(0, nu_inv_star - alphas)
    return x_star, 1.0 / nu_inv_star

Mean-Variance Portfolio Optimization

In Markowitz portfolio theory, we trace the efficient frontier by solving a family of constrained problems:

minimize12wΣwsubject toμwrmin,  1w=1,  w0\text{minimize} \quad \frac{1}{2} w^\top \Sigma w \qquad \text{subject to} \quad \mu^\top w \geq r_{\min},\; \mathbf{1}^\top w = 1,\; w \geq 0

where Σ\Sigma is the return covariance matrix, μ\mu is the expected return vector, and rminr_{\min} is the minimum acceptable return. The dual variable λ\lambda^* for the return constraint is the risk-return tradeoff price: it tells you how much additional variance you must accept per unit of additional expected return.

By varying rminr_{\min} from the minimum-variance portfolio to the maximum-return portfolio and solving the KKT conditions at each point, we trace the efficient frontier — the Pareto-optimal set of portfolios. The shadow price λ\lambda^* is the slope of the efficient frontier at each point.

Water-filling, portfolio efficient frontier, and portfolio weights along the frontier


Computational Notes

CVXPY with Automatic Dual Extraction

CVXPY is a Python-embedded modeling language for disciplined convex programming. It automatically solves the dual alongside the primal and exposes the dual variables directly.

import cvxpy as cp

x = cp.Variable(2)
objective = cp.Minimize((x[0] - 3)**2 + (x[1] - 2)**2)
constraints = [
    x[0] + x[1] <= 3,     # Resource constraint
    x[0] >= 0, x[1] >= 0  # Non-negativity
]
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"Optimal value: {prob.value:.4f}")
print(f"Optimal x: {x.value}")
for i, c in enumerate(constraints):
    print(f"Constraint {i} dual value (shadow price): {c.dual_value:.4f}")

The .dual_value attribute gives λi\lambda_i^* directly. This is how you extract shadow prices in practice — no manual KKT derivation needed.

scipy.optimize with SLSQP

For problems not in disciplined convex form, scipy.optimize.minimize with method='SLSQP' handles nonlinear constraints:

from scipy.optimize import minimize

result = minimize(
    fun=lambda x: (x[0] - 3)**2 + (x[1] - 2)**2,
    x0=[0.0, 0.0],
    method='SLSQP',
    constraints=[
        {'type': 'ineq', 'fun': lambda x: 3 - x[0] - x[1]},  # x0 + x1 <= 3
    ],
    bounds=[(0, None), (0, None)]
)
print(f"Optimal value: {result.fun:.6f}")
print(f"Optimal x: {result.x}")

Interior-Point / Barrier Methods

Interior-point methods approach the constrained optimum from the interior of the feasible set by adding a logarithmic barrier that blows up at the constraint boundary:

minx  tf0(x)i=1mlog(fi(x))\min_x \; t \cdot f_0(x) - \sum_{i=1}^{m} \log(-f_i(x))

As the barrier parameter tt \to \infty, the solution traces the central path toward the true constrained optimum. At each point on the central path, the barrier Hessian system is a perturbed KKT system — interior-point methods are solving KKT conditions with a controlled perturbation. The convergence rate is typically O(mlog(1/ϵ))O(\sqrt{m}\log(1/\epsilon)) Newton steps, where mm is the number of constraints.

def barrier_method(f0, grad_f0, constraints, x0, t_init=1, mu=10, tol=1e-8):
    """Conceptual barrier method — solve a sequence of unconstrained problems."""
    x = x0.copy()
    t = t_init
    while True:
        # Phase 1: centering — minimize t*f0(x) + barrier using Newton's method
        for _ in range(100):  # Inner Newton iterations
            barrier_grad = sum(-1.0 / c['fun'](x) * c['grad'](x) for c in constraints)
            total_grad = t * grad_f0(x) + barrier_grad
            x = x - 0.01 * total_grad  # Simplified: use Newton step in practice

        # Phase 2: increase t
        if len(constraints) / t < tol:
            break
        t *= mu

    return x

Constraint diagnostics. After solving, verify the KKT conditions numerically:

  1. Stationarity residual: f0(x)+iλifi(x)\|\nabla f_0(x^*) + \sum_i \lambda_i^* \nabla f_i(x^*)\| should be near zero.
  2. Complementary slackness: λifi(x)|\lambda_i^* f_i(x^*)| should be near zero for each ii.
  3. Primal feasibility: maxifi(x)\max_i f_i(x^*) should be 0\leq 0 (within tolerance).
  4. Dual feasibility: miniλi\min_i \lambda_i^* should be 0\geq 0.

Computational notes: CVXPY, scipy, and the barrier method convergence


Connections & Further Reading

Lagrangian duality is the theoretical core of constrained optimization, connecting the convex analysis foundations to the algorithmic methods in the rest of the Optimization track and beyond.

TopicConnection
Convex AnalysisEvery result here rests on convex analysis. Weak duality follows from the infimum structure; strong duality uses the separating hyperplane theorem; KKT stationarity extends the subdifferential condition 0f(x)0 \in \partial f(x^*) to constrained settings. Conjugate functions are the engine of Lagrangian duality: the dual function g(λ)=f0(Aλ)bλg(\lambda) = -f_0^*(-A^\top\lambda) - b^\top\lambda for linear constraints.
Gradient Descent & ConvergenceInterior-point methods solve the KKT system via Newton’s method on the barrier-perturbed equations. The convergence analysis of these solvers relies on the smoothness and strong convexity framework. The barrier parameter tt controls the tradeoff between centering accuracy and constraint satisfaction.
Proximal MethodsADMM is a dual decomposition method: it applies Douglas–Rachford splitting to the dual of the consensus problem. The augmented Lagrangian Lρ(x,λ)=L(x,λ)+ρ2Axb2\mathcal{L}_\rho(x, \lambda) = \mathcal{L}(x, \lambda) + \frac{\rho}{2}\|Ax - b\|^2 is a proximal regularization of the standard Lagrangian, connecting the dual update λk+1=λk+ρ(Axk+1b)\lambda^{k+1} = \lambda^k + \rho(Ax^{k+1} - b) to proximal point iterations.
The Spectral TheoremQuadratic programs require the objective Hessian to be PSD for convexity. The eigendecomposition determines the condition number κ(P)\kappa(P) that governs interior-point convergence. For SDPs, the spectral theorem on the matrix variable provides the optimality structure.
Singular Value DecompositionThe SVM dual involves the Gram matrix Kij=xixjK_{ij} = x_i^\top x_j. For kernel SVMs, the SVD of the feature matrix reveals the effective dimensionality and condition number of KK, governing the dual QP’s numerical behavior.
PCA & Low-Rank ApproximationNuclear norm minimization minX\min \|X\|_* subject to linear constraints is the convex relaxation of rank-constrained optimization. The dual of this problem involves the spectral norm, and the duality theory developed here establishes the tightness of the relaxation.
AdjunctionsLagrangian duality is a Galois connection — an adjunction between posets. Weak duality (dfd^* \leq f^*) is the counit condition, strong duality (d=fd^* = f^*) is when the unit and counit are isomorphisms, and the duality gap measures the obstruction. The KKT conditions characterize the fixed points of the closure operator.

The Optimization Track (Complete)

Convex Analysis
    ├── Gradient Descent & Convergence
    │       └── Proximal Methods
    └── Lagrangian Duality & KKT (this topic)

This topic completes the Optimization track. The four topics form a coherent progression: Convex Analysis provides the mathematical foundations (convex sets, functions, subdifferentials, separation theorems), Gradient Descent develops the unconstrained optimization theory (smooth, strongly convex, accelerated), Proximal Methods extends to non-smooth and composite objectives (proximal operators, splitting, ADMM), and Lagrangian Duality addresses constrained optimization with the full KKT machinery.

Potential future directions that build on this foundation include second-order cone programming (SOCP), semidefinite programming (SDP), robust optimization, and bilevel optimization — but these are beyond the current curriculum scope.

Connections

  • Every result in this topic rests on convex analysis. Weak duality follows from the infimum structure; strong duality requires Slater's condition on the feasible set; KKT stationarity is the multivariate extension of the subgradient optimality condition 0 ∈ ∂f(x*). The separating hyperplane theorem provides the geometric proof of strong duality. convex-analysis
  • Quadratic programs min x^T P x + q^T x subject to linear constraints require P to be PSD for convexity. The Spectral Theorem provides the eigendecomposition that verifies positive semidefiniteness and computes the condition number relevant to interior-point solver convergence. spectral-theorem
  • The SVM dual involves the Gram matrix K_{ij} = x_i^T x_j. For kernel SVMs, the SVD of the feature matrix reveals the effective dimensionality and condition number of the kernel matrix, governing the dual QP's difficulty. svd
  • Semidefinite relaxations of rank-constrained problems (e.g., max-cut) produce SDP duals whose analysis relies on the duality theory developed here. The nuclear norm minimization from PCA & Low-Rank Approximation is the Lagrangian dual of a rank-constrained problem. pca-low-rank
  • Interior-point methods solve the KKT system via Newton's method on the barrier-modified equations. The convergence analysis of these solvers uses the smoothness and strong convexity framework from the Gradient Descent topic. gradient-descent
  • ADMM solves constrained problems by alternating between primal updates and dual (multiplier) updates. The augmented Lagrangian in ADMM is a proximal regularization of the standard Lagrangian developed here. proximal-methods

References & Further Reading

  • book Convex Optimization — Boyd & Vandenberghe (2004) Chapter 5: Duality — the primary reference for Lagrangian duality, weak/strong duality, KKT conditions, and sensitivity analysis
  • book Convex Analysis — Rockafellar (1970) The foundational monograph — conjugate duality, saddle point theory, and the minimax theorem
  • book Introductory Lectures on Convex Optimization — Nesterov (2004) Chapter 4: Duality in convex optimization and self-concordant barrier methods
  • book Numerical Optimization — Nocedal & Wright (2006) Chapter 12: Theory of constrained optimization — KKT conditions for general nonlinear programs
  • paper A Tutorial on Support Vector Machines for Pattern Recognition — Burges (1998) The SVM dual derivation via KKT conditions and the kernel trick
  • paper Interior-Point Methods in Semidefinite Programming with Applications to Combinatorial Optimization — Alizadeh (1995) Interior-point duality for SDP — extends the LP duality framework to semidefinite constraints