intermediate information-theory 45 min read

Rate-Distortion Theory

The fundamental limits of lossy compression — how many bits per symbol when we tolerate distortion?

Overview & Motivation

Shannon’s source coding theorem establishes that the entropy rate HH is the fundamental limit of lossless compression. But lossless compression is often wildly impractical: a raw audio CD uses 1,411 kbps; MP3 achieves perceptually transparent quality at 128 kbps — a 10× reduction by accepting controlled distortion. JPEG, H.264, and every streaming codec you’ve ever used make the same bargain: trade fidelity for bits.

Rate-distortion theory answers the question that lossless coding leaves open: if we allow an average distortion of at most DD, what is the minimum number of bits per source symbol we must transmit? The answer is the rate-distortion function R(D)R(D), which Shannon proved is the solution to a convex optimization problem over conditional distributions:

R(D)=minp(x^x):E[d(X,X^)]DI(X;X^)R(D) = \min_{p(\hat{x}|x):\, \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X})

This is one of the deepest results in information theory: it gives a single number that separates the possible from the impossible. Any coding scheme operating below R(D)R(D) bits per symbol must incur distortion greater than DD, no matter how clever the encoder.

What We Cover

  1. Distortion measures — Hamming distortion, squared error, and the expected distortion framework
  2. The rate-distortion function — definition, properties (convex, non-increasing, boundary values), and the operational meaning
  3. Rate-distortion theorem — Shannon’s achievability and converse: R(D)R(D) is the exact boundary between achievable and unachievable (rate, distortion) pairs
  4. Closed-form solutions — binary source with Hamming distortion, Gaussian source with squared error, parametric form, Shannon lower bound
  5. The Blahut–Arimoto algorithm — alternating minimization for computing R(D)R(D) numerically
  6. The information bottleneck — compression with relevance, connections to deep learning (VIB, β\beta-VAE)
  7. Computational notes — Python implementations, VAE loss as rate-distortion, neural compression

Prerequisites


Distortion Measures

Before we can talk about “acceptable” lossy compression, we need a formal notion of how much damage we’ve done to the source. A distortion measure quantifies the cost of reproducing source symbol xx as x^\hat{x}.

Definition 1 (Distortion Function).

A distortion function d:X×X^[0,)d: \mathcal{X} \times \hat{\mathcal{X}} \to [0, \infty) assigns a non-negative real cost d(x,x^)d(x, \hat{x}) to each source-reproduction pair (x,x^)(x, \hat{x}). The expected distortion (per-symbol distortion) for a joint distribution p(x,x^)p(x, \hat{x}) is

D=E[d(X,X^)]=xx^p(x,x^)d(x,x^)D = \mathbb{E}[d(X, \hat{X})] = \sum_{x} \sum_{\hat{x}} p(x, \hat{x})\, d(x, \hat{x})

Two distortion measures dominate information theory: one for discrete sources, one for continuous.

Definition 2 (Hamming Distortion).

For discrete alphabets, the Hamming distortion is

dH(x,x^)=1[xx^]={0if x=x^1if xx^d_H(x, \hat{x}) = \mathbb{1}[x \neq \hat{x}] = \begin{cases} 0 & \text{if } x = \hat{x} \\ 1 & \text{if } x \neq \hat{x} \end{cases}

The expected Hamming distortion equals the error probability: E[dH]=Pr(XX^)\mathbb{E}[d_H] = \Pr(X \neq \hat{X}).

Definition 3 (Squared Error Distortion).

For real-valued alphabets, the squared error distortion is

dSE(x,x^)=(xx^)2d_{SE}(x, \hat{x}) = (x - \hat{x})^2

The expected squared error distortion is the mean squared error: E[dSE]=E[(XX^)2]\mathbb{E}[d_{SE}] = \mathbb{E}[(X - \hat{X})^2].

Hamming distortion treats all errors equally — a single bit flip costs the same regardless of context. Squared error, by contrast, is graded: reproducing 3.0 as 3.1 costs far less than reproducing it as 10.0. The choice of distortion measure shapes the entire rate-distortion curve, because it determines what “tolerable” compression means.

For a discrete source with alphabet {0,1,,k1}\{0, 1, \ldots, k-1\}, the Hamming distortion can be organized into a distortion matrix D\mathbf{D} where Dij=d(xi,x^j)D_{ij} = d(x_i, \hat{x}_j). For Hamming distortion, this is simply the identity complement: zeros on the diagonal, ones everywhere else.

Distortion measures — Hamming matrix, squared error curves, and expected distortion


The Rate-Distortion Function

The rate-distortion function answers the central question: what is the minimum number of bits per source symbol we need to transmit, if we can tolerate average distortion D\leq D?

Definition 4 (Rate-Distortion Function).

For a source Xp(x)X \sim p(x) with distortion measure d(x,x^)d(x, \hat{x}), the rate-distortion function is

R(D)=minp(x^x):E[d(X,X^)]DI(X;X^)R(D) = \min_{p(\hat{x}|x):\, \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X})

where the minimization is over all conditional distributions p(x^x)p(\hat{x}|x) (called test channels) such that the expected distortion E[d(X,X^)]=xx^p(x)p(x^x)d(x,x^)D\mathbb{E}[d(X, \hat{X})] = \sum_x \sum_{\hat{x}} p(x)\, p(\hat{x}|x)\, d(x, \hat{x}) \leq D.

The test channel p(x^x)p(\hat{x}|x) describes how we map source symbols to reproductions — it’s the probabilistic encoding rule. The mutual information I(X;X^)I(X; \hat{X}) measures how many bits of information the reproduction carries about the source. Minimizing I(X;X^)I(X; \hat{X}) subject to the distortion constraint finds the most compressed representation that still meets the fidelity requirement.

Proposition 1 (Properties of R(D)).

The rate-distortion function R(D)R(D) has the following properties:

(i) R(D)0R(D) \geq 0 for all D0D \geq 0.

(ii) R(D)R(D) is a non-increasing function of DD.

(iii) R(D)R(D) is a convex function of DD.

(iv) R(0)=H(X)R(0) = H(X) for discrete sources with Hamming distortion (the lossless limit).

(v) R(D)=0R(D) = 0 for DDmaxD \geq D_{\max}, where Dmax=minx^E[d(X,x^)]D_{\max} = \min_{\hat{x}} \mathbb{E}[d(X, \hat{x})] is the distortion achievable without any communication.

Proof.

(i) Mutual information is non-negative: I(X;X^)0I(X; \hat{X}) \geq 0, so the minimum is non-negative.

(ii) If D1<D2D_1 < D_2, then {p(x^x):E[d]D1}{p(x^x):E[d]D2}\{p(\hat{x}|x) : \mathbb{E}[d] \leq D_1\} \subseteq \{p(\hat{x}|x) : \mathbb{E}[d] \leq D_2\}. Minimizing over a larger set can only decrease (or maintain) the optimum, so R(D2)R(D1)R(D_2) \leq R(D_1).

(iii) Let (D1,R(D1))(D_1, R(D_1)) and (D2,R(D2))(D_2, R(D_2)) lie on the curve, achieved by test channels p1(x^x)p_1(\hat{x}|x) and p2(x^x)p_2(\hat{x}|x) respectively. For λ[0,1]\lambda \in [0,1], the mixture test channel pλ=λp1+(1λ)p2p_\lambda = \lambda p_1 + (1-\lambda) p_2 achieves expected distortion λD1+(1λ)D2\leq \lambda D_1 + (1-\lambda) D_2. By the convexity of mutual information in p(x^x)p(\hat{x}|x) for fixed p(x)p(x):

R(λD1+(1λ)D2)Iλ(X;X^)λI1(X;X^)+(1λ)I2(X;X^)=λR(D1)+(1λ)R(D2)R(\lambda D_1 + (1-\lambda) D_2) \leq I_\lambda(X; \hat{X}) \leq \lambda I_1(X; \hat{X}) + (1-\lambda) I_2(X; \hat{X}) = \lambda R(D_1) + (1-\lambda) R(D_2)

(iv) At D=0D = 0 with Hamming distortion, Pr(XX^)=0\Pr(X \neq \hat{X}) = 0, so X^=X\hat{X} = X almost surely. Then I(X;X^)=H(X)I(X; \hat{X}) = H(X).

(v) At DDmaxD \geq D_{\max}, we can set p(x^x)=δx^p(\hat{x}|x) = \delta_{\hat{x}^*} (reproduce everything as a constant x^\hat{x}^* that minimizes E[d(X,x^)]\mathbb{E}[d(X, \hat{x}^*)]), achieving I(X;X^)=0I(X; \hat{X}) = 0.

Properties (ii) and (iii) together tell us that R(D)R(D) is a convex, non-increasing curve from R(0)=H(X)R(0) = H(X) down to R(Dmax)=0R(D_{\max}) = 0. The region above the curve is achievable (we can compress to that rate at that distortion); the region below is unachievable (no coding scheme can reach it).

Rate-distortion function — binary source, Gaussian source, achievable and unachievable regions


The Rate-Distortion Theorem

Shannon’s rate-distortion theorem gives R(D)R(D) its operational meaning: it is the exact boundary between achievable and unachievable compression.

Theorem 3 (Shannon's Rate-Distortion Theorem).

For an i.i.d. source X1,X2,p(x)X_1, X_2, \ldots \sim p(x) with distortion measure dd:

Achievability. For any rate R>R(D)R > R(D), there exists a sequence of (2nR,n)(2^{nR}, n) codes with expected distortion D+ε\leq D + \varepsilon for all ε>0\varepsilon > 0 and nn sufficiently large.

Converse. For any rate R<R(D)R < R(D), every sequence of (2nR,n)(2^{nR}, n) codes has expected distortion >D> D for nn sufficiently large.

In other words: R(D)R(D) bits per symbol are both necessary and sufficient to represent the source with average distortion at most DD.

Proof.

Achievability. Generate 2nR2^{nR} codewords x^n\hat{x}^n independently from the reproduction distribution p(x^)p^*(\hat{x}) induced by the optimal test channel. For a source sequence xnx^n, find the codeword x^n\hat{x}^n that is jointly typical with xnx^n under the optimal test channel.

By the covering lemma, this succeeds with high probability if R>I(X;X^)R > I(X; \hat{X}): we need enough codewords to “cover” the typical set of source sequences. The expected distortion converges to E[d(X,X^)]D\mathbb{E}[d(X, \hat{X})] \leq D by the properties of jointly typical sequences — each typical pair (xn,x^n)(x^n, \hat{x}^n) has per-symbol distortion close to the expected distortion of the test channel.

Converse. By the data processing inequality, for any encoder-decoder pair with index M{1,,2nR}M \in \{1, \ldots, 2^{nR}\}:

nRH(M)I(Xn;X^n)=i=1nI(Xi;X^iXi1,X^i1)nR \geq H(M) \geq I(X^n; \hat{X}^n) = \sum_{i=1}^n I(X_i; \hat{X}_i | X^{i-1}, \hat{X}^{i-1})

For a memoryless source, XiX_i is independent of (Xi1,X^i1)(X^{i-1}, \hat{X}^{i-1}), so each conditional mutual information satisfies I(Xi;X^iXi1,X^i1)I(Xi;X^i)I(X_i; \hat{X}_i | X^{i-1}, \hat{X}^{i-1}) \geq I(X_i; \hat{X}_i) — conditioning on independent variables cannot increase the information X^i\hat{X}_i carries about XiX_i. Therefore I(Xn;X^n)i=1nI(Xi;X^i)I(X^n; \hat{X}^n) \geq \sum_{i=1}^n I(X_i; \hat{X}_i).

By the convexity of R(D)R(D) and Jensen’s inequality applied to the per-symbol distortions di=d(Xi,X^i)d_i = d(X_i, \hat{X}_i):

R1ni=1nR(di)R ⁣(1ni=1ndi)=R(Dn)R \geq \frac{1}{n}\sum_{i=1}^n R(d_i) \geq R\!\left(\frac{1}{n}\sum_{i=1}^n d_i\right) = R(D_n)

So RR(Dn)R \geq R(D_n), meaning any rate below R(D)R(D) cannot achieve distortion DD.

The theorem is the operational backbone of lossy compression. The achievability proof is constructive (random coding with joint typicality), while the converse is information-theoretic (any code, no matter how clever, must respect R(D)R(D)).

Rate-distortion theorem — block diagram of lossy compression, achievable vs unachievable regions


Closed-Form Solutions

For two important source-distortion pairs, R(D)R(D) has a clean analytical form. These serve as benchmarks and building blocks for understanding the general case.

Binary Source with Hamming Distortion

Theorem 1 (Rate-Distortion for Binary Source).

For a Bernoulli(pp) source with Hamming distortion:

R(D)=Hb(p)Hb(D),0Dmin(p,1p)R(D) = H_b(p) - H_b(D), \qquad 0 \leq D \leq \min(p, 1-p)

where Hb()H_b(\cdot) is the binary entropy function. For D>min(p,1p)D > \min(p, 1-p), we have R(D)=0R(D) = 0.

Proof.

The optimal test channel is a binary symmetric channel BSC(DD): it flips each bit independently with probability DD. Under this channel:

  • The expected distortion is exactly DD (the flip probability).
  • The mutual information is I(X;X^)=H(X)H(XX^)=Hb(p)Hb(D)I(X; \hat{X}) = H(X) - H(X|\hat{X}) = H_b(p) - H_b(D).

We verify this is optimal by checking the KKT conditions. The Lagrangian is L=I(X;X^)+sE[d(X,X^)]L = I(X; \hat{X}) + s\, \mathbb{E}[d(X, \hat{X})] with slope parameter ss. The optimal test channel satisfies p(x^x)q(x^)exp(sd(x,x^))p(\hat{x}|x) \propto q(\hat{x}) \exp(s\, d(x, \hat{x})), which for Hamming distortion gives the BSC structure. The distortion constraint is active (E[d]=D\mathbb{E}[d] = D), confirming the solution.

At D=0D = 0: R(0)=Hb(p)Hb(0)=Hb(p)=H(X)R(0) = H_b(p) - H_b(0) = H_b(p) = H(X), recovering lossless compression.

At D=min(p,1p)D = \min(p, 1-p): R(D)=Hb(p)Hb(min(p,1p))=0R(D) = H_b(p) - H_b(\min(p,1-p)) = 0, since Hb(min(p,1p))=Hb(p)H_b(\min(p,1-p)) = H_b(p).

For the uniform binary source (p=0.5p = 0.5), this simplifies to R(D)=1Hb(D)R(D) = 1 - H_b(D): we start at 1 bit (lossless) and reach 0 bits at D=0.5D = 0.5 (pure guessing).

Gaussian Source with Squared Error

Theorem 2 (Rate-Distortion for Gaussian Source).

For a Gaussian source XN(0,σ2)X \sim \mathcal{N}(0, \sigma^2) with squared error distortion:

R(D)=12log2σ2D,0<Dσ2R(D) = \frac{1}{2}\log_2 \frac{\sigma^2}{D}, \qquad 0 < D \leq \sigma^2

For D>σ2D > \sigma^2, we have R(D)=0R(D) = 0.

Proof.

The optimal test channel adds independent Gaussian noise: X^=X+Z\hat{X} = X + Z where ZN(0,D)Z \sim \mathcal{N}(0, D) is independent of XX, followed by MMSE estimation X^=σ2Dσ2X\hat{X} = \frac{\sigma^2 - D}{\sigma^2} X. The resulting distortion is exactly DD.

The mutual information under this channel:

I(X;X^)=h(X)h(XX^)=12log2(2πeσ2)12log2(2πeD)=12log2σ2DI(X; \hat{X}) = h(X) - h(X|\hat{X}) = \frac{1}{2}\log_2(2\pi e \sigma^2) - \frac{1}{2}\log_2(2\pi e D) = \frac{1}{2}\log_2\frac{\sigma^2}{D}

This is optimal because the Gaussian source uniquely achieves the Shannon lower bound (Proposition 2 below) with equality.

At D=σ2D = \sigma^2: we can reproduce everything as the mean (zero), achieving zero rate.

The Gaussian R(D)R(D) has a particularly clean interpretation: halving the distortion costs exactly one extra bit per sample. This logarithmic relationship is the foundation of quantization theory.

Parametric Form and the Shannon Lower Bound

Theorem 4 (Parametric Form of R(D)).

The rate-distortion function can be written parametrically via the slope s<0s < 0:

R(s)=sD(s)+xp(x)log2[x^exp ⁣(sd(x,x^))]1R(s) = -s\, D(s) + \sum_x p(x) \log_2 \left[\sum_{\hat{x}} \exp\!\big(s\, d(x, \hat{x})\big)\right]^{-1}

D(s)=xx^p(x)ps(x^x)d(x,x^)D(s) = \sum_x \sum_{\hat{x}} p(x)\, p_s(\hat{x}|x)\, d(x, \hat{x})

where ps(x^x)qs(x^)exp ⁣(sd(x,x^))p_s(\hat{x}|x) \propto q_s(\hat{x}) \exp\!\big(s\, d(x, \hat{x})\big) is the optimal test channel at slope ss, and qs(x^)q_s(\hat{x}) is its marginal. The slope ss is the Lagrange multiplier from the Lagrangian dual of the rate-distortion optimization.

For the binary source, the slope parameter is s=log2 ⁣(D/(1D))s = \log_2\!\big(D/(1-D)\big) and the test channel is BSC(DD). For the Gaussian source, s=1/(2Dln2)s = -1/(2D \ln 2), and the test channel adds N(0,D)\mathcal{N}(0, D) noise.

Proposition 2 (Shannon Lower Bound).

For any continuous source XX with differential entropy h(X)h(X) and squared error distortion:

R(D)h(X)12log2(2πeD)R(D) \geq h(X) - \frac{1}{2}\log_2(2\pi e D)

with equality if and only if XX is Gaussian. This is the Shannon lower bound (SLB).

Proof.

Starting from the definition:

R(D)=minI(X;X^)=min[h(X)h(XX^)]=h(X)maxh(XX^)R(D) = \min I(X; \hat{X}) = \min \big[h(X) - h(X|\hat{X})\big] = h(X) - \max h(X|\hat{X})

Since Var(XX^)E[(XX^)2]D\text{Var}(X|\hat{X}) \leq \mathbb{E}[(X - \hat{X})^2] \leq D, the conditional entropy is bounded by h(XX^)12log2(2πeD)h(X|\hat{X}) \leq \frac{1}{2}\log_2(2\pi e D) — because the Gaussian maximizes differential entropy for a given variance. Therefore:

R(D)h(X)12log2(2πeD)R(D) \geq h(X) - \frac{1}{2}\log_2(2\pi e D)

Equality holds when XX^X|\hat{X} is Gaussian with variance DD, which happens exactly when XX itself is Gaussian.

The Shannon lower bound tells us that the Gaussian source is the hardest to compress (among sources with the same variance), in the sense that it requires the most bits at every distortion level.

Closed-form solutions — binary test channel, Gaussian R(D), Shannon lower bound


The Blahut–Arimoto Algorithm

For general discrete sources where a closed-form R(D)R(D) is unavailable, the Blahut–Arimoto (BA) algorithm computes R(D)R(D) via alternating minimization. The algorithm exploits the convex structure of the rate-distortion optimization.

Theorem 5 (Blahut–Arimoto for Rate-Distortion).

The rate-distortion function R(D)R(D) for a discrete source p(x)p(x) with finite alphabet and distortion measure d(x,x^)d(x, \hat{x}) can be computed by the following alternating minimization:

Initialize: q(x^)=uniform over X^q(\hat{x}) = \text{uniform over } \hat{\mathcal{X}}.

Repeat until convergence:

Step 1 (Optimize test channel for fixed output marginal):

p(x^x)=q(x^)exp ⁣(sd(x,x^))Z(x)where Z(x)=x^q(x^)exp ⁣(sd(x,x^))p(\hat{x}|x) = \frac{q(\hat{x}) \exp\!\big(s\, d(x, \hat{x})\big)}{Z(x)} \quad \text{where } Z(x) = \sum_{\hat{x}} q(\hat{x}) \exp\!\big(s\, d(x, \hat{x})\big)

Step 2 (Update output marginal):

q(x^)=xp(x)p(x^x)q(\hat{x}) = \sum_x p(x)\, p(\hat{x}|x)

The algorithm converges to the optimal test channel for the given slope s<0s < 0. Sweeping ss from -\infty to 00 traces out the entire R(D)R(D) curve.

Proof.

The Lagrangian of the rate-distortion optimization is

L=I(X;X^)+sE[d(X,X^)]L = I(X; \hat{X}) + s\, \mathbb{E}[d(X, \hat{X})]

This decomposes into two subproblems when we alternate between optimizing the test channel p(x^x)p(\hat{x}|x) and the output marginal q(x^)q(\hat{x}). Each step decreases LL:

  • Step 1 minimizes LL over p(x^x)p(\hat{x}|x) for fixed q(x^)q(\hat{x}). The solution is the Gibbs distribution p(x^x)q(x^)exp(sd(x,x^))p(\hat{x}|x) \propto q(\hat{x}) \exp(s\, d(x, \hat{x})), obtained by setting the functional derivative to zero.
  • Step 2 minimizes LL over q(x^)q(\hat{x}) for fixed p(x^x)p(\hat{x}|x). The optimal q(x^)q(\hat{x}) is the marginal xp(x)p(x^x)\sum_x p(x) p(\hat{x}|x), which minimizes the KL divergence DKL(p(x^x)q(x^))D_{\mathrm{KL}}(p(\hat{x}|x) \| q(\hat{x})).

Since LL is bounded below and decreases at each step, the algorithm converges. Since the original problem is convex, the limit is the global optimum.

The connection to the EM algorithm is direct: Step 1 is analogous to the E-step (computing a posterior given current parameters), and Step 2 is analogous to the M-step (updating parameters given the posterior). Both algorithms exploit the same alternating minimization structure on convex objectives.

Iteration: 0

Here is the Python implementation:

def blahut_arimoto_rd(px, distortion_matrix, slope, max_iter=200, tol=1e-10):
    """Blahut-Arimoto for one point on the R(D) curve."""
    n_x, n_xhat = distortion_matrix.shape
    q_xhat = np.ones(n_xhat) / n_xhat  # uniform initialization

    for iteration in range(max_iter):
        # Step 1: optimal test channel
        log_channel = np.log(np.maximum(q_xhat, 1e-300)) + slope * distortion_matrix
        log_channel -= log_channel.max(axis=1, keepdims=True)
        p_xhat_given_x = np.exp(log_channel)
        p_xhat_given_x /= p_xhat_given_x.sum(axis=1, keepdims=True)

        # Step 2: update output marginal
        new_q = px @ p_xhat_given_x
        new_q = np.maximum(new_q, 1e-300)

        if np.max(np.abs(new_q - q_xhat)) < tol:
            break
        q_xhat = new_q

    # Compute rate and distortion
    joint = px[:, None] * p_xhat_given_x
    rate = mutual_information(joint)
    distortion = np.sum(joint * distortion_matrix)
    return rate, distortion, q_xhat, p_xhat_given_x

Blahut–Arimoto algorithm — R(D) curve, convergence, distribution evolution


The Information Bottleneck

The information bottleneck (IB) method, introduced by Tishby, Pereira & Bialek (1999), extends rate-distortion theory from compression of XX to compression of XX while preserving information about a relevant variable YY. This is the bridge between rate-distortion theory and representation learning.

Definition 5 (Information Bottleneck).

Given a joint distribution p(x,y)p(x, y), the information bottleneck seeks a compressed representation TT of XX that maximizes information about YY. The IB Lagrangian is

LIB=I(X;T)βI(T;Y)\mathcal{L}_{IB} = I(X; T) - \beta\, I(T; Y)

where β>0\beta > 0 controls the compression-relevance trade-off:

  • I(X;T)I(X; T) = complexity — how much we remember about XX
  • I(T;Y)I(T; Y) = relevance — how much TT tells us about YY
  • β\beta = Lagrange multiplier: large β\beta favors relevance, small β\beta favors compression

The IB is not just an abstract optimization — it is a special case of rate-distortion theory with an information-theoretic distortion measure.

Proposition 3 (IB as Rate-Distortion with KL Distortion).

The IB problem is equivalent to a rate-distortion problem with the “log-loss” distortion measure:

dIB(x,t)=DKL ⁣(p(yx)p(yt))d_{IB}(x, t) = D_{\mathrm{KL}}\!\big(p(y|x) \| p(y|t)\big)

The “cost” of representing xx by tt is the KL divergence between their conditional distributions over YY. This makes the IB a special case of rate-distortion theory where the distortion is information-theoretic.

Proof.

We want to minimize I(X;T)I(X; T) subject to I(T;Y)I0I(T; Y) \geq I_0. Using the Markov chain TXYT - X - Y (the representation TT depends on YY only through XX), we proceed in three steps.

Step 1: Decompose the relevance constraint. Write mutual information as a conditional entropy difference:

I(T;Y)=H(Y)H(YT)I(T; Y) = H(Y) - H(Y|T)

So I(T;Y)I0I(T; Y) \geq I_0 is equivalent to H(YT)H(Y)I0H(Y|T) \leq H(Y) - I_0.

Step 2: Introduce the KL distortion. By the Markov chain TXYT - X - Y, we have p(yt)=xp(xt)p(yx)p(y|t) = \sum_x p(x|t)\, p(y|x). The gap between conditional entropies decomposes as:

H(YT)H(YX)=tp(t)xp(xt)DKL ⁣(p(yx)p(yt))=E[dIB(X,T)]H(Y|T) - H(Y|X) = \sum_t p(t) \sum_x p(x|t)\, D_{\mathrm{KL}}\!\big(p(y|x) \| p(y|t)\big) = \mathbb{E}\big[d_{IB}(X, T)\big]

where dIB(x,t)=DKL ⁣(p(yx)p(yt))d_{IB}(x, t) = D_{\mathrm{KL}}\!\big(p(y|x) \| p(y|t)\big) is the KL distortion measure. This is non-negative (by Gibbs’ inequality) and equals zero only when TT preserves the full conditional p(yx)p(y|x).

Step 3: Reformulate as rate-distortion. Since H(YX)H(Y|X) is a constant of the source, the constraint H(YT)H(Y)I0H(Y|T) \leq H(Y) - I_0 becomes E[dIB(X,T)]D\mathbb{E}[d_{IB}(X, T)] \leq D for D=H(Y)I0H(YX)=I(X;Y)I0D = H(Y) - I_0 - H(Y|X) = I(X;Y) - I_0. The IB objective minI(X;T)\min I(X; T) subject to this expected distortion constraint is exactly the rate-distortion problem with the distortion measure dIBd_{IB}.

Proposition 4 (IB Curve Properties).

The IB curve — the set of achievable (I(X;T),I(T;Y))(I(X;T), I(T;Y)) pairs — has the following properties:

(i) It is a concave function of I(X;T)I(X;T).

(ii) I(T;Y)=0I(T;Y) = 0 when I(X;T)=0I(X;T) = 0 (no compression implies no relevance).

(iii) I(T;Y)=I(X;Y)I(T;Y) = I(X;Y) when I(X;T)=H(X)I(X;T) = H(X) (perfect representation preserves all relevance).

(iv) The slope dI(T;Y)/dI(X;T)dI(T;Y)/dI(X;T) at any point equals 1/β1/\beta.

The IB curve tells us exactly how much relevance we must sacrifice for each bit of compression. Low β\beta means heavy compression (small I(X;T)I(X;T), low relevance); high β\beta means preserving relevance at the cost of a more complex representation.

def information_bottleneck(p_xy, beta, n_t=4, max_iter=500, tol=1e-10):
    """Compute IB solution for given beta via alternating optimization."""
    n_x, n_y = p_xy.shape
    p_x = p_xy.sum(axis=1)
    p_y_given_x = p_xy / p_x[:, None]

    # Initialize p(t|x) randomly
    p_t_given_x = np.random.dirichlet(np.ones(n_t), size=n_x)

    for _ in range(max_iter):
        p_t = p_x @ p_t_given_x
        p_t = np.maximum(p_t, 1e-300)

        # p(y|t) from Bayes: p(y|t) = sum_x p(y|x) p(x|t)
        p_y_given_t = np.zeros((n_t, n_y))
        for t in range(n_t):
            if p_t[t] > 1e-300:
                p_y_given_t[t] = (p_t_given_x[:, t] * p_x) @ p_y_given_x / p_t[t]

        # Update p(t|x) proportional to p(t) exp(-beta * D_KL(p(y|x) || p(y|t)))
        new_p = np.zeros_like(p_t_given_x)
        for i in range(n_x):
            for t in range(n_t):
                dkl = np.sum(
                    p_y_given_x[i] * np.log(
                        np.maximum(p_y_given_x[i], 1e-300)
                        / np.maximum(p_y_given_t[t], 1e-300)
                    ) * (p_y_given_x[i] > 0)
                )
                new_p[i, t] = p_t[t] * np.exp(-beta * dkl)
            new_p[i] /= np.maximum(new_p[i].sum(), 1e-300)

        if np.max(np.abs(new_p - p_t_given_x)) < tol:
            break
        p_t_given_x = new_p

    # Compute I(X;T) and I(T;Y)
    p_xt = p_x[:, None] * p_t_given_x
    I_XT = mutual_information(p_xt)
    p_ty = np.diag(p_t) @ p_y_given_t
    I_TY = mutual_information(p_ty)
    return I_XT, I_TY

Information bottleneck — IB curve, rate-distortion interpretation, β trade-off

Data Processing and Successive Refinement

Two additional results connect rate-distortion theory to practical coding systems.

Proposition 5 (DPI and Rate-Distortion).

For the Markov chain XX^X~X \to \hat{X} \to \tilde{X} (post-processing the compressed representation):

I(X;X~)I(X;X^)I(X; \tilde{X}) \leq I(X; \hat{X})

by the data processing inequality. Since R(D)R(D) is non-increasing, post-processing cannot improve the rate-distortion trade-off: it can only increase distortion or maintain it (if the processing is sufficient statistic-preserving).

Proposition 6 (Successive Refinement).

A source XX is successively refinable if the rate-distortion curve can be achieved by layered coding: a first description at rate R1R_1 achieves distortion D1D_1, and a second description at rate R2R_2 (given the first) achieves distortion D2<D1D_2 < D_1, such that:

R1=R(D1),R1+R2=R(D2)R_1 = R(D_1), \qquad R_1 + R_2 = R(D_2)

Theorem (Equitz & Cover, 1991): A source is successively refinable if and only if the optimal test channels at D1D_1 and D2D_2 form a Markov chain X^2X^1X\hat{X}_2 \to \hat{X}_1 \to X. The Gaussian source with squared error is successively refinable.

Successive refinement is the theoretical foundation of progressive coding (JPEG progressive, scalable video coding): a base layer provides coarse quality, and enhancement layers refine it without wasting bits.

Successive refinement — layered coding, Gaussian refinability


Computational Notes

Rate-distortion theory connects directly to modern ML through three bridges: the VAE loss, neural compression, and the information bottleneck in deep learning.

VAE Loss as Rate-Distortion

Remark (VAE Loss as Rate-Distortion).

The VAE objective (ELBO) can be written as:

LVAE=Eq(zx)[logp(xz)]distortion D+DKL ⁣(q(zx)p(z))rate R\mathcal{L}_{\text{VAE}} = \underbrace{\mathbb{E}_{q(z|x)}[-\log p(x|z)]}_{\text{distortion } D} + \underbrace{D_{\mathrm{KL}}\!\big(q(z|x) \| p(z)\big)}_{\text{rate } R}

This is precisely the rate-distortion Lagrangian L=D+βRL = D + \beta R with β=1\beta = 1 (standard VAE). The reconstruction term is the distortion (negative log-likelihood under the decoder), and the KL term is the rate (complexity of the latent code).

Remark (β-VAE and Rate-Distortion Trade-off).

The β\beta-VAE (Higgins et al., 2017) modifies the ELBO with a tunable β\beta:

Lβ-VAE=E[logp(xz)]+βDKL ⁣(q(zx)p(z))\mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}[-\log p(x|z)] + \beta\, D_{\mathrm{KL}}\!\big(q(z|x) \| p(z)\big)

This traces the rate-distortion curve:

  • β<1\beta < 1: prioritize reconstruction (low distortion, high rate)
  • β>1\beta > 1: prioritize compression (low rate, higher distortion, more disentangled representations)
  • β=1\beta = 1: standard VAE (the “natural” operating point on the R(D) curve)

Python Implementations

Here are reference implementations for computing rate-distortion functions:

def rate_distortion_binary(p, D):
    """R(D) = H_b(p) - H_b(D) for binary source with Hamming distortion."""
    D_max = min(p, 1 - p)
    if D >= D_max:
        return 0.0
    return H_b(p) - H_b(D)

def rate_distortion_gaussian(sigma2, D):
    """R(D) = 0.5 * log2(sigma2 / D) for Gaussian source."""
    if D >= sigma2:
        return 0.0
    return 0.5 * np.log2(sigma2 / D)
# VAE loss decomposition
def vae_loss(x, encoder, decoder, beta=1.0):
    """L = E[-log p(x|z)] + β D_KL(q(z|x) || p(z))"""
    mu, logvar = encoder(x)
    z = reparameterize(mu, logvar)

    # Distortion: reconstruction error
    recon_loss = -decoder.log_prob(x, z).mean()

    # Rate: KL divergence from posterior to prior
    kl_loss = -0.5 * (1 + logvar - mu**2 - logvar.exp()).sum(dim=-1).mean()

    return recon_loss + beta * kl_loss  # rate-distortion Lagrangian

Neural Compression

Modern learned image and video codecs operationalize rate-distortion theory directly. The encoder-decoder architecture learns the optimal test channel end-to-end, with the loss function being exactly the rate-distortion Lagrangian:

L=E[xx^2]distortion+λE[logp(z^)]rate\mathcal{L} = \underbrace{\mathbb{E}[\| x - \hat{x} \|^2]}_{\text{distortion}} + \lambda \underbrace{\mathbb{E}[-\log p(\hat{z})]}_{\text{rate}}

where z^\hat{z} is the quantized latent representation and p(z^)p(\hat{z}) is the learned entropy model. Different λ\lambda values trace out the operational R(D) curve of the neural codec. Frameworks like CompressAI (Ballé et al.) achieve state-of-the-art image compression by learning both the transform and the entropy model jointly.

Computational notes — VAE operating points, β-VAE trade-off, framework table


Connections & Further Reading

Connection Map

  • Shannon Entropy & Mutual InformationR(D)=minI(X;X^)R(D) = \min I(X; \hat{X}): the rate-distortion function is defined as a minimization of mutual information. At D=0D=0, R(0)=H(X)R(0) = H(X) recovers the lossless source coding limit.

  • KL Divergence & f-Divergences — The KL divergence appears in the IB distortion measure dIB(x,t)=DKL(p(yx)p(yt))d_{IB}(x,t) = D_{\mathrm{KL}}(p(y \mid x) \| p(y \mid t)) and in the VAE rate term DKL(q(zx)p(z))D_{\mathrm{KL}}(q(z \mid x) \| p(z)).

  • Convex AnalysisR(D)R(D) is convex in DD: the rate-distortion optimization is a convex program. The Blahut–Arimoto algorithm exploits this structure via alternating minimization.

  • Lagrangian Duality & KKT — The slope ss in the parametric form of R(D)R(D) is the Lagrange multiplier for the distortion constraint. Strong duality holds because the optimization is convex.

  • Minimum Description Length — MDL connects source coding (Shannon) to model selection: the best model minimizes description length = code length + model complexity.

Notation Reference

SymbolMeaning
d(x,x^)d(x, \hat{x})Distortion function: cost of reproducing xx as x^\hat{x}
D=E[d(X,X^)]D = \mathbb{E}[d(X, \hat{X})]Expected (average) distortion
R(D)R(D)Rate-distortion function: minimum bits/symbol at distortion D\leq D
p(x^x)p(\hat{x} \mid x)Test channel: conditional distribution from source to reproduction
DmaxD_{\max}Maximum useful distortion: achievable without any communication
ssSlope parameter (Lagrange multiplier) in the parametric form
β\betaTrade-off parameter in IB / β\beta-VAE
I(X;T)I(X; T)Complexity (compression) in the information bottleneck
I(T;Y)I(T; Y)Relevance (preserved information) in the information bottleneck

Connections

  • R(D) = min I(X; X̂) — the rate-distortion function minimizes mutual information. At D = 0, R(0) = H(X) recovers the lossless source coding limit established by the source coding theorem. shannon-entropy
  • The information bottleneck distortion measure is d_IB(x, t) = D_KL(p(y|x) || p(y|t)), and the VAE rate term is D_KL(q(z|x) || p(z)). Both bridge rate-distortion theory to modern ML. kl-divergence
  • R(D) is convex in D — the rate-distortion optimization is a convex program. The Blahut–Arimoto algorithm exploits convexity via alternating minimization, and Lagrangian duality gives the parametric form. convex-analysis
  • The slope parameter s in the parametric form of R(D) is the Lagrange multiplier for the distortion constraint. Strong duality holds because the optimization is convex. The KKT conditions yield the optimal test channel structure. lagrangian-duality

References & Further Reading

  • book Elements of Information Theory — Cover & Thomas (2006) Chapters 10–13 — rate-distortion theory, closed-form solutions, Blahut–Arimoto algorithm
  • book Rate Distortion Theory: A Mathematical Basis for Data Compression — Berger (1971) The classical monograph on rate-distortion theory
  • paper Computation of Channel Capacity and Rate-Distortion Functions — Blahut (1972) The original Blahut–Arimoto algorithm for computing R(D) via alternating minimization
  • paper The Information Bottleneck Method — Tishby, Pereira & Bialek (1999) Introduction of the information bottleneck — rate-distortion with relevance
  • paper Deep Variational Information Bottleneck — Alemi, Fischer, Dillon & Murphy (2017) Connects the information bottleneck to deep learning via variational bounds
  • paper β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework — Higgins, Matthey, Pal, Burgess, Glorot, Botvinick, Mohamed & Lerchner (2017) β-VAE as rate-distortion trade-off — tuning β traces the R(D) curve