foundational information-theory 50 min read

Shannon Entropy & Mutual Information

Information content, entropy, and mutual information — the quantities that measure uncertainty, surprise, and statistical dependence

Overview & Motivation

How much information does a random variable carry? How many bits do we need, at minimum, to encode a message? When two variables are statistically dependent, how much does knowing one tell us about the other?

These questions have precise, quantitative answers — and they all flow from a single idea that Claude Shannon published in 1948: entropy. Shannon showed that the uncertainty in a random variable has a unique, canonical measure, and that this measure controls the fundamental limits of data compression, communication, and statistical inference.

For machine learning, information theory is not optional background. Cross-entropy loss — the default objective for classification — is an information-theoretic quantity. Mutual information drives feature selection. The data processing inequality explains why deep representations can lose but never create information about the input. The maximum-entropy principle connects to Bayesian inference through the choice of priors. And KL Divergence & f-Divergences, the next topic on this track, builds directly on the entropy framework we develop here.

What We Cover

  1. Information content and Shannon entropy — the surprise of an outcome and the average surprise of a distribution
  2. Joint and conditional entropy — uncertainty in pairs of variables and the chain rule
  3. Mutual information — the symmetric measure of statistical dependence
  4. The data processing inequality — why post-processing cannot create information
  5. Differential entropy — the continuous extension and the maximum-entropy role of the Gaussian
  6. Source coding and compression — Shannon’s theorem, Kraft’s inequality, and Huffman coding
  7. Entropy rate — entropy for stochastic processes and Markov chains
  8. Connections to ML — cross-entropy loss, MI feature selection, and the InfoMax principle

Prerequisites

This topic builds on:

  • Measure-Theoretic Probability — entropy is defined as the expectation E[logp(X)]E[-\log p(X)], differential entropy requires the Lebesgue integral, and conditional entropy uses conditional expectation

Information Content & Shannon Entropy

Imagine playing a guessing game. Your friend picks a card from a deck and you must guess which one. If the deck has 2 cards, one guess suffices — that’s 1 bit of uncertainty. A deck of 8 cards takes 3 guesses in the worst case — 3 bits. The pattern is clear: the uncertainty in a uniform draw over nn outcomes is log2n\log_2 n bits.

But what if the outcomes are not equally likely? Shannon’s key insight was to define the information content (or surprise) of a single outcome, and then take the average to get the entropy.

Definition 1 (Information Content (Surprise)).

The information content (or self-information) of an outcome xx with probability p(x)p(x) is

I(x)=log2p(x)[bits]I(x) = -\log_2 p(x) \quad \text{[bits]}

An outcome with probability p=1/2p = 1/2 carries 11 bit of information. An outcome with probability p=1/8p = 1/8 carries 33 bits. Common events carry little surprise; rare events carry a lot.

The choice of log2\log_2 gives units of bits. Using ln\ln gives nats; using log10\log_{10} gives hartleys. The choice of base is a convention — we use bits throughout this topic.

Definition 2 (Shannon Entropy).

The Shannon entropy of a discrete random variable XX with probability mass function pp over a finite alphabet X\mathcal{X} is

H(X)=E[log2p(X)]=xXp(x)log2p(x)[bits]H(X) = E[-\log_2 p(X)] = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x) \quad \text{[bits]}

with the convention 0log20=00 \log_2 0 = 0 (justified by continuity: limp0+plogp=0\lim_{p \to 0^+} p \log p = 0).

Entropy is the average surprise — the expected number of bits needed to encode a randomly drawn outcome. It is the fundamental measure of uncertainty in information theory.

The binary entropy function is the special case for a Bernoulli random variable with parameter pp:

Hb(p)=plog2p(1p)log2(1p)H_b(p) = -p \log_2 p - (1-p) \log_2(1-p)

This reaches its maximum of 11 bit at p=1/2p = 1/2 (a fair coin) and drops to 00 at p=0p = 0 or p=1p = 1 (a certain outcome).

Properties of Entropy

Proposition 1 (Non-negativity of Entropy).

H(X)0H(X) \geq 0 for all discrete random variables XX, with equality if and only if XX is deterministic (one outcome has probability 11).

Proof.

Each term p(x)log2p(x)0-p(x) \log_2 p(x) \geq 0 since 0p(x)10 \leq p(x) \leq 1 implies log2p(x)0\log_2 p(x) \leq 0. The sum of non-negative terms is non-negative. Equality holds when all terms are zero, which requires each p(x)p(x) to be either 00 or 11 — i.e., XX is deterministic.

Proposition 2 (Entropy Maximized by the Uniform Distribution).

For a random variable XX over kk outcomes, H(X)log2kH(X) \leq \log_2 k, with equality if and only if XX is uniformly distributed.

Proof.

Let u(x)=1/ku(x) = 1/k be the uniform distribution. By the log-sum inequality (or equivalently, by the non-negativity of KL divergence):

0DKL(pu)=xp(x)log2p(x)1/k=H(X)+log2k0 \leq D_{KL}(p \| u) = \sum_x p(x) \log_2 \frac{p(x)}{1/k} = -H(X) + \log_2 k

Rearranging gives H(X)log2kH(X) \leq \log_2 k. Equality holds iff DKL(pu)=0D_{KL}(p \| u) = 0, i.e., p=up = u.

Proposition 3 (Concavity of Entropy).

H(X)H(X) is a concave function of the probability distribution pp: for distributions p1,p2p_1, p_2 and λ[0,1]\lambda \in [0, 1],

H(λp1+(1λ)p2)λH(p1)+(1λ)H(p2)H(\lambda p_1 + (1-\lambda) p_2) \geq \lambda H(p_1) + (1-\lambda) H(p_2)

Proof.

Since f(t)=tlogtf(t) = -t \log t is concave (its second derivative 1/t-1/t is negative for t>0t > 0), we have for each xx:

[λp1(x)+(1λ)p2(x)]log[λp1(x)+(1λ)p2(x)]λp1(x)logp1(x)(1λ)p2(x)logp2(x)-[\lambda p_1(x) + (1-\lambda) p_2(x)] \log[\lambda p_1(x) + (1-\lambda) p_2(x)] \geq -\lambda p_1(x) \log p_1(x) - (1-\lambda) p_2(x) \log p_2(x)

Summing over xx gives the result.

Theorem 8 (Khinchin's Uniqueness Theorem).

Shannon entropy is the unique function H(p1,,pk)H(p_1, \ldots, p_k) satisfying four axioms: (1) continuity in the pip_i, (2) maximality at the uniform distribution, (3) invariance under adding zero-probability outcomes, and (4) the chain rule H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y|X). Any function satisfying these axioms is a positive multiple of H(X)=pilogpiH(X) = -\sum p_i \log p_i.

Remark.

Khinchin’s theorem tells us that entropy is not an arbitrary choice — it is the only sensible measure of uncertainty satisfying natural consistency requirements. The chain rule (axiom 4) is the key structural axiom.

Explore the interactive visualization below. Drag the bars to change the probability distribution and watch entropy respond in real time.

Entropy for various distributions


Joint Entropy & Conditional Entropy

When we have two random variables, we can ask about their combined uncertainty and how much knowing one reduces uncertainty about the other.

Definition 3 (Joint Entropy).

The joint entropy of discrete random variables XX and YY is

H(X,Y)=xXyYp(x,y)log2p(x,y)H(X, Y) = -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log_2 p(x, y)

This is the total uncertainty in the pair (X,Y)(X, Y) — the average number of bits needed to encode both outcomes.

Definition 4 (Conditional Entropy).

The conditional entropy of YY given XX is

H(YX)=xXp(x)H(YX=x)=x,yp(x,y)log2p(yx)H(Y | X) = \sum_{x \in \mathcal{X}} p(x) \, H(Y | X = x) = -\sum_{x, y} p(x, y) \log_2 p(y | x)

This is the average remaining uncertainty in YY after observing XX.

The chain rule connects these quantities.

Theorem 1 (Chain Rule for Entropy).

H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y | X)

More generally, for nn random variables:

H(X1,X2,,Xn)=i=1nH(XiX1,,Xi1)H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i | X_1, \ldots, X_{i-1})

Proof.

H(X,Y)=x,yp(x,y)log2p(x,y)H(X, Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y) =x,yp(x,y)log2[p(x)p(yx)]= -\sum_{x,y} p(x,y) \log_2 [p(x) \, p(y|x)] =x,yp(x,y)log2p(x)x,yp(x,y)log2p(yx)= -\sum_{x,y} p(x,y) \log_2 p(x) - \sum_{x,y} p(x,y) \log_2 p(y|x) =xp(x)log2p(x)+(x,yp(x,y)log2p(yx))= -\sum_x p(x) \log_2 p(x) + \left( -\sum_{x,y} p(x,y) \log_2 p(y|x) \right) =H(X)+H(YX)= H(X) + H(Y|X)

The general case follows by induction: p(x1,,xn)=p(x1)i=2np(xix1,,xi1)p(x_1, \ldots, x_n) = p(x_1) \prod_{i=2}^{n} p(x_i | x_1, \ldots, x_{i-1}), and taking E[log2()]-E[\log_2(\cdot)] of both sides telescopes.

Theorem 2 (Conditioning Reduces Entropy).

H(YX)H(Y)H(Y | X) \leq H(Y)

with equality if and only if XX and YY are independent.

Proof.

We need to show that H(Y)H(YX)0H(Y) - H(Y|X) \geq 0. This difference is the mutual information I(X;Y)I(X; Y) (defined in the next section), and its non-negativity follows from the non-negativity of KL divergence:

I(X;Y)=x,yp(x,y)log2p(x,y)p(x)p(y)=DKL(pXYpXpY)0I(X;Y) = \sum_{x,y} p(x,y) \log_2 \frac{p(x,y)}{p(x)p(y)} = D_{KL}(p_{XY} \| p_X p_Y) \geq 0

Equality holds iff p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) for all x,yx, y — independence.

In plain language: information can only reduce (or maintain) uncertainty, never increase it. Knowing something extra about the world cannot make you more confused on average.

Joint and conditional entropy


Mutual Information

Mutual information captures how much knowing one variable tells us about another. It is the fundamental measure of statistical dependence in information theory.

Definition 5 (Mutual Information).

The mutual information between discrete random variables XX and YY is

I(X;Y)=H(X)H(XY)=H(Y)H(YX)=H(X)+H(Y)H(X,Y)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)

Equivalently:

I(X;Y)=xXyYp(x,y)log2p(x,y)p(x)p(y)I(X; Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x,y) \log_2 \frac{p(x,y)}{p(x) \, p(y)}

Mutual information is the KL divergence from the joint distribution to the product of marginals: I(X;Y)=DKL(pXYpXpY)I(X;Y) = D_{KL}(p_{XY} \| p_X p_Y). It measures how far the joint distribution is from independence.

Proposition 4 (Non-negativity of Mutual Information).

I(X;Y)0I(X; Y) \geq 0, with equality if and only if XX and YY are independent.

Proof.

I(X;Y)=DKL(pXYpXpY)0I(X;Y) = D_{KL}(p_{XY} \| p_X p_Y) \geq 0 by the non-negativity of KL divergence (which follows from Jensen’s inequality applied to the convex function log-\log). Equality holds iff p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) for all x,yx, y.

Proposition 5 (Symmetry of Mutual Information).

I(X;Y)=I(Y;X)I(X; Y) = I(Y; X).

Proof.

From the definition: I(X;Y)=H(X)+H(Y)H(X,Y)=H(Y)+H(X)H(Y,X)=I(Y;X)I(X;Y) = H(X) + H(Y) - H(X,Y) = H(Y) + H(X) - H(Y,X) = I(Y;X), since joint entropy is symmetric: H(X,Y)=H(Y,X)H(X,Y) = H(Y,X).

This is remarkable: conditioning is asymmetric (H(YX)H(XY)H(Y|X) \neq H(X|Y) in general), but mutual information is perfectly symmetric. Knowing XX tells you the same amount about YY as knowing YY tells you about XX.

Proposition 6 (Mutual Information and Independence).

I(X;Y)=0I(X; Y) = 0 if and only if XX and YY are independent.

Proof.

This is a direct consequence of Proposition 4: I(X;Y)=DKL(pXYpXpY)=0I(X;Y) = D_{KL}(p_{XY} \| p_X p_Y) = 0 iff p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) everywhere.

The Information Diagram

The relationships between H(X)H(X), H(Y)H(Y), H(X,Y)H(X,Y), H(XY)H(X|Y), H(YX)H(Y|X), and I(X;Y)I(X;Y) are best visualized as a Venn diagram — the information diagram. Two overlapping circles represent H(X)H(X) and H(Y)H(Y); the overlap is I(X;Y)I(X;Y); the left crescent is H(XY)H(X|Y); the right crescent is H(YX)H(Y|X); and the union is H(X,Y)H(X,Y).

Information diagram (Venn-style)

Conditional Mutual Information and Chain Rule

Definition 6 (Conditional Mutual Information).

The conditional mutual information of XX and YY given ZZ is

I(X;YZ)=H(XZ)H(XY,Z)I(X; Y | Z) = H(X | Z) - H(X | Y, Z)

Theorem 3 (Chain Rule for Mutual Information).

I(X1,X2,,Xn;Y)=i=1nI(Xi;YX1,,Xi1)I(X_1, X_2, \ldots, X_n ; Y) = \sum_{i=1}^{n} I(X_i ; Y | X_1, \ldots, X_{i-1})

Proof.

Apply the chain rule for entropy to H(X1,,XnY)H(X_1, \ldots, X_n | Y) and H(X1,,Xn)H(X_1, \ldots, X_n):

I(X1,,Xn;Y)=H(X1,,Xn)H(X1,,XnY)I(X_1, \ldots, X_n ; Y) = H(X_1, \ldots, X_n) - H(X_1, \ldots, X_n | Y) =iH(XiX1,,Xi1)iH(XiX1,,Xi1,Y)= \sum_{i} H(X_i | X_1, \ldots, X_{i-1}) - \sum_{i} H(X_i | X_1, \ldots, X_{i-1}, Y) =i[H(XiX1,,Xi1)H(XiX1,,Xi1,Y)]= \sum_{i} [H(X_i | X_1, \ldots, X_{i-1}) - H(X_i | X_1, \ldots, X_{i-1}, Y)] =iI(Xi;YX1,,Xi1)= \sum_{i} I(X_i ; Y | X_1, \ldots, X_{i-1})


The Data Processing Inequality

The data processing inequality (DPI) is one of the most important results in information theory for machine learning. It formalizes a simple but powerful idea: you cannot create information by processing data.

Definition 7 (Markov Chain).

Random variables XX, YY, ZZ form a Markov chain, written XYZX \to Y \to Z, if ZZ is conditionally independent of XX given YY:

p(zx,y)=p(zy)for all x,y,zp(z | x, y) = p(z | y) \quad \text{for all } x, y, z

Equivalently, XX and ZZ are independent given YY.

Theorem 4 (Data Processing Inequality).

If XYZX \to Y \to Z is a Markov chain, then

I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y)

Post-processing cannot increase mutual information. Equality holds if and only if ZZ is a sufficient statistic of YY for XX — that is, I(X;YZ)=0I(X; Y | Z) = 0.

Proof.

By the chain rule for mutual information:

I(X;Y,Z)=I(X;Z)+I(X;YZ)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Z) + I(X; Y | Z) = I(X; Y) + I(X; Z | Y)

Since XYZX \to Y \to Z is Markov, XZYX \perp Z \mid Y, so I(X;ZY)=0I(X; Z | Y) = 0. Therefore:

I(X;Z)+I(X;YZ)=I(X;Y)I(X; Z) + I(X; Y | Z) = I(X; Y)

Since I(X;YZ)0I(X; Y | Z) \geq 0 (non-negativity of conditional MI), we get I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y).

Equality holds iff I(X;YZ)=0I(X; Y | Z) = 0, meaning YY provides no additional information about XX beyond what ZZ already captures.

The DPI has profound implications for representation learning. Every layer in a neural network processes its input: if we have XZ1Z2ZLX \to Z_1 \to Z_2 \to \cdots \to Z_L, then I(X;Z1)I(X;Z2)I(X;ZL)I(X; Z_1) \geq I(X; Z_2) \geq \cdots \geq I(X; Z_L). Information about the input can only decrease through the network. A good representation is one that preserves the relevant information about the target while discarding noise.

Explore the DPI interactively — adjust the channel type and noise level to see how information degrades through processing.

Data processing inequality


Differential Entropy

What happens when we move from discrete to continuous random variables? The naive approach — replacing sums with integrals — gives differential entropy, but with some important caveats.

Definition 8 (Differential Entropy).

For a continuous random variable XX with probability density function f(x)f(x), the differential entropy is

h(X)=f(x)log2f(x)dx[bits]h(X) = -\int_{-\infty}^{\infty} f(x) \log_2 f(x) \, dx \quad \text{[bits]}

assuming the integral exists.

Unlike discrete entropy, differential entropy can be negative — it is not a true measure of “information” in the same absolute sense. However, differences of differential entropies are well-behaved, and mutual information I(X;Y)=h(X)h(XY)I(X;Y) = h(X) - h(X|Y) is non-negative even for continuous variables.

Key properties

Translation invariance. h(X+c)=h(X)h(X + c) = h(X) for any constant cc. Shifting a distribution does not change its uncertainty.

Non-invariance under scaling. If Y=aXY = aX for a0a \neq 0, then h(Y)=h(X)+log2ah(Y) = h(X) + \log_2 |a|. Stretching a distribution increases its entropy; compressing it decreases entropy. This is why differential entropy can be negative — concentrate a distribution enough and h(X)<0h(X) < 0.

Maximum Entropy Distribution

Theorem 5 (Gaussian Maximizes Entropy for Fixed Variance).

Among all continuous distributions with mean μ\mu and variance σ2\sigma^2, the Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2) uniquely maximizes differential entropy:

h(X)12log2(2πeσ2)[bits]h(X) \leq \frac{1}{2} \log_2(2\pi e \sigma^2) \quad \text{[bits]}

with equality if and only if XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2).

Proof.

Let ff be any density with mean μ\mu and variance σ2\sigma^2, and let ϕ\phi be the Gaussian density N(μ,σ2)\mathcal{N}(\mu, \sigma^2). We have:

0DKL(fϕ)=f(x)log2f(x)ϕ(x)dx=h(f)f(x)log2ϕ(x)dx0 \leq D_{KL}(f \| \phi) = \int f(x) \log_2 \frac{f(x)}{\phi(x)} \, dx = -h(f) - \int f(x) \log_2 \phi(x) \, dx

Now compute f(x)log2ϕ(x)dx\int f(x) \log_2 \phi(x) \, dx. Since ϕ(x)=12πσ2exp((xμ)22σ2)\phi(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp(-\frac{(x-\mu)^2}{2\sigma^2}):

f(x)log2ϕ(x)dx=12log2(2πσ2)E[(Xμ)2]2σ2log2e=12log2(2πeσ2)\int f(x) \log_2 \phi(x) \, dx = -\frac{1}{2}\log_2(2\pi\sigma^2) - \frac{E[(X-\mu)^2]}{2\sigma^2} \log_2 e = -\frac{1}{2}\log_2(2\pi e \sigma^2)

where we used E[(Xμ)2]=σ2E[(X-\mu)^2] = \sigma^2. Substituting back:

0h(f)+12log2(2πeσ2)0 \leq -h(f) + \frac{1}{2}\log_2(2\pi e \sigma^2)

so h(f)12log2(2πeσ2)=h(ϕ)h(f) \leq \frac{1}{2}\log_2(2\pi e \sigma^2) = h(\phi). Equality holds iff DKL(fϕ)=0D_{KL}(f \| \phi) = 0, i.e., f=ϕf = \phi.

This result explains why Gaussians appear everywhere in statistics and ML: they are the “least informative” distributions for a given variance. When we assume Gaussian noise or Gaussian priors, we are making the maximum-entropy assumption — committing to the distribution that introduces the least extra structure beyond the constraint on variance.

Differential entropy: Gaussian vs. uniform

Maximum entropy distributions

Mutual information for continuous variables

While differential entropy has quirks (negativity, scale-dependence), mutual information remains perfectly well-defined for continuous variables:

I(X;Y)=h(X)+h(Y)h(X,Y)=f(x,y)log2f(x,y)f(x)f(y)dxdyI(X;Y) = h(X) + h(Y) - h(X,Y) = \int \int f(x,y) \log_2 \frac{f(x,y)}{f(x) f(y)} \, dx \, dy

This is non-negative and equals zero iff XX and YY are independent, just as in the discrete case. The scale-dependent artifacts cancel in the difference.


Source Coding & Compression

Shannon entropy is not just an abstract measure of uncertainty — it has a concrete operational meaning: the entropy of a source is the fundamental limit of lossless data compression. Shannon’s source coding theorem makes this precise.

Prefix codes and Kraft’s inequality

Definition 9 (Prefix Code).

A prefix code (or instantaneous code) assigns a binary codeword c(x)c(x) to each symbol xx such that no codeword is a prefix of another. Prefix codes are uniquely decodable without lookahead.

Theorem 6 (Kraft's Inequality).

A prefix code with codeword lengths 1,2,,k\ell_1, \ell_2, \ldots, \ell_k exists if and only if

i=1k2i1\sum_{i=1}^{k} 2^{-\ell_i} \leq 1

Proof.

Necessity. Model the binary code as a complete binary tree of depth max\ell_{\max}. Each codeword of length i\ell_i “claims” a subtree of 2maxi2^{\ell_{\max} - \ell_i} leaves. Since the prefix property forbids overlap, the total claimed leaves cannot exceed 2max2^{\ell_{\max}}:

i=1k2maxi2max\sum_{i=1}^{k} 2^{\ell_{\max} - \ell_i} \leq 2^{\ell_{\max}}

Dividing by 2max2^{\ell_{\max}} gives 2i1\sum 2^{-\ell_i} \leq 1.

Sufficiency. Given lengths satisfying the inequality, assign codewords greedily in order of non-decreasing length, at each step choosing the lexicographically first available codeword. The Kraft inequality guarantees that codewords never run out.

Shannon’s source coding theorem

Theorem 7 (Shannon's Source Coding Theorem).

For a discrete random variable XX with entropy H(X)H(X):

  1. Converse. Any uniquely decodable code must satisfy E[(X)]H(X)E[\ell(X)] \geq H(X).
  2. Achievability. There exists a prefix code with E[(X)]<H(X)+1E[\ell(X)] < H(X) + 1.

In particular, the Huffman code achieves H(X)E[Huffman(X)]<H(X)+1H(X) \leq E[\ell_{\text{Huffman}}(X)] < H(X) + 1.

Proof.

Converse. By Kraft’s inequality, any code with lengths (x)\ell(x) defines a sub-probability distribution q(x)=2(x)/Zq(x) = 2^{-\ell(x)} / Z where Z=2(x)1Z = \sum 2^{-\ell(x)} \leq 1. Since (x)=log2q(x)log2Z\ell(x) = -\log_2 q(x) - \log_2 Z:

E[(X)]=xp(x)log2q(x)log2ZE[\ell(X)] = -\sum_x p(x) \log_2 q(x) - \log_2 Z

Since xp(x)log2q(x)=H(X)+DKL(pq)H(X)-\sum_x p(x) \log_2 q(x) = H(X) + D_{KL}(p \| q) \geq H(X) and log2Z0-\log_2 Z \geq 0 because Z1Z \leq 1:

E[(X)]=H(X)+DKL(pq)log2ZH(X)E[\ell(X)] = H(X) + D_{KL}(p \| q) - \log_2 Z \geq H(X)

Achievability. Set (x)=log2p(x)\ell(x) = \lceil -\log_2 p(x) \rceil. Then (x)<log2p(x)+1\ell(x) < -\log_2 p(x) + 1, so:

E[(X)]<xp(x)log2p(x)+1=H(X)+1E[\ell(X)] < -\sum_x p(x) \log_2 p(x) + 1 = H(X) + 1

These lengths satisfy Kraft’s inequality: 2(x)2log2p(x)=p(x)=1\sum 2^{-\ell(x)} \leq \sum 2^{\log_2 p(x)} = \sum p(x) = 1.

The Huffman algorithm builds an optimal prefix code by iteratively merging the two lowest-probability symbols. Explore the construction below.

Symp(x)Codeℓ(x)
A0.43801
B0.219102
C0.1461103
D0.10911114
E0.08811104
H(X):2.064 bits
E[ℓ]:2.102 bits
Gap (E[ℓ] − H):0.038
H(X) ≤ E[ℓ] < H(X) + 1

Huffman tree example

Source coding: entropy as compression limit

Code blocks: Huffman coding in Python

import heapq

class HuffmanNode:
    def __init__(self, symbol=None, prob=0, left=None, right=None):
        self.symbol = symbol
        self.prob = prob
        self.left = left
        self.right = right

    def __lt__(self, other):
        return self.prob < other.prob

def build_huffman_tree(symbols, probs):
    heap = [HuffmanNode(s, p) for s, p in zip(symbols, probs)]
    heapq.heapify(heap)
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        merged = HuffmanNode(prob=left.prob + right.prob, left=left, right=right)
        heapq.heappush(heap, merged)
    return heap[0]

def get_codes(node, prefix='', codes=None):
    if codes is None:
        codes = {}
    if node.symbol is not None:
        codes[node.symbol] = prefix if prefix else '0'
    else:
        if node.left:
            get_codes(node.left, prefix + '0', codes)
        if node.right:
            get_codes(node.right, prefix + '1', codes)
    return codes

Entropy Rate

So far we have studied entropy for single random variables and pairs. For stochastic processes — sequences of random variables — the relevant quantity is the entropy rate: the per-symbol uncertainty in the long run.

Definition 10 (Entropy Rate).

For a stationary stochastic process {X1,X2,X3,}\{X_1, X_2, X_3, \ldots\}, the entropy rate is

H=limnH(XnXn1,,X1)\mathcal{H} = \lim_{n \to \infty} H(X_n \mid X_{n-1}, \ldots, X_1)

when the limit exists. Equivalently:

H=limn1nH(X1,X2,,Xn)\mathcal{H} = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n)

For an i.i.d. process, H=H(X1)\mathcal{H} = H(X_1) — each symbol carries its full entropy. But for processes with memory (like Markov chains), the entropy rate is strictly less than H(X1)H(X_1) because past observations reduce uncertainty about the next symbol.

Proposition 7 (Entropy Rate of a Stationary Markov Chain).

For a stationary Markov chain with transition matrix PP and stationary distribution π\pi, the entropy rate is

H=iπijPijlog2Pij=iπiH(Pi)\mathcal{H} = -\sum_{i} \pi_i \sum_{j} P_{ij} \log_2 P_{ij} = \sum_{i} \pi_i \, H(P_{i \cdot})

where H(Pi)H(P_{i \cdot}) is the entropy of the ii-th row of the transition matrix.

Proof.

In stationarity, H(XnXn1)=iP(Xn1=i)H(XnXn1=i)=iπiH(Pi)H(X_n | X_{n-1}) = \sum_i P(X_{n-1} = i) \, H(X_n | X_{n-1} = i) = \sum_i \pi_i \, H(P_{i\cdot}).

By the Markov property, H(XnXn1,,X1)=H(XnXn1)H(X_n | X_{n-1}, \ldots, X_1) = H(X_n | X_{n-1}), so the limit defining H\mathcal{H} is reached exactly:

H=H(XnXn1)=iπiH(Pi)=iπijPijlog2Pij\mathcal{H} = H(X_n | X_{n-1}) = \sum_i \pi_i \, H(P_{i\cdot}) = -\sum_i \pi_i \sum_j P_{ij} \log_2 P_{ij}

The entropy rate of a Markov chain is a weighted average of the entropies of each row of the transition matrix, weighted by the stationary distribution. States visited more often contribute more to the overall rate.

Entropy rate: Markov chains


Computational Notes

Computing entropy in NumPy

import numpy as np

def entropy(probs):
    """Shannon entropy in bits. Convention: 0 log 0 = 0."""
    p = np.asarray(probs, dtype=float)
    p = p[p > 0]  # filter zeros
    return -np.sum(p * np.log2(p))

# Examples
probs_uniform = np.array([0.25, 0.25, 0.25, 0.25])
print(f"Uniform(4): H = {entropy(probs_uniform):.4f} bits")  # 2.0000

probs_peaked = np.array([0.9, 0.05, 0.03, 0.02])
print(f"Peaked:     H = {entropy(probs_peaked):.4f} bits")    # 0.6095

Mutual information from a joint table

def mutual_information(joint):
    """I(X;Y) from a 2D joint probability array."""
    joint = np.asarray(joint, dtype=float)
    p_x = joint.sum(axis=1)
    p_y = joint.sum(axis=0)
    return entropy(p_x) + entropy(p_y) - entropy(joint.ravel())

joint = np.array([[0.3, 0.1],
                   [0.1, 0.5]])
print(f"I(X;Y) = {mutual_information(joint):.4f} bits")

Plug-in vs. k-NN mutual information estimation

For continuous data, we can estimate MI using either a plug-in estimator (discretize into bins, compute the histogram MI) or the k-nearest-neighbor estimator of Kraskov, Stögbauer & Grassberger (2004), which avoids binning artifacts:

import numpy as np

# Plug-in estimator: discretize and compute
def mi_plugin(x, y, bins='auto'):
    """Plug-in MI estimator via 2D histogram."""
    hist_2d, _, _ = np.histogram2d(x, y, bins=bins)
    hist_2d = hist_2d / hist_2d.sum()  # normalize to joint PMF
    p_x = hist_2d.sum(axis=1)
    p_y = hist_2d.sum(axis=0)
    return entropy(p_x) + entropy(p_y) - entropy(hist_2d.ravel())

The plug-in estimator is biased for small samples (tends to overestimate MI) and sensitive to bin width. The k-NN estimator is consistent and adaptive, making it the preferred method for continuous MI estimation.

Cross-entropy loss and entropy

The cross-entropy loss used in classification is an information-theoretic quantity:

H(p,q)=ipilog2qi=H(p)+DKL(pq)H(p, q) = -\sum_i p_i \log_2 q_i = H(p) + D_{KL}(p \| q)

For one-hot labels (pp is a point mass), H(p)=0H(p) = 0, so H(p,q)=DKL(pq)H(p, q) = D_{KL}(p \| q). Minimizing cross-entropy loss is equivalent to minimizing the KL divergence from the model’s predicted distribution to the true label distribution.

import torch
import torch.nn.functional as F

# PyTorch cross-entropy loss
logits = torch.tensor([[1.0, 2.0, 0.5, 0.1]])  # raw model outputs
target = torch.tensor([1])                        # true class

loss = F.cross_entropy(logits, target)
print(f"Cross-entropy loss: {loss.item():.4f} nats")

# This equals -log(softmax(logits)[target])
probs = F.softmax(logits, dim=1)
manual_loss = -torch.log(probs[0, target[0]])
print(f"Manual check:       {manual_loss.item():.4f} nats")

Mutual information for feature selection

Rank features by their mutual information with the target variable to identify the most informative features:

I(Xi;Y)=H(Y)H(YXi)I(X_i; Y) = H(Y) - H(Y | X_i)

This non-parametric criterion captures arbitrary (not just linear) dependencies, making it more powerful than correlation for feature selection in classification tasks.

Cross-entropy loss and MI feature selection


Connections & Further Reading

Shannon entropy opens the door to a rich web of information-theoretic quantities and their applications across machine learning, statistics, and beyond.

Downstream on this track

  • KL Divergence & f-Divergences — the asymmetric divergence DKL(pq)=H(p,q)H(p)D_{KL}(p \| q) = H(p, q) - H(p) generalizes to the broader family of ff-divergences, each inducing a different geometry on the space of distributions.
  • Rate-Distortion Theory — Shannon’s lossy source coding theorem: the minimum rate R(D)R(D) for encoding a source at distortion level DD is an optimization over mutual information.
  • Minimum Description Length — model selection via code length: the best model minimizes the total description length (model code + data-given-model code), connecting entropy to statistical learning theory.

Cross-track connections

  • Measure-Theoretic Probability — entropy is E[logp(X)]E[-\log p(X)]; differential entropy requires the Lebesgue integral; conditional entropy uses conditional expectation via the Radon–Nikodym theorem.
  • Convex Analysis — entropy is concave in pp; Jensen’s inequality proves uniform maximality; channel capacity and rate-distortion optimization are convex programs.
  • Information Geometry & Fisher Metric — the Fisher metric is the Hessian of KL divergence, which is defined in terms of entropy. The entropy and divergence functions developed here are the starting point for the geometric structure on statistical manifolds.
  • Bayesian Nonparametrics — the maximum entropy principle selects the least informative prior consistent with known constraints, connecting entropy to Bayesian inference and the choice of non-informative priors.
  • Graph Laplacians & Spectrum — the entropy of a random walk’s stationary distribution H(π)H(\boldsymbol{\pi}) measures graph regularity. On a dd-regular graph, π\boldsymbol{\pi} is uniform with maximum entropy log2n\log_2 n. The spectral gap governs the mixing rate — how fast entropy increases toward equilibrium.

Connections

  • Entropy is defined as the expectation of the negative log-likelihood: H(X) = E[-log p(X)]. The Lebesgue integral and Radon–Nikodym derivative are essential for differential entropy, and conditional expectation underlies conditional entropy. measure-theoretic-probability
  • Entropy is a concave function of the probability distribution — Jensen's inequality gives the maximum entropy property of the uniform distribution. Mutual information optimization (channel capacity, rate-distortion) involves convex programs. convex-analysis
  • The Fisher information metric is the Hessian of KL divergence, which is defined in terms of entropy: D_KL(p||q) = H(p,q) - H(p). Shannon entropy is the starting point for the divergence functions that equip statistical manifolds with geometric structure. information-geometry
  • Maximum entropy priors (the principle of maximum entropy) select the least informative prior consistent with known constraints, connecting entropy maximization to Bayesian inference. bayesian-nonparametrics

References & Further Reading

  • book Elements of Information Theory — Cover & Thomas (2006) The standard reference — Chapters 2-4 cover entropy, mutual information, and source coding
  • book Information Theory, Inference, and Learning Algorithms — MacKay (2003) Chapters 2, 4, 5 — information theory with a Bayesian and ML perspective
  • paper A Mathematical Theory of Communication — Shannon (1948) The foundational paper — defines entropy, mutual information, and proves the source coding theorem
  • book Information Theory and Statistics — Kullback (1968) Classical treatment of the connection between information theory and statistical inference
  • book Entropy and Information Theory — Gray (2011) Rigorous measure-theoretic treatment of entropy and entropy rate for stationary processes
  • paper Estimating Mutual Information — Kraskov, Stögbauer & Grassberger (2004) k-nearest-neighbor mutual information estimator — the standard non-parametric method