Differential Entropy

What is "differential" in differential entropy?
Jun 12, 2023Last updated: Jun 12, 2023

Entropy for continuous random variables is technically called differential entropy. I've always wondered what the differential means, and I finally have an answer.

Discrete Random Variables

Shannon's groundbreaking work in information theory1 defined information as a measure of surprise. Specifically, for discrete random variables XX as โˆ’logโกp(X)-\log{p(X)} where p(X)p(X) is the probability mass. Consequently, the average information, or entropy HH is defined as,2

H(X)=โˆ’โˆ‘ip(Xi)logโกp(Xi).H(X) = -\sum_{i} p(X_i)\log{p(X_i)}.

Extending this definition to continuous random variables, however, is tricky as we'll see next.

Continuous Random Variables

Discrete probability masses are often visualized as histograms. In similar spirit, instead of thinking in terms of a continuous random variable XX, we are going to think in terms of its discretized version ฮ”X\Delta X, binned into buckets of width dXdX.3

To construct the entropy of such a discretized distribution, we need to define p(ฮ”X)p(\Delta X). One way is to think in terms of the area of one bin relative to the total area occupied by all bins. For n(ฮ”X)n(\Delta X) number of values in a bin, the area will be a=n(ฮ”X)ร—dXa = n(\Delta X) \times dX (a thin rectangle). For the total area across all bins A=โˆ‘aA = \sum a, we have the probability of a bin as p(ฮ”X)=a/Ap(\Delta X) = a/A. This construction satisfies the law of total probability such that โˆ‘p(ฮ”X)=1\sum p(\Delta X) = 1, i.e. probability of all bins sum to 11.

Now that we have a normalized histogram, we can instead work with normalized counts which we denote by q(ฮ”X)q(\Delta X). Under such a normalization, the area itself defines the probability of the bin:

p(ฮ”X)=q(ฮ”X)ร—dX.p(\Delta X) = q(\Delta X) \times dX.

Instead of our original continuous random variable XX, let us now work with this definition of probability for the discretized version ฮ”X\Delta X.

Entropy of Discretized Random Variable

Let's plug the definition of discretized probability p(ฮ”X)p(\Delta X) into entropy. We have

H(ฮ”X)=โˆ’โˆ‘p(ฮ”X)logโกp(ฮ”X)=โˆ’โˆ‘q(ฮ”X)dXlogโกq(ฮ”X)โˆ’logโกdXร—โˆ‘q(ฮ”X)dXโŸโˆ‘p(ฮ”X)=1=โˆ’dX[โˆ‘q(ฮ”X)logโกq(ฮ”X)]โˆ’logโกdX\begin{aligned} H(\Delta X) &= - \sum p(\Delta X) \log{p(\Delta X)} \\ &= -\sum q(\Delta X) dX \log{q(\Delta X)} - \log{dX} \times \underbrace{\sum q(\Delta X) dX}_{\sum p(\Delta X) = 1} \\ &= - dX \left[\sum q(\Delta X) \log{q(\Delta X)}\right] - \log{dX} \end{aligned}

As the bin width dXdX approaches zero, the entropy becomes:

H(X)=โˆ’โˆซXq(X)logโกq(X)+โˆžH(X) = -\int_{\mathcal{X}} q(X)\log{q(X)} + \infty

This result is trouble - the entropy for all continuous random variables in infinite. In principle, this result is not wrong - as the precision of our continuous quantity's measurement increases (i.e. the bin width decreases), the average surprise in the measurement increases. But it leaves us with an unworkable definition of entropy for continuous random variables since we always need to know the bin width.

Differential Entropy

To work with entropy of continuous random variables, the resolution is that we only keep the interesting term and skip the constant width term โˆ’logโกdX-\log{dX}. The differential entropy is therefore given by,

Hdif(X)=H(X)โˆ’logโก1/dX=โˆ’โˆซXq(X)logโกq(X)H_{\mathrm{dif}}(X) = H(X) - \log{1/dX} = -\int_{\mathcal{X}} q(X)\log{q(X)}

And therefore,

the differential comes from ignoring the constant width term, which otherwise forces the entropy to be always infinite.

This definition is often clear from context and not made explicit. Notably, in cases involving comparison of two continuous distributions (e.g. KL-divergence), this difference often cancels out and does not cause trouble.

Footnotes

  1. Claude E. Shannon. โ€œA mathematical theory of communication.โ€ย Bell Syst. Tech. J.ย 27 (1948): 623-656. https://ieeexplore.ieee.org/document/6773024 โ†ฉ

  2. David John Cameron MacKay. โ€œInformation Theory, Inference, and Learning Algorithms.โ€ย IEEE Transactions on Information Theoryย 50 (2004): 2544-2545. https://www.inference.org.uk/mackay/itila/ โ†ฉ

  3. James V. Stone. โ€œInformation Theory: A Tutorial Introduction.โ€ย ArXivย abs/1802.05968 (2015). https://arxiv.org/abs/1802.05968 โ†ฉ

ยฉ 2023 Sanyam Kapoor