Entropy for continuous random variables is technically called differential entropy. I've always wondered what the differential means, and I finally have an answer.
Shannon's groundbreaking work in information theory^{1} defined information as a measure of surprise. Specifically, for discrete random variables $X$ as $-\log{p(X)}$ where $p(X)$ is the probability mass. Consequently, the average information, or entropy $H$ is defined as,^{2}
Extending this definition to continuous random variables, however, is tricky as we'll see next.
Discrete probability masses are often visualized as histograms. In similar spirit, instead of thinking in terms of a continuous random variable $X$, we are going to think in terms of its discretized version $\Delta X$, binned into buckets of width $dX$.^{3}
To construct the entropy of such a discretized distribution, we need to define $p(\Delta X)$. One way is to think in terms of the area of one bin relative to the total area occupied by all bins. For $n(\Delta X)$ number of values in a bin, the area will be $a = n(\Delta X) \times dX$ (a thin rectangle). For the total area across all bins $A = \sum a$, we have the probability of a bin as $p(\Delta X) = a/A$. This construction satisfies the law of total probability such that $\sum p(\Delta X) = 1$, i.e. probability of all bins sum to $1$.
Now that we have a normalized histogram, we can instead work with normalized counts which we denote by $q(\Delta X)$. Under such a normalization, the area itself defines the probability of the bin:
Instead of our original continuous random variable $X$, let us now work with this definition of probability for the discretized version $\Delta X$.
Let's plug the definition of discretized probability $p(\Delta X)$ into entropy. We have
As the bin width $dX$ approaches zero, the entropy becomes:
This result is trouble - the entropy for all continuous random variables in infinite. In principle, this result is not wrong - as the precision of our continuous quantity's measurement increases (i.e. the bin width decreases), the average surprise in the measurement increases. But it leaves us with an unworkable definition of entropy for continuous random variables since we always need to know the bin width.
To work with entropy of continuous random variables, the resolution is that we only keep the interesting term and skip the constant width term $-\log{dX}$. The differential entropy is therefore given by,
And therefore,
the differential comes from ignoring the constant width term, which otherwise forces the entropy to be always infinite.
This definition is often clear from context and not made explicit. Notably, in cases involving comparison of two continuous distributions (e.g. KL-divergence), this difference often cancels out and does not cause trouble.
Claude E. Shannon. โA mathematical theory of communication.โย Bell Syst. Tech. J.ย 27 (1948): 623-656. https://ieeexplore.ieee.org/document/6773024 โฉ
David John Cameron MacKay. โInformation Theory, Inference, and Learning Algorithms.โย IEEE Transactions on Information Theoryย 50 (2004): 2544-2545. https://www.inference.org.uk/mackay/itila/ โฉ
James V. Stone. โInformation Theory: A Tutorial Introduction.โย ArXivย abs/1802.05968 (2015). https://arxiv.org/abs/1802.05968 โฉ