## Meaning of entropy, Kullback–Leibler distance and mutual information

Recently I came across a nice read about entropy by Cover and Thomas. I want to summarize what entropy ‘physically’ means from information coding perspective.
The definition of entropy for a discretely valued random variable is:
$H(X) = \sum\limits_{x \in \text{Range}(X)} P(X=x) \log_2{P(X=x)}$
Entropy does not depend on the values that $X$ takes. It describes distributional properties of $X$. The unit for entropy is 1 bit. To express some number in bits take base 2 log of it.
What does entropy measure? Suppose person $A$ observes the value of $X$, and needs to communicate this value to person $B$ by sending a binary message. This message can be of any length. Person $A$ wants to choose encoding scheme to minimize the expected length of message. A good strategy is to order possible values of $X$ according to their probabilities in decreasing order. Lets denote this ordering by $x(0),\ldots,x(n)$. Among many possible representations, the following will work:
$'0' \rightarrow x(0), '1' \rightarrow x(1), '01' \rightarrow x(2), \ldots$
and so on. The details of this representation are less relevant. The key simple point is to encode most probable values of $X$ by shorter strings. One can show, that under the optimal representation the expected length of the string will be between $H(X)$ and $H(X) + 1$. This is exactly what entropy measures. Notice, if each of the possible values of $X$ is equally likely, then no ‘smart’ representation can help reduce the average length of the string. This is the case of a maximal entropy which corresponds to the uniform distribution.
Kullback–Leibler distance:
$D(p \vert\vert q) = \sum\limits_{x \in \text{Range}(X)} P(X=x) \log_2{\frac{P(X=x)}{Q(X=x)}}$
This quantity measures the distance between two distributions. It measures by how longer, on average, message becomes, if its encoding scheme is optimized for distribution $Q$, when true distribution is $P$.
Mutual information:
$I(X,Y) = \sum\limits_{(x,y) \in \text{Range}((X,Y))} P(X=x,Y=y) \log_2{\frac{P(X=x,Y=y)}{P(X=x)P(Y=y)}}$
This quantity measures by how longer message becomes, on average, if its encoding scheme is optimized based on assumption that $X$ and $Y$ are independent and using only knowledge of their marginals.

This entry was posted in Machine Learning, Math, Probability Theory and tagged , , , , . Bookmark the permalink.