Meaning of entropy, Kullback–Leibler distance and mutual information

Recently I came across a nice read about entropy by Cover and Thomas. I want to summarize what entropy ‘physically’ means from information coding perspective.
The definition of entropy for a discretely valued random variable is:
H(X) = \sum\limits_{x \in \text{Range}(X)} P(X=x) \log_2{P(X=x)}
Entropy does not depend on the values that X takes. It describes distributional properties of X. The unit for entropy is 1 bit. To express some number in bits take base 2 log of it.
What does entropy measure? Suppose person A observes the value of X, and needs to communicate this value to person B by sending a binary message. This message can be of any length. Person A wants to choose encoding scheme to minimize the expected length of message. A good strategy is to order possible values of X according to their probabilities in decreasing order. Lets denote this ordering by x(0),\ldots,x(n). Among many possible representations, the following will work:
'0' \rightarrow x(0), '1' \rightarrow x(1), '01' \rightarrow x(2), \ldots
and so on. The details of this representation are less relevant. The key simple point is to encode most probable values of X by shorter strings. One can show, that under the optimal representation the expected length of the string will be between H(X) and H(X) + 1. This is exactly what entropy measures. Notice, if each of the possible values of X is equally likely, then no ‘smart’ representation can help reduce the average length of the string. This is the case of a maximal entropy which corresponds to the uniform distribution.
Kullback–Leibler distance:
D(p \vert\vert q) = \sum\limits_{x \in \text{Range}(X)} P(X=x) \log_2{\frac{P(X=x)}{Q(X=x)}}
This quantity measures the distance between two distributions. It measures by how longer, on average, message becomes, if its encoding scheme is optimized for distribution Q, when true distribution is P.
Mutual information:
I(X,Y) = \sum\limits_{(x,y) \in \text{Range}((X,Y))} P(X=x,Y=y) \log_2{\frac{P(X=x,Y=y)}{P(X=x)P(Y=y)}}
This quantity measures by how longer message becomes, on average, if its encoding scheme is optimized based on assumption that X and Y are independent and using only knowledge of their marginals.

This entry was posted in Machine Learning, Math, Probability Theory and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s