Recently I came across a nice read about entropy by Cover and Thomas. I want to summarize what entropy ‘physically’ means from information coding perspective.
The definition of entropy for a discretely valued random variable is:
Entropy does not depend on the values that takes. It describes distributional properties of . The unit for entropy is 1 bit. To express some number in bits take base 2 log of it.
What does entropy measure? Suppose person observes the value of , and needs to communicate this value to person by sending a binary message. This message can be of any length. Person wants to choose encoding scheme to minimize the expected length of message. A good strategy is to order possible values of according to their probabilities in decreasing order. Lets denote this ordering by . Among many possible representations, the following will work:
and so on. The details of this representation are less relevant. The key simple point is to encode most probable values of by shorter strings. One can show, that under the optimal representation the expected length of the string will be between and . This is exactly what entropy measures. Notice, if each of the possible values of is equally likely, then no ‘smart’ representation can help reduce the average length of the string. This is the case of a maximal entropy which corresponds to the uniform distribution.
Kullback–Leibler distance:
This quantity measures the distance between two distributions. It measures by how longer, on average, message becomes, if its encoding scheme is optimized for distribution , when true distribution is .
Mutual information:
This quantity measures by how longer message becomes, on average, if its encoding scheme is optimized based on assumption that and are independent and using only knowledge of their marginals.

