Posted by: Chris Brew | September 15, 2009

## Mutual information

The formula for mutual information between two random variables  $X$ and $Y$ is

$I(X;Y)= \sum_{x \in X, y \in Y} p(x , y) \log \frac{p(x,y)}{p(x)p(y)}$

A random variable is little more than a set $s \in S$ that is associated with a probability function $p(s)$. Because of the way probabilities must work, there are rules about the relationships that must hold between the joint probability $p(x,y)$ the marginal probabilities $p(x), p(y)$ and the conditional probabilities $p(x|y),p(y|x)$.

A common pattern in statistical NLP is to use estimates of  the components of $I(X;Y)$ to define properties of either individual words $x$ or word pairs $x,y$. Peter Turney’s article on PMI-IR (http://arxiv.org/abs/cs.LG/0212033) nicely explains what is going on here.

You can write two expressions for what Turney calls pointwise mutual information by slightly changing the summation in the expression above.

For a word $x \in X$ we have:

$PMI(x)= \sum_{ y \in Y} p(x , y) \log \frac{p(x,y)}{p(x)p(y)}$

and for a word pair $x \in X, y \in Y$ we have:

$PMI(x,y)= { p(x , y) \log \frac{p(x,y)}{p(x)p(y)}}$

Even though the mathematics dictates that $I(X;Y)$ is always positive, the PMI functions can produce either positive or negative quantities. In practice, the distributions that arise in NLP tend to yield large positive values for adjacent word pairs that intuitively feel strongly associated, such as “Swiss bank“. However, this is an empirical rather than a mathematical fact. Similarly, words $x$ that occur in what intuitively seem like “highly predictable contexts’  will tend to have large positive  $\mathrm{PMI}(x)$.

The “intuitively seems like” business above should alert you to the possibility that the PMI values may not actually correspond to anything meaningful. $I(X;Y)$ is definitely meaningful, but it doesn’t tell you much about individual words or word pairs.

There is a chain rule relating joint and conditional entropy:

$H(X,Y) = H(X) + H(Y|X) = H(Y)+H(X|Y)$

where $H(X|Y)$ is the conditional entropy and is defined as:

$H(X|Y) = \sum_{x,y} - p(x,y)\log p(x|y)$

A little algebra rearranges the chain rule expression to give

$I(X;Y) \equiv H(Y) - H(Y|X) = H(X) - H(X|Y)$

This is the $I(X;Y)$ that we call mutual information. The chain rule form makes it obvious that it is all about the difference between guessing one of the variables outright and guessing one when you already know the other. It should now also be intuitive that mutual information is non-negative, since it can be no harder to guess something outright than it can be if you are given a clue. As mentioned above, this does not guarantee that all the PMI functions that make up $I(X;Y)$ will be positive.