Mutual Information (2×2)
Mutual information between two binary variables from joint probabilities.
This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.
Core idea
Overview
Mutual Information quantifies the statistical dependence between two discrete random variables by measuring how much information is shared between them. In the 2×2 contingency case, it calculates the Kullback-Leibler divergence between the joint probability distribution and the product of the marginal distributions of two binary variables.
When to use: Apply this formula when analyzing the relationship between two binary variables, such as comparing a test result with the presence of a disease. It is preferred over linear correlation when you need to capture non-linear dependencies or general statistical association.
Why it matters: It is a foundational concept in communication theory for calculating channel capacity and in machine learning for feature selection. High mutual information indicates that knowing the state of one variable significantly reduces uncertainty about the other.
Symbols
Variables
I(X;Y) = Mutual Information, = P(X=0,Y=0), = P(X=0,Y=1), = P(X=1,Y=0), = P(X=1,Y=1)
Walkthrough
Derivation
Derivation of Mutual Information from a 2×2 Joint Table
Mutual information sums p(x,y) ln(p(x,y)/(p(x)p(y))) over all pairs.
- X and Y are binary.
- Joint probabilities p00,p01,p10,p11 sum to 1.
Start from the definition:
Mutual information quantifies dependence between X and Y.
Compute marginals from the 2×2 table:
You need p(x) and p(y) to form the ratio p(x,y)/(p(x)p(y)).
Sum the four terms (p00, p01, p10, p11):
Each non-zero joint probability contributes a term. By convention, 0·ln(0)=0.
Result
Visual intuition
Graph
Graph unavailable for this formula.
The plot displays Mutual Information as a function of the joint probability distribution, exhibiting a concave, non-linear shape that resembles a sigmoid-like or logarithmic growth curve. As the variables shift from independence to perfect correlation, the information value increases from zero to a maximum peak, reflecting the bounded nature of entropy. This curvature illustrates that information gain is constrained by the marginal distributions, with the slope indicating how rapidly uncertainty is reduced as dependency strengthens.
Graph type: sigmoid
Why it behaves this way
Intuition
Imagine a statistical landscape where the 'height' at each (x,y) point represents the deviation from independence. Mutual information is the total 'volume' of these deviations, weighted by how frequently each combination occurs.
Signs and relationships
- \ln\frac{p(x,y)}{p(x)p(y)}: The natural logarithm transforms the ratio of probabilities into an additive measure of information. If the observed joint probability p(x,y) is larger than p(x)p(y), the log term is positive; if it is smaller, the term is negative.
Free study cues
Insight
Canonical usage
Mutual information is a dimensionless quantity, representing a measure of statistical dependence. It is conventionally expressed in 'nats' when the natural logarithm (ln) is used, or 'bits' when logarithm base 2 (log2)
Common confusion
A common confusion is treating 'nats' or 'bits' as physical units rather than as conventional units for information content, whose choice depends on the logarithm base used in the calculation.
Dimension note
Mutual information is inherently dimensionless because it is calculated from ratios of probabilities, which are themselves dimensionless.
Unit systems
One free problem
Practice Problem
A researcher is studying the link between a specific gene mutation and a rare trait. In a perfectly balanced population, the joint probabilities are all equal (0.25 each). Calculate the Mutual Information.
Solve for:
Hint: If the joint probability of every cell is equal to the product of its marginal probabilities, the variables are independent.
The full worked solution stays in the interactive walkthrough.
Where it shows up
Real-World Context
In quantifying how informative a medical test result is about disease status, Mutual Information (2×2) is used to calculate Mutual Information from P(X=0,Y=0), P(X=0,Y=1), and P(X=1,Y=0). The result matters because it helps evaluate model behaviour, algorithm cost, or prediction quality before relying on the output.
Study smarter
Tips
- Ensure the sum of joint probabilities (p00, p01, p10, p11) equals exactly 1.0 before starting.
- Calculate the marginal probabilities for X and Y by summing the rows and columns of the contingency table.
- Treat terms where p(x,y) is zero as zero, as the limit of p log(p) as p approaches zero is zero.
- The result is measured in nats when using the natural logarithm (ln) or bits when using log base 2.
Avoid these traps
Common Mistakes
- Forgetting to normalize probabilities to sum to 1.
- Mixing logs (ln vs log2) and units (nats vs bits).
Common questions
Frequently Asked Questions
Mutual information sums p(x,y) ln(p(x,y)/(p(x)p(y))) over all pairs.
Apply this formula when analyzing the relationship between two binary variables, such as comparing a test result with the presence of a disease. It is preferred over linear correlation when you need to capture non-linear dependencies or general statistical association.
It is a foundational concept in communication theory for calculating channel capacity and in machine learning for feature selection. High mutual information indicates that knowing the state of one variable significantly reduces uncertainty about the other.
Forgetting to normalize probabilities to sum to 1. Mixing logs (ln vs log2) and units (nats vs bits).
In quantifying how informative a medical test result is about disease status, Mutual Information (2×2) is used to calculate Mutual Information from P(X=0,Y=0), P(X=0,Y=1), and P(X=1,Y=0). The result matters because it helps evaluate model behaviour, algorithm cost, or prediction quality before relying on the output.
Ensure the sum of joint probabilities (p00, p01, p10, p11) equals exactly 1.0 before starting. Calculate the marginal probabilities for X and Y by summing the rows and columns of the contingency table. Treat terms where p(x,y) is zero as zero, as the limit of p log(p) as p approaches zero is zero. The result is measured in nats when using the natural logarithm (ln) or bits when using log base 2.
References
Sources
- Cover, Thomas M., and Joy A. Thomas. Elements of Information Theory. 2nd ed. Wiley-Interscience, 2006.
- Wikipedia: Mutual Information
- Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
- Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.
- Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423.