KL Divergence (Bernoulli)
D_KL(p||q) for Bernoulli distributions.
This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.
Core idea
Overview
The Bernoulli KL divergence measures the relative entropy between two Bernoulli distributions, quantifying the information lost when distribution q is used to approximate distribution p. It is a non-symmetric metric that characterizes the statistical distance between two binary outcomes across a shared probability space.
When to use: This equation is essential when evaluating the performance of binary classifiers or when comparing a theoretical model to observed binary frequencies. It is frequently applied in machine learning as a component of loss functions like Binary Cross-Entropy and in the context of information-theoretic model selection.
Why it matters: It provides a rigorous way to measure the 'surprise' or extra cost incurred by assuming one set of probabilities when the reality is different. In practice, minimizing this divergence optimizes data transmission and ensures that predictive models are as close to the true data generation process as possible.
Symbols
Variables
= KL Divergence, p = True Probability, q = Model Probability
Walkthrough
Derivation
Derivation of KL Divergence for Bernoulli Variables
KL divergence measures mismatch between true probability p and model probability q.
- Binary variable X∈{0,1}.
- True distribution: P(X=1)=p.
- Model distribution: Q(X=1)=q.
Start from the definition of KL divergence:
KL is an expected log ratio of probabilities.
Write probabilities for X=1 and X=0:
Bernoulli distributions are determined by their success probabilities.
Expand the expectation:
This is the standard closed form for Bernoulli KL divergence.
Result
Visual intuition
Graph
The graph depicts a convex, U-shaped parabola representing the divergence between two Bernoulli distributions as the probability parameter p varies. The curve features a global minimum at zero, where the divergence is null when the two distributions are identical, and rises sharply toward vertical asymptotes as p approaches the boundaries of 0 or 1. This shape illustrates that information loss increases non-linearly as the predicted probability deviates from the true target probability.
Graph type: quadratic
Why it behaves this way
Intuition
Imagine two distinct bar charts, each representing a Bernoulli distribution with two bars (success and failure). The KL divergence quantifies the 'extra space' or 'distance' required to describe the first bar chart using
Signs and relationships
- \ln: The logarithmic function transforms probability ratios into units of information (nats, for natural logarithm). Its property ensures that the terms `p\ln(p/q)` and `(1-p)((1-p)/(1-q))` are always non-negative
- p: The true probabilities 'p' and '(1-p)' act as weighting factors. They ensure that the information discrepancy for each outcome (success or failure)
- +: The two terms are summed to account for the total expected information discrepancy across both possible outcomes (success and failure)
Free study cues
Insight
Canonical usage
KL Divergence is a dimensionless quantity, often expressed in 'nats' or 'bits' depending on the base of the logarithm used, but fundamentally represents a unitless measure of information.
Common confusion
Students might confuse 'nats' or 'bits' as physical units rather than as indicators of the logarithm's base, leading to attempts to convert them to other physical units or to expect dimensional consistency with physical
Dimension note
The KL divergence is inherently dimensionless as it is calculated from probabilities, which are themselves dimensionless ratios. While 'nats' or 'bits' are often used to denote the unit of information, these are not physical units.
One free problem
Practice Problem
A coin is known to have a true probability of landing heads of p = 0.5. If a researcher models this coin with an estimated probability q = 0.2, calculate the resulting KL Divergence in nats.
Solve for:
Hint: Plug the values into the formula using natural logarithms for both the p/q and (1-p)/(1-q) terms.
The full worked solution stays in the interactive walkthrough.
Where it shows up
Real-World Context
In quantifying how much a model's predicted probability differs from reality, KL Divergence (Bernoulli) is used to calculate KL Divergence from True Probability and Model Probability. The result matters because it helps estimate likelihood and make a risk or decision statement rather than treating the number as certainty.
Study smarter
Tips
- Ensure p and q values remain strictly between 0 and 1 to avoid natural logs of zero or infinity.
- Remember that D(p||q) is not equal to D(q||p); the order represents the direction from the truth p to the model q.
- A divergence of 0 always implies that the two distributions are perfectly identical.
Avoid these traps
Common Mistakes
- Swapping p and q (changes the value).
- Assuming KL is a distance metric (it isn’t symmetric).
Common questions
Frequently Asked Questions
KL divergence measures mismatch between true probability p and model probability q.
This equation is essential when evaluating the performance of binary classifiers or when comparing a theoretical model to observed binary frequencies. It is frequently applied in machine learning as a component of loss functions like Binary Cross-Entropy and in the context of information-theoretic model selection.
It provides a rigorous way to measure the 'surprise' or extra cost incurred by assuming one set of probabilities when the reality is different. In practice, minimizing this divergence optimizes data transmission and ensures that predictive models are as close to the true data generation process as possible.
Swapping p and q (changes the value). Assuming KL is a distance metric (it isn’t symmetric).
In quantifying how much a model's predicted probability differs from reality, KL Divergence (Bernoulli) is used to calculate KL Divergence from True Probability and Model Probability. The result matters because it helps estimate likelihood and make a risk or decision statement rather than treating the number as certainty.
Ensure p and q values remain strictly between 0 and 1 to avoid natural logs of zero or infinity. Remember that D(p||q) is not equal to D(q||p); the order represents the direction from the truth p to the model q. A divergence of 0 always implies that the two distributions are perfectly identical.
References
Sources
- Elements of Information Theory by Thomas M. Cover and Joy A. Thomas
- Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- Wikipedia: Kullback-Leibler divergence
- Cover and Thomas, Elements of Information Theory, 2nd ed.
- Wikipedia: Bernoulli distribution
- IUPAC Gold Book: relative entropy
- Cover and Thomas Elements of Information Theory