Information Gain
Reduction in entropy.
This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.
Core idea
Overview
Information Gain measures the reduction in uncertainty, or entropy, within a dataset after it is partitioned based on a specific attribute. It is the primary criterion used by algorithms like ID3 and C4.5 to determine the best feature for splitting a node in a decision tree.
When to use: Apply this metric during the construction of supervised learning models to evaluate the predictive power of independent variables. It is most effective when working with categorical targets where the goal is to maximize class purity in resulting subsets.
Why it matters: By identifying features that offer the highest Information Gain, models can be built with fewer levels, reducing computational complexity. This efficiency helps prevent overfitting and ensures that the most relevant data patterns are prioritized during training.
Symbols
Variables
IG = Info Gain, = Parent Entropy, = Child Entropy
Walkthrough
Derivation
Formula: Information Gain
Information gain measures how much uncertainty (entropy) is reduced by splitting a dataset using an attribute, guiding decision tree construction.
- A dataset S is split into subsets by values v of attribute A.
- Entropy H() is computed on the class distribution within each subset.
State information gain for a split:
Subtract the weighted average entropy after the split from the original entropy before the split.
Choose the best split:
The attribute with the highest information gain produces the largest reduction in uncertainty at that node.
Note: Some algorithms use gain ratio to reduce bias towards many-valued attributes.
Result
Source: Standard curriculum — Machine Learning (Decision Trees)
Visual intuition
Graph
The graph depicts the Information Gain (or entropy reduction) as a function of the probability of an outcome, typically exhibiting a concave shape that peaks at a probability of 0.5. The curve passes through the origin when the probability is 0 or 1, signifying no new information is gained from certain outcomes, and reaches its maximum value at the point of maximum uncertainty. This shape reflects the logarithmic relationship between probability and information content, where rare events provide the highest informational value.
Graph type: logarithmic
Why it behaves this way
Intuition
Imagine a mixed collection of items (parent node) being sorted into smaller, more uniform groups (child nodes) based on a specific characteristic, where Information Gain measures how much more organized and less mixed
Signs and relationships
- - H(children): The subtraction of H(children) from H(parent) signifies that Information Gain quantifies the *reduction* in entropy. We aim for the entropy of the child nodes to be less than the parent node, so a positive Information
Free study cues
Insight
Canonical usage
Information Gain is a dimensionless numerical score used to quantify the reduction in entropy within a dataset.
Common confusion
Students may mistakenly try to assign physical units to Information Gain or entropy, rather than understanding them as dimensionless scores or measures of information.
Dimension note
Information Gain is a dimensionless quantity derived from the difference in entropy values, which are themselves calculated from probabilities.
Unit systems
One free problem
Practice Problem
A dataset has an initial entropy of 0.940 bits. After splitting it based on a specific feature, the weighted average entropy of the child nodes is 0.693 bits. Calculate the Information Gain.
Solve for: IG
Hint: Subtract the entropy of the children from the entropy of the parent node.
The full worked solution stays in the interactive walkthrough.
Where it shows up
Real-World Context
In choosing a feature split for a spam filter, Information Gain is used to calculate Info Gain from Parent Entropy and Child Entropy. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Study smarter
Tips
- Ensure the children's entropy is calculated as a weighted average based on the number of samples in each branch.
- Be aware that Information Gain can be biased toward attributes with a large number of distinct values.
- A gain of zero indicates that the split does not improve the purity of the dataset at all.
Avoid these traps
Common Mistakes
- Adding entropies instead of subtracting.
- Mixing log bases.
Common questions
Frequently Asked Questions
Information gain measures how much uncertainty (entropy) is reduced by splitting a dataset using an attribute, guiding decision tree construction.
Apply this metric during the construction of supervised learning models to evaluate the predictive power of independent variables. It is most effective when working with categorical targets where the goal is to maximize class purity in resulting subsets.
By identifying features that offer the highest Information Gain, models can be built with fewer levels, reducing computational complexity. This efficiency helps prevent overfitting and ensures that the most relevant data patterns are prioritized during training.
Adding entropies instead of subtracting. Mixing log bases.
In choosing a feature split for a spam filter, Information Gain is used to calculate Info Gain from Parent Entropy and Child Entropy. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Ensure the children's entropy is calculated as a weighted average based on the number of samples in each branch. Be aware that Information Gain can be biased toward attributes with a large number of distinct values. A gain of zero indicates that the split does not improve the purity of the dataset at all.
References
Sources
- Wikipedia: Information gain (decision tree)
- Wikipedia: Entropy (information theory)
- An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
- Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Wikipedia: Information gain in decision trees
- Standard curriculum — Machine Learning (Decision Trees)