Information Gain

Core idea

Overview

Information Gain measures the reduction in uncertainty, or entropy, within a dataset after it is partitioned based on a specific attribute. It is the primary criterion used by algorithms like ID3 and C4.5 to determine the best feature for splitting a node in a decision tree.

When to use: Apply this metric during the construction of supervised learning models to evaluate the predictive power of independent variables. It is most effective when working with categorical targets where the goal is to maximize class purity in resulting subsets.

Why it matters: By identifying features that offer the highest Information Gain, models can be built with fewer levels, reducing computational complexity. This efficiency helps prevent overfitting and ensures that the most relevant data patterns are prioritized during training.

Symbols

Variables

IG = Info Gain, $H_{p}$ = Parent Entropy, $H_{c}$ = Child Entropy

IG

Info Gain

bits

H_{p}

Parent Entropy

bits

H_{c}

Child Entropy

bits

Walkthrough

Derivation

Formula: Information Gain

Information gain measures how much uncertainty (entropy) is reduced by splitting a dataset using an attribute, guiding decision tree construction.

A dataset S is split into subsets $S_{v}$ by values v of attribute A.
Entropy H( $\cdot$ ) is computed on the class distribution within each subset.

1

State information gain for a split:

Subtract the weighted average entropy after the split from the original entropy before the split.

I G (S, A) = H (S) - v \in V a l u es (A) \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})

2

Choose the best split:

The attribute with the highest information gain produces the largest reduction in uncertainty at that node.

max I G (S, A) \Rightarrow best split

Note: Some algorithms use gain ratio to reduce bias towards many-valued attributes.

Result

max I G (S, A) \Rightarrow best split

Source: Standard curriculum — Machine Learning (Decision Trees)

Visual intuition

Graph

The graph depicts the Information Gain (or entropy reduction) as a function of the probability of an outcome, typically exhibiting a concave shape that peaks at a probability of 0.5. The curve passes through the origin when the probability is 0 or 1, signifying no new information is gained from certain outcomes, and reaches its maximum value at the point of maximum uncertainty. This shape reflects the logarithmic relationship between probability and information content, where rare events provide the highest informational value.

Graph type: logarithmic

Why it behaves this way

Intuition

Imagine a mixed collection of items (parent node) being sorted into smaller, more uniform groups (child nodes) based on a specific characteristic, where Information Gain measures how much more organized and less mixed

IG

The reduction in uncertainty or randomness of a dataset after it is partitioned based on an attribute.

A higher Information Gain indicates that splitting the dataset by this attribute makes the resulting subsets significantly more predictable or 'purer' in terms of their target classes.

H(parent)

The initial level of uncertainty or impurity (entropy) in the dataset before any split is made.

Represents how mixed the classes are in the original dataset; a higher H(parent) means the classes are more evenly distributed and thus more uncertain.

H(children)

The weighted average uncertainty or impurity (entropy) of the subsets created after splitting the dataset by a particular attribute.

Represents how mixed the classes are in the resulting subsets; a lower H(children) means the subsets are more homogeneous and less uncertain.

Signs and relationships

- H(children): The subtraction of H(children) from H(parent) signifies that Information Gain quantifies the *reduction* in entropy. We aim for the entropy of the child nodes to be less than the parent node, so a positive Information

Free study cues

Insight

Canonical usage

Information Gain is a dimensionless numerical score used to quantify the reduction in entropy within a dataset.

Common confusion

Students may mistakenly try to assign physical units to Information Gain or entropy, rather than understanding them as dimensionless scores or measures of information.

Dimension note

Information Gain is a dimensionless quantity derived from the difference in entropy values, which are themselves calculated from probabilities.

Unit systems

IGdimensionless - Information Gain is a numerical score representing the reduction in entropy. While entropy can be expressed in 'bits' or 'nats' depending on the logarithm base, Information Gain itself is fundamentally a dimensionless

$H$ dimensionless - Entropy (H) is a measure of uncertainty based on probabilities. It is dimensionless but can be expressed in 'bits' (using log base 2) or 'nats' (using natural log) to indicate the base of the logarithm.

One free problem

Practice Problem

A dataset has an initial entropy of 0.940 bits. After splitting it based on a specific feature, the weighted average entropy of the child nodes is 0.693 bits. Calculate the Information Gain.

Parent Entropy0.94 bits

Child Entropy0.693 bits

Solve for: IG

Hint: Subtract the entropy of the children from the entropy of the parent node.

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

In choosing a feature split for a spam filter, Information Gain is used to calculate Info Gain from Parent Entropy and Child Entropy. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.

Study smarter

Tips

Ensure the children's entropy is calculated as a weighted average based on the number of samples in each branch.
Be aware that Information Gain can be biased toward attributes with a large number of distinct values.
A gain of zero indicates that the split does not improve the purity of the dataset at all.

Avoid these traps

Common Mistakes

Adding entropies instead of subtracting.
Mixing log bases.

Keep going

Related Formulas

Common questions

Frequently Asked Questions

Information gain measures how much uncertainty (entropy) is reduced by splitting a dataset using an attribute, guiding decision tree construction.

Apply this metric during the construction of supervised learning models to evaluate the predictive power of independent variables. It is most effective when working with categorical targets where the goal is to maximize class purity in resulting subsets.

By identifying features that offer the highest Information Gain, models can be built with fewer levels, reducing computational complexity. This efficiency helps prevent overfitting and ensures that the most relevant data patterns are prioritized during training.

Adding entropies instead of subtracting. Mixing log bases.

In choosing a feature split for a spam filter, Information Gain is used to calculate Info Gain from Parent Entropy and Child Entropy. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.

Ensure the children's entropy is calculated as a weighted average based on the number of samples in each branch. Be aware that Information Gain can be biased toward attributes with a large number of distinct values. A gain of zero indicates that the split does not improve the purity of the dataset at all.

References

Sources

Wikipedia: Information gain (decision tree)
Wikipedia: Entropy (information theory)
An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Wikipedia: Information gain in decision trees
Standard curriculum — Machine Learning (Decision Trees)

Overview

Variables

Derivation

State information gain for a split:

Choose the best split:

Graph

Intuition

Insight

Practice Problem

Real-World Context

Tips

Common Mistakes

Related Formulas

Entropy (Shannon)

Frequently Asked Questions

Sources