Binary Cross-Entropy
Loss function for binary classification.
This public page keeps the free explanation visible and leaves premium worked solving, advanced walkthroughs, and saved study tools inside the app.
Core idea
Overview
Binary Cross-Entropy measures the divergence between two probability distributions, typically the true labels and the predicted probabilities in a binary classification task. It calculates a loss value that penalizes predictions exponentially as they diverge from the actual class value.
When to use: This equation is the standard loss function for binary classification problems where the output is a single probability between 0 and 1. It is most effective when paired with a sigmoid activation function in the final layer of a neural network.
Why it matters: It provides a smooth, convex surface for optimization, allowing gradient descent to effectively update model weights. By heavily penalizing confident but incorrect predictions, it forces the model to learn more distinct boundaries between classes.
Symbols
Variables
L = Loss, y = Actual Label (0/1), p = Predicted Prob
Walkthrough
Derivation
Formula: Binary Cross-Entropy (Log Loss)
Binary cross-entropy measures how well predicted probabilities match true binary labels y, heavily penalising confident wrong predictions.
- Binary labels y\in\{0,1\}.
- Predictions are probabilities in (0,1), commonly from a sigmoid.
- Logarithms are natural logs unless specified otherwise (choice changes scale only).
Write loss for one example:
If y=1, only -() matters; if y=0, only -(1-) matters.
Average across N examples:
The dataset loss is the mean of individual losses, giving a single number to minimise during training.
Note: In practice, probabilities are clipped away from 0 and 1 to avoid (0).
Result
Source: Standard curriculum — Machine Learning (Classification Losses)
Visual intuition
Graph
The graph displays a logarithmic curve that approaches vertical asymptotes at x=0 and x=1, where the loss tends toward infinity. As the independent variable moves away from the target value, the loss increases sharply, reflecting the penalty for incorrect predictions.
Graph type: logarithmic
Why it behaves this way
Intuition
A landscape where the model aims to find the lowest point, representing minimal divergence between its predicted probabilities and the true class labels, with steep gradients that severely penalize confident incorrect
Signs and relationships
- -: The natural logarithm of a probability (a value between 0 and 1) is always negative or zero. To ensure the loss function 'L' is a non-negative value that can be minimized towards zero, the entire expression is multiplied
Free study cues
Insight
Canonical usage
This equation calculates a dimensionless loss value, representing the divergence between a true binary label and a predicted probability.
Common confusion
A common mistake is to input probabilities as percentages (e.g., 75%) instead of decimal values (e.g., 0.75), which would lead to incorrect logarithmic calculations.
Dimension note
All variables in the Binary Cross-Entropy formula (true label 'y', predicted probability 'p', and the resulting loss 'L') are dimensionless quantities.
Unit systems
One free problem
Practice Problem
A machine learning model identifies a transaction as fraudulent (y = 1). The model's predicted probability of fraud is 0.85. Calculate the binary cross-entropy loss for this specific prediction.
Solve for:
Hint: When y = 1, the formula simplifies to L = -ln(p).
The full worked solution stays in the interactive walkthrough.
Where it shows up
Real-World Context
In training a spam classifier with probabilistic output, Binary Cross-Entropy is used to calculate Loss from Actual Label (0/1) and Predicted Prob. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Study smarter
Tips
- Ensure predicted values p stay within (0, 1) to avoid undefined natural logs at 0 or 1.
- The loss is 0 only if the prediction perfectly matches the label.
- For multi-class targets, use the Categorical Cross-Entropy variant instead.
Avoid these traps
Common Mistakes
- Using p=0 or p=1 directly.
- Forgetting the (1-y) term.
Common questions
Frequently Asked Questions
Binary cross-entropy measures how well predicted probabilities \hat{y} match true binary labels y, heavily penalising confident wrong predictions.
This equation is the standard loss function for binary classification problems where the output is a single probability between 0 and 1. It is most effective when paired with a sigmoid activation function in the final layer of a neural network.
It provides a smooth, convex surface for optimization, allowing gradient descent to effectively update model weights. By heavily penalizing confident but incorrect predictions, it forces the model to learn more distinct boundaries between classes.
Using p=0 or p=1 directly. Forgetting the (1-y) term.
In training a spam classifier with probabilistic output, Binary Cross-Entropy is used to calculate Loss from Actual Label (0/1) and Predicted Prob. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.
Ensure predicted values p stay within (0, 1) to avoid undefined natural logs at 0 or 1. The loss is 0 only if the prediction perfectly matches the label. For multi-class targets, use the Categorical Cross-Entropy variant instead.
References
Sources
- Wikipedia: Cross-entropy
- Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press.
- Deep Learning (Ian Goodfellow, Yoshua Bengio, and Aaron Courville)
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Chapter 6, Section 6.2.2.2)
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (Chapter 4, Section 4.3.4)
- Standard curriculum — Machine Learning (Classification Losses)