Learning rate Equation, Derivation & Rearrangements

Core idea

Overview

The learning rate is a scalar hyperparameter that determines the step size at each iteration of an optimization algorithm. It scales the gradient of the loss function, controlling how significantly the model's weights are adjusted in response to estimated error.

When to use: Apply this during the training of machine learning models when using gradient-based optimization like SGD, Adam, or RMSProp. It is used to balance the trade-off between the speed of training and the precision of the convergence toward a minimum.

Why it matters: The learning rate is arguably the most critical hyperparameter; setting it too high causes the model to overshoot the minimum and diverge, while setting it too low results in inefficient training or getting stuck in local minima.

Symbols

Variables

$α$ = Learning Rate

α

Learning Rate

Variable

Walkthrough

Derivation

Learning Rate

The learning rate α is a positive scalar hyperparameter that controls the step size of each weight update during gradient-based optimisation. Too large and the model overshoots; too small and training is slow or stalls.

α > 0.
A differentiable loss function exists so gradients can be computed.
The learning rate may be fixed or follow a decay schedule.

1

Gradient Descent Update Rule

Each weight w is adjusted by subtracting the gradient of the loss L multiplied by α. α scales how far we move in the direction of steepest descent.

w_{t + 1} = w_{t} - α \cdot \nabla_{w} L (w_{t})

2

Effect of a Large α

A very large learning rate causes large weight updates, making the loss oscillate or diverge rather than converge.

α ↑ ⟹ Δ w ↑ ⟹ may overshoot minimum

3

Effect of a Small α

A very small learning rate leads to tiny updates; training may converge very slowly or settle in a local minimum.

α ↓ ⟹ Δ w ↓ ⟹ slow convergence

4

Learning Rate Decay Example

A common schedule halves α every k iterations, starting from an initial rate α₀, allowing faster early training and finer tuning later.

α_{t} = \frac{α _{0}}{2 ^{⌊ t / k ⌋}}

Note: Common starting values: 0.1, 0.01, 0.001. Search logarithmically to find the best range.

Result

α_{t} = \frac{α _{0}}{2 ^{⌊ t / k ⌋}}

Source: A-Level Data & Computing — Machine Learning

Visual intuition

Graph

Graph unavailable for this formula.

The graph is a horizontal line representing a constant value for the learning rate (alpha) across all values of the independent variable. Since the formula defines alpha as a fixed positive parameter, the line remains parallel to the x-axis and exists only in the region where alpha is greater than zero.

Graph type: constant

Why it behaves this way

Intuition

Imagine a blindfolded person trying to find the lowest point in a hilly landscape; the learning rate determines the size of each step they take in the direction they feel is downhill.

α

The magnitude of the step taken in the direction of the negative gradient during optimization.

A larger α leads to bigger updates to model weights, potentially speeding up convergence but risking overshooting the minimum. A smaller α leads to smaller updates, ensuring finer adjustments but potentially slowing down

Signs and relationships

α > 0: The learning rate must be positive because it scales the gradient to move the model parameters *down* the loss landscape towards a minimum.

Free study cues

Insight

Canonical usage

The learning rate is a dimensionless scalar used to control the step size during gradient-based optimization in machine learning.

Common confusion

A common confusion is trying to assign physical units to the learning rate. It is a hyperparameter whose value is tuned empirically, not derived from physical dimensions.

Dimension note

The learning rate is a dimensionless scalar. It acts as a pure scaling factor for the gradient vector, determining the magnitude of weight updates.

Unit systems

$α$ dimensionless · The learning rate is a scalar hyperparameter that scales the gradient of the loss function. It does not typically carry physical units, acting as a pure numerical factor.

One free problem

Practice Problem

A machine learning practitioner is training a neural network with an initial learning rate alpha of 0.01. After observing that the loss is oscillating, they decide to reduce the learning rate to one-fifth of its current value. Calculate the new value for alpha.

Learning Rate0.01

Solve for: alpha

Hint: Divide the initial value by 5 to find the reduced rate.

The full worked solution stays in the interactive walkthrough.

Where it shows up

Real-World Context

In tuning α in a training run, Learning rate is used to calculate the alpha value from the measured values. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.

Study smarter

Tips

Start with a logarithmic search (e.g., 0.1, 0.01, 0.001) to find the best range for your specific architecture.
Utilize learning rate decay or schedules to reduce alpha as training progresses to fine-tune weights.
Monitor the loss curve; consistent oscillation or rising loss usually indicates the learning rate is too high.

Avoid these traps

Common Mistakes

Choosing a rate that is too large.
Assuming one rate fits all models.

Keep going

Related Formulas

Common questions

Frequently Asked Questions

The learning rate α is a positive scalar hyperparameter that controls the step size of each weight update during gradient-based optimisation. Too large and the model overshoots; too small and training is slow or stalls.

Apply this during the training of machine learning models when using gradient-based optimization like SGD, Adam, or RMSProp. It is used to balance the trade-off between the speed of training and the precision of the convergence toward a minimum.

The learning rate is arguably the most critical hyperparameter; setting it too high causes the model to overshoot the minimum and diverge, while setting it too low results in inefficient training or getting stuck in local minima.

Choosing a rate that is too large. Assuming one rate fits all models.

In tuning α in a training run, Learning rate is used to calculate the alpha value from the measured values. The result matters because it helps judge uncertainty, spread, or evidence before making a conclusion from the data.

Start with a logarithmic search (e.g., 0.1, 0.01, 0.001) to find the best range for your specific architecture. Utilize learning rate decay or schedules to reduce alpha as training progresses to fine-tune weights. Monitor the loss curve; consistent oscillation or rising loss usually indicates the learning rate is too high.

References

Sources

Deep Learning (Goodfellow, Bengio, Courville)
Wikipedia: Learning rate
Wikipedia: Gradient descent
Pattern Recognition and Machine Learning (Bishop)
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8: Optimization for Training Deep Models.
A-Level Data & Computing — Machine Learning