Learning Rate (AI Glossary)

How large a step the optimiser takes when updating weights—the most critical hyperparameter.

Ad placeholder (leaderboard)

What is the learning rate?

The learning rate is the hyperparameter that controls how large a step the optimiser takes when it updates the model’s weights. During training, gradient descent computes which direction reduces the loss; the learning rate decides how far to move in that direction on each step. It is widely regarded as the single most important hyperparameter to get right, because almost every other setting depends on it.

Returning to the hiker analogy for gradient descent: the gradient tells the hiker which way is downhill, and the learning rate is the length of their stride. Tiny steps make the descent reliable but painfully slow; giant steps risk leaping past the valley entirely.

When the learning rate is too high

If the learning rate is too large, updates overshoot the minimum of the loss. The symptoms are unmistakable:

  • The loss oscillates instead of steadily falling.
  • The loss spikes upward or diverges, often producing NaN values.
  • Training fails to converge no matter how long it runs.

A too-high learning rate is one of the most common reasons a training run blows up early.

When the learning rate is too low

If the learning rate is too small, the opposite problem appears: progress is painfully slow. The model may need far more steps and compute to reach the same performance, and it can stall on a plateau or get trapped in a shallow, suboptimal minimum because each step is too timid to escape it. Low learning rates waste time and resources even when they eventually work.

Warmup and decay schedules

In practice the learning rate is rarely held constant. A standard recipe combines two phases:

  • Warmup — start near zero and ramp the rate up over the first few hundred or thousand steps. Early gradients are noisy and unreliable, so a gentle start prevents them from destabilising the model.
  • Decay — after warmup, gradually shrink the learning rate, commonly with cosine decay that eases it smoothly toward zero. Larger steps early make fast progress; smaller steps later let the model settle precisely into a good minimum.

Learning rate and adaptive optimisers

Adaptive optimisers like Adam and AdamW adjust the effective step size per parameter using running statistics of past gradients. This makes them more forgiving of the chosen learning rate than plain SGD — but it does not remove the need to tune it. The base learning rate still matters, and pairing a well-chosen value with a warmup-and-decay schedule remains one of the highest- leverage decisions in training any modern neural network.

Ad placeholder (rectangle)