What is gradient descent?
Gradient descent is the optimisation algorithm at the heart of training almost every neural network. Training is framed as minimising a loss function — a number that measures how wrong the model’s predictions are. Gradient descent computes the gradient, the slope of that loss with respect to each of the model’s parameters, then nudges every parameter a small step in the direction that decreases the loss.
The classic intuition is a hiker on a foggy hillside trying to reach the valley. They cannot see the bottom, but they can feel which way the ground slopes downhill and take a step that way. Repeat enough times and you descend toward a low point. The size of each step is set by the learning rate.
Stochastic gradient descent
Computing the exact gradient over an entire dataset every step is prohibitively expensive. Stochastic gradient descent (SGD) instead estimates the gradient from a small random mini-batch of examples. Each step is far cheaper, and the noise from sampling can even be helpful — it jostles the optimiser out of shallow, poor minima. The trade-off is that updates are noisier, so SGD typically needs careful learning-rate tuning and often a momentum term to smooth its path.
Adam and AdamW
Modern training rarely uses plain SGD. Adam (Adaptive Moment Estimation) keeps running averages of both the gradient and its square for every parameter, and uses them to scale each parameter’s step individually. Parameters with small, consistent gradients take larger steps; noisy ones take smaller steps. This adaptive behaviour makes Adam converge quickly and tolerate a wider range of learning rates.
AdamW is a refinement that decouples weight decay from the gradient update, applying it correctly rather than folding it into the gradient. AdamW generalises better and has become the default optimiser for training large language models and most modern deep networks.
Learning rate scheduling
Even with an adaptive optimiser, the learning rate is the most important knob, and it usually should not stay fixed. A common recipe is:
- Warmup — start with a tiny learning rate and ramp it up over the first few thousand steps, so early, unreliable gradients do not destabilise the model.
- Decay — gradually shrink the rate afterwards, often with cosine decay, so the model takes precise small steps as it nears a good minimum.
Together, the optimiser and its schedule determine how reliably and how quickly a model navigates its loss landscape — and getting them right is one of the most consequential parts of training.