How Neural Networks Learn: Backpropagation and Gradient Descent

The maths behind training: how errors propagate backwards to update weights

Ad placeholder (leaderboard)

Learning is correction, repeated

A freshly built neural network has random weights and produces meaningless outputs. Learning is the process of correcting those weights so the outputs match known answers. The recipe is simple to state: make a prediction, measure how wrong it was, and adjust the weights to be a little less wrong — then repeat thousands or millions of times. Two ideas power this loop. A loss function turns “how wrong” into a single number, and an algorithm called gradient descent, fed by backpropagation, decides how to change each weight to shrink that number. The simulator above lets you drive gradient descent yourself on a simple curve.

The loss landscape

Imagine plotting the loss for every possible setting of the weights. The result is a hilly surface — the loss landscape — where height is error and the lowest valley is the best the network can do. Training is the search for that valley. With one weight the landscape is a curve, as in the simulator; with millions of weights it is an unimaginably high-dimensional surface, but the same intuition holds. At any point on the surface, the gradient tells you the slope: which direction is uphill, and how steep. To reduce error you simply walk downhill, against the gradient.

Gradient descent and the learning rate

Gradient descent is that downhill walk made precise. At the current weights, compute the gradient, then move each weight a small step in the opposite direction. The size of the step is the learning rate, and it matters enormously. Too small, and training crawls, taking forever to reach the valley. Too large, and each step overshoots — the weight leaps past the minimum and can bounce outward, diverging instead of settling. The simulator makes this vivid: a modest learning rate glides smoothly into the minimum, while a large one overshoots and oscillates. Tuning this single number is one of the most important parts of training.

Backpropagation and the chain rule

In a deep network, one weight buried in an early layer affects the final loss only indirectly, through all the layers above it. Backpropagation is the algorithm that untangles this. Using the chain rule of calculus, it starts at the output, computes how much the loss changes with respect to the final layer, and propagates that sensitivity backwards layer by layer until every weight has a gradient. The brilliance is efficiency: instead of testing each weight separately, backpropagation computes all the gradients in essentially one backward pass. Without it, training large networks would be hopelessly slow.

Stochastic gradient descent in practice

Computing the gradient over an entire dataset before every step is expensive, so real training uses stochastic gradient descent (SGD): estimate the gradient from a small random batch of examples, take a step, and move on to the next batch. The estimate is noisier, but each step is far cheaper, so the network learns much faster overall. The noise is even useful — it can jostle the weights out of shallow, poor valleys toward better ones. Modern optimizers like Adam build on SGD with adaptive step sizes, but the heart of it remains what you can explore above: measure the slope, step downhill, and repeat until the loss is low.

Ad placeholder (rectangle)