How does a neural network actually learn?

It learns by trial and correction. The network makes a prediction, a loss function measures how wrong it is, and then it adjusts its weights to reduce that error. This adjustment is computed by backpropagation and applied by gradient descent, repeated over many examples until the predictions match the correct answers well enough.

What is backpropagation?

Backpropagation is the algorithm that figures out how much each weight contributed to the error. Using the chain rule from calculus, it propagates the error backwards from the output layer to every weight, producing a gradient — the direction and amount to change each weight to reduce the loss. It is what makes training deep networks computationally feasible.

What is gradient descent and the learning rate?

Gradient descent is the rule for updating weights: move each one a small step in the direction that lowers the loss, as indicated by its gradient. The learning rate controls the size of that step. Too small and learning crawls; too large and the update overshoots the minimum and can diverge, so choosing a good learning rate is critical.

Why is it called stochastic gradient descent?

Computing the gradient over the entire dataset every step is slow, so in practice the gradient is estimated from a small random batch of examples at a time. This randomness is the stochastic part. It makes each step noisier but far faster and, helpfully, the noise can nudge the model out of poor local minima during training.

How Neural Networks Learn: Backpropagation and Gradient Descent

Learning is correction, repeated

A freshly built neural network has random weights and produces meaningless outputs. Learning is the process of correcting those weights so the outputs match known answers. The recipe is simple to state: make a prediction, measure how wrong it was, and adjust the weights to be a little less wrong — then repeat thousands or millions of times. Two ideas power this loop. A loss function turns “how wrong” into a single number, and an algorithm called gradient descent, fed by backpropagation, decides how to change each weight to shrink that number. The simulator above lets you drive gradient descent yourself on a simple curve.

The loss landscape

Imagine plotting the loss for every possible setting of the weights. The result is a hilly surface — the loss landscape — where height is error and the lowest valley is the best the network can do. Training is the search for that valley. With one weight the landscape is a curve, as in the simulator; with millions of weights it is an unimaginably high-dimensional surface, but the same intuition holds. At any point on the surface, the gradient tells you the slope: which direction is uphill, and how steep. To reduce error you simply walk downhill, against the gradient.

Gradient descent and the learning rate

Gradient descent is that downhill walk made precise. At the current weights, compute the gradient, then move each weight a small step in the opposite direction. The size of the step is the learning rate, and it matters enormously. Too small, and training crawls, taking forever to reach the valley. Too large, and each step overshoots — the weight leaps past the minimum and can bounce outward, diverging instead of settling. The simulator makes this vivid: a modest learning rate glides smoothly into the minimum, while a large one overshoots and oscillates. Tuning this single number is one of the most important parts of training.

Backpropagation and the chain rule

In a deep network, one weight buried in an early layer affects the final loss only indirectly, through all the layers above it. Backpropagation is the algorithm that untangles this. Using the chain rule of calculus, it starts at the output, computes how much the loss changes with respect to the final layer, and propagates that sensitivity backwards layer by layer until every weight has a gradient. The brilliance is efficiency: instead of testing each weight separately, backpropagation computes all the gradients in essentially one backward pass. Without it, training large networks would be hopelessly slow.

Stochastic gradient descent in practice

Computing the gradient over an entire dataset before every step is expensive, so real training uses stochastic gradient descent (SGD): estimate the gradient from a small random batch of examples, take a step, and move on to the next batch. The estimate is noisier, but each step is far cheaper, so the network learns much faster overall. The noise is even useful — it can jostle the weights out of shallow, poor valleys toward better ones. Modern optimizers like Adam build on SGD with adaptive step sizes, but the heart of it remains what you can explore above: measure the slope, step downhill, and repeat until the loss is low.