Definition
Backpropagation (short for “backward propagation of errors”) is the algorithm that trains neural networks. After the network makes a prediction and a loss function measures how wrong it was, backpropagation computes how much each of the network’s weights contributed to that error. It does this by applying the chain rule of calculus to propagate gradients backwards from the output to every parameter, so each weight can be nudged in the direction that reduces the loss.
The forward and backward pass
Training proceeds in two phases:
- Forward pass — input data flows through the network layer by layer to produce a prediction, and intermediate values are cached along the way.
- Backward pass — starting from the loss at the output, backpropagation works backwards through each layer, computing the gradient of the loss with respect to that layer’s inputs and weights.
Because a neural network is a deep chain of nested functions, the chain rule lets the gradient at one layer be expressed in terms of the gradient at the next layer, multiplied by a local derivative. This is why error “flows backwards”.
Why it is efficient
A naive approach would recompute derivatives independently for every weight, which is hopelessly expensive in a network with billions of parameters. Backpropagation instead treats the network as a computational graph and reuses the intermediate values from the forward pass, computing every gradient in roughly the cost of one extra forward pass. This dynamic-programming insight is what makes training large models practical.
Backpropagation vs gradient descent
These two are often confused but play distinct roles. Backpropagation computes the gradients; gradient descent uses them to update the weights, typically by subtracting a small fraction (the learning rate) of each gradient. In a single training step you run a forward pass, run backpropagation to get gradients, then apply a gradient-descent update.
Why it matters
Backpropagation is the engine behind essentially all deep learning. Every modern model — from image classifiers to large language models — is trained by repeatedly running backpropagation over batches of data. Understanding it clarifies why issues like vanishing or exploding gradients arise and why choices such as activation functions and normalisation matter so much for training deep networks.