Residual Connection (AI Glossary)

Skip connections that let gradients flow through very deep networks

Ad placeholder (leaderboard)

Definition

A residual connection — also called a skip or shortcut connection — adds the input of a layer directly to its output. Instead of forcing each layer to learn a full transformation from scratch, the layer only needs to learn the difference (the residual) between its input and the desired output. Introduced by the ResNet image-classification architecture in 2015, residual connections are now a near-universal ingredient in deep networks, including every transformer.

The vanishing-gradient problem they solve

Before residual connections, simply stacking many layers often made networks harder to train, not better. During back-propagation, gradients are multiplied layer by layer; across dozens of layers they tend to shrink toward zero — the vanishing-gradient problem — so early layers barely learn. A residual connection creates a direct identity path for the gradient to flow backward unimpeded, keeping the signal strong even in networks hundreds of layers deep.

The identity shortcut

Mathematically, if a block computes a function F applied to its input x, a residual connection makes the block’s output F(x) + x. The ”+ x” is the identity shortcut. A useful consequence: if the optimal behaviour for that block is to do nothing, the network only has to drive F(x) toward zero — far easier than learning to reproduce the identity mapping through nonlinear layers. This is why “learning the residual” is more forgiving than learning the full mapping.

Residuals in transformers

Transformers depend heavily on residual connections. Each transformer block has two sub-layers — multi-head attention and a feed-forward network — and each is wrapped in its own residual connection, typically combined with layer normalisation (in pre-norm or post-norm arrangements). This “residual stream” running through the whole stack is also central to mechanistic interpretability research, which views layers as reading from and writing to a shared residual stream.

Why they matter

Residual connections are one of the quiet enablers of the deep-learning era. They make it possible to train the very deep stacks that modern accuracy requires, stabilise optimisation, and let architects add depth without the model degrading. Almost any state-of-the-art network you encounter today — vision, language, or multimodal — relies on them as foundational plumbing.

Ad placeholder (rectangle)