Question 1

What is a residual connection?

Accepted Answer

A residual connection adds a layer's input directly to its output, so the layer only has to learn the difference (the 'residual') rather than the full transformation. It is also called a skip or shortcut connection, popularised by the ResNet architecture in 2015.

Question 2

Why do residual connections help training?

Accepted Answer

They give gradients a direct, unobstructed path back through the network during back-propagation. This largely solves the vanishing-gradient problem, allowing networks hundreds of layers deep to train stably where plain stacked layers would fail.

Question 3

Where are residual connections used in transformers?

Accepted Answer

Every transformer block wraps both its attention sub-layer and its feed-forward sub-layer in a residual connection, usually paired with layer normalisation. Without them, the deep stacks of layers in modern LLMs would be effectively untrainable.

Question 4

What does the 'residual' actually represent?

Accepted Answer

If the desired output of a block is H(x), the block learns F(x) = H(x) − x, and the connection adds x back: output = F(x) + x. Learning this residual is easier because if a layer should do nothing, it just needs to push F(x) toward zero.

Residual Connection (AI Glossary)

Definition

The vanishing-gradient problem they solve

The identity shortcut

Residuals in transformers

Why they matter