What is layer normalisation?
Layer normalisation (layer norm) is a technique that stabilises the training of deep neural networks by rescaling the activations inside a layer. For each individual example, it computes the mean and variance across the feature dimension, normalises the activations to zero mean and unit variance, then applies a learned scale and shift so the network can recover any representation it needs.
The purpose is to keep activation magnitudes well-behaved as signals pass through many layers. Without normalisation, activations can grow or shrink dramatically, producing unstable gradients and slow or failed training.
Layer norm vs batch norm
Both techniques normalise activations, but along different axes:
- Batch normalisation normalises each feature across the examples in a batch. Its statistics therefore depend on batch size and on which examples happen to be grouped together. It works well for vision models with large, fixed batches but becomes unreliable with small or variable batches.
- Layer normalisation normalises across the features of a single example. It is completely independent of batch size and of other examples in the batch.
This independence is the decisive difference: layer norm gives identical behaviour whether the batch contains one example or a thousand, and whether sequences are short or long.
Why transformers use layer norm
Transformers process variable-length sequences, and the meaningful unit is the individual token representation. Batch statistics across a batch of unrelated sequences are noisy and position-dependent, which makes batch norm a poor fit. Layer norm, by normalising each token’s feature vector on its own, behaves consistently regardless of sequence length or batch size. This reliability is why essentially every transformer — from BERT to modern large language models — uses layer norm (or its close cousin RMSNorm, which drops the mean-centring step for efficiency).
Pre-norm vs post-norm
Where the normalisation is placed inside each transformer block has a large effect on trainability:
- Post-norm — used in the original 2017 transformer — applies layer norm after the residual connection adds the sublayer output. It can be expressive but tends to require careful learning-rate warmup and becomes hard to train as depth increases, because gradients must pass through normalisation on the main path.
- Pre-norm — applies layer norm before each sublayer (attention or feed-forward), leaving a clean, unnormalised residual path running straight through the network. This keeps gradients flowing cleanly across many layers and makes very deep models far more stable to train.
Because of this stability advantage, pre-norm is now the standard in modern large transformers, often combined with RMSNorm to reduce computation. The placement looks like a minor detail, but it is one of the practical changes that made training networks with dozens or hundreds of layers reliable.