Question 1

What is layer normalisation?

Accepted Answer

Layer normalisation rescales the activations within a single layer for each individual example, so they have zero mean and unit variance across the feature dimension, then applies a learned scale and shift. It stabilises and speeds up training by keeping activation magnitudes well-behaved as signals flow through a deep network.

Question 2

How is layer norm different from batch norm?

Accepted Answer

Batch norm normalises each feature across the examples in a batch, so it depends on batch size and statistics. Layer norm normalises across the features of a single example, so it is independent of batch size and of other examples — which is exactly why it suits sequence models and small or variable batches.

Question 3

Why do transformers use layer norm instead of batch norm?

Accepted Answer

Transformers process variable-length sequences where batch statistics are unstable and tokens within a batch are not comparable position by position. Layer norm operates per token independently of the batch, so it behaves consistently regardless of sequence length or batch size, making it the natural fit.

Question 4

What is the difference between pre-norm and post-norm?

Accepted Answer

Post-norm applies normalisation after the residual addition, as in the original transformer; it can be powerful but harder to train deeply without careful warmup. Pre-norm applies normalisation before each sublayer, which keeps a clean residual path and makes very deep models train far more stably, so it is now standard.

Layer Normalisation (AI Glossary)

What is layer normalisation?

Layer norm vs batch norm

Why transformers use layer norm

Pre-norm vs post-norm