Question 1

Why do transformers need positional encoding?

Accepted Answer

Self-attention treats its input as an unordered set — it computes the same result no matter how tokens are shuffled. Positional encoding adds explicit order information so the model can tell "dog bites man" from "man bites dog".

Question 2

What is the difference between sinusoidal and learned positional encodings?

Accepted Answer

Sinusoidal encodings use fixed sine and cosine functions of varying frequency and require no training. Learned encodings are trainable vectors, one per position, optimised during training. Sinusoidal generalises to unseen lengths more gracefully; learned can fit the data better within the trained range.

Question 3

What is RoPE (Rotary Position Embedding)?

Accepted Answer

RoPE encodes position by rotating the query and key vectors by an angle proportional to their position before computing attention. Because it acts on the dot product, it naturally captures relative distance and is widely used in models like Llama and GPT-NeoX.

Question 4

What is ALiBi and how does it help with long context?

Accepted Answer

ALiBi (Attention with Linear Biases) adds a distance-based penalty directly to attention scores instead of modifying embeddings. It extrapolates to longer sequences than seen in training, which is why it is popular for long-context models.

Positional Encoding (AI Glossary)

Definition

Why attention needs it

Absolute encodings: sinusoidal and learned

Relative and rotary methods: RoPE and ALiBi

Why it matters