Question 1

What is the KV cache and why does it matter?

Accepted Answer

The KV cache stores the key and value vectors the model computed for every token already in the context, so it does not recompute them for each new token it generates. It is what makes generation efficient, but it grows linearly with context length and consumes GPU memory, which is a major practical limit on how long a context a server can hold.

Question 2

Why is attention described as quadratic?

Accepted Answer

In standard self-attention, every token attends to every other token, so the compute and memory for the attention step scale with the square of the sequence length. Doubling the context roughly quadruples that cost. This quadratic scaling is the core reason long context is expensive and why researchers pursue cheaper approximations.

Question 3

What is the difference between nominal and effective context?

Accepted Answer

Nominal context is the advertised maximum number of tokens a model accepts. Effective context is how much of that the model actually uses well. Models frequently accept far more than they reliably reason over, so a million-token window does not guarantee the model attends carefully to everything inside it.

Question 4

What is the 'lost in the middle' problem?

Accepted Answer

Research has shown that models often retrieve information placed at the very start or end of a long context more reliably than information buried in the middle. Practically, this means where you put critical content within a long prompt affects whether the model uses it, so important instructions and facts belong near the edges.

How Context Windows Work: KV Cache, Attention, and Memory

The context window is the model’s working memory

Attention and its quadratic cost

The KV cache: speed, and its memory cost

Nominal vs effective context, and “lost in the middle”