Question 1

What is a KV cache?

Accepted Answer

A KV cache stores the key and value vectors that a transformer computes for each token during generation, so they do not have to be recomputed when producing the next token. It is the standard optimisation that makes autoregressive LLM inference fast.

Question 2

Why does the KV cache speed up inference?

Accepted Answer

Without it, each new token would force the model to recompute attention over the entire sequence from scratch. Caching the keys and values means each step only computes attention for the one new token against stored history.

Question 3

Why does the KV cache use so much memory?

Accepted Answer

It stores key and value tensors for every token, every layer, and every attention head. Memory grows linearly with sequence length, so long contexts and large batches can make the KV cache the dominant memory cost.

Question 4

What happens when the KV cache fills up?

Accepted Answer

When the cache reaches the model's context limit or the available GPU memory, the system must truncate or evict older tokens, drop the request, or use techniques like paged attention and cache compression to manage the space.

KV Cache (AI Glossary)

What is a KV cache?

Why it matters

Why the cache grows so large

When the cache fills up