What is a KV cache?
The KV cache (key-value cache) is the central optimisation that makes transformer text generation efficient. During inference, a model generates one token at a time, and each new token must “attend” to all the tokens before it. The attention mechanism computes a key and a value vector for every token. Rather than recomputing those vectors for the whole sequence at every step, the model stores them in the KV cache and reuses them — only computing keys and values for the single new token.
Why it matters
Transformers are autoregressive: they produce token 1, then token 2 using token 1, then token 3 using tokens 1 and 2, and so on. Naively, generating the Nth token would require recomputing attention over all N−1 previous tokens. That is quadratic, wasteful work repeated at every step.
With a KV cache:
- The keys and values for previous tokens are computed once and stored.
- Each new generation step only computes the key/value for the new token and attends against the cached history.
- Generation time per token becomes roughly constant rather than growing with the sequence so far.
This is a classic memory-for-compute trade: you spend GPU memory to avoid redundant computation.
Why the cache grows so large
The cache must hold a key and a value tensor for:
- every token in the context,
- multiplied by every layer in the model,
- multiplied by every attention head.
So memory use scales linearly with context length and batch size. For long contexts (tens or hundreds of thousands of tokens) and large batches, the KV cache often becomes the single largest consumer of GPU memory during inference — sometimes larger than the model weights themselves.
When the cache fills up
Two limits can be hit:
- The model’s context window — the maximum tokens it was trained to handle.
- Available hardware memory — even within the context window, the cache may not physically fit.
When either limit approaches, serving systems use strategies such as paged attention (managing the cache in fixed blocks like virtual memory), cache quantisation or compression, evicting the oldest tokens, or simply truncating the prompt. Understanding the KV cache is essential for anyone optimising long-context or high-throughput LLM serving.