The context window is the model’s working memory
A context window is the maximum span of tokens — prompt, conversation history, retrieved documents, and the response being generated — that a model can hold in view at once. It is not long-term memory; it is working memory, rebuilt every request. Everything the model “knows” in the moment must fit inside it, and anything that overflows is truncated or dropped. The advertised size (4K, 128K, 1M tokens) tells you the ceiling, but the interesting questions are how the model represents all those tokens internally and whether it actually uses them well. Both come down to two mechanisms: attention and the KV cache.
Attention and its quadratic cost
Transformers process context using self-attention: for each token, the model computes how relevant every other token is and blends their information accordingly. The catch is that “every token attends to every other token” means the attention computation scales with the square of the sequence length — double the context and you roughly quadruple the attention cost in both compute and memory. This quadratic scaling is the fundamental tax on long context and the reason a 100K-token request is far more than 25 times the cost of a 4K one. It is also why much of the research frontier — sparse attention, sliding windows, linear-attention approximations — is about escaping that quadratic curve without losing too much quality.
The KV cache: speed, and its memory cost
When a model generates text one token at a time, recomputing attention over the entire history for every new token would be ruinously slow. The KV cache fixes this: the model stores the key and value vectors it already computed for each past token and reuses them, so generating token n+1 only requires processing the one new token against the cached history. This makes generation fast, but the cache grows linearly with context length and lives in scarce GPU memory. KV-cache size is one of the real, physical limits on how long a context a given server can serve and how many requests it can batch — and it is why long contexts raise cost and can reduce throughput even when the model nominally supports them.
Nominal vs effective context, and “lost in the middle”
Here is the gap that catches people out: a model’s nominal context (what it accepts) is not the same as its effective context (what it reasons over well). Models routinely accept far more tokens than they use carefully. Empirically, the “lost in the middle” effect shows that information placed at the start or end of a long prompt is retrieved more reliably than the same information buried in the middle — attention is not uniform across a huge window. The practical lessons follow directly: do not assume a giant window means perfect recall; put your most important instructions and facts near the edges of the prompt; prefer retrieving and inserting only the relevant passages (RAG) over dumping everything in; and remember that longer context costs more, runs slower, and degrades reasoning quality well before you hit the nominal limit. “Long context” is a real capability, but it is a spectrum of diminishing returns, not a switch that gives the model a flawless memory.