Working memory, not long-term memory
A context window is the maximum span of text — your prompt, the conversation so far, any documents you paste or retrieve, and the answer being written — that a model can hold in view for a single request. It is measured in tokens, the small chunks of text (roughly three-quarters of an English word each) that models read and write. The crucial thing to understand is that this is working memory, not long-term memory: it is rebuilt from scratch every request. The model has no persistent memory of past conversations unless that text is fed back in. Whatever fits inside the window is what the model “knows” in the moment; anything beyond it is truncated or simply never seen.
Why context costs so much
Context windows are not infinite for a concrete technical reason: attention. To process the input, transformers use self-attention, where every token relates to every other token. That means the compute and memory cost scales with the square of the sequence length — double the context and you roughly quadruple the cost of the attention step. This quadratic scaling is why long context is expensive and slow, and why a request with a huge pasted document costs far more than a short one. It is also the engine behind much current research: sparse attention, sliding windows, and other tricks all aim to serve long context without paying the full quadratic price.
Nominal vs effective context
Here is the trap that catches people: the advertised size of a context window (its nominal length) is not the same as how much the model actually uses well (its effective length). Models routinely accept far more tokens than they reason over carefully. A well-documented effect called “lost in the middle” shows that information placed at the start or end of a long prompt is retrieved more reliably than the same information buried in the middle. So a million-token window is a capability, not a guarantee — feeding the model an enormous document does not ensure it attends to every part of it equally.
Working within the limit
Because context is finite, expensive, and unevenly used, a few practical habits pay off. Put the most important instructions and facts near the edges of the prompt, not buried in the middle. Rather than dumping an entire knowledge base in, retrieve and insert only the relevant passages — the retrieval-augmented generation (RAG) pattern — which keeps prompts short, cheaper, and sharper. For long conversations, summarise older turns instead of carrying every message verbatim, so the essential context survives without consuming the whole window. And remember that longer context costs more and runs slower well before you hit the nominal limit, so “fits in the window” and “used effectively” are two different bars. Treat the context window as a scarce, valuable resource and spend it deliberately.