Definition
The context window is the maximum amount of text — measured in tokens — that a large language model can take into account at one time. It is the model’s short-term working memory: everything it can “see” while generating a response, including your instructions, any documents you paste in, prior conversation turns, and the reply it is currently producing. Anything outside the window effectively does not exist for the model.
Measured in tokens, not words
Context windows are counted in tokens, the sub-word chunks models actually process. As a rough rule, 1,000 tokens is about 750 English words. So a 128,000 token window holds roughly a 250-page book, while a 1-million-token window can fit a small codebase or several long PDFs. Crucially, the prompt and the response share the budget — a long input leaves less room for a long answer.
Why bigger isn’t free
The dominant attention mechanism in transformers compares every token with every other token, so its cost grows quadratically with sequence length. Doubling the context roughly quadruples the compute and memory needed. This is why long context was historically expensive and why techniques like sliding-window, sparse, and flash attention exist — they reduce that cost so models can reach hundreds of thousands of tokens economically.
The “lost in the middle” problem
A larger window does not guarantee perfect recall. Models often attend most strongly to the start and end of the context and can overlook facts buried in the middle of a very long input. Practical systems therefore place the most important instructions near the top or bottom and avoid dumping huge, unstructured text when only a small part is relevant.
Working within the limit
When content exceeds the window, common strategies are: summarising older conversation turns, chunking documents and processing them in pieces, and — most powerfully — retrieval-augmented generation, which fetches only the passages relevant to the current query instead of stuffing everything in. Used well, these keep the model focused and costs predictable even when the source material is far larger than any single window.