What Is a Context Window? Size, Limits, and Why It Matters

Understanding token limits in GPT-4, Claude, Gemini, and more

Ad placeholder (leaderboard)

What a context window is

A context window is the maximum amount of text a model can consider at once, measured in tokens rather than words or characters. Crucially, it covers everything the model is working with on a given turn: the system prompt, the running conversation history, any documents or data you have pasted in, and the response the model is about to generate. A useful mental model is short-term working memory — the model reasons over whatever currently sits inside the window, and anything outside it does not exist as far as that response is concerned. This is why a model can answer brilliantly about a document you just pasted, yet appear to forget something you mentioned far earlier in a very long chat: the early message has scrolled out of the window.

How large are context windows today

Context windows have grown dramatically. Early models like GPT-3.5 worked with roughly 4,000 tokens — only a few pages of text. Modern frontier models are far larger: GPT-4 class and Claude models commonly offer 128,000 to 200,000 tokens, and Google’s Gemini has shipped windows of one million tokens or more. To translate that into something tangible, a token is about three-quarters of a word in English, so 100,000 tokens is roughly 75,000 words — about the length of a short novel. That means today’s large windows can ingest entire reports, codebases, or books in a single prompt, which was impossible just a couple of years ago. The trend is firmly toward bigger windows, but size is not the whole story.

Why bigger is not automatically better

A larger window lets you supply more material, but it does not guarantee the model uses all of it well. Research has documented a lost-in-the-middle effect: models tend to attend most reliably to information at the very start and very end of a long context, while details buried in the middle can be overlooked. Filling the window with loosely relevant text also costs more (you pay per token), increases latency, and can dilute the model’s focus on what actually matters. So a curated, tightly relevant context often produces sharper answers than a sprawling one — quality and placement beat sheer quantity.

Working within the limits

Practical habits help you get the most from any window size. Put the most important instructions and the most relevant material near the beginning or end of the prompt, where models attend most reliably. Summarise or trim long histories rather than letting them grow unbounded; many applications periodically compress earlier turns into a short recap. For large document sets, retrieval-augmented generation is usually better than pasting everything — it fetches only the passages relevant to the question, keeping the context focused and the cost down. And when you hit a hard limit, chunk the work: split a huge task into pieces that each fit comfortably, then combine the results. Understood this way, the context window stops being a mysterious limit and becomes a budget you can manage deliberately.

Ad placeholder (rectangle)