Understanding LLM Context Windows: A Developer's Guide

Tokens, limits, and strategies for long-document tasks

Ad placeholder (leaderboard)

What a context window is

A large language model does not have persistent memory of your conversation the way a person does. Everything it can consider for a single request lives inside its context window — a hard cap, measured in tokens, that covers your system prompt, the user message, any retrieved documents you paste in, the prior turns of the chat, and the model’s own generated reply. If the combined total exceeds the limit, the API rejects the call or silently drops the oldest content. Think of the window as the model’s working memory for one call: nothing outside it has any influence on the output.

How tokens and limits interact

Tokens are sub-word chunks, not characters or words. The word “tokenization” might be one token while “antidisestablishmentarianism” could be several; a rough English rule of thumb is about four characters per token, but it varies by model, so you measure with the provider’s tokenizer (tiktoken for OpenAI models, or a count-tokens endpoint) when you are near a limit. Crucially, input and output share the same budget: the reply is generated from the remaining window, so filling the window with prompt leaves no room to answer. Always reserve headroom for the response. Context sizes also vary widely between models — from a few thousand tokens on older or smaller models to hundreds of thousands or more on current long-context models — so the right strategy depends as much on which model you chose as on how long your document is.

Strategies for long documents

When a source exceeds the window, you have three workhorse patterns. Chunking with retrieval splits the document into passages, embeds them, and sends only the few most relevant chunks for a given query — ideal when the answer lives in a small part of a large corpus. Sliding window processes the text in overlapping segments, carrying a little context across boundaries so nothing is lost at the seams — useful for sequential tasks like cleaning or annotating a long transcript. Map-reduce summarises each chunk independently (the map step), then combines those summaries into a final answer (the reduce step) — the standard way to summarise a document far larger than any single window. You can layer these: retrieve relevant chunks, then map-reduce over them.

Cost, latency, and the lost-in-the-middle effect

Bigger is not automatically better. Every token you send is billed, and long inputs also increase latency, so a context-stuffing habit quietly inflates both your bill and your response times. Models also exhibit the lost-in-the-middle effect: recall is strongest for material at the very start and end of a long context and weakest for content buried in the middle. The practical lesson is to send focused, relevant context rather than dumping everything available — put the most important material near the beginning or end, prune what the task does not need, and reach for a long-context model only when the task genuinely requires holding the whole document together at once. Used deliberately, the context window is a powerful lever; used as a dumping ground, it becomes an expensive, unreliable one.

Ad placeholder (rectangle)