Question 1

What exactly is a context window?

Accepted Answer

The context window is the maximum number of tokens a model can consider at once — your prompt, any system instructions, retrieved documents, and the model's own reply all count toward it. When the total exceeds the limit, the request is rejected or older content is truncated. It is the model's short-term memory for a single call, and nothing outside it influences the answer.

Question 2

Does a bigger context window mean I should always fill it?

Accepted Answer

No. Models often attend less reliably to information buried in the middle of a very long context — the well-known lost-in-the-middle effect — and every token you send is billed. A focused prompt with only the relevant material usually beats dumping a whole corpus in. Use the large window when you genuinely need it, not by default.

Question 3

When should I chunk versus use a longer-context model?

Accepted Answer

Chunk when the source far exceeds the window or when you only need a few relevant passages — retrieve and send just those. Reach for a longer-context model when the task needs the whole document held together at once, such as cross-referencing clauses across a long contract. Cost and latency both rise with context length, so chunking is often cheaper even when a long window would technically fit.

Question 4

How do I count tokens before sending a request?

Accepted Answer

Use the provider's tokenizer library — tiktoken for OpenAI-family models, or the provider's count-tokens endpoint — to measure your prompt offline. A rough English rule of thumb is about four characters or three-quarters of a word per token, but tokenizers differ per model, so measure rather than guess when you are near a limit. Always leave headroom for the model's reply, which also consumes the window.

Question 5

Do input and output share the same window?

Accepted Answer

Yes. The reply the model generates is drawn from the same context budget as your input, so if you fill the window with prompt you leave no room for an answer. Always reserve enough tokens for the expected output, and if you ask for a long response, send a correspondingly shorter prompt.

Understanding LLM Context Windows: A Developer's Guide

What a context window is

How tokens and limits interact

Strategies for long documents

Cost, latency, and the lost-in-the-middle effect