Why reserve tokens for the response?

The context window is shared between everything you send and everything the model generates. If you fill the whole window with input, there is no room left for the reply and the call will truncate or error.

What is a safe utilization target?

Aim to keep total input plus reserved output under about 80% of the window. That leaves headroom for token-estimate error, special tokens, and longer-than-expected responses.

How accurate is the token estimate?

It uses an English-calibrated heuristic, typically within 5-10% of the real count. For content near the limit, leave extra margin or verify with a model-specific tokenizer.

Is my content uploaded?

No. Estimation and the fit calculation run entirely in your browser. Nothing you paste leaves the page.

What is the Context Window Planner?

Paste your system prompt, conversation history, and expected response length to see what percentage of an LLM's context window you are using. Color-coded fit indicator across GPT-4o, Claude, Gemini, and more. It runs free in your browser on Gera Tools, with nothing uploaded.

Context Window Planner

Name: Context Window Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Context window planner

Every model has a hard ceiling on how many tokens it can hold at once — the context window — and it is shared between your input and the model’s reply. Paste your system prompt and conversation history, reserve room for the response, and this planner tells you whether it all fits the model you picked, with a clear color-coded indicator.

How the context window is shared

The context window is not a budget solely for your input — it covers everything the model processes and generates in a single request. The breakdown:

context_used = system_prompt_tokens
             + conversation_history_tokens
             + reserved_completion_tokens
             
context_available = model_context_window - context_used

If context_used exceeds the model window, one of two things happens: the API rejects the request with an error, or (more dangerously) it silently truncates the oldest content from the beginning of the conversation. Either outcome is a failure mode — the first is noisy, the second is invisible.

What the fit indicator means

The planner calculates the total token estimate and shows a fit indicator:

Green (comfortably under ~75%) — you have substantial headroom; the request will succeed with room to spare even if estimates are off
Amber (75–90%) — you are approaching the limit; token estimate error or a longer-than-expected response could push you over; consider trimming
Red (over 90% or overflow) — the call would likely truncate or fail; reduce the prompt before proceeding

Why 80% as a practical ceiling? Token estimates carry a 5–10% margin of error. Special tokens (role markers, formatting tokens, tool definitions) add overhead that doesn’t appear directly in the pasted text. And responses often run longer than you plan for, especially on open-ended tasks.

Reasoning models need extra reserved space

Standard models emit a reply proportional to what you asked for. Reasoning models (those that produce a chain-of-thought before answering) generate “thinking” tokens that can be much larger than the visible answer. These thinking tokens consume context window space even though you may not see them in the final output. Reserve generously — several thousand extra tokens — when working with reasoning models to avoid overflow mid-response.

High-leverage ways to reduce context usage

If the planner shows you’re near or over the limit, the options with the most impact are:

Summarize conversation history. In multi-turn conversations, the history from earlier turns is often load-bearing but could be compressed. Replace turn 1–10 with a one-paragraph summary of what was established, then keep the recent turns verbatim. This can reclaim 60–70% of history tokens while preserving continuity.

Move documents to RAG. If your system prompt contains a 10-page reference document “so the model has the context it needs,” consider retrieving only the relevant passage instead. A retrieval step that pulls 500 tokens beats including 50,000 tokens on every call.

Trim the system prompt. System prompts in production applications often grow over time as instructions are added and never removed. Audit your system prompt for redundant instructions, examples that could be consolidated, and verbose phrasing that could be compressed.

Upgrade the model’s context window. Sometimes the right move is a model with a larger window, especially if the content is genuinely dense and not amenable to summarization.

Tips

Treat ~80% utilization as your practical ceiling.
Reserve generous completion space for reasoning models — their thinking tokens are invisible but real.
This tool uses an English-calibrated heuristic; for code-heavy or non-Latin content, the estimate may be off by more than the usual 5–10%.