What is a context window?

A context window is the maximum number of tokens a model can process at once, covering both your input (system prompt, history, documents, query) and the response it generates. If your total exceeds the window, the request fails or older content is truncated.

Why reserve tokens for output?

The output counts against the same window as the input. If you fill the whole window with input, the model has no room to reply. Reserving an output budget guarantees space for the answer you actually want.

How do I estimate tokens from text?

A rough rule for English is about 0.75 words per token, or roughly 4 characters per token. For exact counts use a tokenizer for your specific model, since they differ between GPT, Claude, and others.

Does this planner call a model?

No. It is a pure arithmetic planner that runs in your browser. It never sends your numbers anywhere, so you can plan sensitive prompts privately.

What happens if I overflow the window?

The planner highlights the overflow in red and shows how many tokens you are over. In practice you would then trim history, summarise documents, shorten the system prompt, or move to a model with a larger window.

What is the Context Window Budget Planner?

Plan how to spend a model's context window — set the model's max tokens, then allocate tokens to the system prompt, conversation history, documents, and the user query to see a live breakdown and whether you fit or overflow. It runs free in your browser on Gera Tools, with nothing uploaded.

Context Window Budget Planner

Name: Context Window Budget Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Why context budgeting matters

The context window budget planner helps you decide how to spend a finite resource: the tokens a model can hold in a single request. Every token you give to the system prompt, conversation history, or retrieved documents is a token you cannot give to the answer. This tool lets you set a model’s window size and divide it across segments so you can see, before you build, whether your design fits.

How it works

You choose a model preset (or type a custom window size), then enter token estimates for each segment: the system prompt, conversation history, documents or retrieved context, the user query, and a reserved output budget. The planner sums them, shows each segment as a percentage of the window, and reports the remaining headroom. If the total exceeds the window, it flags the overflow and tells you exactly how many tokens to cut.

All of this is arithmetic in your browser — nothing is sent anywhere. Token counts are estimates; for exact figures, run your text through the tokenizer for your specific model, because GPT, Claude, and other families tokenize differently.

Tips and examples

A common mistake is forgetting that output shares the window. If a model has a 128k window and you stuff 127k tokens of documents in, there is no room to answer. Reserve a realistic output budget first, then fit input around it. For long chats, cap history with a sliding window or periodic summary so it does not grow unbounded. For retrieval-augmented prompts, the document segment is usually where the budget goes — prefer fewer, more relevant chunks over many marginal ones. Use the planner to compare a 16k versus a 128k model: the larger window does not just cost more, it changes what is feasible.

Common context budget patterns

Different application types tend to allocate their window very differently:

Application type	System prompt	History	Documents	Output reserve
Simple chatbot	Small	Large (grows with turns)	Minimal	Moderate
RAG document Q&A	Moderate	Small	Large	Moderate
Code generation agent	Large (tools + rules)	Small	Moderate	Large
Long-form summarizer	Small	None	Very large	Moderate

Planning the allocation for your pattern up front — before writing a line of code — prevents the most common failure mode: a design that works on short conversations but silently truncates or errors on longer ones.

Practical rules for fitting your design

Reserve output first. Decide the longest response you want, reserve that many tokens, and treat the rest as your input budget. Forgetting this step is the single most common cause of incomplete model replies.

Cap history. Conversation history grows with every turn. Without a sliding window or periodic summarization, a long chat eventually overflows even a large-window model. The planner lets you see how many turns fit before you hit the ceiling.

Prefer fewer, better chunks for retrieval. If you are injecting retrieved documents, a tight budget favors quality over quantity — three highly relevant chunks usually beat ten marginally relevant ones and leave more room for the model to reason.

Token estimates are approximate. The rule of thumb (roughly 0.75 words per token for English, or 4 characters per token) differs between model families. Tokenizers vary: GPT and Claude count the same text differently. For production systems, measure with the actual tokenizer for your model.