Why reserve completion tokens up front?

The context window is shared between input and output. If you do not carve out room for the reply, the model has no space to answer and the call truncates or errors. Reserve generously for reasoning models.

How much should I leave for retrieved context?

Whatever is left after system prompt, user message, and reserved completion. This tool computes that remainder so you know exactly how many tokens of RAG chunks you can fit.

What if the fixed slots already exceed the window?

The tool flags an overflow and shows the negative remainder, meaning you must shrink the system prompt, trim the user message, reduce the reserved completion, or move to a larger-context model.

It is exact arithmetic on the token figures you provide. Estimate those figures with a token counter, and leave ~10-20% margin for estimate error and special tokens.

What is the Token Budget Splitter?

Given a model's context limit, allocate tokens across system instructions, retrieved context chunks, the user message, and a reserved completion. Outputs slot sizes, a percentage breakdown, and an overflow warning. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Token Budget Splitter

Name: Token Budget Splitter
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Token budget splitter

A model’s context window is a fixed budget you have to divide between four competing demands: system instructions, retrieved context, the user’s message, and room for the reply. This tool takes your context limit and fixed slots, then tells you exactly how many tokens are left for retrieved context chunks — the slot that usually flexes in RAG systems.

How it works

You enter the total context window and three known quantities: system prompt tokens, reserved completion tokens, and your typical user message size. The splitter subtracts those from the window and reports the remaining budget available for retrieved context, plus a percentage breakdown of how the whole window is allocated. If your fixed slots already exceed the window, it flags the overflow and shows by how much you are over.

Worked example

For example, building a RAG system on a model with a 128,000 token window:

Slot	Tokens	Percentage
System prompt	800	0.6%
User message	400	0.3%
Reserved completion	4,000	3.1%
Available for context	122,800	95.9%

At 512 tokens per retrieved chunk, this budget fits approximately 239 chunks — or at 1,024 tokens per chunk, approximately 119 chunks. Knowing the budget lets you decide chunk size and retrieval depth before writing a single line of code.

Now consider a smaller model window of 16,000 tokens with the same fixed slots:

Slot	Tokens
System prompt	800
User message	400
Reserved completion	4,000
Available for context	10,800

At 512 tokens per chunk, only about 21 chunks fit. This immediately tells you that either smaller chunks, a larger model, or a more selective retrieval strategy is needed.

Common pitfalls when planning context budgets

Under-reserving completion tokens. If you reserve too few tokens, the model either truncates its response or the API returns an error if the requested max_tokens exceeds the remaining window. For reasoning-capable models, the internal reasoning chain consumes tokens before the visible output, so reserve more than you think you need.

Ignoring special tokens. Every API call includes special tokens for role markers, message separators, and tool definitions. These are usually small (50–200 tokens per turn) but can add up in multi-turn conversations. Leave at least 10% slack beyond your estimate.

Forgetting conversation history. In multi-turn systems, each exchange grows the input. A conversation that starts well within budget can overflow after 10 turns. Plan a truncation or summarisation strategy before hitting the limit.

This is the core sizing step for any RAG pipeline. Reserve completion tokens generously and pair this tool with a token counter to verify real prompt sizes before going to production.