Does the output count toward the context window?

Yes. For most models the context window covers input plus output combined, so reserving output tokens reduces what's left for your prompt. A few models bill output against a separate cap — check your provider's docs.

How do I know my system prompt's token size?

Use a token counter to measure it, then enter that number here. A rough guide is one token per three-quarters of an English word.

What happens if I overflow?

If input plus reserved output exceeds the window, the API will reject the request or silently truncate. The planner turns the bar red and tells you how many tokens to cut.

Should I leave a safety margin?

Yes. Token estimates aren't exact and chat templates add overhead, so leaving 5-10% headroom avoids surprise truncation in production.

What is the Prompt Token Budget Planner?

Enter your model's context window and desired output length, then see exactly how many tokens remain for your system prompt and retrieved context, with a visual budget bar and overflow warnings. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt Token Budget Planner

Name: Prompt Token Budget Planner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Plan your token budget before you call

Every LLM call has to fit inside a fixed context window. This planner helps you divide that window sensibly: reserve tokens for the output, account for your system prompt, and see how much room is left for user context and retrieved documents — with a visual bar that turns red the moment you overflow.

Context windows across common models

Different models offer significantly different context windows, and the right budget strategy depends on which model you are using:

Model	Approximate context window
GPT-4o	128,000 tokens
GPT-4o mini	128,000 tokens
Claude 3.5 Sonnet	200,000 tokens
Claude 3.5 Haiku	200,000 tokens
Gemini 1.5 Pro	1,000,000 tokens
Llama 3 (8B)	8,000 tokens

Larger windows do not mean you stop needing to budget. Even with 200K tokens available, a poorly planned call that fills the window with low-value retrieved documents and generates a truncated output is wasteful. The planning discipline stays the same regardless of window size: reserve output first, then allocate input.

How the budget works

For most models the context window is shared by input and output. The planner subtracts in this order:

remaining_for_context = window − output_reserve − system_prompt

If that number goes negative, the request will not fit. Because the model’s reply lives in the same window, reserving too little output truncates the answer; reserving too much starves your context. The bar shows the four segments — system, context, output, and free — proportionally.

Where budgets are typically wasted

Oversized system prompts are the most common problem. A system prompt that runs 4,000 tokens — full of repeated rules, elaborated examples, and redundant hedges — is billed on every single call. Cutting it to 1,500 tokens frees 2,500 tokens for user context on every request.

Uncontrolled retrieved context in RAG applications is the second most common issue. Without a hard token budget for retrieved documents, a retrieval step that returns ten long passages can consume most of the window before the user’s actual question arrives.

Tips

Reserve output first. Decide how long the answer needs to be, then spend what’s left on input.
Keep the system prompt lean. It’s billed and counted on every call; move large stable references into retrieved context instead.
Leave a 5–10% margin. Token estimates and chat-template overhead mean the real count is a bit higher than your math.
Set a hard limit on retrieved context. If you use RAG, cap the total tokens retrieved before they reach the prompt so the budget stays predictable.