Why does overlap reduce how many chunks fit?

Overlap duplicates tokens between adjacent chunks. When several overlapping chunks are retrieved together, those duplicated tokens still occupy context, so a larger overlap means each chunk costs more space and fewer chunks fit.

Should I leave room for the completion?

Yes. The context window is shared between the prompt and the model's output. If you fill the entire window with chunks, the model has no room to answer, so always reserve tokens for the completion.

How accurate is the chunk count?

It is a planning estimate. Real chunk sizes vary because tokenizers split text unevenly, so treat the number as a safe upper bound and test with your actual documents.

Is my data uploaded anywhere?

No. The calculator runs entirely in your browser. Nothing you type is sent to a server, stored, or logged.

What is the RAG Context Budget Calculator?

Enter your chunk size, overlap, system prompt length, and target model to see how many retrieved chunks fit in the context window while leaving room for the user question and completion. It runs free in your browser on Gera Tools, with nothing uploaded.

RAG Context Budget Calculator

Name: RAG Context Budget Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

RAG context budget calculator

When you build a retrieval-augmented generation (RAG) pipeline, the single hardest constraint is the context window. Every retrieved document chunk, your system prompt, the user’s question, and the space the model needs to answer all compete for the same fixed token budget. This calculator tells you how many chunks you can actually retrieve before you run out of room.

How it works

The context window is a fixed number of tokens for your chosen model. From that total the calculator subtracts three fixed costs: the system prompt, the user query, and a completion reserve for the model’s answer. Whatever is left is your retrieval budget.

Each chunk costs its base size plus its overlap, because overlap tokens are duplicated into the chunk and still occupy space when retrieved. Dividing the remaining budget by that effective chunk size gives the maximum number of chunks you can safely pass in. The calculator floors the result so you never plan for a partial chunk that would overflow the window.

The formula in plain terms

retrieval_budget  = context_window - system_prompt - user_query - completion_reserve
effective_chunk   = chunk_size + overlap
max_chunks        = floor(retrieval_budget / effective_chunk)

Worked example

Say you are building a document Q&A system on a model with a 32 000-token window. Your system prompt is 400 tokens, a typical user question is 50 tokens, and you reserve 800 tokens for the answer. That leaves 30 750 tokens for retrieval.

With 400-token chunks and 50-token overlap, each chunk costs 450 tokens:

30 750 / 450 = 68 chunks (floored)

That is a comfortable retrieval budget for most question-answering tasks. Now suppose you increase the chunk size to 800 tokens with 100-token overlap:

30 750 / 900 = 34 chunks

Fewer but larger chunks means each one carries more surrounding context, which helps tasks where the answer spans multiple paragraphs.

What changes the budget most

Factor	Effect
System prompt length	Directly reduces retrieval budget; verbose instructions are costly
Completion reserve	Easy to underestimate — agents or code generators can need 2 000+ tokens
Overlap size	Even small overlaps multiply across many chunks; 10% of chunk size is common
Model context window	Doubling the window roughly doubles the chunk count at the same settings

Tips and notes

Reserve generously for the completion. A summarisation task might only need a few hundred output tokens, but an agent that writes code can need several thousand. Under-reserving causes truncated answers.
Smaller chunks improve precision but add overhead. Many small chunks let retrieval target the exact passage, but more chunks mean more glue tokens and more overlap waste. Use the chunking-strategy calculator to balance this.
Token estimates are approximate. Tokenizers split text differently across models, so leave a safety margin rather than packing the window to the byte.
Test with real documents. The calculator gives a planning upper bound; actual token counts per chunk depend on your specific corpus and tokenizer. Run a sample batch through your tokenizer and compare against the estimate.