What is the lost-in-the-middle effect?

Research shows LLMs attend best to information at the start and end of a long context and can miss relevant facts buried in the middle. So stuffing in more retrieved chunks past a point can actually reduce answer quality, not just add cost.

Why does cost rise linearly but quality doesn't?

Each extra chunk adds a fixed number of input tokens, so cost grows linearly with chunk count. Quality, by contrast, follows diminishing returns — the first few relevant chunks help a lot, later ones help little, and the worst can hurt via lost-in-the-middle.

How is the quality curve calculated?

It is an illustrative relative curve based on the shape you choose, not a measurement of your specific data. Use it to understand the shape of the tradeoff, then confirm the optimal chunk count on your own evaluation set.

What is quality-per-dollar?

It divides the relative quality at each chunk count by the cost of that call. The highest value is usually well short of the maximum chunks, because the last chunks add cost without proportional quality, making them inefficient.

How do I improve retrieval without adding chunks?

Use better embeddings and a reranker so the top few chunks are the most relevant, deduplicate near-identical passages, and put the most important context at the start or end of the prompt to dodge the lost-in-the-middle dip.

What is the Context Window Size vs Retrieval Quality Tradeoff?

Free RAG context tradeoff tool. See the cost curve as you retrieve more chunks against the diminishing — and eventually declining — quality returns from the lost-in-the-middle effect, and find the best quality-per-dollar point. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Context Window Size vs Retrieval Quality Tradeoff

Name: Context Window Size vs Retrieval Quality Tradeoff
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

More retrieved context is not always better

In a RAG system it is tempting to retrieve “just a few more chunks” to be safe. But each chunk adds input tokens (linear cost) while quality follows diminishing returns — and past a point the lost-in-the-middle effect means extra context can actually lower answer quality. This tool plots the cost against an illustrative quality curve so you can find the point of best value rather than defaulting to the maximum.

The three forces at work

Understanding this tradeoff requires understanding three forces that pull in different directions as you increase the number of retrieved chunks:

Coverage (helps quality): More chunks mean a higher probability that the relevant information is included somewhere in the context. The first few chunks from a good retriever are highly likely to contain the answer; later chunks add coverage at the margin.

Noise and dilution (hurts quality): Retrieval systems are imperfect. The 5th, 8th, or 12th chunk is less likely to be truly relevant than the 1st or 2nd. Irrelevant chunks dilute the signal-to-noise ratio in the prompt, making it harder for the model to locate the relevant information.

Position effects (hurts quality at scale): Research consistently shows that transformer models attend more strongly to information at the beginning and end of a long context than to information in the middle. A fact buried at chunk 8 of 12 is genuinely harder for many models to recall than the same fact at chunk 1 of 3.

These three forces mean quality does not rise linearly with chunk count. It rises steeply at first (coverage wins), flattens as coverage saturates, and then — on many real workloads — begins to decline as noise and position effects dominate.

How the cost-quality curves work

For each chunk count, the cost is straightforward:

call_cost = (chunks × tokens_per_chunk × input_price) + (output_price × output_tokens)

Cost grows in a straight line with chunk count. The quality curves represent different assumptions about how the model handles increasing context:

Logarithmic — quality rises steeply at first and flattens; no decline. Optimistic assumption, appropriate if your retriever is high precision.
Plateau — quality rises, peaks, then declines, modelling the lost-in-the-middle effect. More realistic for most general-purpose models.
Linear — quality rises proportionally with chunk count. The naive assumption; included for comparison.

The quality-per-dollar metric divides the relative quality at each chunk count by the cost of that call. It peaks earlier than raw quality does, because the last chunks added cost without proportional quality gain.

What this means in practice

Most production RAG systems retrieve more chunks than they need. A few typical findings when teams actually measure:

Top-3 chunks often match or exceed top-10 chunks in answer quality
The difference between top-5 and top-15 is typically negligible on answer quality while adding significant token cost
Systems with a reranker can achieve better quality at top-3 than systems without a reranker achieve at top-10

The implication: invest in better retrieval (embedding quality, reranking, query expansion) rather than compensating for weak retrieval by adding more chunks.

Tips for efficient retrieval

Rerank before increasing chunk count. A cross-encoder reranker that scores candidates semantically can surface the truly relevant chunks from a larger initial retrieval set, so you send fewer, better chunks.
Mind the position. Put the most important context at the start or end of the prompt to avoid the middle blind spot; some implementations explicitly order chunks by relevance score with highest-scored first.
Deduplicate near-identical chunks. Chunks from the same or similar passages waste tokens and crowd out genuinely new information.
Measure on your actual eval set. The curves here illustrate the shape of the tradeoff — find your real optimum by testing chunk counts against graded answers on your specific corpus and model.