What algorithm does it use?

A greedy value-density heuristic. Each chunk's value is its relevance score divided by its token count, and chunks are added in descending density order until the next one would exceed the budget. This is fast and near-optimal for typical RAG packing where chunk sizes are similar.

Why not a true knapsack solver?

Exact 0/1 knapsack is overkill for context packing — relevance scores are noisy estimates, not exact values, so a greedy density-first pass gives essentially the same answer instantly. The tool also tries swapping in any single skipped chunk that still fits, which captures the common improvement case.

What scale should scores use?

Any consistent scale works — cosine similarity (0–1), a reranker logit, or a 1–10 rating. The algorithm only compares relative density, so the units cancel out as long as you use the same scale for every chunk.

Does it send my chunks anywhere?

No. All ranking and packing runs locally in your browser. Your retrieved text and scores never leave the page.

What is the RAG Context Window Optimizer?

Given retrieved chunks with relevance scores and token counts plus a token budget, a greedy value-density packing algorithm selects the highest-value chunks that fit. See which chunks are included, dropped, and how much budget is used. It runs free in your browser on Gera Tools, with nothing uploaded.

RAG Context Window Optimizer

Name: RAG Context Window Optimizer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

RAG context window optimizer

Retrieval-augmented generation lives or dies on what you put in the context window. Stuff in too much and you blow the token budget or bury the signal; trim too aggressively and the model misses the answer. This tool takes your retrieved chunks — each with a token count and a relevance score — and packs the most valuable ones into a fixed token budget so every token earns its place.

The problem this solves

In a typical RAG pipeline, a vector search returns ten to twenty candidate chunks ranked by similarity. The naive approach is to take the top N chunks until the budget runs out. The problem is that top-N selection ignores token count: a single large chunk scoring 0.85 might consume 600 tokens and block three smaller chunks that together score 2.4 and only cost 400 tokens combined. That naive selection is suboptimal, and this tool does better.

How it works

Each chunk’s value density is its relevance score divided by its token count — how much relevance you buy per token spent. The optimizer then:

Sorts all chunks by density in descending order (most efficient first).
Greedily adds each chunk if it fits within the remaining budget, skipping those that would overflow.
After the greedy pass, attempts one improvement sweep: checks whether any skipped chunk can replace a lower-density included chunk for a net relevance gain within the same budget.

The improvement sweep catches the common failure mode where a large high-scoring early chunk blocked two smaller chunks that would have delivered more combined relevance for fewer tokens.

The result shows which chunks are included, which are dropped, the total tokens used, and the cumulative relevance score captured.

Why greedy density beats exact knapsack here

The exact 0/1 knapsack solution is optimal but computationally expensive for large chunk sets. More importantly, relevance scores from vector search are noisy estimates — a cosine similarity of 0.82 versus 0.80 is not meaningfully different given embedding model variance. Treating approximate scores as exact values and solving the problem exactly is false precision. The greedy density approach is fast, near-optimal, and correct for the actual precision of the inputs.

Setting up your inputs

Token counts: Count the actual tokens your target model will see, not word count or character count. For OpenAI models, use tiktoken; for others, use the model’s tokenizer or a rough estimate of 0.75 words per token for English prose.

Relevance scores: Any consistent scale works. Cosine similarity from a vector store typically runs 0.0–1.0. A cross-encoder reranker might output a logit or a 0–10 score. Whatever you use, apply the same scale to every chunk in a session — the algorithm only compares relative density, so absolute values cancel out.

Token budget: This is not your model’s full context window. Reserve room for the system prompt, the user’s question, any few-shot examples you inject, and the expected answer length. The context budget is what is left after those reservations. Underestimating this headroom is the most common setup mistake.

Tips for better results

Rerank before optimizing. Raw vector similarity scores measure semantic distance, not document relevance to the specific query. A cross-encoder reranker or LLM-as-judge score after retrieval gives the optimizer much better signal.
Split long chunks upstream. If a high-score chunk keeps being dropped because it is too large, the fix is in your chunking strategy — split at semantic boundaries to produce smaller, denser units.
Watch the dropped list, not just the selected list. A chunk with a high raw score that consistently misses selection is telling you it is too token-heavy relative to its relevance. That is useful diagnostic feedback for your chunker.
Everything runs locally. Your chunk text, scores, and budget figures never leave your browser.