What makes up the cost of a RAG query?

Three pieces: embedding the user's query (cheap), feeding the retrieved chunks plus the query as input to the generation model (often the largest piece), and generating the answer (output tokens). Vector-search compute is usually negligible per query.

Why are retrieved chunks the biggest cost?

Retrieved context is sent as input on every query and can be thousands of tokens. More chunks or larger chunks directly raise the input token bill, so retrieval breadth is the main cost lever.

Is the embedding cost significant?

Usually no. Embedding one short query costs a tiny fraction of a cent. The expensive part is the generation call with the retrieved context attached.

Is anything sent to a server?

No. The full breakdown is computed in your browser. Nothing you enter is uploaded, stored or logged.

What is the Semantic Search Cost Estimator?

Break down the full cost of one RAG query into embedding the question, vector search, retrieving context, and the final generation call — see per-query and monthly cost for your workload. It runs free in your browser on Gera Tools, with nothing uploaded.

Semantic Search Cost Estimator

Name: Semantic Search Cost Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Semantic search cost estimator

A single RAG (retrieval-augmented generation) query is not one cost — it’s three: embedding the question, sending the retrieved context to the model, and generating the answer. This tool breaks them out so you can see where the money actually goes and project the monthly bill for your query volume.

How it works

The estimator prices each stage and sums them:

embed_cost = (query_tokens / 1e6) × embedding_price
input_cost = ((retrieved_chunks × chunk_tokens) + query_tokens) / 1e6 × gen_input_price
output_cost = (answer_tokens / 1e6) × gen_output_price
per_query = embed_cost + input_cost + output_cost

In almost every real pipeline the input cost dominates because the retrieved context is large and is re-sent on every query. Embedding the query is a rounding error by comparison; vector-search compute is typically negligible per request.

Understanding the three cost stages

Stage 1 — Query embedding

Embedding the user’s question converts it to a vector for the nearest-neighbour search. At typical embedding model prices this stage costs a fraction of a cent per query, making it negligible in the overall bill. The chunk embeddings stored in your vector index were a one-time cost and are not re-incurred per query.

Stage 2 — Retrieved context as input

This is almost always the dominant cost. Each retrieved chunk is re-sent in full as part of the generation prompt. Retrieve five chunks of 300 tokens each and you add 1,500 tokens to every single input call, regardless of the query. At scale this accumulates fast: 100,000 queries per month with five 300-token chunks is 150 million tokens just in retrieved context.

The levers here are chunk count (retrieve fewer, better chunks using a re-ranker), chunk size (smaller chunks reduce token waste but may omit necessary context), and summarisation (condense retrieved passages before appending them to the prompt).

Stage 3 — Generation output

Output tokens are typically priced 3–5x higher than input tokens, but answers are usually short — a few hundred tokens at most — so this stage is rarely the largest component unless your answers are very long.

Architecture decisions that change the numbers significantly

Re-ranking — retrieve 10 candidates from the vector index, run a lightweight cross-encoder to re-rank them, and send only the top 3 to the generator. This cuts retrieved-context tokens by 70% with little retrieval quality loss.

Hybrid retrieval — combining keyword search (BM25) with vector search lets you retrieve fewer but more relevant chunks. Fewer chunks directly cuts input cost.

Prompt caching — if your system prompt or retrieval instructions are long and fixed, providers that support prompt caching let you pay for them once and cache the resulting KV state across requests.

Model routing — for simple factual queries, a smaller, cheaper generation model often performs as well as a flagship model at a fraction of the price.

Tips

The biggest savings come from retrieving fewer, better chunks — re-rank and send the top 3 instead of the top 10.
Trim chunk size or summarize retrieved passages before sending them to the generator to cut input tokens.
Use prompt caching for any fixed system prompt or instructions so you only pay full price for the variable retrieved context.
Route simple lookups to a cheaper generation model; reserve the premium model for queries that genuinely need stronger reasoning.