What are the cost stages in a RAG query?

A retrieval-augmented query has three priced stages: embedding the query (tiny), the generation model reading the query plus retrieved context as input, and the generation model writing the answer as output. Vector search itself is usually a fixed infrastructure cost, not per-token.

Why is retrieved context the biggest input cost?

To ground answers you inject several retrieved chunks into the prompt. Ten chunks of 500 tokens each is 5,000 input tokens before the question is even asked — far more than the query embedding, and it is re-paid on every single query.

Is vector search included?

Vector search is typically billed as flat infrastructure (managed vector DB or self-hosted), not per token, so it is not in the per-query LLM cost here. Add it separately as a fixed monthly line.

How do I cut cost per query?

Retrieve fewer, more relevant chunks (lower top-k), compress or rerank context, cache answers to repeated questions, and use a cheaper generation model for simple lookups.

What is the Knowledge Base Q&A Cost Per Query?

Break down the full cost of a RAG knowledge base Q&A query — query embedding, retrieved context injection, and answer generation. Enter token counts, model choice, and daily volume to see cost per query and monthly spend. It runs free in your browser on Gera Tools, with nothing uploaded.

Knowledge Base Q&A Cost Per Query

Name: Knowledge Base Q&A Cost Per Query
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

What one knowledge base answer really costs

A RAG (retrieval-augmented generation) query feels like one action, but you pay for three stages: embedding the user’s question, the generation model reading the question plus the retrieved context, and the model writing the answer. This calculator prices each stage so you can see exactly where your spend goes and project it across daily volume.

How the per-query cost breaks down

embed_cost   = query_tokens / 1e6 × embed_price_per_1M
gen_in_cost  = (query_tokens + context_tokens) / 1e6 × gen_input_price_per_1M
gen_out_cost = answer_tokens / 1e6 × gen_output_price_per_1M
per_query    = embed_cost + gen_in_cost + gen_out_cost

The embedding step is almost free — embedding models are cheap and the query is short. The dominant cost is the retrieved context you inject as generation input, because you re-pay for it on every query, and the answer output, which is priced at the higher output rate.

A concrete illustration

Suppose you retrieve 5 chunks of 400 tokens each (2,000 context tokens), the user’s question is 50 tokens, and the answer is 200 tokens.

With a mid-range generation model:

Embedding cost: 50 tokens — negligible at typical embedding prices
Generation input: (50 + 2,000) = 2,050 tokens — the main driver
Generation output: 200 tokens — typically priced 3–4× higher per token than input

The context injection is 97% of all input tokens. Every dollar you spend on reducing context (better retrieval, reranking, smaller chunks) translates almost directly into lower generation input cost.

Which constraint binds: embedding vs generation

For most production RAG pipelines:

Stage	Typical token count	Cost sensitivity
Embedding	10–100 tokens (the question)	Very low — near-zero per query
Generation input	500–8,000 (question + chunks)	High — directly proportional to chunk count and size
Generation output	100–500 (the answer)	Moderate — output rates are higher per token

Increasing top-k from 3 to 10 roughly triples your generation input cost per query. Compressing chunks from 500 to 200 tokens cuts that cost by 60%. Neither changes embedding cost meaningfully.

Tips to control RAG spend

Lower top-k: Retrieve 3–5 focused chunks rather than 10. A reranker helps select the best ones.
Rerank before injection: Use a lightweight cross-encoder to score and discard irrelevant chunks before injecting them into the expensive frontier model.
Cache frequent answers: Identical or near-identical questions (FAQ-style) can be served from a cache with zero generation cost. Semantic deduplication of the query embedding catches near-duplicates.
Route by difficulty: Use a cheap, fast model for simple lookups and only escalate to a frontier model when the query signals complexity or low retrieval confidence.
Compress context: Summarise retrieved chunks to 30–50% of their length before injection. This cuts input tokens at the cost of one extra cheap generation call, which is usually worth it above moderate query volumes.