Semantic search cost estimator
A single RAG (retrieval-augmented generation) query is not one cost — it’s three: embedding the question, sending the retrieved context to the model, and generating the answer. This tool breaks them out so you can see where the money actually goes and project the monthly bill for your query volume.
How it works
The estimator prices each stage and sums them:
embed_cost = (query_tokens / 1e6) × embedding_price
input_cost = ((retrieved_chunks × chunk_tokens) + query_tokens) / 1e6 × gen_input_price
output_cost = (answer_tokens / 1e6) × gen_output_price
per_query = embed_cost + input_cost + output_cost
In almost every real pipeline the input cost dominates because the retrieved context is large and is re-sent on every query. Embedding the query is a rounding error by comparison; vector-search compute is typically negligible per request.
Tips
- The biggest savings come from retrieving fewer, better chunks — re-rank and send the top 3 instead of the top 10.
- Trim chunk size or summarize retrieved passages before sending them to the generator to cut input tokens.
- Use prompt caching for any fixed system prompt or instructions so you only pay full price for the variable retrieved context.
- Route simple lookups to a cheaper generation model; reserve the premium model for queries that genuinely need stronger reasoning.