RAG vs long-context cost calculator
Long-context models like Gemini 1.5 Pro can hold a million tokens in a single prompt, which makes a tempting shortcut: skip the vector database and just paste your whole corpus into every request. That convenience has a cost — you pay for the entire corpus on every single query. This tool puts that number next to the cost of a RAG pipeline so you can decide with arithmetic instead of vibes.
How it works
Two strategies are priced over a 30-day month:
- Long-context: every query sends the full corpus as input. Monthly cost is
corpus_tokens × price_per_token × queries_per_day × 30. There is no indexing cost, but the per-query input is enormous. - RAG: a one-time embedding pass over the corpus (
corpus_tokens × embed_price), then each query sends only the retrieved chunk (retrieved_tokens × query_price). Monthly cost is the amortized embedding plusretrieved_tokens × query_price × queries_per_day × 30.
Latency is modelled as a function of input size: long-context pays a per-token processing cost on the whole corpus, while RAG pays a fixed retrieval overhead plus processing of only the retrieved slice.
Tips and notes
- The crossover point moves earlier (toward RAG) as corpus size and query volume grow — both multiply the long-context bill but not the RAG bill.
- If your acceptable latency budget is tight and the corpus is large, RAG usually wins on latency too, not just cost.
- Remember the hidden RAG costs this calculator does not include: the vector database, reranking, and the engineering time to build and maintain the pipeline. For a tiny corpus those can outweigh the token savings, which is exactly when long-context is the right call.