RAG vs Long-Context: Cost & Latency Tradeoff

Should you use a 1M-token long-context model or build a RAG pipeline?

Ad placeholder (leaderboard)

RAG vs long-context cost calculator

Long-context models like Gemini 1.5 Pro can hold a million tokens in a single prompt, which makes a tempting shortcut: skip the vector database and just paste your whole corpus into every request. That convenience has a cost — you pay for the entire corpus on every single query. This tool puts that number next to the cost of a RAG pipeline so you can decide with arithmetic instead of vibes.

How it works

Two strategies are priced over a 30-day month:

  • Long-context: every query sends the full corpus as input. Monthly cost is corpus_tokens × price_per_token × queries_per_day × 30. There is no indexing cost, but the per-query input is enormous.
  • RAG: a one-time embedding pass over the corpus (corpus_tokens × embed_price), then each query sends only the retrieved chunk (retrieved_tokens × query_price). Monthly cost is the amortized embedding plus retrieved_tokens × query_price × queries_per_day × 30.

Latency is modelled as a function of input size: long-context pays a per-token processing cost on the whole corpus, while RAG pays a fixed retrieval overhead plus processing of only the retrieved slice.

Tips and notes

  • The crossover point moves earlier (toward RAG) as corpus size and query volume grow — both multiply the long-context bill but not the RAG bill.
  • If your acceptable latency budget is tight and the corpus is large, RAG usually wins on latency too, not just cost.
  • Remember the hidden RAG costs this calculator does not include: the vector database, reranking, and the engineering time to build and maintain the pipeline. For a tiny corpus those can outweigh the token savings, which is exactly when long-context is the right call.
Ad placeholder (rectangle)