How is the eval cost calculated?

For each example the candidate model is billed for prompt tokens at the input rate and completion tokens at the output rate. The totals are multiplied by the dataset size. If you use an LLM judge, the judge's grading cost is added on top.

What does LLM-as-judge add to the cost?

An LLM judge reads each example's prompt, the candidate's answer and your grading rubric as input, then emits a short verdict. That roughly doubles the token volume per example, because the judge re-processes the content the candidate already produced.

Why are multiple-choice benchmarks so cheap?

Benchmarks like MMLU usually require only a single-letter answer, so completion tokens per example are tiny. Cost is dominated by the prompt, which includes the question and the answer choices. Open-ended generation evals cost far more per example.

Should I run the full benchmark every time?

Not necessarily. For iterative development, run a representative random subset to keep cost and latency low, then run the full benchmark before a release. This tool makes it easy to compare the cost of a 500-example sample versus the full 14,000-example set.

Is my data sent anywhere?

No. The estimator runs entirely in your browser. Nothing you enter is uploaded, stored or logged.

What is the LLM Evaluation Run Cost Estimator?

Free LLM eval cost estimator. Enter your benchmark size (MMLU, HellaSwag or a custom eval), average prompt and completion tokens, and an optional LLM-as-judge model to see the total token cost of one full evaluation run. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Evaluation Run Cost Estimator

Name: LLM Evaluation Run Cost Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

LLM evaluation run cost estimator

Running a benchmark like MMLU, HellaSwag, GSM8K or a custom eval suite means thousands of API calls — and the bill adds up fast, especially with LLM-as-judge grading that doubles the token volume. This estimator tells you what one full evaluation run will cost before you launch it, so you can budget, sample sensibly and compare models on an even footing.

Types of eval and their cost profiles

Not all evals cost the same. Understanding the difference helps you plan:

Multiple-choice benchmarks (MMLU, MMLU-Pro, GPQA): Each example sends a question plus answer choices as the prompt, and the model generates a single letter (A, B, C, D). Completion tokens are tiny — often 1–5 per example. Cost is dominated by the prompt, which includes the question and all options. These benchmarks are inexpensive at scale.

Open-ended generation evals (TruthfulQA, AlpacaEval, MT-Bench): The model generates a full response, sometimes many paragraphs. Completion tokens are large. At model prices where output costs several times more than input, these evals cost substantially more per example.

Chain-of-thought evals (GSM8K CoT, MATH, BIG-Bench Hard): The model is prompted to reason step by step before answering. This produces long completions (the reasoning chain plus the answer), which drives cost up significantly compared to direct-answer versions of the same benchmark.

Custom evals: Your own test set. Cost depends on your prompt design and expected answer length. The estimator lets you enter these directly.

How it works

Each example costs the candidate model its prompt tokens at the input rate plus its completion tokens at the output rate. Across the whole dataset:

candidate_cost = examples × [ (prompt_tokens / 1e6) × input_price
                            + (completion_tokens / 1e6) × output_price ]

If you grade with LLM-as-judge, the judge re-reads each example’s prompt, the candidate’s answer and your rubric, then emits a short verdict. That adds a second pass over roughly the same content, which is why judge-graded evals can cost as much as the candidate run itself.

Worked example

Suppose you want to run a 5,000-example custom eval on a model with a prompt of 600 tokens and expected completion of 400 tokens, then grade with an LLM judge.

At illustrative prices of $2 per million input tokens and $8 per million output tokens:

Candidate: 5,000 × [(600/1M × $2) + (400/1M × $8)] = 5,000 × [$0.0012 + $0.0032] = $22
Judge pass (roughly similar token volume): adds a similar amount
Estimated total: ~$40–50 for one full graded run

Running a 500-example random subset first costs roughly one-tenth of that — a sensible first step before committing to the full run.

Tips for cost-efficient evaluation

For day-to-day iteration, run a random subset of a few hundred examples to keep cost and latency low, then run the full suite only before a release.
If you use LLM-as-judge, a smaller, cheaper judge model is often sufficient for relative comparisons — reserve an expensive judge for final scoring.
Always log the exact dataset version and model snapshot alongside the cost so your eval numbers stay reproducible.
Compare models at the same temperature and sampling settings; otherwise cost and quality differences are confounded.