LLM evaluation run cost estimator
Running a benchmark like MMLU, HellaSwag, GSM8K or a custom eval suite means thousands of API calls — and the bill adds up fast, especially with LLM-as-judge grading that doubles the token volume. This estimator tells you what one full evaluation run will cost before you launch it, so you can budget, sample sensibly and compare models on an even footing.
How it works
Each example costs the candidate model its prompt tokens at the input rate plus its completion tokens at the output rate. Across the whole dataset:
candidate_cost = examples × [ (prompt_tokens / 1e6) × input_price
+ (completion_tokens / 1e6) × output_price ]
If you grade with LLM-as-judge, the judge re-reads each example’s prompt, the candidate’s answer and your rubric, then emits a short verdict. That adds a second pass over roughly the same content, which is why judge-graded evals can cost as much as the candidate run itself.
Tips and notes
Multiple-choice benchmarks are cheap because the answer is one token; open-ended generation and chain-of-thought evals are far pricier because completions are long. For day-to-day iteration, run a random subset of a few hundred examples to keep cost and latency low, then run the full suite only before a release. If you use LLM-as-judge, a smaller, cheaper judge model is often sufficient for relative comparisons — reserve an expensive judge for final scoring. Always log the exact dataset version and model snapshot alongside the cost so your eval numbers stay reproducible.