Price-per-Quality-Point Calculator

Normalize LLM cost by benchmark performance to find best value

Ad placeholder (leaderboard)

Raw token prices are misleading: a model that costs half as much but scores far lower is not a bargain. This calculator normalizes price by benchmark performance, ranking models by the cost of each quality point so you can compare on true value.

How it works

For each model the tool computes a blended price — input and output rates combined with a configurable weighting (default 3:1 input-to-output, typical of chat workloads). It then divides that blended price by your chosen benchmark score:

price per quality point = blended_price_per_million / benchmark_score

A lower result means you pay less per point of capability. Pick MMLU (knowledge), HumanEval (coding), MATH (reasoning), or a composite average of all three. Models are re-ranked instantly, cheapest-per-point first.

Worked example

The built-in models on the composite metric, blended at the default 3:1 input-to-output weight (blended = 0.75 × input + 0.25 × output):

ModelBlended $/1MComposite$ / point
Budget (small)$0.2663.30.0041
Open-weights$0.3376.00.0043
Mid-tier$2.0080.70.0248
Frontier$8.7590.00.0972

The budget model wins on value (lowest cost per point) — but only if a composite of 63 clears your quality floor. If your task needs 80+, the mid-tier model is the real value leader because the cheaper models simply cannot do the job.

Tips

  • Set a minimum acceptable score mentally before reading the ranking; value-per-point only matters above your quality floor.
  • Adjust the input-output weighting to match your traffic — output-heavy workloads make output-expensive models look worse.
  • Use this to shortlist, then confirm with your own evals and the LLM API Cost Calculator for absolute monthly spend.
Ad placeholder (rectangle)