What does price per quality point mean?

It is the blended cost per million tokens divided by the chosen benchmark score. A lower number means you pay less for each point of measured capability, so it normalizes raw price against how good the model actually is.

Why use a blended price?

Models price input and output tokens differently. A blended price (a weighted mix, often 3:1 input-to-output) gives a single comparable figure. You can adjust the prices to reflect your real input-output ratio.

Which benchmark should I pick?

Use MMLU for general knowledge tasks, HumanEval for code generation, and MATH for multi-step reasoning. The composite average is a reasonable default when your workload mixes all three.

Are benchmark scores comparable across models?

Roughly. Benchmarks are imperfect and providers sometimes report under different conditions, but for value ranking the relative ordering is informative. Always validate the top candidates on your own evals before committing.

Does cheapest-per-point mean I should always pick it?

Not necessarily. The cheapest-per-point model may still miss a quality floor your product needs. Use this tool to shortlist value leaders, then confirm they clear your minimum acceptable score on your tasks.

What is the Price-per-Quality-Point Calculator?

Free tool that divides each LLM's blended price by a composite benchmark score (MMLU, HumanEval, MATH) to rank models by quality delivered per dollar. Compare GPT, Claude, Gemini and more on a true value basis. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Price-per-Quality-Point Calculator

Name: Price-per-Quality-Point Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Raw token prices are misleading: a model that costs half as much but scores far lower is not a bargain. This calculator normalizes price by benchmark performance, ranking models by the cost of each quality point so you can compare on true value.

How it works

For each model the tool computes a blended price — input and output rates combined with a configurable weighting (default 3:1 input-to-output, typical of chat workloads). It then divides that blended price by your chosen benchmark score:

price per quality point = blended_price_per_million / benchmark_score

A lower result means you pay less per point of capability. Pick MMLU (knowledge), HumanEval (coding), MATH (reasoning), or a composite average of all three. Models are re-ranked instantly, cheapest-per-point first.

Worked example

The built-in models on the composite metric, blended at the default 3:1 input-to-output weight (blended = 0.75 × input + 0.25 × output):

Model	Blended $/1M	Composite	$ / point
Budget (small)	$0.26	63.3	0.0041
Open-weights	$0.33	76.0	0.0043
Mid-tier	$2.00	80.7	0.0248
Frontier	$8.75	90.0	0.0972

The budget model wins on value (lowest cost per point) — but only if a composite of 63 clears your quality floor. If your task needs 80+, the mid-tier model is the real value leader because the cheaper models simply cannot do the job.

Choosing the right benchmark for your workload

The three benchmarks measure different capabilities, and the one you pick should reflect what you actually need the model to do:

MMLU (Massive Multitask Language Understanding) tests factual knowledge across 57 academic domains — history, law, medicine, maths. Use this benchmark axis when your workload is primarily question-answering, retrieval augmentation, or summarisation tasks where domain recall matters.
HumanEval measures code generation: the model receives a function signature and docstring and must write working Python. Use this when you are picking a model for a coding assistant, code review tool, or automated test generation pipeline.
MATH tests multi-step mathematical reasoning across problem types. Use it if your task involves structured chain-of-thought, quantitative analysis, or step-by-step derivations.
Composite (the average of all three) is the best starting default for mixed workloads — customer support bots, document processing pipelines, or general-purpose assistants where you cannot predict the split.

What the blended price means

Providers price input tokens (the prompt you send) and output tokens (the response the model generates) separately, and usually at different rates. A 3:1 blended weight assumes three prompt tokens for every one output token — typical of a RAG or classification workload. If you are building a long-form generation tool where responses are much longer than the prompt, shift the weighting toward output-heavy to get a fair comparison. The tool lets you adjust this ratio directly.

Practical tips

Set a quality floor before reading the ranking. A model that scores lowest cost-per-point may still fail to handle your task reliably. Decide your minimum acceptable benchmark score first, then look at which model above that floor wins on value.
Prices change more often than benchmarks. Update the price fields to the current provider rate sheet; benchmark scores for established models are more stable.
Shortlist two or three candidates, then run your own evals. Standardised benchmarks are a starting filter, not a final verdict. Test top candidates on representative examples from your actual workload before committing.
Pair this tool with the LLM API Cost Calculator to translate the per-point ranking into a concrete monthly spend projection.