What input formats are accepted?

A CSV with a header row, or a JSON array of flat objects. Columns that parse as numbers become scoring dimensions; non-numeric columns are ignored for the statistics.

How is standard deviation calculated?

It uses the population standard deviation — the square root of the mean of squared deviations from the mean. With a single row it is zero by definition.

What counts toward the pass rate?

For each dimension, the pass rate is the share of rows whose value is greater than or equal to your threshold. It is reported per dimension rather than as a single overall number, since rubrics usually score multiple things.

Is my eval data sent anywhere?

No. Parsing and all statistics run in your browser. Nothing is uploaded.

What is the LLM Eval Score Aggregator?

Paste eval results as CSV or a JSON array and the tool computes mean, median, and standard deviation per scoring dimension plus an overall pass rate against a threshold you set — all client-side, ready to compare prompt or model variants. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Eval Score Aggregator

Name: LLM Eval Score Aggregator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

LLM eval score aggregator

Once you have run an eval across many examples and several rubric dimensions, you need to collapse all those rows into numbers you can compare. This tool takes your CSV or JSON eval results and computes mean, median, standard deviation, min, max, and a pass rate per dimension — the summary table you would otherwise build by hand in a spreadsheet.

Why per-dimension statistics matter

A single overall score hides the shape of a model’s failure. Suppose you are evaluating a document summarisation system on three dimensions: factual accuracy, conciseness, and format compliance. A model might score a mean of 0.82 overall — but if accuracy is 0.65 and format compliance is 0.95, the average is misleading. The per-dimension breakdown tells you where to invest prompt engineering effort.

Standard deviation adds the next layer: a mean of 0.80 with a standard deviation of 0.02 means the model is reliably good; a mean of 0.80 with a standard deviation of 0.25 means it is unpredictably inconsistent, which matters just as much in production.

How it works

The input is detected as JSON if it parses as an array of objects, otherwise it is treated as CSV with a header row and quote-aware parsing. Every column whose values parse as numbers becomes a dimension. For each dimension the tool computes count, mean, median (middle value of the sorted set), population standard deviation, min, max, and the fraction of rows at or above your pass threshold. Everything runs locally so you can iterate on raw eval dumps without uploading them anywhere.

Example input and output

Paste a CSV like this:

example_id,accuracy,conciseness,format
1,0.9,0.75,1.0
2,0.6,0.82,1.0
3,0.85,0.68,0.5
4,0.7,0.90,1.0

With a pass threshold of 0.8, the tool returns for each numeric column: count (4), mean, median, standard deviation, min, max, and pass rate. You can immediately see that format has high mean but high variance (scores of 1.0, 1.0, 0.5, 1.0), while accuracy has a moderate mean spread around 0.76.

Tips for clean aggregation

One row per example, one column per dimension. That shape gives the cleanest aggregation; an extra column naming the prompt variant lets you paste two variants side by side and compare.
Mind the std dev. A high standard deviation next to a decent mean means the model is inconsistent — often more actionable than the average alone. Consistency at a moderate score can be more valuable in production than a high-but-noisy score.
Pick the threshold deliberately. A binary pass/fail at 0.8 frames the results differently than a continuous mean. Report both when sharing results with a team that needs to make a go/no-go decision.
Compare like with like. Aggregate each variant over the same example set so differences reflect the variant, not the sample. A larger sample on one variant will almost always look better simply because it smooths outliers.