LLM Eval Score Aggregator

Aggregate LLM evaluation scores across runs and rubric dimensions with one paste.

Ad placeholder (leaderboard)

LLM eval score aggregator

Once you have run an eval across many examples and several rubric dimensions, you need to collapse all those rows into numbers you can compare. This tool takes your CSV or JSON eval results and computes mean, median, standard deviation, min, max, and a pass rate per dimension — the summary table you would otherwise build by hand in a spreadsheet.

How it works

The input is detected as JSON if it parses as an array of objects, otherwise it is treated as CSV with a header row and quote-aware parsing. Every column whose values parse as numbers becomes a dimension. For each dimension the tool computes count, mean, median (middle value of the sorted set), population standard deviation, min, max, and the fraction of rows at or above your pass threshold. Everything runs locally so you can iterate on raw eval dumps without uploading them anywhere.

Tips and notes

  • One row per example, one column per dimension. That shape gives the cleanest aggregation; an extra column naming the variant lets you compare runs by pasting them separately.
  • Mind the std dev. A high standard deviation next to a decent mean means the model is inconsistent — often more actionable than the average alone.
  • Pick the threshold deliberately. A binary pass/fail at, say, 0.8 frames the results differently than a continuous mean; report both when sharing.
  • Compare like with like. Aggregate each variant over the same example set so differences reflect the variant, not the sample.
Ad placeholder (rectangle)