When you are choosing between models, or between prompt variants, eyeballing two walls of text rarely gives a clear answer. This tool lays up to five responses side by side, lets you define the quality dimensions that matter, and tallies your scores so the winner is obvious.
How it works
Paste each response and give it a label (the model name or prompt variant). Add rating dimensions — your own criteria such as accuracy, tone, or completeness — and score every response from 0 to 5 on each. The tool sums the scores into a per-response total and shows objective stats like word and character count beside them. Everything runs locally; nothing is uploaded.
Why score manually
Automated graders are useful but blind to nuance — a factual error, an off-brand tone, a missing caveat. By keeping scoring in your hands the tool captures domain judgement it could never infer, while still doing the tedious part: structuring the comparison and adding up the numbers.
Tips
- Decide your dimensions before reading the responses to avoid anchoring on the first one you like.
- Keep dimensions to three or four; too many and every response scores about the same.
- Watch the word counts — a higher total score on a much longer answer may just reflect padding, not quality.