What format should my examples be in?

Either a JSON array of objects with input and output keys, or plain text where each example is a block separated by a blank line. The tool detects JSON first and falls back to text parsing.

How is the diversity score calculated?

It measures how lexically distinct your inputs are from one another using token-overlap. Near-duplicate inputs lower the score because they teach the model little beyond a single pattern.

Why does label balance matter?

For classification-style tasks, skewed output labels bias the model toward the majority class. The balance subscore flags when one output value dominates so you can add counter-examples.

How many few-shot examples should I use?

Three to eight is typical. Below three the model has little to generalize from; above eight you spend tokens with diminishing returns and risk crowding the context window.

Does this send my examples anywhere?

No. All scoring runs locally in your browser with simple text statistics. Nothing is uploaded or stored.

What is the Few-Shot Example Quality Rater?

Analyzes a set of few-shot examples and scores them on input diversity, output variety, label balance, and length consistency so you can curate a stronger, more reliable example set for your prompts. It runs free in your browser on Gera Tools, with nothing uploaded.

Few-Shot Example Quality Rater

Name: Few-Shot Example Quality Rater
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Few-shot example quality rater

Few-shot prompting only works when the examples actually teach the model something. A set of near-identical inputs, lopsided labels, or examples that skip the hard cases will leave the model guessing on real inputs. This rater analyzes your example set and gives you four subscores — input diversity, output variety, label balance, and length consistency — plus an overall grade and specific suggestions, so you can curate examples that generalize instead of ones that just look complete.

How it works

Paste your examples as a JSON array ([{ "input": "...", "output": "..." }]) or as plain-text blocks separated by blank lines. The tool tokenizes each input and measures pairwise overlap to estimate diversity — sets where every input shares most of its words score low because they only demonstrate one pattern. It does the same for outputs, tallies how often each distinct output value appears to judge label balance, and checks whether output lengths swing wildly. Everything runs locally in your browser; no model call is needed.

Tips and examples

Spread your inputs. If three of five examples start with the same phrasing, replace one with a structurally different case.
Include an edge case. Add at least one example covering the tricky or empty input — that single example often prevents the most common failures.
Balance the labels. For yes/no or category tasks, aim for roughly even representation unless the real distribution is genuinely skewed.
Keep outputs consistent in shape. If one output is a single word and another is a paragraph, the model gets a mixed signal about expected verbosity.

Why bad few-shot examples are worse than no examples

Zero-shot prompting relies entirely on the model’s pre-trained knowledge of the task. That is often sufficient for well-specified, common tasks. Few-shot prompting is intended to override or refine that default behavior by demonstration — but it only works when the examples actually model the target behavior. Problematic example sets can actively degrade performance:

Near-duplicate inputs train the model on a narrow input range, leaving it poorly calibrated on inputs that look even slightly different. The model essentially learns to do well on that one pattern.
Unbalanced labels in classification tasks produce a strong mode bias. If four of five examples have the label “positive,” the model learns that the base rate is 80% positive and defaults there under ambiguity.
Inconsistent output format causes the model to oscillate between formats — sometimes a JSON object, sometimes a prose sentence — depending on which examples the attention mechanism weighted most heavily on that particular call.
Only easy examples leave the model unprepared for the hard cases that matter most in production. The error rate on edge cases stays at baseline even though average performance looks fine.

What a strong few-shot set looks like

A well-curated set of three to six examples typically has:

One straightforward example that establishes the basic task format clearly.
One edge case covering the tricky or unusual input that would otherwise produce a wrong answer.
Balanced labels (for classification) so no single class dominates.
Varied input phrasing across examples so the model does not pick up superficial cues about which phrasing style to respond to.
Consistent output structure — if the first output is a JSON object with reason and label keys, all outputs should follow that structure.

The rater’s subscores map directly onto these criteria: diversity catches the phrasing variety problem, label balance catches the skew problem, output variety catches the format inconsistency problem, and length consistency catches verbosity mismatch.