Why does few-shot diversity matter?

Models generalize from the examples you give them, so if every example uses the same structure or vocabulary the model will overfit to that pattern and handle off-distribution inputs poorly. Diverse examples teach the model the task rather than one narrow phrasing of it.

How is similarity measured?

Each pair of examples is compared with Jaccard similarity over their word sets — the size of the shared vocabulary divided by the combined vocabulary. A score near 1 means the two examples use almost identical words; near 0 means they barely overlap. Pairs above the threshold are flagged.

What is a good diversity score?

There is no single right number, but you generally want low pairwise similarity (most pairs well under 0.5) and meaningful length variation. If two examples are nearly identical, one of them is wasting a slot in your prompt and may bias the model.

Does length variation matter too?

Yes. If all your examples are the same length the model may learn to always produce outputs of that length. Showing short and long examples teaches it to match the input rather than a fixed template, so the tool reports the length spread as well.

What is the Few-Shot Example Diversity Checker?

Analyzes lexical diversity, length variation, and pairwise similarity across your few-shot examples. Flags when examples are too alike so your prompt does not overfit the model to a single phrasing or pattern. It runs free in your browser on Gera Tools, with nothing uploaded.

Few-Shot Example Diversity Checker

Name: Few-Shot Example Diversity Checker
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Make sure your prompt examples actually teach the task

Few-shot prompting works because the model imitates the examples you provide. The failure mode is subtle: if your examples are all phrased the same way, the model learns the phrasing instead of the task and breaks on inputs that look different. This checker measures how varied your examples really are — across vocabulary, length, and pairwise overlap — so you can catch redundancy before it biases your prompt.

How it works

Each example is tokenized into a set of words. The tool computes three things. First, lexical diversity: the ratio of unique words to total words across all examples, which tells you how much vocabulary your set covers. Second, length variation: the spread between your shortest and longest example, since uniform lengths push the model toward fixed-length outputs. Third, and most usefully, pairwise Jaccard similarity — for every pair of examples it divides the shared vocabulary by the combined vocabulary and flags any pair above a similarity threshold as too alike.

Tips and notes

Aim for examples that differ in structure, length, and wording while still demonstrating the same task. Two near-identical examples waste a slot and can nudge the model toward that exact pattern, so when a pair is flagged, replace one of them with a genuinely different case — ideally an edge case or a different input shape. Diversity is not the only goal: every example should still be correct and representative. Use this tool as a redundancy filter, not a correctness check. You can paste either plain blocks separated by blank lines or a JSON array of strings; both are parsed automatically.

Why few-shot diversity is harder than it looks

When writing few-shot examples for a prompt, it is natural to start with the clearest and most straightforward case and then add a second and third that are slight variations of it. The result is a set that looks diverse on the surface — three different inputs, three different outputs — but is actually highly redundant: the same vocabulary, the same sentence length, the same structure, just different nouns.

This is a problem because the model has seen the task framing thousands of times in training; the examples you provide are telling it not just “this is the task” but “this is how inputs to this task look.” If all your examples use polite, formal, medium-length requests, the model will handle informal or very short requests less well, even if those are equally valid inputs.

What to look for and how to fix flagged pairs

When the checker flags a pair as too similar, the question is not just “are these the same?” but “do these examples teach the same thing?” Two examples that are lexically similar may teach genuinely different subtleties (different entity types, different edge cases); two that are lexically different may demonstrate exactly the same handling. Use the pairwise score as a prompt to inspect the pair, not as an automatic rejection.

Common ways to improve diversity in a flagged set:

Vary input length. Include one very short and one verbose example of the same task type. Models calibrate output length to examples; a set of uniformly medium-length examples produces uniformly medium-length outputs.
Vary domain vocabulary. If all your examples come from one domain (all medical, all technical, all formal business writing), add examples from a different register even if the task is domain-specific.
Include an edge case. A boundary condition — an empty input, an unusually structured one, one where the correct answer is “I don’t know” or “not applicable” — teaches the model what to do when the input is unusual. Without it, the model interpolates from the clean examples and can produce confident-sounding wrong answers at the boundary.
Vary the polarity. For classification or sentiment tasks, make sure positive, negative, and neutral examples are all present and roughly balanced.

Jaccard similarity: what the score means

Jaccard similarity between two texts is defined as the size of the intersection of their word sets divided by the size of the union. A score of 1.0 means the two examples use exactly the same vocabulary; 0.0 means they share no words at all. In practice:

Above 0.7 — the pair almost certainly contains redundant examples. One should be replaced.
0.4 to 0.7 — similar but may teach different things. Inspect manually.
Below 0.4 — healthy diversity; the examples are meaningfully distinct.

These thresholds are rough guides. The checker allows you to adjust the flagging threshold if your task requires more or less strictness.