How does the automatic checking work?

For each instruction the tool runs lightweight heuristics — length and word-count rules, format checks like JSON or bullet lists, and forbidden-word checks for "do not" instructions. It is a first pass, and you can override any verdict, so the final score reflects your judgement plus the heuristics.

Why let me override the verdicts?

Many instructions are semantic ("be friendly", "stay on topic") and no heuristic can judge them reliably. Letting you set pass, partial, or fail per instruction keeps the score honest while the tool handles the mechanical checks for you.

What counts as an instruction?

The extractor pulls imperative sentences, numbered steps, and bullet points — the lines that tell the model what to do. It skips background and context. You can prune or add to the list before scoring so it matches what you actually care about.

Does anything get uploaded?

No. Extraction, checking, and scoring all run locally in your browser, so you can paste sensitive prompts and outputs without them leaving the page.

What is the Instruction-Following Scorer?

Extracts the explicit instructions from your prompt and checks the model's output against each one with a pass, partial, or fail verdict, then computes a compliance percentage so you can see at a glance where the model drifted. It runs free in your browser on Gera Tools, with nothing uploaded.

Instruction-Following Scorer

Name: Instruction-Following Scorer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Instruction-following scorer

When a model output looks “mostly right” it is easy to miss the one rule it ignored — the length cap it blew past, the format it dropped, the word you told it to avoid. This scorer pulls the explicit instructions out of your prompt, checks the output against each one, and gives you a compliance percentage so drift is visible instead of buried in a wall of plausible text.

How it works

The tool first extracts candidate instructions: imperative sentences, numbered rules, and bullet points — the lines that actually tell the model what to do. It then runs lightweight heuristics on the output for each one: word and character limits, format checks (JSON, bullet lists, headings), and forbidden-word checks for “do not” rules. Each instruction gets a pass, partial, or fail verdict that you can override, because semantic instructions like “be concise” need a human eye. The compliance percentage updates live as you adjust.

Why instruction-following is harder than it looks

Large language models are trained to produce plausible, coherent text — but that goal is not the same as rigidly following a list of rules. Two common failure modes stand out in practice:

Prioritisation drift: when a prompt contains many instructions, models tend to follow the early and late ones more reliably than the middle ones. A rule buried in paragraph three is more likely to be skipped than one at the top of the list or the very end. This is sometimes called the “lost in the middle” problem.

Soft instruction collapse: precise constraints (“respond in exactly 80 words”, “use only British English”, “do not use the word ‘delve’”) are often approximated rather than enforced exactly. The output is near-compliant — maybe 95 words, one Americanism, one instance of the banned word — in a way that reads as fine on a quick scan but fails the actual specification.

This scorer makes those lapses visible, turning a vague sense of “it seems about right” into a numbered checklist with a percentage.

What the compliance percentage tells you

A score of 100% means every extracted instruction passed its check (or you manually marked it as passing). Lower scores point to where effort is needed — whether that means rewriting the prompt to make a rule harder to miss, splitting a complex prompt into smaller targeted ones, or trying a different model or setting.

The score is most useful as a comparative tool: run the same prompt and output through twice across two model versions, or two different prompts aimed at the same task, and the delta in compliance percentage is more meaningful than any single absolute score.

When to use it

Prompt engineering: test whether a newly written system prompt is actually followed before deploying it in production.
Model comparison: feed the same prompt to two models and compare adherence scores side by side.
Regression testing: spot when a model update or a prompt tweak causes previously-passing rules to start failing.
Output review: use it as a structured review checklist before publishing or acting on AI-generated content.

Tips and notes

Prune the extracted list first. Removing non-instructions keeps the score meaningful.
Trust the mechanical checks, judge the rest. Length and format verdicts are reliable; tone and relevance are yours to set.
Use it to compare models. Score the same prompt across two models and the percentages give you a fast, concrete comparison.
Everything is local. Paste proprietary prompts and outputs freely — nothing leaves your browser.