Does this send my prompt to an AI model?

No. The evaluation runs entirely in your browser using heuristic checks. Nothing is uploaded, so you can safely paste proprietary or sensitive prompts.

What do the ten dimensions measure?

Clarity, specificity, role/persona definition, explicit output-format instructions, appropriate length, presence of examples, stated constraints, task decomposition, ambiguity, and prompt-injection resilience. Each is scored independently so you can see exactly where a prompt is weak.

Is a perfect score always best?

Not necessarily. A short factual lookup does not need examples or heavy decomposition. Use the scores as a checklist and apply judgement for the task at hand rather than chasing 20/20 on every prompt.

How is injection safety scored?

The tool looks for patterns that make prompts vulnerable, such as concatenating untrusted input without delimiters or no instruction reinforcing that user content is data, not commands. It rewards explicit delimiters and statements that user input should not override system rules.

What is the Prompt Meta-Evaluator?

Paste a system and user prompt to get a per-dimension score for clarity, specificity, role definition, output-format instructions, length, examples, constraints, and prompt-injection safety, plus concrete fixes for each weak area. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt Meta-Evaluator

Name: Prompt Meta-Evaluator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Score your prompt before you spend tokens

A weak prompt wastes tokens, money, and time on outputs you have to re-prompt anyway. The Prompt Meta-Evaluator scores your system and user prompt against ten dimensions that consistently separate reliable prompts from flaky ones, so you can fix the obvious problems before you ever hit the API.

How it works

The tool runs a set of heuristic checks entirely in your browser. It looks at length and structure, detects whether you have defined a role, asked for a specific output format, supplied examples, stated constraints, and broken the task into steps. It also flags ambiguity (vague verbs like “handle” or “deal with”) and checks for prompt-injection resilience — whether untrusted input is clearly delimited and the model is told to treat it as data. Each dimension gets a 0, 1, or 2, and the totals roll up into a score out of 20 with a grade.

What each dimension measures

The ten dimensions are not arbitrary — each one addresses a documented failure mode that produces inconsistent model output:

Clarity (0–2): Does the prompt use direct, unambiguous verbs? Instructions with words like “handle,” “manage,” or “deal with” leave the model guessing. “Extract,” “summarise,” or “classify” are clear.

Specificity (0–2): Are the requirements concrete and bounded? “Write a good email” is low; “Write a three-paragraph cold email to a CFO, opening with a cost saving, closing with a specific call to action” is high.

Role/persona definition (0–2): A named role anchors the model’s reference frame. “You are a senior contract lawyer specialising in SaaS agreements” constrains the vocabulary, tone, and assumptions far more than an unanchored instruction.

Output format (0–2): Explicitly stated format requirements — JSON with a named schema, a numbered list, a markdown table — dramatically reduce variance. Without them, the model chooses its own format, which changes between runs.

Length appropriateness (0–2): Prompts that are extremely long (many repetitions of the same instruction) or extremely short (bare question with no context) both score lower. The sweet spot is exactly enough instruction to eliminate ambiguity.

Examples (0–2): One or two well-chosen examples of input/output pairs narrow the model’s target better than any amount of description. They are the highest-leverage addition to most underpowered prompts.

Constraints (0–2): Explicit “do not” rules are surprisingly effective. “Do not mention competitor names,” “do not exceed 200 words,” “do not add disclaimers” removes classes of failure that the positive instructions miss.

Task decomposition (0–2): Complex multi-step tasks produce better output when broken into numbered steps or sequenced sub-instructions than when stated as a single imperative.

Ambiguity (0–2): The tool flags words with multiple plausible interpretations given the context — a sign to rephrase before the model guesses wrong.

Injection resilience (0–2): User-supplied content clearly delimited and described as data, not instructions, scores highest. The absence of delimiters on untrusted input is the most common attack surface for injection.

Tips for a higher score

Define a role and an output format. “You are a senior tax accountant. Reply only in JSON matching this schema” outscores a bare question.
Show, don’t just tell. One or two examples of the desired output lift the specificity and example dimensions and dramatically improve consistency.
Delimit user input. Wrap pasted content in triple backticks or XML tags and state that anything inside is data, never instructions. This is the single biggest win for injection safety.
Cut filler. Politeness padding and restating the obvious lowers the clarity score without improving results — be direct.
A perfect 20/20 is not always the goal. A simple factual lookup does not need examples or decomposition. Score with the task in mind, not in the abstract.