How reliable is a self-assessed confidence score?

It is a useful signal, not a guarantee. Models are often poorly calibrated and can be confidently wrong, so treat low scores as a prompt to verify — never treat a high score as proof a claim is true.

Where does my API key go?

Your key stays in your browser and is sent only in the direct request to OpenAI or Anthropic from your machine. It is never sent to our servers, stored, or logged.

Which models can I use?

Any current OpenAI chat model (e.g. gpt-4o-mini, gpt-4o) or Anthropic model (e.g. claude-3-5-haiku, claude-3-5-sonnet). Cheaper models are fine for scoring and keep costs low.

Why did a request fail?

Common causes are an invalid key, insufficient credit, a CORS or network block, or rate limiting. The exact provider error message is shown so you can act on it.

What is the Confidence Annotator (BYO-key)?

Paste LLM output and use your own OpenAI or Anthropic API key to have a model re-evaluate each sentence and tag it with a confidence score. Low-confidence claims are highlighted for verification. Your key stays in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Confidence Annotator (BYO-key)

Name: Confidence Annotator (BYO-key)
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Annotate LLM output with per-sentence confidence

When an LLM gives you a paragraph of claims, some are rock-solid and some are plausible guesses. This tool uses your own API key to ask a model to re-read the text sentence by sentence and assign each one a confidence score, then highlights the shaky claims so you know where to focus your fact-checking. Your key never leaves your browser.

How it works

The pasted output is split into sentences locally.
A single request is sent directly from your browser to OpenAI or Anthropic, asking the model to return a JSON array of { sentence, confidence } objects, where confidence is 0–100.
The results are color-coded: green for high confidence, amber for medium, and red for low — the claims most worth verifying.

No proxy, no server: the only network call is the one from your machine to the provider you chose, authenticated with your key.

What confidence scores actually measure

LLM confidence scores are a form of self-assessment, not a probability derived from the model’s internal state. When the grader model assigns a score of 40 to a sentence, it means the model believes the claim is uncertain or potentially incorrect based on what it knows — not that the claim is wrong 60% of the time. The distinction matters:

A low score is a reliable signal to investigate further. Models tend to assign low scores when claims involve specific numbers, recent events, niche technical details, or anything that feels “off” even if they can’t immediately refute it.
A high score is not a reliability guarantee. Models are often confidently wrong — especially on topics where they have learned confident-sounding but incorrect information from training data.

Think of the color-coded output as a triage queue, not a verdict. The red sentences are where you spend your fact-checking effort first.

What the annotator catches well

Specific statistics and numbers — “The study included 4,283 participants” or “GDP grew 2.7%” are high-precision claims that a model will often flag as uncertain when it cannot verify the exact number.
Temporal claims — “As of 2023” or “since last year” often earn lower scores because models know their training has a cutoff and temporal accuracy degrades.
Named entities in specific contexts — “Dr. Jane Smith at MIT” is a combination of a name and affiliation that models frequently flag because they cannot confirm the specific combination.
Cause-and-effect claims — “This caused a 30% improvement” implies a causal link that a model may assess as requiring stronger evidence than correlation.

What the annotator does not catch reliably

Plausible but wrong facts that sound like training data. If the model was trained on authoritative-sounding but incorrect sources, it will score those claims highly.
Omissions. The annotator checks what is said, not what is missing. An answer that omits crucial context and is technically correct in every sentence it includes will score well.
Logical errors. A claim can be individually factual but lead to an incorrect conclusion through faulty reasoning. Sentence-level confidence does not catch multi-step reasoning errors.

Tips and notes

Calibration is imperfect. Treat a low score as “go check this”, not as proof of error — and never treat a high score as proof of truth.
Use a cheap model (gpt-4o-mini or claude-3-5-haiku) for scoring; the task is simple and the cost per run stays tiny.
This pairs well with a real source check: the annotator tells you where to look; you still confirm the facts against a primary source.
For high-stakes content (medical, legal, financial), use the annotator as a first filter and then verify every claim through primary sources regardless of score.