How is the relevance score produced?

Each chunk is sent to your chosen model with the question and a strict rubric asking for a 0-100 score and a short reason. The tool parses the numeric score and ranks chunks high to low.

Does this keep my API key?

No. The key lives only in the page's memory for the current session and is sent solely to OpenAI or Anthropic for the scoring requests. Nothing is stored or logged.

Why score chunks instead of just passing them all?

Irrelevant chunks waste tokens, dilute the prompt, and can trigger hallucinations. Scoring lets you keep only the chunks that actually help answer the question.

What threshold should I use?

A common starting point is to keep chunks scoring 60 or above, but tune it per corpus. Inspect the rationales on borderline chunks before settling on a cutoff.

What is the Context Relevance Scorer (BYO-key)?

Free BYO-key context relevance scorer for RAG pipelines. Paste a question and your retrieved chunks, use your own OpenAI or Anthropic key, and get a 0-100 relevance score per chunk so you can prune low-value context. It runs free in your browser on Gera Tools, with nothing uploaded.

Context Relevance Scorer (BYO-key)

Name: Context Relevance Scorer (BYO-key)
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Score retrieved chunks before you answer

In a retrieval-augmented generation (RAG) pipeline, the retriever often returns chunks that are only loosely related to the user’s question. Passing all of them to the model wastes tokens and invites hallucination. This tool sends each chunk to your own LLM with a strict scoring rubric and returns a 0-100 relevance score plus a short rationale, so you can rank and prune context before generating the final answer.

How relevance scoring works

For every chunk you paste, the tool issues a separate request to your selected provider asking the model to rate, on a 0-100 scale, how useful that chunk is for answering the question — and to give a one-line reason. The numeric score is parsed out and the chunks are sorted high to low. Because each chunk is judged in isolation against the same rubric, the scores are comparable and you get a clean ranking rather than a vague gut feel. Your API key is used only for these direct browser-to-provider calls and is never persisted.

Why irrelevant chunks cause hallucinations

When you pass a chunk to the model that is only loosely related to the question, the model must still do something with it. Models trained to be helpful tend to incorporate available context even when it is misleading — a chunk about a similar but different product can cause the model to answer confidently about the wrong product. This is not a failure mode you can prompt-engineer away entirely; the most reliable fix is to not include the irrelevant chunk in the first place.

Scoring lets you quantify how often this happens. If you regularly see chunks scoring below 30 that your retriever still returns with high vector similarity, your embeddings and your question semantics are misaligned — a signal that chunking strategy or the embedding model needs revisiting.

What the rubric actually asks the model

The scoring prompt instructs the model to evaluate each chunk on a single criterion: given this question, how directly does this chunk provide information needed to answer it? The scale is:

80–100: The chunk directly addresses the question and contains specific, relevant facts or reasoning.
50–79: The chunk is related to the topic but provides only background or tangential information.
20–49: The chunk is in the same general domain but does not help answer this specific question.
0–19: The chunk is off-topic or contradicts the question’s premises.

The rationale line explains which category and why, which helps you calibrate whether your threshold is in the right place.

Choosing a cutoff threshold

A threshold of 60 is a reasonable starting point: keep chunks at 60 or above, discard those below. But the right threshold depends on your corpus density. If your retriever consistently returns chunks scoring 40–80, a hard cutoff at 60 discards half your results — consider raising it to 70 or tightening your chunking. If most chunks score above 70 and a few stragglers score below 20, a threshold of 40 is probably safe and keeps more potentially useful context. Always read the rationales on borderline chunks before settling on a number.

Tips for reliable scoring

Keep chunks reasonably short. A 2,000-token chunk muddies the score because it may contain both relevant and irrelevant passages.
Use a cheap, fast model (gpt-4o-mini or claude-3-5-haiku) — relevance judging doesn’t need a frontier model and you’ll score many chunks quickly.
Treat the score as a ranking signal, not a verdict. Read the rationale on any chunk near your cutoff before discarding it.
Re-score after changing your chunking strategy to see whether smaller or larger windows improve relevance density.