Context Relevance Scorer (BYO-key)

Score retrieved context chunks for relevance to your question using an LLM.

Ad placeholder (leaderboard)

Score retrieved chunks before you answer

In a retrieval-augmented generation (RAG) pipeline, the retriever often returns chunks that are only loosely related to the user’s question. Passing all of them to the model wastes tokens and invites hallucination. This tool sends each chunk to your own LLM with a strict scoring rubric and returns a 0-100 relevance score plus a short rationale, so you can rank and prune context before generating the final answer.

How relevance scoring works

For every chunk you paste, the tool issues a separate request to your selected provider asking the model to rate, on a 0-100 scale, how useful that chunk is for answering the question — and to give a one-line reason. The numeric score is parsed out and the chunks are sorted high to low. Because each chunk is judged in isolation against the same rubric, the scores are comparable and you get a clean ranking rather than a vague gut feel. Your API key is used only for these direct browser-to-provider calls and is never persisted.

Tips for reliable scoring

  • Keep chunks reasonably short. A 2,000-token chunk muddies the score because it may contain both relevant and irrelevant passages.
  • Use a cheap, fast model (gpt-4o-mini or claude-3-5-haiku) — relevance judging doesn’t need a frontier model and you’ll score many chunks quickly.
  • Treat the score as a ranking signal, not a verdict. Read the rationale on any chunk near your cutoff before discarding it.
  • Re-score after changing your chunking strategy to see whether smaller or larger windows improve relevance density.
Ad placeholder (rectangle)