Cosine Similarity Threshold Optimizer

Find the optimal cosine similarity cutoff for your retrieval task.

Ad placeholder (leaderboard)

Why the threshold matters

In retrieval and RAG systems you compare a query embedding against document embeddings with cosine similarity, then keep everything above a cutoff. Set the cutoff too low and you flood the context with irrelevant chunks; set it too high and you drop documents the model actually needed. The right value is specific to your embedding model and your corpus, and the only honest way to find it is to measure against labeled examples.

How it works

You provide a list of score, relevant pairs from a labeled sample. The tool treats “keep everything ≥ threshold” as a binary classifier and sweeps every meaningful cutoff — the midpoints between adjacent scores, where the decision can actually flip. At each cutoff it counts true positives, false positives, and false negatives, then computes:

  • Precision — of the documents you kept, how many were relevant.
  • Recall — of the relevant documents, how many you kept.
  • F1 / F-beta — the harmonic blend, weighted toward precision or recall by your chosen beta.

It then highlights the threshold with the highest F-beta and shows the full precision-recall table so you can see the trade-off curve and choose a slightly different point if your application demands it.

Tips

  • Match beta to the cost of mistakes. For a legal or medical retriever where a missed document is expensive, push beta above 1 to favour recall. For a user-facing search where noise erodes trust, drop it below 1.
  • Sample near the boundary. Optimisation is only as good as the labels around the decision point — include pairs that are genuinely borderline, not just obvious matches and obvious non-matches.
  • Re-measure when you change models. A threshold tuned for one embedding model is meaningless for another; cosine scores are not comparable across models.
Ad placeholder (rectangle)