What threshold should I start from?

There is no universal value — it depends on your embedding model and data. That is exactly why you measure it. Common starting points sit between 0.7 and 0.85 for normalised embeddings, but this tool finds the right one from your labels.

What is the F-beta weight for?

Beta controls the precision-recall trade-off. Beta of 1 (the default) balances them; values above 1 favour recall (catch more relevant docs, tolerate noise); values below 1 favour precision (fewer false positives).

How are candidate thresholds chosen?

It tests the midpoints between every pair of adjacent unique scores, plus the extremes. Those midpoints are the only places where the classification can change, so this finds the true optimum without missing any cutoff.

How much labeled data do I need?

A few dozen pairs gives a rough cutoff; one to two hundred gives a stable one. Make sure your sample includes both clearly relevant and clearly irrelevant pairs near the boundary.

What is the Cosine Similarity Threshold Optimizer?

Paste a labeled set of similarity scores with relevant true/false labels and this tool sweeps every candidate cutoff, computing precision, recall, and F-beta to find the cosine similarity threshold that maximises retrieval quality for your data. It runs free in your browser on Gera Tools, with nothing uploaded.

Cosine Similarity Threshold Optimizer

Name: Cosine Similarity Threshold Optimizer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Why the threshold matters

In retrieval and RAG systems you compare a query embedding against document embeddings with cosine similarity, then keep everything above a cutoff. Set the cutoff too low and you flood the context with irrelevant chunks, diluting the model’s attention and increasing hallucination risk. Set it too high and you drop documents the model genuinely needed, causing gaps in its answer. The right value is specific to your embedding model and your corpus — there is no universal number, and the only honest way to find it is to measure against labeled examples.

How it works

You provide a list of score, relevant pairs from a labeled sample — each line is a cosine similarity score and a true/false label for whether that document was actually relevant to the query. The tool treats “keep everything ≥ threshold” as a binary classifier and sweeps every meaningful candidate threshold — the midpoints between adjacent unique scores — which are the only points where the classification actually changes.

At each candidate threshold it counts:

True positives (TP): relevant documents above the cutoff
False positives (FP): irrelevant documents above the cutoff
False negatives (FN): relevant documents below the cutoff

From those it computes:

Precision = TP / (TP + FP) — of what you kept, how much was useful
Recall = TP / (TP + FN) — of all useful documents, how many you kept
F-beta — the harmonic blend of precision and recall, weighted by your chosen beta

The tool highlights the threshold with the highest F-beta and shows the full precision-recall table so you can see the trade-off curve and pick a slightly different point if your application has an asymmetric cost to false positives vs. false negatives.

Choosing the right beta

Beta controls the precision-recall trade-off:

Beta	What it favours	When to use it
0.5	Precision	When irrelevant chunks hurt more than missed ones (e.g. user-facing search)
1.0	Equal balance	Default; use when precision and recall matter equally
2.0	Recall	When a missed document is expensive (e.g. legal, medical, compliance retrieval)

Practical tips

Sample near the decision boundary. Optimisation is only as reliable as the labels close to the cutoff. Include borderline pairs — documents that are somewhat relevant or only marginally off-topic — not just obvious matches and obvious junk.

Re-measure when you change models. Cosine similarity scores are not comparable across embedding models. A threshold tuned for text-embedding-ada-002 is meaningless when you switch to a different model; the absolute values shift even if the relative ranking stays similar.

Use enough labeled examples. A few dozen pairs gives a rough cutoff; one to two hundred gives a stable one. If your labels come from human annotators, ensure they agreed on the borderline cases — inter-annotator agreement matters most near the threshold.

Treat the optimum as a starting point. The mathematically optimal threshold on your sample may not match your user experience in practice. Use this to narrow the search, then validate the top candidates with a small user study or downstream task evaluation.