What does alpha mean here?

The blended score is alpha × dense + (1 − alpha) × BM25, after per-query min-max normalisation. Alpha = 1 is pure dense (semantic), alpha = 0 is pure BM25 (keyword); the tuner finds the mix that ranks relevant docs highest.

Why normalise the scores first?

BM25 and cosine/dot-product scores live on completely different scales, so adding them raw lets one dominate. Min-max normalising each query's scores to 0–1 puts them on equal footing before blending.

How much eval data do I need?

Enough labelled queries that the result is stable — typically 30+ queries with several candidates each. With only a handful of queries the optimal alpha can be noisy, so treat it as a starting point.

No. The sweep runs entirely in your browser. Nothing you paste is sent to a server, stored or logged.

What is the Hybrid Search Weight Tuner?

Paste BM25 scores, dense scores, and relevance labels for your eval queries and find the alpha blend that maximises MRR or hit rate. Sweeps alpha from 0 to 1 and shows the best weighting for your hybrid RAG retriever — all in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Hybrid Search Weight Tuner

Name: Hybrid Search Weight Tuner
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Tune the BM25 ↔ dense blend for hybrid RAG

Hybrid retrieval combines keyword search (BM25) and dense vector search, blended by a weight alpha. Picking alpha by gut feel leaves quality on the table. This tool takes your labelled eval scores and sweeps every alpha from 0 to 1, reporting the value that maximises your chosen retrieval metric — so you ship the blend your own data prefers. It runs in your browser.

How it works

For each query, the candidates’ BM25 and dense scores are min-max normalised to 0–1 (so neither scale dominates), then combined:

score = alpha × dense_norm + (1 − alpha) × bm25_norm

Candidates are re-ranked by the blended score and the chosen metric is computed:

MRR — mean of 1 / rank of the first relevant result per query.
Hit@K — fraction of queries with a relevant result in the top K.

The tuner evaluates alpha at fine steps across [0, 1] and returns the alpha with the best average metric, plus a small table showing how the metric varies so you can see how sensitive your system is to the choice.

Why normalise scores before blending?

BM25 scores are unbounded TF-IDF-style numbers, while cosine or dot-product scores from dense embeddings live roughly in [−1, 1]. Adding them raw almost always lets the larger-scaled system dominate by accident. Min-max normalisation per query compresses both to [0, 1] so each point of alpha actually means something. Note that min-max is per-query, not global, because score ranges vary significantly across different questions.

Worked example of the sweep

Suppose you have 5 queries, each with 5 candidates. You paste your BM25 and dense scores and mark which candidate is relevant. The tuner tries alpha at 0.0, 0.05, 0.10, … 1.0 (21 steps). For example:

Alpha	MRR
0.0 (BM25 only)	0.52
0.30	0.68
0.50	0.71
0.70 (best)	0.74
1.0 (dense only)	0.61

Here alpha 0.70 wins, meaning your particular corpus responds better to dense signals but still benefits meaningfully from keyword matching. This is a plausible pattern for technical documentation where exact keyword matches matter but the user often paraphrases.

When each system tends to do better

BM25 stronger: domain-specific jargon, product codes, names, IDs — exact strings that embeddings dilute.
Dense stronger: paraphrased questions, synonyms, multi-hop reasoning, languages with rich morphology.
Equal blend: general-purpose corpora that mix both patterns.

Tips and notes

A flat curve means alpha barely matters — pick a value for stability and move on. A sharp peak means the blend really matters; lock it in.
Make sure each query’s candidate list includes the relevant doc(s); otherwise no alpha can rank them and the metric is artificially low.
This optimises the linear blend; if you use reciprocal rank fusion instead, the intuition (balance keyword vs semantic) still holds, but tune RRF’s k separately.
Rerun the sweep whenever you switch embedding models or update your corpus significantly — the optimal alpha is specific to your data.