What is BM25 and why use it for RAG?

BM25 (Okapi BM25) is a probabilistic ranking function that scores documents by term frequency, inverse document frequency, and document length. It is the lexical baseline most production RAG systems combine with semantic search in a hybrid retriever because it excels at exact keyword and rare-term matches.

What do k1 and b control?

k1 controls term-frequency saturation — higher values let repeated terms keep adding score. b controls length normalization between 0 and 1 — at b=1 longer documents are penalized fully, at b=0 length is ignored. The common defaults are k1=1.5 and b=0.75.

How is BM25 different from cosine similarity on embeddings?

BM25 is purely lexical — it matches the actual words shared between query and document. Embedding cosine similarity is semantic and can match paraphrases with no shared words. They fail in opposite ways, which is why hybrid retrieval fuses both.

Is anything sent to a server?

No. Tokenization, IDF computation, and scoring all run in your browser. Your query and documents never leave the page, so it is safe to test with private content.

What is the BM25 Relevance Scorer?

A browser-side BM25 implementation that scores and ranks a list of text chunks against a query. Tune k1 and b, see per-document scores, and compare lexical BM25 ranking against your semantic retrieval for RAG. It runs free in your browser on Gera Tools, with nothing uploaded.

BM25 Relevance Scorer

Name: BM25 Relevance Scorer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

BM25 relevance scorer

BM25 is the lexical ranking function behind Elasticsearch, OpenSearch, and most production search engines, and it remains the default keyword half of nearly every hybrid RAG retriever. This tool runs a faithful Okapi BM25 implementation entirely in your browser so you can paste a query and a set of candidate chunks and see exactly how a lexical retriever would rank them — useful for sanity- checking whether your retrieval problem even needs embeddings.

How it works

The query and every document are lowercased and split into word tokens. The tool counts how many documents contain each query term to compute the inverse document frequency, IDF(t) = ln(1 + (N - n + 0.5) / (n + 0.5)), where N is the document count and n is the number of documents containing the term. Each document’s score is the sum over query terms of IDF(t) · f · (k1 + 1) / (f + k1 · (1 - b + b · len/avglen)), where f is the term frequency in that document, len is the document length in tokens, and avglen is the average document length. The two knobs, k1 and b, let you tune term saturation and length normalization respectively.

Tips and notes

Hybrid beats either alone. In practice teams run BM25 and embedding search in parallel and fuse the rankings (reciprocal rank fusion), because BM25 nails exact identifiers and rare terms while embeddings catch paraphrase.
Start with the defaults. k1=1.5 and b=0.75 are the standard starting values; only tune after you have a labeled relevance set to measure against.
Short queries reward IDF. A rare query term that appears in only one document will dominate the ranking — that is the intended behavior, not a bug.
Everything stays local. Because nothing is uploaded, you can evaluate proprietary documents and customer queries without privacy risk.

Understanding the k1 and b parameters

The two tuning parameters have precise effects and the defaults (k1 = 1.5, b = 0.75) are starting points, not universal truths:

k1 — term-frequency saturation. At k1 = 0, term frequency is completely ignored — only IDF matters. At k1 = 1.5, a term appearing three times in a document contributes meaningfully more than a term appearing once, but the gain diminishes with repetition (saturation). At very high values like k1 = 10, the ranking approaches TF-IDF without the saturation effect. For short queries on short documents (tweets, titles), lower k1 values often work better; for long documents with dense repetition of key terms, higher values are reasonable.

b — document-length normalization. At b = 1.0, a document twice as long needs a term to appear twice as often to score the same as a short document. At b = 0, length is completely ignored. In practice, b = 0.75 is a good default for general text, but if your corpus is naturally uniform in length (fixed-size paragraphs, for example) you can lower b toward 0.5 or even 0 without much effect.

Where BM25 underperforms

BM25 is purely lexical, which means it fails on paraphrase: a query for “automobile repair” will not score a document about “car maintenance” unless those exact words appear. It is also blind to word order — “dog bites man” and “man bites dog” score identically against any single-word query. These are the canonical cases where dense vector (embedding) retrieval beats BM25, which is why production RAG systems combine both.

BM25 also struggles with very short queries (one keyword), where IDF dominates and slight corpus-frequency accidents create surprising rankings, and with highly technical vocabularies where synonyms are common but varied.

Using this scorer to evaluate a RAG pipeline

A practical workflow: paste a query and your retrieved top-10 chunks. The BM25 scores show you whether lexical overlap explains the ranking or whether the retriever is relying on semantic matching for which there is no surface-form evidence. When BM25 ranks a chunk highly that your vector retriever missed, that chunk is a strong candidate for the hybrid-fused result set. When BM25 scores a chunk near zero but it is ranked first by your embeddings, it is arriving via semantic match — which you can verify by confirming the relevant words are absent from the chunk.