The key never leaves your browser except in the direct HTTPS request to OpenAI's own API. The tool has no backend, stores nothing, and does not send your key to Gera or any third party.

How is similarity computed?

Each text is embedded once via the OpenAI Embeddings API, then cosine similarity is computed locally between every pair of vectors. The diagonal is always 1.0 because a text is identical to itself.

What can I use this for?

Spotting near-duplicate documents before indexing, validating that distinct classes in a dataset are actually distinct, and visualizing how tightly a set of texts clusters semantically.

How many texts can I compare?

The matrix grows as N² cells, so it stays readable up to roughly 15–20 texts. All N texts are embedded in a single batched request to keep cost and latency low.

What is the Semantic Similarity Matrix (BYO-key)?

Embed N texts with your own OpenAI API key and render a color-coded cosine-similarity heatmap matrix in the browser. Useful for deduplication analysis, near-duplicate detection, and visualizing semantic clusters in a dataset. It runs free in your browser on Gera Tools, with nothing uploaded.

Semantic Similarity Matrix (BYO-key)

Name: Semantic Similarity Matrix (BYO-key)
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Visualize semantic similarity between texts

The Semantic Similarity Matrix embeds each of your texts using your own OpenAI API key, then renders an N×N heatmap of the cosine similarity between every pair. It turns a list of strings into an at-a-glance picture of which items are semantically close — invaluable for deduplication, dataset auditing, and cluster discovery.

How it works

All texts are sent in one batched request to the OpenAI Embeddings API. The returned vectors are kept in browser memory, and cosine similarity is computed locally for each pair: the dot product of two vectors divided by the product of their magnitudes. Values range from -1 to 1, where 1 means identical direction. Each cell is colour-coded — brighter green for higher similarity — so patterns jump out immediately. The diagonal is always 1.0.

What cosine similarity actually measures

Cosine similarity is not “how many words match” — it measures the angle between two high-dimensional vectors. Two short sentences about the same concept but using entirely different words can score above 0.9 with a strong embedding model; two texts sharing many surface words but on different topics may score 0.5 or lower. This is what makes embeddings useful for semantic deduplication where simple string matching fails.

Typical cosine similarity ranges for text embeddings using text-embedding-3-small:

Score range	Typical meaning
0.95 – 1.0	Near-identical or paraphrased
0.85 – 0.95	Closely related, same topic
0.70 – 0.85	Related, overlapping theme
Below 0.70	Distinct topics

These thresholds vary by embedding model and domain; calibrate against known pairs from your dataset to find the right cutoff for your use case.

Practical uses

Deduplication before indexing: paste product descriptions, support articles, or job postings to spot near-duplicate entries before they reach a search index or RAG pipeline. The matrix makes it immediately visible when two chunks are functionally identical even if the wording differs.

Validating training data classes: for a classification dataset, paste a sample from each class. If classes you expect to be distinct show high cross-class similarity, the model may struggle to learn the boundary — surface that before training.

Clustering discovery: paste a diverse list of sentences to see which naturally group together. Bright clusters off the diagonal suggest the data has a hidden category structure worth making explicit.

Tips

Look for unexpectedly bright off-diagonal cells: those are likely near-duplicates.
If two texts you expect to be unrelated score high, your embedding model may be picking up shared domain vocabulary rather than shared meaning — check the actual texts to distinguish.
Keep the set under 20 lines so the grid stays readable; the tool keeps all N² cells visible but heatmaps with more than ~15 rows become hard to scan.
The embeddings request runs against OpenAI directly with your key; nothing is stored or sent to any other server.