What does PCA do to my embeddings?

Principal Component Analysis finds the directions (principal components) along which your vectors vary the most, then projects every vector onto the top two of those directions. This compresses, say, 1536 dimensions down to two coordinates you can plot, while preserving as much of the original variance as possible. The variance percentages tell you how faithful the 2D picture is.

Why is PCA different from t-SNE or UMAP?

PCA is linear and global — it preserves large-scale variance and the relative positions of far-apart groups, and it is fast and deterministic. t-SNE and UMAP are non-linear and emphasise local neighbourhoods, often separating clusters more cleanly but at the cost of distorting global distances and requiring tuning. PCA is the right first look; reach for t-SNE/UMAP when PCA's two components explain little variance.

How many vectors and dimensions can I paste?

The tool needs at least two vectors of at least two dimensions each, and all vectors must be the same length. It comfortably handles dozens to low hundreds of vectors of typical embedding sizes in the browser. Very large sets (thousands of vectors) are better projected with a dedicated library offline.

No. The covariance matrix, principal components (via power iteration), and projection are all computed locally in JavaScript. Nothing is sent to a server, so it is safe to paste embeddings of confidential text.

My points all overlap — what does that mean?

If points cluster on top of each other and the two components explain little variance, your items are either genuinely similar or the variation lives in dimensions PCA's first two components don't capture. Try labelling the points to confirm, or use a non-linear method that surfaces local structure.

What is the Embedding Dimension Visualizer?

Paste embedding vectors as JSON and get a browser-side PCA projection to 2D, rendered as a scatter plot. Add labels to color points and reveal semantic clusters — no data leaves your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Embedding Dimension Visualizer

Name: Embedding Dimension Visualizer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Embeddings are lists of hundreds or thousands of numbers — impossible to read directly. This tool projects them down to two dimensions with PCA and draws a scatter plot, so you can literally see whether semantically similar items land near each other. Everything runs in your browser.

How it works

Paste a JSON array of equal-length numeric arrays. The tool centres the data, builds the covariance matrix, and uses power iteration to extract the top two principal components — the two directions along which your vectors vary most. Every vector is projected onto those two axes and plotted. The percentage of total variance each component captures is shown underneath, telling you how trustworthy the flattened view is: high percentages mean the 2D picture faithfully reflects the real geometry.

Add labels (one per line, matching the vector order) and points sharing a label are drawn in the same colour, making clusters jump out immediately.

Reading the plot

Points that sit close together had similar embeddings — the model considers those items semantically related. Well-separated coloured groups mean your embedding model cleanly distinguishes those categories, which is exactly what you want before building a retrieval or classification system on top. If everything piles into one blob and the variance percentages are low, the discriminating signal lives in dimensions PCA’s first two components miss; that is the cue to try t-SNE or UMAP.

Understanding the variance explained

The two percentages shown under the plot — “PC1 explains X%” and “PC2 explains Y%” — tell you how much of the original high-dimensional variation survives the projection:

Combined variance above ~60–70% means the 2D picture is a reasonable representation of the true geometry. Clusters you see are likely real.
Combined variance 30–60% means the plot captures the strongest signal but misses significant structure. Treat it as exploratory.
Combined variance below 30% means most variation is orthogonal to the two plotted axes. The plot may show groups not because they are distant in the original space, but because they happen to differ along the two components PCA selected. A non-linear method like t-SNE or UMAP would surface more structure.

High-quality embedding models trained on domain-specific data tend to produce higher variance-explained scores on in-domain data, because the relevant variation is captured in a smaller number of directions.

Practical use cases

Debugging retrieval

Embed a query and 10–15 candidate document chunks. Label the query as “query” and the candidates as “relevant” or “irrelevant” based on what you know should match. Paste all vectors and check whether the query point in the scatter plot lands closer to the relevant candidates. If an irrelevant chunk sits adjacent to the query, that chunk will appear at the top of your retrieval results — and you can trace the problem back to chunking, text preprocessing, or the embedding model.

Comparing embedding models

Embed a fixed set of labelled texts with two different models (for example text-embedding-3-small and a domain-tuned open-source model). Plot each and compare cluster separation. The model with tighter, more cleanly separated label groups is usually better suited to your retrieval or classification task.

Checking for label leakage or data contamination

In a machine learning project, embed training and test examples together and label them by split. If training and test points intermingle completely, the split is well-randomized. If they separate into two distinct clusters, there may be systematic differences between splits (temporal leakage, domain shift, or stratification issues).

Tips

Use it to debug retrieval: embed a query and a handful of candidate chunks, label them, and check the query lands nearest the chunks you expect.
Compare embedding models by plotting the same labelled items from each — the model with tighter, more separated clusters is usually the better choice for your domain.
PCA is deterministic, so the same input always gives the same plot — handy for documenting results.
If all points collapse to a single dense blob, the embedding model may be producing near-constant vectors for your domain — a strong signal to try a different model or fine-tuning.