How is storage calculated?

Bytes per vector = dimensions × bytes-per-component (float32 = 4, float16 = 2, int8 = 1, binary = 1 bit = 0.125). Total = vectors × bytes per vector. The estimate covers raw vector storage, not index overhead.

How accurate are the quality numbers?

They are typical ranges from published benchmarks (e.g. MTEB and vendor reports), not your specific result. int8 commonly retains ~99% of float32 recall; binary typically retains ~90–96%, often recovered to ~98%+ with a re-ranking pass over the top candidates.

What is rescoring / re-ranking?

A two-stage retrieval where binary or int8 vectors do a fast first pass, then the top N candidates are re-scored with full-precision vectors. It recovers most of the quality lost to aggressive quantization at a small extra cost.

Is anything uploaded?

No. The calculation runs entirely in your browser. Nothing you enter is sent to a server, stored or logged.

What is the Embedding Quantization Estimator?

Enter your vector count, dimensions, and target precision (float16, int8, or binary) to see exact storage reduction and the typical retrieval-quality impact based on published benchmarks. Plan your vector index budget in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Embedding Quantization Estimator

Name: Embedding Quantization Estimator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Estimate the storage vs accuracy trade-off of quantizing embeddings

Embeddings are usually stored as float32 — four bytes per dimension — which gets expensive fast at scale. Quantizing to float16, int8, or binary can cut your vector index by 2× to 32×, but trades away some retrieval quality. This tool shows the exact storage reduction and the typical quality impact from published benchmarks, so you can pick a precision with eyes open. It runs in your browser.

How it works

Raw vector storage is just arithmetic:

bytes_per_vector = dimensions × bytes_per_component
total_bytes      = vectors × bytes_per_vector

where a component costs 4 bytes (float32), 2 (float16), 1 (int8), or 1 bit = 0.125 bytes (binary). The estimator computes this for your inputs and expresses the saving as a multiple of the float32 baseline.

Alongside the numbers it pairs each precision with the typical recall retention reported in the literature:

float16 — ~2× smaller, essentially lossless (~100% recall).
int8 — ~4× smaller, typically ~99% recall retained.
binary — ~32× smaller, typically ~90–96% recall, often recovered to ~98%+ with a full-precision re-ranking pass over the top candidates.

Worked example

For a vector index of 10 million vectors at 1,536 dimensions (a common embedding model output size):

Precision	Bytes per vector	Total raw storage	Reduction
float32	6,144	~57.2 GB	baseline
float16	3,072	~28.6 GB	2×
int8	1,536	~14.3 GB	4×
binary	192	~1.8 GB	32×

These are raw vector bytes before index overhead. In a production vector database with HNSW indexing, actual memory usage is typically 1.5–3× higher. Even so, the binary representation brings 57 GB of vectors down to a few gigabytes — a meaningful difference when deciding whether to run a vector database on-disk, in memory, or on a specific instance type.

How int8 quantization works

int8 quantization maps each float32 value to the nearest integer in the range –128 to +127. Before quantization, the full-precision activations are scaled so that the meaningful range maps efficiently onto the 256 available integer levels. Because the cosine similarity of two unit vectors depends primarily on the sign and relative magnitude of their components — not absolute precision — most relevant information survives this compression, which is why recall retention is so high.

How binary quantization works

Binary quantization is more aggressive: each dimension becomes a single bit (1 if the float32 value is positive, 0 otherwise). The hamming distance between two binary vectors (count of differing bits) becomes the similarity metric, which is extremely fast to compute in hardware. The trade-off is that a lot of nuance is discarded — hence the wider quality range of 90–96% versus float32.

The binary + rescoring pattern recovers most of this loss: retrieve the top 100–200 candidates using fast binary search, then recompute cosine similarity using the full float32 vectors for just those candidates. The rescoring step is cheap because it operates on a small set, and the resulting quality is close to pure float32 at a fraction of the storage and search cost.

Choosing the right precision

float16: the safe starting point for any production system. Negligible quality loss, 2× storage saving, broadly supported by GPU and CPU hardware.
int8: strong choice when storage or memory is a constraint and you can tolerate up to ~1% recall drop. Well supported in most major vector databases.
binary: appropriate when you have very large indexes (hundreds of millions of vectors) and can run a rescoring pass. Validate carefully on your own evaluation set before deploying.