Definition
An embedding is a dense vector — a fixed-length list of numbers — that encodes the meaning of a piece of data in a high-dimensional space. Instead of treating words as opaque symbols, an embedding model maps each input to coordinates such that semantically similar items land near each other. The classic example is word arithmetic: the vectors for “king” minus “man” plus “woman” land close to “queen”, showing that relationships are captured geometrically.
What can be embedded
Embeddings are not limited to text:
- Text embeddings turn words, sentences, or whole documents into vectors for search, clustering, and classification.
- Image embeddings represent pictures so visually similar images sit close together.
- Multimodal embeddings (such as those produced by CLIP) place text and images in the same space, so a caption and its matching photo align.
This shared geometry is what lets you query images with text, or find related documents regardless of exact wording.
How similarity is measured
Once data is embedded, comparing items becomes a geometry problem. The most common metric is cosine similarity, which measures the angle between two vectors and returns a value where 1 means “pointing the same way” (very similar) and 0 means unrelated. Euclidean distance and dot product are also used. Because comparisons are just maths on vectors, they are fast and scale to millions of items.
Dimensions and models
Embedding models output vectors of a fixed size — commonly 384, 768, 1,536, or 3,072 dimensions. More dimensions can encode finer distinctions but increase storage and comparison cost. Popular models include OpenAI’s text-embedding-3 family and many open-source sentence-transformer models. The right choice balances quality, vector size, and price for your data volume.
Why embeddings matter
Embeddings are the backbone of modern semantic systems. They power semantic search, recommendation engines, clustering and deduplication, and most importantly retrieval-augmented generation (RAG) — where relevant documents are found by vector similarity and fed into an LLM’s context. Specialised vector databases like Pinecone, Weaviate, Qdrant, and pgvector exist specifically to store embeddings and run fast nearest-neighbour searches over them.