The core idea
Word2vec is a method, published by a Google team in 2013, for turning words into dense numeric vectors called embeddings. Its guiding principle is the distributional hypothesis: words that appear in similar contexts tend to have similar meanings. By training a small neural network to predict context, word2vec arranges words in a high-dimensional space where semantically related words sit close together. This replaced clumsy one-hot encodings — where every word is an isolated symbol — with a continuous map of meaning.
Skip-gram and CBOW
Word2vec comes in two flavours, defined by what they predict. In skip-gram, the network is given a single word and learns to predict the words around it. This handles rare words well because each occurrence generates several training signals. In CBOW (Continuous Bag of Words), the network does the opposite: it predicts a target word from the average of its surrounding context words. CBOW is faster and works well for frequent words. Both objectives nudge the network to assign similar vectors to words that share contexts, and the hidden layer weights become the word embeddings.
Vector arithmetic and analogies
The most striking result was that relationships became directions in the vector space. The classic demonstration is the equation king − man + woman ≈ queen: subtracting the “man” vector and adding the “woman” vector moves you from king to a point near queen. Similar arithmetic captured capital-of-country and verb-tense relationships. This showed embeddings encode structured semantic relationships, not merely a similarity score — a property that made them genuinely useful rather than just a clever compression.
Why it mattered
Before word2vec, most natural-language systems treated words as discrete IDs with no notion that “happy” and “joyful” are related. Learned embeddings let downstream models start from a representation that already encodes meaning, improving everything from sentiment analysis to machine translation. Word2vec was also fast and trainable on ordinary hardware over large text corpora, which helped the idea spread quickly. It kicked off a wave of embedding methods such as GloVe and fastText.
Word2vec’s legacy
Modern systems rarely use word2vec directly. Its embeddings are static: the word “bank” gets one vector regardless of whether it means a riverbank or a financial institution. Transformer models like BERT produce contextual embeddings that change with the sentence, which is far more powerful. But the core idea — represent meaning as a learned vector and let geometry capture similarity — is the direct ancestor of the embeddings that power today’s semantic search, recommendation engines, and retrieval-augmented generation. Word2vec proved that idea worked, and everything since builds on it.