The core idea
BERT — Bidirectional Encoder Representations from Transformers — is a language model Google released in 2018 that transformed natural language processing. Its key insight was to read text bidirectionally: rather than processing words strictly left to right, BERT looks at the entire sentence at once, using context on both sides of every word. This produces deep representations of meaning that can be adapted to many tasks. BERT is built from the encoder half of the transformer architecture, which is optimised for understanding rather than generating text.
Masked language modelling
BERT’s main pre-training trick is masked language modelling (MLM). During training, roughly 15%
of the words in each input are randomly replaced with a special [MASK] token, and the model must
predict the original words from the surrounding context. Because the gaps can be filled using words
both before and after them, the model is forced to build a genuinely bidirectional understanding of
language. This is what distinguishes BERT from earlier left-to-right models, which could only use
preceding context.
Next sentence prediction
BERT was also trained on a second objective: next sentence prediction (NSP). The model is shown two sentences and must decide whether the second genuinely follows the first or is a random unrelated sentence. This was meant to help BERT learn relationships between sentences, useful for tasks like question answering and natural language inference. Later research found NSP added little value, and follow-up models such as RoBERTa dropped it — but it was part of the original recipe.
Encoder-only architecture
BERT uses only the transformer’s encoder stack. The encoder processes all input tokens in parallel
through layers of self-attention, where each token gathers information from every other token. Because
there is no decoder generating output token by token, BERT does not write text in the usual sense.
Instead it outputs a rich vector for every input token plus a special [CLS] summary vector. These
representations are what downstream tasks build on. This design contrasts with GPT, which is
decoder-only and generates text left to right.
Pre-training, fine-tuning, and legacy
BERT popularised the pre-train then fine-tune workflow that now dominates NLP. After expensive pre-training on a massive text corpus, the same model can be fine-tuned on a small labelled dataset for a specific task — sentiment analysis, named entity recognition, question answering — by adding a thin task-specific layer and training briefly. Because BERT already understands language, fine-tuning is cheap and effective. BERT and its descendants (RoBERTa, DistilBERT, ALBERT) became the default for understanding-heavy applications, and they remain widely used in search ranking, classification, and embeddings even as generative models grab the headlines.