What is BERT in simple terms?

BERT (Bidirectional Encoder Representations from Transformers) is a model Google released in 2018 that reads a whole sentence in both directions at once to understand language. It is pre-trained on huge amounts of text by filling in masked words, then fine-tuned for specific tasks like classification or question answering. It set new state-of-the-art results across many NLP benchmarks.

What is masked language modelling?

Masked language modelling (MLM) is BERT's main pre-training task. Some words in the input are randomly hidden with a mask token, and the model must predict the missing words using the surrounding context on both sides. Because it can see words before and after the gap, BERT learns rich bidirectional representations of meaning.

How is BERT different from GPT?

BERT is encoder-only and bidirectional, designed to understand text by looking at the full context at once, which makes it strong at classification and question answering. GPT is decoder-only and reads left to right to generate text one token at a time. BERT understands; GPT generates — though modern large models blur this line.

What does fine-tuning BERT mean?

Fine-tuning takes the pre-trained BERT model and continues training it on a smaller labelled dataset for a specific task, like sentiment classification or named entity recognition. Because BERT already understands language from pre-training, fine-tuning needs far less data and compute than training from scratch and usually delivers strong results quickly.

What Is BERT? Google's Bidirectional Transformer Explained

The core idea

BERT — Bidirectional Encoder Representations from Transformers — is a language model Google released in 2018 that transformed natural language processing. Its key insight was to read text bidirectionally: rather than processing words strictly left to right, BERT looks at the entire sentence at once, using context on both sides of every word. This produces deep representations of meaning that can be adapted to many tasks. BERT is built from the encoder half of the transformer architecture, which is optimised for understanding rather than generating text.

Masked language modelling

BERT’s main pre-training trick is masked language modelling (MLM). During training, roughly 15% of the words in each input are randomly replaced with a special [MASK] token, and the model must predict the original words from the surrounding context. Because the gaps can be filled using words both before and after them, the model is forced to build a genuinely bidirectional understanding of language. This is what distinguishes BERT from earlier left-to-right models, which could only use preceding context.

Next sentence prediction

BERT was also trained on a second objective: next sentence prediction (NSP). The model is shown two sentences and must decide whether the second genuinely follows the first or is a random unrelated sentence. This was meant to help BERT learn relationships between sentences, useful for tasks like question answering and natural language inference. Later research found NSP added little value, and follow-up models such as RoBERTa dropped it — but it was part of the original recipe.

Encoder-only architecture

BERT uses only the transformer’s encoder stack. The encoder processes all input tokens in parallel through layers of self-attention, where each token gathers information from every other token. Because there is no decoder generating output token by token, BERT does not write text in the usual sense. Instead it outputs a rich vector for every input token plus a special [CLS] summary vector. These representations are what downstream tasks build on. This design contrasts with GPT, which is decoder-only and generates text left to right.

Pre-training, fine-tuning, and legacy

BERT popularised the pre-train then fine-tune workflow that now dominates NLP. After expensive pre-training on a massive text corpus, the same model can be fine-tuned on a small labelled dataset for a specific task — sentiment analysis, named entity recognition, question answering — by adding a thin task-specific layer and training briefly. Because BERT already understands language, fine-tuning is cheap and effective. BERT and its descendants (RoBERTa, DistilBERT, ALBERT) became the default for understanding-heavy applications, and they remain widely used in search ranking, classification, and embeddings even as generative models grab the headlines.

The two model sizes and their scale

The original paper released two configurations, and their names are still used as shorthand:

Model	Transformer layers	Hidden size	Attention heads	Parameters
BERT-Base	12	768	12	~110 million
BERT-Large	24	1,024	16	~340 million

Both were pre-trained on the BooksCorpus (~800M words) plus English Wikipedia (~2,500M words). By 2018 standards these were large models; by today’s standards a 340M-parameter encoder is small and cheap to fine-tune, which is precisely why BERT-family models are still deployed at scale where a multi-billion-parameter generative model would be overkill.

Why “bidirectional” was the breakthrough

Earlier context-aware models faced a chicken-and-egg problem: a standard left-to-right language model cannot simply look at both directions at once, because a word would then be able to “see itself” through the future context and prediction becomes trivial. Masked language modelling sidesteps this by hiding the target word entirely, so the model must reconstruct it from surrounding context without ever seeing the answer. This is what let BERT condition every token’s representation on the entire sentence — the single design choice that produced its state-of-the-art results on the GLUE and SQuAD benchmarks in 2018.

Where BERT sits today

BERT is an understanding model, not a generation model, and that distinction still maps cleanly onto real deployment choices:

Use a BERT-family encoder for classification, entity extraction, retrieval/embeddings, re-ranking search results, and semantic similarity — tasks where you feed in text and want a label, a span, or a vector out.
Use a decoder (GPT-family) model when you need to produce fluent text: chat, summarisation, code generation.

Modern retrieval-augmented and semantic-search systems frequently pair the two — a BERT-style encoder builds the embeddings that find relevant documents, and a decoder writes the answer. Google itself reported using BERT to improve understanding of search queries from 2019 onward.

Where BERT sits in 2026

BERT-style encoders did not disappear when generative LLMs arrived — they moved into the plumbing. Embedding models used for semantic search and RAG retrieval, rerankers, content moderation classifiers, and named-entity extractors are overwhelmingly encoder architectures descended from BERT, because a bidirectional encoder remains cheaper and often more accurate than a generative model for classify-and-score workloads. Knowing the encoder / decoder distinction is still the fastest way to predict which architecture a production NLP task actually needs.

Sources

Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (2018) — the original paper.
Google — “Understanding searches better than ever before” (2019) — BERT in Google Search.
Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach” (2019) — which dropped NSP and improved results.