What self-supervised learning is
Self-supervised learning is the training trick behind almost every large modern AI model. The data has no human-provided labels, yet the model still learns from a supervised-style prediction task — because the labels are generated automatically from the data itself. The classic move is to hide part of the input and ask the model to predict the hidden part. Since the correct answer is simply the piece that was hidden, you can manufacture unlimited training examples from raw text, images, audio or video without paying a single annotator. That is what makes web-scale training possible.
Pretext tasks: free labels from raw data
The invented prediction problem is called a pretext task. Solving it is not the real goal; the goal is the representations the model builds along the way. Because predicting a masked word well requires understanding grammar, meaning and world knowledge, the model is forced to learn genuinely useful internal features. After pre-training, those features transfer to real downstream tasks like classification, search, or answering questions, often with only a small amount of labelled fine-tuning data.
Next-token prediction (GPT)
The dominant pretext task for language models is next-token prediction: given a sequence of tokens, predict the one that comes next. Every position in every document provides a free training example. By doing this over trillions of tokens, models in the GPT family learn grammar, facts, reasoning patterns and style — all without explicit labels. The same objective, scaled up, is what turns a raw model into a capable text generator.
Masked prediction (BERT)
A related strategy is masked language modelling, used by BERT. Here some tokens in a sentence are randomly hidden and the model must fill them in using the words on both sides. Because it sees full context rather than only the left side, BERT-style models build rich understanding well suited to tasks like classification, search and named-entity recognition.
Contrastive learning (CLIP)
Self-supervision works beyond text. Contrastive learning teaches a model to pull matching pairs together and push mismatched pairs apart in an embedding space. CLIP does this with images and captions: it learns that a photo and its real caption should be close, while random photo-caption pairs should be far apart. The result is a shared representation that connects vision and language, powering zero-shot image classification and text-to-image search.
Why it changed AI
Before self-supervision, progress was capped by how much data humans could label by hand. By inventing labels from the data’s own structure, self-supervised learning unlocked training on essentially the entire internet. That scale of broad, cheap data is exactly what gives today’s foundation models their general knowledge — making self-supervised learning the engine behind the current wave of AI.