The problem distillation solves
State-of-the-art models are large, slow, and expensive to run. That is fine in a data center but impractical on a phone, in a browser, or at the scale of millions of requests where latency and cost dominate. Knowledge distillation is the technique that bridges this gap: it transfers the capability of a big, expensive teacher model into a small, cheap student model. The student ends up far smaller and faster while keeping much of what made the teacher good. It is one of the main reasons capable models now run on modest hardware, and it underpins many of the small, efficient models released in recent years.
Teacher and student: how the training works
In ordinary supervised training, a model learns from data with hard labels — this email is spam, this token comes next. In distillation, the student instead learns to match the teacher’s output. The teacher processes inputs and produces its full probability distribution over possible outputs, and the student is trained to reproduce that distribution. So the student is not just told the right answer; it is shown how confident the teacher was across all options. This is typically combined with the true labels too, so the student learns from both the ground truth and the teacher’s richer signal. The teacher does the expensive learning once; the student inherits a distilled version of it.
Soft targets and “dark knowledge”
The magic ingredient is soft targets. A hard label says “this is a 7.” A teacher’s soft target says “92% a 7, 5% a 1, 3% a 9” — encoding that sevens look a bit like ones and nines. Geoffrey Hinton and colleagues, who formalised modern distillation, called this extra information dark knowledge: the relationships and similarities the teacher learned that a single label cannot express. A “temperature” parameter is often used to soften the teacher’s distribution further, exposing even the small probabilities. Training on these graded signals is why a small student can generalise much better than the same-sized model trained on hard labels alone — it inherits the teacher’s sense of structure, not just its answers.
Where distillation shows up
Distillation is everywhere in modern AI. The classic example is DistilBERT, a compressed version of BERT that kept most of its accuracy at roughly half the size. Today, many compact, efficient model families are trained partly through distillation from larger siblings or from synthetic data generated by a stronger teacher — the Phi and Gemma families and various small instruction-tuned models lean on this idea. A widespread variant is data distillation: a strong model generates high-quality training examples, and a smaller model is trained on them, effectively distilling the teacher’s behaviour through its outputs. This is how a lot of capable open small models are built.
Distillation vs other compression methods
Distillation is one of three main ways to make models smaller, and it is worth knowing how they differ. Pruning removes weights or whole neurons from an existing model that contribute little, then fine-tunes what remains. Quantization lowers the numerical precision of the weights — say from 16-bit to 4-bit — shrinking memory and speeding inference with modest quality loss. Distillation trains a genuinely new, smaller architecture to imitate a larger one. The key distinction: pruning and quantization compress a given model in place, while distillation produces a fresh model that can have a different shape entirely. In practice these are complementary — teams frequently distill a model and then quantize it — so the real-world recipe for an efficient model is often all three working together.