What knowledge distillation actually is
Knowledge distillation is a model-compression technique in which a small student model is trained to reproduce the behaviour of a large, expensive teacher model. The motivation is practical: the most accurate models are often far too slow and memory-hungry to run on a phone, an edge device, or at high request volumes. Distillation lets you train a compact model that keeps much of the teacher’s quality while running a fraction of the cost. The idea was popularised by Geoffrey Hinton and colleagues in 2015, and it underpins many of the “mini,” “small,” and “flash” model variants shipped by major AI labs today.
Soft labels: the core trick
A normal classifier is trained on hard labels — for an image of a dog, the target is simply “dog: 1, everything else: 0.” That throws away a lot of information. A well-trained teacher, by contrast, produces a full soft label: maybe “dog: 0.90, wolf: 0.07, cat: 0.02, car: 0.0001.” Those secondary probabilities tell the student something important — that dogs look much more like wolves than like cars. Hinton called this signal dark knowledge. By training the student to match the teacher’s entire distribution rather than just the top answer, the student inherits the teacher’s learned sense of how concepts relate, which a small model could never discover on its own from sparse hard labels.
Temperature scaling
To make that dark knowledge usable, distillation applies temperature scaling
to the teacher’s outputs. The teacher’s logits are divided by a temperature T
greater than 1 before the softmax, which softens the distribution — flattening
the sharp 0.90 peak so the smaller probabilities become more visible and
informative. The student is trained at the same temperature to match these
softened targets (usually alongside a smaller loss term on the true hard labels).
At inference time the temperature is set back to 1, so the deployed student
behaves normally. Choosing T is a trade-off: too low and the soft information
collapses back toward a hard label; too high and the distribution becomes noise.
Distillation versus quantization
Distillation is often confused with quantization, but they compress models in different ways. Distillation produces a genuinely smaller architecture — fewer layers or narrower layers — trained to imitate the teacher. Quantization keeps the same architecture but stores weights and activations in lower numerical precision (16-bit, 8-bit, even 4-bit), shrinking memory and accelerating arithmetic with minimal accuracy loss. Distillation can recover capability that a small-from-scratch model would lack; quantization squeezes an existing model into less hardware. The two are frequently stacked: a large model is distilled into a smaller student, and that student is then quantized for the target device.
When distillation is worth it
Distillation shines when you need to serve a model cheaply and at scale, when latency matters (interactive apps, on-device assistants), or when you must fit within fixed hardware. The cost is an extra training stage and a usually modest drop in peak accuracy versus the teacher. It is less useful when you already have abundant compute, when the task is so narrow a small model trains well directly, or when you cannot access a strong teacher’s outputs. As a rule of thumb, reach for distillation when a powerful model proves a task is solvable but is too expensive to deploy as-is.