What Is Knowledge Distillation in AI?

How smaller, faster student models learn from larger teacher models

Ad placeholder (leaderboard)

What knowledge distillation actually is

Knowledge distillation is a model-compression technique in which a small student model is trained to reproduce the behaviour of a large, expensive teacher model. The motivation is practical: the most accurate models are often far too slow and memory-hungry to run on a phone, an edge device, or at high request volumes. Distillation lets you train a compact model that keeps much of the teacher’s quality while running a fraction of the cost. The idea was popularised by Geoffrey Hinton and colleagues in 2015, and it underpins many of the “mini,” “small,” and “flash” model variants shipped by major AI labs today.

Soft labels: the core trick

A normal classifier is trained on hard labels — for an image of a dog, the target is simply “dog: 1, everything else: 0.” That throws away a lot of information. A well-trained teacher, by contrast, produces a full soft label: maybe “dog: 0.90, wolf: 0.07, cat: 0.02, car: 0.0001.” Those secondary probabilities tell the student something important — that dogs look much more like wolves than like cars. Hinton called this signal dark knowledge. By training the student to match the teacher’s entire distribution rather than just the top answer, the student inherits the teacher’s learned sense of how concepts relate, which a small model could never discover on its own from sparse hard labels.

Temperature scaling

To make that dark knowledge usable, distillation applies temperature scaling to the teacher’s outputs. The teacher’s logits are divided by a temperature T greater than 1 before the softmax, which softens the distribution — flattening the sharp 0.90 peak so the smaller probabilities become more visible and informative. The student is trained at the same temperature to match these softened targets (usually alongside a smaller loss term on the true hard labels). At inference time the temperature is set back to 1, so the deployed student behaves normally. Choosing T is a trade-off: too low and the soft information collapses back toward a hard label; too high and the distribution becomes noise.

Distillation versus quantization

Distillation is often confused with quantization, but they compress models in different ways. Distillation produces a genuinely smaller architecture — fewer layers or narrower layers — trained to imitate the teacher. Quantization keeps the same architecture but stores weights and activations in lower numerical precision (16-bit, 8-bit, even 4-bit), shrinking memory and accelerating arithmetic with minimal accuracy loss. Distillation can recover capability that a small-from-scratch model would lack; quantization squeezes an existing model into less hardware. The two are frequently stacked: a large model is distilled into a smaller student, and that student is then quantized for the target device.

When distillation is worth it

Distillation shines when you need to serve a model cheaply and at scale, when latency matters (interactive apps, on-device assistants), or when you must fit within fixed hardware. The cost is an extra training stage and a usually modest drop in peak accuracy versus the teacher. It is less useful when you already have abundant compute, when the task is so narrow a small model trains well directly, or when you cannot access a strong teacher’s outputs. As a rule of thumb, reach for distillation when a powerful model proves a task is solvable but is too expensive to deploy as-is.

Ad placeholder (rectangle)