Question 1

What is knowledge distillation in machine learning?

Accepted Answer

Knowledge distillation is a training technique where a small student model learns to imitate the outputs of a larger, more capable teacher model. Instead of training only on the original labels, the student also learns from the teacher's full probability distribution over classes, which carries richer information than a single correct answer.

Question 2

What are soft labels and why do they matter?

Accepted Answer

Soft labels are the full probability distribution a teacher model assigns across all possible outputs, not just the single hard label. They reveal which wrong answers the teacher considered plausible, encoding relationships between classes. Learning from these soft targets is what lets a tiny student model capture nuance it could never learn from hard labels alone.

Question 3

What does temperature do during distillation?

Accepted Answer

Temperature scaling softens the teacher's probability distribution before the student learns from it. A higher temperature spreads probability mass across more classes, exposing the small differences between near-miss options. This 'dark knowledge' in the softened distribution is the key signal the student trains on, after which the same temperature is removed for normal inference.

Question 4

How is distillation different from quantization?

Accepted Answer

Distillation trains a new, architecturally smaller model to mimic a larger one, changing the model itself. Quantization keeps the same model but stores its weights in lower precision, such as 8-bit or 4-bit integers, to shrink memory and speed up math. They are complementary: a model is often distilled first and then quantized for deployment.

What Is Knowledge Distillation in AI?

What knowledge distillation actually is

Soft labels: the core trick

Temperature scaling

Distillation versus quantization

When distillation is worth it