What Is Model Compression? Making AI Models Smaller and Faster

Quantization, pruning, distillation, and architecture search for edge AI

Ad placeholder (leaderboard)

Why models need compressing

Modern AI models can have billions of parameters, demanding gigabytes of memory and heavy compute for every prediction. That is fine in a data centre, but it is a problem when you want the model to run on a phone, in a browser, on a sensor, or simply at lower cost and latency at scale. Model compression is the set of techniques that reduce a model’s size and computational cost while trying to preserve as much of its accuracy as possible. The goal is a smaller, faster, cheaper model that behaves almost like the original.

Quantization

Quantization lowers the numerical precision used to store and compute weights and activations. A model trained in 32-bit floating point can often be converted to 8-bit integers — or even 4-bit — cutting memory roughly fourfold or more and letting hardware do arithmetic far faster. Post-training quantization applies this directly to a finished model, while quantization-aware training simulates the lower precision during training so the model adapts to it. Because most of a model’s cost is moving and multiplying numbers, quantization is often the single biggest, easiest win.

Pruning

Pruning removes parts of the model that contribute little to its output. Unstructured pruning zeroes out individual low-magnitude weights, producing a sparse model; structured pruning removes whole units — neurons, attention heads, or layers — which is friendlier to standard hardware. The model is usually fine-tuned afterward to recover accuracy lost when the connections were cut. Pruning exploits the fact that large networks are typically over-parameterised, carrying far more capacity than a given task strictly requires.

Knowledge distillation

Knowledge distillation trains a small “student” model to imitate a large “teacher”. Instead of learning only from hard labels, the student learns from the teacher’s full output distribution — its soft probabilities across classes or tokens — which carries richer information about how the teacher generalises. The student can be a much smaller architecture yet capture a surprising amount of the teacher’s behaviour. Distillation is how many compact deployable models inherit the strengths of far larger ones.

Neural architecture search and combining methods

Neural architecture search (NAS) automates the design of efficient model architectures, searching over building blocks to find ones that hit a target accuracy at minimal size or latency, sometimes tailored to specific hardware. In practice, the four techniques are complementary: a typical edge-AI pipeline might distill a large model into a NAS-designed compact architecture, prune redundant structure, then quantize the result. Every step trades a little accuracy for size and speed, so each is validated against the deployment budget — the art of compression is choosing where on that curve to land.

Ad placeholder (rectangle)