Question 1

Why compress an AI model at all?

Accepted Answer

Large models are expensive to run, slow to respond, and often too big to fit on phones, browsers, or embedded chips. Compression shrinks the model's size and compute needs so it runs faster, costs less to serve, drains less battery, and can run on-device without a network round trip.

Question 2

Does compression always hurt accuracy?

Accepted Answer

There is usually some accuracy loss, but it is often small and recoverable. Light quantization or moderate pruning can shrink a model substantially with negligible quality drop, and fine-tuning after compression typically restores most of any lost accuracy. Aggressive compression trades more accuracy for more savings.

Question 3

What is the difference between quantization and pruning?

Accepted Answer

Quantization reduces the precision of each weight — for example from 32-bit to 8-bit numbers — making every value cheaper to store and compute. Pruning removes weights or whole structures entirely, making the model sparser or smaller. They attack size from different angles and are often combined.

Question 4

Can I combine multiple compression techniques?

Accepted Answer

Yes, and practitioners usually do. A common pipeline distills a large model into a smaller architecture, prunes redundant structure, and then quantizes the result, sometimes guided by architecture search. Stacking techniques compounds the savings, though each step needs evaluation to control accuracy loss.

What Is Model Compression? Making AI Models Smaller and Faster

Why models need compressing

Quantization

Pruning

Knowledge distillation

Neural architecture search and combining methods