Question 1

What does quantizing a model mean?

Accepted Answer

Quantizing means storing a model's weights using fewer bits — for example 8-bit or 4-bit integers instead of 16- or 32-bit floats. This shrinks the file and memory use, letting bigger models run on smaller hardware.

Question 2

Does quantization hurt model accuracy?

Accepted Answer

Some accuracy is usually lost, but modern methods keep it small. INT8 is often nearly lossless; aggressive INT4 can degrade quality more, though techniques like GPTQ and AWQ minimise the drop by quantizing carefully.

Question 3

What is the difference between PTQ and QAT?

Accepted Answer

Post-training quantization (PTQ) compresses an already-trained model with little or no extra training — fast and common. Quantization-aware training (QAT) simulates low precision during training so the model adapts, giving better accuracy at higher cost.

Question 4

Why would I want a quantized model?

Accepted Answer

Quantized models use far less memory and run faster, so you can fit a large model on a single consumer GPU or even a laptop, lower inference costs, and reduce latency — at the price of a small quality trade-off.

What Is Model Quantization? Running AI With Less Memory

What model quantization is

Why precision matters for size

Post-training quantization vs QAT

The accuracy trade-off

When to use quantized models