Question 1

What is quantization in machine learning?

Accepted Answer

Quantization is the process of representing a model's weights (and sometimes activations) with fewer bits — for example converting 32-bit floats to 8-bit or 4-bit integers. This shrinks memory use and speeds up inference, usually with only a small drop in accuracy.

Question 2

What is the difference between PTQ and QAT?

Accepted Answer

Post-training quantization (PTQ) compresses an already-trained model without further training, making it fast and cheap. Quantization-aware training (QAT) simulates low precision during training so the model adapts, typically yielding better accuracy at very low bit widths but at much higher cost.

Question 3

What are GGUF, GPTQ, and AWQ?

Accepted Answer

They are popular quantization formats and methods for LLMs. GGUF is the file format used by llama.cpp for CPU and consumer-GPU inference. GPTQ and AWQ are post-training quantization algorithms that pick weight scaling to preserve accuracy at 4-bit, with AWQ protecting the most important weight channels.

Question 4

Does quantization hurt model accuracy?

Accepted Answer

There is usually some loss, but it is often small. 8-bit quantization is typically near-lossless; 4-bit causes a modest, frequently acceptable drop; below 4-bit, degradation grows quickly. Good methods like AWQ and GPTQ minimise the loss by being selective about precision.

Quantization (AI Glossary)

Definition

Why it matters

Post-training quantization vs quantization-aware training

Common formats and methods

The accuracy trade-off