Quantization (AI Glossary)

Reducing model weight precision from float32 to int8 or int4 to save memory

Ad placeholder (leaderboard)

Definition

Quantization is the technique of reducing the numerical precision used to store and compute a neural network’s parameters. Instead of representing each weight as a 32-bit floating-point number (float32), a quantized model uses a smaller representation such as 8-bit integers (int8) or even 4-bit integers (int4). Because large language models are dominated by the cost of storing and moving billions of weights, cutting precision from 32 bits to 4 bits can shrink a model’s memory footprint by roughly 8x — turning a model that needed a data centre GPU into one that runs on a laptop.

Why it matters

A 70-billion-parameter model in float16 needs around 140 GB of memory just for its weights — far beyond a consumer GPU. Quantize the same model to 4-bit and the weights drop to roughly 35 GB, bringing it within reach of high-end consumer hardware. Quantization is therefore the single most important technique for local and on-device inference: it is what makes running capable open-weight models on personal machines practical. It also reduces cloud serving costs and improves latency, since less data moves through memory.

Post-training quantization vs quantization-aware training

There are two broad approaches. Post-training quantization (PTQ) takes a fully trained model and compresses it directly, with no further training. It is fast, cheap, and the default for most open-weight model releases. Quantization-aware training (QAT) instead simulates low-precision arithmetic during training, letting the model learn to tolerate the rounding error. QAT generally produces better accuracy at aggressive bit widths but requires the full training pipeline, so it is far less common for huge LLMs.

Common formats and methods

Several formats dominate the open-source LLM ecosystem:

  • GGUF — the file format used by llama.cpp, designed for efficient CPU and consumer-GPU inference across many quantization levels (Q4, Q5, Q8, and more).
  • GPTQ — a PTQ algorithm that quantizes weights layer by layer while minimising output error, popular for 4-bit GPU inference.
  • AWQ (Activation-aware Weight Quantization) — protects the small fraction of weight channels that matter most, preserving accuracy especially well at 4-bit.

The accuracy trade-off

Quantization is a balancing act between size and quality. 8-bit is usually near-lossless. 4-bit introduces a modest, often imperceptible drop and is the sweet spot for most local deployments. Going below 4-bit (3-bit, 2-bit) saves more memory but degrades quality more sharply, so it is reserved for extreme memory constraints. Choosing a quantization level means deciding how much quality you are willing to trade for the ability to run a bigger model on smaller hardware.

Ad placeholder (rectangle)