What Is Model Quantization? Running AI With Less Memory

INT8, INT4, GPTQ, AWQ: shrinking models without destroying performance

Ad placeholder (leaderboard)

What model quantization is

Quantization is the process of storing an AI model’s numbers using fewer bits. A model is just a giant collection of weights, normally kept as 16-bit or 32-bit floating-point values. Quantization converts those weights to lower precision — commonly 8-bit (INT8) or 4-bit (INT4) integers. The result is a much smaller model that needs less memory and can run faster, which is why quantization is central to running large language models on modest hardware.

Why precision matters for size

A model’s memory footprint is roughly its parameter count times the bytes per parameter. Dropping from 16-bit to 4-bit cuts the bytes per parameter by four, so a 13-billion-parameter model that needs about 26 GB at 16-bit can fit in roughly 7 GB at 4-bit. That difference is what lets a model that previously required a data-centre GPU run on a single consumer card or laptop.

Beyond memory, lower precision means less data to move and multiply, which typically increases throughput (tokens per second) and lowers cost.

Post-training quantization vs QAT

There are two broad approaches:

  • Post-training quantization (PTQ) — take a fully trained model and compress it afterwards, often using a small calibration dataset to choose good rounding ranges. It is fast, needs no retraining, and is the most common path. Methods like GPTQ and AWQ are advanced forms of PTQ that quantize layer by layer to preserve as much accuracy as possible.
  • Quantization-aware training (QAT) — simulate low precision during training so the model learns to be robust to it. QAT usually yields the best accuracy at low bit-widths but costs far more compute, so it is reserved for cases where every point of quality matters.

The accuracy trade-off

Quantization always discards some information, so there is a quality cost — but how much depends on how aggressive you go:

  • INT8 — often nearly lossless; a safe default for most deployments.
  • INT4 — much smaller, but can noticeably degrade quality unless a smart method (GPTQ, AWQ) protects the most sensitive weights.
  • Below 4-bit — possible but increasingly risky; quality can fall sharply.

The right setting balances how much memory you need to save against how much quality you can tolerate losing.

When to use quantized models

Reach for quantization when you want to:

  • Run locally on a single GPU, laptop, or edge device.
  • Cut inference costs by fitting more on each accelerator.
  • Lower latency thanks to smaller, faster numeric operations.

For most users, an INT8 or a well-made INT4 build offers a large memory saving with a quality drop small enough to ignore. To compare the specific formats used for local inference, see Quantization Methods Compared: GPTQ vs AWQ vs GGUF.

Ad placeholder (rectangle)