Definition
FP16 (half precision) and BF16 (bfloat16) are 16-bit floating-point number formats used to train and run deep learning models more efficiently than the traditional 32-bit FP32 format. By using half the bits per value, they roughly halve memory consumption and accelerate computation on GPU tensor cores, which is why they are central to modern large-model training.
How floating-point formats split their bits
A floating-point number is divided into a sign bit, exponent bits (which set the magnitude/range), and mantissa bits (which set the precision):
- FP32 — 1 sign, 8 exponent, 23 mantissa. Wide range, high precision, 4 bytes.
- FP16 — 1 sign, 5 exponent, 10 mantissa. Good precision but a narrow range, so it overflows and underflows easily. 2 bytes.
- BF16 — 1 sign, 8 exponent, 7 mantissa. Same range as FP32 but coarser precision. 2 bytes.
The key insight is that BF16 keeps FP32’s exponent, so it can represent the same span of very large and very small magnitudes — it just rounds more aggressively.
The precision-versus-range trade-off
Deep networks, especially large language models, produce activations and gradients that span an enormous range of magnitudes. With FP16’s small exponent, those values can silently become zero (underflow) or infinity (overflow), destabilising training. The standard fix is loss scaling: multiply the loss by a large factor before the backward pass, then divide it back out. BF16 mostly avoids this problem because its range matches FP32, at the cost of carrying fewer significant digits.
Mixed-precision training
In practice neither format is used in isolation. Mixed-precision training keeps a master copy of the weights and sensitive operations (like large reductions and softmax) in FP32, while performing the bulk of matrix multiplications and storing most activations in FP16 or BF16. This delivers the memory and speed benefits of 16-bit math while preserving the numerical stability needed for convergence.
Why it matters
On Ampere-generation and newer hardware, BF16 has become the default for LLM training because it is stable without the bookkeeping of loss scaling and still halves memory versus FP32 — letting practitioners fit larger models and bigger batches on the same GPUs. FP16 remains common for inference and on older hardware where BF16 is unsupported. Choosing the right format is a routine but consequential decision in any large-scale training pipeline.