Batch Size (AI Glossary)

How many training examples to process at once before updating model weights

Ad placeholder (leaderboard)

Definition

Batch size is the number of training examples a model processes before it performs a single update to its weights. Modern neural networks are trained with mini-batch stochastic gradient descent (SGD): the dataset is divided into batches, the model computes one averaged gradient per batch, and one weight update follows. Batch size is therefore one of the most important training hyperparameters, influencing speed, memory use, and final model quality.

The three regimes

  • Full-batch — the gradient is computed over the entire dataset before each update. Accurate but impractical for large datasets.
  • Mini-batch — the standard approach, using batches of tens to thousands of examples. It balances stable gradients with frequent updates.
  • Stochastic (batch size 1) — one example per update. Very noisy, but the noise can help escape poor local minima.

The large-vs-small trade-off

The choice of batch size is a genuine trade-off:

  • Large batches produce smoother, more reliable gradient estimates, make efficient use of GPU parallelism, and train fewer steps per epoch. The downsides are higher memory consumption and a tendency to generalise slightly worse if the learning rate is not tuned.
  • Small batches inject more noise into each update, which often improves generalisation and requires far less memory, but training is slower per epoch and less stable.

Gradient accumulation

When you want the stability of a large batch but lack the GPU memory to hold one, gradient accumulation is the standard trick. You compute gradients over several small batches, sum (or average) them, and only then perform one weight update — effectively simulating a large batch. This is heavily used when fine-tuning large language models on modest hardware.

Relationship with learning rate

Batch size does not stand alone. Because larger batches yield less noisy gradients, they usually call for a larger learning rate. A common heuristic is to scale the learning rate roughly in proportion to the batch size, often with a short warm-up period to keep early training stable.

Why it matters

Batch size directly shapes how fast a model trains, how much memory it needs, and how well it generalises. Getting it right — together with the learning rate — is one of the practical levers that separates a model that trains smoothly from one that diverges or overfits.

Ad placeholder (rectangle)