What is inference?
Inference is the stage where a trained model is used: it takes a new input it has never seen and produces an output. For a language model, that means turning your prompt into generated text. For an image classifier, it means labelling a new photo. Crucially, the model’s weights do not change during inference — they are frozen. Inference is simply applying what was already learned.
Inference vs training
The two phases of a model’s life are easy to confuse:
- Training — the model repeatedly sees labelled examples and adjusts its internal weights to reduce error. This is compute-heavy but happens once, or periodically when the model is updated.
- Inference — the model applies its fixed weights to fresh inputs. This happens every time a user sends a request, potentially billions of times.
A useful analogy: training is like a student studying for an exam, while inference is the student sitting the exam over and over for every new question.
Why inference dominates cost
Because inference runs on every interaction, its cumulative cost usually exceeds the one-time cost of training for any widely used product. A model trained once for a few weeks might then serve requests for years. This is why so much engineering effort goes into making inference cheaper and faster — small per-request savings multiply enormously at scale.
Inference-time optimisations
Several techniques make inference more efficient without retraining the model:
- Batching — grouping multiple requests so the hardware processes them together, improving throughput.
- KV-cache — storing the attention keys and values from earlier tokens so the model does not recompute them for every new token in a generation.
- Quantisation — storing weights at lower numerical precision (for example 8-bit or 4-bit) to cut memory use and speed up math.
- Speculative decoding — a small, fast “draft” model proposes tokens that a larger model verifies, accelerating generation.
Together, these let providers serve large models at acceptable latency and cost, which is what makes interactive AI products practical.