Question 1

What is inference in machine learning?

Accepted Answer

Inference is the phase where a fully trained model takes new, unseen input and produces an output — a prediction, classification, or generated text. It is the model 'doing its job' after training is finished.

Question 2

How is inference different from training?

Accepted Answer

Training adjusts the model's weights by learning from labelled examples, and happens once (or periodically). Inference uses the frozen weights to answer requests and happens every single time a user interacts with the model.

Question 3

Why is inference so expensive at scale?

Accepted Answer

Training is a large one-time cost, but inference runs on every request forever. For a popular product, the cumulative compute spent on inference quickly dwarfs the original training cost.

Question 4

What optimisations speed up inference?

Accepted Answer

Common techniques include batching multiple requests together, caching attention keys and values (KV-cache), quantising weights to lower precision, and speculative decoding with a smaller draft model.

Inference (AI Glossary)

What is inference?

Inference vs training

Why inference dominates cost

Inference-time optimisations