Question 1

What is the difference between training and inference?

Accepted Answer

Training is the one-time, compute-heavy process of teaching a model by adjusting its parameters on huge datasets. Inference is using the finished model to produce an output from new input. Training is like writing the textbook; inference is like answering a question with it. You train rarely and run inference constantly.

Question 2

Why does inference cost money every time?

Accepted Answer

Each request runs the full model — billions of multiplications across its parameters — on expensive GPU hardware for that specific input. Unlike traditional software, there is no cheap cached path; the model genuinely computes the answer fresh each time. That is why API providers charge per token of input and output.

Question 3

Why is inference latency important?

Accepted Answer

Latency is how long a model takes to respond, and for chatbots, coding assistants, and live applications it directly shapes the user experience. Large models are slower because they do more computation per token. Teams trade off model size, hardware, and optimisation techniques to hit the latency their product needs.

Question 4

What does inference optimisation actually do?

Accepted Answer

It makes the same model run faster and cheaper without retraining it. Techniques include quantisation (using lower-precision numbers), batching (processing many requests together), caching repeated computation, and distillation into smaller models. Together these can cut cost and latency several-fold while keeping output quality acceptable.

What Is Inference in AI? Training vs Inference Explained

The two phases of a model’s life

What happens during inference

Why inference costs money

Why latency matters

How inference is optimised