Two phases, one model
Every AI model lives through two fundamentally different phases. Training is the learning phase, where the model is taught by repeatedly making predictions, measuring how wrong it was, and nudging its internal numbers — the weights — to do better next time. Inference is the using phase, where the finished model takes a new input and produces an output without changing itself at all. A useful analogy: training is years of medical school, inference is the doctor seeing a patient. The same brain is involved, but one phase builds the expertise and the other applies it. Understanding this split clarifies why AI costs and hardware look the way they do.
What happens during training
Training runs the model forwards and backwards. In the forward pass the model predicts an output; a loss function then scores how far that prediction was from the correct answer; and in the backward pass — backpropagation — gradients flow back through the network to adjust every weight slightly in the direction that reduces the error. This loop repeats over a massive dataset, often many times, gradually shaping billions of parameters. It is extraordinarily compute-heavy: large models train on clusters of GPUs for days or weeks, consuming enormous energy. Crucially, training is a one-time investment — once finished, the weights are frozen.
What happens during inference
Inference uses only the forward pass. The trained, frozen model receives your prompt, runs it through the network once, and emits an output. There is no loss calculation and no weight update — the model does not learn from what it sees. Because it is a single pass, an individual inference is far cheaper and faster than training. This is why you can get an answer from a model in a second or two even though the model that produces it may have taken weeks and millions of dollars to train. The expertise is baked in; inference simply reads it out.
Why the cost structures differ
Training is a large, fixed, up-front cost paid once per model version. Inference is a small, variable cost paid per request — but multiplied across every user and every query. A widely used model serves billions of inferences, so the cumulative inference bill over its lifetime can dwarf the original training cost. This economic split drives much of modern AI engineering: teams invest heavily to train a strong model once, then work hard to make inference cheap through quantization (using lower-precision numbers), distillation (training a smaller model to mimic a larger one), batching, and caching.
Why the distinction matters in practice
Knowing which phase you are in explains common behaviors. A model does not “remember” your last conversation because inference does not update weights; any apparent memory comes from context fed back in, not learning. Making a model permanently better at your task requires fine-tuning — a fresh, smaller training run — not just clever prompting. And when people compare model “cost,” they may mean the colossal one-time training spend or the ongoing per-call inference price, which are wildly different numbers. Keeping training and inference straight is the foundation for reasoning about AI performance, cost, and capability.