How Much Does It Cost to Train an LLM?

GPT-4 cost hundreds of millions; here is how AI training costs are calculated

Ad placeholder (leaderboard)

What you are actually paying for

The cost of training a large language model is, at heart, the cost of compute — the raw mathematical operations needed to adjust billions of parameters over trillions of words of text. That compute is bought as GPU-hours: time on specialised accelerators like NVIDIA H100s, either rented from a cloud provider or owned outright. A single training run can occupy thousands of GPUs for weeks. On top of the compute bill sit data acquisition and cleaning, electricity, networking, storage, and the salaries of the research teams — costs that often exceed the headline training number.

The FLOPs formula

Engineers estimate compute in FLOPs (floating-point operations). A widely used rule of thumb is that training a dense transformer costs roughly 6 × N × D FLOPs, where N is the number of model parameters and D is the number of training tokens. This makes the core driver obvious: cost grows with both how big the model is and how much data it sees. Once you have the FLOPs estimate, you convert it to dollars by dividing by your hardware’s effective throughput (FLOPs per second, accounting for real-world efficiency, which is well below peak) and multiplying by the price per GPU-hour.

The Chinchilla compute-optimal insight

For years the instinct was simply to build bigger models. DeepMind’s 2022 Chinchilla study overturned that. It showed that, for a fixed compute budget, many large models were under-trained — they had too many parameters and too little data. The compute-optimal recipe scales parameters and training tokens together, roughly 20 tokens per parameter. The practical lesson is that throwing money at model size alone wastes compute; the best performance per dollar comes from balancing model size against training data. Chinchilla reshaped how labs spend their training budgets.

Real-world cost estimates

Numbers vary enormously by tier. Training a frontier model like GPT-4 is widely estimated at tens to over a hundred million dollars for the final run alone — and total program cost, including failed runs and research, is far higher. Capable open-source models in the billions-of-parameters range can be trained for thousands to a few million dollars, depending on size and data. At the other extreme, fine-tuning an existing open model on a focused dataset can cost as little as a few hundred dollars of GPU time, because it adjusts an already-trained model rather than building one from scratch.

Why most builders never pay it

The crucial takeaway is that the eye-watering training figures apply to the handful of labs building foundation models from scratch. Almost everyone else builds on top of those models — fine-tuning an open model, using retrieval-augmented generation, or simply calling a hosted API and paying per token of usage. For typical applications, the relevant cost is not training but inference, which is pennies to dollars per million tokens. Understanding the training-cost math is mostly useful for appreciating why frontier AI is concentrated among a few well-funded players — and why standing on their shoulders is the economically sensible choice for nearly everyone else.

Ad placeholder (rectangle)