Why neural networks love parallel hardware
Under the hood, a neural network is mostly matrix multiplication. Every layer takes a vector of numbers, multiplies it by a large grid of weights, and adds the results together. A single forward pass through a large language model can involve billions of these multiply-and-add operations. The key insight is that each individual multiplication is independent of the others — they do not depend on each other’s results — so they can all be computed at the same time.
This is exactly what a GPU is built for. A CPU has a small number of very fast, flexible cores designed to handle complex, branching, sequential programs. A GPU instead has thousands of simpler cores designed to do the same operation on many pieces of data simultaneously. When the work is “multiply these ten thousand numbers in parallel,” the GPU wins by a huge margin.
How CUDA made GPUs programmable for AI
Graphics cards existed long before deep learning, but they were hard to use for anything other than rendering. In 2007 NVIDIA released CUDA, a programming model that let developers write general-purpose parallel code for the GPU in a familiar C-like language.
CUDA matters far beyond the hardware. On top of it NVIDIA built optimised libraries such as cuDNN for deep-learning primitives, and the major frameworks — PyTorch, TensorFlow, JAX — were written to call those libraries. The result is a deep software moat: even if a competitor ships a faster chip, the entire AI ecosystem already speaks CUDA. This software lock-in, as much as the silicon, explains NVIDIA’s dominance.
What makes a data-centre AI GPU special
A gaming GPU and a data-centre AI GPU like the H100 share DNA but diverge sharply. The AI parts are built around:
- Tensor cores — dedicated units that do mixed-precision matrix math far faster than general GPU cores.
- High-bandwidth memory (HBM) — memory stacked next to the chip that feeds data fast enough to keep the cores busy, since memory speed is usually the real bottleneck.
- Fast interconnects (NVLink) — links that let many GPUs work together as one large pool for training models too big for a single card.
Combined with low production volume and enormous demand, these features are why a single H100 can sell for around $30,000.
The wider accelerator landscape
NVIDIA leads, but it is not alone. Google designs its own TPUs (Tensor Processing Units) used internally and via Google Cloud. AMD ships the Instinct line of accelerators and is building out its ROCm software stack to challenge CUDA. Startups such as Cerebras and Groq pursue radically different architectures, and Apple’s unified-memory chips run smaller models efficiently on consumer devices. For most builders, though, the practical takeaway is simpler: you rarely touch this hardware directly. You either rent it through a cloud GPU provider or, more often, call a hosted model through an API and let someone else run the silicon.