What Mixture of Experts means
Mixture of Experts (MoE) is an architecture that lets a model grow enormous in total size without a proportional increase in the compute spent on each token. Instead of one large dense feed-forward block that every token passes through, an MoE layer contains many parallel sub-networks called experts, alongside a small gating network that decides, per token, which experts should handle it. Only a handful of experts run for any given token, so most of the model sits idle on any single forward pass.
Conditional computation: the core idea
The trick MoE exploits is conditional computation — using different parts of the network for different inputs. In a dense model, every parameter participates in every prediction, so doubling capacity doubles cost. In a sparse MoE, you can add dozens or hundreds of experts to multiply the model’s total capacity, but because the gate selects only the top one or two experts per token, the active compute stays roughly flat. This decouples a model’s knowledge capacity from its per-token cost, which is why MoE has become a favoured way to scale large language models.
The gating network
The gating network (or router) is the conductor. For each token it produces a score over the available experts and routes the token to the top-k highest-scoring experts — commonly the top one or two. The chosen experts process the token and their outputs are combined, weighted by the gate’s scores. Crucially, the gate is trained jointly with the experts, so the model learns a useful specialisation: different experts come to handle different kinds of input, even if no human ever assigns them topics.
Total vs active parameters
The headline numbers around MoE models can be confusing. A model described as having, say, many tens or hundreds of billions of total parameters may only use a small fraction — the active parameters — to process each token. Total parameters set the ceiling on how much the model can know; active parameters set the per-token inference cost. MoE’s appeal is getting a high total (lots of capacity) with a low active count (cheap inference).
Trade-offs
MoE is not free. The router can suffer load imbalance, sending too many tokens to a few popular experts while others starve, which wastes capacity; training therefore adds auxiliary load-balancing losses to spread tokens evenly. Serving is also more complex, because all experts must be kept in memory and tokens routed dynamically, which complicates batching and distribution across GPUs. In return, MoE delivers far more model capacity per unit of training and inference compute, which is why several frontier-class models adopt it.