What Mixture of Experts is
Mixture of Experts (MoE) is an architecture that lets a language model be enormous in total size while staying cheap to run per token. Instead of one big dense network where every parameter processes every input, an MoE layer contains many smaller sub-networks called experts, plus a small router that decides which experts handle each token. Models such as Mixtral and, reportedly, GPT-4 use this approach to reach very large parameter counts without a matching explosion in compute.
Dense vs sparse models
In a normal dense model, every parameter is used for every token — capacity and compute rise together, so a bigger model is always more expensive to run.
MoE breaks that link by being sparse. Each token only activates a small fraction of the model:
- Total parameters — the full store of all experts, which can be huge.
- Active parameters — the few experts actually run for a given token.
A model might have a trillion total parameters but activate only tens of billions per token, giving it a large model’s capacity at a small model’s cost.
How routing works
The key component is the gating network (router). For each token, it:
- Looks at the token’s current representation.
- Scores every available expert.
- Selects the top-k experts (commonly top-1 or top-2).
- Sends the token to those experts and blends their outputs, weighted by the router’s scores.
Crucially, the router is learned — during training the model figures out which experts should specialise in which kinds of input, all guided by gradient descent rather than hand-written rules.
Why MoE is attractive
The headline benefit is scaling efficiency. You can grow capacity (more experts) without proportionally growing the FLOPs spent per token, so inference and training stay affordable relative to a dense model of the same total size. Experts also tend to specialise, which can improve quality on a diverse range of tasks because the model dedicates different sub-networks to different patterns.
The trade-offs
MoE is not a free lunch:
- Memory — every expert must be loaded into memory even though only a couple run per token, so MoE models are memory-hungry to serve.
- Load balancing — without care, the router can over-use a few experts and starve others; auxiliary balancing losses are added to spread the load.
- Serving complexity — routing tokens across experts (often across multiple GPUs) makes deployment and batching harder than for a plain dense model.
In short, MoE trades memory and engineering complexity for compute efficiency at scale — a trade that pays off well for the largest frontier models. For the basics of how parameter count relates to capacity, see What Are Parameters in an AI Model?.