Mixture of Experts (MoE) Explained: Why GPT-4 and Mixtral Use It

The architecture trick that makes large AI models efficient

Ad placeholder (leaderboard)

What Mixture of Experts actually is

A standard transformer processes every token through the same dense layers, so doubling the parameters roughly doubles the compute. Mixture of Experts (MoE) breaks that link. Inside each MoE layer there are many parallel sub-networks called experts — typically copies of the feed-forward block — plus a small router that decides which experts handle each token. Only the chosen experts run, so the model can hold an enormous number of parameters while activating just a slice of them at a time.

The result is sparse activation: capacity scales with the number of experts, but cost scales with how many you actually use per token. This is the core trick behind making very large models practical to serve.

How routing works

For every token, the router produces a score for each expert, usually with a single linear layer followed by a softmax. It then selects the top-k experts — often k=2 — and sends the token only to those. Each selected expert processes the token, and their outputs are combined, weighted by the router’s scores.

Because the router is learned during training, the model gradually discovers a useful specialisation: different experts tend to handle different kinds of tokens or patterns. Training adds an auxiliary load-balancing loss that nudges the router to spread work evenly, preventing a few popular experts from being overloaded while others sit idle.

Why it makes large models efficient

Consider Mixtral 8x7B. It has eight experts in each MoE layer, but routes each token to only two. Its total parameter count is around 47 billion, yet only about 13 billion are active for any given token. So it runs at roughly the speed and memory bandwidth of a 13B dense model while delivering quality closer to a much larger one. That is the whole pitch: more knowledge capacity per dollar of inference.

This efficiency is why MoE is attractive at frontier scale. GPT-4 is widely believed to use an MoE design, and several open models — Mixtral, DeepSeek-MoE, Qwen-MoE — have made the approach mainstream. OpenAI has never officially confirmed GPT-4’s architecture, so treat that as informed speculation rather than fact.

The tradeoffs

MoE is not free. The full model — all experts — must still fit in memory even though only a few run per token, so VRAM requirements stay high. Routing can be unstable: if load balancing fails, quality drops and some experts are wasted. Sparse activation also complicates batching and serving, and fine-tuning MoE models is trickier than fine-tuning dense ones.

There is also a quality nuance. A MoE model with N total parameters is usually not as capable as a dense model with N parameters, because only a fraction of weights see each token. The fair comparison is against a dense model with the same active parameter count — and there, MoE typically wins.

When MoE matters to you

As a user or application builder, you rarely interact with the routing directly, but the design explains real behaviour. MoE models can feel fast for their apparent size, and they sometimes show uneven performance across topics because different experts carry different strengths. When choosing an open model, look at both numbers: total parameters tell you the memory you need, while active parameters tell you the speed and cost you will actually experience.

Ad placeholder (rectangle)