Question 1

What is a Mixture of Experts model?

Accepted Answer

A Mixture of Experts (MoE) model replaces the dense feed-forward layers of a transformer with many smaller expert networks plus a router that picks a few experts for each token. Only the selected experts run, so the model has a huge total parameter count but activates only a fraction per token. This gives the capacity of a very large model at the compute cost of a much smaller one.

Question 2

Why do GPT-4 and Mixtral use Mixture of Experts?

Accepted Answer

MoE lets these models grow in total parameters — and therefore knowledge capacity — without a proportional rise in inference cost. Mixtral 8x7B has eight experts but routes each token to just two, so it runs at roughly the speed of a 13B model while holding far more parameters. GPT-4 is widely believed to use an MoE design for the same efficiency reasons, though OpenAI has not confirmed the details.

Question 3

What is the router in a MoE model?

Accepted Answer

The router is a small learned network, usually a single linear layer with a softmax, that scores every expert for each incoming token and selects the top-k highest-scoring ones. Its weights are trained jointly with the rest of the model. Good routing is critical: if the router sends tokens unevenly, some experts get overloaded while others go unused, hurting both quality and efficiency.

Question 4

What is the difference between total and active parameters?

Accepted Answer

Total parameters count every weight in the model, including all experts. Active parameters count only the weights actually used to process a given token — the shared layers plus the few experts the router selected. MoE models advertise large total parameter counts but much smaller active counts, which is exactly why they are cheaper to run than a dense model of the same total size.

Mixture of Experts (MoE) Explained: Why GPT-4 and Mixtral Use It

What Mixture of Experts actually is

How routing works

Why it makes large models efficient

The tradeoffs

When MoE matters to you