Question 1

What is a Mixture of Experts model?

Accepted Answer

A Mixture of Experts (MoE) model replaces a dense layer with many parallel "expert" sub-networks plus a small gating network that, for each token, selects only a few experts to run. This means the model can have a very large total number of parameters while only a small fraction are active for any given token.

Question 2

What is the difference between total and active parameters?

Accepted Answer

Total parameters count every weight in all the experts; active parameters count only the weights actually used to process a given token. An MoE model might have 100+ billion total parameters but activate only a fraction per token, so it has the knowledge capacity of a huge model with the inference cost of a much smaller one.

Question 3

What does the gating network do?

Accepted Answer

The gating (or router) network looks at each incoming token and produces scores for the available experts, then routes the token to the top-scoring few (often the top one or two). It is trained jointly with the experts so the model learns which experts to use for which kinds of input.

Question 4

Why are MoE models harder to train and serve?

Accepted Answer

MoE adds complications: the router can collapse to favouring a few experts (load imbalance), so auxiliary load-balancing losses are needed; and at inference the experts must be held in memory and tokens routed dynamically, which complicates batching and distribution across hardware. The payoff is far more capacity per unit of compute.

Mixture of Experts (AI Glossary)

What Mixture of Experts means

Conditional computation: the core idea

The gating network

Total vs active parameters

Trade-offs