Mixture of Experts (MoE) is the secret sauce behind models like GPT‑4 and Mixtral. Instead of activating all parameters for every input, MoE routes tokens to specialized sub-networks called "experts."
What is Mixture of Experts?
MoE replaces the traditional feed-forward layers in a Transformer with multiple parallel sub-networks (experts). A gating network decides which expert(s) should process each token. This allows the model to have far more parameters while keeping compute costs manageable.
How Gating Works
The gating mechanism (router) takes each token's representation and outputs a probability distribution over experts. Typically, only the top‑k experts (k=2 is common) are activated per token. This sparse activation is what makes MoE efficient.
Routing Strategies
Several routing strategies exist: Top‑k routing (select the best k experts), Expert Choice routing (experts choose which tokens they want), and load‑balancing loss to prevent expert collapse where a few experts get all the tokens.
MoE in Practice: GPT‑4 & Mixtral
GPT‑4 is rumored to use 8 experts with approximately 1.8 trillion parameters total, but only about 280B are active per forward pass. Mixtral 8x7B uses 8 experts of 7B parameters each, outperforming much larger dense models.
Trade‑offs & Future
MoE trades memory for compute: the full model must fit in memory, but inference is faster due to sparse activation. Challenges include expert load balancing, training instability, and communication overhead in distributed settings.