A mixture-of-experts, or MoE, architecture is now the dominant pattern behind leading frontier Artificial Intelligence models because it routes each token to a small set of specialized experts rather than using all model parameters. The article cites the Artificial Analysis leaderboard, where the top 10 most intelligent open-source models adopt MoE designs, including DeepSeek Artificial Intelligence’s DeepSeek-R1, Moonshot Artificial Intelligence’s Kimi K2 Thinking, OpenArtificial Intelligence’s gpt-oss-120B and Mistral Artificial Intelligence’s Mistral Large 3. By selecting only the experts relevant to a given Artificial Intelligence token, MoE models raise intelligence and adaptability while containing compute and energy costs relative to dense models that use every parameter for every token.
Scaling MoE in production has been constrained by memory pressure and latency caused by distributing experts across multiple GPUs. NVIDIA’s answer is an extreme codesign in the GB200 NVL72 rack-scale system, which integrates 72 NVIDIA Blackwell GPUs into a single NVLink fabric with 130 TB/s of NVLink connectivity, 30TB of fast shared memory and 1.4 exaflops of Artificial Intelligence performance. That design lets expert parallelism span up to 72 GPUs, reducing experts per GPU, easing parameter-loading demands on high-bandwidth memory and accelerating all-to-all expert communication. Software and format optimizations including NVIDIA Dynamo, NVFP4 and support from TensorRT-LLM, SGLang and vLLM help orchestrate disaggregated serving, prefill and decode tasks to maximize inference throughput and efficiency.
NVIDIA reports a 10x generational leap in performance per watt on GB200 NVL72 for multiple MoE models compared with prior-generation platforms such as NVIDIA HGX H200. Kimi K2 Thinking, DeepSeek-R1 and Mistral Large 3 are cited as examples of this improvement. Cloud providers and partners are deploying GB200 NVL72, and customers including CoreWeave, DeepL and Fireworks Artificial Intelligence are using the rack-scale design to run and serve large MoE models. The article positions MoE as a fundamental architecture for future multimodal and agentic systems and presents GB200 NVL72 as the infrastructure enabling wide expert parallelism and materially lower per-token cost and power consumption for frontier Artificial Intelligence workloads.
