NVIDIA swept all seven tests in MLPerf Training v5.1, submitting the only platform results on every benchmark and underscoring the programmability of NVIDIA GPUs and the maturity of the CUDA software stack. The company reported top performance across large language models, image generation, recommender systems, computer vision and graph neural networks, and said its partners provided a wide set of system submissions.
The GB300 NVL72 rack-scale system, powered by the NVIDIA Blackwell Ultra GPU architecture, made its MLPerf debut. NVIDIA reported more than 4x the Llama 3.1 405B pretraining performance and nearly 5x the Llama 2 70B LoRA fine-tuning performance versus the prior-generation Hopper architecture using the same GPU count. Blackwell Ultra includes new Tensor Cores that deliver 15 petaflops of NVFP4 Artificial Intelligence compute, twice the attention-layer compute, and 279 GB of HBM3e memory. The Quantum-X800 InfiniBand platform, an end-to-end 800 Gb/s networking solution, also debuted and doubled scale-out networking bandwidth compared with the prior generation.
A central technical advance in this round was the use of NVFP4 precision for training. NVIDIA said Blackwell GPUs can perform FP4 calculations, including the NVFP4 format and other FP4 variants, at double the rate of FP8 and that Blackwell Ultra boosts that to three times FP8 performance. The company was the only submitter to use FP4 calculations while meeting MLPerf Training accuracy requirements. Those optimizations helped set new records including a 10 minute time to train Llama 3.1 405B using more than 5,000 Blackwell GPUs, and a 18.79 minute run using 2,560 Blackwell GPUs that was 45 percent faster than the prior Blackwell-based submission with 2,496 GPUs.
NVIDIA also set records on two new benchmarks added this round. Llama 3.1 8B replaced BERT-large and NVIDIA reported a 5.2 minute training time using up to 512 Blackwell Ultra GPUs. FLUX.1 replaced Stable Diffusion v2 and NVIDIA submitted a 12.5 minute result using 1,152 Blackwell GPUs. The company said it continued to hold records on existing graph neural network, object detection and recommender system tests, and highlighted participation from 15 ecosystem organizations including Dell Technologies, Hewlett Packard Enterprise, Lenovo, Supermicro and Lambda. NVIDIA described its cadence of annual innovation as driving rapid performance increases across pretraining, post-training and inference.
