MLCommons released the latest MLPerf benchmark results this week, and you will not be surprised to learn that Nvidia GPU-equipped systems had a good showing. What is perhaps more interesting is that Nvidia put its latest Nvidia Blackwell Ultra GPUs through the tests, and the results were impressive.
Out of the 93 entries for the MLPerf 5.1 benchmark released today by MLCommons, 74 of the systems contained Nvidia Blackwell GPUs, while 19 sported various AMD Instinct GPUs. Nvidia systems came in first in each of the seven AI models contained in MLPerf 5.1, which span large language models (LLMs), image generation, recommender systems, computer vision, and graph neural networks.
More important than the wins for Nvidia was the opportunity to showcase its latest, greatest gear, the Blackwell Ultra GB300 GPU, which was unveiled in March 2024 and started shipping in volume just two months ago.
Blackwell Ultra performed well on MLPerf benchmarks (Image courtesy Nvidia)
Benchmarks show that Blackwell Ultra GB300 is able to deliver about twice the FP4 performance than the Blackwell GB200, and four to five times the performance of the Hopper H100 GPUs, which Nvidia unveiled in March 2022 and started shipping later that year.
For instance, on the Llama 3.1 405B Pretraining benchmark, a system equipped with 512 Blackwell Ultra GB300 GPUs was able to complete the test in 64.6 minutes. That was about twice as fast as a system equipped with 512 Blackwell GB200 GPUs could complete the task on the MLPerf 5.0 benchmark, and about 4x faster than an equivalent H100 system, also on MLPerf 1.0.
On a fine-tuning benchmark involving the Llama 2 70B LoRA model, an eight-GPU Blackwell Ultra GB300 setup was about to complete the task in 8.5 minutes, which was 5x faster than an eight-GPU Hopper H100 setup running on MLPerf 4.1, and about 1.6 times faster than an equivalent setup with Blackwell GB200 running on MLPerf 5.0.
While MLPerf changes from year to year, the individual benchmarks remain consistent, enabling apples-to-apples comparisons. For the MLPerf 5.1, MLCommons ditched two older models, the BERT-Large and Stable Diffusion, and replaced them with Llama 3.1 8B and FLUX.1 (for image generation). This evolution reflects the growing importance of AI inference workloads compared to AI training workloads.
“Taken together, the increased submissions to genAI benchmarks and the sizable performance improvements recorded in those tests make it clear that the community is heavily focused on genAI scenarios, to some extent at the expense of other potential applications of AI technology,” said David Kanter, head of MLPerf at MLCommons, in a blog post.
The tests also allowed Nvidia to showcase the performance of Blackwell Ultra GB300 on emerging AI workloads that use ultra-low precision 4-bit floating point data formats. In June, Nvidia launched a second type of FP4 data format, NVFP4, to go along with the preexisting MXFP4 format for Blackwell and Blackwell Ultra.
The evolution of individual MLPerf benchmarks (Image courtesy MLCommons)
During a press briefing on Monday, Nvidia shared data documenting how much more accurate the NVPF4 format is compared to MXFP4. While both are not as accurate compared to bfloat16 (BF16), a 16-bit floating-point format, NVPF4 displayed less loss than MXFP4, up to around 800 billion tokens, after which the loss rate accelerates with both formats.
“The basic takeaway is that we have seen, through just our own empirical observation, that NVFP4 actually delivers better accuracy than MXFP4, which is why we tend to use it not only in the inference side, but also in the training side,” said Dave Salvator, director of accelerated computing at Nvidia.
Salvator also pointed out how well Blackwell Ultra –which, with 279GB of HBM3 memory, delivers 15 petaflops of NVFP4 compute, twice the attention-layer compute of Blackwell–performed on benchmarks involving larger clusters. Nvidia configured a cluster of GB300 NVL72 systems with 5,120 GPUs, connected with 800Gbps Quantum-X800 InfiniBand interconnect. It was able to complete the Llama 3.1 405B pretraining workload in just 10.0 minutes, which is 2.7 times faster than Nvidia’s previous system, which had about 2,500 GPUs.
Salvator compared the results of the 5,120-GPU system to the results of the 512-GPU system (mentioned above), which completed the pretraining task in 64.6 minutes.
“As we went from 512 to 5,120 GPUs, as you see there, we were able to basically achieve about 85% scaling efficiency,” Salvator said. “Compute isn’t the only thing in the benchmark. There’s other pieces. There’s memory movements, there’s I/O, there’s network communications, there’s other things at play. So the fact that we’re achieving an 85% scaling efficiency, while we basically increase the GPU count by 10-fold, is actually really impressive.”
Blackwell Ultra’s support for NVPF4 boosts performance (Image courtesy Nvidia)
Nvidia set performance records on the two new benchmarks added this round, including Llama 3.1 8B and FLUX.1. Nvidia was able to train the Llama 3.1 8B model with a system composed of 512 Blackwell Ultra GPUs in just 5.2 minutes. It was able to complete the FLUX.1 image generator benchmark with a record time of 12.5 minutes on a system composed of 1,152 Blackwell CPUs. The company’s records for existing graph neural network, object detection and recommender system tests still stand.
In the age of agentic AI, the speed at which one can either train or fine-tune a model and then transition into production (i.e. inference) mode will determine how well one competes. As Salvator points out, Nvidia’s gear dominates at every stage of the game.
“The performance pickups in the training domain translate into faster convergence of the model. And the faster the model converges, the faster the model can be deployed, the faster an organization gets to ROI, which ultimately for many organizations [is] the goal, to be able to deploy these things in a way that is, in fact, profitable,” he said.
This article first appeared on HPCwire.
The post Nvidia Showcases Blackwell Ultra Performance on MLPerf Benchmark appeared first on AIwire.
