Nvidia has claimed the top spot on the 31st Graph500 breadth-first search list with a benchmark result of 410 trillion traversed edges per second (TEPS), delivered on a commercially available cluster hosted by cloud provider Coreweave. The record-setting run took place in a Coreweave data center in Dallas and used 8,192 Nvidia H100 gpus to process a graph containing 2.2 trillion vertices and 35 trillion edges. According to Nvidia, this performance is more than double that of comparable Graph500 entries, including systems operated by national laboratories, highlighting the potential of its accelerated computing stack for large-scale graph workloads.
The company emphasizes that efficiency is as important as raw speed. While a comparable top 10 Graph500 system used about 9,000 nodes, the Nvidia and Coreweave configuration reached its result with just over 1,000 nodes, which the company says delivers 3x better performance per dollar. Nvidia illustrates the scale by noting that if every person on Earth had 150 friends, this would correspond to 1.2 trillion edges in a social graph, and the demonstrated system could search all such relationships in about three milliseconds. The achievement relies on Nvidia’s integrated platform, spanning Nvidia CUDA software, Spectrum-X networking, H100 gpus and a new active messaging library designed to minimize hardware footprint while maximizing throughput.
Graph500 breadth-first search is presented as a long-standing industry benchmark for navigating sparse, irregular graphs, such as those representing social networks, banking relationships or cybersecurity data. Traditional approaches to very large graph processing have relied on cpu-based systems, where moving graph data between nodes can create communication bottlenecks at trillion-edge scales. To overcome this, developers have used active messages that process data in place, but these techniques were originally designed for cpus and are constrained by cpu throughput. Nvidia reengineered this model around gpus using a custom framework built on InfiniBand GPUDirect Async (IBGDA) and the NVSHMEM parallel programming interface, enabling gpu-to-gpu active messages and allowing hundreds of thousands of gpu threads to send messages concurrently. By running active messaging entirely on gpus and leveraging the parallelism and memory bandwidth of H100 devices on Coreweave’s infrastructure, the system doubled the performance of similar runs while using a fraction of the hardware and cost. Nvidia argues that this approach opens a new path for high-performance computing fields such as fluid dynamics and weather forecasting, which rely on sparse data structures, enabling developers to scale their largest applications on commercially available infrastructure using technologies like NVSHMEM and IBGDA.
