Huawei has unveiled its CloudMatrix 384 system super node, positioning it as a domestic challenger to NVIDIA´s GB200 NVL72 system in the high-performance Artificial Intelligence hardware arena. The CloudMatrix 384 employs 384 Ascend 910C accelerators, dramatically outscaling NVIDIA´s configuration of 36 Grace CPUs paired with 72 ´Blackwell´ GB200 GPUs. While the solution requires roughly five times more accelerators to nearly double the performance of NVIDIA´s NVL72, it´s a significant step forward for Huawei in system-level deployment, despite lagging in per-chip efficiency and performance.
At the individual accelerator level, NVIDIA maintains clear leadership. Its GB200 GPU delivers over three times the BF16 performance of Huawei´s Ascend 910C (2,500 vs. 780 TeraFLOPS), boasts larger on-chip memory (192 GB compared to 128 GB), and offers superior bandwidth (8 TB/s versus 3.2 TB/s). These specifications translate to raw power and energy efficiency advantages for NVIDIA at the chip scale. However, when the focus shifts to overarching system capabilities, Huawei´s CloudMatrix 384 pulls ahead: it achieves 1.7 times the overall PetaFLOPS, 3.6 times greater HBM memory, and supports more than five times the number of accelerators, allowing broader scalability and bandwidth within a single supercomputer node.
The trade-off for this unprecedented system scalability is energy consumption. Huawei´s solution draws close to four times more power than NVIDIA´s—approximately 560 kW per CloudMatrix 384 system compared to 145 kW for a single GB200 NVL72. This means NVIDIA continues to lead for single-node peak efficiency, but for organizations building massive Artificial Intelligence superclusters where total throughput and interconnect speeds are critical, Huawei´s approach is compelling. The all-to-all topology in Huawei´s design enhances performance for large-scale training and inference tasks. Industry analysts note that as SMIC—the manufacturing partner for Huawei’s chips—advances to newer semiconductor process nodes, future iterations could narrow or even close the current efficiency gap with NVIDIA.