NVIDIA has published technical details for Blackwell Ultra, describing it as the final iteration of the Blackwell family before the transition to Rubin. The company positions this silicon as a server-focused design that departs from previous Blackwell variants in I/O and raw scale. Blackwell Ultra is built on TSMC´s 4NP node and packs 208 billion transistors, which NVIDIA states is 2.6 times the transistor count of the prior-generation Hopper design. The chip has a 1,400 W thermal design power, which will require substantial cooling infrastructure for deployment.
One of the most notable platform changes is support for PCIe Gen 6, while the consumer Blackwell and regular server Blackwell use PCIe Gen 5. The Ultra design pairs 160 streaming multiprocessors across two reticle dies using NVIDIA´s NV-HBI link to deliver a 10 TB/s die-to-die fabric. Memory is provisioned as 288 GB of HBM3E with up to 8 TB/s of bandwidth. Compute improvements center on fifth-generation Tensor Cores tuned for NVFP4 and an overall increase in NVFP4 compute density, which NVIDIA quantifies as approximately 1.5 times denser than Blackwell.
Performance implications highlighted by NVIDIA include higher tokens-per-second for inference workloads and improved throughput for large-batch training. Attention-layer performance is targeted by doubling the throughput of special function units for transcendental operations, reducing softmax latency and improving reasoning responsiveness in models. The blog post focuses on silicon and microarchitectural changes, software support and optimizations, and data-path enhancements geared to server-class Artificial Intelligence workloads.