Nvidia Blackwell Ultra GB300 NVL72 targets massive gains in agentic artificial intelligence inference

Nvidia’s GB300 NVL72 system, built on the Blackwell Ultra GPU and a codesigned software stack, sharply improves performance and cost for agentic artificial intelligence workloads, from low-latency assistants to long-context coding tools. New data highlights throughput-per-megawatt and token-cost advantages over both Hopper and prior Blackwell platforms.

The Nvidia Blackwell platform is seeing broad adoption among inference providers such as Baseten, DeepInfra, Fireworks Artificial Intelligence and Together Artificial Intelligence, with deployments already reducing cost per token by up to 10x compared with earlier generations. Agentic Artificial Intelligence use cases and coding assistants are driving rapid growth in software-programming-related Artificial Intelligence queries, which increased from 11% to about 50% last year according to OpenRouter’s State of Inference report, and these workloads demand both low latency across multistep workflows and long context to reason over entire codebases. New SemiAnalysis InferenceX performance data indicates that Nvidia’s combination of software optimizations and the next-generation Blackwell Ultra platform pushes Nvidia GB300 NVL72 systems to deliver up to 50x higher throughput per megawatt, resulting in 35x lower cost per token compared with the Nvidia Hopper platform.

Earlier analysis from Signal65 found that Nvidia GB200 NVL72 with tightly codesigned hardware and software delivers more than 10x more tokens per watt, which results in one-tenth the cost per token compared with the Nvidia Hopper platform, and these gains have been expanding as the stack improves. Continuous optimizations from teams behind Nvidia TensorRT-LLM, Nvidia Dynamo, Mooncake and SGLang are significantly boosting Blackwell NVL72 throughput for mixture-of-experts inference at all latency targets, and Nvidia TensorRT-LLM library changes alone have delivered up to 5x better performance on GB200 for low-latency workloads compared with just four months ago. Building on these advances, GB300 NVL72 with the Blackwell Ultra GPU extends throughput-per-megawatt to 50x compared with Hopper, and this translates into up to 35x lower cost per million tokens at low latency where agentic applications operate, enabling real-time interactive assistants to scale to many more users.

The benefits of GB300 NVL72 are particularly pronounced in long-context scenarios, such as Artificial Intelligence coding assistants that must reason across entire repositories. For workloads with 128,000-token inputs and 8,000-token outputs, GB300 NVL72 delivers up to 1.5x lower cost per token compared with GB200 NVL72, helped by Blackwell Ultra’s 1.5x higher NVFP4 compute performance and 2x faster attention processing that allow efficient understanding of entire code bases. Major cloud providers including Microsoft, CoreWeave and Oracle Cloud Infrastructure are deploying GB300 NVL72 for low-latency and long-context use cases, with CoreWeave emphasizing that Grace Blackwell NVL72 improves token economics and makes large-scale inference more usable for customers. Looking ahead, the Nvidia Rubin platform, which combines six new chips into a single Artificial Intelligence supercomputer, is positioned to deliver further improvements, including up to 10x higher throughput per megawatt for mixture-of-experts inference compared with Blackwell that translate into one-tenth the cost per million tokens, and the ability to train large mixture-of-experts models using just one-fourth the number of GPUs compared with Blackwell.

68

Impact Score

Anu Bradford on tech sovereignty and regulatory fragmentation

Anu Bradford argues that Europe is wavering in its role as the world’s digital rule-setter just as governments everywhere move toward more state control over technology. Global companies are being pushed to treat geopolitical risk, data sovereignty, and Artificial Intelligence governance as core strategic issues.

Mistral launches text-to-speech model

Mistral has expanded its Voxtral family with a text-to-speech system aimed at enterprise voice applications. The company is positioning the open-weights model as a flexible alternative for organizations that want more control over deployment, cost and customization.

UK Parliament opens workforce inquiry on Artificial Intelligence

A UK Parliament committee is examining how Artificial Intelligence is changing business and work, with a focus on both economic opportunity and labour disruption. The inquiry is seeking evidence on government priorities as adoption expands across the economy.

Windows 11 tightens kernel trust for older drivers

Microsoft is changing Windows 11 kernel policy so new drivers must be signed through the Windows Hardware Compatibility Program. Older trusted drivers will still be allowed in some cases to preserve compatibility during the transition.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.