The Nvidia Blackwell platform is seeing broad adoption among inference providers such as Baseten, DeepInfra, Fireworks Artificial Intelligence and Together Artificial Intelligence, with deployments already reducing cost per token by up to 10x compared with earlier generations. Agentic Artificial Intelligence use cases and coding assistants are driving rapid growth in software-programming-related Artificial Intelligence queries, which increased from 11% to about 50% last year according to OpenRouter’s State of Inference report, and these workloads demand both low latency across multistep workflows and long context to reason over entire codebases. New SemiAnalysis InferenceX performance data indicates that Nvidia’s combination of software optimizations and the next-generation Blackwell Ultra platform pushes Nvidia GB300 NVL72 systems to deliver up to 50x higher throughput per megawatt, resulting in 35x lower cost per token compared with the Nvidia Hopper platform.
Earlier analysis from Signal65 found that Nvidia GB200 NVL72 with tightly codesigned hardware and software delivers more than 10x more tokens per watt, which results in one-tenth the cost per token compared with the Nvidia Hopper platform, and these gains have been expanding as the stack improves. Continuous optimizations from teams behind Nvidia TensorRT-LLM, Nvidia Dynamo, Mooncake and SGLang are significantly boosting Blackwell NVL72 throughput for mixture-of-experts inference at all latency targets, and Nvidia TensorRT-LLM library changes alone have delivered up to 5x better performance on GB200 for low-latency workloads compared with just four months ago. Building on these advances, GB300 NVL72 with the Blackwell Ultra GPU extends throughput-per-megawatt to 50x compared with Hopper, and this translates into up to 35x lower cost per million tokens at low latency where agentic applications operate, enabling real-time interactive assistants to scale to many more users.
The benefits of GB300 NVL72 are particularly pronounced in long-context scenarios, such as Artificial Intelligence coding assistants that must reason across entire repositories. For workloads with 128,000-token inputs and 8,000-token outputs, GB300 NVL72 delivers up to 1.5x lower cost per token compared with GB200 NVL72, helped by Blackwell Ultra’s 1.5x higher NVFP4 compute performance and 2x faster attention processing that allow efficient understanding of entire code bases. Major cloud providers including Microsoft, CoreWeave and Oracle Cloud Infrastructure are deploying GB300 NVL72 for low-latency and long-context use cases, with CoreWeave emphasizing that Grace Blackwell NVL72 improves token economics and makes large-scale inference more usable for customers. Looking ahead, the Nvidia Rubin platform, which combines six new chips into a single Artificial Intelligence supercomputer, is positioned to deliver further improvements, including up to 10x higher throughput per megawatt for mixture-of-experts inference compared with Blackwell that translate into one-tenth the cost per million tokens, and the ability to train large mixture-of-experts models using just one-fourth the number of GPUs compared with Blackwell.
