Artificial Intelligence applications built on token-based interactions are becoming cheaper to run as infrastructure and algorithmic efficiencies drive down the cost per token. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually. Nvidia positions its Blackwell platform as a key lever in this shift, with inference providers such as Baseten, DeepInfra, Fireworks Artificial Intelligence and Together Artificial Intelligence reporting up to 10x lower token costs compared with the Nvidia Hopper generation, particularly when paired with open source frontier models and low-precision formats like NVFP4.
In healthcare, Sully.ai turned to Baseten’s Model API to replace proprietary closed source models that suffered from unpredictable latency, rising inference costs and limited control. Baseten deploys open source models such as gpt-oss-120b on Nvidia Blackwell GPUs using NVFP4, TensorRT-LLM and the Nvidia Dynamo inference framework, and chose Blackwell after seeing up to 2.5x better throughput per dollar compared with the Nvidia Hopper platform. As a result, Sully.ai’s inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes, and the company has returned over 30 million minutes to physicians. In gaming, Latitude runs large open source mixture-of-experts models on DeepInfra’s Blackwell-based platform, where DeepInfra reduced the cost per million tokens from 20 cents on the Nvidia Hopper platform to 10 cents on Blackwell, and moving to Blackwell’s native low-precision NVFP4 format further cut that cost to just 5 cents for a total 4x improvement in cost per token.
Agentic chat and customer service workloads are seeing similar gains. Sentient Chat, which can trigger cascades of autonomous agent interactions for a single query, uses Fireworks Artificial Intelligence’s Blackwell-optimized inference platform and achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment, enabling it to support a viral launch of 1.8 million waitlisted users in 24 hours and process 5.6 million queries in a single week at low latency. Together Artificial Intelligence runs production inference for Decagon’s multimodel voice stack on Nvidia Blackwell GPUs, combining speculative decoding, caching and autoscaling to keep responses under 400 milliseconds even when processing thousands of tokens per query, while cost per query dropped by 6x compared with closed source proprietary models. Nvidia highlights that its GB200 NVL72 system delivers a 10x reduction in cost per token for reasoning mixture-of-experts models compared with Nvidia Hopper and that the upcoming Nvidia Rubin platform integrates six new chips into a single Artificial Intelligence supercomputer to deliver 10x performance and 10x lower token cost over Blackwell.
