Inference providers slash Artificial Intelligence token costs with Nvidia Blackwell

Inference platforms built on Nvidia Blackwell GPUs are cutting the cost of Artificial Intelligence tokens by up to 10x, using open source models and tightly optimized software stacks across healthcare, gaming, customer service and agentic chat.

Artificial Intelligence applications built on token-based interactions are becoming cheaper to run as infrastructure and algorithmic efficiencies drive down the cost per token. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually. Nvidia positions its Blackwell platform as a key lever in this shift, with inference providers such as Baseten, DeepInfra, Fireworks Artificial Intelligence and Together Artificial Intelligence reporting up to 10x lower token costs compared with the Nvidia Hopper generation, particularly when paired with open source frontier models and low-precision formats like NVFP4.

In healthcare, Sully.ai turned to Baseten’s Model API to replace proprietary closed source models that suffered from unpredictable latency, rising inference costs and limited control. Baseten deploys open source models such as gpt-oss-120b on Nvidia Blackwell GPUs using NVFP4, TensorRT-LLM and the Nvidia Dynamo inference framework, and chose Blackwell after seeing up to 2.5x better throughput per dollar compared with the Nvidia Hopper platform. As a result, Sully.ai’s inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes, and the company has returned over 30 million minutes to physicians. In gaming, Latitude runs large open source mixture-of-experts models on DeepInfra’s Blackwell-based platform, where DeepInfra reduced the cost per million tokens from 20 cents on the Nvidia Hopper platform to 10 cents on Blackwell, and moving to Blackwell’s native low-precision NVFP4 format further cut that cost to just 5 cents for a total 4x improvement in cost per token.

Agentic chat and customer service workloads are seeing similar gains. Sentient Chat, which can trigger cascades of autonomous agent interactions for a single query, uses Fireworks Artificial Intelligence’s Blackwell-optimized inference platform and achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment, enabling it to support a viral launch of 1.8 million waitlisted users in 24 hours and process 5.6 million queries in a single week at low latency. Together Artificial Intelligence runs production inference for Decagon’s multimodel voice stack on Nvidia Blackwell GPUs, combining speculative decoding, caching and autoscaling to keep responses under 400 milliseconds even when processing thousands of tokens per query, while cost per query dropped by 6x compared with closed source proprietary models. Nvidia highlights that its GB200 NVL72 system delivers a 10x reduction in cost per token for reasoning mixture-of-experts models compared with Nvidia Hopper and that the upcoming Nvidia Rubin platform integrates six new chips into a single Artificial Intelligence supercomputer to deliver 10x performance and 10x lower token cost over Blackwell.

70

Impact Score

Anu Bradford on tech sovereignty and regulatory fragmentation

Anu Bradford argues that Europe is wavering in its role as the world’s digital rule-setter just as governments everywhere move toward more state control over technology. Global companies are being pushed to treat geopolitical risk, data sovereignty, and Artificial Intelligence governance as core strategic issues.

Mistral launches text-to-speech model

Mistral has expanded its Voxtral family with a text-to-speech system aimed at enterprise voice applications. The company is positioning the open-weights model as a flexible alternative for organizations that want more control over deployment, cost and customization.

UK Parliament opens workforce inquiry on Artificial Intelligence

A UK Parliament committee is examining how Artificial Intelligence is changing business and work, with a focus on both economic opportunity and labour disruption. The inquiry is seeking evidence on government priorities as adoption expands across the economy.

Windows 11 tightens kernel trust for older drivers

Microsoft is changing Windows 11 kernel policy so new drivers must be signed through the Windows Hardware Compatibility Program. Older trusted drivers will still be allowed in some cases to preserve compatibility during the transition.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.