Inference providers slash Artificial Intelligence token costs with Nvidia Blackwell

Inference platforms built on Nvidia Blackwell GPUs are cutting the cost of Artificial Intelligence tokens by up to 10x, using open source models and tightly optimized software stacks across healthcare, gaming, customer service and agentic chat.

Artificial Intelligence applications built on token-based interactions are becoming cheaper to run as infrastructure and algorithmic efficiencies drive down the cost per token. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually. Nvidia positions its Blackwell platform as a key lever in this shift, with inference providers such as Baseten, DeepInfra, Fireworks Artificial Intelligence and Together Artificial Intelligence reporting up to 10x lower token costs compared with the Nvidia Hopper generation, particularly when paired with open source frontier models and low-precision formats like NVFP4.

In healthcare, Sully.ai turned to Baseten’s Model API to replace proprietary closed source models that suffered from unpredictable latency, rising inference costs and limited control. Baseten deploys open source models such as gpt-oss-120b on Nvidia Blackwell GPUs using NVFP4, TensorRT-LLM and the Nvidia Dynamo inference framework, and chose Blackwell after seeing up to 2.5x better throughput per dollar compared with the Nvidia Hopper platform. As a result, Sully.ai’s inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes, and the company has returned over 30 million minutes to physicians. In gaming, Latitude runs large open source mixture-of-experts models on DeepInfra’s Blackwell-based platform, where DeepInfra reduced the cost per million tokens from 20 cents on the Nvidia Hopper platform to 10 cents on Blackwell, and moving to Blackwell’s native low-precision NVFP4 format further cut that cost to just 5 cents for a total 4x improvement in cost per token.

Agentic chat and customer service workloads are seeing similar gains. Sentient Chat, which can trigger cascades of autonomous agent interactions for a single query, uses Fireworks Artificial Intelligence’s Blackwell-optimized inference platform and achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment, enabling it to support a viral launch of 1.8 million waitlisted users in 24 hours and process 5.6 million queries in a single week at low latency. Together Artificial Intelligence runs production inference for Decagon’s multimodel voice stack on Nvidia Blackwell GPUs, combining speculative decoding, caching and autoscaling to keep responses under 400 milliseconds even when processing thousands of tokens per query, while cost per query dropped by 6x compared with closed source proprietary models. Nvidia highlights that its GB200 NVL72 system delivers a 10x reduction in cost per token for reasoning mixture-of-experts models compared with Nvidia Hopper and that the upcoming Nvidia Rubin platform integrates six new chips into a single Artificial Intelligence supercomputer to deliver 10x performance and 10x lower token cost over Blackwell.

70

Impact Score

Us supercomputers test new Artificial Intelligence chip suppliers

Sandia National Laboratories is evaluating chips from Israeli startup NextSilicon as major chipmakers shift their roadmaps toward Artificial Intelligence. The move reflects growing concern that mainstream processors are deprioritizing the scientific computing features government labs still need.

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.