Inference providers slash Artificial Intelligence token costs with Nvidia Blackwell

Inference platforms built on Nvidia Blackwell GPUs are cutting the cost of Artificial Intelligence tokens by up to 10x, using open source models and tightly optimized software stacks across healthcare, gaming, customer service and agentic chat.

Artificial Intelligence applications built on token-based interactions are becoming cheaper to run as infrastructure and algorithmic efficiencies drive down the cost per token. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually. Nvidia positions its Blackwell platform as a key lever in this shift, with inference providers such as Baseten, DeepInfra, Fireworks Artificial Intelligence and Together Artificial Intelligence reporting up to 10x lower token costs compared with the Nvidia Hopper generation, particularly when paired with open source frontier models and low-precision formats like NVFP4.

In healthcare, Sully.ai turned to Baseten’s Model API to replace proprietary closed source models that suffered from unpredictable latency, rising inference costs and limited control. Baseten deploys open source models such as gpt-oss-120b on Nvidia Blackwell GPUs using NVFP4, TensorRT-LLM and the Nvidia Dynamo inference framework, and chose Blackwell after seeing up to 2.5x better throughput per dollar compared with the Nvidia Hopper platform. As a result, Sully.ai’s inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes, and the company has returned over 30 million minutes to physicians. In gaming, Latitude runs large open source mixture-of-experts models on DeepInfra’s Blackwell-based platform, where DeepInfra reduced the cost per million tokens from 20 cents on the Nvidia Hopper platform to 10 cents on Blackwell, and moving to Blackwell’s native low-precision NVFP4 format further cut that cost to just 5 cents for a total 4x improvement in cost per token.

Agentic chat and customer service workloads are seeing similar gains. Sentient Chat, which can trigger cascades of autonomous agent interactions for a single query, uses Fireworks Artificial Intelligence’s Blackwell-optimized inference platform and achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment, enabling it to support a viral launch of 1.8 million waitlisted users in 24 hours and process 5.6 million queries in a single week at low latency. Together Artificial Intelligence runs production inference for Decagon’s multimodel voice stack on Nvidia Blackwell GPUs, combining speculative decoding, caching and autoscaling to keep responses under 400 milliseconds even when processing thousands of tokens per query, while cost per query dropped by 6x compared with closed source proprietary models. Nvidia highlights that its GB200 NVL72 system delivers a 10x reduction in cost per token for reasoning mixture-of-experts models compared with Nvidia Hopper and that the upcoming Nvidia Rubin platform integrates six new chips into a single Artificial Intelligence supercomputer to deliver 10x performance and 10x lower token cost over Blackwell.

70

Impact Score

Nvidia DGX Spark brings desktop supercomputing to universities worldwide

Nvidia’s DGX Spark desktop supercomputer is giving universities petaflop-class Artificial Intelligence performance at the lab bench, supporting projects from neutrino astronomy at the South Pole to radiology report analysis and robotics on campus. Institutions are using the compact systems to run large models locally, protect sensitive data and prototype workflows before scaling to big clusters or cloud resources.

ByteDance’s Seedance 2.0 ignites Artificial Intelligence video race in China

ByteDance’s Seedance 2.0 video generation model has gone viral in China, drawing comparisons to a “Sputnik moment” and stoking competitive and regulatory concerns from Hollywood to Beijing. The system’s hyper-realistic output and multimodal input support are being cast as a direct challenge to leading Western Artificial Intelligence video models.

Courts clarify discoverability of artificial intelligence generated data in litigation

Courts are beginning to define when data from generative artificial intelligence tools must be preserved and produced in discovery, reinforcing that traditional e-discovery rules still apply. Companies are urged to build defensible, proportional strategies for identifying, preserving, and protecting artificial intelligence related data.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.