Cloud providers adopt NVIDIA Dynamo to boost Artificial Intelligence inference performance

Amazon Web Services, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure are integrating NVIDIA Dynamo to enable multi-node Artificial Intelligence inference on Blackwell systems and managed Kubernetes services.

NVIDIA Blackwell is highlighted in the article as delivering industry-leading performance, efficiency and the lowest total cost of ownership across tested models, with a cited SemiAnalysis InferenceMAX v1 benchmark and a figure noting Jensen Huang’s claim that Blackwell delivers 10x the performance of NVIDIA Hopper. To realize that performance for complex Artificial Intelligence models, the article explains the need to distribute inference across multiple servers to serve millions of concurrent users and reduce response times.

The NVIDIA Dynamo software platform is presented as the production-ready solution that unlocks multi-node, or disaggregated, inference across existing cloud environments. The article describes disaggregated serving as a method that separates the two phases of model serving, prefill and decode, onto independently optimized GPUs to avoid bottlenecks and improve efficiency. For large reasoning and mixture-of-experts models such as DeepSeek-R1, disaggregated serving is said to be essential. A real-world example notes that Baseten used NVIDIA Dynamo to achieve 2x faster inference for long-context code generation and a 1.6x increase in throughput without adding hardware.

Kubernetes is identified as the scaling mechanism for disaggregated serving, and the article lists major cloud integrations that make Dynamo available in managed Kubernetes offerings. Amazon Web Services is noted to integrate Dynamo with Amazon EKS. Google Cloud provides a Dynamo recipe for LLM inference on its AI Hypercomputer. Microsoft Azure enables multi-node LLM inference with Dynamo and ND GB200-v6 GPUs on Azure Kubernetes Service. Oracle Cloud Infrastructure supports multi-node LLM inferencing with OCI Superclusters and Dynamo. The piece also mentions Nebius as a partner building cloud infrastructure to serve inference workloads with NVIDIA accelerated computing and Dynamo.

To simplify orchestration on Kubernetes, NVIDIA Grove is introduced as an application programming interface within Dynamo that lets users declare high-level specifications for complex inference systems. Grove is described as automating coordination, scaling, placement and startup order for components such as prefill, decode and routing, reducing operational complexity as inference becomes increasingly distributed. The article directs readers to technical deep dives and simulation tools for further exploration of hardware and deployment trade-offs.

68

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.