Cloud providers adopt NVIDIA Dynamo to boost Artificial Intelligence inference performance

Amazon Web Services, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure are integrating NVIDIA Dynamo to enable multi-node Artificial Intelligence inference on Blackwell systems and managed Kubernetes services.

NVIDIA Blackwell is highlighted in the article as delivering industry-leading performance, efficiency and the lowest total cost of ownership across tested models, with a cited SemiAnalysis InferenceMAX v1 benchmark and a figure noting Jensen Huang’s claim that Blackwell delivers 10x the performance of NVIDIA Hopper. To realize that performance for complex Artificial Intelligence models, the article explains the need to distribute inference across multiple servers to serve millions of concurrent users and reduce response times.

The NVIDIA Dynamo software platform is presented as the production-ready solution that unlocks multi-node, or disaggregated, inference across existing cloud environments. The article describes disaggregated serving as a method that separates the two phases of model serving, prefill and decode, onto independently optimized GPUs to avoid bottlenecks and improve efficiency. For large reasoning and mixture-of-experts models such as DeepSeek-R1, disaggregated serving is said to be essential. A real-world example notes that Baseten used NVIDIA Dynamo to achieve 2x faster inference for long-context code generation and a 1.6x increase in throughput without adding hardware.

Kubernetes is identified as the scaling mechanism for disaggregated serving, and the article lists major cloud integrations that make Dynamo available in managed Kubernetes offerings. Amazon Web Services is noted to integrate Dynamo with Amazon EKS. Google Cloud provides a Dynamo recipe for LLM inference on its AI Hypercomputer. Microsoft Azure enables multi-node LLM inference with Dynamo and ND GB200-v6 GPUs on Azure Kubernetes Service. Oracle Cloud Infrastructure supports multi-node LLM inferencing with OCI Superclusters and Dynamo. The piece also mentions Nebius as a partner building cloud infrastructure to serve inference workloads with NVIDIA accelerated computing and Dynamo.

To simplify orchestration on Kubernetes, NVIDIA Grove is introduced as an application programming interface within Dynamo that lets users declare high-level specifications for complex inference systems. Grove is described as automating coordination, scaling, placement and startup order for components such as prefill, decode and routing, reducing operational complexity as inference becomes increasingly distributed. The article directs readers to technical deep dives and simulation tools for further exploration of hardware and deployment trade-offs.

68

Impact Score

Tech firms commit billions to Artificial Intelligence infrastructure

Amazon, OpenAI, Nvidia, Meta, Google and others are signing increasingly large cloud, chip and data center agreements as demand for Artificial Intelligence infrastructure accelerates. The latest wave of deals spans investments, compute purchases, chip supply agreements and data center buildouts.

JEDEC outlines LPDDR6 expansion for data centers

JEDEC has previewed planned updates to LPDDR6 aimed at pushing the memory standard beyond mobile devices and into selected data center and accelerated computing use cases. The roadmap includes higher-capacity packaging options, flexible metadata support, 512 GB densities, and a new SOCAMM2 module standard.

Tsmc debuts A13 process technology

Tsmc has introduced its A13 process at its 2026 North America Technology Symposium as a tighter version of A14 aimed at next-generation Artificial Intelligence, high performance computing, and mobile designs. The company positions the node as a more compact and efficient option with backward-compatible design rules for faster migration.

Google unveils eighth-generation tensor processor units

Google introduced its eighth generation of custom tensor processor units with separate designs for training and inference. The new TPU 8t and TPU 8i are aimed at large-scale model training, serving, and agentic workloads.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.