AMD is escalating its competition with Nvidia across both gpu software and server cpus, aiming to weaken CUDA lock-in while protecting its position in processor infrastructure. Ahead of Nvidia’s GTC event, AMD pushed back on claims that it trails badly in inference software, with Anush Elangovan, AMD’s VP for Artificial Intelligence software, arguing that updates to the ROCm stack have narrowed the gap and in some cases surpassed Nvidia’s B200 on four-bit floating point (FP4). He also argued that customer demand remains centered on eight-bit floating point (FP8), saying “most” users operate there and that “FP8 is still king of the hill.”
AMD is positioning ROCm as an open alternative to CUDA for Artificial Intelligence and high-performance computing workloads, emphasizing compatibility with frameworks such as PyTorch, TensorFlow, and JAX. A core part of that strategy is heavier investment in Triton, the open, Python-first gpu compiler originally developed at OpenAI. AMD sees Triton as the highest-level abstraction layer for gpu programming and wants it to become “the de facto” standard so moving workloads from Nvidia to AMD is “zero friction.” Its own lower-level tools, including Fly DSL and Wave, are intended to handle hardware-specific tuning underneath that abstraction. AMD also frames Nvidia’s CUDA Tile push as a reaction to Triton’s growing role in democratizing gpu programming.
The rivalry is also expanding into the cpu layer, where AMD is trying to defend what it sees as an established advantage as Nvidia promotes Grace and Vera. Days before GTC, AMD highlighted benchmark claims for its server chips, citing SPEC CPU Benchmark data that its 5th-Gen EPYC CPU offered 2.1-times higher performance per core against Nvidia’s Grace Superchip systems, while up to a 2.26-times uplift in operations per watt. AMD is also tying cpus more directly to emerging agentic Artificial Intelligence workloads, arguing that processors act as the control plane for gpu-heavy data centers by orchestrating work and managing more complex tasks.
Cloud providers are already reflecting that positioning. Microsoft added AMD’s Turin processors to its Da/Ea/Fasv7-series virtual machines and said the chips deliver 35% better CPU performance compared to the prior v6 AMD-based generation, along with higher instructions-per-clock, greater memory bandwidth, and support for advanced vector instructions. Google Cloud has also adopted AMD’s 5th-Gen EPYC processors for its C4D and H4D instances aimed at Artificial Intelligence inference, high-performance computing, and general-purpose workloads. AMD has further extended that push with edge-focused cpu variants launched last September for latency-critical applications, signaling that the company intends to compete with Nvidia from software stack to server control plane.
