NVIDIA Groq 3 LPX is positioned as the inference accelerator for NVIDIA Vera Rubin, aimed at the low-latency and large-context requirements of agentic systems. The design pairs NVIDIA Rubin GPUs with LPUs in a co-designed architecture intended to combine interactivity, intelligence, and throughput in a single inference platform. NVIDIA says the combination extends the Artificial Intelligence factory with deterministic, low-latency token generation for real-time inference workloads.
By combining Rubin GPUs for high-bandwidth memory (HBM) and LPUs for static random-access memory (SRAM), NVIDIA Vera Rubin with LPX delivers a new class of inference performance for trillion-parameter models and million-token context. Deployed with Vera Rubin NVL72, Rubin GPUs and LPUs boost decode by jointly computing every layer of the Artificial Intelligence model for every output token. Agentic systems consume up to 15x more tokens than traditional Artificial Intelligence applications. When paired with LPX, Vera Rubin delivers up to 35x higher throughput per megawatt for trillion-parameter models. When LPX is paired with Vera Rubin, Artificial Intelligence factories can produce premium tokens at scale, unlocking 10x more revenue per watt.
NVIDIA says each LPX rack features 256 interconnected LPU accelerators that work with the Vera Rubin platform to accelerate inference. Each LPU accelerator delivers 500 megabytes (MB) of SRAM, 150 terabytes per second (TB/s) of SRAM bandwidth, and 2.5 TB/s scale-up bandwidth. At the rack level, LPX delivers 128 GB of SRAM for low-latency processing and 12 TB of DDR5 memory for large models and workloads. NVIDIA also highlights 40 petabytes per second (PB/s) of SRAM bandwidth per rack and 640 TB/s of scale-up bandwidth across the LPX rack for low-latency chip communication.
The broader system is presented as part of the NVIDIA Vera Rubin NVL72 platform, which unifies seven purpose-built chips into a single Artificial Intelligence supercomputer. NVIDIA says LPX connects to NVL72 with high-speed links designed to reduce latency to near zero. The platform also uses the NVIDIA MGX ETL rack so deployments can plan around a single universal rack within Vera Rubin installations. Together, the message centers on scaling long-context inference and enabling token factories to support more demanding agentic workloads with higher performance and efficiency.
