Penguin Solutions introduced a production-ready KV cache server built with CXL memory technology to address the memory wall in Artificial Intelligence inferencing. The Penguin Solutions MemoryAI KV cache server is designed for enterprise scale inference, including agentic Artificial Intelligence, and is aimed at improving latency, throughput, GPU cluster efficiency, service-level agreement performance, and time-to-first-token.
Inference workloads are described as fundamentally different from model training and tuning because they are continuous, memory-bound, and latency-sensitive. Inference demands are typically 30% compute driven (GPU) and 70% memory driven (RAM), elevating the need for greater memory capacity and causing performance bottlenecks and GPU idle time.
The system delivers up to 11 TB of CXL-based memory for memory-dependent Artificial Intelligence processes. Penguin’s MemoryAI KV cache server increases memory capacity by integrating 3 TB of DDR5 main memory and up to eight 1 TB CXL Add-in Cards (AICs). The company positions the platform as a way to support higher performance inference while improving the utilization of GPU infrastructure.
