IBM Research, Red Hat, and Google Cloud announced at KubeCon Europe 2026 in Amsterdam that they are donating llm-d to the Cloud Native Computing Foundation as a sandbox project. The open-source framework is designed to make large language model inference a cloud-native, production-grade workload on Kubernetes. Backing from NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral Artificial Intelligence positions the project as a community-governed effort around vendor-neutral inference infrastructure.
Launched in 2025, llm-d was built to make serving foundation models at scale predictable, portable, and cloud-native. It turns inference from a model-by-model deployment problem into a reusable Kubernetes-based system. The framework splits inference into prefill and decode phases and runs them on different pods, allowing each phase to scale independently. It also adds routing and scheduling based on KV-cache state, pod load, and hardware characteristics, and layers a modular stack on Kubernetes using vLLM as an inference gateway.
The design is intended to improve both performance and cost efficiency for stateful inference workloads. Early testing by Google Cloud showed “2x improvements in time-to-first-token for use cases like code completion, enabling more responsive applications.” llm-d also supports hierarchical cache offloading across GPU, CPU, and storage tiers, which helps support larger context windows without overwhelming accelerator memory. Its autoscaling is tuned to workload patterns and hardware rather than generic utilization metrics, and it is designed to work with Kubernetes technologies including the Gateway API Inference Extension and LeaderWorkerSet.
Supporters describe llm-d as a validated path from experimentation to production, with reproducible benchmarks, tested deployment patterns, and compatibility across Nvidia GPUs, Google TPUs, and AMD and Intel hardware. IBM executives framed the donation as part of a broader push to make distributed inference a standard part of the cloud-native stack, comparable in importance to established CNCF projects. The next development cycle will focus on multi-modal workloads, HuggingFace multi-LoRA optimization, and deeper integration with vLLM, with Mistral Artificial Intelligence already contributing code for disaggregated serving.
