AMD described a startup optimization for on-device large language model inference on Ryzen Artificial Intelligence processors, where execution is split between the NPU and the integrated GPU. The NPU handles compute-intensive prefill with up to 50 TOPS of AI Engine performance, while the iGPU handles the memory-bound decode phase. That hybrid design delivers low time-to-first-token and high tokens-per-second, but operator initialization creates a significant startup penalty before inference begins.
The main bottleneck comes from how custom operators were originally initialized. A single constructor performed both host-side model reading and device-side NPU setup, including reading ONNX graph attributes, extracting weights, allocating NPU memory, creating accelerator kernels, and transferring weights. For a multi-layer transformer, these steps repeated sequentially across many operators. AMD identified the core issue as CPU cache pollution: host-side model reads and device-side driver calls touch different memory regions, and alternating between them on one thread repeatedly evicts hot L1 and L2 cache data.
The proposed fix is a two-phase deferred initialization scheme. In phase 1, the main thread performs all model-reading work, including collecting node attributes, extracting constant tensors, capturing session configuration, and storing the state needed later. In phase 2, background worker threads handle device setup, including kernel creation, device memory allocation, weight formatting and transfer, and execution parameter setup. A grouped thread pool assigns one worker thread per operator type so same-type tasks run sequentially in FIFO order for safety, while different operator types can initialize in parallel.
AMD said the feature remains opt-in through an environment variable, and when disabled, initialization follows the original inline constructor path. Correctness is enforced with a one-shot barrier before first inference, per-group sequential execution, and the unchanged fallback path. Profiling showed that the largest gain came from cache separation rather than parallelism, because keeping all model-reading work together preserves warm runtime data in cache. AMD reported up to 10× faster LLM initialization (from ~10s to ~1s) as measured on Qwen3-4B running on AMD Ryzen AI, with zero impact on inference correctness. Testing was performed on HP OmniBook 7 Aero Laptop 13-bg1xxx with AMD Ryzen AI 7 H 350 w/Radeon 860M, 32GB memory.
