A Reddit user demonstrated a workstation configuration that uses Intel Optane persistent memory modules as system memory to run a 1-trillion-parameter large language model locally. The setup was built around used Optane PMem DIMMs purchased second-hand and configured to support Kimi K2.5 inference on a machine with a single GPU. The reported result was to “run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second” on a Xeon workstation.
The core of the build was 768GB of Optane (6x 128GB), a discontinued memory format originally intended to sit between DRAM and SSDs in performance and capacity. While the 768GB of Optane (6x 128GB) does indeed offer far lower latency than the best NVMe SSDs, it is still two or three times slower than DRAM. That tradeoff appears workable for large language model inference, especially because the modules were acquired used at a price described as much lower than equivalent DRAM capacity. The system used Intel Xeon Gold 6246, a Tyan S5630GMRE-CGN motherboard, Asus Dual GeForce RTX 3060 OC 12GB GPU, 6x 32GB Samsung 2666MHz DDR4 ECC DRAM sticks, 6x 128GB Intel Optane DCPMM PC4-2666 NMA1XBD128GQS persistent memory modules, Western Digital WD SN850X 2TB M.2 2280 NVMe SSD, ASRock Steel Legend SL-850G 850W 80 PLUS GOLD & Cybenetics Platinum Fully Modular Power Supply, and a Silverstone SST-GD08B (Black) Grandia Series Home Theater PC Case.
The machine was configured with the Optane in memory mode and the Samsung DDR4 as cache. On the software side, the workload relied on Kimi K2.5’s mixture-of-experts design and a hybrid GPU/CPU inference approach using llama.cpp. To improve performance, the routing components were placed into the 12GB GPU with llama.cpp’s ‘override-tensor’ flag. The resulting performance was reported as ~4 tokens per second, a figure the builder characterized as a strong outcome given the hardware limitations and budget.
The experiment also points to a broader industry gap. There remains demand for a memory product that fits between DRAM and SSDs for large model workloads, especially local large language model inference. CXL (Compute Express Link) is presented as a likely future answer, with the potential to provide large pools of affordable, byte-addressable memory for this kind of use case.