Used Optane memory runs trillion-parameter model on one GPU

A workstation built with second-hand Intel Optane persistent memory modules was used to run Kimi K2.5 locally with a single GPU. The setup highlights renewed interest in a memory tier between DRAM and SSDs for large language model inference.

A Reddit user demonstrated a workstation configuration that uses Intel Optane persistent memory modules as system memory to run a 1-trillion-parameter large language model locally. The setup was built around used Optane PMem DIMMs purchased second-hand and configured to support Kimi K2.5 inference on a machine with a single GPU. The reported result was to “run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second” on a Xeon workstation.

The core of the build was 768GB of Optane (6x 128GB), a discontinued memory format originally intended to sit between DRAM and SSDs in performance and capacity. While the 768GB of Optane (6x 128GB) does indeed offer far lower latency than the best NVMe SSDs, it is still two or three times slower than DRAM. That tradeoff appears workable for large language model inference, especially because the modules were acquired used at a price described as much lower than equivalent DRAM capacity. The system used Intel Xeon Gold 6246, a Tyan S5630GMRE-CGN motherboard, Asus Dual GeForce RTX 3060 OC 12GB GPU, 6x 32GB Samsung 2666MHz DDR4 ECC DRAM sticks, 6x 128GB Intel Optane DCPMM PC4-2666 NMA1XBD128GQS persistent memory modules, Western Digital WD SN850X 2TB M.2 2280 NVMe SSD, ASRock Steel Legend SL-850G 850W 80 PLUS GOLD & Cybenetics Platinum Fully Modular Power Supply, and a Silverstone SST-GD08B (Black) Grandia Series Home Theater PC Case.

The machine was configured with the Optane in memory mode and the Samsung DDR4 as cache. On the software side, the workload relied on Kimi K2.5’s mixture-of-experts design and a hybrid GPU/CPU inference approach using llama.cpp. To improve performance, the routing components were placed into the 12GB GPU with llama.cpp’s ‘override-tensor’ flag. The resulting performance was reported as ~4 tokens per second, a figure the builder characterized as a strong outcome given the hardware limitations and budget.

The experiment also points to a broader industry gap. There remains demand for a memory product that fits between DRAM and SSDs for large model workloads, especially local large language model inference. CXL (Compute Express Link) is presented as a likely future answer, with the potential to provide large pools of affordable, byte-addressable memory for this kind of use case.

54

Impact Score

Colorado narrows its Artificial Intelligence law before rollout

Colorado is poised to significantly scale back its original Artificial Intelligence governance framework before it takes effect. The revised bill shifts the focus from broad upfront assessments to disclosure, recordkeeping, and post-deployment compliance.

UC Berkeley law tightens Artificial Intelligence rules without banning it

UC Berkeley Law is adopting a stricter Artificial Intelligence policy that limits student use in core legal work while preserving space for specialized courses and instructor discretion. Faculty behind the change say the aim is to protect foundational lawyering skills as generative tools become more capable and more common in practice.

EU eases parts of its Artificial Intelligence Act

The EU has agreed targeted changes to its landmark Artificial Intelligence Act, delaying some deadlines, narrowing parts of the high-risk category, and cutting overlapping compliance requirements. The package also adds a ban on tools that generate non-consensual sexually explicit images.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.