Google engineers have developed a method to compress Artificial Intelligence data so that it requires up to six times less working memory to function. The new system, called TurboQuant, is designed to let models retain the same amount of information and perform the same computations while using significantly less memory hardware.
Current Artificial Intelligence systems rely heavily on working memory, also known as the key value (KV) cache, which stores temporary computational results and context during active processing. A single sentence uses only a few dozen tokens, but storing hundreds of thousands of tokens in the KV cache for more sophisticated work can require tens of gigabytes of memory. These memory requirements scale linearly depending on the number of users, and ChatGPT is known to receive billions of requests every day.
TurboQuant reduces the amount of working memory needed through quantization, a process that represents values with fewer bits. Google has used quantization on neural networks for years, but typically in a static form applied once before runtime. TurboQuant instead compresses the KV cache in real time while keeping the cached data accurate and current as the model generates outputs. In tests in Meta’s Llama 3.1-8B, Google’s Gemma and Mistral Artificial Intelligence models, Google said the system showed promise for reducing key value bottlenecks without sacrificing model performance.
Google says TurboQuant could reduce the KV cache’s size by a factor of at least six times, using two methods: PolarQuant and Quantized Johnson-Lindenstrauss (QJL). PolarQuant reexpresses Artificial Intelligence data from Cartesian coordinates into polar coordinates so vector angles align more consistently and can be compressed into fewer bits with less scaling information. The vectors then pass through QJL, which makes slight adjustments to correct computational errors introduced by quantization.
The advance drew attention beyond Google. In a post on X, Cloudflare CEO Matthew Prince called it “Google’s DeepSeek,” and Google’s March 24 unveiling of TurboQuant sent stocks in memory companies like SanDisk, Western Digital and Seagate plummeting. Even so, the technology remains at the lab stage and has not been widely deployed in production models.
Its practical effect may also be narrower than the headline memory savings suggest. The method compresses only the working memory used during inference, when a model generates a response. A model’s training typically requires up to four times more memory than that, so the total impact on overall memory demand will be limited. Merrill Lynch banker Vivek Arya wrote that the “6x improvement in memory efficiency” would likely support a “6x increase in accuracy (model size) and/or context length (KV cache allocation), rather than 6x decrease in memory.” Google officially unveiled TurboQuant at ICLR 2026, held April 23-27 in Rio de Janeiro, and plans to formally present PolarQuant and QJL at AISTATS 2026 in Tangier, Morocco, in early May.
