Google unveils memory-saving Artificial Intelligence compression

May 3, 2026

Google says its TurboQuant system can cut the working memory used by chatbots during conversations by up to six times without reducing performance. The technique targets a major bottleneck in inference by compressing the key value cache in real time.

Google engineers have developed a method to compress Artificial Intelligence data so that it requires up to six times less working memory to function. The new system, called TurboQuant, is designed to let models retain the same amount of information and perform the same computations while using significantly less memory hardware.

Current Artificial Intelligence systems rely heavily on working memory, also known as the key value (KV) cache, which stores temporary computational results and context during active processing. A single sentence uses only a few dozen tokens, but storing hundreds of thousands of tokens in the KV cache for more sophisticated work can require tens of gigabytes of memory. These memory requirements scale linearly depending on the number of users, and ChatGPT is known to receive billions of requests every day.

TurboQuant reduces the amount of working memory needed through quantization, a process that represents values with fewer bits. Google has used quantization on neural networks for years, but typically in a static form applied once before runtime. TurboQuant instead compresses the KV cache in real time while keeping the cached data accurate and current as the model generates outputs. In tests in Meta’s Llama 3.1-8B, Google’s Gemma and Mistral Artificial Intelligence models, Google said the system showed promise for reducing key value bottlenecks without sacrificing model performance.

Google says TurboQuant could reduce the KV cache’s size by a factor of at least six times, using two methods: PolarQuant and Quantized Johnson-Lindenstrauss (QJL). PolarQuant reexpresses Artificial Intelligence data from Cartesian coordinates into polar coordinates so vector angles align more consistently and can be compressed into fewer bits with less scaling information. The vectors then pass through QJL, which makes slight adjustments to correct computational errors introduced by quantization.

The advance drew attention beyond Google. In a post on X, Cloudflare CEO Matthew Prince called it “Google’s DeepSeek,” and Google’s March 24 unveiling of TurboQuant sent stocks in memory companies like SanDisk, Western Digital and Seagate plummeting. Even so, the technology remains at the lab stage and has not been widely deployed in production models.

Its practical effect may also be narrower than the headline memory savings suggest. The method compresses only the working memory used during inference, when a model generates a response. A model’s training typically requires up to four times more memory than that, so the total impact on overall memory demand will be limited. Merrill Lynch banker Vivek Arya wrote that the “6x improvement in memory efficiency” would likely support a “6x increase in accuracy (model size) and/or context length (KV cache allocation), rather than 6x decrease in memory.” Google officially unveiled TurboQuant at ICLR 2026, held April 23-27 in Rio de Janeiro, and plans to formally present PolarQuant and QJL at AISTATS 2026 in Tangier, Morocco, in early May.

Source

72

Impact Score

Latest News

European Union talks stall on changes to Artificial Intelligence rules

May 2, 2026

European Union countries and European Parliament lawmakers failed to agree on proposed changes to the bloc’s Artificial Intelligence rules after extended negotiations. The deadlock leaves uncertainty around exemptions and the future shape of enforcement under the Artificial Intelligence Act.

Privacy questions emerge at Los Angeles Artificial Intelligence summit

May 2, 2026

Privacy teams are confronting new compliance questions as companies deploy Artificial Intelligence tools more broadly. Vendor status, note-taking applications, agentic systems, and data minimization surfaced as central concerns.

Big Tech Artificial Intelligence infrastructure spending climbs to ? billion in 2026

May 2, 2026

Alphabet, Amazon, Meta and Microsoft are set to spend ? billion on Artificial Intelligence infrastructure in 2026, deepening the concentration of compute, capital and cloud power in the US. The spending surge is reshaping semiconductor demand, energy use and enterprise dependence on hyperscaler platforms.

Big Tech and startups push deeper into Artificial Intelligence infrastructure

May 2, 2026

Big Tech is lifting infrastructure spending plans again as cloud growth supports heavier investment in Artificial Intelligence. At the same time, startups including Parag Agrawal’s Parallel and Softbank’s planned Roze venture are targeting major opportunities in agent networks, data centers, and robotics.

ALP targets stability in large language model reinforcement learning

May 2, 2026

Adaptive Layerwise Perturbation aims to reduce policy staleness and training mismatches in large language model reinforcement learning. The method improves stability by constraining policy drift and reducing harmful importance-ratio tails.

Google unveils memory-saving Artificial Intelligence compression

72

Impact Score

Latest News

European Union talks stall on changes to Artificial Intelligence rules

Privacy questions emerge at Los Angeles Artificial Intelligence summit

Big Tech Artificial Intelligence infrastructure spending climbs to ? billion in 2026

Big Tech and startups push deeper into Artificial Intelligence infrastructure

ALP targets stability in large language model reinforcement learning

Contact Us