The week centered on practical advances in Artificial Intelligence infrastructure rather than new benchmark-driven capability leaps. The main theme was efficiency across the stack, especially in inference, where memory and latency increasingly determine what systems can be deployed at scale. Three releases stood out for lowering key technical constraints: Google Research’s TurboQuant for KV cache compression, Google’s Gemini 3.1 Flash Live for native audio interaction, and Mistral’s Voxtral TTS for low-latency, on-device speech generation.
TurboQuant addressed the growing cost of long-context inference by compressing the KV cache, which expands linearly with context length and can become the dominant consumer of GPU memory. Google Research reported 3-bit KV cache compression with zero measurable accuracy loss, 6x memory reduction, and up to 8x speedup on H100s. The method combines PolarQuant, which converts KV vectors from Cartesian to polar coordinates, with QJL, which reduces each vector to a single sign bit using the Johnson-Lindenstrauss transform while maintaining accurate attention scores. The framing was also significant: TurboQuant’s error is described as approaching the Shannon lower bound, suggesting compression alone may be nearing its practical ceiling and that future gains may need to come from new architectures, sparse attention, or improved eviction strategies.
Voice systems also moved in two different directions. Google shipped Gemini 3.1 Flash Live as a native audio model that replaces the older multi-stage pipeline of VAD, STT, LLM, and TTS with a single system that processes raw PCM bidirectionally. It supports barge-in mid-sentence, reaches over 90 languages in real time, and scored 36.1% on Scale AI’s Audio MultiChallenge. Search Live is now rolling on this model in 200+ countries, marking a broad deployment of a new voice architecture built for interruption and conversational continuity.
Mistral’s Voxtral TTS took a different approach focused on portability and control. Voxtral TTS is 4B parameters, built on Ministral 3B, runs on a smartphone, voice-clones from under five seconds of audio, and ships with open weights under Creative Commons. Time-to-first-audio is 90ms. The enterprise appeal was framed less around higher-quality voice output and more around data sovereignty, especially for regulated industries that want speech systems deployed on their own hardware without sending audio outside the datacenter.
Other research highlighted self-improving agents, agent institutions, multimodal neuroscience models, financial tool-use benchmarks, and compact world models. Product releases also included Anthropic’s research preview of computer use capabilities for Claude Code and Claude Work. In funding and industry news, Deccan AI raised a $25M Series A, Harvey closed a $200M round at an $11B valuation, Granola raised $125M at a $1.5B valuation, Kleiner Perkins raised $3.5B across two funds, Doss raised a $55M Series B, Air Street Capital closed a $232M Fund III, and SoftBank confirmed a $40 billion unsecured bridge loan maturing in March 2027 for further investments in OpenAI and general corporate purposes. Meta also increased its El Paso data center investment from $1.5 billion to over $10 billion.
