New LLM architectures target long-context efficiency

May 18, 2026

Recent open-weight language models are adding targeted architectural changes to cut the cost of long-context inference. Key ideas include cross-layer KV sharing, per-layer embeddings, compressed attention, and wider residual pathways.

Open-weight large language model design is increasingly centered on long-context efficiency. As reasoning systems and agent workflows keep more tokens active for longer periods, KV-cache size, memory traffic, and attention cost have become major bottlenecks. Recent model releases focus on reducing those costs through targeted changes inside transformer blocks, residual pathways, and attention mechanisms rather than replacing the core decoder-only transformer design.

Google’s Gemma 4 highlights two of these efficiency tactics. The E2B and E4B variants use cross-layer KV sharing, where later layers reuse key and value tensors from earlier layers of the same attention type instead of recomputing them. Gemma 4 E2B has 35 transformer layers, but only the first 15 compute their own KV projections; the final 20 layers reuse KV tensors from the most recent earlier non-shared layer of the same attention type. Gemma 4 E4B has 42 layers, with 24 layers computing their own KV and the final 18 layers sharing them. Since roughly half of the KVs are shared across layers, the design saves approximately half of the KV cache size. For the smallest E2B model, this results in a 2.7 GB saving (at bfloat16 precision) in long 128K contexts. For the E4B variant, this saves about 6 GB at 128K. Gemma 4 also adds per-layer embeddings, which increase token-specific representational capacity without scaling the full transformer stack. Gemma 4 E2B is listed as 2.3B effective parameters, or 5.1B parameters when the embeddings are counted. Gemma 4 E4B is listed as 4.5B effective parameters, or 8B parameters with embeddings.

Poolside’s Laguna XS.2 takes a different route by varying attention budget by layer. It has 40 layers total, with 30 sliding-window attention layers and 10 global/full attention layers. The sliding-window layers attend over 512 tokens, while the model also varies query-head counts per layer while keeping KV heads fixed at 8. That lets the architecture allocate more attention capacity to cheaper sliding-window layers and fewer heads to more expensive full-attention layers. The result is a more selective use of compute across the stack.

Zyphra’s ZAYA1-8B focuses on reducing attention cost directly with Compressed Convolutional Attention. Instead of only compressing cached representations, the model performs attention in a compressed latent space and then up-projects the result. It combines this with convolutional mixing on compressed queries and keys to preserve local context. DeepSeek V4 pushes efficiency further with two distinct changes. Its manifold-constrained hyper-connections widen the residual pathway using multiple interacting residual streams, and the team reports that an optimized implementation adds only 6.7% additional training time overhead for 4 residual streams (n = 4) throughout all transformer blocks compared to the single-stream baseline. On the attention side, DeepSeek V4 alternates between Compressed Sparse Attention and Heavily Compressed Attention, both of which compress along the sequence dimension. HCA compresses every 128 tokens into one compressed KV entry. At a 1M-token context length, DeepSeek V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2. DeepSeek V4-Flash is even smaller, at 10% of the FLOPs and 7% of the KV cache size relative to DeepSeek V3.2.

Source

52

Impact Score

Latest News

ArXiv tightens rules on Artificial Intelligence generated papers

May 18, 2026

ArXiv is escalating enforcement against careless use of large language models in scientific submissions. Authors who submit papers showing clear signs of unchecked model output can face a 1-year ban and stricter conditions for future postings.

Simple Artificial Intelligence recommendations for small business growth

May 17, 2026

Research from the University of Warwick and Nanyang Technological University, Singapore, examines how small and medium sized enterprises can use simpler Artificial Intelligence recommendation systems without large datasets or costly infrastructure. Findings from a field experiment suggest low data approaches can still increase customer engagement and spending.

Quantexa wins HMRC data modernisation contract

May 17, 2026

Quantexa has secured a £175 million, 10-year contract from HM Revenue & Customs to modernise the tax authority’s data infrastructure and support governed use of Artificial Intelligence across core operations. The deal positions the London-founded company at the centre of a major UK public sector data transformation programme.

EU Artificial Intelligence Act delay gives HR more time to prepare

May 17, 2026

The European Union has pushed back compliance deadlines for high-risk Artificial Intelligence systems, giving HR teams more time to prepare for rules that still carry broad reach beyond Europe. Experts say the delay should be treated as a chance to strengthen governance, data practices, and cross-functional accountability rather than slow down.

Uk falling behind on Artificial Intelligence adoption

May 17, 2026

New research indicates the UK is losing ground on Artificial Intelligence adoption as many businesses fail to move beyond early experimentation. More than half remain stuck in the pilot phase, pointing to slow deployment across the market.

New LLM architectures target long-context efficiency

52

Impact Score

Latest News

ArXiv tightens rules on Artificial Intelligence generated papers

Simple Artificial Intelligence recommendations for small business growth

Quantexa wins HMRC data modernisation contract

EU Artificial Intelligence Act delay gives HR more time to prepare

Uk falling behind on Artificial Intelligence adoption

Contact Us