Open-weight large language model design is increasingly centered on long-context efficiency. As reasoning systems and agent workflows keep more tokens active for longer periods, KV-cache size, memory traffic, and attention cost have become major bottlenecks. Recent model releases focus on reducing those costs through targeted changes inside transformer blocks, residual pathways, and attention mechanisms rather than replacing the core decoder-only transformer design.
Google’s Gemma 4 highlights two of these efficiency tactics. The E2B and E4B variants use cross-layer KV sharing, where later layers reuse key and value tensors from earlier layers of the same attention type instead of recomputing them. Gemma 4 E2B has 35 transformer layers, but only the first 15 compute their own KV projections; the final 20 layers reuse KV tensors from the most recent earlier non-shared layer of the same attention type. Gemma 4 E4B has 42 layers, with 24 layers computing their own KV and the final 18 layers sharing them. Since roughly half of the KVs are shared across layers, the design saves approximately half of the KV cache size. For the smallest E2B model, this results in a 2.7 GB saving (at bfloat16 precision) in long 128K contexts. For the E4B variant, this saves about 6 GB at 128K. Gemma 4 also adds per-layer embeddings, which increase token-specific representational capacity without scaling the full transformer stack. Gemma 4 E2B is listed as 2.3B effective parameters, or 5.1B parameters when the embeddings are counted. Gemma 4 E4B is listed as 4.5B effective parameters, or 8B parameters with embeddings.
Poolside’s Laguna XS.2 takes a different route by varying attention budget by layer. It has 40 layers total, with 30 sliding-window attention layers and 10 global/full attention layers. The sliding-window layers attend over 512 tokens, while the model also varies query-head counts per layer while keeping KV heads fixed at 8. That lets the architecture allocate more attention capacity to cheaper sliding-window layers and fewer heads to more expensive full-attention layers. The result is a more selective use of compute across the stack.
Zyphra’s ZAYA1-8B focuses on reducing attention cost directly with Compressed Convolutional Attention. Instead of only compressing cached representations, the model performs attention in a compressed latent space and then up-projects the result. It combines this with convolutional mixing on compressed queries and keys to preserve local context. DeepSeek V4 pushes efficiency further with two distinct changes. Its manifold-constrained hyper-connections widen the residual pathway using multiple interacting residual streams, and the team reports that an optimized implementation adds only 6.7% additional training time overhead for 4 residual streams (n = 4) throughout all transformer blocks compared to the single-stream baseline. On the attention side, DeepSeek V4 alternates between Compressed Sparse Attention and Heavily Compressed Attention, both of which compress along the sequence dimension. HCA compresses every 128 tokens into one compressed KV entry. At a 1M-token context length, DeepSeek V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2. DeepSeek V4-Flash is even smaller, at 10% of the FLOPs and 7% of the KV cache size relative to DeepSeek V3.2.
