Recent developments in LLM architectures: Kv sharing, mHC, and compressed attention

Sebastian Raschka · Ahead of AI ·May 16, 2026 ·27 min read

Commentary by Hex Index staff

Sebastian Raschka has shifted the spotlight from the hype of model size to the quiet, critical engineering that actually enables long-context reasoning. While the industry chases parameter counts, Raschka argues that the real bottleneck is no longer intelligence, but memory traffic and the physical limits of the KV cache. This piece is notable because it dissects specific, recent architectural shifts in open-weight models like Gemma 4 and DeepSeek V4, revealing how developers are fundamentally rewriting the transformer block to make agents and reasoning models viable on consumer hardware.

The Memory Bottleneck

The core of Raschka's argument is that efficiency is now the primary driver of innovation. "As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints," he writes. This observation reframes the entire landscape: the next leap in capability won't come from simply adding more data, but from smarter ways to store and retrieve context.

Recent developments in LLM architectures: Kv sharing, mHC, and compressed attention

Raschka highlights Google's Gemma 4 as a pivotal case study, specifically its use of "cross-layer attention." In a standard transformer, every layer computes its own key and value projections, which is computationally expensive. Raschka explains that Gemma 4 changes this by having later layers "reuse key-value states from earlier layers to reduce long-context memory and compute." He notes that while the concept isn't new, citing a 2024 NeurIPS paper by Brandon et al., "it's the first popular architecture where I saw this concept applied."

The impact is tangible. By sharing roughly half of the key-value tensors across layers, the model achieves significant memory savings without a proportional drop in quality. Raschka illustrates this with hard numbers: for the smallest Gemma 4 variant, this results in a "2.7 GB saving (at bfloat16 precision) in long 128K contexts." This is a crucial distinction for anyone deploying models locally; it means the difference between a model running on a high-end laptop versus one requiring enterprise-grade server clusters.

"Most of these changes look like small tweaks in my architecture diagrams, but some of them are quite intricate design changes that are worth a more detailed discussion."

Critics might argue that sharing KV tensors is merely an approximation that inevitably degrades model performance. Raschka acknowledges this, noting that the scheme "reduces model capacity," but points out that existing research suggests the impact is minimal for smaller models. The trade-off appears to be a calculated risk that pays off in deployment feasibility.

Redefining Model Size

Perhaps the most provocative claim in the piece concerns how we define a model's size. Raschka dissects Gemma 4's "E" series, which stands for "effective" parameters, introducing a design called per-layer embeddings (PLE). This technique allows the model to store additional capacity in embedding tables rather than in the heavy transformer blocks.

"The 'E' in Gemma 4 E2B and E4B stands for 'effective'," Raschka clarifies. "Gemma 4 E2B is listed as 2.3B effective parameters, or 5.1B parameters when the embeddings are counted." This distinction is vital because it separates the computational cost of inference from the total parameter count. The transformer stack remains small and fast, while the embedding layers provide the necessary nuance for token-specific information.

Raschka describes the mechanism: the model prepares a "packed PLE tensor that contains one small vector per decoder layer," which is then gated and added as an extra residual update. This design choice allows the model to "increase representational capacity through embedding parameters and small projections" without the latency penalty of scaling the entire stack.

This approach challenges the industry's obsession with raw parameter counts. Raschka suggests that for larger models, this trick is less necessary because they already have sufficient capacity, but for smaller, efficient models, it is a game-changer. He admits, however, that "we have to take Google's word here that this is an effective and worthwhile design choice," noting that independent comparison studies against standard dense models are still needed.

Strategic Attention Budgeting

The final major innovation Raschka examines is "layer-wise attention budgeting" in Poolside's Laguna XS.2. This model abandons the idea that every layer needs the same attention capacity. Instead, it varies the number of query heads per layer, allocating more heads to sliding-window layers and fewer to global attention layers.

"The point is to spend attention capacity where it is most useful, instead of giving every layer the same attention budget," Raschka writes. This mirrors the logic of the "Gating mechanism" discussed in his previous deep dives, where resources are dynamically routed, but here the routing is static and architectural. The Laguna XS.2 config explicitly allows for different query-head counts per layer, a move Raschka calls "one of the most prominent recent examples of this per-layer query-head budgeting in a production-style open model."

This strategy echoes the design philosophy seen in Apple's 2024 OpenELM, where model capacity is varied by layer to optimize efficiency. By keeping the key-value heads fixed while adjusting query heads, Laguna XS.2 maintains a compatible KV cache shape while fine-tuning the attention mechanism. This suggests a future where model architectures are no longer monolithic but are instead highly specialized, with different layers performing distinct roles in the processing pipeline.

"The point is to spend attention capacity where it is most useful, instead of giving every layer the same attention budget."

A counterargument worth considering is whether this level of architectural complexity makes models harder to train and debug. If every layer has unique constraints, the training dynamics become more fragile. Raschka does not fully address the training stability implications of such heterogeneous layers, leaving a gap for future research.

Bottom Line

Sebastian Raschka's analysis provides a necessary corrective to the industry's fixation on scale, demonstrating that the path to viable long-context agents lies in architectural efficiency rather than brute force. The strongest part of the argument is the concrete evidence that techniques like KV sharing and per-layer embeddings can drastically reduce memory footprints without sacrificing utility. The biggest vulnerability remains the reliance on vendor claims regarding performance trade-offs, as independent benchmarks for these specific architectural tweaks are still emerging. Readers should watch for how these efficiency gains translate into real-world agent performance in the coming months.

Deep Dives

Explore these related deep dives:

Transformers for Natural Language Processing Amazon · Better World Books by Denis Rothman
DeepSeek
This specific mechanism, pioneered by DeepSeek, explains how the article's discussion of 'mHC' compresses the KV cache by projecting keys and values into a lower-dimensional latent space rather than storing full vectors.
Gating mechanism
Understanding this activation function is essential to grasp how modern architectures like Gemma and ZAYA1 replace standard ReLU layers to improve gradient flow and efficiency without increasing parameter counts.
Attention Is All You Need
The article contrasts new compression techniques against this established method, which limits attention to a fixed local context window to reduce quadratic complexity, serving as the baseline for the 'compressed attention' innovations discussed.

Sources

Recent developments in LLM architectures: Kv sharing, mHC, and compressed attention

by Sebastian Raschka · Ahead of AI · Read full article

After a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency.

As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs.

The main examples I want to look at are KV sharing and per-layer embeddings in Gemma 4, layer-wise attention budgeting in Laguna XS.2, compressed convolutional attention in ZAYA1-8B, and mHC plus compressed attention in DeepSeek V4.

Most of these changes look like small tweaks in my architecture diagrams, but some of them are quite intricate design changes that are worth a more detailed discussion.

Note that this article is about architecture designs, so I will mostly skip dataset mixtures, training schedules, post-training details, RL recipes, benchmark tables, and product comparisons. Even with that narrower scope, there is a lot to cover. And, like always, the article turned out longer than I expected, so I will keep the focus on what changes inside the transformer block, residual stream, KV cache, or attention computation.

Please also note that I am only covering those topics that are interesting (new) design choices and that I haven’t covered elsewhere, yet. This list includes:

KV sharing and per-layer embeddings in Gemma 4

Compressed convolutional attention in ZAYA1

Attention budgeting in Laguna XS.2

mHC and compressed attention in DeepSeek V4

Previous Topics.

Before getting into the new parts, here are the two previous articles I will refer back to. The first one gives a broader architecture background on recent MoE models, routed experts, active parameters, and model-size comparisons. The second one covers the attention background that comes up repeatedly below, including MHA, MQA, GQA, MLA, sliding-window attention, sparse attention, and hybrid attention designs.

I also turned several of these explanations into short, standalone tutorial pages in the LLM Architecture Gallery. For example, readers can find compact explainers for GQA, MLA, sliding-window attention, DeepSeek Sparse Attention, MoE routing, and other concepts linked from the corresponding model cards and concept labels.

1. Reusing KV Tensors Across Layers to Shrink the Cache (Gemma 4).

For this tour of architecture advances and tweaks, we will go back to the beginning of April ...