The big LLM architecture comparison

In an era where the industry chases ever-larger parameter counts, Sebastian Raschka offers a sobering, technical reality check: the fundamental architecture of large language models has barely changed in seven years. While the headlines scream about new breakthroughs, Raschka argues that beneath the hype, we are largely "polishing the same architectural foundations" rather than reinventing the wheel. This distinction is critical for any professional trying to separate genuine engineering progress from marketing noise.

The Illusion of Radical Change

Raschka begins by dismantling the assumption that recent models represent a structural revolution. He notes that despite the passage of time from GPT-2 to the latest 2025 releases, the core DNA remains strikingly consistent. "At first glance, looking back at GPT-2 (2019) and forward to DeepSeek V3 and Llama 4 (2024-2025), one might be surprised at how structurally similar these models still are," he writes. This observation is not merely nostalgic; it serves as a grounding mechanism for readers overwhelmed by the pace of AI announcements.

The author acknowledges that refinements exist, such as the shift from absolute to rotational positional embeddings or the replacement of GELU activation functions with SwiGLU. However, he frames these as incremental optimizations rather than paradigm shifts. "But beneath these minor refinements, have we truly seen groundbreaking changes, or are we simply polishing the same architectural foundations?" Raschka asks. This rhetorical question effectively challenges the reader to look past the surface-level marketing of "new" models and examine the actual engineering under the hood.

DeepSeek's Efficiency Gambit

The commentary then pivots to DeepSeek V3 and R1, which Raschka identifies as the most significant architectural deviations in the current landscape. The piece highlights two specific innovations: Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE). Raschka explains that while Grouped-Query Attention (GQA) has become the standard for reducing memory usage by sharing key and value projections, DeepSeek chose a different path. "Multi-Head Latent Attention (MLA) offers a different memory-saving strategy that also pairs particularly well with KV caching," he notes.

The distinction is technical but vital. Unlike GQA, which groups heads to share data, MLA compresses the key and value tensors into a lower-dimensional space before storage. Raschka points out a crucial trade-off: "This adds an extra matrix multiplication but reduces memory usage." He further clarifies that this isn't just a theoretical exercise; empirical evidence suggests it yields better results. Citing ablation studies from the DeepSeek-V2 paper, he writes, "GQA appears to perform worse than MHA, whereas MLA offers better modeling performance than MHA, which is likely why the DeepSeek team chose MLA over GQA."

This section is particularly strong because it moves beyond the "bigger is better" narrative to focus on efficiency. Raschka details how DeepSeek V3 utilizes a massive 671-billion-parameter MoE architecture but only activates a fraction of those parameters during inference. "DeepSeek V3 has 256 experts per MoE module and a total of 671 billion parameters. Yet during inference, only 9 experts are active at a time," he explains. This sparsity allows the model to hold vast knowledge while remaining computationally feasible. A counterargument worth considering is whether this complexity introduces new failure modes or instability that simpler, dense models avoid, but Raschka's data suggests the efficiency gains currently outweigh these risks.

By compressing the key and value tensors, DeepSeek achieves a rare feat: reducing memory usage while actually improving modeling performance compared to standard alternatives.

The OLMo 2 Anomaly

Shifting focus to the non-profit Allen Institute for AI, Raschka examines OLMo 2, a model that stands out for its transparency rather than its raw scale. He describes OLMo 2 as a "great blueprint for developing LLMs" due to its open documentation of training data and code. Interestingly, OLMo 2 bucks the trend of adopting efficiency hacks like MLA or GQA, sticking instead to traditional Multi-Head Attention. Instead, its innovation lies in the placement of normalization layers.

Raschka details how OLMo 2 reverted to a form of Post-LN (Post-Norm), a design choice that had largely been abandoned in favor of Pre-LN in models like Llama. "In 2020, Xiong et al. showed that Pre-LN results in more well-behaved gradients at initialization," he writes, noting that OLMo 2's return to Post-LN was a deliberate move to enhance training stability. He further introduces the concept of QK-Norm, an additional normalization step applied to queries and keys within the attention mechanism. "QK-Norm is essentially yet another RMSNorm layer," Raschka clarifies, explaining that its placement helps stabilize the attention scores before they are processed.

The inclusion of OLMo 2 is a strategic choice by Raschka to show that innovation isn't monolithic. While DeepSeek pushes the boundaries of efficiency through complex compression, OLMo 2 explores the stability of foundational components. "Unfortunately this figure shows the results of the reordering together with QK-Norm, which is a separate concept. So, it's hard to tell how much the normalization layer reordering contributed by itself," he admits. This honesty about the limitations of current research data adds significant credibility to his analysis.

Critics might argue that focusing on architectural tweaks ignores the more significant driver of performance: the scale and quality of training data. However, Raschka's premise is that understanding the structural skeleton is essential for diagnosing why models behave the way they do, regardless of data volume.

Bottom Line

Sebastian Raschka's analysis provides a necessary corrective to the industry's tendency to conflate parameter count with architectural innovation. The strongest part of his argument is the evidence that efficiency gains, particularly through MLA and MoE, are now the primary battleground for model performance, not just raw size. The biggest vulnerability in the current landscape remains the opacity of training methodologies, which makes it difficult to fully isolate the impact of these architectural choices. As the field moves forward, professionals should watch whether these efficiency-focused designs can scale without hitting the same diminishing returns that have plagued previous generations of dense models.

The big LLM architecture comparison

by Sebastian Raschka · Ahead of AI · Read full article

It has been seven years since the original GPT architecture was developed. At first glance, looking back at GPT-2 (2019) and forward to DeepSeek V3 and Llama 4 (2024-2025), one might be surprised at how structurally similar these models still are.

Sure, positional embeddings have evolved from absolute to rotational (RoPE), Multi-Head Attention has largely given way to Grouped-Query Attention, and the more efficient SwiGLU has replaced activation functions like GELU. But beneath these minor refinements, have we truly seen groundbreaking changes, or are we simply polishing the same architectural foundations?

Comparing LLMs to determine the key ingredients that contribute to their good (or not-so-good) performance is notoriously challenging: datasets, training techniques, and hyperparameters vary widely and are often not well documented.

However, I think that there is still a lot of value in examining the structural changes of the architectures themselves to see what LLM developers are up to in 2025. (A subset of them are shown in Figure 1 below.)

So, in this article, rather than writing about benchmark performance or training algorithms, I will focus on the architectural developments that define today's flagship open models.

(As you may remember, I wrote about multimodal LLMs not too long ago; in this article, I will focus on the text capabilities of recent models and leave the discussion of multimodal capabilities for another time.)

Tip: This is a fairly comprehensive article, so I recommend using the navigation bar to access the table of contents (just hover over the left side of the Substack page).

Optional: The video below is a narrated and abridged version of this article.

1. DeepSeek V3/R1.

As you have probably heard more than once by now, DeepSeek R1 made a big impact when it was released in January 2025. DeepSeek R1 is a reasoning model built on top of the DeepSeek V3 architecture, which was introduced in December 2024.

While my focus here is on architectures released in 2025, I think it’s reasonable to include DeepSeek V3, since it only gained widespread attention and adoption following the launch of DeepSeek R1 in 2025.

If you are interested in the training of DeepSeek R1 specifically, you may also find my article from earlier this year useful:

In this section, I’ll focus on two key architectural techniques introduced in DeepSeek V3 that improved its computational efficiency and distinguish it from many other LLMs:

Multi-Head Latent Attention (MLA)

Mixture-of-Experts ...

The Illusion of Radical Change

DeepSeek's Efficiency Gambit

The OLMo 2 Anomaly

Bottom Line

Sources

The big LLM architecture comparison