Beyond standard llms

Sebastian Raschka delivers a crucial reality check for the artificial intelligence sector: the industry's obsession with efficiency may have hit a wall. While the rush to replace standard transformer models with linear-attention hybrids promised to unlock infinite context windows and slash costs, Raschka reveals a startling pivot where leading developers are abandoning these very architectures due to performance failures in complex reasoning tasks.

The Efficiency Trap

Raschka begins by acknowledging the dominance of the current standard. "Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code," he writes, listing a parade of recent open-weight models that rely on this proven foundation. He notes that while efficiency tweaks like grouped-query attention have helped, the fundamental quadratic cost of processing long sequences remains a bottleneck. This sets the stage for the industry's desperate search for alternatives.

The core of the piece examines the recent "revival" of linear attention mechanisms. These architectures attempt to reduce computational complexity from quadratic to linear, theoretically allowing models to handle massive amounts of data without exploding memory costs. Raschka details how major players like MiniMax and Qwen experimented with these hybrids, replacing standard attention layers with mechanisms like Gated DeltaNet. "All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants," he observes, highlighting the speed at which the industry adopted these experimental designs.

The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today's LLMs.

However, Raschka's reporting takes a sharp turn when he describes the MiniMax team's decision to release their M2 model without linear attention, reverting to standard mechanisms. He explains that while the linear models worked for simple prompts, they faltered significantly in reasoning and multi-turn interactions. "It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks," Raschka writes, noting that these are critical for agentic applications. This is a vital distinction often missed in hype cycles: a model that is fast but cannot reason is functionally limited.

Critics might argue that abandoning linear attention now is premature, given that the technology is still maturing. Yet, the practical evidence from production environments suggests that the trade-off between memory efficiency and cognitive capability is not yet worth making for high-stakes tasks.

The Hybrid Compromise

The article then dissects the specific architecture of Qwen3-Next, which attempts to solve this dilemma through a hybrid approach. Instead of a full replacement, the model alternates between linear and standard attention layers in a specific ratio. Raschka explains that this design uses a "3:1 ratio" where three layers of linear attention (Gated DeltaNet) are followed by one layer of full attention. This structure allows the model to maintain a running memory state for efficiency while periodically refreshing its global context.

The mechanism relies on a "delta rule" to update a hidden state, a concept Raschka likens to biological learning. "It's basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision," he notes. This recurrent state update allows the model to process tokens one by one, avoiding the massive n-by-n attention matrix that slows down traditional transformers. However, this efficiency comes with a structural cost: the model must compress past context into a fixed-size hidden state.

In Gated DeltaNet, there's no n-by-n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in.

Raschka points out that this compression creates a "bottleneck" that limits the model's ability to capture global context compared to full pairwise attention. The hybrid solution, therefore, is a compromise. It acknowledges that while linear attention is necessary for scaling context length, it cannot fully replace the global modeling power of standard transformers. The inclusion of "gated attention" layers, which use a sigmoid gate to modulate outputs, further attempts to stabilize training and prevent issues like "Attention Sink."

The attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model.

This nuanced framing is the piece's greatest strength. Rather than declaring one architecture the winner, Raschka illustrates a landscape where engineers are forced to balance competing priorities. The shift away from pure linear attention in the MiniMax M2 model suggests that the industry is learning that raw efficiency cannot come at the expense of intelligence.

Bottom Line

Raschka's analysis provides a necessary correction to the narrative that linear attention is the inevitable future of large language models. The strongest part of his argument is the empirical evidence that efficiency gains often degrade reasoning capabilities, forcing a return to hybrid or standard architectures. The biggest vulnerability remains the scalability of these hybrid models; if they cannot handle the context lengths promised by linear attention without sacrificing accuracy, the industry may be stuck with expensive, quadratic bottlenecks. Readers should watch for whether future iterations can solve the "memory bottleneck" without reverting to the computational costs of full attention.

The industry is learning that raw efficiency cannot come at the expense of intelligence.

Beyond standard llms

by Sebastian Raschka · Ahead of AI · Read full article

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism.

However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance.

After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is!

Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years.

PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below.

(There is also a YouTube version here.)

1. Transformer-Based LLMs.

Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include

DeepSeek V3/R1

OLMo 2

Gemma 3

Mistral Small 3.1

Llama 4

Qwen3

SmolLM3

Kimi K2

gpt-oss

GLM-4.5

GLM-4.6

MiniMax-M2

and many more.

(The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.)

Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article.

(Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other ...

The Efficiency Trap

The Hybrid Compromise

Bottom Line

Sources

Beyond standard llms