A dream of spring for open-weight llms: 10 architectures from jan-feb 2026

Sebastian Raschka · Ahead of AI ·Feb 25, 2026 ·25 min read

Commentary by Hex Index staff

Ten Architectures, One Conclusion: Data Still Wins

Between January 27 and February 17, 2026, ten open-weight large language models shipped from labs spanning the United States, China, the Middle East, and Canada. Sebastian Raschka, the machine learning researcher and author of Build a Large Language Model (From Scratch), catalogues them all in a single survey piece that doubles as an architectural field guide. The sheer pace is staggering. A decade ago, a single new architecture warranted a year of discourse. Now ten arrive in three weeks.

Raschka is candid about what matters most:

Modeling performance is likely not attributed to the architecture design itself but rather the dataset quality and training recipes.

That admission, buried in the conclusion, quietly undercuts the very premise of an architecture comparison. It is the right admission to make.

A dream of spring for open-weight llms: 10 architectures from jan-feb 2026

The DeepSeek Gravitational Pull

If there is a single through-line across all ten releases, it is the gravitational pull of DeepSeek V3. Moonshot AI's Kimi K2.5, at one trillion parameters, is explicitly described as a scaled-up version of the DeepSeek V3 architecture. Raschka notes that z.AI's GLM-5 now adopts Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. Even Arcee AI's Trinity Large, from a previously unknown American startup, uses a DeepSeek-style Mixture-of-Experts (MoE) configuration.

Kimi K2.5 is a native multimodal model built upon Kimi K2 through large-scale joint pre-training on approximately 15 trillion mixed visual and text tokens.

The number should land with some weight. Fifteen trillion tokens is an enormous pre-training corpus, and the fact that vision tokens were mixed in from the earliest stages rather than bolted on later represents a meaningful shift in how multimodal models get built.

The Efficiency Race

StepFun's Step 3.5 Flash stands out for a pragmatic reason: speed. At 196 billion parameters with only 11 billion active per token, it achieves roughly three times the throughput of DeepSeek V3.2 on Hopper GPUs. Multi-Token Prediction (MTP) with three additional tokens during both training and inference is the key trick. Raschka explains:

DeepSeek V3 reported using MTP-1, that is, MTP with 1 extra token (instead of 3) during training, and then making MTP optional during inference. Step 3.5 Flash uses MTP with 3 additional tokens (MTP-3) during both training and inference.

This is a concrete engineering decision with measurable payoff. More labs should be this transparent about their inference optimizations.

Qwen's contribution to the efficiency story comes through hybrid attention. The Qwen3-Coder-Next model replaces standard attention with a Gated DeltaNet and Gated Attention hybrid in a 3:1 ratio. Raschka explains the tradeoff clearly:

DeltaNet offers less precise content-based retrieval than full attention, which is why one gated attention layer remains.

That single retained attention layer is a telling concession. Pure linear attention is not yet ready to stand alone. Ant Group's Ling 2.5 takes a similar hybrid approach but substitutes Lightning Attention for DeltaNet, achieving 3.5 times the throughput of Kimi K2 at equivalent parameter counts.

Small Models, Big Claims

Two models target the on-device category. Nanbeige 4.1 3B is architecturally almost identical to Llama 3.2 3B, with one notable divergence: it drops weight tying between input embeddings and the output layer. Raschka observes that weight tying "is a nice way to reduce the total number of parameters, but it almost always results in worse training performance as evidenced by higher training and validation losses." Most of Nanbeige's gains come from post-training, not architecture.

Cohere's Tiny Aya, at 3.35 billion parameters, takes a more distinctive approach with parallel transformer blocks. Instead of computing attention and the feed-forward network sequentially, both operate on the same normalized input simultaneously. It is the strongest multilingual model at the 3B scale, outperforming Qwen3-4B and Gemma 3 4B. However, its non-commercial license significantly limits real-world adoption, a constraint Raschka notes but does not dwell on. That licensing restriction matters more than any architectural choice for anyone considering deployment.

Benchmarks at the Breaking Point

Raschka makes an important aside about benchmark saturation. Comparing Claude Opus 4.5 and Opus 4.6 on SWE-Bench Verified, he notes they score nearly identically, despite users reporting clear differences in real-world performance. His diagnosis is sharp:

The more likely issue here is that the SWE-Bench Verified benchmark has saturated, and it may no longer be a meaningful benchmark to report from now on.

This is a field-wide problem. When benchmarks no longer discriminate between models that feel different to use, the benchmarks are broken, not the models. The community's continued reliance on these numbers is becoming an obstacle to honest evaluation.

The GLM-5 Moment

Among the ten, GLM-5 from z.AI arguably makes the strongest overall impression. At 744 billion parameters, it appears to match or exceed the performance of both GPT-5.2 extra-high and Claude Opus 4.6 on independent hallucination benchmarks. Raschka notes that its architecture is strikingly similar to DeepSeek V3.2 but reduces the number of transformer layers from 92 (in its GLM-4.7 predecessor) to 78, a decision he attributes to latency reduction:

Layer depth cannot be parallelized in the same way as width.

Fewer layers, wider experts. It is a simple principle that more teams are converging on.

MiniMax M2.5, meanwhile, takes a contrarian approach. No sliding window attention, no hybrid attention mechanisms, no MLA. Just plain Grouped Query Attention (GQA) at 230 billion parameters. Despite this architectural conservatism, it leads OpenRouter usage statistics and holds its own on coding benchmarks. Sometimes the simplest design wins on cost efficiency alone.

What Is Missing

Raschka acknowledges the elephant in the room: DeepSeek V4 has not shipped yet. The entire field has been building on V3's blueprints for months. When V4 arrives, it will either validate or invalidate the architectural bets dozens of teams have placed.

One gap in the survey is any sustained discussion of inference cost per token across these models. Throughput numbers appear for Step 3.5 Flash and Ling 2.5, but a systematic cost comparison would be far more useful to practitioners than yet another benchmark table. Architecture comparisons are intellectually satisfying, but the market will ultimately sort these models by price-performance ratio, not by how cleverly they arrange their attention heads.

Bottom Line

Raschka has produced a valuable reference for anyone trying to keep pace with the open-weight model explosion. The architectural diagrams alone justify the read. But his own conclusion is the most important sentence in the piece: performance comes from data and training recipes, not from architecture. The ten models surveyed here prove that point by achieving similar results through wildly different structural choices. The real race is happening in the training pipeline, where none of these teams are sharing their secrets.

Deep Dives

Explore these related deep dives:

The Master Algorithm Amazon · Better World Books by Pedro Domingos
The quest for the universal learning algorithm that will reshape civilization.
Build a Large Language Model (From Scratch) Amazon by Sebastian Raschka
Hands-on guide to building an LLM from the ground up.
Mixture of experts
The Trinity Large 400B model uses a Mixture-of-Experts architecture with 13B active parameters, which is a core architectural feature discussed in the article.
Attention Is All You Need
The article details Arcee AI's specific implementation of this sparse attention mechanism with a 3:1 local-to-global ratio to manage long-context inference costs.

Sources

A dream of spring for open-weight llms: 10 architectures from jan-feb 2026

by Sebastian Raschka · Ahead of AI · Read full article

If you have struggled a bit to keep up with open-weight model releases this month, this article should catch you up on the main themes.

In this article, I will walk you through the ten main releases in chronological order, with a focus on the architecture similarities and differences:

Arcee AI’s Trinity Large (Jan 27, 2026)

Moonshot AI’s Kimi K2.5 (Jan 27, 2026)

StepFun Step 3.5 Flash (Feb 1, 2026)

Qwen3-Coder-Next (Feb 3, 2026)

z.AI’s GLM-5 (Feb 12, 2026)

MiniMax M2.5 (Feb 12, 2026)

Nanbeige 4.1 3B (Feb 13, 2026)

Qwen 3.5 (Feb 15, 2026)

Ant Group’s Ling 2.5 1T & Ring 2.5 1T (Feb 16, 2026)

Cohere’s Tiny Aya (Feb 17, 2026)

(PS: DeepSeek V4 will be added once released.)

Since there’s a lot of ground to cover, I will be referencing my previous The Big LLM Architecture Comparison article for certain technical topics (like Mixture-of-Experts, QK-Norm, Multi-head Latent Attention, etc.) throughout this article for background information to avoid redundancy in this article.

1. Arcee AI’s Trinity Large: A New US-Based Start-Up Sharing Open-Weight Models.

On January 27, Arcee AI (a company I hadn’t had on my radar up to then) began releasing versions of their open-weight 400B Trinity Large LLMs on the model hub, along with two smaller variants:

Their flagship large model is a 400B param Mixture-of-Experts (MoE) with 13B active parameters.

The two smaller variants are Trinity Mini (26B with 3B active parameters) and Trinity Nano (6B with 1B active parameters).

Along with the model weights, Arcee AI also released a nice technical report on GitHub (as of Feb 18 also on arxiv) with lots of details.

So, let’s take a closer look at the 400B flagship model. Figure 2 below compares it to z.AI’s GLM-4.5, which is perhaps the most similar model due to its size with 355B parameters.

As we can see in the Trinity and GLM-4.5 comparison, there are several interesting architectural components added to the Trinity model.

First, there are the alternating local:global (sliding window) attention layers (SWA) like in Gemma 3, Olmo 3, Xiaomi MiMo, etc. In short, SWA is a type of sparse (local) attention pattern where each token attends only to a fixed-size window of t recent tokens (for example, 4096) instead of attending to the entire input (which could be up to n=256,000 tokens). This reduces the per-layer regular attention cost from O(n²) to roughly O(n·t) for sequence ...