← Back to Library

From gpt-2 to gpt-oss: Analyzing the architectural advances

In a landscape dominated by closed ecosystems and proprietary black boxes, OpenAI's release of gpt-oss marks a seismic shift back to the open-weight era. Sebastian Raschka dissects this move not as a nostalgic return to 2019, but as a masterclass in architectural efficiency, revealing how modern models achieve massive scale without bloating hardware requirements. For the busy professional watching the AI race, the takeaway isn't just about a new model; it's about the democratization of high-performance intelligence that can now run on a single consumer graphics card.

The Architecture of Efficiency

Raschka begins by dismantling the assumption that progress requires radical structural reinvention. "There is significant rotation of employees between these labs," he speculates, noting that the industry has largely converged on the transformer architecture because "we still have not found anything better than the transformer architecture." This observation is crucial: it suggests that the current gold rush is about optimization and data, not a fundamental breakthrough in how machines "think." The author argues that while alternatives like state space models exist, "no one has shown that they perform as well as transformers at this scale."

From gpt-2 to gpt-oss: Analyzing the architectural advances

The core of the gpt-oss innovation lies in how it strips away legacy components that no longer serve a purpose. Raschka points out that dropout, a technique to prevent overfitting, has been abandoned in modern large language models. "I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture," he writes, explaining that because these models train on massive datasets for a single epoch, "there is little risk of overfitting." This is a pragmatic evolution, shedding the baggage of older deep learning paradigms.

"Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes."

This framing is effective because it grounds the reader's expectations. We aren't seeing a new species of AI; we are seeing a highly refined version of the same engine, tuned for speed and accessibility. Raschka notes that the shift from GELU to Swish activation functions is driven largely by computational cost rather than a massive leap in modeling performance, stating, "Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models." Critics might argue that this focus on efficiency comes at the cost of theoretical elegance, but in a field where inference costs dictate deployment, the practical choice wins.

Scaling Down to Scale Up

Perhaps the most striking aspect of the coverage is how Raschka explains the use of Mixture-of-Experts (MoE) to balance capacity with speed. By activating only a subset of the model's parameters for each token, the architecture can hold vast amounts of knowledge without slowing down the user. "The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time," Raschka explains. This is the secret sauce that allows the 120-billion parameter model to run on a single H100 GPU, a feat that would have been impossible with a dense architecture.

The commentary also highlights the clever use of Grouped Query Attention (GQA) and sliding window attention to reduce memory bandwidth. "GQA reduces memory usage by grouping multiple heads to share the same key and value projections," Raschka writes, noting that this leads to "lower memory usage and improved efficiency without noticeably affecting modeling performance." The inclusion of sliding window attention, where the model only looks back 128 tokens in alternating layers, is a bold move. "The window is just 128 tokens, which is remarkably small," he observes, yet ablation studies suggest this has a minimal impact on the model's ability to handle complex tasks.

"In most MoE models, expert weights account for more than 90% of the total model parameters."

This statistic underscores the sheer scale of the hidden capacity in these models. The trade-off is clear: the model is massive in potential, but lightweight in operation. Raschka's analysis of the transition from absolute positional embeddings to Rotary Position Embeddings (RoPE) further illustrates this trend toward elegant, mathematically efficient solutions. "RoPE encodes position by rotating the query and key vectors in a way that depends on each token's position," he notes, a method that has become a staple in modern architectures like Llama.

The Bottom Line

Sebastian Raschka's analysis succeeds in demystifying the technical wizardry behind OpenAI's latest release, proving that the path to accessible AI is paved with incremental, highly optimized engineering rather than sudden, magical breakthroughs. The strongest part of this argument is its focus on the "why" behind the architectural choices—efficiency, cost, and hardware constraints—rather than just the "what." The biggest vulnerability, however, is the assumption that the transformer architecture will remain dominant indefinitely; as the author admits, the search for a better alternative is ongoing. For now, though, the gpt-oss models represent a pivotal moment where high-end intelligence becomes a local, manageable tool for the individual developer.

"Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes."

The industry has reached a plateau of structural innovation, and the next frontier is purely about how efficiently we can run what we already have. This piece is essential reading for anyone who wants to understand not just what AI can do, but how it can actually be deployed in the real world.

Sources

From gpt-2 to gpt-oss: Analyzing the architectural advances

by Sebastian Raschka · Ahead of AI · Read full article

OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later).

This is the first time since GPT-2 that OpenAI has shared a large, fully open-weight model. Earlier GPT models showed how the transformer architecture scales. The 2022 ChatGPT release then made these models mainstream by demonstrating concrete usefulness for writing and knowledge (and later coding) tasks. Now they have shared some long-awaited weight model, and the architecture has some interesting details.

I spent the past few days reading through the code and technical reports to summarize the most interesting details. (Just days after, OpenAI also announced GPT-5, which I will briefly discuss in the context of the gpt-oss models at the end of this article.)

Below is a quick preview of what the article covers. For easier navigation, I recommend using the Table of Contents on the left of on the article page.

Model architecture comparisons with GPT-2

MXFP4 optimization to fit gpt-oss models onto single GPUs

Width versus depth trade-offs (gpt-oss vs Qwen3)

Attention bias and sinks

Benchmarks and comparisons with GPT-5

I hope you find it informative!

1. Model Architecture Overview.

Before we discuss the architecture in more detail, let's start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, shown in Figure 1 below.

If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance.

This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because

There is significant rotation of employees between these labs.

We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96. EDIT: Someone ...