The state of reinforcement learning for LLM reasoning

Sebastian Raschka cuts through the recent noise of model releases to deliver a crucial insight: the industry is hitting a wall with brute-force scaling, and the next leap in intelligence requires a fundamental shift in how models learn to think, not just how much data they consume.

The End of the Scaling Era

Raschka opens by noting a peculiar silence surrounding the latest flagship releases from major players. "One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning," he writes. This observation is striking because it reframes the recent lack of fanfare not as a failure of marketing, but as a signal that the low-hanging fruit of simply adding more parameters and data has been picked. The market is reacting to the absence of a specific capability: the ability to deliberate.

The state of reinforcement learning for LLM reasoning

The author points out that competitors are already pivoting. "Meanwhile, competitors such as xAI and Anthropic have added more reasoning capabilities and features into their models," Raschka notes, highlighting the introduction of "thinking" buttons that toggle these advanced cognitive processes. This suggests a bifurcation in the industry where standard chatbots are becoming commoditized, while reasoning engines are becoming the new premium product. The muted response to non-reasoning models, he argues, "suggests we are approaching the limits of what scaling model size and data alone can achieve."

"The muted response to GPT-4.5 and Llama 4 (non-reasoning) models suggests we are approaching the limits of what scaling model size and data alone can achieve."

This is a bold claim, yet it is supported by the counter-example of OpenAI's o3 model. Raschka highlights that "OpenAI's recent release of the o3 reasoning model demonstrates there is still considerable room for improvement when investing compute strategically, specifically via reinforcement learning methods tailored for reasoning tasks." The key differentiator here is not just the algorithm, but the sheer volume of compute dedicated to the training phase—reportedly ten times that of its predecessor. This reframes the narrative from "bigger is better" to "smarter training is better."

Redefining Reasoning

To understand the shift, Raschka first has to define what we are actually talking about. He moves away from vague buzzwords to a functional definition: "Reasoning, in the context of LLMs, refers to the model's ability to produce intermediate steps before providing a final answer." This is the technical essence of "chain-of-thought" reasoning, where the model generates a structured sequence of logic rather than hallucinating a direct answer.

The distinction between training-time and test-time compute is critical here. Raschka explains that accuracy can be improved through "increased training or test-time compute," but notes that his focus is on the former. "In my previous article: I solely focused on the test-time compute methods. In this article, I finally want to take a closer look at the training methods," he writes. This distinction is vital for listeners; it means the intelligence is being baked into the model's weights during creation, not just summoned during the conversation. Critics might argue that test-time compute is more flexible, but Raschka's analysis suggests that without the foundational training in reasoning, the model cannot effectively utilize that extra time.

The Mechanics of Alignment

The core of Raschka's technical argument rests on Reinforcement Learning with Human Feedback (RLHF). He breaks down the standard three-step pipeline: pre-training, supervised fine-tuning, and alignment. "The original goal of RLHF is to align LLMs with human preferences," he explains, noting that this process is what makes models helpful and safe. However, the article dives deep into the mechanics of how this is achieved, specifically through an algorithm called Proximal Policy Optimization (PPO).

Raschka uses a vivid analogy to demystify PPO, comparing the training process to a chef tweaking recipes based on customer feedback. "Your overall goal is to tweak your recipe (policy) based on customer feedback (reward)," he writes. This analogy effectively strips away the mathematical jargon, making the concept of policy optimization accessible. He details how PPO limits how much the model can change in a single step to prevent instability. "One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step," he writes, explaining that this prevents the model from "reinventing the kitchen" every week.

The article meticulously details the role of the reward model, which acts as an automated proxy for human judgment. "The idea here is that the reward model replaces and automates the labor-intensive human ranking to make the training feasible on large datasets," Raschka writes. This automation is the engine that allows for the massive scale of training required for reasoning models. Without a reward model to score millions of potential reasoning paths, the process would be prohibitively expensive and slow.

"The reward model replaces and automates the labor-intensive human ranking to make the training feasible on large datasets."

However, the reliance on reward models introduces a potential vulnerability. If the reward model is flawed or biased, the policy optimization will amplify those errors. While Raschka focuses on the efficiency of the method, a counterargument worth considering is the difficulty of accurately scoring complex reasoning steps compared to simple preference rankings. A model might learn to "game" the reward model by producing reasoning that looks correct but is logically flawed, a phenomenon known as reward hacking.

The Future of Training Pipelines

Despite the complexities, Raschka is optimistic about the trajectory. "And I expect reasoning-focused post-training to become standard practice in future LLM pipelines," he predicts. This is not just a prediction of feature adoption but a fundamental shift in how artificial intelligence is built. The era of training a model to predict the next word and hoping it learns to reason is ending; the new era is explicitly training the model to reason.

The article concludes by emphasizing that while reasoning isn't a "silver bullet," it "reliably improves model accuracy and problem-solving capabilities on challenging tasks." This pragmatic assessment grounds the excitement in measurable results. The evidence presented suggests that the industry is moving away from the passive accumulation of knowledge toward the active cultivation of cognitive skills.

Bottom Line

Sebastian Raschka makes a compelling case that the next frontier in AI is not bigger models, but smarter training methods that explicitly teach reasoning through reinforcement learning. The strongest part of his argument is the clear distinction between scaling data and investing compute in strategic reasoning training, backed by concrete examples of recent model releases. The biggest vulnerability lies in the assumption that current reward modeling techniques can perfectly capture the nuance of logical validity without introducing new forms of error. Readers should watch for how the industry balances the massive compute costs of these training methods against the tangible improvements in model reliability.

The state of reinforcement learning for LLM reasoning

by Sebastian Raschka · Ahead of AI · Read full article

A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning.

Meanwhile, competitors such as xAI and Anthropic have added more reasoning capabilities and features into their models. For instance, both the xAI Grok and Anthropic Claude interfaces now include a "thinking" (or "extended thinking") button for certain models that explicitly toggles reasoning capabilities.

In any case, the muted response to GPT-4.5 and Llama 4 (non-reasoning) models suggests we are approaching the limits of what scaling model size and data alone can achieve.

However, OpenAI's recent release of the o3 reasoning model demonstrates there is still considerable room for improvement when investing compute strategically, specifically via reinforcement learning methods tailored for reasoning tasks. (According to OpenAI staff during the recent livestream, o3 used 10× more training compute compared to o1.)

While reasoning alone isn't a silver bullet, it reliably improves model accuracy and problem-solving capabilities on challenging tasks (so far). And I expect reasoning-focused post-training to become standard practice in future LLM pipelines.

So, in this article, let's explore the latest developments in reasoning via reinforcement learning.

Because it is a relatively long article, I am providing a Table of Contents overview below. To navigate the table of contents, please use the slider on the left-hand side in the web view.

Understanding reasoning models

RLHF basics: where it all started

A brief introduction to PPO: RL's workhorse algorithm

RL algorithms: from PPO to GRPO

RL reward modeling: from RLHF to RLVR

How the DeepSeek-R1 reasoning models were trained

Lessons from recent RL papers on training reasoning models

Noteworthy research papers on training reasoning models

Tip: If you are already familiar with reasoning basics, RL, PPO, and GRPO, please feel free to directly jump ahead to the “Lessons from recent RL papers on training reasoning models” section, which contains summaries of interesting insights from recent reasoning research papers.

Understanding reasoning models.

The big elephant in the room is, of course, the definition of reasoning. In short, reasoning is about inference and training techniques that make LLMs better at handling complex tasks.

To provide a bit more detail on how this is achieved (so far), I'd like ...

The End of the Scaling Era

Redefining Reasoning

The Mechanics of Alignment

The Future of Training Pipelines

Bottom Line

Sources

The state of reinforcement learning for LLM reasoning