Understanding reasoning llms

In a field often clouded by marketing hype, Sebastian Raschka cuts through the noise to reveal that the latest breakthrough in artificial intelligence isn't a new architecture, but a fundamental shift in how models are trained to think. The most surprising claim here isn't that machines can solve complex math problems, but that this capability can emerge from pure reinforcement learning without the traditional, labor-intensive step of supervised fine-tuning. For executives and engineers navigating the 2025 landscape, this distinction is not academic; it dictates whether you build a custom reasoning engine or simply prompt a general model to "think harder."

Defining the Reasoning Gap

Raschka begins by dismantling the vague terminology that plagues the industry. He argues that we must stop conflating basic knowledge retrieval with actual reasoning. "If you work in AI (or machine learning in general), you are probably familiar with vague and hotly debated definitions," he writes, before offering a precise operational definition: "reasoning" as the process of answering questions that require complex, multi-step generation with intermediate steps.

This framing is crucial because it separates the trivial from the transformative. A model reciting the capital of France is not reasoning; a model calculating the trajectory of a projectile based on variable wind speeds is. Raschka notes that while many modern models can mimic this behavior, true "reasoning models" are those specifically refined to excel at puzzles, advanced math, and coding challenges. The distinction matters because, as he points out, "transforming an LLM into a reasoning model also introduces certain drawbacks." These models are more expensive, more verbose, and prone to "overthinking" on simple tasks. The strategic takeaway is clear: using a sledgehammer to crack a nut is not just inefficient; it is a costly error in resource allocation.

Use the right tool (or type of LLM) for the task.

The DeepSeek Blueprint

The article's centerpiece is a dissection of the DeepSeek R1 pipeline, which Raschka presents as a new industry standard. He outlines a three-stage evolution that challenges the conventional wisdom of model development. The first stage, DeepSeek-R1-Zero, is particularly provocative. It was trained using "pure" reinforcement learning, skipping the supervised fine-tuning (SFT) step that has been the bedrock of alignment for years.

Raschka highlights the significance of this "cold start" approach: "The researchers observed an 'Aha!' moment, where the model began generating reasoning traces as part of its responses despite not being explicitly trained to do so." This suggests that the ability to chain thoughts together is not merely a result of mimicking human data, but an emergent behavior when a model is rewarded strictly for accuracy and format. This is a powerful argument for the potential of reinforcement learning to unlock capabilities that supervised data alone cannot teach.

However, the pure RL approach had limits. The flagship model, DeepSeek-R1, returned to a hybrid approach, combining the "cold-start" data with additional supervised fine-tuning and further reinforcement learning. Raschka explains that this hybrid method, which includes a "consistency reward" to prevent language mixing, mirrors the likely development path of other top-tier models like OpenAI's o1. The implication is that while pure RL can spark reasoning, a polished, production-ready model still requires the stability of supervised data.

Critics might note that relying on "cold start" data generated by an unrefined model could introduce feedback loops of error, potentially cementing bad habits before the model is even fully trained. Yet, Raschka's analysis suggests that the sheer volume of verifiable rewards—using compilers for code and deterministic systems for math—acts as a sufficient guardrail against this risk.

Scaling Thought at Inference Time

Beyond training pipelines, Raschka explores the concept of "inference-time scaling." This is the idea that we can improve a model's output quality not by making the model bigger, but by giving it more time and computational resources to "think" before answering. He draws a compelling analogy to human cognition: "A rough analogy is how humans tend to generate better responses when given more time to think through complex problems."

This section moves the conversation from model weights to application architecture. Techniques like chain-of-thought prompting, where the user explicitly asks the model to "think step by step," are described as a form of inference-time scaling. More advanced strategies involve voting mechanisms or search algorithms that generate multiple potential answers and select the best one. Raschka notes that while DeepSeek categorized some of these explicit search methods as "unsuccessful attempts" in their specific report, the broader industry is likely to embrace them. "I suspect that OpenAI's o1 and o3 models use inference-time scaling," he writes, which would explain their higher operational costs compared to standard models.

Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks.

This observation forces a re-evaluation of cost-benefit analysis for AI deployments. If the best results come from models that spend more tokens generating intermediate steps, the cost per query rises significantly. For a business application, this means reasoning models are a specialized tool for high-stakes problems, not a drop-in replacement for every customer service chatbot.

Bottom Line

Sebastian Raschka's most valuable contribution is demystifying the "reasoning" label, stripping away the mystique to reveal a concrete methodology centered on reinforcement learning and intermediate step generation. The strongest part of his argument is the demonstration that reasoning can emerge from pure RL, challenging the industry's reliance on massive supervised datasets. However, the biggest vulnerability remains the economic reality: these models are expensive and inefficient for general tasks, meaning the future of AI will likely be a fragmented landscape of specialized, costly reasoning engines alongside cheaper, faster generalists. Readers should watch how the industry balances the high cost of these "thinking" models against the tangible value they bring to complex problem-solving.

Understanding reasoning llms

by Sebastian Raschka · Ahead of AI · Read full article

This article describes the four main approaches to building reasoning models, or how we can enhance LLMs with reasoning capabilities. I hope this provides valuable insights and helps you navigate the rapidly evolving literature and hype surrounding this topic.

In 2024, the LLM field saw increasing specialization. Beyond pre-training and fine-tuning, we witnessed the rise of specialized applications, from RAGs to code assistants. I expect this trend to accelerate in 2025, with an even greater emphasis on domain- and application-specific optimizations (i.e., "specializations").

The development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transforming an LLM into a reasoning model also introduces certain drawbacks, which I will discuss later.

To give you a brief glimpse of what's covered below, in this article, I will:

Explain the meaning of "reasoning model"

Discuss the advantages and disadvantages of reasoning models

Outline the methodology behind DeepSeek R1

Describe the four main approaches to building and improving reasoning models

Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases

Provide tips for developing reasoning models on a tight budget

I hope you find this article useful as AI continues its rapid development this year!

How do we define "reasoning model"?.

If you work in AI (or machine learning in general), you are probably familiar with vague and hotly debated definitions. The term "reasoning models" is no exception. Eventually, someone will define it formally in a paper, only for it to be redefined in the next, and so on.

In this article, I define "reasoning" as the process of answering questions that require complex, multi-step generation with intermediate steps. For example, factual question-answering like "What is the capital of France?" does not involve reasoning. In contrast, a question like "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" requires some simple reasoning. For instance, it requires recognizing the relationship between distance, speed, and time before arriving at the answer.

Most modern LLMs are capable of basic reasoning and can answer questions like, "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" So, today, when we refer to reasoning models, ...

Defining the Reasoning Gap

The DeepSeek Blueprint

Scaling Thought at Inference Time

Bottom Line

Sources

Understanding reasoning llms