← Back to Library

The state of LLM reasoning model inference

Sebastian Raschka cuts through the hype of 2025's AI boom to reveal a counterintuitive truth: the next leap in artificial intelligence isn't necessarily about building bigger brains, but about giving existing ones more time to think. While the industry obsesses over training costs, Raschka argues that the real breakthrough lies in "inference-time compute scaling"—a strategy where models trade computational power for deeper, more deliberate reasoning during the moment a user asks a question. This is not just a technical tweak; it is a fundamental shift in how we expect machines to solve complex problems, moving from instant retrieval to simulated deliberation.

The Cost of Thought

Raschka frames the current landscape by distinguishing between two primary paths to intelligence: increasing the compute used during training versus increasing the compute used during inference. He writes, "Stronger reasoning skills allow LLMs to tackle more complex problems, making them more capable across a wide range of tasks users care about." This distinction is crucial because it suggests that the ceiling for current models is not yet hit; we simply haven't allocated enough processing power to the moment of generation.

The state of LLM reasoning model inference

The author explains that unlike simple question-answering bots, reasoning models are designed to "generate intermediate steps or structured 'thought' processes." This is the engine behind the new wave of models that can solve math puzzles or debug code. Raschka notes that while training compute is a one-time expense, inference scaling is an ongoing cost that scales directly with the length of the response. "Since inference costs scale with response length (e.g., a response twice as long requires twice the compute), these training approaches are inherently linked to inference scaling," he observes. This creates a new economic reality for the industry: accuracy is no longer free, and the most capable answers will inevitably be the most expensive to generate.

Humans give better responses when given more time to think, and similarly, LLMs can improve with techniques that encourage more 'thought' during generation.

This analogy is the piece's most effective rhetorical device. It demystifies the black box of neural networks by comparing them to human cognition. However, critics might argue that this comparison glosses over the massive energy consumption required to simulate that "thinking" process, raising sustainability questions that the article touches on only implicitly through the lens of cost.

The Mechanics of 'Wait'

Raschka dives deep into the recent explosion of research following the release of DeepSeek R1, specifically highlighting a paper titled "s1: Simple Test-Time Scaling." The core innovation here is the use of "wait" tokens to force the model to pause and extend its reasoning chain. He describes this as a modern evolution of the classic "think step by step" prompt. "Their approach is twofold: Create a curated SFT dataset with 1k training examples that include reasoning traces. Control the length of responses by: a) Appending 'Wait' tokens to get the LLM to generate longer responses, self-verify, and correct itself," Raschka writes.

This technique, which the researchers call "budget forcing," allows developers to explicitly regulate how long a model thinks before answering. Raschka finds the results compelling, noting that the team "found their budget-forcing method more effective than other inference-scaling techniques I've discussed, like majority voting." The implication is that a single, extended chain of thought is often superior to aggregating multiple short, shallow guesses.

Yet, Raschka remains critical of the methodology's limitations. He points out that the study lacked comparisons with more sophisticated search strategies like beam search or lookahead methods. "If there's something to criticize or improve, I would've liked to see results for more sophisticated parallel inference-scaling methods," he admits. This intellectual honesty strengthens his credibility; he is not just selling the latest trend but evaluating its actual performance against the theoretical maximum.

The researchers were inspired by the 'Aha moment' figure in the DeepSeek-R1 paper, where researchers saw LLMs coming up with something like 'Wait, wait. Wait. That's an aha moment I can flag here.'

The choice of the word "Wait" is not arbitrary. It reflects a specific behavioral pattern observed in reinforcement learning where models self-correct. Raschka notes that while researchers tried tokens like "Hmm," the "Wait" token performed slightly better. This granular detail suggests that the specific phrasing of the prompt can have a measurable impact on the model's internal logic, a finding that could revolutionize how we engineer prompts for critical applications.

Beyond the Single Path

The commentary then shifts to other emerging strategies, such as "Test-Time Preference Optimization" (TPO) and methods to combat "underthinking." TPO involves an iterative process where the model generates multiple responses, scores them, and refines its output based on textual feedback. Raschka explains that this allows for "on-the-fly alignment" without altering the underlying model weights. This is a significant development because it suggests that models can be aligned to human values dynamically, rather than being frozen in the state they were in at the end of training.

Another paper, "Thoughts Are All Over the Place," addresses a phenomenon where models switch reasoning paths too frequently, leading to errors. The authors introduce a "Thought Switching Penalty" to discourage these premature transitions. Raschka highlights that this approach "does not require model fine-tuning and empirically improves accuracy across multiple challenging test sets." This reinforces the central thesis: we can extract significant performance gains by simply changing how the model navigates its own thought process, rather than retraining it from scratch.

However, the article also acknowledges the limits of this approach. In the section on adversarial robustness, Raschka notes that while increased compute helps defend against attacks, "improvements in settings involving policy ambiguities or loophole exploitation are limited." This is a vital caveat. It reminds the reader that more compute is not a panacea; it cannot solve fundamental logical gaps or exploit ambiguities in the model's training data.

Unlike adversarial training, this method does not need any special training or require prior knowledge of specific attack types.

This observation underscores the efficiency of inference-time scaling. It offers a defensive capability that is adaptable and immediate, contrasting with the slow, resource-intensive process of retraining models to patch security holes. Yet, the limitation regarding policy ambiguities suggests that the models are still bound by the quality of their initial training, and no amount of "thinking time" can fully overcome a flawed foundation.

Bottom Line

Sebastian Raschka's analysis provides a necessary corrective to the industry's obsession with model size, proving that the path to superior reasoning lies in the strategic allocation of compute during the inference phase. The strongest part of his argument is the demonstration that simple interventions, like "wait" tokens, can yield disproportionate gains in accuracy. However, the piece's biggest vulnerability is the assumption that the cost of this extra compute will be sustainable for widespread commercial use. As we move forward, the industry will need to balance the brilliance of these reasoning models against the economic reality of paying for every extra second of "thought."

Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling.

Sources

The state of LLM reasoning model inference

by Sebastian Raschka · Ahead of AI · Read full article

Improving the reasoning abilities of large language models (LLMs) has become one of the hottest topics in 2025, and for good reason. Stronger reasoning skills allow LLMs to tackle more complex problems, making them more capable across a wide range of tasks users care about.

In the last few weeks, researchers have shared a large number of new strategies to improve reasoning, including scaling inference-time compute, reinforcement learning, supervised fine-tuning, and distillation. And many approaches combine these techniques for greater effect. 

This article explores recent research advancements in reasoning-optimized LLMs, with a particular focus on inference-time compute scaling that have emerged since the release of DeepSeek R1.

Implementing and improving reasoning in LLMs: The four main categories.

Since most readers are likely already familiar with LLM reasoning models, I will keep the definition short: An LLM-based reasoning model is an LLM designed to solve multi-step problems by generating intermediate steps or structured "thought" processes. Unlike simple question-answering LLMs that just share the final answer, reasoning models either explicitly display their thought process or handle it internally, which helps them to perform better at complex tasks such as puzzles, coding challenges, and mathematical problems.

In general, there are two main strategies to improve reasoning: (1) increasing training compute or (2) increasing inference compute, also known as inference-time scaling or test-time scaling. (Inference compute refers to the processing power required to generate model outputs in response to a user query after training.)

Note that the plots shown above make it look like we improve reasoning either via train-time compute OR test-time compute. However, LLMs are usually designed to improve reasoning by combining heavy train-time compute (extensive training or fine-tuning, often with reinforcement learning or specialized data) and increased test-time compute (allowing the model to "think longer" or perform extra computation during inference).

To understand how reasoning models are being developed and improved, I think it remains useful to look at the different techniques separately. In my previous article, Understanding Reasoning LLMs, I discussed a finer categorization into four categories, as summarized in the figure below.

Methods 2-4 in the figure above typically produce models that generate longer responses because they include intermediate steps and explanations in their outputs. Since inference costs scale with response length (e.g., a response twice as long requires twice the compute), these training approaches are inherently linked to inference scaling. However, in this section on inference-time compute scaling, ...