← Back to Library

Noteworthy AI research papers of 2024

In a year dominated by hype cycles and parameter races, Sebastian Raschka offers a rare, grounded audit of what actually moved the needle in artificial intelligence. Rather than chasing the latest marketing claims, Raschka dissects the mechanical shifts in how models are trained, how they reason, and how they see the world. This is not a list of shiny new toys; it is a technical roadmap for understanding why the industry is pivoting from simply building bigger models to making them think harder.

The Evolution of Open Weights

Raschka begins by anchoring the year in the release of Meta's Llama 3 family, noting that the true story isn't just the model sizes, but the sophistication of the training pipeline. He writes, "What's notable about the Llama 3 model family is the increased sophistication of the pre-training and post-training pipelines compared to its Llama 2 predecessor." This observation is crucial because it signals a maturation in the field; the low-hanging fruit of simply scaling data is being replaced by more nuanced engineering. The author details how the pre-training process became multi-staged and how post-training shifted from Reinforcement Learning with Human Feedback to Direct Preference Optimization.

Noteworthy AI research papers of 2024

This shift matters for anyone deploying these systems. Raschka notes that while the architecture resembles its predecessor, the "larger vocabulary and the introduction of grouped-query attention" represent practical upgrades for efficiency. He argues that despite fierce competition from models like Qwen 2.5 and Gemma 2, Llama remains the industry standard. "I believe Llama will remain the go-to model for most users, much like ChatGPT has retained its popularity despite competition from options like Anthropic Claude, Google Gemini, DeepSeek, and others," Raschka asserts.

The field now includes many competitive open-source and open-weight LLMs, yet Llama will remain the go-to model for most users.

Critics might argue that this brand loyalty is fragile, especially as proprietary models pull ahead in reasoning benchmarks. However, Raschka's point about the ecosystem—how easy these models are to finetune and understand—suggests that community adoption creates a moat that raw performance metrics alone cannot breach.

The Inference Revolution

Perhaps the most significant insight in Raschka's analysis comes from his examination of test-time compute. The conventional wisdom has been that to get better answers, you must train a bigger model. Raschka challenges this by highlighting research suggesting that "improving LLM responses during inference time" can be more effective than scaling parameters. He draws a compelling analogy: "Suppose that humans, on hard tasks, can generate better responses if they are given more time to think. Analogously, LLMs may be able to produce better outputs given more time/resources to generate their responses."

The paper he highlights suggests that for easy and medium questions, a smaller model with extra compute can match the performance of a model fourteen times its size. This is a massive economic and environmental implication. Raschka explains that the strategy involves generating multiple solutions and using a verifier to pick the best one, or sequentially revising the response. "For challenging questions, larger models outperform smaller models that get additional inference compute via the inference scaling strategies discussed earlier. However, for easy and medium questions, inference time compute can be used to match the performance of 14x larger models at the same compute budget!" he writes.

This reframing moves the conversation from "how big is your model?" to "how smart is your deployment strategy?" It suggests that the future of AI efficiency lies in dynamic resource allocation rather than static model bloat. While this approach increases latency and cost per query for complex tasks, Raschka rightly identifies it as a critical tool for on-device AI, where massive models simply cannot fit.

Seeing and Reasoning

The commentary then turns to the messy reality of multimodal systems and the elusive nature of reasoning. Raschka analyzes NVIDIA's NVLM paper, which provides a rare, head-to-head comparison of different architectural approaches to vision and language. He notes that the industry is split between unified embedding decoders and cross-modality attention mechanisms. The hybrid approach, which combines a thumbnail for context with high-resolution patches for detail, emerges as the most robust solution. "NVLM-H: Combines the strengths of both approaches for optimal performance," Raschka summarizes, highlighting the trend toward specialized architectures rather than one-size-fits-all models.

Finally, Raschka tackles the mystery of OpenAI's o1 model and the concept of "journey learning." He contrasts this with traditional "shortcut learning," where models are trained only on the correct answer. In journey learning, the model is trained on the entire trial-and-error process. "Traditionally, LLMs are trained on the correct solution path (shortcut learning); in journey learning, the supervised finetuning encompasses the whole trial-and-error correction process," he explains. This distinction is vital because it suggests that the next generation of AI intelligence won't come from more data, but from teaching models how to struggle with a problem before solving it.

The next generation of AI intelligence won't come from more data, but from teaching models how to struggle with a problem before solving it.

A counterargument worth considering is whether this "journey learning" is simply a more expensive form of reinforcement learning that may not scale indefinitely. However, the philosophical shift Raschka identifies—that the path to the answer is as important as the answer itself—resonates with how human cognition actually works.

Bottom Line

Sebastian Raschka's analysis succeeds by stripping away the marketing veneer to reveal the mechanical gears turning beneath. The strongest part of his argument is the pivot from parameter scaling to inference-time compute, a shift that promises to redefine the economics of AI deployment. The biggest vulnerability in the current landscape remains the opacity of proprietary reasoning models, leaving the open-source community to reverse-engineer the very techniques that define the next frontier. For the busy professional, the takeaway is clear: the future belongs not to those with the biggest models, but to those who know how to make them think.

Sources

Noteworthy AI research papers of 2024

by Sebastian Raschka · Ahead of AI · Read full article

I hope your 2025 is off to a great start! To kick off the year, I've finally been able to complete the draft and second part of this AI Research Highlights of 2024 article. It covers a variety of relevant topics, from mixture-of-experts models to new LLM scaling laws for precision.

Note that this article is Part Two in this series, focusing on the second half of 2024 from July through December. You can find Part One, covering January to June here.

The selection criteria are admittedly subjective, based on what stood out to me this year. I've also aimed for some variety, so it's not all just about LLM model releases.

I hope you are having a great 2025, and happy reading!

7. July: The Llama 3 Herd of Models.

Readers are probably already well familiar with Meta AI's Llama 3 models and paper, but since these are such important and widely-used models, I want to dedicate the July section to The Llama 3 Herd of Models (July 2024) paper by Grattafiori and colleagues.

What's notable about the Llama 3 model family is the increased sophistication of the pre-training and post-training pipelines compared to its Llama 2 predecessor. Note that this is not only true for Llama 3 but other LLMs like Gemma 2, Qwen 2, Apple's Foundation Models, and others, as I described a few months ago in my New LLM Pre-training and Post-training Paradigms article.

7.1 Llama 3 architecture summary.

Llama 3 was first released in 8 billion and 70 billion parameter sizes, but the team kept iterating on the model, releasing 3.1, 3.2, and 3.3 versions of Llama. The sizes are summarized below: 

Llama 3 (April 2024)

8B parameters

70B parameters 

Llama 3.1 (July 2024, discussed in the paper)

8B parameters

70B parameters

405B parameters 

Llama 3.2 (September 2024)

1B parameters

3B parameters

11B parameters (vision-enabled)

90B parameters (vision-enabled) 

Llama 3.3 (December 2024)

70B parameters

Overall, the Llama 3 architecture closely resembles that of Llama 2. The key differences lie in its larger vocabulary and the introduction of grouped-query attention for the smaller model variant. A summary of the differences is shown in the figure below.

If you're curious about architectural details, a great way to learn is by implementing the model from scratch and loading pretrained weights as a sanity check. I have a GitHub repository with a from-scratch implementation that converts GPT-2 to Llama ...