New LLM pre-training and post-training paradigms

Sebastian Raschka cuts through the noise of hundreds of monthly AI papers to reveal a quiet but seismic shift: the era of brute-force data scaling is ending, replaced by a ruthless obsession with data quality and synthetic augmentation. For busy professionals tracking the AI landscape, this analysis is vital because it exposes the hidden engineering trade-offs behind the latest models from Apple, Google, and Alibaba, proving that raw parameter counts no longer guarantee dominance.

The Quality Over Quantity Pivot

Raschka begins by dismantling the assumption that more data always equals better performance. He points to Alibaba's Qwen 2, which achieved competitive results using significantly fewer tokens than its peers, provided those tokens were rigorously filtered. "One of the focus areas has been improving the data filtering pipeline to remove low-quality data and enhancing data mixing to increase data diversity," Raschka writes. This observation is critical; it suggests the industry has hit a point of diminishing returns on raw internet scrapes. The argument lands because it aligns with the practical reality that garbage in still produces garbage out, regardless of model size.

New LLM pre-training and post-training paradigms

However, the most striking revelation in Raschka's analysis is the heavy reliance on synthetic data—using existing AI models to generate training material for new ones. He notes that Qwen 2 used previous generations of itself to "synthesize additional pre-training data" and create high-quality instruction pairs. This creates a fascinating, if slightly circular, feedback loop. Critics might note that training on AI-generated data risks "model collapse," where errors compound over generations, but Raschka's evidence suggests that when combined with strict human filtering, this approach currently yields superior efficiency.

"More is better, but only if it meets certain quality standards."

Apple's Three-Stage Discipline

Moving to Apple's Foundation Models, Raschka highlights a disciplined, three-stage pre-training process that prioritizes specific skills over general breadth. He praises the company for respecting `robots.txt` files and decontaminating benchmark data, actions that are increasingly rare in the industry. "Quality was much more important than quantity," Raschka observes, noting that Apple's team deliberately down-weighted lower-quality web crawls in favor of math and code.

This approach is particularly notable for its use of knowledge distillation, where a massive "teacher" model trains a smaller "student" model. Raschka explains that the on-device Apple model was "distilled and pruned from a larger 6.4-billion-parameter model," allowing it to punch above its weight class. This is a strategic masterstroke for mobile deployment, where compute and battery life are constrained. The core of the argument is that the future of AI isn't just about building bigger servers; it's about compressing intelligence into devices we carry in our pockets.

Yet, Raschka acknowledges the opacity that remains. He admits, "Unfortunately, another theme of the technical reports is that details about the dataset are scarce," a frustration shared by the entire research community. Without full transparency on data sources, it is difficult to independently verify the robustness of these quality claims.

The Post-Training Convergence

In the realm of post-training, Raschka identifies a clear convergence toward Direct Preference Optimization (DPO) as the new standard for aligning models with human values. He contrasts this with older, more complex reinforcement learning methods, noting that the SFT (Supervised Fine-Tuning) plus DPO pipeline is becoming dominant due to its stability and ease of use. "The SFT+DPO approach seems to be the most popular preference tuning strategy at the moment due to the ease of use compared to other methods," he writes.

This shift matters because it lowers the barrier to entry for creating aligned models, potentially democratizing access to high-quality AI assistants. However, the reliance on synthetic data in this phase also raises questions about the diversity of human preferences being captured. If the training data is generated by the model itself, there is a risk of narrowing the scope of what the AI considers "helpful" or "harmless."

Bottom Line

Raschka's analysis provides a crucial corrective to the hype cycle, demonstrating that the next leap in AI capability will come from smarter data curation and architectural efficiency rather than just scaling up. The strongest part of his argument is the empirical evidence that smaller, cleaner datasets outperform massive, noisy ones, a lesson that will likely reshape how organizations approach their own AI strategies. The biggest vulnerability remains the lack of transparency in data sourcing, which leaves the industry to trust vendor claims over independent verification.

New LLM pre-training and post-training paradigms

by Sebastian Raschka · Ahead of AI · Read full article

The development of large language models (LLMs) has come a long way, from the early GPT models to the sophisticated open-weight LLMs we have today. Initially, the LLM training process focused solely on pre-training, but it has since expanded to include both pre-training and post-training. Post-training typically encompasses supervised instruction fine-tuning and alignment, which was popularized by ChatGPT.

Training methodologies have evolved since ChatGPT was first released. In this article, I review the latest advancements in both pre-training and post-training methodologies, particularly those made in recent months.

There are hundreds of LLM papers each month proposing new techniques and approaches. However, one of the best ways to see what actually works well in practice is to look at the pre-training and post-training pipelines of the most recent state-of-the-art models. Luckily, four major new LLMs have been released in the last months, accompanied by relatively detailed technical reports.

In this article, I focus on the pre-training and post-training pipelines of the following models:

Alibaba's Qwen 2

Apple Intelligence Foundation Language Models

Google's Gemma 2

Meta AI's Llama 3.1

These models are presented in order based on the publication dates of their respective technical papers on arXiv.org, which also happens to align with their alphabetical order.

This article is a passion project that I created in my free time and over the weekends. If you find it valuable and would like to support my work, please consider purchasing a copy of my books and recommending them to your colleagues. Your review on Amazon would also be greatly appreciated!

Build a Large Language Model (from Scratch) is a highly focused book dedicated to coding LLMs from the ground up in PyTorch, covering everything from pre-training to post-training—arguably the best way to truly understand LLMs.

Machine Learning Q and AI is a great book for those who are already familiar with the basics; it dives into intermediate and advanced concepts covering deep neural networks, vision transformers, multi-GPU training paradigms, LLMs, and many more.

Machine Learning with PyTorch and Scikit-Learn is a comprehensive guide to machine learning, deep learning, and AI, offering a well-balanced mix of theory and practical code. It's the ideal starting point for anyone new to the field.

1. Alibaba's Qwen 2.

Let's begin with Qwen 2, a really strong LLM model family that is competitive with other major LLMs. However, for some reason, it's less popular than the open-weight models from ...

The Quality Over Quantity Pivot

Apple's Three-Stage Discipline

The Post-Training Convergence

Bottom Line

Sources

New LLM pre-training and post-training paradigms