An AI researcher named Richard Sutton has a fundamental problem with how modern language models work: they train on tens of thousands of years of human knowledge but learn nothing during the billions of inference cycles when they're actually deployed. In this piece, Dwarkesh Patel breaks down Sutton's critique and offers his own take on why that criticism might be missing something crucial.", > The current LLM paradigm treats training as a singular event rather than a continuous process.
The Core Critique
Richard Sutton has spent decades thinking about how intelligence actually works. His "bitter lesson" essay argues that the best AI systems use computation most effectively and scalably. The problem with today's large language models is that nearly all their compute happens during training — not during deployment. When you run an LLM in production, it's completely static. It doesn't learn from the billions of tokens it processes every day.
Sutton's point cuts deeper than just inefficiency. Current language models are trained on essentially all human knowledge available up to a certain date. They build representations of what humans would say next — not how the world actually works. An LLM trained only on data up to 1900 probably couldn't derive relativity from scratch, because it never learned how environments respond to different actions.
The core issues Sutton identifies are striking: LLMs can't learn on the job, they require a special training phase that consumes massive compute without learning during deployment, and their dependence on human data is fundamentally non-renewable.
The Case for Imitation Learning
Patel disagrees with how Sutton frames the distinction between language models and true intelligence. He argues that imitation learning isn't categorically different from reinforcement learning — it's simply very short-horizon RL where the episode is just a few tokens long.
The AlphaGo versus AlphaZero comparison illustrates this well. Both achieved superhuman performance in Go, but AlphaZero used far more compute and bootstrapped itself from scratch without human game data. Yet both succeeded. The lesson isn't that imitation learning must be abandoned — it's that at sufficient scale, it stops being significantly harmful.
Patel makes a broader point: humanity's entire knowledge base was built through cultural accumulation over thousands of years. We didn't invent language or legal systems from scratch. Most technology in our phones wasn't invented by anyone currently alive. This process of learning from accumulated human knowledge is more analogous to imitation learning than pure reinforcement learning.
Why Pre-Training Might Still Matter
After pre-training base models with reinforcement learning on ground truth — solving International Math Olympiad problems and coding working applications — we've seen remarkable capability emerge. These aren't just imitating humans anymore; they're performing actual tasks that required genuine reasoning.
The key insight is whether imitation learning helps kickstart the RL process. Pre-trained models serve as a reasonable prior for experiential learning. Without this foundation, we don't know how to train an AI from scratch to accomplish these complex tasks.
Patel acknowledges Sutton's decades-long perspective makes certain gaps obvious: the lack of continual learning, abysmal sample efficiency, and dependence on exhaustible human data are genuine problems that pervade the current paradigm.
Counterarguments
Critics might note that defining world models by process rather than capability creates semantic confusion. Just because current LLMs aren't trained to model how their actions affect the world doesn't mean they haven't developed deep representations of it. The definition seems to privilege a specific training methodology over actual demonstrated ability.
Another reasonable disagreement centers on whether Sutton's first-principles critique actually proves what he claims. The argument that future systems won't use this paradigm is different from proving today's models have fundamental gaps. Those gaps might be fixable without abandoning the entire approach.
Bottom Line
Patel's strongest contribution isn't necessarily winning the debate — it's identifying where Sutton's critique actually lands. Even if Sutton's ideal path to AGI doesn't materialize, his decades-long perspective reveals genuine problems we don't notice because they're so pervasive in how we currently work on AI. The lack of continual learning and abysmal sample efficiency are real issues that need solving.