← Back to Library

Some thoughts on the sutton interview

An AI researcher named Richard Sutton has a fundamental problem with how modern language models work: they train on tens of thousands of years of human knowledge but learn nothing during the billions of inference cycles when they're actually deployed. In this piece, Dwarkesh Patel breaks down Sutton's critique and offers his own take on why that criticism might be missing something crucial.", > The current LLM paradigm treats training as a singular event rather than a continuous process.

The Core Critique

Richard Sutton has spent decades thinking about how intelligence actually works. His "bitter lesson" essay argues that the best AI systems use computation most effectively and scalably. The problem with today's large language models is that nearly all their compute happens during training — not during deployment. When you run an LLM in production, it's completely static. It doesn't learn from the billions of tokens it processes every day.

Some thoughts on the sutton interview

Sutton's point cuts deeper than just inefficiency. Current language models are trained on essentially all human knowledge available up to a certain date. They build representations of what humans would say next — not how the world actually works. An LLM trained only on data up to 1900 probably couldn't derive relativity from scratch, because it never learned how environments respond to different actions.

The core issues Sutton identifies are striking: LLMs can't learn on the job, they require a special training phase that consumes massive compute without learning during deployment, and their dependence on human data is fundamentally non-renewable.

The Case for Imitation Learning

Patel disagrees with how Sutton frames the distinction between language models and true intelligence. He argues that imitation learning isn't categorically different from reinforcement learning — it's simply very short-horizon RL where the episode is just a few tokens long.

The AlphaGo versus AlphaZero comparison illustrates this well. Both achieved superhuman performance in Go, but AlphaZero used far more compute and bootstrapped itself from scratch without human game data. Yet both succeeded. The lesson isn't that imitation learning must be abandoned — it's that at sufficient scale, it stops being significantly harmful.

Patel makes a broader point: humanity's entire knowledge base was built through cultural accumulation over thousands of years. We didn't invent language or legal systems from scratch. Most technology in our phones wasn't invented by anyone currently alive. This process of learning from accumulated human knowledge is more analogous to imitation learning than pure reinforcement learning.

Why Pre-Training Might Still Matter

After pre-training base models with reinforcement learning on ground truth — solving International Math Olympiad problems and coding working applications — we've seen remarkable capability emerge. These aren't just imitating humans anymore; they're performing actual tasks that required genuine reasoning.

The key insight is whether imitation learning helps kickstart the RL process. Pre-trained models serve as a reasonable prior for experiential learning. Without this foundation, we don't know how to train an AI from scratch to accomplish these complex tasks.

Patel acknowledges Sutton's decades-long perspective makes certain gaps obvious: the lack of continual learning, abysmal sample efficiency, and dependence on exhaustible human data are genuine problems that pervade the current paradigm.

Counterarguments

Critics might note that defining world models by process rather than capability creates semantic confusion. Just because current LLMs aren't trained to model how their actions affect the world doesn't mean they haven't developed deep representations of it. The definition seems to privilege a specific training methodology over actual demonstrated ability.

Another reasonable disagreement centers on whether Sutton's first-principles critique actually proves what he claims. The argument that future systems won't use this paradigm is different from proving today's models have fundamental gaps. Those gaps might be fixable without abandoning the entire approach.

Bottom Line

Patel's strongest contribution isn't necessarily winning the debate — it's identifying where Sutton's critique actually lands. Even if Sutton's ideal path to AGI doesn't materialize, his decades-long perspective reveals genuine problems we don't notice because they're so pervasive in how we currently work on AI. The lack of continual learning and abysmal sample efficiency are real issues that need solving.

Deep Dives

Explore these related deep dives:

Sources

Some thoughts on the sutton interview

by Dwarkesh Patel · Dwarkesh Patel · Watch video

Boy, do you guys have a lot of thoughts about the Sun interview. I've been thinking about it myself and I think I have a much better understanding now of Sun's perspective than I did during the interview itself. So, I wanted to reflect on how I understand his worldview now. And Richard, apologies if there's still any rors or misunderstandings.

It's been very productive to learn from your thoughts. Okay. So, here's my understanding of the steelman of Richard's position. Obviously, he wrote the famous essay, the better lesson.

And what is this essay about? Well, it's not saying that you just want to throw away as much compute as you possibly can. The bitter lesson says that you want to come up with techniques which most effectively and scalably leverage compute. Most of the compute that's spent on an LLM is used in running it during deployment.

And yet, it's not learning anything during this entire period. It's only learning during the special phase that we call training. And so, this is obviously not an effective use of compute. And what's even worse is that this training period by itself is highly inefficient because these models are usually trained on the equivalent of tens of thousands of years of human experience.

And what's more, during this training phase, all of their learning is coming straight from human data. Now, this is an obvious point in the case of pre-training data, but it's even kind of true for the RLVR that we do with these LLMs. These RL environments are human furnished playgrounds to teach LLMs the specific skills that we have prescribed for them. The agent is in no substantial way learning from organic and self-directed engagement with the world.

Having to learn only from human data, which is an inelastic and hard tokill resource, is not a scalable way to use compute. Furthermore, what these LLMs learn from training is not a true world model which would tell you how the environment changes in response to different actions that you take. Rather, they are building a model of what a human would say next. And this leads them to rely on human derived concepts.

A way to think about this would be suppose you trained an LLM on all the data up to the year 1900. That LLM probably wouldn't be able to come up with relativity ...