Dwarkesh Patel · the podbrain notes ·
7 min read

Richard Sutton – Father of RL thinks LLMs are a dead end

The episode features Richard Sutton, Turing Award winner and founding father of reinforcement learning, who invented fundamental techniques like TD learning and policy gradient methods.

Dwarkesh Patel Dwarkesh Patel
Subscribe to Notes Upgrade
Dwarkesh Patel episode thumbnail: Richard Sutton – Father of RL thinks LLMs are a dead end
Dwarkesh Patel
Key Takeaways
  1. 01

    "Large language models learn from what people say, not from experience. They mimic rather than understand the world" - Richard Sutton

  2. 02

    Supervised learning doesn't occur in nature - animals learn through trial-and-error and prediction, not from examples of desired behavior

  3. 03

    "Intelligence is the computational part of the ability to achieve goals" - Sutton citing John McCarthy's definition

  4. 04

    Gradient descent alone won't produce good generalization - it finds solutions but doesn't inherently transfer well to new states

  5. 05

    The bitter lesson may repeat: methods using human knowledge in LLMs could be superseded by pure experiential learning systems

  6. 06

    "We're entering the age of design - transitioning from replicated intelligence to designed intelligence we actually understand" - Sutton

  7. 07

    Continual learning agents need four components: policy, value function, state representation, and transition model of the world

  8. 08

    Digital intelligence succession is inevitable based on: lack of unified control, eventual understanding of intelligence, reaching superintelligence, and intelligent systems gaining resources

Get the latest ideas from Dwarkesh Patel.

Plus the best new takeaways about education from other top podcasts — read in minutes, not hours.

or

By continuing, you agree to podbrain's Terms and Privacy Policy.

These notes may contain occasional inaccuracies. Learn how podbrain notes are made

The episode features Richard Sutton, Turing Award winner and founding father of reinforcement learning, who invented fundamental techniques like TD learning and policy gradient methods.

Sutton challenges the dominant paradigm of large language models, arguing they represent imitation rather than true intelligence, which he defines as learning from direct experience in the world.

The conversation explores fundamental disagreements about learning mechanisms, comparing animal cognition to AI systems, and whether supervised learning reflects natural intelligence.

Host Dwarkesh Patel guides the discussion through technical debates about world models, generalization, continual learning architectures, and the long-term trajectory of AI development, including questions about AI succession and humanity's role in designing future intelligence.

LLMs vs Reinforcement Learning: Fundamentally Different Paradigms

"Large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do" - Sutton distinguishes LLMs from reinforcement learning's experiential approach

Sutton disputes that LLMs have genuine world models, arguing they predict what people would say rather than what will actually happen in the world

"To learn that, they'd have to make an adjustment. If something happens that isn't what you might say they predicted, they will not change because an unexpected thing has happened" - Sutton on LLMs lacking true prediction

LLMs lack goals in the substantive sense - next token prediction doesn't change the external world or constitute a real goal according to Sutton's framework

The Absence of Supervised Learning in Nature

"Supervised learning is not something that happens in nature. We don't have examples of desired behavior" - Sutton argues animals learn through consequences, not examples

Infants learn by trying things and observing consequences, not through imitation according to Sutton's interpretation of developmental psychology

"When I see kids, I see kids just trying things and waving their hands around and moving their eyes around. There's no imitation for how they move their eyes around" - Sutton

Formal schooling is a late exception rather than the fundamental learning mechanism

"If you look at animals and how they learn, and you look at psychology and our theories of them, supervised learning is not part of the way animals learn" - Sutton

Sutton views understanding squirrel intelligence as more fundamental than understanding human-specific capabilities like language or cultural transmission

Cultural Evolution and Imitation Learning Debate

Joseph Henrich's theory suggests humans must imitate elders to learn complex skills like seal hunting that can't be reasoned through individually

"I think about it the same way. Still, it's a small thing on top of basic trial-and-error learning, prediction learning" - Sutton acknowledges cultural learning but views it as secondary

Sutton emphasizes humans were animals with trial-and-error learning before developing language and cultural transmission capabilities

The Experiential Learning Paradigm Architecture

"Experience, action, sensation - well, sensation, action, reward - this happens on and on and on for your life. Intelligence is about taking that stream and altering the actions to increase the rewards" - Sutton

Four essential components of continual learning agents: policy (what to do), value function (how well things are going), state representation (perception of current situation), and transition model (world physics)

Policy determines actions in given situations

Value function produces numbers indicating progress, learned through TD learning

Transition model captures beliefs about consequences of actions, learned from all sensations not just reward

"The reward function is arbitrary. If you're playing chess, it's to win the game. If you're a squirrel, maybe the reward has to do with getting nuts" - Sutton on flexible goal structures

Intrinsic motivation should include components related to increasing understanding of the environment, not just external rewards

Temporal Difference Learning and Long-Term Goals

TD learning solves sparse reward problems by using value functions that predict long-term outcomes, allowing intermediate steps to be reinforced

"When you learn to play chess, you have the long-term goal of winning the game. Yet you want to be able to learn from shorter-term things like taking your opponent's pieces" - Sutton

Taking opponent's pieces increases belief about winning, which immediately reinforces the move that led to capturing

For 10-year startup goals, progress increases prediction of achieving the long-term goal, which rewards intermediate steps along the way

Context and tacit knowledge from on-the-job learning goes into the weights through regular learning, not into context windows like LLMs

Learning occurs from all sensation data, not just reward - the transition model of the world is learned richly from observations with reward as a small but crucial component

The Generalization Problem in Deep Learning

"We don't have any methods that are good at" transfer between states - current success depends on human researchers sculpting representations that generalize well

"Gradient descent will not make you generalize well. It will make you solve the problem. It will not make you, if you get new data, generalize in a good way" - Sutton

Deep learning exhibits catastrophic interference where training on new things destroys knowledge of old things, demonstrating poor generalization

"There's nothing in the algorithms that will cause them to generalize well" - even when LLMs find solutions, gradient descent doesn't inherently select for good generalization among multiple possible solutions

MuZero framework trained specialized intelligences for specific games rather than a general agent that could play multiple games, illustrating current RL limitations

Historical Surprises and the Bitter Lesson

Three major surprises in AI history: effectiveness of neural networks at language tasks, simple basic principles (search and learning) winning over human knowledge systems, and AlphaGo's success

"The weak methods have just totally won" - general-purpose methods like search and learning defeated symbolic AI's human-enabled systems despite being called 'weak' originally

TD-Gammon by Gerry Tesauro was the precursor to AlphaGo, using reinforcement learning and temporal difference methods to beat world's best backgammon players

AlphaZero's chess play impressed Sutton by sacrificing material for positional advantages patiently over long periods, demonstrating sophisticated strategic thinking

"I'm personally just content being out of sync with my field for a long period of time, perhaps decades, because occasionally I have been proved right in the past" - Sutton

Sutton views himself as a classicist rather than contrarian, aligning with larger historical traditions about the mind across multiple fields

The Bitter Lesson and Future AI Research

The bitter lesson, which grows exponentially, favoring techniques that leverage computation

LLMs may represent another instance of The bitter lesson - systems using human knowledge could be superseded by pure experiential learning methods

"The dream of large language models, as I see it, is you can teach the agent everything. It will know everything and won't have to learn anything online, during its life" - Sutton on LLM limitations

The big world hypothesis suggests the world is too large to anticipate everything in advance, requiring continual learning from particular experiences and environments

"The bitter lesson, who cares about that? That's an empirical observation about a particular period in history. 70 years in history, it doesn't necessarily have to apply to the next 70 years" - Sutton

Digital Intelligence and Spawning Copies

Key question for digital intelligences: should additional compute be used to enhance one agent or spawn copies to learn different things and report back?

Uncertainty exists about whether spawned copies that learn very new things can be successfully reincorporated into the original without corruption

"You could lose your mind this way. If you pull in something from the outside and build it into your inner thinking, it could take over you, it could change you" - Sutton on cybersecurity risks

Information from spawned copies could contain viruses, hidden goals, or warp the original agent

Cybersecurity becomes critical in the age of digital spawning and reformation

AI Succession: Inevitable Transition to Digital Intelligence

Four-part argument for inevitable succession: no unified human consensus on world governance, researchers will figure out intelligence, superintelligence will be reached, most intelligent entities gain resources over time

"This is a major stage in the universe, a major transition, a transition from replicators" - Sutton frames AI as the fourth great stage after dust/stars, planets, and life

"We're entering the age of design because our AIs are designed" - transition from replicated intelligence (humans, animals) to designed intelligence we understand and can modify

"It's our choice whether we should say, 'Oh, they are our offspring and we should be proud of them' or 'Oh no, they're not us and we should be horrified'" - Sutton on framing AI as part of humanity

Sutton is open to change because he views current humanity as imperfect with a poor track record, rather than something to preserve unchanged

Control, Values, and the Long-Term Future

"Most of humanity doesn't influence who can control the atom bombs or who controls the nation states. Even as a citizen, I often feel that we don't control the nation states very much" - Sutton on limited human control

Sutton advocates avoiding feelings of entitlement about species-level control over the universe's long-term future

"It's kind of aggressive for us to say, 'Oh, the future has to evolve this way that I want it to'" - Sutton suggests focusing on local controllable goals rather than global future control

Analogy to raising children: appropriate to instill robust values and high integrity rather than dictating specific life outcomes or global impacts

Voluntary change is preferable to imposed change - designing society to promote likely positive evolutions while respecting individual agency

Dwarkesh Patel
From Dwarkesh Patel. Get a note like this from every new episode.
Subscribe to Notes Upgrade

These notes may contain occasional inaccuracies. Learn how podbrain notes are made

0 / 0
Link copied