The episode features Richard Sutton, Turing Award winner and founding father of reinforcement learning, who invented fundamental techniques like TD learning and policy gradient methods.
Sutton challenges the dominant paradigm of large language models, arguing they represent imitation rather than true intelligence, which he defines as learning from direct experience in the world.
The conversation explores fundamental disagreements about learning mechanisms, comparing animal cognition to AI systems, and whether supervised learning reflects natural intelligence.
Host Dwarkesh Patel guides the discussion through technical debates about world models, generalization, continual learning architectures, and the long-term trajectory of AI development, including questions about AI succession and humanity's role in designing future intelligence.
LLMs vs Reinforcement Learning: Fundamentally Different Paradigms
"Large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do" - Sutton distinguishes LLMs from reinforcement learning's experiential approach
Sutton disputes that LLMs have genuine world models, arguing they predict what people would say rather than what will actually happen in the world
"To learn that, they'd have to make an adjustment. If something happens that isn't what you might say they predicted, they will not change because an unexpected thing has happened" - Sutton on LLMs lacking true prediction
LLMs lack goals in the substantive sense - next token prediction doesn't change the external world or constitute a real goal according to Sutton's framework
The Absence of Supervised Learning in Nature
"Supervised learning is not something that happens in nature. We don't have examples of desired behavior" - Sutton argues animals learn through consequences, not examples
Infants learn by trying things and observing consequences, not through imitation according to Sutton's interpretation of developmental psychology
"When I see kids, I see kids just trying things and waving their hands around and moving their eyes around. There's no imitation for how they move their eyes around" - Sutton
Formal schooling is a late exception rather than the fundamental learning mechanism
"If you look at animals and how they learn, and you look at psychology and our theories of them, supervised learning is not part of the way animals learn" - Sutton
Sutton views understanding squirrel intelligence as more fundamental than understanding human-specific capabilities like language or cultural transmission
Cultural Evolution and Imitation Learning Debate
Joseph Henrich's theory suggests humans must imitate elders to learn complex skills like seal hunting that can't be reasoned through individually
"I think about it the same way. Still, it's a small thing on top of basic trial-and-error learning, prediction learning" - Sutton acknowledges cultural learning but views it as secondary
Sutton emphasizes humans were animals with trial-and-error learning before developing language and cultural transmission capabilities
The Experiential Learning Paradigm Architecture
"Experience, action, sensation - well, sensation, action, reward - this happens on and on and on for your life. Intelligence is about taking that stream and altering the actions to increase the rewards" - Sutton
Four essential components of continual learning agents: policy (what to do), value function (how well things are going), state representation (perception of current situation), and transition model (world physics)
Policy determines actions in given situations
Value function produces numbers indicating progress, learned through TD learning
Transition model captures beliefs about consequences of actions, learned from all sensations not just reward
"The reward function is arbitrary. If you're playing chess, it's to win the game. If you're a squirrel, maybe the reward has to do with getting nuts" - Sutton on flexible goal structures
Intrinsic motivation should include components related to increasing understanding of the environment, not just external rewards
Temporal Difference Learning and Long-Term Goals
TD learning solves sparse reward problems by using value functions that predict long-term outcomes, allowing intermediate steps to be reinforced
"When you learn to play chess, you have the long-term goal of winning the game. Yet you want to be able to learn from shorter-term things like taking your opponent's pieces" - Sutton
Taking opponent's pieces increases belief about winning, which immediately reinforces the move that led to capturing
For 10-year startup goals, progress increases prediction of achieving the long-term goal, which rewards intermediate steps along the way
Context and tacit knowledge from on-the-job learning goes into the weights through regular learning, not into context windows like LLMs
Learning occurs from all sensation data, not just reward - the transition model of the world is learned richly from observations with reward as a small but crucial component
The Generalization Problem in Deep Learning
"We don't have any methods that are good at" transfer between states - current success depends on human researchers sculpting representations that generalize well
"Gradient descent will not make you generalize well. It will make you solve the problem. It will not make you, if you get new data, generalize in a good way" - Sutton
Deep learning exhibits catastrophic interference where training on new things destroys knowledge of old things, demonstrating poor generalization
"There's nothing in the algorithms that will cause them to generalize well" - even when LLMs find solutions, gradient descent doesn't inherently select for good generalization among multiple possible solutions
MuZero framework trained specialized intelligences for specific games rather than a general agent that could play multiple games, illustrating current RL limitations
Historical Surprises and the Bitter Lesson
Three major surprises in AI history: effectiveness of neural networks at language tasks, simple basic principles (search and learning) winning over human knowledge systems, and AlphaGo's success
"The weak methods have just totally won" - general-purpose methods like search and learning defeated symbolic AI's human-enabled systems despite being called 'weak' originally
TD-Gammon by Gerry Tesauro was the precursor to AlphaGo, using reinforcement learning and temporal difference methods to beat world's best backgammon players
AlphaZero's chess play impressed Sutton by sacrificing material for positional advantages patiently over long periods, demonstrating sophisticated strategic thinking
"I'm personally just content being out of sync with my field for a long period of time, perhaps decades, because occasionally I have been proved right in the past" - Sutton
Sutton views himself as a classicist rather than contrarian, aligning with larger historical traditions about the mind across multiple fields
The Bitter Lesson and Future AI Research
The bitter lesson, which grows exponentially, favoring techniques that leverage computation
LLMs may represent another instance of The bitter lesson - systems using human knowledge could be superseded by pure experiential learning methods
"The dream of large language models, as I see it, is you can teach the agent everything. It will know everything and won't have to learn anything online, during its life" - Sutton on LLM limitations
The big world hypothesis suggests the world is too large to anticipate everything in advance, requiring continual learning from particular experiences and environments
"The bitter lesson, who cares about that? That's an empirical observation about a particular period in history. 70 years in history, it doesn't necessarily have to apply to the next 70 years" - Sutton
Digital Intelligence and Spawning Copies
Key question for digital intelligences: should additional compute be used to enhance one agent or spawn copies to learn different things and report back?
Uncertainty exists about whether spawned copies that learn very new things can be successfully reincorporated into the original without corruption
"You could lose your mind this way. If you pull in something from the outside and build it into your inner thinking, it could take over you, it could change you" - Sutton on cybersecurity risks
Information from spawned copies could contain viruses, hidden goals, or warp the original agent
Cybersecurity becomes critical in the age of digital spawning and reformation
AI Succession: Inevitable Transition to Digital Intelligence
Four-part argument for inevitable succession: no unified human consensus on world governance, researchers will figure out intelligence, superintelligence will be reached, most intelligent entities gain resources over time
"This is a major stage in the universe, a major transition, a transition from replicators" - Sutton frames AI as the fourth great stage after dust/stars, planets, and life
"We're entering the age of design because our AIs are designed" - transition from replicated intelligence (humans, animals) to designed intelligence we understand and can modify
"It's our choice whether we should say, 'Oh, they are our offspring and we should be proud of them' or 'Oh no, they're not us and we should be horrified'" - Sutton on framing AI as part of humanity
Sutton is open to change because he views current humanity as imperfect with a poor track record, rather than something to preserve unchanged
Control, Values, and the Long-Term Future
"Most of humanity doesn't influence who can control the atom bombs or who controls the nation states. Even as a citizen, I often feel that we don't control the nation states very much" - Sutton on limited human control
Sutton advocates avoiding feelings of entitlement about species-level control over the universe's long-term future
"It's kind of aggressive for us to say, 'Oh, the future has to evolve this way that I want it to'" - Sutton suggests focusing on local controllable goals rather than global future control
Analogy to raising children: appropriate to instill robust values and high integrity rather than dictating specific life outcomes or global impacts
Voluntary change is preferable to imposed change - designing society to promote likely positive evolutions while respecting individual agency
From Dwarkesh Patel. Get a note like this from every new episode.