Dwarkesh Patel · the podbrain notes ·
6 min read

The data black hole at the center of AI

This episode features Dwarkesh Patel delivering a solo monologue — drawn from his written blog post — exploring one of the most underappreciated constraints in modern AI development: sample efficiency. Dwarkesh is a podcast host and writer known for long-form technical interviews with AI researchers and economists.

Dwarkesh Patel Dwarkesh Patel
Subscribe to Notes Upgrade
Dwarkesh Patel episode thumbnail: The data black hole at the center of AI
Dwarkesh Patel
Key Takeaways
  1. 01

    Frontier AI models are trained on tens to hundreds of trillions of tokens — nearly a million-fold more data than a human absorbs from birth to adulthood (~200 million tokens).

  2. 02

    Data is the real driver of AI progress: open models lag frontier models by only ~4 months because training data can be distilled from public APIs, unlike hyperparameters or architectural tricks.

  3. 03

    Even scaling model parameters to infinity would only reduce required training data by a factor of 10, while humans are thousands to millions of times more sample-efficient than current models.

  4. 04

    The Chinchilla scaling law shows parameter count and data requirements are added independently to the loss — you cannot scale your way out of the sample efficiency gap.

  5. 05

    A teenager learns to drive in ~20 hours; Waymo and Tesla need 3–4 orders of magnitude more data to achieve comparable self-driving performance.

  6. 06

    AIs can be 'ludicrously inefficient' to train and still be economically viable because what they learn is amortized across billions of simultaneous sessions.

  7. 07

    The genome is only 3 gigabytes with 1–2% protein-coding — not enough to store a pre-trained neural network, suggesting evolution found hyperparameters, not weights.

  8. 08

    Human software engineering demand may actually increase by 2027 due to AI acting as a complementary input rather than a direct replacement.

Get the latest ideas from Dwarkesh Patel.

Plus the best new takeaways about artificial intelligence from other top podcasts — read in minutes, not hours.

or

By continuing, you agree to podbrain's Terms and Privacy Policy.

These notes may contain occasional inaccuracies. Learn how podbrain notes are made

This episode features Dwarkesh Patel delivering a solo monologue — drawn from his written blog post — exploring one of the most underappreciated constraints in modern AI development: sample efficiency. Dwarkesh is a podcast host and writer known for long-form technical interviews with AI researchers and economists.

The talk covers the staggering data requirements of frontier language models compared to human learning, the mechanics of reinforcement learning as synthetic data generation, and why scaling model size alone cannot close the efficiency gap. Dwarkesh walks through the Chinchilla scaling law, the economics of the AI data labeling industry, and what the sample efficiency problem means for the two core ambitions of AI labs: automating white-collar work and automating AI research itself. The episode closes with a tease of a future post on intelligence explosions built atop LLMs.

Sample Efficiency: The Hidden Bottleneck in AI Progress

One definition of intelligence is sample efficiency — how much data is needed to operate fluently in a domain — and it is not clear meaningful progress has been made here in recent years.

"The main way that AIs have been getting better is from adding more and better data and scaling the compute required to develop that data in the first place." - Dwarkesh

Reinforcement learning functions as synthetic data generation: compute is dumped against a verifier or LLM-as-judge to identify high-quality rollouts, which are then used as training targets — analogous to next-token prediction on internet text.

With GRPO, models generate hundreds to thousands of rollouts per task to solve the credit assignment problem, versus a human student who might practice a problem once or twice.

The Data Industry Powering Frontier Models

Companies like Merkur and Scale AI post listings for Word document specialists, legal M&A diligence writers, and management consultants — illustrating how domain-specific and bespoke expert training data must be.

Each skill requires at least hundreds of human experts generating example completions, writing rubrics, and explaining chain-of-thought reasoning.

The data labeling and RL environment industry is already earning billions per year in revenue, with Dwarkesh projecting it will reach 'decabillions' soon.

"The correct way to think about these models is not like a human who has learned all these different skills... It's more like a Frankenstein's monster, which has been built out of a billion graphs of carefully constructed examples all sewn together." - Dwarkesh

The Million-Fold Data Gap: Humans vs. AI Models

A human absorbing ~2,000 words per hour from birth to adulthood accumulates roughly 200 million tokens; frontier models train on tens to hundreds of trillions of tokens — a gap of close to one million times.

A teenager learns to drive in about 20 hours of practice; even including 16 years of world-building experience, that is still 3–4 orders of magnitude less data than Waymo and Tesla use for self-driving training.

Humans can learn to teleoperate any robot arm within hours; current AI systems cannot perform complex open-ended robotic tasks even with millions of hours of demonstrations collected.

Deaf individuals who consume far fewer than 200 million language tokens still develop general intelligence, suggesting sensory data volume is not the source of human cognitive efficiency.

Why Three Common Objections to This Gap Fall Short

Objection 1 — Evolution pre-trained us: The human genome is only 3 gigabytes, with 1–2% protein-coding, which is not enough storage for a pre-trained neural network. Evolution more likely found the right hyperparameters and loss functions, not the weights themselves.

Even granting the evolution argument, it does not explain why every new marginal capability still requires enormous amounts of new data — unlike humans, who don't need 100 professors to learn a new programming language after being educated once.

Objection 2 — Multimodal sensory data: Including all sensory input from birth yields tens to hundreds of billions of tokens, but blind and deaf people still develop general intelligence, undermining the claim that sensory volume explains human efficiency.

Objection 3 — Just scale bigger: The Chinchilla scaling law shows parameter count and data requirements enter the loss function independently. Even with infinite parameters, required data would only decrease by a factor of ~10, while humans are thousands to millions of times more sample-efficient.

Frontier models currently sit around 5 trillion parameters; the human brain has ~100 trillion synapses — 1–2 orders of magnitude larger — yet scaling alone cannot bridge the efficiency gap.

Why Sample Inefficiency Doesn't Block White-Collar Automation

The labs' bet on white-collar automation rests on the fact that common tasks — software engineering, analysis, accounting — are common enough to be brought into the training distribution at scale.

"AIs can learn these skills by firehosing gigawatts of training at a time, and what they learn can be amortized across billions of sessions at once. So we can be ludicrously inefficient in training them up and still be wildly in the green." - Dwarkesh

Human lifespan limits the quantity and breadth of training any individual can receive; AI has no such constraint, making even wildly inefficient training economically rational.

Dwarkesh predicts overall demand for human software engineers will be higher in 2027 than today, driven by AI as a complementary input rather than a direct substitute.

The Path to AGI: Automating AI Research Itself

For jobs requiring frequent out-of-distribution thinking, the labs' plan is to first automate AI research, then have automated AI researchers solve the sample efficiency problem.

Epoch AI reports open models lag state-of-the-art frontier models by approximately 4 months — a relatively small gap that Dwarkesh attributes to data being the primary driver of progress and being easily distillable from public APIs.

"The way that people currently think about an intelligence explosion is very clumsy — either people dismiss the possibility of AI speeding up AI progress altogether, or they assume that some kind of god pops out the other end." - Dwarkesh

Dwarkesh teases a future blog post reasoning carefully about what accelerated AI progress looks like when built atop the specific architecture and limitations of LLMs, rather than assuming a discontinuous leap.

Resources Mentioned

Joe Abercrombie First Law Series 3 Books Collection Set (The Blade Itself, Before They Are Hanged, Last Argument Of Kings)

problem means for the two core ambitions of AI labs: automating white-collar work and automating AI research itself. The episode closes with a tease of a future post on intelligence explosions built a

Itself For jobs requiring frequent out-of-distribution thinking

ven by AI as a complementary input rather than a direct substitute. The Path to AGI: Automating AI Research Itself For jobs requiring frequent out-of-distribution thinking, the labs' plan is to first

Even if you increase the number of parameters by infinity

rameters as is necessary to make that happen. So take the constants from the Chinchilla scaling law paper Even if you increase the number of parameters by infinity, that would only decrease by a facto

AI ESSENTIALS FOR ACCOUNTANTS, CFOS, ANALYSTS AND CONSULTANT. PROVEN AI PROMPTS, PROVEN AI TEMPLATES, PROVEN AI WORK TOOLS AND SHEETS.

o overarching objectives they have, which are one, automate white-collar work, and two, automate AI research itself? The bet that the labs are making with white-collar work is that the common tasks th

Adapting To AI & Automation A Comprehensive Guide to Excelling in an Era of Technological Advancements, Job Displacement, and Career Navigation in the Age of AI (Money IQ Series Book 2)

omplementary input of AI. The lab's plans for this latter category of jobs is first to automate AI research and then have the automated AI researchers solved the sample efficiency problem. So then th

problems that stand in the way of human-like intelligence and learning? This is a very complicated question

stion is, can AIs, which do not have human-level sample efficiency, nonetheless solve the remaining research problems that stand in the way of human-like intelligence and learning? This is a very comp

Dwarkesh Patel
From Dwarkesh Patel. Get a note like this from every new episode.
Subscribe to Notes Upgrade

Books Mentioned

Joe Abercrombie First Law Series 3 Books Collection Set (The Blade Itself, Before They Are Hanged, Last Argument Of Kings) by Joe Abercrombie
AI ESSENTIALS FOR ACCOUNTANTS, CFOS, ANALYSTS AND CONSULTANT.: PROVEN AI PROMPTS, PROVEN AI TEMPLATES, PROVEN AI WORK TOOLS AND SHEETS. by Jaxon Byte
Adapting To AI & Automation : A Comprehensive Guide to Excelling in an Era of Technological Advancements, Job Displacement, and Career Navigation in the Age of AI (Money IQ Series Book 2) by M.K. Arman

These notes may contain occasional inaccuracies. Learn how podbrain notes are made

0 / 0
Link copied