a16z · the podbrain notes · Mar 17, 2026

5 min read

What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado

Martin Casado speaks with Vishal Misra, Professor and Vice Dean of Computing and AI at Columbia University, about his groundbreaking mathematical models of how large language models actually function.

From a16z 5 min read

Episode

0:00 0:00

a16z

Subscribe to Notes Upgrade

a16z

Key Takeaways

01
Transformers match Bayesian posterior distributions with 10^-3 bits accuracy in controlled experiments, proving they perform mathematical Bayesian inference
02
LLMs learn correlation patterns but cannot build causal models - they operate in Shannon entropy space, not Kolmogorov complexity
03
To achieve AGI, two breakthroughs are needed: continual learning (plasticity) and moving from correlation to causation
04
Einstein's relativity test: train an LLM on pre-1916 physics data and see if it discovers relativity theory independently
05
Current architectures are 'grains of silicon doing matrix multiplication' - they lack consciousness and inner monologue despite recent claims
06
LLMs freeze weights after training and forget context between sessions, unlike human brains which remain plastic throughout life
07
The first implementation of RAG was deployed at ESPN in 2020 using GPT-3 for cricket database queries with custom DSL translation

Get the latest ideas from a16z.

Plus the best new takeaways from other top podcasts — read in minutes, not hours.

Continue with Google

By continuing, you agree to podbrain's Terms and Privacy Policy.

These notes may contain occasional inaccuracies. Learn how podbrain notes are made

Five years ago, Misra achieved the first implementation of retrieval-augmented generation (RAG) using GPT-3 to translate natural language into domain-specific languages for cricket statistics at ESPN. When the system worked despite GPT-3 never seeing his custom language, Misra set out to understand why.

The conversation covers Misra's series of viral papers proving that transformers perform precise Bayesian inference, his 'Bayesian wind tunnel' experiments, and the fundamental limitations preventing current architectures from achieving artificial general intelligence.

The Matrix Model: Understanding LLM Architecture

LLMs can be conceptualized as a giant matrix where every row represents a possible prompt and columns contain probability distributions over the 50,000-token vocabulary.

The matrix would have more rows than electrons across all galaxies, but it's extremely sparse since most token combinations are gibberish - LLMs create compressed representations.

Example: after 'protein,' the model assigns high probabilities to 'synthesis' (biology context) or 'shake' (fitness context), demonstrating Bayesian updating as new evidence arrives.

The Cricket DSL Breakthrough and In-Context Learning

In 2020, Misra created the first RAG implementation using GPT-3 to translate natural language cricket queries into a custom domain-specific language he designed.

"I created about a database of 1500 natural language queries and the DSL corresponding to that query" - Vishal, describing his few-shot learning approach for cricket statistics.

GPT-3 successfully translated queries into the DSL it had never seen before, using semantic search to find relevant examples and complete new queries in the custom language.

Token probabilities for DSL tokens started extremely low but increased with each example shown, reaching nearly 100% probability for correct tokens by the final query.

Bayesian Wind Tunnel: Mathematical Proof of Inference

The team created controlled environments where blank architectures received tasks impossible to memorize but with analytically calculable Bayesian posteriors.

"The transformer got the precise Bayesian posterior down to 10 to the power minus 3 bits accuracy" - Vishal, describing the mathematical precision of the results.

Transformers performed all Bayesian tasks perfectly, Mamba handled most tasks well, LSTMs managed partial success, and MLPs failed completely.

The same geometric signatures found in small models persisted in production LLMs with hundreds of millions of parameters, though somewhat messier due to diverse training data.

The Consciousness Debate and Architecture Limitations

"They are grains of silicon doing matrix multiplication. They don't have consciousness, they don't have an inner monologue" - Vishal, responding to recent consciousness claims.

Human brains evolved with the objective 'don't die and reproduce,' while LLMs optimize for 'predict the next token as accurately as possible' based on training data.

LLMs freeze weights after training and forget context between sessions, unlike human brains which remain plastic throughout life and retain learning.

Apparent deceptive behavior in LLMs reflects training data patterns from Reddit and social media, not autonomous decision-making or self-preservation instincts.

Shannon Entropy vs Kolmogorov Complexity

Deep learning operates in Shannon entropy space (correlation-based), while true intelligence requires Kolmogorov complexity (shortest program representation).

The digits of pi have infinite Shannon entropy (unpredictable) but very low Kolmogorov complexity (short programs can generate the entire sequence).

Humans perform causal simulations - when dodging a thrown pen, "your mind simulates, and you dodge it" rather than computing Bayesian probabilities.

Current architectures excel at association (first level of Pearl's causal hierarchy) but cannot perform intervention or counterfactual reasoning.

The Einstein Test and AGI Requirements

"You take an LLM and train it on pre-1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI" - Vishal's proposed test.

Einstein created a new manifold representation of space-time, rejecting existing axioms to develop one equation explaining gravitational waves, black holes, and GPS functionality.

LLMs suffer from 'data gravity' - they're bound to existing manifolds and cannot generate new representations despite seeing anomalous evidence.

"To get to what is called AGI, I think there are two things that need to happen" - Vishal: plasticity through continual learning and moving from correlation to causation.

Donald Knuth's Recent Work and Future Directions

Knuth's viral Hamiltonian cycles solution demonstrates LLMs performing Shannon entropy tasks while humans provide the Kolmogorov complexity insight.

The approach hacked together plasticity by updating LLM memory with learned solutions, but required human intervention to create the final mathematical proof.

"Scale will not solve everything. You need a different kind of architecture" - Vishal, arguing against the prevailing scaling hypothesis.

Future research should focus on Pearl's causal hierarchy and do-calculus rather than larger models with more tokens, targeting plasticity and causality breakthroughs.

Resources Mentioned

Wild Problems A Guide to the Decisions That Define Us

a distribution, how something like in-context learning would work. And so like, I think your first paper tackled this problem. Right. And so maybe you could walk through your understanding of how LL

Came Out From Darkness Flying Towards The Horizon

ticed. And so I want get to that in just a second, but before that, um, I remember when your first paper came out, people would be like, you know, these things are definitely not Bayesian. Like, you

The Art Of Note Taking Your Research-Based Guide To Taking Notes That Will Stick To Your Memory (Self-Learning Mastery)

xity and the causal world. Wow, interesting. Right. So to what extent do you think this provides us research directions to kind of improve the state of the art? So let me just give you a specific exam

Lifelong and Continual Learning Dialogue Systems (Synthesis Lectures on Human Language Technologies)

ate, you know, the matrix, they don't kind of update their weights. But right now, there's a lot of research on continual learning. Yeah. So does your work provide some guidance of how you might appr

GOOGLE NOTEBOOKLM USER GUIDE The Complete Step-by-Step Manual For Beginners to Master Google’s AI Research Assistant with Hidden Features, Tips & Tricks

actually taken the time to read those papers, I'm getting really good feedback. There was a recent paper by Google Research, which tried to teach LLMs by some sort of RLHF to do Bayesian learning pro

where you had the empirical results

ight. Well, listen, really appreciate you coming. This is awesome. So we had you here for the first paper where you had the empirical results. Then we had you back when you actually have like the for

From a16z. Get a note like this from every new episode.

Subscribe to Notes Upgrade

Books Mentioned

Wild Problems: A Guide to the Decisions That Define Us by Russell D. Roberts

Came Out From Darkness: Flying Towards The Horizon by Bastian Rodrigo

The Art Of Note Taking: Your Research-Based Guide To Taking Notes That Will Stick To Your Memory (Self-Learning Mastery) by Thinknetic

Lifelong and Continual Learning Dialogue Systems (Synthesis Lectures on Human Language Technologies) by Sahisnu Mazumder, Bing Liu

GOOGLE NOTEBOOKLM USER GUIDE: The Complete Step-by-Step Manual For Beginners to Master Google’s AI Research Assistant with Hidden Features, Tips & Tricks by Johnny Dru