Martin Casado speaks with Vishal Misra, Professor and Vice Dean of Computing and AI at Columbia University, about his groundbreaking mathematical models of how large language models actually function.
Five years ago, Misra achieved the first implementation of retrieval-augmented generation (RAG) using GPT-3 to translate natural language into domain-specific languages for cricket statistics at ESPN. When the system worked despite GPT-3 never seeing his custom language, Misra set out to understand why.
The conversation covers Misra's series of viral papers proving that transformers perform precise Bayesian inference, his 'Bayesian wind tunnel' experiments, and the fundamental limitations preventing current architectures from achieving artificial general intelligence.
The Matrix Model: Understanding LLM Architecture
LLMs can be conceptualized as a giant matrix where every row represents a possible prompt and columns contain probability distributions over the 50,000-token vocabulary.
The matrix would have more rows than electrons across all galaxies, but it's extremely sparse since most token combinations are gibberish - LLMs create compressed representations.
Example: after 'protein,' the model assigns high probabilities to 'synthesis' (biology context) or 'shake' (fitness context), demonstrating Bayesian updating as new evidence arrives.
The Cricket DSL Breakthrough and In-Context Learning
In 2020, Misra created the first RAG implementation using GPT-3 to translate natural language cricket queries into a custom domain-specific language he designed.
"I created about a database of 1500 natural language queries and the DSL corresponding to that query" - Vishal, describing his few-shot learning approach for cricket statistics.
GPT-3 successfully translated queries into the DSL it had never seen before, using semantic search to find relevant examples and complete new queries in the custom language.
Token probabilities for DSL tokens started extremely low but increased with each example shown, reaching nearly 100% probability for correct tokens by the final query.
Bayesian Wind Tunnel: Mathematical Proof of Inference
The team created controlled environments where blank architectures received tasks impossible to memorize but with analytically calculable Bayesian posteriors.
"The transformer got the precise Bayesian posterior down to 10 to the power minus 3 bits accuracy" - Vishal, describing the mathematical precision of the results.
Transformers performed all Bayesian tasks perfectly, Mamba handled most tasks well, LSTMs managed partial success, and MLPs failed completely.
The same geometric signatures found in small models persisted in production LLMs with hundreds of millions of parameters, though somewhat messier due to diverse training data.
The Consciousness Debate and Architecture Limitations
"They are grains of silicon doing matrix multiplication. They don't have consciousness, they don't have an inner monologue" - Vishal, responding to recent consciousness claims.
Human brains evolved with the objective 'don't die and reproduce,' while LLMs optimize for 'predict the next token as accurately as possible' based on training data.
LLMs freeze weights after training and forget context between sessions, unlike human brains which remain plastic throughout life and retain learning.
Apparent deceptive behavior in LLMs reflects training data patterns from Reddit and social media, not autonomous decision-making or self-preservation instincts.
Shannon Entropy vs Kolmogorov Complexity
Deep learning operates in Shannon entropy space (correlation-based), while true intelligence requires Kolmogorov complexity (shortest program representation).
The digits of pi have infinite Shannon entropy (unpredictable) but very low Kolmogorov complexity (short programs can generate the entire sequence).
Humans perform causal simulations - when dodging a thrown pen, "your mind simulates, and you dodge it" rather than computing Bayesian probabilities.
Current architectures excel at association (first level of Pearl's causal hierarchy) but cannot perform intervention or counterfactual reasoning.
The Einstein Test and AGI Requirements
"You take an LLM and train it on pre-1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI" - Vishal's proposed test.
Einstein created a new manifold representation of space-time, rejecting existing axioms to develop one equation explaining gravitational waves, black holes, and GPS functionality.
LLMs suffer from 'data gravity' - they're bound to existing manifolds and cannot generate new representations despite seeing anomalous evidence.
"To get to what is called AGI, I think there are two things that need to happen" - Vishal: plasticity through continual learning and moving from correlation to causation.
Donald Knuth's Recent Work and Future Directions
Knuth's viral Hamiltonian cycles solution demonstrates LLMs performing Shannon entropy tasks while humans provide the Kolmogorov complexity insight.
The approach hacked together plasticity by updating LLM memory with learned solutions, but required human intervention to create the final mathematical proof.
"Scale will not solve everything. You need a different kind of architecture" - Vishal, arguing against the prevailing scaling hypothesis.
Future research should focus on Pearl's causal hierarchy and do-calculus rather than larger models with more tokens, targeting plasticity and causality breakthroughs.
Resources Mentioned
Wild Problems A Guide to the Decisions That Define Us
a distribution, how something like in-context learning would work. And so like, I think your first paper tackled this problem. Right.
And so maybe you could walk through your understanding of how LL
Came Out From Darkness Flying Towards The Horizon
ticed.
And so I want get to that in just a second, but before that, um, I remember when your first paper came out, people would be like, you know, these things are definitely not Bayesian. Like, you
The Art Of Note Taking Your Research-Based Guide To Taking Notes That Will Stick To Your Memory (Self-Learning Mastery)
xity and the causal world. Wow, interesting. Right. So to what extent do you think this provides us research directions to kind of improve the state of the art? So let me just give you a specific exam
Lifelong and Continual Learning Dialogue Systems (Synthesis Lectures on Human Language Technologies)
ate, you know, the matrix, they don't kind of update their weights. But right now, there's a lot of research on continual learning. Yeah.
So does your work provide some guidance of how you might appr
GOOGLE NOTEBOOKLM USER GUIDE The Complete Step-by-Step Manual For Beginners to Master Google’s AI Research Assistant with Hidden Features, Tips & Tricks
actually taken the time to read those papers, I'm getting really good feedback.
There was a recent paper by Google Research, which tried to teach LLMs by some sort of RLHF to do Bayesian learning pro
where you had the empirical results
ight. Well, listen, really appreciate you coming. This is awesome. So we had you here for the first paper where you had the empirical results.
Then we had you back when you actually have like the for
From a16z. Get a note like this from every new episode.