This conversation features Sebastian Raschka, author of Build a Large Language Model from Scratch and the upcoming Build a Reasoning Model from Scratch, alongside Nathan Lambert, post-training lead at Allen Institute for AI and author of the definitive Reinforcement Learning from Human Feedback book. Both are respected machine learning researchers, engineers, and educators who provide technical depth while maintaining accessibility.
The discussion centers on the DeepSeek moment of January 2025, when the Chinese company released DeepSeek R1 with near state-of-the-art performance at allegedly much lower costs, intensifying global AI competition. They explore the current landscape of open versus closed models, the dominance of Chinese open-weight systems, and the technical breakthroughs in post-training methods.
Key topics include the evolution from pre-training to post-training focus, the rise of RLVR (Reinforcement Learning with Verifiable Rewards), the ongoing relevance of scaling laws, and practical considerations for AI education and career paths. They also examine the cultural dynamics of AI development, from Silicon Valley's intense work culture to the broader implications for human civilization.
The DeepSeek Moment and International AI Competition
DeepSeek R1's January 2025 release achieved near state-of-the-art performance with allegedly $5 million in training costs, compared to hundreds of millions for comparable Western models, shocking the AI community and accelerating competition.
Chinese companies like DeepSeek, Qwen, Kimi, and MiniMax are releasing powerful open-weight models while US companies increasingly keep their best models closed, creating a strategic imbalance in global AI development.
"I don't think nowadays, 2026, that there will be any company who [wins everything] because researchers are frequently changing jobs, changing labs" - Sebastian, noting that ideas flow freely but resources and hardware create differentiation.
The business model difference is crucial: Chinese companies use open models to gain international influence since US companies won't pay for Chinese API subscriptions due to security concerns, while US companies monetize through subscriptions.
Architecture Evolution: From GPT-2 to Modern Transformers
Despite rapid advancement, fundamental architectures remain remarkably similar to GPT-2, with most changes being incremental tweaks like mixture of experts, different attention mechanisms, and normalization layers.
The transformer architecture from Attention Is All You Need established the encoder-decoder structure, with GPT focusing on just the decoder part for autoregressive text generation, forming the basis for all current language models.
Mixture of experts (MOE) allows models to have multiple specialized feed-forward networks with a router selecting which experts to use, packing more knowledge while using less compute per token during inference.
"You can convert one from one, you can go from one into the other by just adding these changes, basically. This fundamentally is still the same architecture" - Sebastian on the continuity from GPT-2 to modern models.
Scaling Laws: Pre-training vs Post-training Dynamics
Pre-training scaling laws still hold but low-hanging fruit has been picked - the real excitement and gains are now in post-training with RLVR and inference-time compute scaling.
"Pre-training has gotten extremely expensive. I think to scale up pre-training, it's also implying that you're going to serve a very large model to the users" - Nathan on the economic constraints of scaling.
RLVR (Reinforcement Learning with Verifiable Rewards) shows logarithmic scaling where 10x more compute yields linear improvements in performance, unlike RLHF which plateaus quickly.
The cost structure differs dramatically: pre-training is a one-time expense that gives permanent capabilities, while inference scaling costs money per query but can be adjusted based on user demand and willingness to pay.
Post-training Revolution: RLVR and Reasoning Models
RLVR enables models to learn through trial and error on verifiable tasks like math and coding, with training runs now lasting weeks and showing continuous improvement unlike traditional RLHF which plateaus.
"Just 50 steps, like in a few minutes with RLVR, the model went from 15% to 50% accuracy" - Sebastian demonstrating RLVR's rapid capability unlocking on math problems.
Inference-time scaling allows models to 'think' for minutes or hours before responding, generating hidden reasoning traces that dramatically improve accuracy on complex problems.
The training pipeline now consists of pre-training (knowledge acquisition), mid-training (specialized skills like long context), and post-training (capability unlocking through RLVR and RLHF).
The Coding Revolution and Developer Experience
Recent surveys show developers are shipping 50%+ AI-generated code, with senior developers more likely to use AI extensively than junior developers, suggesting expertise enables better AI utilization.
Cloud Opus 4.5 has generated massive hype for coding tasks, with many finding it superior to other models for complex programming work and architectural decisions.
"I use basically half and half cursor and cloud code because I find them to be like fundamentally different experience and both useful" - Lex on the complementary nature of different AI coding tools.
The debate over learning and struggle continues: while AI makes coding faster, there's concern about junior developers missing fundamental learning experiences that come from working through problems independently.
Education and Learning in the AI Era
Build a Large Language Model from Scratch exemplifies the philosophy that building systems from scratch is the most effective way to understand them, providing hands-on experience with transformer architectures.
The educational challenge is finding the right balance between AI assistance and independent struggle - too much AI help prevents deep learning, while too little wastes time on mundane tasks.
Reinforcement Learning from Human Feedback addresses the philosophical complexities of preference optimization, explaining why RLHF is "never, ever fully solvable" due to the fundamental challenge of quantifying human preferences.
"I think there's a fun thing... losing my mind, that you use the router and the non-thinking model. I'm like, how do you live with that?" - Nathan on the importance of using reasoning models for complex tasks.
Silicon Valley Culture and the Future of Work
The 996 work culture (9am-9pm, 6 days/week) is becoming standard at frontier AI labs, with intense competition driving researchers to work extreme hours despite significant burnout risks.
Apple in China by Patrick McGee illustrates similar patterns of extreme work dedication, including marriage-saving programs for engineers working on supply chain development.
"My friends who are professors seem on average happier than my friends who work at a frontier lab" - Nathan observing the human cost of the AI race.
The Silicon Valley bubble creates both incredible productivity through reality distortion fields and dangerous disconnection from broader human experiences and perspectives worldwide.
Timelines and the Future of Human Civilization
AGI definitions remain contentious, but many converge on "a system that could reproduce most digital economic work" or the "remote worker" standard as a practical benchmark.
The superhuman coder milestone from AI safety frameworks may be achievable within years, but full automation faces the "jagged" nature of AI capabilities - excellent at some tasks, poor at others.
100 years from now, the current AI revolution will likely be remembered as part of the broader computing revolution, similar to how we view the Industrial Revolution's various mechanical innovations today.
"Humans do tend to find a way. I think that's what humans are built for is to have community and find a way to figure out problems" - Nathan expressing cautious optimism about navigating AI's challenges.
From Lex Fridman. Get a note like this from every new episode.