Are We Misreading the AI Exponential? Julian Schrittwieser on Move 37 & Scaling RL (Anthropic)
The episode features Julian Shitwezer, a core contributor to DeepMind's legendary AlphaGo Zero and MuZero projects, now a key researcher at Anthropic, speaking with host Matt Turk from First Mark.
- 01
"We are seeing this very consistent improvement over many many years where every say like you know 3 4 months is able to like do a task that is twice as long as before completely on its own" - Julian
- 02
By mid-2026, Julian extrapolates agents will work autonomously for a full day; by late 2026, at least one model matches industry experts across many occupations
- 03
"By 2027 2028, I think extremely likely that the models will be smart enough and capable enough to actually have that level of insight" for Nobel Prize-level discoveries - Julian
- 04
Move 37 from AlphaGo demonstrated AI creativity in 2016, playing an unexpected move that surprised professional Go players and ultimately won the game
- 05
"Pre-training on this vast data sets we have just brings us so much value that we would from a practical point of view not want to give it up" - Julian on combining pre-training with RL
- 06
"The main question is do these two trends balance each other out so that you know the AI makes us increasingly more productive" versus problems getting harder - Julian on AI discontinuity
- 07
Reinforcement learning training is much more unstable than supervised learning due to feedback cycles, requiring careful isolation of components during development
- 08
"If you manage to make everybody in society 10 times more productive you know what kind of abundance can we achieve?" - Julian on AI's economic potential
Get the latest ideas from The MAD Podcast with Matt Turck.
Plus the best new takeaways about artificial intelligence from other top podcasts — read in minutes, not hours.
By continuing, you agree to podbrain's Terms and Privacy Policy.
These notes may contain occasional inaccuracies. Learn how podbrain notes are made
The episode features Julian Shitwezer, a core contributor to DeepMind's legendary AlphaGo Zero and MuZero projects, now a key researcher at Anthropic, speaking with host Matt Turk from First Mark.
Julian discusses his viral blog post 'Failing to Understand the Exponential Again,' explaining why AI bubble talk seems divorced from what's happening in frontier labs, where consistent progress continues without slowdown.
The conversation covers exponential AI trajectory predictions for 2026-2027, the evolution from AlphaGo to MuZero, the science behind AI creativity including the famous Move 37, and how reinforcement learning powers modern agentic systems.
Julian shares his journey from Austrian countryside to Google to DeepMind, his predictions for AI-driven Nobel Prize discoveries by 2027-2028, and Anthropic's approach to safety and alignment in an era of rapidly advancing capabilities.
Understanding AI's Exponential Trajectory
Julian wrote his blog post during a car ride in Kyrgyzstan after noticing AI bubble discussions seemed "very divorced from what was happening in frontier labs and what we were seeing" in terms of actual progress.
"We are seeing this very consistent improvement over many many years where every say like you know 3 4 months is able to like do a task that is twice as long as before completely on its own" - Julian, comparing the situation to early COVID exponential growth that people struggled to understand intuitively.
Frontier labs show no slowdown in progress, with consistent improvement across many benchmarks and evaluations over multiple years, making current valuations at OpenAI, Anthropic, and Google "fairly conservative" according to Julian.
Julian suggests a possible bifurcation where frontier labs maintain solid trajectories with strong revenues while the wider AI ecosystem may experience bubble-like conditions - an unusual situation not seen in past technology rushes like the dot-com bubble or railroad era.
Predictions for 2026-2027: Agents Working All Day
"Most of the time, right, I'm not going to be smarter than statistical models, statistical extrapolation of past trends that have been very consistent" - Julian on his methodology for predictions, using naive linear extrapolation of benchmarks like SWE-bench.
By mid-2026, Julian predicts models will work autonomously for a whole day on tasks like implementing entire software features or completing comprehensive research reports. Task length is critical because "if you need to talk to it every 10 minutes, right? Versus if you have something that can go for hours at a time, obviously, right? Then you cannot just have one copy of it, you can have a whole team."
Late 2026 prediction: at least one model matches industry experts across many occupations, based on OpenAI's GPQA evaluation that collected real-world tasks from domain experts and compared model performance against human experts.
By 2027, models will frequently outperform experts on many tasks. Julian notes that messiness and task length are correlated - "if you think about like you know how do you come up with a task that you know takes a human 8 hours 16 hours right you will have to include all these messiness and all this real world mess."
Julian looks at multiple signals that could change his mind, including internal pre-training runs, fine-tuning results, RL performance, scaling trends, and whether users actually become more productive with new models over time.
Move 37 and AI Creativity Beyond Imitation
Move 37 occurred during the second game of AlphaGo's 2016 match against Lee Sedol, one of the world's best Go players. "AlphaGo played like a really unexpected unconventional move that surprised many professional Go players" - Julian, with commentators calling it "truly creative unexpected," and AlphaGo ultimately winning the game.
"For many people an early sign that AI is not just you know purely calculating following an optimal path but it can also do something that is truly novel and creative that you might not expect you know just from imitating its training data" - Julian on Move 37's significance.
Modern language models are "literally trained to generate a whole probability distribution which means that when we sample from them we can generate you know infinite amount of novel sequences from them" - Julian, arguing models clearly can do novel things.
Creating a modern equivalent of Move 37 requires "the combination of a task that is like sufficiently difficult and interesting and a model that is both able to create sufficiently diverse and creative ideas and also able to evaluate accurately how good they are." The hard part is "creating novel things that are useful and interesting," not just novel things.
AI Discovering Novel Science and Nobel Prizes
AlphaCode and AlphaTensor proved AI can discover novel programs and algorithms. Recent work from Google DeepMind and Yale in biomedical fields shows continued progress in AI discovering brand new scientific concepts.
"We're absolutely at the stage where you know it is discovering novel things and we're just moving up the scale of how impressive how interesting are the things that it is able to discover on its own" - Julian, predicting "sometime next year we're going to have some discoveries that people pretty you know unanimously agree that you know this is super impressive."
"My guess for that level of capability might be maybe 2027" - Julian on when AI could make a Nobel Prize-worthy breakthrough on its own. "By 2027 2028, I think extremely likely that the models will be smart enough and capable enough to actually have that level of insight on that level of discovery," though the actual prize would come years later due to Nobel Prize delays.
Julian emphasizes excitement about "AI that can help us advance science and really unlock you know both all the mysteries of the universe and all the improvements in living standards and abilities for us that we could have if we understood the world better."
AI Discontinuity and the Path to Singularity
"A true discontinuity is extremely unlikely" - Julian on sudden AI takeoff scenarios. AI researchers already use AI to accelerate themselves, creating "a smooth improvement of productivity."
The key question is whether difficulty of improving AI scales faster than productivity gains. "A very common effect in a very common issue in many scientific fields is that we find all the easy problems first and then you know as we continue exploring the problem the field it gets more and more difficult to make advances."
"The normal course in many scientific fields is that we actually need to exponentially increase the research effort just to keep making progress" - Julian, citing pharmacology where discovering new drugs now costs billions versus a single scientist discovering the first antibiotic by accident 100 years ago.
"We will be seeing advanced signs of oh we're making faster progress every single week. we can see something is happening. Maybe we decide to pause if you don't understand what's happening" - Julian on why sudden surprise takeoff is unlikely.
Pre-training Plus RL: The Current Paradigm
"If you're thinking of oh we want some kind of system that can perform at roughly human level in basically all tasks that we care about productivity wise then I think yeah it's extremely likely that the current approach pre-training or you know transformers is going to get us there" - Julian.
Julian avoids terms like AGI and ASI, preferring to "talk very concretely about you know what problem are we solving what task are we solving what quality are we interested in because I find that often makes the actual disagreement much more obvious."
On training from scratch with RL versus pre-training: "Personally, I think that's unlikely" that future models will be pure RL. "Pre-training on this vast data sets we have just brings us so much value that we would from a practical point of view not want to give it up," though pure RL agents may be built "out of scientific interest."
Pre-training provides safety benefits: "By pre-training on you know all this human knowledge we're implicitly creating an agent that has similar values as we do and I think that is quite valuable for aligning uh you know highly intelligent agent if you already start out by caring about sort of you know the same rough set of values."
The key is not over-encoding priors: "If your pre-training if your prior knowledge prevents you from exploring something that might be the correct course of action that will be bad" - Julian on the danger of restricting search space too much.
Julian's Journey from Austrian Countryside to AI Research
Julian grew up in a small village in the Austrian countryside with no expectations of becoming an AI researcher, but was "always very interested in computers" as "this connection to the wider world to all these other interesting things."
Early interest in making computer games led to programming, though "I somehow I always got distracted by the technical aspect of I'm going to build a very general game engine that you know can run any kind of game. And so I never actually ended up making any game."
After his first year studying computer science in Vienna, a summer internship at Google changed everything: "That's when I realized oh wow these guys are doing really interesting things they you know that's where their big clusters the tens of thousands of machines are that's first time I radically changed my plans from wanting to stay in academia."
Julian worked as a software engineer at Google in advertising, which he "wasn't super excited or interested in," and was planning to leave for a hedge fund when he saw an email about Demis Hassabis giving a talk about Atari and video games. Despite being on a day off visiting a friend, "that email looked so intriguing that I was like, 'Oh no, I'm going to have to like take the train back to the office right now and like see this talk.'" That moment led him to join DeepMind.
AlphaGo to MuZero: Evolution of Game-Playing AI
AlphaGo combined deep neural networks trained to predict moves and game outcomes with Monte Carlo tree search to plan ahead. Initial training used human amateur games to predict "at each turn in the game what move would they have played," achieving "pretty decent like amateur go level" but not strong enough to beat top players.
The team was "very nervous" before the Lee Sedol match with internal bets on outcomes. "If we had wanted to be a bit more safe, we may have like tried to do a few months later. And I think if you had done it a few months earlier, we would have probably lost. So it was very knife edge" - Julian, making each game "a nailbiter."
AlphaGo Zero removed all human Go knowledge, "training it just from scratch playing only against itself and rediscovering basically all go completely figuring out from scratch how to play." The team gave it game rules to score results and indicate illegal moves, but the network learned everything else from self-play.
AlphaZero generalized to chess, Go, and shogi with the same algorithm and network structure, "making it you know much simpler, elegant and faster" while "laying the groundwork for applying the algorithms to solve real problems."
MuZero solved the problem that "if you want to solve many real world tasks, you have no way of perfectly simulating what's going to happen." The breakthrough was teaching the neural network to "predict the future of the environment, the future of the world" rather than requiring perfect simulation, enabling application beyond board games to robotics and other complex domains.
Why Pre-training and RL Took Time to Combine
"Scaling up the language models to the massive degree that we scaled them up took a lot of effort on its own" - Julian. Pre-training and supervised training are "more stable and sort of easier to debug because you don't have this feedback cycle."
RL's feedback cycle makes debugging extremely difficult: "If you have you know something is not working it's very hard to figure out where in this cycle your problem is coming from. No, maybe your training update was bad and that's why you suddenly started behaving badly or maybe the way you decided, you know, the way you select actions to behave is not correct."
"It makes a lot of sense to you know first scale up the pre-training the architectures figure out something that works pretty well especially if you can already get pretty far by some fine tuning some prompting and then when you know when it's clear that these models are really general they are really useful and we have them in pretty stable state then you know you can ramp up RL."
Even in AlphaGo and AlphaZero development, the team followed similar patterns: "We always follow a similar split as well when we first set up the architecture of the network, the training using fixed supervised data. And only when we had that working really reliable, only then did we do the full RL loop" to enable component isolation and debugging.
Scaling Laws and Compute for Reinforcement Learning
"There's less published literature about it. But I think if you if you look at all the RL literature over time, we see very similar returns on compute in pre-training and in RL where we can invest exponentially more compute in RL and keep getting benefits" - Julian.
A key open research question is determining optimal compute allocation: "What should be the split for a big model for example it could be 50/50 should be like 1 to 10 which way should it be 1 to 10 so I think that's going to be extremely interesting."
Reward sources are flexible - "the reinforcement learning process per se doesn't really care where the reward comes from." Sources include human feedback, automated signals from winning/losing or passing tests, and model-generated rewards like Anthropic's constitutional AI approach where "the model itself score whether you're following some guidelines."
"The great thing about RL is that the data is generated by your model itself. So the smarter our models become, the better RL data we can generate, the more interesting and complex tasks they can solve, which then gives us more and more data that we can train on."
Data Quality vs Quantity in Training
"That's a very interesting question that maybe doesn't have like a super clear answer yet or it's maybe still interesting research to be done" - Julian on whether quality, quantity, or recency matters more for training data.
Pre-training shows continued improvement with scale, while fine-tuning papers demonstrate "with a very small amount of examples you can teach the model how to do an interesting skill." The challenge is "we don't have any good scaling laws yet that tell us the trade-off especially I think because it's very hard to measure what is the quality of a data point."
High-quality data makes RL more stable. AlphaZero spent "a lot of computation" on planning and search to generate "very high quality data to train on which then resulted in RL training that was incredibly stable" - could run across continents with long generation times and remain robust.
Modern language model RL is less stable because "the difference in how good is the model and what data it generates that we then train on is not so large" since models more directly sample without extensive search. Improving this through "putting more reasoning into your language model to generate much more high quality training data" is a key direction for scaling RL.
How RL Powers Modern AI Agents
An agent is "an AI that can act on its own" - can take actions on a computer, save files, edit files, send emails without constant user interaction. "The main characteristic is that it doesn't have to interact with the user all the time. It can do things on its own."
"Our pre-training data is not very agent-like" - Julian explains pre-training data contains websites, books, and text with "a lot of information but it doesn't have a lot of actions. It doesn't really capture how the humans actually interact with the world." Raw pre-trained models aren't good agents and "especially it's not going to be very good at correcting for its own errors."
RL enables agents to learn from their own behavior: "In RL we can take our agent let it interact with the environment and then directly train on that interaction. So for example, if the agent did well, we can reinforce those actions. And if the agent did badly, we can push it away from those actions."
Crucially, RL allows learning recovery behaviors: "If the agent sort of did badly at the beginning, but then recovered and managed to well, then we can also reinforce that recovery" - teaching robustness by learning "on the actual problem that is trying to solve."
Building AI Applications: When to Use RL
"Nowadays with the capabilities of like you know topic cloud models top OpenAI GPT models. You don't need to do any fine tuning. You can take the model as is, write your own tools, your own harness, and benefit from that agentic training" - Julian advising builders.
"Doing good agentic fine tuning is actually very hard. And so it's it's quite hard to do better than the top frontier models that you might get. But on the contrary, coming up with good tools and a good representation of your task makes a huge difference."
Current blockers to agentic AI aren't single issues but "improvements needed around the whole space" - making models better at correcting errors, continuing for long periods without distraction, general intelligence improvements, and speed. "There's basically like, you know, a whole set of things that we know that we can improve."
"That's actually, you know, one of the reasons why AI is a very fun field is that there are so many low-hanging fruits that, you know, you can do much better on, but already the current models are so good that it's very fun to work on it" - Julian on the state of AI research.
Goodhart's Law and the Evaluation Problem
Goodhart's Law states "any measure that becomes a target stops being a good measure" - Julian explains with the example that "if you start paying for example programmers based on how many lines of code they write well suddenly they will discover many ways to add more lines of comments which is you know completely useless."
"Any benchmark that is too easily measured or that has a lot of attention on it, people will optimize very hard for it. Which means that probably the model will look very good at that benchmark. But if you then use it for your own task, you might get different performance."
Best practice is "periodically create completely new held out benchmarks that nobody has seen before." Many researchers "have their own toy problems that they use to test all the models precisely for that reason" to get unbiased estimates. For companies, "make your own internal benchmark that really represents what you care about."
"It definitely used to be easier to have good evals. You know 5 years ago the tasks were doing I think it was easier to measure model performance. I think nowadays it's much more difficult" - Julian on the evolution of evaluation challenges.
OpenAI's GPQA evaluation is "very accurate and unbiased but it's very expensive" because it involves "taking human experts having them do the task and then compare the model task to the experts and like you know rate it with multiple people." The challenge is making evals that are "both cheap to run reliable and accurate because it's easyish to make an eval that takes one of those but to get all three is quite hard."
Mechanistic Interpretability and Understanding Models
RL can destroy interpretability if not careful: "You could look at the chain of thought to you know see what are the model internal thoughts and then you could also have a thought that oh maybe I should use that as a reward signal in RL and punish the model if it thinks the wrong thing but then suddenly you completely destroyed your interpretability angle."
The Golden Gate Claude model demonstrated breakthrough interpretability: researchers "found the neurons in Claude that were responsible for the golden gate concept and then modified them to make a version of Claude that really love the golden gate bridge in San Francisco" - a "vivid example of ah you know we really understand what's happening in this model."
"What better way is there to verify that understanding than actually changing the behavior of the model" - Julian on why the Golden Gate Claude experiment was significant for proving genuine understanding of model internals.
"As the models get smarter, we really need to be able to understand what is the model thinking internally. You know, what is the values it has? Is it lying to us? Is it actually genuinely following the instructions?" - Julian on why interpretability is critical for safety.
"If like you know people interested in working AI or doing AI research, I think interpretability is a great area to get into" - Julian's recommendation for those entering the field.
Safety and Alignment at Anthropic
"The focus on sort of safety alignment pervades all of anthropic and there's very rigorous processes where we train a model whenever we want to release a model both to you know analyze the capabilities of the model verify the alignment of the model" - Julian on Anthropic's approach.
"If we are unsure about the safety of model we will delay the launch and like you know until we're sufficiently sure that is actually harmless we will not launch and release a model which you know may you know I guess shows that you know people take the safety much more seriously than any financial return or revenue."
"The teams working on safety and interpretability are a big focus of the company which you know gives me a lot of confidence that we're actually care about this and put a lot of effort into it" - Julian on resource allocation at Anthropic.
Safety isn't just an RL problem - "it goes throughout the whole stack." Approaches include filtering pre-training data, post-training classifiers that monitor model behavior, and safety guidelines in system prompts. "Safety alignment it really pervades the whole of research and the whole of you know product and deployment it's not just isolated into any one part."
AI's Impact on Jobs and Inequality
"Artificial intelligence is quite I mean this may sound a bit simplistic but it's quite different than human intelligence" - Julian notes models are much better at some tasks like calculation and much worse at others, so "it is not I don't think it is at all going to be any like one for one replacement."
"It's going to be much more complimentary of you know the model is really good at something that maybe I really don't like doing or I'm not interested in or I'm very bad at and then I'm much better than the model as a mother part" - Julian on gradual complementary adoption rather than replacement.
Julian uses Claude constantly "to for example you know refactor code or maybe write some front end code that I don't want to write" while remaining "clearly much better" at other parts, demonstrating comparative advantage and synergy.
"The promise of technology has long been that, oh, we're going to be all so productive, so wealthy that we need to work much less. Yet, mysteriously, right, we all have like 40 hours working week for decades" - Julian argues this is "much more like a political social problem of like figuring out how do we actually benefit from all these improvements."
In chess and Go, AI "has become much easier for people to study how to play go how to play chess cuz now you don't need to find you know an expert tutor you anybody can practice on their own" with chess streamers becoming popular on Twitch, suggesting AI can democratize access to expertise.
"Redistributing the pie is kind of a losers game. To get more wealthy, we really need to grow the pie" - Julian on why total wealth creation matters most. "If we manage to make everybody in society 10 times more productive you know what kind of abundance can we achieve?" in medicine, energy, materials science, and other fields "bottlenecked on how much intelligence we have access to."
These notes may contain occasional inaccuracies. Learn how podbrain notes are made
More in Science & Tech

The 5-Minute AI Weekly Recap: Realignment Week
Jun 20, 2026
Why Kalshi's John Wang Says Perps Are 'the Most Pure Trading Instrument'
Jun 19, 2026
Your Company Doesn’t Need an AI Strategy
Jun 19, 2026
The data black hole at the center of AI
Jun 19, 2026
The New Rules of Media | Marc Andreessen & Ben Horowitz
Jun 19, 2026
UFO Researcher Details The STRANGEST Alien Encounters - Preston Dennett | DEBRIEFED ep 93
Jun 19, 2026This keyword appears in the full transcript. Upgrade to search and explore complete transcripts.
Unlock full transcripts