What I Learned Testing GPT-5.5

This episode analyzes the release of OpenAI's GPT-5.5, featuring reactions from AI researchers, developers, and industry analysts including Pietro Shivano, Matt Schumer, Ben Davis, and teams from CodeRabbit, Every, and Artificial Analysis.

From The AI Daily Brief: Artificial Intelligence News and Analysis 4 min read

Episode

0:00 0:00

The AI Daily Brief: Artificial Intelligence News and Analysis

Subscribe to Notes Upgrade

The AI Daily Brief: Artificial Intelligence News and Analysis

Key Takeaways

01
GPT-5.5 scored 82.7% on Terminal Bench 2.0 versus Opus 4.7's 69.4%, marking OpenAI's return to benchmark leadership
02
The model costs $5 per million input tokens and $30 per million output tokens, 20% more expensive than Opus 4.7
03
"For the first time, I don't feel limited by what a model can do. I feel limited only by what I can imagine" - Pietro Shivano
04
GPT-5.5 can run coding tasks for 7+ hours continuously, compared to previous 30-minute limits before stopping
05
OpenAI emphasized "iterative deployment" and "democratization" in contrast to Anthropic's restricted Mythos model approach
06
"Intelligence is a function of inference compute. What matters is intelligence per token or per dollar" - Noam Brown
07
The model tops Artificial Analysis' overall benchmarks with the first-ever score in the 60s range
08
"We expect quite rapid continued progress... I would say the last few years have been surprisingly slow" - Jacob Pachalki

Get the latest ideas from The AI Daily Brief: Artificial Intelligence News and Analysis.

Plus the best new takeaways about artificial intelligence from other top podcasts — read in minutes, not hours.

Continue with Google

By continuing, you agree to podbrain's Terms and Privacy Policy.

These notes may contain occasional inaccuracies. Learn how podbrain notes are made

The discussion covers GPT-5.5's benchmark performance against Anthropic's Claude Opus 4.7, particularly on coding tasks, knowledge work, and agent capabilities. The model represents OpenAI's response to Anthropic's unreleased Mythos model and addresses recent competition in the AI space.

Key topics include benchmark comparisons across Terminal Bench 2.0, SWEBench Pro, and VendingBench, cost-performance analysis, coding capabilities, design limitations, and OpenAI's strategic messaging around democratization versus Anthropic's restricted access approach.

Benchmark Wars: GPT-5.5 vs Claude Opus 4.7 Performance

GPT-5.5 achieved 82.7% on Terminal Bench 2.0 compared to Opus 4.7's 69.4%, and scored 84.9% on real-world Hask GDP VAL versus Opus's 80.3%

Artificial Analysis declared "GPT 5.5 takes OpenAI back to the clear number one" with the first model ever to score in the 60s range on their overall index

SWEBench Pro results showed GPT-5.5 significantly underperforming Opus 4.7, though OpenAI's Tebow dismissed this: "you'll be missing out if you think SWEBench is representative of anything real"

On VendingBench, GPT-5.5 performed similarly to Opus 4.6 in single-player mode but beat Opus 4.7 in multiplayer competition without displaying "underhanded tactics like lying to suppliers"

Cost-Performance Revolution: Intelligence Per Dollar Metrics

At $5 input/$30 output per million tokens, GPT-5.5 costs double GPT-5.4 and 20% more than Opus 4.7, but "completely dominates the cost performance frontier" according to Scaling01

"Intelligence is a function of inference compute. Comparing models by a single number hasn't made sense since 2024. What matters is intelligence per token or per dollar" - Noam Brown from OpenAI

Parameter estimates suggest GPT-5.4 was 1-2 trillion, GPT-5.5 is 2-5 trillion, while Anthropic's Mythos is approximately 10 trillion parameters

Coding Capabilities: Long-Running Tasks and Reliability

Peter Gostev reported GPT-5.5 running migration tasks for "7+ hours" and overnight coding sessions lasting "8+ hours," compared to previous 30-minute limits

"GPT-5.5 feels less tiring... It writes cleaner code, touches fewer things it doesn't need to touch, is less likely to overengineer a simple change" - Flavio Adamo

CodeRabbit found 79.2% expected issue detection in code review versus 58.3% baseline, calling it "stronger signal and better performance on issues that matter most"

Aiden McLaughlin from OpenAI described dictating an "ambitious RL run" that worked autonomously for 31 hours while he was away for days

Design and Creative Work: Persistent Limitations

Testing Resonate presentation methodology, GPT-5.5 created a mood board and four visuals for a haptic internet presentation, working autonomously for 16+ minutes

Design taste remains limited with issues including "not a ton of visual variety," too many fonts, and breaking "the fourth wall" by referencing prompts within presentation text

"Opus 4.7 seems to write better plans and have a superior eye for design and product details" according to Every's analysis, though GPT-5.5 excels at execution

The optimal workflow appears to be "GPT Images 2 for concepting UI and then 5.5 and Codex for implementing it" rather than relying on native design capabilities

Strategic Messaging: Democratization vs Restriction

"We believe in iterative deployment... We believe in democratization. We want people to be able to use lots of AI" - Sam Altman, contrasting with Anthropic's Mythos restrictions

"Crazy how you can just ship a model without a giant PR campaign to scare the crap out of everyone first" - Justine Moore from A16Z, referencing OpenAI's understated launch approach

Anthropic published a post-mortem confirming "recent Claude Code quality issues," validating user complaints about model degradation since March 4th

"Really excellent work by the inference team to serve this model so efficiently. To a significant degree, we have become an AI inference company now" - Sam Altman

Future Roadmap: Rapid Improvement Trajectory

"Yes, we expect quite rapid continued progress... I would say the last few years have been surprisingly slow" - Chief Scientist Jacob Pachalki on accelerating release pace

"What 5.5 represents is not an end point. In many ways, it's a beginning point... you should expect even larger improvements over just the upcoming months" - President Greg Brockman

GPT-5.5 may be "the initial RL checkpoint of their new pre-training model," similar to how GPT-01 preview led to GPT-03's breakthrough

From The AI Daily Brief: Artificial Intelligence News and Analysis. Get a note like this from every new episode.

Subscribe to Notes Upgrade