⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
It’s time to take the next step up in frontier agent evals.
This post originally appeared in Latent Space.
We're thrilled to officially welcome Swyx from Latent.Space to the SAIL coalition! Swyx joined Nathan Lambert and Sebastian Raschka, PhD on this morning's Substack Live, where the conversation naturally turned to the state of SWE evals — making this a perfect first piece to restack as his inaugural contribution to SAIL Media. Without further ado:
We’ve been somewhat making tongue in cheek references to the very very minor bumps on SWE-Bench Verified scores every time a new frontier model is released (Opus 4.5 → 4.6 was literally a 0.1% down step), but it is a whole other matter for the original authors of SWE-Bench Verified to make the call to discontinue reporting it.
We were excited to have Mia Glaese, original coauthor of SWE-Bench Verified and VP of Research on the Frontier Evals, Human Data and Alignment teams, and Olivia Watkins, Researcher on Frontier Evals, drop by to talk about their decision to publicly abandon SWE-Bench Verified today and endorse SWE-Bench Pro:
The discussion around the saturation of SWE-Bench has been swirling in the community for over a year now — most frontier model numbers consistently report around 80%, and the authors of the original SWE-Bench still assert that the “ceiling” for a saturation call should be closer to 87-95%, aka there are still quite a few more percentage points to go, even on the filtered subset of 500 tasks in Verified.
Keep reading with a 7-day free trial
Subscribe to SAIL Media to keep reading this post and get 7 days of free access to the full post archives.
