Reading today's open-closed performance gap
The complex factors that determine the single evaluation number so many focus on. Plus, how this changes in the future.
This post originally appeared in Interconnects.
“If this direct data access becomes the next frontier of training, open models in their current form will be left behind.”
It’s a clear, current equilibrium that open models will be in perpetual catch-up of closed models, but this gap being viewed as a single number, a “distance”, covers up a nuanced and crucial dynamic at what capabilities the models are covering. The most popular benchmark to comment on this gap is the Artificial Analysis Intelligence Index — a composite benchmark of ~10 sub-evals that they maintain over time to capture the “frontier” of current language model capabilities.
Particularly, I spend a lot of time understanding how dynamics that feed into that index are misunderstood by the natural tendency to reduce performance and trends to one number. Examples include:
How benchmarks evolve over time, becoming more or less correlated with how people actually use models,
How different models’ real-world performance relates to their benchmark rankings, and
How training regimes evolve over time to move said benchmarks.
Keep reading with a 7-day free trial
Subscribe to SAIL Media to keep reading this post and get 7 days of free access to the full post archives.
