Are These “SOTA” Models Really SOTA?

Every time a new model launches, the company behind it claims it’s “state of the art.” But the real question is: how much can we trust these benchmark numbers?

These results look impressive, but we don’t always know whether the tests were done on truly unseen data or if the model has already seen similar questions during training. And in just the last one month, three big models from three different companies dropped, and all of them claimed to outperform the previous one.

More models are coming soon too, and they’ll probably say the same thing. So the point I’m raising is: are these models genuinely getting better, or are we just seeing benchmarks that conveniently match what they were already trained on?

https://x.com/OpenAI/status/1999182104362668275

Engineering

question

1

1 reply