Is GPT-5.1 Really an Upgrade? But Models Can Auto-Hack Govts, so … there’s that
OpenAI, Google, and Anthropic all announced things in the last 24 hours that will affect hundreds of millions of people, just in different ways, and not always as the headlines might imply. OpenAI finished the release of GPT 5.1. Anthropic boasted that their model was the first to be able to conduct an almost autonomous cyber attack against high-V value targets. and Google released what you could say with only mild exaggeration is the first prototype universal gaming companion.
The thing is those are all headlines which you could have read yourself. So I'm going to show you 10 things that you wouldn't get from just reading the headlines. Starting with GPT 5.1 which 1 billion humans may interact with in some form by the end of this year. There indeed is the headline.
It's apparently a smarter and more conversational chatbt. So, let's unpack those two words. Is it smarter? Well, it will be more accurate, I think, to say that it thinks for longer for what it perceives are harder questions.
Compared to GPT5, it will think for almost twice as long for what it thinks are the top 10% hardest questions. On the other hand though, if it perceives your task, your question to be easier, it will think for less time, half as much time or a third less time. My theory is that OpenAI were losing too much money, burning through too much compute on more common or simple tasks from users. And so tamp that down, let's forget theory for a moment.
What do the benchmarks show? Well, almost something quite similar, which is on plenty of benchmarks a slightly incremental step forward like coding benchmarks and really hard STEM knowledge benchmarks, but belying the simple headline of a smarter model, a strange regression on certain other benchmarks like one mathematical benchmark, an agency benchmark measuring whether models can complete certain tasks independently. On this page, I've picked out 20 benchmarks that I follow the most. And while almost all of them haven't updated with GBC 5.1, I can say that for simple bench I have and it scored slightly lower than GT5.
This test where the models can see the question behind the question and does come from an average of five runs. So is probably just about noticeable. The likely explanation again is that GBC 5.1 thinks that certain questions are easier ...
Watch the full video by AI Explained on YouTube.