Gpt-5 has arrived

A new AI model has arrived, and it's available to everyone — but probably not in the way you think.

The reviewer behind the popular AI Explained channel has spent extensive time with GPT-5, combing through the system card documentation, watching the live stream announcement, and running their own benchmarks. Their verdict: impressive in several domains, underwhelming in others, and definitively not the leap toward artificial general intelligence that some had hoped for.

The Benchmark Paradox

The most viral claim about GPT-5 centered on its SimpleBench performance — a thread showing it getting nine out of ten questions right went genuinely viral. But the reviewer ran their own tests tonight and found something different: around 57 to 58 percent accuracy across three runs, not the 70 percent many expected. That's a significant score, but it's not setting new records. It's also worth noting that SimpleBench questions likely exist in GPT-5's training data, given how public those questions became after going viral — so some of that performance may be memorization rather than reasoning.

The model does crush certain public benchmarks, though. In software engineering specifically, GPT-5 outperforms Anthropic's offerings on SweetBench verified — a benchmark that Anthropic had cited as proof their own models were state-of-the-art in coding. This represents one of the bigger developments from the release, and it may pose real challenges to competitors.

The Hallucination Question

OpenAI made noise about GPT-5 hallucinating less than previous models — claiming 44 percent fewer responses with at least one major factual error. But when you dig into the system card, the benchmarks used for comparison are new ones rather than the standard tests everyone already knows. On SimpleQA, the most commonly quoted benchmark for factual accuracy, GPT-5 performs about as well as GPT-4o3 — maybe slightly better if you squint.

The practical reality: models still hallucinate around five percent of the time on major incorrect claims during actual user conversations. That's not nothing, but it's also not catastrophic. For users asking technical questions that involve images or charts, GPT-5 is looking quite strong. The mmu benchmark shows it beating Gemini Deepthink, which costs $250 per month and runs much slower.

What Didn't Change

The context window hasn't expanded significantly — a disappointment for those hoping to analyze longer documents. GPT-5 remains stuck in the low hundreds of thousands of tokens rather than approaching the one million that Gemini 2.5 Pro can handle.

Translation capabilities haven't improved either, which seems like a missed opportunity given how much ground AI has covered in multilingual tasks.

And if you were hoping for new image generation capabilities — Sora isn't here yet. The model selector is being deprecated for non-pro users, meaning free tier users will see only GPT-5 going forward rather than choosing between multiple options.

"GPT-5 is now better at coding than Anthropic's flagship models, but it's not the AGI leap people were hoping for."

Counterarguments

Critics might note that benchmark performance doesn't fully capture how users actually interact with these models in real-world scenarios. The five percent hallucination rate could be far more impactful on practical work than abstract benchmarks.

Additionally, some researchers have pointed out that the system card shows no meaningful improvement on machine learning engineering benchmarks or OpenAI's own internal research tasks — the very problems the company faces when trying to advance their technology. These are the bottlenecks that actually delay training runs and launches at OpenAI, and GPT-5 doesn't meaningfully move the needle there.

Bottom Line

The reviewer's strongest argument is that GPT-5 delivers real utility for coding tasks and factual accuracy while remaining accessible to free users — a genuinely impactful democratization of frontier AI capability. The vulnerability: it's not the paradigm shift toward artificial general intelligence many expected, and certain promised improvements like expanded context windows and translation quality didn't materialize. For readers, the takeaway is that GPT-5 is a strong incremental improvement rather than a revolutionary leap — useful for coding and professional work, but don't expect AI to start solving fundamental research problems just yet.

by AI Explained · AI Explained · Watch video

Well, GBT 5 is here and it's in the free tier. I've tested it a bunch, read the system card in full, and even sat through that full live stream. Wow. But actually, I think it's pretty huge that free users of ChatBT will get access to GPT5.

In other words, approaching a billion people will experience a significantly more intelligent AI model, at least before they hit the limits. But if you watched the live stream and demo, you may have been underwhelmed. And I don't just mean the mathematically impossible bar graphs, and there were multiple of those. There were even hallucinations in the segment describing how the model hallucinates less.

For sure, it would be easy to make a video just taking the mick of those mistakes. But the thing is, GT5 is actually a pretty great model. So, here are my first impressions. First, my own logic benchmark or some people call it a trick question benchmark.

I can confirm that GT5 indeed does crush the public questions of SimpleBench. Whoever this was that came out with this viral thread of it getting 9 out of 10 on those public 10 questions from SimpleBench wasn't lying. Technically, in some of my early testing, it got questions right that no other model had gotten right. When I saw this, I was like, man, I'm gonna have to bring out V2 really early.

Everyone's going to get super hyped. This is crazy. However, if you're newer to AI, you might not know that the performance of language models is heavily dependent on the training data they're fed. And I suspect some of these 10 public questions have made it into the training data, at least indirectly.

Not deliberately, I think, but given that the models are trained on things like Reddit and other forums, it's definitely not impossible. Given how long I normally take to update the leaderboard, you guys might be quite shocked to hear that we're doing the runs tonight. And so far, it's not setting a new record. That surprised even me actually.

I was expecting honestly 70%. I'll be honest with you guys. So far in the three runs we've done it's getting around 57 58%. So at this point we can be clear it's not a new paradigm of AI and if you didn't believe models were AGI now this model won't convince ...

Gpt-5 has arrived

The Benchmark Paradox

The Hallucination Question

What Didn't Change

Counterarguments

Bottom Line

Deep Dives

Sources

Gpt-5 has arrived