A new AI model has arrived, and it's available to everyone — but probably not in the way you think.
The reviewer behind the popular AI Explained channel has spent extensive time with GPT-5, combing through the system card documentation, watching the live stream announcement, and running their own benchmarks. Their verdict: impressive in several domains, underwhelming in others, and definitively not the leap toward artificial general intelligence that some had hoped for.
The Benchmark Paradox
The most viral claim about GPT-5 centered on its SimpleBench performance — a thread showing it getting nine out of ten questions right went genuinely viral. But the reviewer ran their own tests tonight and found something different: around 57 to 58 percent accuracy across three runs, not the 70 percent many expected. That's a significant score, but it's not setting new records. It's also worth noting that SimpleBench questions likely exist in GPT-5's training data, given how public those questions became after going viral — so some of that performance may be memorization rather than reasoning.
The model does crush certain public benchmarks, though. In software engineering specifically, GPT-5 outperforms Anthropic's offerings on SweetBench verified — a benchmark that Anthropic had cited as proof their own models were state-of-the-art in coding. This represents one of the bigger developments from the release, and it may pose real challenges to competitors.
The Hallucination Question
OpenAI made noise about GPT-5 hallucinating less than previous models — claiming 44 percent fewer responses with at least one major factual error. But when you dig into the system card, the benchmarks used for comparison are new ones rather than the standard tests everyone already knows. On SimpleQA, the most commonly quoted benchmark for factual accuracy, GPT-5 performs about as well as GPT-4o3 — maybe slightly better if you squint.
The practical reality: models still hallucinate around five percent of the time on major incorrect claims during actual user conversations. That's not nothing, but it's also not catastrophic. For users asking technical questions that involve images or charts, GPT-5 is looking quite strong. The mmu benchmark shows it beating Gemini Deepthink, which costs $250 per month and runs much slower.
What Didn't Change
The context window hasn't expanded significantly — a disappointment for those hoping to analyze longer documents. GPT-5 remains stuck in the low hundreds of thousands of tokens rather than approaching the one million that Gemini 2.5 Pro can handle.
Translation capabilities haven't improved either, which seems like a missed opportunity given how much ground AI has covered in multilingual tasks.
And if you were hoping for new image generation capabilities — Sora isn't here yet. The model selector is being deprecated for non-pro users, meaning free tier users will see only GPT-5 going forward rather than choosing between multiple options.
"GPT-5 is now better at coding than Anthropic's flagship models, but it's not the AGI leap people were hoping for."
Counterarguments
Critics might note that benchmark performance doesn't fully capture how users actually interact with these models in real-world scenarios. The five percent hallucination rate could be far more impactful on practical work than abstract benchmarks.
Additionally, some researchers have pointed out that the system card shows no meaningful improvement on machine learning engineering benchmarks or OpenAI's own internal research tasks — the very problems the company faces when trying to advance their technology. These are the bottlenecks that actually delay training runs and launches at OpenAI, and GPT-5 doesn't meaningfully move the needle there.
Bottom Line
The reviewer's strongest argument is that GPT-5 delivers real utility for coding tasks and factual accuracy while remaining accessible to free users — a genuinely impactful democratization of frontier AI capability. The vulnerability: it's not the paradigm shift toward artificial general intelligence many expected, and certain promised improvements like expanded context windows and translation quality didn't materialize. For readers, the takeaway is that GPT-5 is a strong incremental improvement rather than a revolutionary leap — useful for coding and professional work, but don't expect AI to start solving fundamental research problems just yet.