GPT 5.2: OpenAI strikes back

OpenAI's latest model, GPT 5.2, claims to set a new state-of-the-art score on GPQA and be the first model performing at or above human expert level. But here's what's really interesting: benchmark performance is increasingly driven by how many tokens the model spends thinking — and that makes direct comparisons notoriously difficult.

The Bold Claim

OpenAI's release page for GPT 5.2 touts a claim that's caused considerable confusion: the model beats or ties top industry professionals on 71% of comparisons, according to expert judges. However, this benchmark only tests digital jobs and predefined tasks where contextual information is provided upfront. The benchmark deliberately excludes tasks that involve tacit knowledge — essentially requiring the model to search out or intuit contextual information to solve problems.

The release also omits comparisons with Claude Opus 4.5 or Gemini 3 Pro, leading researchers to do their own cheeky comparisons. When GPT 5.2 was asked to create a football-themed interaction matrix with results from this season, it performed extremely well. But when given the same challenge without the $200 pro tier's larger token budget and thinking time, it couldn't complete the task.

The Benchmark Problem

Performance on AI benchmarks increasingly depends on thinking time or tokens used — what researchers call test-time compute. The computing budget model providers allocate to answering benchmark questions drives results dramatically.

Take ARC-AGI1, a benchmark designed to test fluid intelligence outside training data. Results almost uniformly get better the more dollars or tokens are spent on thinking. With GPT 5.2 Pro Extra High reasoning effort, it achieves over 90% performance — but because of computing and algorithmic efficiencies, price-performance ratios continue improving.

The more time a model thinks, the more ideas from their training data they can try out.

The benchmark selection itself has become problematic. OpenAI points to Sweepbench Pro as rigorous — it tests four languages and aims to be contamination-resistant. But benchmarks purporting to test the same thing often give different results. MMU Pro, designed for analyzing tables, charts, and graphs shows Gemini 3 Pro at 81% versus GPT 5.2 Thinking at 80.4%. Yet on a newer benchmark called Charive Reasoning, GPT 5.2 gets 88.7% versus 81%.

SimpleBench Results

A fully external benchmark called SimpleBench tests common-sense questions and spatio-temporal reasoning designed to exploit known model weaknesses. Tested five times with GPT 5.2 Pro Extra High reasoning effort, it achieved 57.4%. The human baseline is roughly 84%, and Gemini 3 Pro outperforms both at 76.4%.

Critically, this benchmark is hard for model providers to cheat on because answers aren't provided in the API call — the comparison is done by a program rather than an LLM.

Critics might note that GPT 5.2 still leads on LM Arena for web development, with Claude Opus 4.5 exceeding both GPT 5.2 and Gemini 3 Pro in that domain.

Long Context Capabilities

One result that stands out: GPT 5.2's ability to recall details across long context achieves near-100% accuracy on the four needle challenge, where different items must be recalled from nearly 200,000 words. This was previously a specialty of Gemini 3 Pro, which can handle up to a million tokens versus GPT 5.2's 400,000 token limit.

Bottom Line

GPT 5.2 represents an incremental step forward rather than a revolutionary leap. It excels at spreadsheet creation after web research and achieves impressive long-context recall, but benchmark comparisons remain fraught with complications around thinking budgets, token spending, and benchmark selection. The model is genuinely good — though whether it's the best for any specific use case depends entirely on what that use case requires.

GPT 5.2: OpenAI strikes back

by AI Explained · AI Explained · Watch video

In the last 24 hours, OpenAI have released a new model and plenty of record-breaking results. GPT 5.2 might not be a Christmas miracle, however, as to get Frontier performance, it often needs to spend more tokens thinking, but just setting tokens aside for one moment, GPT 5.2 is in many benchmarks among the best language models out there. For me, this is a tiny bit like us all getting luxury Christmas presents, though, where we don't know which results were bought by the labs with the last of their intellectual or financial overdraft, and which results will be superseded early in the new year with something even shinier. Either way, it's a genuinely good model.

So, let me give you nine details about GPT 5.2 that you wouldn't get from just reading the headlines, so you can decide for yourself. Plus, I'm going to end with a sheep analogy, which I think is quite good. First, let's talk about the bold claim right at the top of the release page for GPT 5.2, which is that GPC 5.2 thinking sets a new state-of-the-art score on GDP vow and is the first model that performs at or above a human expert level. It beats or ties top industry professionals on 71% of comparisons on that benchmark according to expert judges and it's the best model yet for realworld professional use apparently.

I will say that both OpenAI and Samman were relatively specific about the claim they were making for this benchmark calling it an eval measuring wellsp specified knowledge work tasks across 44 occupations. Nevertheless, seeing models exceed expert level in realworld professional tasks may lead many to misinterpret this chart and this benchmark. I have tested Gypsy 5.2 heavily and covered this benchmark specifically in great detail in a previous video, but let me give you a 10-second recap. Yes, the questions for GDP Val were crafted by industry experts, but the jobs must be predominantly digital jobs.

Any that weren't were excluded. only a subset of the tasks within each of those occupations were selected and the quote well specified adjective they gave was intentional because the full context of each task is given to the models beforehand and even open AI say in the release notes that real tasks often involve tacet knowledge where basically you have to search out or intuitit or know the contextual information ...

GPT 5.2: OpenAI strikes back

The Bold Claim

The Benchmark Problem

SimpleBench Results

Long Context Capabilities

Bottom Line

Deep Dives

Sources

GPT 5.2: OpenAI strikes back