OpenAI's latest model, GPT 5.2, claims to set a new state-of-the-art score on GPQA and be the first model performing at or above human expert level. But here's what's really interesting: benchmark performance is increasingly driven by how many tokens the model spends thinking — and that makes direct comparisons notoriously difficult.
The Bold Claim
OpenAI's release page for GPT 5.2 touts a claim that's caused considerable confusion: the model beats or ties top industry professionals on 71% of comparisons, according to expert judges. However, this benchmark only tests digital jobs and predefined tasks where contextual information is provided upfront. The benchmark deliberately excludes tasks that involve tacit knowledge — essentially requiring the model to search out or intuit contextual information to solve problems.
The release also omits comparisons with Claude Opus 4.5 or Gemini 3 Pro, leading researchers to do their own cheeky comparisons. When GPT 5.2 was asked to create a football-themed interaction matrix with results from this season, it performed extremely well. But when given the same challenge without the $200 pro tier's larger token budget and thinking time, it couldn't complete the task.
The Benchmark Problem
Performance on AI benchmarks increasingly depends on thinking time or tokens used — what researchers call test-time compute. The computing budget model providers allocate to answering benchmark questions drives results dramatically.
Take ARC-AGI1, a benchmark designed to test fluid intelligence outside training data. Results almost uniformly get better the more dollars or tokens are spent on thinking. With GPT 5.2 Pro Extra High reasoning effort, it achieves over 90% performance — but because of computing and algorithmic efficiencies, price-performance ratios continue improving.
The more time a model thinks, the more ideas from their training data they can try out.
The benchmark selection itself has become problematic. OpenAI points to Sweepbench Pro as rigorous — it tests four languages and aims to be contamination-resistant. But benchmarks purporting to test the same thing often give different results. MMU Pro, designed for analyzing tables, charts, and graphs shows Gemini 3 Pro at 81% versus GPT 5.2 Thinking at 80.4%. Yet on a newer benchmark called Charive Reasoning, GPT 5.2 gets 88.7% versus 81%.
SimpleBench Results
A fully external benchmark called SimpleBench tests common-sense questions and spatio-temporal reasoning designed to exploit known model weaknesses. Tested five times with GPT 5.2 Pro Extra High reasoning effort, it achieved 57.4%. The human baseline is roughly 84%, and Gemini 3 Pro outperforms both at 76.4%.
Critically, this benchmark is hard for model providers to cheat on because answers aren't provided in the API call — the comparison is done by a program rather than an LLM.
Critics might note that GPT 5.2 still leads on LM Arena for web development, with Claude Opus 4.5 exceeding both GPT 5.2 and Gemini 3 Pro in that domain.
Long Context Capabilities
One result that stands out: GPT 5.2's ability to recall details across long context achieves near-100% accuracy on the four needle challenge, where different items must be recalled from nearly 200,000 words. This was previously a specialty of Gemini 3 Pro, which can handle up to a million tokens versus GPT 5.2's 400,000 token limit.
Bottom Line
GPT 5.2 represents an incremental step forward rather than a revolutionary leap. It excels at spreadsheet creation after web research and achieves impressive long-context recall, but benchmark comparisons remain fraught with complications around thinking budgets, token spending, and benchmark selection. The model is genuinely good — though whether it's the best for any specific use case depends entirely on what that use case requires.