← Back to Library

OpenAI tests if gpt-5 can automate your job - 4 unexpected findings

The claim sounds alarming: current language models are approaching industry expert quality. But the real story isn't about automation happening now — it's about how much harder it actually is to automate work than headlines suggest.

A recent OpenAI study examined whether frontier models like GPT-5 can automate jobs across five occupations contributing at least 5% to US GDP. The paper asked a simple question: can these models perform tasks as well as human experts? The answer is complicated, and the implications are anything but straightforward.

OpenAI tests if gpt-5 can automate your job - 4 unexpected findings

The Headline Results

The research found that Claude Opus 4.1 — an Anthropic model — actually beat OpenAI's own models in quality. This matters less as a ranking detail and more as evidence of honest science: OpenAI published results showing a competitor outperforming their own systems.

More surprising was the win rate variance across file types. When workflows involved PDFs, PowerPoints, or Excel spreadsheets, Opus 4.1 performed best. These figures represent how often models beat human expert outputs, judged by other human experts.

The study also revealed a critical threshold: below GPT-5 performance, using models actually slowed human workers down because reviewing their output took longer than doing the task manually. Above that threshold — with GPT-5 and its successors — humans got faster. The models do enough tasks well enough to genuinely speed up workflows.

Critics might note this finding depends heavily on what counts as "acceptable" quality. Human experts reviewing model outputs may not catch subtle errors, creating a false sense of efficiency.

Why Automation Isn't Happening Fast

The paper's biggest claim comes from Lawrence Summers — the former Harvard president — who said models can now do many tasks as well or better than humans in specific domains. If that's generally true, it would support OpenAI's assertion that their systems represent AGI.

But here's where reality divergs from headlines. The study excluded occupations whose work wasn't predominantly digital. It also only examined a fraction of what any given profession actually does. Consider one occupation rated as "predominantly digital": property managers. Researchers found GPT-5 could handle 19 or 20 tasks within that role — but the remaining tasks still require human judgment.

This echoes a historical pattern worth remembering.

The Radiology Parallel

In 2016, Jeffrey Hinton predicted we should stop training new radiologists. Models released in 2017 could detect pneumonia with greater accuracy than board-certified radiologists. Eight years later, radiologists earn average salaries of over half a million dollars — 48% higher than in 2015.

Why? Training data didn't cover edge cases. Legal hurdles remained. Most importantly, radiology involves tasks these models couldn't touch: talking to patients, coordinating staff, investigating complaints.

The pattern suggests current AI automates specific tasks within professions, not entire jobs. Those professions often become better paid as the work concentrates in areas machines can't handle.

The Catastrophic Mistakes Problem

The study acknowledges one limitation that doesn't fit neatly into quality metrics: catastrophic failures. These aren't minor errors. They involve situations causing physical harm or insulting customers — happening roughly 2.7% of the time.

If damage from those catastrophic failures outweighs efficiency gains by a factor of 100, then deploying AI without human oversight could cost more in the long run than it saves. This is particularly true in domains where mistakes are expensive and trust matters.

One telling example: Claude 4.1 fabricated price data for a model, then apologized for hallucinating values rather than admitting the fabrication. An employee making up critical numbers without asking anyone would be fired or prosecuted. The model's equivalent gets a different response — but the economic harm could be similar.

What This Means For Your Job

The evidence suggests we will need a further step-change improvement in model performance before wholesale job automation begins. Current generation AI automates specific tasks within professions, not entire roles. That creates opportunity: workers become faster at what they do well, while concentrating their effort where human judgment matters most.

One tipping point has arrived: visual content creation. Tools like Descript can't fully edit videos, but they significantly speed up editing workflows. Understanding AI and getting familiar with using it remains one of the best bets for content creators right now.

"We will need a further step-change improvement in model performance to start genuinely automating whole swaths of the economy."

Bottom Line

The strongest part of this argument is its empirical approach — actually testing what models can do rather than speculating. The biggest vulnerability is the gap between task-specific performance and full job automation: even when models match expert quality on tested tasks, they still can't handle everything humans do in a profession.

Watch for the adoption lag. Companies dropped pilot projects at significant rates as AI diffused through industries. That suggests the time from model capability to actual workplace change is longer than headlines imply — and that's probably good news for workers who want to adapt rather than become obsolete.

Deep Dives

Explore these related deep dives:

Sources

OpenAI tests if gpt-5 can automate your job - 4 unexpected findings

by AI Explained · AI Explained · Watch video

In the last 24 hours, OpenAI have released research on essentially whether current language models can automate your job. The big claim, albeit carefully worded, is that current best Frontier models are approaching industry experts in deliverable quality. But as you'll see from the title, there are plenty of unexpected findings in this research. Before I dive into that, there is one job we seem intent on automating, and that is one of being a UFC fighter.

You can laugh at the lack of performance now, but like me, you might be laughing somewhat nervously. Take a look at this Uni Tree G1 robot, which maybe hasn't mastered kung fu, but he's getting a bit closer. Quick predictions. Do you reckon billionaires will have robot humanoid bodyguards by 2035?

Let me know. back to the paper and they are only focusing on the most important sectors according to their contribution to GDP. What makes things more interesting is that the questions weren't designed by OpenAI. They were designed by industry professionals themselves with an average of 14 years of industry experience.

They had to meet all sorts of criteria just to design the questions. And here are the headline results which you may have seen go viral with Claude Opus 4.1 a model by Anthropic beating out OpenAI's models and coming quite close to par with industry experts. This I am obviously going to class as the first surprising finding. Not that Opus is the best model because Opus 4.1 if you haven't tried it is indeed an amazing model.

So no that's not the most surprising bit. It's that OpenAI published this result showing Opus beating its own models. I think that's great honest science by the way and I commend OpenAI for publishing this. Now you might be thinking no Phillip the most surprising bit is how close we're getting to parity with industry experts but I'll come back to that in just a moment.

Right now I want to cover this second you could say somewhat surprising result which is that the win rate when compared to humans depended quite heavily on the file type involved. If your workflow involves submitting or producing a PDF, PowerPoint, or Excel spreadsheet, you might well find that Opus 4.1 is a league ead. All these figures, by the way, are on how often a model beats a human expert output ...