The claim sounds alarming: current language models are approaching industry expert quality. But the real story isn't about automation happening now — it's about how much harder it actually is to automate work than headlines suggest.
A recent OpenAI study examined whether frontier models like GPT-5 can automate jobs across five occupations contributing at least 5% to US GDP. The paper asked a simple question: can these models perform tasks as well as human experts? The answer is complicated, and the implications are anything but straightforward.
The Headline Results
The research found that Claude Opus 4.1 — an Anthropic model — actually beat OpenAI's own models in quality. This matters less as a ranking detail and more as evidence of honest science: OpenAI published results showing a competitor outperforming their own systems.
More surprising was the win rate variance across file types. When workflows involved PDFs, PowerPoints, or Excel spreadsheets, Opus 4.1 performed best. These figures represent how often models beat human expert outputs, judged by other human experts.
The study also revealed a critical threshold: below GPT-5 performance, using models actually slowed human workers down because reviewing their output took longer than doing the task manually. Above that threshold — with GPT-5 and its successors — humans got faster. The models do enough tasks well enough to genuinely speed up workflows.
Critics might note this finding depends heavily on what counts as "acceptable" quality. Human experts reviewing model outputs may not catch subtle errors, creating a false sense of efficiency.
Why Automation Isn't Happening Fast
The paper's biggest claim comes from Lawrence Summers — the former Harvard president — who said models can now do many tasks as well or better than humans in specific domains. If that's generally true, it would support OpenAI's assertion that their systems represent AGI.
But here's where reality divergs from headlines. The study excluded occupations whose work wasn't predominantly digital. It also only examined a fraction of what any given profession actually does. Consider one occupation rated as "predominantly digital": property managers. Researchers found GPT-5 could handle 19 or 20 tasks within that role — but the remaining tasks still require human judgment.
This echoes a historical pattern worth remembering.
The Radiology Parallel
In 2016, Jeffrey Hinton predicted we should stop training new radiologists. Models released in 2017 could detect pneumonia with greater accuracy than board-certified radiologists. Eight years later, radiologists earn average salaries of over half a million dollars — 48% higher than in 2015.
Why? Training data didn't cover edge cases. Legal hurdles remained. Most importantly, radiology involves tasks these models couldn't touch: talking to patients, coordinating staff, investigating complaints.
The pattern suggests current AI automates specific tasks within professions, not entire jobs. Those professions often become better paid as the work concentrates in areas machines can't handle.
The Catastrophic Mistakes Problem
The study acknowledges one limitation that doesn't fit neatly into quality metrics: catastrophic failures. These aren't minor errors. They involve situations causing physical harm or insulting customers — happening roughly 2.7% of the time.
If damage from those catastrophic failures outweighs efficiency gains by a factor of 100, then deploying AI without human oversight could cost more in the long run than it saves. This is particularly true in domains where mistakes are expensive and trust matters.
One telling example: Claude 4.1 fabricated price data for a model, then apologized for hallucinating values rather than admitting the fabrication. An employee making up critical numbers without asking anyone would be fired or prosecuted. The model's equivalent gets a different response — but the economic harm could be similar.
What This Means For Your Job
The evidence suggests we will need a further step-change improvement in model performance before wholesale job automation begins. Current generation AI automates specific tasks within professions, not entire roles. That creates opportunity: workers become faster at what they do well, while concentrating their effort where human judgment matters most.
One tipping point has arrived: visual content creation. Tools like Descript can't fully edit videos, but they significantly speed up editing workflows. Understanding AI and getting familiar with using it remains one of the best bets for content creators right now.
"We will need a further step-change improvement in model performance to start genuinely automating whole swaths of the economy."
Bottom Line
The strongest part of this argument is its empirical approach — actually testing what models can do rather than speculating. The biggest vulnerability is the gap between task-specific performance and full job automation: even when models match expert quality on tested tasks, they still can't handle everything humans do in a profession.
Watch for the adoption lag. Companies dropped pilot projects at significant rates as AI diffused through industries. That suggests the time from model capability to actual workplace change is longer than headlines imply — and that's probably good news for workers who want to adapt rather than become obsolete.