The Gap Between Smart and Dependable
A team of researchers at Princeton, led by Arvind Narayanan, Sayash Kapoor, and postdoctoral researcher Stephan Rabanser, has published a paper that attempts something the AI industry has largely avoided: defining and measuring the reliability of AI agents as a distinct property from their raw capability. The paper, "Towards a Science of AI Agent Reliability," tested 14 models from OpenAI, Google, and Anthropic across 500 benchmark runs and arrived at a conclusion that should give pause to anyone deploying autonomous agents in production.
Nearly two years of rapid capability progress have produced only modest reliability gains.
That single finding carries enormous weight. It suggests that the industry's standard practice of reporting a single accuracy number on benchmarks is not just incomplete but actively misleading.
Four Dimensions, Borrowed from Fields That Cannot Afford Failure
Narayanan and his co-authors drew on decades of safety engineering from aviation, nuclear power, and automotive design to decompose reliability into four dimensions: consistency, robustness, calibration, and safety. The analogy to human coworkers is instructive.
When we consider a coworker to be reliable, we don't just mean that they get things right most of the time. We mean something richer.
The paper refines these four dimensions into twelve measurable metrics. Among the findings, consistency stands out as particularly troubling. Agents that can solve a task often fail on repeated attempts under identical conditions, with outcome consistency scores ranging from 30 to 75 percent across tested models. In plain terms: ask the same agent the same question five times, and it may give wildly different results.
Agents are not good at knowing when they're wrong. This is the weakest dimension across the board. When agents report confidence, it often carries little signal.
That last point deserves emphasis. An agent that cannot distinguish its correct answers from its incorrect ones is, in a meaningful sense, untrustworthy regardless of its average accuracy.
Bigger Models, Not Necessarily Better Reliability
One of the more counterintuitive findings concerns scale. The conventional wisdom in AI development holds that larger models perform better. Rabanser and his colleagues found a more nuanced picture.
Bigger models aren't uniformly more reliable. Scaling up improves some dimensions (calibration, robustness) but can hurt consistency. Larger models with richer behavioral repertoires sometimes show more run-to-run variability.
This is a genuinely important result. It means the industry cannot simply scale its way to reliability. A model with more parameters may know more, but it may also be more unpredictable in how it applies that knowledge. The implications for deployment decisions are significant: organizations choosing between model sizes need to consider which reliability dimensions matter most for their use case, not just which model scores highest on capability benchmarks.
The Accuracy Threshold Argument
The paper anticipates the most obvious objection: perhaps reliability will not matter if accuracy gets high enough. If an agent is right 99 percent of the time, maybe the remaining one percent of unpredictable failures is tolerable. Narayanan and Kapoor push back firmly.
For autonomous operation in high-stakes contexts, we need 3-5 "nines" of performance -- 99.9% to 99.999% accuracy -- in order for reliability to become a non-issue, and we don't think LLM-based agents are on track to reach such a threshold.
This is where the paper is at its most provocative -- and possibly at its most vulnerable. The claim that current architectures cannot reach such thresholds is a prediction about fundamental limits, and the history of AI is littered with confident predictions about what neural networks could never do. The authors acknowledge this uncertainty honestly, noting that if the current linear improvement trend were projected forward, agents would reach 100 percent reliability in three years. They simply do not believe linear projection is the right model.
Whether they are right about the ceiling matters less than whether they are right about the current gap. And the data on that point is compelling.
Augmentation Versus Automation
The practical advice for deployers centers on a distinction that the industry often blurs: the difference between an AI tool that assists a human and one that operates autonomously.
A coding assistant that occasionally suggests wrong variable names is annoying; an autonomous agent managing an industrial plant yielding highly variable output is unacceptable.
Kapoor and Narayanan argue that augmentation tools get a reliability "discount" because a human reviews the output. This framing is useful, though it somewhat understates the risk. In practice, human reviewers develop automation bias -- they increasingly trust and rubber-stamp AI outputs over time, particularly when the system is usually correct. The reliability discount for augmentation may erode faster than organizations expect.
What the Benchmarks Are Missing
The paper also connects to the team's earlier work on evaluation methodology. The current standard of running a benchmark once and reporting the accuracy number draws a sharp comparison.
The current approach of running a benchmark once and reporting the accuracy number is a shallow, superficial performance measure. It is comparable to stress-testing a car once in perfect weather and declaring it safe if it passes.
The proposed alternative -- reliability profiles that include multiple runs, varied conditions, and ongoing retesting -- would substantially increase the cost and complexity of agent evaluation. That is not necessarily a reason to avoid it, but it does raise questions about who will bear those costs. Academic labs with limited compute budgets may find themselves even further disadvantaged relative to well-funded industry labs, potentially concentrating evaluation authority in the hands of the very companies whose products are being evaluated.
The Bigger Picture on AI Progress
Perhaps the most consequential claim in the paper is that the capability-reliability gap helps explain why AI agents have not yet produced the dramatic economic effects that their benchmark performance would seem to predict. The authors connect their work to a recent UK AI Safety Institute report that identified six barriers to broadly capable AI.
The gazillion-dollar question is whether agents will get better across the board through general methods such as inference scaling and reinforcement learning, or whether painstaking work will be required to improve individual dimensions of reliability, adaptability, originality, and so on.
If the answer is the latter -- that each dimension requires targeted engineering effort -- then the timeline to economically transformative AI agents extends considerably. It would mean that crushing a capability benchmark is the easy part, and the hard, unglamorous work of making systems dependable in the real world is where the true bottleneck lies.
Bottom Line
This paper from the Princeton team makes a rigorous case for something that practitioners have long intuited: AI agents that perform well on average can still be dangerously unreliable in practice. By decomposing reliability into measurable dimensions and showing that progress on those dimensions has lagged far behind capability gains, Narayanan, Kapoor, and Rabanser have given the field a vocabulary and a framework it badly needed. The proposed "reliability index" for AI agents could become as important as accuracy leaderboards -- if the industry is willing to adopt metrics that might make its products look less impressive. The open question is whether it will.