New paper: Towards a science of AI agent reliability

The Gap Between Smart and Dependable

A team of researchers at Princeton, led by Arvind Narayanan, Sayash Kapoor, and postdoctoral researcher Stephan Rabanser, has published a paper that attempts something the AI industry has largely avoided: defining and measuring the reliability of AI agents as a distinct property from their raw capability. The paper, "Towards a Science of AI Agent Reliability," tested 14 models from OpenAI, Google, and Anthropic across 500 benchmark runs and arrived at a conclusion that should give pause to anyone deploying autonomous agents in production.

Nearly two years of rapid capability progress have produced only modest reliability gains.

That single finding carries enormous weight. It suggests that the industry's standard practice of reporting a single accuracy number on benchmarks is not just incomplete but actively misleading.

New paper: Towards a science of AI agent reliability

Four Dimensions, Borrowed from Fields That Cannot Afford Failure

Narayanan and his co-authors drew on decades of safety engineering from aviation, nuclear power, and automotive design to decompose reliability into four dimensions: consistency, robustness, calibration, and safety. The analogy to human coworkers is instructive.

When we consider a coworker to be reliable, we don't just mean that they get things right most of the time. We mean something richer.

The paper refines these four dimensions into twelve measurable metrics. Among the findings, consistency stands out as particularly troubling. Agents that can solve a task often fail on repeated attempts under identical conditions, with outcome consistency scores ranging from 30 to 75 percent across tested models. In plain terms: ask the same agent the same question five times, and it may give wildly different results.

Agents are not good at knowing when they're wrong. This is the weakest dimension across the board. When agents report confidence, it often carries little signal.

That last point deserves emphasis. An agent that cannot distinguish its correct answers from its incorrect ones is, in a meaningful sense, untrustworthy regardless of its average accuracy.

Bigger Models, Not Necessarily Better Reliability

One of the more counterintuitive findings concerns scale. The conventional wisdom in AI development holds that larger models perform better. Rabanser and his colleagues found a more nuanced picture.

Bigger models aren't uniformly more reliable. Scaling up improves some dimensions (calibration, robustness) but can hurt consistency. Larger models with richer behavioral repertoires sometimes show more run-to-run variability.

This is a genuinely important result. It means the industry cannot simply scale its way to reliability. A model with more parameters may know more, but it may also be more unpredictable in how it applies that knowledge. The implications for deployment decisions are significant: organizations choosing between model sizes need to consider which reliability dimensions matter most for their use case, not just which model scores highest on capability benchmarks.

The Accuracy Threshold Argument

The paper anticipates the most obvious objection: perhaps reliability will not matter if accuracy gets high enough. If an agent is right 99 percent of the time, maybe the remaining one percent of unpredictable failures is tolerable. Narayanan and Kapoor push back firmly.

For autonomous operation in high-stakes contexts, we need 3-5 "nines" of performance -- 99.9% to 99.999% accuracy -- in order for reliability to become a non-issue, and we don't think LLM-based agents are on track to reach such a threshold.

This is where the paper is at its most provocative -- and possibly at its most vulnerable. The claim that current architectures cannot reach such thresholds is a prediction about fundamental limits, and the history of AI is littered with confident predictions about what neural networks could never do. The authors acknowledge this uncertainty honestly, noting that if the current linear improvement trend were projected forward, agents would reach 100 percent reliability in three years. They simply do not believe linear projection is the right model.

Whether they are right about the ceiling matters less than whether they are right about the current gap. And the data on that point is compelling.

Augmentation Versus Automation

The practical advice for deployers centers on a distinction that the industry often blurs: the difference between an AI tool that assists a human and one that operates autonomously.

A coding assistant that occasionally suggests wrong variable names is annoying; an autonomous agent managing an industrial plant yielding highly variable output is unacceptable.

Kapoor and Narayanan argue that augmentation tools get a reliability "discount" because a human reviews the output. This framing is useful, though it somewhat understates the risk. In practice, human reviewers develop automation bias -- they increasingly trust and rubber-stamp AI outputs over time, particularly when the system is usually correct. The reliability discount for augmentation may erode faster than organizations expect.

What the Benchmarks Are Missing

The paper also connects to the team's earlier work on evaluation methodology. The current standard of running a benchmark once and reporting the accuracy number draws a sharp comparison.

The current approach of running a benchmark once and reporting the accuracy number is a shallow, superficial performance measure. It is comparable to stress-testing a car once in perfect weather and declaring it safe if it passes.

The proposed alternative -- reliability profiles that include multiple runs, varied conditions, and ongoing retesting -- would substantially increase the cost and complexity of agent evaluation. That is not necessarily a reason to avoid it, but it does raise questions about who will bear those costs. Academic labs with limited compute budgets may find themselves even further disadvantaged relative to well-funded industry labs, potentially concentrating evaluation authority in the hands of the very companies whose products are being evaluated.

The Bigger Picture on AI Progress

Perhaps the most consequential claim in the paper is that the capability-reliability gap helps explain why AI agents have not yet produced the dramatic economic effects that their benchmark performance would seem to predict. The authors connect their work to a recent UK AI Safety Institute report that identified six barriers to broadly capable AI.

The gazillion-dollar question is whether agents will get better across the board through general methods such as inference scaling and reinforcement learning, or whether painstaking work will be required to improve individual dimensions of reliability, adaptability, originality, and so on.

If the answer is the latter -- that each dimension requires targeted engineering effort -- then the timeline to economically transformative AI agents extends considerably. It would mean that crushing a capability benchmark is the easy part, and the hard, unglamorous work of making systems dependable in the real world is where the true bottleneck lies.

Bottom Line

This paper from the Princeton team makes a rigorous case for something that practitioners have long intuited: AI agents that perform well on average can still be dangerously unreliable in practice. By decomposing reliability into measurable dimensions and showing that progress on those dimensions has lagged far behind capability gains, Narayanan, Kapoor, and Rabanser have given the field a vocabulary and a framework it badly needed. The proposed "reliability index" for AI agents could become as important as accuracy leaderboards -- if the industry is willing to adopt metrics that might make its products look less impressive. The open question is whether it will.

New paper: Towards a science of AI agent reliability

by Arvind Narayanan & Sayash Kapoor · AI Snake Oil · Read full article

By Stephan Rabanser, Sayash Kapoor, Arvind Narayanan

Suppose you hear about a new AI agent for improving productivity — by making purchases, or writing code, or sending emails, or handling a customer on your behalf. Should you trust it? Can the agent do the job reliably enough? After all, there are many horror stories of agents going wrong.

Surprisingly, even though the lack of reliability of AI agents is well known, right now the AI industry doesn’t have good tools for measuring reliability, or even a good definition of reliability.

Arvind and Sayash have long been thinking about this. Last fall, we were joined by postdoctoral researcher Stephan Rabanser, whose PhD looked at the reliability question in simpler, more traditional AI systems. We recruited a few other independent researchers, and have released what we hope is a comprehensive measurement of reliability. Our draft paper is called Towards a Science of AI Agent Reliability.

We borrowed insights from many other fields, such as nuclear and aviation safety. We were able to decompose reliability into 12 different dimensions. Evaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gains. See our interactive dashboard here.

While our findings are tentative at this stage, we hope they can help explain the puzzlement among many in the industry as to why the economic impacts of AI agents have been gradual, even though they are crushing capability benchmarks.1 To help the community track reliability systematically, we plan to launch an AI agent “reliability index”. We hope this will stimulate researchers and industry to invest effort into improving reliability.

Table of Contents.

Accuracy isn’t enough: four dimensions of reliability

Capability gains are rapid, but improvements in reliability are modest

Why we could be wrong

What should deployers do differently?

What should researchers and developers do differently?

What do our findings mean for AI progress?

New paper: Towards a science of AI agent reliability

The Gap Between Smart and Dependable

Four Dimensions, Borrowed from Fields That Cannot Afford Failure

Bigger Models, Not Necessarily Better Reliability

The Accuracy Threshold Argument

Augmentation Versus Automation

What the Benchmarks Are Missing

The Bigger Picture on AI Progress

Bottom Line

Deep Dives

Sources

New paper: Towards a science of AI agent reliability