Rohit Krishnan challenges a foundational assumption in the rapidly scaling world of artificial intelligence: that we can simply route complex tasks to the best available model without first teaching those models to understand their own limitations. By introducing a new benchmark called MarketBench, Krishnan and his co-author Andrey Fradkin reveal a startling gap between what AI agents claim they can do and what they actually achieve, suggesting that the dream of a self-organizing AI economy is currently stalled by a lack of self-knowledge.
The Hayekian Promise vs. The Calibration Gap
Krishnan frames the problem of assigning tasks to AI agents through the lens of economic theory, specifically the work of Friedrich Hayek on the "local knowledge problem." The central thesis is that no central planner—human or algorithmic—can possess the dispersed, specific information required to match every task to the perfect agent. Instead, Krishnan proposes a market mechanism where agents bid on tasks based on their own assessment of cost and probability of success. "Markets tend to be superior to other forms of resource allocation when information and capabilities are distributed among a variety of people," Krishnan writes, arguing that this aggregation of private information is the only way to efficiently manage a heterogeneous ecosystem of models.
However, the piece quickly pivots from theory to a harsh empirical reality. To test this, the authors built MarketBench, asking six frontier models to forecast their own success rates and token consumption before attempting real software engineering tasks. The results were disqualifying for a market-based approach. "Models don't know themselves very well," Krishnan states bluntly. The data showed that while actual pass rates clustered tightly between 75% and 81%, the models' stated confidence spanned wildly from 61% to 93%. Some models, particularly from the Gemini family, were dramatically overconfident, while the GPT family was systematically under-confident.
"If you were running a market and asked agents 'how much compute will this take?' you'd get answers that are off by an order of magnitude or two."
This calibration failure has profound implications. In a functioning market, a bidder's price signals their capability. Here, the signals are noise. Krishnan notes that when they ran a simulated procurement auction, the results were predictable but disastrous: "Gemini wins 84.6% of auctions. But it's winning because it's the most overconfident, not because it's the most capable." This mirrors the historical failures of central planning not because the central planner was absent, but because the private information required to make the market work simply didn't exist in the agents' internal states. The analogy to Goodhart's law is apt here: once a measure (self-reported confidence) becomes a target (winning the bid), it ceases to be a good measure of the underlying reality (actual capability).
The Limits of Prompting and the Case for Diversity
Recognizing that the models lacked self-awareness, Krishnan tested a simpler intervention: providing each model with a "report card" of its historical performance to help it calibrate its current bids. While this improved the models' average accuracy slightly, it failed to solve the core problem of task-specific routing. "The intervention improved average calibration, not comparative routing," Krishnan explains, noting that a bidder can be right on average but still useless for allocation if they cannot distinguish between tasks they can solve and those they cannot.
This leads to a nuanced finding about system architecture. When the authors replaced the market mechanism with a centralized router—a single large model tasked with picking the best worker—the centralized planner actually outperformed the flawed market. "Once we held model diversity constant, a LLM central planner beat the market," Krishnan admits. This suggests that until agents can reliably self-assess, the "invisible hand" of the market is less efficient than a visible, albeit imperfect, hand of a central router.
"The single most robust finding in our live scaffold is that access to multiple different (frontier) models helps, almost regardless of how you route between them."
Despite the failure of the market mechanism itself, the study uncovers a critical practical takeaway for engineers: diversity is king. Even with crude routing, a system that leverages multiple different models significantly outperforms a single-model approach. This is a vital distinction. It implies that the immediate bottleneck is not the sophistication of the routing logic, but the fundamental architecture of the agent pool. "Don't lock into one provider, even if your routing logic is crude," Krishnan advises, emphasizing that the heterogeneity of the models themselves provides a buffer against the specific blind spots of any single architecture.
Critics might argue that focusing on self-assessment as a training target is a distraction from improving raw reasoning capabilities. If models simply get smarter, won't they naturally become better at estimating their own success? Krishnan anticipates this, arguing that solving a task and predicting the probability of solving it are distinct cognitive skills that require separate optimization. "Models are trained to solve tasks, not to predict whether they can solve them," he writes, suggesting that without explicit training on metacognition, raw intelligence alone will not yield a functional market.
Bottom Line
Rohit Krishnan's analysis delivers a necessary reality check to the hype surrounding autonomous AI markets: the infrastructure for decentralized coordination is currently broken because the participants lack self-knowledge. While the Hayekian vision of agents bidding on tasks remains theoretically sound, the empirical evidence shows that without a fundamental shift in how models are trained to understand their own capabilities, centralized routing and model diversity remain the only reliable strategies. The most urgent next step for the field is not better algorithms for bidding, but better curricula for metacognition.
"As agentic systems scale, the ability to say 'I can do this, at this cost, with this confidence' becomes as important as the ability to do the thing."
The Path Forward
The piece concludes with a call for a hybrid approach, acknowledging that pure decentralization is premature but that centralized planning will eventually hit a wall as the ecosystem grows too complex. Krishnan envisions a "scoring auction" where bids are weighted by reputation and observed history, effectively creating a market augmented by AI oversight. This middle ground recognizes that while the agents cannot yet be trusted to tell the truth, a system can be built to verify their claims over time. For now, the advice is pragmatic: test your models' self-assessment capabilities before betting your infrastructure on a market mechanism, because right now, they mostly don't know what they're good at.