4 AI labs built the same system without talking to each other

The claim that AI has "jagged" capabilities—extraordinarily good at some tasks, terrible at others—is about to collapse. And the reason it matters isn't just technical: it's a fundamental shift in how we should think about deploying AI in the workplace.

That's the argument from Nate B Jones, and he's backed it up with something that should make you sit up: four major AI labs have independently built remarkably similar multi-agent coordination systems without ever coordinating with each other. And nobody's talking about why.

4 AI labs built the same system without talking to each other

The Myth of Jaggedness

For three years, we've organized our thinking around what Jones calls "the jagged frontier." AI is incredible at some things, terrible at others. Experts talk about it constantly. We see it in daily life. It feels like a truism.

But Jones wants you to reconsider: the jagged frontier was never an inherent property of AI intelligence. It was an artifact of how we were asking AI to work.

When you ask a model for one answer in one turn—solve this problem, give me an answer—all the variance in task difficulty shows up as jaggedness in outcomes. That's not because the intelligence is jagged. It's because no organizational structure was being applied to that work.

We were asking a capable analyst to solve every problem in 30 seconds with no notes, no colleagues, no ability to try something and retry. That mental model worked in 2022. But it's shifting now.

The Inference Turn

The arrival of inference computing changed everything. AI can take time to decide. It has tokens it can process. With tools like Chat GPT 5.2 thinking and 5.2 Pro, the AI can try some approaches that don't work, correct its mistakes, and come back. This produces higher quality results.

We see better performance. But most of our conversation has focused on intelligence improvements—and we've missed something crucial. The jaggedness has started to smooth out.

You don't have issues with counting the Rs in "Strawberry" anymore, do you? The mental model that shaped three years of AI strategy needs to change. It needs to change because Jones believes the last 30 days have convinced everyone that jaggedness is no longer the right paradigm for how AI works in the workplace.

It is certainly true that there are extraordinary capabilities for AI and capabilities that are just very good. That is a kind of jaggedness. But that's not super relevant, because the last time most people solved an international Olympiad math problem at work was never. It just doesn't happen at work.

In the world of practical work—PRDs, code, customer service tickets—AI is not jagged anymore. And we need to stop pretending that it is.

What Single Turn Actually Means

Jones lays out what single turn single agent interaction actually means: you present a problem, the model produces a response. If it contains an error midway through, the error propagates through everything that follows. If the first approach is wrong, there's no mechanism to detect that and try something else.

If the task requires more information than fits in a context window, it cannot accumulate that information incrementally. Every problem needs to be solved in one shot. This is the most primitive version of a chatbot—close to what we experienced with Chat GPT when it initially launched.

And this is not how any competent human professional works. It's not how a lawyer researches a case. It's not how an engineer designs a system. It's not how a scientist runs an experiment. All these involve trying things, recognizing when they're not working, adjusting, accumulating information over time, getting feedback at intermediate stages, and revising.

The review processes, the sprint cycles, the peer feedback loops—all of those exist because we have a hard time solving one-shot cognition problems too. And we've forgotten that AI might be able to use that help.

We deployed AI into a paradigm that removes so many of those structures that help us think, and then we described the resulting limitations as a property of AI itself.

The Learning Curve That Matters More

Jones walks forward through what happened after 2022. We got inference, which helps AI not make mistakes. We got tools for AI. We also realized we need to be better at describing our tasks—that's prompting. We've been working on providing tools while AI has been getting smarter because we've been scaling intelligence.

We've been scaling it partly through inference and partly through reinforcement learning—the tried method since the beginning of LLMs. What we see is a trend line where intelligence has been climbing, but our fluency at using the tool has been getting better too. And we haven't been tracking that curve. We've been talking about the intelligence curve. We have not been talking about the curve that allows us to actually use this tool—the ability to learn to put agents into harnesses, the ability to use tools in a loop to do practical work.

What we really haven't recognized is that we are in a learning trend line that matters more than the intelligence curve at this point—at least for practical work. Because figuring out the scale at which we can operate intelligence now is a function of our ability to use tools with agents, our ability to use harnesses with agents.

Harnesses is the state around the agent—the scaffolding around the agent, the thing the agent operates within that allows it to do work. Maybe it's a markdown file for tasks. Maybe it's a spot to put its memory. All of it comes together. It's a harness. It allows it to do meaningful work.

We've forgotten that part as valuable. We've forgotten that if we do that well, maybe we will address the jaggedness. And so when the first couple months of the year arrived, we were surprised when the jaggedness starts to disappear.

All at once. Video starts to get better. Tech starts to get better. Mathematics gets better. Science gets better. We're talking about specific advances in the last 60 days. We are not seeing jagged improvements anymore. We are seeing a pattern of improvements where everything is getting better at once. The frontier of AI is smoothing, and we are seeing much more smoothing if we look at the smaller bubble that is work—because work is inside the frontier at this point.

For most of our work, this is a smooth product. It is not jagged. And we've got to recognize how big a deal that is because it changes all of our assumptions about where we should expect AI to work and where we should deploy.

The Proof

Jones points to March 3 when Cursor CEO Michael Trule announced that Cursor had discovered a novel solution to problem six of the first proof—a research-grade mathematics problem drawn from unpublished work of Stanford, MIT, and Berkeley academics. You can't reinforcement learn on it. It didn't just solve it. It improved on the official human-written solution—stronger bounds, better coverage.

And they did it using the exact same coding harness that six weeks earlier had built a web browser from scratch. The harness ran for four days on this math problem with zero hints, zero human nudges, and zero midcourse guidance. And then it solved it.

Here's why Jones says smoothing matters: Cursor didn't build to solve math problems. It's one thing if Google's like, "We put this special math model together. It's super special and it did a special math." Or if OpenAI says the same thing—we put a special math model together and it did a special math. Great. Good for you.

But in this case, it matters more because Cursor is a coding company. A system designed to write code looked at a problem in spectral graph theory—produced mathematics that the problem's own authors hadn't found. This is a huge deal.

Michael Trule put it well: "This suggests our technique for scaling agent coordination might generalize beyond coding."

Jones goes farther: it suggests that the way we put agents into harnesses to do long-running work looks like it will work for any domain that is even reasonably verifiable—in other words, where we can reasonably determine a correct answer. That opens up a lot. That's not just math. That's not just code. That's legal. That's many customer service use cases because there's a verifiable correct answer.

What's In The Cursor Agent

What is in the box on this Cursor agent? Is it something special? Is it secret sauce they're not going to share? No—they did share. In January, Wilson Lynn published a Cursor blog post on scaling long-running autonomous coding.

The first attempt was flat coordination—agents shared a single file, they used locks to avoid collision—and it failed very badly. Agents became risk-averse. They avoided difficult tasks and they optimized for small and safe changes. You got lots of activity, but you did not get much progress.

The breakthrough came from hierarchy and specialization: planners explore the codebase and create tasks spawning sub-planners recursively. So there's two layers here—workers pick up individual tasks and grind until done and they ignore everything else. A judge—an LLM—is judge determines whether to continue and the next iteration begins a fresh.

The judge's ability to restart cleanly, bringing in a new agent with fresh context, turned out to be one of the system's most important properties because it got around the problem of the context window.

And so, as Jones mentioned earlier, the test case was building a web browser from scratch in Rust. The agents ran for a week and wrote a million lines of code. Cursor ran the same harness on Solid to React migration and got that to work. They ran it on a Java language server—these are all coding problems. They ran it on a Windows 7 emulator, 1.2 million lines, and Excel clone 1.6 million lines.

Two lessons emerged. First, model choice matters a lot for long-horizon tasks—they found that GPT 5.2 consistently outperforms Claude Opus, which tends to stop earlier and take shortcuts. Second, and more counter-intuitively, many of the improvements they made came from removing complexity in the agentic system rather than adding to it.

The actual improvement came from stripping out a lot of the complicated coordination machinery, adding hierarchy, and letting agents work in very clean isolation.

Jones says it's probably not an accident that that harness looks very similar to the CodeIUM harness that you can set up if you download and CodeIUM app where you have agents running in isolation in sandboxes.

The deepest observation is this: the system's behavior is disproportionately determined by the design of the prompt. Prompting is still going to matter in the future, Jones says. If you can prompt with all of the information, the complete solution, what the model needs to do to be correct and you set up your model harness correctly, it will run for a long time.

And so Cursor got excited. They got experimental and they pointed at this math problem—Cursor system found an approach that I can barely pronounce. Something involving Marcus Spielman's SVA interlacing polynomial method. Don't ask me that and don't try and say that five times fast. But the point is it solved it and it went beyond what humans did.

This should wake you up. If you are thinking that a coding agent does code, if you are thinking that an LLM is a narrow thing, this should wake up. It is not a narrow thing. These LLMs, especially in agents, are generalizing broadly.

And this goes back to what Jones was saying earlier: we have assumed that jagged responses from LLMs are a function of intelligence. But the lesson that's in plain sight over the last few years is that it's actually been at least as much a function of the harness we put the agent in.

The Four Labs

At this point, four organizations—Anthropic, Google DeepMind, OpenAI, and Cursor—have independently built very large multi-agent coordination systems designed to do long-horizon work. None have coordinated. All four exhibit a similar structural pattern. And to Jones's mind, this hasn't been clearly articulated.

Hear him now: this is not as different as it sounds. There are some differences in their patterns that are related to the models they use, but there's underlying architectures that are similar. Decompose the work. One, parallelize the execution. Two, verify outputs. Three, and then iterate toward completion.

Anthropic's approach is an initializer agent that sets up an environment state and a progress file. A coding agent then makes incremental progress and leaves structured artifacts that the next session can read. Without this structure, the failure modes are super vivid—the agent might try to one-shot the whole implementation. They might run out of context mid-way. They might leave things worse than they started or make features complete without testing them.

Google DeepMind's approach especially is similar.

The systems behavior is disproportionately determined by the design of the prompt. If you can prompt with all of the information, the complete solution, what the model needs to do to be correct and you set up your model harness correctly, it will run for a long time.

Critics might note that the math problem solved by Cursor—while impressive—still represents a narrow benchmark. The jump from solving an unpublished research problem to handling the full complexity of real-world professional work is enormous. Real business problems involve ambiguous requirements, stakeholder conflicts, and context that can't be fully formalized into a verifiable answer. That's where AI still struggles.

Bottom Line

The strongest part of Jones's argument is the evidence that four labs independently converged on similar architectures without coordination—this suggests something fundamental about how AI agents need to work. The biggest vulnerability: we've only seen this smoothing in coding and math, domains with clear verification. Most real work isn't like that. What should you watch for? Whether these agentic harnesses can actually scale to the messy, ambiguous reality of actual business problems—and whether that changes where we deploy AI.

4 AI labs built the same system without talking to each other

The Myth of Jaggedness

The Inference Turn

What Single Turn Actually Means

The Learning Curve That Matters More

The Proof

What's In The Cursor Agent

The Four Labs

Bottom Line

Deep Dives

Sources

4 AI labs built the same system without talking to each other