Deepdive: How 10 tech companies choose the next generation of dev tools

Gergely Orosz dismantles the comforting illusion that buying a single enterprise license solves the AI coding dilemma. In a landscape where the default answer eighteen months ago was simply "buy GitHub Copilot," Orosz reveals a chaotic, fragmented reality where developer trust has replaced executive mandates as the primary driver of adoption. This is not a story about which model is fastest; it is a forensic look at how organizations of vastly different sizes are navigating a market where the tools are evolving faster than their procurement policies can handle.

The Speed of Trust vs. The Weight of Bureaucracy

Orosz draws a sharp line between the agility of small teams and the paralysis of large enterprises. For startups with fewer than sixty engineers, the selection process is almost entirely organic. The author notes that at a seed-stage logistics startup, the head of engineering adopted a "high-trust" model: "We agreed to try new tools for 2 weeks and see how everyone felt. We didn't use any hard-and-fast measurement. TLDR: I trust our devs and their opinion is a big part of this." This approach allows tools like CodeRabbit to "stick" within days if the team likes them, while inferior options are discarded just as quickly.

Deepdive: How 10 tech companies choose the next generation of dev tools

This fluidity stands in stark contrast to the mid-sized and large companies Orosz interviewed, where the decision-making machinery grinds to a halt under the weight of security reviews and budgetary scrutiny. At a cloud infrastructure firm with 900 employees, the principal engineer describes a painful tug-of-war: "We started with Copilot because it was easy to procure... Then switching to Cursor took forever. Pricing keeps shifting." The author highlights a critical friction point: executives are often unwilling to absorb the cost jump from a $40 monthly subscription to the $150 required for more advanced agents, even when developers are demonstrably more productive with the latter.

"The Copilot → Cursor → Claude Code migration path is well trodden, and nobody has cracked productivity measurement yet."

Critics might argue that Orosz underestimates the necessity of security gates in regulated industries, but the evidence he presents suggests these gates often become barriers to innovation rather than safeguards. The author illustrates this with a cautionary tale of an EU-based software company that declared itself "AI-first" in the summer of 2025 only to find itself gridlocked. Because leadership had not budgeted for alternatives to the default Microsoft offering, and because legal teams were paralyzed by the EU's AI Act, developers were left "stuck" with outdated models while the market moved on. As one engineer there lamented, "The pace of new models and tools in the second half of 2025 left the leadership team completely unprepared."

The Measurement Paradox

Perhaps the most damning insight in Orosz's deep dive is the universal failure to measure the actual value of AI tools. Every company, from the five-person startup to the publicly listed fintech, struggles to prove ROI. The industry's default metric—lines of code generated—is widely distrusted by engineers because it incentivizes volume over quality and ignores the most valuable use cases, such as debugging or research. Orosz writes, "Using the 'lines of code written by AI' metric creates bad incentives, a sentiment shared by AI enthusiasts and skeptics alike at the company."

The author contrasts this with the rigorous, albeit imperfect, frameworks some companies are building. Wealthsimple, for instance, ran a two-month selection process backed by usage data from Jellyfish, while WeTravel constructed a structured scoring system across five dimensions to evaluate code review tools. Yet, even with these efforts, the consensus remains that no single metric works. At a large fintech that ran a comparative test of Copilot, Claude, and Cursor across 50 pull requests, the results were nuanced: "Cursor reviews the most precise, Claude the most balanced, and Copilot the most quality-focused." This specificity is lost when companies rely on blunt, one-size-fits-all KPIs.

"There's no single vendor that's universally rated by every team in all contexts."

This finding challenges the vendor marketing narrative that a single "best" tool exists for everyone. Orosz points out that a tool beloved by one team can be loathed by another, depending on their specific workflow and codebase. The author notes that at a Series D observability company, non-engineers like product managers were using Claude Code more than median engineers to handle customer bug reports, a use case that would never show up in a standard developer productivity dashboard. This highlights a blind spot in how organizations currently evaluate their tech stacks.

The Vendor Lock-in Trap

A recurring theme in Orosz's reporting is the fear of vendor lock-in, particularly among larger firms. A staff engineer at a public travel company with 1,500 employees explicitly stated, "Our main concern is avoiding vendor lock-in with a single solution." This caution is understandable given the rapid pace of change; what is the industry standard today may be obsolete in six months. However, the author suggests that this fear often leads to inaction. The EU-based company mentioned earlier found themselves unable to approve new tools for six months, leading to a "vicious cycle where the tool feels underwhelming, which suppresses adoption and makes it harder to justify further investment."

Orosz also touches on the human element of this transition. At the Series A startup, a staff engineer noted that code reviews had become a headache as code quantity increased, but quality had dipped until the adoption of newer models like Opus 4.5. The solution wasn't just a new tool, but a cultural shift supported by "Agents.md" and "Claude.md" files that served as a single source of truth for coding style. This suggests that the technology is only half the battle; the other half is the documentation and shared context that allows teams to leverage these tools effectively.

Bottom Line

Orosz's most compelling argument is that the era of the "silver bullet" AI tool is over; the future belongs to organizations that can build flexible, trust-based evaluation processes rather than relying on executive mandates or vanity metrics. The piece's greatest vulnerability is its reliance on anonymous sources for the majority of its data, which, while necessary for candidness, limits the ability to verify the specific outcomes of these tooling shifts. Readers should watch for how the industry develops standardized, non-invasive metrics that can actually capture the value of AI without distorting developer behavior.

Deepdive: How 10 tech companies choose the next generation of dev tools

by Gergely Orosz · The Pragmatic Engineer · Read full article

Right now, it seems like almost every tech company is changing its developer tooling stack, which is a big shift from eighteen months ago when the answer to “what to use for AI-assisted coding?” was simple: buy a GitHub Copilot license and boot up ChatGPT. In our AI tooling survey in 2024, those two tools racked up more mentions than all the others combined.

But no more. Today, a plethora of tools outpace Copilot in various ways, like Cursor, Claude Code, Codex, and Gemini CLI, and there’s also AI code review tools like CodeRabbit, Graphite, and Greptile, not to mention all the MCP integrations which plug into agentic tools.

So, for this deepdive I asked 10 tech companies which tools their engineers use and, crucially, how they made their choices from among all the options. These businesses range from a 5-person seed-stage startup, to one that employs 1,500 people and is publicly listed. All are anonymous, except for Wealthsimple and WeTravel. WeTravel has also kindly shared the most detailed measurement framework I’ve yet seen.

We cover:

Speed, trust, & show-and-tell: how small teams select tools: At places with fewer than ~60 engineers, tooling decisions are fast and informal: developers try them for a couple of weeks and those which “stick” win.

How mid-to-large companies choose: bureaucracy, security, and vendor lock-ins. At companies with ~150 engineers, adoption is considerably slowed down by security reviews, compliance requirements, and executive-level budgetary considerations.

Measurement problem: metrics are needed but none work. Every workplace struggles to prove its AI tools work, and common metrics like lines-of-code-generated are distrusted by engineers.

How Wealthsimple measured and decided. The flagship Canadian consumer fintech ran a 2-month selection process to choose an AI code review tool. Rolling out Claude Code to all engineers was a decision made by the CTO, backed with a mix of personal conviction, validated by usage data from Jellyfish.

How one company accurately measures code review usefulness. WeTravel built a structured -3 to +3 scoring system across five dimensions, with five engineers evaluating ~100 comments. They found no AI code reviewer suitable for their codebase.

Comparative measurements at a large fintech. A team ran Copilot, Claude, and Cursor simultaneously across ~50 PRs, scoring ~450 comments. They found Cursor reviews the most precise, Claude the most balanced, and Copilot the most quality-focused.

Common patterns. Developer trust drives adoption more than mandates, the Copilot → Cursor → ...

Deepdive: How 10 tech companies choose the next generation of dev tools

The Speed of Trust vs. The Weight of Bureaucracy

The Measurement Paradox

The Vendor Lock-in Trap

Bottom Line

Deep Dives

Sources

Deepdive: How 10 tech companies choose the next generation of dev tools