← Back to Library

An interview with the gimlet labs team about heterogeneous inference for AI agents

This piece cuts through the noise of the current AI infrastructure boom by challenging a fundamental assumption: that a single, dominant chip vendor can efficiently power the next generation of artificial intelligence. Chipstrat reports that the era of "one-size-fits-all GPU" is ending, replaced by a complex, multi-vendor reality where the real innovation lies not in the silicon itself, but in the software that orchestrates it. For investors and technical leaders watching the capital expenditure arms race, the article offers a crucial pivot point: the companies winning the next decade won't be those with the most expensive hardware, but those with the smartest software to manage a heterogeneous mix of it.

The Economics of Hardware Entanglement

The article's most striking revelation concerns the financial straitjackets binding many new "neoclouds." Chipstrat notes that "most neoclouds are backed by one silicon vendor and gave significant equity in return," creating a structural inability to diversify. This is a critical insight for anyone analyzing the competitive landscape. When hardware amortization accounts for roughly "70% of their annual costs," the margin for optimization vanishes. The piece argues that this equity entanglement means these competitors "can't diversify their silicon, which is why the only software innovation they can ship is disaggregation on top of a single vendor's stack — never across vendors."

An interview with the gimlet labs team about heterogeneous inference for AI agents

This dynamic mirrors the historical constraints seen in the early days of cloud computing, where proprietary lock-in often stifled broader ecosystem growth. Just as the industry eventually moved away from monolithic mainframes to distributed systems, the current AI infrastructure is hitting a wall where a single supply chain cannot meet the diverse needs of agentic workloads. The editors highlight that Gimlet Labs, founded in 2023, is attempting to break this cycle with a "two-track business" model that deploys software inside customer data centers while operating its own mixed-silicon cloud. This approach allows them to "optimize the bottom line" through supply-chain diversity while commanding a "price premium on the top line" via differentiated token performance.

"Supply-chain diversity optimizes the bottom line, differentiated token performance commands a price premium on the top line, and one track funds the CapEx of the other."

Critics might note that managing a multi-vendor stack introduces significant operational complexity and potential points of failure that single-vendor solutions elegantly avoid. However, the article suggests that the cost of inefficiency in a homogeneous stack is becoming untenable as workloads grow more complex.

From Monolithic Chips to Disaggregated Workloads

The core technical argument rests on the idea that different parts of an AI agent's workflow require fundamentally different hardware. Natalie, a co-founder quoted in the piece, explains that "agentic inference is not a uniform workload. Different parts of it have different compute needs and different bottlenecks." The article details how Gimlet traces a PyTorch workload as a graph, splits it at optimal points, and then lowers each segment to the target vendor's framework, such as TensorRT for NVIDIA chips or equivalent frameworks for others. They explicitly avoid trying to build a "universal programming language across chips," instead choosing to leverage the native frameworks of each hardware partner.

This strategy represents a shift from the "one-size-fits-all" mentality that has dominated since the early days of CUDA. Much like how Moore's Law eventually slowed, forcing architects to look at specialized accelerators rather than just raw clock speed, the industry is now realizing that a single chip cannot be optimal for every stage of inference. The piece reports a compelling case study: on a large model with 120 billion parameters, running a speculative decoder on a specialized d-Matrix card while using NVIDIA B200s for the verifier delivered a "roughly a 4× shift in the throughput-vs-interactivity Pareto frontier compared to GPU-only speculative decode."

This level of optimization is not just about cost; it is about latency. The article emphasizes that "AI-native customers aren't just price-sensitive — they have product latency budgets (e.g. one-second response windows, voice agents) where faster tokens unlock entirely new user experiences, not just cheaper ones." This distinction is vital. It moves the conversation from "cheaper compute" to "better user experience," a shift that could redefine market winners.

The Sovereign Cloud and the Talent Gap

A particularly nuanced section of the interview addresses the geopolitical dimension of AI infrastructure. Chipstrat identifies "sovereign clouds" in Europe, the Middle East, India, Asia, and Korea as a prime customer segment. These regions often have government funding and emerging local silicon vendors but lack the deep software talent required to write optimized kernels across different chips. The piece captures Gimlet's pitch perfectly: "make an API call, not a porting project."

This observation highlights a growing talent gap in the industry. As the hardware landscape fragments, the ability to write efficient code for specific architectures becomes a scarce resource. The article notes that "hyperscalers and frontier labs already run multi-vendor silicon... but the orchestration layer is getting more complex faster than internal teams can keep up." Consequently, these large entities are increasingly outsourcing orchestration to specialists like Gimlet, allowing them to focus their engineering attention on "next-gen training and product differentiation."

"We think that all of these options are really great for different purposes. And that's important because agentic inference is not a uniform workload."

Bottom Line

The strongest part of this argument is its clear-eyed assessment of the financial and technical limitations of single-vendor lock-in, a reality that many neoclouds are currently ignoring. The piece effectively demonstrates that the future of AI infrastructure is not about finding the single best chip, but about building the software layer that can seamlessly weave together the best chips for specific tasks. The biggest vulnerability remains the execution risk of managing such a complex, heterogeneous stack at scale; while the theory is sound, the practical challenges of debugging and maintaining a multi-vendor environment are non-trivial. Readers should watch closely to see if Gimlet's two-track model can indeed scale without the operational friction that has historically plagued similar attempts at hardware abstraction.

Deep Dives

Explore these related deep dives:

  • Designing Data-Intensive Applications Amazon · Better World Books by Martin Kleppmann

  • CUDA

    Understanding NVIDIA's proprietary programming model explains why Gimlet's ability to lower PyTorch graphs to non-NVIDIA frameworks is a technical breakthrough that breaks the industry's single-vendor lock-in.

  • Moore's law

    The article's argument that hardware amortization is the primary cost driver relies on the slowing of this historical trend, which forces the industry to seek efficiency gains through heterogeneous silicon rather than raw transistor scaling.

  • SAP Cloud Infrastructure

    This specific infrastructure model explains the geopolitical pressure on regions like Europe and India to adopt Gimlet's multi-vendor approach, as they lack the domestic software talent to optimize kernels for their emerging local silicon vendors.

Sources

An interview with the gimlet labs team about heterogeneous inference for AI agents

by Various · Chipstrat · Read full article

I’ve been writing for a while about the shift from a one-size-fits-all GPU to multi-vendor, multi-silicon environments, so I wanted to talk to Gimlet directly about how cross-vendor orchestration actually works — and why most neoclouds, locked into a single-silicon vendor by equity terms, can’t compete with this model by design. See previous articles for more: multi-silicon era is here, right systems for agentic workloads, and right-sized AI infra.

Natalie is a co-founder of Gimlet, alongside CEO Zain Asgar (a Stanford CS professor). Beltir spent years at Intel before joining Gimlet five months ago, after Gimlet had been one of her portfolio companies. The company was founded in 2023, has raised $92M (Series A this March), reports more than $10M in annualized revenue, and runs a two-track business — deploying its orchestration software inside customers’ data centers, and operating its own neocloud with mixed silicon.

In this interview, we walk through how Gimlet thinks about both the architecture and the business. Important insights:

Most neoclouds are backed by one silicon vendor and gave significant equity in return. Hardware amortization is ~70% of their annual costs, leaving very little room to optimize bottom line. That equity entanglement means they can’t diversify their silicon, which is why the only software innovation they can ship is disaggregation on top of a single vendor’s stack — never across vendors

Gimlet’s two-track business model is the answer to that constraint: deploy software inside customer data centers (frontier labs, hyperscalers, sovereigns) and operate their own neocloud with mixed silicon for AI-native customers. Supply-chain diversity optimizes the bottom line, differentiated token performance commands a price premium on the top line, and one track funds the CapEx of the other

Hyperscalers and frontier labs already run multi-vendor silicon (NVIDIA, AMD, in-house ASICs), but the orchestration layer is getting more complex faster than internal teams can keep up. They’d rather spend engineering attention on next-gen training and product differentiation, so some outsource orchestration to Gimlet — and some go further, having Gimlet take on the CapEx and data-center burden so they can experiment with hardware combinations without staffing a forever-team

AI-native customers aren’t just price-sensitive — they have product latency budgets (e.g. one-second response windows, voice agents) where faster tokens unlock entirely new user experiences, not just cheaper ones

Sovereign clouds are a prime customer segment — Europe, the Middle East, India, Asia, and Korea have government funding and ...