← Back to Library

The inference shift

Ben Thompson's latest analysis cuts through the current hype cycle surrounding artificial intelligence hardware to reveal a coming architectural fracture. While the market obsesses over raw speed and the dominance of a single chipmaker, Thompson argues that the next decade of AI will be defined not by faster processors, but by a fundamental shift in how we value memory and latency. This is a crucial distinction for investors and technologists alike, as it suggests the current gold rush for high-bandwidth memory may be a temporary peak before a massive pivot to capacity-driven, cost-efficient systems.

The GPU Monoculture

Thompson begins by contextualizing the current dominance of Graphics Processing Units (GPUs), noting that their rise was a fortunate accident of parallel processing needs. "Just as drawing pixels on a computer screen was a parallel process, which meant there was a direct connection between the number of processing units and graphics speed, making AI-related calculations was a parallel process," he writes. This historical parallel explains why Nvidia became the de facto standard: the hardware designed for video games happened to be perfect for training massive models.

The inference shift

However, Thompson points out that this architecture is a "jack of all trades, master of none" for the future. The current model relies on linking thousands of chips together to create a single massive memory pool, a strategy that is incredibly expensive and energy-intensive. He illustrates the scale of this dependency by citing a recent deal where a major AI firm secured over 220,000 GPUs from a space company's data center simply to handle inference workloads. "The GPUs Anthropic is contracting for... were originally used for training as well; the fact that GPUs are so flexible is a big advantage," Thompson notes. This flexibility is currently the industry's greatest asset, but it is also its biggest vulnerability.

Critics might argue that Nvidia's software ecosystem (CUDA) creates a moat so deep that hardware alternatives cannot penetrate it, regardless of efficiency. Thompson acknowledges this inertia but suggests that economic pressure from the sheer scale of future workloads will eventually force a change.

The Wafer-Scale Experiment

Enter Cerebras Systems, a company attempting to break the "reticle limit"—the physical size constraint that forces traditional chipmakers to stitch smaller chips together. Thompson explains that Cerebras has managed to turn an entire silicon wafer into a single chip, eliminating the slow connections between separate units. The result is a machine with blistering speed. "The WSE-3 has just over half the memory of an H100, but 6,000 times the memory bandwidth," he writes. This is a staggering statistic that redefines what is physically possible in silicon.

Yet, Thompson is quick to temper the excitement. This approach, reminiscent of the ambitious but ultimately niche wafer-scale integration experiments of the 1980s, comes with severe trade-offs. The yield rates for manufacturing a single massive chip are notoriously difficult, driving up costs. Furthermore, the architecture is best suited for a specific type of workload: "answer inference," where a human is waiting for a response. "As long as everything fits in on-chip memory Cerebras' speed is an incredible experience; the moment you need more memory... then Cerebras doesn't make much sense," Thompson warns. This limitation suggests that while Cerebras may win the race for immediate user satisfaction, it may not be the engine for the long-term future of autonomous systems.

The most important aspect for answer inference is token speed; the most important aspect for agentic inference, however, is memory.

The Agentic Shift

The core of Thompson's argument lies in the distinction between "answer inference" and "agentic inference." The former is about a human asking a question and getting a fast reply. The latter is about autonomous agents performing complex tasks without human intervention. Thompson posits that the requirements for these two modes are fundamentally opposed. For an agent working overnight to solve a logistics problem or manage a database, latency is irrelevant; capacity is everything.

"If an agent is waiting around for a job that is being run overnight, the agent doesn't know or care about the user experience impact; what is most important is being able to accomplish a task," Thompson writes. This insight flips the entire current hardware investment thesis on its head. If the future is dominated by agents, then the industry's obsession with high-bandwidth memory (HBM) and extreme compute speed becomes a misallocation of resources. "If latency isn't the top priority, then slower and cheaper memory — like traditional DRAM, for example — makes a lot more sense," he argues.

This shift implies a unbundling of the GPU. The current architecture forces a trade-off where high-speed compute sits idle while waiting for data, or high-speed memory sits idle while waiting for compute. Thompson suggests that future systems will prioritize "good enough" compute paired with massive, cheap storage hierarchies. This has profound geopolitical implications. "China, meanwhile, for all of its lack of leading edge compute, has everything it needs for agentic inference: fast-enough (but not leading-edge) GPUs, fast-enough (but not leading-edge) CPUs, DRAM, hard drives, etc," he observes. The barrier to entry for running the next generation of AI agents may drop significantly lower than the barrier for training the models that power them.

Beyond Moore's Law

Thompson concludes by challenging the prevailing wisdom that computing speed must always increase. "Jensen Huang regularly says that 'Moore's Law is Dead'... Maybe the most profound implication of agents that act without humans in the loop, however, will be that Moore's Law doesn't matter, and that the way we get more compute is by realizing that the compute we have is already good enough," he writes. This is a radical departure from the tech industry's relentless pursuit of the next nanometer.

He even points to the potential for space-based data centers, where older, larger, and slower chips would be more radiation-resistant and power-efficient. "Slower chips actually make space data centers more viable," Thompson notes, highlighting how the constraints of the physical world might finally align with the economic realities of software. The argument here is that the "inference shift" is not just a hardware upgrade, but a philosophical pivot from speed to scale.

Bottom Line

Thompson's most compelling insight is that the market's current valuation of AI hardware is priced for a human-centric future that may be shrinking, while undervaluing the massive, latency-insensitive market of autonomous agents. The argument's greatest vulnerability is the sheer momentum of the current GPU ecosystem; even if a cheaper, slower architecture is theoretically superior for agents, displacing the entrenched software and hardware standards of the last decade will be a monumental task. Readers should watch closely for how hyperscalers begin to balance their portfolios between high-speed inference for users and high-capacity, low-cost infrastructure for agents.

Deep Dives

Explore these related deep dives:

  • Wafer-scale integration

    Cerebras's unique architecture abandons traditional chip packaging to build a single processor the size of a silicon wafer, directly challenging the industry standard of networking thousands of discrete GPUs described in the text.

  • Cache replacement policies

    The article identifies the 'KV cache' as the critical bottleneck for inference speed, and this concept explains the specific memory management technique that allows large language models to generate text token-by-token without reprocessing the entire conversation history.

  • Weightlessness

    The article contrasts Cerebras's unique approach with standard GPU clusters; this concept explains how their wafer-scale engine eliminates the memory bandwidth bottlenecks and inter-chip networking latency that Nvidia systems struggle to overcome for large AI models.

Sources

The inference shift

by Ben Thompson · Stratechery · Read full article

Subscribe to get access.

Read more of this content when you subscribe today.

Log In

If you were looking for the ideal time to IPO, being a chip company in May 2026 is hard to beat. Reuters reported over the weekend:

Cerebras Systems is set to raise the size and price of its initial public offering as soon as Monday, as demand for the artificial intelligence chipmaker’s shares continues to climb, two people familiar with the matter told Reuters on Sunday. The company is considering a new IPO price range of $150-$160 a share, up from $115-$125 a share, and raising the number of shares marketed to 30 million from 28 million, said the sources, who asked not to be identified because the information isn’t public yet.

The fundamental driver of the ongoing surge in semiconductor stocks is, of course, AI, particularly the realization that agents are going to need a lot of compute. What Cerebras represents, however, is something broader: while the compute story for AI has been largely about GPUs, particularly from Nvidia, the future is going to look increasingly heterogeneous.

The GPU Era.

The story of how Graphics Processing Units became the center of AI is a well-trodden one, but in brief:

Just as drawing pixels on a computer screen was a parallel process, which meant there was a direct connection between the number of processing units and graphics speed, making AI-related calculations was a parallel process, which meant there was a direct connection between the number of processing units and calculation speed. Nvidia enabled this dual-usage by making its graphics processors programmable, and created an entire software ecosystem called CUDA to make this programming accessible. The big difference between graphics and AI has been the size of the problem being solved — models are a lot bigger than video game textures — which has led to a dramatic expansion in high-bandwidth memory (HBM) per GPU, and dramatic innovations in terms of chip-to-chip networking to allow multiple chips to work together as one addressable system. Nvidia has been the leader in both.

The number one use case for GPUs has been training, which stresses the third point in particular. While the calculations within each training step are massively parallel, the steps themselves are serial: every GPU has to share its results with every other GPU before the next step can begin. This is why a trillion-parameter model ...