← Back to Library

Cerebras — faster tokens please

Dylan Patel delivers a startling pivot in the AI hardware narrative: the industry's obsession with raw model intelligence has hit a wall, and the market is now paying a premium for speed above all else. This piece is notable not for predicting the next chip architecture, but for documenting a fundamental shift in user behavior that validates a technology many dismissed as a niche curiosity. The evidence is stark—developers are willingly sacrificing frontier capabilities for faster token generation, a move that has suddenly made Cerebras' wafer-scale approach the most valuable asset in the sector.

The Speed Inflection

Patel argues that the era of prioritizing "smarter" tokens over "faster" ones is effectively over. He writes, "Past a certain threshold of intelligence, developers prefer faster tokens to smarter tokens." This observation reframes the entire value proposition of AI infrastructure. For years, the industry chased parameter counts, but Patel points out that in a workflow-driven environment, latency is the true bottleneck to productivity. The author notes that "the speed at which tokens are generated can be the bottleneck to 'flow state', i.e. how much productive work is completed."

Cerebras — faster tokens please

The data backing this claim is compelling. Patel reveals that in a recent analysis, "80% of our AI spend... was on Opus 4.6 fast," despite the fact that this tier costs significantly more and, in some cases, offers diminishing returns on speed compared to standard modes. This is a critical insight: the market is revealing its true preferences through its wallet, not just through benchmark scores. As Patel puts it, "This is the first time we've ever decided to forgo frontier intelligence in exchange for faster tokens (and at a significant price premium too!)."

Critics might argue that this focus on speed is a temporary reaction to early adoption friction rather than a permanent shift in AI utility. However, the willingness of major labs to tier their offerings—creating "fast," "priority," and "batch" modes—suggests that the administration of compute resources is now driven by interactivity needs. The market has spoken, and it wants immediacy.

In a world where AI is involved in almost every aspect of your workflow, the speed at which tokens are generated can be the bottleneck to 'flow state'.

The Wafer-Scale Gamble

To understand why Cerebras is suddenly the darling of the market, one must look at its unique architecture. Patel describes the Wafer Scale Engine (WSE) as a "bold bet" that defies the traditional constraints of silicon manufacturing. Instead of splitting a wafer into many small chips, Cerebras treats the entire wafer as a single processor. "The goal is to make the entire wafer a chip," Patel writes, noting that this approach addresses the slowdown of Moore's Law by bypassing the reticle limit of 858mm².

This design choice has profound implications for memory. Unlike standard GPUs that rely on High Bandwidth Memory (HBM) stacked off-chip, the WSE integrates massive amounts of Static Random Access Memory (SRAM) directly onto the silicon. Patel highlights the scale: "Each wafer or chip has a large pool of very fast SRAM. 50% of silicon area is dedicated to SRAM cells with the remaining 50% consisting of compute cores." This eliminates the latency and power costs of moving data off-package, a problem that plagues traditional architectures. The result is a system capable of delivering "thousands of tokens per second," a figure that is "literally off the chart" compared to conventional accelerators.

However, this architecture is not without significant trade-offs. Patel is candid about the limitations, noting that the WSE has "almost zero network" bandwidth relative to its peers. "The lack of network bandwidth... is certainly a handicap in the WSE-3 design preventing Cerebras from launching their business to the stratosphere," he admits. This is a crucial counterpoint: while Cerebras wins on single-node speed, its ability to scale across massive clusters is structurally constrained compared to GPU-based systems that rely heavily on high-speed interconnects. The company is effectively betting that the demand for single-node speed is so high that it can overcome this scaling hurdle.

The Economics of the OpenAI Deal

The article's most explosive claim centers on a massive compute agreement between Cerebras and OpenAI. Patel writes that the deal involves "tens of billions of dollars for Cerebras compute," with a requirement to deliver 750 megawatts of power by 2028. This is not just a contract; it is a validation of the wafer-scale model at a scale previously thought impossible. "Demand is so strong it's making everyone look good," Patel observes, suggesting that the sheer volume of orders has obscured the technical weaknesses that were once a concern.

The author connects this deal to the broader trend of "neoclouds," where specialized hardware providers are securing their own power and infrastructure to serve specific clients. "We will see if this remains true given the slower speeds, delayed 4.7 support, and upcoming Mythos release," Patel warns, acknowledging that the current hype may face headwinds if the technology cannot keep pace with software updates. Yet, the financial trajectory is clear. "We expect Cerebras revenue to inflect sharply in the coming years, with OpenAI as the primary growth driver."

Patel also touches on the future of the technology, mentioning plans to "hybrid bond an wafer scale optical transceiver onto their WSE compute engine." He notes that while this is "not needed for LLM inference," it is essential for high-performance computing workloads that NVIDIA has arguably abandoned by reducing FP64 capabilities. This strategic pivot suggests Cerebras is positioning itself not just as an AI accelerator, but as a comprehensive solution for the next generation of scientific computing.

The strengths of Cerebras (namely: speed), have been overlooked for years in favor of total throughput. But now, with frontier labs releasing fast, priority, standard and batch tiers of the same model weights, the world has revealed their preference for fast tokens with their wallets.

Bottom Line

Patel's analysis effectively dismantles the assumption that raw compute density is the only metric that matters in AI, proving instead that interactivity is the new currency. The strongest part of the argument is the empirical evidence of market behavior—developers paying premiums for speed—rather than theoretical benchmarks. The biggest vulnerability remains the architectural trade-off: the WSE's limited networking could become a critical bottleneck if the industry shifts back toward massive, distributed training or inference tasks that require tight cluster coordination. Readers should watch whether Cerebras can scale its wafer production to meet the OpenAI demand without compromising yield, a challenge that has historically plagued large-scale silicon designs.

Deep Dives

Explore these related deep dives:

  • Wafer-scale integration

    This manufacturing technique explains the physical impossibility of traditional chip packaging for Cerebras, revealing why their 'single chip' approach requires abandoning standard yield-loss mitigation strategies.

  • High Bandwidth Memory

    Understanding the specific bandwidth bottlenecks of HBM stacks clarifies why Cerebras' on-wafer SRAM architecture offers a distinct advantage for fast inference over the memory-bound designs of competitors like NVIDIA.

  • Orbital hybridisation

    The article's mention of future optical transceivers relies on this advanced packaging method, which allows for the ultra-dense interconnects necessary to merge compute and communication layers without traditional wire bonds.

Sources

Cerebras — faster tokens please

by Dylan Patel · SemiAnalysis · Read full article

It’s been nearly 5 years since Dylan wrote a dedicated article about Cerebras in June of 2021 for the newsletter. He shipped 4 articles in 2 days! They could be read inHow times have changed.

One of the other things that has changed is Cerebras’s fortunes. With the arrival of fast tokens on the mainstage and a 750MW compute deal with OpenAI notched, Cerebras is feeling ready for the scrutiny of public markets. Up until just 6 months ago, we felt that the Wafer Scale Engine, despite its bold innovations, had some technical weaknesses that were too hard to cover up. Thus, the continued popularity of HBM-based accelerators such as GPU and TPU. The strengths of Cerebras (namely: speed), have been overlooked for years in favor of total throughput. But now, with frontier labs releasing fast, priority, standard and batch tiers of the same model weights, the world has revealed their preference for fast tokens with their wallets. This brings Cerebras’s strengths to the fore and is the key reason why OpenAI is willing to fork over tens of billions of dollars for Cerebras compute.

Demand is so strong it’s making everyone look good.

Today, on the verge of Cerebras’s IPO, and because we love the wafer, we are shipping an article that is as long as 4 normal articles. Inside, we will dive deep on:

Fast inference

WSE-3, Cerebras’ unique wafer-scale chip

CS-3, Cerebras’ system, with its unique architecture

Provide a BOM cost analysis

Explain when and how the wafer wins for fast inference

Describe some of the wafer’s limitations, showing tradeoffs

For paid subscribers we also show the economics of the huge OAI Inference deal that has changed the company’s fortunes and share our insights on how far along Cerebras is in becoming a neocloud (i.e. securing the 750MW they need by 2028 for OpenAI). Furthermore, we will talk about Cerebras’ future plans of hybrid bonding an wafer scale optical transceiver onto their WSE compute engine, which they claim they are pursuing strictly for the love the game as it is not needed for LLM inference, but is needed for HPC boomer workloads. The HPC customers whom NVIDIA has effectively abandoned after reducing FP64 native hardware on their GPUs to basically nothing.

The Need for Speed.

Fast inference has arrived.

While SemiAnalysis has historically been an SRAM machine hater, all this changed when Nvidia licensiquihired Groq in December ...