← Back to Library

ClusterMAX™ 2.0: The industry standard gpu cloud rating system

Dylan Patel has done more than update a ranking list; he has effectively declared that the "wild west" era of GPU cloud renting is over, replaced by a rigorous, technical standard where reliability commands a price premium. While the industry chases the newest chips, this piece argues that the real bottleneck for AI development is no longer hardware scarcity, but the chaotic quality of the software and infrastructure wrapping it.

The New Currency of Trust

Patel's central thesis is that the market is maturing fast enough to demand a new kind of due diligence. He notes that since the first version of his rating system dropped, the top-rated providers have collectively booked nearly $400 billion in Remaining Performance Obligations. This isn't just a vanity metric; it proves that enterprise buyers are willing to pay for certainty. "ClusterMAX 2.0 debuts with a comprehensive review of 84 providers, up from 26 in ClusterMAX 1.0," Patel writes, highlighting a market that has exploded in size but remains fragmented in quality.

ClusterMAX™ 2.0: The industry standard gpu cloud rating system

The author's methodology is aggressive. He didn't just ask providers what they could do; he tested what they were doing. The result is a stark revelation about the state of the industry: many vendors rushed to install basic tools like Slurm (a workload manager that has been the backbone of high-performance computing since the early 2000s) only after being told they needed it. "We had cloud providers that had never installed slurm before try to install it for the first time about a week before handing over a cluster to us," Patel observes. This anecdote is damning. It suggests that for many players, the "cloud" is a marketing veneer over a fragile, untested stack.

"Gold and Platinum Neoclouds rise to the top by introducing new features and functionality that others do not have, and customers appreciate."

This distinction is crucial. The commentary suggests that the "commodity" trap is real. If a provider only offers raw compute without robust orchestration, they are competing on price alone—a race to the bottom that hurts the end user. Patel argues that the top-tier providers can charge more because their Total Cost of Ownership (TCO) is actually lower, thanks to fewer failures and better uptime.

The AMD Gap and the Hardware Illusion

One of the most provocative claims in the piece is the disparity between how providers handle NVIDIA versus AMD hardware. It is a common assumption that a data center is a data center, regardless of the chip inside. Patel dismantles this. "For providers that have deployed both AMD and NVIDIA GPUs, the quality of their AMD cloud offering is much worse than their NVIDIA cloud offering," he states.

He details how AMD offerings often lack critical features like detailed monitoring, automatic health checks, and working Slurm support. This is a significant finding for the industry. It implies that the ecosystem around NVIDIA's CUDA software has created a moat that goes beyond the chip itself; it extends to the entire cloud management layer. Providers are treating AMD as an afterthought, failing to integrate it with the same rigor. Critics might argue that the AMD ecosystem is simply younger and needs time to mature, but Patel's evidence suggests the issue is provider negligence, not just software immaturity. The gap is not in the silicon; it is in the service.

The piece also touches on the transition to the Blackwell architecture and the specific challenges of the GB200 NVL72 systems. As these massive, liquid-cooled clusters roll out, the margin for error shrinks. Patel warns that "reliability and SLAs" are now the primary battlegrounds. A cluster that fails during a multi-week training run isn't just an inconvenience; it's a financial catastrophe.

The "Russian Roulette" of Cluster Handover

Perhaps the most visceral part of the commentary is the description of the user experience at lower-tier providers. Patel describes the process of "giving back" a cluster or playing "Russian Roulette" with marketplaces that offer single machines. "Issues with reliability of the GPU drivers, GPU server hardware, backend interconnect network, shared storage mounts, internet connection, and more can cause users to lose faith in a provider, and churn out," he writes.

This framing shifts the conversation from technical specs to human frustration. It highlights that the "cloud" is only as good as its most fragile link. The author's decision to penalize providers who delayed cluster handovers or rushed security patches reinforces a "trust but verify" philosophy. In an industry where a single misconfigured setting can lead to a container escape or a security breach, this rigor is not optional.

"We encourage providers to use these lists when developing their offerings. We consider the lists as an amalgamation of our interviews with end users, and continue to pursue the quality when developing their offerings."

This is a rare moment of constructive criticism in a sector often dominated by hype. Patel isn't just ranking; he is providing a blueprint for what a mature AI infrastructure should look like. The inclusion of criteria like "Container Escapes" and "Pentesting" signals that security is no longer a backend concern but a front-line requirement for AI labs.

Bottom Line

Patel's argument is a necessary corrective to an industry obsessed with hardware specs while ignoring the software glue that makes them useful. The strongest part of the piece is the empirical evidence that reliability commands a premium, proving that the market is ready to punish mediocrity. Its biggest vulnerability is the sheer pace of change; as new chips and architectures emerge, today's "Platinum" standards could become tomorrow's baseline, requiring the rating system to evolve faster than the providers can build. For busy decision-makers, the takeaway is clear: stop buying raw GPU hours and start buying the ecosystem that keeps them running.

Deep Dives

Explore these related deep dives:

  • Slurm Workload Manager

    The article repeatedly references SLURM as a critical orchestration layer for GPU clusters, discussing 'Slurm-on-Kubernetes' as a key trend and SLURM support quality as a differentiator between providers. Understanding how this job scheduler works provides essential context for why it matters in AI infrastructure.

  • InfiniBand

    The article mentions InfiniBand as a key interconnect technology for GPU clusters and lists it as a required qualification for hires. This high-bandwidth networking technology is fundamental to understanding why certain cloud providers achieve better performance for distributed AI training.

Sources

ClusterMAX™ 2.0: The industry standard gpu cloud rating system

by Dylan Patel · SemiAnalysis · Read full article

Introduction.

GPU clouds (also known as “Neoclouds” since October of last year) are at the center of the AI boom. Neoclouds represent some of the most important transactions in AI, the critical juncture where end users rent GPUs to train models, process data, and build inference endpoints.

Our previous research has set the standard for understanding Neoclouds:

Since ClusterMAX 1.0 was released 6 months ago, we have seen significant changes in the industry. H200, B200, MI325X, and MI355X GPUs have arrived at scale. GB200 NVL72 has rolled out to hyperscale customers and GB300 NVL72 systems are being brought up. TPU and Trainium are in the arena. And many buyers are turning to the ClusterMAX rating system as the trusted, independent third party with a comprehensive, technical guide to understanding the market.

An update is needed!

Executive Summary.

YouTube summary video available here!

ClusterMAX 2.0 debuts with a comprehensive review of 84 providers, up from 26 in ClusterMAX 1.0. We increase our market view to cover 209 total providers, up from 169 in our previous article and 124 in the original AI Neocloud Playbook and Anatomy. We have interviewed over 140 end users of Neoclouds as part of this research.

We release an itemized list of all criteria we consider during testing, covering 10 primary categories (security, lifecycle, orchestration, storage, networking, reliability, monitoring, pricing, partnerships, availability).

We release five descriptions of our expectations, covering SLURM, Kubernetes, standalone machines, monitoring, and health checks. We encourage providers to use these lists when developing their offerings. We consider the lists as an amalgamation of our interviews with end users, and continue to pursue the quality when developing their offerings.

CoreWeave retains top spot as the only member of the Platinum tier. CoreWeave sets the bar for others to follow, and is the only cloud to consistently command premium pricing in our interviews with end users.

Nebius, Oracle and Azure are the top providers within the Gold tier. Crusoe and new entrant Fluidstack also achieve Gold tier.

Google rises to the top of the Silver tier, alongside AWS, together.ai and Lambda. Many more clouds from all around the world debut at the Bronze or Silver tier, for a total of 37 clouds achieving a medallion rating.

We provide analysis of key trends: Slurm-on-Kubernetes, Virtual Machines or Bare-Metal, Kubernetes for Training, Transition to Blackwell, GB200 NVL72 Reliability and SLA’s, Crypto Miners Here To Stay, Custom Storage ...