InferenceX v2: Nvidia blackwell vs Amd vs hopper - formerly InferenceMAX

In a field often clouded by marketing hype and static point-in-time tests, Dylan Patel delivers a rare, dynamic truth: the battle for AI dominance is no longer just about raw silicon, but about the intricate software ecosystems that make that silicon usable. While competitors focus on peak theoretical numbers, Patel's InferenceX v2 benchmark cuts through the noise with a continuous, open-source approach that reveals a stark reality—Nvidia's Blackwell architecture is currently operating on a different plane of efficiency, while AMD's rapid hardware gains are being throttled by software fragmentation. For the busy executive or engineer trying to navigate the next two years of infrastructure spend, this is not just a performance chart; it is a warning about the hidden costs of composability.

The Software Gap and the Composability Crisis

Patel's central thesis is that hardware specs are becoming secondary to how well different optimization techniques work together. He writes, "The biggest issue with inference on their systems and using their software is composability. That is, many of AMDs inference optimization implementations work well in isolation, but when combined with other optimizations, the result is not as competitive as one would expect." This observation is critical because it shifts the conversation from "which chip is faster" to "which stack actually runs your model." In the context of modern large language models, which rely on complex techniques like disaggregated prefill and wide expert parallelism, a chip that cannot handle the combination of these tasks efficiently is a liability, regardless of its raw FLOPs.

InferenceX v2: Nvidia blackwell vs Amd vs hopper - formerly InferenceMAX

The author highlights that while AMD's SGLang implementation can match Nvidia's performance when using a subset of optimizations, the gap widens significantly when all three major optimizations are enabled. Patel notes, "While performance is competitive on AMD when enabling just a subset of the SOTA inference optimizations, enabling all three major optimizations that labs use, AMD's performance is currently not competitive with Nvidia's." This is a nuanced but devastating critique for AMD. It suggests that their engineering efforts, while impressive in isolation, have not yet achieved the systemic integration required for frontier AI workloads. A counterargument worth considering is that AMD's software stack is younger and that the rapid pace of upstream contributions mentioned by Patel suggests this gap is closing faster than historical trends would predict.

"Nvidia absolutely frame mogs with the B200, B300 and ASU frat leader, rack scale GB200/GB300 NVL72 across both SGLang and TRTLLM."

The use of the phrase "frame mogs"—internet slang for dominating a visual or performance comparison—might seem informal, but it underscores the magnitude of the lead. Patel is not describing a close race; he is describing a scenario where the incumbent's ecosystem integration creates a moat that hardware parity alone cannot breach. This aligns with the historical context of GPU computing, where the CUDA ecosystem has long been the primary barrier to entry for competitors, a dynamic that Patel argues is now being replicated in the inference layer with even greater intensity.

The Blackwell Leap and the Hopper Obsolescence

The benchmark results for Nvidia's latest Blackwell architecture are presented not as an incremental update, but as a generational leap that renders previous generations nearly obsolete for high-performance inference. Patel writes, "Nvidia's GB300 NVL72 doesn't disappoint. It achieves up to 100x on FP8 vs FP4 compared to even a strong H100 disagg+wideEP+MTP baseline and 65x on FP8 vs FP8." This data point is staggering. It implies that for organizations running at the frontier, the cost-per-token economics of the Hopper architecture (H100) are being fundamentally broken by the new Blackwell Ultra systems.

The commentary on energy efficiency is particularly sharp for a C-suite audience concerned with operational expenditure. "Nvidia GPUs also dominate when it comes to energy efficiency, with much lower all-in provisioned picoJoules of energy per token across all workloads," Patel states. In an era where data center power constraints are a primary bottleneck, this efficiency advantage is as valuable as the raw speed. The author further contextualizes this by noting that the administration's push for domestic AI capacity often overlooks the physical reality of power consumption, making these efficiency gains a strategic imperative rather than just a technical nicety.

Patel also addresses the narrative around Nvidia's CEO, reframing the "Jensen Math" skepticism that has plagued the company's previous announcements. He writes, "At GTC 2024, Jensen claimed that Blackwell will deliver up to 30x perf on inference compared to H100, Jensen under promised & overdelivered on Blackwell inference performance." This is a significant moment of validation for the executive branch and industry analysts who have been wary of the company's aggressive projections. The benchmark suggests that the physical limits of silicon are being pushed further than anticipated, likely due to the rack-scale integration of the NVL72 systems, which was a key focus of the Blackwell microarchitecture deep dive.

"Rack scale Blackwell NVL72 is framemogging hopper and makes hopper looks like it is jestermaxxing."

Again, the informal language serves to emphasize the absurdity of the performance gap. To describe the previous generation as "jestermaxxing"—a term implying looking foolish or out of touch—is a bold rhetorical choice that signals the end of the Hopper era for high-end inference. Critics might argue that such language undermines the technical rigor of the report, but in the fast-moving world of AI, the speed of obsolescence is the most important metric, and Patel is effectively communicating that the clock has run out on the old guard.

The Path Forward for AMD and the Open Ecosystem

Despite the overwhelming dominance of Nvidia, Patel does not write AMD off. Instead, he identifies a clear, actionable path for the competitor to regain relevance: upstreaming their code and focusing on composability. He writes, "We also see that for single node aggregated serving, AMD's SGLang delivers better perf per TCO than NVIDIA's SGLang for FP8." This finding is a crucial lifeline. It suggests that for specific, less complex workloads, AMD's Total Cost of Ownership (TCO) advantage is real and significant.

The author praises AMD's recent decision to deprecate their forked versions of open-source tools in favor of contributing to the mainline code. "It is also great to see that AMD has deprecated their second class fork of vllm to move further upstream and closer to delivering first class experience," Patel notes. This is a strategic pivot that addresses the root cause of their composability issues. By moving their engineers to work directly with the maintainers of SGLang and vLLM, AMD is attempting to solve the integration problem at the source rather than trying to patch it with proprietary solutions.

Patel also highlights the role of AMD's China-based engineering team in driving these improvements. "MoRI is AMD's MoE dispatch/combine collective and KV Cache transfer library built from first principles by AMD's cracked 10x China-based engineering team," he writes. This acknowledgment of the specific engineering teams adds a layer of human depth to the technical analysis, recognizing that the software gap is being closed by dedicated individuals working on the ground. The author recommends that Nvidia, too, should invest more in these open ecosystems, writing, "Jensen needs to staff more resources & engineers towards contributing open ecosystems like SGLang & vLLM." This is a subtle but important critique of the incumbent's strategy, suggesting that their dominance in proprietary tools (like TensorRT-LLM) may eventually backfire if they neglect the open-source community that drives innovation.

Bottom Line

Dylan Patel's InferenceX v2 is a definitive moment in the AI infrastructure debate, proving that the next frontier of competition is not just silicon, but the software glue that holds it together. The strongest part of the argument is the empirical evidence showing that Nvidia's Blackwell architecture has achieved a level of composability and energy efficiency that makes the previous generation economically unviable for frontier workloads. However, the analysis's biggest vulnerability is its heavy reliance on the current state of open-source adoption; if the open ecosystem fractures or if proprietary solutions become the standard, the TCO advantages for AMD could evaporate. Readers should watch closely to see if AMD's commitment to upstreaming code translates into a sustained reduction in the composability gap over the next six months.