InferenceMAX™: Open source inference benchmarking

In an industry drowning in marketing brochures and theoretical peak numbers, Dylan Patel's new initiative cuts through the noise with a radical proposition: stop guessing and start measuring every single night. While competitors rely on static, one-off benchmarks that age poorly in weeks, Patel argues that the only way to understand the true state of artificial intelligence is to treat performance data as a living, breathing ecosystem that changes daily. This is not just another technical report; it is a structural intervention in how the industry evaluates the trillions of dollars being poured into AI infrastructure.

The Velocity Problem

The core of Patel's argument rests on a simple but often overlooked reality: hardware moves in steps, but software moves in sprints. He writes, "While hardware innovation drives step jumps in performance every year through the release of new GPUs and new systems, software evolves every single day, delivering continuous performance gains on top of these step jumps." This distinction is crucial for any operator trying to make sense of the current market. A chip that looks inferior on paper today might outperform its rival tomorrow simply because a new software kernel was optimized overnight.

InferenceMAX™: Open source inference benchmarking

Patel explains that traditional benchmarks fail because they capture a snapshot of a moving target. "Benchmarks conducted at a fixed point in time quickly go stale and do not represent the performance that can be achieved with the latest software packages." By automating the process to run nightly across hundreds of chips, the new InferenceMAX™ initiative aims to create a "live indicator of inference performance progress." This approach acknowledges that the gap between theoretical capability and real-world output is determined not by silicon alone, but by the intricate dance of software stacks like SGLang, vLLM, and TensorRT-LLM.

The gap between theoretical peak and real-world inference throughput is often determined by systems software: inference engine, distributed strategies, and low-level kernels.

Critics might argue that such a high-frequency benchmarking model introduces too much noise, making it difficult to distinguish between a genuine software breakthrough and a transient glitch. However, the sheer volume of data generated by running tests every night on diverse hardware should, in theory, smooth out these anomalies and reveal genuine trends rather than outliers.

A Neutral Field in a Polarized Market

What makes this project particularly striking is its attempt to remain neutral in a market dominated by two giants: NVIDIA and AMD. Patel is careful to frame the results not as a victory for one vendor, but as a reflection of specific workload characteristics. He notes, "AMD and Nvidia GPUs can both deliver competitive performance for different sets of workloads, with AMD performing best for some types of workloads and Nvidia excelling at others." This nuance is vital. It moves the conversation away from tribal loyalty and toward practical engineering decisions based on cost, efficiency, and specific use cases.

The initiative has already secured buy-in from the very leaders whose products are being tested, a rare feat in the competitive semiconductor space. Jensen Huang, founder and CEO of NVIDIA, states, "The results are clear: Grace Blackwell NVL72 with TRT-LLM and Dynamo delivers unmatched performance per dollar and per megawatt—powering the most productive and cost-effective AI factories in the world." Meanwhile, Dr. Lisa Su, CEO of AMD, counters with her own data-driven confidence: "The open-source InferenceMAX benchmark gives the community transparent, nightly results that inspire trust and accelerate progress. It highlights the competitive TCO performance of our AMD Instinct MI300, MI325X, and MI355X GPUs across diverse workloads."

The fact that these rivals are both endorsing a third-party benchmark suggests a shared exhaustion with opaque marketing claims. As Peter Hoeschele of OpenAI puts it, "InferenceMAX™'s head-to-head benchmarks cut through the noise and provide a living picture of token throughput, performance per dollar, and tokens per Megawatt." This collective desire for transparency indicates that the industry is maturing; buyers are no longer satisfied with vendor-provided white papers.

The Software Feedback Loop

Perhaps the most significant insight Patel offers is how this benchmarking tool creates a direct feedback loop between software developers and hardware architects. He highlights the responsiveness of the AMD team, noting that when issues arise, they "immediately jumped in to help find temporary fixes that unblock us, following up with permanent patches into ROCm to ensure long-term stability." This level of collaboration is what Patel calls the "AMD 2.0 sense of urgency."

By making the benchmark open-source and running it continuously, the project allows software engineers to see exactly how their optimizations impact real-world performance. Tri Dao, Chief Scientist of Together AI, emphasizes this value: "InferenceMAX™ is valuable because it benchmarks the latest software showing how optimizations like FP4, MTP, speculative decode, and wide-EP actually play out across various hardware." This transforms benchmarking from a static report card into a dynamic tool for engineering improvement.

Open, reproducible results like these help the whole community move faster.

A counterargument worth considering is whether this level of transparency might inadvertently expose proprietary optimizations or strategic roadmaps that companies prefer to keep under wraps. However, the overwhelming support from major cloud providers like Microsoft, Oracle, and CoreWeave suggests that the industry values the clarity of real-world data over the secrecy of marketing narratives.

Bottom Line

Patel's InferenceMAX™ initiative represents a necessary evolution in how the AI industry measures success, shifting the focus from static hardware specs to the dynamic reality of software-driven performance. Its greatest strength lies in its ability to force transparency in a market often clouded by hype, while its biggest vulnerability is the sheer complexity of maintaining a neutral, automated benchmark across rapidly changing software stacks. For anyone investing in or deploying AI infrastructure, the nightly data from this project is likely to become the single most important metric for decision-making in the coming year.

InferenceMAX™: Open source inference benchmarking

by Dylan Patel · SemiAnalysis · Read full article

LLM Inference performance is driven by two pillars, hardware and software. While hardware innovation drives step jumps in performance every year through the release of new GPUs/XPUs and new systems, software evolves every single day, delivering continuous performance gains on top of these step jumps.

AI software like SGLang, vLLM, TensorRT-LLM, CUDA, and ROCm achieve continuous improvement in performance through kernel-level optimizations, distributed inference strategies, and scheduling innovations that increase the Pareto frontier of performance in incremental releases that can be just days apart.

This pace of software advancement creates a challenge: benchmarks conducted at a fixed point in time quickly go stale and do not represent the performance that can be achieved with the latest software packages.

InferenceMAX™, an open-source automated benchmark designed to move at the same rapid speed as the software ecosystem itself, is built to address this challenge.

InferenceMAX™ runs our suite of benchmarks every night on hundreds of chips, continually re-benchmarking the world’s most popular open-source inference frameworks and models to track real performance in real-time. As these software stacks improve, InferenceMAX™ captures that progress in near real-time, providing a live indicator of inference performance progress. A live dashboard is available for free publicly at https://inferencemax.ai/.

AMD and Nvidia GPUs can both deliver competitive performance for different sets of workloads, with AMD performing best for some types of workloads and Nvidia excelling at others. Indeed, both ecosystems are advancing rapidly!

There are many nuances and considerations when analyzing the results from InferenceMAX™, and this is in no small part because it is designed to be a neutral benchmark, not cherry-picked to promote any specific vendor or solution. As such, there are models and interactivity (tok/s/user) levels where AMD currently does better against Nvidia GPUs of the same generation, and there are also interactivity levels where Nvidia currently does better. The goal of InferenceMAX™ is simple but ambitious — to provide benchmarks that both emulate real world applications as much as possible and reflect the continuous pace of software innovation.

For the initial InferenceMAX™ v1 release, we are benchmarking the GB200 NVL72, B200, MI355X, H200, MI325X, H100 and MI300X. Over the next two months, we’re expanding InferenceMAX™ to include Google TPU and AWS Trainium backends, making it the first truly multi-vendor open benchmark across AMD, NVIDIA, and custom accelerators.

InferenceMAX™ v1 is far from perfect, but we believe that it is a good first step in ...

The Velocity Problem

A Neutral Field in a Polarized Market

The Software Feedback Loop

Bottom Line

Sources

InferenceMAX™: Open source inference benchmarking