This isn't just another benchmark chart; it is a forensic audit of the AI industry's fragility on launch day. Dylan Patel exposes a startling reality: while the open-source community achieved near-instantaneous deployment, major proprietary stacks from giants like Nvidia and AMD were functionally broken for weeks. The evidence here suggests that the 'moat' of advanced hardware is only as strong as the software ecosystem surrounding it, and right now, that ecosystem is being built by volunteers, not the chip vendors themselves.
The Day 0 Reality Check
Patel frames the narrative around a critical window: the first forty-three days after DeepSeek v4's release. He argues that this period reveals the true state of engineering maturity across different hardware architectures. "The open-source InferenceX engineering team has pulled multiple all-nighters to measure performance results for this model on Day 0, Day 1, Day 2, and beyond," he writes, highlighting the frantic pace required just to get a baseline.
This focus on the immediate aftermath of a release is distinctive. Most industry analysis waits for polished press releases; Patel looks at the raw, broken code. He notes that while CUDA-based engines like vLLM and SGLang worked "out of the box," other major players stumbled immediately. "Nvidia's in house TensorRT-LLM did not work well for DeepSeek v4, and we at SemiAnalysis had to fix their open source mHC kernel launch code," Patel admits, a rare moment where an analyst publicly patches the vendor's own product.
The implications are stark. The argument here is that proprietary optimization is lagging behind community-driven development. "One of the north star goals of InferenceX... is to highlight the iterative improvements to performance over time, instead of just snapshots of performance," Patel explains. This approach shifts the metric from peak theoretical speed to real-world deployability. It suggests that for busy engineers, a chip that works on Day 1 with open tools is more valuable than one that requires weeks of vendor-specific debugging.
The CUDA moat at work: With CUDA, distributed inferencing tends to be supported near Day 0 for the latest open models.
Critics might argue that focusing on Day 0 bugs ignores the long-term stability vendors eventually achieve. However, in a sector moving as fast as generative AI, a two-week delay can mean missing an entire product cycle. The evidence of broken kernels and hardcoded constants suggests these aren't minor glitches but fundamental architectural oversights.
The Software Gap: Nvidia vs. AMD
The article draws sharp distinctions between how different hardware ecosystems handled the new model architecture. On the Nvidia side, Patel details a specific failure mode where engineers simply removed error guards rather than fixing the underlying logic. "Nvidia engineers also encountered this guard error and instead of adding code to support DeepSeek v4 Pro's 7168 hidden size, they simply removed the guard," he writes.
This is a damning critique of corporate engineering culture. By removing the safety check, the system didn't crash; it produced "invalid generations" silently. Patel notes that this issue persisted for over a week until his team intervened. "Running inference with these settings results in an inference run without triggering an immediate crash, but there are hidden consequences: the engine ends up corrupting hidden states and producing invalid generations," he warns.
In contrast, the AMD story is one of rapid, albeit late, redemption. On Day 0, their performance was "an unusable experience given extremely low interactivity levels." Yet, Patel highlights a dramatic turnaround led by the engineering team under HaiShaw. "The AMD SGLang engineering team... massively improved performance in the first month - achieving a more than 100x performance by Day 26," he reports.
This 100x improvement is not just a number; it represents a fundamental shift from PyTorch fallbacks to custom kernels. Patel explains that the gains came from replacing generic paths with specialized code for tasks like flash attention and sparse indexing. "The gain came almost entirely from AMD replacing PyTorch-native fallback paths with real AITER, Triton, TileLang, and FlyDSL kernels," he details.
The historical context here is vital. This mirrors the early struggles of the ROCm ecosystem seen in previous deep dives on TensorRT-LLM, where proprietary stacks often lag behind open frameworks like vLLM. The fact that AMD's team had to rebuild their stack from first principles underscores a broader truth: hardware dominance does not guarantee software readiness.
The Open Source Advantage
The core thesis of Patel's coverage is the superiority of the open-source inference ecosystem in speed and adaptability. He points out that the open models, largely driven by Chinese labs like DeepSeek, are forcing the industry to evolve faster than ever. "China currently dominates the open model landscape, with Kimi K2.6 still beating Jensen's Nemotron Committee Coalition's Nemotron 3 Ultra on coding," he observes.
This dominance forces Western vendors to react or risk obsolescence in the inference layer. Patel emphasizes that the community tools are not just alternatives but often the primary drivers of innovation. "These inference engines are so fundamental to the global ML ecosystem that both teams have started their own company, Inferact and RadixArk, with each raising hundreds of millions of dollars," he notes.
The reliance on community support was further highlighted when hardware failures threatened the study itself. When Patel's own cluster went down, it was CoreWeave that stepped in to provide spare racks. "Luckily, CoreWeave came through and contributed compute to the open source community... scrambling to find two spare dev GB300 NVL72 racks," he writes.
This collaborative dynamic stands in contrast to the siloed approach of major chipmakers. The argument suggests that the future of AI infrastructure may depend less on who makes the best silicon and more on who builds the most responsive software layer.
In the early days of DeepSeek v4 Pro, CUDA vLLM and CUDA SGLang... worked great out of the box, proving the strength of the open ecosystems.
A counterargument worth considering is that proprietary engines like TensorRT-LLM may eventually offer superior performance at scale once these bugs are ironed out. Patel acknowledges this, noting that "as of today, TRT-LLM's performance is superior at higher batch sizes." However, he maintains that the time-to-market advantage of open tools remains a critical differentiator for most enterprises.
Bottom Line
The strongest part of Dylan Patel's analysis is its unflinching exposure of how easily proprietary software stacks can fail when faced with novel model architectures. The evidence that major vendors produced silent data corruption rather than simple crashes is a significant warning for enterprise adopters. The piece's biggest vulnerability lies in its heavy reliance on open-source benchmarks, which may not fully capture the optimized performance these vendors claim to deliver in controlled, long-term deployments. Readers should watch closely whether the rapid iterative improvements seen in the first month can be sustained as models grow even larger and more complex.