← Back to Library

DeepSeekV4 1.6T day 0 to day 43 performance over time - Huawei, GB300 NVL72, MI355X, b200

This isn't just another benchmark chart; it is a forensic audit of the AI industry's fragility on launch day. Dylan Patel exposes a startling reality: while the open-source community achieved near-instantaneous deployment, major proprietary stacks from giants like Nvidia and AMD were functionally broken for weeks. The evidence here suggests that the 'moat' of advanced hardware is only as strong as the software ecosystem surrounding it, and right now, that ecosystem is being built by volunteers, not the chip vendors themselves.

The Day 0 Reality Check

Patel frames the narrative around a critical window: the first forty-three days after DeepSeek v4's release. He argues that this period reveals the true state of engineering maturity across different hardware architectures. "The open-source InferenceX engineering team has pulled multiple all-nighters to measure performance results for this model on Day 0, Day 1, Day 2, and beyond," he writes, highlighting the frantic pace required just to get a baseline.

DeepSeekV4 1.6T day 0 to day 43 performance over time - Huawei, GB300 NVL72, MI355X, b200

This focus on the immediate aftermath of a release is distinctive. Most industry analysis waits for polished press releases; Patel looks at the raw, broken code. He notes that while CUDA-based engines like vLLM and SGLang worked "out of the box," other major players stumbled immediately. "Nvidia's in house TensorRT-LLM did not work well for DeepSeek v4, and we at SemiAnalysis had to fix their open source mHC kernel launch code," Patel admits, a rare moment where an analyst publicly patches the vendor's own product.

The implications are stark. The argument here is that proprietary optimization is lagging behind community-driven development. "One of the north star goals of InferenceX... is to highlight the iterative improvements to performance over time, instead of just snapshots of performance," Patel explains. This approach shifts the metric from peak theoretical speed to real-world deployability. It suggests that for busy engineers, a chip that works on Day 1 with open tools is more valuable than one that requires weeks of vendor-specific debugging.

The CUDA moat at work: With CUDA, distributed inferencing tends to be supported near Day 0 for the latest open models.

Critics might argue that focusing on Day 0 bugs ignores the long-term stability vendors eventually achieve. However, in a sector moving as fast as generative AI, a two-week delay can mean missing an entire product cycle. The evidence of broken kernels and hardcoded constants suggests these aren't minor glitches but fundamental architectural oversights.

The Software Gap: Nvidia vs. AMD

The article draws sharp distinctions between how different hardware ecosystems handled the new model architecture. On the Nvidia side, Patel details a specific failure mode where engineers simply removed error guards rather than fixing the underlying logic. "Nvidia engineers also encountered this guard error and instead of adding code to support DeepSeek v4 Pro's 7168 hidden size, they simply removed the guard," he writes.

This is a damning critique of corporate engineering culture. By removing the safety check, the system didn't crash; it produced "invalid generations" silently. Patel notes that this issue persisted for over a week until his team intervened. "Running inference with these settings results in an inference run without triggering an immediate crash, but there are hidden consequences: the engine ends up corrupting hidden states and producing invalid generations," he warns.

In contrast, the AMD story is one of rapid, albeit late, redemption. On Day 0, their performance was "an unusable experience given extremely low interactivity levels." Yet, Patel highlights a dramatic turnaround led by the engineering team under HaiShaw. "The AMD SGLang engineering team... massively improved performance in the first month - achieving a more than 100x performance by Day 26," he reports.

This 100x improvement is not just a number; it represents a fundamental shift from PyTorch fallbacks to custom kernels. Patel explains that the gains came from replacing generic paths with specialized code for tasks like flash attention and sparse indexing. "The gain came almost entirely from AMD replacing PyTorch-native fallback paths with real AITER, Triton, TileLang, and FlyDSL kernels," he details.

The historical context here is vital. This mirrors the early struggles of the ROCm ecosystem seen in previous deep dives on TensorRT-LLM, where proprietary stacks often lag behind open frameworks like vLLM. The fact that AMD's team had to rebuild their stack from first principles underscores a broader truth: hardware dominance does not guarantee software readiness.

The Open Source Advantage

The core thesis of Patel's coverage is the superiority of the open-source inference ecosystem in speed and adaptability. He points out that the open models, largely driven by Chinese labs like DeepSeek, are forcing the industry to evolve faster than ever. "China currently dominates the open model landscape, with Kimi K2.6 still beating Jensen's Nemotron Committee Coalition's Nemotron 3 Ultra on coding," he observes.

This dominance forces Western vendors to react or risk obsolescence in the inference layer. Patel emphasizes that the community tools are not just alternatives but often the primary drivers of innovation. "These inference engines are so fundamental to the global ML ecosystem that both teams have started their own company, Inferact and RadixArk, with each raising hundreds of millions of dollars," he notes.

The reliance on community support was further highlighted when hardware failures threatened the study itself. When Patel's own cluster went down, it was CoreWeave that stepped in to provide spare racks. "Luckily, CoreWeave came through and contributed compute to the open source community... scrambling to find two spare dev GB300 NVL72 racks," he writes.

This collaborative dynamic stands in contrast to the siloed approach of major chipmakers. The argument suggests that the future of AI infrastructure may depend less on who makes the best silicon and more on who builds the most responsive software layer.

In the early days of DeepSeek v4 Pro, CUDA vLLM and CUDA SGLang... worked great out of the box, proving the strength of the open ecosystems.

A counterargument worth considering is that proprietary engines like TensorRT-LLM may eventually offer superior performance at scale once these bugs are ironed out. Patel acknowledges this, noting that "as of today, TRT-LLM's performance is superior at higher batch sizes." However, he maintains that the time-to-market advantage of open tools remains a critical differentiator for most enterprises.

Bottom Line

The strongest part of Dylan Patel's analysis is its unflinching exposure of how easily proprietary software stacks can fail when faced with novel model architectures. The evidence that major vendors produced silent data corruption rather than simple crashes is a significant warning for enterprise adopters. The piece's biggest vulnerability lies in its heavy reliance on open-source benchmarks, which may not fully capture the optimized performance these vendors claim to deliver in controlled, long-term deployments. Readers should watch closely whether the rapid iterative improvements seen in the first month can be sustained as models grow even larger and more complex.

Deep Dives

Explore these related deep dives:

  • vLLM

    The article credits this specific open-source inference engine with enabling the rapid Day 0 performance tracking and subsequent iterative optimizations that define DeepSeek v4's deployment success.

  • TensorRT

    Understanding why Nvidia's proprietary optimization suite initially failed to support DeepSeek v4 highlights the critical friction between closed hardware ecosystems and emerging open-model architectures.

  • HiSilicon

    While the article focuses on the newer 950DT, this predecessor chip provides essential context for the specific architectural constraints and software stack evolution that Chinese labs had to navigate to achieve inference parity with Nvidia GPUs.

Sources

DeepSeekV4 1.6T day 0 to day 43 performance over time - Huawei, GB300 NVL72, MI355X, b200

by Dylan Patel · SemiAnalysis · Read full article

The release of DeepSeek v4 marks another step forward for the open model community - unsurprisingly, it is the product of a Chinese lab. The evolution of its performance over time is of paramount importance to the AI Ecosystem. The open-source InferenceX engineering team has pulled multiple all-nighters to measure performance results for this model on Day 0, Day 1, Day 2, and beyond and bring these results to the world. In this article, we will highlight DeepSeek v4’s Day 0 performance as and explain the significant improvements made the subsequent weeks following the model’s release. We will also explain core components of DeepSeek v4’s model architecture and discuss how it was co-designed in part for Huawei Ascend inference.

In section 2 of our blog post, we do an comprehensive analysis of DeepSeekv4’s inference on Day 0 Huawei Ascend 950DT. This article serves is the first analysis of Ascend 950DT DeepSeekv4 inference and we breakdown the compute<>communication lap & the different compute streams that Huawei did to optimize performance.

A key goal of InferenceX, especially during a model’s Day 0 release window, is to record each SKU’s performance using open-sourced images and recipes across as many frameworks as possible, regardless of how well these images and recipes perform. This enables us to track improvements over time, which we believe best reflects the real, deployable performance of each chip. The video below shows iterative improvements for non-MTP configs from Day 0 onward for vLLM/SGLang, respectively. visit inference.com to see the MTP configs from day 0 onwards too.

The graphics reflect the thousands of engineering hours that went into tuning DeepSeek v4 inference performance and most of the optimizations are merged into the master branch of SGLang/vLLM. One of the north start goals of InferenceX is to highlight the iterative improvements to performance over time, instead of just snapshots of performance, after all when it comes to engineering, the things you learn along the way are often just as important as the end result.

In the early days of DeepSeek v4 Pro, CUDA vLLM and CUDA SGLang and CUDA vLLM disaggregated prefill worked great out of the box, proving the strength of the vLLM and SGLang open ecosystems. These inference engines are so fundamental to the global ML ecosystem that both teams have started their own company, Inferact and RadixArk, with each raising hundreds of millions of dollars to continue to fuel ...