← Back to Library

Aws trainium3 deep dive

Dylan Patel doesn't just report on a new chip; he exposes the quiet, industrial-scale war being waged to dismantle the monopoly on artificial intelligence infrastructure. While the industry fixates on raw speed, this deep dive reveals that the real battleground is total cost of ownership and the software ecosystem that locks developers in. The most startling claim here isn't about a single transistor, but the strategic pivot by the administration's largest cloud provider to treat its custom silicon not as a proprietary fortress, but as an open platform designed to erode the very moat its competitors rely on.

The Hardware Chess Game

Patel frames the launch of Trainium3 not as a mere incremental upgrade, but as a "step-function improvement" that forces the market to reconsider who holds the keys to the kingdom. He writes, "Amazon has had the longest and broadest history of custom silicon in the datacenter. While they were behind in AI for quite some time, they are rapidly progressing to be competitive." This is a crucial reframing: the narrative of the cloud giant playing catch-up is over; the narrative of aggressive competition has begun.

Aws trainium3 deep dive

The technical specifics are dense, but Patel's analysis cuts through the jargon to explain why the hardware choices matter. He notes that the new chip moves to a "switched fabric that is somewhat similar to the GB200 NVL36x2 topology," a design choice driven by the need to support complex Mixture-of-Experts models. This mirrors the architectural shifts seen in high-performance computing history, much like how the transition from simple game logic to complex physics engines in titles like Factorio required entirely new ways of thinking about resource management and parallel processing. The hardware is no longer just about raw power; it's about how that power is connected.

Patel highlights a critical supply chain reality: "AWS is switching to Hynix and Micron to achieve much faster speeds," abandoning the sub-par memory speeds that hampered the previous generation. This move underscores a broader trend where hyperscalers are no longer passive consumers of components but active shapers of the semiconductor supply chain. By demanding higher pin speeds and better memory density, they are forcing vendors to innovate faster than the traditional market cycle would allow.

"Rather than committing to any single architectural design, AWS maximizes operational flexibility."

This flexibility is the core of the strategy. Patel argues that the administration's approach is to "optimize for perf per TCO" by refusing to be locked into a single vendor or standard. They are willing to use three different scale-up switch solutions over the lifecycle of the chip, pivoting from PCIe to a future UALink standard. This is a stark contrast to the rigid, single-vendor ecosystems that have dominated the industry. Critics might note that such a fragmented approach could confuse developers, but Patel suggests the long-term gain in cost efficiency outweighs the short-term friction.

The Software Moat and the Open Source Gambit

The most provocative section of Patel's analysis concerns the software strategy. For years, the industry has accepted that the only viable path to AI development is through a specific, proprietary software stack. Patel challenges this, writing, "We believe the CUDA Moat isn't constructed by the Nvidia engineers that built the castle, but by the millions of external developers that dig the moat around that castle by contributing to the CUDA ecosystem." He then posits that the administration is attempting to replicate this dynamic by open-sourcing their own stack.

The plan involves a massive, multi-phase shift. Phase 1 includes releasing a native PyTorch backend and open-sourcing the compiler for their kernel language, NKI. Phase 2 involves open-sourcing their XLA graph compiler and JAX software stack. Patel argues, "By open sourcing most of their software stack, AWS will help broaden adoption and kick-start an open developer ecosystem." This is a bold move that seeks to turn the very mechanism of lock-in against the incumbent. If developers can easily move between hardware platforms without rewriting their code, the power of the proprietary stack evaporates.

However, Patel is not blind to the current limitations. He points out that "Trainium3 will only have Day 0 support for Logical NeuronCore (LNC) = 1 or LNC = 2," which caters to elite engineers but alienates the wider research community that prefers LNC=8. He notes, "Unfortunately, AWS does not plan on supporting LNC=8 until mid-2026." This delay is a significant vulnerability. It creates a window where the incumbent can maintain its dominance by offering a more mature toolset to the broader market. The administration's strategy relies on the assumption that the hardware cost advantage will be enough to carry the platform through this software gap.

"Jensen needs to ACCELERATE even faster than he has over the past 4 months. In the same way that Intel stayed complacent in the CPU while others like AMD and ARM raced ahead, if Nvidia stays complacent they will lose their pole position even more rapidly."

This warning is the piece's most urgent call to action. Patel draws a parallel to the CPU market, where complacency led to a rapid loss of market share. The implication is clear: the era of guaranteed dominance is ending. The competition is no longer just about who has the fastest chip today, but who can build the most adaptable, cost-effective ecosystem for tomorrow.

Bottom Line

Patel's analysis succeeds in shifting the focus from the hype of individual chip specs to the structural dynamics of the AI infrastructure market. The strongest part of the argument is the identification of software openness as the primary weapon against hardware monopolies. The biggest vulnerability remains the timeline; the gap between elite engineering support and mass-market readiness could allow the incumbent to solidify its position before the new ecosystem matures. Readers should watch closely to see if the open-source strategy can truly attract the developer community needed to break the lock-in, or if the hardware cost savings alone will be enough to drive adoption.

Deep Dives

Explore these related deep dives:

  • Factorio

    Linked in the article (10 min read)

  • High Bandwidth Memory

    The article heavily discusses HBM3E specifications including pin speeds (5.7Gbps vs 9.6Gbps), stack heights (12-high), and memory capacity (144GB per chip). Understanding HBM architecture explains why these specifications matter for AI accelerator performance and the technical tradeoffs involved.

Sources

Aws trainium3 deep dive

by Dylan Patel · SemiAnalysis · Read full article

Trainium3: A New Challenger Approaching!.

Hot on the heels of our 10K word deep dive on TPUs, Amazon launched Trainium3 (Trn3) general availability and announced Trainium4 (Trn4) at its annual AWS re:Invent. Amazon has had the longest and broadest history of custom silicon in the datacenter. While they were behind in AI for quite some time, they are rapidly progressing to be competitive. Last year we detailed Amazon’s ramp of its Trainium2 (Trn2) accelerators aimed at internal Bedrock workloads and Anthropic’s training/inference needs.

Since then, through our datacenter model and accelerator model, we detailed the huge ramp that led to our blockbuster call that AWS would accelerate on revenue.

Today, we are publishing our next technical bible on the step-function improvement that is the Trainium3 chip, microarchitecture, system and rack architecture, scale up, profilers, software platform, and datacenters ramps. This is the most detailed piece we've written on an accelerator and its hardware/software, on desktop there is a table of contents that makes it possible to review specific sections.

Amazon Basics GB200 aka GB200-at-Home.

With Trainium3, AWS remains laser-focused on optimizing performance per total cost of ownership (perf per TCO). Their hardware North Star is simple: deliver the fastest time to market at the lowest TCO. Rather than committing to any single architectural design, AWS maximizes operational flexibility. This extends from their work with multiple partners on the custom silicon side to the management of their own supply chain to multi-sourcing multiple component vendors.

On the systems and networking front, AWS is following an “Amazon Basics” approach that optimizes for perf per TCO. Design choices such as whether to use a 12.8T, 25.6T or a 51.2T bandwidth scale-out switch or to select liquid vs air cooling are merely a means to an end to provide the best TCO for the given client and the given datacenter.

For the scale-up network, while Trn2 only supports a 4x4x4 3D Torus mesh scaleup topology, Trainium3 adds a unique switched fabric that is somewhat similar to the GB200 NVL36x2 topology with a few key differences. This switched fabric was added because a switched scaleup topology has better absolute performance and perf per TCO for frontier Mixture-of-Experts (MoE) model architectures.

Even for the switches used in this scale-up architecture, AWS has decided to not decide: they will go with three different scale-up switch solutions over the lifecycle of Trainium3, starting with a 160 lane, 20 ...