How DoorDash moved to a service mesh to handle 80m requests/second

Most engineering case studies celebrate the glory of scale, but Alex Xu's analysis of DoorDash's infrastructure pivot reveals a far more urgent truth: reliability is often a casualty of fragmentation, not a lack of technology. While the headline boasts handling 80 million requests per second, the real story is how a single payment latency spike nearly collapsed the entire platform, exposing a dangerous patchwork of inconsistent reliability mechanisms. This is not just a technical deep dive; it is a cautionary tale for any organization transitioning from monoliths to microservices without a unified safety net.

The Cost of Fragmentation

Xu frames the narrative around a specific, devastating failure in mid-2021, where a payment service outage triggered a "retry storm" that brought the platform down for over two hours. He writes, "The incident started with the payment service experiencing high latency. Clients of this service interpreted the slow responses as potential failures and retried their requests. This created a retry storm where each retry added more load to an already overwhelmed service." This description perfectly captures the cascading nature of distributed system failures, where the cure (retries) becomes the poison.

How DoorDash moved to a service mesh to handle 80m requests/second

The author argues that the root cause wasn't a lack of effort, but a lack of standardization. DoorDash had moved to microservices between 2019 and 2023, yet "reliability features like Layer 7 metrics-aware circuit breakers and load shedding couldn't remain the responsibility of individual application teams." Xu's point is sharp: when every team builds their own safety gear, some will inevitably be left without a parachute. This is a critical insight for leaders who assume that moving to the cloud automatically solves operational complexity.

Reliability features like Layer 7 metrics-aware circuit breakers and load shedding couldn't remain the responsibility of individual application teams.

Critics might argue that enforcing a single reliability model across diverse programming languages and teams stifles innovation or slows down development velocity. However, Xu counters this by showing that the alternative—a fragmented system prone to total collapse—is far more costly. The outage demonstrated that the "patchwork approach to reliability wasn't working," forcing a complete architectural rethink.

The Pragmatic Choice: Custom Over Off-the-Shelf

Perhaps the most distinctive part of Xu's coverage is the decision-making process behind the new architecture. Many companies default to the most popular open-source tools, but DoorDash took a contrarian path. Xu details how they evaluated Istio, the "power user" approach, and rejected it because "this power comes at a cost: complexity, a steep learning curve, and heavier resource usage." With only two engineers initially dedicated to the project, the operational burden of Istio was simply unsustainable.

Similarly, they looked at Linkerd2, which prioritizes simplicity, but found it lacking in essential features like adaptive concurrency and the ability to integrate with their existing Consul-based service discovery without significant upstream contributions. Xu notes, "The team ultimately chose Envoy as their data plane because it was battle-tested (originally built by Lyft), performant, and had strong community support." This choice reflects a mature understanding that the "best" tool is the one that fits the specific constraints of the organization, not the one with the most marketing hype.

The decision to build a custom control plane rather than adopting a full suite was a calculated risk. Xu explains, "A minimal custom control plane was faster to build and deploy than learning and operating the full complexity of Istio." This aligns with the broader history of cloud-native computing, where the Cloud Native Computing Foundation (CNCF) has seen a shift from monolithic platforms to modular, composable tools. Just as the CNCF's 2015 founding aimed to standardize container orchestration, DoorDash's move to a custom Envoy implementation shows how mature organizations often need to diverge from standard paths to solve unique scale problems.

The team ultimately chose Envoy as their data plane because it was battle-tested (originally built by Lyft), performant, and had strong community support.

A counterargument worth considering is whether building a custom control plane creates a long-term maintenance burden that could eventually outpace the complexity of using a standard tool like Istio. Xu acknowledges this by emphasizing the "pragmatic MVP approach," suggesting that the custom solution was a stepping stone to a unified platform rather than a permanent state.

The MVP and the Rollout Strategy

The execution phase is where Xu's commentary shines, detailing a "minimal feature set" focused strictly on the problems that caused the outage. The architecture relied on a Kubernetes mutating webhook to inject Envoy sidecars, but the configuration model was surprisingly primitive: "Users wrote raw Envoy configurations in a GitHub repository, which were versioned and reviewed like code through their standard pull request process." This human-centric approach to infrastructure configuration is a refreshing departure from the trend of fully automated, opaque systems.

The rollout strategy was equally deliberate. Xu describes a canary deployment where "developers controlled traffic exposure by adjusting the number of canary replicas." By starting with the payment and notification services—the very culprits of the outage—the team could validate the solution in the most critical areas first. "The rollout happened gradually over two weeks to allow any unknown issues to surface in a controlled manner." This methodical pace stands in stark contrast to the "move fast and break things" mentality that often plagues tech companies.

The success of this approach is evident in the results. Xu writes, "The initial onboarding was smooth and successful with no delivery loss." This outcome validates the argument that simplicity and focus often outperform feature-rich complexity, especially when dealing with high-stakes infrastructure.

The rollout happened gradually over two weeks to allow any unknown issues to surface in a controlled manner.

Bottom Line

Alex Xu's analysis succeeds because it prioritizes the human and operational realities of engineering over the allure of buzzwords. The strongest part of the argument is the clear demonstration that a custom, minimal solution often outperforms complex, off-the-shelf platforms when scale and speed are critical. However, the biggest vulnerability remains the long-term sustainability of a custom control plane, which requires continuous investment to avoid becoming a technical debt trap. For busy leaders, the takeaway is clear: reliability is not a feature you buy; it is a system you build, one that must be unified, visible, and rigorously tested before it is ever needed.

How DoorDash moved to a service mesh to handle 80m requests/second

by Alex Xu · ByteByteGo Newsletter · Read full article

AI Meets Streaming: Build Real-Time Architectures with AWS + Redpanda (Sponsored).

Join us live on December 11 for a Redpanda Tech Talk with AWS experts on how to bring Agentic and Generative AI into real-time data pipelines. Redpanda Solutions Engineer Garrett Raska and AWS Partner Solutions Architect Dr. Art Sedighi will walk through emerging AI patterns from AWS re:Invent and show how to integrate AI inference directly into streaming architectures. Learn how to build low-latency, context-aware applications, combine real-time signals with GenAI models, and architect reliable, production-ready AI workflows. If you’re exploring how AI transforms streaming systems, this session delivers the patterns you need to get started.

Disclaimer: The details in this post have been derived from the details shared online by the DoorDash Engineering Team. All credit for the technical details goes to the DoorDash Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

In mid-2021, DoorDash experienced a production outage that brought down the entire platform for more than two hours.

The incident started with the payment service experiencing high latency. Clients of this service interpreted the slow responses as potential failures and retried their requests. This created a retry storm where each retry added more load to an already overwhelmed service. The cascading failure spread through DoorDash’s microservices architecture as services depending on payments started timing out and failing.

See the diagram below:

This wasn’t an isolated incident. DoorDash had experienced a series of similar issues earlier as well. The problems may have been prompted by their transition from a monolith to a microservices architecture between 2019 and 2023.

Of course, it wasn’t that DoorDash was blind to reliability concerns. The team had already implemented several reliability features in their primary Kotlin-based services. However, not all services used Kotlin, which meant they either had to build their own mechanisms or go without. The payment service was also one of them.

The outage made one thing clear: their patchwork approach to reliability wasn’t working. The incident demonstrated that reliability features like Layer 7 metrics-aware circuit breakers and load shedding couldn’t remain the responsibility of individual application teams.

Based on this realization, the DoorDash ...

How DoorDash moved to a service mesh to handle 80m requests/second

The Cost of Fragmentation

The Pragmatic Choice: Custom Over Off-the-Shelf

The MVP and the Rollout Strategy

Bottom Line

Deep Dives

Sources

How DoorDash moved to a service mesh to handle 80m requests/second