← Back to Library

How Shopify prepares for black Friday

Most tech coverage treats Black Friday as a marketing spectacle, but Alex Xu reframes it as the ultimate stress test for modern distributed systems. This piece stands out because it moves beyond the usual "record-breaking numbers" to reveal the nine-month, multi-track engineering marathon required to keep commerce flowing when the world's attention turns to buying. For busy leaders, the takeaway isn't just about Shopify; it's a masterclass in how to architect resilience when failure means immediate financial loss for millions of merchants.

The Three-Track Framework

Xu argues that the secret to handling massive scale isn't a single silver bullet, but a disciplined, parallel workflow. He writes, "Shopify never uses BFCM as a release deadline. Every architectural change and migration happens months before the critical window." This is a crucial distinction that separates mature engineering cultures from those that gamble on go-live dates. By decoupling deployment from the event, the team avoids the panic of last-minute patches.

How Shopify prepares for black Friday

The author details a "Three-Track Framework" where Capacity Planning, Infrastructure Roadmap, and Risk Assessments run simultaneously. Xu explains that "risk findings might reveal capacity gaps the team didn't account for," creating a continuous feedback loop rather than a linear checklist. This approach mirrors the philosophy found in deep dives on Extract, Transform, Load (ETL) pipelines, where data integrity depends on anticipating bottlenecks before they occur. The strength of Xu's framing here is its emphasis on proactive discovery over reactive firefighting.

Critics might argue that such a rigid, nine-month preparation cycle is too costly for smaller players, but the article suggests that the cost of downtime during peak traffic far outweighs the investment in preparation.

"The key insight is that the team cannot wait until BFCM to discover the capacity limits. It takes months of preparation to scale infrastructure and optimize code."

Chaos as a Strategy

The piece gets most compelling when it discusses "Game Days," which Xu describes as "chaos engineering exercises that intentionally simulate production failures at the BFCM scale." This isn't just theoretical; it's a deliberate strategy to break things before the customers do. The team injects network faults, busts caches, and randomizes navigation to mimic real user behavior.

Xu notes that these exercises focus on "critical journeys" like checkout and payment processing, because "if these break during BFCM, merchants lose sales immediately." This focus on business-critical paths rather than just system uptime is a sophisticated shift in perspective. It acknowledges that a server being "up" is useless if the customer can't complete a transaction. The documentation of these failures feeds into a "Resiliency Matrix," a centralized record of vulnerabilities and recovery procedures that acts as a living roadmap for system hardening.

This methodology draws a parallel to historical chaos engineering practices, similar to how Netflix pioneered the concept of the "Chaos Monkey" to ensure their streaming service could survive region-wide outages. Xu's reporting confirms that this is no longer experimental; it is the baseline for enterprise-grade reliability.

The Analytics Blind Spot

A particularly insightful section addresses the unique challenge of 2025: preparing a new analytics platform that had never seen holiday traffic. Xu writes, "They were preparing for BFCM on APIs that had never seen holiday traffic." This asymmetry created a dangerous blind spot where historical data models were useless.

To solve this, the team ran specific Game Days for the analytics infrastructure, simulating increased loads and database latency. The results were stark: they identified issues like memory bloat in the API layer and connection pool exhaustion. Xu explains that "timeouts were too long, meaning connections would get stuck waiting," a subtle bug that only manifests under extreme pressure. By tuning these parameters and splitting API requests across different load balancers, they prevented a potential collapse of merchant visibility during their most critical sales period.

This section highlights a common pitfall in system design: assuming that a new architecture will behave like the old one under stress. The article's value lies in exposing the specific, granular fixes—like increasing Kafka partitions for data freshness—that made the difference.

The Scale Test Reality

Finally, Xu details the "Scale Tests," which validate the entire platform working together. The team ran five major tests, ramping up to 150% of the previous year's load. "By the fourth test, Shopify hit 146 million requests per minute and over 80,000 checkouts per minute," Xu reports. These tests were so large they required coordination with cloud providers and were run at night to avoid impacting shared infrastructure.

The most revealing moment came when the team added authenticated checkout flows to the test scenarios. Xu notes, "Modeling real logged-in buyers exposed rate-limiting code paths that anonymous browsing never touches." This finding underscores the importance of simulating real user behavior rather than just raw traffic volume. Even though authenticated flows were a small percentage of traffic, they revealed bottlenecks that would have caused catastrophic failures during the actual event.

The operational plan for the actual weekend relies on this preparation, with 24/7 incident response teams and live system tuning. Xu concludes that "preparation gets you ready, but operational excellence keeps you steady," a reminder that no amount of testing can replace human vigilance during the event.

"Modeling real logged-in buyers exposed rate-limiting code paths that anonymous browsing never touches."

Bottom Line

Alex Xu's coverage is a definitive guide to the invisible engineering that powers the world's biggest shopping event, proving that resilience is a process, not a feature. The piece's greatest strength is its refusal to romanticize the "peak" moment, instead focusing on the grueling, unglamorous work of failure simulation and capacity planning. The only vulnerability is the sheer scale of resources required, which may feel out of reach for smaller organizations, though the principles of proactive testing remain universally applicable.

Deep Dives

Explore these related deep dives:

  • Extract, transform, load

    Linked in the article (18 min read)

  • Chaos engineering

    The article extensively discusses Shopify's 'Game Days' which are chaos engineering exercises - deliberately injecting faults to test system resilience. Understanding the principles and history of chaos engineering (pioneered by Netflix) provides crucial context for why companies intentionally break their own systems.

Sources

How Shopify prepares for black Friday

Power real-time apps and AI agents with Redis (Sponsored).

Real-time isn’t just about speed. It’s about instant, fresh, and reliable responses at scale.

This definitive Redis guide breaks down how to architect a real-time data layer that keeps user experiences snappy, AI agents responsive, and data up to date across your stack.

Inside, you’ll learn:

How to get your apps from “fast” to truly real-time

The role of Redis in low-latency caching, vector search, AI agent memory, and streaming workloads

Real-world patterns from companies using Redis to cut latency, reduce drop-offs, and keep users in flow

Note: This article is written in collaboration with the Shopify engineering team. Special thanks to the Shopify engineering team for sharing details with us about their Black Friday Cyber Monday preparation work and also for reviewing the final article before publication. All credit for the technical details shared in this article goes to the Shopify Engineering Team.

Black Friday Cyber Monday (BFCM) 2024 was massive for Shopify. The platform processed 57.3 petabytes of data, handled 10.5 trillion database queries, and peaked at 284 million requests per minute on its edge network. On app servers alone, they handled 80 million requests per minute while pushing 12 terabytes of data every minute on Black Friday.

Here’s the interesting part: this level of traffic is now the baseline for Shopify. And BFCM 2025 was even bigger, serving 90 petabytes of data, handling 1.75 trillion database writes with peak performance at 489 million requests per minute. This is why Shopify rebuilt its entire BFCM readiness program from scratch.

The preparation involved thousands of engineers working for nine months, running five major scale tests.

In this article, we will look at how Shopify prepared for success during the Super Bowl of commerce

The Three-Track Framework.

Shopify’s BFCM preparation started in March with a multi-region strategy on Google Cloud.

The engineering team organized the work into three parallel tracks that run simultaneously and influence each other:

Capacity Planning involves modeling traffic patterns using historical data and merchant growth projections. The team submits these estimates to their cloud providers early so the providers can ensure they have enough physical infrastructure available. This planning defines how much computing power Shopify needs and where it needs to be located geographically.

The Infrastructure Roadmap is where the team reviews their technology stack, evaluates what architectural changes are needed, and identifies system upgrades required to ...