How Netflix built a real-time distributed graph for internet scale

Alex Xu delivers a rare, granular look inside the engine room of a tech giant, arguing that the most scalable solution for real-time personalization isn't a specialized database, but a clever repurposing of existing infrastructure. While the industry chases the latest graph database hype, Xu reveals how Netflix bypassed it entirely to handle billions of interactions, proving that operational maturity often trumps architectural novelty.

The Myth of the Perfect Tool

Xu opens by dismantling the assumption that a microservices architecture naturally leads to a unified view of the customer. He notes that "when each service manages its own data, information can become siloed," creating a fragmented reality where a user's gaming habits and viewing history exist in separate, unconnected universes. This is a critical insight for any organization scaling beyond a single product line; the technical debt of data silos grows exponentially as business models diversify.

How Netflix built a real-time distributed graph for internet scale

The author argues that traditional data warehouses were insufficient because "the data lands in different tables and processes at different times," making real-time stitching impossible. Instead, Xu details Netflix's pivot to a graph representation, not because it was the trendy choice, but because "graphs enable fast relationship traversals without expensive database joins." This reframing is powerful: it positions the graph not as a database type, but as a logical model for connectivity that can be implemented in unexpected ways.

"Sometimes the right solution is not the obvious one. Netflix could have used purpose-built graph databases, but chose to emulate graph capabilities using key-value storage based on operational realities like internal expertise and existing platform support."

This admission is the piece's intellectual anchor. It suggests that in high-scale engineering, the "best" technology is often the one your team knows how to operate at 3 a.m., not the one with the most impressive feature list.

Building the Pipeline

The article then dives into the mechanics of the Real-Time Distributed Graph (RDG). Xu explains that the ingestion backbone relies on Apache Kafka, chosen because it offers "durable, replayable streams that downstream processors can consume in real time." He highlights the sheer volume involved, noting that Netflix's applications generate "up to one million messages per second" across various topics. This isn't just about speed; it's about the ability to reconstruct a user's journey the moment they switch from a phone to a TV.

The processing layer utilizes Apache Flink, which Xu praises for its "strong capabilities around near-real-time event processing." However, he doesn't shy away from the failures encountered during development. The team initially tried a monolithic approach, but "different topics have vastly different volumes and throughput patterns, making tuning impossible." The solution was counter-intuitive: they moved to a one-to-one mapping from topic to job. "While this added operational overhead, each job became simpler to maintain and tune," Xu writes. This is a vital lesson for system architects: sometimes adding complexity in the orchestration layer reduces complexity in the execution layer.

Critics might argue that maintaining thousands of separate jobs creates an unsustainable operational burden for smaller teams. However, Xu's point is that for an organization of Netflix's size, the cost of tuning a single massive job outweighs the cost of managing many small ones. The trade-off shifts as scale increases.

The Storage Challenge and the KVDAL Solution

Perhaps the most surprising section is the rejection of established graph databases. Xu writes that "native graph databases struggle to scale horizontally for large, real-time datasets," citing how systems like Neo4j and AWS Neptune degrade under the weight of hundreds of millions of records. He points out that these systems are "not inherently designed for the continuous, high-throughput event streaming workloads critical to Netflix operations."

Instead, Netflix turned to KVDAL, an internal abstraction layer built on Apache Cassandra. Xu describes how they emulated graph relationships using a "two-level map architecture" where records are identified by ID and contain sorted items. "To find all titles a member watched, Netflix retrieves the entire record with one KVDAL lookup," he explains. This approach allowed them to bypass the limitations of specialized graph engines while leveraging the battle-tested scalability of a key-value store.

"Netflix found it simpler to emulate graph-like relationships in existing data storage systems rather than adopting specialized graph infrastructure."

This decision underscores a broader trend in infrastructure: the move away from "silver bullet" databases toward composable, general-purpose systems that can be tailored to specific needs. The flexibility of KVDAL allowed Netflix to assign a separate "namespace" for every node and edge type, enabling independent scaling. "If one namespace needs more storage or traffic capacity, you can move it to its own cluster for independent management," Xu notes. This granular control is what allows the system to sustain "two million reads per second and six million writes per second" without collapsing.

Bottom Line

Xu's analysis is a masterclass in pragmatic engineering, demonstrating that the most robust systems are often built by adapting proven tools rather than chasing new ones. The piece's greatest strength is its honest accounting of failure, showing how the team learned that "isolation and independence matter at scale." The only vulnerability in the argument is its specificity; while the lessons on scaling and operational reality are universal, the sheer magnitude of Netflix's infrastructure makes the solution difficult to replicate for smaller entities without significant investment. For builders, the takeaway is clear: understand your operational constraints before you choose your database.

How Netflix built a real-time distributed graph for internet scale

by Alex Xu · ByteByteGo Newsletter · Read full article

2026 AI predictions for builders (Sponsored).

The AI landscape is changing fast—and the way you build AI systems in 2026 will look very different.

Join us live on January 28 as we unpack the first take from Redis’ 2026 predictions report: why AI apps won’t succeed without a unified context engine.

You’ll learn:

One architectural standard for AI across teams

Lower operational overhead via shared context infrastructure

Predictable, production-grade performance

Clear observability and governance for agent data access

Faster time to market for new AI features

Read the full 2026 predictions report →

Netflix is no longer just a streaming service. The company has expanded into live events, mobile gaming, and ad-supported subscription plans. This evolution created an unexpected technical challenge.

To understand the challenge, consider a typical member journey. Assume that a user watches Stranger Things on their smartphone, continues on their smart TV, and then launches the Stranger Things mobile game on a tablet. These activities happen at different times on different devices and involve different platform services. Yet they all belong to the same member experience.

Disclaimer: This post is based on publicly shared details from the Netflix Engineering Team. Please comment if you notice any inaccuracies.

Understanding these cross-domain journeys became critical for creating personalized experiences. However, Netflix’s architecture made this difficult.

Netflix uses a microservices architecture with hundreds of services developed by separate teams. Each service can be developed, deployed, and scaled independently, and teams can choose the best data storage technology for their needs. However, when each service manages its own data, information can become siloed. Video streaming data lives in one database, gaming data in another, and authentication data separately. Traditional data warehouses collect this information, but the data lands in different tables and processes at different times.

Manually stitching together information from dozens of siloed databases became overwhelming. Therefore, the Netflix engineering team needed a different approach to process and store interconnected data while enabling fast queries. They chose a graph representation for the same due to the following reasons:

First, graphs enable fast relationship traversals without expensive database joins.

Second, graphs adapt easily when new connections emerge without significant schema changes.

Third, graphs naturally support pattern detection. Identifying hidden relationships and cycles is more efficient using graph traversals than siloed lookups.

This led Netflix to build the Real-Time Distributed Graph, or RDG. In this article, we will look at the architecture ...

The Myth of the Perfect Tool

Building the Pipeline

The Storage Challenge and the KVDAL Solution

Bottom Line

Sources

How Netflix built a real-time distributed graph for internet scale