How LinkedIn built a next-gen service discovery for 1000s of services

The Invisible Infrastructure That Keeps LinkedIn Running

At hundreds of millions of members loading feeds and sending messages, LinkedIn's speed depends on something users never see: a coordination system that tracks where tens of thousands of software services live at any given moment. Alex Xu documents how LinkedIn replaced a decade-old architecture that was approaching capacity limits with a next-generation design built for extreme scale.

The Coordination Problem

Every modern platform breaks functionality into microservices—small programs handling authentication, messaging, feed generation, and countless other tasks. These services must find each other constantly. Hardcoding addresses fails when servers restart or scale. Service discovery maintains a living directory of IP addresses and port numbers for every active instance.

How LinkedIn built a next-gen service discovery for 1000s of services

Alex Xu writes, "At LinkedIn's scale, with tens of thousands of microservices running across global data centers and handling billions of requests each day, service discovery becomes exceptionally challenging." The system must update in real time, remain highly reliable, and respond within milliseconds.

For ten years, LinkedIn used Apache ZooKeeper as the control plane. ZooKeeper is a coordination service that maintains a centralized registry. Server applications registered endpoint addresses as ephemeral nodes—temporary entries existing only while connections remained active. Clients set watchers on server clusters they needed to call, reading updates when changes occurred.

Three Critical Failures

The ZooKeeper-based architecture had fundamental problems in scalability, compatibility, and extensibility.

On scalability, Alex Xu writes, "Zookeeper is a strong consistency system, meaning it enforces strict ordering over availability. All reads, writes, and session health checks go through the same request queue." When the queue backlog grew, write requests couldn't process. Sessions dropped due to health check timeouts. Ephemeral nodes disappeared. Capacity vanished. Site availability suffered.

"As of July 2022, LinkedIn had about 2.5 years of capacity left with a 50 to 100 percent yearly growth rate in cluster size and number of watchers, even after increasing the number of Zookeeper hosts to 80."

The compatibility problems stemmed from LinkedIn's custom D2 schemas, which stood for Dynamic Discovery. These were incompatible with modern technologies like gRPC and Envoy. The logic was implemented primarily in Java, with partial and outdated Python support. Onboarding other languages required rewriting everything from scratch.

Extensibility suffered because no intermediary layer existed between the service registry and application instances. Centralized load balancing couldn't develop. Integrating new registries like Etcd with Kubernetes created challenges.

The Next-Gen Architecture

LinkedIn's engineering team designed a new architecture with two separate paths: Kafka for writes and Service Discovery Observer for reads.

Kafka handles application server writes and periodic heartbeats through events called Service Discovery URIs. Each URI contains service name, IP address, port number, health status, and metadata. Kafka is a distributed streaming platform capable of handling millions of messages per second.

The Service Discovery Observer consumes URIs from Kafka and writes them into main memory. Application clients open bidirectional gRPC streams to the Observer, sending subscription requests using the xDS protocol—an industry standard created by the Envoy project. The Observer keeps streams open, pushing changes as they happen rather than waiting for clients to poll.

Alex Xu writes, "As of today, one Observer can maintain 40,000 client streams while sending 10,000 updates per second and consuming 11,000 Kafka events per second." With fleet projections growing to three million instances, LinkedIn will need approximately one hundred Observers.

Availability Over Consistency

The new architecture represents a fundamental philosophical shift. LinkedIn prioritized availability over consistency because service discovery data only needs to eventually converge. Short-term inconsistency across servers is acceptable. The data must be highly available to the huge fleet of clients.

As Alex Xu puts it, "Even when Kafka experiences significant lag or goes down, Observer continues serving client requests with its cached data, preventing cascading failures." Multiple Observer replicas reach eventual consistency after a Kafka event is consumed and processed on all replicas.

The Observer component is horizontally scalable and written in Go, chosen for high concurrency capabilities. It processes large volumes of client requests, dispatches data updates, and consumes URIs for the entire LinkedIn fleet efficiently.

The Migration Challenge

Rolling out Next-Gen Service Discovery to hundreds of thousands of hosts without impacting current requests required careful planning. LinkedIn needed the new control plane's data to exactly match ZooKeeper's data. All application servers and clients needed related mechanisms through just an infrastructure library version bump. Central control was required to switch reads and writes on and off by application.

Alex Xu writes, "First, service discovery is mission-critical, and any error could lead to severe site-wide incidents." Since ZooKeeper was approaching capacity limits, LinkedIn needed to migrate as many applications off ZooKeeper as quickly as possible.

Application states were complex and unpredictable. Rest.li applications had existed for over a decade in very different states regarding dependencies, gRPC SSL, and network access. Compatibility for many applications was unpredictable without actually enabling the read.

Read and write migrations were coupled. If write wasn't migrated, no data could be read on Next-Gen Service Discovery. If read wasn't migrated, data was still read on ZooKeeper, blocking write migration. Since read path connectivity was vulnerable to application-specific states, read migration had to start first.

Dual Mode Strategy

LinkedIn implemented a dual mode strategy where applications run both old and new systems simultaneously, verifying the new flow behind the scenes. To decouple read and write migration, the new control plane served a combined dataset of Kafka and ZooKeeper URIs, with Kafka as primary source and ZooKeeper as backup.

Alex Xu writes, "This enabled read migration to start independently." In Dual Read mode, an application client reads data from both Next-Gen Service Discovery and ZooKeeper, keeping ZooKeeper as the source of truth for serving traffic. Using an independent background thread, the client tried to resolve traffic as if served by Next-Gen Service Discovery data and reported any errors.

LinkedIn built comprehensive metrics to verify connectivity, performance, and data correctness on both client and Observer sides. Connectivity and latency metrics watched connection status and data latencies. Dual Read metrics compared data received from ZooKeeper and Next-Gen Service Discovery to identify mismatches.

What This Means for Microservices

The LinkedIn case illustrates why microservices architectures demand sophisticated coordination infrastructure. Apache ZooKeeper served countless organizations for over a decade as a reliable coordination service. But at extreme scale—with tens of thousands of services, billions of daily requests, and global data center distribution—the strong consistency model becomes a liability.

Critics might note that the dual-mode migration strategy, while safe, adds operational complexity that smaller organizations cannot afford. The Observer component requires Go expertise, gRPC infrastructure, and Kafka operational knowledge. Not every engineering team has these capabilities.

Critics might also observe that separating reads and writes introduces eventual consistency as a trade-off. For service discovery, this works. For other coordination problems—distributed locks, leader election, configuration management—strong consistency remains necessary. ZooKeeper isn't obsolete; it's mismatched for this specific scale.

Alex Xu writes, "Adding Next-Gen Service Discovery as a central control plane between the service registry and clients enables LinkedIn to extend to modern service mesh features." These include centralized load balancing, security policies, and transforming endpoint addresses between IPv4 and IPv6. LinkedIn can integrate with Kubernetes to leverage application readiness probes, converting servers from actively making announcements to passively receiving status probes.

Bottom Line

LinkedIn's service discovery migration demonstrates that infrastructure built for yesterday's scale becomes tomorrow's bottleneck—even when it's working perfectly. The engineering discipline here isn't fixing broken systems; it's replacing working systems before they break, while maintaining zero user impact. The dual-mode strategy and comprehensive metrics show mature operational thinking: migrate mission-critical infrastructure without betting the company on the transition.