← Back to Library

How Google manages trillions of authorizations with zanzibar

Alex Xu doesn't just explain how Google handles permissions; he reveals why the very concept of "instant access" is a carefully engineered illusion that breaks at the scale of billions of users. While most technical deep dives focus on database speed, Xu argues that the true bottleneck isn't storage—it's the terrifying complexity of ensuring that a revoked right disappears before the next byte of content is served. This piece matters now because as AI agents and third-party integrations multiply, the "plumbing" of trust is becoming the single point of failure for the entire internet economy.

The Core Problem: When Speed Meets Safety

Xu frames the challenge not as a storage issue, but as a temporal one. He writes, "The challenge multiplies at Google's scale. For reference, Zanzibar stores over two trillion permission records and serves them from dozens of data centers worldwide." This staggering volume forces a rethinking of how we define "access." In a small app, you check a list. At Google, checking a list is impossible because the list changes faster than it can be read.

How Google manages trillions of authorizations with zanzibar

The author identifies a critical flaw in traditional systems he calls the "new enemy" problem. "Consider the scenario where we remove someone from a document's access list, then add new content to that document," Xu explains. "If the system uses stale permission data, the person who was just removed might still see the new content." This is the crux of the argument: in a distributed world, "now" is different for every server. If the system doesn't track the exact order of events, security collapses into a race condition.

"Any delay in these checks directly impacts user experience."

Xu's framing here is sharp. He forces the reader to realize that security isn't just about being correct; it's about being correct fast enough that the user never notices the complexity. A counterargument worth considering is whether this level of rigor is necessary for smaller organizations, but Xu's point is that the principles of consistency apply regardless of scale, even if the implementation differs.

The Data Model: Tuples Over Lists

To solve this, the article details a shift from rigid access control lists to a flexible data model. Xu describes how Zanzibar represents all permissions as "relation tuples, which are simple statements about relationships between objects and users." He illustrates this with a simple format: object, relation, user. This abstraction allows the system to handle complex hierarchies without duplicating data.

"Instead of listing every member of a group individually on a document, we can create one tuple that says 'members of the Engineering group can view this document,'" Xu writes. "When the Engineering group membership changes, the document permissions automatically reflect those changes." This is a profound simplification. It moves the logic from the data storage to a configuration language, allowing rules to be composed dynamically.

The author highlights how this enables permission inheritance without data duplication. "Rather than duplicating the viewer list on every document, we write a rule saying that to check who can view a document, look up its parent folder, and include that folder's viewers." This approach transforms authorization from a static lookup into a dynamic calculation, which is essential for services like YouTube where a video's access might depend on its channel, its playlist, and its parent organization simultaneously.

Handling Consistency with Ordering

The most sophisticated part of Xu's analysis is the solution to the "new enemy" problem. He introduces the concept of "zookies," tokens that encode timestamps to ensure data freshness. "When an application saves new content, it requests an authorization check from Zanzibar. If authorized, Zanzibar returns a zookie encoding the current timestamp, which the application stores with the content," he explains.

This mechanism relies on Google Spanner's ability to provide external consistency across the globe. "Since the timestamp came from after any permission changes, Zanzibar will see those changes when performing the check," Xu notes. The brilliance lies in the flexibility: the system doesn't demand an exact timestamp, but rather a minimum freshness. "It specifies the minimum required freshness, not an exact timestamp," he writes. "Zanzibar can use any timestamp equal to or fresher than required, enabling performance optimizations."

"The zookie protocol has an important property. It specifies the minimum required freshness, not an exact timestamp."

Critics might argue that relying on a specific database technology like Spanner limits the portability of this architecture. However, Xu's point is that the principle of ordering events is universal, even if the specific tool varies. The real innovation is the decoupling of the permission check from the content storage, allowing them to coordinate without locking each other up.

The Architecture: Distribution and Caching

Finally, Xu breaks down the physical reality of running this system. "Zanzibar runs on over 10,000 servers organized into dozens of clusters worldwide," he states. The system is designed to handle "flash crowds" where a popular piece of content triggers millions of simultaneous checks. To prevent this from crashing the system, Zanzibar uses a lock table to deduplicate requests. "When multiple requests for the same check arrive simultaneously, only one actually executes the check. The others wait for the result, then all receive the same answer," Xu explains.

He also details how the system handles deeply nested groups using a component called Leopard, which precomputes transitive group membership. "Instead of following chains like 'Alice is in Backend, Backend is in Engineering,' Leopard stores direct mappings from users to all groups they belong to," he writes. This turns a slow recursive search into a millisecond set intersection.

The performance metrics are staggering. "Around 99% of permission checks use moderately stale data, served entirely from local replicas," Xu notes, with a median latency of just 3 milliseconds. "The remaining 1% requiring fresher data have a 95th percentile latency of around 60 milliseconds due to cross-region communication." This trade-off—optimizing for the common case of slightly stale data while guaranteeing correctness for the rare fresh case—is the defining characteristic of the system.

"Most importantly, Zanzibar illustrates optimizing for observed behavior rather than theoretical worst cases."

Bottom Line

Xu's analysis succeeds because it moves beyond the "how" of the code to the "why" of the design, proving that at massive scale, correctness and speed are not opposing forces but interdependent requirements. The piece's greatest strength is its demonstration that a flexible data model and a clever consistency protocol can solve the "new enemy" problem without sacrificing performance. The only vulnerability is the heavy reliance on Google's specific infrastructure, but the architectural lessons on tuple modeling and request deduplication are universally applicable for any engineer building distributed systems. As AI agents begin to manage permissions autonomously, understanding this balance between freshness and speed will be critical.

Sources

How Google manages trillions of authorizations with zanzibar

WorkOS Pipes: Ship Third-Party Integrations Without Rebuilding OAuth (Sponsored).

Connecting user accounts to third-party APIs always comes with the same plumbing: OAuth flows, token storage, refresh logic, and provider-specific quirks.WorkOS Pipes removes that overhead. Users connect services like GitHub, Slack, Google, Salesforce, and other supported providers through a drop-in widget. Your backend requests a valid access token from the Pipes API when needed, while Pipes handles credential storage and token refresh.

Sometime before 2019, Google built a system that manages permissions for billions of users while maintaining both correctness and speed.

When you share a Google Doc with a colleague or make a YouTube video private, a complex system works behind the scenes to ensure that only the right people can access the content. That system is Zanzibar, Google’s global authorization infrastructure that handles over 10 million permission checks every second across services like Drive, YouTube, Photos, Calendar, and Maps.

In this article, we will look at the high-level architecture of Zanzibar and understand the valuable lessons it provides for building large-scale systems, particularly around the challenges of distributed authorization.

See the diagram below that shows the high-level architecture of Zanzibar.

Disclaimer: This post is based on publicly shared details from the Google Engineering Team. Please comment if you notice any inaccuracies.

The Core Problem: Authorization at Scale.

Authorization answers a simple question: Can this particular user access this particular resource? For a small application with a few users, checking permissions is straightforward. We might store a list of allowed users for each document and check if the requesting user is on that list.

The challenge multiplies at Google’s scale. For reference, Zanzibar stores over two trillion permission records and serves them from dozens of data centers worldwide. A typical user action might trigger tens or hundreds of permission checks. When searching for an artifact in Google Drive, the system must verify your access to every result before displaying it. Any delay in these checks directly impacts user experience.

Beyond scale, authorization systems also face a critical correctness problem that Google calls the “new enemy” problem. Consider the scenario where we remove someone from a document’s access list, then add new content to that document. If the system uses stale permission data, the person who was just removed might still see the new content. This happens when the system doesn’t properly track the order in which you made changes.

Zanzibar ...