How to write high-performance code

Alex Xu challenges a deeply ingrained myth in software engineering: that high performance is the domain of geniuses tweaking assembly code. The piece's most striking claim is that the "root of all evil" quote by Donald Knuth is almost universally misunderstood, and that the real opportunity for speed lies not in micro-optimization, but in fundamental architectural intuition. For busy engineers shipping code daily, this reframing is vital because it shifts the focus from expensive, late-stage tuning to early, high-impact design choices.

The 3% That Matters

Xu dismantles the common excuse that performance tuning is too complex for the average developer. He writes, "The truth is that writing reasonably fast code doesn't require advanced computer science knowledge or years of experience. It requires developing an intuition about where performance matters and learning some fundamental principles." This is a powerful assertion in an industry often paralyzed by imposter syndrome. By isolating the "critical 3%" of code that actually dictates system speed, Xu argues that developers can stop worrying about the 97% where efficiency is negligible.

The author's approach to estimating performance is grounded in the physical reality of hardware, not abstract theory. He illustrates the staggering difference in speed tiers: "CPU cache access, which happens in nanoseconds... is roughly 100 times faster than accessing main memory (RAM)." He drives the point home with a concrete calculation: processing one million records via individual network calls could take 14 hours, whereas batching them into groups of 1,000 reduces that time to 50 seconds. This back-of-the-envelope math is the article's strongest tool, proving that architectural decisions outweigh code-level tweaks.

The fastest code is code that never runs.

Critics might argue that in modern cloud environments with auto-scaling, such granular efficiency matters less than developer velocity. However, Xu's logic holds up when considering the cumulative cost of latency at scale; a 50-millisecond delay per user compounds into thousands of wasted server-hours and frustrated customers.

Measure First, Optimize Second

A recurring theme in the piece is the danger of trusting human intuition. Xu notes, "Our intuition about performance bottlenels is usually wrong. We might spend days optimizing a function we think is slow, only to discover through profiling that some completely different part of the code is the actual problem." This is a crucial corrective to the "premature optimization" mindset. The article advocates for a data-driven workflow where profiling tools, rather than hunches, dictate where effort is spent.

He distinguishes between obvious bottlenecks and the more subtle "flat profile," where no single function dominates the runtime. In these cases, Xu suggests looking for patterns across multiple functions or making structural changes higher up in the call chain. This nuanced advice moves beyond the standard "find the slow function" trope and addresses the reality of mature, complex codebases.

The Algorithmic and Memory Hierarchy

The core of Xu's argument rests on the idea that algorithmic efficiency dwarfs all other optimizations. He explains that converting a nested loop approach (O(N²)) to a hash table lookup (O(N)) can mean the difference between finishing in milliseconds or seconds. "A better algorithm can provide a 10x or 100x speedup, dwarfing any micro-optimization we make," he writes. This is the most valuable takeaway for senior engineers who might be tempted to optimize syntax before structure.

Equally important is the discussion on memory locality. Xu highlights that modern CPUs are so fast that the bottleneck is often fetching data from memory. "When the CPU fetches data from memory, it doesn't fetch just one byte. It fetches an entire cache line, typically 64 bytes." He argues that storing related data contiguously in arrays is superior to scattered storage in linked lists, even if the latter has theoretical advantages for insertion. This physical constraint of hardware often overrides theoretical computer science preferences.

Eliminating Unnecessary Work

The final layer of Xu's strategy is about reduction. He emphasizes that "creating fast paths for common cases is a powerful technique," noting that 80% of cases often follow a simple pattern. By optimizing for the common path and deferring complex logic, systems can achieve massive gains. He also champions lazy evaluation and bailing out early on validation checks to avoid expensive downstream processing.

The key point is that we should let data guide our optimization decisions.

While the advice is sound, a counterargument worth considering is the trade-off with code readability. Highly specialized, optimized paths can sometimes obscure the logic for future maintainers. Xu acknowledges the need for balance but prioritizes performance in critical sections, a stance that resonates with high-throughput systems but might feel aggressive for standard business applications.

Bottom Line

Alex Xu's piece succeeds by demystifying performance engineering, moving it from a black art to a set of practical, measurable principles. Its greatest strength is the emphasis on algorithmic choice and memory layout over syntax tweaks, while its only vulnerability is the potential complexity of implementing these changes in legacy codebases. For any developer looking to build systems that scale, the shift from "optimizing later" to "estimating early" is the single most important lesson here.

How to write high-performance code

by Alex Xu · ByteByteGo Newsletter · Read full article

Kiss bugs goodbye with fully automated end-to-end test coverage (Sponsored).

Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.

QA Wolf’s AI-native service provides high-volume, high-speed test coverage for web and mobile apps, reducing your organization’s QA cycle to less than 15 minutes.

They can get you:

80% automated E2E test coverage in weeks

Unlimited parallel test runs

24-hour maintenance and on-demand test creation

Zero flakes, guaranteed

Engineering teams move faster, releases stay on track, and testing happens automatically—so developers can focus on building, not debugging.

Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

⭐ Rated 4.8/5 on G2.

We’ve all been there. Our code works perfectly, passes all tests, and does exactly what it’s supposed to do. Then we deploy it to production and realize it takes 10 seconds to load a page when users expect instant results. Or worse, it works fine with test data but crawls to a halt with real-world volumes.

The common reaction is to think about optimizing later, or leaving performance tuning for experts. Both assumptions are wrong. The truth is that writing reasonably fast code doesn’t require advanced computer science knowledge or years of experience. It requires developing an intuition about where performance matters and learning some fundamental principles.

Many developers have heard the famous quote about premature optimization being “the root of all evil.” However, this quote from Donald Knuth is almost always taken out of context. The full statement reads: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%”.

This article is about that critical 3%, where we’ll explore how to estimate performance impact, when to measure, what to look for, and practical techniques that work across different programming languages.

Learning to Estimate.

One of the most valuable skills in performance-aware development is the ability to estimate rough performance costs before writing code. We don’t need precise measurements at this stage, but we just need to understand orders of magnitude.

Think of computer operations as existing in different speed tiers. At the fastest tier, we have CPU cache access, which happens in nanoseconds. These are operations where the data is ...

The 3% That Matters

Measure First, Optimize Second

The Algorithmic and Memory Hierarchy

Eliminating Unnecessary Work

Bottom Line

Sources

How to write high-performance code