← Back to Library

Understanding and coding the kv cache in llms from scratch

In a field often obsessed with scaling laws and massive parameter counts, Sebastian Raschka turns his attention to the quiet engine room of modern artificial intelligence: the mechanics of efficiency. While the industry chases bigger models, Raschka argues that the real bottleneck for production systems isn't just raw compute power, but the redundancy of recalculating the same data over and over. He offers a rare, human-readable dissection of the "KV cache," a technique that transforms how large language models generate text, proving that optimization is just as critical as innovation.

The Cost of Redundancy

Raschka begins by dismantling a common misconception about how these models work. He illustrates that without optimization, an AI generating a sentence like "Time flies fast" effectively re-reads "Time" and "flies" from scratch every single time it predicts the next word. "The LLM does not cache intermediate key/value states, it re-encodes the full sequence every time a new token is generated," he explains. This is a profound inefficiency. As the text grows, the computational load doesn't just grow; it explodes quadratically, making long-form generation prohibitively slow.

Understanding and coding the kv cache in llms from scratch

The author's framing is distinct because he refuses to treat this as a mere engineering footnote. Instead, he positions the cache as the dividing line between a theoretical model and a usable product. "The downside of a KV cache is that it adds more complexity to the code, increases memory requirements... and can't be used during training," Raschka admits. This honesty about trade-offs is refreshing. He acknowledges that while the cache demands more memory, "the inference speed-ups are often well worth the trade-offs in code complexity and memory when using LLMs in production." This pragmatic stance cuts through the hype, reminding developers that real-world deployment requires balancing speed against resource constraints.

"Without caching, the attention at step t must compare the new query with t previous keys, so the cumulative work scales quadratically... With a cache, each key and value is computed once and then reused, reducing the total per-step complexity to linear."

The Mechanics of Memory

Moving from theory to practice, Raschka provides a "from-scratch" implementation that demystifies the black box. He walks the reader through the specific code changes required to store these intermediate states, noting that the core logic is surprisingly simple: "all we have to do is compute the keys and values as usual but then store them so that we can retrieve them in the next round." He details how the model must be modified to register buffer variables, `cache_k` and `cache_v`, which hold the concatenated data across generation steps.

His approach to teaching is deliberate. He avoids obfuscating the logic with complex abstractions, opting instead for a direct comparison between a standard model and one with the cache enabled. "I opted for a simple one that emphasizes code readability," Raschka writes, acknowledging that his goal is clarity over raw performance optimization. This choice serves the reader well, as it reveals the fundamental architecture without getting lost in the weeds of GPU-specific optimizations. However, critics might note that while this implementation is excellent for understanding, it relies on concatenating tensors which can be memory-intensive compared to pre-allocated strategies used in high-performance production environments.

The author also highlights a critical, often overlooked detail: the necessity of resetting the cache between different prompts. "Otherwise, the queries of a new prompt will attend to stale keys left over from the previous sequence, which causes the model to rely on irrelevant context and produce incoherent output," he warns. This is a crucial lesson for engineers; the cache is a double-edged sword that can corrupt output if not managed with surgical precision.

The Performance Verdict

To prove the concept, Raschka runs a side-by-side comparison using a small, untrained model. The results are stark. On a consumer-grade Mac Mini, the cached version achieves a five-fold speed increase. "So, as we can see, we already get a ~5x speed-up with a small 124 M parameter model and a short 200-token sequence length," he reports. The significance here is not just the speed, but the validation of the logic: "both the gpt_ch04.py and gpt_with_kv_cache.py implementations produce exactly the same text," confirming that the optimization does not alter the model's intelligence, only its velocity.

"This tells us that the KV cache is implemented correctly -- it is easy to make indexing mistakes that can lead to divergent results."

The piece concludes by reiterating that while the current example generates "gibberish" due to a lack of training, the mechanism itself is sound and ready for trained models. Raschka's work serves as a vital reminder that the future of AI isn't just about building larger brains, but about ensuring those brains can think fast enough to be useful.

Bottom Line

Sebastian Raschka's tutorial succeeds by stripping away the mystique of large language models to reveal the elegant, necessary engineering beneath. His strongest argument is that efficiency is not an afterthought but a prerequisite for production viability, a point proven by the dramatic speed gains in his code examples. The piece's only vulnerability is its focus on readability over high-performance optimization, but this is a deliberate and effective choice for educational clarity. For any developer looking to move beyond theory, this is the essential bridge to building systems that actually work.

Sources

Understanding and coding the kv cache in llms from scratch

by Sebastian Raschka · Ahead of AI · Read full article

KV caches are one of the most critical techniques for efficient inference in LLMs in production. KV caches are an important component for compute-efficient LLM inference in production. This article explains how they work conceptually and in code with a from-scratch, human-readable implementation.

It's been a while since I shared a technical tutorial explaining fundamental LLM concepts. As I am currently recovering from an injury and working on a bigger LLM research-focused article, I thought I'd share a tutorial article on a topic several readers asked me about (as it was not included in my Building a Large Language Model From Scratch book).

Happy reading!

Overview.

In short, a KV cache stores intermediate key (K) and value (V) computations for reuse during inference (after training), which results in a substantial speed-up when generating text. The downside of a KV cache is that it adds more complexity to the code, increases memory requirements (the main reason I initially didn't include it in the book), and can't be used during training. However, the inference speed-ups are often well worth the trade-offs in code complexity and memory when using LLMs in production.

What Is a KV Cache?.

Imagine the LLM is generating some text. Concretely, suppose the LLM is given the following prompt: "Time". As you may already know, LLMs generate one word (or token) at a time, and the two following text generation steps may look as illustrated in the figure below:

Note that there is some redundancy in the generated LLM text outputs, as highlighted in the next figure:

When we implement an LLM text generation function, we typically only use the last generated token from each step. However, the visualization above highlights one of the main inefficiencies on a conceptual level. This inefficiency (or redundancy) becomes more clear if we zoom in on the attention mechanism itself. (If you are curious about attention mechanisms, you can read more in Chapter 3 of my Build a Large Language Model (From Scratch) book or my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article).

The following figure shows an excerpt of an attention mechanism computation that is at the core of an LLM. Here, the input tokens ("Time" and "flies") are encoded as 3-dimensional vectors (in reality, these vectors are much larger, but this would make it challenging to fit them into a small figure). The matrices W are ...