← Back to Library

Attention in transformers, step-by-step

You know what's rare in a technical explanation? An author who makes you visualize the problem before solving it. Grant Sanderson does exactly that here — using concrete examples like "mole" (the animal vs. the chemical element) to ground abstract concepts in meaning. This is chapter six of a deep learning series, and it's doing something most technical writing fails at: giving readers a mental picture before diving into matrices.

The Setup

Sanderson opens with the core goal that's easy to forget amid the math: transformers aim to predict the next word. Not classify sentiment, not generate poetry — just predict what comes next. This simplicity is the anchor that makes everything else make sense.

Attention in transformers, step-by-step

The key insight he wants readers to carry: embeddings start generic and contextless. The word "mole" gets the same vector whether it's in "American true mole" or "take a biopsy of the mole." That's intentional — it's only after attention updates these vectors that they become specific. Sanderson writes:

"after the first step of a transformer... the vector that's associated with mole would be the same in all three of these cases because this initial token embedding is effectively a lookup table with no reference to the context"

This is the foundation: embeddings are dumb until attention makes them smart.

The Mechanism

Now enters the real magic. Sanderson walks through query, key, and value matrices — but first, he sets up what these should do. He imagines a single attention head where nouns ask: "hey, are there any adjectives sitting in front of me?"

The query matrix compresses embeddings into a smaller space (say, 128 dimensions). The key matrix does the same. Then comes the dot product alignment:

"the bigger dots correspond to the larger dot products... this means the embeddings of fluffy and blue attend to the embedding of creature"

That phrase — attend to — is doing real work. It makes the mechanism feel like a conversation rather than a computation.

Then softmax normalizes everything between 0 and 1, making each column sum to one "as if they were a probability distribution." This is crucial: Sanderson isn't just describing math, he's explaining why numerical stability matters during training.

The Training Secret

Here's where most explanations fail. Sanderson explains masked attention clearly:

"you never want to allow later words to influence earlier words since otherwise they could kind of give away the answer for what comes next"

This is elegant — he frames it as a narrative problem, not an algorithm. The masking ensures columns stay normalized even after setting future entries to negative infinity.

The piece also names a structural bottleneck that gets overlooked:

"its size is equal to the square of the context size... this is why context size can be a really huge bottleneck for large language models"

This matters because it explains why scaling isn't trivial — and why recent variations like sparse attention exist.

Counterpoints

Critics might note that Sanderson's adjective-noun example, while effective, oversimplifies how real attention heads actually behave. The true behavior is "much harder to parse because it's based on tweaking and tuning a huge number of parameters" — and his hypothetical example could mislead readers into thinking attention is more interpretable than it actually is. Real models don't neatly map to linguistic analogies.

Bottom Line

Sanderson's strongest move is the narrative framing: making abstract matrices feel like agents in conversation ("hey, are there any adjectives sitting in front of me?"). His biggest vulnerability is that the adjective-noun example implies attention heads are more interpretable than they actually are — which works as a teaching device but glosses over how messy real transformer behavior actually gets. The piece succeeds because it gives readers a mental picture before showing them the math.

Deep Dives

Explore these related deep dives:

Sources

Attention in transformers, step-by-step

by Grant Sanderson · · Watch video

in the last chapter you and I started to step through the internal workings of a transformer this is one of the key pieces of Technology inside large language models and a lot of other tools in the Modern Wave of AI it first hit the scene in a now famous 2017 paper called attention is all you need and in this chapter you and I will dig into what this attention mechanism is visualizing how it processes data as a quick recap here's the important context I want you to have in mind the goal of the model that you and I are studying is to take in a piece of text and predict what word comes next the input text is broken up into little pieces that we call tokens and these are very often words or pieces of words but just to make the examples in this video easier for you and me to think about let's simplify by pretending that tokens are always just words the first step in a Transformer is to associate each token with a high dimensional Vector what we call its embedding now the most important idea I want you to have in mind is how directions in this high-dimensional space of all possible embeddings can correspond with semantic meaning in the last chapter we saw an example for how Direction can correspond to gender in the sense that adding a certain step in this space can take you from the embedding of a masculine noun to the embedding of the corresponding feminine noun that's just one example you could imagine how many other directions in this high-dimensional space could correspond to numerous other aspects of a word's meaning the aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word but instead they bake in some much richer contextual meaning I should say up front that a lot of people find the attention mechanism this key piece in a Transformer very confusing so don't worry if it takes some time for things to sink in I think that before we dive into the computational details and all the Matrix multiplications it's worth thinking about a couple examples for the kind of behavior that we want attention to enable consider the phrases American true mole one mole of carbon dioxide and take a biopsy ...