You know what's rare in a technical explanation? An author who makes you visualize the problem before solving it. Grant Sanderson does exactly that here — using concrete examples like "mole" (the animal vs. the chemical element) to ground abstract concepts in meaning. This is chapter six of a deep learning series, and it's doing something most technical writing fails at: giving readers a mental picture before diving into matrices.
The Setup
Sanderson opens with the core goal that's easy to forget amid the math: transformers aim to predict the next word. Not classify sentiment, not generate poetry — just predict what comes next. This simplicity is the anchor that makes everything else make sense.
The key insight he wants readers to carry: embeddings start generic and contextless. The word "mole" gets the same vector whether it's in "American true mole" or "take a biopsy of the mole." That's intentional — it's only after attention updates these vectors that they become specific. Sanderson writes:
"after the first step of a transformer... the vector that's associated with mole would be the same in all three of these cases because this initial token embedding is effectively a lookup table with no reference to the context"
This is the foundation: embeddings are dumb until attention makes them smart.
The Mechanism
Now enters the real magic. Sanderson walks through query, key, and value matrices — but first, he sets up what these should do. He imagines a single attention head where nouns ask: "hey, are there any adjectives sitting in front of me?"
The query matrix compresses embeddings into a smaller space (say, 128 dimensions). The key matrix does the same. Then comes the dot product alignment:
"the bigger dots correspond to the larger dot products... this means the embeddings of fluffy and blue attend to the embedding of creature"
That phrase — attend to — is doing real work. It makes the mechanism feel like a conversation rather than a computation.
Then softmax normalizes everything between 0 and 1, making each column sum to one "as if they were a probability distribution." This is crucial: Sanderson isn't just describing math, he's explaining why numerical stability matters during training.
The Training Secret
Here's where most explanations fail. Sanderson explains masked attention clearly:
"you never want to allow later words to influence earlier words since otherwise they could kind of give away the answer for what comes next"
This is elegant — he frames it as a narrative problem, not an algorithm. The masking ensures columns stay normalized even after setting future entries to negative infinity.
The piece also names a structural bottleneck that gets overlooked:
"its size is equal to the square of the context size... this is why context size can be a really huge bottleneck for large language models"
This matters because it explains why scaling isn't trivial — and why recent variations like sparse attention exist.
Counterpoints
Critics might note that Sanderson's adjective-noun example, while effective, oversimplifies how real attention heads actually behave. The true behavior is "much harder to parse because it's based on tweaking and tuning a huge number of parameters" — and his hypothetical example could mislead readers into thinking attention is more interpretable than it actually is. Real models don't neatly map to linguistic analogies.
Bottom Line
Sanderson's strongest move is the narrative framing: making abstract matrices feel like agents in conversation ("hey, are there any adjectives sitting in front of me?"). His biggest vulnerability is that the adjective-noun example implies attention heads are more interpretable than they actually are — which works as a teaching device but glosses over how messy real transformer behavior actually gets. The piece succeeds because it gives readers a mental picture before showing them the math.