{Grant Sanderson's piece on Transformers is essentially a gateway drug for understanding how Large Language Models actually work — and that's what makes it so compelling. He doesn't just explain what a Transformer does; he demystifies the entire machinery behind tools like ChatGPT that have captivated the world's attention. The most distinctive claim? That these systems are fundamentally built on one simple game: predicting the next word.
The Core Architecture
Sanderson opens with the foundational breakdown of GPT — Generative Pre-trained Transformer — and immediately makes the piece accessible by defining each term plainly. "Generative" means it creates new text; "Pre-trained" refers to massive data learning with room for fine-tuning; "Transformer" is the neural network itself. This is effective because he immediately signals that he's going to unpack what most people treat as a black box.
He writes, "what I want to do with this video and the following chapters is go through a visually driven explanation for what actually happens inside a Transformer." The visual approach matters here — Sanderson understands that abstract concepts like attention mechanisms need anchoring in concrete examples. He's not just teaching; he's building intuition.
The piece's real strength lies in connecting seemingly unrelated applications under one umbrella. "All those tools that took the World by storm in 2022 like Dall-E, Midjourney, that take in a text description and produce an image are based on Transformers," he writes. This single sentence bridges audio, image, and text processing — showing readers that the architecture is universal.
The Prediction Game
Sanderson's explanation of how LLMs generate text is where the piece becomes genuinely illuminating. He describes it as "the process of repeated prediction and sampling" — essentially a loop where the model predicts what comes next, samples from that distribution, appends it to the input, then repeats. This is the mechanic behind every ChatGPT interaction.
He notes something fascinating: with GPT-2 on a laptop, "the story just doesn't actually really make that much sense." But swap it for API calls to GPT-3 — "which is the same basic model just much bigger" — and suddenly "we do get a sensible story one that even seems to infer that a pie creature would live in a Land of math and computation." The jump in capability from small to large models isn't incremental; it's almost magical. This observation captures exactly why the AI revolution has been so disorienting for non-practitioners.
The high-level overview of data flow through a Transformer is masterfully structured: tokenization, embedding matrices, attention blocks, multi-layer perceptrons (feed-forward layers), then repeating until the final vector produces a probability distribution over possible next tokens. Sanderson explicitly compares these blocks to "a giant pile of Matrix multiplications" — making the mathematical substrate visible without overwhelming non-technical readers.
The Missing Pieces
A crucial insight surfaces when he describes deep learning's format requirement: "in order for this training algorithm to work well at scale, these models have to follow a certain specific format." This is where Sanderson earns his keep. He understands that most explanations skip the architecture constraints — the reasons why Transformers process language the way they do.
He also clarifies what every weight means: "you should draw a very sharp distinction in your mind between the weights of the model which I'll always color in blue or red and the data being processed which I'll always color in Gray." The weights are the actual brains; the gray is just input. This visual metaphor helps readers understand why LLMs can be deployed for different tasks without changing their fundamental structure — only the data changes.
However, there's a notable gap: Sanderson mentions "back propagation" as the unifying training algorithm but doesn't elaborate on how it actually works in Transformers. The piece promises to cover this later, which is fine structurally, but leaves readers hanging on the most critical question — how these models learn from data at all.
Counterpoints
Critics might note that while Sanderson effectively describes what happens inside a Transformer, he doesn't fully address whether attention mechanisms are actually necessary for smaller models or whether simpler architectures could achieve similar results. The research community continues to debate efficiency versus scale tradeoffs. Additionally, the piece focuses heavily on text and image generation but barely touches on why Transformers excel at these tasks specifically — what's unique about their inductive bias?
Pull Quote
"This process here of repeated prediction and sampling is essentially what's happening when you interact with ChatGPT or any of these other large language models and you see them producing one word at a time."
Sanderson's strongest contribution is making the invisible visible: showing that every AI tool you've used isn't some alien intelligence but rather a sophisticated autocomplete on an unprecedented scale. The "pie creature" story he references — generated by GPT-3 inferring mathematical properties from fantasy — demonstrates how these models can extrapolate from training data in ways that feel genuinely creative.
Bottom Line
Sanderson's piece succeeds because it demystifies the most consequential technology of our time without requiring advanced mathematics. His vulnerability is the promise of future chapters — this is clearly introductory material meant to set up deeper dives into attention blocks and multi-layer perceptrons. For busy readers wanting to understand what their ChatGPT prompts actually do, this high-level tour is exactly right: it shows that under the hood, it's all just matrix multiplication and probability distributions — nothing more magical than that.