← Back to Library

Transformers, the tech behind llms

{Grant Sanderson's piece on Transformers is essentially a gateway drug for understanding how Large Language Models actually work — and that's what makes it so compelling. He doesn't just explain what a Transformer does; he demystifies the entire machinery behind tools like ChatGPT that have captivated the world's attention. The most distinctive claim? That these systems are fundamentally built on one simple game: predicting the next word.

The Core Architecture

Sanderson opens with the foundational breakdown of GPT — Generative Pre-trained Transformer — and immediately makes the piece accessible by defining each term plainly. "Generative" means it creates new text; "Pre-trained" refers to massive data learning with room for fine-tuning; "Transformer" is the neural network itself. This is effective because he immediately signals that he's going to unpack what most people treat as a black box.

He writes, "what I want to do with this video and the following chapters is go through a visually driven explanation for what actually happens inside a Transformer." The visual approach matters here — Sanderson understands that abstract concepts like attention mechanisms need anchoring in concrete examples. He's not just teaching; he's building intuition.

Transformers, the tech behind llms

The piece's real strength lies in connecting seemingly unrelated applications under one umbrella. "All those tools that took the World by storm in 2022 like Dall-E, Midjourney, that take in a text description and produce an image are based on Transformers," he writes. This single sentence bridges audio, image, and text processing — showing readers that the architecture is universal.

The Prediction Game

Sanderson's explanation of how LLMs generate text is where the piece becomes genuinely illuminating. He describes it as "the process of repeated prediction and sampling" — essentially a loop where the model predicts what comes next, samples from that distribution, appends it to the input, then repeats. This is the mechanic behind every ChatGPT interaction.

He notes something fascinating: with GPT-2 on a laptop, "the story just doesn't actually really make that much sense." But swap it for API calls to GPT-3 — "which is the same basic model just much bigger" — and suddenly "we do get a sensible story one that even seems to infer that a pie creature would live in a Land of math and computation." The jump in capability from small to large models isn't incremental; it's almost magical. This observation captures exactly why the AI revolution has been so disorienting for non-practitioners.

The high-level overview of data flow through a Transformer is masterfully structured: tokenization, embedding matrices, attention blocks, multi-layer perceptrons (feed-forward layers), then repeating until the final vector produces a probability distribution over possible next tokens. Sanderson explicitly compares these blocks to "a giant pile of Matrix multiplications" — making the mathematical substrate visible without overwhelming non-technical readers.

The Missing Pieces

A crucial insight surfaces when he describes deep learning's format requirement: "in order for this training algorithm to work well at scale, these models have to follow a certain specific format." This is where Sanderson earns his keep. He understands that most explanations skip the architecture constraints — the reasons why Transformers process language the way they do.

He also clarifies what every weight means: "you should draw a very sharp distinction in your mind between the weights of the model which I'll always color in blue or red and the data being processed which I'll always color in Gray." The weights are the actual brains; the gray is just input. This visual metaphor helps readers understand why LLMs can be deployed for different tasks without changing their fundamental structure — only the data changes.

However, there's a notable gap: Sanderson mentions "back propagation" as the unifying training algorithm but doesn't elaborate on how it actually works in Transformers. The piece promises to cover this later, which is fine structurally, but leaves readers hanging on the most critical question — how these models learn from data at all.

Counterpoints

Critics might note that while Sanderson effectively describes what happens inside a Transformer, he doesn't fully address whether attention mechanisms are actually necessary for smaller models or whether simpler architectures could achieve similar results. The research community continues to debate efficiency versus scale tradeoffs. Additionally, the piece focuses heavily on text and image generation but barely touches on why Transformers excel at these tasks specifically — what's unique about their inductive bias?

Pull Quote

"This process here of repeated prediction and sampling is essentially what's happening when you interact with ChatGPT or any of these other large language models and you see them producing one word at a time."

Sanderson's strongest contribution is making the invisible visible: showing that every AI tool you've used isn't some alien intelligence but rather a sophisticated autocomplete on an unprecedented scale. The "pie creature" story he references — generated by GPT-3 inferring mathematical properties from fantasy — demonstrates how these models can extrapolate from training data in ways that feel genuinely creative.

Bottom Line

Sanderson's piece succeeds because it demystifies the most consequential technology of our time without requiring advanced mathematics. His vulnerability is the promise of future chapters — this is clearly introductory material meant to set up deeper dives into attention blocks and multi-layer perceptrons. For busy readers wanting to understand what their ChatGPT prompts actually do, this high-level tour is exactly right: it shows that under the hood, it's all just matrix multiplication and probability distributions — nothing more magical than that.

Deep Dives

Explore these related deep dives:

Sources

Transformers, the tech behind llms

by Grant Sanderson · · Watch video

the initials GPT stand for generative pre-trained Transformer so that first word is straightforward enough these are Bots that generate new text pre-trained refers to how the model went through a process of learning from a massive amount of data and the prefix insinuates that there's more room to fine-tune it on specific tasks with additional training but the last word that's the real key piece a Transformer is a specific kind of neural network a machine learning model and it's the core invention underlying the current boom in AI what I want to do with this video and the following chapters is go through a visually driven explanation for what actually happens inside a Transformer we're going to follow the data that flows through it and go step by step there are many different kinds of models that you can build using Transformers some models take in audio and produce a transcript this sentence comes from a model going the other way around producing synthetic speech just from text all those tools that took the World by storm in 2022 like doll in mid-journey that take in a text description and produce an image are based on Transformers and even if I can't quite get it to understand what a pie creature is supposed to be I'm still blown away that this kind of thing is even remotely possible and the original Transformer introduced in 2017 by Google was invented for the specific use case of translating text from one language into another but the variant that you and I will focus on which is the type that underlies tools like chat GPT will be a model that's trained to take in a piece of text maybe even with some surrounding images or sound accompanying it and produce a prediction for what comes next in the passage that prediction takes the form of a probability distribution over many different chunks of text that might follow at first glance you might think that predicting the next word feels like a very different goal from generating new text but once you have a prediction model like this a simple thing you could try to make it generate a longer piece of text is to give it an initial snippet to work with have it take a random samp Le from the distribution it just generated append that sample to the text ...