← Back to Library

Attention in transformers, step-by-step | Deep Learning Chapter 6

in the last chapter you and I started to step through the internal workings of a transformer this is one of the key pieces of Technology inside large language models and a lot of other tools in the Modern Wave of AI it first hit the scene in a now famous 2017 paper called attention is all you need and in this chapter you and I will dig into what this attention mechanism is visualizing how it processes data as a quick recap here's the important context I want you to have in mind the goal of the model that you and I are studying is to take in a piece of text and predict what word comes next the input text is broken up into little pieces that we call tokens and these are very often words or pieces of words but just to make the examples in this video easier for you and me to think about let's simplify by pretending that tokens are always just words the first step in a Transformer is to associate each token with a high dimensional Vector what we call its embedding now the most important idea I want you to have in mind is how directions in this high-dimensional space of all possible embeddings can correspond with semantic meaning in the last chapter we saw an example for how Direction can correspond to gender in the sense that adding a certain step in this space can take you from the embedding of a masculine noun to the embedding of the corresponding feminine noun that's just one example you could imagine how many other directions in this high-dimensional space could correspond to numerous other aspects of a word's meaning the aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word but instead they bake in some much much richer contextual meaning I should say up front that a lot of people find the attention mechanism this key piece in a Transformer very confusing so don't worry if it takes some time for things to sink in I think that before we dive into the computational details and all the Matrix multiplications it's worth thinking about a couple examples for the kind of behavior that we want attention to enable consider the phrases American true mole one mole of carbon dioxide and take a ...

Watch on YouTube →

Watch the full video by Grant Sanderson on YouTube.