Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Let's jump in to talk about the self-attention mechanism of transformers. If you can get the main idea behind this video, you'll understand the most important core idea behind what makes transformer networks work. Let's jump in. You've seen how attention is used with sequential neural networks such as RNNs. To use attention with a style more like CNNs, you need to calculate self-attention, where you create attention-based representations for each of the words in your input sentence. Let's use our running example, Jane Vizic-Lefic on September. Our goal will be, for each word, to compute an attention-based representation like this. We'll end up with five of these, since our sentence has five words. When we've computed them, we'll call the five representations of these five words A1 through A5. I know you're starting to see a bunch of symbols Q, K, and V. We'll explain what these symbols mean in a later slide, so don't worry about them for now. The running example I'm going to use is, take the word Lefic in this sentence. We'll step through on the next slide how the transformer network's self-attention mechanism allows you to compute A3 for this word, and then you do the same thing for the other words in the sentence as well. Now you learned previously about word embeddings, and so one way to represent Lefic would be to just look up the word embedding for Lefic. But depending on the context, are we thinking of Lefic, or Africa, as a site of historical interest, or as a holiday destination, or as the world's second largest continent? Depending on how you're thinking of Lefic, you may choose to represent it differently, and that's what this representation A3 will do. It will look at the surrounding words to try to figure out what's actually going on in how we're talking about Africa in this sentence, and find the most appropriate representation for this. In terms of the actual calculation, it won't be too different from the attention mechanism you saw previously as applied in the context of RNNs, except we'll compute these representations in parallel for all five words in the sentence. So when we're building attention on top of RNNs, this was the equation we used. With the self-attention mechanism, the attention equation is instead going to look like this. So you can see the equations have some similarity. The inner term here also involves a softmax, just like this term over here on the left. And you can think of the exponent terms as being akin to attention values. Exactly how these terms are worked out, you'll see in the next slide. So again, don't worry about the details just yet. But the main difference is that for every word, say for Lefic, you have three values called the query, key, and value. And these vectors are the key inputs to computing the attention value for each word. Now, let's step through the steps needed to actually calculate A3. On this slide, let's step through the computations you need to go from the word Lefic to the self-attention representation A3. For reference, I've also printed up here on the upper right that softmax-like equation from the previous slide. First, we're going to associate each of the words with three values called the query, key, and value pairs. So if X3 is the word embedding for Lefic, the way that Q3 is computed is as a learned matrix, which I'm going to write as WQ times X3. And similarly for the key and value pairs. So K3 is WK times X3, and V3 is WV times X3. So these matrices WQ, WK, and WV are parameters of this learning algorithm, and they allow you to pull up these query, key, and value vectors for each word. So what are these query, key, and value vectors supposed to do? They were named using a loose analogy to a concept in databases where you can have queries and also key value pairs. If you're familiar with those types of databases, the analogy may make sense to you, but if you're not familiar with that database concept, don't worry about it. Let me give one intuition behind the intent of these query, key, and value vectors. Q3 is a question that you get to ask about Lefic. So Q3 may represent a question like, what's happening there? So Africa, Lefic, is a destination, so you might want to know when computing A3, what's happening there? And what we're going to do is compute the inner product between Q3 and K1, between query 3 and key 1, and this will tell us how good is an answer, words 1, to the question of what's happening in Africa. And then we compute the inner product between Q3 and K2, and this is intended to tell us how good is Lefic, an answer, to the question of what's happening in Africa, and so on for the other words in the sequence. And the goal of this operation is to pull up the most information as needed to help us compute the most useful representation, A3, up here. So, again, just for intuition building, if K1 represents that this word is a person, because Jane is a person, and K2 represents that the second word, Vizit, is an action, then you may find that Q3 inner producted with K2 has the largest value, and this maybe intuitive example might suggest that Vizit gives you the most relevant context for what's happening in Africa, which is that it's viewed as a destination for a visit. So what we will do is take these five values in this row and compute the softmax over them. So that's actually this softmax over here. And in the example that we've been talking about, Q3 times K2, corresponding to word Vizit, maybe has the largest value, so I'm going to shade that blue over here. Then finally, we're going to take these softmax values and multiply them with V1, which is the value for word 1, the value for word 2, and so on. And so these values correspond to that value up there. Finally, we sum it all up, so this summation corresponds to this summation operator, and so adding up all of these values gives you A3, which is just equal to this value here. So another way to write A3 is really as A, this A up here, of Q3, KV, but sometimes it will be more convenient to just write A3 like that. And the key advantage of this representation is the word of Lafrique isn't some fixed word embedding. Instead, it lets the self-attention mechanism realize that Lafrique is the destination of a Vizit, and thus compute a richer, more useful representation for this word. Now, I've been using the third word, Lafrique, as a running example, but you could use this process for all five words in your sequence to get similarly rich representations for Jane, Vizit, Lafrique, on Septembre. And if you put all of these five computations together, the notation used in literature looks like this, where you can summarize all of these computations that we just talked about for all the words in the sequence by writing attention, QKV, where QKV matrices with all of these values, and this is just a compressed or vectorized representation of this equation up here. The term in the denominator is just to scale the dot product so it doesn't explode. You don't really need to worry about it. But another name for this type of attention is the scaled dot product attention, and this is the one represented in the original Transformer architecture paper, attention is all you need as well. So, that's the self-attention mechanism of the Transformer network. To recap, associated with each of the words, each of the five words, you end up with a query, a key, and a value. The query lets you ask a question about that word, such as, what's happening in Africa? The key looks at all of the other words, and by the similarity to the query, helps you figure out which word gives the most relevant answer to that question, and in this case, vizit is what's happening in Africa, someone's visiting Africa. And then finally, the value allows a representation to plug in how vizit should be represented within A3, within the representation of Africa. And so this allows you to come up with a representation for the word Africa that says this is Africa, and someone is visiting Africa. And this is a much more nuanced, much richer representation for the word than if you just had to pull up the same fixed word embedding for every single word without being able to adapt it based on what words are to the left and to the right of that word, without being able to take into account any of the context. Now, you have learned about the self-attention mechanism. We're going to put a big for loop over this whole thing, and that will be the multi-headed attention mechanism. Let's dive into the details of that in the next video.