Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
You've learned about self-attention, you've learned about multi-headed attention, let's put it all together to build the transformer network. In this video, you'll see how you can pair the attention mechanisms you saw in the previous videos to build the transformer architecture, starting again with the sentence, Jane visit Lafitte on September, and its corresponding embedding. Let's walk through how you can translate the sentence from French to English. I've also added the start of sentence and end of sentence tokens here. Up until this point, for the sake of simplicity, I've only been talking about the embeddings for the words in the sentence, but in many sequence-to-sequence translation tasks, it'll be useful to also add the start of sentence, or the SOS, and the end of sentence, or the EOS, tokens, which I have in this example. The first step in the transformer is these embeddings get fed into an encoder block, which has a multi-head attention layer. This is exactly what you saw on the last slide, where you feed in the values Q, K, and V computed from the embeddings and the weight matrices W. This layer then produces a matrix that can be parsed into a feed-forward neural network, which helps determine what interesting features there are in the sentence. In the transformer paper, this encoding block is repeated N times, and a typical value for N is 6. So after maybe about 6 times through this block, we will then feed the output of the encoder into a decoder block. Let's start building the decoder block. The decoder block's job is to output the English translation, so the first output will be the start of sentence token, which I've already written down here. At every step, the decoder block will input the first few words, whatever we've already generated of the translation. When we're just getting started, the only thing we know is that the translation will start with a start of sentence token, and so the start of sentence token gets fed in to this multi-head attention block, and just this one token, the SOS token start of sentence, is used to compute Q, K, and V for this multi-head attention block. This first block's output is used to generate the Q matrix for the next multi-head attention block, and the output of the encoder is used to generate K and V. So here's the second multi-head attention block with inputs Q, K, and V as before. Why is it structured this way? Maybe here's one piece of intuition that could help. The input down here is whatever you've translated of the sentence so far, and so this will ask a query to say, what of the start of sentence, and it will then pull context from K and V, which is translated from the French version of the sentence, to then try to decide what is the next word in the sequence to generate. To finish the description of the decoder block, the multi-head attention block outputs the values which are fed to a feedforward neural network. This decoder block is also going to be repeated N times, maybe 6 times, where you take the output, feed it back to the input, and have this go through, say, half a dozen times. And the job of this neural network is to predict the next word in the sentence. So hopefully, it will decide that the first word in the English translation is Jane, and what we do is then feed Jane to the input as well. And now, the next query comes from SOS and Jane, and it says, well, given Jane, what is the most appropriate next word? Let's find the right key and the right value that lets us generate the most appropriate next word, which hopefully will generate visit, and then running this neural network again generates Africa, then we feed Africa back into the input, hopefully it then generates in and then September, and with this input, hopefully it generates the end of sentence token, and then we're done. These encoder and decoder blocks and how they're combined to perform a sequence-to-sequence translation task are the main ideas behind the transformer architecture. In this case, you saw how you can translate an input sentence into a sentence in another language to gain some intuition about how attention in neural networks can be combined to allow simultaneous computation. But beyond these main ideas, there are a few extra bells and whistles to transformers. Let me briefly step through these extra bells and whistles that makes the transformer network work even better. The first of these is positional encoding of the input. If you recall the self-attention equations, there's nothing that indicates the position of a word. Is this word the first word in the sentence, in the middle, the last word in the sentence? But the position within the sentence can be extremely important to translation. So the way you encode the position of elements in the input is that you use a combination of these sine and cosine equations. So let's say, for example, that your word embedding is a vector with four values. In this case, the dimension d of the word embedding is 4. So x1, x2, x3, let's say those are four-dimensional vectors. In this example, we're going to then create a positional embedding vector of the same dimension, also four-dimensional. And I'm going to call this positional embedding P1, let's say, for the position embedding of the first word, Jane. In this equation below, pos, position, denotes the numerical position of the word. So for the word Jane, pos is equal to 1. And i over here refers to the different dimensions of encoding. This first element corresponds to i equals 0. This element, i equals 0, i equals 1, i equals 1. So these are the variables pos, pos, and i that go into these equations down below, where pos is the position of a word, i goes from 0 to 1, and d is equal to 4, the dimension of this vector. And what the position encoding does with the sine and cosine is create a unique positional encoding vector, one of these vectors that is unique for each word. So the vector P3 that encodes the position of Lafrique, the third word, will be a set of four values that will be different than the four values used to encode the position of the first word of Jane. This is what the sine and cosine curves look like. Here's i equals 0, i equals 0, i equals 1, and i equals 1. And because you have these terms in the denominator, you end up with i equals 0 will have some sinusoidal curve that looks like this, and i equals 0 will be the matched cosine. So, you know, 90 degrees on the face. And i equals 1 will end up with a lower frequency sinusoid, like so. And i equals 1 gives you a matched cosine curve. So for P1, for position 1, you read off values at this position to fill in those four values there. Whereas for a different word at a different position, maybe this is now 3 on the horizontal axis, you read off a different set of values. And notice these first two values may be very similar because they're roughly the same height. But by using these multiple sines and cosines, looking across all four values, P3 will be a different vector than P1. So the positional encoding P1 is added directly to X1, to the input this way, so that each of the word vectors is also influenced or colored with where in the sentence the word appears. The output of the encoding block contains contextual semantic embedding and positional encoding information. The output of the embedding layer is then D, which in this case is 4, by the maximum length of sequence your model can take. The outputs of all these layers are also of this shape. In addition to adding these positional encodings to the embeddings, you'd also pass them through the network with residual connections. These residual connections are similar to those you've previously seen in the ResNet, and their purpose in this case is to pass along positional information through the entire architecture. In addition to positional encoding, the transformer network also uses a layer very similar to a bash norm, and their purpose in this case is to pass along positional information to positional encoding. The transformer also uses a layer called add-in norm that is very similar to the bash norm layer that you're already familiar with. For the purpose of this video, don't worry about the differences, think of it as playing a role very similar to the bash norm, and this helps speed up learning. This bash norm-like layer, this add-in norm layer, is repeated throughout this architecture. And finally, for the output of the decoder block, there's actually also a linear and then a softmax layer to predict the next word one word at a time. In case you read the literature on the transformer network, you may also hear of something called the masked multi-header tension, which I'm going to draw over here. Masked multi-header tension is important only during the training process, where you're using a dataset of correct French to English translations to train your transformer. So previously, we stepped through how the transformer performs prediction one word at a time, but how does it train? Let's say your dataset has the correct French to English translation, Jane visits Africa in September. When training, you have access to the entire correct English translation, the correct output, and the correct input. And because you have the full correct output, you don't actually have to generate the words one at a time during training. Instead, what masking does is it blocks out the last part of the sentence to mimic what the network will need to do at test time or during prediction. In other words, all that masked multi-header tension does is it repeatedly pretends that the network had perfectly translated, say, the first few words and hides the remaining words to see if, given a perfect first part of the translation, whether the neural network can predict the next word in the sequence accurately. So that's the summary of the transformer architecture. Since the paper, Attention is All You Need, came out, there have been many other iterations of this model, such as BERT or BERT-Distill, which you get to explore yourself this week. So that was it. So there was a lot of details, but now you have a good sense of all of the major building blocks of the transformer network. And when you see this in this week's program exercise, playing around with the code there will help you to build even deeper intuition about how to make this work for your applications.