In this lesson, you'll use PyTorch to code a class that implements self-attention, masked self-attention, and encoder-decoder attention. You'll also code a class that implements Multi-Head attention. Let's code. Just like before. We're going to import torch, torch dot NN and torch dot NN dot functional. Now we'll code a class that implements encoder-decoder Attention. Actually we're going to code a class that implements all three types of attention that we've learned about: self-attention, masked self-attention, and encoder-decoder attention. So we start by defining a class called attention that inherits from NN dot module. And the init method is identical to what we've already coded twice before. Bam! The forward method has two changes. First, we can now specify different encodings for the queries, keys, and values. And second, we pass those potentially different encodings to the matrices to create the queries, keys and values. Everything else is the same. Now let's run some numbers through it and make sure it works as expected. We'll start with the same matrix of encodings for the same tokens that we used before, except now we're specifying that they are for creating the queries. Now we create a matrix of encodings for making keys, and a matrix of encodings for making values. Note, in this example I'm making the encoded values the same, so we can easily compare the result to what we did before. Then we set the seed for the random number generator and create an object from our attention class using the same parameters we used before. Lastly, we pass the encodings to our attention object and these are the attention values. Bam! As always, if you got something different, you can verify the results like we did before. And now let's talk about how to code Multi-Head attention. We'll start by defining a class called Multi-Head attention that inherits from NN dot module. And in the init method we'll add one new parameter, num heads, the number of attention heads we want. As always, the next thing we do is call the parent's init method. Then we use a for loop to create num heads attention objects. Each attention object that we create is initialized with the same values for d_model, row_dim and col_dim. And we store them in a module list called heads. A module list is just what it sounds like. It's a list of modules that we can index. The last thing we do in the init method is save the col_dim parameter. The forward method takes the encoding matrices and then uses a for loop to pass the matrices onto each attention head. The attention values returned by each head are then concatenated and returned. And that's all there is to coding Multi-Head attention. Double bam! Now let's run some numbers through it and make sure it works as expected. So we set the seed for the random number generator and then create and initialize a Multi-Head attention object. The parameters d_model, row_dim and col_dim are the same as before. And we're setting num_heads to one just to see if we can get the same results as before. We'll change this value later. And then we pass in the encoding matrices that we made earlier. And we get the same results as before. Bam! And now let's do the same thing with two heads. So we reset the seed for the random number generator and create a new Multi-Head attention object that has num heads equal to two. Then we pass the encoding matrices and we get twice as many attention values as before. Triple bam.

Please sign in to view this content

Learn Code

Next Lesson

Attention in Transformers: Concepts and Code in PyTorch

Introduction
Video
・
6 mins

The Main Ideas Behind Transformers and Attention
Video
・
4 mins

The Matrix Math for Calculating Self-Attention
Video
・
11 mins

Coding Self-Attention in PyTorch
Video with Code Example
・
8 mins

Self-Attention vs Masked Self-Attention
Video
・
14 mins

The Matrix Math for Calculating Masked Self-Attention
Video
・
3 mins

Coding Masked Self-Attention in PyTorch
Video with Code Example
・
5 mins

Encoder-Decoder Attention
Video
・
4 mins

Multi-Head Attention
Video
・
2 mins

Coding Encoder-Decoder Attention and Multi-Head Attention in PyTorch
Video with Code Example
・
4 mins

Conclusion
Video
・

Appendix – Tips and Help
Code Example
・

Course Feedback

Forum