In this lesson, you'll use PyTorch to code a class that implements self-attention. You'll also run some numbers through it and verify that the calculations are correct. All right, let's code. The first thing we do is import torch to create the tensors we will use to store the raw data and to provide a few helper functions. Note: if you are not already familiar with tensors, then just think of them as multi-dimensional lists optimized for neural networks. Then we import torch dot NN for the module and linear classes and a bunch of other helper functions. Then we import torch dot NN dot functional to access the softmax function that we will use when calculating attention. Now we'll code a class that implements self-attention. So we start by defining a class called self-attention that inherits from NN dot module. NN dot module is the base class for all neural network modules that you make with PyTorch. The first thing we'll do in our new class is create an init method. In this case, we're passing the init method a D underscore model, the dimension of the model, or the number of word embedding values per token. We'll use d_model to define the size of the weight matrices that we'll use to create the queries, keys, and values. For example, if d_model equals two, meaning we're using two-word embedding values per token, and thus after adding positional encoding to each word embedding, we have two encoded values per token. Then to create the queries we'll multiply the encoded values for each token by a two-by-two weight matrix. We're also passing in some convenience parameters, so that we can easily modify the row and column indexes in our data. Usually the first dimension is the batch size, but in this example we won't be using batches of data. However, at some point in the future, we might want to use batches of data and we can adjust these parameters then. Next, we call the parent's init method. Otherwise, there's no point inheriting from a class to begin with. Now, in order to create the weight matrix that we will use to calculate the query's values Q, we use NN dot linear, which will create the weight matrix and do the math for us. N features defines how many rows are in the weight matrix, so we'll set it to d_model. And out_features defines the number of columns in the weight matrix. So we set it to d_model as well. Lastly in the original Transformers manuscript they don't add additional bias terms when calculating attention. So we won't either by setting bias equal to false. As a result, we end up with an object we're calling W_q that holds the currently untrained weights needed to calculate query values. And because W_q is a linear object, it doesn't just store the weights, but it will also do the math for us when the time comes. Note: just as a reminder, I've labeled the query weights matrix with the transpose symbol because of how PyTorch prints out the weights. Also note, there are really no rules about the shape of the weight matrices, so feel free to modify the code later to have other dimensions. Just make sure that the matrix multiplication still works out. Then we do the exact same thing to create a linear object W_k that contains the weights needed to calculate the keys. And then we create a linear object W_v to calculate the values. The last thing we do in the init method is save the row and column indexes. We then add a forward method to the self-attention class. The forward method is where we actually calculate the self-attention values for each token, and we're passing it the token encodings, which are the word embeddings plus positional encoding for each input token. Now we pass the token encodings to W_q, the query weight matrix, and that does the matrix multiplication and returns the queries stored in a variable called q. Then we calculate the keys and store them in k, and we calculate the values and store them in v. Now using the matrices we just created, we calculate self-attention. We start by using torch dot matmul to multiply q by the transpose of k. To calculate the similarities between all possible combinations of queries and keys. Then we scale the similarities by the square root of the number of values used in each key. The next thing we do to calculate attention is run the scaled similarities through a softmax function. Remember, applying the softmax function to the scaled similarities determines the percentages of influence that each token should have on the others, which is why we store the results in a variable called attention percents. Lastly, we use torch dot matmul to multiply the attention percentages by the values in v, and that gives us the final attention scores stored in attention scores, which we return. All together, the self-attention class looks like this. Bam! And now let's run some numbers through it and make sure it works as expected. We'll start by using torch dot tensor to create a matrix of encodings for three tokens. The result, which we're calling encodings matrix looks like this. Now we're going to see the random number generator with torch dot manual seed so that hopefully we'll all get the same results. Then we create an object from our self-attention class with the d_model set to two and the row index set to zero, and the column index set to one. This will create a new object that we are saving as self-attention, that has initialized the weight matrices that we need to calculate the query, key, and values. Lastly, we pass self-attention, the encodings matrix because the self-attention class inherits from NN dot module, it will pass the matrix to the forward method that we wrote, and the result should be a matrix of self-attention values that looks like this. Double bam! Note, if you got something else, don't panic in just a bit, we'll validate that the math was done correctly regardless of the output we see here. Also note, this last bit of the tensor is used for training the weights with backpropagation. Since we're only coding a self-attention class and not a full transformer, we won't actually be doing any training. But at least you know what this bit is for. Small bam! To validate that the math was done correctly, we can start by printing out the weights in the matrix that we used to calculate the queries by transposing the weight property associated with W_q, and we should get a two by two matrix like this. Likewise, we can print out the weights in W_k and we can print out the weights in W_v. Now combining these weight matrices, and the original encodings matrix, we can calculate the Query, Key, and Value matrices by hand. Or we can pass the encodings matrix directly to W_q to calculate the queries. In other words, we can extract the weights that the self-attention object is using to do calculations and use them to validate the math that the self-attention class performs. Triple bam!

AI is the new electricity and will transform and improve nearly all areas of human lives.

Learn Code

Next Lesson

Attention in Transformers: Concepts and Code in PyTorch

Introduction
Video
・
6 mins

The Main Ideas Behind Transformers and Attention
Video
・
4 mins

The Matrix Math for Calculating Self-Attention
Video
・
11 mins

Coding Self-Attention in PyTorch
Video with Code Example
・
8 mins

Self-Attention vs Masked Self-Attention
Video
・
14 mins

The Matrix Math for Calculating Masked Self-Attention
Video
・
3 mins

Coding Masked Self-Attention in PyTorch
Video with Code Example
・
5 mins

Encoder-Decoder Attention
Video
・
4 mins

Multi-Head Attention
Video
・
2 mins

Coding Encoder-Decoder Attention and Multi-Head Attention in PyTorch
Video with Code Example
・
4 mins

Conclusion
Video
・
1 min

Appendix – Tips and Help
Code Example
・
1 min

Course Feedback

Community