Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In the last video, you saw how the attention model allows a neural network to pay attention to only part of an input sentence while generating a translation, much like a human translator might. Let's now formalize that intuition into the exact details of how you would implement an attention model. So same as in the previous video, let's assume you have an input sentence and you use a bidirectional RNN or bidirectional GRU or bidirectional LSTM to compute features on every word. In practice, GRUs and LSTMs are often used for this, with maybe LSTMs being more common. And so, for the forward recurrence, you would have a forward recurrence first time step, then backward recurrence first time step, activation for the forward recurrence second time step, activation backward, and so on, because it won't follow them in. That's a forward fifth time step, a backward fifth time step. Well, we have the a0 here, technically we could also have a, I guess, a backward 6 as a factor of all 0s, actually that's a factor of all 0s. And then to simplify the notation going forward at every time step, even though you have the features computed from the forward recurrence and from the backward recurrence in a bidirectional RNN, I'm just going to use a of t to represent both of these concatenated together. So a t is going to be our feature vector for time step t. Although to be consistent with the notation we'll use in a second, I'm going to call this t prime. I'm actually going to use t prime to index into the words in the French sentence. Next, we have our forward only, so it's a single direction RNN, with state s to generate the translation. And so the first time step, it should generate y1, and this will have as input some context c. And if you want to index it with time, I guess you could write c1, but sometimes I just write c without the superscript 1. And this will depend on the attention parameters, so alpha 1,1, alpha 1,2, and so on, tells us how much attention. And so these alpha parameters tells us how much the context will depend on the features we're getting, on the activations we're getting from the different time steps. And so the way we'll define the context is there'll actually be a weighted sum of the features from the different time steps, weighted by these attention weights. So more formally, the attention weights will satisfy this, that they'll all be non-negative, so it'll be a zero positive, and they'll sum to 1. We'll see later how to make sure this is true. And we will have that the context, or the context at time 1, I'll often drop that superscript. That's going to be sum over t prime, all the values of t prime, of this weighted sum of these activations. So this term here are the attention weights, and this term here comes from here. And so alpha t t prime is the amount of attention that y t should pay to a of t prime. So in other words, when you're generating the t output words, how much should you be paying attention to the t prime input word? So that's one step of generating the output, and then at the next time step, you generate the second output, and it's again done similarly where now you have a new set of attention weights, they define a new weighted sum, that generates a new context, this is also input, and that allows you to generate the second word. Only now, this weighted sum becomes the context of the second time step is sum over t prime, alpha 2 t prime. So using these context vectors, c1, let's write that back in, c2, and so on, this network up here looks like a pretty standard RNN sequence with the context vectors as output, and we can just generate the translation one word at a time. We have also defined how to compute the context vectors in terms of these attention weights and those features of the input sentence. So the only remaining thing to do is to define how to actually compute these attention weights. Let's do that on the next slide. So just to recap, alpha t t prime is the amount of attention you should pay to a t prime when you're trying to generate the t words in the output translation. So let me just write down the formulas and talk about how this works. This is a formula you can use to compute alpha t t prime, we're going to compute these terms e t t prime, and then use essentially a softmax to make sure that these weights sum to one if you sum over t prime. So for every fixed value of t, these things sum to one if you're summing over t prime. And using this softmax parametrization just ensures it's properly summed to one. Now how do you compute these factors e? Well one way to do so is to use a small neural network as follows. So s t minus one was the neural network state from the previous time step. So here's the network we have. If you're trying to generate y t, then s t minus one was the hidden state from the previous step that's fed into s t. And that's one input to the very small neural network, usually one hidden layer neural network because you need to compute these a lot. And then a t prime, the feature, from time step t prime is the other input. And the intuition is if you want to decide how much attention to pay to the activation of t prime, well the things that seems like it should depend the most on is what is your own hidden state activation from the previous time step. You don't have the current state activation yet because the context feeds into this so you haven't computed that. But look at whatever your hidden state is of this RNN generating the output translation. And then for each of the positions, each of the words, look at their features. So it seems pretty natural that alpha t t prime and e t t prime should depend on these two quantities. But we don't know what the function is. So one thing we could do is just train a very small neural network to learn whatever this function should be and trust back propagation, trust gradient descent. To learn the right function. And it turns out that if you implement this whole model and train it with gradient descent, the whole thing actually works. This little neural network does a pretty decent job telling you how much attention y t should pay to a t prime. And this formula makes sure that the attention weighs sum to one. And then as you chug along, generating one word at a time, this neural network actually pays attention to the right parts of the input sentence and learns all this automatically using gradient descent. Now, one downside of this algorithm is that it does take quadratic time or quadratic cost to run this algorithm. If you have t x words in the input and t y words in the output, then the total number of these attention parameters is going to be t x times t y. And so this algorithm runs in quadratic cost. Although in machine translation applications where neither input nor output sentence is usually that long, maybe quadratic cost is actually acceptable. Although there is some research work on trying to reduce this cost as well. Now, so far I've been describing the attention idea in the context of machine translation. Without going too much into detail, this idea has been applied to other problems as well, such as image captioning. So in the image captioning problem, the task is to look at a picture and write a caption for that picture. So in this paper selected at the bottom by Kelvin Xu, Jimmy Barr, Ryan Kieros, Kevin Cho, Aaron Korver, Ruslan Safudinov, Rich Zemo, and Yoshua Bengio, the author showed that you could have a very similar architecture, look at a picture, and pay attention only to parts of the picture at a time while you're writing a caption for a picture. So if you're interested, I encourage you to take a look at that paper as well. And you get to play with all this more in the programming exercise. Whereas machine translation is a very complicated problem, in the programming exercise you get to implement and play with the attention model yourself for the date normalization problem. So the problem of inputting a date like this, this is actually the date of the Apollo moon landing, and normalizing it into standard formats, or a date like this, and having a neural network, a sequence-to-sequence model, normalize it to this format. This, by the way, is the birth date of William Shakespeare, also is believed to be. And what you see in the programming exercise is you can train a neural network to input dates in any of these formats and have it use an attention model to generate a normalized format for these dates. One other thing that's sometimes fun to do is to look at the visualizations of the attention weights. So here's a machine translation example, and here we're plotted in different colors the magnitude of the different attention weights. I don't want to spend too much time on this, but you find that the corresponding input and output words, you find that the attention weights will tend to be high, thus suggesting that when it's generating a specific word in output, it's usually paying attention to the correct word in the input. And all this, including learning where to pay attention when, was all learned using backpropagation with an attention model. So that's it for the attention model, really one of the most powerful ideas in deep learning. I hope you enjoy implementing and playing with some of these ideas yourself later in this week's programming exercises.