Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
You've seen how a basic RNN works. In this video, you learn about the gated recurrent unit, which is a modification to the RNN hidden layer that makes it much better at capturing long-range connections and helps a lot with the vanishing gradient problems. Let's take a look. You've already seen the formula for computing the activation at time t of an RNN. It's the activation function applied to the parameter WA times the activations for a previous time step and the current input and then plus the bias. I'm going to draw this as a picture. The RNN unit I'm going to draw as a picture, drawn as a box, which inputs a of t minus 1, the activation for the last time step, and also inputs xt. These two go together, and after some weights and after this type of linear calculation, if g is a tanh activation function, then after the tanh, it computes the output activation. The output activation at might also be passed to, say, a softmax unit or something that could then be used to output y hat t. This is maybe a visualization of the RNN unit of the hidden layer of the RNN in terms of a picture. I want to show you this picture because we're going to use a similar picture to explain the GRU, or the gated recurrent unit. A lot of the ideas of GRUs were due to these two papers, respectively, by Jun-Yang Chung, Kagura, Garcia-Hira, Chiang-Hun Cho, and Yoshaben Cho. I'm sometimes going to refer to this sentence, which we've seen in the last video, to motivate that given a sentence like this, you might need to remember the cat was singular to make sure you understand why that was rather than were. So the cat was for, or the cats were for. So as we read in this sentence from left to right, the GRU unit is going to have a new variable called C, which stands for cell, for memory cell. And what the memory cell do is it will provide a bit of memory to remember, for example, whether cat was singular or plural so that when it gets much further into the sentence, it can still work on the consideration whether the subject of the sentence was singular or plural. And so at time t, the memory cell will have some value, C of t. And what we'll see is that the GRU unit will actually output an activation value, A of t, that's equal to C of t. And for now, I wanted to use different symbols, C and A, to denote the memory cell value and the output activation value, even though they're the same. And I'm using this notation because when we talk about LSTMs a little bit later, these will be two different values. But for now, for the GRU, C of t is equal to the output activation, A of t. So these are the equations that govern the computations of a GRU unit. At every time step, we're going to consider overwriting the memory cell with a value, C tilde of t. So this is going to be a candidate for replacing C of t. And we're going to compute this using an activation function, tanh, of WC. And so that's the parameter matrix WC. And we'll pass to this parameter matrix the previous value of the memory cell, the activation value, as well as the current input value, Xt, and then plus a bias. So C tilde of t is going to be a candidate for replacing C of t. And then the key, really the important idea of the GRU, it will be that we'll have a gate. So the gate I'm going to call gamma U. This is the capital Greek alphabet gamma, subscript U. And U stands for update gate. And this will be a value between 0 and 1. And to develop your intuition about how GRUs work, think of gamma U, this gate value, as being always 0 or 1. Although in practice, you compute it with a sigmoid function applied to this. So remember that the sigmoid function looks like this. And so its value is always between 0 and 1. And for most of the possible ranges of the input, the sigmoid function is either very, very close to 0 or very, very close to 1. So for intuition, think of gamma as being either 0 or 1 most of the time. And this alphabet U stands for, so I chose the alphabet gamma for this because if you look at a gated fence, looks a bit like this I guess, then there are a lot of sort of gammas in this fence. So that's why gamma U we're going to use to denote the gate. Also, Greek alphabet G, right, G for gate. So G for gamma and G for gate. And then next, the key part of a GRU is this equation, which is that we have come up with a candidate way. We're thinking of updating C using C tilde. And then the gate will decide whether or not we actually update it. And so the way to think about it is maybe this memory cell C is going to be set to either 0 or 1 depending on whether the word you're considering, really the subject of the sentence, is singular or plural. So because it's singular, let's say that we set this to 1 and it was plural, maybe we'll set this to 0. And then the GRU unit will memorize the value of the CT all the way until here, where this is still equal to 1. And so that tells it, oh, it was singular, so use the choice was. And the job of the gate of gamma U is it decides when do you update this value. So in particular, when you see the phrase VCAT, you know that you're talking about a new concept, the subject of the sentence is cat. So that would be a good time to update this bit. And then maybe when you're done using it, you know, the cat, blah, blah, blah, was full. Then you know, okay, I don't need to memorize it anymore. I can just forget that. So the specific equation we'll use for the GRU is the following, which is that the actual value of CT will be equal to this gate times the candidate value plus 1 minus the gate times the old value, CT minus 1. So notice that if the gate, if this update value, if this is equal to 1, then it's saying set the new value of CT equal to this candidate value. So that's like over here. Set the gate equal to 1, so go ahead and update that bit. And then for all of these values in the middle, you should have the gate equals 0. So this is saying, don't update it, don't update it, don't update it, just hang on to the old value. Because if gamma U is equal to 0, then this will be 0, and this will be 1, and so it's just setting CT equal to the old value, even as you scan the sentence from left to right. So when the gate is equal to 0, it's saying, don't update it, don't update it, just hang on to the value, and don't forget what this value was. And so that way, even when you get all the way down here, hopefully you've just been setting CT equals CT minus 1 all along, and it still memorizes the cat was singular. So let me also draw a picture to denote the GRU unit. And by the way, when you look in online blog posts and textbooks and tutorials, these types of pictures are quite popular for explaining GRUs as well as, we'll see later, LSTM units. I personally find the equations easier to understand in the pictures, so if the picture doesn't make sense, don't worry about it, but I'll just draw it in case it helps some of you. But so a GRU unit inputs CT minus 1 from the previous time step, and this happens to be equal to AT minus 1. So it takes that as its input, and then it also takes as input XT. Then these two things get combined together, and with some appropriate weighting and some tannage, this gives you C tilde T, which is a candidate for a value. So replacing CT, and then with a different set of parameters, and through a sigmoid activation function, this gives you gamma U, which is the update gate. And then finally, all of these things combine together through another operation, and I won't write out the formula, but this box here, which I shaded in purple, represents this equation, which we had down there. So that's what this purple operation represents, and it takes as input the gate value, the candidate new value, or that's the gate value again, and the O value for CT. So it takes as input this, this, and this, and together they generate the new value for the memory cell. And so that's CT equals A. And if you wish, you could also use this, and pass this to a softmax or something to make some prediction for, you know, YT or Y hat T. So that is the GRU unit, or at least a slightly simplified version of it. And what it's remarkably good at is through the gate, deciding that when you're scanning the sentence from left to right, say that that's a good time to update one particular memory cell, and then to not change it, not change it, until you get to the point where you really need it to use this memory cell that you had set up. Even much earlier in the sentence. And because the sigmoid value, now because the gate is quite easy to set to zero, right? So long as this quantity is a large negative value, then up to numerical round off, the update gate will be essentially zero. Very, very, very close to zero. So when that's the case, then this update equation ends up setting CT equals CT minus one. And so this is very good at maintaining the value for the cell. And because gamma can be so close to zero, it can be 0.000001, or even smaller than that, it doesn't suffer from much of a vanishing gradient problem. Because you say gamma is so close to zero that this becomes essentially CT equals CT minus one, and the vanishing gradient problem is essentially 0.000001. This becomes essentially CT equals CT minus one, and the value of CT is maintained pretty much exactly, even across many, many, many, many time steps. So this can help significantly with the vanishing gradient problem, and therefore allow a neural network to learn even very long range dependencies, such as the cat and was, are related even if they're separated by a lot of words in the middle. Now, I just want to talk over some more details of how you implement this. In the equations I've written, CT can be a vector, so if you have a 100-dimensional hidden activation value, then CT can be a 100-dimensional, say. And so C tilde T would also be the same dimension, and gamma would also be the same dimension as the other things I'm drawing in boxes. And in that case, these asterisks are actually elements wise multiplication. So here, if gamma U, if the gate is a 100-dimensional vector, what it is is really a 100-dimensional vector of bits, the value is mostly 0 and 1, that tells you of this 100-dimensional memory cell, which are the bits you want to update. And of course, in practice, gamma won't be exactly 0 or 1. Sometimes it'll take values in the middle as well. But it's convenient for intuition to think of it as mostly taking on values that are exactly 0, pretty much exactly 0, or pretty much exactly 1. And what these element wise multiplications do is it just, element wise, tells the GRU unit which are the bits in your, it just tells your GRU which are the dimensions of your memory cell vector to update at every time step. So you can choose to keep some bits constant while updating other bits. So, for example, maybe you'll use one bit to remember the singular or plural cat, and maybe you'll use some other bits to realize that you're talking about food, and so because, you know, we talk about eating and talk about food, then you'd expect to talk about whether the cat is full later, right? So you can use different bits and change only a subset of the bits at every point in time. You now understand the most important ideas of the GRU. What I presented on this slide, what I presented on this slide is actually a slightly simplified GRU unit. Let me describe the full GRU unit. So to do that, let me copy the three main equations, this one, this one, and this one to the next slide. So here they are, and for the full GRU units, I'm just going to make one change to this, which is for the first equation, which was calculating the candidate new value for the memory cell, I'm going to add just one term. Let me push that a little bit to the right, and I'm going to add one more gate. So this is another gate gamma r. You can think of r as standing for relevance. So this gate gamma r tells you how relevant is ct minus 1 to computing the next candidate for ct, and this gate gamma r is computed pretty much as you expect with a new parameter matrix wr, and then the same things as input xt plus br. So as you can imagine, there are multiple ways to design these types of neural networks, and why do we have gamma r? Why not just always, why not use the simpler version from the previous slides? So it turns out that over many years, researchers have experimented with many, many different possible versions of how to design these units to try to have longer range connections, to try to have model longer range effects, and also address vanishing gradient problems. And the GRU is one of the most commonly used versions that researchers have converged through and have found as robust and useful for many different problems. If you wish, you could try to invent new versions of these units if you want, but the GRU is a standard one that's just commonly used, although you can imagine that researchers have tried other versions that are similar but not exactly the same as what I'm writing down here as well. And the other common version is called an LSTM, which stands for Long Short Term Memory, which we'll talk about in the next video. But GRUs and LSTM are two specific instantiations of this set of ideas that are most commonly used. Just one note on notation. I tried to define a consistent notation to make these ideas easier to understand. If you look at the academic literature, you sometimes see people... If you look at the academic literature, sometimes you see people use an alternative notation. It would be H tilde, U, R, and H to refer to these quantities as well. I tried to use a more consistent notation between GRUs and LSTMs as well as using a more consistent notation, gamma, to refer to the gates to hopefully make these ideas easier to understand. So that's it for the GRU, for the Gated Recurrent Unit. This is one of the ideas in RNNs that has enabled them to become much better at capturing very long range dependencies as made RNNs much more effective. Next, as I briefly mentioned, the other most commonly used variation of this class of ideas is something called the LSTM Unit or the Long Short Term Memory Unit. Let's take a look at that in the next video.