Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In the last video, you learned about the GRU, the Gated Recurrent Unit, and how that can allow you to learn very long-range connections in a sequence. The other type of unit that allows you to do this very well is the LSTM, or the Long-Short Terminary Units, and this is even more powerful than the GRU. Let's take a look. Here are the equations from the previous video for the GRU, and for the GRU, we had AT equals CT, and two gates, the update gate and the relevance gate, C tilde T, which is a candidate for replacing the memory cell, and then we use the update gate gamma U to decide whether or not to update CT using C tilde T. The LSTM is an even slightly more powerful and more general version of the GRU, and it's due to Seth Hochreiter and Juergen Schmidt-Huber, and this was a really seminal paper that has a huge impact on sequence modeling. Although I think this paper is one of the more difficult ones to read, it goes quite a lot into the theory of vanishing gradients, and so I think more people have learned about the details of LSTM through maybe other places than from this particular paper, even though I think this paper has had a wonderful impact on the deep learning community. These are the equations that govern the LSTM, so we'll continue to have a memory cell C, and the candidate value for updating it, C tilde T, will be this, and so notice that for the LSTM, we will no longer have the case that AT is equal to CT, so this is what we'll use, and so this is just like the equation on the left, except that we've now more explicitly used AT there, or AT minus 1 instead of CT minus 1, and we're not using this gamma R, this relevance gate, although you can have a variation of the LSTM where you put that back in, but maybe the more common version of the LSTM doesn't bother with that. And then we will have an update gate, same as before, so W update, and I'm going to use AT minus 1 here, CT plus U, and one new property of the LSTM is instead of having one update gate control both of these terms, we're going to have two separate terms, so instead of gamma U and 1 minus gamma U, we're going to have gamma U here, and forget gate, which we're going to call gamma F, so this gate gamma F is going to be sigmoid of pretty much what you'd expect, CT plus BF, and then we're going to have a new output gate, which is sigma of WO, and then again, pretty much what you'd expect, plus BO, and then we're going to have, and then the update value to the memory cell will be CT equals gamma U, then this asterisk denotes element-wise multiplication, there's a vector-vector element-wise multiplication, plus, and instead of 1 minus gamma U, we're going to have a separate forget gate, gamma F times C of T minus 1, so this gives the memory cell the option of keeping the old value, CT minus 1, and then just adding to it this new value, C tilde of T, so it uses separate update and forget gates, so this stands for update, forget, and output gates. And then finally, instead of AT equals CT, AT is going to equal to the output gate, element-wise multiply, with CT. So these are the equations that govern the LSTM, and you can tell it has 3 gates instead of 2, so it's a bit more complicated, and it places the gates in slightly different places. So here again, are the equations governing the behavior of the LSTM. Once again, it's traditional to explain these things using pictures, so let me draw one here, and if these pictures are too complicated, don't worry about it. I personally find the equations easier to understand than the picture, but I'll just show the picture here for the intuitions it conveys. The particular picture here was very much inspired by a blog post due to Chris Olar, titled Understanding LSTM Networks, and the diagram drawn here is quite similar to one that he drew in his blog post. But the key things to take away from this picture are maybe that you use AT minus 1 and XT to compute all the gate values. So in this picture, you have AT minus 1 and XT coming together to compute the forget gate, to compute the update gate, and to compute the output gate, and they also go through a tanish to compute C tilde T. And then these values are combined in these complicated ways with element-wise multiplies and so on to get CT from the previous CT minus 1. Now, one element of this is interesting is if you hook up a bunch of these in parallel, so that's one of them, and you connect them, you know, connect these temporally. So there's the input X1, then X2, X3, so you can take these units and just hook them up as follows, where the output A, as from a previous time step, is the input A at the next time step, and similarly for C. And I've simplified the diagrams a little bit at the bottom. And one cool thing about this you notice is that there's this line at the top that shows how, so long as you set the forget and the update gates appropriately, it is relatively easy for the LSTM to have some value C0 and have that be passed all the way to the right to have, you know, maybe C3 equals C0. And this is why the LSTM, as well as the GRU, is very good at memorizing certain values even for a long time, for certain real values stored in the memory cell even for many, many time steps. So that's it for the LSTM. As you can imagine, there are also a few variations on this that people use. Perhaps the most common one is that instead of just having the gate values be dependent only on A, T minus 1, XT, sometimes people also sneak in the values C, T minus 1 as well. This is called a peephole connection. Not a great name, maybe, but you see peephole connection. What that means is that the gate values may depend not just on A, T minus 1, but on X, T, but also on the previous memory cell value. And the peephole connection can go into all three of these gates and computations. So that's one common variation you see of LSTMs. One technical detail is that these LSTMs are very similar to LSTMs. One technical detail is that these are, say, 100-dimensional vectors. If you have a 100-dimensional hidden memory cell unit, so is this. And the, say, 5th element of C, T minus 1 affects only the 5th element of the corresponding gates. So that relationship is one-to-one, where not every element of the 100-dimensional C, T minus 1 can affect all elements of the gates, but instead the first element of C, T minus 1 affects the first element of the gates, the second element affects the second element, and so on. But if you ever read a paper and see someone talk about a peephole connection, that's what they mean, that C, T minus 1 is used to affect the gate value as well. So that's it for the LSTM. When should you use a GRU and when should you use an LSTM? There isn't widespread consensus in this. And even though I presented GRUs first, in the history of deep learning, LSTMs actually came much earlier, and then GRUs were a relatively recent invention that were maybe derived as partly a simplification of the more complicated LSTM model. Researchers have tried both of these models on many different problems, and on different problems, different algorithms all went out, so there isn't a universally superior algorithm, which is why I want to show you both of them. But I feel like when I am using these, the advantage of the GRU is that it's a simpler model, and so it's actually easier to build a much bigger network that only has two gates, so computationally it runs a bit faster, so it scales to building somewhat bigger models. But the LSTM is more powerful and more flexible, since it has three gates instead of two. If you want to pick one to use, I think LSTM has been the historically more proven choice, so if you had to pick one, I think most people today would still use the LSTM as the default first thing to try, although I think in the last few years, GRUs have been gaining a lot of momentum, and I feel like more and more teams are also using GRUs, because they're a bit simpler, but often work just as well, and it might be easier to scale them to even bigger problems. So that's it for LSTMs. With either GRUs or LSTMs, you'll be able to build new networks that can capture much longer range dependencies.