In the last video, you learned about the GRU, the Gated Recurrent Unit, and how that can allow you to learn very long-range connections in a sequence. The other type of unit that allows you to do this very well is the LSTM, or the Long-Short Terminary Units, and this is even more powerful than the GRU. Let's take a look. Here are the equations from the previous video for the GRU, and for the GRU, we had AT equals CT, and two gates, the update gate and the relevance gate, C tilde T, which is a candidate for replacing the memory cell, and then we use the update gate gamma U to decide whether or not to update CT using C tilde T. The LSTM is an even slightly more powerful and more general version of the GRU, and it's due to Seth Hochreiter and Juergen Schmidt-Huber, and this was a really seminal paper that has a huge impact on sequence modeling. Although I think this paper is one of the more difficult ones to read, it goes quite a lot into the theory of vanishing gradients, and so I think more people have learned about the details of LSTM through maybe other places than from this particular paper, even though I think this paper has had a wonderful impact on the deep learning community. These are the equations that govern the LSTM, so we'll continue to have a memory cell C, and the candidate value for updating it, C tilde T, will be this, and so notice that for the LSTM, we will no longer have the case that AT is equal to CT, so this is what we'll use, and so this is just like the equation on the left, except that we've now more explicitly used AT there, or AT minus 1 instead of CT minus 1, and we're not using this gamma R, this relevance gate, although you can have a variation of the LSTM where you put that back in, but maybe the more common version of the LSTM doesn't bother with that. And then we will have an update gate, same as before, so W update, and I'm going to use AT minus 1 here, CT plus U, and one new property of the LSTM is instead of having one update gate control both of these terms, we're going to have two separate terms, so instead of gamma U and 1 minus gamma U, we're going to have gamma U here, and forget gate, which we're going to call gamma F, so this gate gamma F is going to be sigmoid of pretty much what you'd expect, CT plus BF, and then we're going to have a new output gate, which is sigma of WO, and then again, pretty much what you'd expect, plus BO, and then we're going to have, and then the update value to the memory cell will be CT equals gamma U, then this asterisk denotes element-wise multiplication, there's a vector-vector element-wise multiplication, plus, and instead of 1 minus gamma U, we're going to have a separate forget gate, gamma F times C of T minus 1, so this gives the memory cell the option of keeping the old value, CT minus 1, and then just adding to it this new value, C tilde of T, so it uses separate update and forget gates, so this stands for update, forget, and output gates. And then finally, instead of AT equals CT, AT is going to equal to the output gate, element-wise multiply, with CT. So these are the equations that govern the LSTM, and you can tell it has 3 gates instead of 2, so it's a bit more complicated, and it places the gates in slightly different places. So here again, are the equations governing the behavior of the LSTM. Once again, it's traditional to explain these things using pictures, so let me draw one here, and if these pictures are too complicated, don't worry about it. I personally find the equations easier to understand than the picture, but I'll just show the picture here for the intuitions it conveys. The particular picture here was very much inspired by a blog post due to Chris Olar, titled Understanding LSTM Networks, and the diagram drawn here is quite similar to one that he drew in his blog post. But the key things to take away from this picture are maybe that you use AT minus 1 and XT to compute all the gate values. So in this picture, you have AT minus 1 and XT coming together to compute the forget gate, to compute the update gate, and to compute the output gate, and they also go through a tanish to compute C tilde T. And then these values are combined in these complicated ways with element-wise multiplies and so on to get CT from the previous CT minus 1. Now, one element of this is interesting is if you hook up a bunch of these in parallel, so that's one of them, and you connect them, you know, connect these temporally. So there's the input X1, then X2, X3, so you can take these units and just hook them up as follows, where the output A, as from a previous time step, is the input A at the next time step, and similarly for C. And I've simplified the diagrams a little bit at the bottom. And one cool thing about this you notice is that there's this line at the top that shows how, so long as you set the forget and the update gates appropriately, it is relatively easy for the LSTM to have some value C0 and have that be passed all the way to the right to have, you know, maybe C3 equals C0. And this is why the LSTM, as well as the GRU, is very good at memorizing certain values even for a long time, for certain real values stored in the memory cell even for many, many time steps. So that's it for the LSTM. As you can imagine, there are also a few variations on this that people use. Perhaps the most common one is that instead of just having the gate values be dependent only on A, T minus 1, XT, sometimes people also sneak in the values C, T minus 1 as well. This is called a peephole connection. Not a great name, maybe, but you see peephole connection. What that means is that the gate values may depend not just on A, T minus 1, but on X, T, but also on the previous memory cell value. And the peephole connection can go into all three of these gates and computations. So that's one common variation you see of LSTMs. One technical detail is that these LSTMs are very similar to LSTMs. One technical detail is that these are, say, 100-dimensional vectors. If you have a 100-dimensional hidden memory cell unit, so is this. And the, say, 5th element of C, T minus 1 affects only the 5th element of the corresponding gates. So that relationship is one-to-one, where not every element of the 100-dimensional C, T minus 1 can affect all elements of the gates, but instead the first element of C, T minus 1 affects the first element of the gates, the second element affects the second element, and so on. But if you ever read a paper and see someone talk about a peephole connection, that's what they mean, that C, T minus 1 is used to affect the gate value as well. So that's it for the LSTM. When should you use a GRU and when should you use an LSTM? There isn't widespread consensus in this. And even though I presented GRUs first, in the history of deep learning, LSTMs actually came much earlier, and then GRUs were a relatively recent invention that were maybe derived as partly a simplification of the more complicated LSTM model. Researchers have tried both of these models on many different problems, and on different problems, different algorithms all went out, so there isn't a universally superior algorithm, which is why I want to show you both of them. But I feel like when I am using these, the advantage of the GRU is that it's a simpler model, and so it's actually easier to build a much bigger network that only has two gates, so computationally it runs a bit faster, so it scales to building somewhat bigger models. But the LSTM is more powerful and more flexible, since it has three gates instead of two. If you want to pick one to use, I think LSTM has been the historically more proven choice, so if you had to pick one, I think most people today would still use the LSTM as the default first thing to try, although I think in the last few years, GRUs have been gaining a lot of momentum, and I feel like more and more teams are also using GRUs, because they're a bit simpler, but often work just as well, and it might be easier to scale them to even bigger problems. So that's it for LSTMs. With either GRUs or LSTMs, you'll be able to build new networks that can capture much longer range dependencies.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 1: Recurrent Neural Networks

Recurrent Neural Networks

Why Sequence Models?
Video
・
2 mins

Notation
Video
・
8 mins

Recurrent Neural Network Model
Video
・
16 mins

Backpropagation Through Time
Video
・
6 mins

Different Types of RNNs
Video
・
9 mins

Language Model and Sequence Generation
Video
・
12 mins

Sampling Novel Sequences
Video
・
8 mins

Vanishing Gradients with RNNs
Video
・
6 mins

Clarifications about Upcoming Gated Recurrent Unit (GRU) Video
Reading
・
1 min

Gated Recurrent Unit (GRU)
Video
・
16 mins

Clarifications about Upcoming Long Short Term Memory (LSTM) Video
Reading
・
1 min

Long Short Term Memory (LSTM)
Video
・
9 mins

Bidirectional RNN
Video
・
8 mins

Deep RNNs
Video
・
5 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Quiz

Recurrent Neural Networks

Graded・Quiz

・

50 mins

Programming Assignments

(Optional) Downloading your Notebook and Refreshing your Workspace
Reading
・
5 mins

Building your Recurrent Neural Network - Step by Step

Graded・Code Assignment

・

3 hours

Dinosaur Island-Character-Level Language Modeling

Graded・Code Assignment

・

3 hours

Jazz Improvisation with LSTM

Graded・Code Assignment

・

3 hours

Week 2: Natural Language Processing & Word Embeddings