In the last video, you saw the notation we'll use to define sequence learning problems. Now, let's talk about how you can build a model, build a neural network to learn the mapping from X to Y. Now, one thing you could do is try to use a standard neural network for this task. So, in our previous example, we had nine input words. So, you could imagine trying to take these nine input words, maybe the nine one-hot vectors, and feeding them into a standard neural network, maybe a few hidden layers, and then eventually have this output, the nine values, zero or one, that tell you whether each word is part of a person's name. But this turns out not to work well, and there are really two main problems with this. The first is that the inputs and outputs can be different lengths in different examples. So, it's not as if every single example had the same input length, TX, or the same output length, TY. And maybe if every sentence had a maximum length, maybe you could pad or zero pad every input up to that maximum length, but this still doesn't seem like a good representation. And then the second, and maybe more serious problem, is that a naive neural network architecture like this, it doesn't share features learned across different positions of text. In particular, if the neural network has learned that maybe the words Harry appearing in position one gives a sign that that's part of a person's name, then won't it be nice if it automatically figures out that Harry appearing in some other position, XT, also means that that might be a person's name. And this is maybe similar to what you saw in convolutional neural networks, where you want things learned for one part of the image to generalize quickly to other parts of the image, and we like similar effects for sequence data as well. And similar to what you saw with conf nets, using a better representation will also let you reduce the number of parameters in your model. So previously, we said that each of these is a 10,000-dimensional one-hot vector, and so this is just a very large input layer. If the total input size was, you know, maximum number of words times 10,000, and the weight matrix of this first layer would end up having an enormous number of parameters. So a recurrent neural network, which we'll start to describe in the next slide, does not have either of these disadvantages. So what is a recurrent neural network? Let's build one up. So if you are reading the sentence from left to right, the first word you will read is some first word, say, X1. And what we're going to do is take the first word and feed it into a neural network layer. I'm going to draw it like this. So that's a hidden layer of the first neural network. And we can have the neural network maybe try to predict the output. So is this part of a person's name or not? And what a recurrent neural network does is, when it then goes on to read the second word in the sentence, say, X2, instead of just predicting Y2 using only X2, it also gets to input some information from what it had computed at time step one. So in particular, the activation value from time step one is passed on to time step two. And then at the next time step, a recurrent neural network inputs the third word, X3, and it tries to predict, output some prediction, Y hat 3, and so on, up until the last time step where it inputs X, TX, and then it outputs Y hat TY. At least in this example, TX is equal to TY, and the architecture will change a bit if TX and TY are not identical. And so at each time step, the recurrent neural network passes on its activation to the next time step for it to use. And to kick off the whole thing, we'll also have some made-up activation at time zero. This is usually the vector of zeros. Some researchers will initialize A0 randomly, or have other ways to initialize A0, but really having a vector of zeros as the fake time zero activation is the most common choice. And so that gets input to the neural network. In some research papers or in some books, you see this type of neural network drawn with the following diagram, in which at every time step, you input X and output Y hat. Maybe sometimes there'll be a T index there. And then to denote the recurrent connection, sometimes people will draw a loop like that, that the layer feeds back to itself. Sometimes they'll draw a shaded box to denote that this is the shaded box here, denotes a time delay of one step. I personally find these recurrent diagrams much harder to interpret. And so throughout this course, I'll tend to draw the unrolled diagram, like the one you have on the left. But if you see something like the diagram on the right in a textbook or in a research paper, what it really means, or the way I tend to think about it, is to mentally unroll it into the diagram you have on the left instead. The recurrent neural network scans through the data from left to right. And the parameters it uses for each time step are shared. So there'll be a set of parameters, which we'll describe in greater detail on the next slide. But the parameters governing the connection from X1 to the hidden layer will be some set of parameters we're going to write as WAX. And it's the same parameters, WAX, that it uses for every time step. I guess you could write WAX there as well. And the activations, the horizontal connections, will be governed by some set of parameters, WAA. And it's the same parameters, WAA, used on every time step. And similarly, there's some WYA that governs the output predictions. And I'll describe on the next slide exactly how these parameters work. So in this recurrent neural network, what this means is that when making the prediction for Y3, it gets the information not only from X3, but also the information from X1 and X2, because the information from X1 can pass through this way to help the prediction with Y3. Now, one weakness of this RNN is that it only uses the information that is earlier in the sequence to make a prediction. In particular, when predicting Y3, it doesn't use information about the words X4, X5, XX, and so on. And so this is a problem, because if you're given a sentence, he said, Teddy Roosevelt was a great president. In order to decide whether or not the word Teddy is part of a person's name, it'd be really useful to know not just information from the first two words, but to know information from the later words in the sentence as well. Because the sentence could also have been, he said, Teddy bears are on sale. And so given just the first three words, it's not possible to know for sure whether the word Teddy is part of a person's name. In the first example, it is. In the second example, it's not. But you can't tell the difference if you look only at the first three words. So one limitation of this particular neural network structure is that the prediction at a certain time uses information from the inputs earlier in the sequence, but not information later in the sequence. We will address this in a later video, where we talk about bidirectional recurrent neural networks, or BRNNs. But for now, this simpler unidirectional neural network architecture will suffice to explain the key concepts. And we just have to make a quick modification to these ideas later to enable, say, the prediction of y hat 3 to use both information earlier in the sequence, as well as information later in the sequence. We'll get to that in a later video. So let's now write explicitly what are the calculations that this neural network does. Here's a cleaned up version of the picture of the neural network. As I mentioned previously, typically, you start it off with the input a0 equals the vector of all zeros. Next, this is what forward propagation looks like. To compute a1, you would compute that as an activation function g applied to waa times a0 plus wax times x1 plus a bias, which we're going to write as ba. And then to compute y hat 1, the prediction at time set 1, that would be some activation function, maybe a different activation function than the one above, but applied to wya times a1 plus by. And the notation convention I'm going to use for the subscript of these matrices, like that example, wax, the second index means that this wax is going to be multiplied by some x-like quantity, and this a means that this is used to compute some a-like quantity, like so. And similarly, you notice that here, wya is multiplied by some a-like quantity to compute a y-type quantity. The activation function used to compute the y-type quantity activations will often be a tanh in the choice of an RNN, and sometimes values are also used, but tanh is actually a pretty common choice. And we have other ways of preventing the vanishing gradient problem, which we'll talk about later this week. And depending on what your output y is, if it is a binary classification problem, then I guess you would use a sigmoid activation function, or it could be a softmax if you have a k-way classification problem. But the choice of activation function here would depend on what type of output y you have. So for the name entity recognition task, where y was either 0 or 1, I guess the second g could be a sigmoid activation function. And I guess you could write g2 if you want to distinguish that these could be different activation functions, but I usually won't do that. And then more generally, at time t, at will be g of waa times a from the previous time step plus wax of x from the current time step plus ba. And y hat t is equal to g, again, could be different activation functions, but g of waa times at plus by. So these equations define forward propagation in the neural network, where you would start off with a0 as a vector of all 0s, and then using a0 and x1, you would compute a1 and y hat 1. And then you take x2 and use x2 and a1 to compute a2 and y hat 2, and so on. And you carry out forward propagation going from the left to the right of this picture. Now, in order to help us develop the more complex neural networks, I'm actually going to take this notation and simplify it a little bit. So let me copy these two equations to the next slide. Right, here they are. And what I'm going to do is actually take, so to simplify the notation a bit, I'm actually going to take that and write in a slightly simpler way. And so I'm going to write this as at equals g times just a matrix waa times a new quantity, which is going to be at minus 1, xt, and then plus ba. And so that underlying quantity on the left and right are supposed to be equivalent. So the way we define waa is we'll take this matrix waa and this matrix wax and put them side by side, kind of stack them horizontally as follows. And this will be the matrix wa. So, for example, if a was a 100 dimensional and in our running example x was 10,000 dimensional, then waa would have been a 100 by 100 dimensional matrix, and wax would have been a 100 by 10,000 dimensional matrix. And so stacking these two matrices together, this would be 100 dimensional, this would be 100, and this would be, I guess, 10,000 elements. So waa will be a 100 by 10,000 dimensional matrix. I guess this diagram on the left is not drawn to scale, since wax would be a very wide matrix. And what this notation means is to just take the two vectors and stack them together. So I'm going to use that notation to denote that we're going to take the vector at minus 1, so that's 100 dimensional, and stack it on top of at. So this ends up being a 10100 dimensional vector. And so hopefully you can check for yourself that this matrix times this vector just gives you back the original quantity, right? Because now this matrix waa times wax multiplied by this at minus 1 xt vector, this is just equal to waa times at minus 1 plus wax times xt, which is exactly what we had back over here. So the advantage of this notation is that rather than carrying around two parameter matrices, waa and wax, we can compress them into just one parameter matrix, waa. And this will simplify our notation for when we develop more complex models. And then for this, in a similar way, I'm just going to rewrite this slightly. I'm going to write this as wy at plus by. And now we just have the subscripts in the notation, wy and by, it denotes what type of output quantity we're computing. So wy indicates that that's a weight matrix, we're computing a y-like quantity. And here waa and ba on top indicates those are the parameters we're computing, like an a, an activation output quantity. So that's it. You now know what is a basic recurrent neural network. Next, let's talk about backpropagation and how you would learn with these RNNs.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 1: Recurrent Neural Networks

Recurrent Neural Networks

Why Sequence Models?
Video
・
2 mins

Notation
Video
・
8 mins

Recurrent Neural Network Model
Video
・
16 mins

Backpropagation Through Time
Video
・
6 mins

Different Types of RNNs
Video
・
9 mins

Language Model and Sequence Generation
Video
・
12 mins

Sampling Novel Sequences
Video
・
8 mins

Vanishing Gradients with RNNs
Video
・
6 mins

Clarifications about Upcoming Gated Recurrent Unit (GRU) Video
Reading
・
1 min

Gated Recurrent Unit (GRU)
Video
・
16 mins

Clarifications about Upcoming Long Short Term Memory (LSTM) Video
Reading
・
1 min

Long Short Term Memory (LSTM)
Video
・
9 mins

Bidirectional RNN
Video
・
8 mins

Deep RNNs
Video
・
5 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Quiz

Recurrent Neural Networks

Graded・Quiz

・

50 mins

Programming Assignments

(Optional) Downloading your Notebook and Refreshing your Workspace
Reading
・
5 mins

Building your Recurrent Neural Network - Step by Step

Graded・Code Assignment

・

3 hours

Dinosaur Island-Character-Level Language Modeling

Graded・Code Assignment

・

3 hours

Jazz Improvisation with LSTM

Graded・Code Assignment

・

3 hours

Week 2: Natural Language Processing & Word Embeddings