Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
Language modeling is one of the most basic and important tasks in natural language processing. It is also one that RNNs do very well. In this video, you'll learn about how to build a language model using an RNN. And this will lead up to a fun programming exercise at the end of this week, where you build a language model and use it to generate Shakespeare-like text and other types of text. Let's get started. So what is a language model? Let's say you're building a speech recognition system, and you hear the sentence, the apple and pear salad was delicious. So what did you just hear me say? Did I say the apple and pear salad? Or did I say the apple and pear salad? You probably think the second sentence is much more likely. And in fact, that's what a good speech recognition system would output, even though these two sentences sound exactly the same. And the way a speech recognition system picks the second sentence is by using a language model, which tells it what is the probability of either of these two sentences. For example, a language model might say that the chance of the first sentence is 3.2 by 10 to the negative 13. And the chance of the second sentence is, say, 5.7 by 10 to the negative 10. And so with these probabilities, the second sentence is much more likely by over a factor of 10 to the 3 compared to the first sentence. And that's why a speech recognition system would pick the second choice. So what a language model does is, given any sentence, its job is to tell you what is the probability of a sentence, of that particular sentence. And by probability of sentence, I mean if you were to pick up a random newspaper, open a random email, or pick a random web page, or listen to the next thing someone says, the friend of you says, what is the chance that the next sentence you read somewhere out there in the world will be a particular sentence, like the apple and pear salad? And this is a fundamental component for both speech recognition systems, as you've just seen, as well as for machine translation systems, where translation systems want to output only sentences that are likely. And so the basic job of a language model is to input a sentence, which I'm going to write as a sequence, y1, y2, up to yty. And the language model will be useful to represent the sentences as outputs y rather than as inputs x. But what a language model does is it estimates the probability of that particular sequence of words. So how do you build a language model? To build such a model using a RNN, you would first need a training set comprising a large corpus of English text, or text from whatever language you want to build a language model of. And the word corpus is an NLP terminology that just means a large body or a very large set of English text, of English sentences. So let's say you get a sentence in your training set as follows. Cat averaged 15 hours of sleep a day. The first thing you would do is tokenize this sentence, and that means you would form a vocabulary, as we saw in an earlier video, and then map each of these words to, say, one-hot vectors or to indices in your vocabulary. One thing you might also want to do is model when sentences end. So another common thing to do is to add an extra token called EOS, that stands for end of sentence, that can help you figure out when a sentence ends. We'll talk more about this later. But the EOS token can be appended to the end of every sentence in your training set if you want your model to explicitly capture when sentences end. We won't use the end of sentence token for the programming exercise at the end of this week, but for some applications, you might want to use this. And we'll see later where this comes in handy. So in this example, we have y1, y2, y3, 4, 5, 6, 7, 8, 9, 9 inputs in this example if you append the end of sentence token to the end. And during the tokenization step, you can decide whether or not the period should be a token as well. In this example, I'm just ignoring punctuation. So I'm just using day as another token and omitting the period. But if you want to treat the period or other punctuation as explicit token, then you could add the period to your vocabulary as well. Now, one other detail would be what if some of the words in your training set are not in your vocabulary? So if your vocabulary uses 10,000 words, maybe the 10,000 most common words in English, then the term Mao, as in the Egyptian Mao is a breed of cat, that might not be in one of your top 10,000 tokens. So in that case, you could take the word Mao and replace it with a unique token called unk, which stands for unknown word, and we'll just model the chance of the unknown word instead of the specific word Mao. Having carried out the tokenization step, which basically means taking the input sentence and mapping it to the individual tokens or the individual words in your vocabulary, next, let's build an RNN to model the chance of these different sequences. And one of the things we'll see on the next slide is that you end up setting the inputs Xt to be equal to Y of t minus one, t minus one, but you'll see that in a little bit. So let's go on to build the RNN model, and I'm going to continue to use this sentence as the running example. This will be the RNN architecture. At time zero, you're going to end up computing some activation A1 as a function of some input X1, and X1 would just be set to the set of all zeros, to a zero vector, and the previous A0 by convention also set that to a vector of zeros. But what A1 does is it will make a softmax prediction to try to figure out what is the probability of the first word Y, and so that's going to be Y1. So what this step does is really, it has a softmax that is trying to predict what is the probability of any word in your dictionary? What's the chance that the first word is A, what's the chance of the first word is errand, and then what's the chance of the first word is cats, all the way up to what's the chance the first word is Zulu, or what's the chance that the first word is an unknown word. Or what's the first chance that the first word is the, in the sentence though, it shouldn't happen, right? So Y hat one is output according to a softmax that just predicts what's the chance of the first word being whatever it ends up being. And in our example, it wound up being the word cats. So this would be a 10,000-way softmax output. If you have a 10,000-word vocabulary, or 10,002, I guess, if you count unknown word in, in the sentence, there's two additional tokens. Then, the RNN steps forward to the next step, and has some activation A2, the next step. And at this step, its job is to try to figure out what is the second word. But now, we will also give it the correct first word. So we'll tell it that, gee, in reality, the first word was actually cats. So that's Y1. So tell it cats. And this is why Y1 is equal to X2. And so at the second step, the output is again predicted by our softmax. The RNN's job is to predict what's the chance of it being whatever word it is. Is it A, or Aaron, or cats, or Zulu, or unknown word, or EOS, or whatever, given what had come previously. So in this case, I guess the right answer was average, since the sentence starts with cats average. And then, you go on to the next step of the RNN, where you now compute A3. But to predict what is the third word, which is 15, we can now give it the first two words. We're going to tell it cats average are the first two words. So this next input here, X3, will be equal to Y2. So the word average is input. And its job is to figure out what is the next word in the sequence. So in other words, trying to figure out what is the probability of any word in the dictionary, given that what just came before was cats average. And in this case, the right answer is 15, and so on. Until, at the end, you end up at, I guess, time step 9. You end up feeding it X9, which is equal to Y8, which is the word day. And then, this has A9, and its job is output Y hat 9. And this happens to be the EOS token. So what's the chance of whatever it is, given, you know, everything that's come before. And hopefully, you'll predict that there's a high chance of an EOS in the sentence token. So each step in the RNN will look at some set of preceding words, such as, given the first three words, what is the distribution over the next word? And so this RNN learns to predict one word at a time going from left to right. Next, to train this neural network, we're going to define the cost function. So at a certain time t, if the true word was Yt, and the neural network softmax predicted some Y hat t, then this is the softmax loss function that you should already be familiar with. And then the overall loss is just the sum over all time steps of the losses associated with the individual predictions. And if you train this RNN on a loss training set, what it will be able to do is, given any initial set of words, such as caps average 15 or caps average 15 hours of, it can predict what is the chance of the next word. And given a new sentence, say, Y1, Y2, Y3, with just three words for simplicity, the way you can figure out what is the chance of this entire sentence would be, well, the first softmax tells you what's the chance of Y1, that would be this first output. And then the second one can tell you what's the chance of P of Y2 given Y1. And then the third one tells you what's the chance of Y3 given Y1 and Y2. And so it's by multiplying out these three probabilities, and you see much more of the details of this in the program exercise, but it's by multiplying out these three that you end up with the probability of this three sentence, of this three word sentence. So that's the basic structure of how you can train a language model using an RNN. If some of these ideas still seem a little bit abstract, don't worry about it, you get to practice all of these ideas in the program exercise. But next, it turns out, one of the most fun things you can do with a language model is to sample sequences from the model. Let's take a look at that in the next video.