Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
After you've trained a sequence model, one of the ways you can informally get a sense of what is learned is to have it sample novel sequences. Let's take a look at how you could do that. So remember that a sequence model models the chance of any particular sequence of words as follows. And so what we'd like to do is sample from this distribution to generate novel sequences of words. So the network was trained using this structure shown at the top, but to sample, you do something slightly different. So what you want to do is first sample what is the first word you want your model to generate, and so for that, you input the usual X1 equals zero, A0 equals zero. And now, your first time step will have some soft max probability over possible outputs. So what you do is you then randomly sample according to this soft max distribution. So what the soft max distribution gives you is, it tells you what is the chance of the first word is A. What's the chance of the first word is Aaron. What's the first chance of the first word is Zulu. What's the chance of the first word is the unknown word token. Maybe what's the chance is the end of sentence token. And then you take this vector and use, for example, the NumPy command, np.random.choice to sample according to the distribution defined by this vector probabilities, and that lets you sample the first word. Next, you then go on to the second time step. And now, remember that the second time step is expecting this Y1 as input. But what you do is you would then take the Y1 hat that you just sampled and pass that in here as the input to the next time step. So whatever word you just chose for the first time step, pass it as input in the second position, and then this soft max will make a prediction for what is Y hat 2. So as a concrete example, let's say that after you sample the first word, the first word happened to be Z, which is a very common choice of first word. Then you pass in Z as X2, which is now equal to Y hat 1. And now you're trying to figure out what is the chance of whatever the second word is given that the first word is D, and this is going to be Y hat 2. Then you again use, you know, this type of sampling function to sample Y hat 2, and then at the next time step, you take whatever choice you had represented, say, as a one hot encoding, and pass that to the next time step. And then you sample the third word, take that, whatever you chose, and you keep going until you get to the last time step. And so how do you know when the sequence ends? Well, one thing you could do is if the end of sentence token is part of your vocabulary, you could keep sampling until you generate the EOS token, and that tells you you've hit the end of a sentence and you can stop. Or alternatively, if you did not include this in your vocabulary, then you can also just decide to sample 20 words or 100 words or something, and then keep going until you've reached that number of time steps. And this particular procedure will sometimes generate a unknown word token if you want to make sure that you're out room never generates this token. One thing you could do is just reject any sample that came out as unknown word token and just keep resampling from the rest of the vocabulary until you get a word that's not an unknown word. Or you could just leave it in the output as well if you don't mind having an unknown word output. But so this is how you would generate a randomly chosen sentence from your RNN language model. Now, so far, we've been building a word level RNN. By which I mean, the vocabulary are words from English. Depending on your application, one thing you could do is also build a character level RNN. So in this case, your vocabulary would be just the alphabets up to Z, and as well as maybe space, punctuation, if you wish, the digits 0 to 9. And if you want to distinguish between uppercase and lowercase, you can include the uppercase alphabets as well. And one thing you could do is actually just look at your training set, look at your training set corpus, and look at the characters that appear there and use that to define your vocabulary. And if you build a character level language model rather than a word level language model, then your sequence, Y1, Y2, Y3, would be the individual characters in your training data rather than the individual words in your training data. So for our previous example, the sentence, cats average 15 hours of sleep a day. In this example, C would be Y1, A would be Y2, T would be Y3, the space would be Y4, and so on. Using a character level language model has some pros and cons. One is that you don't ever have to worry about unknown word tokens. In particular, a character level language model is able to assign a sequence like Mao a non-zero probability, whereas if Mao was not in your vocabulary for the word level language model, you just have to assign it the unknown word token. But the main disadvantage of the character level language model is that you end up with much longer sequences. So many English sentences will have 10 to 20 words, but may have many, many dozens of characters, and so character language models are not as good as word level language models at capturing long range dependencies between how the early parts of the sentence also affect the later part of the sentence. And character language models are also just more computationally expensive to train. So the trend I've been seeing in natural language processing is that for the most part, word level language models are still used, but as computers get faster, there are more and more applications where people are, you know, at least in some special cases, starting to look at more character level models, but they do tend to be much harder and much more computationally expensive to train, so they're not in widespread use today, except for maybe specialized applications where you might need to deal with unknown words or other vocabulary words a lot. Or they're also used in more specialized applications where you have a more specialized vocabulary. So under these methods, what you can now do is build an RNN to look at the corpus of English text, build a word level, build a character language model, or build a word level, or build a character level language model, and then sample from the language model that you've trained. So here are some fun examples of text that were sampled from a language model, actually from a character level language model. You get to implement something like this yourself in the exercise. If the model was trained on news articles, then it generates text like this. That's shown on the left, and you know, this looks vaguely like news text, not quite grammatical, but maybe sounds a little bit like things that could be appearing in the news. Concussion epidemic to be examined. And it was trained on Shakespeare in text, and it generates stuff that, you know, sounds like Shakespeare could have written it. The mortal wound hath a eclipsing love, and subjects that this doubt art in other disclose. When best to be my love, to me cease hath, for whose are bruised of mine eyes, he's. So, that's it for the basic RNN and how you can build a language model using it, as well as sample from the language model that you've trained. In the next few videos, I want to discuss further some of the challenges of training RNNs, as well as how to address some of these challenges, specifically vanishing gradients by building even more powerful models of an RNN. So, in the next video, let's talk about the problem of vanishing gradients, and we'll go on to talk about the GRU, Gated Recurrent Unit, as well as the LSTM models.