In this video, you learn about the BeamSearch algorithm. In the last video, you remember how for machine translation, given an input French sentence, you don't want to output a random English translation, you want to output the best and most likely English translation. The same is also true for speech recognition, where given an input audio clip, you don't want to output a random text transcript of that audio, you want to output the best and maybe the most likely text transcript. BeamSearch is the most widely used algorithm to do this, and in this video, you'll see how to get BeamSearch to work for yourself. Let's describe BeamSearch using our running example of the French sentence Jane visit le fric en septembre, hopefully being translated into Jane visits Africa in September. The first thing BeamSearch has to do is try to pick the first word of the English translation that it's going to output. So here, I've listed all, say, 10,000 words in the vocabulary, and to simplify the problem a bit, I'm going to ignore capitalization, so I'm just listing all the words in lowercase. So in the first step of BeamSearch, I'll use this network fragment with the encoder shown in green and the decoder shown in purple to try to evaluate what is the probability of that first word. So what's the probability of the first output, y, given the input sentence, x, given the French input. So whereas GreedySearch will pick only the one most likely word and move on, BeamSearch instead can consider multiple alternatives. So the BeamSearch algorithm has a parameter called b, which is called the beam width, and for this example, I'm going to set the beam width to be equal to 3, and what this means is BeamSearch will consider not just one possibility, but consider three at a time. So in particular, let's say evaluating this probability over different choices of the first word, it finds that the choices in Jane and September are the most likely three possibilities for the first word in the English output. Then BeamSearch will store away in computer memory that it wants to try out all three of these words. And if the beam width parameter was set differently, if the beam width parameter was, say, 10, then it would keep track of not just three, but of the 10 most likely possible choices for the first word. So to be clear, in order to perform this first step of BeamSearch, what you need to do is run the input French sentence through this encoder network, and then this first step of the decoder network, if this is a softmax output over all 10,000 possibilities, then you would take those 10,000 possible outputs and keep in memory which were the top three. Let's go on to the second step of BeamSearch. Having picked in Jane and September as the three most likely choices of the first word, what BeamSearch will do now is for each of these three choices, consider what should be the second word. So after in, maybe the second word is a, or maybe it's Aaron. I'm just listing words from the vocabulary or from the dictionary. Or somewhere down the list will be September. Somewhere down the list is visit, and then all the way to z. Maybe the last word is Zulu. So to evaluate the probability of the second word, it will use this neural network fragment where there's the encoder in green. And for the decoder portion, when trying to decide what comes after in, remember the decoder first outputs y hat 1. So we're going to set this y hat 1 to the word in and feed this back in. So there's the word in, because we decided for now that, because we're trying to figure out that the first word was in, what is the second word? And then this will output, I guess, y hat 2. And so by hard wiring y hat 1 here, really the input here, to be the first word in, this network fragment can be used to evaluate what is the probability of the second word given the input French sentence and that the first word of the translation has been the word in. Now notice that what we ultimately care about in this second step of beam search is to find the pair of the first and second words that is most likely. So not just the second word is most likely, but the pair of the first and second words are most likely. And by the rules of conditional probability, this can be expressed as P of the first word times P of probability of the second word, which you are getting from this neural network fragment. And so if for each of the three words you've chosen, in, Jane, and September, you save away this probability, then you can multiply them by this second probability to get the probability of the first and second words. So now you've seen how if the first word was in, how you can evaluate the probability of the second word. Now if the first word was Jane, you'd do the same thing. The sentence could be Jane A, Jane Aaron, and so on, down to Jane is, Jane visits, and so on. And you would use this neural network fragment, let me draw this in as well, where here you would hardwire y hat 1 to be Jane, and so with the first word y1 hat hardwired as Jane, then this neural network fragment can tell you what's the probability of the second word given the input x, and given that the first word was Jane. And then same as above, you can multiply with P of y1 to get the probability of y1 and y2 for each of these 10,000 different possible choices for the second word. And then finally, you do the same thing for September. All the words from A down to Zulu, and use this network fragment, I just draw this in as well, to see if the first word was September, what are the most likely options for the second word. So for this second step of beam search, because we're continuing to use a beam width of 3, and because there are 10,000 words in the vocabulary, you'd end up considering 3 times 10,000, or 30,000 possibilities, because there are 10,000 here, 10,000 here, 10,000 here. So it's a beam width times the number of words in the vocabulary. And what you do is you evaluate all of these 30,000 options according to the probability of the first and second words, and then pick the top 3. So we're going to cut down these 30,000 possibilities down to 3 again, down to the beam width rounds again. And so let's say that of these 30,000 choices, the most likely were in September, and say Jane is, and Jane visits. Sorry, this is a bit messy, but if those are the most likely 3 out of the 30,000 choices, then that's what beam search would memorize away, and take on to the next step of beam search. So notice one thing. If beam search decides that the 3 most likely choices for the first and second words are in September, or Jane is, or Jane visits, then what that means is that it is now rejecting September as a candidate for the first word of the output English translation. So we're now down to 2 possibilities for the first word, but we still have a beam width of 3, keeping track of 3 choices for pairs of y1, y2. Before going on to the third step of beam search, just want to notice that because the beam width is equal to 3, at every step you instantiate 3 copies of the network to evaluate these partial sentence fragments of the output. And it's because the beam width is equal to 3 that you have 3 copies of the network with different choices for the first word. But these 3 copies of the network can be very efficiently used to evaluate all 30,000 options for the second word. So just don't instantiate 30,000 copies of the network. You need only 3 copies of the network to very quickly evaluate all 10,000 possible outputs at that softmax output, say, for y2. Let's just quickly illustrate one more step of beam search. So we said that the most likely choices for the first two words were in September, Jane is, and Jane visits. And for each of these pairs of words, we should have saved away in computer memory the probability of y1 and y2, given the input x, given the French sentence x. So similar to before, we now want to consider what is the third word. So is it in September A, in September Aaron, all the way down to is in September Zulu. And to evaluate possible choices for the third word, you use this network fragment, where you hardwire the first word here to be in, the second word to be September. And so this network fragment allows you to evaluate what's the probability of the third word, given the input French sentence x, and given that the first two words are in September, in the English output. And then you do the same thing for the second fragment, so like so, and same thing for Jane visits. And so beam search will then, once again, pick the top three possibilities, maybe it thinks in September, Jane is a likely outcome, or Jane is visiting, is likely, or maybe Jane visits Africa, is likely for the first three words, and then it keeps going. And then you go on to the fourth step of beam search to add one more word, and on it goes. And the outcome of this process, hopefully, will be that, adding one word at a time, that beam search will decide that Jane visits Africa in September, will be terminated by the end of sentence symbol, if you're using that in the system, which is quite common, then we'll find that this is a likely output English sentence. And you see more details of this for yourself in this week's programming exercise as well, where you get to play with beam search yourself. So with a beam width of three, beam search considers three possibilities at a time. Notice that if the beam width was set to be equal to one, so it considers only one, then this essentially becomes the greedy search algorithm, which we had discussed in the last video. But by considering multiple possibilities, say three or ten or some other number at the same time, beam search will usually find a much better output sentence than greedy search. You've now seen how beam search works. But it turns out there's some additional tips and tricks and refinements that help you to make beam search work even better. Let's go on to the next video to take a look.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Sequence Models & Attention Mechanism

Various Sequence To Sequence Architectures

Basic Models
Video
・
6 mins

Picking the Most Likely Sentence
Video
・
8 mins

Beam Search
Video
・
11 mins

Refinements to Beam Search
Video
・
10 mins

Error Analysis in Beam Search
Video
・
9 mins

Bleu Score (Optional)
Video
・
15 mins

Attention Model Intuition
Video
・
9 mins

Clarifications about Upcoming Attention Model Video
Reading
・
10 mins

Attention Model
Video
・
12 mins

Speech Recognition - Audio Data

Speech Recognition
Video
・
8 mins

Trigger Word Detection
Video
・
4 mins

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Sequence Models & Attention Mechanism

Graded・Quiz

・

50 mins

Programming Assignments

Neural Machine Translation

Graded・Code Assignment

・

3 hours

Trigger Word Detection

Graded・Code Assignment

・

3 hours

Week 4: Transformer Network