In the last video, you saw how the SkipGram model allows you to construct a supervised learning task where you map from context to target words, and how that allows you to learn a useful word embedding. But the downside of that was the Softmax objective was slow to compute. In this video, you see a modified learning problem called negative sampling that allows you to do something similar to the SkipGram model you saw just now, but with a much more efficient learning algorithm. Let's see how you could do this. Most of the ideas presented in this video are due to Thomas Mikolov, Yasuska Verakai-Chen, Greg Corrado, and Jeff Dean. So, what we're going to do in this algorithm is create a new supervised learning problem. And the problem is, given a pair of words, like orange and juice, we're going to predict, is this a context target pair? So, in this example, orange juice was a positive example. And how about orange and king? Well, that's a negative example, so I'm going to write zero for the target. So, what we're going to do is, we're actually going to sample a context and a target word. So, in this case, we have orange and juice, and we'll associate that with a label of one. So, I'll just put words in the middle. And then, having generated a positive example, so the positive example is generated exactly how we generated it in the previous videos. Sample a context word, look around a window of, say, plus minus ten words, and pick a target word. So, that's how you generate the first row of this table, with orange, juice, one. And then, to generate the negative example, I'm going to take the same context word, and then just pick a word at random from the dictionary. So, in this case, I chose the word king at random, and we'll label that as zero. And then, let's take orange, and let's pick another random word from the dictionary. Under the assumption that if we pick a random word, it probably won't be associated with the word orange. So, orange, book, zero. And let's pick a few others. Orange, maybe just by chance, we'll pick the, zero, and then orange. And maybe just by chance, we'll pick the word of, and we'll put a zero there. And notice that of is labeled as zero, even though the word of actually appears next to orange as well. So, to summarize, the way we generated this data set is, we'll pick a context word, and then pick a target word. And that is the first row of this table. That gives us a positive example. So, context, target, and then give that a label of one. And then what we'll do is, for some number of times, say, k times, we're going to take the same context word, and then pick random words from the dictionary. King, book, the, of, whatever comes out at random from the dictionary, and label all those zero. And those would be our negative examples. And it's okay if, just by chance, one of those words we picked at random from the dictionary happens to appear in a window, in a plus minus ten word window, say, next to the context word orange. Then, we're going to create a supervised learning problem, where the learning algorithm inputs x, inputs this pair of words, and then has to predict the target label, or to predict the output y. So, the problem is really, given a pair of words, like orange and juice, do you think they appear together? Do you think I got these two words by sampling two words close together? Or do you think I got them as one word from the text, and one word chosen at random from the dictionary? It's really to try to distinguish between these two types of distributions from which you might sample a pair of words. So, this is how you generate the training set. How do you choose k? Mikolov et al. recommend that maybe k is 5 to 20 for smaller data sets. And if you have a very large data set, then choose k to be smaller. So, k equals 2 to 5 for larger data sets, and larger values of k for smaller data sets. And in this example, I've just used k equals 4. Next, let's describe the supervised learning model for learning and mapping from x to y. So, here is the Softmax model you saw from the previous video. And here is the training set we got from the previous slide, where, again, this is going to be the new input x, and this is going to be the value of y you're trying to predict. So, to define the model, I'm going to use this to denote this with c for the context word, this to denote the possible target word, t, and this, I'll use y to denote 0, 1. You know, is this a context target pair? So, what we're going to do is define a logistic regression model. We say that the chance that y is equal to 1, given the input c, t pair, we're going to model this as basically a logistic regression model, but the specific formula we use is sigmoid applied to theta transpose, theta t transpose, e, c. So, the parameters are similar as before. You have one parameter vector, theta, for each possible target word, and a separate parameter vector, really the embedding vector, for each possible context word. And we're going to use this formula to estimate the probability that y is equal to 1. So, if you have k examples here, then you can think of this as having a k to 1 ratio of negative to positive examples. So, for every positive example, you will have k negative examples with which to train this logistic regression-like model. And so, to draw this as a neural network, if the input word is orange, which is word 6 through 5, 7, then what you do is you input the one-hot vector, pass it through e, do that multiplication to get the embedding vector, 6 through 5, 7. And then what you have is really 10,000 possible logistic regression classification problems, where one of these will be the classifier corresponding to, well, is the target word juice or not? And then there will be other words. For example, there may be one somewhere down here, which is predicting, is the word king or not? And so on. So, these are possible words in your vocabulary. So, think of this as having 10,000 binary logistic regression classifiers, but instead of training all 10,000 of them on every iteration, we're only going to train 5 of them. We're going to train the one corresponding to the actual target word we got, and then train 4 randomly chosen negative examples. And this is for the case where k is equal to 4. So, instead of having one giant 10,000-way softmax, which is very expensive to compute, we've instead turned it into 10,000 binary classification problems, each of which is quite cheap to compute. And on every iteration, we're only going to train 5 of them, or more generally, k plus 1 of them, with k negative examples and 1 positive examples. And this is why the computational cost of this algorithm is much lower, because you're updating k plus 1 logistic units, k plus 1 binary classification problems, which is relatively cheap to do on every iteration, rather than updating a 10,000-way softmax classifier. So, this technique is called negative sampling, because what you're doing is, you had a positive example, the orange juice, and then you got and deliberately generated a bunch of negative examples, kind of negative sampling, sensitive negative sampling, which was to train 4 more of these binary classifiers. And on every iteration, you choose 4 different random negative words with which to train your algorithm on. Now, before wrapping up, one more important detail of this algorithm is, how do you choose the negative examples? So, after having chosen the context word orange, how do you sample these words to generate the negative examples? So, one thing you could do is sample the words in the middle, the candidates' target words. One thing you could do is sample it according to the empirical frequency of words in your corpus. So, just sample it according to how often different words appear. But the problem with that is that you end up with a very high representation of words like be, of, and, and so on. One other extreme would be to say, you use one over the vocab size, sample the negative examples uniformly at random. But that's also very non-representative of the distribution of English words. So, the authors, Mikhailov et al., reported that empirically what they found to work best was to take this heuristic value, which is a little bit in between the two extremes of sampling from the empirical frequencies, meaning from whatever is the observed distribution in English text to the uniform distribution. And what they did was, they sampled proportional to the frequency of a word to the power of three-fourths. So, if f of wi is the observed frequency of a particular word in the English language or in your training set corpus, then by taking it to the power of three-fourths, this is somewhere in between the extreme of taking uniform distribution and the other extreme of just taking whatever was the observed distribution in your training set. And so, I'm not sure this is very theoretically justified, but multiple researchers are now using this heuristic and it seems to work decently well. So, to summarize, you've seen how you can learn word vectors with a soft math classifier, but that's very computationally expensive. And in this video, you saw how by changing that to a bunch of binary classification problems, you can very efficiently learn word vectors. And if you run this algorithm, you will be able to learn pretty good word vectors. Now, of course, as is the case in other areas of deep learning as well, there are open source implementations and there are also pre-trained word vectors that others have trained and released online under permissive licenses. And so, if you want to get going quickly on a NLP problem, it'd be reasonable to download someone else's word vectors and use that as a starting point. So, that's it for the skip-gram model. In the next video, I want to share with you yet another version of a word embedding learning algorithm that is maybe even simpler than what you've seen so far. So, in the next video, let's learn about the GloVe algorithm.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Natural Language Processing & Word Embeddings

Introduction to Word Embeddings

Word Representation
Video
・
10 mins

Using Word Embeddings
Video
・
9 mins

Properties of Word Embeddings
Video
・
11 mins

Embedding Matrix
Video
・
3 mins

Learning Word Embeddings: Word2vec & GloVe

Learning Word Embeddings
Video
・
10 mins

Word2Vec
Video
・
12 mins

Negative Sampling
Video
・
11 mins

Clarifications about Upcoming GloVe Word Vectors Video
Reading
・
1 min

GloVe Word Vectors
Video
・
11 mins

Applications Using Word Embeddings

Sentiment Classification
Video
・
7 mins

Debiasing Word Embeddings
Video
・
11 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Natural Language Processing & Word Embeddings

Graded・Quiz

・

50 mins

Programming Assignments

Operations on Word Vectors - Debiasing

Graded・Code Assignment

・

3 hours

Emojify

Graded・Code Assignment

・

3 hours

Week 3: Sequence Models & Attention Mechanism