Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
In the last video, you saw how you can learn a neural language model in order to get good word embeddings. In this video, you see the Word2Vec algorithm, which is a simpler and computationally more efficient way to learn these types of embeddings. Let's take a look. Most of the ideas that are presented in this video are due to Thomas Nikolaus, Kai Chen, Greg Corrado, and Jeff Dean. Let's say you're given this sentence in your training set. In the skip-gram model, what we're going to do is come up with a few context-to-target pairs to create our supervised learning problem. So rather than having the context be always the last four words or the last n words immediately before the target word, what we're going to do is say randomly pick a word to be the context word, and let's say we chose the word orange. And what we're going to do is randomly pick another word within some window, say plus-minus five words or plus-minus ten words of the context word, and we choose that to be the target word. So maybe just by chance, you might pick juice to be a target word. That's just one word later. Or you might choose two words before. So you have another pair where the target could be gloss. Or maybe just by chance, you choose the word my as the target. And so we'll set up a supervised learning problem where given the context word, you're asked to predict what is a randomly chosen word within, say, a plus-minus ten word window, or plus-minus five or ten word window of that input context word. And obviously, this is not a very easy learning problem because, you know, within plus-minus ten words of the word orange, it could be a lot of different words. But the goal of setting up this supervised learning problem isn't to do well on this supervised learning problem per se. It is that we want to use this learning problem to learn good word embeddings. So here are the details of the model. Let's say that we'll continue to use our vocab of ten thousand words, and some have been trained on vocab sizes that exceed a million words. But the basic supervised learning problem we're going to solve is that we want to learn a mapping from some context C, such as the word orange, to some target, which we're going to call T, which might be the word juice, or the word glass, or the word my, if we use the example from the previous slide. So in our vocabulary, orange is word six, two, five, seven, and the word juice is word four, eight, three, four, in our vocab of ten thousand words. And so that's the input x that you want to learn to map to that output y. So to represent the input, such as the word orange, you can start off with some one-hot vector, which we're going to write as O subscript C. So there's a one-hot vector for the context word. And then similar to what you saw in the last video, you can take the embedding matrix E, multiply E by the vector O subscript C, and this gives you your embedding vector for the input context word. So here, EC is equal to capital E times that one-hot vector. Then in this little neural network that we'll form, we're going to take this vector EC and feed it to a softmax unit. So I've been drawing a softmax unit as like a node in a neural network. That's not an O, that's a softmax unit. And then there's a job of the softmax unit to output y hat. So to write out this model in detail, this is the model. The softmax model estimates probability of different target words, given the input context word as E to the theta t transpose EC, divided by sum over all words, I'm going to say sum from j equals 1 to all 10,000 words, of E to the theta j transpose EC. So here, theta t is the parameter associated with output t. There's a chance of a particular word t being the label. So I've left off the bias term in this softmax, but we could include that too if we wish. And then finally, the loss function for softmax will be the usual. So we use y to represent the target words, and we use a one-hot representation for y hat and y here. Then the loss will be the negative log likelihood. So sum from i equals 1 to 10,000 of y i log y i hat. So that's a usual loss for softmax where we're representing the target y as a one-hot vector. So this would be a one-hot vector with just one 1 and the rest 0s. And if the target word is juiced, then it would be elements 4, 8, 3, 4 from up here. That is equal to 1, and the rest would be equal to 0. And similarly, y hat would be a 10,000 dimensional vector output by the softmax unit with probabilities for all 10,000 possible target words. So to summarize, this is the overall little model, a little neural network with basically looking up the embedding and just a softmax unit. And the matrix E will have a lot of parameters. So the matrix E has parameters corresponding to all of these embedding vectors, E subscript c. And then the softmax unit also has parameters that gives these theta t parameters. But if you optimize this loss function with respect to all of these parameters, you actually get a pretty good set of embedding vectors. So this is called the skip-gram model because it's taking as input one word like orange and then trying to predict what is, you know, some word, skipping a few words to the left or the right to predict what comes a little bit before, a little bit after the context words. Now, it turns out there are a couple problems with using this algorithm. And the primary problem is computational speed. In particular, for the softmax model, every time you want to evaluate this probability, you need to carry out a sum over all 10,000 words in your vocabulary. And maybe 10,000 isn't too bad, but if you're using a vocabulary of size 100,000 or a million, it gets really slow to sum up over this denominator every single time. In fact, 10,000 is actually already pretty bad, will be quite slow, but it makes it even harder to scale to larger vocabularies. So there are a few solutions to this. One which you see in the literature is to use a hierarchical softmax classifier. And what that means is instead of trying to categorize something into all 10,000 categories on one go, imagine if you have one classifier that tells you is the target word in the first 5,000 words of the vocabulary or is it in the second 5,000 words of the vocabulary? And let's say this binary classifier tells you is in the first 5,000 words. Then you can have a second class to tell you is this in the first 2,500 words of the vocab or in the second 2,500 words of the vocab and so on until eventually you get down to classifying exactly what word it is so that the leaf of this tree. And so having a tree of classifiers like this means that each of the interior nodes of a tree can be just a binary classifier, like a logistic classifier. And so you don't need to sum over all 10,000 words, all of the vocab size in order to make a single classification. In fact, the computational cost using a tree like this scales like log of the vocab size rather than linear in vocab size. So this is called a hierarchical softmax classifier. I should mention in practice, the hierarchical softmax classifier doesn't use a perfectly balanced tree or this perfectly symmetric tree with equal numbers of words on the left and right sides of each branch. In practice, the hierarchical softmax classifier can be developed so that the common words tend to be on top, whereas the less common words like durian can be buried much deeper in the tree. Because you see the more common words more often and so you might need only a few traversals to get to common words like the and of, whereas you see less frequent words like durian much less often. So it's okay that they're buried deeper in the tree because you don't need to go that deep as frequently. So there are various heuristics for building the tree that you use to build a hierarchical softmax classifier. So this is one idea you see in the literature for speeding up the softmax classification. And you can read more details of this on the paper I referenced by Thomas Mikolov and others on the first slide. But I won't spend too much more time on this because in the next video, we'll actually talk about a different method called negative sampling, which I think is even simpler and also works really well for speeding up the softmax classifier and the problem of needing to sum over the entire vocab size in the denominator. So you see more of that in the next video. But before moving on, one quick topic I want you to understand is how to sample the context C. So once you sample the context C, the target T can be sampled within, say, a plus minus 10 word window of the context C. But how do you choose the context C? One thing you could do is just sample uniformly at random from your training corpus. When you do that, you find that there are some words like the, of, a, and, to, and so on that appear extremely frequently. And so if you do that, you'll find that in your context to target mapping pairs, you just get these types of words extremely frequently. Whereas there are other words like orange, apple, and also durian that don't appear that often. And maybe you don't want your training set to be dominated by these extremely frequently occurring words because then you spend almost all the effort updating EC for those frequently occurring words. But you want to make sure that you spend some time updating the embedding even for these less common words like e, durian. So in practice, the distribution of words PC isn't taken just entirely uniformly at random from the training set corpus. But instead, there are different heuristics that you could use in order to balance out sampling from the common words together with the less common words. So that's it for the word2vec skip-gram model. If you read the original paper by Mikhailov et al that I referenced earlier, you find that that paper actually had two versions of this word2vec model. The skip-gram was one, and the other one is called the CBAL, Continuous Vague Words Model, which takes the surrounding context for a middle word and uses the surrounding words to try to predict the middle word. And that algorithm also works and has some advantages and disadvantages. But the key problem with this algorithm, with the skip-gram models presented so far, is that the softmax step is very expensive to calculate because of needing to sum over your entire vocabulary size in the denominator of the softmax. In the next video, I'm going to show you an algorithm that modifies the training objective that makes it run much more efficiently and therefore lets you apply this maybe to much bigger training sets as well and therefore learn much better word embeddings. Let's go on to the next video.