In this video, you start to learn some concrete algorithms for learning word embeddings. In the history of deep learning as applied to learning word embeddings, people actually started off with relatively complex algorithms, and then over time, researchers discovered that you can use simpler and simpler and simpler algorithms and still get very good results, especially if you have a large data set. But what happened is some of the algorithms that are most popular today, they are, you know, so simple that if I present them first, it might seem almost a little bit magical, that how could something this simple work? So what I'm going to do is start off with some of the slightly more complex algorithms, because I think it's actually easier to develop intuition about why they should work, and then we'll move on to simplify these algorithms and show you some of the simpler algorithms that also give very good results. So let's get started. Let's say you're building a language model, and you do it with a neural network. So during training, you might want your neural network to do something like input, I want a glass of orange, and then predict the next word in the sequence. And below each of these words, I've also written down the index in the vocabulary of the different words. So it turns out that building a neural language model is a reasonable way to learn deep learning. A set of embeddings. And the ideas I present on this slide were due to Yoshua Bengio, Regine Dujon, Pascal Vincent, and Tristan Jarvin. So here's how you can build a neural network to predict the next word in the sequence. Let me take the list of words. I want a glass of orange, and let's start with the first word, I. So I'm going to construct a one-hot vector corresponding to the word I. So there's a one-hot vector with a 1 in position 4, 3, 4, 3. So this is going to be a 10,000 dimensional vector. And what we're going to do is then have a matrix of parameters E, and take E times O to get an embedding vector E, 4, 3, 4, 3. And this step really means that E, 4, 3, 4, 3 is obtained by the matrix E times the one-hot vector, 4, 3. And then we'll do the same for all of the other words. So the word want is word 9, 6, 6, 5. So we take that one-hot vector, multiply it by E to get the embedding vector. And similarly for the other words. A is the first word in the dictionary. Alphabetically it comes first. So that's O, 1, gets us E, 1. And similarly for the other words in this phrase. So now you have a bunch of three-dimensional embeddings. So each of these is a 300-dimensional embedding vector. And what we can do is feed all of them into neural network. So here's the neural network layer. And then this neural network feeds to a softmax, which has its own parameters as well. And the softmax classifies among the 10,000 possible outputs in the vocab for this final word we're trying to predict. And so, you know, if in the training set we saw the word juice, then the target for the softmax during training would be that it should predict that the word juice was what came after this. So this hidden layer here will have its own parameters. It'll have some, I'm going to call this W1, and there's also a B1. The softmax layer has its own parameters, W2, B2. And if you're using 300-dimensional word embeddings, then here we have six words. So this would be 6 times 300. So this layer, or this input, would be an 1800-dimensional vector obtained by taking your six embedding vectors and stacking them together. Or what's actually more commonly done is to have a fixed historical window. So, for example, you might decide that you always want to predict the next word given, say, the previous four words, where 4 here is a hyperparameter of the algorithm. So this is how you adjust either very long or very short sentences. Or you decide to always just look at the previous four words. So you say, I'm just going to use those four words, and so let's just get rid of these. And so if you're always using a four-word history, this means that your neural network would input a 1200-dimensional feature vector, go into this layer, then have a softmax try to predict the output. And again, a variety of choices. And using a fixed history just means that you can deal with even arbitrarily long sentences because the input size is always fixed. So the parameters of this model will be this matrix E. And use the same matrix E for all the words. So you don't have different matrices for different positions in the preceding four words. It's the same matrix E. And then these weights are also parameters of the algorithm. And you can use backprop to perform gradient descent to maximize the likelihood of your training set. To just repeatedly predict, given four words in the sequence, what is the next word in your text corpus. And it turns out that this algorithm will learn pretty decent word embeddings. And the reason is, if you remember our orange juice, apple juice example, it's in the algorithm's incentive to learn pretty similar word embeddings for orange and apple. Because doing so allows it to fit the training set better. Because it's going to see orange juice sometimes. It'll see apple juice sometimes. And so if you have only a 300-dimensional feature vector to represent all of these words, the algorithm will find that it fits the training set best. If apples, oranges, and grapes, and pears, and so on. And maybe also durians, which is a rare fruit, end up with similar feature vectors. So this is one of the earlier and pretty successful algorithms for learning word embeddings, for learning this matrix E. But now let's generalize this algorithm and see how we can derive even simpler algorithms. So I want to illustrate the other algorithms using a more complex sentence as our example. Let's say that in your training set, you have this longer sentence. I want a glass of orange juice to go along with my cereal. So what we saw on the last slide was that the job of the algorithm was to predict some word juice, which you want to call the target word. And it was given some context, which was the last four words. And so if your goal is to learn an embedding, researchers have experimented with many different types of contexts. If your goal is to build a language model, then it's natural for the context to be a few words right before the target word. But if your goal isn't to learn the language model per se, then you can choose other contexts. For example, you can pose a learning problem where the context is the four words on the left and right. So you can take the four words on the left and right as the context. And what that means is that we're posing a learning problem where the algorithm is given four words on the left, so a glass of orange, and four words on the right to go along with. And it's asked to predict the word in the middle. And posing a learning problem like this, where you have the embeddings of the left four words and the right four words feed into a neural network, similar to what you saw in the previous slide, to try to predict the word in the middle, try to predict the target word in the middle. This can also be used to learn word embeddings. Or if you want to use a simpler context, maybe you can just use the last one word. So given just the word orange, what comes after orange? So this would be a different learning problem where you tell it one word, orange, and then ask it, well, what do you think is the next word? And you can construct a neural network that just feeds in the word, the one previous word, or the embedding of the one previous word to a neural network that has to try to predict the next word. Or one thing that works surprisingly well is to take a nearby one word. So I might tell you that, well, the word glass is somewhere close by. So I might say, I saw the word glass, and then there's another word somewhere close to glass. Where do you think that word is? So that would be using nearby one word as the context. And we'll formalize this in the next video. But this is the idea of a skip-gram model. And this is an example of a simpler algorithm where the context is now much simpler. It's just one word rather than four words. But this works remarkably well. So what research has found was that if you really want to build a language model, it's natural to use the last few words as a context. But if your main goal is really to learn a word embedding, then you can use all of these other contexts. And they will result in very meaningful word embeddings as well. And we'll formalize the details of this in the next video where we talk about the word to vec model. To summarize, in this video, you saw how the language modeling problem, which causes you to pose a machine learning problem where you input the context, like the last four words, and predict some target words. How posing that problem allows you to learn a good word embedding. In the next video, you'll see how using even simpler contexts and even simpler learning algorithms to map from context to target word can also allow you to learn a good word embedding. Let's go on to the next video where we discuss the word to vec model.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Natural Language Processing & Word Embeddings

Introduction to Word Embeddings

Word Representation
Video
・
10 mins

Using Word Embeddings
Video
・
9 mins

Properties of Word Embeddings
Video
・
11 mins

Embedding Matrix
Video
・
3 mins

Learning Word Embeddings: Word2vec & GloVe

Learning Word Embeddings
Video
・
10 mins

Word2Vec
Video
・
12 mins

Negative Sampling
Video
・
11 mins

Clarifications about Upcoming GloVe Word Vectors Video
Reading
・
1 min

GloVe Word Vectors
Video
・
11 mins

Applications Using Word Embeddings

Sentiment Classification
Video
・
7 mins

Debiasing Word Embeddings
Video
・
11 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Natural Language Processing & Word Embeddings

Graded・Quiz

・

50 mins

Programming Assignments

Operations on Word Vectors - Debiasing

Graded・Code Assignment

・

3 hours

Emojify

Graded・Code Assignment

・

3 hours

Week 3: Sequence Models & Attention Mechanism