By now, you should have a sense of how word embeddings can help you build NLP applications. One of the most fascinating properties of word embeddings is that they can also help with analogy reasoning. And while reasoning about analogies may not be, by itself, the most important NLP application, they might also help convey a sense of what these word embeddings are doing, what these word embeddings can do. Let me show you what I mean. Here are the featurized representations of a set of words that you might hope a word embedding could capture. Let's say I pose the question, man is to woman as king is to what? Many of you would say, man is to woman as king is to queen. But is it possible to have an algorithm figure this out automatically? Well, here's how you could do it. Let's say that you're using this 4-dimensional vector to represent man. So this would be E5391, although just for this video, let me call this E subscript man. And let's say that's the embedding vector for woman. So I'm going to call that E subscript woman, and similarly for king and queen. And for this example, I'm just going to assume you're using 4-dimensional embeddings rather than anywhere from, you know, 50 to 1,000 dimensional, which would be more typical. One interesting property of these vectors is that if you take the vector E man and subtract the vector E woman, then you end up with approximately minus 1 minus another 1 is negative 2. That's about 0 minus 0, 0 minus 0, close to 0 minus 0. So you get roughly negative 2, 0, 0, 0. And similarly, if you take E king minus E queen, then that's approximately the same thing. That's about minus 1 minus 0.97. It's about negative 2. This is about 1 minus 1, since kings and queens are both about equally royal. So that's 0. And then age difference, food difference, 0. And so what this is capturing is that the main difference between man and woman is the gender, and the main difference between king and queen, as represented by these vectors, is also the gender, which is why the difference, E man minus E woman, and the difference, E king minus E queen, are about the same. So one way to carry out this analogy reasoning is if the algorithm is asked, man is to woman, as king is to what? What it can do is compute E man minus E woman, and try to find a vector, try to find a word, so that E man minus E woman is close to E king minus E, of that new word. And it turns out that when queen is the word president here, then the left-hand side is close to the right-hand side. So these ideas were first pointed out by Thomas Mikhailov, Wentao Yi, and Jeffrey Zweig, and it's been one of the most remarkable and surprising and influential results about word embeddings, and I think has helped the whole community get better intuitions about word embeddings, about what word embeddings are doing. So let's formalize how you can turn this into an algorithm. In pictures, the word embeddings live in maybe a 300-dimensional space, and so the word man is represented as a point in this space, and the word woman is represented as a point in this space, and the word king is represented as another point, and the word queen is represented as another point. And what we pointed out really on the last slide is that the vector difference between man and woman is very similar to the vector difference between king and queen, and this arrow I just drew is really the vector that represents a difference in gender. And remember, these are points we're plotting in a 300-dimensional space. So in order to carry out this type of analogical reasoning to figure out man is the woman and king is the what, what you can do is try to find the word W so that this equation holds true. And so what you want is to find the word W that maximizes the similarity between EW compared to E king minus E man plus E woman. Right, so what I did was I took this E question mark and replaced that with EW, and then, you know, brought EW to just one side of the equation, and then the other three terms to the right-hand side of this equation. So if you have some appropriate similarity function for measuring how similar is the embedding of some word W to this quantity of the right, then finding the word that maximizes the similarity should hopefully let you pick out the word queen. And the remarkable thing is this actually works. If you learn a set of word embeddings and find the word W that maximizes this type of similarity, you can actually get the exact right answer, you know, depending on the details of the task. But if you look at research papers, it's not uncommon for research papers to report anywhere from, say, 30% to 75% accuracy on analogy reasoning tasks like these, where you count an analogy attempt as correct only if it gets the exact word right. So only if, in this case, it picks out the word queen. Before moving on, I just want to clarify what this plot on the left is. So previously, we talked about using algorithms like t-SNE to visualize words. What t-SNE does is it takes 300D data and it maps it in a very nonlinear way to a 2D space. And so the mapping that t-SNE learns, this is a very complicated and very nonlinear mapping. So after the t-SNE mapping, you should not expect these types of parallelogram relationships like the one we saw on the left to hold true. And it's really in this original 300-dimensional space that you could more reliably count on these types of parallelogram relationships in analogy pairs to hold true. And it may hold true after mapping through t-SNE, but in most cases, because of t-SNE's nonlinear mapping, you should not count on that. And many of the parallelogram analogy relationships will be broken by t-SNE. Now before moving on, let me just quickly describe the similarity function that is most commonly used. So the most commonly used similarity function is called cosine similarity. So this is the equation we had from the previous slide. So in cosine similarity, you define the similarity between two vectors u and v as u transpose v divided by the lengths, by the Euclidean lengths. So ignoring the denominator for now, this is basically the inner product between u and v. And so if u and v are very similar, their inner product would tend to be large. And this is called cosine similarity because this is actually the cosine of the angle between the two vectors u and v. So if that's the angle of phi, then this formula is actually the cosine of the angle between them. And so you will remember from calculus, I guess, that if this is phi, then the cosine of phi looks like this. So the angle between them is zero, then the cosine similarity is equal to one. And if their angle is 90 degrees, the cosine similarity is zero. And then if they are 180 degrees apart, pointing in completely opposite directions, it ends up being negative one. So that's where the term cosine similarity comes from. It works quite well for these analogy reasoning tasks. If you want, you can also use squared distance or Euclidean distance, u minus v squared. Technically, this would be a measure of dissimilarity rather than a measure of similarity. So we need to take the negative of this. And this will work okay as well, although I see cosine similarity being used a bit more often. And the main difference between these is how it normalizes for the lengths of the vectors u and v. So one of the remarkable results about word embeddings is the generality of analogy relationships they can learn. So for example, it can learn that man is the woman as boy is the girl, because the vector difference between men and women, similar to king, queen, boy and girl, is primarily just the gender. It can learn that Ottawa, which is the capital of Canada, that Ottawa to Canada is as to Nairobi as the Kenya. So that's the city capital is to the name of the country. It can learn that big is the bigger, as tall as the taller. And it can learn things like that yen is to Japan, since the yen is the currency of Japan, as ruble is to Russia. And all of these things can be learned just by running a word embedding learning algorithm on a large text corpus. It can spot all of these patterns by itself just by learning from very large bodies of text. So in this video, you saw how word embeddings can be used for analogy reasoning. And while you might not be trying to build an analogy reasoning system yourself as an application, this, I hope, conveys some intuition about the types of featurized representations or feature-like representations that these representations can learn. And you also saw how cosine similarity can be a way to measure the similarity between two different words embeddings. Now we'll talk a lot about properties of these embeddings and how you can use them. Next, let's talk about how you'd actually learn these word embeddings. Let's go on to the next video.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Natural Language Processing & Word Embeddings

Introduction to Word Embeddings

Word Representation
Video
・
10 mins

Using Word Embeddings
Video
・
9 mins

Properties of Word Embeddings
Video
・
11 mins

Embedding Matrix
Video
・
3 mins

Learning Word Embeddings: Word2vec & GloVe

Learning Word Embeddings
Video
・
10 mins

Word2Vec
Video
・
12 mins

Negative Sampling
Video
・
11 mins

Clarifications about Upcoming GloVe Word Vectors Video
Reading
・
1 min

GloVe Word Vectors
Video
・
11 mins

Applications Using Word Embeddings

Sentiment Classification
Video
・
7 mins

Debiasing Word Embeddings
Video
・
11 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Natural Language Processing & Word Embeddings

Graded・Quiz

・

50 mins

Programming Assignments

Operations on Word Vectors - Debiasing

Graded・Code Assignment

・

3 hours

Emojify

Graded・Code Assignment

・

3 hours

Week 3: Sequence Models & Attention Mechanism