Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Hello and welcome back. Last week you learned about RNNs, GRUs, and LSTMs. In this week, you'll see how many of these ideas can be applied to NLP, to natural language processing, which is one of the fields of AI that is really being revolutionized by deep learning. One of the key ideas you'll learn about is word embeddings, which is a way of representing words that lets your algorithms automatically understand analogies like that Man is the Woman, as King is the Queen, and many other examples. And through these ideas of word embeddings, you'll be able to build NLP applications even with modest lead size or even with relatively small label training sets. Finally, towards the end of the week, you'll see how to de-bias word embeddings, that is to reduce undesirable gender or ethnicity or other types of bias that learning algorithms can sometimes pick up. So with that, let's get started with a discussion on word representation. So far, we've been representing words using a vocabulary of words, and our vocabulary from the previous week might be, say, 10,000 words. And we've been representing words using a one-hot vector. So, for example, if Man is word number 5391 in this dictionary, then you represent it with a vector with a 1 in position 5391. And I'm also going to use O subscript 5391 to represent this vector, where O here stands for one-hot. And then if Woman is word number 9853, then you represent it with O subscript 9853, which just has a 1 in position 9853 and zeros elsewhere. And then other words, King, Queen, Apple, Orange, would be similarly represented with one-hot vectors. One of the weaknesses of this representation is that it treats each word as a thing unto itself, and it doesn't allow an algorithm to easily generalize across words. For example, let's say you have a language model that has learned that when you see I want a glass of orange blank, well, what do you think the next word will be? Very likely to be juice. But even if the learning algorithm has learned that I want a glass of orange juice is a likely sentence, if it sees I want a glass of apple blank, as far as it knows, the relationship between apple and orange is not any closer as the relationship between any of the other words, man, woman, king, queen, and orange. And so it's not easy for the learning algorithm to generalize from knowing that orange juice is a popular thing to recognizing that apple juice might also be a popular thing or a popular phrase. And this is because the inner product between any two different one-hot vectors is zero. If you take any two vectors, say queen and king, and inner product them, the inner product is zero. If you take apple and orange and inner product them, the inner product is zero. And the Euclidean distance between any pair of these vectors is also the same. So it just doesn't know that somehow apple and orange are much more similar than king and orange or queen and orange. So, wouldn't it be nice if instead of a one-hot representation, we can instead learn a featurized representation where for each of these words, man, woman, king, queen, apple, orange, or really for every word in a dictionary, we could learn a set of features and values for each of them. So, for example, each of these words might have, we want to know what is the gender associated with each of these things. So, gender goes from minus one for male to plus one for female. Then the gender associated with man might be minus one, for woman might be plus one. And then if you're actually learning these things, maybe for king you get minus 0.95, for queen plus 0.97, and for apple and orange, you know, sort of genderless. Another feature might be, well, how royal are these things? And so the terms man and woman are not particularly royal, so they might have features, values close to zero, whereas king and queen are highly royal, and apple and orange are not particularly royal. How about age? Well, man and woman doesn't connote much about age. Maybe man and woman implies that they're adults, but maybe neither necessarily young nor old, so maybe values close to zero, whereas kings and queens are almost always adults, and apple and orange might be more neutral with respect to age. And then another feature for, you know, is this a food? Well, man is not a food, woman is not a food. Neither are kings and queens, but apples and oranges are foods. And there can be many other features as well, ranging from what is the size of this, what is the cost, is this something that is alive, is this an action, or is this a noun, or is this a verb, or is this something else, and so on. So you can imagine coming up with many features, and for the sake of illustration, let's say, 300 different features, and what that does is it allows you to take this list of numbers, I've only written four here, but this could be a list of maybe 300 numbers, that then becomes a 300-dimensional vector for representing the word man. And I'm going to use the notation E subscript 5391 to denote a representation like this. And similarly, this vector, this 300-dimensional vector, a 300-dimensional vector like this, I would denote E 9853 to denote a 300-dimensional vector we could use to represent the word woman. And similarly, for the other examples here. Now, if you use this representation to represent the words orange and apple, then notice that the representations for orange and apple are now quite similar. Some of the features will differ, because maybe the color of an orange and the color of an apple, the taste, or some of the features will differ. But by and large, a lot of features of apple and orange are actually the same, or take on very similar values. And so this increases the odds of the learning algorithm that has figured out that orange juice is a thing to also quickly figure out that apple juice is a thing. So this allows it to generalize better across different words. So over the next few videos, we'll find a way to learn word embeddings, which is basically to learn high-dimensional feature vectors like these that gives a better representation than one-hot vectors for representing different words. And the features we'll end up learning won't have an easy-to-interpret interpretation like that component 1 is gender, component 2 is royal, component 3 is Asian, and so on. Exactly what they're representing will be a bit harder to figure out, but nonetheless, the featurized representations we'll be able to learn will allow an algorithm to quickly figure out that apple and orange are more similar than, say, king and orange or queen and orange. If we're able to learn a 300-dimensional feature vector, a 300-dimensional embedding for each word, one of the popular things to do is also to take this 300-dimensional data and embed it, say, in a 2-dimensional space so that you can visualize them. And so one common algorithm for doing this is the t-SNE algorithm due to Lauren Vandermartin and Geoff Hinton. And if you look at one of these embeddings, one of these representations, you find that words like man and woman tend to get grouped together, king and queen tend to get grouped together, and these are the people which tend to get grouped together. Those are animals which tend to get grouped together. The fruits will tend to be close to each other. Numbers like 1, 2, 3, 4 will be close to each other. And then maybe the animated objects as a whole will also tend to get grouped together. But you see plots like these sometimes on the Internet to visualize some of these 300- or higher-dimensional embeddings. And maybe this gives you a sense that word embedding algorithms like these can learn similar features for concepts that feel like they should be more related, as visualized by the concepts that seem to you and me like they should be more similar end up getting mapped to more similar feature vectors. And these representations we'll use, these sort of featurized representations in maybe a 300-dimensional space, these are called embeddings. And the reason we call them embeddings is you can think of a 300-dimensional space, and again, I can't draw a 300-dimensional space. This is a 3D one. And what you do is you take every word, like orange, and you have a 300-dimensional feature vector. So the word orange gets embedded to a point in this 300-dimensional space. And the word apple gets embedded to a different point in this 300-dimensional space. And, of course, to visualize it, algorithms like t-SNE map this to a much lower dimensional space so you can actually plot the 2D data and look at it. But that's where the term embedding comes from. Word embeddings has been one of the most important ideas in NLP, in natural language processing. In this video, you saw why you might want to learn or use word embeddings. In the next video, let's take a deeper look at how you'll be able to use these algorithms to build NLP algorithms.