Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
So, how does PCA work? If you have a dataset with two features, X1 and X2, so initially your data is plotted or represented using axes X1 and X2, but you want to replace these two features with just one feature. How can you choose a new axis, let's call it the Z-axis, that is somehow a good way, a good feature for capturing or for representing the data. Let's take a look at how PCA does this. Here's a dataset with five training examples. Remember, this is an unsupervised learning algorithm, so we just have X1 and X2, there is no Y, there's no label Y. An example here like this may have coordinates X1 equals 10 and X2 equals 8. So, if we don't want to use the X1, X2 axes, how can we pick some different axes with which to capture what's in the data or with which to represent the data. One note on pre-processing, before applying the next few steps of PCA, the features should be first normalized to have zero mean. I've already done that here. And if the features X1 and X2 take on very different scales, for example, if you remember our housing example, if X1 was the size of a house in square feet and X2 was the number of bedrooms, then X1 could be a thousand or a couple thousand, whereas X2 is a small number. If the features take on very different scales, then you would first perform feature scaling before applying the next few steps of PCA. But so, assuming the features have been normalized to have zero mean, so subtract the mean from each feature, and then maybe apply feature scaling as well so the ranges are not too far apart, what does PCA do next? To examine what PCA does, let me remove the X1 and X2 axes so that we're just left with the five training examples. And this dot here represents the origin, the position of zero on this plot still. What we have to do now with PCA is pick one axis instead of the two axes that we had previously with which to capture what's important about these five examples. If we were to choose this axis to be our new Z axis, this is actually the same as the X1 axis just for this example. Then what we're saying is that for this example, we're going to just capture this value, this coordinate on the Z axis. And for the second example, we're going to capture this value, and for this, we'll capture this value, and so on for all five examples. So another way of saying this is that we're going to take each of these examples and project it down to a point on the Z axis. And the word project refers to that you're taking this example and bringing it to the Z axis using this line segment that's at a 90 degree angle to the Z axis. And this little box here is used to denote that this line segment is at 90 degrees to the Z axis. And the term project just means you're taking a point and finding this corresponding point on the Z axis using this line segment that's at 90 degrees. So picking this direction as a Z axis is not a bad choice, but there's some even better choices. This choice isn't too bad because when you project your examples onto the Z axis, you still capture quite a lot of the spread of the data. These five points here, they're pretty spread apart, so you're still capturing a lot of the variation or a lot of the variance in the original dataset. And by that, I mean these five points are quite spread apart, and so the variance or variation among these five points, the projections of the data onto the Z axis is decently large. And what that means is we're still capturing quite a lot of the information in the original five examples. Let's look at some other possible choices for the axis Z. Here's another choice, and this is actually not a great choice. But if I were to choose this as my Z axis, then if I take those same five examples and project them down to the Z axis, I end up with these five points. And you notice that compared to the previous choice, these five points are quite squished together. The amount they are different from each other, or their variance or the variation is much less. And what this means is with this choice of Z, you're capturing much less of the information in the original dataset because you've partially squished all five examples together. Let's look at one last choice, which is if I choose this to be the Z axis. This is actually a better choice than the previous two that we saw, because if we take the data's projections onto the Z axis, we find that these dots over here, they're actually quite far apart. And so we're capturing a lot of the variation, a lot of the information in the original dataset, even though we're now using just one coordinate or one number to represent or to capture each of the training examples instead of two numbers or two coordinates, X1 and X2. In the PCA algorithm, this axis is called the principal component. It is the axis that when you project the data onto it, you end up with the largest possible amount of variance. And so if you were to reduce the data to one axis or the one feature, this principal component is actually a good choice. And this is what PCA will do. If you want to reduce the data to one number, to one dimension or to one feature, then we'll choose this principal component. Let me show you a visualization of how different choices of the axes affect the projection. Here we have 10 training examples. And as we slide this slider here, and you can play with this in one of the optional labs yourself. As you slide the slider here, the angle of the Z axis changes. And what you're seeing on the left is each of the examples projected by that short line segment at 90 degrees to the Z axis. And here on the right is that projection of the data, meaning the value of these 10 examples Z coordinate. And you notice that when I set the axis to about here, the points are quite squished together. And so this possesses less of the information of the original data. Whereas if I set the Z axis, say to this, then these points vary much more. And so this is capturing much more of the information in the original data set. And so that's why the principal component corresponds to setting the Z axis to about here. And this is the choice that PCA would make if you asked it to reduce the data to one number, to one dimension. So machine learning library like Scikit-learn, which you hear more about in the next video, can help you automatically find the principal component. But let's dig a little bit deeper into how that works. Here are my X1 and X2 axes. And here is one training example with coordinates 2 on the X1 axis and 3 on the X2 axis. And let's say that PCA has found this direction for the Z axis. What I'm drawing here, this little arrow, is a length 1 vector pointing in the direction of this Z axis that PCA will choose or that we have chosen. It turns out this length 1 vector is the vector 0.71, 0.71, rounded off a bit. It's actually 0.707 and then a bunch of other digits. So, given this example with coordinates 2, 3 on the X1, X2 axes, how do we project this example onto the Z axis? It turns out the formula for doing so is to take a dot product between the vector 2, 3 and this vector 0.71, 0.71. And if you do that, 2, 3 dot product with 0.71, 0.71 turns out to be 2 times 0.71 plus 3 times 0.71, which is equal to 3.55. And what that means is the distance from the origin of this point over here is 3.55, which means that if we were to represent or to use one number to try to capture this example, that one number is 3.55. So far, we have talked about how to use PCA to reduce data down to one dimension or down to one number. And we did so by finding the principal components, also called sometimes the first principal component. And so, in this example, we had found this as the first axis. It turns out that if you were to pick a second axis, the second axis will always be at 90 degrees to the first axis. And if you were to choose even the third axis, then the third axis will be at 90 degrees to the first and the second axes. By the way, in mathematics, 90 degrees is sometimes called perpendicular. The term perpendicular just means at 90 degrees. So mathematicians will sometimes say the second axis, Z2, is at 90 degrees or is perpendicular to the first axis, Z1. And if you choose additional axes, they're also at 90 degrees or perpendicular to Z1 and Z2 and to any other axes that PCA will choose. And so, if you had 50 features and wanted to find three principal components, then if that's the first axis, the second axis will be at 90 degrees to it. Then the third axis will also be at 90 degrees to the first and the second axes. Now, one question I'm often asked is how is PCA different from linear regression? It turns out PCA is not linear regression. It's a totally different algorithm. Let me explain why. With linear regression, which is a supervised learning algorithm, you have data X and Y. So here's the data set where the horizontal axis is the feature X and the vertical axis here is the label Y. And with linear regression, you're trying to fit a straight line so that the predicted value is as close as possible to the ground-truth label Y. So in other words, you're trying to minimize the length of these little line segments, which are in the vertical direction. They just align with the Y axis. In contrast, in PCA, there is no ground-truth label Y, so you just have unlabeled data, X1 and X2. And furthermore, you're not trying to fit a line to use X1 to predict X2. Instead, the algorithm treats X1 and X2 equally, and we're trying to find this axis Z that it turns out will end up making these little line segments small when you project the data onto Z. And so in linear regression, there is one number, Y, which is given very special treatment, and we're always trying to measure distance between the fitted line and Y, which is why these distances are measured just in the direction of the Y axis. Whereas in PCA, you can have a lot of features, X1, X2, maybe all the way up to X50, if you have 50 features. And all 50 features are treated equally, and we're just trying to find an axis Z so that when the data is projected onto the axis Z, using these line segments, that you still retain as much of the variance of the original data as possible. I know that when I plot these things in two dimensions with just two features, which is all I can draw on a flat computer monitor, these arrows look like maybe they're a little bit similar, but when you have more than two features, which is most of the case, the difference between linear regression and PCA and what the algorithms do is very large. So these algorithms are used for totally different purposes and give you very different answers. When linear regression is used to predict a target output Y, and PCA is trying to take a lot of features and treat them all equally and reduce the number of axes needed to represent the data well. And it turns out that maximizing the spread of these projections will correspond to minimizing the distances of these line segments, the distances that the points have to move to be projected down to Z. To illustrate the difference between linear regression and PCA in another way, if you have a data set that looks like this, linear regression, all it can do is fit a line that looks like that, whereas if your data set looks like this, PCA will choose this to be the principal component. And so you should use linear regression if you're trying to predict the value of Y, and you should use PCA if you're trying to reduce the number of features in your data set, say to visualize it. Finally, before we wrap up this video, there's one more thing you could do with PCA, which is recall this example, which was at coordinates 2, 3. We found that if you project it to the Z axis, you end up with 3.55. One thing you could do is, if you have an example where Z equals 3.55, given just this one number Z, 3.55, can we try to figure out what was the original example? It turns out that there's a step in PCA called reconstruction, which is to try to go from this one number, Z equals 3.55, back to the original two numbers, X1 and X2. And it turns out you don't have enough information to get back X1 and X2 exactly, but you can try to approximate it. And in particular, the formula is you would take this number, 3.55, which is Z, and multiply it by that length one vector that we had just now, which is 0.71, 0.71. And this ends up to be 2.52, 2.52, which is this point over here. So we can approximate the original training example, which was at coordinates 2, 3, with this new point here, which is at 2.52, 2.52. And the difference between the original point and the projected point is this little line segment here. And in this case, it's not a bad approximation. 2.52, 2.52, it's not that far from 2, 3. So with just one number, we could get a reasonable approximation to the coordinates of the original training example. And this is called the reconstruction step of PCA. To summarize, the PCA algorithm looks at your original data and chooses one or more new axes, Z, or maybe Z1 and Z2, to represent your data. And by taking your original data set and projecting it onto your new axis or axes, this gives you a smaller set of numbers that you can plot. With which to visualize your data. You've seen the math. Let's now take a look at how you can implement this in code. In the next video, we'll look at how you can use PCA yourself using the Scikit-learn library. Let's go on to the next video.