We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In the earlier clauses, clauses 1 and 2 of the specialization, you saw a lot of supervised learning algorithms as a technique training set posing a cost function and then using gradient descent or some other algorithm to optimize that cost function. It turns out that the Q-means algorithm that you saw in the last video is also optimizing a specific cost function, although the optimization algorithm that it uses to optimize that is not gradient descent, it's actually the algorithm that you already saw in the last video. Let's take a look at what all this means. Let's take a look at what is the cost function for k-means. To get started, as a reminder, this is the notation we've been using, where c i is the index of the cluster, so c i is some number from 1 through k, of the index of the cluster to which training example x i is currently assigned, and mu k is the location of cluster centroid k. Let me introduce one more piece of notation, which is when lowercase k equals c i, so mu subscript c i is the cluster centroid of the cluster to which example x i has been assigned. For example, if I were to look at some training example, say training example 10, and I were to ask, what's the location of the cluster centroid to which the 10th training example has been assigned? Well, I would then look up c 10. This would give me a number from 1 to k that tells me, was example 10 assigned to the red or the blue or some other cluster centroid? And then mu subscript c 10 is the location of the cluster centroid to which x 10 has been assigned. So, armed with this notation, let me now write out the cost function that k-means turns out to be minimizing. The cost function J, which is a function of c 1 through c m, these are all the assignments of points to cluster centroids, as well as mu 1 through mu K. These are the locations of all the cluster centroids, as defined as this expression on the right. It is the average, so 1 over m, of sum from i equals 1 to m, of the squared distance between every training example x i, as i goes from 1 through m, it is the squared distance between x i and mu subscript c i, so this quantity up here. In other words, the cost function for k-means is the average squared distance between every training example x i and the location of the cluster centroid to which the training example x i has been assigned. For this example up here, we would be measuring the distance between x 10 and mu subscript c 10, the cluster centroid to which x 10 has been assigned, and taking the square of that distance. That would be one of the terms over here that we're averaging over. It turns out that what the k-means algorithm is doing is trying to find assignments of points to cluster centroids, as well as find locations of cluster centroids that minimizes the squared distance. Visually, here's what you saw partway into the run of k-means in an earlier video. At this step, the cost function, if you were to compute it, would be to look at every one of the blue points and measure these distances and compute the square. Then also similarly, look at every one of the red points and compute these distances and compute the square. Then the average of the squares of all of these differences for the red and the blue points is the value of the cost function j at this particular configuration of the parameters for k-means. What it will do on every step is try to update the cluster assignments c 1 through c 30 in this example, or update the positions of the cluster centroids mu 1 and mu 2 in order to keep on reducing this cost function j. By the way, this cost function j also has a name in the literature. It's called the distortion function. I don't know that this is a great name, but if you hear someone talk about the k-means algorithm and the distortion or the distortion cost function, that's just what this formula j is computing. Let's now take a deeper look at the algorithm and why the algorithm is trying to minimize this cost function j, or why it's trying to minimize the distortion. Here on top, I've copied over the cost function from the previous slide. It turns out that the first part of k-means, where you assign points to cluster centroids, that turns out to be trying to update c 1 through c m to try to minimize the cost function j as much as possible, while holding mu 1 through mu k fixed. And the second step, in contrast, where you move the cluster centroid, it turns out that that is trying to leave c 1 through c m fixed, but to update mu 1 through mu k to try to minimize the cost function or the distortion as much as possible. Let's take a look at why this is the case. During the first step, if you want to choose the values of c 1 through c m, or save a particular value of c i to try to minimize this, well, what would make x i minus mu c i as small as possible? This is the distance, or the square distance, between a training example x i and the location of the cluster centroid to which it's been assigned. So if you want to minimize this distance, or this square distance, what you should do is assign x i to the closest cluster centroid. So to take a simplified example, if you have two cluster centroids, say cluster centroids 1 and 2, and just a single training example x i, if you were to assign it to cluster centroid 1, this square distance here would be this large distance, well, squared. And if you were to assign it to cluster centroid 2, then this square distance would be the square of this much smaller distance. So if you want to minimize this term, you would take x i and assign it to the closer cluster centroid, which is exactly what the algorithm is doing up here. So that's why the step where you assign points to cluster centroids is choosing the values for c i to try to minimize j, without changing mu 1 through mu k for now, but just choosing the values of c 1 through c m to try to make these terms as small as possible. How about the second step of the k-means algorithm? That is to move the cluster centroids. It turns out that choosing mu k to be average of the mean of the points assigned is the choice of these terms mu that will minimize this expression. To take a simplified example, say you have a cluster with just two points assigned to it, shown as follows. With the cluster centroid here, the average of the squared distances would be the distance of 1 here, squared, plus this distance here, which is 9, squared, and you take the average of these two numbers. And so that turns out to be 1 half of 1 plus 81, which turns out to be 41. But if you were to take the average of these two points, so 1 plus 11 over 2, that's equal to 6, and if you were to move the cluster centroid over here to the middle, then the average of these two squared distances turns out to be a distance of 5 and 5 here, so you end up with 1 half of 5 squared plus 5 squared, which is equal to 25. And this is a much smaller average squared distance than 41. And in fact, you can play around with the location of this cluster centroid and maybe convince yourself that taking this mean location, this average location, in the middle of these two training examples, that is really the value that minimizes the squared distance. So the fact that the k-means algorithm is optimizing a cost function J means that it is guaranteed to converge. That is, on every single iteration, the distortion cost function should go down or stay the same. But if it ever fails to go down or stay the same in the worst case, if it ever goes up, that means there's a bug in the code. It should never go up because every single step of k-means is setting the values C i and mu k to try to reduce the cost function. Also, if the cost function ever stops going down, that also gives you one way to test if k-means has converged. Once there's a single iteration where it stays the same, that usually means k-means has converged and you should just stop running the algorithm even further. Or in some rare cases, you will run k-means for a long time and the cost function or the distortion is just going down very, very slowly. And that's a bit like gradient descent, where maybe running it even longer might help a bit. But if the rate at which the cost function is going down has become very, very slow, you might also just say, this is good enough. I'm just going to say it's close enough to convergence and not spend even more compute cycles running the algorithm for even longer. So these are some of the ways that computing the cost function is helpful. It helps you figure out if the algorithm has converged. It turns out that there's one other very useful way to take advantage of the cost function, which is to use multiple different random initializations of the cluster centroids. It turns out if you do this, you can often find much better clusters using k-means. Let's take a look at the next video of how to do that.