Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In the last video, you saw how if you have features for each movie, such as features X1 and X2, that tell you how much is this a romance movie and how much is this an action movie, then you can use basically linear regression to learn to predict movie ratings. But what if you don't have those features, X1 and X2? Let's take a look at how you can learn or come up with those features, X1 and X2, from the data. So here's the data that we had before, but what if instead of having these numbers for X1 and X2, we didn't know in advance what the values of the features X1 and X2 are? So I'm going to replace them with question marks over here. Now, just for the purposes of illustration, let's say we had somehow already learned parameters for the four users. So let's say that we learned parameters w1 equals 5 and 0 and b1 equals 0 for user 1, w2 is also 5, 0, b2 is 0, w3 is 0, 5, b3 is 0, and for user 4, w4 is also 0, 5, and b4 is equal to 0. We'll worry later about how we might have come up with these parameters w and b, but let's say we have them already. And as a reminder, to predict user j's rating on movie i, we're going to use wj.product the features of Xi plus bj. So to simplify this example, all the values of b are actually equal to 0, so just to reduce a little bit of writing, I'm going to ignore b for the rest of this example. Let's take a look at how we can try to guess what might be reasonable features for movie 1. If these are the parameters you have on the left, then given that Alice rated movie 1 5, we should have that w1.x1 should be about equal to 5, and w2.x2 should also be about equal to 5, because Bob rated it 5, w3.x1 should be close to 0, and w4.x1 should be close to 0 as well. So the question is, given these values for w that we have up here, what choice for x1 would cause these values to be right? Well, one possible choice would be if the features for that first movie were 1,0, in which case w1.x1 would equal to 5, w2.x1 would equal to 5, and similarly w3 or w4.product with this feature vector x1 would be equal to 0. So what we have is that if you have the parameters for all four users here, and if you have four ratings in this example that you want to try to match, you can take a reasonable guess at what is the feature vector x1 for movie 1 that would make good predictions for these four ratings up on top. And similarly, if you have these parameter vectors, you can also try to come up with a feature vector x2 for the second movie, a feature vector x3 for the third movie, and so on, to try to make the algorithm's predictions on these additional movies close to what was actually the ratings given by the users. Let's come up with a cost function for actually learning the values of x1 and x2. And by the way, notice that this works only because we have parameters for four users. That's what allows us to try to guess appropriate features x1. This is why in a typical linear regression application, if you have just a single user, you don't actually have enough information to figure out what would be the features x1 and x2, which is why in the linear regression context that you saw in course 1, you can't come up with features x1 and x2 from scratch. But in collateral filtering, it is because you have ratings for multiple users of the same item, of the same movie, that's what makes it possible to try to guess what the plausible value is for these features. So given w1, b1, w2, b2, and so on, through w, n, u, b, n, u for the n subscript u users, if you want to learn the features xi for a specific movie i, here's a cost function we could use, which is that I'm going to want to minimize squared error as usual. So if the predicted rating by user j on movie i is given by this, let's take the squared difference from the actual movie rating yij, and as before, let's sum over all the users j, but this will be a sum over all values of j where rij is equal to 1, and I'll add a 1 half there as usual. And so if I define this as a cost function for xi, then if we minimize this as a function of xi, you'd be choosing the features xi for movie i, so that for all the users j that had rated movie i, we would try to minimize the squared difference between what your choice of features xi results in, in terms of the predicted movie rating, minus the actual movie rating that the user had given it. And finally, if we want to add a regularization term, we add the usual plus lambda over 2, k equals 1 through n, where n as usual is the number of features of xik squared. Lastly, to learn all the features x1 through xnm, because we have nm movies, we can take this cost function on top and sum it over all the movies, so sum from i equals 1 through the number of movies, and then just take this term from above, and this becomes a cost function for learning the features for all of the movies in the dataset. And so, if you have parameters w and b for all the users, then minimizing this cost function as a function of x1 through xnm using gradient descent or some other algorithm, this will actually allow you to take a pretty good guess at learning good features for the movies. And this is pretty remarkable. For most machine learning applications, the features have to be externally given, but in this algorithm, we can actually learn the features for a given movie. But in what we've done so far in this video, we assume you had those parameters w and b for the different users. Where do you get those parameters from? Well, let's put together the algorithm from the last video for learning w and b, and what we just talked about in this video for learning x, and that will give us our collaborative filtering algorithm. Here's the cost function for learning the features. This is what we had derived on the last slide. Now, it turns out that if we put these two together, this term here is exactly the same as this term here. Notice that sum over j of all values of i, that rij equals 1, is the same as summing over all values of i with all j where rij is equal to 1. This summation is just summing over all user movie pairs where there is a rating. And so what I'm going to do is put these two cost functions together and have this, where I'm just writing out the summation more explicitly as summing over all pairs i and j where we do have a rating of the usual squared cost function, and then let me take the regularization term from learning the parameters w and b and put that here, and take the regularization term from learning the features x and put them here, and this ends up being our overall cost function for learning w, b, and x. And it turns out that if you minimize this cost function as a function of w and b as well as x, then this algorithm actually works. Here's what I mean. If we had three users and two movies, and if you have ratings for these four movies but not those two, what it does is it sums over all the users, and for user 1 it has a term in the cost function for this, for user 2 it has a term in the cost function for these, for user 3 it has a term in the cost function for this. So we're summing over users first and then having one term for each movie where there is a rating. But an alternative way to carry out this summation is to first look at movie 1, that's what this summation here does, and then to include all the users that rated movie 1, and then look at movie 2 and have a term for all the users that had rated movie 2. And you see that in both cases we're just summing over these four pairs where the user had rated the corresponding movie. So that's why this summation on top and this summation here, they're two ways of summing over all of the pairs where the user had rated the movie. So how do you minimize this cost function as a function of W, B, and X? One thing you could do is to use gradient descent. So in course 1 when we learned about linear regression, this is the gradient descent algorithm you had seen, where we had a cost function J which is a function of the parameters W and B, and we'd apply gradient descent as follows. With collateral filtering, the cost function isn't a function of just W and B, it's now a function of W, B, and X. And I'm using W and B here to denote the parameters for all of the users, and X here just informally to denote the features for all of the movies. But if you're able to take partial derivatives with respect to the different parameters, you can then continue to update the parameters as follows. But now we need to optimize this with respect to X as well, so we also will want to update each of these parameters X using gradient descent as follows. And it turns out that if you do this, then you actually find pretty good values of W and B as well as X. And in this formulation of the problem, the parameters are W and B, and X is also a parameter. And then finally, to learn the values of X, we also will update X as X minus the partial derivative with respect to X of the cost W, B, X. I'm using the notation here a little bit informally and not keeping very careful track of the superscripts and subscripts, but the key takeaway I hope you have from this is that the parameters of this model are W and B, and X now is also a parameter. Which is why we minimize the cost function as a function of all three of these sets of parameters W and B as well as X. So the algorithm we just derived is called collaborative filtering. And the name collaborative filtering refers to the sense that because multiple users have rated the same movie kind of collaboratively, giving you a sense of what this movie may be like, that allows you to guess what are appropriate features for that movie. And this in turn allows you to predict how other users that haven't yet rated that same movie may decide to rate it in the future. So this collaborative filtering is this gathering of data from multiple users, this collaboration between users to help you predict ratings for even other users in the future. So far, our problem formulation has used movie ratings from 1 to 5 stars or from 0 to 5 stars. A very common use case of recommended systems is when you have binary labels, such as did the user favorite or like or interact with an item. In the next video, let's take a look at the generalization of the model you've seen so far to binary labels. Let's go see that in the next video.