Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
So let's take a look at how we can develop a recommended system if we had features of each item or features of each movie. So here's the same data set that we had previously with the four users having rated some but not all of the five movies. What if we additionally have features of the movies? So here I've added two features, X1 and X2, that tell us how much each of these is a romance movie and how much each of these is an action movie. So for example, Love at Last is a very romantic movie, so this feature takes on 0.9, but it's not a non-action movie, so this feature takes on 0. But it turns out Non-Stop Car Chases has just a little bit of romance in it, so it's 0.1, but it has a ton of action, so that feature takes on the value of 1.0. So you recall that I had used the notation NU to denote the number of users, which is 4, and M to denote the number of movies, which is 5. I'm going to also introduce N to denote the number of features we have here, and so N is equal to 2, because we have two features, X1 and X2, for each movie. With these features, we have, for example, that the features for movie 1, that is the movie Love at Last, would be 0.90, and the features for the third movie, Cute Puppies of Love, would be 0.99 and 0. And let's start by taking a look at how we might make predictions for Alice's movie ratings. So for user 1, that is Alice, let's say we predict the rating for movie i as W dot X, a feature i, plus B. So this is just a lot like linear regression. For example, if we end up choosing the parameter W1 equal 5, 0, and say B1 is equal to 0, then the prediction for movie 3, where the features are 0.99 and 0, which is just copied from here, first feature is 0.99, second feature is 0, our prediction would be W dot X3 plus B equals 0.99 times 5 plus 0 times 0, which turns out to be equal to 4.95. And this rating seems pretty plausible. It looks like Alice has given high ratings to Love at Last and Romance Forever, to two highly romantic movies, but given low ratings to the action movies Non-Stop Conch Chases and Souls vs. Karate. So if we look at Cute Puppies of Love, while predicting that she might rate that 4.95 seems quite plausible. And so these parameters W and B for Alice seems like a reasonable model for predicting her movie ratings. Just to add a little bit of notation, because we have not just one user, but multiple users, or really NU equals four users, I'm going to add a superscript 1 here to denote that this is the parameter W1 for user 1, and add a superscript 1 there as well. And similarly here, and here as well, so that we would actually have different parameters for each of the four users on our dataset. And more generally, in this model, we can, for user J, not just user 1 now, we can predict user J's rating for movie I as WJ dot product XI plus BJ. So here, the parameters W, J, and BJ are the parameters used to predict user J's rating for movie I, which is a function of XI, which is the features of movie I. And this is a lot like linear regression, except that we're fitting a different linear regression model for each of the four users in the dataset. So let's take a look at how we can formulate the cost function for this algorithm. As a reminder, our notation is that Rij is equal to 1 if user J has rated movie I, or 0 otherwise, and Yij is the rating given by user J on movie I. And on the previous slide, we defined WJ, BJ as the parameters for user J, and XI as the feature vector for movie I. So the model we have is for user J and movie I, predict the rating to be WJ dot product XI plus BJ. I'm going to introduce just one new piece of notation, which is I'm going to use MJ to denote the number of movies rated by user J. So if a user has rated 4 movies, then MJ would be equal to 4. And if a user has rated 3 movies, then MJ would be equal to 3. So what we'd like to do is to learn the parameters WJ and BJ given the data that we have, that is given the ratings a user has given of a set of movies. So the algorithm we're going to use is very similar to linear regression. So let's write out the cost function for learning the parameters WJ and BJ for a given user J. So let's focus on one user, on user J for now. I'm going to use the mean squared error criteria. So the cost will be the prediction, which is WJ dot XI plus BJ minus the actual rating that the user had given, so minus YIJ squared. And we'll try to choose parameters W and B to minimize the squared error between their predicted rating and the actual rating that was observed. But the user hasn't rated all the movies, so if we're going to sum over this, we're going to sum over only over the values of I, where RIJ is equal to 1. So we're going to sum only over the movies I that user J has actually rated. So that's what this denotes, sum over all values of I, where RIJ is equal to 1, meaning that user J has rated that movie I. And then finally, we can take the usual normalization, 1 over 2 MJ, and this is very much like the cost function we had for linear regression with M, or really MJ training examples, where you're summing over the MJ movies for which you have a rating, taking a squared error, and then normalizing by this 1 over 2 MJ. And this is going to be a cost function J of WJ, BJ. And if we minimize this as a function of WJ and BJ, then you should come up with a pretty good choice of parameters, WJ and BJ, for making predictions for user J's ratings. Let me add just one more term to this cost function, which is the regularization term, to prevent overfitting. And so here's our usual regularization parameter, lambda, divided by 2 MJ, and then times the sum of the squared values of the parameters W, and so N is the number of features, XI, and that's the same as the number of numbers in WJ. If you were to minimize this cost function J as a function of W and B, you should get a pretty good set of parameters for predicting user J's ratings for other movies. Now before moving on, it turns out that for recommended systems, it would be convenient to actually eliminate this division by MJ term. MJ is just a constant in this expression, and so even if you take it out, you should end up with the same value of W and B. Now let me take this cost function down here at the bottom and copy it to the next slide. So we have that. To learn the parameters WJ, BJ for user J, we would minimize this cost function as a function of WJ and BJ. But instead of focusing on a single user, let's look at how we learn the parameters for all of the users. To learn the parameters W1, B1, W2, B2 through WNU, BNU, we would take this cost function on top and sum it over all the NU users. So we would have sum from J equals 1 to NU of the same cost function that we had written up above, and this becomes the cost for learning all the parameters for all of the users. And if we use gradient descent or any other optimization algorithm to minimize this as a function of W1, B1, all the way through WNU, BNU, then you have a pretty good set of parameters for predicting movie ratings for all the users. And you may notice that this algorithm is a lot like linear regression, where that plays a role similar to the output F of X of linear regression, only now we're training a different linear regression model for each of the N subscript U users. So that's how you can learn parameters and predict movie ratings if you had access to these features X1 and X2 that tell you how much is each of the movies a romance movie and how much is each of the movies an action movie. But where do these features come from and what if you don't have access to such features that give you enough detail about the movies with which to make these predictions? In the next video, we'll look at a modification of this algorithm that will let you make predictions, let you make recommendations, even if you don't have in advance features that describe the items of the movies in sufficient detail to run the algorithm that we just saw. So let's go on and take a look at that in the next video.