Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In the last video, you saw how to use a test set to evaluate the performance of a model. Let's make one further refinement to that idea in this video, which allows you to use the technique to automatically choose a good model for your machine learning algorithm. One thing we've seen is that once the model's parameters W and B have been fit to the training set, the training error may not be a good indicator of how well the algorithm will do or how well it will generalize to new examples that were not in the training set. And in particular, for this example, the training error will be pretty much zero, and that's likely much lower than the actual generalization error. And by that, I mean the average error on new examples that were not in the training set. And what you saw in the last video is that JTest, the performance of the algorithm on examples it's not trained on, that that would be a better indicator of how well the model will likely do on new data. And by that, I mean other data that's not in the training set. Let's take a look at how this affects how we might use a test set to choose a model for a given machine learning application. So if we're fitting a function to predict housing prices or some other regression problem, one model you might consider is to fit a linear model like this. And this is a first-order polynomial, and I'm going to use d equals 1 on this slide to denote fitting a 1 or first-order polynomial. If you were to fit a model like this to your training set, you'd get some parameters w and b, and you can then compute JTest to estimate how well this will generalize to new data. And on this slide, I'm going to use w1, b1, superscript there, to denote that these are the parameters you get if you were to fit a first-order polynomial or degree 1, d equals 1 polynomial. Now, you might also consider fitting a second-order polynomial or quadratic model. So this is the model, and if you were to fit this to your training set, you would get some parameters w2, b2, and you can then similarly evaluate those parameters on your test set and get JTest w2, b2, and this would give you a sense of how well the second-order polynomial does. And you can go on to try d equals 3. That's a third-order or a degree 3 polynomial that looks like this, and fit parameters and similarly get JTest, and you might keep doing this until, say, you try up to a 10th-order polynomial, and you end up with JTest of w10, b10. That gives you a sense of how well the 10th-order polynomial is doing. So one procedure you could try, this turns out not to be the best procedure, but one thing you could try is look at all of these JTests and see which one gives you the lowest value. And say you find that JTest for the fifth-order polynomial, for w5, b5, turns out to be the lowest. If that's the case, then you might decide that the fifth-order polynomial, d equals 5, does best and choose that model for your application. And if you want to estimate how well this model performs, one thing you could do, but this turns out to be a slightly flawed procedure, is to report the tested error, JTest w5, b5. The reason this procedure is flawed is JTest of w5, b5 is likely to be an optimistic estimate of the generalization error. In other words, it is likely to be lower than the actual generalization error. And the reason is, in the procedure we talked about on this slide, we basically fit one extra parameter, which is d, the degree of polynomial, and we chose this parameter using the test set. So on the previous slide, we saw that if you were to fit wb to the training data, then the training data would be an overly optimistic estimate of generalization error. And it turns out, too, that if you were to choose the parameter d using the test set, then the test set, JTest, is now an overly optimistic, that is lower than the actual estimate of the generalization error. So the procedure on this particular slide is flawed, and I don't recommend using this. Instead, if you want to automatically choose a model, such as decide what degree of polynomial to use, here's how you modify the training and testing procedure in order to carry out model selection, where by model selection, I mean choosing amongst different models, such as these 10 different models that you might contemplate using for your machine learning application. The way we'll modify the procedure is, instead of splitting your data into just two subsets, the training set and the test set, we're going to split your data into three different subsets, which we're going to call the training set, the cross-validation set, and then also the test set. So using our example from before of these 10 training examples, we might split it into putting 60% of the data into the training set, and so the notation we'll use for the training set portion will be the same as before, except that now mtrain, the number of training examples, will be 6. And we might put 20% of the data into the cross-validation set, and the notation I'm going to use is xcv of 1 comma ycv of 1 for the first cross-validation example. So cv stands for cross-validation, all the way down to xcv of mcv and ycv of mcv, where here mcv equals 2 in this example is the number of cross-validation examples. And then finally, we have the test set, same as before. So x1 through xmtest and y1 through ymtest, where mtest here is equal to 2. This is the number of test examples. We'll see on the next slide how to use the cross-validation set. So the way we'll modify the procedure is you've already seen the training set and the test set, and we're going to introduce a new subset of the data called the cross-validation set. The name cross-validation refers to that this is an extra data set that we're going to use to check or cross-check the validity or really the accuracy of different models. I don't think it's a great name, but that is what people in machine learning have gotten to call this extra data set. You may also hear people call this the validation set for short. It's just fewer syllables than cross-validation. Or in some applications, people also call this the development set. It means basically the same thing. Or for short, sometimes you hear people call this the dev set. But all of these terms mean the same thing as cross-validation set. I personally use the term dev set the most often because it's the shortest, fastest way to say it, but cross-validation is probably used a little bit more often by machine learning practitioners. So onto these three subsets of the data, training set, cross-validation set, and test set, you can then compute the training error, the cross-validation error, and the test error using these three formulas. Whereas usual, none of these terms include the regularization term that is included in the training objective. And this new term in the middle, the cross-validation error, is just the average over your MCV cross-validation examples of the average, say, squared error. And this term, in addition to being called cross-validation error, is also commonly called the validation error for short or even the development set error or the dev error. Armed with these three measures of learning algorithm performance, this is how you can then go about carrying out model selection. You can, with the 10 models, same as earlier on this slide, with D equals 1, D equals 2, all the way up to a 10th degree or the 10th order polynomial, you can then fit the parameters W1, B1, but instead of evaluating this on your test set, you would instead evaluate these parameters on your cross-validation set and compute JCV of W1, B1, and similarly for the second model, you get JCV of W2, B2, and all the way down to JCV of W10, B10. Then, in order to choose a model, you would look at which model has the lowest cross-validation error. And concretely, let's say that JCV of W4, B4 is lowest, then what that means is you would pick this fourth order polynomial as the model you will use for this application. Finally, if you want to report out an estimate of the generalization error of how well this model will do on new data, you would do so using that third subset of your data, the test set, and you report out JTest of W4, B4. And you notice that throughout this entire procedure, you had fit these parameters using the training set, you then chose the parameter D or chose the degree of polynomial using the cross-validation set, and so up until this point, you have not fit any parameters, either W or B or D, to the test set. And that's why JTest in this example will be a fair estimate of the generalization error of this model that has parameters W4, B4. So this gives a better procedure for model selection and it lets you automatically make a decision like what order polynomial to choose for your linear regression model. This model selection procedure also works for choosing among other types of models. For example, choosing a neural network architecture. If you are fitting a model for handwritten digit recognition, you might consider three models like these, maybe even a larger set of models than just three, but here are a few different neural networks, small, somewhat larger, and then even larger. To help you decide how many layers should your neural network have and how many hidden units per layer should you have, you can then train all three of these models and end up with parameters W1, B1 for the first model, W2, B2 for the second model, and W3, B3 for the third model. And you can then evaluate the neural network's performance using JCV, using your cross-validation set. And since this is a classification problem, JCV, the most common choice, would be to compute this as a fraction of cross-validation examples that the algorithm has misclassified. And you would compute this using all three models and then pick the model with the lowest cross-validation error. So if, in this example, this has the lowest cross-validation error, you would then pick the second neural network and use parameters trained on this model. And finally, if you want to report out an estimate of the generalization error, you would then use the test set to estimate how well the neural network that you just chose will do. So it's considered best practice in machine learning that if you have to make decisions about your model, such as fitting parameters or choosing the model architecture, such as neural network architecture or degree of polynomial, if you're fitting linear regression, to make all those decisions only using your training set and your cross-validation set, and to not look at the test set at all while you're still making decisions regarding your learning algorithm. And it's only after you've come up with one model, that's your final model, to only then evaluate it on the test set. And because you haven't made any decisions using the test set, that ensures that your test set is a fair and not overly optimistic estimate of how well your model will generalize to new data. So that's model selection. And this is actually a very widely used procedure. I use this all the time to automatically choose what model to use for a given machine learning application. Now, earlier this week, I mentioned running diagnostics to decide how to improve the performance of a learning algorithm. Now that you have a way to evaluate learning algorithms and even automatically choose a model, let's dive more deeply into examples of some diagnostics. The most powerful diagnostic that I know of and that I use for a lot of machine learning applications is one called bias and variance. Let's take a look at what that means in the next video.