Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
You saw in the last video how different choices of the degree of polynomial d affects the bias and variance of your learning algorithm, and therefore its overall performance. In this video, let's take a look at how regularization, specifically the choice of the regularization parameter lambda, affects the bias and variance, and therefore the overall performance of the algorithm. This it turns out will be helpful for when you want to choose a good value of lambda of the regularization parameter for your algorithm. Let's take a look. In this example, I'm going to use a fourth-order polynomial, but we're going to fit this model using regularization, where here the value of lambda is the regularization parameter that controls how much you trade off keeping the parameters w small versus fitting the trading data well. Let's start with the example of setting lambda to be a very large value. Say lambda is equal to 10,000. If you were to do so, you would end up fitting a model that looks roughly like this. Because if lambda were very, very large, then the algorithm is highly motivated to keep these parameters w very small, and so you end up with w1, w2, really all of these parameters would be very close to zero. The model ends up being f of x is just approximately b, a constant value, which is why you end up with a model like this. This model clearly has high bias, and it underfits the trading data because it doesn't even do well on the trading set, and jtrain is large. Let's take a look at the other extreme. Let's say you set lambda to be a very small value. So with a small value of lambda, in fact, let's go to the extreme of setting lambda equals zero. With that choice of lambda, there is no regularization, and so we're just fitting a fourth-order polynomial with no regularization, and you end up with that curve that you saw previously that overfits the data. What we saw previously was when you have a model like this, jtrain is small, but jcv is much larger than jtrain, or jcv is large, and so this indicates we have high variance, and it overfits this data. It would be if you have some intermediate value of lambda, not really large to 10,000, but not so small as zero, that hopefully you get a model that looks like this, that is just right and fits the data well with small jtrain and small jcv. So if you are trying to decide what is a good value of lambda to use for the regularization parameter, cross-validation gives you a way to do so as well. Let's take a look at how we could do so, and just as a reminder, the problem we're addressing is if you're fitting a fourth-order polynomial, so that's the model, and you're using regularization, how can you choose a good value of lambda? This would be a procedure similar to what you have seen for choosing the degree of polynomial d using cross-validation. Specifically, let's say we try to fit a model using lambda equals zero, and so we would minimize the cross-function using lambda equals zero, and end up with some parameters w1, b1, and you can then compute the cross-validation error, jcv of w1, b1. And now let's try a different value of lambda, let's say you try lambda equals 0.01, then again, minimizing the cross-function gives you a second set of parameters, w2, b2, and you can also see how well that does on the cross-validation set, and so on. And let's keep trying other values of lambda, and in this example, I'm going to try doubling it to lambda equals 0.02, and so that would give you jcv of w3, b3, and so on. And let's double it again and double it again, after doubling a number of times, you end up with lambda approximately equal to 10, and that would give you parameters w12, b12, and jcv w12, b12. And by trying out a large range of possible values for lambda, fitting parameters using those different regularization parameters, and then evaluating the performance on the cross-validation set, you can then try to pick what is the best value for the regularization parameter. Now quickly, if in this example, you find that jcv of w5, b5, has the lowest value of all of these different cross-validation errors, you might then decide to pick this value for lambda, and so use w5, b5 as the chosen parameters. And finally, if you want to report out an estimate of the generalization error, you would then report out the test set error, jtest of w5, b5. To further hone intuition about what this algorithm is doing, let's take a look at how training error and cross-validation error vary as a function of the parameter lambda. So in this figure, I've changed the x-axis again. Notice that the x-axis here is annotated with the value of the regularization parameter lambda. And if we look at the extreme of lambda equals 0, here on the left, that corresponds to not using any regularization. And so that's where we wind up with this very wiggly curve, if lambda was small or it was even 0. And in that case, we have a high variance model, and so jtrain is going to be small, and jcv is going to be large, because it does great on the training data, but does much worse on the cross-validation data. This extreme on the right, with very large values of lambda, say lambda equals 10,000, ends up with fitting a model that looks like that. So this has high bias, it underfits the data, and it turns out jtrain will be high and jcv will be high as well. And in fact, if you were to look at how jtrain varies as a function of lambda, you find that jtrain will go up like this. Because in the optimization cost function, the larger lambda is, the more the algorithm is trying to keep w squared small, that is, the more weight it's giving to this regularization term, and thus the less attention it's paying to actually doing well on the training set. Remember, this term on the left is jtrain, so the more it's trying to keep the parameters small, the less good a job it does on minimizing the training error. So that's why as lambda increases, the training error jtrain will tend to increase like so. Now how about the cross-validation error? Turns out the cross-validation error will look like this. Because we've seen that if lambda is too small or too large, then it doesn't do well on the cross-validation set. It either overfits here on the left or underfits here on the right, and there'll be some intermediate value of lambda that causes the algorithm to perform best. And what cross-validation is doing is trying out a lot of different values of lambda. This is what we saw on the last slide. Try out lambda equals zero, lambda equals 0.01, lambda equals 0.02. Try a lot of different values of lambda and evaluate the cross-validation error at a lot of these different points, and then hopefully pick a value that has low cross-validation error, and this will hopefully correspond to a good model for your application. If you compare this diagram to the one that we had in the previous video, where the horizontal axis was the degree of polynomial, these two diagrams look a little bit, not mathematically and not in any formal way, but they look a little bit like mirror images of each other. And that's because when you're fitting a degree of polynomial, the left part of this curve corresponded to underfitting and high bias, the right part corresponded to overfitting and high variance, whereas in this one, high variance was on the left and high bias was on the right. But that's why these two images are a little bit like mirror images of each other. But in both cases, cross-validation, evaluating different values, can help you choose a good value of T or a good value of lambda. So that's how the choice of regularization parameter lambda affects the bias and variance and overall performance of your algorithm. And you've also seen how you can use cross-validation to make a good choice for the regularization parameter lambda. Now so far, we've talked about how having a high training set error, high J train, is indicative of high bias, and how having a high cross-validation error, J cv, specifically if it's much higher than J train, how that's indicative of a variance problem. But what do these words high or much higher actually mean? Let's take a look at that in the next video, where we'll look at how you can look at the numbers J train and J cv and judge if it's high or low. And it turns out that one further refinement of these ideas, that is establishing a baseline level performance through a learning algorithm, will make it much easier for you to look at these numbers, J train, J cv, and judge if they're high or low. Let's take a look at what all this means in the next video.