Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
The typical workflow of developing a machine learning system is that you have an idea and you train a model, and you almost always find that it doesn't work as well as you wish yet. When I'm training a machine learning model, it pretty much never works that well the first time. And so key to the process of building a machine learning system is how to decide what to do next in order to improve its performance. I've found across many different applications that looking at the bias and variance of a learning algorithm gives you very good guidance on what to try next. Let's take a look at what this means. You might remember this example from the first course on linear regression, where given this data set, if you were to fit a straight line to it, it doesn't do that well. And we said that this algorithm has high bias, or that it underfits this data set. Or if you were to fit a 4-fold polynomial, then it has high variance, or it overfits. And in the middle, if you fit a quadratic polynomial, then it looks pretty good, and we said that was just right. Because this is a problem with just a single feature x, we could plot the function f and look at it like this, but if you had more features, you can't plot f and visualize whether it's doing well as easily. So instead of trying to look at plots like this, a more systematic way to diagnose or to find out if your algorithm has high bias or high variance will be to look at the performance of your algorithm on the training set and on the cross-validation set. In particular, let's look at the example on the left. If you were to compute J-train, how well does the algorithm do on the training set? Not that well. So I'd say J-train here would be high, because there are actually pretty large errors between the examples and the actual predictions of the model. And how about J-CV? So J-CV would be if you had a few new examples, maybe examples like that, that the algorithm had not previously seen, and here the algorithm also doesn't do that well on examples that it had not previously seen, so J-CV would also be high. And one characteristic of an algorithm with high bias, something that is underfitting, is that it's not even doing that well on the training set. And so when J-train is high, that gives you a strong indicator that this algorithm has high bias. Let's now look at the example on the right. If you were to compute J-train, how well is this doing on the training set? Well, it's actually doing great on the training set, because the training data is really well. So J-train here would be low, but if you were to evaluate this model on other houses not in the training set, then you find that J-CV, the cross-validation error, would be quite high. And so a characteristic signature, or a characteristic cue, that your algorithm has high variance would be if J-CV is much higher than J-train. In other words, it does much better on data it has seen than on data it has not seen, and this turns out to be a strong indicator that your algorithm has high variance. And again, the point of what we're doing is that by computing J-train and J-CV, and seeing if J-train is high, or if J-CV is much higher than J-train, this gives you a sense, even if you can't plot to function f, of whether your algorithm has high bias or high variance. And finally, the case in the middle. If you look at J-train, it's pretty low, since it's doing quite well on the training set. But if you were to look at a few new examples, like those, from, say, your cross-validation set, you find that J-CV is also pretty low. And so J-train not being too high indicates this doesn't have a high bias problem, and J-CV not being much worse than J-train, this indicates that it doesn't have a high variance problem either, which is why this model, the quadratic model, seems to be a pretty good one for this application. So to summarize, when D equals 1, for a linear polynomial, J-train was high and J-CV was high. When D equals 4, J-train was low, but J-CV is high, and when D equals 2, both were pretty low. Let's now take a different view on bias and variance, and in particular, on the next slide, I'd like to show you how J-train and J-CV vary as a function of the degree of the polynomial you're fitting. So let me draw a figure where the horizontal axis of this figure will be the degree of polynomial that we're fitting to the data. Over on the left will correspond to a small value of D, like D equals 1, which corresponds to a fitting straight line, and over to the right will correspond to, say, D equals 4, or even higher values of D, where we're fitting this high-order polynomial. So if you were to plot J-train for WB as a function of the degree of polynomial, what you find is that as you fit a higher and higher degree polynomial, here I'm assuming we're not using regularization, but as you fit a higher and higher-order polynomial, the training error will tend to go down, because when you have a very simple linear function, it doesn't fit the training data that well. When you fit a quadratic function, or a third-order polynomial, or a fourth-order polynomial, it fits the training data better and better. So as the degree of polynomial increases, J-train will typically go down. Next, let's look at J-CV, which is how well does it do on data that it did not get to fit to. What we saw was when D equals 1, when the degree of polynomial was very low, J-CV was pretty high, because it underfit, so it didn't do well on the cross-validation set. And here on the right as well, when the degree of polynomial is very large, say 4, it does not do well on the cross-validation set either, and so is also high. But if D was in between, say a second-order polynomial, then it actually did much better. And so, if you were to vary the degree of polynomial, you'd actually get a curve that looks like this, which comes down and then goes back up, where if the degree of polynomial is too low, it underfits, and so doesn't do well on the cross-validation set. If it is too high, it overfits, and also doesn't do well on the cross-validation set. And it's only if it's somewhere in the middle that is just right, which is why the second-order polynomial in our example ends up with a lower cross-validation error, and neither high bias nor high variance. So to summarize, how do you diagnose bias and variance in your learning algorithm? If your learning algorithm has high bias or has underfitted data, the key indicator will be if J-train is high, and so that corresponds to this leftmost portion of the curve, which is where J-train is high. And usually, you have J-train and J-CV will be close to each other. And how do you diagnose if you have high variance? Well, the key indicator for high variance will be if J-CV is much greater than J-train. This double greater-than sign in math refers to much greater than, so this is greater and this means much greater. And this rightmost portion of the plot is where J-CV is much greater than J-train, and usually J-train will be pretty low. But the key indicator is whether J-CV is much greater than J-train. And that's what happened when we had fit a very high-order polynomial to this small dataset. And even though we've just seen bias and variance, it turns out in some cases it's possible to simultaneously have high bias and have high variance. You won't see this happen that much for linear regression, but it turns out that if you're training a neural network, there are some applications where unfortunately you have high bias and high variance. And one way to recognize that situation will be if J-train is high, so you're not doing that well in the training set, but even worse, the cross-validation error is, again, even much larger than the training set. The notion of high bias and high variance doesn't really happen for linear models applied to 1D, but to give intuition about what it looks like, it would be as if for part of the input, you had a very complicated model that overfit. So it overfits the part of the input, but then for some reason, for other parts of the input, it doesn't even fit the training data well, and so it underfits for part of the input. In this example, which looks artificial because it's a single-feature input, we fit the training set really well, and we overfit in part of the input, and we don't even fit the training data well, and we underfit in part of the input. That's how, in some applications, you can unfortunately end up with both high bias and high variance, and the indicator for that will be if Diagram does poorly on the training set and it even does much worse than on the training set. For most learning applications, you probably have primarily a high bias or a high variance problem rather than both at the same time, but it is possible sometimes to have both at the same time. So I know that there's a lot to process, there are a lot of concepts on the slides, but the key takeaways are high bias means it's not even doing well on the training set, and high variance means it does much worse on the cross-validation set than the training set. Whenever I'm training a machine learning algorithm, I will almost always try to figure out to what extent the algorithm has a high bias or underfitting versus a high variance or overfitting problem, and this will give good guidance, as we'll see later this week, on how you can improve the performance of your algorithm. But first, let's take a look at how regularization affects the bias and variance of a learning algorithm, because that will help you better understand when you should use regularization. Let's take a look at that in the next video.