Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Learning curves are a way to help understand how your learning algorithm is doing as a function of the amount of experience it has, where by experience I mean, for example, the number of training examples it has. Let's take a look. Let me plot learning curves for a model that fits a second-order polynomial quadratic function like so, and I'm going to plot both Jcv, the cross-validation error, as well as Jtrain, the training error. So on this figure, the horizontal axis is going to be m, train, that is, the training set size, or the number of examples that the algorithm can learn from, and on the vertical axis I'm going to plot the error, and by error I mean either Jcv or Jtrain. So let's start by plotting the cross-validation error. It will look something like this. So that's what Jcv of WB will look like. And it's maybe no surprise that as m, train, the training set size gets bigger, then you learn a better model, and so the cross-validation error goes down. Now let's plot Jtrain of WB, of what the training error looks like as the training set size gets bigger. It turns out that the training error will actually look like this, that as the training set size gets bigger, the training set error actually increases. Let's take a look at why this is the case. We'll start with an example of when you have just a single training example. Well, if you were to fit a quadratic model to this, you can fit easily a straight line or a curve, and your training error will be zero. How about if you have two training examples like this? Well, you can again fit a straight line and achieve zero training error. In fact, if you have three training examples, the quadratic function can still fit this very well and get pretty much zero training error. But now, if your training set gets a little bit bigger, say you have four training examples, then it gets a little bit harder to fit all four examples perfectly, and you may get a curve that looks like this, fits it pretty well, but a little bit off in a few places here and there. And so when you have increased the training set size to four, the training error has actually gone up a little bit. How about when you have five training examples? Well, again, you can fit it pretty well, but it gets even a little bit harder to fit all of them perfectly, and when you have an even larger training set, it just gets harder and harder to fit every single one of your training examples perfectly. So to recap, when you have a very small number of training examples, like one or two or even three, it's relatively easy to get zero or very small training error. But when you have a larger training set, it's harder for a quadratic function to fit all the training examples perfectly, which is why as the training set gets bigger, the training error increases, because it's harder to fit all of the training examples perfectly. Notice one other thing about these curves, which is the cross-validation error will be typically higher than the training error, because you fit the parameters to the training set, and so you expect to do at least a little bit better, or when m is small, maybe even a lot better on the training set than on the cross-validation set. Let's now take a look at what the learning curves will look like for an algorithm with high bias versus one with high variance. Let's start with the high bias or the underfitting case. Recall that an example of high bias would be if you're fitting a linear function to a curve that looks like this. If you were to plot the training error, then the training error will go up like so, as you'd expect. And in fact, this curve of training error may start to flatten out, or we call it plateau, meaning flatten out after a while. And that's because as you get more and more training examples, when you're fitting this simple linear function, your model doesn't actually change that much more. It's fitting a straight line, and even as you get more and more and more examples, there's just not that much more to change, which is why the average training error flattens out after a while. And similarly, your cross-validation error will come down and also flatten out after a while, which is why JCV, again, is higher than J-train, but JCV will tend to look like that. And it's because beyond a certain point, even as you get more and more and more examples, not much is going to change about the straight line you're fitting. It's just too simple a model to be fitting to this much data, which is why both of these curves, JCV and J-train, tend to flatten out after a while. And if you have a measure of that baseline level of performance, such as human-level performance, then it'll tend to be a value that is lower than your J-train and your JCV. So human-level performance may look like this, and there's a big gap between the baseline level of performance and J-train, which was our indicator for this algorithm having high bias. That is, one could hope to be doing much better if only we could fit a more complex function than just a straight line. Now, one interesting thing about this plot is you can ask, what do you think will happen if you could have a much bigger training set? So what would it look like if we could increase M even further than the right of this plot, and go further to the right, as follows? Well, you can imagine, if you were to extend both of these curves to the right, they'd both sort of flatten out, and both of them will probably just continue to be flat like that. And no matter how far you extend to the right of this plot, these two curves, they will never somehow find a way to dip down to this human-level performance or just keep on being kind of flat like this pretty much forever, no matter how large the training set gets. So that gives this conclusion, maybe a little bit surprising, that if a learning algorithm has high bias, getting more training data will not, by itself, help that much. And I know that we're used to thinking that having more data is good, but if your algorithm has high bias, then if the only thing you do is throw more training data at it, that, by itself, will not ever let you bring down the error rate that much. And it's because of this, really. No matter how many more examples you add to this figure, the straight linear fitting just isn't going to get that much better. And that's why, before investing a lot of effort into collecting more training data, it's worth checking if your learning algorithm has high bias, because if it does, then you probably need to do some other things, other than just throw more training data at it. Let's now take a look at what the learning curve looks like for a learning algorithm with high variance. You might remember that if you were to fit a 4-fold polynomial with small lambda, say, or even lambda equals zero, then you get a curve that looks like this. And even though it fits the training data very well, it doesn't generalize. Let's now look at what a learning curve might look like in this high variance scenario. J-train will be going up as the training set size increases, so you get a curve that looks like this. And J-CV will be much higher. So your cross-validation error is much higher than your training error. And the fact that there's a huge gap here is what can tell you that there's high variance. It's doing much better on the training set than it's doing on your cross-validation set. If you were to plot a baseline level of performance, such as human-level performance, you may find that it turns out to be here, that J-train can sometimes be even lower than the human-level performance, or maybe human-level performance is a little bit lower than this. But when you're overfitting the training set, you may be able to fit the training set so well to have an unrealistically low error, such as zero error in this example over here, which is actually better than how well humans would actually be able to predict housing prices or whatever the application you're working on. But again, the signal for high variance is whether J-CV is much higher than J-train. And when you have high variance, then increasing the training set size could help a lot. And in particular, if we could extrapolate these curves to the right, increase M-train, then the training error will continue to go up, but then the cross-validation error hopefully will come down and approach J-train. And so in this scenario, it might be possible just by increasing the training set size to lower the cross-validation error and to get your algorithm to perform better and better. And this is unlike the high bias case where if the only thing you do is get more training data, that won't actually help your learning algorithm's performance much. So to summarize, if a learning algorithm suffers from high variance, then getting more training data is indeed likely to help. Because extrapolating to the right of this curve, you see that you can expect J-CV to keep on coming down. And in this example, just by getting more training data allows the algorithm to go from this relatively high cross-validation error to get much closer to human-level performance. You can see that if you were to add a lot more training examples and continue to fit a 4-fold polynomial, then you can just get a better 4-fold polynomial fit to this data than this very weakly curve up on top. So if you're building a machine learning application, you could plot the learning curve if you want. That is, you can take different subsets of your training set. And even if you have, say, 1,000 training examples, you could train a model on just 100 training examples and look at the training error and the cross-validation error. Then train a model on 200 examples, holding out 800 examples and just not using them for now, and plot J-train and J-CV and so on and repeat and plot out what the learning curve looks like. And if you were to visualize it that way, then that could be another way for you to see if your learning curve looks more like a high-bias or high-variance one. One downside of plotting learning curves like this is something I've done, but one downside is it is computationally quite expensive to train so many different models using different size subsets of your training set. So in practice, it isn't done that often. But nonetheless, I find that having this mental visual picture in my head of what the training set looks like, sometimes that helps me to think through what I think my learning algorithm is doing and whether it has high-bias or high-variance. So I know we've gone through a lot about bias and variance. Let's go back to our earlier example of if you've trained a model for housing price prediction, how does bias and variance help you decide what to do next? Let's go back to that earlier example, which I hope will now make a lot more sense to you. Let's do that in the next video.