Now you've seen a couple of different learning algorithms, linear regression and logistic regression. They work well for many tasks, but sometimes, in an application, the algorithm can run into a problem called overfitting, which can cause it to perform poorly. What I'd like to do in this video is to show you what is overfitting, as well as a closely related, almost opposite problem called underfitting. And in the next videos after this, I'll share with you some techniques for addressing overfitting. In particular, there's a method called regularization. Very useful technique. I use it all the time, but regularization will help you minimize this overfitting problem and get your learning algorithms to work much better. So let's take a look at what is overfitting. To help us understand what is overfitting, let's take a look at a few examples. Let's go back to our original example of predicting housing prices with linear regression, where you want to predict the price as a function of the size of a house. To help us understand what is overfitting, let's take a look at a linear regression example. And I'm going to go back to our original running example of predicting housing prices with linear regression. Suppose your data set looks like this, with the input feature x being the size of the house and the value y they're trying to predict, the price of the house. One thing you could do is fit a linear function to this data, and if you do that, you get a straight line fit to the data that maybe looks like this. But this isn't a very good model. Looking at the data, it seems pretty clear that as the size of the house increases, the housing prices kind of flatten out. So this algorithm does not fit the training data very well. The technical term for this is the model is underfitting the training data. Another term is the algorithm has high bias. You may have read in the news about some learning algorithms, really unfortunately, demonstrating bias against certain ethnicities or certain genders. In machine learning, the term bias has multiple meanings. Checking learning algorithms for bias based on characteristics such as gender or ethnicity is absolutely critical. But the term bias has a second technical meaning as well, which is the one I'm using here, which is if the algorithm has underfit the data, meaning that it's just not even able to fit the training set that well, that there's a clear pattern in the training data that the algorithm is just unable to capture. Another way to think of this form of bias is as if the learning algorithm has a very strong preconception, or we say a very strong bias, that the housing prices are going to be a completely linear function of the size, despite data to the contrary. So this preconception that the data is linear causes it to fit a straight line that fits the data poorly, leading it to underfit the data. Now let's look at a second variation of a model, which is if you instead fit a quadratic function to the data with two features, x and x squared, then when you fit the parameters w1 and w2, you can get a curve that fits the data somewhat better. Maybe it looks like this. So if you were to get a new house that's not in this set of five training examples, this model would probably do quite well on that new house. So if you're a real estate agent, the idea that you want your learning algorithm to do well, even on examples that are not on the training set, that's called generalization. Technically, we say that you want your learning algorithm to generalize well, which means to make good predictions, even on brand new examples that it has never seen before. So this quadratic model seems to fit the training set not perfectly, but pretty well, and I think it would generalize well to new examples. Now let's look at the other extreme. What if you were to fit a fourth order polynomial to the data? So you have x, x squared, x cubed, and x to the fourth, all as features. With this fourth order polynomial, you can actually fit a curve that passes through all five of the training examples exactly, and you might get a curve that looks like this. This on one hand seems to do an extremely good job fitting the training data because it passes through all of the training data perfectly. In fact, you'll be able to choose parameters that will result in the cost function being exactly equal to zero because the errors are zero on all five training examples. But this is a very wiggly curve. It's going up and down all over the place. And if you have this house size right here, the model would predict that this house is cheaper than houses that are smaller than it. So we don't think that this is a particularly good model for predicting housing prices. The technical term is that we'll say this model has overfit the data or this model has an overfitting problem because even though it fits the training set very well, it has fit the data almost too well, hence it's overfit. And it does not look like this model will generalize to new examples that it has never seen before. Another term for this is that the algorithm has high variance. In machine learning, many people will use the terms overfit and high variance almost interchangeably and will use the terms underfit and high bias almost interchangeably. The intuition behind overfitting or high variance is that the algorithm is trying very, very hard to fit every single training example. And it turns out that if your training set were just even a little bit different, say one house was priced just a little bit more, a little bit less, then the function that the algorithm fits could end up being totally different. So if two different machine learning engineers were to fit this fourth order polynomial model to just slightly different data sets, they could end up with totally different predictions or highly variable predictions. And that's why we say the algorithm has high variance. Contrasting this rightmost model with the one in the middle for the same house, it seems the middle model gives the much more reasonable prediction for price. There isn't really a name for this case in the middle, but I'm just going to call this just right because it is neither underfit nor overfit. So we can say that the goal of machine learning is to find a model that hopefully is neither underfitting nor overfitting. In other words, hopefully a model that has neither high bias nor high variance. When I think about underfitting and overfitting, high bias and high variance, I'm sometimes reminded of the children's story of Goldilocks and the Three Bears. In this children's tale, a girl called Goldilocks visits the home of a bear family. There's a bowl of porridge that's too cold to taste, and so that's no good. There's also a bowl of porridge that's too hot to eat, so that's no good either. But there's a bowl of porridge that is neither too cold nor too hot. The temperature is in the middle, which is just right to eat. So to recap, if you have too many features, like the full water polynomial on the right, then the model may fit the training set well, but almost too well or overfit and have high variance. On the flip side, if you have too few features, then in this example, like the one on the left, it underfits and has high bias. And in this example, using quadratic features, x and x-squared, that seems to be just right. So far, we've looked at underfitting and overfitting for a linear regression model. Similarly, overfitting applies to classification as well. Here's a classification example with two features, x1 and x2, where x1 is maybe the tumor size and x2 is the age of the patient, and we're trying to classify if a tumor is malignant or benign, as denoted by these crosses and circles. One thing you can do is fit a logistic regression model, just a simple model, like this, where as usual, g is the sigmoid function, and this term here inside is z. So if you do that, you end up with a straight line as the decision boundary. This is the line where z is equal to zero, that separates the positive and negative examples. This straight line doesn't look terrible, it looks kind of okay, but it doesn't look like a very good fit to the data either. So this is an example of underfitting or of high bias. Let's look at another example. If you were to add to your features these quadratic terms, then z becomes this new term in the middle, and the decision boundary, that is, where z equals zero, can look more like this, more like an ellipse, or part of an ellipse. And this is a pretty good fit to the data, even though it does not perfectly classify every single training example in the training set. Notice how some of these crosses get classified among the circles, but this model looks pretty good. I'm going to call it just right, and it looks like this will generalize pretty well to new patients. And finally, at the other extreme, if you were to fit a very high-order polynomial with many, many features like these, then the model may try really hard and contort or twist itself to find a decision boundary that fits your training data perfectly. Having all these higher-order polynomial features allows the algorithm to choose this really overly complex decision boundary. If the features are tumor size and age, and you're trying to classify tumors as malignant or benign, then this doesn't really look like a very good model for making predictions. So once again, this is an instance of overfitting and high variance, because this model, despite doing very well in the training set, doesn't look like it will generalize well to new examples. So now you've seen how an algorithm can underfit or have high bias, or overfit and have high variance. You may want to know how you can get a model that is just right. In the next video, we'll look at some ways you can address the issue of overfitting, and we'll also touch on some ideas relevant for addressing underfitting. Let's go on to the next video.

Supervised Machine Learning: Regression and Classification

33 hours 16 mins

Week 3: Classification

Classification with logistic regression

Motivations
Video
・
9 mins

Optional lab: Classification
Code Example
・
1 hour

Logistic regression
Video
・
9 mins

Optional lab: Sigmoid function and logistic regression
Code Example
・
1 hour

Decision boundary
Video
・
10 mins

Optional lab: Decision boundary
Code Example
・
1 hour

Practice quiz: Classification with logistic regression

Graded・Quiz

・

30 mins

Cost function for logistic regression

Cost function for logistic regression
Video
・
11 mins

Optional lab: Logistic loss
Code Example
・
1 hour

Simplified Cost Function for Logistic Regression
Video
・
5 mins

Optional lab: Cost function for logistic regression
Code Example
・
1 hour

Practice quiz: Cost function for logistic regression

Graded・Quiz

・

30 mins

Gradient descent for logistic regression

Gradient Descent Implementation
Video
・
6 mins

Optional lab: Gradient descent for logistic regression
Code Example
・
1 hour

Optional lab: Logistic regression with scikit-learn
Code Example
・
1 hour

Practice quiz: Gradient descent for logistic regression

Graded・Quiz

・

30 mins

The problem of overfitting

The problem of overfitting
Video
・
11 mins

Addressing overfitting
Video
・
8 mins

Optional lab: Overfitting
Code Example
・
1 hour

Cost function with regularization
Video
・
9 mins

Regularized linear regression
Video
・
8 mins

Regularized logistic regression
Video
・
5 mins

Optional lab: Regularization
Code Example
・
1 hour

End of Access to Lab Notebooks

[IMPORTANT] Reminder about end of access to Lab Notebooks
Reading
・
2 mins

Practice quiz: The problem of overfitting

Graded・Quiz

・

30 mins

Week 3 practice lab: logistic regression

Graded・Code Assignment

・

3 hours

Conversations with Andrew (Optional)

Andrew Ng and Fei-Fei Li on Human-Centered AI
Video
・
41 mins

Acknowledgments

Acknowledgments
Reading
・
2 mins

Optional opt-in form from Stanford
Reading
・
1 min