We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Remember that the cost function gives you a way to measure how well a specific set of parameters fits the training data, and thereby gives you a way to try to choose better parameters. In this video, we'll look at how the squared error cost function is not an ideal cost function for logistic regression, and we'll take a look at a different cost function that can help us choose better parameters for logistic regression. Here's what the training set for a logistic regression model might look like, where here each row might correspond to a patient that was paying a visit to the doctor and wound up with some sort of diagnosis. As before, we'll use m to denote the number of training examples. Each training example has one or more features, such as the tumor size, the patient's age, and so on, for a total of n features. And so, let's call the features x1 through xn. And since this is a binary classification task, the target label y takes on only two values, either 0 or 1. And finally, the logistic regression model is defined by this equation. Okay, so the question you want to answer is, given this training set, how can you choose parameters w and b? Recall, for linear regression, this is the squared error cost function. The only thing I've changed is that I put the one-half inside the summation instead of outside the summation. And you might remember that in the case of linear regression, where f of x is the linear function w dot x plus b, the cost function looks like this, is a convex function, or a bow shape or a hammer shape. And so gradient descents will look like this, where you take one step, one step, one step, and so on, to converge at the global minimum. Now, you could try to use the same cost function for logistic regression, but it turns out that if I were to write f of x equals 1 over 1 plus e to the negative wx plus b, and plot the cost function using this value of f of x, then the cost will look like this. This becomes what's called a non-convex cost function. It's not convex. And what this means is that if you were to try to use gradient descents, there are lots of local minima that you can get stuck in. So it turns out that for logistic regression, this squared error cost function is not a good choice. Instead, there will be a different cost function that can make the cost function convex again, so that gradient descents can be guaranteed to converge to the global minimum. The only thing I've changed is that I put the one-half inside the summation instead of outside the summation. This will make the math you see later on this slide a little bit simpler. In order to build a new cost function, one that we'll use for logistic regression, I'm going to change a little bit the definition of the cost function J of w and b. In particular, if you look inside this summation, let's call this term inside the loss on a single training example. And I'm going to denote the loss via this capital L and is a function of the prediction of the learning algorithm, f of x, as well as of the true label, y. And so the loss, given the predicted f of x and the true label y, is equal, in this case, to one-half of the squared difference. We'll see shortly that by choosing a different form for this loss function, we'll be able to keep the overall cost function, which is 1 over m times the sum of these loss functions, to be a convex function. Now, the loss function inputs f of x and the true label y and tells us how well we're doing on that example. I'm going to just write down here the definition of the loss function we'll use for logistic regression. If the label y is equal to 1, then the loss is negative log of f of x. And if the label y is equal to 0, then the loss is negative log of 1 minus f of x. Let's take a look at why this loss function hopefully makes sense. Let's first consider the case of y equals 1 and plot what this function looks like to gain some intuition about what this loss function is doing. And remember, the loss function measures how well you're doing on one training example, and it's by summing up the losses on all of the training examples that you then get the cost function, which measures how well you're doing on the entire training set. So if you plot log of f, it looks like this curve here, where f here is on the horizontal axis. And so a plot of negative of the log of f looks like this, where we just flip the curve along the horizontal axis. Notice that it intersects the horizontal axis at f equals 1 and continues downward from there. Now f is the output of logistic regression. Thus, f is always between 0 and 1, because the output of logistic regression is always between 0 and 1. The only part of the function that's relevant is therefore this part over here, corresponding to f between 0 and 1. So let's zoom in and take a closer look at this part of the graph. If the algorithm predicts a probability close to 1, and the true label is 1, then the loss is very small. It's pretty much 0, because you're very close to the right answer. Now, continuing with the example of the true label y being 1, so say it really is a malignant tumor, if the algorithm predicts 0.5, then the loss is at this point here, which is a bit higher, but not that high. Whereas in contrast, if the algorithm were to have output 0.1, if it thinks that there's only a 10% chance of the tumor being malignant, but y really is 1, it really is malignant, then the loss is this much higher value over here. So when y is equal to 1, the loss function incentivizes, or nudges, or helps push the algorithm to make more accurate predictions, because the loss is lowest when it predicts values close to 1. Now on this slide, we've been looking at what the loss is when y is equal to 1. On this slide, let's look at the second part of the loss function corresponding to when y is equal to 0. In this case, the loss is negative log of 1 minus f of x. When this function is plotted, it actually looks like this. The range of f is limited to 0 to 1, because logistic regression only outputs values between 0 and 1. And if we zoom in, this is what it looks like. So in this plot, corresponding to y equals 0, the vertical axis shows the value of the loss for different values of f of x. So when f is 0, or very close to 0, the loss is also going to be very small, which means that if the true label is 0 and the model's prediction is very close to 0, well, you nearly got it right. So the loss is appropriately very close to 0. And the larger the value of f of x gets, the bigger the loss, because the prediction is further from the true label 0. And in fact, as that prediction approaches 1, the loss actually approaches infinity. Going back to the tumor prediction example, this says that if a model predicts that the patient's tumor is almost certain to be malignant, say 99.9% chance of malignancy, but it turns out to actually not be malignant, so y equals 0, then we penalize the model with a very high loss. So in this case of y equals 0, similar to the case of y equals 1 on the previous slide, the further the prediction f of x is away from the true value of y, the higher the loss. And in fact, if f of x approaches 0, the loss here actually goes really, really large and in fact approaches infinity. So when the true label is 1, the algorithm is strongly incentivized not to predict something too close to 0. So in this video, you saw why the squared error cost function doesn't work well for logistic regression. We also defined the loss for a single training example and came up with a new definition for the loss function for logistic regression. It turns out that with this choice of loss function, the overall cost function will be convex, and thus you can reliably use gradient descent to take you to the global minimum. Proving that this function is convex is beyond the scope of this course. You may remember that the cost function is a function of the entire training set, and is therefore the average or 1 over m times the sum of the loss function on the individual training examples. So the cost on a certain set of parameters w and b is equal to 1 over m times the sum over all the training examples of the loss on the training examples. And if you can find the value of the parameters w and b that minimizes this, then you have a pretty good set of values for the parameters w and b for logistic regression. In the upcoming optional lab, you get to take a look at how the squared error cost function doesn't work very well for classification, because you see that the surface plot results in a very wiggly cost surface with many local minima. Then you take a look at the new logistic loss function, and as you can see here, this produces a nice and smooth convex surface plot that does not have all those local minima. So please take a look at the code and the plots after this video. Alright, so we've seen a lot in this video. In the next video, let's go back and take the loss function for a single training example, and use that to define the overall cost function for the entire training set. And we'll also figure out a simpler way to write out the cost function, which will then later allow us to run gradient descent to find good parameters for logistic regression. Let's go on to the next video.