So previously, you took a look at the linear regression model, and then the cost function, and then the gradient descent algorithm. In this video, we're going to put it all together and use the squared error cost function for the linear regression model with gradient descent. This will allow us to train the linear regression model to fit a straight line to our training data. Let's get to it. Here's the linear regression model, and to the right is the squared error cost function, and below is the gradient descent algorithm. It turns out if you calculate these derivatives, these are the terms you would get. The derivative with respect to w is this, 1 over m, summed from i equals 1 through m. Then the error term, that is the difference between the predicted and the actual values, times the input feature x i. And the derivative with respect to b is this formula over here, which looks the same as the equation above, except that it doesn't have that x i term at the end. And if you use these formulas to compute these two derivatives and implement gradient descent this way, it will work. Now, you may be wondering, where did I get these formulas from? They're derived using calculus. If you want to see the full derivation, I'll quickly run through the derivation on the next slide. But if you don't remember or aren't interested in the calculus, don't worry about it. You can skip the materials on the next slide entirely and still be able to implement gradient descent and finish this class, and everything will work just fine. So in this slide, which is one of the most mathematical slides of the entire specialization, and again is completely optional, we'll show you how to calculate the derivative terms. Let's start with the first term, the derivative of the cost function j with respect to w. We'll start by plugging in the definition of the cost function j. So j of wb is this. So 1 over 2m times this sum of the squared error terms. And now, remember also that f of wb of xi is equal to this term over here, which is wxi plus b. And so what we would like to do is compute the derivative, also called the partial derivative, with respect to w of this equation right here on the right. If you've taken a calculus class before, and again it's totally fine if you haven't, you may know that by the rules of calculus, the derivative is equal to this term over here, which is why the 2 here and the 2 here cancel out, leaving us with this equation that you saw on the previous slide. So this, by the way, is why we had to find the cost function with the 1 half earlier this week. It's because it makes the partial derivative neater. It cancels out the 2 that appears from computing the derivative. For the other derivative with respect to b, this is quite similar. I can write it out like this, and once again plug in the definition of f of xi, giving this equation. And by the rules of calculus, this is equal to this, where there's no xi anymore at the end. And so the 2's cancel once more, and you end up with this expression for the derivative with respect to b. And so now you have these two expressions for the derivatives, and you can plug them into the gradient descent algorithm. So here's the gradient descent algorithm for linear regression. You repeatedly carry out these updates to w and b until convergence. Remember that this f of x is a linear regression model, so it's equal to w times x plus b. This expression here is the derivative of the cost function with respect to w, and this expression is the derivative of the cost function with respect to b. And just as a reminder, you want to update w and b simultaneously on each step. Now let's get familiar with how gradient descent works. One issue we saw with gradient descent is that it can lead to a local minimum instead of a global minimum, where the global minimum means the point that has the lowest possible value for the cost function j out of all possible points. You may recall this surface plot that looks like an outdoor park with a few hills, with the grass and the birds. This is a relaxing outdoor hill. This function has more than one local minimum. Remember, depending on where you initialize the parameters w and b, you can end up at different local minima. You can end up here, or you can end up here. But it turns out when you're using a squared error cost function with linear regression, the cost function does not and will never have multiple local minima. It has a single global minimum because of this bow shape. The technical term for this is that this cost function is a convex function. Informally, a convex function is a bow-shaped function, and it cannot have any local minima other than the single global minimum. So when you implement gradient descent on a convex function, one nice property is that so long as your learning rate is chosen appropriately, it will always converge to the global minimum. Congratulations! You now know how to implement gradient descent for linear regression. We have just one last video for this week. In that video, we'll see this algorithm in action. Let's go to that last video.