Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
You've seen the logistic regression model, you've seen the loss function that measures how well you're doing on a single training example. You've also seen the cost function that measures how well your parameters W and B are doing on your entire training set. Now let's talk about how you can use the gradient descent algorithm to train or to learn the parameters W and B on your training set. To recap, here is the familiar logistic regression algorithm, and we have on the second line the cost function, J, which is a function of your parameters W and B, and that's defined as the average, so it's 1 over M times the sum of this loss function. And so the loss function measures how well your algorithm's outputs, Y hat i on each of the training examples, stacks up or compares to the ground truth label, Yi, on each of the training examples, and the full formula is expanded out on the right. So the cost function measures how well your parameters W and B are doing on the training set. So in order to learn a set of parameters W and B, it seems natural that we want to find W and B that make the cost function J of W comma B as small as possible. So here's an illustration of gradient descent. In this diagram, the horizontal axes represent your space of parameters, W and B. In practice, W can be much higher dimensional, but for the purposes of plotting, let's illustrate W as a single real number and B as a single real number. The cost function, J of WB, is then some surface above these horizontal axes, W and B, so the height of the surface represents the value of J comma B at a certain point. And what we want to do is really to find the value of W and B that corresponds to the minimum of the cost function, J. It turns out that this particular cost function, J, is a convex function, so it's just a single big bowl. So this is a convex function, and this is as opposed to functions that look like this, which are non-convex and has lots of different local optima. So the fact that our cost function, J of WB, as defined here, is convex is one of the huge reasons why we use this particular cost function, J, for logistic regression. So to find a good value for the parameters, what we'll do is initialize W and B to some initial value, maybe denoted by that little red dot. And for logistic regression, almost any initialization method works. Usually, you initialize the values to zero. Random initialization also works, but people don't usually do that for logistic regression. But because this function is convex, no matter where you initialize, you should get to the same point, or roughly the same point. And what gradient descent does is it starts at that initial point and then takes a step in the steepest downhill direction. So after one step of gradient descent, you might end up there, because it's trying to take a step downhill in the direction of steepest descent, or as quickly downhill as possible. So that's one iteration of gradient descent. And after two iterations of gradient descent, you might step there, three iterations, and so on. I guess this is now hidden by the back of the plot until eventually, hopefully you converge to this global optimum, or get to something close to the global optimum. So this picture illustrates the gradient descent algorithm. Let's write out a bit more of the details. For the purpose of illustration, let's say that there's some function J of W that you want to minimize, and maybe that function looks like this. To make this easier to draw, I'm going to ignore B for now, just to make this a one-dimensional plot, instead of a higher-dimensional plot. So gradient descent does this. We're going to repeatedly carry out the following update. We're going to take the value of W and update it, going to use colon equals to represent updating W. So set W to W minus alpha times, and this is a derivative, D of J W, D W. And we'll repeatedly do that until the algorithm converges. So a couple points in the notation, alpha here is the learning rate, and controls how big a step we take on each iteration of gradient descent. We'll talk later about some ways for choosing the learning rate alpha. And second, this quantity here, this is a derivative, this is basically the update of the change you want to make to the parameters W. When we start to write code to implement gradient descent, we're going to use the convention that the variable name in our code, D W, will be used to represent this derivative term. So when you write code, you write something like W equals, colon equals, W minus alpha times D W. So we'll use D W to be the variable name to represent this derivative term. Now let's just make sure that this gradient descent update makes sense. Let's say that W was over here, so you're at this point on the cost function J of W. Remember that the definition of a derivative is the slope of a function at the point. So the slope of a function is really, you know, the height divided by the width, right, of a little triangle here, this tangent to J of W at that point. And so here, the derivative is positive, W gets updated as W minus a learning rate times the derivative. The derivative is positive, and so you end up subtracting from W, so you end up taking a step to the left. And so gradient descent would, you know, make your algorithm slowly decrease the parameter if you had started off with this large value of W. As another example, if W was over here, then at this point, the slope here of D J, D W will be negative, and so the gradient descent update would subtract alpha times a negative number, and so end up slowly increasing W, so you end up making W bigger and bigger with successive iterations of gradient descent, so that hopefully, whether you initialize on the left or on the right, gradient descent will move you toward this global minimum here. If you're not familiar with derivatives or with calculus and what this term D J of W, D W means, don't worry too much about it. We'll talk some more about derivatives in the next video. If you have a deep knowledge of calculus, you might be able to have a deeper intuitions about how neural networks work, but even if you're not that familiar with calculus, in the next few videos, we'll give you enough intuitions about derivatives and about calculus that you'll be able to effectively use neural networks. But the overall intuition for now is that this term represents the slope of the function, and we want to know the slope of the function at the current setting of the parameters, so that we can take these steps of steepest descent, so that we know what direction to step in, in order to go downhill on the cost function J. So, we wrote our gradient descent for J of W, if only W was your parameter. In logistic regression, your cost function is a function of both W and B, so in that case, the inner loop of gradient descent, that is this thing here, the thing you have to repeat, becomes as follows. You end up updating W as W minus the learning rate times the derivative of J of W B with respect to W, and you update B as B minus the learning rate times the derivative of the cost function with respect to B. So, these two equations at the bottom are the actual update you implement. As an aside, I just want to mention one notational convention in calculus that is a bit confusing to some people. I don't think it's super important that you understand calculus, but in case you see this, I want to make sure that you don't think too much of this, which is that in calculus, this term here, we actually write as follows, with that funny squiggle symbol. So this symbol, this is actually just a lower case d in a fancy font, in a stylized font, but when you see this expression, all this means is this is the derivative of J of W comma B, or really the slope of the function J of W comma B, how much that function slopes in the W direction. And the rule of the notation in calculus, which I think isn't totally logical, but the rule in the notation for calculus, which I think just makes things much more complicated than they need to be, is that if J is a function of two or more variables, then instead of using lower case d, you use this funny symbol. This is called a partial derivative symbol. But don't worry about this, and if J is a function of only one variable, then you use lower case d. So the only difference between whether you use this funny partial derivative symbol or lower case d, as we did on top, is whether J is a function of two or more variables, in which case you use this symbol, the partial derivative symbol, or if J is only a function of one variable, then you use lower case d. This is one of those funny rules of notation in calculus that I think just makes things more complicated than they need to be. But if you see this partial derivative symbol, all it means is you're measuring the slope of the function with respect to one of the variables. And similarly, to adhere to the formally correct mathematical notation in calculus, because here J has two inputs, not just one, this thing at the bottom should be written with this partial derivative symbol. But it really means the same thing as, you know, almost the same thing as lower case d. Finally, when you implement this in code, we're going to use the convention that this quantity, really the amount by which you update w, will denote as the variable dw in your code. And this quantity, right, the amount by which you want to update b, will denote by the variable db in your code. Alright, so that's how you can implement gradient descent. Now, if you haven't seen calculus for a few years, I know that that might seem like a lot more derivatives in calculus than you might be comfortable with so far. But if you're feeling that way, don't worry about it. In the next video, we'll give you better intuition about derivatives. And even without a deep mathematical understanding of calculus, with just an intuitive understanding of calculus, you will be able to make neural networks work effectively. So with that, let's go on to the next video where we'll talk a little bit more about derivatives.