Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
You've seen how, in TensorFlow, you can specify a neural network architecture, so compute the output y as a function of the input x, and also specify a cost function, and TensorFlow will then automatically use backpropagation to compute derivatives and use gradient descents or Adam to train the parameters of your neural network. So the backpropagation algorithm, which computes derivatives of your cost function with respect to the parameters, is a key algorithm in neural network learning. But how does it actually work? In this and in the next few optional videos, we'll try to take a look at how backpropagation computes derivatives. These videos are completely optional, and they do go just a little bit into calculus. If you're already familiar with calculus, I hope you enjoy these videos, but if not, it's totally fine. We'll build up from the very basics of calculus to try to make sure you have all the intuition you need to understand how backpropagation works. Let's take a look. I'm going to use a simplified cost function, J of W equals W squared. The cost function is a function of the parameters W and say B, and for the simplified cost function, let's just pretend J of W equals W squared, and I'm going to ignore B for this example. Let's say the value of the parameter W is equal to 3. So J of W will be equal to 9, W squared, or 3 squared. Now, if we were to increase W by a tiny amount, say epsilon, which I'm going to set to 0.001, how does the value of J of W change? If we increase W by 0.001, then W becomes 3 plus 0.001, so it's 3.001, and so J of W, which is W squared, which we defined above, is now this 3.001 squared, which is 9.0601. So what we see is that if W goes up by 0.001, I'm going to use this up arrow here to denote W goes up by 0.001, where 0.001 is this small value epsilon, then J of W roughly goes up by 6 times as much, 6 times 0.001. This isn't quite exact, it actually goes up not to 9.006, but 9.0601, but it turns out that if epsilon were infinitesimally small, and by infinitesimally small, I mean very, very, very, very small. You know, epsilon is pretty small, but it's not infinitesimally small. If epsilon was 0.00000, lots of 0's followed by 1, then this becomes more and more accurate. In this example, what we see is that if W goes up by epsilon, then J goes up roughly by 6 times epsilon. And in calculus, what we would say is that the derivative of J of W with respect to W is equal to 6. And all this means is if W goes up by a tiny little amount, J of W goes up 6 times as much. What if epsilon were to take on a different value? What if epsilon were 0.002? So in this case, W would be 3 plus 0.002, and W squared becomes 3.002 squared, which is 9.012004. And in this case, what we conclude is that if W goes up by 0.002, then J of W goes up by roughly 6 times 0.002. It goes up roughly to 9.012, and this 0.012 is roughly 6 times 0.002. Again, it's a little bit off, this extra 0.00004 here, because epsilon is not quite infinitesimally small. And once again, we see this 6 to 1 ratio between how much W goes up versus how much J of W goes up. And that's why the derivative of J of W with respect to W is equal to 6. And the smaller epsilon is, the more accurate this becomes. And by the way, feel free to pause the video and try this calculation out yourself with other values of epsilon. The key is that so long as epsilon is pretty small, the ratio by which J of W goes up versus the amount by which W goes up should be 6 to 1. So feel free to try it out yourself with other values of epsilon and check if this really holds true. And so, this leads us to an informal definition of a derivative, which is that if whenever W goes up by a tiny amount of epsilon, that causes J of W to go up by K times epsilon. And in our example just now, K was equal to 6. Then we say that the derivative of J of W with respect to W is equal to K, which was equal to 6 in the example just now. So you might remember when implementing gradient descent, you would repeatedly use this rule to update the parameter WJ, where as usual alpha is the learning rate. And so, what does gradient descent do? Notice that if the derivative is small, then this update step will make a small update to the parameter WJ. Where as if this derivative term is large, this will result in a big change to the parameter WJ. And this makes sense because this is essentially saying that if the derivative is small, this means that changing W doesn't make a big difference to the value of J. And so, let's not bother to make a huge change to WJ. But if the derivative is large, that means that even a tiny change to WJ can make a big difference in how much you can change or decrease the cost function J of W. So in that case, let's make a bigger change to WJ because doing so will actually make a big difference to how much we can reduce the cost function J. Let's take a look at a few more examples of derivatives. What you saw in the example just now was that if W equals 3 and J of W equals W squared equals 9, then if W goes up by epsilon, by 0.001, then J of W becomes J of 3.001 is now 9.006.001. Or in other words, J has gone up by about 0.006, which is 6 times 0.001 or 6 times epsilon, which is why the derivative of J of W with respect to W is equal to 6. Let's look at what the derivative will be for other values of W. Take W equals 2. In this case, J of W is W squared is now equal to 4. And if W goes up by 0.001, then J of W becomes J of 2.001, which is equal to this, 4.004.001. And so J of W has gone up from 4 to this value over here, which is roughly 4 times epsilon bigger than 4, which is why now the derivative is 4 because W going up by epsilon has caused J of W to go up 4 times as much. And again, this extra 0.001 is because it's not quite accurate because epsilon is an infinitesimally small. Well, let's look at another example. What if W were equal to negative 3? J of W, which is W squared, is still equal to 9 because negative 3 squared is 9. And if W were to go up by epsilon again, then you now have W equals negative 2.999. So that's J of negative 2.999. And the square of negative 2.999 is equal to this, 8.994.001, because W is negative 3 plus 0.001. And notice here, J of W has gone down by about 0.006, which is 6 times epsilon. And so, what we have in this example is that J starts off as 9, but it has now gone down, notice this down arrow here instead of up arrow, by 6 times epsilon. Or equivalently, it has gone up by negative 6 times epsilon. And that's why the derivative in this case is equal to negative 6, because W going up by epsilon causes J of W to go up by negative 6 times epsilon when epsilon is small. Another way to visualize this is to plot the function J of W. So if the horizontal axis is W and this is J of W, then when W is equal to 3, J of W is equal to 9. When it's negative 3, it's also equal to 9. And when it is 2, J of W is equal to 4. Let me make an observation that may be relevant if you've taken a calculus class before. But if you haven't, what I say in the next 60 seconds may not make sense, but don't worry about it. You will need to understand it to fully follow the rest of these videos. If you've taken a class in calculus at some point, you may recognize that the derivatives corresponds to the slope of a line that just touches the function J of W at this point, say, where W equals 3. And so the slope of this line at this point, and the slope is this height over this width, turns out to be equal to 6 when W equals 3. Slope of this line turns out to be 4 when W equals 2. And the slope of this line turns out to be negative 6 when W equals negative 3. And it turns out in calculus, the slope of these lines correspond to the derivative of the function. But if you haven't taken a calculus class before and haven't seen this slope concept before, don't worry about it. Now, there's one last observation I want to make before moving on, which is that you see in all three of these examples, J of W is the same function. J of W is equal to W squared. But the derivative of J of W depends on W. When W is 3, the derivative is 6. When W is 2, the derivative is 4. And when W is negative 3, the derivative is negative 6. It turns out that if you're familiar with calculus, and again, it's totally fine if you're not, calculus can allow us to calculate the derivative of J of W with respect to W as 2 times W. In a little bit, I'll show you how you could use Python to compute these derivatives yourself using a nifty Python package called SymPy. But because calculus tells us that the derivative of W squared, J of W, is 2W, that's why the derivative when W is 3 is 2 times 3. Or when it's 2, it's 2 times 2. Or when it's negative 3, it's 2 times negative 3. Because this value of W times 2 turns out to give you the derivative. Let's go through just a few more examples before we wrap up. For these examples, I'm going to set W equals 2. So you saw in the last slide, if J of W is W squared, then the derivative I set would be 2 times W, which is 4. And so if W goes up by 0.01, this being epsilon, J of W becomes this, so roughly J of W goes up by 4 times epsilon. Let's look at a few other functions. What if J of W is equal to W cubed? So in this case, W cubed, 2 cubed, would be equal to 8. Or what if J of W is just equal to W? So here, W is equal to 2. Or what if J of W was 1 over W? In this case, 1 over W, 1 over 2 would be 1 half or 0.5. What is the derivative of J of W with respect to W when the cost function J of W is either W cubed or W or 1 over W? Let me show you how you can compute these derivatives yourself using a library and package called SymPy. So let me first import SymPy. And what I'm going to do is tell SymPy that I'm going to use J and W as symbols for computing derivatives. So for our first example, we had the cost function J was equal to W squared. Notice how SymPy actually renders it in this nifty font here as well. Now, if we were to use SymPy to take the derivative of J with respect to W, we should do as follows. You see that SymPy tells you this derivative is 2W. Let me actually choose a variable DJDW and set that to be equal to this. I'll just type it again here. So I'll print it out. So that's 2W. And if you want to plug in the value of W into this expression to evaluate it, you can do the derivative dot subs W2. This means plug in the value of W to be equal to 2 into this expression and evaluate it. And that gives you the value of 4, which is why when W goes to 2, we saw that the derivative of J was equal to 4. Let's look at some other examples. What if J was W cubed? Then the derivative becomes 3 times W squared. So it turns out from calculus, and this is what SymPy is calculating for us, if J is W cubed, then the derivative of J with respect to W is 3W squared. And depending on what W is, the value of the derivative changes as well. And we can plug in if W equals 2, you get 12 in this case. Or what if it was J equals W? In this case, the derivative is just equal to 1. Or the final example we had was what if J equals 1 over W? In this case, the derivative turns out to be negative 1 over W squared. And so this is negative one-fourth. So what I'm going to do is take the derivatives we had worked out. Remember for W squared, it was 2W. For W cubed, it was 3W squared. For W, it's just 1. And for 1 over W, it is negative 1 over W squared. And let's copy this back to our other slide. So what SymPy, or really calculus, showed us is if J of W is W cubed, the derivative is 3W squared, which is equal to 12 when W equals 2. When J of W equals W, the derivative is just equal to 1. And when J of W is 1 over W, it's negative 1 over W squared, which is negative one-quarter when W equals 2. Let's double check this, if these expressions that we got from SymPy are correct. So let's try increasing W by epsilon. In this case, J of W, and again, feel free to pause the video and check this map on your own calculator if you want. But in this case, J of W, 2.001 cubed, becomes this. And so J has gone up from 8 to 8.012, roughly. And so it's gone up by roughly 12 times epsilon. And thus the derivative is indeed 12. Or if J of W equals W, then if W increased by epsilon, then J of W, which is just W, is now 2.001. And so it's gone up by 0.001, which is exactly the value of epsilon. So J of W has gone up by 1 times epsilon. So the derivative is indeed equal to 1. Notice that here, this is actually exactly epsilon, even though epsilon is infinitesimally small. For our last example, if J of W equals 1 over W, if W goes up by epsilon, then W is 1 over 2.001. Then it turns out J of W is approximately 4.9975, with some extra digits that I've truncated. But this turns out to be 0.5 minus 0.00025. So J of W has started off at 0.5 and it's gone down by 0.00025. And this 0.00025, it is 0.25 times epsilon. And it's gone down by this amount, or it's gone up by negative 0.25 times epsilon. Because negative 0.25 times epsilon is equal to this term over here. So we see that if W goes up by epsilon, J of W goes up by negative one-fourth, or negative 0.25 times epsilon, which is why the derivative in this case is negative one-quarter. So I hope that with these examples, you have a sense of what the derivative with respect to W of J of W means. It just asks if W goes up by epsilon, how much does J of W go up? By some constant K times epsilon, and this constant K is the derivative. And the value of K will depend both on what is the function J of W, as well as what is the value of W. Before we wrap up this video, I want to briefly touch on the notation used to write derivatives that you may see in other texts. Which is that if J of W is a function of a single variable, say W, then mathematicians will sometimes write the derivative as DDW of J of W. And notice here, this notation is using the lower case letter D. Whereas in contrast, if J is a function of more than one variable, then mathematicians will sometimes use this strictly alternative D to denote the derivative of J with respect to one of the parameters WI. To my mind, this notation distinguishing between this regular letter D and this stylized calculus derivative symbol D it makes little sense to me to make this distinction. And this notation, to my mind, overcomplicates calculus and the derivative notation. But for historical reasons, calculus texts will use these two different notations depending on whether J is a function of a single variable or a function of multiple variables. But I think for practical purposes, this notational convention, it tends to just overcomplicate things, I think, in a way that I don't think is actually necessary. And so for this class, I'm just going to use this notation everywhere, even when there's just a single variable. And in fact, for most of our applications, the function J, it is a function of more than one variable. And so this other notation, which is sometimes called the partial derivative notation, this is actually the correct notation almost all the time because J usually has more than one variable. But I hope that using this notation throughout these lectures that simplifies the presentation and makes derivatives a little bit easier to understand. And in fact, this notation is the one you've been seeing in the videos leading up to now. And for conciseness, instead of writing out this full expression here, sometimes you also see this shortened as derivative or partial derivative of J with respect to WI or written like this. And these are just simplified, abbreviated forms of this expression over here. So I hope that gives you a sense of what are derivatives. It's just if W goes up by a little bit, by epsilon, how much does J of W change as a consequence? Next, let's take a look at how you can compute derivatives in a neural network. To do so, we need to take a look at something called a computation graph. Let's go take a look at that in the next video.