Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In this final video on intuition for backprop, let's take a look at how the computation graph works on a larger neural network example. Here's the network we will use with a single hidden layer, with a single hidden unit that outputs A1, that feeds into the output layer that outputs the final prediction, A2. To make the math more tractable, I'm going to continue to use just a single training example with inputs x equals 1, y equals 5, and these will be the parameters of the network. And throughout, we're going to use the regular activation function, so g of z equals max of 0 comma z. So for a prop in the neural network looks like this, as usual, A1 equals g of w1 times x plus b1. And so it turns out w1 x plus b will be positive, so we're in the max 0 z equals z part of this activation function, so that's just equal to this, which is 2 times 1, that's w1 is 2, times x is 1, plus 0, that's b1, which is equal to 2. And then similarly, A2 equals this, g of w2, A1 plus b2, which is w2 times A1 plus b, again, because we're in the positive part of the regular activation function, which is 3 times 2 plus 1, which is equal to 7. Finally, we'll use the squared error cost function, so j of wb is 1 half A2 minus y squared, which is 1 half 7 minus 5 squared, which is 1 half of 2 squared, which is just equal to 2. So let's take this calculation that we just did and write it down in the form of a computation graph. To carry out the computation step by step, first thing we need to do is take w1 and multiply that by x. So we'll have w1 that feeds into the computation node to compute w1 times x. And I'm going to call this a temporary variable, t1. Next, we compute z1, which is this term here, which is t1 plus b1, so we also have this input b1 over here. And finally, A1 equals g of z1. We apply the activation function, and so we end up with, again, this value here, 2. And then next, we have to compute t2, which is w2 times A1. And so with w2, that gives us this value, which is 6. Then z2, which is this quantity, we add b to it, and that gives us 7. And finally, apply the activation function, g, we still end up with 7. And lastly, j is 1 half A2 minus y squared, and that gives us 2, which is this cost function here. So this is how you take the step by step computations for a larger neural network and write it in a computation graph. You've already seen in the last video the mechanics of how to carry out backprop. I'm not going to go through the step by step calculations here, but if you were to carry out backprop, the first thing you do is ask, what is the derivative of the cost function j with respect to A2? And it turns out, if you calculate that, it turns out to be 2. So we'll fill that in here. And the next step would be to ask, what's the derivative of the cost j with respect to z2? And using this derivative that we computed previously, you can figure out that this turns out to be 2. Because if z goes up by epsilon, you can show that for the current setting of all the parameters, A2 will go up by epsilon, and therefore j will go up by 2 times epsilon. So this derivative is equal to 2, and so on, step by step. We can then find out that the derivative of j with respect to b2 is also equal to 2. The derivative with respect to t2 is equal to 2, and so on, and so forth. Until eventually, you've computed the derivative of j with respect to all the parameters w1, b1, w2, and b2. And so that's backprop. And again, I didn't go through the mechanical steps of every single step of backprop, but it's basically the process that you saw in the previous video. Let me just double check one of these examples. So we saw here that the derivative of j with respect to w1 is equal to 6. So what this is predicting is that if w1 goes up by epsilon, j should go up by roughly 6 times epsilon. Let's step through the math and see if that really is true. These are the calculations that we did again. And so if w, which was 2, were to be 2.001, goes up by epsilon, then a1 becomes, let's see, instead of 2, this is 2.001 as well. So a1 instead of 2 is now 2.001. So 3 times 2.001 plus 1 just gives us 7.003. And if a2 is 7.003, then this becomes 7.003 minus 5 squared. And so this becomes 2.003 squared over 2, which turns out to be equal to 2.00605. So ignoring some of the extra digits, you see from this little calculation that if w1 goes up by 0.001, j of w has gone up from 2 to 2.006 roughly, so 6 times as much. And so the derivative of j with respect to w1 is indeed equal to 6. And so the backprop procedure gives you a very efficient way to compute all of these derivatives, which you can then feed into the gradient descent algorithm or the atom optimization algorithm to then train the parameters of your neural network. And again, the reason we use backprop for this is it's a very efficient way to compute all of the derivatives of j with respect to w1, j with respect to b1, j with respect to w2, and j with respect to b2. I did just illustrate how we could bump up w1 by a little bit and see how much j changes, but that was a left to right calculation. And if we had to do this procedure for each parameter, one parameter at a time, if we had to increase w by 0.01 to see how that changes j, increase b1 by a little bit to see how that changes j, and increase every parameter one at a time by a little bit to see how that changes j, then this becomes a very inefficient calculation. And if you had n nodes in your computation graph and p parameters, this procedure would end up taking n times p steps, which is very inefficient, whereas we got all four of these derivatives n plus p rather than n times p steps. And this makes a huge difference in practical neural networks where the number of nodes and the number of parameters can be really large. So that's the end of the video for this week. Thanks for sticking with me through the end of these optional videos, and I hope that you now have an intuition for when you use a programming framework like TensorFlow to train a neural network, what's actually happening under the hood and how it's using the computation graph to efficiently compute derivatives for you. Many years ago, before the rise of frameworks like TensorFlow and PyTorch, researchers used to have to manually use calculus to compute the derivatives of the neural networks that they wanted to train. And so in modern programming frameworks, you can specify a forward prop and have it take care of back prop for you. Many years ago, researchers used to write down the neural network by hand, manually use calculus to compute the derivatives, and then implement a bunch of equations that they had laboriously derived on paper to implement back prop. Thanks to the computation graph and these techniques for automatically carrying out derivative calculations, it's sometimes called auto-div for automatic differentiation. This process of researchers manually using calculus to take derivatives is no longer really done. At least, I've not had to do this for many years now myself because of auto-div. So many years ago, to use neural networks, the bar for the amount of calculus you had to know actually used to be higher. But because of automatic differentiation algorithms, usually based on the computation graph, you can now implement a neural network and get derivatives computed for you easier than before. So maybe with the maturing of neural networks, the amount of calculus you need to know in order to get these algorithms to work has actually gone down, and that's been encouraging for a lot of people. And so that's it for the videos for this week. I hope you enjoyed the labs, and I look forward to seeing you next week.