Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
The computation graph is a key idea in deep learning, and it is also how programming frameworks like TensorFlow automatically compute derivatives of your neural networks. Let's take a look at how it works. Let me illustrate the concept of a computation graph with a small neural network example. This neural network has just one layer, which is also the output layer, and just one unit in the output layer. It takes as input X, applies a linear activation function, and outputs the activation A. More specifically, its output is A equals WX plus B. So, this is basically linear regression, but expressed as a neural network with one output unit. Given the output, the cost function is then one-half A, that is the predicted value, minus the actual observed value Y squared. And for this small example, we're only going to have a single training example, where the training example is the input X equals negative 2, the ground truth output value Y equals 2, and the parameters of this network are W equals 2 and B equals 8. So, what I'd like to do is show how the computation of the cost function J can be computed step-by-step using a computation graph. And just as a reminder, when learning, we like to view the cost function J as a function of the parameters W and B. Let's take the computation of J and break it down into individual steps. First, you have the parameter W, that is an input to the cost function J. And then we first need to compute W times X. And let me just call that C as follows. W is equal to 2, X is equal to negative 2, and so C would be negative 4. And I'm just going to write the value here on top of this line, on top of this arrow to show the value that is your output on this arrow. The next step is then to compute A, which is WX plus B. So, let me create another node here. And this needs to input B, the other parameter, that is input to the cost function J. And A equals WX plus B, so it's equal to C plus B. And if you add these up, that turns out to be 4. And so this is starting to build up a computation graph in which the steps we need to compute the cost function J are broken down into smaller steps. The next step is to then compute A minus Y, which I'm going to call D. So, let me have that node D, which computes A minus Y. Y is equal to 2, so 4 minus 2 is 2. And then finally, J is the cost. It's one half of A minus Y squared, or one half of D squared, which is just equal to 2. What we've just done is build up a computation graph. And this is a graph, not in the sense of plots with X and Y axes, but this is the other sense of the word graph used in computer science, which is that this is a set of nodes that is connected by edges, or connected by arrows in this case. So, this computation graph shows the forward prop step of how we compute the output A of the neural network, but then also go further than that to also compute the value of the cost function J. The question now is, how do we find the derivative of J with respect to the parameters W and B? Let's take a look at that next. So, here's the computation graph from the previous slide, and we've completed forward prop where we've computed that J, the cost function, is equal to 2 through all these steps going from left to right forward prop in the computation graph. What we'd like to do now is compute the derivative of J with respect to W, and the derivative of J with respect to B. And it turns out that whereas forward prop was a left to right calculation, computing the derivatives would be a right to left calculation, which is why it's called back prop, was going backwards from right to left. The final computation node of this graph is this one over here, which computes J equals one half of D squared. The first step of back prop will ask if the value of D, which was the input to this node, were to change a little bit, how much does the value of J change? Specifically, we'll ask if D were to go up by a little bit, say 0.001, and that would be our value of epsilon in this case, how would the value of J change? It turns out in this case, if D goes from 2 to 2.001, then J goes from 2 to 2.002. And so if D goes up by epsilon, J goes up by roughly 2 times epsilon. And so we conclude that the derivative of J with respect to D, to this value D that is input to this final node, is equal to 2. And so the first step of back prop would be to fill in this value 2 over here, where this value is the derivative of J with respect to this input value D. And we know if D changes a little bit, J changes by twice as much, because this derivative is equal to 2. The next step is to look at the node before that and ask, what is the derivative of J with respect to A? And to answer that, we have to ask, well if A goes up by 0.001, how does that change J? Well, we know that if A goes up by 0.001, D is just A minus Y. So if A becomes 4.001, D, which is A minus Y, becomes 4.001 minus Y equals 2, so it becomes 2.001. So if A goes up by 0.001, D also goes up by 0.001. But we'd already concluded previously that if D goes up by 0.001, J goes up by twice as much. So now we know if A goes up by 0.001, D goes up by 0.001, then J goes up roughly by 2 times 0.001. And this tells us that the derivative of J with respect to A is also equal to 2. So I'm going to fill in that value over here, that this is the derivative of J with respect to A, just as this was the derivative of J with respect to D. If you've taken a calculus class before and if you've heard of the chain rule, you might recognize that this step of computation that I just did is actually relying on the chain rule for calculus. If you're not familiar with the chain rule, don't worry about it. You won't need to know it for the rest of these videos. But if you have seen the chain rule, you might recognize that the derivative of J with respect to A is asking, how much does D change with respect to A, which is derivative of D with respect to A times the derivative of J with respect to D. And this little calculation on top showed that the partial of D with respect to A is 1. And we show pleasingly that the derivative of J with respect to D is equal to 2, which is why the derivative of J with respect to A is 1 times 2, which is equal to 2. And that's the value we got. But again, if you're not familiar with the chain rule, don't worry about it. The logic that we just went through here is why we know J goes up by twice as much as A does. And that's why this derivative term is equal to 2. The next step then is to keep on going right to left as we do in backprop. And we'll ask, how much does a little change in C cause J to change? And how much does a little change in B cause J to change? And the way we figure that out is to ask, well, if C goes up by epsilon, 0.01, how much does A change? Well, A is equal to C plus B. So it turns out that if C ends up being negative 3.999, then A, which is negative 3.999 plus 8, becomes 4.01. And so if C goes up by epsilon, A goes up by epsilon. And we know if A goes up by epsilon, then because the derivative of J with respect to A is 2, we know that this in turn causes J to go up by 2 times epsilon. So we can conclude that if C goes up by a little bit, J goes up by twice as much. And we know this because we know the derivative of J with respect to A is 2. So this allows us to conclude that the derivative of J with respect to C is also equal to 2. And so I'm going to fill in that value over here. And again, only if you're familiar with chain rule, another way to write this is the derivative of J with respect to C is the derivative of A with respect to C. This turns out to be 1 times the derivative of J with respect to A, which we had previously figured out was equal to 2. So that's why this ends up being equal to 2. And by a similar calculation, if B goes up by 0.001, then A also goes up by 0.001, and J goes up by 2 times 0.001, which is why this derivative is also equal to 2. We'll fill in here the derivative of J with respect to B, and here the derivative of J with respect to C. Now, one final step, which is what is the derivative of J with respect to W? So if W goes up by 0.001, what happens? C, which is W times X, if W were 2.001, C, which is W times X, becomes negative 2 times 2.001, so it becomes negative 4.002. And so if W goes up by epsilon, C goes down by 2 times 0.001, or equivalently, C goes up by negative 2 times 0.001. And we know that if C goes up by negative 2 times 0.001, because the derivative of J with respect to C is 2, this means that J will go up by negative 4 times 0.001. Right? Because if C goes up by a certain amount, J changes by 2 times as much. So 2 times, you know, negative 2 times this is negative 4 times this. This allows us to conclude that if W goes up by 0.001, J goes up by negative 4 times 0.001. And so the derivative of J with respect to W is negative 4. And so I'm going to write negative 4 over here because that's the derivative of J with respect to W. And once again, the chain rule calculation, if you're familiar with it, is this. It's the derivative of C with respect to W times the derivative of J with respect to C. This is 2 and this is negative 2, which is why we end up with negative 4. But again, don't worry about it if you're not familiar with the chain rule. So to wrap up, what we've just done is manually carry out backprop in this computation graph. Whereas FOILPROP was a left to right computation where we had W equals 2 that allowed us to compute C, then we had B and that allowed us to compute A, and then D, and then J. Backprop went from right to left, and we would first compute the derivative of J with respect to D, and then go back to compute the derivative of J with respect to A, then the derivative of J with respect to B, derivative of J with respect to C, and finally the derivative of J with respect to W. So that's why backprop is a right to left computation, whereas FOILPROP was a left to right computation. In fact, let's double check the computation that we just did. So J with these values of W, B, X, and Y is equal to one half times WX plus B minus Y squared, which is one half times 2 times negative 2 plus 8 minus 2 squared, which is equal to 2. And now if W were to go up by 0.001, then J becomes one half times, W is now 2.001 times X, which is negative 2, plus B, which is 8, minus Y squared, and if you calculate this out, this turns out to be 1.996002. And so roughly, J has gone from 2 down to 1.996, and then an extra 0.002, and J has therefore gone down by 4 times epsilon. So this shows that if W goes up by epsilon, J goes down by 4 times epsilon, or equivalently, J goes up by negative 4 times epsilon, which is why the derivative of J with respect to W is negative 4, which is what we have worked out over here. And if you want, feel free to pause the video and double check this math yourself as well for what happens if B, the other parameter, goes up by epsilon, and hopefully you find that the derivative of J with respect to B is indeed 2, that if B goes up by epsilon, J goes up by 2 times epsilon, as predicted by this derivative calculation. So why do we use the backprop algorithm to compute derivatives? It turns out that backprop is an efficient way to compute derivatives, and the reason we sequence this as a right-to-left calculation is if you were to start off and ask what is the derivative of J with respect to W, then to know how much changing W affects changing J, right? If W were to go up by epsilon, how much does J go up by epsilon? Well, the first thing we want to know is what is the derivative of J with respect to C, because changing W will change C, this first quantity here. So to know how much changing W affects J, we want to know how much this changing C affects J. But to know how much changing C affects J, the most useful thing to know to compute this would be changing C changes A, so you want to know how much does changing A affect J, and so on. And that's why backprop is sequenced as a right-to-left calculation, because if you do the calculation from right to left, you can find out how does changing D affect changing J, then you can find out how much does changing A affect J, and so on, until you find the derivatives of each of these intermediate quantities, C, A, and D, as well as the parameters W and B, so that you can find out with one right-to-left calculation how much changing any of these intermediate quantities, C, A, or D, as well as the input parameters W and B, how much changing any of these things will affect the final output value J. One thing that makes backprop efficient is, you notice that when we do the right-to-left calculation, we had to compute this term, the derivative of J with respect to A just once, and this quantity is then used to compute both the derivative of J with respect to W and the derivative of J with respect to B. And it turns out that if a computation graph has N nodes, meaning N of these boxes, and P parameters, so we have two parameters in this case, this procedure allows us to compute all the derivatives of J with respect to all the parameters in roughly N plus P steps, rather than N times P steps. And if you have a neural network with, say, 10,000 nodes, and maybe 100,000 parameters, this would not be considered even a very large neural network by modern standards. Being able to compute the derivatives in 10,000 plus 100,000 steps, which is 110,000, is much, much better than needing 10,000 times 100,000 steps, which would be a billion steps. And so the backpropagation algorithm, done using the computation graph, gives you a very efficient way to compute all the derivatives, and that's why it is such a key idea in how deep learning algorithms are implemented today. In this video, you saw how the computation graph takes all the steps of the calculation needed to compute the output of a neural network, A, as well as the cost function, J, and takes the step-by-step computations and breaks them into the different nodes of the computation graph, and then uses a left-to-right computation for a prop to compute the cost function, J, and then a right-to-left or backpropagation calculation to compute all the derivatives. In this video, you saw these ideas applied to a small neural network example. In the next video, let's take these ideas and apply them to a larger neural network. Let's go on to the next video.