In the last video, we worked through an example of using a computation graph to compute a function j. Now, let's take a cleaned up version of that computation graph and show how you can use it to figure out derivative calculations for that function j. So, here's our computation graph. Let's say you want to compute the derivative of j with respect to v. So what is that? Well, this says, if we were to take this value of v and change it a little bit, how would the value of j change? Well, j is defined as 3 times v, and right now, v is equal to 11. So, if we were to bump up v by a little bit to 11.001, then j, which is 3v, so currently 33, will get bumped up to 33.003. So, here we've increased v by .001, and the net result of that is that j goes up 3 times as much, so the derivative of j with respect to v is equal to 3, because the increase in j is 3 times the increase in v. And in fact, this is very analogous to the example we had in the previous video, where we had f of a equals 3a, and we then derived that df da, which with slightly simplified, slightly sloppy notation, you can write as df da was equal to 3. So instead, here we have j equals 3v, and so dj dv is equal to 3. With here, j playing the role of f, and v playing the role of a in this previous example that we had from an earlier video. So, in the terminology of backpropagation, what we're seeing is that if you want to compute the derivative of this final output variable, which usually is the variable you care most about, with respect to v, then we've done sort of one step of backpropagation, so we've gone one step backwards in this graph. Now, let's look at another example. What is dj da? In other words, if we bump up the value of a, how does that affect the value of j? Well, let's go through the example. Right now, a is equal to 5, so let's bump it up to 5.001. The net impact of that is that v, which was a plus u, so that was previously 11, this would get increased to 11.001, and then we've already seen as above that j now gets bumped up to 33.003. So, what we're seeing is that if you increase a by .001, j increases by .003. And by increase a, I mean if you were to take this value of 5 and just plug in a new value, then the change to a will propagate to the right of the computation graph, so that j ends up being 33.003. And so, the increase to j is 3 times the increase to a, so that means this derivative is equal to 3. And one way to break this down is to say that if you change a, then that would change v, and through changing v, that would change j. And so, the net change to the value of j when you bump up the value, when you nudge the value of a up a little bit, is that first, by changing a, you end up increasing v. Well, how much does v increase? It will increase by an amount that's determined by dv da, and then the change in v will cause the value of j to also increase. So, in calculus, this is actually called the chain rule, that if a affects v affects j, then the amount that j changes when you nudge a is the product of how much v changes when you nudge a times how much j changes when you nudge v. So, in calculus, again, this is called the chain rule. And what we saw from this calculation is that if you increase a by .001, v changes by the same amount, so dv da is equal to 1. So, in fact, if you plug in what we have worked out previously, dv dj is equal to 3, and dv da is equal to 1. So, the product of these, 3 times 1, that actually gives you the correct value that dj da is equal to 3. So, this little illustration shows how by having computed dj dv, that is, derivative with respect to this variable, it can then help you to compute dj da. And so that's another step of this backward calculation. I just want to introduce one more new notational convention, which is that when you're writing codes to implement backpropagation, there'll usually be some final output variable that you really care about. So, final output variable that you really care about or that you want to optimize. And in this case, this final output variable is j. It's really the last node in your computation graph. And so a lot of computations will be trying to compute the derivative of that final output variable, so d of this final output variable with respect to some other variable. Let me just call that d va, right? So, a lot of the computations you have will be to compute the derivative of the final output variable, really j in this case, with various intermediate variables such as a, b, c, u, or v. And when you implement this in software, you know, what do you call this variable name? One thing you could do is, in Python, you could write, you know, give us a very long variable name like d final output var over d var. But that's a very long variable name. You could call this, I guess, dj d var. But because you're always taking derivatives with respect to dj, with respect to this final output variable, I'm going to introduce a new notation where in code, when you're computing this thing, in the code you write, we're just going to use the variable name d var in order to represent that quantity. So d var, in the code you write, will represent the derivative of the final output variable you care about, such as j, or sometimes the loss l, with respect to the various intermediate quantities you're computing in your code. So this thing here, in your code, you know, you use dv to denote this value. So dv would be equal to 3. And in your code, you represent this as dA, which we also figured out to be equal to 3. Okay? So we've done backpropagation partially through this computation graph. Let's go through the rest of this example on the next slide. So let's go to a cleaned up copy of the computation graph. And just to recap, what we've done so far is go backward here and figured out that dv is equal to 3. And again, the definition of dv, that's just the variable name of the code, is really dj dv. And we've figured out that dA is equal to 3. And again, dA is the variable name in your code, and that's really the value of dj dA. And we'll sort of hand wave how, you know, we've gone backwards on these two edges, like so. Now, let's keep computing derivatives. Let's look at the value u. So what is dj du? Well, through a similar calculation as what we did before, we start off with u equals 6. If you bump up u to 6.001, then v, which is previous 11, goes up to 11.001. And so j goes from 33 to 33.003. And so the increase in j is 3x, so this is equal. And the analysis for u is very similar to the analysis we did for A. This is actually computed as dj dv times dv du, where this we had already figured out was 3, and this turns out to be equal to 1. So with kind of one more step of back propagation, we end up computing that du is also equal to 3. And du is, of course, just this dj du. Now, we'll just step through one last example in detail. So what is dj dv, right? So, you know, imagine if you are allowed to change the value of b and you want to tweak b a little bit in order to minimize or maximize the value of j, right? So what is the derivative or what's the slope of this function j when you change the value of b a little bit? It turns out that using the chain rule for calculus, this can be written as the product of two things. It's dj du times du dv. And the reasoning is if you change b a little bit, so b goes from 3 to, say, 3.001, the way it will affect j is it will first affect u. So how much does it affect u? Well, u is defined as b times c, right? So this will go from 6 when b is equal to 3 to now 6.002, right? Because c is equal to 2 in our example here. And so this tells us that du db is equal to 2 because when you bump up b by .001, u increases twice as much. So du db, this is equal to 2. And now we know that u has gone up twice as much as b has gone up. Well, what is dj du? We've already figured out that this is equal to 3. And so by multiplying these two out, we find that dj db is equal to 6. And again, here's the reasoning for the second part of the argument, we want to know when u goes up by .002, how does that affect j? The fact that dj du is equal to 3, that tells us that when u goes up by .002, j goes up three times as much. So j should go up by .006, right? So that comes from the fact that dj du is equal to 3. And if you check the map in detail, you will find that if b becomes 3.001, then u becomes 6.002, v becomes 11.002, so that's a plus u, so that's 5 plus u. And then j, which is equal to 3 times v, that ends up being equal to 33.006. And so that's how you get that dj db is equal to 6. And to fill that in, this is if we go backwards, so this is db is equal to 6. And db really is the Python co-variable name for dj db. And I won't go through the last example in great detail, but it turns out that if you also compute out dj db, this turns out to be dj du times du, and this turns out to be 9. This turns out to be 3 times 3. I won't go through that example in detail, but so through this last step, it is possible to derive that dc is equal to... So the key takeaway from this video, from this example, is that when computing derivatives and computing all of these derivatives, the most efficient way to do so is through a right to left computation following the direction of the red arrows. And in particular, we'll first compute the derivative with respect to v, and then that becomes useful for computing the derivative with respect to a and the derivative with respect to u. And then the derivative with respect to u, for example, this term over here and this term over here, those in turn become useful for computing the derivative with respect to b and the derivative with respect to c. So that was the computation graph and how there's a forward or left to right calculation to compute the cost function, such as j that you might want to optimize, and a backwards or a right to left calculation to compute derivatives. If you're not familiar with calculus or the chain rule, I know some of those details may have gone by really quickly, but if you didn't follow all the details, don't worry about it. I'll go over this again in the context of logistic regression and show you exactly what you need to do in order to implement the computations you need to compute the derivatives for the logistic regression model.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Neural Networks Basics

Logistic Regression as a Neural Network

Binary Classification
Video
・
8 mins

Logistic Regression
Video
・
5 mins

Logistic Regression Cost Function
Video
・
8 mins

Gradient Descent
Video
・
11 mins

Derivatives
Video
・
7 mins

More Derivative Examples
Video
・
10 mins

Computation Graph
Video
・
3 mins

Derivatives with a Computation Graph
Video
・
14 mins

Logistic Regression Gradient Descent
Video
・
6 mins

Gradient Descent on m Examples
Video
・
8 mins

Derivation of DL/dz (Optional)
Reading
・
10 mins

Python and Vectorization

Vectorization
Video
・
8 mins

More Vectorization Examples
Video
・
6 mins

Vectorizing Logistic Regression
Video
・
7 mins

Vectorizing Logistic Regression's Gradient Output
Video
・
9 mins

Broadcasting in Python
Video
・
11 mins

A Note on Python/Numpy Vectors
Video
・
6 mins

Quick tour of Jupyter/iPython Notebooks
Video
・
3 mins

Explanation of Logistic Regression Cost Function (Optional)
Video
・
7 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Neural Network Basics

Graded・Quiz

・

50 mins

Programming Assignments

Deep Learning Honor Code
Reading
・
2 mins

Programming Assignment FAQ
Reading
・
10 mins

(Optional) Downloading your Notebook, Downloading your Workspace and Refreshing your Workspace
Reading
・
5 mins

Python Basics with Numpy

Graded・Code Assignment

・

1 hour

Logistic Regression with a Neural Network Mindset

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Pieter Abbeel Interview
Video
・
16 mins

Week 3: Shallow Neural Networks