We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In the last video, you saw the equations for backpropagation. In this video, let's go over some intuition using the computation graph for how those equations were derived. This video is completely optional, so feel free to watch it or not. You should be able to do the homework either way. So recall that when we talked about logistic regression, we had this forward pass where we compute z, then a, and then a loss. And then to take derivatives, we had this backward pass where we could first compute da, and then go on to compute dz, and then go on to compute dw and db. So the definition for the loss was L of a comma y equals negative y log a minus 1 minus y times log 1 minus a. So if you're familiar with calculus and you take the derivative of this with respect to a, that would give you the formula for da. So da is equal to that. And if you actually figure out the calculus, you could show that this is y over a, negative y over a plus 1 minus y over 1 minus a. And you just kind of derive that from calculus by taking derivatives of this. It turns out when you take another step backwards to compute dz, we then worked out that dz is equal to a minus y. I didn't explain why previously, but it turns out that from the chain rule of calculus, dz is equal to da times g prime of z. Where here, g of z equals sigmoid of z is our activation function for this output unit in logistic regression. So just remember, this is still logistic regression. We'll have x1, x2, x3, and then just one sigmoid unit. And then that gives us a, or gives us y hat. So here, the activation function was a sigmoid function. And as an aside, only for those of you familiar with the chain rule of calculus, the reason for this is because a is equal to sigmoid of z. And so, partial of L with respect to z is equal to partial of L with respect to a times da dz. But since a is equal to sigmoid of z, this is equal to d dz g of z, right? Which is equal to g prime of z. So that's why this expression, which is dz in our code, is equal to this expression, which is da in our code, times g prime of z. And so, this is just that. So that last derivation would make sense only if you're familiar with calculus, and specifically the chain rule from calculus. But if not, don't worry about it. I'll try to explain the intuition wherever it's needed. And then finally, having computed dz for logistic regression, we would compute dw, which it turns out was dz times x, and db, which is just dz when you have a single chain example. So that was logistic regression. So what we're going to do when computing back propagation for a neural network is a calculation a lot like this, but only we'll do it twice. Because now we have not x going to an output unit, but x going to a hidden layer, and then going to an output unit. And so instead of this computation being sort of one step as we have here, we'll have, you know, two steps here, right? In this kind of a neural network with two layers. So in this two-layer neural network, that is with an input layer, a hidden layer, and an output layer, remember the steps of a computation. First, you compute z1 using this equation, and then compute a1, and then you compute z2, and notice z2 also depends on the parameters w2 and b2, and then based on z2, you compute a2, and then finally, that gives you the loss. So what back propagation does is, it will go backward to compute da2, and then dz2, and then go back to compute dw2, and db2, go back to compute da1, dz1, and so on. We don't need to take derivatives with respect to the input x, since the input x for supervised learning is fixed, so we're not trying to optimize x. So we won't bother to take derivatives, at least for supervised learning with respect to x. And so I'm going to skip explicitly computing da2. If you want, you can actually compute da2, and then use that to compute dz2, but in practice, you could collapse both of these steps into one. Into one step, so you end up at dz2 is equal to a2 minus y, same as before, and you have also, I'm going to write dw2 and db2 down here below. You have that dw2 is equal to dz2 times a1 transpose, and db2 equals dz2. So this step is quite similar for logistic regression, where we had that dw was equal to dz times x, except that now, a1 plays the row of x, and there's an extra transpose there, because the relationship between the capital matrix W and our individual parameters W, there's a transpose there, right? Because W is equal to a row vector, in the case of logistic regression, with a single output, dw2 is like that, whereas W here was a column vector, so that's why there's an extra transpose for a1, whereas we didn't for x here for logistic regression. So this completes half of backpropagation, and then again, you can compute da1 if you wish, although in practice, the computation for da1 and dz1 are usually collapsed into one step, and so what you'd actually implement is that dz1 is equal to w2 transpose times dz2, and then times an element-wise product of g1 prime of z1. And just to do a check on the dimensions, right? If you have a neural network that looks like this, right, outputs y if so, if you have n0, and x equals n0 input features, n1 hidden units, and n2 so far, and n2, in our case, just one output unit, then the matrix W2 is n2 by n1 dimensional, z2, and therefore dz2, are going to be n2 by 1 dimensional, it's really going to be a 1 by 1 when we're doing binary classification, and z1, and therefore also dz1, are going to be n1 by 1 dimensional. Note that for any variable foo, foo and dfoo always have the same dimension, so that's why w and dw always have the same dimension, and similarly for b and db and z and dz and so on. So to make sure that the dimensions of this all match up, we have that dz1 is equal to W2 transpose times dz2, and then this is an element-wise product times g1 prime of z1. So matching the dimensions from above, this is going to be n1 by 1 is equal to W2 transpose, W2 transpose of this, so this is going to be n1 by n2 dimensional, dz2 is going to be n2 by 1 dimensional, and then this, this is the same dimension as z1, so this is also n1 by 1 dimensional, so an element-wise product. So the dimensions do make sense, right? An n1 by 1 dimensional vector can be obtained by n1 by n2 dimensional matrix times n2 by n1, because the product of these two things gives you an n1 by 1 dimensional matrix, and so this becomes the element-wise product of two n1 by 1 dimensional vectors, and so the dimensions do match up. One tip when implementing backprop, if you just make sure that the dimensions of your matrices match up, so if you think through what are the dimensions of your various matrices, including W1, W2, z1, z2, a1, a2, and so on, and just make sure that the dimensions of these matrix operations may match up, sometimes that will already eliminate quite a lot of bugs in backprop. Alright, so this gives us dz1, and then finally, just to wrap up, dw1 and db1, we should write them here, I guess, but since I'm running out of space, I'll write them on the right of the slide. dw1 and db1 are given by the following formulas. This is going to be equal to dz1 times x transpose, and this is going to be equal to dz, and you may notice a similarity between these equations and these equations, which is really no coincidence because x plays the role of a0, so x transpose is a0 transpose, so those equations are actually very similar. Alright, so that gives a sense for how backpropagation is derived, and we have six key equations here for dz2, dw2, db2, dz1, dw1, and db1. So let me just take these six equations and copy them over to the next slide. Here they are. And so far, we've derived backpropagation for API training on a single training example at a time. But it should come as no surprise that rather than working on a single example at a time, we would like to vectorize across different training examples. So you remember that for forward propagation, when we're operating on one example at a time, we had equations like this, as well as, say, A1 equals g1 of z1. And in order to vectorize, we took, say, the z's and stacked them up in columns like this, 1, 2, z1, m, and called this capital Z. And then we found that by stacking things up in columns and defining the capital uppercase version of these, we then just had z1 equals w1x plus b, and A1 equals g1 of z1. And we defined the notation very carefully in this course to make sure that stacking examples into different columns of a matrix makes all this work out. So it turns out that if you go through the math carefully, the same trick also works for back propagation. So the vectorized equations are as follows. First, if you take these dz's for different training examples and stack them as the different columns of the matrix, and same for this, and same for this, then this is the vectorized implementation. And then here's the definition for, or here's how you can compute dw2. There is this extra 1 over m because the cost function j is this 1 over m of sum from i equals 1 through m of the losses. And so when computing the roots, we have that extra 1 over m term just as we did when we were computing the weight updates for logistic regression. And then that's the update you get for db2. Again, sum of the dz's. And then we have a 1 over m. And then dz1 is computed as follows. Once again, this is an element-wise product. Only whereas previously, this was, we saw on the previous slide, that this was an n1 by 1 dimensional vector. Now this is a n1 by m dimensional matrix. And both of these are also n1 by m dimensional. And so that's why that asterisk is an element-wise product. And then finally, the remaining two updates perhaps shouldn't look too surprising. So I hope that gives you some intuition for how the backpropagation algorithm is derived. In all of machine learning, I think the derivation of the backpropagation algorithm is actually one of the most complicated pieces of math I've seen. And it requires knowing both linear algebra as well as the derivative matrices to really derive it from scratch, from first principles. If you are an expert in matrix calculus, using this process, you might be able to derive the algorithm yourself. But I think that there are actually plenty of deep learning practitioners that have seen the derivation at about the level you've seen in this video and are already able to have all the right intuitions and be able to implement this algorithm very effectively. So if you are an expert in calculus, do see if you can derive the whole thing from scratch. It is one of the very hardest pieces of math, one of the very hardest derivations that I've seen in all of machine learning. But either way, if you implement this, this will work. And I think you have enough intuitions to tune it and get it to work. So with that, there's just one last detail I want to share with you before you implement your neural network, which is how to initialize the weights of your neural network. And it turns out that initializing your parameters not to zero but randomly turns out to be very important for training your neural network. In the next video, you'll see why.