Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
In the previous video, you saw the basic blocks of implementing a deep neural network, a forward propagation step for each layer, and a corresponding backward propagation step. Let's see how you can actually implement these steps. We'll start with forward propagation. Recall that what this will do is input AL-1 and output AL and the cache ZL. We just said that from an implementational point of view, maybe we'll cache WL and BL as well, just to make the functions call a bit easier in the programming exercise. And so the equations for this should already look familiar. The way to implement a forward function is just this, equals WL times AL-1 plus BL, and then AL equals the activation function applied to Z. And if you want a vectorized implementation, then it's just that, times AL-1 plus B, with the B, adding B being Python broadcasting, and AL equals G applied element-wise to Z. And you remember on the diagram for the forward step, where we had this chain of boxes going forward, so you initialize that with feeding in A0, which is equal to X. So you initialize this, what is the input to the first one? It's really A0, which is the input features to either for one training example, if you're doing one example at a time, or A0, the entire training set, if you are processing the entire training set. So that's the input to the first forward function in the chain, and then just repeating this allows you to compute forward propagation from left to right. Next, let's talk about the backward propagation step. Here, your goal is to input DAL and output DAL-1 and DWL and DB. Let me just write out the steps you need to compute these things. DZL is equal to DAL element-wise product with G of L prime Z of L. And then to compute the derivatives, DWL equals DZL times AL-1. I didn't explicitly put that in the cache, but it turns out you need this as well. And then DBL is equal to DZL. And finally, DA of L-1 is equal to WL transpose times DZL. And I don't want to go through the detailed derivation for this, but it turns out that if you take this definition for DA and plug it in here, then you get the same formula as we had in there previously for how you compute DZL as a function of the previous DZL. In fact, well, if I just plug that in here, you end up that DZL is equal to WL plus 1 transpose DZL plus 1 times GL prime Z of L. I know this looks like a lot of algebra. You can actually double check for yourself that this is the equation we had written down for back propagation last week. When we were doing a neural network with just a single hidden layer. And as a reminder, this times this element-wise product, so all you need is those four equations to implement your backward function. And then finally, I'll just write out the vectorized version. So the first line becomes DZL equals DA L element-wise product with GL prime of ZL. Maybe no surprise there. DWL becomes 1 over M DZL times A L minus 1 transpose. And then DBL becomes 1 over M NP dot sum DZL. Then X is equal to 1. Keep dims equals true. We talked about the use of NP dot sum in the previous week to compute DB. And then finally, DA L minus 1 is WL transpose times DZ of L. So this allows you to input this quantity DA over here. And I'll put DWL DBL, the derivatives you need, as well as DA L minus 1 as follows. So that's how you implement the backward function. So just to summarize, take the input X. You might have the first layer. Maybe it has a ReLU activation function. Then go to the second layer. Maybe it uses another ReLU activation function. Go to the third layer. Maybe it has a sigmoid activation function if you're doing binary classification. And this outputs Y hat. And then using Y hat, you can compute the loss. And this allows you to start your backward iteration. I'll draw the arrows first. I guess I don't have to change pens too much. Where you would then have backprop compute the derivatives. Compute DW3, DB3, DW2, DB2, DW1, DB1. And along the way, you would be computing. I guess the cache would transfer Z1, Z2, Z3. And here, you pass back with DA2 and DA1. This could compute DA0, but we won't use that. So you can just discard that. And so this is how you implement forward prop and backprop for a three-layer neural network. Now, there's just one last detail that I didn't talk about, which is for the forward recursion, we would initialize it with the input data X. How about the backward recursion? Well, it turns out that DA of L, when you're using logistic regression, when you're doing binary classification, is equal to Y over A plus 1 minus Y over 1 minus A. So it turns out that the derivative of the loss function with respect to the output, with respect to Y hat, can be shown to be equal to this. If you're familiar with calculus, if you look up the loss function L and take derivatives with respect to Y hat or with respect to A, you can show that you get that formula. So this is the formula you should use for DA for the final layer, capital L. And of course, if you were to have a vectorized implementation, then you initialize the backward recursion, not with this, but with DA, capital A, for the layer L, which is going to be the same thing for the different examples, right, over A for the first training example plus 1 minus Y for the first training example over 1 minus A for the first training example, dot, dot, dot, down to the nth training example. It's 1 minus A of n. So that's how you implement the vectorized version. That's how you initialize the vectorized version of backpropagation. So you've now seen the basic building blocks of both forward propagation as well as backpropagation. Now, if you implement these equations, you will get a correct implementation of forward prop and backprop to get you the derivatives you need. You might be thinking, well, there's a lot of equations. I'm slightly confused. I'm not quite sure I see how this works. And if you're feeling that way, my advice is when you get to this week's programming assignment, you will be able to implement these for yourself and they'll be much more concrete. And I know there was a lot of equations, and maybe some of the equations didn't make complete sense, but if you work through the calculus and the linear algebra, which is not easy, so feel free to try, but that's actually one of the more difficult derivations in machine learning. It turns out the equations we wrote down are just the calculus equations for computing the derivatives, especially in backprop. But once again, if this feels a little bit abstract, a little bit mysterious to you, my advice is when you've done the programming exercise, it will feel a bit more concrete to you. Although I have to say, you know, even today when I implement a learning algorithm, sometimes even I'm surprised when my learning algorithm implementation works. And it's because a lot of the complexity of machine learning comes from the data rather than from the lines of code. So sometimes you feel like you implement a few lines of code, not quite sure what it did, but it almost magically works. And it's because a lot of the magic is actually not in the piece of code you write, which is often, you know, not too long. It's not exactly simple, but it's not, you know, 10,000 or 100,000 lines of code, but you feed it so much data that sometimes, even though I've worked in machine learning for a long time, sometimes it still, you know, surprises me a bit when my learning algorithm works because a lot of the complexity of your learning algorithm comes from the data rather than necessarily from your writing, you know, thousands and thousands of lines of code. All right. So that's how you implement deep neural networks. And again, this will become more concrete when you've done the programming exercise. Before moving on, I want to discuss in the next video, I want to discuss hyperparameters and parameters. It turns out that when you're training deep nets, being able to organize your hyperparameters well will help you be more efficient in developing your networks. In the next video, let's talk about exactly what that means.