Your subscription plan will change at the end of your current billing period. You’ll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
One of the problems with training neural networks, especially very deep neural networks, is that of vanishing and exploding gradients. What that means is that when you're training a very deep network, your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult. In this video, you see what this problem of exploding or vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem. Let's say you're training a very deep neural network like this. To save space on this slide, I've drawn it as if you have only two hidden units per layer, but it could be more as well. But this neural network will have parameters w1, w2, w3, and so on, up to wl. For the sake of simplicity, let's say we're using an activation function g of z equals z, so a linear activation function, and let's ignore b. Let's set b of l equals zero. So, in that case, you can show that the output y will be wl times wl minus 1 times wl minus 2, dot, dot, dot, down to w3, w2, w1 times x. But if you want to just check my math, w1 times x is going to be z1, right? Because b is equal to zero, so z1 is equal to, I guess, w1 times x, and then plus b, which is zero. But then, a1 is equal to g of z1. But because we use a linear activation function, this is just equal to z1. So, this first term, w1x is equal to a1. And then, by similar reasoning, you can figure out that w2 times w1 times x is equal to a2, because that's going to be g of z2. This is going to be g of w2 times a1, right? Which, you can plug that in here. So, this thing is going to be equal to a2. And then, this thing is going to be a3, and so on, until the product of all these matrices gives you y hat, not y. Now, let's say that each of your weight matrices, wl, is equal to, let's say it's just a little bit larger than 1 times the identity. So, it's 1.5, 1.5, 0, 0, right? Technically, the last one has different dimensions, so maybe this is just the rest of these weight matrices. Then, y hat will be, ignoring this last one which has different dimensions, will be this 1.5, 0, 0, 1.5 matrix to the power of l minus 1 times x. Because, if we assume that each one of these matrices is equal to this thing, really 1.5 times the identity matrix, then you end up with this calculation. And so, y hat will be essentially 1.5 to the power of l, to the power of l, l minus 1 times x. And if l is large, for a very deep neural network, y hat will be very large. In fact, this grows exponentially. It grows like 1.5 to the number of layers. And so, if you have a very deep neural network, the value of y will explode. Now, conversely, if we replace this with 0.5, so something less than 1, then this becomes 0.5 to the power of l. This matrix becomes 0.5 to the l minus 1 times x. Again, ignoring w, l. But, so if each of your matrices are less than 1, then if let's say x1, x2 were 1, 1, then the activations will be 1.5, 1.5, 1.4, 1.4, 1.8, 1.8, and so on, until this becomes 1 over 2 to the l. So, the activation values will decrease exponentially as a function of the depth, as a function of the number of layers l of the network. And so, if you have a very deep network, the activations end up decreasing exponentially. So, the intuition I hope you can take away from this is that if the weights, w, if they're all, you know, just a little bit bigger than 1, or just a little bit bigger than the identity matrix, then with a very deep network, the activations can explode. And if w is, you know, just a little bit less identity. So, this is maybe using 0.9, 0.9. Then if a very deep network, the activations will decrease exponentially. And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of l, a similar argument can be used to show that the derivatives or the gradients you compute with gradient descent will also increase exponentially or decrease exponentially as a function of the number of layers. With some of the modern neural networks, you actually have l equals 150. Microsoft recently got great results with a 152-layer neural network. But with such a deep neural network, if your activations, your gradients increase or decrease exponentially as a function of l, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially small in l. Then, you know, gradient descent will take tiny little steps and it'll take a long time for gradient descent to learn anything. To summarize, you've seen how deep networks suffer from the problems of vanishing or exploding gradients. In fact, for a long time, this problem was a huge barrier to training deep neural networks. It turns out there's a partial solution that doesn't completely solve this problem, but that helps a lot, which is careful choice of how you initialize the weights. To see that, let's go on to the next video.