We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
When you train your neural network, it's important to initialize the weights randomly. For logistic regression, it was okay to initialize the weights to 0, but for a neural network, if you initialize the weights, the parameters to all 0 and then apply gradient descent, it won't work. Let's see why. So you have here two input features, so n0 is equal to 2, and two hidden units, so n1 is equal to 2. And so the matrix associated with the hidden layer, w1, is going to be 2 by 2. Let's say that you initialize it to all 0s, 0, 0, 0, 0, 2 by 2 matrix. And let's say b1 is also equal to 0, 0. It turns out initializing the bias terms b to 0 is actually okay, but initializing w to all 0s is a problem. So the problem with this form of initialization is that for any example you give it, you will have that a11 and a12 will be equal. So this activation and this activation will be the same because both of these hidden units are computing exactly the same function. And then when you compute backpropagation, it turns out that dz11 and dz12 will also be the same, kind of by symmetry. Both of these hidden units will initialize the same way. Technically, for what I'm saying, I'm assuming that the outgoing weights are also identical, so that w2 is equal to 0, 0. But if you initialize the neural network this way, then this hidden unit and this hidden unit are completely identical. So they're completely, sometimes you say they're completely symmetric, which just means that they're computing exactly the same function. And by kind of a proof by induction, it turns out that after every single iteration of training, your two hidden units are still computing exactly the same function. So this is possible to show that dw will be a matrix that looks like this, where every row takes on the same value. So when you perform a weight update, w1 gets updated as w1 minus alpha times dw. You find that w1, after every iteration, will have the first row equal to the second row. So it's possible to construct a proof by induction that if you initialize all the weights, all the values of w to 0, then because both hidden units start off computing the same function, and both hidden units have the same influence on the output unit, then after one iteration, that same statement is still true. The two hidden units are still symmetric, and therefore by induction after two iterations, three iterations, and so on, no matter how long you train the neural network, both hidden units are still computing exactly the same function. And so in this case, there's really no point to having more than one hidden unit because they're all computing the same thing. And of course, for larger neural networks, let's say you have three features and maybe a very large number of hidden units, a similar argument works to show that with a neural network like this, because it won't draw on all the edges, if you initialize the weights to 0, then all of your hidden units are symmetric, and no matter how long you run gradient descent, they'll all continue to compute exactly the same function. So that's not helpful because you want the different hidden units to compute different functions. The solution to this is to initialize your parameters randomly. So here's what you do. You can set w1 equals np.random.randn. This generates a Gaussian random variable 2, 2. And then usually, you multiply this by a very small number, such as 0.01. So you initialize it to very small random values. And then b, it turns out that b does not have this symmetry problem. It's called the symmetry breaking problem. So it's okay to initialize b to just 0s because so long as w is initialized randomly, you start off with the different hidden units computing different things, and so you no longer have this symmetry breaking problem. And then similarly, for w2, you can initialize that randomly, and b2, you can initialize that to 0. So you might be wondering, you know, where did this constant come from? And why is it 0.01? Why not put the number 100 or 1000? Turns out that we usually prefer to initialize the weights to very, very small random values because if you're using a, say, tanh or sigmoid activation function, or if you have a sigmoid, even just at the output layer, if the weights are too large, then when you compute the activation function, when you compute the activation values, remember that z1 is equal to w1x plus b, and then a1 is the activation function applied to z1. So if w is very big, z will be very big, or at least some values of z will be either very large or very small, and so in that case, you're more likely to end up at these flat parts of the tanh function or the sigmoid function where the slope or the gradient is very small, meaning that gradient descent will be very slow, and so learning will be very slow. So just to recap, if w is too large, you're more likely to end up, even at the very start of training, with very large values of z, which causes your tanh or sigmoid activation function to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your neural network, this is less of an issue, but if you're doing binary classification and your output unit is a sigmoid function, then you just don't want the initial parameters to be too large. So that's why multiplying by 0.01 would be something reasonable to try, or any other small number, and same for w2, right? This can be a random dot random, I guess this would be 1 by 2 in this example, times 0.01. Oh, missing a s there. So finally, it turns out that sometimes there can be better constants than 0.01. When you're training a neural network with just one hidden layer, this is a relatively shallow neural network without too many hidden layers, setting it to 0.01 would probably work okay, but when you're training a very, very deep neural network, then you might want to pick a different constant than 0.01, and in next week's material, we'll talk a little bit about how and when you might want to choose a different constant than 0.01. But either way, it will usually end up being a relatively small number. So that's it for this week's videos. You now know how to set up a neural network with a hidden layer, initialize the parameters, implement gradient descriptions using forward prop, as well as compute derivatives and implement gradient descent using back prop. So with that, you should be able to do the quizzes as well as this week's programming exercises. Best of luck with that. I hope you have fun with the programming exercise and I look forward to seeing you in the week 4 materials.