Your subscription plan will change at the end of your current billing period. Youโll continue to have access to your current plan until then.
Welcome back!
Hi ,
We'd like to know you better so we can create more relevant courses. What do you do for work?
Course Syllabus
You've achieved today's streak!
Complete one lesson every day to keep the streak going.
Su
Mo
Tu
We
Th
Fr
Sa
You earned a Free Pass!
Free Passes help protect your daily streak. Complete more lessons to earn up to 3 Free Passes.
Elevate Your Career with Full Learning Experience
Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
So, why does BatchNorm work? Here's one reason. You've seen how normalizing the input features, the Xs, to mean 0 and variance 1, how that can speed up learning. So, rather than having some features that range from 0 to 1, and some from 1 to 1000, by normalizing all the features, input features X, to take on a similar range of values, that can speed up learning. So, one intuition behind why BatchNorm works is this is doing a similar thing, but for the values in your hidden units, and not just for your input layer. Now, this is just a partial picture for what BatchNorm is doing. There are a couple of further intuitions that will help you gain a deeper understanding of what BatchNorm is doing. Let's take a look at those in this video. A second reason why BatchNorm works is it makes weights later or deeper in the neural network, say the weights are layer 10, more robust to changes to weights in earlier layers of the neural network, say in layer 1. To explain what I mean, let's look at this motivating example. Let's say you're training a network, maybe a shallow network, like logistic regression, or maybe a neural network, maybe a shallow network, like logistic regression, or maybe a deeper network, on famous cat detection tasks. But let's say that you've trained your data sets on all images of black cats. If you now try to apply this network to data with colored cats, where the positive examples are not just black cats, like on the left, but to colored cats, like on the right, then your classifier might not do very well. So in pictures, if your training set looked like this, where you have positive examples here and negative examples here, but you were to try to generalize it to a data set where maybe the positive examples are here, and the negative examples are here, then you might not expect a model trained on the data on the left to do very well on the data on the right, even though, you know, there might be the same function that actually works well, but you wouldn't expect your learning algorithm to discover that green decision boundary just looking at the data on the left. So this idea of your data distribution changing goes by the somewhat fancy name covariate shift. And the idea is that if you learn some x to y mapping, if the distribution of x changes, then you might need to retrain your learning algorithm. And this is true even if the function, the ground truth function mapping from x to y remains unchanged, which it is in this example, because the ground truth function is this picture a cat or not. And the need to retrain your function becomes even more acute, or it becomes even worse if the ground truth function shifts as well. So how does this problem of covariate shift apply to a neural network? Consider a deep network like this, and let's look at the learning process from the perspective of this hidden layer, the third hidden layer. So this network has learned the parameters w3 and b3. And from the perspective of the third hidden layer, it gets some set of values from the earlier layers, and then has to do some stuff to hopefully make the output y hat close to the ground truth value y. So let me cover up the nodes on the left for a second. So from the perspective of this third hidden layer, it gets some values. Let's call them a21, a22, a23, and a24. But these values might as well be features, x1, x2, x3, x4. And the job of the third hidden layer is to take these values and find a way to map them to y hat. So you can imagine doing gradient descent so that these parameters w3, b3, as well as maybe w4, b4, and even w5, b5, maybe trying to learn those parameters so the network does a good job mapping from the values I drew in black on the left to the output values y hat. But now let's uncover the left of the network again. The network is also adapting parameters w2, b2, and w1, b1. And so as these parameters change, these values a2 will also change. So from the perspective of the third hidden layer, these hidden unit values are changing all the time. And so it's suffering from the problem of covariate shift that we talked about on the previous slide. So what batch norm does is it reduces the amount that the distribution of these hidden unit values shifts around. And if it were to plot the distribution of these hidden unit values, maybe this is, technically we normalize the z. So this is actually z21 and z22. And this will plot two values instead of four values so we can visualize it in 2D. What batch norm is saying is that the values for z21 and z22 can change, and indeed they will change when the neural network updates the parameters in the earlier layers. But what batch norm ensures is that no matter how it changes, the mean and variance of z21 and z22 will remain the same. So even if the exact values of z21 and z22 change, their mean and variance will at least stay mean zero and variance one. Or not necessarily mean zero and variance one, but whatever value is governed by beta two and gamma two, which if the neural network chooses can force it to be mean zero and variance one, or really any other mean and variance. But what this does is it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the third layer now sees and therefore has to learn on. And so batch norm reduces the problem of the input values changing. It really causes these values to become more stable so that the later layers of the neural network has more firm ground to stand on. And even though the input distribution changes a bit, it changes less. And what this does is even as the earlier layers keep learning, the amount that this forces the later layers to adapt to its earliest layers changes is reduced. Or if you will, it weakens the coupling between what the earlier layers parameters have to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, you know, a little bit more independently of other layers. And this has the effect of speeding up learning in the whole network. So I hope this gives some better intuition, but the takeaway is that batch norm means that especially from the perspective of one of the later layers of the neural network, the earlier layers don't get to shift around as much because they're constrained to have the same mean and variance. And so this makes the job of learning in the later layers easier. It turns out batch norm has a second effect. It has a slight regularization effect. So one non-intuitive thing about batch norm is that each mini-batch, I'll say mini-batch XT, has the values ZT, has the values ZL, scaled by the mean and variance computed on just that one mini-batch. Now, because the mean and variance computed on just that mini-batch as opposed to computed on the entire data set, that mean and variance has a little bit of noise in it because it's computed just on your mini-batch of say 64 or 128 or maybe 256 or larger training examples. So because the mean and variance is a little bit noisy because it's estimated with just a relatively small sample of data, the scaling process going from ZL to Z to the L, that process is a little bit noisy as well because it's computed using a slightly noisy mean and variance. So similar to dropout, it adds some noise to each hidden layer's activations. The way dropout adds noise is it takes the hidden units and it multiplies it by zero with some probability and multiplies it by one with some probability. And so dropout adds multiplicative noise because it's multiplying by zero or one, whereas batch norm has multiplicative noise because it's scaling by the standard deviation as well as additive noise because it's subtracting the mean. The mean and the standard deviation are noisy. And so similar to dropout, batch norm therefore has a slight regularization effect because by adding noise to the hidden units, it's forcing the downstream hidden units not to rely too much on any one hidden unit. And so similar to dropout, this adds noise to the hidden layers and therefore has a very slight regularization effect. Because the noise added is quite small, this is not a huge difference between batch norm together with dropout. And you might use batch norm together with dropout if you want the more powerful regularization effect of dropout. And maybe one other slightly non-intuitive effect is that if you use a bigger mini-batch size, right, so if you use a mini-batch size of say 512 instead of 64, by using a larger mini-batch size you're reducing this noise and therefore also reducing this regularization effect. This is a strange property of dropout which is that by using a bigger mini-batch size, you reduce the regularization effect. Having said this, I wouldn't really use batch norm as a regularizer, that's really not the intent of batch norm, but sometimes it has this extra intended or unintended effect on your learning algorithm. But really, don't turn to batch norm as a regularization. Use it as a way to normalize your hidden units, activations and therefore speed up learning. And I think the regularization is an almost unintended side effect. So I hope that gives you better intuition about what batch norm is doing. Before we wrap up the discussion on batch norm, there's one more detail I want to make sure you know, which is that batch norm handles data one mini-batch at a time. It computes mean and variances on mini-batches. So at test time, you're trying to make predictions, trying to evaluate the neural network. You might not have a mini-batch of examples. You might be processing one single example at a time. So at test time, you need to do something slightly differently to make sure your predictions make sense. In the next and final video on batch norm, let's talk over the details of what you need to do in order to take your neural network train using batch norm to make predictions.