So, you've seen the equations for how to implement dash norm for maybe a single hidden layer. Let's see how it fits into the training of a deep network. So, let's say you have a neural network like this. You've seen me say before that you can view each hidden unit as computing two things. First, it computes Z, and then it applies the activation function to compute A. And so, you can think of each of these circles as representing a two-step computation. And similarly, for the next layer, that's Z21 and A21, and so on. So, if you were not applying dash norm, you would have an input X feed into the first hidden layer, and then first compute Z1, and this is governed by the parameters W1 and B1. And then, ordinarily, you would feed Z1 into the activation function to compute A1. But what we'll do in dash norm is take this value Z1 and apply dash norm, sometimes abbreviated BN to it, and that's going to be governed by parameters beta 1 and gamma 1. And this will give you this new normalized value Z1. And then you feed that to the activation function to get A1, which is G1 applied to Z tilde 1. Now, you've done the computation for the first layer, where this dash norm step really occurs, you know, in between the computation from Z and A. Next, you take this value A1 and use it to compute Z2. And so this is now governed by W2, B2. And similar to what you did for the first layer, you would take Z2 and apply it through dash norm, abbreviate that BN now. This is governed by dash norm parameters specific to the next layer, so beta 2, gamma 2. And now this gives you Z tilde 2, and you use that to compute A2 by applying the activation function. And so on. So once again, the dash norm step, you know, kind of happens between computing Z and computing A. And the intuition is that instead of using the unnormalized value Z, you're going to use the normalized value Z tilde. That's in the first layer. In the second layer as well, instead of using the unnormalized value Z2, you're going to use the mean and variance normalized values Z tilde 2. So the parameters of your network are going to be W1, B1. It turns out we'll get rid of the B parameters, so we'll see why in the next slide. But for now, imagine your parameters are the usual W1, B1 through WL, BL. And we've added to this new network additional parameters, beta 1, gamma 1, beta 2, gamma 2, and so on. You know, for each layer in which you are applying batch norm. For clarity, note that these betas here, these have nothing to do with the hyperparameter beta that we had for momentum or for the computing the various exponentially weighted averages. You know, the authors of the Adam paper had used beta in their paper to denote that hyperparameter. The authors of the batch norm paper had used beta to denote this parameter, but these are two completely different betas. I decided to stick with beta in both cases, in case you read the original papers. But the beta 1, beta 2, and so on, that batch norm tries to learn is a different beta than the hyperparameter beta used in momentum and in the Adam and RMS prop algorithms. But so, now that these are the new parameters of your algorithm, you would then use whatever optimization you want, such as gradient descent, in order to implement it. So, for example, you might compute D beta L for a given layer and then update the parameters, beta gets updated as beta minus the learning rate times D beta L. And you can also use Adam or RMS prop or momentum in order to update the parameters beta and gamma, not just gradient descent. And even though in the previous video, I've explained what the batch norm operation does, computes means and variances and subtracts and divides by them. If you're using a deep learning programming framework, usually you won't have to implement the batch norm step or the batch norm layer yourself. In some of the programming frameworks, it can be sort of one line of code. So, for example, in the TensorFlow framework, you can implement batch normalization with this function. We'll talk more about programming frameworks later, but in practice, you might not end up needing to implement all these details yourself, but it's still worth knowing how it works so that you can get a better understanding of what your code is doing. But implementing batch norm is often, you know, something like one line of code in the deep learning frameworks. Now, so far, we've talked about batch norm as if you were training on your entire training set at a time, as if you were using batch gradient descent. In practice, batch norm is usually applied with mini-batches of your training set. So the way you actually apply batch norm is you take your first mini-batch and compute Z1, same as we did on the previous slide, using the parameters W1, B1, and then you take just this mini-batch and compute the mean and variance of the Z1s on just this mini-batch, and then batch norm would subtract by the mean and divide by the standard deviation and then rescale by beta 1, gamma 1 to give you Z1, and all this is on the first mini-batch. Then you apply the activation function, you know, to get A1, and then you compute Z2 using W2, B2, and so on. So you do all this in order to perform one step of, say, gradient descent on the first mini-batch, and then you go to the second mini-batch, X2, and you do something similar, where you now compute Z1 on the second mini-batch and then use batch norm to compute Z1 tilde. And so here in this batch norm step, you'd be normalizing Z tilde using just the data in your second mini-batch. So this batch norm step here is looking at the examples in your second mini-batch, computing the mean and variances of the Z1s on just that mini-batch, and then rescaling by beta and gamma to get Z tilde, and so on. And you do this with the third mini-batch and keep training. Now, there's one detail to the parameterization that I want to clean up, which is previously I said that the parameters was WL, BL for each layer, as well as beta L and gamma L. Now, notice that the way ZL is computed is as follows. ZL equals WL times A of L minus 1 plus B of L. But what batch norm does is it's going to look at the mini-batch and normalize ZL to first have mean zero and standard variance, and then a rescale by beta and gamma. But what that means is that whatever is the value of BL is actually going to just get subtracted out because during that batch norm normalization step, you're going to compute the means of the ZLs and then subtract out the mean. And so adding any constant to all of the examples in a mini-batch, it doesn't change anything because any constant you add will get canceled out by the mean subtraction step. So if you're using batch norm, you can actually eliminate that parameter. Or if you want, think of it as setting it permanently to zero. So then the parameterization becomes ZL is just WL times AL minus 1, and then you compute ZL normalized. And when you compute Z tilde equals gamma ZL plus beta, you end up using this parameter beta L in order to decide what's the mean of Z tilde L, which is what gets passed to the next layer. So just to recap, because batch norm zeroes out the mean of these ZL values in the layer, there's no point having this parameter BL, and so you might as well get rid of it. And instead, it's sort of replaced by beta L, which is a parameter that controls, that ends up affecting the shift or the bias terms. Finally, remember that the dimension of ZL, because if you're doing this on one example, is going to be NL by 1. And so BL had dimension NL by 1, if NL is the number of hidden units in layer L. And so the dimension of beta L and gamma L is also going to be NL by 1, because that's the number of hidden units you have. You have NL hidden units, and so beta L and gamma L are used to scale the mean and variance of each of the hidden units to whatever the network wants to set them to. So let's put it all together and describe how you can implement gradient descent using batch norm. Assuming you're using mini-batch gradient descent, you iterate for T equals 1 to the number of mini-batches. You would implement forward prop on mini-batch XT, and doing forward prop in each hidden layer, use batch norm to replace ZL with Z tilde L. And so this ensures that within that mini-batch, the values Z end up with some normalized mean and variance. And the values and the version of the normalized mean and variance is this Z tilde L. And then you use back prop to compute DW, DB, for all the values of L, D beta, D gamma. Although technically, since we've gotten rid of B, this actually now goes away. And then finally, you update the parameters. So, you know, W gets updated as W minus the learning rate times this as usual. Beta gets updated as beta minus the learning rate times DB, and similarly for gamma. And if you've computed the gradients as follows, you could use gradient descent. That's what I've written down here. But this also works with gradient descent with momentum or RMS prop or atom. Where instead of taking this gradient descent update, you could use the updates given by these other algorithms as we discussed in the previous week's videos. So these other optimization algorithms as well can be used to update the parameters, beta and gamma, that batch norm added to your algorithm. So I hope that gives you a sense of how you could implement batch norm from scratch if you wanted to. If you're using one of the deep learning programming frameworks, which we'll talk more about later, hopefully you can just call someone else's implementation in the programming framework, which will make using batch norm much easier. Now, in case batch norm still seems a little bit mysterious, if you're still not quite sure why it speeds up training so dramatically, let's go to the next video and talk more about why batch norm really works and what it's really doing.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 3: Hyperparameter Tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tuning

Tuning Process
Video
・
7 mins

Using an Appropriate Scale to pick Hyperparameters
Video
・
8 mins

Hyperparameters Tuning in Practice: Pandas vs. Caviar
Video
・
6 mins

Batch Normalization

Clarification about Upcoming Normalizing Activations in a Network Video
Reading
・
1 min

Normalizing Activations in a Network
Video
・
8 mins

Fitting Batch Norm into a Neural Network
Video
・
12 mins

Why does Batch Norm work?
Video
・
11 mins

Batch Norm at Test Time
Video
・
5 mins

Multi-class Classification

Clarifications about Upcoming Softmax Video
Reading
・
1 min

Softmax Regression
Video
・
11 mins

Training a Softmax Classifier
Video
・
10 mins

Introduction to Programming Frameworks

Deep Learning Frameworks
Video
・
4 mins

TensorFlow
Video
・
15 mins

(Optional) Learn about Gradient Tape and More
Reading
・
1 min

Lecture Notes (Optional)

Lecture Notes W3
Reading
・
1 min

Quiz

Hyperparameter tuning, Batch Normalization, Programming Frameworks

Graded・Quiz

・

50 mins

End of access to Lab Notebooks

[IMPORTANT] Reminder about end of access to Lab Notebooks
Reading
・
2 mins

Programming Assignment

TensorFlow Introduction

Graded・Code Assignment

・

3 hours

References & Acknowledgments

References
Reading
・
10 mins

Acknowledgments
Reading
・
10 mins

Next in the Professional Certificate

Structuring Machine Learning Projects