Hello, and welcome back. In this week, you'll learn about optimization algorithms that will enable you to train your neural networks much faster. You've heard me say before that applying machine learning is a highly empirical process, it's a highly iterative process, in which you just have to train a lot of models to find one that works really well. So it really helps to be able to train models quickly. One thing that makes that more difficult is that deep learning tends to work best in a regime of big data, when you're able to train your neural networks on a huge data set. And training on a large data set is just slow. So what you find is that having fast optimization algorithms, having good optimization algorithms, can really speed up the efficiency of you and your team. So let's get started by talking about mini-batch gradient descent. You've learned previously that vectorization allows you to efficiently compute on all M examples. That allows you to process your whole training set without an explicit for loop. So that's why we would take our training examples and stack them into this huge matrix, capital X. So it's X1, X2, X3, and then eventually it goes up to Xm, if you have m training examples. And similarly for Y, this is Y1, Y2, Y3, and so on, up to Ym. So the dimension of X was nX by m, and this was 1 by m. Vectorization allows you to process all m examples quickly, relatively quickly. If m is very large, then it can still be slow. So for example, what if m was 5 million, or 50 million, or even bigger? With the implementation of gradient descent on your whole training set, what you have to do is you have to process your entire training set before you take one little step of gradient descent. And then you have to process your entire training set of 5 million training examples again before you take another little step of gradient descent. So it turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire giant training set of 5 million examples. In particular, here's what you can do. Let's say that you split up your training set into smaller, little baby training sets, and these baby training sets are called mini-batches. And let's say each of your baby training sets have just 1,000 examples each. So you take X1 through X1,000, and you call that your first little baby training set, also called a mini-batch. And then you take the next 1,000 examples, X1,001 through X2,000. That's the next 1,000 examples, and call the next one, and so on. And I'm going to introduce a new notation. I'm going to call this X superscript with curly braces 1, and I'm going to call this X superscript with curly braces 2. Now, if you have 5 million training examples total, and each of these little mini-batches has 1,000 examples, that means you have 5,000 of these, because 5,000 times 1,000 equals 5 million. So altogether, you would have 5,000 of these mini-batches. So it ends with X superscript, curly braces 5,000. And then similarly, you do the same thing for Y. You'd also split up your training data for Y accordingly. So you call that Y1, then this is Y1,001 through Y2,000. This becomes called Y2, and so on, until you have Y5,000. So now, mini-batch number T is going to be comprised of XT and YT, and that is 1,000 training examples with the corresponding input-output pairs. Before moving on, just to make sure my notation is clear, we have previously used superscript round brackets I to index in the training set. So XI is the I-th training example. We use superscript square brackets L to index into the different layers of a neural network. So ZL comes from the Z value, so the Lth layer of the neural network. And here, we're introducing the curly brackets T to index into different mini-batches. So you have XT, YT. And to check your understanding of these, what's the dimension of XT and YT? Well, X is NX by M, so if X1 is 1,000 training examples, or the X values for 1,000 examples, then this dimension should be NX by 1,000, and X2 should also be NX by 1,000, and so on. So all of these should have dimension NX by 1,000. And these should have dimension 1 by 1,000. To explain the name of this algorithm, batch gradient descent refers to the gradient descent algorithm we've been talking about previously, where you process your entire training set all at the same time. And the name comes from viewing that as processing your entire batch of training examples all at the same time. I'm not sure it's a great name, but that's just what it's called. Mini-batch gradient descent, in contrast, refers to the algorithm, which we'll talk about on the next slide, in which you process a single mini-batch XT, YT at the same time, rather than processing your entire training set XY at the same time. So let's see how mini-batch gradient descent works. To run mini-batch gradient descent on your training set, you would run for T equals 1 to 5,000, because we had 5,000 mini-batches of size 1,000 each. And what you're going to do inside the for loop is basically implement one step of gradient descent using XT, YT. And it's as if you had a training set of size 1,000 examples, and it was as if you were to implement the algorithm you're already familiar with, but just on this, you know, little training set size of M equals 1,000. Rather than having an explicit for loop over all 1,000 examples, you would use vectorization to process all 1,000 examples sort of all at the same time. So let's write this out. You implement for prop on the inputs, so just on XT. And you do that by implementing, you know, Z1 equals W1. Now, previously, we would just have X there, right? But now, you aren't processing the entire training set, you're just processing the first mini-batch. So this becomes XT when you're processing mini-batch T. And then you would have A1 equals G1 of Z1, right? This would be a capital Z, since this is actually a vectorized implementation, and so on until you end up with AL, you know, as I guess GL of ZL, and then this is your prediction. And you notice that here, you should use a vectorized implementation. It's just that this vectorized implementation processes 1,000 examples at a time, rather than 5 million examples. Next, you compute the cost function, J, which I'm going to write as 1 over 1,000, since here 1,000 is the cost function. 1,000, since here 1,000 is the size of your little training set. Sum from I equals 1 through L of really the, you know, loss of Y hat I, YI. And this notation for clarity refers to examples from the mini-batch XT, YT. And then if you're using regularization, you can also have this regularization term, I guess maybe with two in the denominator, times sum over L. You know, Frobenius norm of the weight matrix is squared. So because this is really the cost on just one mini-batch, I'm going to index this cost J with a superscript T in curly braces. So you notice that everything we're doing is exactly the same as when we were previously implementing gradient descent, except that instead of doing it on XY, you're now doing it on XT, YT. Next, you'd implement backprop to compute gradients with respect to, really with respect to this JT. So you're still using only XT, YT. And then you update the weights. You know, W, really every WL gets updated as WL minus alpha D WL. And similarly for B. And so this is one pass through your training set using mini-batch gradient descent. The code I've written down here is also called doing one epoch of training. An epoch is a word that just means a single pass through the training set. So whereas with batch gradient descent, a single pass through the training set allows you to take only one gradient descent step, with mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now, of course, you want to take multiple passes through the training set, which you usually want to. You might want another for loop or another while loop out there. So you keep taking passes through the training set until hopefully you converge or have approximately converged. When you have a large training set, mini-batch gradient descent runs much faster than batch gradient descent. And it's pretty much what everyone in deep learning will use when you're training on a large data set. In the next video, let's delve deeper into mini-batch gradient descent so you can get a better understanding of what it's doing and why it works so well.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Optimization Algorithms

Optimization Algorithms

Mini-batch Gradient Descent
Video
・
11 mins

Understanding Mini-batch Gradient Descent
Video
・
11 mins

Exponentially Weighted Averages
Video
・
5 mins

Understanding Exponentially Weighted Averages
Video
・
9 mins

Bias Correction in Exponentially Weighted Averages
Video
・
4 mins

Gradient Descent with Momentum
Video
・
9 mins

RMSprop
Video
・
7 mins

Clarification about Upcoming Adam Optimization Video
Reading
・
1 min

Adam Optimization Algorithm
Video
・
7 mins

Clarification about Learning Rate Decay Video
Reading
・
1 min

Learning Rate Decay
Video
・
6 mins

The Problem of Local Optima
Video
・
5 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Optimization Algorithms

Graded・Quiz

・

50 mins

Programming Assignment

Optimization Methods

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yuanqing Lin Interview
Video
・
13 mins

Week 3: Hyperparameter Tuning, Batch Normalization and Programming Frameworks