In the previous video, you saw how you can use mini-batch gradient descent to start making progress, to start taking gradient descent steps, even when you're just partway through processing your training set, even for the first time. In this video, you'll learn more details of how to implement gradient descent and gain a better understanding of what it's doing and why it works. With batch gradient descent, on every iteration, you go through the entire training set, and you'd expect the cost to go down on every single iteration. So you plot the cost function J as a function of different iterations. It should decrease on every single iteration, and if it ever goes up even on one iteration, then something's wrong. Maybe the learning rate's too big. On mini-batch gradient descent, though, if you plot progress on your cost function, then it may not decrease on every iteration. In particular, on every iteration, you're processing some XT, YT, and so if you plot the cost function JT, which is computed using just XT, YT, then it's as if on every iteration, you're training on a different training set, or really training on a different mini-batch. So you plot the cost function J, you're more likely to see something that looks like this. It should trend downwards, but it's also going to be a little bit noisier. So if you plot J of T as you're training mini-batch gradient descent, and maybe over multiple epochs, you might expect to see a curve like this. So it's okay if it doesn't go down on every iteration, but it should trend downwards. And the reason it'll be a little bit noisy is that maybe X1, Y1 is just a relatively easy mini-batch, so your cost might be a bit lower, but then maybe just by chance, X2, Y2 is just a harder mini-batch. Maybe even some mislabeled examples in it, in which case the cost would be a bit higher, and so on. So that's why you get these oscillations as you plot the cost when you're running mini-batch gradient descent. Now, one of the parameters you need to choose is the size of your mini-batch. So M was the training set size. On one extreme, if the mini-batch size is equal to M, then you just end up with batch gradient descent. Alright, so in this extreme, you would just have one mini-batch, X1, Y1, and this mini-batch is equal to your entire training set. So setting the mini-batch size to M just gives you batch gradient descent. The other extreme would be if your mini-batch size were equal to 1. This gives you an algorithm called stochastic gradient descent, and here every example is its own mini-batch. So what you do in this case is you look at the first mini-batch. So X1, Y1, but when your mini-batch size is 1, this just has your first training example, and you take a gradient descent step with your first training example. And then you next take a look at your second mini-batch, which is just your second training example, and take a gradient descent step with that. And then you do it with the third training example, and so on, looking at just one single training example at a time. So let's look at what these two extremes will do on optimizing this cost function. If these are the contours of the cost function you're trying to minimize, so the minimum is there, then batch gradient descent might start somewhere and be able to take relatively low noise, relatively large steps, and just keep marching to the minimum. In contrast, with stochastic gradient descent, if you start somewhere, let's pick a different starting point, then on every iteration you're taking gradient descent with just a single training example. So most of the time you hit toward the global minimum, but sometimes you hit in the wrong direction, if that one example happens to point you in the bad direction. So stochastic gradient descent can be extremely noisy, and on average it will take you in a good direction, but sometimes it will hit in the wrong direction as well. And stochastic gradient descent won't ever converge, it will always just kind of oscillate and wander around the region of the minimum, but it won't ever just hit to the minimum and stay there. In practice, the mini-batch size you use will be somewhere in between. Somewhere between 1 and m, and 1 and m are respectively too small and too large. And here's why. If you use batch gradient descent, so this is your mini-batch size equals m, then you're processing a huge training set on every iteration. So the main disadvantage of this is that it takes too much time, too long per iteration, assuming you have a very large training set. If you have a small training set, then batch gradient descent is fine. If you go to the opposite, if you use stochastic gradient descent, then it's nice that you get to make progress after processing just one example. That's actually not a problem, and the noisiness can be ameliorated or can be reduced by just using a smaller learning rate. But the huge disadvantage of stochastic gradient descent is that you lose almost all your speed up from vectorization, because here you're processing a single training example at a time. The way you process each example is going to be very inefficient. So what works best in practice is something in between, where you have some mini-batch size that's not too big or too small. And this gives you, in practice, the fastest learning. And you notice that this has two good things going for it. One is that you do get a lot of vectorization. So in the example we used on the previous video, if your mini-batch size was 1,000 examples, then you might be able to vectorize across 1,000 examples, which is going to be much faster than processing the examples one at a time. And second, you can also make progress without needing to wait until you process the entire training set. So again, using the numbers we have from the previous video, each epoch or each part of your training set allows you to take 5,000 gradient descent steps. So in practice, there'll be some in-between mini-batch size that works best. And so with mini-batch gradient descent, if you start here, maybe one iteration does this, two iterations, three, four, you know. And it's not guaranteed to always head toward the minimum, but it tends to head more consistently in the direction of the minimum than stochastic gradient descent. And then it doesn't always exactly converge or oscillate in a very small region. If that's an issue, you can always reduce the learning rate slowly. We'll talk more about learning rate decay or how to reduce the learning rate in a later video. So if the mini-batch size should not be M and should not be 1, it should be something in between. How do you go about choosing it? Well, here are some guidelines. First, if you have a small training set, just use batch gradient descent. You know, if you have a small training set, then no point using mini-batch gradient descent. You can process the whole training set quite fast, so you might as well use batch gradient descent. What does small training set mean? I would say, you know, if it's less than maybe 2,000, it'd be perfectly fine to just use batch gradient descent. Otherwise, if you have a bigger training set, typical mini-batch sizes would be anything from 64 up to maybe 512 are quite typical. And because of the way computer memory is laid out in Access, sometimes your code runs faster if your mini-batch size is a power of 2, right? So 64 is 2 to the 6 is 2 to the 7, 2 to the 8, 2 to the 9. So often, I'll implement my mini-batch size to be a power of 2. I know that in the previous video, I used a mini-batch size of 1,000. If you really wanted to do that, I would recommend you just use your 1,024, which is 2 to the power of 10. And you do see mini-batch sizes of size 1024, it is a bit more rare. This range of mini-batch size is a little bit more common. One last tip is to make sure that your mini-batch, all of your XT, YT, that that fits in CPU, GPU memory. And this really depends on your application and how large a single training example is. But if you ever process a mini-batch that doesn't actually fit in CPU, GPU memory, whatever you're using to process the data, then you find that the performance suddenly falls off a cliff and is suddenly much worse. So I hope this gives you a sense of the typical range of mini-batch sizes that people use. In practice, of course, the mini-batch size is actually another hyperparameter that you might do a quick search over to try to figure out which one is most efficient at reducing your cost function J. So what I would do is just try several different values, try a few different powers of two, and then see if you could pick one that makes your gradient descent optimization algorithm as efficient as possible. But hopefully this gives you a set of guidelines for how to get started with that hyperparameter search. You now know how to implement mini-batch gradient descent and make your algorithm run much faster, especially when you're trading on a large trading set. But it turns out there are even more efficient algorithms than gradient descent or mini-batch gradient descent. Let's start talking about them in the next few videos.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Optimization Algorithms

Optimization Algorithms

Mini-batch Gradient Descent
Video
・
11 mins

Understanding Mini-batch Gradient Descent
Video
・
11 mins

Exponentially Weighted Averages
Video
・
5 mins

Understanding Exponentially Weighted Averages
Video
・
9 mins

Bias Correction in Exponentially Weighted Averages
Video
・
4 mins

Gradient Descent with Momentum
Video
・
9 mins

RMSprop
Video
・
7 mins

Clarification about Upcoming Adam Optimization Video
Reading
・
1 min

Adam Optimization Algorithm
Video
・
7 mins

Clarification about Learning Rate Decay Video
Reading
・
1 min

Learning Rate Decay
Video
・
6 mins

The Problem of Local Optima
Video
・
5 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Optimization Algorithms

Graded・Quiz

・

50 mins

Programming Assignment

Optimization Methods

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yuanqing Lin Interview
Video
・
13 mins

Week 3: Hyperparameter Tuning, Batch Normalization and Programming Frameworks