In the last video, we talked about exponentially weighted averages. This will turn out to be a key component of several optimization algorithms that you use to train your neural networks. So in this video, I want to delve a little bit deeper into intuitions for what this algorithm is really doing. Recall that this is the key equation for implementing exponentially weighted averages. And so if beta equals 0.9, you got the red line. If it was much closer to 1, it was 0.98, you'd get the green line. And if it was much smaller, maybe 0.5, you'd get the yellow line. Let's look at a bit more of the math to understand how this is computing averages of the daily temperature. So here's that equation again, and let's set beta equals 0.9 and write out a few equations that this corresponds to. So whereas when implementing it, you'd have t going from 0 to 1 to 2 to 3, increasing values of t. To analyze it, I've written it with decreasing values of t. Alright, and this goes on. So let's take this first equation here and understand what V100 really is. So V100 is going to be, let me reverse these two terms. This is going to be 0.1 times theta 100 plus 0.9 times whatever the value was on the previous day. Now, but what is V99? Well, we just plug it in from this equation. So this is just going to be 0.1 times theta 99, and again I've reversed these two terms, right, plus 0.9 times V98. But then what is V98? Well, you just get that from here. So you can just plug in here, 0.1 times theta 98 plus 0.9 times V97, and so on. And if you multiply all of these terms out, you can show that V100 is 0.1 times theta 100 plus, now let's look at coefficient on theta 99. It's going to be 0.1 times 0.9 times theta 99. Now let's look at the coefficient on theta 98. There's a 0.1 here times 0.9 times 0.9. So if you expand out the algebra, this becomes 0.1 times 0.9 squared times theta 98. And if you keep expanding this out, you find that this becomes 0.1 times 0.9 cubed theta 97 plus 0.1 times 0.9 to the 4th times theta 96 plus dot, dot, dot. So this is really a weighted sum, and thus a weighted average of theta 100, which is the current day's temperature. We're looking from the perspective of V100. We should calculate on the 100th day of the year. But so it's a sum of, you know, theta 100, theta 99, theta 98, theta 97, theta 96, and so on. So one way to draw this in pictures would be if, you know, let's say we have some number of days of temperature. So this is theta and this is T, right? So theta 100 would be some value, then theta 99 would be some value, theta 98. So these are, you know, so this is T equals 100, 99, 98, and so on. Right, so you have some number of days of temperature. And what we have is then an exponentially decaying function. So starting from 0.1 to 0.9 times 0.1 to 0.9 squared times 0.1, 2, and so on. So you have this exponentially decaying function. And the way you compute V100 is you take the element wise product between these two functions and sum it up. So you take this value, theta 100 times 0.1 times this value, theta 99 times 0.1 times 0.9. That's the second term, and so on. So it's really taking the daily temperature, multiply it with this exponentially decaying function, and then summing it up. And this becomes your V100. It turns out that up to details I'll talk about later, but all of these coefficients add up to one, or add up to very close to one, up to a detail called bias correction, which we'll talk about in the next video. But because of that, this really is an exponentially weighted average. And finally, you might wonder, you know, how many days' temperature is this averaging over? Well, it turns out that 0.9 to the power of 10 is about 0.35, and this turns out to be about 1 over e, 1 over the base of natural algorithms. And more generally, if you have 1 minus epsilon, so in this example, epsilon would be 0.1. So if this was 0.9, then 1 minus epsilon to the 1 over epsilon, this is about 1 over e. This is about 0.34, 0.35. And so, in other words, it takes about 10 days for the height of this to decay to, you know, around one-third, or really 1 over e of the peak. So it's because of this that when beta equals 0.9, we say that it takes, this is as if you're computing an exponentially weighted average that focuses on just the last 10 days' temperature, because it's after 10 days that the weight decays to, you know, less than about a third of the weight of the current day. Whereas in contrast, if beta was equal to 0.98, then, well, what do you need 0.98 to the power of in order for this to be really small? It turns out that 0.98 to the power of 50 will be approximately equal to 1 over e. So the weight will be pretty big. It'll be bigger than 1 over e for the first 50 days, and then it'll, you know, decay quite rapidly over that. So, intuitively, this is a hard and fast thing, you can think of this as averaging over about 50 days' temperature, because in this example, to use the notation here on the left is as if epsilon is equal to 0.02, so 1 over epsilon is 50. And this, by the way, is how we got the formula that we're averaging over 1 over 1 minus beta, or so days. Right here, epsilon, we place a row of 1 minus beta. It tells you, up to some constant, roughly how many days' temperature you should think of this as averaging over. But this is just a rule of thumb for how to think about it, and isn't a formal mathematical statement. Finally, let's talk about how you actually implement this. Recall that we start with a V0 initializes 0. Then you compute V1 on the first day, V2, and so on. Now, to explain the algorithm, it was useful to write down V0, V1, V2, and so on. There's distinct variables, but if you're implementing this in practice, this is what you do. You initialize V to be equal to 0. And then on day one, you would set V equals beta times V plus 1 minus beta times theta 1. And then on the next day, you would update V to be equal to beta V plus 1 minus beta theta 2, and so on. And sometimes we use this notation V subscripts theta to denote that V is computing this exponentially weighted average of the parameter theta. So, just to say this again, but in a four-loop format, you set V theta equals 0, and then repeatedly, and if on each day, you would get next theta T, and then set V theta gives updated as beta times the old value of V theta plus 1 minus beta times the current value of V theta. So, one of the advantages of this exponentially weighted average formula is that it takes very little memory. You just need to keep, you know, this one row number in computer memory, and you keep on overwriting it with this formula based on the latest value that you got. And it's really this reason, the efficiency, it just takes, you know, one line of code, basically, and just storage and memory for a single row number to compute this exponentially weighted average is really not the best way, not the most accurate way to compute an average. If you were to compute a moving window where you explicitly sum over the last 10 days, the last 50 days temperature, and just divide by 10 or divide by 50. That usually gives you a better estimate. But the disadvantage of that, of explicitly keeping all the temperatures around and summing over the last 10 days, it requires more memory and it's just more complicated to implement and it's computationally more expensive. So for things, we'll see some examples in the next few videos, where you need to compute averages of a lot of variables. This is a very efficient way to do so, both from a computation and memory efficiency point of view, which is why it's used a lot in machine learning. Not to mention that it's just one line of code, which is maybe another advantage. So now you know how to implement exponentially weighted averages. There's one more technical detail, there's what you're learning about, called bias correction. Let's see that in the next video. And then after that, you'll be able to use this to build a better optimization algorithm than the straightforward gradient descent.

Deep Learning Specialization

Intermediate

Topics

Computer Vision

Deep Learning

NLP

Supervised Learning

Transformers

Collaborator

DeepLearning.AI

Week 2: Optimization Algorithms

Optimization Algorithms

Mini-batch Gradient Descent
Video
・
11 mins

Understanding Mini-batch Gradient Descent
Video
・
11 mins

Exponentially Weighted Averages
Video
・
5 mins

Understanding Exponentially Weighted Averages
Video
・
9 mins

Bias Correction in Exponentially Weighted Averages
Video
・
4 mins

Gradient Descent with Momentum
Video
・
9 mins

RMSprop
Video
・
7 mins

Clarification about Upcoming Adam Optimization Video
Reading
・
1 min

Adam Optimization Algorithm
Video
・
7 mins

Clarification about Learning Rate Decay Video
Reading
・
1 min

Learning Rate Decay
Video
・
6 mins

The Problem of Local Optima
Video
・
5 mins

Lecture Notes (Optional)

Lecture Notes W2
Reading
・
1 min

Quiz

Optimization Algorithms

Graded・Quiz

・

50 mins

Programming Assignment

Optimization Methods

Graded・Code Assignment

・

3 hours

Heroes of Deep Learning (Optional)

Yuanqing Lin Interview
Video
・
13 mins

Week 3: Hyperparameter Tuning, Batch Normalization and Programming Frameworks