In this video, we'll look at two further refinements to the reinforcement learning algorithm you've seen. The first idea is called using mini-batches. This turns out to be an idea that can both speed up your reinforcement learning algorithm, and is also applicable to supervised learning, and can help you speed up your supervised learning algorithm as well, like training a neural network or training a linear regression or logistic regression model. The second idea we'll look at is soft updates, which it turns out will help your reinforcement learning algorithm do a better job to converge to a good solution. Let's take a look at mini-batches and soft updates. To understand mini-batches, let's just look at supervised learning to start. Here's a dataset of housing sizes and prices that you had seen way back in the first course of the specialization on using linear regression to predict housing prices. There, we had come up with this cost function for the parameters w and b. It was 1 over 2m sum of the prediction minus the actual value y squared. The gradient descent algorithm was to repeatedly update w as w minus the learning rate alpha times the partial derivative with respect to w of the cost j of wb and similarly to update b as follows. Let me just take this definition of j of wb and substitute it in here. Now, when we looked at this example way back when we're starting to talk about linear regression and supervised learning, the training set size m was pretty small. I think we had 47 training examples. But what if you have a very, very large training set, say m equals 100 million. There are many countries, including the United States with over 100 million housing units, and so a national census will give you a dataset that is this order of magnitude of size. The problem with this algorithm when your dataset is this big is that every single step of gradient descent requires computing this average over 100 million examples, and this turns out to be very slow. Every step of gradient descent means you would compute this sum or this average over 100 million examples. Then you take one tiny gradient descent step and you go back and have to scan over your entire 100 million example dataset again to compute the derivative on the next step. Then you take another tiny gradient descent step and so on and so on. When the training set size is very large, this gradient descent algorithm turns out to be quite slow. The idea of mini-batch gradient descent is to not use all 100 million training examples on every single iteration through this one. Instead, we may pick a smaller number. Let me call it m prime equals, say, 1,000 and on every step, instead of using all 100 million examples, we will pick some subset of 1,000 or m prime examples. This inner term becomes 1 over 2 m prime sum over some m prime examples. Now, each iteration through gradient descent requires looking only at 1,000 rather than 100 million examples and every step takes much less time and just leads to a more efficient algorithm. What mini-batch gradient descent does is on the first iteration through the algorithm, maybe it looks at that subset of the data. On the next iteration, maybe it looks at that subset of the data and so on, for the third iteration and so on, so that every iteration is looking at just a subset of the data, so each iteration runs much more quickly. To see why this might be a reasonable algorithm, here's the housing dataset. If on the first iteration, we were to look at just, say, five examples, this is not the whole dataset, but it's slightly representative of the straight line you might want to fit in the end and so taking one gradient descent step to make the algorithm better fit these five examples is okay. But then on the next iteration, you take a different five examples like that shown here. You take one gradient descent step using these five examples, and on the next iteration, you use a different five examples, and so on and so forth. You can scan through this list of examples from top to bottom. That would be one way. Another way would be if on every single iteration, you just pick a totally different five examples to use. You might remember with batch gradient descent, if these are the contours of the cost function J, then batch gradient descent would say, start here and take a step, take a step, take a step, take a step, take a step. Every step of gradient descent causes the parameters to reliably get closer to the global minimum of the cost function here in the middle. In contrast, mini-batch gradient descent or mini-batch learning algorithm will do something like this. If you start here, then the first iteration uses just five examples. So it'll kind of head in the right direction, but maybe not the best gradient descent direction. Then the next iteration, it may do that, the next iteration, that, and that, that, and sometimes just by chance. The five examples you chose may be an unlucky choice, and even head in the wrong direction, away from the global minimum, and so on and so forth. But on average, mini-batch gradient descent will tend towards the global minimum, not reliably and somewhat noisily, but every iteration is much more computationally inexpensive. So mini-batch learning or mini-batch gradient descent turns out to be a much faster algorithm when you have a very large training set. So in fact, for supervised learning, when you have a very large training set, mini-batch learning or mini-batch gradient descent or mini-batch version with other optimization algorithms like Adam, is used more common than batch gradient descent. Going back to our reinforcement learning algorithm, this is the algorithm that we had seen previously. So the mini-batch version of this would be, even if you have stored the 10,000 most recent tuples in the replay buffer, what you might choose to do is not to use all 10,000 every time you train a model. Instead, what you might do is just take a subset. So you might choose just 1,000 examples of these S, A, R of S, S prime tuples, and use it to create just 1,000 training examples to train the neural network. It turns out that this will make each iteration of training a model a little bit more noisy, but much faster. This will overall tend to speed up this reinforcement learning algorithm. So that's how mini-batching can speed up both a supervised learning algorithm like linear regression, as well as this reinforcement learning algorithm, where you may use a mini-batch size of, say, 1,000 examples, even if you stored away 10,000 of these tuples in your replay buffer. Finally, there's one other refinement to the algorithm that can make it converge more reliably, which is I have written out this step here of set Q equals Q new. But it turns out that this can make a very abrupt change to Q. If you train a new neural network, Q new, maybe just by chance is not a very good neural network. Maybe it's even a little bit worse than the old one. Then you just overwrote your Q function with a potentially worse noisy neural network. So the self-update method helps to prevent Q new through just one unlucky step, getting worse. In particular, the neural network Q will have some parameters W and B, all the parameters for all the layers of the neural network. When you train the new neural network, you get some parameters W new and B new. So in the original algorithm as described on that slide, you would set W to be equal to W new and B equals B new. That's what set Q equals Q new means. With the self-update, what we do is instead set W equals 0.01 times W new plus 0.99 times W. In other words, we're going to make W to be 99 percent, the old version of W plus one percent of the new version W new. So this is called a self-update because whenever we train a new neural network, W new, we're only going to accept a little bit of the new value. Similarly, B equals 0.01 times B new plus 0.99 times B. These numbers 0.01 and 0.99, these are hyperparameters that you could set. But it controls how aggressively you move W to what W new, and these two numbers are expected to add up to one. One extreme would be if you were to set W equals one times W new plus zero times W, in which case, you're back to the original algorithm up here, where you're just copying W new onto W. But the self-update allows you to make a more gradual change to Q or to the neural network parameters W and B that affect your current guess for the Q function, Q of S A. It turns out that using the self-update method causes the reinforcement learning algorithm to converge more reliably. It makes it less likely that the reinforcement learning algorithm will oscillate or diverge or have other undesirable properties. With these two final refinements to the algorithm, mini-batching, which actually applies very well to supervise learning as well, not just reinforcement learning, as well as the idea of self-updates, you should be able to get your learning algorithm to work really well on the Lunar Lander. The Lunar Lander is actually a decently complex, decently challenging application, and so that you can get it to work and land safely on the moon. I think that's actually really cool, and I hope you enjoy playing with the practice lab. Now, we've talked a lot about reinforcement learning. Before we wrap up, I'd like to share with you my thoughts on the state of reinforcement learning, so that as you go out and build applications using different machine learning techniques, be it supervised, unsupervised, reinforcement learning techniques, that you have a framework for understanding where reinforcement learning fits in to the world of machine learning today. So let's go take a look at that in the next video.

Unsupervised Learning, Recommenders, Reinforcement Learning

27 hours 16 mins

Week 3: Reinforcement learning

Reinforcement learning introduction

What is Reinforcement Learning?
Video
・
8 mins

Mars rover example
Video
・
6 mins

The Return in reinforcement learning
Video
・
10 mins

Making decisions: Policies in reinforcement learning
Video
・
2 mins

Review of key concepts
Video
・
5 mins

Practice quiz: Reinforcement learning introduction

Reinforcement learning introduction

Graded・Quiz

・

30 mins

State-action value function

State-action value function definition
Video
・
10 mins

State-action value function example
Video
・
5 mins

State-action value function (optional lab)
Code Example
・
1 hour

Bellman Equation
Video
・
12 mins

Random (stochastic) environment (Optional)
Video
・
8 mins

Quiz: State-action value function

State-action value function

Graded・Quiz

・

30 mins

Continuous state spaces

Example of continuous state space applications
Video
・
6 mins

Lunar lander
Video
・
5 mins

Learning the state-value function
Video
・
16 mins

Algorithm refinement: Improved neural network architecture
Video
・
3 mins

Algorithm refinement: ϵ-greedy policy
Video
・
8 mins

Algorithm refinement:  Mini-batch and soft updates (optional)
Video
・
11 mins

The state of reinforcement learning
Video
・
2 mins

Quiz: Continuous state spaces

Continuous state spaces

Graded・Quiz

・

30 mins

End of Access to Lab Notebooks

[IMPORTANT] Reminder about end of access to Lab Notebooks
Reading
・
2 mins

Practice Lab: Reinforcement Learning

Reinforcement Learning

Graded・Code Assignment

・

3 hours

Summary and thank you

Summary and thank you
Video
・
3 mins

Conversations with Andrew (Optional)

Andrew Ng and Chelsea Finn on AI and Robotics
Video
・
33 mins

Acknowledgments

Acknowledgments
Reading
・
2 mins

(Optional) Opportunity to Mentor Other Learners
Reading
・
1 min